Question

How I deal with raw read counts of RNA-seq

0

Entering edit mode

4.7 years ago

zizigolu ★ 4.3k

Hi

This is raw read counts extracted by htseq from STAR alignment

> head(counts[1:2,1:4])
                 Sample1 Sample2 Sample3
1 ENSG00000000003  115  437  380
2 ENSG00000000005    0    0    0
> dim(counts)
[1] 58735    17
>

By this code I tried to find matched gene symbol for ensmbl gene id

>   ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
>   values=rownames(counts)
>    data <- getBM(attributes=c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = values, mart= ensembl)


> merged_data = merge(x = data, y = counts, by.x = colnames(data)[1], by.y = colnames(counts)[1], all = T)

But in resulted matrix for a lot of ensmbl gene id I don't have gene symbol, when I removed them I finished with a smaller matrix

> dim(new_counts)
[1] 25052    16
>

In your experience is it normal? what should I do then? Please help me to get the right matrix for down stream analysis

Thank you

RNA-Seq DESeq2 STAR • 1.4k views

ADD COMMENT • link updated 4.7 years ago by swbarnes2 14k • written 4.7 years ago by zizigolu ★ 4.3k

0

Entering edit mode

See the answer here: What is the difference between transcript id and Ensembl gene id

Do you have transcript variants for genes in your list? That can explain the larger numbers you initially had.

Also see: ENSEMBL annotation file for quantification: which file to use?

ADD REPLY • link 4.7 years ago by GenoMax 142k

score 1 · Answer 1 · 2019-09-10

1

Entering edit mode

4.7 years ago

dextereyes88 ▴ 10

I prefer to use the bitr function in the clusterProfiler package for mapping gene symbols to ensembl IDs. It will tell you if there is 1:1 matching or many:1 matching, and will print the percentage of non-mapped IDs. https://yulab-smu.github.io/clusterProfiler-book/chapter14.html#bitr

It is certainly possible that the dimensions will change. Especially if the Ensembl IDs in your data set come from a different release of the Ensembl annotation used to map those IDs. Why are you mapping the gene symbols first? Are you doing a targeted analysis? If you are doing a genome wide analysis I would do your DESEq analysis using the ensembl IDs and filter the genes based on some expression threshold across your samples, then add gene annotation using something like BioMart.

ADD COMMENT • link 4.7 years ago by dextereyes88 ▴ 10

0

Entering edit mode

I know I have used Homo_sapiens.GRCh38.dna.primary_assembly.gtf for extracting raw read counts, then how I tell biomaRt to use the same version for id conversion?

ADD REPLY • link 4.6 years ago by zizigolu ★ 4.3k

0

Entering edit mode

> ids <- bitr(hg, fromType="ENSEMBL", toType=c("SYMBOL"), OrgDb="org.Hs.eg.db")
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(hg, fromType = "ENSEMBL", toType = c("SYMBOL"), OrgDb = "org.Hs.eg.db") :
  55.7% of input gene IDs are fail to map...

Also htseq returns gene name by --idattr=gene_name , again a lot of non coding genes like AL390728.6. I am not sure should I ignore them like I am doing with id conversion or not

Don't you think I am losing too much information by this conversion?

ADD REPLY • link 4.6 years ago by zizigolu ★ 4.3k

score 1 · Answer 2 · 2019-09-10

1

Entering edit mode

4.7 years ago

swbarnes2 14k

I'd try "external_gene_name" instead of hgnc symbols. With 25k things having matching names, it looks like hgnc names might only match protein coding genes, but ensembl IDs will include a lot more than that.

ADD COMMENT • link 4.7 years ago by swbarnes2 14k

0

Entering edit mode

@ swbarnes2 I done by external_gene_name now I have a lot of AC120036.5 and so on

Do you think I can ignore them and rely on my data gotten by hgnc_symbol

ADD REPLY • link 4.6 years ago by zizigolu ★ 4.3k