Question: How I deal with raw read counts of RNA-seq
gravatar for A
11 months ago by
A3.8k wrote:


This is raw read counts extracted by htseq from STAR alignment

> head(counts[1:2,1:4])
                 Sample1 Sample2 Sample3
1 ENSG00000000003  115  437  380
2 ENSG00000000005    0    0    0
> dim(counts)
[1] 58735    17

By this code I tried to find matched gene symbol for ensmbl gene id

>   ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
>   values=rownames(counts)
>    data <- getBM(attributes=c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = values, mart= ensembl)

> merged_data = merge(x = data, y = counts, by.x = colnames(data)[1], by.y = colnames(counts)[1], all = T)

But in resulted matrix for a lot of ensmbl gene id I don't have gene symbol, when I removed them I finished with a smaller matrix

> dim(new_counts)
[1] 25052    16

In your experience is it normal? what should I do then? Please help me to get the right matrix for down stream analysis

Thank you

rna-seq star deseq2 • 488 views
ADD COMMENTlink modified 11 months ago by swbarnes28.2k • written 11 months ago by A3.8k

See the answer here: What is the difference between transcript id and Ensembl gene id

Do you have transcript variants for genes in your list? That can explain the larger numbers you initially had.

Also see: ENSEMBL annotation file for quantification: which file to use?

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax87k
gravatar for dextereyes88
11 months ago by
dextereyes8810 wrote:

I prefer to use the bitr function in the clusterProfiler package for mapping gene symbols to ensembl IDs. It will tell you if there is 1:1 matching or many:1 matching, and will print the percentage of non-mapped IDs.

It is certainly possible that the dimensions will change. Especially if the Ensembl IDs in your data set come from a different release of the Ensembl annotation used to map those IDs. Why are you mapping the gene symbols first? Are you doing a targeted analysis? If you are doing a genome wide analysis I would do your DESEq analysis using the ensembl IDs and filter the genes based on some expression threshold across your samples, then add gene annotation using something like BioMart.

ADD COMMENTlink written 11 months ago by dextereyes8810

I know I have used Homo_sapiens.GRCh38.dna.primary_assembly.gtf for extracting raw read counts, then how I tell biomaRt to use the same version for id conversion?

ADD REPLYlink modified 11 months ago • written 11 months ago by A3.8k
> ids <- bitr(hg, fromType="ENSEMBL", toType=c("SYMBOL"), OrgDb="")
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(hg, fromType = "ENSEMBL", toType = c("SYMBOL"), OrgDb = "") :
  55.7% of input gene IDs are fail to map...

Also htseq returns gene name by --idattr=gene_name , again a lot of non coding genes like AL390728.6. I am not sure should I ignore them like I am doing with id conversion or not

Don't you think I am losing too much information by this conversion?

ADD REPLYlink modified 11 months ago • written 11 months ago by A3.8k
gravatar for swbarnes2
11 months ago by
United States
swbarnes28.2k wrote:

I'd try "external_gene_name" instead of hgnc symbols. With 25k things having matching names, it looks like hgnc names might only match protein coding genes, but ensembl IDs will include a lot more than that.

ADD COMMENTlink modified 11 months ago • written 11 months ago by swbarnes28.2k

@ swbarnes2 I done by external_gene_name now I have a lot of AC120036.5 and so on

Do you think I can ignore them and rely on my data gotten by hgnc_symbol

ADD REPLYlink written 11 months ago by A3.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour