Question: How I deal with raw read counts of RNA-seq
gravatar for A
4 months ago by
A3.7k wrote:


This is raw read counts extracted by htseq from STAR alignment

> head(counts[1:2,1:4])
                 Sample1 Sample2 Sample3
1 ENSG00000000003  115  437  380
2 ENSG00000000005    0    0    0
> dim(counts)
[1] 58735    17

By this code I tried to find matched gene symbol for ensmbl gene id

>   ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
>   values=rownames(counts)
>    data <- getBM(attributes=c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = values, mart= ensembl)

> merged_data = merge(x = data, y = counts, by.x = colnames(data)[1], by.y = colnames(counts)[1], all = T)

But in resulted matrix for a lot of ensmbl gene id I don't have gene symbol, when I removed them I finished with a smaller matrix

> dim(new_counts)
[1] 25052    16

In your experience is it normal? what should I do then? Please help me to get the right matrix for down stream analysis

Thank you

rna-seq star deseq2 • 339 views
ADD COMMENTlink modified 4 months ago by swbarnes27.2k • written 4 months ago by A3.7k

See the answer here: What is the difference between transcript id and Ensembl gene id

Do you have transcript variants for genes in your list? That can explain the larger numbers you initially had.

Also see: ENSEMBL annotation file for quantification: which file to use?

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax76k
gravatar for dextereyes88
4 months ago by
dextereyes8810 wrote:

I prefer to use the bitr function in the clusterProfiler package for mapping gene symbols to ensembl IDs. It will tell you if there is 1:1 matching or many:1 matching, and will print the percentage of non-mapped IDs.

It is certainly possible that the dimensions will change. Especially if the Ensembl IDs in your data set come from a different release of the Ensembl annotation used to map those IDs. Why are you mapping the gene symbols first? Are you doing a targeted analysis? If you are doing a genome wide analysis I would do your DESEq analysis using the ensembl IDs and filter the genes based on some expression threshold across your samples, then add gene annotation using something like BioMart.

ADD COMMENTlink written 4 months ago by dextereyes8810

I know I have used Homo_sapiens.GRCh38.dna.primary_assembly.gtf for extracting raw read counts, then how I tell biomaRt to use the same version for id conversion?

ADD REPLYlink modified 4 months ago • written 4 months ago by A3.7k
> ids <- bitr(hg, fromType="ENSEMBL", toType=c("SYMBOL"), OrgDb="")
'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(hg, fromType = "ENSEMBL", toType = c("SYMBOL"), OrgDb = "") :
  55.7% of input gene IDs are fail to map...

Also htseq returns gene name by --idattr=gene_name , again a lot of non coding genes like AL390728.6. I am not sure should I ignore them like I am doing with id conversion or not

Don't you think I am losing too much information by this conversion?

ADD REPLYlink modified 4 months ago • written 4 months ago by A3.7k
gravatar for swbarnes2
4 months ago by
United States
swbarnes27.2k wrote:

I'd try "external_gene_name" instead of hgnc symbols. With 25k things having matching names, it looks like hgnc names might only match protein coding genes, but ensembl IDs will include a lot more than that.

ADD COMMENTlink modified 4 months ago • written 4 months ago by swbarnes27.2k

@ swbarnes2 I done by external_gene_name now I have a lot of AC120036.5 and so on

Do you think I can ignore them and rely on my data gotten by hgnc_symbol

ADD REPLYlink written 4 months ago by A3.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1204 users visited in the last hour