Question: extra gene ids in gene count matrix than in gtf file
0
gravatar for rthapa
6 months ago by
rthapa0
rthapa0 wrote:

Hi,

I am using gtf file from ensembl database for mapping and alignment. I used stringtie to get the gene count matrix but I see many extra gene ids in my gene count matrix than those in gtf file. Is it normal? Does anyone have any suggestion? I would appreciate any suggestion.

Thanks

rna-seq • 336 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by rthapa0

Thank you for your reply. The extra gene ids in my gene count matrix are not named as 'MSTRG' identifiers. I am working on rice sample. The gtf file from ensembl has ids either starting with "Os" or "ENSRNA" but in my gene count matrix there are many genes with ids starting with "EPIOSAG". And also the total number of genes in rice is around 40,000 but in my count matrix I have around total 90,000 gene ids. Do you have any idea if it is normal or it have might have been due to mistake in alignment and mapping?

ADD REPLYlink written 6 months ago by rthapa0

Please use ADD COMMENT or ADD REPLY to respond to existing posts to keep threads logically organized.

ADD REPLYlink written 6 months ago by genomax70k

I don't recall any annotation id starting with "EPIOSAG". Can you show command that you used while creating merged gtf file with stringtie? Maybe you used name prefix for output transcripts (the -l flag). Also keep in mind that stringtie provides two count files: gene counts and transcript counts. And yes, the increased amount of rows in your gene count matrix is expected due to the fact that stringtie provides additional unknown (considering provided annotation file) genes etc. (also noncoding transcripts). If you wish to stick with ensembl gtf only you could replace stringtie merged gtf with the ensembl gtf while estimating transcript abundances.

ADD REPLYlink written 6 months ago by ahaswer150

For the stringtie, I used the following command; stringtie /scratch/user/rthapa/RNA-seq/rnaseqS10.hisat.sorted.bam -o /scratch/user/rthapa/RNA-seq/rnaseqS10.gtf -p 8 -G /scratch/user/rthapa/RNA-seq/Oryza_sativa.IRGSP-1.0.37.gtf -A /scratch/user/rthapa/RNA-seq/gene_abund -e

And to convert the gtf file, I used the below mentioned command line; python prepDE.py -i sample3a.txt

Yes, I do have both gene count matrix and transcript count matrix. And in transcript matrix too, I can see transcripts starting with "EPIOSAT".

ADD REPLYlink written 6 months ago by rthapa0

In fact your gtf file (Oryza_sativa.IRGSP-1.0.37.gtf) does contain "EPlOSAG" ids, (short for Ensembl Plants Oryza sativa genes), check:

grep -i "eplosag" Oryza_sativa.IRGSP-1.0.37.gtf | head

However they are fine identifiers of genes/transcripts. The newest gtf file of O. sativa (1.0.42) contains unified ids (i. e. without "EPlOSAG"-like ids). Check Ensembl releases page.

ADD REPLYlink modified 6 months ago • written 6 months ago by ahaswer150
0
gravatar for ahaswer
6 months ago by
ahaswer150
Czech Republic
ahaswer150 wrote:

Do the ids start with "MSTRG"? You will always get additional ids while using annotation. The reason for that is that the annotation file never contains 100% of complete genes and isoforms. Therefore every transcript which is not included in annotation will be assigned with 'MSTRG' identifier. If you are interested in known transcripts only you can try Salmon or Kallisto.

ADD COMMENTlink modified 6 months ago • written 6 months ago by ahaswer150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 437 users visited in the last hour