gene IDs in stringtie output
1
1
Entering edit mode
3.5 years ago

Dear All,

Im using stringTie to assemble the transcripts using my genome annotation file with -G flag. but stringTie assigns its own IDs like MSTRG .1, MSTRG.2 to genes and MSTRG1.1 and MSTRG 2,1 to transcripts despite of using geneom annotation file and im unable to get same gene IDs as to that in genome annotation file. I need those IDs for subsequent functional analysis. Can anyone suggest me how to get the same IDs in stringtie output as to that in genome annotation file????

thanks in anticipation

rna-seq • 2.9k views
0
Entering edit mode

Hello blooming.daisy333,

fin swimmer

0
Entering edit mode

Dear finswimmer, I really appreciate your kind and quick help and im extremely sorry for the delay but im still working on those questions. actually these are interconnected for my analysis. I wiill surely give you the comments like have given before. please give me some time. further for some posts that solved my problem, i could not see any upvote/accepted sign to click on. thats why they are not marked.

0
Entering edit mode

one way is to intersect each mstrg coordinates with known transcriptome gtf @ blooming.daisy333

0
Entering edit mode

Hello, I am having the same issue as I am getting MSTG ID instead of gene name. Were you able to solve this issue? If yes, please help me and let me know how did you do it?

Many thanks

0
Entering edit mode

Please don't ask question in the space reserved for answers, use the ADD COMMENT button instead.

0
Entering edit mode

Sorry about that. I am new and didn't realize this.

0
Entering edit mode
14 months ago

The missing gene_names from StringTie can originate from 3 different sources: 1) It is a novel transcript in a known gene 2) It is a novel transcript in a cluster of genes (multiple gene_names) which are joined together by StringTie/Cufflinks because of their overlap 3) It is a novel gene - meaning no genomic overlap with any feature in the reference you are using.

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.

Hope this helps.

Cheers

Kristoffer