Question: StringTie --merge creates the same id for neighbouring genes
2
gravatar for marina.v.yurieva
18 months ago by
Farmington, CT
marina.v.yurieva520 wrote:

I'm very sorry if this topic has been posted before, I couldn't find anything relevant. I'm doing assembly with StringTie --merge using reference genome and the list of gtf files from my samples:

stringtie --merge -G ref.gtf -o merged.gtf assembly_gtf_list.txt

And for many known genes next to each other it creates the same MSTRG ID for some reason. For example,

MSTRG.10092 ENSG00000135722 (FLBX8 gene)
MSTRG.10092 ENSG00000265690 (AC074143.2 gene)
MSTRG.10092 ENSG00000102878 (HSF4 gene)

I know that Stringtie has an issue with the novel isoforms but these are all known genes, it doesn't take into account their original gene ids. I tried to use gffcompare instead stringtie --merge but it doesn't seem to fix the problem. Are there any other options I can try?

Thank you in advance!

ADD COMMENTlink modified 3 months ago by kristoffer.vittingseerup3.5k • written 18 months ago by marina.v.yurieva520
0
gravatar for kristoffer.vittingseerup
3 months ago by
European Union
kristoffer.vittingseerup3.5k wrote:

The missing gene_names from StringTie can originate from 3 different sources: 1) It is a novel transcript in a known gene 2) It is a novel transcript in a cluster of genes (multiple gene_names) which are joined together by StringTie/Cufflinks because of their overlap 3) It is a novel gene - meaning no genomic overlap with any feature in the reference you are using.

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.

Hope this helps.

Cheers

Kristoffer

ADD COMMENTlink written 3 months ago by kristoffer.vittingseerup3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2400 users visited in the last hour