Question

Stringie output doesn't have any gene names!

0

Entering edit mode

4.6 years ago

c_u ▴ 520

Hi,

So, I ran Stringtie on data derived from 38 samples, according to the protocol mentioned in the paper - https://www.nature.com/articles/nprot.2016.095. I generated the merged gtf file after running stringtie merge (using the -G option), and then I performed transcript abundance calculation.

After that, I used the method given here to run tximport using the Stringtie files. However, upon doing that, I saw that tximport just gave one large number for each sample, instead of raw counts for each gene. So, I tried to look at the data carefully and found that there was no gene_names in the stringtie_merged gtf file, as a result of which (probably) tximport wasn't able to give separate counts for the genes/transcripts. (In comparison, the merged.gtf file in their paper does have gene_names at least for some transcripts)

Then, I searched online and found a code by the developer here that is supposed to append gene_names to the merged.gtf file, but on using it, the output gtf still doesn't have any gene names.

TLDR - what should I do to ensure that the merged.gtf file has the gene_names so that tximport can assign raw counts to the transcripts/genes?

stringtie RNA-Seq • 3.0k views

ADD COMMENT • link updated 3.6 years ago by Kristoffer Vitting-Seerup ★ 4.0k • written 4.6 years ago by c_u ▴ 520

1

Entering edit mode

I need to ask you two questions before being able to answer you: 1)Is it only gene-names that are the problem or do you also lack gene_ids? 2) When you ran stringtie --merge did you use the -G option to include a refrence?

ADD REPLY • link 4.6 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Hi Kristoffer,

1) I do see gene_id's in the stringtie merged file, BUT, there are 2.1 million lines in the stringtie_merged.gtf and for 1.95 million of them, the gene_id is of the form MSTRG.x - only the last 0.15 million lines have gene_id of the form ENSG....

2) Yes, I did use the -G option : stringtie --merge -p 8 -G /Volumes/bam/DRG/Homo_sapiens.GRCh38.97.gff3

ADD REPLY • link 4.6 years ago by c_u ▴ 520

1

Entering edit mode

3.6 years ago

Kristoffer Vitting-Seerup ★ 4.0k

I just wanted to post an addition to my comment above:

The missing gene_names from StringTie can originate from 3 different sources: 1) It is a novel transcript in a known gene 2) It is a novel transcript in a cluster of genes (multiple gene_names) which are joined together by StringTie/Cufflinks because of their overlap 3) It is a novel gene - meaning no genomic overlap with any feature in the reference you are using.

From my experience with StringTie data there are typically thens of thousands of missing gene_names and ~50% of the missing gene_names are due to problem 1 and 2. To solve this I have just release an update to the R package IsoformSwitchAnalyzeR (available in >1.11.6) which can fix problem 1 and 2 for most genes. You simply use the importRdata() function - which will fix the isoform annotation which is fixable and clean up the rest of the annotation. From the resulting switchAnalyzeRList object you can analyse isoform switches with predicted functional consequences with IsoformSwitchAnalyzeR or use extractGeneExpression() to get a gene count matrix for DE analysis with other tools.

Hope this helps.

Cheers

Kristoffer

ADD COMMENT • link 3.6 years ago by Kristoffer Vitting-Seerup ★ 4.0k

score 2 · Accepted Answer · 2019-10-02

2

Entering edit mode

4.6 years ago

Kristoffer Vitting-Seerup ★ 4.0k

The problem is that assigning gene names is not trivial since for a subset of the data the genes identified by StringTie (defined as overlapping (re-constructed) transcripts) might overlap multiple genes - then how do you decide which gene name to assign to a specific transcript? This is a very hard task which is why StringTie per default does not assign gene names.

This is however only for a smaller percentage of the data - for the majority of the data a StringTie defined gene (the MSTRGxxxxx) is uniquely associated with a reference gene_id (via the "ref_gene_id" entry in the GTF file) which you manually can assign to the corresponding gene_name via the reference GTF file.

What I would suggest is that you use tximport to summaize to gene_id level (you dont want to use gene_name since sometimes they are duplicated) and then assign gene_names to those where there is no ambiguity. For the cases with multiple overlapping genes I'm not aware of any good automatic way of splitting the data.

The only alternative I know is to use Cuffmerge instead of StringTie --merge since Cuffmerge does not have this annotation problem.

ADD COMMENT • link 4.6 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Thanks a lot Kristoffer. I have two questions -

You wrote and then assign gene_names to those where there is no ambiguity. What do you mean by no ambiguity? How can I check that?
Given that Stringtie per default doesn't assign gene names, and for DGE/DTE analysis its imperative to know the gene/transcript name corresponding to the count values, this would look like a major flaw. How do people generally go about doing differential expression analysis after running tools like Stringtie? Does everyone face this problem that most of the reads can't be assigned to specific genes/transcripts?

ADD REPLY • link 4.6 years ago by c_u ▴ 520

1

Entering edit mode

Look for cases where the StringTie gene is associted with multiple refrence gene names.
It's a bit inconvinent for sure - but also a really hard problem. I would rather a tool does not solve a problem unless it does it in a good way. Most people probably use the StringTie IDs to summarize, and the ref_gene_id to get the gene_name and then ignore (knowingly or unknowingly) the problem of multiple joint genes (since it is a small fraction).

ADD REPLY • link 4.5 years ago by Kristoffer Vitting-Seerup ★ 4.0k