Question: Stringie output doesn't have any gene names!
0
gravatar for chahat_u
6 weeks ago by
chahat_u140
United States
chahat_u140 wrote:

Hi,

So, I ran Stringtie on data derived from 38 samples, according to the protocol mentioned in the paper - https://www.nature.com/articles/nprot.2016.095. I generated the merged gtf file after running stringtie merge (using the -G option), and then I performed transcript abundance calculation.

After that, I used the method given here to run tximport using the Stringtie files. However, upon doing that, I saw that tximport just gave one large number for each sample, instead of raw counts for each gene. So, I tried to look at the data carefully and found that there was no gene_names in the stringtie_merged gtf file, as a result of which (probably) tximport wasn't able to give separate counts for the genes/transcripts. (In comparison, the merged.gtf file in their paper does have gene_names at least for some transcripts)

Then, I searched online and found a code by the developer here that is supposed to append gene_names to the merged.gtf file, but on using it, the output gtf still doesn't have any gene names.

TLDR - what should I do to ensure that the merged.gtf file has the gene_names so that tximport can assign raw counts to the transcripts/genes?

rna-seq stringtie • 224 views
ADD COMMENTlink modified 6 weeks ago by kristoffer.vittingseerup2.6k • written 6 weeks ago by chahat_u140
1

I need to ask you two questions before being able to answer you: 1)Is it only gene-names that are the problem or do you also lack gene_ids? 2) When you ran stringtie --merge did you use the -G option to include a refrence?

ADD REPLYlink written 6 weeks ago by kristoffer.vittingseerup2.6k

Hi Kristoffer,

1) I do see gene_id's in the stringtie merged file, BUT, there are 2.1 million lines in the stringtie_merged.gtf and for 1.95 million of them, the gene_id is of the form MSTRG.x - only the last 0.15 million lines have gene_id of the form ENSG....

2) Yes, I did use the -G option : stringtie --merge -p 8 -G /Volumes/bam/DRG/Homo_sapiens.GRCh38.97.gff3

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by chahat_u140
2
gravatar for kristoffer.vittingseerup
6 weeks ago by
European Union
kristoffer.vittingseerup2.6k wrote:

The problem is that assigning gene names is not trivial since for a subset of the data the genes identified by StringTie (defined as overlapping (re-constructed) transcripts) might overlap multiple genes - then how do you decide which gene name to assign to a specific transcript? This is a very hard task which is why StringTie per default does not assign gene names.

This is however only for a smaller percentage of the data - for the majority of the data a StringTie defined gene (the MSTRGxxxxx) is uniquely associated with a reference gene_id (via the "ref_gene_id" entry in the GTF file) which you manually can assign to the corresponding gene_name via the reference GTF file.

What I would suggest is that you use tximport to summaize to gene_id level (you dont want to use gene_name since sometimes they are duplicated) and then assign gene_names to those where there is no ambiguity. For the cases with multiple overlapping genes I'm not aware of any good automatic way of splitting the data.

The only alternative I know is to use Cuffmerge instead of StringTie --merge since Cuffmerge does not have this annotation problem.

ADD COMMENTlink written 6 weeks ago by kristoffer.vittingseerup2.6k

Thanks a lot Kristoffer. I have two questions -

  1. You wrote and then assign gene_names to those where there is no ambiguity. What do you mean by no ambiguity? How can I check that?

  2. Given that Stringtie per default doesn't assign gene names, and for DGE/DTE analysis its imperative to know the gene/transcript name corresponding to the count values, this would look like a major flaw. How do people generally go about doing differential expression analysis after running tools like Stringtie? Does everyone face this problem that most of the reads can't be assigned to specific genes/transcripts?

ADD REPLYlink written 5 weeks ago by chahat_u140
1
  1. Look for cases where the StringTie gene is associted with multiple refrence gene names.
  2. It's a bit inconvinent for sure - but also a really hard problem. I would rather a tool does not solve a problem unless it does it in a good way. Most people probably use the StringTie IDs to summarize, and the ref_gene_id to get the gene_name and then ignore (knowingly or unknowingly) the problem of multiple joint genes (since it is a small fraction).
ADD REPLYlink written 5 weeks ago by kristoffer.vittingseerup2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1965 users visited in the last hour