Question

How can I get true transcript_ids from StringTie?

0

Entering edit mode

4.4 years ago

robert.domingues • 0

We need to decide if we work with differentially expressed transcripts or differentially expressed genes.

We expected that the quantification (FPKM and / or TPM) of the genes would be the sum of all transcripts of the same gene. But we note that this does not occur. It appears that Ballgown sums FPKM and / or TPM from just a few of the transcripts of this gene. This makes us afraid to use differentially expressed genes in later analyzes. After all, the estimated FPKM and / or TPM per gene seems to be biased because it does not consider all transcripts.

But we stop to think that “within” the same gene, one transcript may be more expressed in one group and another transcript may be more expressed in another group (alternative splicing). Hence we think of working with differentially expressed transcripts.

How to get transcript_id in StringTie output (.gtf file) so that it is compatible with a database (NCBI or ENSEMBL)? In our current output the transcript_id are

"gene1.1"
"gene1.2"
"gene2.1"
"gene3.1"
"gene3.2"
"gene3.3"
..

StringTie RNA-Seq • 1.3k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 4.4 years ago by robert.domingues • 0

0

Entering edit mode

What annotation file (.GFF) did you have to map the transcripts? You need to use the GFF file used for mapping and the stringTie output which is also in .GFF. Use GFFcompare tool to get the ids for the transcripts.

ADD REPLY • link 4.4 years ago by c.chakraborty ▴ 170

score 0 · Answer 1 · 2019-11-25

First of all you probably don't want to use ballgown - it is one of the worst DE tools out there in most benchmarks (e.g this) and even the authors don't use it themselfs (e.g. this article). Use DESeq2, edgeR or limma-voom instead - its very easy after using tximport to get the StringTie data into R. tximport can even do the summing to gene-level for you.

Could you be a bit more elaborate on the problem itself. Do you mean there is a problem with summing to gene_id or gene_name? Also which species are you using? How do you identify the problem?

With regards to transcript analysis I would highly recommend looking into differential transcript usage as it nicely complements differential gene expression and enables analysis of isoforms switches and alternative splicing. Incidentally my R package IsoformSwitchAnalyzeR enables such analysis - both on a genome wide level and for individual genes. Examples of the analysis produced by the package can be found in this section of the vignette.

From a interpretation point of view I personally find differential transcript expression a bit redundant if you already have differentail gene expression and differential isoform usage. This basically correspond to me saying that I find differential transcript expression to be uninteresting unless it leads to differential gene expression or differential isoform usage - both of which are better analysed with more direct approach. Please note that this preference is only from a interpretation point of view. From an analysis point of view there is some cool recent evidence that performing the transcript-level differential expression analysis and then aggregating them to gene-level events (might) produce better results (e.g this article and many more).