Question

Can I compare TPM values for a transcript across different transcriptome assemblies?

0

Entering edit mode

4.7 years ago

molly77 ▴ 10

Hi,

I have three transcriptome assemblies representing two different tissues from three different species within the same genus. Each transcriptome assembly consists of three biological replicates (two tissues from each individual, so six RNA-seq libraries per transcriptome assembly). I have Salmon generated TPM values across all three assemblies for each RNA-seq library.

One of these species has an ecological trait that the other two species have, but to a much lesser extent, and it's hypothesized that this ecological trait in sp.1 is attributed to a novel protein found only in this genus and is highly expressed within the aforementioned tissues (we already know it's highly expressed for sp. 1). When I look at the TPM value for this transcript in sp.1, it is among the top 10 most highly expressed transcripts (which we predicted), however, the TPM values for this transcript in the other two species is much lower (which we also predicted).

Is it okay to directly compare TPM values for the same transcript across three different transcriptome assemblies from three different species? Can I infer that this transcript is more highly expressed in sp.1 relative to the other two species? I cannot find any literature or posts in regard to this, if you can point me to any resources, PLEASE DO.

Thanks.

RNA-Seq Gene expression TPM transcript • 3.7k views

ADD COMMENT • link 4.7 years ago by molly77 ▴ 10

0

Entering edit mode

Could you clarify: Is it exactly the same transcript you have quantified in the 3 species or are they just similar?

ADD REPLY • link 4.7 years ago by Kristoffer Vitting-Seerup ★ 4.0k

score 0 · Answer 1 · 2019-07-23

I cannot find any literature or posts in regard to this, if you can point me to any resources

You cannot because is not possible to compare expression between 2 different species, there are many biological factors that will affect your quantification even if this is normalized in TPM. The best you can do is to declare what you found, the gene in Specie 1 is highly expressed, meanwhile in Specie 2 is not.

score 0 · Answer 2 · 2019-07-23

As JC pointed out, comparing absolute expression levels across species is probably not very meaningful. You may just want to resort to rank-based comparisons, which is something you seem to have done already. One thing you may want to consider checking is for differences sequence composition (i.e. GC bias); if those are quite different between the species, you would have to correct for it (because they will influence the number of reads that will be assigned to the individual transcripts).

Whether there's any biological meaning to your observation will have to be determined with other means anyway, i.e. who's to say that the two species where that gene does not seem to be in the top 10 expressed genes simply don't need such high levels of it because they're better at keeping it around/not destroying it etc. etc.

score 0 · Answer 3 · 2019-07-24

0

Entering edit mode

4.7 years ago

Kristoffer Vitting-Seerup ★ 4.0k

I have to say I tend to disagree with JC and Friederike. According to this blogpost by Harold Pimentel (co-author of Kallisto) the interpretation of a TPM values is:

if you were to sequence one million full length transcripts, TPM is the number of transcripts you would have seen [...], given the abundances of the other transcripts in your sample.

This means that TPM is a relative measure within a condition and thereby they should be comparable across condition/species - right?

ADD COMMENT • link 4.7 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Yes...assuming the same set of transcripts is quantified between experiments, and the the non-zero count transcripts are identical between sets (i.e. the set of "expressed" transcripts is the same between conditions). In the case above, we both both different conditions AND different transcriptomes. It's not even clear from the original question that the transcripts being compared are 100% identical, and they likely each emerge from transcriptomes that have many non-overlapping entities. So even though the units are TPM, it's still comparing apples and oranges.

ADD REPLY • link 4.7 years ago by seidel 11k

0

Entering edit mode

Technically, it's not totally wrong. I just don't think it makes sense, i.e. what's the biological insight you're gaining that you didn't already know from the rank-based analysis?

ADD REPLY • link 4.7 years ago by Friederike 8.9k

0

Entering edit mode

It might be interesting to normalize to a subset of perhaps "common" or "house-keeping" genes that are conserved across species and known to maintain relatively stable expression levels.

ADD REPLY • link 4.7 years ago by benformatics 3.9k

0

Entering edit mode

That's a great suggestion! This would provide at least some sort of assessment of how stable the TPM values are overall when the different species at hand are compared.

ADD REPLY • link 4.7 years ago by Friederike 8.9k

0

Entering edit mode

Thank you to everyone who offered input, this was incredibly helpful in regard to the between experiment comparisons. However, I have another question in regard to the same gene expression analyses aforementioned, and I figure I'll post it here as a comment in the hopes you will offer additional insight....

So I left out one major caveat in my original post to try and keep things relatively simple, we're interested in these genes that are part of a multigene family that are incredibly long and HIGHLY repetitive, thus making de novo assembly technically challenging because in theory the reads pertaining to the repeats just collapse on each other, although what you're really left with are transcripts corresponding to either the non-repetitive N- or C-terminal domains with very little, if any, 3' or 5' flanking repetitive region, respectively.

We have already done PacBio Iso-sequencing to obtain actual full-length transcripts of what're interested in, so obtaining Illumina transcripts representative of the biological transcript is trivial as we're primarily doing Illumina-seq for gene expression analyses.

For the biological transcripts of interest, and I say biological transcript because in theory you'll end up with two Trinity assembled transcripts (i.e. N- & C-terminal domains) derived from the SAME biological transcript, we would typically use the C-terminal domain assembled transcripts as a proxy for quantification due to 5' bias from polyA isolation etc., thus C-term transcripts are more abundant relative to their N-term counterpart.

MY QUESTION: If only using C-term transcripts as a proxy to quantify the transcripts of interest to us, are these somehow inflated values? Should we aggregate quant values across BOTH N-term and C-term stemming from the same "biological transcript" (between transcript types of what we're interested in, terminal domains are distinct enough to know which N-term should be paired with which C-term)?? Should the N-terminal domain transcripts of what we're interested be fished out and removed, and then RE-quantify?? Again, these transcripts are very highly expressed within the tissues we're sequencing.

If you're still reading this, and can offer some expertise, thank you very much. Enjoy your weekend!

ADD REPLY • link 4.7 years ago by molly77 ▴ 10