Question

Gibbs sampling/bootstrap replicates in Salmon

0

Entering edit mode

3.6 years ago

Dunois ★ 2.5k

This is a sort-of-kind-of follow up to this post.

Say I've performed de novo transcriptome assembly with Trinity, and performed quantification with Salmon. Let's suppose Salmon also produced either Gibbs sampling or bootstraps alongside.

My first question is: how do I process and interpret the Gibbs/bootstrap data?

I understand that I can preprocess the data with ConvertBootstrapsToTSV.py, but then what? I've looked around, but there really doesn't seem to be any documentation (especially for the biologists/non-bioinformaticians) on how to make use of this data (at the time of reading that post, I incorrectly assumed the downstream analysis steps were well documented). The closest thing I could find was this issue on Salmon's GitHub, but that still leaves one hanging less than one quarter of the way.

I presume that what the Gibbs/bootstrap estimates represent are the (theoretical? estimated?) counts (that would have been?) generated from the set of reads at hand for a given set of transcripts over n iterations/draws (conditional upon some set of constraints?). So, if a transcript t is quantified using Salmon with (say) Gibbs sampling enabled (e.g., --numGibbsSamples 4) it would generate a set of counts like so: {20, 44, 22, 18}. I suppose then that the variance of this set is the uncertainty in inference that @Rob mentioned in their post. Is this correct?

My second question pertains to using these estimates to rule out transcript isoforms (as Trinity defines them) in the context of finding candidate proteins. Basically, I have two transcript isoforms that yield protein sequences that I've identified as homologs to a protein of interest. The problem is that both of them have the same e-value. My first idea was to check the read support for both of them, and potentially drop the one with lower read support. But it turned out that the one with the lower read support actually aligns slightly better with the protein of interest in comparison to the one with greater read support. So this brings me to the question at hand: If I have the aforementioned estimates for both these isoforms, and one of them has a high uncertainty in its counts, can I rule that one out as unlikely to be a "real" candidate even if it has greater read support?

Your inputs would be much appreciated.

salmon uncertainty • 2.6k views

ADD COMMENT • link updated 3.6 years ago by Rob 6.5k • written 3.6 years ago by Dunois ★ 2.5k

score 4 · Accepted Answer · 2020-09-20

Hi @Dunois,

Thanks for the follow-up. You’re absolutely right that there’s not a lot of documentation around for using the posterior estimates _for this purpose_. Currently, the most common uses of these estimates are to improve transcript-level differential expression, as outlined here (http://www.bioconductor.org/packages/devel/bioc/vignettes/fishpond/inst/doc/swish.html), or to group together transcripts that are inferentially indistinguishable as described in the terminus tutorial (https://combine-lab.github.io/terminus-tutorial/2020/running-terminus/).

However, I think the relative sparsity of downstream usage of inferential uncertainty is mostly a historical accident, as these were typically not assumed to be the output of transcript level quantification. So, you are absolutely right that it would be great to develop more documentation and best practices in the community for using this information for other purposes. To your questions:

My first question is: how do I process and interpret the Gibbs/bootstrap data?

The easiest way would be to draw a relatively large number of Gibbs samples (say 100), and then, to look at the sorted list of values for each transcript. Taking the e.g. inner 98 entries of this list would define the 98% credible interval (https://en.m.wikipedia.org/wiki/Credible_interval) for this transcript. In terms of interpretation, I think this would be the most straightforward way. You can, of course, do things like take the mean (or, preferrably median) of the posterior samples to get a potentially more robust estimate of the transcript expression in a sample.

My second question pertains to using these estimates to rule out transcript isoforms (as Trinity defines them) in the context of finding candidate proteins. Basically, I have two transcript isoforms that yield protein sequences that I've identified as homologs to a protein of interest. The problem is that both of them have the same e-value. My first idea was to check the read support for both of them, and potentially drop the one with lower read support. But it turned out that the one with the lower read support actually aligns slightly better with the protein of interest in comparison to the one with greater read support. So this brings me to the question at hand: If I have the aforementioned estimates for both these isoforms, and one of them has a high uncertainty in its counts, can I rule that one out as unlikely to be a "real" candidate even if it has greater read support?

So, there are a few possibilities here. As I mentioned in the other post, the posterior estimates may help untangle what is goin on here, but as far as I know, there is no "standard" way to solve this issue (other than to group very similar transcripts and treat them together using something like e.g. CD-HIT EST). If both of the transcripts look like they have substantial support in their posterior samples, then the next question would be, do they have substantial support at the same time. That is, is there sufficient read evidence to support the existence of both of these transcripts in your sample at the same time, or is one transcript only abundant in posterior samples where the other is absent or has very low expression. This, in fact, is exactly the type of question that terminus is designed to answer. It aims to group together transcripts that are difficult (or impossible) to distinguish inferentially. So, you could do a few things here. You could look at the posterior samples for this pair of transcripts (not sorted in any way --- the order is important), and see if the posterior samples are strongly anti-correlated (whenever one transcript increases in abundance, the other decreases by a similar amount). If they are, that is strong evidence that even though Trinity's algorithm decided to report both, there is not sufficient evidnece in the reads for the samples to distinguish between these two possibilities if they are present at the same time. This wouldn't be too surprising, as Trinity is quite sensitive and this may somtimes come at the cost of reporting very similar (almost redundant) transcripts. Additionally, you could try and run terminus on your samples and see if these transcripts are grouped together by its algorithm. This would also be support that they are not distinct enough to be accurately quantified simultaneously.

If this is the case, then the typical thing to do would be to simply choose one to be the representative. If they are, in fact, very similar to each other, then this representative should be able to explain most of the reads from the other. If they are _not_ very similar to each other, then the next question would be why they would have anti-correlated posterior samples. But we'd have to dig deeper into the data for that.