Question

Effect of Bootstrapping/Gibbs Sampling in Salmon Counts

1

Entering edit mode

3.1 years ago

saipra003 ▴ 20

Hi Everyone, I am a bit confused about the difference between Gibbs Sampling and Bootstrapping when it comes to Salmon and how these procedures affect downstream analysis. For context, I am trying to do analysis of 49 matched cancer vs. normal RNAseq samples in the context of alternative splicing (i.e. I am trying to cluster together patients with similar alternative splicing profiles and then see what genes are driving the clustering). I read the bootstrapping and Gibbs sampling improve transcript quantification for downstream analysis, but I am unsure how dramatic this effect may be for my purpose. Any advice or help in this regard would be appreciated!

Gibbs Bootstrapping RNAseq Salmon Sampling • 2.5k views

ADD COMMENT • link updated 19 months ago by Gordon Smyth ★ 8.2k • written 3.1 years ago by saipra003 ▴ 20

2

Entering edit mode

20 months ago

Gordon Smyth ★ 8.2k

edgeR uses the Salmon bootstrap samples to assess differential transcript expression. edgeR has had this functionality since October 2018 but we have only this year written up a formal manuscript with performance comparisons, see

Baldoni PL, Chen Y, Hediyeh-zadeh S, Liao Y, Dong X, Ritchie ME, Shi W, Smyth GK (2023). Dividing out quantification uncertainty allows efficient assessment of differential transcript expression with edgeR. Nucleic Acids Research, https://doi.org/10.1093/nar/gkad1167
Section 4.6 of the edgeR User's Guide https://bioconductor.org/packages/edgeR

ADD COMMENT • link 19 months ago by Gordon Smyth ★ 8.2k

score 3 · Accepted Answer · 2022-05-27

To be clear, enabling bootstrapping or Gibbs sampling does not change the “primary estimate” (i.e. the TPM or NumReads in the quant.sf) files at all. Rather bootstrapping and Gibbs sampling are both ways to estimate _posterior uncertainty_. That is, when salmon estimates a particular abundance for a transcript in a sample (say — transcript A produced 500 fragments), sometimes there can be a high degree of certainty in this estimate and other times a lot of uncertainty. For example, if all 500 fragments assigned to this transcript map uniquely back to it, uncertainly will be very low. On the other hand, if this transcript has a near identical splice variant or an allelic variant and all or almost all of these reads are multi-mapping, the uncertainty may be quite high.

The primary estimates used in most common downstream analyses are “point” estimates. That is, in this case, they are maximum likelihood estimates with no notion of their uncertainty. Bootstrapping or Gibbs sampling are two different ways to estimate the uncertainty for each abundance point estimate. They generate information that can be used in downstream analysis tools to assess not just what the best estimate of abundance is for a transcript in a sample, but how certain we are in that abundance. However, not all downstream tools take advantage of this information. For example, if you are performing a differential analysis, a tool like swish will take advantage of this information, but e.g. DESeq2 or EdgeR will not. So, you can always manually look at the variance of the bootstrap replicates or Gibbs samples to manually assess the confidence in a transcript’s expression, but if you want to make use of this information systematically in downstream analysis, you need to find a tool for your chosen task that takes advantage of this information.