Question

How does Salmon bootstrapping work?

2

Entering edit mode

3.2 years ago

scsc185 ▴ 80

I am trying to understand how salmon creates bootstrap samples. I read the original salmon paper, and in the paper, it says the following.

The bootstrap sampling process works by sampling (with replacement) counts for each equivalence class, and then re-running the offline inference procedure (either the EM or VBEM algorithm) for each bootstrap sample.

What I understand right now:

Bootstrap means sampling with replacement.
An equivalence class is a set of transcripts that a given read can possibly come from.
Read counts are how many reads come from a specific transcript.
EM algorithm in the context of HMM models.

But I am having trouble putting these concepts together to understand the sentence above from the paper. Specifically,

How do you sample read counts? Is it sampling the reads that come from a given transcript?
What's the role of EM algorithm here? What parameters are being maximized here?

Thank you!

RNA-seq salmon bootstrapping • 2.8k views

ADD COMMENT • link 3.2 years ago by scsc185 ▴ 80

1

Entering edit mode

Gibbs sampling/bootstrap replicates in Salmon

ADD REPLY • link 3.2 years ago by GenoMax 152k

score 4 · Answer 1 · 2022-04-29

Hi scsc185,

During it's streaming estimation phase, salmon partitions mapped fragments into range-factorized equivalence classes — which groups together fragments that map to the same set of transcripts with the same approximate conditional probabilities.

During offline estimation (the VBEM or EM, depending on the flags you pass to salmon), the counts of these equivalence classes of fragments, and the conditional probabilities that induced them, are treated as sufficient statistics, at which point a standard bootstrap procedure is run. Consider A(X) to be the estimation algorithm (e.g. VBEM or EM) on the vector of range-factorized equivalence class counts X. Each bootstrap replicate is then generated according to the process for the standard bootstrap procedure) — that is, a draw X' ~ X is made with replacement — treating the count vector X as an empirical distribution a new X' is generated by sampling counts according to X. Then the bootstrap estimate B' = A(X') — that is, the estimation algorithm is simply re-run on X'. If you request 50 bootstrap samples, this is done 50 times. Each time a new X' is drawn, it will be from the distribution of X, but slightly different. This lets salmon assess how differences in the sampling variability (which specific X' is drawn) affect the overall abundance estimates. The result, when looking across the set of bootstrap replicates, is an estimate of the variance (or uncertainty) in the point estimate for the abundance of each transcript.

With respect to the specifics of your questions:

How do you sample read counts? Is it sampling the reads that come from a given transcript?

So, what is being sampled are counts for range-factorized equivalence classes. That is, counts are being drawn, distributed according to the empirical distribution, such that they have the same pattern of multi mapping and conditional probabilities as the original fragments. This works because the counts from the range-factorized equivalence classes act as the sufficient statistics of the inference procedure, so one can bootstrap over those directly rather than the more traditional approach (as taken in e.g. IsoEM) of bootstrapping the actual reads themselves.

What's the role of EM algorithm here? What parameters are being maximized here?

Most probabilistic quantification tools maximize a likelihood that looks like L(θ; F, T), where θ are the parameters. You can view θ as a vector, equal in length to the number of annotated transcripts, that measures the estimated nucleotide fraction of each transcript in the annotation in the current sample. The likelihood seeks to find the θ that maximize the likelihood of the observed data F (the set of sequenced fragments) over the set of annotated transcripts T. Since the range-factorized equivalence class counts are the sufficient statistics, we have that L(θ; F, T) ∝ L(θ; X, T). Thus, for each bootstrap replicate X', we seek to maximize L(θ; X', T).

For the Gibbs Sampling, salmon uses a modified version of the 2-step Gibbs chain as explained in the MMSEQ paper. This is similar in spirit to the bootstrap, but is a Gibbs sampler over the posterior distribution of abundance estimates for the transcripts.

score 2 · Answer 2 · 2022-04-29

For the EM algorithm (and the objective function that is to be maximized), see: https://bioinfo.iric.ca/understanding-how-kallisto-works/

For bootstrapping, imagine each equivalence class (EC) is a different colored ball (e.g. EC 1 = red, EC 2 = blue, EC 3 = green, etc.) and you have 10 red balls, 4 blue balls, 3 green balls, etc. You basically sample your balls with replacement and then, with your bootstrap sample, you try to maximize the likelihood function.