Question

Using bootstrap testing in kallisto to obtain estimated transcript counts

12

Entering edit mode

8.7 years ago

neuner.sarah ▴ 120

Hi,

I am using the program Kallisto (verson 0.42.2.1) to perform pseudoalignment to a transcriptome and obtain estimated counts for each transcript. The program has an option to perform multiple bootstraps during this procedure, and I recently ran a test on one of my samples using 0, 1, 5, 10, 50, and 100 bootstraps to see at which point the estimated counts become relatively stable.

Surprisingly, the number of bootstraps did not change the estimated counts contained in the final output file abundance.tsv. However, this program also outputs an abundance file for each individual bootstrap run (compressed into abundance.h5), and the counts contained in these files were different. So, each time the program runs, the estimated counts were slightly different, but the end output file (abundance.tsv) did not change, no matter how many bootstraps were run.

If anyone could help with understanding why the estimated counts for each bootstrap would be different, but with no effect on the ultimate estimated counts, it would be appreciated!

Thanks,
Sarah

RNA-Seq • 13k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by neuner.sarah ▴ 120

1

Entering edit mode

What factors should one consider to determine how many bootstraps should be performed? Thanks

ADD REPLY • link 7.0 years ago by ATCG ▴ 380

5

Entering edit mode

8.7 years ago

h.mon 35k

The quantification step, which estimates the counts, is probably deterministic and so identical between runs.

The bootstrap has no influence whatsoever on the quantification step. The bootstrapping procedure will provide a measure of the accuracy of the quantification by random resampling with replacement. If you use the same seed and same number of bootstraps on the same dataset, the bootstraps estimates would be identical as well.

ADD COMMENT • link 8.7 years ago by h.mon 35k

0

Entering edit mode

If the abundance estimates don't change with more bootstraps, then they only provide an estimate of technical variance? I would have expected that they would improve (converge) as more bootstraps are done. In other words, if the downstream algorithm isn't taking advantage of the technical variance, then there is no reason to do bootstraps then right?

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Dave Bridges ★ 1.4k

1

Entering edit mode

The bootstrap is only for estimating the technical variance, it will not improve the original point estimate. The sleuth program, http://pachterlab.github.io/sleuth/, uses bootstraps to improve differential expression detection.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by pmelsted ▴ 120

score 5 · Accepted Answer · 2015-08-19

5

Entering edit mode

8.7 years ago

pmelsted ▴ 120

The output in the .tsv file is the maximum likelihood estimate for the expression of transcripts. That process is completely deterministic

The bootstrap outputs are stored in the .h5 files, they are generated by resampling the original data and finding the expression for this new dataset. This is why you will get slightly different est_counts in the bootstraps, and the amount of variation gives you an indication of how reliable the initial point estimate is.

ADD COMMENT • link 8.7 years ago by pmelsted ▴ 120

0

Entering edit mode

Sorry for re-upping this post. Can I ask you if in the newer version of Kallisto, inferential replicates are taken into account also when using .tsv files? Because when I am importing tsv files with tximport, it says:

Note: importing `abundance.h5` is typically faster than `abundance.tsv`
reading in files with read.delim (install 'readr' package for speed up)
1 2 3 4 5 6 
transcripts missing from tx2gene: 22020
summarizing abundance
summarizing counts
summarizing length
summarizing inferential replicates

so my question is: why it is summarising inferential replicates, at this stage? when using tsv files, are these inferential replicates taken into account when doing the downstream analysis (I am using DESeq2)?

ADD REPLY • link 4.8 years ago by Mozart ▴ 330