I was wondering if there is any data set that I can use as ground truth to compare which tool behaves better at counting the number of reads within a BAM file for instance. So far, I've ran
bcftools mpileup which is supposedly only a wrapper for
samtools mpileup (which I also ran) and a third tool called
I ran all three tools on the same data set making sure that the quality thresholds/filtering parameters are the same for all three tools. However, when I look at the results and compare the read counts for the same chromosome position (i.e., unique ID based on
chr_pos_base) I see that the counts reported highly vary.
I've attached some boxplots here which show the absolute difference in read counts per reported position for each pair of tools tested.
bcftools mpileup, and
samtools mpileup. Performed the analysis on different tissues from the same sample (X-axis). What is intriguing from those plots, is that
samtools mpileup vs.
bcftools mpileup are not agreeing as much as expected. Which is weird, because the
bcftools implementation of
mpileup should be just a wrapper according to this. For instance, the variance in the absolute difference in counts between
samtools mpileup vs.
bam-readcount is a lot smaller.
Any pointers how I can assess which tool is closest to the "truth" are appreciated.
Are you accounting for secondary alignments of reads?
You can load the BAM file read directly in IGV and count them yourself - which in lower coverage areas is very feasible.
Thanks. The problem does not actually rely in the areas with low coverage from what I've seen so far in IGV
@GenoMax To be honest, I am not sure if that is in the default parameter settings of
mpileup. This are the commands I run for
So nothing fancy there :|
You use option -B in samtools mpileup, but not in bcftools. Could explain some of the differences.
Good point. I'll run a quick test on only one chromosome and see if/how much that changes the results
A big difference between mpileup and bam-readcount is the default maximum depth reported (250 vs 10000000). There are other differences in the default parameters, with samtools filtering out more reads than bam-readcount in general.
Yes, that I agree. But I would expect
bcftools mpileupto produce very similar results with the same parameter settings. However, as seen in the plots, actually
samtoolsare the closest in read count reporting
Sorry for a late answer, but I've reached my 5 posts limit and had to wait 6 hours on a Friday evening :D
As Carlo Yague mentioned, I was running
-Bflag set, as opposed to the
samtoolscommand. Setting that flag in
bcftoolsresulted in having identical results for
However, there are still differences between
samtools mpileup, even if I match the maximum depth:
So I still do not know which tool I should trust. Any further suggestions? Thank you
You can trust both, the difference comes from the fact that samtools/bcftools' defaults do some more read filtering than bam-readcount : discard low quality reads, pairs not properly paired, etc...
bam-readcount is closer to the "naïve" read coverage because by default, it does not filter anything.
That is true. I am now performing some other tests where I am setting the map/base quality to 30 and tweak the read depth. Thank you for your input and for spotting that flag there :]
So I guess the answer is: up to date, there is no ground truth data set for BAM read counts.