Hi,
I was wondering if there is any data set that I can use as ground truth to compare which tool behaves better at counting the number of reads within a BAM file for instance. So far, I've ran bcftools mpileup which is supposedly only a wrapper for samtools mpileup (which I also ran) and a third tool called bam-readcount (https://github.com/genome/bam-readcount).
I ran all three tools on the same data set making sure that the quality thresholds/filtering parameters are the same for all three tools. However, when I look at the results and compare the read counts for the same chromosome position (i.e., unique ID based on chr_pos_base) I see that the counts reported highly vary.
I've attached some boxplots here which show the absolute difference in read counts per reported position for each pair of tools tested. BRC is bam-readcount, BCF is bcftools mpileup, and SAM is samtools mpileup. Performed the analysis on different tissues from the same sample (X-axis). What is intriguing from those plots, is that samtools mpileup vs. bcftools mpileup are not agreeing as much as expected. Which is weird, because the bcftools implementation of mpileup should be just a wrapper according to this. For instance, the variance in the absolute difference in counts between samtools mpileup vs. bam-readcount is a lot smaller.
Any pointers how I can assess which tool is closest to the "truth" are appreciated.
Are you accounting for secondary alignments of reads?
You can load the BAM file read directly in IGV and count them yourself - which in lower coverage areas is very feasible.
Thanks. The problem does not actually rely in the areas with low coverage from what I've seen so far in IGV
@GenoMax To be honest, I am not sure if that is in the default parameter settings of
mpileup. This are the commands I run forbcftools mpileupandsamtools mpileup:So nothing fancy there :|
You use option -B in samtools mpileup, but not in bcftools. Could explain some of the differences.
Good point. I'll run a quick test on only one chromosome and see if/how much that changes the results
A big difference between mpileup and bam-readcount is the default maximum depth reported (250 vs 10000000). There are other differences in the default parameters, with samtools filtering out more reads than bam-readcount in general.
Yes, that I agree. But I would expect
samtools mpileupandbcftools mpileupto produce very similar results with the same parameter settings. However, as seen in the plots, actuallybam-readcountandsamtoolsare the closest in read count reportingSorry for a late answer, but I've reached my 5 posts limit and had to wait 6 hours on a Friday evening :D
As Carlo Yague mentioned, I was running
bcftoolswithout the-Bflag set, as opposed to thesamtoolscommand. Setting that flag inbcftoolsresulted in having identical results forsamtools mpileupandbcftools mpileup.However, there are still differences between
bam-readcountandsamtools mpileup, even if I match the maximum depth:So I still do not know which tool I should trust. Any further suggestions? Thank you
You can trust both, the difference comes from the fact that samtools/bcftools' defaults do some more read filtering than bam-readcount : discard low quality reads, pairs not properly paired, etc...
bam-readcount is closer to the "naïve" read coverage because by default, it does not filter anything.
That is true. I am now performing some other tests where I am setting the map/base quality to 30 and tweak the read depth. Thank you for your input and for spotting that flag there :]
So I guess the answer is: up to date, there is no ground truth data set for BAM read counts.