ground truth dataset for BAM read counts
0
1
Entering edit mode
2.1 years ago
Marius ▴ 30

Hi,

I was wondering if there is any data set that I can use as ground truth to compare which tool behaves better at counting the number of reads within a BAM file for instance. So far, I've ran bcftools mpileup which is supposedly only a wrapper for samtools mpileup (which I also ran) and a third tool called bam-readcount (https://github.com/genome/bam-readcount).

I ran all three tools on the same data set making sure that the quality thresholds/filtering parameters are the same for all three tools. However, when I look at the results and compare the read counts for the same chromosome position (i.e., unique ID based on chr_pos_base) I see that the counts reported highly vary.

I've attached some boxplots here which show the absolute difference in read counts per reported position for each pair of tools tested. BRC is bam-readcount, BCF is bcftools mpileup, and SAM is samtools mpileup. Performed the analysis on different tissues from the same sample (X-axis). What is intriguing from those plots, is that samtools mpileup vs. bcftools mpileup are not agreeing as much as expected. Which is weird, because the bcftools implementation of mpileup should be just a wrapper according to this. For instance, the variance in the absolute difference in counts between samtools mpileup vs. bam-readcount is a lot smaller.

Any pointers how I can assess which tool is closest to the "truth" are appreciated.

genome alignment read counts ground truth • 800 views
0
Entering edit mode

Are you accounting for secondary alignments of reads?

0
Entering edit mode

You can load the BAM file read directly in IGV and count them yourself - which in lower coverage areas is very feasible.

0
Entering edit mode

Thanks. The problem does not actually rely in the areas with low coverage from what I've seen so far in IGV

0
Entering edit mode

@GenoMax To be honest, I am not sure if that is in the default parameter settings of mpileup. This are the commands I run for bcftools mpileup and samtools mpileup:

samtools mpileup -q 0 -Q 0 -B -l /data/regions.bed -f /genome/reference.fna -o /data/samtools_output.tsv/data/input.bam



So nothing fancy there :|

0
Entering edit mode

You use option -B in samtools mpileup, but not in bcftools. Could explain some of the differences.

0
Entering edit mode

Good point. I'll run a quick test on only one chromosome and see if/how much that changes the results

0
Entering edit mode

A big difference between mpileup and bam-readcount is the default maximum depth reported (250 vs 10000000). There are other differences in the default parameters, with samtools filtering out more reads than bam-readcount in general.

0
Entering edit mode

Yes, that I agree. But I would expect samtools mpileup and bcftools mpileup to produce very similar results with the same parameter settings. However, as seen in the plots, actually bam-readcount and samtools are the closest in read count reporting

0
Entering edit mode

Sorry for a late answer, but I've reached my 5 posts limit and had to wait 6 hours on a Friday evening :D

As Carlo Yague mentioned, I was running bcftools without the -B flag set, as opposed to the samtools command. Setting that flag in bcftools resulted in having identical results for samtools mpileup and bcftools mpileup.

However, there are still differences between bam-readcount and samtools mpileup, even if I match the maximum depth:

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00    0.00    0.00   41.16    2.00 7749.00


So I still do not know which tool I should trust. Any further suggestions? Thank you

0
Entering edit mode

You can trust both, the difference comes from the fact that samtools/bcftools' defaults do some more read filtering than bam-readcount : discard low quality reads, pairs not properly paired, etc...

bam-readcount is closer to the "naïve" read coverage because by default, it does not filter anything.

0
Entering edit mode

That is true. I am now performing some other tests where I am setting the map/base quality to 30 and tweak the read depth. Thank you for your input and for spotting that flag there :]

So I guess the answer is: up to date, there is no ground truth data set for BAM read counts.