samtools/bcftools mpileup -B option: to use it or not in order to reduce the occurence of false SNPs discovery?
2
1
Entering edit mode
13 months ago
cg1440 ▴ 30

UPDATE: I'll leave this post up since I got a really thorough response from the tool's developer himself. This response might be of a great value later on for someone else. Thank you jkbonfield!

Hey fellow bioinformaticians!

At this point I'm really confused regarding the -B option with the mpileup function.

First to clarify what this option does, it "disables base alignment quality (BAQ) computation"

Now, in the documentation page of samtools mpileup:

BAQ is the Phred-scaled probability of a read base being misaligned. It greatly helps to reduce false SNPs caused by misalignments. BAQ is calculated using the probabilistic realignment method described in the paper “Improving SNP discovery by base alignment quality”

BUT, in the documentation page of bcftools mpileup, they say the exact opposite regarding BAQ:

-B, --no-BAQ Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.

So at this point I'm really confused. Samtools documentation says that computing BAQ helps reduce false SNPs discovery, while the bcftools documentation says that disabling BAQ improves SNPs discovery.

Also, I've tried bcftools mpileup on the same set of data, once without -B and once with it. I got significantly different results: without this option, I got some INDELs in some samples, while with the -B argument, I got no INDELs at all in any of my 42 COVID-19 samples.

Am I missing something? Did I misunderstand the developers' wording?

EDIT: in the ivar manual, the use of -B with samtools mpileup is recommended, but for a different reason:

Please use the -B options with samtools mpileup to call variants and generate consensus. When a reference sequence is supplied, the quality of the reference base is reduced to 0 (ASCII: !) in the mpileup output. Disabling BAQ with -B seems to fix this. This was tested in samtools 1.7 and 1.8

I don't know what to understand from all of this.

covid-19 mpileup bcftools snp samtools • 1.5k views
4
Entering edit mode
13 months ago
jkbonfield ▴ 760

That looks like an error in the bcftools documentation. Good spot.

Generally though it's far more complicated than this.

BAQ assesses the per-position accuracy of an alignment. If the data is complex then it's likely there is a nice 1:1 alignment through the matrix and the Base Alignment Quality is high. For low-complexity indels, eg copy number variations in STRs, there can be multiple places the bases could be added or removed with similar or even identical alignment scores. In these scenarios BAQ will give a low score as it's not sure the alignment produced by the read mapper will be the same for all reads, particularly when close to the end of reads.

It is clear that using BAQ does remove many false positives by reducing the base quality in places where reference bias is likely to creep in. However by its very nature it also removes some true positives too (or increases false negatives if you prefer).

Last year I started looking to speed up bcftools by removing most BAQ calls, as it's the biggest CPU hog. The theory is can we do rapid assessments to identify where there appears to be no problematic alignments in the pileup, meaning BAQ isn't necessary to get the correct answer. (The idea came from Crumble, which assesses multiple alignments to work out which regions need quality values retained and which don't.) Doing this cuts out a good 90% of our BAQ calls, but rather fortuitously this also happens to still remove most false positives while not having the same detrimental impact on false negatives. A rare win/win.

It then turned into a bit of a rabbit hole of improving calling in other ways, particularly indels. This work is mostly complete, barring some tidying up. Some examples of the impact are here: https://github.com/samtools/bcftools/pull/1363#issuecomment-802042033

Hopefully we can get this in the next bcftools release.

Edit: For Covid-19 sequencing, if using amplicon methods, you may also wish to apply this Htslib PR:

https://github.com/samtools/htslib/pull/1273

It makes the overlap removal method randomised between read 1 / read 2. This removes a signficant source of strand bias from amplicon sequencing, which when coupled with BAQ can give major issues to calling. The alternative is to use -x to disable overlap removal. Double counting is a problem on shallow data, but it doesn't really make much difference when we get to high depth.

1
Entering edit mode

Also for with-BAQ vs without-BAQ see this pic from the above github PR:

The blue bars at the top right is elapsed time, shoing how significant BAQ is to the running time of "bcftools mpileup".

This was an earlier version of the PR so the results have changed since then for "partial" mode, but basically it shows the general gist. Each dot there is a different quality filtering level in increments of 10 (GATK may have been bigger increments, I forget). So bottom right is all variants (QUAL>=0), somewhere closer to bottom left corner are around QUAL>=100 ish, and the top left are higher levels of QUAL. As expected increasing the QUAL filter reduces FP (higher precision) but increases FN (lower recall). There's a "sweet spot", but where it is depends on your goals.

It's clear that at 60x coverage BAQ has a significant hit to recall rates (higher Y) but also improves accuracy (lower X). However for most realistic QUAL filtering levels, you're better off disabling BAQ and just increasing your QUAL filter level a bit to compensate. Eg Q>=70 no-BAQ is lower FN and lower FP than Q>=30 full-BAQ. At very high filtering this isn't true, with the change over point around QUAL>=120. The partial BAQ mode (NB this has been revised since, so don't draw too much into that line) did better than either.

At lower depth, eg 15x, it's totally different and BAQ always wins out. This is fair enough as it's how BAQ was initially written. When Heng Li wrote it he was working on 1000 Genomes Project which was doing very shallow sequencing of lots of samples. Data has moved on since then, and the algorithms and parameters aren't well tuned.

As a final comment on tuning, I'd also recommend upping the -h option from 100 to maybe 500 for modern Illumina data sets, ie bcftools mpileup -h 500. This is a major benefit for indel calling. It's a homopolymer error bias, so indels in homopolymers get rejected if the homopolymer length is too long, reflecting the likelihood of sequencing artifact being higher than genuine variant. Again it's a product of the time. In 1000G era, that was true. On modern Illumina platforms it's rarely a major source of errors. While you're at it, also best to add -m 2 or -m 3 to filter the more rediculous indel false positives. Another product of the extreme low-depth origins of bcftools that doesn't apply so much now.

0
Entering edit mode
13 months ago

My personal anecdote: omitting -B caused some sanger verified SNPs to be missed. So I would say do include -B.