Question: BQSR: when it is applicable?
gravatar for lamteva.vera
16 months ago by
Ukraine, Kyiv
lamteva.vera140 wrote:

BQSR is recommended at least by GATK's Best Practices and Good laboratory practice for clinical next-generation sequencing informatics pipelines. Heng Li claims that

...BQSR and indel realignment...may make difference on low-coverage data or when the base quality is not well calibrated.

Currently I don't use BQSR, because, frankly, I don't quite understand the point of using it. But maybe I should give it a try? How do I know if the base quality is not well calibrated?

base quality bqsr gatk • 1.8k views
ADD COMMENTlink modified 15 months ago by finswimmer9.9k • written 16 months ago by lamteva.vera140
gravatar for finswimmer
15 months ago by
finswimmer9.9k wrote:

Wether to use BQSR or not may depend on the variant caller. The authors of freebayes for example statet out:

The need for base quality recalibration is avoided through the direct detection of haplotypes. Sequencing platform errors tend to cluster (e.g. at the ends of reads), and generate unique, non-repeating haplotypes at a given locus.

fin swimmer

ADD COMMENTlink written 15 months ago by finswimmer9.9k

Thanks for your comment, fin swimmer. I know that feature of Freebayes and maybe I'll try it as well. Now I'm testing Haplotype Caller.

ADD REPLYlink written 15 months ago by lamteva.vera140
gravatar for h.mon
16 months ago by
h.mon23k wrote:

Base quality score recalibration will correct your base scores, in turn leading to more accurate SNP calling. See these two posts for more detailed discussions:

Your statement reminds me of my grandfather, who always said he didn't see the point of seat-belts.

ADD COMMENTlink written 16 months ago by h.mon23k

I second BQSR, still I think it's worth asking whether BQSR is still necessary or even detrimental. As the sequencing technology improves you may over-correct and penalize true variations. At the extreme, your reads have no errors at all and all the mismatches with the reference are genuine. In this case BQSR would be deleterious.

ADD REPLYlink written 15 months ago by dariober9.9k

Heng Li stated that the effect of BSQR is minimal, both using SAMtools or GATK for variant calling. As by now, these steps are pretty time consuming. According to the pre-release information of GATK4, they improved the processing time notably, so we will see.

ADD REPLYlink written 15 months ago by ATpoint13k

Hi- It would be good if Heng Li would chime in here... I think Heng makes that statement with reference to just the two datasets used in the paper. I don't think he meant to generalize more broadly. On the other hand, in this recent paper, Callari et al. Intersect-then-combine approach..., the authors found that BQSR made a large improvement in their use case.

I guess Illumina must have calibrated their base quality algorithm on "normal" libraries, meaning whole human genome, using good starting material and following their library and sequencing protocols. So if your libraries and instruments are also "normal" it shouldn't be necessary to do BQSR. However, things sometimes deviate and some recalibration may be useful.

ADD REPLYlink modified 15 months ago • written 15 months ago by dariober9.9k

I think the probability of that would be tiny. You'd have to essentially have more variations in the genome of the individual than the sequencing technology. But i can see where you're coming from - take for example this quote from the NIH about what SNPs are:

SNPs occur normally throughout a person’s DNA. They occur once in every 300 nucleotides on average, which means there are roughly 10 million SNPs in the human genome.

This would imply once sequencing machines get better than 1 error in 300 sequenced bases, your fears might be correct, but there's two confounding issues.

The first is that the NIH's numbers are generated by taking 3 billion bases in a genome and dividing it by all 10 million SNPs in dbSNP (there's 13million now). This is a really stupid way of measuring SNPs, because not everyone has every single SNP, a lot of dbSNP entries are copies of each other represented in a different way, and even more are probably sequencing errors themselves. SNPs aren't randomly distributed either, so statistics like this are totally pointless regardless. In whole genome sequencing, i doubt you even get 100,000 variants. That puts the SNP rate per person at i guess 1 in 30000bp, which is several orders of magnitude lower than what Illumina can currently pull off in sequencing error.

The second is that, although to my knowledge BQSR does not do this, common true/annotated SNPs can be identified and excluded from BQSR quite easily. Certainly, if sequencing becomes better quality, and the noise of mis-sequenced bases drops, identifying SNPs will be easier and so BQSR should become more robust. It's an optimisation problem since you're calling SNPs to change the data to call more SNPs, but it should dramatically reduce false-negatives.

But i agree that as of right now that doesn't happen, so it's a moot point. My feelings on BQSR is that it's a technique that makes the data better, so why not use it. I'll never understand why some practitioners of Bioinformatics will on the one hand complain about the poor quality of public datasets, then in the same breath say BQSR and indel realignment is a waste of time. Certainly, there are situations where you're not publishing the data and your not SNP calling, but still, even for ChIP-Seq signal extraction stuff, i don't see the downside. (note i'm not saying this is you dariober, i'm just venting :P)

ADD REPLYlink modified 15 months ago • written 15 months ago by John12k

Hi John- Thanks for the input and for putting in some numbers.

10 million SNPs in dbSNP (there's 13million now).

Wait, dbSNP currently has 324 million SNPs which makes a few million SNPs per individual reasonable.

That puts the SNP rate per person at i guess 1 in 30000bp, which is several orders of magnitude lower than what Illumina can currently pull off in sequencing error.

1/30000 makes a phred score of ~ 44. Illumina reports phred up to 41 and it's not unusual to see runs where most of the bases have the maximum score, effectively being off-scale. Of course, those may be over-optimistic runs that need recalibration. Most importantly, once you feed the BQSR with dbSNP I think it's safe to assume that the variation left is mostly errors, even in cancer samples. Still, sequencing error rate is approaching real variation. So if you work with a hypervariable organism, say HIV I guess, or you don't have a dbSNP-equivalent, BQSR should be used with care.

In the case of ChIP-Seq or RNA-Seq or similar, BQSR is not necessary because the analysis of these data doesn't use base qualities (as far as I know at least...). Indel realignment could make sense in principle, but ChIP-Seq peak resolution is so low that I don't think the computational burden of realigning (and debugging, and dependencies) pays off. Maybe for ChIP-Exo, I don't know...

ADD REPLYlink written 15 months ago by dariober9.9k

Eek, you're right, i pulled the 13m number from the NCBI's FAQ, but it hasn't been updated since 2008 -_-;

Regarding PHRED scores, you're right, this is on the assumption Illumina is producing probabilities of error that match reality. I've always known their process as having a 1:1000 error rate, or at least this is what i was taught at Uni, so it may no longer be true. This comparison of error rates from 2014 says 1:1000 too, but again, perhaps things have changed in recent years.

So if you work with a hypervariable organism, say HIV I guess, or you don't have a dbSNP-equivalent, BQSR should be used with care.

Very good point. And perhaps one could also say the same for poorly studied genomes, genomes from wildly outbred strains, etc.

Your last point about not performing BQSR and potentially Indel Realignment for ChIP-Seq is appreciated and is certainly the majority opinion. I'm just worry where that kind of thinking leads, if we aren't putting quality and accuracy above all else. Certainly tradeoffs need to be made, but frankly, when you put 6 months into procuring a sample, it seems odd not to put in the 6 hours to BQSR and indel realign - just in case. I mean, even if you don't SNP call, someone else might want to. It's a philosophical difference perhaps. Ideals vs pragmatism.

ADD REPLYlink modified 15 months ago • written 15 months ago by John12k

Dear John, could you (or someone else reading this thread) please explain in more detail: how does the sequencer's error rate linked to SNP rate based on dbSNP and how these numbers are used to predict the applicabitily of BQSR?

ADD REPLYlink written 15 months ago by lamteva.vera140

Sure, um, well it's to do with how BQSR works. Let's start at the beginning. Base quality scores are the sequencing machines guess at how likely it is to be wrong when base calling. It is important to realise that it's not how likely the machine is to be wrong, but a guess at how likely the machine is to be wrong, since sequencing machines do not check how often they are wrong/right and recalibrate themselves. It's a second-derivative of wrong-ness so to speak. These machines just guess how wrong they are, and write out the FASTQ data for someone else to deal with. Please note, Illumina is a 30billion dollar company, which is more than the GDP of Malta, Macedonia and Mali combined, and ranks in the top 25 most innovative companies in the world by Forbes, so it's probably totally legit that it does this and one should not question why publicly funded bioinformaticians are the ones who end up calibrating the data on Illumina's behalf. #sarcasm

To check how right/wrong the machines actually are, BQSR will use a copy of the reference genome of the organism being sequenced and consider any deviation from the reference sequence as a sequencing error. This way it can model not only how often the machine is right/wrong in general, but also in what specific instances, for example perhaps the machine is over-confident of it's abilities when calling the 4th, 11th and 22nd base, but under-confident when calling the 9th, 14th, and 30th. For example. Alternatively when there are 5 Ts in a row, the machine is always underconfident when calling the next base - whatever that base may be. These are the sorts of patterns BQSR is looking for when assessing how good Illumina machines are at guessing their own inaccuracy, and once it's done figuring out the patterns, it adjusts the quality scores to reflect this. This leads to more accurate quality scores, which leads to more accurate data (particularly for SNP calling).

But obviously, if one is not taking into consideration legitimate differences between the individual being sequenced and the reference genome, i.e. SNPs, one is going to assume the machine is wrong when it is in fact right. So dariober's very legitimate point is that as sequencer's become more accurate, this false-negative for base calling will dominate BQSR's model of how well the sequencer is performing, and actually make things worse, not better. My point was it should be possible to give BQSR both a reference genome and a list of known SNPs, but as he astutely points out this only works for organisms where all the common SNPs are known.

A more in-depth description of BQSR can be found here:

ADD REPLYlink modified 15 months ago • written 15 months ago by John12k

Thanks for taking time to explain!

  1. Based on your explanation and the post: Is it an oversimplification and misunderstanding that BQSR automatically assigns poor quality to the (presumably true) variants that are absent in dbSNP (thus potentially excluding them from further consideration)?

  2. What are the requirements for true SNVs to survive BQSR? Does it have to do with covariates analyzis?

  3. As I understood, right now the sequencing techology is imperfect, so recalibrating base quality scores is reasonable and works as designed. But when the technology becames less error-prone, the sequencing error rate could approach real variation rate (i.e. most of the observed mismatches are real SNVs). In that case the observed variants (including absent from dbSNP) are less likely to be erroneous, but could be asssigned as such. Right?

  4. Have you read this paper? What do you (and others) think?

ADD REPLYlink written 15 months ago by lamteva.vera140
  1. is difficult to answer, but essentially yes. It's not quite so straight forward, because first the tool builds rules for the entire data, and nothing based on genomic mapping location. It will not see a mismatch on chr1:5000 100 times and say "the quality scores of chr1:5000 need to be reduced", rather, it looks at things like sequencer cycle, flowcell tile, and base sequence to a degree (not so unique that the base sequence singles out the SNP only). So even one true SNP, if sequenced by many overlapping reads at different positions in the read, will look like noise to BQSR, not something it can model. Of course, this changes for data like ChIP or RNA-Seq where, due to targeted sequencing, this assumption doesn't always hold true - but it will still be a small blip of noise in a sea of signal. Basically, BQSR is normalising out systematic errors produced by the sequencer across all the reads sequenced. So the effect on true SNV should be small.

  2. This should be answered by the above.

  3. bingo :)

  4. I haven't but i'll read it today and get back to you either today or tomorrow.

ADD REPLYlink modified 15 months ago • written 15 months ago by John12k

As the sequencing technology improves you may over-correct and penalize true variations.

dariober, that's exactly what I meant by "I don't quite understand the point of using it". I wouldn't bother if I were not afraid of side effects such as loosing true variants.

ADD REPLYlink modified 15 months ago • written 15 months ago by lamteva.vera140

Thanks, h.mon. My priority is the accuracy of data analyzis, so I'd rather prefer to "fasten belts". What I try to figure out is when BQSR is applicable and sensible and when it is not? See, for example, this paper:

...For these three callers [Platypus, GATK UnifiedGenotyper and HaplotypeCaller] in SNP detection, the role of BQSR is actually adverse rather than beneficial. In the low divergence regions, when the coverage is not sufficient, SAMtools and FreeBayes showed decrease in sensitivity but increase in precision rate by BQSR. In other cases, the loss of sensitivity was not associated with an increase in precision rate, which argues against the application of BQSR in those instances.

What do you think? Thanks for pointing me to the interesting post, it's always nice to gain deeper understanding!

ADD REPLYlink modified 15 months ago • written 15 months ago by lamteva.vera140
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2497 users visited in the last hour