Question

Doesn't Base Quality Score Recalibration degrade sensitivity in heavily mutated cancers?

12

Entering edit mode

9.8 years ago

Cyriac Kandoth 6.0k

From the GATK docs - [BQSR] assumes that all reference mismatches are errors and indicative of poor base quality - which is why we have to give it a list of dbSNPs to skip over. But what about somatic SNPs? Wouldn't hypermutated tumors from uterine, colorectal, melanoma, or lung cancers be re-calibrated to a lower quality than data from AML or breast? And variant caller sensitivity would drop accordingly. Or is this not a big deal, in practice?

As a test - I will try to do some high-confidence SNP calling on un-calibrated uterine cancer BAMs, append those to the dbSNP VCF for BQSR, and redo variant calling. Then I'll compare these calls to the standard BQSR BAM using only dbSNP.

sequencing SNP • 6.9k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.8 years ago by Cyriac Kandoth 6.0k

2

Entering edit mode

Looking ahead to the results for your test. I have observed a similar problem in regards with finding novel variants. The Base Quality Score Recalibration will probably decrease the chances of a variant caller to detect novel snps.

ADD REPLY • link updated 2.5 years ago by Ram 43k • written 9.8 years ago by kautilya ▴ 430

Ram · Answer 1 · 2014-08-27

7

Entering edit mode

9.7 years ago

lh3 33k

No, BQSR won't affect sensitivity. The list of dbSNP sites are used to generate the recalibration table. A more complete list of variants helps to yield higher calibrated quality. It is not like that a variant seen in the dbSNP will get a higher quality than a novel variant.

That said, Illumina raw quality is pretty good these days. For high-coverage samples, BQSR is frequently not necessary IMO.

ADD COMMENT • link 9.7 years ago by lh3 33k

0

Entering edit mode

BQSR is not needed since Illumina base qualities are quite good nowadays. But it is still a standard part of many reference alignment pipelines... because it's recommended by GATK's best-practices for DNA-seq. And it inevitably gets used in cancer genomics pipelines, where there are real variants at various allele fractions. And since these variants are not in dbSNP, they will be classified as sequencing artifacts that are used to generate the recalibration table - that inevitably "corrects for" these real variants - reducing sensitivity.

I still haven't done my test, so can't confirm this assumption yet. But does it make sense?

ADD REPLY • link 9.6 years ago by Cyriac Kandoth 6.0k

0

Entering edit mode

No, no. The recalibration table WON'T "correct for these real variants".

ADD REPLY • link 9.7 years ago by lh3 33k

1

Entering edit mode

OK. Then I might have misunderstood the purpose of BQSR. Can you take a look at their docs here, or here's a shortened excerpt:

[BQSR] tabulates and bins data about features of the bases (read group, dinucleotide context, etc.). It counts the number of bases within each bin and how often such bases mismatch the reference base, excluding loci known to vary in the population (dbSNP). The new recalibrated quality scores are based on the sum of the global difference between reported quality scores and the empirical quality.

In cancers, there will be a lot of real variants in very distinctive dinucleotide contexts. For example, from UV and cigarette smoke. Or when specific DNA-repair genes (like POLE, MLH1, etc.) are disabled, you get more variants of the type that they were responsible to repair. All these are real variants in reads that BQSR will downgrade due to too much difference from the reference sequence (empirical quality).

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.7 years ago by Cyriac Kandoth 6.0k

score 4 · Answer 2 · 2014-07-21

The official answer from GATK devs is down here. In short, this was "...not investigated, but it sounds like a use case where the BQSR would benefit from generating a bootstrap set of variants". They go on to give a longer explanation of the test I mentioned in the question, and that "This should compensate for the risk of counting real mutations as errors in hypermutated cancer tissue. But please understand that it's a theoretical solution that we haven't tested out ourselves, so we can't guarantee results"

Update (Jan 24, 2019): GATK devs provided this comment that acknowledge the issue, but also make a good case that the degradation in sensitivity should be negligible.