Base recalibration in normal vs. tumor somatic variant calling in WXS data?
3
1
Entering edit mode
5 weeks ago
rebeliscu ▴ 30

Hi there,

I have a tumor and a normal BAM file and am preparing to run base recalibration.

I was planning on calling variants on the normal and using that, in addition to dbSNP, as input for recalibration of tumor BAM(s), e.g.:

gatk BaseRecalibrator \
-I tumor.bam \
-R hg38.fasta \
--known-sites normal.vcf  \
--known-sites dbSNP_hg38.vcf  \
-O tumor_recal.table


Before producing the normal VCF however, it's not clear to me whether I should run base recalibration on the normal BAM. If this is advised, I had planned using dbSNP as the known sites (for normal), e.g.:

gatk BaseRecalibrator \
-I normal.bam \
-R hg38.fasta \
--known-sites dbSNP_hg38.vcf  \
-O normal_recal.table


Alternatively, I could keep things simple and run base recalibration on both tumor and normal using dbSNP only.

Is one of these workflows more preferable? Any clarity here would be much appreciated. Thanks!

WXS recalibration variant somatic • 561 views
2
Entering edit mode
5 weeks ago

The current "best-practice" is to always do BQSR with the latest (and largest) dbSNP VCF on all samples - tumor or normal, FFPE or blood, etc. Per discussions in this post and this post, BQSR can benefit slightly if you provide a "bootstrap of known variants" unique to your samples, either somatic/germline variants found in your tumor/normal. However, you are effectively running your primary analysis pipeline twice (which is overkill), potentially amplifying false-positive variants (from your first-pass variant list), and potentially breaking compatibility of your BAMs with secondary analysis pipelines (e.g. downstream false-positive filters that use BQ).

There are also recent arguments against using BQSR at all, and instead flagging false-positives based on Base Quality drop-off (at the ends of reads, strand-bias, etc.). You can find an old Perl script here that implements such BQD filters. I also just found this tweet from Geraldine of GATK saying they're thinking of dropping BQSR from the best-practices - presumably because the high computational-expense of BQSR is not reasonable when the quality of DNA sequencing has improved so much. I would still recommend BQSR when re-analyzing old FASTQs, or when comparing FASTQs from a mix of different sequencers (e.g. HiSeq and NovaSeq).

0
Entering edit mode

Hi Cyriac, thanks so much for your response, this is very helpful. To be clear, doing BQSR with, for example, dbSNP, would not hurt the output of your analyses so much as the computational aspect is cumbersome, yes?

0
Entering edit mode

"hurt" is relative. :) Waiving the computational expense, doing BQSR gives you a decent balance between variant detection sensitivity and specificity. But if you care more about sensitivity than specificity, then BQSR will hurt your analysis. See more here.

1
Entering edit mode

You have all the answers! Thank you again.

0
Entering edit mode
5 weeks ago

Hi, If you read the GATK Best Practices forums and posts about BaseRecalibrator you will find that the purpose of "calibrating" BAMs is to correct sequencing errors. In many patterns of nucleotides like AAG, the third nucleotide after a repetition tends to be overestimated in the PHRED score. If you don't calibrate these seq errors you can get variant calls that pass the hard filters in a position with an actual low PHRED score. So, no matter if you are processing normals or tumors you should perform this step in all your samples.

0
Entering edit mode

I guess I intuited that the normal would need to be recalibrated. Additionally, it's not clear to me: should I recalibrate the tumor BAMs using the normal as a "known sites" input (i.e. recal on normal using dbSNP, call variants, use as input to recal tumor) or recalibrate them all the same way, i.e. just using dbSNP? Hopefully that makes sense.

0
Entering edit mode
5 weeks ago
tomas4482 ▴ 40

No matter what kind of samples and variants you need to deal with, the preprocessing pipeline need to be done, which includes MarkDuplicatesSpark - Base Quality Recalibration and Apply recalibration. For RNA-seq, an additional SplitNCigarReads is needed before BSQR as well.

Only after applying recalibration to your bam, it could be further taken as input to detect somatic or germline variants.