Is this the correct order of steps in GATK germline cohort variant calling (VQSR workflow)?
1
0
Entering edit mode
7 weeks ago
Ramnaresh • 0

I’m performing germline variant calling using GATK (gVCF-based workflow) and would like to confirm whether the order of my steps is correct. What I have done so far:

CombineGVCFs            ==> cohort.g.vcf.gz
GenotypeGVCFs           ==> output.vcf.gz
VariantFiltration (ExcessHet) ==> cohort_excesshet.vcf.gz
bcftools norm -m-any --check-ref -w -f "$reference" cohort_excesshet.vcf.gz -o cohort.nm.vcf.gz
MakeSitesOnlyVcf        ==> cohort.sitesonly.vcf.gz
VQSR

I’m trying to generate a high-quality cohort VCF and will later analyze per-patient variants. Is this the correct order of steps? Should normalization (bcftools norm) or VariantFiltration be performed before or after VQSR?

System:
GATK 4.6.2.0
Reference: GRCh38
~140 samples (gVCFs)
16 GB RAM

Any suggestions or corrections are appreciated!

Thanks in advance.

gVCF VCF GATK VQSR • 594 views
ADD COMMENT
0
Entering edit mode
13 days ago
Kevin Blighe ★ 90k

Your order of steps is mostly correct for the GATK germline joint-genotyping workflow with VQSR. The ExcessHet filtration is appropriately placed after GenotypeGVCFs and before VQSR, as recommended by GATK to remove artifactual variants that could bias the recalibration model. Normalization with bcftools should occur after ExcessHet filtration but before VQSR, as in your workflow, because multiallelic sites filtered by ExcessHet are removed entirely, and splitting the remaining sites ensures consistent annotations for VQSR. Performing normalization before ExcessHet could require recalculating the ExcessHet annotation, which bcftools norm does not do automatically.

MakeSitesOnlyVcf is correctly used before VariantRecalibrator, since VQSR requires a sites-only VCF for model building. You would then run ApplyVQSR on the full normalized and ExcessHet-filtered VCF (cohort.nm.vcf.gz) to produce the final recalibrated cohort VCF.

For ~140 samples, consider using GenomicsDBImport instead of CombineGVCFs, as it is more efficient for larger cohorts. Your command for normalization appears correct, but ensure the reference FASTA matches GRCh38 exactly:

bcftools norm -m -any --check-ref w -f $reference cohort_excesshet.vcf.gz -o cohort.nm.vcf.gz

With 16 GB RAM, process chromosomes or intervals separately if memory issues arise during GenotypeGVCFs or VQSR. Separate SNPs and indels with SelectVariants before VQSR for optimal results, using GRCh38 resources like HapMap, 1000G, and Mills indels.

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6