Hi, you're in a common scenario when working with variants that lack the necessary annotations for Variant Quality Score Recalibration (VQSR). Here's a clear breakdown of your options:
Can you proceed without VQSR?**
Yes, you can absolutely proceed without VQSR. VQSR is ideal, especially for large datasets and high-quality cohorts, but it requires:
- A large number of variants (typically >30 samples)
- Known resources (e.g., HapMap, Omni, dbSNP)
- Specific annotations (e.g., QD, MQ, FS, ReadPosRankSum, etc.)
If your sample doesn't have those annotations and you can't regenerate the VCF, then VQSR is off the table.
How should you filter your sample instead?
1. Hard Filtering (GATK Recommended)
For single samples or datasets lacking the necessary annotations/resources, GATK recommends hard filtering — applying fixed thresholds to annotations to separate true variants from artifacts.
Here is a typical hard-filtering thresholds for SNPs and indels:
For SNPs:
FILTER: QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0
For Indels:
FILTER: QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0
Use GATK VariantFiltration
to apply these filters.
Note: If your VCF lacks these annotations entirely (e.g., QD, FS, MQ), then you'll need to filter based on what's available (e.g., QUAL score and depth).
Alternative Filtering Methods
Besides hard filtering, here are some other options depending on your setup:
2. bcftools filter
You can apply custom expressions on fields like QUAL
, DP
, and AF
:
bcftools filter -e 'QUAL < 30 || DP < 10' input.vcf -o filtered.vcf
3. Machine learning / ensemble filtering
If you're working with tools like DeepVariant, Octopus, or Strelka, they often include internal filtering or confidence scores which can be used.
4. Custom scoring
Some pipelines develop sample-specific filtering heuristics (e.g., based on known variant concordance, sample depth distribution, or variant allele frequency). This can be done in Python, R, or with tools like vcftools
.
5. Use of population-level knowledge (if applicable)
If you're genotyping a sample from a well-characterized population (e.g., 1000 Genomes, gnomAD), you can cross-reference known variants to boost confidence.
Hope this helps friend
I truly appreciate your clear and comprehensive explanation. I learned a lot of important and practical points.