Question

required annotations for VQSR

1

Entering edit mode

10 days ago

Mahdi&Christ ▴ 20

Hi, I have a sample that doesn’t have the required annotations for Variant Quality Score Recalibration (VQSR). Since I don’t have any other files to regenerate the VCF with these annotations, I’m wondering how I should proceed.

Is it possible to continue without applying VQSR and its recalibration? If yes, how should I filter this sample?

Also, besides hard-filtering, are there any other filtering methods I could use in this case?

Thanks a lot for your help

VQSR vcf annotation • 378 views

ADD COMMENT • link 10 days ago by Mahdi&Christ ▴ 20

score 4 · Accepted Answer · 2025-06-10

Hi, you're in a common scenario when working with variants that lack the necessary annotations for Variant Quality Score Recalibration (VQSR). Here's a clear breakdown of your options:

Can you proceed without VQSR?**

Yes, you can absolutely proceed without VQSR. VQSR is ideal, especially for large datasets and high-quality cohorts, but it requires:

A large number of variants (typically >30 samples)
Known resources (e.g., HapMap, Omni, dbSNP)
Specific annotations (e.g., QD, MQ, FS, ReadPosRankSum, etc.)

If your sample doesn't have those annotations and you can't regenerate the VCF, then VQSR is off the table.

How should you filter your sample instead?

1. Hard Filtering (GATK Recommended)

For single samples or datasets lacking the necessary annotations/resources, GATK recommends hard filtering — applying fixed thresholds to annotations to separate true variants from artifacts.

Here is a typical hard-filtering thresholds for SNPs and indels:

For SNPs:

FILTER: QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0

For Indels:

FILTER: QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0

Use GATK VariantFiltration to apply these filters. Note: If your VCF lacks these annotations entirely (e.g., QD, FS, MQ), then you'll need to filter based on what's available (e.g., QUAL score and depth).

Alternative Filtering Methods

Besides hard filtering, here are some other options depending on your setup:

2. bcftools filter

You can apply custom expressions on fields like QUAL, DP, and AF:

bcftools filter -e 'QUAL < 30 || DP < 10' input.vcf -o filtered.vcf

3. Machine learning / ensemble filtering

If you're working with tools like DeepVariant, Octopus, or Strelka, they often include internal filtering or confidence scores which can be used.

4. Custom scoring

Some pipelines develop sample-specific filtering heuristics (e.g., based on known variant concordance, sample depth distribution, or variant allele frequency). This can be done in Python, R, or with tools like vcftools.

5. Use of population-level knowledge (if applicable)

If you're genotyping a sample from a well-characterized population (e.g., 1000 Genomes, gnomAD), you can cross-reference known variants to boost confidence.

Hope this helps friend