Question

When to use .vcf or .gvcf files from GATK HaplotypeCaller?

0

Entering edit mode

23 months ago

Vitor1 ▴ 120

Hi everyone!

I'm curently following this tutorial here for variant calling using gatk: https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/

Its very clear and straightfoward, however it uses the HaplotypeCaller function from gatk to generate output in .vcf format (step 4).

When I was looking for GATK best practises for germile variante calling, it uses this same function (HaplotypeCaller) with the output beign in the .gvcf format, and later consolidating and getting the .vcf files. (https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows)

I'm wondering which one should I use. I have blood WES data from approximately 30 patients and I'm looking for SNPs and INDELS for specific genes.

Thanks!

indel gatk calling snp variant • 2.5k views

ADD COMMENT • link updated 23 months ago by Medhat 9.7k • written 23 months ago by Vitor1 ▴ 120

score 2 · Answer 1 · 2022-05-10

"The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not. This estimation is generated by the HaplotypeCaller's built-in reference model." More

So, in you case If you want to analyze the 30 sample as cohort use gvcf format. Additionally, you can convert gvcf to vcf, but not the other way bcftools convert --gvcf2vcf.