Question: GATK: HaplotypeCaller gVCF and multisample
gravatar for iraun
5.5 years ago by
iraun3.7k wrote:

Hi all,

Anyone can explain me what is the main difference between using GATK HC in gVCF mode instead of in multi-sample mode? I know that HC in GVCF mode is used to do variant discovery analysis on cohorts of samples, but what is the meaning of "cohorts of samples"? If I have 2 groups of samples, one WT and the other mutant, should I use GVCF mode? I've read almost all the tutorials and Howto's of GATK and I can not understand at all.

Also, how can I give more than one bam to HC? Is this the correct way?:

java -Xmx8g -jar -XX:ParallelGCThreads=4 -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I sample1.bam -I sample2.bam -I ...  --genotyping_mode DISCOVERY -stand_emit_conf 10 -stand_call_conf 30 --dbsnp dbsnp_138.b37.vcf -o raw_variants.vcf


Thank you.

gatk • 10k views
ADD COMMENTlink modified 5.1 years ago by abrahamdsl10 • written 5.5 years ago by iraun3.7k
gravatar for abrahamdsl
5.1 years ago by
abrahamdsl10 wrote:

"between using GATK HC in gVCF mode instead of in multi-sample mode?"

I'm not sure what you mean, but with the possible scenarios:

- HaplotypleCaller in gVCF mode vs Just variant calling

- HaplotypeCaller (without regard to gVCF or just variant calling) vs UnifiedGenotyper per sample

I think you meant the second one. HaplotypeCaller  also performs de-novo assembly of regions containing variants for more confident variant calls. Also, more info is described here: Variant Caller Of Choice?

"Cohort" is usually subjective.

  • Cohort: A collection of samples being analyzed together. This organizational unit is the most subjective and depends very specifically on the design goals of the sequencing project. For population discovery projects like the 1000 Genomes, the analysis cohort is the ~100 individual in each population. For exome projects with many deeply sequenced samples (e.g., ESP with 800 EOMI samples) we divide up the complete set of samples into cohorts of ~50 individuals for multi-sample analyses.


When we are doing our GATK-based pipeline, by cohorts of samples, we mean, all of the "pools". For example, we have four pools. We have induced mutation on a plant, and then fifteen plants still exhibit phenotype that as if it did not undergo mutation. We call that Pool1. The rest, Pools 2-4 with around 15 physical plants per pool, exhibit mutation at varying degrees. With the sequencing data, Pool1, 2, 3 and 4 are different samples. The cohort is all of them together. 

I do think you can or not use GVCF in your analysis ( WT vs mutant) - that depends on what you have further in your downstream processing. With all I have seen so far, they do use HaplotypeCaller in GVCF mode, then GenotypeGVCFs, then Variant Quality Score Recalibration which in actuality uses VariantRecalibrator and ApplyRecalibration walkers of GATK. From there you select variants with acceptable VQSLOD usually >= 4.0 . Further filtration might be needed after that.

And yes, you are correct in giving two or more BAMs to GATK.

ADD COMMENTlink written 5.1 years ago by abrahamdsl10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1027 users visited in the last hour