GATK VCF output: where are the sample-level annotations?
1
0
Entering edit mode
3.2 years ago
mcrepeau ▴ 20

I also posted this question to the GATK forum, but it doesn't seem to be very active. The question is:

How do I get GATK to write the sample-level annotations to my VCF file? Most VCF files I've worked with have a header that includes FORMAT and then the names of each of my samples. Then each record ends with the annotations for each sample, like:

GT:AD:DP:GQ:PL 0/1:18,15:33:99:393,0,480 1/1:0,30:30:89:913,89,0 etc.

This is the kind of VFC file I get from our current pipeline, which uses Freebayes. But I'm trying to learn to use GATK. I managed to write some WDL scripts to run GATK HaplotypeCaller on 11 mosquito genomes and then run the resulting 11 GVCF files through GATK GenotypeGVCFs. It worked, but... There are no sample-level annotations! The VCF header stops at INFO.

So I have apparently done the "SNP discovery", but I'd also to call the genotypes of each sample. Is there a separate GATK tool to do that? What am I missing here? I looked through the documents for GenotypeGVCFs but couldn't see any command arguments that instruct it to output sample-level annotations.

SNP sequence GATK VCF • 1.1k views
ADD COMMENT
1
Entering edit mode

Well, it was all scripted in WDL, basically the Broad Institute scripts but I stripped them down to the bare minimum. I can post the full scripts, or even the Cromwell workflow logs, but obviously they're quite long! The basic command line for HaplotypeCaller looked like this:

gatk --java-options "-Xmx6G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
  HaplotypeCaller \
  -R /cromwell_root/marc-test-gbucket/infiles/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa \
  -I gs://marc-test-gbucket/infiles/2014COV006.bam \
  -L /cromwell_root/marc-test-gbucket/infiles/2R.bed \
  -O 2014COV006.g.vcf.gz \
  --heterozygosity 0.01 \
  --indel-heterozygosity 0.001 \
  --min-base-quality-score 17 \
  -contamination 0 \
  -G StandardAnnotation -G StandardHCAnnotation -G AS_StandardAnnotation \
  -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
  -ERC GVCF

Each sample was sharded across 8 intervals (basically full chromosome arms) and the shards were merged with GATK MergeVCFs. Once I had produced 11 GVCF files that way, I ran a WDL script that used GATK GenomicsDBImport to pull them all into a database and then GenotypeGVCFs used that database as input. The basic GenomicsDBImport command looked like this:

gatk --java-options -Xms8g \
  GenomicsDBImport \
  --genomicsdb-workspace-path genomicsdb \
  --batch-size 50 \
  -L /cromwell_root/fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-SplitIntervalList/cacheCopy/glob-d928cd0f5fb17b6bd5e635f48c18ccfb/0000-scattered.interval_list \
  --sample-name-map /cromwell_root/marc-test-gbucket/infiles/sample_map \
  --reader-threads 5 \
  --merge-input-intervals \
  --consolidate

Again, it was sharded, this time in 5 shards. Then the basic GenotypeVCFs command looked like this:

gatk --java-options -Xms8g \
  GenotypeGVCFs \
  -R /cromwell_root/marc-test-gbucket/infiles/Anopheles-gambiae-PEST_CHROMOSOMES_AgamP4.fa \
  -O test_callset.0.vcf.gz \
  -G StandardAnnotation -G AS_StandardAnnotation \
  --only-output-calls-starting-in-intervals \
  --use-new-qual-calculator \
  -V gendb://$WORKSPACE \
  -L gs://fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-SplitIntervalList/cacheCopy/glob-d928cd0f5fb17b6bd5e635f48c18ccfb/0000-scattered.interval_list \
  --merge-input-intervals

For the same 5 shards. Then there were some filtering commands on each shard that looked like this:

gatk --java-options -Xms3g \
  VariantFiltration \
  --filter-expression "ExcessHet > 54.69" \
  --filter-name ExcessHet \
  -O test_callset.0.variant_filtered.vcf.gz \
  -V /cromwell_root/fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-GenotypeGVCFs/shard-0/test_callset.0.vcf.gz

gatk --java-options -Xms3g \
  MakeSitesOnlyVcf \
  -I test_callset.0.variant_filtered.vcf.gz \
  -O test_callset.0.sites_only.variant_filtered.vcf.gz

And finally, all the shards were gathered with this command:

gatk --java-options -Xms6g \
  GatherVcfsCloud \
  --ignore-safety-checks \
  --gather-type BLOCK \
  --input gs://fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-HardFilterAndMakeSitesOnlyVcf/shard-0/test_callset.0.sites_only.variant_filtered.vcf.gz --input gs://fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-HardFilterAndMakeSitesOnlyVcf/shard-1/test_callset.1.sites_only.variant_filtered.vcf.gz --input gs://fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-HardFilterAndMakeSitesOnlyVcf/shard-2/test_callset.2.sites_only.variant_filtered.vcf.gz --input gs://fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-HardFilterAndMakeSitesOnlyVcf/shard-3/test_callset.3.sites_only.variant_filtered.vcf.gz --input gs://fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-HardFilterAndMakeSitesOnlyVcf/shard-4/test_callset.4.sites_only.variant_filtered.vcf.gz --input gs://fc-98bcf530-34e6-4198-936d-ded43246d2ea/8339b2b6-ed56-4f68-9981-2422250f0c82/JointGenotyping/f1edaf67-859a-44dc-b64d-b48269bd65db/call-HardFilterAndMakeSitesOnlyVcf/shard-5/test_callset.5.sites_only.variant_filtered.vcf.gz \
  --output test_callset.sites_only.vcf.gz
ADD REPLY
0
Entering edit mode

what are your command lines ?

I managed to write some WDL scripts to run GATK HaplotypeCaller on 11 mosquito genomes and then run the resulting 11 GVCF files through GATK GenotypeGVCFs

hum... CombineGVCFs is missing here.

ADD REPLY
0
Entering edit mode
3.2 years ago

MakeSitesOnlyVcf

isn't it why you get a VCF without FORMAT ?.....

ADD COMMENT
0
Entering edit mode

Ah! Okay, yes. That looks like the problem. As I said, I started with WDL scripts from the Broad, so this is the way they had it scripted. I guess they were only interested in SNP discovery. Presumably if I give GatherVcfsCloud the test_callset.n.variant_filtered.vcf.gz shards instead of the test_callset.n.sites_only.variant_filtered.vcf.gz shards I will get what I want.

ADD REPLY
0
Entering edit mode

Yep, that worked. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2159 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6