HaplotypeCaller GENOTYPE GIVEN ALLELES doesn't genotype given alleles
Entering edit mode
3 months ago
kamanovae ▴ 80


I am trying to run the Gatk HaplotypeCaller (human data):

./gatk- HaplotypeCaller\
 --reference ref.fa \
 --input file.bam \
 --genotyping-mode GENOTYPE_GIVEN_ALLELES \
 --alleles allele_chunk_file.vcf \
 --intervals file.bed  \
 --output  out/file.vcf

After running the above command for any given sample, only ~ 3 sites are called and all of them have a 'LowQual' filter attached, for example:

chr16   89919714    .   C   A   0   LowQual AC=0;AF=0.00;AN=2;DP=35;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.257 GT:AD:DP:GQ:PL  0/0:35,0:35:99:0,105,1210
chr16   89919722    .   T   C   0   LowQual AC=0;AF=0.00;AN=2;DP=32;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.206 GT:AD:DP:GQ:PL  0/0:32,0:32:96:0,96,1148
chr16   89919736    .   C   T   0   LowQual AC=0;AF=0.00;AN=2;DP=31;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.223 GT:AD:DP:GQ:PL  0/0:31,0:31:92:0,92,976
chr16   89919746    .   G   A   168.60  .   AC=1;AF=0.500;AN=2;BaseQRankSum=2.137;DP=34;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=4.96;ReadPosRankSum=0.955;SOR=0.409   GT:AD:DP:GQ:PL  0/1:26,8:34:99:176,0,674
chr16   89920138    .   G   C   0   LowQual AC=0;AF=0.00;AN=2;DP=30;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.307 GT:AD:DP:GQ:PL  0/0:30,0:30:90:0,90,862
chr16   89957794    .   G   A   0   LowQual AC=0;AF=0.00;AN=2;DP=43;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.428 GT:AD:DP:GQ:PL  0/0:43,0:43:99:0,129,1554
chr16   89957798    .   A   G   545.60  .   AC=1;AF=0.500;AN=2;BaseQRankSum=-4.803;DP=44;ExcessHet=3.0103;FS=15.365;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=12.40;ReadPosRankSum=-0.858;SOR=2.363   GT:AD:DP:GQ:PL  0/1:21,23:44:99:553,0,622
chr17   81629785    .   C   T   693.60  .   AC=1;AF=0.500;AN=2;BaseQRankSum=3.231;DP=57;ExcessHet=3.0103;FS=2.273;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=12.17;ReadPosRankSum=-1.027;SOR=1.012 GT:AD:DP:GQ:PL  0/1:33,24:57:99:701,0,908
chr20   34077942    .   A   .   0   LowQual AC=0;AF=0.00;AN=2;DP=47;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.509 GT:AD:DP:GQ:PL  0/0:47,0:47:99:0,144,2147483647
chr20   34197406    .   C   G   0   LowQual AC=0;AF=0.00;AN=2;DP=58;ExcessHet=3.0103;FS=0.000;MLEAC=0;MLEAF=0.00;MQ=60.00;SOR=0.327 GT:AD:DP:GQ:PL  0/0:58,0:58:99:0,174,2030

I've checked the bam file and both the known sites VCF and the called sites VCFs in IGV. There are many instances where the call should be homozygous reference but there is no call in the called VCF. There is clearly sufficient read coverage of one of the known sites VCF calls to call it as homozygous reference but there is no call in the called VCF.

Now I'm trying to find a command that can find the 0/0 genotype and distinguish it between really low coverage

Gatk HaplotypeCaller VCF • 448 views
Entering edit mode

I believe GATK's HaptoypeCaller is behaving correctly: "For each potentially variant site, the program applies Bayes' rule, using the likelihoods of alleles given the read data". Generally, GATK will only emit segregating sites, as in most cases homozygote reference sites are uninformative in many use cases.

If you are using a recent version of GATK (which I don't think you are since the --genotyping-mode argument appears to be deprecated), you can use either of these --output-mode [EMIT_ALL_CONFIDENT_SITES|EMIT_ALL_ACTIVE_SITES] as EMIT_VARIANTS_ONLY is used by default. Docs for this option here.

Entering edit mode

Thanks for your reply! I have another problem with GATK Haplotypecaller. Is there a way to keep the RS field in output vcf file. This is my command:

$gatk_run_line \
 --native-pair-hmm-threads 20 \
 --reference $genome_seq \
 --input $i \
 --alleles $allele_chunk_file \
 --intervals $intervals  \
 --output  out_new/$i.vcf

The input vcf looks like this:

chr1    1222897 rs3813199   G   A   .   .   RS=rs3813199
chr1    1223251 rs6603781   A   G   .   .   RS=rs6603781
chr1    1223251 rs6603781   A   T   .   .   RS=rs6603781
chr1    1225285 rs6671424   G   A   .   .   RS=rs6671424

But the output does not include rsid

chr1    817341  .   A   C   1492.06 .   AC=0;AF=0;AN=2;DP=50;ExcessHet=0;FS=0;MLEAC=0;MLEAF=0;MQ=60;QD=31.08;SOR=1.562  GT:AD:DP:GQ:PL  0/0:0,0:48:99:1506,1508,1511
chr1    817341  .   A   G   1492.06 .   AC=2;AF=1;AN=2;DP=50;ExcessHet=0;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=31.08;SOR=1.562  GT:AD:DP:GQ:PL  1/1:0,48:48:99:1506,143,0
chr1    817341  .   A   T   1492.06 .   AC=0;AF=0;AN=2;DP=50;ExcessHet=0;FS=0;MLEAC=0;MLEAF=0;MQ=60;QD=31.08;SOR=1.562  GT:AD:DP:GQ:PL  0/0:0,0:48:99:1506,1508,1511
chr1    818802  .   A   G   1024.06 .   AC=2;AF=1;AN=2;DP=27;ExcessHet=0;FS=0;MLEAC=2;MLEAF=1;MQ=55.95;QD=25.36;SOR=1.528   GT:AD:DP:GQ:PL  1/1:0,27:27:81:1038,81,0


Entering edit mode

I've not used a SNP database in my analyses on non-model organisms, but I believe you are looking for the VariantAnnotator tool. It has the --dbsnp parameter to annotate variants with dbSNP IDs where applicable. And according to the best practices for germline short variant discovery, VariantAnnotator should be one of the final steps in the pipeline.


Login before adding your answer.

Traffic: 2271 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6