I'm interested to call the genotype at a list of positions for a bunch of panel-like sequencing runs. I run a standard best practices GATK germline variant calling pipelines (without marking duplicates, since duplicates are expected due to the size of the sequenced regions). I make gVCFs with HaplotypeCaller and then VCFs with GenotypeGVCFs.
Largely, the results are fine, but roughly 0.01 of my positions have low quality and no genotype (
./.) in the VCF, e.g.
4 190318080 . C . 29.91 LowQual DP=1015 GT:AD:DP ./.:1015,0:1015
When I look into positions like this, I see something curious in the gVCF. The GC score is good up until exactly this position, and then plummets to 0 for a single base. E.g., the gVCF looks like
4 190318056 . G <NON_REF> . . END=190318079 GT:DP:GQ:MIN_DP:PL 0/0:1014:99:993:0,120,1800 4 190318080 . C <NON_REF> . . END=190318080 GT:DP:GQ:MIN_DP:PL 0/0:1015:0:1015:0,0,0 4 190318081 . A <NON_REF> . . END=190318093 GT:DP:GQ:MIN_DP:PL 0/0:1015:99:989:0,120,1800
I figure this is the cause of my problem. To investigate, I made a bamout from HaplotypeCaller, which looks completely normal. When I
samtools tview it, I get an appropriate column of 1015 Cs at position 190318080 as expected. The qualities in the bamout also look fine: all
5s compared to
Is in the original CRAM, but I figure this is rescaling in the bamout, and this still corresponds to q-score 20, which ought to be fine.
Based on the above, I am puzzled as to why HaplotypeCaller/GenotypeGVCF are unable to call the homozygous genotype at this position. FWIW, I've seen this occur at several different positions some homozygous and other heterozygous (as judged manually from the alignment).
Has anyone got any ideas as to what might be going on here, or what else I might look at in order to try and diagnose this problem?