Hi! I used GATK HaplotypeCaller (GHC) to call variants on amplicon target sequencing data. I think GHC missed some true variants.
I have a set of 40 amplicon target sequencing samples. I called variants on them using GHC and 2 other variant callers. I also have VCFs with "standard" variants called on the same samples by another pipeline I trust more. I use these standard variants to compare variants called by my pipeline against.
I did not mark duplicates in BAMs. I hard-clipped primers in BAMs. I run GHC with --dont-use-soft-clipped-bases false. I used GATK 4.6.1.0.
When I use GHC with --max-reads-per-alignment-start option set to default 50, GHC finds all variants from standard variant set. But when I use GHC with --max-reads-per-alignment-start option set to 0 as recommended for amplicon gene panels, GHC misses some of standard variants. The missed variants appear in more than one sample. And GHC found these variants in some samples, but missed in other samples. For example, imagine there is a variant chr1-123456-A-C. In standard variant set this variant is present in samples A, B, C, D. But GHC in my pipeline found the variant chr1-123456-A-C in samples A, B, C, but missed in sample D.
Worth mentioning, 2 other callers found those missing variants. One of the callers does not have an analog of the --max-reads-per-alignment-start option, so it found the variants at full depth. The other caller has --max-reads-per-alignment-start option analog. It found the missed variants both at threshold of 50 and full depth.
I tried to debug GHC as described in "When HaplotypeCaller and Mutect2 do not call an expected variant" and "Expected variant at a specific site was not called" articles on GATK site. I run GHC with --debug, --bam_output, and --max-reads-per-alignment-start 0 options.
I compared BAM files where GHC found a variant with BAM files where GHC missed exactly the same variant via IGV. I found no significant difference in depth of coverage and strand support between BAMs with found variants and BAMs with missed variants. I also did not find significant difference in mapping quality of reads that cover this variant. And I did not find significant difference in quality of bases that cover the variant.
Looking at GHC BAM output I discovered the positions of missed variants in majority of samples lost coverage, though one missed variant in one sample preserved it. The debug log did not bring me any closer to the problem solution. In all cases GHC considered that variant region was active. It tried to assemble haplotypes at variant regions. May be it succeeded since there were no messages about assembly fail in debug log. In some cases GHC did not find an event at particular location of the missed variant, but found variants in the same assembly, even if they were 20-30 bases away from a missed variant. In some cases GHC even genotyped missed variants, but they still did not appear in VCF. I compared debug logs of samples where variants were missed with debug logs of samples where the same variants were found. I found no difference except variants were genotyped in all samples where they were written to VCFs, whereas in samples where they were missed they were genotyped only in some samples.
In the next test in addition to the options described in paragraph above I used --linked-de-bruijn-graph option. It recovered a few missed variants, but did not change the debug log, except for notifying about genotyping event in case there was no genotyping notification in previous test debug log.
I performed one more debug test, where I used both --linked-de-bruijn-graph and --recover-all-dangling-branches options. It recovered some missed variants, but debug logs changed only to the same extent as in test with --linked-de-bruijn-graph option only.
I tried to gradually increase --max-reads-per-alignment-start option threshold. GHC started to miss variants after I set the threshold above 900.
Considering all the above I have a few questions:
- Why does GHC miss the same variant in one sample and find in another despite depth, strand support, read MAPQ, and base quality are nearly the same?
- Can I make GHC find these missing variants somehow at full available depth? Or may be there is a way to choose a --max-reads-per-alignment-start threshold for amplicon gene panel data that would not be just tuning for specific samples or random? Because I also really liked computational speed at smaller depth.
- Can you share literature about how GHC chooses what reads to consider and what to ignore if it has --max-reads-per-alignment-start set to some non-zero value?
- Can you suggest variant callers that are more suitable than GHC for amplicon gene panel data?