This has been a 'burning' issue of mine since 2014 when I was working in a clinical setting and when I naively implemented the GATK pipeline (with HaplotypeCaller; HC) for the purpose of identifying germline variants in patient samples passing through the laboratory. It first began when the clinical scientists noticed that my NGS reports frequently omitted variants that they knew were present in the patient samples via Sanger sequencing. I could then easily spot these missed variants in IGV, but there was no apparent connection between, e.g., GC content, MAPQ, read depth, repetitive sequence, allele fraction, etc., and the missed variants - they just happened in every single sample.
Upon recommendation from a former Sanger Institute employee, I then later tested out
samtools mpileup, and it detected everything. I also tested out GATK and SAMtools on a 1000 Genomes whole genome sample, and again SAMtools detected known variants in the sample, where GATK could not.
After exploration, I found a way to 'assist' the GATK to find the missing variants, a process that basically involved sub-sampling the aligned BAMs to lower read depths and then re-calling variants on each sub-set. A modified version of the pipeline is here, but this version now involves
bcftools mpileup: https://github.com/kevinblighe/ClinicalGradeDNAseq
So, it seems that GATK HC's issues were / are in part dependent on read depth. THIS GitHub issue, started in 2017, points the finger at input padding, i.e., the
-ip command line parameter, and makes for interesting reading. In particular, take a look at THIS comment from that same GitHub issue.
In my conversations over the years to people about this, some seem to not care too much about it, while others are concerned but continue to use the GATK. From my perspective, GATK should only be used in a research setting.
Related thread: A: Best tool for variant calling