Hello, I am attempting to determine the ploidy of my organism by observing where the modes of SNP allele ratios fall. For example, a 2N diploid should have a peak of allele read-counts at 0.5 since most SNPs should have, for example, 15 reference and 15 alternate reads. A triploid should have two modal peaks around 0.33 and 0.66, and a tetraploid should have four peaks around 0.25, 0.5, and 0.75.
I am using HaplotypeCaller to call SNPs across ~50 individuals as a cohort in gVCF+joint genotyping mode, then using SelectVariants to keep biallelic SNPs and using the JEXL expression " -select 'vc.getGenotype("sample").isHet()' " to isolate only those SNPs called as heterozygous, then taking the "-GF AD" option on SelectVariants and parsing the allele depths in the resulting table, and finally calculating the alternate allele read depth divided by the total read depth to get my final "ratio". Each individual has >200,000 SNPs included in analysis.
While this workflow seems to work and give a modal peak around 0.5, every sample previously known as a 2N diploid I try this on shows a significant shoulder on the left side of the density plot I generate (see attached, sample name = "1306"). When I apply this to an artificial tetraploid (two 2N diploid individuals merged) I can see the expected three modal peaks around 0.25, 0.5, and 0.75, but again, there is a very significant bias toward the left. The same trend finally applies to an isolate known to be triploid (sample "EC1"). In essence, it appears that the number of reference reads is crowding out the alt reads.
My first theory was that local CNVs drove up reference read alleles, and indeed the SNPs showing this abnormally low read ratio values were usually higher in read depth than the high ratio SNPs. Removing SNPs with high read depths (>200x in a genome with ~50x background) didn't really change anything. My next theory is that reads with alt alleles weren't mapped as well as ref reads, leading HaplotypeCaller to discard them due to low mapping quality.
Is this a problem that anyone employing this this technique has run into? Is HaplotypeCaller just not a good program to use? Should I forgo cohort SNP calling in HC, or just use UnifiedGenotyper? Do I need more sophisticated filtering parameters (I used GATK Best practices recommended for hard filtering, I can't use VQSR)?
TL;DR, too many read ratio peak heights in 2N individuals, and not equal heights in higher ploidies. Why??