I'm currently designing a modification on GATK's "Germline SNPs + Indels best practices" pipeline, for pooled sequencing (pool-seq) data. The trouble is that when I run
HaplotypeCaller it generates this warning message;
WARN HaplotypeCallerGenotypingEngine - Removed alt alleles where ploidy is 88 and original allele count is 3, whereas after trimming the allele count becomes 2. Alleles kept are:[T*, A]
Note: the polidy for my sample is set to 88 to ensure SNPs are calculated based on the alleles from all 44 diploid fish in my population. This is a key modification for pool-seq data.
HaplotypeCaller has "trimmed" the number of alleles down to make this a biallelic site. I want to avoid this since, although it's unlikely, there will be some true multi-allelic SNPs in the population that are being coerced into biallelic SNPs by the program. My guess is that it has something to do with the genotype calling algorithm GATK uses. Can anyone explain what is going on behind the scenes?
I've found a way to mitigate the problem, by increasing the value of the
--max-genotype-count parameter, however this substantially increases the computation time (before= 5.9days, after = 5 days only 40% finished). Is there a more efficient way to call multi-allelic SNPs?