I need to filter out SNPs with excess heterozygotes from a large vcf to get reliable estimates of heterozygosity and Fis.
My current estimates are obviously too high, and I think I might have potential paralogues mapped on top of each other or other mapping issues. However, I'm struggling to find a good way to filter out SNPs with excess heterozygotes from a big vcf (millions of SNPs) using already existing tools and scripts. Any advice on how I can solve this?
I have been thinking that I could either:
filter out SNPs showing fixed heterozygosity on a population basis
filter out SNPs based on Fis calculated per SNP (Fis can be used as an indication of poor mapping like explained here: https://gatk.broadinstitute.org/hc/en-us/articles/360035531992-Inbreeding-Coefficient )
filter out SNPs based on a fixed proportion of heterozygotes over the whole dataset (like only keeping SNPs that have proportion of heterozygotes less than 0.75 or something).
The VCF has already been filtered twice - once according to GATK’s best practices quality recommendations: QualbyDepth (QD) < 2.0, RMSMappingQuality (MQ) < 40.0, FisherStrand (FS) > 60.0, MappingQualityRankSumTest (MQRankSum) < -12.5, and ReadPosRankSumTest (ReadPosRankSum) < -8.0.
Then we also used vcftools to filter the VCF additionally by removing indels (--remove-indels), only retain bi-allelic SNPs (--min-alleles 2 --max-alleles 2), SNPs with a minimum quality of 40 (--minQ 40), maximum missingness of 0.9 (--max-missing 0.9), minimum depth > 15 (--min-meanDP 15, --minDP 15), and maximum depth < 60 (--max-meanDP 60, --maxDP 60). I think the maximum depth filtration gets read of possible mapping issues caused by repeats, but perhaps not paralogues.
Super grateful for any advice as I have been stuck on this problem for a while 🙈
p.s. After posting this I discovered a Biostar post on a similar topic, which might be useful if I take the third approach? Remove sites with heterozygote excess with VCFtools