Question

How to filter SNPs with excess heterozygosity from a large vcf

2

Entering edit mode

21 months ago

Bagelbart ▴ 20

Hi all,

I need to filter out SNPs with excess heterozygotes from a large vcf to get reliable estimates of heterozygosity and Fis.

My current estimates are obviously too high, and I think I might have potential paralogues mapped on top of each other or other mapping issues. However, I'm struggling to find a good way to filter out SNPs with excess heterozygotes from a big vcf (millions of SNPs) using already existing tools and scripts. Any advice on how I can solve this?

I have been thinking that I could either:

filter out SNPs showing fixed heterozygosity on a population basis
filter out SNPs based on Fis calculated per SNP (Fis can be used as an indication of poor mapping like explained here: https://gatk.broadinstitute.org/hc/en-us/articles/360035531992-Inbreeding-Coefficient )
filter out SNPs based on a fixed proportion of heterozygotes over the whole dataset (like only keeping SNPs that have proportion of heterozygotes less than 0.75 or something).

The VCF has already been filtered twice - once according to GATK’s best practices quality recommendations: QualbyDepth (QD) < 2.0, RMSMappingQuality (MQ) < 40.0, FisherStrand (FS) > 60.0, MappingQualityRankSumTest (MQRankSum) < -12.5, and ReadPosRankSumTest (ReadPosRankSum) < -8.0.

Then we also used vcftools to filter the VCF additionally by removing indels (--remove-indels), only retain bi-allelic SNPs (--min-alleles 2 --max-alleles 2), SNPs with a minimum quality of 40 (--minQ 40), maximum missingness of 0.9 (--max-missing 0.9), minimum depth > 15 (--min-meanDP 15, --minDP 15), and maximum depth < 60 (--max-meanDP 60, --maxDP 60). I think the maximum depth filtration gets read of possible mapping issues caused by repeats, but perhaps not paralogues.

Super grateful for any advice as I have been stuck on this problem for a while 🙈

p.s. After posting this I discovered a Biostar post on a similar topic, which might be useful if I take the third approach? Remove sites with heterozygote excess with VCFtools

vcf heterozygosity filtering SNPs • 3.1k views

ADD COMMENT • link updated 21 months ago by Prash ▴ 270 • written 21 months ago by Bagelbart ▴ 20

0

Entering edit mode

After posting this, I found a way to filter out SNPs that were showing patterns of fixed heterozygosity in populations. Basically I just divided my VCF into population specific VCFs and then found all SNPs with fixed heterozygosity using grep like this:

grep -E '(0/1.*){10}' mypopfile.vcf > fixed.heterozous.SNPs.vcf

0/1 is the heterozygote pattern and 10 is the number of individuals in the pop. I concatenated all the info on fixed heterozygous SNPs from all populations, and used this to filter the main VCF using inverted grep (using the chromosome + position info).

grep -v -F -f chr.pos.fixed.heterozous.SNPs.vcf myfile.vcf > myfile.filtered.vcf

The chr.pos.fixed.heterozous.SNPs.vcf was just a file with the two first columns of the concatenated fixed.heterozous.SNPs.vcf giving chromosome and position. Not very elegant, but it seemed to be working.

Only problem is that my populations were quite small, so I might need to come up with some better filtering criterion.

ADD REPLY • link 21 months ago by Bagelbart ▴ 20

score 0 · Answer 1 · 2022-07-29

0

Entering edit mode

21 months ago

LChart 3.9k

What annotations are available in your VCF? Do you have mapping-related annotations like mapping quality rank sum, fisher strand, read position ranks sum (etc)? That might be more effective since it can remove sites that have problems but don't show excess heterozygotes (for instance, mapping issues that create hom-alt variants).

Failing that, vcftools allows you to directly filter out sites that fail Hardy-Weinberg equilibrium (excess or depleted heterozygosity):

https://vcftools.github.io/man_latest.html (look for --hwe)

ADD COMMENT • link 21 months ago by LChart 3.9k

0

Entering edit mode

Thank you so much for the suggestion. I should probably have added some info about my vcf and what kind of filtering I have already done (I have updated the post with this info now).

The VCF has already been filtered twice - once according to GATK’s best practices quality recommendations: QualbyDepth (QD) < 2.0, RMSMappingQuality (MQ) < 40.0, FisherStrand (FS) > 60.0, MappingQualityRankSumTest (MQRankSum) < -12.5, and ReadPosRankSumTest (ReadPosRankSum) < -8.0.

Then we also used vcftools to filter the VCF additionally by removing indels (--remove-indels), only retain bi-allelic SNPs (--min-alleles 2 --max-alleles 2), SNPs with a minimum quality of 40 (--minQ 40), maximum missingness of 0.9 (--max-missing 0.9), minimum depth > 15 (--min-meanDP 15, --minDP 15), and maximum depth < 60 (--max-meanDP 60, --maxDP 60). I think the maximum depth filtration gets read of mapping issues caused by repeats, but perhaps not paralogues.

I have been considering if I could use the --hwe flag in vcftools for something, but I guess I would have to apply it on a per population basis and not on the entire file? Or what do you think? I have been worried that the p-values does not make sense when I have a VCF with many different pops. Perhaps I could divide the VCF in to pop based VCFs, filter them using --hwe and then merge them somehow. (I only have four individuals per pop).

ADD REPLY • link 21 months ago by Bagelbart ▴ 20

0

Entering edit mode

Interesting. You are correct that a difference in allele frequencies between populations will lead to empirically higher heterozygosity. I would make two points:

The caveat about over-filtering applies to the general principal of filtering based on heterozygosity, so you cannot avoid it by using inbreeding or heterozygote fraction instead of HWE; any metric based on genotypes will be impacted by FST
You clearly have your own sense of "how much heterozygosity is too much," since you have identified excess heterozygotes as being associated with poor quality (and not as being normal inter-population divergence).

My suggestion is to annotate with HWE; and take a look at variants with the most extreme p-values to determine whether they look "real" or "artefactual"; and move down the p-value distribution until you find a cutoff where you feel like you may be throwing away too many bona-fide variants.

During such a process, I usually figure out what it is I'm seeing in the reads that makes me skeptical of one site versus others; and can turn it into a filter that performs better than HWE itself.

ADD REPLY • link 21 months ago by LChart 3.9k

0

Entering edit mode

That is a good point! I will try this approach and let you know how it goes. Thank you so much for taking the time to think about this.

p.s. I have seen several studies on selfing species where they discovered that there is excess heterozygosity that needs to be filtered away, so my guess is that this is a general problem. (However, none of these studies had approaches that were easy to follow unfortunately).

ADD REPLY • link 21 months ago by Bagelbart ▴ 20

0

Entering edit mode

I checked the p-value distribution for excess heterozygosity, and unfortunately it was a bit hard to determine what was real or artefactual since extreme p-values was tied to very different levels of heterozygosity. I think because of the population structure and mating system this approach might be a bit hard to use for my species. I did however manage to filter out fixed heterozygotes on a population basis - I will add that approach as a comment to this post. Thank you so much for your help!

ADD REPLY • link 21 months ago by Bagelbart ▴ 20

score 0 · Answer 2 · 2022-07-30

0

Entering edit mode

21 months ago

Prash ▴ 270

Great discussions and suggestions by LChart. If I may suggest, this would change if you are working on rare diseases even with excess heterozygosity! The MAF could be as small as 20 instead of 40. In adidtion, I suggest you consider matches: variant allele frequency (VAF) as VAF at nearly 50% or 100% could be considered potential

ADD COMMENT • link 21 months ago by Prash ▴ 270

0

Entering edit mode

Hi Prash! Thank you for the reply. I'm a bit uncertain if I understood what you meant by the MAF comment - do you mean that filtering based on the p-value for excess heterozygosity might not work if there are extreme minor allele frequencies? A lot of singletons are expected in selfing populations, so this might be an issue. (We have a VCF filtered based on MAF and one without). I will check out the potential for VAF filtering! Thank you.