filtering on F_PASS(GT="het") in bcftools gives contradictory results
0
0
Entering edit mode
3.2 years ago

I want to remove sites from a VCF file with an overabundance or an underabundance of heterozygous genotypes. I have 8940 markers to start with:

bcftools view unfiltered.bcf.gz -H | wc -l # 8940

When I exclude sites with a frequency of hets > 0.9, this number goes down a bit:

bcftools filter unfiltered.bcf.gz -e 'F_PASS(GT="het") > 0.9' | bcftools view -H | wc -l #8905

When I exclude sites with a frequency of hets < 0.1, suddenly, almost every marker is removed:

bcftools filter unfiltered.bcf.gz -e 'F_PASS(GT="het") < 0.1' | bcftools view -H | wc -l #9

This does not reflect the actual distribution of heterozygous frequencies in this population (validated using bcftools stats). Indeed, when I perform what to my mind is an equivalent command (i.e., include only sites with a frequency of hets >= 0.1) I get a more reasonable reduction:

bcftools filter unfiltered.bcf.gz -i 'F_PASS(GT="het") >= 0.1' | bcftools view -H | wc -l #8909

Any explanation for the difference in output between these later two commands would be appreciated.

SNP filtering bcftools • 2.0k views
ADD COMMENT
0
Entering edit mode

You could try using bcftools query to find the %N_PASS(GT="het") per site. It's not listed, but maybe %F_PASS works too. That way, you'd have a list of F_PASS that you can compare sites that pass/fail filters to and see what's going on.

ADD REPLY
0
Entering edit mode

Thanks for the tip. Indeed, this just kind of makes things more confusing:

bcftools query -f '%ID %N_PASS(GT="het")\n' unfiltered.bcf.gz

Returns a long list where it is clear that almost every site has 60-80 samples that pass the expression. However:

bcftools filter unfiltered.bcf.gz -e 'N_PASS(GT="het") < 16' | bcftools view -H | wc -l #9

Still does not work (setting the hard threshold of 16 in this case because 0.1 * sample size = 15.9). So bcftools is definitely parsing N_PASS(GT="het") correctly, but somehow when based to the -e option of filter, it doesn't work....

ADD REPLY
0
Entering edit mode

I daresay this warrants an issue on github, or at least an email to their mailing list (if one exists). Check with -i N_PASS(GT="het") >=16 and if that returns more rows, something is quite wrong.

ADD REPLY
0
Entering edit mode

Indeed:

bcftools filter unfiltered.bcf.gz -i 'N_PASS(GT="het") >=16' | bcftools view -H | wc -l

Returns 8909 sites. I opened an in issue on Github and if I get a response I'll give an update here. Thanks for thinking this through.

ADD REPLY
0
Entering edit mode

No problem. Could missing genotypes have something to do with this? Still feels intuitively weird.

ADD REPLY
0
Entering edit mode

There is definitely are missing genotypes (encoded with . characters in the FMT/GT field, but I'm not sure why that would only create a problem for e and not i. The compressed .vcf is here: https://www.dropbox.com/s/ypkh861qkimz6jc/DMxM6_F2.vcf.bgz?dl=0, and it's tabix-generated index here: https://www.dropbox.com/s/0d5sccem1qe9m6s/DMxM6_F2.vcf.bgz.tbi?dl=0, if you want to take a look.

ADD REPLY

Login before adding your answer.

Traffic: 2723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6