Filtering out samples by FORMAT attributes using bcftools
0
1
Entering edit mode
9 weeks ago

The -e and -i options of the bcftools filter command appear, by default, to only allow for including or excluding _sites_. E.g., -e 'FMT/DP < 10' removes sites where any sample has DP < 10, and -e 'MEAN(FMT/DP) < 10' removes sites where average depth across samples is < 10.

I would like to perform effectively similar filtering commands, but in a way that includes or excludes samples, instead of sites. E.g., exclude all _samples_ that have DP < 10 at any _site_, or exclude all samples with an average depth across all sites < 10, or exclude all samples that have a genotype quality of < 20 at 10% of genotyped sites.

Such commands can very easily be written to apply to sites, but there does not seem to be a straightforward way of applying such filtering expressions to samples in bcftools.

One option mentioned in the manual is to use the double || operator, as described in the man page. As described in my comment on a Github issue, however, this does not seem to work as a filtering operation for including or excluding entire columns in a VCF file, but rather only for selecting more or less individual records (e.g., as illustrated in the example on this tutorial page). In addition, it would seem limited in functionality (e.g., functions such as MEAN(FMT/DP) < 10 would still be evaluated across samples, not sites, when combined with a || operator).

The only other method mentioned in the manual is to use the SMPL or s prefix to functions which evaluate FORMAT fields in expressions passed to filter. This also does not appear to work as I would expect. I'm not sure I understand what it _is_ doing, but it definitely _isn't_ removing columns from the VCF file. Instead it seems to be using as a metric calculated across sites on a per sample basis as a filtering criteria applied to sites (but I could just be confused):

nSites_unfiltered=$(bcftools view -H unfiltered.vcf.gz | wc -l) nSamples_unfiltered=$(bcftools query unfiltered.bcf.gz -l | wc -l)

nSites_DPfilter=$(bcftools filter unfiltered.bcf.gz -e 'sMEAN(FORMAT/DP)<5' | bcftools view -H | wc -l) nSamples_DPfilter=$(bcftools filter unfiltered.bcf.gz -e 'sMEAN(FORMAT/DP)<5' | bcftools query -l | wc -l)

echo "scale=5; $nSites_DPfilter /$nSites_unfiltered" | bc #0.17349 - sites are removed, even though the filter is being applied to samples
echo "scale=5; $nSamples_DPfilter /$nSamples_unfiltered" | bc #1.00000 - no samples removed, despite, again, the filter being applied to samples

nSites_normalDPFilter=$(bcftools filter unfiltered.bcf.gz -e 'MEAN(FMT/DP) < 5' | bcftools view -H | wc -l) echo "scale=5;$nSites_normalDPFilter / \$nSites_unfiltered" | bc #0.83121 - fewer sites removed than the sMEAN filter

The .bcf.gz passed to these commands is here: https://www.dropbox.com/s/fwj1snabsb763sn/unfiltered.bcf.gz?dl=0 Its .csi index file is here: https://www.dropbox.com/s/jzdq6y91nkencmy/unfiltered.bcf.gz.csi?dl=0

In any event, the only option I have been able to come up with that actually excludes or includes entire sample IDs is to extract summary statistics using bcftools stats, export the desired statistics from the PSC, Per-sample counts table, and perform my own exploratory analysis outside of bcftools. Then I can use bcftools view -S to explicitly remove samples by their ID, which I have identified manually as failing some thresholding criteria. This also is obviously cumbersome, and only works for FMT fields that are included in stats (e.g., not GQ or PL).

I am not sure if I am missing some basic functionality of bcftools, or if this is more of a feature request. Any information would be appreciated.

Thanks!

vcf SNP filtering bcftools gbs • 233 views