I have a VCF file where the
FMT/DP of missing genotypes has been recorded itself as missing data (
When I filter this file in
bcftools, such that only sites where at least 90% of samples have a depth >= 10, I am returned 8940 sites.
bcftools filter unfiltered.bcf.gz -i 'F_PASS(FMT/DP >= 10) >= 0.9' -Ob -o filtered.bcf.gz bcftools view filtered.bcf.gz -H | wc -l #8940 sites
When I perform an equivalent filtering operation in
VariantAnnotation in R, I get different results depending on whether I replace this
"." with zeros, or remove missing data from the calculation, and neither value is 8940:
data <- readVcf(file="unfiltered.vcf.bgz") depth <- geno(data)$DP # replace NAs with 0 depth[is.na(depth)] <- 0 DP0.1 <- apply(depth,1,quantile,p=0.1) DP_filter <- (DP0.1>=10) sum(DP_filter) #8953 # don't replace NAs with 0, but remove from quantile calculation depth <- geno(data)$DP DP0.1 <- apply(depth,1,quantile,p=0.1,na.rm=T) DP_filter <- (DP0.1>=10) sum(DP_filter) #9196
I am wondering how
bcftools parses the missing values in this case, since it does not conform to either of the options I can simulate in R.
Relatedly, is there a way to edit the
FMT/DP attribute of specific genotypes and/or, change the way
bcftools treats missing data when operating on FMT/DP information? I.e., what I would really like to do is replace the depth of all missing genotypes with 0s.