Hello!
Could somebody please clarify why imposing the marker filters step-by-step vs all at once makes a huge difference? In particular, at first I tried to do the filtering step by step:
- Read in bgen to produce pgen:
plink2 --bgen <filename.bgen> --sample <filename.sample>
. Number of variants: 4,562,905. - Select white British and SNPs only filter:
plink2 --pfile <filename> --make-pgen --snps-only --keep wb_sample.txt --out test_chr10_wb_snps
. Number of variants: 4,382,572. - Missingness per SNP filter:
plink2 --pfile test_chr10_wb_snps --make-pgen --geno 0.05 --out test_chr10_wb_snps_call95
. Number of variants: 4,224,711. - Filtering based on minor allele frequency and imputation quality:
plink2 test_chr10_wb_snps_call95 --make-pgen --extract chr10_info_maf.txt --out test_chr10_wb_snps_call95_af_info
. Number of variants: 1,067,224. - HWE filter:
plink2 --pfile test_chr10_wb_snps_call95_af_info --make-pgen --hwe 1e-12 --out test_chr10_wb_snps_call95_af_info_hwe
. Number of variants: 1,064,964. - Exclude badly genotyped SNPs:
plink2 --pfile test_chr10_wb_snps_call95_af_info_hwe --make-pgen --exclude /disk.0/data/PRS/bad_geno_snps_exclude.txt --out test_chr10_wb_snps_call95_af_info_hwe_g
. Number of variants: 1,064,909.
However, next time when I tried to impose those filters all at once (plink2 --pfile <filename> --make-pgen --keep wb_sample.txt --snps-only --extract chr10_info_maf.txt --geno 0.05 --hwe 1e-12 --exclude bad_geno_snps_exclude.txt --out <fileame_filtered>
) I get in the end 4,355,904 variants.
I just don't understand why the outputs are so different. I would have thought that in the end, whichever way the filters are imposed, the intersection of all of them remain.
Thank you!