I have made a raw (unfiltered) variant call set following GATK best practices (VCF file with ~16 Million SNPs produced by GenotypeGVCFs). The original WGS data corresponds to 60 samples sequenced at a average coverage of 20x.
We want to identify a small subset of really good SNPs and another subset of really bad SNPs, which we could use for validation.
How can I construct a filter that keeps SNPs most likely to be true and false positives, respectively?
A first choice would be to rank by QUAL and pick the SNPs at the top and the bottom of the list, but I am sure there is a more sofisticated way to do this.
Also, since the VCF contains multiple samples, would it be better to filter by site or by genotype?
Thanks and I appreciate your feedback!