First post on the site, although I've been benefiitting from the questions on here for a while. I've been thinking about some SNP filtering and was hoping for some feedback on my ideas. I've been using mutect to identify somatic SNPs in tumor sample using a matched control and a panel of normal mutations. I'm still getting quite a few very SNPs called with low MAF (<0.1) and I'm keen to remove false positives where possible.
One idea I had was to run mutect with control and tumor bams swaped around and investigate the SNPs which are not rejected by the bayesian classifier or subsequent filtering (Control SNPs). Assuming that these are all false positives - the frequency of back-mutations should be vanishingly small - then I'd expect the attributes of these SNPs (read depth, alternate allele count, power to call strand inbalance, etc) to be similar to false positives SNPs in the tumor sample. I can therefore use these attributes to classify the tumor mutations as low or high confidence depending on how much they resemble the Control SNPs. In essence this would be similar, but opposite, to the GATK VariantQualityScoreCalibration tool (http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration-vqsr) which scores variants based on my much they resemble SNPs which are also found in a database of high-quality SNPs.
Does this seem a reasonable approach? If so, any advice on how best to do this? I assume some sort of clustering-based approach would be best?
EDIT: In response to Malachi's questions,s ome more details: I have 8 tumor and matched control exomes from non-smoking lung cancer patients. Our initial aim is to identify somatic SNPs and INDELs, although I would like to extend this to identification of suseptibility factors and mutation hotspots later if at all possible.
For the mutect analysis, I am employing the HC + PON mode by first running mutect on my 8 normal samples in single sample mode and then generating a vcf containing all variants found in at least 2 normal samples. All default parameters except min_qsore 20 and gap_events_threshold 2.