I use GATK to make variants calling on exome sequencing data from human tumor samples, and have been using GATK for a few months now. In the VQSR step, I use the Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf to filter out those common SNPs/Indels. I additionally use GATK's SelectVariant walker to select only variants. At the end of the GATK run, I still have about 2000 SNPs for the samples.
This number of mutations is not quite workable for biologists who do wet experiments, so I am always asked to narrow down the list of variants. I used the Polyphen2 score as a guide for the data filtration. The choice of a polyphen cut-off score is arbitrary - I use a minimum of 0.6, but it's hard to justify why I did not choose a different score. I want to do this filtration part more objectively without losing those correct and meaningful variants.
I've heard people use dbSNP 130 vcf and NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7 to filter the VCF results. It looks to me people are trying to filter out those previously identified variants as many as possible - just to get the variants uniquely identified in their samples. I am a little bit concerned about the way of this practice. Unique variants may not tell a whole picture of what's going on in the tumor samples. So I would like to discuss with you guys what's the best practice of filtering VCF for meaningful research.