5.7 years ago by
Rochester, NY USA
As with many questions you'll see posted here, the answer all depends upon your hypotheses and experimental design. There are many previous threads that address aspects of your question, and you might want to take a look at them:
Filtering Ngs Genomic Alignments
Variant Filtration By Exclusion Of Common Or Well-Known Variants
Filtering Vcf Variants Based On Sequencing Coverage
And last, although I'm only assuming you are working with human data and maybe you are working with whole exome data (don't know from your question), but this thread has a lot of information that may be helpful, plus a lot of links to other sites where there is more information:
What Is The Best Pipeline For Human Whole Exome Sequencing?
Regarding your two specific points for filters:
Number of reads: at least 10 reads (both REF and ALT alleles) with at least 5 reads of mutant allele.
Polyphen2 score of at least 0.4
Assuming, again, you are working with human data, and assuming again this is a whole exome seq experiment, and assuming again (a lot of assumptions) your experimental design is to identify the variant(s) causing the phenotype you are studying, then those are reasonable filters, except if your causative variant has poor read depth you will filter it out. And remember SIFT, PolyPhen, et al only provide suggestions and guesses, and are not based on actual in vivo biology -- so I don't actually filter for those annotations straight off. We've all seen pathologic mutations that are predicted to be "benign", but because they happen to cause an amino acid substitution in a key turn in the protein's 3D structure, are pathologic. These points are true assuming you are looking for causative variants in a single gene -- but if that's not your experimental design can you please clarify?