Question

How Do You Usually Filter Variant Calling Results?

2

Entering edit mode

10.3 years ago

newDNASeqer ▴ 760

I'm a novice to variant calling, and would like to get an idea how you guys usually filter the final variant calling results.

I am using GATK and annovar to do the variant calling and annotation. The annovar outputs include Polyphen2 and SIFT scores, etc. I am now using the following standards to filter variant calling results:

Number of reads: at least 10 reads (both REF and ALT alleles) with at least 5 reads of mutant allele.
Polyphen2 score of at least 0.4

I am not quite sure how good this filter is. I want to minimize the false positive while not losing too much real positive info. Could you guys shed some light on how to analyze the variants calling results? Your reply is appreciated.

filter • 6.8k views

ADD COMMENT • link updated 10.3 years ago by Alex Paciorkowski 3.5k • written 10.3 years ago by newDNASeqer ▴ 760

0

Entering edit mode

From what organism are your data?

ADD REPLY • link 10.3 years ago by Sean Davis 26k

0

Entering edit mode

Guessing human if @newDNASeqer is using Polyphen2 and SIFT. Though it does help here to be explicit.

ADD REPLY • link 10.3 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

For which organism? and what is the study system? I mean any looking for germline variants, somatic or any disease specific mutations?

ADD REPLY • link 10.3 years ago by pirates.of.the.genome ▴ 100

score 3 · Answer 1 · 2014-01-12

As with many questions you'll see posted here, the answer all depends upon your hypotheses and experimental design. There are many previous threads that address aspects of your question, and you might want to take a look at them:

Filtering Ngs Genomic Alignments

Variant Filtration By Exclusion Of Common Or Well-Known Variants

Filtering Vcf Variants Based On Sequencing Coverage

And last, although I'm only assuming you are working with human data and maybe you are working with whole exome data (don't know from your question), but this thread has a lot of information that may be helpful, plus a lot of links to other sites where there is more information: What Is The Best Pipeline For Human Whole Exome Sequencing?

Regarding your two specific points for filters:

Number of reads: at least 10 reads (both REF and ALT alleles) with at least 5 reads of mutant allele.
Polyphen2 score of at least 0.4

Assuming, again, you are working with human data, and assuming again this is a whole exome seq experiment, and assuming again (a lot of assumptions) your experimental design is to identify the variant(s) causing the phenotype you are studying, then those are reasonable filters, except if your causative variant has poor read depth you will filter it out. And remember SIFT, PolyPhen, et al only provide suggestions and guesses, and are not based on actual in vivo biology -- so I don't actually filter for those annotations straight off. We've all seen pathologic mutations that are predicted to be "benign", but because they happen to cause an amino acid substitution in a key turn in the protein's 3D structure, are pathologic. These points are true assuming you are looking for causative variants in a single gene -- but if that's not your experimental design can you please clarify?

score 0 · Answer 2 · 2014-01-11

0

Entering edit mode

10.3 years ago

Sean Davis 26k

Assuming you are looking at human data, you might want to look at Variant Quality Score Recalibration.

ADD COMMENT • link 10.3 years ago by Sean Davis 26k