Question

Practice Of Filtering Vcf Files (From Gatk)

2

Entering edit mode

10.2 years ago

newDNASeqer ▴ 760

I use GATK to make variants calling on exome sequencing data from human tumor samples, and have been using GATK for a few months now. In the VQSR step, I use the Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf to filter out those common SNPs/Indels. I additionally use GATK's SelectVariant walker to select only variants. At the end of the GATK run, I still have about 2000 SNPs for the samples.

This number of mutations is not quite workable for biologists who do wet experiments, so I am always asked to narrow down the list of variants. I used the Polyphen2 score as a guide for the data filtration. The choice of a polyphen cut-off score is arbitrary - I use a minimum of 0.6, but it's hard to justify why I did not choose a different score. I want to do this filtration part more objectively without losing those correct and meaningful variants.

I've heard people use dbSNP 130 vcf and NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7 to filter the VCF results. It looks to me people are trying to filter out those previously identified variants as many as possible - just to get the variants uniquely identified in their samples. I am a little bit concerned about the way of this practice. Unique variants may not tell a whole picture of what's going on in the tumor samples. So I would like to discuss with you guys what's the best practice of filtering VCF for meaningful research.

filter vcf gatk • 6.5k views

ADD COMMENT • link updated 10.2 years ago by donfreed ★ 1.6k • written 10.2 years ago by newDNASeqer ▴ 760

0

Entering edit mode

Do you also have paired normal samples from the sample subject? If so, you could filter out germline variants and keep only somatic variants for your tumor samples.

ADD REPLY • link 10.2 years ago by Robert Sicko ▴ 630

score 2 · Answer 1 · 2014-02-26

2

Entering edit mode

10.2 years ago

Robert Sicko ▴ 630

You could look for corresponding variants in COSMIC

Also, there are more variant prediction algorithms in addition to Polyphen2. You could use ANNOVAR to filter on these additional scores (scroll down to #5).

Finally you could try Exomiser which uses cross species phenotype comparison. Although I'm not sure how looking at somatic vs germline will affect this algorithm as it is focused on inherited disease.

ADD COMMENT • link 10.2 years ago by Robert Sicko ▴ 630

1

Entering edit mode

+1 for ANNOVAR or wANNOVAR (the web-based version of ANNOVAR)

To provide a slightly more detailed answer, PolyPhen an SIFT scores will have a category status (so, you can focus on scores in the "damaging" category for one or both of those examples).

Using 1000 genome and ESP frequencies (of say < 1%) is relatively common practice, with the rationale being that common variants should have been detected already. Accordingly, I also use ANNOVAR to provide GWAS catalog associations in addition to the standard report. That way, you can say that a common variant is OK if it has been associated with a disease. I think the COSMIC suggestion would be similar, but I seem to remember not liking the ANNOVAR hg19 track for COSMIC for some reason.

ADD REPLY • link 10.2 years ago by Charles Warden 8.2k

score 2 · Answer 2 · 2014-02-26

Expanding on r.j.'s answer, VAAST does a great job of prioritizing variants and predicting their effect.

Also, removal of false positive variants is another good reason to filter with dbSNP. Even if you do not explicitly filter out variants from dbSNP, it is probably a good idea to annotate the variants in your callset that are present in dbSNP, as this adds a lot of information.