Question

Why is it absolutely necessary to filter a VCF?

0

Entering edit mode

7.3 years ago

beausoleilmo ▴ 580

I just wonder what it would mean if I do my analysis on an unfiltered VCF file. Is this a problem? Why do we need to filter the data?

VCF bcftools vcftools samtools filtering • 2.8k views

ADD COMMENT • link updated 7.3 years ago by mforde84 ★ 1.4k • written 7.3 years ago by beausoleilmo ▴ 580

1

Entering edit mode

7.3 years ago

mforde84 ★ 1.4k

1) most will be relatively common variants. eg., corresponding ExAC MAF = .4 or even in multiallelic site >.5. one recent data set i was working with 80/100 samples had the same variant, and the corresponding ExAC adj.MAF was approximately .80.

2) most will be synonymous, intergenic, intronic, etc. typically (not always), people are looking for variants which will results in changes to protein composition or structure.

3) questionable clinical significance, which is kinda of a mixture of reasons with 1 and 2. say for common variants which have been extensively seen and not associated with a disease phenotype.

ADD COMMENT • link 7.3 years ago by mforde84 ★ 1.4k

0

Entering edit mode

These are mostly 'biological' filters - i.e. how biologically relevant are the variants in my data. But I think OP is mainly interested in 'methodological/technical' filtering. However, also biological filters can reduce false positive rate (e.g. variants have to be shared by the affected individuals.)

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Ah, fair enough. Methodologically, I think the only suitable way is as Brian says in the comments above. You need a gold standard. Thankfully a lot of time and thought has gone into this problem already, so there are resources readily available for pipeline validation with actual seq data.

ADD REPLY • link 7.3 years ago by mforde84 ★ 1.4k

score 7 · Accepted Answer · 2017-01-15

7

Entering edit mode

7.3 years ago

iraun 6.2k

Unfiltered data... means a lower specificity. The probability of having false positives (mutations that are not true, i.e errors) is higher than in the filtered data. In an ideal and a perfect world, when you call mutations through a Variant Calling analysis, all the true variants in the output would be identified; but in the real world, some of the variants will be missing ( sensitivity ) and some variants will be errors (sequencing errors, artifacts, missalignments...) that the variant calling algorithm identifies as "mutations". Through the application of some filters, we try to decrease the number of false positive calls. Is that what you are asking?

ADD COMMENT • link 7.3 years ago by iraun 6.2k

0

Entering edit mode

Yep! Do you know something about the effect or the level of false positive call?

ADD REPLY • link 7.3 years ago by beausoleilmo ▴ 580

1

Entering edit mode

You'd have to calibrate that yourself for the data and software you're using.

ADD REPLY • link 7.3 years ago by Brian Bushnell 20k

0

Entering edit mode

What do you mean by "calibrate"? Is there a way to calculate this? We would have to simulate data to do this. Might sound like an interesting project!

ADD REPLY • link 7.3 years ago by beausoleilmo ▴ 580

3

Entering edit mode

Simulating data is useful for testing and developing software. Calibrating something like a variant caller is a bit harder - you need real data with real answers to do that. So, for example, if you have a trio (child and parents, for example) you can confidently determine true and false positives based on inheritance, and use that to calibrate the VCF filtering (using information like, say, "variants with 0.7 AF are correct 99% of the time and those with 0.1 AF are incorrect 99% of the time with our methodology").

Without a trio, you can still calibrate based on known annotated SNPs with known frequencies (using dbSNP and HapMap), for example.

There's no point in filtering a VCF until you know the effects of the filters you plan to apply, so that's what calibration is for.