Question

Removing Low Quality Lof Variants From 1000 Genomes Imputation

3

Entering edit mode

13.3 years ago

Ryan D ★ 3.4k

We are currently carrying out imputation using the 1000 genomes data on samples genotyped by Affy and Illumina platforms. However, Daniel MacArhur's recent Science paper points out that most LoF variants (about 59%) are going to be false calls. He illuminates a pipeline for filtering out false variants on page 37 of the supplemental .

His paper used 185 genomes from the 1000 Genomes project. Now there are 1092 genomes available. So in theory imputation should be more accurate. My question is this: between the low coverage release they used (2010_07) and the newest data available (2011_05_21) have any of the filters been implemented to remove SNVs that are likely artifacts? This would include (see the figure S1) mostly mapping/sequencing errors, functional annotation errors. In other words, have any of the artifactual variants been removed so that imputation quality will be improved or are the same problems present in the current release?

Jorge answered a related question on this sometime back here.

genome imputation snp mutation error • 3.2k views

ADD COMMENT • link updated 11.8 years ago by Biostar 20 • written 13.3 years ago by Ryan D ★ 3.4k

score 7 · Answer 1 · 2012-03-14

Hey Ryan,

The newest data haven't been explicitly filtered for the artefacts described in our paper. We're currently working on this analysis (it involves recoding a lot of manually intensive steps into an automated pipeline), and we'll release a separate LoF VCF that has been filtered and that also integrates multiple different sources of experimental genotyping data (e.g. Affymetrix Axiom typing, PCR-454, and a targeted validation effort currently underway at the Broad Institute). I'm hoping to have at least a rough draft of this released by early May, in time for the Biology of Genomes meeting.

However, I will note that (even without our filters) the existing data are much higher-quality than the pilot data we used for our Science paper. Our early results suggest >98% of the SNV sequencing errors in LoF variants in the pilot project have been correctly filtered out of the latest release, thanks to the VQSR filtering developed at Broad and other improvements. We've also made changes to the Gencode annotations to remove annotation errors spotted in our paper - so while the calls won't be perfect, they'll be much better than the ones from the pilot.

My suggestion would be to impute using the current release (or wait a few weeks for a newer one with more accurate indel filters), explore your data using those data, and then update once we push out our filtered phase 1 LoF calls.