Question: Removing Low Quality Lof Variants From 1000 Genomes Imputation
3
gravatar for Ryan D
7.1 years ago by
Ryan D3.3k
USA
Ryan D3.3k wrote:

We are currently carrying out imputation using the 1000 genomes data on samples genotyped by Affy and Illumina platforms. However, Daniel MacArhur's recent Science paper points out that most LoF variants (about 59%) are going to be false calls. He illuminates a pipeline for filtering out false variants on page 37 of the supplemental .

His paper used 185 genomes from the 1000 Genomes project. Now there are 1092 genomes available. So in theory imputation should be more accurate. My question is this: between the low coverage release they used (2010_07) and the newest data available (2011_05_21) have any of the filters been implemented to remove SNVs that are likely artifacts? This would include (see the figure S1) mostly mapping/sequencing errors, functional annotation errors. In other words, have any of the artifactual variants been removed so that imputation quality will be improved or are the same problems present in the current release?

Jorge answered a related question on this sometime back here.

ADD COMMENTlink modified 5.6 years ago by Biostar ♦♦ 20 • written 7.1 years ago by Ryan D3.3k
7
gravatar for Dgmacarthur
7.1 years ago by
Dgmacarthur310
Cambridge, UK
Dgmacarthur310 wrote:

Hey Ryan,

The newest data haven't been explicitly filtered for the artefacts described in our paper. We're currently working on this analysis (it involves recoding a lot of manually intensive steps into an automated pipeline), and we'll release a separate LoF VCF that has been filtered and that also integrates multiple different sources of experimental genotyping data (e.g. Affymetrix Axiom typing, PCR-454, and a targeted validation effort currently underway at the Broad Institute). I'm hoping to have at least a rough draft of this released by early May, in time for the Biology of Genomes meeting.

However, I will note that (even without our filters) the existing data are much higher-quality than the pilot data we used for our Science paper. Our early results suggest >98% of the SNV sequencing errors in LoF variants in the pilot project have been correctly filtered out of the latest release, thanks to the VQSR filtering developed at Broad and other improvements. We've also made changes to the Gencode annotations to remove annotation errors spotted in our paper - so while the calls won't be perfect, they'll be much better than the ones from the pilot.

My suggestion would be to impute using the current release (or wait a few weeks for a newer one with more accurate indel filters), explore your data using those data, and then update once we push out our filtered phase 1 LoF calls.

ADD COMMENTlink written 7.1 years ago by Dgmacarthur310

That's excellent information, Dan. Thanks a lot. That about 98% of errors have been cleared up is really encouraging and makes me feel better about the results of our imputation. I am also curious if for the common variant associations if you tried any collapsing methods for LoF variants or if you think that would be a worthy endeavor? Or is that the next paper? :)

ADD REPLYlink written 7.1 years ago by Ryan D3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour