Question: Does it make sense to perform Genotype Imputation using variants called from WES?
gravatar for Tao
3.1 years ago by
Tao420 wrote:

Hi Biostars:

I have genotype data called from Whole Exon Sequencing, which contains 0.12 million variants (most are SNPs). In theory, it is feasible to impute all SNPs on whole genome using genotype imputation model. But I am wondering how accurate it would be to use such a small portion of variants (exon variants) to impute whole variants on the genome (> 10 million using 1000G reference). I understand R2 can be used to filter low-quality imputed variants, but is it really OK to do imputation in this way?

Thanks! Tao

ADD COMMENTlink written 3.1 years ago by Tao420
gravatar for Kevin Blighe
3.1 years ago by
Kevin Blighe67k
Republic of Ireland
Kevin Blighe67k wrote:

I think that you'd be criticised if you ever tried to publish a study by doing imputation that way. My primary question, if I were a reviewer, would be: 'Why didn't you just do whole genome sequencing or a very dense genotyping array?'

This doesn't mean that studies of a similar nature haven't been done before though:

Thus, as there have been publications in reputable journals that have already done [Edit:] something similar, I think that you can get away with it provided that you follow the methods and QC criteria that these other studies used. Most of the SNPs that you try to impute will fail the QC though (I imagine).

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Kevin Blighe67k

Hi Kevin, Thanks for your reply and time! It's a public dataset, we just want to use it to fit our project which need genotype on whole genome. For the first and second reference, it seems they only imputed variants on Exome based on a reference panel of Exome sequencing project(NHLBI). For the third reference, it seems they want to prove using imputation based on whole genome array(Omni2.5) and 1000G, they can recover the sites on exon chip. So, I didn't see they have done similar way like I described. Please do correct me if my understanding is not correct. Thanks! Tao

ADD REPLYlink written 3.1 years ago by Tao420

Hey Tao,

Yes, the idea is that these are just similar studies, i.e., not the exact same, but neither completely different, that you could use as a starting point.

I do not doubt that you could complete an imputation in the way that you desire, but it's just the credibility of the results that I doubt. Imputation is just statistical relationships, at the end of the day, and is known to produce incorrect genotypes even when done properly.

I hope that others can contribute to the discussion.


ADD REPLYlink written 3.1 years ago by Kevin Blighe67k

Hi Kevin,

Thanks! You are right. Imputation can be done without any error, but how accurate it would be?That's exactly what I concerned! Thanks for your references.

Best, Tao

ADD REPLYlink written 3.1 years ago by Tao420

Hey Tao,

I would really doubt the accuracy, particularly as you go further into intergenic regions and away from genes. Far away from each gene, you just won't have concrete data with which to make any sort of accurate imputation - it would be akin to making random calls, i.e, by chance, you'll be able to impute some genotypes far away from genes, but these could possibly be errors. However, as you implied, I think that many of the imputed SNPs would not even make it to the final dataset as they may fall well below r-square 0.3 or 0.4, or would fail by some other metric.

What is the aim of your experiment, generally? If you are just interested in imputing genotypes in enhancer and promoter regions, the TSS, or the 5'/3'UTR, then you could just impute a certain distance from each gene. Why not just impute up to 25,000 bp from each gene start and terminal exon? That is probably still too great a distance, but it's worth a try. It will neither encompass all enhancer regions, as these can be >100,000 bp from gene bodies and still regulate transcription.

Another interesting study on this topic is here. In it, the authors specifically state that imputation accuracy suffers as distance between a SNP and an imputed SNP increases.

I would also encourage you to seek the opinions of others in your department, just to corroborate what I am saying.

Kind regards


Edit: to give you an idea, high density genotyping microarrays will genotype genome-wide with a mean distance between genotyped positions of ~3,500 bp. Even imputing with that level of density, errors in the imputation occur.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Kevin Blighe67k

Thanks so much for your suggestions and reference! That's very helpful! This dataset is one of several datasets I used in my project, which need genotypes on whole genome, not only gene nearby regions. Luckily, we just find the genotype data call from WGS is now available for that dataset. So, that's not a big problem for me now. But I benefit a lot from the discussion with you! And I think it will also benefit others with similar situation. Best, Tao

ADD REPLYlink written 3.0 years ago by Tao420

Okay, great, best of luck with the remainder of your project.

ADD REPLYlink written 3.0 years ago by Kevin Blighe67k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1376 users visited in the last hour