Question

Does it make sense to perform Genotype Imputation using variants called from WES?

1

Entering edit mode

6.5 years ago

Tao ▴ 530

Hi Biostars:

I have genotype data called from Whole Exon Sequencing, which contains 0.12 million variants (most are SNPs). In theory, it is feasible to impute all SNPs on whole genome using genotype imputation model. But I am wondering how accurate it would be to use such a small portion of variants (exon variants) to impute whole variants on the genome (> 10 million using 1000G reference). I understand R2 can be used to filter low-quality imputed variants, but is it really OK to do imputation in this way?

Thanks! Tao

genotype imputation Whole Exon Sequencing variants • 3.8k views

ADD COMMENT • link 6.5 years ago by Tao ▴ 530

score 2 · Answer 1 · 2017-11-03

2

Entering edit mode

6.5 years ago

Kevin Blighe 87k

I think that you'd be criticised if you ever tried to publish a study by doing imputation that way. My primary question, if I were a reviewer, would be: 'Why didn't you just do whole genome sequencing or a very dense genotyping array?'

This doesn't mean that studies of a similar nature haven't been done before though:

Imputation of Exome Sequence Variants into Population- Based Samples and Blood-Cell-Trait-Associated Loci in African Americans: NHLBI GO Exome Sequencing Project
Whole-exome imputation of sequence variants identified two novel alleles associated with adult body height in African Americans.
Evaluating the Coverage and Potential of Imputing the Exome Microarray with Next-Generation Imputation Using the 1000 Genomes Project

Thus, as there have been publications in reputable journals that have already done [Edit:] something similar, I think that you can get away with it provided that you follow the methods and QC criteria that these other studies used. Most of the SNPs that you try to impute will fail the QC though (I imagine).

ADD COMMENT • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin, Thanks for your reply and time! It's a public dataset, we just want to use it to fit our project which need genotype on whole genome. For the first and second reference, it seems they only imputed variants on Exome based on a reference panel of Exome sequencing project(NHLBI). For the third reference, it seems they want to prove using imputation based on whole genome array(Omni2.5) and 1000G, they can recover the sites on exon chip. So, I didn't see they have done similar way like I described. Please do correct me if my understanding is not correct. Thanks! Tao

ADD REPLY • link 6.5 years ago by Tao ▴ 530

0

Entering edit mode

Hey Tao,

Yes, the idea is that these are just similar studies, i.e., not the exact same, but neither completely different, that you could use as a starting point.

I do not doubt that you could complete an imputation in the way that you desire, but it's just the credibility of the results that I doubt. Imputation is just statistical relationships, at the end of the day, and is known to produce incorrect genotypes even when done properly.

I hope that others can contribute to the discussion.

Kevin

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Thanks! You are right. Imputation can be done without any error, but how accurate it would be?That's exactly what I concerned! Thanks for your references.

Best, Tao

ADD REPLY • link 6.5 years ago by Tao ▴ 530

1

Entering edit mode

Hey Tao,

I would really doubt the accuracy, particularly as you go further into intergenic regions and away from genes. Far away from each gene, you just won't have concrete data with which to make any sort of accurate imputation - it would be akin to making random calls, i.e, by chance, you'll be able to impute some genotypes far away from genes, but these could possibly be errors. However, as you implied, I think that many of the imputed SNPs would not even make it to the final dataset as they may fall well below r-square 0.3 or 0.4, or would fail by some other metric.

What is the aim of your experiment, generally? If you are just interested in imputing genotypes in enhancer and promoter regions, the TSS, or the 5'/3'UTR, then you could just impute a certain distance from each gene. Why not just impute up to 25,000 bp from each gene start and terminal exon? That is probably still too great a distance, but it's worth a try. It will neither encompass all enhancer regions, as these can be >100,000 bp from gene bodies and still regulate transcription.

Another interesting study on this topic is here. In it, the authors specifically state that imputation accuracy suffers as distance between a SNP and an imputed SNP increases.

I would also encourage you to seek the opinions of others in your department, just to corroborate what I am saying.

Kind regards

Kevin

Edit: to give you an idea, high density genotyping microarrays will genotype genome-wide with a mean distance between genotyped positions of ~3,500 bp. Even imputing with that level of density, errors in the imputation occur.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks so much for your suggestions and reference! That's very helpful! This dataset is one of several datasets I used in my project, which need genotypes on whole genome, not only gene nearby regions. Luckily, we just find the genotype data call from WGS is now available for that dataset. So, that's not a big problem for me now. But I benefit a lot from the discussion with you! And I think it will also benefit others with similar situation. Best, Tao