Question

In GWAS Analysis, my top most significant SNPs from pre-imputation data does not hold significance, post imputation data. Why? How do I justify this?

0

Entering edit mode

19 months ago

debleena.guin • 0

I am performing a GWAS Analysis, while comparing my pre-imputation and post-imputation data, I observed that the most significant genetic variant (p<1x 10-16) from pre-imputation data is no more significant post imputation. Imputation performed using reference genome 1000genome phase3 v5 SAS population data in Michigan Imputation server. These variants were missed out while matching the target data and the ref data. How do I overcome this? What is should be reported in the manuscript (pre or post imputation data)? How do we justify such findings ?

GWAS Imputation • 1.2k views

ADD COMMENT • link updated 19 months ago by LChart 3.9k • written 19 months ago by debleena.guin • 0

0

Entering edit mode

I am no expert, but my guess is that the imputation step drastically increased the number of SNPs and thus the number of association tests performed. When correcting for multiple testing, your significant SNPs no longer pass the threshold. How many SNPs were tested before and after imputation? Also, what software/model are you using for GWAS?

ADD REPLY • link 19 months ago by liorglic ★ 1.4k

0

Entering edit mode

We did not perform multiple correction. The SNPs significant pre-imputation (unadjusted p < 1x10-16) is no longer significant post imputation even till p < 0.05. We performed genotyping on ~8 lakh markers and post Quality control, ~3 lakh markers were used as input file for imputation. Post-imputation my output .vcf file retrieved ~80 lakh markers. We performed association analysis using plink where summary statistics were calculated using X2 test. For imputation, we used Michigan imputation server, using 1000genome phase3 v5 (SAS population) as reference, where phasing was performed using Eagle v2.4 where r2 threshold was put 0.2.

ADD REPLY • link 19 months ago by debleena.guin • 0

score 1 · Accepted Answer · 2022-09-28

1

Entering edit mode

19 months ago

LChart 3.9k

the most significant genetic variant (p<1x 10-16) from pre-imputation data is no more significant post imputation.

It is expected that imputation should generally not alter test statistics for variants on your array, as the imputation step is for filling in missing genotypes; and only rarely alters a genotype that was already called (unless the genotype likelihoods are low). A well-called variant on your array will be largely unaffected by imputation, so the p-values should be in high correspondence. If the significance of that variant went from 1x10-16 to something much larger like 1x10-4 then there's something potentially suspicious about the genotypes. Have you filtered for Hardy-Weinberg? Does your top SNP happen to be on the X chromosome?

ADD COMMENT • link 19 months ago by LChart 3.9k

0

Entering edit mode

The imputation was performed using the Michigan Imputation server (MIS) using 1000G Phase3 SAS population as the reference population. Phasing done using Eagle v2.4 and SNPs filtered with r2 cut-off <0.2. This imputation platform performs imputation of missing genotypes in our target data by comparing the variants with the ref data. We lost ~50% variants during this comparison. Now my post-imputation data, contains all the variants from my remaining ~50% variants (of target data) and all the SNPs that are linked to these ~50% variants (SNPs that were not actually genotyped). Post imputation, we QC'ed the SNPs based on MAF <1% and HWE outliers (p< 1x 10-5). We performed imputation only in the autosomes.

I am assuming that the most significant SNP (p 1x 10 -16) in pre imputation data was lost as it was not matched with the ref data (1000G SAS). How do I justify this loss? Is it not very unlikely?

Also, I cross-checked the pre- and post- imputation data for the common SNPs retained in both the sets, the allele freq is the same in both the datasets (pre- & post).

ADD REPLY • link 19 months ago by debleena.guin • 0

0

Entering edit mode

This imputation platform performs imputation of missing genotypes in our target data by comparing the variants with the ref data. We lost ~50% variants during this comparison.

Technically, it's not the imputation platform that does, this. Whoever submitted the data to the imputation server did it themselves by running the suggested pre-imputation checks.

I am assuming that the most significant SNP (p 1x 10 -16) in pre imputation data was lost as it was not matched with the ref data (1000G SAS). How do I justify this loss? Is it not very unlikely?

It depends on the allele frequency. There are ~450 genomes in the SAS cohort, so if your top SNP has AF=10%, then you are not sequencing a well-matched population. The fact that you lost 50% of your SNPs could be an additional indicator of this fact (you should not lose 50% of AF>1% SNPs). What does it look like if you plot (pre-imputation, all sites) AF in your cohort vs AF in SAS? See for instance:

https://www.nature.com/articles/nprot.2014.071/figures/4

ADD REPLY • link 19 months ago by LChart 3.9k