Question: Difference in imputation between platforms and RS names
0
gravatar for anamaria
4 weeks ago by
anamaria110
anamaria110 wrote:

Hello,

so I have SNPs (RSIDs) from imputation done in 2011 on http://csg.sph.umich.edu/abecasis/MACH/tour/ (call it 2011 data) and I did imputation on the same genotype files on Michigan Imputation Server, Genotype Imputation (Minimac4) 1.2.4 (call it 2020 data)

using the same QC steps I perfomed GWAS using plink.

In 2011 I have ~2.5 million SNPs and in 2020 I have ~2.7 million SNPs. The issue is that only ~900000 SNPs are matching between those two data sets. Can someone please explain me why? Did RS names changed in the meantime? I did put both genotype files on Build 37. Here I am presenting number of SNPs per chromosome for old (2011) data and new (2020) data. Also I am comparing snps_that_can_be_found_in_old_but_not_in_new and snps_that_can_be_found_in_new_but_not_in_old.

Can someone please explain me what might be the issue and why there is only ~900000 SNPs matching SNPs?

enter image description here

enter image description here

mach imputation snps minimac rs • 143 views
ADD COMMENTlink modified 4 weeks ago by Kevin Blighe66k • written 4 weeks ago by anamaria110
0
gravatar for Kevin Blighe
4 weeks ago by
Kevin Blighe66k
Kevin Blighe66k wrote:

How are you matching: based on rs ID or genomic position?

Different pre-filtering may be applied by the imputation tools, e.g., for linkage disequilibrium, resulting in different individual results at the SNP levels but, globally, the same overall result in terms of genomic loci affected.

Kevin

ADD COMMENTlink written 4 weeks ago by Kevin Blighe66k

HI Kevin,

I did match based on RS name, not on position. Do you advise me matching based on position?

ADD REPLYlink written 4 weeks ago by anamaria110

what about SNPs like this: rs2905035, rs2519016...which can be found in 2011 imputation bot not in 2020

ADD REPLYlink written 4 weeks ago by anamaria110

I am not sure. If any program is of high quality, it should output what it filters out, and the reason why. You can check log files?

ADD REPLYlink written 4 weeks ago by Kevin Blighe66k

I don't have log files for 2011 data.

What about comparing F_A Frequency of this allele in cases and F_U Frequency of this allele in controls. What would this be indication of?

https://imgur.com/JoFtE6P

https://imgur.com/53yGfNh

ADD REPLYlink written 4 weeks ago by anamaria110

I could only speculate, to be honest. As I am generally familiar with your experiment, my advice to you and your supervisor (were I to provide it) would be to not try to replicate perfectly the results from ~10 years ago. Try to see this as an opportunity to process old data with new, better methods, which will produce results with higher confidence.

ADD REPLYlink written 4 weeks ago by Kevin Blighe66k

I completely understand you. And I already confirmed analysis I did in 2020 in other known GWAS results (doing the most recent QC steps, methods...). But we need to know why 2011 was so different.

ADD REPLYlink written 4 weeks ago by anamaria110

it seems that characteristics of overlapping SNPs is that there is not many of them who have very low minor allele frequency. Attached is the plot for frequencies for overlapping SNPs. Any idea why this might be?

https://imgur.com/XkCxpvz

https://imgur.com/cWfsHxu

ADD REPLYlink written 4 weeks ago by anamaria110
1

I suppose that it indicates how disagreement is on rare alleles; so, the differences could be accounted for by a filter for MAF. What we view as 'rare' has changed over the years as progressively larger datasets have been produced.

There really should be some way to check all filtering steps for both the 2011 dataset and the Michigan Imputation Server (2020) dataset. If no code exists for the 2011 dataset, then that is a negative point on your supervisor's behalf.

ADD REPLYlink written 4 weeks ago by Kevin Blighe66k

Yes indeed. Unfortunately I don't have any record on what kind of MAF threshold was used prior and post imputation in 2011. What number for MAF would you suggest me to impose on the new dataset judging by the old data? Should I do MAF 0.1 on both genotypes prior to doing regression analysis in plink or?

ADD REPLYlink written 4 weeks ago by anamaria110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1329 users visited in the last hour