When using Will Rayner’s pre-imputation perl script HRC checking tool by default it highlights how many variants differ by more than 20% allele frequency in the log file; and has the capacity to identify a more stringent threshold of 10% allele frequency using the flag -t 0.1. If my population is of North European ancestry and I’m using the HRCr1.1 as my reference panel, should I exclude if the allele frequency difference is greater than 10% or 20%?
I have identified n=207 with AF>0.2, and n=482 with AF>0.1 having already corrected for strand and REF/ALT.
I’ve tried looking at dbSNP HapMap3 CEU, 1000GP3 EUR and UCSC genome browser for N.Sweden or TwinsUK data if available to compare with my variants that show a difference >10% between the HRCr1.1 freq and my *.bim file freq, and it seems that my bim file looks correct 75% of the time, and HRC looks incorrect. Presumably there are several reasons for this: 1) HRC contains other populations other than North European, 2) it may not always be showing the same variant for the same chr:pos, 3) it maybe that HRC is showing a rare allele for the same multiallelic variant at the same chr:pos, 4) HRC could simply be wrong. Is it safest to go with a stringent 10% cut-off, to ensure that I am not adding variants that will cause areas of poor imputation?
I also want to run my data against the TOPMed Imputation Server, I have run my data against the Bravo variant browser
[TOPMed_Freeze3a] on build GRCh37 and have no strand or REF/ALT issues, but obvious I have a lot more variants with AF>0.2 n=10555, this I would expect as TOPMed is only 40% Caucasian. I believe pre-phasing is carried out with HRCr1.1, so if I’ve used HRCr1.1 to check my variants using Will Rayner’s pre-imputation perl script HRC checking tool, so hopefully this is the best I can do for pre-imputation checking prior to running on TOPMed Imputation Server?