I want to find the best way to impute 5 different chips (Omni 1M, Omni 1S, MEGA, Omni 5M, Immunochip) for a GWAS study in African Americans.
There is overlap between the people on each array; meaning some people genotyped on MEGA, for instance, are also genotyped on Omni 1M.
The relative numbers of people are:
Omni 1M - 1200
Omni 1S - 1200
MEGA - 485
5M - 985
Ichip - 1400
I also have CGI whole genome sequencing (38x) on 62 people. These people have genotyping data on one or more platforms, and the sequencing data is high quality, especially for common variants.
The MEGA array in particular is supposed to contain lots of variants found in persons of African Ancestry, and the 5M has about 5M SNPs on it, so fairly good density.
Now with all that as background, my question is: what is the most sensible imputation strategy?
Should I lump everything and impute on the lumped data?
Or should I try a more complex approach, and perhaps try to judge imputation accuracy by comparing imputation estimates to markers genotyped on the same sample on a different chip? Is that likely to matter to the final analysis, or is it probably academic? And finally, is there a clearly best program in this day and age? Should I use SHAPEIT2? Genotype Harmonizer seems attractive because it does phasing and strand flipping across chips as well, but is it as good?
With such questions in mind, I would appreciate any advice on a practical imputation strategy for such data.