To be honest, imputation is not an easy task if you have never done it before. I am curious to understand how / why you are getting into it? Supervisor...?
The general process proceeds in 2 main steps:
- pre-phasing of genotypes into haplotypes (against a reference)
- imputation of genotypes against a reference
A lot of work in this area derived from a group in Oxford, where the following programs were developed:
Another commonly used program is called Beagle. There is also the Michigan Imputation Server, which is probably easier if you have never done this before.
how long does it take? and things to be cautious of
...how long is a piece of string? - depends on many factors, including your compute resources, the size of the data that you have, the size of the reference panel, etc. To give you an idea, I recently imputed an Illumina GSA dataset against 1000 Genomes Phase III, and it took ~2 weeks of constant processing (32 cores; 32GB RAM), and probably 1.5 months in total when you consider everything else (script devel, dealing with errors, etc).
Another key point is that, unless you have full access to the Cray Supercomputer, you'll have to do the imputation in chunks, looped across each chromosome, like, 5 megabase chunks. The imputation programs are intelligent enough to impute a 'buffer' window outside of this to ensure a harmonious overlap between each chunk. Then, these chunks have to be pieced back together at the end.
NB - you can convert IMPUTE2's output to VCF via:
shapeit \
-convert \
--input-haps "${GEN}""_haps" \
--output-vcf
Scripts? - you'll find a lot spread across the World Wde Web. For example, I have scripts for pre-phasing, here: C: Phasing with SHAPEIT
Kevin