I’ve got GWAS in plink format (bed, bim, fam). I need to impute some SNPs that weren’t directly genotyped. I’ve read that I need to phase, eg with shapeit, the impute, eg with impute2. I’m having trouble figuring out which genetic map to use for shapeit (the one on their guide doesn’t work). What would be really helpful is a step by step guide to go from plink to imputed snp, as this process seems quite painful. Here's what I've done:
I'm on mac, so had to set up an Ubuntu virtualbox to run shapeit. shapeit.v2.904.3.10.0-693.11.6.el7.x86_64, and the example data with tutorial works. I've got GWAS in plink format (TRACK-HD_v3_qc_imputed_v3.bed, TRACK-HD_v3_qc_imputed_v3.bim, TRACK-HD_v3_qc_imputed_v3.fam) Read the shapeit documentation, which says under 'Genetic map', to click this link to download the map for human populations (http://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html). My GWAS is in GRCh37, so I want to download the 'HapMap phase II b37' - however this link doesn't work (http://www.shapeit.fr/files/genetic_map_b37.tar.gz). I've been looking for an alternative genetic map. First off I went to HapMap (http://hapmap.ncbi.nlm.nih.gov), but that's been retired. I went to their archive (ftp://ftp.ncbi.nlm.nih.gov/hapmap/), but it's not at all clear which file to use as the map. I also read 1KG can be used as a map, so went there, but again, not clear which file to use as a map (https://www.internationalgenome.org/data).
I've tried using this as a genetic map - 'genetic_map_chr1_combined_b36.txt', but I get the following:
michael@michael-VirtualBox:~/bin/shapeit.v2.904.3.10.0-693.11.6.el7.x86_64/bin$ ./shapeit --input-bed TRACK-HD_v3_qc_imputed_v3.bed TRACK-HD_v3_qc_imputed_v3.bim TRACK-HD_v3_qc_imputed_v3.fam --input-map genetic_map_chr1_combined_b36.txt --output-max TRACK-HD_v3_qc_imputed_v3_phased.haps TRACK-HD_v3_qc_imputed_v3_phased.sample Segmented HAPlotype Estimation & Imputation Tool * Authors : Olivier Delaneau, Jared O'Connell, Jean-François Zagury, Jonathan Marchini * Contact : send an email to the OXSTATGEN mail list https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=OXSTATGEN * Webpage : https://mathgen.stats.ox.ac.uk/shapeit * Version : v2.r904 * Date : 24/11/2019 14:30:04 * LOGfile : [shapeit_24112019_14h30m04s_8d07a6e7-5f9d-45c0-8e20-706fd12a0ba6.log] MODE -phase : PHASING GENOTYPE DATA * Autosome (chr1 ... chr22) * Window-based model (SHAPEIT v2) * MCMC iteration Parameters : * Seed : 1574605804 * Parallelisation: 1 threads * Ref allele is NOT aligned on the reference genome * MCMC: 35 iterations [7 B + 1 runs of 8 P + 20 M] * Model: 100 states per window [100 H + 0 PM + 0 R + 0 COV ] / Windows of ~2.0 Mb / Ne = 15000 Reading site list in [TRACK-HD_v3_qc_imputed_v3.bim] ERROR: Duplicate site pos=40345847 ref=A alt=AAAC
All in all a bit fed up and going to have break from this for a while. Any help for when I come back to it later this evening would be very helpful! Thanks!