I have been fighting with the different GWAS file formats recently, and getting confused more and more, so I would like to get some enlightenments from experienced users here.
I took me a while to figure out that the plink .ped & .dat file formats are different from the Mach (& thus merlin?) .ped & .dat file formats, which was quite surprising. Even worse, plink2 & plink take different parameters at command line, and maybe different file formats, too? These software products also change their file formats when they get updated to newer version. Up till now, I am still not sure how the merlin .ped should be structured for mach1.
For instance, here is a plink .map/.ped file definition:
The fields in a MAP file are:
Chromosome Marker ID Genetic distance Physical position Example of a MAP file of the standard PLINK format: 21 rs11511647 0 26765 X rs3883674 0 32380
The fields in a PED file are
Family ID Sample ID Paternal ID Maternal ID Sex (1=male; 2=female; other=unknown) Affection (0=unknown; 1=unaffected; 2=affected) Genotypes (space or tab separated, 2 for each marker. 0=missing) Example of a PED file of the standard PLINK format: FAM1 NA06985 0 0 1 1 A T T T G G C C A T T T G G C C FAM1 NA06991 0 0 1 1 C T T T G G C C C T T T G G C C
It looks like mach1 does not recognize this .ped format in two aspects:
1) it does not like the "affection status" field. It should be deleted. (But Merlin format should take affection status and numeric traits, right? If the phenotype field has to be removed, where should you put the phenotypes?) 2) Alleles of the same locus should be separated by a blank space instead of a tab.
OK, I can change the file format to accommodate mach1, but then mach1 complains about multi allele loci, and the multiallelic variants need to be removed by plink. However, plink does not like the above mentioned format and refuses to work. It runs into a circle.
I know you can remove multiallelic variants using plink with the plink format, and then take the outputs, make format changes to accommodate mach1, but then you will have to write your own scripts to communicate between plink and mach1 with two separate file formats back and forth. This simply does not sound right. I know there is a free software mega2 can do the plink to merlin conversion, but 1) I have a Mac and I cannot install mega2 on my Mac, and 2) I am not sure what versions of the file formats mega2 can handle.
Guess I did something wrong. What would be the best practices to do mach1 imputation if you start from a set of plink 1.07 compatible .ped/.map/.dat files?