Merlin format, Mach, Plink, Plink2, and best practices of snp imputation
Entering edit mode
3.6 years ago
moxu ▴ 500

I have been fighting with the different GWAS file formats recently, and getting confused more and more, so I would like to get some enlightenments from experienced users here.

I took me a while to figure out that the plink .ped & .dat file formats are different from the Mach (& thus merlin?) .ped & .dat file formats, which was quite surprising. Even worse, plink2 & plink take different parameters at command line, and maybe different file formats, too? These software products also change their file formats when they get updated to newer version. Up till now, I am still not sure how the merlin .ped should be structured for mach1.

For instance, here is a plink .map/.ped file definition:

The fields in a MAP file are:

Marker ID
Genetic distance
Physical position
Example of a MAP file of the standard PLINK format:
21  rs11511647  0   26765
X   rs3883674   0   32380

The fields in a PED file are

Family ID
Sample ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Affection (0=unknown; 1=unaffected; 2=affected)
Genotypes (space or tab separated, 2 for each marker. 0=missing)
Example of a PED file of the standard PLINK format:
FAM1    NA06985 0   0   1   1   A   T   T   T   G   G   C   C   A   T   T   T   G   G   C   C
FAM1    NA06991 0   0   1   1   C   T   T   T   G   G   C   C   C   T   T   T   G   G   C   C

It looks like mach1 does not recognize this .ped format in two aspects:

1) it does not like the "affection status" field. It should be deleted. (But Merlin format should take affection status and numeric traits, right? If the phenotype field has to be removed, where should you put the phenotypes?) 2) Alleles of the same locus should be separated by a blank space instead of a tab.

OK, I can change the file format to accommodate mach1, but then mach1 complains about multi allele loci, and the multiallelic variants need to be removed by plink. However, plink does not like the above mentioned format and refuses to work. It runs into a circle.

I know you can remove multiallelic variants using plink with the plink format, and then take the outputs, make format changes to accommodate mach1, but then you will have to write your own scripts to communicate between plink and mach1 with two separate file formats back and forth. This simply does not sound right. I know there is a free software mega2 can do the plink to merlin conversion, but 1) I have a Mac and I cannot install mega2 on my Mac, and 2) I am not sure what versions of the file formats mega2 can handle.

Guess I did something wrong. What would be the best practices to do mach1 imputation if you start from a set of plink 1.07 compatible .ped/.map/.dat files?

SNP software error • 1.9k views
Entering edit mode

Is it really necessary to use mach1? Imputation software has advanced considerably since 2010, and newer packages support more commonly-used formats than Merlin.

As for plink2 and --file, that doesn't work yet because plink2 is an incomplete program currently undergoing alpha testing. The main priority is enabling things which are impossible with earlier plink versions; stuff which is fully functional in plink 1.9, such as .ped + .map import, is lower-priority. With that said, the central file format has NOT changed: --bfile and --make-bed work the same way in v1.07, v1.9, and v2.0. plink2 introduces an extension (--pfile/--make-pgen) which can represent lots of things relevant to modern GWAS that are outside the plink1 binary format's scope (REF vs. ALT, imputed dosages, phasing information, variant QUAL/FILTER/INFO, categorical covariates...); but if you're starting with .ped + .map data, you can just stick to --bfile/--make-bed.

Entering edit mode

We want to evaluate different phasing/imputation software products, MaCH is one of them. "mach1" is the only MaCH I can obtain now.

Thanks a lot for the information about plink. Are you suggesting me to convert .ped to .bed and then use plink2 instead?

Entering edit mode

In your position, I'd use plink 1.x except when you run into something that's impossible to do with v1.x, but possible with v2.

Unfortunately, direct conversion to or from Merlin format is not supported by any version, due to a timing quirk (Merlin format was kind of obsolete when plink 1.07 was released in 2009, then it regained some relevance in 2010, but by the time of the next plink update in 2014 it had again become too rarely used to justify adding new import/export functions for it), so you do actually need to use your own scripts here; fortunately the scripts should be pretty straightforward if you're willing to split or throw out multiallelic variants in advance.

Entering edit mode

OK, thanks a lot for the clarification. I thought MaCH go hand-in-hand with Merlin. I don't mind writing my own scripts for little things like this, but if there is something free, it might be better. Very often, I found it more time consuming to search for small tools to do trivial tasks than writing scripts by myself. :)


Login before adding your answer.

Traffic: 3003 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6