How to convert Plink's ped/map files into nexus format?
1
0
Entering edit mode
8.0 years ago
A.Illum • 0

Dear all,

I need to make a phylogenetic network using unphased Illumina genotyping data stored in Plink's ped/map format. Phylogenetic programs like SplitsTree require input files to be in nexus format. How can I convert ped/map into nexus format? Should I also phase the data?

I am a beginner in this field so any help much appreciated.

Ann

alignment next-gen phylogeny sequencing • 4.6k views
ADD COMMENT
0
Entering edit mode
7.7 years ago

Hi,

Here is a python function that converts a ped file to a nexus file as used by SNAPP. I am not sure about the specific requirements of SplitTree, but you can try.

It assumes that your ped is encoded with 12 instead of ACGT. You obtain such a ped from a standard ped with

plink --file <in_filename> --recode12 --out <out_filename>

then run within python:

ped_to_nexus(ped_fn, nexus_fn, n_individuals=<number of individuals>)

The function is attached below.

Best,
Hannes

def ped_to_nexus(ped_fn, nexus_fn, individual_names=None, n_individuals=None,
                                 max_name_len = 12, comments=None,dominant=False):
    """
    Write snps from ped file to nexus alignment file.
    Attention: The bed file has to be encoded with:
                0.. missing
                1.. first allele 
                2.. second allele
            Such a ped file can be obtained from a regular ped file
            with "plink --file <in_filename> --recode12 --out <out_filename>"
    This function assumes that individuals are diploid. 
    The two haploid columns in the ped become a single genotype (0/1/2) 
        in the nexus file.
    -------Parameters---------
    individual_names... list of individual (or taxa) names
                        if not supplied, the names from the bed file are used
    n_individuals... number of individuals (ntax in the nexus format)
                     (if not supplied, the length of indiviudal_names is used)
    max_name_len... maximal number of characters of individual names, 
                    used to determine intent of alignments
    comments... list of strings that are added to the nexus header as comments
                alternatively, it can be a single string
    dominant... if True, the 0 allele is assumed to be dominant and the resulting
                nexus file is encoded 0/1 instead of 0/1/2
    """
     #
    assert individual_names is not None or n_individuals is not None,\
            "specify either individual names or the total number of individuals" 
    if individual_names is not None and n_individuals is not None:
        assert len(individual_names) == n_individuals,\
                "number of individual names must equal argument n_individuals"
    if n_individuals is None:
        n_individuals = len(individual_names)
    if individual_names is None:
        individual_names = [None]*n_individuals

    with open(ped_fn,"r") as bf, open(nexus_fn,"w") as nf:
        nf.write("#NEXUS\n")
        if comments is not None:
            if not isinstance(comments, basestring):
                for comment in comments:
                    nf.write("[{}]\n".format(comment))
            else:
                nf.write("[{}]\n".format(comments))

        for i,(line,name) in enumerate(zip(bf,individual_names)):
            if i >= n_individuals:
                break
            fields = line.split()
            if name is None:
                name = fields[0]
            n_whitespace = max_name_len + 1 - len(name)
            assert n_whitespace >= 1, "max taxon name length is " \
                                    + str(max_name_len) + " but name is " + name
            data = fields[6:]

            if i == 0:
                if dominant:
                    symbols = "01"
                else:
                    symbols = "012"
                nf.write("\nBegin data;\n"
                 "        Dimensions ntax={} nchar={};\n"
                 '        Format datatype=integerdata symbols="{}" missing=? gap=-;\n'
                 "        Matrix\n".format(n_individuals,len(data)/2,symbols))
            nf.write(name)
            nf.write(" " * n_whitespace)
            for i in xrange(0,len(data),2):
                h0 = int(data[i])
                h1 = int(data[i+1])
                if h0!=0 and h1!=0:
                    gt = h0 + h1 - 2
                    if dominant:
                        #assume 0 allele is dominant
                        gt = int(gt/2)
                    gt = str(gt)
                else:
                    gt = "?"
                nf.write(gt)
            nf.write("\n")
        nf.write("        ;\nEnd;")
ADD COMMENT
0
Entering edit mode

Is there a way to do this with standard 0/1 binary? Your method works with --recode12, but SNAPP requires 1/0's.

I can convert to BED format to get a binary file, but I have no way to convert to nexus from there. Help!

ADD REPLY

Login before adding your answer.

Traffic: 2217 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6