Question: How to convert Plink's ped/map files into nexus format?
0
gravatar for A.Illum
4.9 years ago by
A.Illum0
Estonia
A.Illum0 wrote:

Dear all,
I need to make a phylogenetic network using unphased Illumina genotyping data stored in Plink's ped/map format. Phylogenetic programs like SplitsTree require input files to be in nexus format.  How can I convert ped/map into nexus format? Should I also phase the data?

I am a beginner in this field so any help much appreciated.

Ann

ADD COMMENTlink modified 4.5 years ago by hannes.svardal140 • written 4.9 years ago by A.Illum0
0
gravatar for hannes.svardal
4.5 years ago by
Austria
hannes.svardal140 wrote:

Hi,

Here is a python function that converts a ped file to a nexus file as used by SNAPP. I am not sure about the specific requirements of SplitTree, but you can try.

It assumes that your ped is encoded with 12 instead of ACGT. You obtain such a ped from a standard ped with

plink --file <in_filename> --recode12 --out <out_filename>

then run within python:

ped_to_nexus(ped_fn, nexus_fn, n_individuals=<number of individuals>) 

The function is attached below. 

Best,

Hannes

def ped_to_nexus(ped_fn, nexus_fn, individual_names=None, n_individuals=None,
                                 max_name_len = 12, comments=None,dominant=False):
    """
    Write snps from ped file to nexus alignment file.
    Attention: The bed file has to be encoded with:
                0.. missing
                1.. first allele 
                2.. second allele
            Such a ped file can be obtained from a regular ped file
            with "plink --file <in_filename> --recode12 --out <out_filename>"
    This function assumes that individuals are diploid. 
    The two haploid columns in the ped become a single genotype (0/1/2) 
        in the nexus file.
    -------Parameters---------
    individual_names... list of individual (or taxa) names
                        if not supplied, the names from the bed file are used
    n_individuals... number of individuals (ntax in the nexus format)
                     (if not supplied, the length of indiviudal_names is used)
    max_name_len... maximal number of characters of individual names, 
                    used to determine intent of alignments
    comments... list of strings that are added to the nexus header as comments
                alternatively, it can be a single string
    dominant... if True, the 0 allele is assumed to be dominant and the resulting
                nexus file is encoded 0/1 instead of 0/1/2
    """
     #
    assert individual_names is not None or n_individuals is not None,\
            "specify either individual names or the total number of individuals" 
    if individual_names is not None and n_individuals is not None:
        assert len(individual_names) == n_individuals,\
                "number of individual names must equal argument n_individuals"
    if n_individuals is None:
        n_individuals = len(individual_names)
    if individual_names is None:
        individual_names = [None]*n_individuals
        
    with open(ped_fn,"r") as bf, open(nexus_fn,"w") as nf:
        nf.write("#NEXUS\n")
        if comments is not None:
            if not isinstance(comments, basestring):
                for comment in comments:
                    nf.write("[{}]\n".format(comment))
            else:
                nf.write("[{}]\n".format(comments))
        
        for i,(line,name) in enumerate(zip(bf,individual_names)):
            if i >= n_individuals:
                break
            fields = line.split()
            if name is None:
                name = fields[0]
            n_whitespace = max_name_len + 1 - len(name)
            assert n_whitespace >= 1, "max taxon name length is " \
                                    + str(max_name_len) + " but name is " + name
            data = fields[6:]
            
            if i == 0:
                if dominant:
                    symbols = "01"
                else:
                    symbols = "012"
                nf.write("\nBegin data;\n"
                 "        Dimensions ntax={} nchar={};\n"
                 '        Format datatype=integerdata symbols="{}" missing=? gap=-;\n'
                 "        Matrix\n".format(n_individuals,len(data)/2,symbols))
            nf.write(name)
            nf.write(" " * n_whitespace)
            for i in xrange(0,len(data),2):
                h0 = int(data[i])
                h1 = int(data[i+1])
                if h0!=0 and h1!=0:
                    gt = h0 + h1 - 2
                    if dominant:
                        #assume 0 allele is dominant
                        gt = int(gt/2)
                    gt = str(gt)
                else:
                    gt = "?"
                nf.write(gt)
            nf.write("\n")
        nf.write("        ;\nEnd;")
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by hannes.svardal140

Is there a way to do this with standard 0/1 binary? Your method works with --recode12, but SNAPP requires 1/0's.

I can convert to BED format to get a binary file, but I have no way to convert to nexus from there. Help!

ADD REPLYlink written 2.5 years ago by jcolella.jc0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 931 users visited in the last hour