Convert Fasta Alignment To Phylip In Constant Memory
3
2
Entering edit mode
11.5 years ago
xapple ▴ 230

Hello !

I have an alignment in standard FASTA format and would like to convert it to the PHYLIP format. This, by itself, is not complicated. The problem is that the alignment I'm trying to work with is the full current release of the Silva database. All implementations of this conversion I've tried until now attempt to load all sequences in memory which always falls into swap space and finally gets killed on my 8GB workstation.

Do you know of a quick solution ?

Thanks in advance !

If you have to know, I'm trying to convert it to PHYLIP for input into RAxML.

alignment fasta memory • 9.8k views
ADD COMMENT
4
Entering edit mode
11.4 years ago
xapple ▴ 230

Ah well, I wrote a python script that does it without memory.

#!/usr/bin/env python

"""
Convert fasta alignemnts to relaxed phylip ones in constant memory.
Written by Lucas Sinclair.
Kopimi.

You can use this script from the shell like this::
$ fasta_to_phylip seqs.fasta seqs.phylip
"""

###############################################################################
class Sequence(object):
    """The Sequence object has a string *header* and
    various representations."""

    def __init__(self, header, seq):
        self.header = re.findall('^>(\S+)', header)[0]
        self.seq = seq

    def __len__(self):
        return len(self.seq)

    @property
    def phylip(self):
        return self.header + " " + self.seq.replace('.','-') + "\n"

    @property
    def fasta(self):
        return ">" + self.header + "\n" + self.seq + "\n"

def fasta_parse(path):
    """Reads the file at *path* and yields
       Sequence objects in a lazy fashion"""
    header = ''
    seq = ''
    with open(path) as f:
        for line in f:
            line = line.strip('\n')
            if line.startswith('>'):
                if header: yield Sequence(header, seq)
                header = line
                seq = ''
                continue
            seq += line
    yield Sequence(header, seq)

###############################################################################
# The libraries we need #
import sys, os, random, string, re
# Get the shell arguments #
fa_path = sys.argv[1]
ph_path = sys.argv[2]
# Check that the path is valid #
if not os.path.exists(fa_path): raise Exception("No file at %s." % fa_path)
# Use our two functions #
seqs = fasta_parse(fa_path)
# Write the output to temporary file #
tm_path = ph_path + '.' + ''.join(random.choice(string.letters) for i in xrange(10))
# Count the sequences #
count = 0
with open(tm_path, 'w') as f:
    for seq in seqs:
        f.write(seq.phylip)
        count += 1
# Add number of entries and length at the top #
with open(tm_path, 'r') as old, open(ph_path, 'w') as new:
    new.write(" " + str(count) + " " + str(len(seq)) + "\n")
    new.writelines(old)
# Clean up #
os.remove(tm_path)
ADD COMMENT
0
Entering edit mode

Thanks for posting this!

ADD REPLY
1
Entering edit mode
11.5 years ago
Josh Herr 5.8k

As you mention, this process is a really easy data transformation. I think FASTA to NEXUS would be a little easier if you are writing your own parser, and RAxML will take a NEXUS file also. You say you are not having a problem with the actual parsing, so I won't address how you are doing that here.

The problem you are having is a size problem (it's not impossible to parse the entire silva database - anywhere from half to 2Gb in size - depending on which version you use) but depending on how you are parsing it you may be able to use your computer's memory more economically. If you can't parse with less memory, then I would suggest a cluster for large jobs such as this over your desktop/laptop.

I have a question: Why do you want to construct a phylogeny with the entire 2Gb contents of the SILVA database? The LSU and SSU genes aren't evenly informative across all domains. Not really sure where you are going with things, but it seems like a lot of work for naught. Even if you are using a large phylogeny to place environmental amplicons, it's much less cumbersome to use a selective database (and one which can still be very large in size). I would recommend TopiaryExplorer to phylogenetically show placement of OTU data that you would be inferring from your BLAST or alignment to the SILVA database.

ADD COMMENT
0
Entering edit mode

Thanks for your answer. I didn't know that RAxML takes nexus files also ! I could get lots more RAM, but it's a problem that doesn't require memory, it's O(1) if you implement it right. I'm not attempting to construct a phylogeny, I wanted to try and use the "-f e" option to compute branch lengths and then use the model from "pplacer" to place sequences. I looked at TopiaryExplorer, but was looking for a solution that didn't require a GUI so as to automatize processing.

ADD REPLY
0
Entering edit mode

Excellent, I see your strategy now. I actually have been using pplacer lately and can see how it would work well in a pipeline.

ADD REPLY
0
Entering edit mode
9.4 years ago
zeeefa ▴ 90

I know I'm a little late to post here lol but you can use this tool to convert FASTA format to a relaxed PHYLIP format :) Hope it helps.

ADD COMMENT

Login before adding your answer.

Traffic: 2380 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6