Question

Convert Fasta Alignment To Phylip In Constant Memory

2

Entering edit mode

11.5 years ago

xapple ▴ 230

Hello !

I have an alignment in standard FASTA format and would like to convert it to the PHYLIP format. This, by itself, is not complicated. The problem is that the alignment I'm trying to work with is the full current release of the Silva database. All implementations of this conversion I've tried until now attempt to load all sequences in memory which always falls into swap space and finally gets killed on my 8GB workstation.

Do you know of a quick solution ?

Thanks in advance !

If you have to know, I'm trying to convert it to PHYLIP for input into RAxML.

alignment fasta memory • 9.8k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 11.5 years ago by xapple ▴ 230

score 4 · Answer 1 · 2012-11-05

Ah well, I wrote a python script that does it without memory.

#!/usr/bin/env python

"""
Convert fasta alignemnts to relaxed phylip ones in constant memory.
Written by Lucas Sinclair.
Kopimi.

You can use this script from the shell like this::
$ fasta_to_phylip seqs.fasta seqs.phylip
"""

###############################################################################
class Sequence(object):
    """The Sequence object has a string *header* and
    various representations."""

    def __init__(self, header, seq):
        self.header = re.findall('^>(\S+)', header)[0]
        self.seq = seq

    def __len__(self):
        return len(self.seq)

    @property
    def phylip(self):
        return self.header + " " + self.seq.replace('.','-') + "\n"

    @property
    def fasta(self):
        return ">" + self.header + "\n" + self.seq + "\n"

def fasta_parse(path):
    """Reads the file at *path* and yields
       Sequence objects in a lazy fashion"""
    header = ''
    seq = ''
    with open(path) as f:
        for line in f:
            line = line.strip('\n')
            if line.startswith('>'):
                if header: yield Sequence(header, seq)
                header = line
                seq = ''
                continue
            seq += line
    yield Sequence(header, seq)

###############################################################################
# The libraries we need #
import sys, os, random, string, re
# Get the shell arguments #
fa_path = sys.argv[1]
ph_path = sys.argv[2]
# Check that the path is valid #
if not os.path.exists(fa_path): raise Exception("No file at %s." % fa_path)
# Use our two functions #
seqs = fasta_parse(fa_path)
# Write the output to temporary file #
tm_path = ph_path + '.' + ''.join(random.choice(string.letters) for i in xrange(10))
# Count the sequences #
count = 0
with open(tm_path, 'w') as f:
    for seq in seqs:
        f.write(seq.phylip)
        count += 1
# Add number of entries and length at the top #
with open(tm_path, 'r') as old, open(ph_path, 'w') as new:
    new.write(" " + str(count) + " " + str(len(seq)) + "\n")
    new.writelines(old)
# Clean up #
os.remove(tm_path)

score 1 · Answer 2 · 2012-11-02

As you mention, this process is a really easy data transformation. I think FASTA to NEXUS would be a little easier if you are writing your own parser, and RAxML will take a NEXUS file also. You say you are not having a problem with the actual parsing, so I won't address how you are doing that here.

The problem you are having is a size problem (it's not impossible to parse the entire silva database - anywhere from half to 2Gb in size - depending on which version you use) but depending on how you are parsing it you may be able to use your computer's memory more economically. If you can't parse with less memory, then I would suggest a cluster for large jobs such as this over your desktop/laptop.

I have a question: Why do you want to construct a phylogeny with the entire 2Gb contents of the SILVA database? The LSU and SSU genes aren't evenly informative across all domains. Not really sure where you are going with things, but it seems like a lot of work for naught. Even if you are using a large phylogeny to place environmental amplicons, it's much less cumbersome to use a selective database (and one which can still be very large in size). I would recommend TopiaryExplorer to phylogenetically show placement of OTU data that you would be inferring from your BLAST or alignment to the SILVA database.

Ram · Answer 3 · 2014-12-03

0

Entering edit mode

9.4 years ago

zeeefa ▴ 90

I know I'm a little late to post here lol but you can use this tool to convert FASTA format to a relaxed PHYLIP format :) Hope it helps.

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.4 years ago by zeeefa ▴ 90