Question: biopython script using MUSCLE stalls with >300 sequences
1
gravatar for julestrachsel
3.3 years ago by
United States
julestrachsel20 wrote:

Hello,

I am trying to learn biopython a little better and I am tyring to write a python script that will take in some unaligned DNA sequences in fasta format from the command line, translate these to protein sequences, align these sequences with MUSCLE, and then write out the resulting protein alignment.

I run the script from the command line by typing "python3 script.py input.fasta"

Here is the script so far:

from Bio.SeqRecord import SeqRecord

from Bio import SeqIO

from Bio.SeqUtils.CheckSum import seguid

from sys import argv

from Bio.Align.Applications import MuscleCommandline

from Bio import AlignIO

import subprocess

import sys

def unique_seqs(sequences):

    """returns a list of SeqRecord objects with redundant sequences removed"""

    unique_records = []

    checksum_container = []

    for seq in sequences:

        checksum = seguid(seq.seq)

        if checksum not in checksum_container:

            checksum_container.append(checksum)

            unique_records.append(seq)

    return unique_records

def make_protein_record(nuc_record):

    """Returns a new SeqRecord with the translated sequence (default table)."""

    return SeqRecord(seq=nuc_record.seq.translate(cds=False), id=nuc_record.id, description="translation using def table")

seqs = SeqIO.parse(argv[1], "fasta")

seqs = list(seqs)           

prot = []

print("you input {} sequences".format(len(seqs)))

UNIQUE_Seqs = unique_seqs(seqs)

print("there are {} unique DNA sequences".format(len(UNIQUE_Seqs)))

for seq in UNIQUE_Seqs:

    seq = make_protein_record(seq)

    prot.append(seq)

protein_fasta_handle = "prot" + argv[1]

SeqIO.write(prot, protein_fasta_handle, "fasta")

muscle_cline = MuscleCommandline()

child = subprocess.Popen(str(muscle_cline),

                          stdin =subprocess.PIPE,

                          stdout =subprocess.PIPE,

                          stderr =subprocess.PIPE,

                          universal_newlines =True,

                          shell =(sys.platform != "win32"))

SeqIO.write(prot, child.stdin, "fasta")

child.stdin.close()

align_handle = "ALN" + protein_fasta_handle

align = AlignIO.read(child.stdout, "fasta")

AlignIO.write(align, align_handle, "fasta")

print(align)

 

This script works, however it only works when I give it a low number of starting sequences.  I have successfully gotten it to work with a fasta containing 8 sequences, but when I tried to use 300 starting sequences the script freezes.  I initially thought it was being non-responsive because the MUSCLE alignment was taking a long time.  To test this I ran MUSCLE from outside of biopython and the alignment only took 20 seconds. 

By inserting print statements at various places I learned that the problem arises on line 43:

align = AlignIO.read(child.stdout, "fasta")

I haven't seen any error messages or anything of that nature.  Can anyone spot something I'm doing wrong here?

Thank you for your time.

 

 

biopython sequence alignment • 1.3k views
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by julestrachsel20

Try adding parameters to the MuscleCommandline() function; remove the stdin parameter in the Popen() function, after that add the line: stdout, stdout = child.communicate() and remove the line SeqIO.write(prot, child.stdin, "fasta") which I dont really understand what are you doing with it because you are already saving the protein sequences in fasta format.

ADD REPLYlink written 3.3 years ago by rafa.rios.5050

Working directly with pipes like this (child.stdout etc) is much harder to debug. Try making temporary files instead.

ADD REPLYlink written 3.3 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1577 users visited in the last hour