Question

Error when running clustal omega with python/biopython

2

Entering edit mode

8.0 years ago

Nate ▴ 20

Hi all,

I've been trying to run a sequence alignment using clustal omega from within a python script. However, I keep getting the following error:

FATAL: Sequence file contains no data

However, if I run the raw command on the command line, everything runs fine. Here is the relevant code in my script:

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
import Bio.SeqIO as SeqIO
from Bio.Align.Applications import ClustalOmegaCommandline

ref_seq = "ERVVIGSKPFNEQYILANMIAILLEENGYKA"
target_seq = "ERVVIGSKPFNEQYILANMINGYKA"

in_file = "unaligned.fasta"
out_file = "aligned.fasta"

# Write my sequences to a fasta file
handle = open(in_file, 'w')
records = (SeqRecord(Seq(seq, IUPAC.protein), id=str(index), name="Test", description="Test") for index,seq in enumerate([ref_seq, target_seq]) )
SeqIO.write(records, handle, "fasta")

clustalomega_cline = ClustalOmegaCommandline(infile=in_file, outfile=out_file, verbose=True, auto=True, force=True)
clustalomega_cline()
#sp.call(str(clustalomega_cline), shell=True) # This will also fail with the same error

Edit:

It seems that this is due to either the way biopython writes the FASTA file, or how clustalo processes it.

If I comment out the lines where I write to a file and run the script again, is still fails, even if I run the raw command using subprocess. BUT, if I also make sure that there are no spaces in the FASTA sequence headers, then the commands at the bottom run fine.

I don't know why this isn't a problem when just run from the command line with the headers unmodified. This doesn't make any sense.

biopython python clustalo clustalomega • 4.6k views

ADD COMMENT • link updated 8.0 years ago by gearoid ▴ 200 • written 8.0 years ago by Nate ▴ 20

score 4 · Accepted Answer · 2016-04-21

Closing the input file after writing to it, but before calling Clustal Omega, makes this work for me.

Just add

handle.close()

after the call to SeqIO.write(). I think the way it is now, maybe the sequences aren't being written to that file until the script terminates (after the call to Clustal Omega).

Edit: a more elegant version might be:

records = (SeqRecord(Seq(seq, IUPAC.protein), id=str(index), name="Test", description="Test") for index,seq in enumerate([ref_seq, target_seq]) )

with open(in_file, 'w') as handle:
    SeqIO.write(records, handle, "fasta")

where the 'with' statement takes care of closing the file handle for you (even if an exception is thrown in the inner block).