Biopython 1.56 - Strange Behaviour Of Seqio.Parse And Seqio.Write
3
2
Entering edit mode
12.3 years ago
Markus ▴ 20

Hello everybody,

I'm using Biopython 1.56 compiled from source on Ubuntu 10.10 64-bit. It's a great piece of software and I love to work with it. But there is a very strange behavior of the "Bio.SeqIO.parse()" sequence parser, which cost me several hours to find out:

If I uncomment the for-statement with the print commands, "Bio.SeqIO.write()" refuses to write the sequences to the file.

Is this the desired behavior? Is the for-loop iterating the parser object to its end and leaves it there? Could anyone help me out? What am I getting wrong?

Markus

Example:


handle = open("ls_orchid.fasta", "fasta")
parsedfasta = Bio.SeqIO.parse(handle, "fasta")

# If you uncomment the following part, 0 Sequences
# (instead of many more) are written to the file!

#for seq_record in parsedfasta:
#print "Description:\t%s..." % seq_record.description
#print "Length:\t\t%d" % len(seq_record)
#print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50])

output_handle = open("example.fasta", "w")
count = Bio.SeqIO.write(parsedfasta, output_handle, "fasta")
output_handle.close()
print "%d Sequences written to file" % count

biopython python • 5.8k views
12
Entering edit mode
12.3 years ago

Nice answers so far. SeqIO uses standard python iterators and generators:

http://www.learningpython.com/2009/02/23/iterators-iterables-and-generators-oh-my/

Using 'list' to turn an iterator into an in-memory list does allow you to parse it multiple times, but at the cost of needing to be able to fit the full file in memory. If you have a large file, or would just like to write your code in a more general manner, you can lazily move through the list of sequences a single time by doing the processing inside a generator function. This example processes each record then yields it for writing. Iterators and generators are very useful for dealing with large files in Python:

from Bio import SeqIO

handle = open("ls_orchid.fasta", "r")
parsedfasta = SeqIO.parse(handle, "fasta")

def print_seqrec(seq_record):
"""Specific function to analyze a sequence record.
"""
print "Description:\t%s..." % seq_record.description
print "Length:\t\t%d" % len(seq_record)
print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50]

def process_and_generate(input_iterator, process_function):
"""Reusable function that processes a record, then generates each record.

input_iterator is an iterator that returns one record at a time
process_function is a function that takes one record and does some
processing on it
"""
for rec in input_iterator:
process_function(rec)
yield rec

generator = process_and_generate(parsedfasta, print_seqrec)

output_handle = open("example.fasta", "w")
count = SeqIO.write(generator, output_handle, "fasta")
output_handle.close()
print "%d Sequences written to file" % count

0
Entering edit mode

Great answer! I'll try to integrate these ideas into my code. Thank you.

6
Entering edit mode
12.3 years ago
Casbon ★ 3.2k

Yes, you are consuming the generator with the for loop. Either call parse twice, or consume the generator in a list so that you can iterate over it more than once. This should work:

handle = open("ls_orchid.fasta", "fasta")
parsedfasta = list(Bio.SeqIO.parse(handle, "fasta"))


Notice we are using list to consume the iterator. For a large file this could take a lot of memory.

0
Entering edit mode

Thanks a lot! It's nice to have some people around willing to help the helpless : )

0
Entering edit mode

open flag should be "r", not "fasta"

3
Entering edit mode
12.3 years ago
Thaman ★ 3.3k

The only thing you have to consider while parsing the result in case of Bio.SeqIO is:

• Non-random access - Which means from Seq[0].......Seq[n] which is possbile just by iterator
• Random access - Like your example which needs special built in python list function to turn the iterator into a list.
from Bio import SeqIO
handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
for seq_record in records:
print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50])
handle.close()


Hope it will help