Question: Biopython 1.56 - Strange Behaviour Of Seqio.Parse And Seqio.Write
2
gravatar for Markus
10.0 years ago by
Markus20
Markus20 wrote:

Hello everybody,

I'm using Biopython 1.56 compiled from source on Ubuntu 10.10 64-bit. It's a great piece of software and I love to work with it. But there is a very strange behavior of the "Bio.SeqIO.parse()" sequence parser, which cost me several hours to find out:

If I uncomment the for-statement with the print commands, "Bio.SeqIO.write()" refuses to write the sequences to the file.

Is this the desired behavior? Is the for-loop iterating the parser object to its end and leaves it there? Could anyone help me out? What am I getting wrong?

Thanks in advance!

Markus

Example:


handle = open("ls_orchid.fasta", "fasta")
parsedfasta = Bio.SeqIO.parse(handle, "fasta")

# If you uncomment the following part, 0 Sequences 
# (instead of many more) are written to the file!

#for seq_record in parsedfasta:   
    #print "Description:\t%s..." % seq_record.description
    #print "Length:\t\t%d" % len(seq_record)
    #print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50])

output_handle = open("example.fasta", "w")   
count = Bio.SeqIO.write(parsedfasta, output_handle, "fasta")
output_handle.close()
print "%d Sequences written to file" % count 
python biopython • 4.6k views
ADD COMMENTlink modified 5.5 years ago by Biostar ♦♦ 20 • written 10.0 years ago by Markus20
12
gravatar for Brad Chapman
10.0 years ago by
Brad Chapman9.5k
Boston, MA
Brad Chapman9.5k wrote:

Nice answers so far. SeqIO uses standard python iterators and generators:

http://www.learningpython.com/2009/02/23/iterators-iterables-and-generators-oh-my/

Using 'list' to turn an iterator into an in-memory list does allow you to parse it multiple times, but at the cost of needing to be able to fit the full file in memory. If you have a large file, or would just like to write your code in a more general manner, you can lazily move through the list of sequences a single time by doing the processing inside a generator function. This example processes each record then yields it for writing. Iterators and generators are very useful for dealing with large files in Python:

from Bio import SeqIO

handle = open("ls_orchid.fasta", "r")
parsedfasta = SeqIO.parse(handle, "fasta")

def print_seqrec(seq_record):
    """Specific function to analyze a sequence record.
    """
    print "Description:\t%s..." % seq_record.description
    print "Length:\t\t%d" % len(seq_record)
    print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50]

def process_and_generate(input_iterator, process_function):
    """Reusable function that processes a record, then generates each record.

    input_iterator is an iterator that returns one record at a time
    process_function is a function that takes one record and does some
      processing on it
    """
    for rec in input_iterator:
        process_function(rec)
        yield rec

generator = process_and_generate(parsedfasta, print_seqrec)

output_handle = open("example.fasta", "w")
count = SeqIO.write(generator, output_handle, "fasta")
output_handle.close()
print "%d Sequences written to file" % count
ADD COMMENTlink modified 14 months ago by RamRS30k • written 10.0 years ago by Brad Chapman9.5k

Great answer! I'll try to integrate these ideas into my code. Thank you.

ADD REPLYlink written 10.0 years ago by Markus20
6
gravatar for Casbon
10.0 years ago by
Casbon3.2k
Casbon3.2k wrote:

Yes, you are consuming the generator with the for loop. Either call parse twice, or consume the generator in a list so that you can iterate over it more than once. This should work:

handle = open("ls_orchid.fasta", "fasta")
parsedfasta = list(Bio.SeqIO.parse(handle, "fasta"))

Notice we are using list to consume the iterator. For a large file this could take a lot of memory.

ADD COMMENTlink modified 14 months ago by RamRS30k • written 10.0 years ago by Casbon3.2k

Thanks a lot! It's nice to have some people around willing to help the helpless : )

ADD REPLYlink written 10.0 years ago by Markus20

open flag should be "r", not "fasta"

ADD REPLYlink written 10.0 years ago by User 927610
3
gravatar for Thaman
10.0 years ago by
Thaman3.3k
Finland
Thaman3.3k wrote:

The only thing you have to consider while parsing the result in case of Bio.SeqIO is:

  • Non-random access - Which means from Seq[0].......Seq[n] which is possbile just by iterator
  • Random access - Like your example which needs special built in python list function to turn the iterator into a list.
from Bio import SeqIO
handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
for seq_record in records:   
     print "Sequence:\t%s\n" % seq_record.seq[:40], seq_record.seq[44:50])
handle.close()

For more information visit SeqIO

Hope it will help

ADD COMMENTlink modified 14 months ago by RamRS30k • written 10.0 years ago by Thaman3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1556 users visited in the last hour