Question: Biopython Seqio.Index() And Seqio.Index_Db() Very Slow For Large Sequence
gravatar for Orionzhou
8.0 years ago by
St. Paul, MN
Orionzhou0 wrote:

Before switching to Biopython, I thought there are similar indexing features in biopython as in bioperl. However, the biopython SeqIO.index_db() and SeqIO.index() methods are so inefficient that it's almost impossible to random access a segment of genomic sequence using biopython.

I tested the performance of biopython and bioperl in retrieving a 10bp sequence from Arabidopsis chromosome 3, here is my code:

import sys
import os.path
sys.path.append(os.environ['PYTHON_HOME'] + '/lib/python2.6/site-packages')
sys.path.append(os.environ['PYTHON_HOME'] + '/lib64/python2.6/site-packages')
from time import clock, time
from subprocess import Popen, PIPE
from Bio import SeqIO

def seqret(f_seq, id, beg, end, strand=1):
    if not os.path.isfile(f_seq):
        sys.exit(f_seq + " is not there")
    f_idx = f_seq + ".idx"
    seq_idx = SeqIO.index_db(f_idx, f_seq, "fasta")
    if not id in seq_idx:
        sys.exit("cannot find " + id + "in " + f_seq)

    seq_obj = seq_idx[id].seq[beg-1:end]
    if strand == -1 or strand == "-":
        seq_obj = seq_obj.reverse_complement()
    seq_str = seq_obj.tostring()
    return seq_str

if __name__ == "__main__":

    seq_db = "/project/youngn/zhoup/Data/misc3/spada.crp/Athaliana/01_genome/01_refseq.fa"
    seq_id = "Chr3"
    seq_beg = 100001
    seq_end = 100010

    c0, t0 = clock(), time()
    print seqret(seq_db, seq_id, seq_beg, seq_end, 1)

    print "process time [%.2fs], wall time [%.2fs]" % (time()-t0, clock()-c0)

    c0, t0 = clock(), time()
    f_perl = "/soft/bioperl/5.16.1/bin/"
    cmd = '%s -d %s -i %s -s %d -e %d' % (f_perl, seq_db, seq_id, seq_beg, seq_end)
    p = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    lines = p.stdout.readlines()
    print lines
    print "process time [%.2fs], wall time [%.2fs]" % (time()-t0, clock()-c0)

And here's the output I got:

process time [1.65s], wall time [1.59s]
['>Chr3_100001-100010\n', 'AAAAGGAGCT\n']
process time [0.26s], wall time [0.00s]

So it seems biopython SeqIO.index_db() is much less efficient than bioperl, is it because it only index the sequences by IDs regardless of how large they are? I also noticed that the index file (.idx) created by biopython is much smaller than Bioperl index files (.index).

sequence biopython • 3.2k views
ADD COMMENTlink modified 8.0 years ago by Peter5.8k • written 8.0 years ago by Orionzhou0
gravatar for Peter
8.0 years ago by
Scotland, UK
Peter5.8k wrote:

The Biopython code you are using offers random access to a complete sequence, in this case the ENTIRE chromosome is being loaded. The expected usage is where you have a file (or files) with hundreds or throusands (or more) sequences, for example a set of genes, or a collection of bacterial chromosomes. Do you have any suggestions for how to clarify this in the documentation?

Presumably the BioPerl code you are comparing this to is indexing within the sequence, returning just the small chunk requested? I'm guessing since you've not shown the Perl code.

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by Peter5.8k

the perl script is part of BioPerl, you can check the code here (I'm to sure if it's the same version):

ADD REPLYlink written 8.0 years ago by JC12k

Thanks for the answer, I guess the difference is Bioperl indexes sequences into chunks but Biopython does not. In this case, loading the entire chromosome into memory is not what I want in terms of random access.

ADD REPLYlink written 8.0 years ago by Orionzhou0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1433 users visited in the last hour