Question

Biopython Seqio.Index() And Seqio.Index_Db() Very Slow For Large Sequence

2

Entering edit mode

12.6 years ago

Orionzhou ▴ 10

Before switching to Biopython, I thought there are similar indexing features in biopython as in bioperl. However, the biopython SeqIO.index_db() and SeqIO.index() methods are so inefficient that it's almost impossible to random access a segment of genomic sequence using biopython.

I tested the performance of biopython and bioperl in retrieving a 10bp sequence from Arabidopsis chromosome 3, here is my code:

import sys
import os.path
sys.path.append(os.environ['PYTHON_HOME'] + '/lib/python2.6/site-packages')
sys.path.append(os.environ['PYTHON_HOME'] + '/lib64/python2.6/site-packages')
from time import clock, time
from subprocess import Popen, PIPE
from Bio import SeqIO

def seqret(f_seq, id, beg, end, strand=1):
    if not os.path.isfile(f_seq):
        sys.exit(f_seq + " is not there")
    f_idx = f_seq + ".idx"
    seq_idx = SeqIO.index_db(f_idx, f_seq, "fasta")
    if not id in seq_idx:
        sys.exit("cannot find " + id + "in " + f_seq)

    seq_obj = seq_idx[id].seq[beg-1:end]
    if strand == -1 or strand == "-":
        seq_obj = seq_obj.reverse_complement()
    seq_str = seq_obj.tostring()
    return seq_str

if __name__ == "__main__":

    seq_db = "/project/youngn/zhoup/Data/misc3/spada.crp/Athaliana/01_genome/01_refseq.fa"
    seq_id = "Chr3"
    seq_beg = 100001
    seq_end = 100010

    c0, t0 = clock(), time()
    print seqret(seq_db, seq_id, seq_beg, seq_end, 1)

    print "process time [%.2fs], wall time [%.2fs]" % (time()-t0, clock()-c0)

    c0, t0 = clock(), time()
    f_perl = "/soft/bioperl/5.16.1/bin/bp_seqret.pl"
    cmd = '%s -d %s -i %s -s %d -e %d' % (f_perl, seq_db, seq_id, seq_beg, seq_end)
    p = Popen(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    lines = p.stdout.readlines()
    print lines
    print "process time [%.2fs], wall time [%.2fs]" % (time()-t0, clock()-c0)

And here's the output I got:

AAAAGGAGCT
process time [1.65s], wall time [1.59s]
['>Chr3_100001-100010\n', 'AAAAGGAGCT\n']
process time [0.26s], wall time [0.00s]

So it seems biopython SeqIO.index_db() is much less efficient than bioperl, is it because it only index the sequences by IDs regardless of how large they are? I also noticed that the index file (.idx) created by biopython is much smaller than Bioperl index files (.index).

biopython sequence • 5.8k views

ADD COMMENT • link updated 3.7 years ago by Vitis ★ 2.6k • written 12.6 years ago by Orionzhou ▴ 10

score 1 · Answer 1 · 2012-11-27

1

Entering edit mode

12.6 years ago

Peter 6.0k

The Biopython code you are using offers random access to a complete sequence, in this case the ENTIRE chromosome is being loaded. The expected usage is where you have a file (or files) with hundreds or throusands (or more) sequences, for example a set of genes, or a collection of bacterial chromosomes. Do you have any suggestions for how to clarify this in the documentation?

Presumably the BioPerl code you are comparing this to is indexing within the sequence, returning just the small chunk requested? I'm guessing since you've not shown the Perl code.

ADD COMMENT • link 12.6 years ago by Peter 6.0k

0

Entering edit mode

the perl script is part of BioPerl, you can check the code here (I'm to sure if it's the same version): http://www.gnu-darwin.org/www001/src/ports/biology/p5-bioperl-devel/work/bioperl-1.5.1/blib/script/bp_seqret.pl

ADD REPLY • link 12.6 years ago by JC 13k

0

Entering edit mode

Thanks for the answer, I guess the difference is Bioperl indexes sequences into chunks but Biopython does not. In this case, loading the entire chromosome into memory is not what I want in terms of random access.

ADD REPLY • link 12.6 years ago by Orionzhou ▴ 10

score 0 · Answer 2 · 2021-11-04

Looks like BioPython does take entire sequences into memory, which makes it hard to use for fetching a long list of sequence segments because it would soon run out of memory after a few entries. The fastafile module in pysam seems to be a better alternative if bioperl is to be phased out.

from pysam import FastaFile

ref = FastaFile("referencefasta")
seq = ref.fetch(reference = chr, start = start, end = end)

The reference fasta file needs to be indexed with samtools.