Question: SeqIO.write() in biopython is extremely slow
0
gravatar for jincheng.wang1986
3.2 years ago by
Canada
jincheng.wang19860 wrote:

Hi there!

I've got this strange issue with SeqIO.write() in Biopython. It is extremely slow in one of my script but not in my another script  which is similar to that script. I timed the script at each stage so I knew where it is the SeqIO.write() problem.

The first script (fast) is to randomly select 500000 sequences from the fastq file (read 1). The second script is to find the matching sequences in another fastq file (read2). The first sequence took about 4 min in total for randomly selection from on average 900000 reads and to write in a file. But the second took almost 3 hours just to write in a file.

Please ignore the loop, it was just used to perform on all my files from all samples. Any help would be greatly appreciated. Thanks!

The 1st script (fast one):

#! /usr/bin/ python
#This script is to randomly select a pre-defined number of sequences from multiple fastq files (jackknifing) with python
from Bio import SeqIO
import glob
import random
import time
print time.time()
FNS=glob.glob("/home/et/Desktop/PP15_META/01Geneious/R1/*")
for FN in FNS:
    SEQ_INDEX=SeqIO.index(FN,"fastq") 
    SELECTED_SEQ_INDEX=random.sample(SEQ_INDEX,500000) # Randomly selected 500000 sequence index 
    SELECTED_SEQ=(SEQ_INDEX[SELECTION] for SELECTION in SELECTED_SEQ_INDEX) 
    FN_OUT=FN.replace(".fastq","_JACK.fastq")
    COUNT=SeqIO.write(SELECTED_SEQ,FN_OUT,"fastq")
print time.time()

The 2nd script (slow):

import time
import glob
from Bio import SeqIO
FNS_R2=glob.glob("/home/et/Desktop/PP15_META/01Geneious/*2.fastq")
FNS_R1=glob.glob("/home/et/Desktop/PP15_META/01Geneious/R1/*_JACK.fastq")
N_FNS=len(FNS_R1)
for i in range(1,len(FNS_R1)):
    print time.time()
    SEL_R1=SeqIO.parse(FNS_R1[i],"fastq")
    COM_R1=[record.id.split("_")[0] for record in SEL_R1]
    print "Target number of sequences is: ", len(COM_R1)
    print time.time()
    SEL_R2 = (record for record in SeqIO.parse(FNS_R2[i],"fastq") if record.id.split("_")[0] in COM_R1)
    print time.time()
    FN_OUT=FNS_R2[i].replace(".fastq","_JACK.fastq")
    print "Start to write"
    COUNT=SeqIO.write(SEL_R2, FN_OUT, "fastq")
    print COUNT
    print time.time()
    break

biopython • 1.5k views
ADD COMMENTlink modified 3.2 years ago by Peter5.7k • written 3.2 years ago by jincheng.wang19860
0
gravatar for matted
3.2 years ago by
matted7.0k
Boston, United States
matted7.0k wrote:

Biopython is convenient, but not super quick in my experience.  It also keeps everything in memory, at least the way it seems that you're using it.  Instead of writing your own script to do this this common task, I'd recommend using an existing and efficient tool like seqtk.

You can accomplish your core task in two lines with it:

seqtk sample input1.fq 500000 | seqtk seq -a - | grep -F ">" | cut -c 2- > readnames.txt
seqtk subseq input2.fq readnames.txt > output.fq

 

ADD COMMENTlink written 3.2 years ago by matted7.0k

Thanks! I'll try that. I was just curious why that happened. Two similar codes, but one is very slow. I tried to use a simple handle.write() to do the job for the slow script, but also very slow. I am start to guess it might not be the SeqIO.write() problem. I am using ubuntu in a VirtualBox. 

ADD REPLYlink written 3.2 years ago by jincheng.wang19860
1

I haven't looked too closely, so I might be wrong, but this line sticks out:

    SEL_R2 = (record for record in SeqIO.parse(FNS_R2[i],"fastq") if record.id.split("_")[0] in COM_R1)

You're comparing every record in FNS_R2[i] to every record in COM_R1 in an inefficient (quadratic) way.  If I'm right and this is the problem, you can work around it slightly by converting COM_R1 to a set first.  Then your search (the in test) will be constant time instead of a linear one going through the whole list sequentially.

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by matted7.0k

Thanks for replying. I understood the comparison issue. Since my problem is pretty simple so I did not optimize on that. And the problem of the script being slow was not due to that. I timed the script and it was just because of the SeqIO.write this single command.

Anyway, I usd seqtk and it works like a charm.

Thanks!

ADD REPLYlink written 3.2 years ago by jincheng.wang19860
1

It is good that you've measured things, but you have probably been mislead. The SeqIO.write call would look slow because it includes evaluating the generator expression SEL_R2 which @matted pointed out as a possible bottleneck.

Probably the simplest change is to make COM_R1 into a Python set rather than a list, which speeds up membership testing enormously.

ADD REPLYlink written 3.2 years ago by Peter5.7k

As written this is using iterators and generator expressions, so no, it does not keep everything in memory.

ADD REPLYlink written 3.2 years ago by Peter5.7k
0
gravatar for Peter
3.2 years ago by
Peter5.7k
Scotland, UK
Peter5.7k wrote:

If you want a reasonably fast Biopython based solution to the second task (picking out records from a FASTQ file given a list of IDs), try looking at the FASTQ part of this example:

https://github.com/peterjc/pico_galaxy/blob/master/tools/seq_select_by_id/seq_select_by_id.py

This uses SeqIO.index to scan the FASTQ file to get the identifiers and the file offsets, but never bothers to parse the FASTQ file into in-memory SeqRecord objects (which you would output using SeqIO.write). Instead it copies the raw record out of the source file into the destination file (using the get_raw method of the dictionary-like sequence file index).

 

ADD COMMENTlink written 3.2 years ago by Peter5.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1604 users visited in the last hour