I've got this strange issue with
SeqIO.write() in Biopython. It is extremely slow in one of my script but not in my another script which is similar to that script. I timed the script at each stage so I knew where it is the
The first script (fast) is to randomly select 500000 sequences from the fastq file (read 1). The second script is to find the matching sequences in another fastq file (read2). The first sequence took about 4 min in total for randomly selection from on average 900000 reads and to write in a file. But the second took almost 3 hours just to write in a file.
Please ignore the loop, it was just used to perform on all my files from all samples. Any help would be greatly appreciated. Thanks!
The 1st script (fast one):
#! /usr/bin/ python #This script is to randomly select a pre-defined number of sequences from multiple fastq files (jackknifing) with python from Bio import SeqIO import glob import random import time print time.time() FNS=glob.glob("/home/et/Desktop/PP15_META/01Geneious/R1/*") for FN in FNS: SEQ_INDEX=SeqIO.index(FN,"fastq") SELECTED_SEQ_INDEX=random.sample(SEQ_INDEX,500000) # Randomly selected 500000 sequence index SELECTED_SEQ=(SEQ_INDEX[SELECTION] for SELECTION in SELECTED_SEQ_INDEX) FN_OUT=FN.replace(".fastq","_JACK.fastq") COUNT=SeqIO.write(SELECTED_SEQ,FN_OUT,"fastq") print time.time()
The 2nd script (slow):
import time import glob from Bio import SeqIO FNS_R2=glob.glob("/home/et/Desktop/PP15_META/01Geneious/*2.fastq") FNS_R1=glob.glob("/home/et/Desktop/PP15_META/01Geneious/R1/*_JACK.fastq") N_FNS=len(FNS_R1) for I in range(1,len(FNS_R1)): print time.time() SEL_R1=SeqIO.parse(FNS_R1[i],"fastq") COM_R1=[record.id.split("_") for record in SEL_R1] print "Target number of sequences is: ", len(COM_R1) print time.time() SEL_R2 = (record for record in SeqIO.parse(FNS_R2[i],"fastq") if record.id.split("_") in COM_R1) print time.time() FN_OUT=FNS_R2[i].replace(".fastq","_JACK.fastq") print "Start to write" COUNT=SeqIO.write(SEL_R2, FN_OUT, "fastq") print COUNT print time.time() break
Thanks! I'll try that. I was just curious why that happened. Two similar codes, but one is very slow. I tried to use a simple
handle.write()to do the job for the slow script, but also very slow. I am start to guess it might not be the
SeqIO.write()problem. I am using ubuntu in a VirtualBox.
I haven't looked too closely, so I might be wrong, but this line sticks out:
You're comparing every record in
FNS_R2[I]to every record in
COM_R1in an inefficient (quadratic) way. If I'm right and this is the problem, you can work around it slightly by converting COM_R1 to a
setfirst. Then your search (the
intest) will be constant time instead of a linear one going through the whole list sequentially.
Thanks for replying. I understood the comparison issue. Since my problem is pretty simple so I did not optimize on that. And the problem of the script being slow was not due to that. I timed the script and it was just because of the
SeqIO.writethis single command.
Anyway, I usd seqtk and it works like a charm.
It is good that you've measured things, but you have probably been mislead. The SeqIO.write call would look slow because it includes evaluating the generator expression
SEL_R2which @matted pointed out as a possible bottleneck.
Probably the simplest change is to make
COM_R1into a Python set rather than a list, which speeds up membership testing enormously.
As written this is using iterators and generator expressions, so no, it does not keep everything in memory.