Hi there!
I've got this strange issue with SeqIO.write()
in Biopython. It is extremely slow in one of my script but not in my another script which is similar to that script. I timed the script at each stage so I knew where it is the SeqIO.write()
problem.
The first script (fast) is to randomly select 500000 sequences from the fastq file (read 1). The second script is to find the matching sequences in another fastq file (read2). The first sequence took about 4 min in total for randomly selection from on average 900000 reads and to write in a file. But the second took almost 3 hours just to write in a file.
Please ignore the loop, it was just used to perform on all my files from all samples. Any help would be greatly appreciated. Thanks!
The 1st script (fast one):
#! /usr/bin/ python
#This script is to randomly select a pre-defined number of sequences from multiple fastq files (jackknifing) with python
from Bio import SeqIO
import glob
import random
import time
print time.time()
FNS=glob.glob("/home/et/Desktop/PP15_META/01Geneious/R1/*")
for FN in FNS:
SEQ_INDEX=SeqIO.index(FN,"fastq")
SELECTED_SEQ_INDEX=random.sample(SEQ_INDEX,500000) # Randomly selected 500000 sequence index
SELECTED_SEQ=(SEQ_INDEX[SELECTION] for SELECTION in SELECTED_SEQ_INDEX)
FN_OUT=FN.replace(".fastq","_JACK.fastq")
COUNT=SeqIO.write(SELECTED_SEQ,FN_OUT,"fastq")
print time.time()
The 2nd script (slow):
import time
import glob
from Bio import SeqIO
FNS_R2=glob.glob("/home/et/Desktop/PP15_META/01Geneious/*2.fastq")
FNS_R1=glob.glob("/home/et/Desktop/PP15_META/01Geneious/R1/*_JACK.fastq")
N_FNS=len(FNS_R1)
for I in range(1,len(FNS_R1)):
print time.time()
SEL_R1=SeqIO.parse(FNS_R1[i],"fastq")
COM_R1=[record.id.split("_")[0] for record in SEL_R1]
print "Target number of sequences is: ", len(COM_R1)
print time.time()
SEL_R2 = (record for record in SeqIO.parse(FNS_R2[i],"fastq") if record.id.split("_")[0] in COM_R1)
print time.time()
FN_OUT=FNS_R2[i].replace(".fastq","_JACK.fastq")
print "Start to write"
COUNT=SeqIO.write(SEL_R2, FN_OUT, "fastq")
print COUNT
print time.time()
break
Thanks! I'll try that. I was just curious why that happened. Two similar codes, but one is very slow. I tried to use a simple
handle.write()
to do the job for the slow script, but also very slow. I am start to guess it might not be theSeqIO.write()
problem. I am using ubuntu in a VirtualBox.I haven't looked too closely, so I might be wrong, but this line sticks out:
You're comparing every record in
FNS_R2[I]
to every record inCOM_R1
in an inefficient (quadratic) way. If I'm right and this is the problem, you can work around it slightly by converting COM_R1 to aset
first. Then your search (thein
test) will be constant time instead of a linear one going through the whole list sequentially.Thanks for replying. I understood the comparison issue. Since my problem is pretty simple so I did not optimize on that. And the problem of the script being slow was not due to that. I timed the script and it was just because of the
SeqIO.write
this single command.Anyway, I usd seqtk and it works like a charm.
Thanks!
It is good that you've measured things, but you have probably been mislead. The SeqIO.write call would look slow because it includes evaluating the generator expression
SEL_R2
which @matted pointed out as a possible bottleneck.Probably the simplest change is to make
COM_R1
into a Python set rather than a list, which speeds up membership testing enormously.As written this is using iterators and generator expressions, so no, it does not keep everything in memory.