How can efficiently iterate, from Python, over long FASTQ records, and write them to file if some condition matches? E.g. if I want to go through the file, check the read ID for some property, and if it matches, serialize that entire entry to a new FASTQ file.
BioPython is very very slow on my system. I'd like to do this from Python but would not mind installing a C/C++ library with a Py interface.
To clarify the BioPython issue:
I open my FASTQ file (which is several GB in size), iterate through the records:
fastq_parser = SeqIO.parse(fastq_filename, "fastq") for fastq_rec in fastq_parser: # if some condition is met then write the FASTQ record to file # ... SeqIO.write(fastq_rec, output_file, "fastq")
Perhaps it'd be faster if I write all the records at the end? But then I'd have to accumulate them in memory... in any case, this is very slow.
EDIT: after profiling the culprit was SeqIO.write(). I believe SeqIO.parse() is slow too... but it's unbelievable how slow it all is, given the simplicity of FASTQ. It's too bad because I realy like the BIoPython interface. But the solution was to roll my own... thanks to all who posted.