Question: Efficiently Iterating Over Fastq Records From Python
4
gravatar for User 9996
8.5 years ago by
User 9996810
User 9996810 wrote:

How can efficiently iterate, from Python, over long FASTQ records, and write them to file if some condition matches? E.g. if I want to go through the file, check the read ID for some property, and if it matches, serialize that entire entry to a new FASTQ file.

BioPython is very very slow on my system. I'd like to do this from Python but would not mind installing a C/C++ library with a Py interface.

To clarify the BioPython issue:

I open my FASTQ file (which is several GB in size), iterate through the records:

fastq_parser = SeqIO.parse(fastq_filename, "fastq") 
for fastq_rec in fastq_parser:     
  # if some condition is met then write the FASTQ record to file
  # ...
  SeqIO.write(fastq_rec, output_file, "fastq")

Perhaps it'd be faster if I write all the records at the end? But then I'd have to accumulate them in memory... in any case, this is very slow.

EDIT: after profiling the culprit was SeqIO.write(). I believe SeqIO.parse() is slow too... but it's unbelievable how slow it all is, given the simplicity of FASTQ. It's too bad because I realy like the BIoPython interface. But the solution was to roll my own... thanks to all who posted.

Any recommendations?

thanks.

ADD COMMENTlink modified 23 months ago by hurfdurf460 • written 8.5 years ago by User 9996810
1

Heng Li is right to recommend learning about Python's generator functions - a slowdown is you calling SeqIO.write(...) many times instead of once (I see you say you profiled this). Heng Li is wrong about Biopython being slow because it copes with multi-line FASTQ, that may make a slight difference but is dwarfed by the object creation penalty of the unified SeqRecord based cross-file-format API (Bio.SeqIO). Biopython's low level string parser is pretty competitive.

ADD REPLYlink modified 4.1 years ago • written 8.5 years ago by Peter5.8k

Biopython is slow because it parses multi-line fastq and supposedly does error checking. It does more than you need. If you prefer the biopython interface, learn how to implement it (generator function). This will help you in the long run.

ADD REPLYlink written 8.5 years ago by lh332k

Biopython is slow because it parses multi-line fastq and supposedly checks input errors. It does more than what you need. If you prefer the biopython interface, learn how to implement it (generator function). This will help you in the long run.

ADD REPLYlink written 8.5 years ago by lh332k

Heng Li is right to recommend learning about Python's generator functions - the main slowdown is you calling SeqIO.write(...) many times instead of once. Heng Li is wrong about Biopython being slow because it copes with multi-line FASTQ, that may make a slight difference but is dwarfed by the object creation penalty of the unified SeqRecord based cross-file-format API (Bio.SeqIO). Biopython's low level string parser is pretty competitive.

ADD REPLYlink modified 7.8 years ago • written 8.5 years ago by Peter5.8k

yeah I tend to avoid biopython parsing for simple file types not just because it can be slow. I find myself a lot of times having to help other analyze stuff on their computers that doesn't have biopython installed. And I really think it's useful for beginners to learn simple parsing methods.

ADD REPLYlink written 8.5 years ago by Damian Kao15k

@Peter: parsing four-line fastq is over twice as fast. "Twice" is not slight difference.

ADD REPLYlink written 8.5 years ago by lh332k
4
gravatar for Peter
8.5 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

Heng Li is right to recommend learning about generators - the main problem in your code is calling SeqIO.write(...) thousands of times instead of once.

If you want to use SeqRecord objects via SeqIO, then use a generator expression like this:

fastq_parser = SeqIO.parse(fastq_filename, "fastq") 
wanted = (rec for rec in fastq_parser if ...)
SeqIO.write(wanted, output_file, "fastq")

Or equivalently a generator function,

fastq_parser = SeqIO.parse(fastq_filename, "fastq") 
def my_filter(records):
    for rec in records:
        if ...:
            yield rec
SeqIO.write(my_filter(fastq_parser), output_file, "fastq")

There are examples of this kind of thing in the Biopython Tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html - e.g. "Filtering a sequence file" in the "Cookbook" chapter.

However, if you don't need the SeqRecord objects and the unified cross-file format API, you're paying a speed penalty in object creation. See http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ for that.

ADD COMMENTlink written 8.5 years ago by Peter5.8k

@Peter: you may read my answer to the following question: http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse-a-huge-fastq-file. For python and so on, parsing multi-line fastq is over twice as slow as parsing four-line fastq. Biopython is nearly 20 times as slow. In my opinion, it is necessary to provide an alternative parser that is purely optimized for parsing four-line fastq. From what I have heard, the inefficiency of the biopython parser has pushed many away.

ADD REPLYlink written 8.5 years ago by lh332k

@Peter: you may read my answer to the following question: http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse-a-huge-fastq-file. For python and so on, parsing multi-line fastq is over twice as slow as parsing four-line fastq. Biopython is nearly 20 times as slow. In my opinion, it is necessary to provide an alternative parser that is purely optimized for parsing four-line fastq. From what I have heard, the inefficiency of the biopython parser has pushed many away.

ADD REPLYlink modified 5 months ago by RamRS27k • written 8.5 years ago by lh332k

Is it faster to write a fastq file record by record or to make a list of records and then write the list of records?

ADD REPLYlink written 2.8 years ago by O.rka170
2
gravatar for lh3
8.5 years ago by
lh332k
United States
lh332k wrote:

If you have four-line fastq, write your own parser. It is very easy and much faster than biopython. If you have multi-line fastq, use my readfq.py (with brentp's improvements).

Actually your question is a duplicate of this question, listed as the first "related question" in the sidebar. You should read my answer to that question.

Generally, biopython/bioperl are likely to trade efficiency for modularity and robustness. Frequently they are not the fastest implementations.

ADD COMMENTlink modified 8 months ago by RamRS27k • written 8.5 years ago by lh332k
2

It would be much faster than the high level Biopython parser, but have you compared that with the equivalent in Biopython, function FastqGeneralIterator which also returns tuples of three strings. See http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD REPLYlink modified 8 months ago by RamRS27k • written 8.5 years ago by Peter5.8k
1

Comparison added. FastqGeneralIterator is as fast as my readfq, but it parses fastq only (readfq seamlessly parses fasta). This means the biopython parser is still over twice as slow as a simplistic four-line fastq parser, and 6X as slow as a multi-line fasta/q C parser.

ADD REPLYlink written 8.5 years ago by lh332k

Comparison added. FastqGeneralIterator is as fast as my readfq, but it parses fastq only (readfq seamlessly parses fasta). This means the biopython parser is still over twice as slow as a simplistic four-line fastq parser, and 6X as slow as a C parser.

ADD REPLYlink written 8.5 years ago by lh332k
1
gravatar for Manu Prestat
8.5 years ago by
Manu Prestat4.0k
Lyon, France
Manu Prestat4.0k wrote:

I have already experienced the SeqIO.parse() low speed...

The only thing I can advice is to avoid this function and to use the dict type. For instance, to read a (one line per seq) fasta file, I did like this... This goes really faster.

#### reading the fasta file
handle = open(fastafile)

seqs={}
while True:
    try:
        seq_id = handle.next().strip("\n")
        seq = handle.next().strip("\n")
        seqs[seq_id]=seq
    except StopIteration:
        break

handle.close()

This should be easy to adapt to a fastq.

ADD COMMENTlink modified 5 months ago by RamRS27k • written 8.5 years ago by Manu Prestat4.0k
3

If you want FASTQ files parsed as strings in Biopython, use FastqGeneralIterator - see http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

ADD REPLYlink modified 8 months ago by RamRS27k • written 8.5 years ago by Peter5.8k
1
gravatar for blaise.li__biostars
3.0 years ago by
France
blaise.li__biostars50 wrote:

In this recent bioinformatics SE answer, I tried the new (as of June 2017) python API for GATB. It contains a fasta/fastq parser module called "Bank" that seems quite fast.

For instance, to only print reads longer than 60 nucleotides:

from gatb import Bank
fastq_parser = Bank(fastq_filename)
for seq in fastq_parser:
    if len(seq) >= 60:
        print("@{name}\n{seq}\n+\n{qual}".format(
            name=seq.comment.decode("utf-8"),
            seq=seq.sequence.decode("utf-8"),
            qual=seq.quality.decode("utf-8")))

See https://github.com/GATB/pyGATB/wiki/Bank-API for examples of how to access sequence features.

ADD COMMENTlink written 3.0 years ago by blaise.li__biostars50
0
gravatar for hurfdurf
23 months ago by
hurfdurf460
United States
hurfdurf460 wrote:

Yet another option, but python 3.5+ only:

https://github.com/lgautier/fastq-and-furious

ADD COMMENTlink written 23 months ago by hurfdurf460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1169 users visited in the last hour