Question: fastq parsing inefficiencies with python
gravatar for brismiller
16 days ago by
Bellingham, WA, USA
brismiller10 wrote:

Hey everybody. I have written a custom python script (below) but it is too slow and I need pointers as to where the code can be improved.

What this script is doing: I have two files, the first is a large fastq file (single-end reads) and the second is a text file with all of the header lines in the fastq file that I want to keep. The text file and fastq file are in same order with some of the fastq reads not having a corresponding header line in the text file. This script walks down both files and writes the desired reads to a new file. This script works but it is too slow, my large fastq file is ~20GB (~100millionreads) and my current estimate puts me at ~20days for it to finish (I ran it on a small sample). As someone with not much experience in bigdata, I'm sure there is something obvious that I could be doing to speed this up, any help is appreciated.


import sys
from itertools import islice
# first input (sys.argv[1]) is the original fq file name
# second input (sys.argv[2]) is the quality threshold used (e.g. '30')

print("\nentering python script")

header_file_name = sys.argv[2] + "_good_read_headers.txt"
newfile_name = sys.argv[2] + "_good_reads.fastq"

print ("...making new fastq file for: " + goodreads)

file_iterator = 0
head_iterator = 0

while True:
    four_line_fq, header = [], []

    with open(sys.argv[1], 'r') as name_original:  # opening original fq file
        lines_gen = islice(name_original, int(file_iterator * 4), int(file_iterator * 4 + 4))
        for i in lines_gen:

    with open(header_file_name, 'r') as name_header:  # opening good header names only file
        headers_gen = islice(name_header, head_iterator, head_iterator + 1)
        for i in headers_gen:

    if len(four_line_fq) == 0:  # if at end of original fq file
        print("...end of original fastq file reached, done writing")

    if four_line_fq[0] == header[0]:
        with open(newfile_name, 'a+') as new:
            for i in four_line_fq:
        head_iterator += 1
    file_iterator += 1

print("...exiting python script\n")
ADD COMMENTlink modified 16 days ago by Bastien Hervé170 • written 16 days ago by brismiller10

Completely agree with Ram. Quickly jump to Biopython

ADD REPLYlink written 16 days ago by Vijay Lakhujani1.7k

Your strategy of re-opening the files for each slice is very time consuming. You should open each input file once, then iterate through the lines. Same goes for the output file which should be opened just once.

ADD REPLYlink modified 16 days ago • written 16 days ago by jomo018250

Why are you writing a parser yourself? There are multiple tools that do this. IMO people have to stop writing stuff from scratch unless they have an extremely good reason to do so.

ADD REPLYlink written 16 days ago by Ram13k

Although I agree that using well tested and optimized modules is the way to go, a good reason to write your own parser would be to learn how to do the programming yourself, rather than using the code of others...

ADD REPLYlink written 16 days ago by WouterDeCoster26k

For widely-performed-on-large-files tasks like these, I guess I prioritize getting stuff done over learning efficiency intricacies. But that's my bias against tools that re-invent the wheel.

ADD REPLYlink written 16 days ago by Ram13k
gravatar for Bastien Hervé
16 days ago by
Limoges, CBRS, France
Bastien Hervé170 wrote:

Using Biopython as advised. Assuming you have a header.txt file with headers in one column (separate by "\n")

from Bio import SeqIO 

###Open your result file
filtered_fastq = open("your_filtered_fastq.fastq", "a")
###Create an array to keep headers
###Stock headers in header_array
with open('header.txt', 'r') as f:
    for line in f:
        ###Remove the \n of your header
###Read your fastq file
###SeqIO read the file record by record. Record is (Name of read, sequence, "+", quality)
for record in SeqIO.parse('your_fastq.fastq','fastq'):  
    ###For each read if fastq header exist in your header_array, record will be write in your result file   
    if in header_array:
        ###Write the record
        SeqIO.write(record, filtered_fastq,'fastq')
###Close file

I'm interested by the final calculation time, if you could write it in comment, Thanks !

ADD COMMENTlink written 16 days ago by Bastien Hervé170

Or throw in a list comprehension because those are neat:

header_array=[line.rstrip() for line in open('header.txt', 'r')]
ADD REPLYlink written 16 days ago by WouterDeCoster26k

Right, 0.15s faster per million line but still faster and cleaner !

ADD REPLYlink written 16 days ago by Bastien Hervé170

Welp that's an impressive speedup :-D I'm just a list/dict comprehension fanboy...

ADD REPLYlink written 16 days ago by WouterDeCoster26k

This is not an efficient solution because you are not taking advantage of the details as stated by brismiller, mainly that the files are ordered.There is no reason to use large memory for holding 100M headers and there is no reason to search each fastq header in these 100M possible headers. Actually, brismiller does exploit the ordering of files albeit incorrectly.

ADD REPLYlink written 16 days ago by jomo018250

I think brismiller is storing the entire fastq file in "lines_gen"

lines_gen = islice(name_original, int(file_iterator * 4), int(file_iterator * 4 + 4))

and the same for headers, which is time consuming and memory expensive.

I see want you mean by "exploit the ordering", but if you know a way to catch record without reading the whole file or without holding it in memory, i am interested to know how.

Moreover, headers may have be ordered, in fastq doesn't correspond to header10 in the header file.

ADD REPLYlink written 16 days ago by Bastien Hervé170

Well, we could try to exploit this ordering by creating a generator from the headers rather than a list, since we only care for a match of the fastq header to the next() element?

ADD REPLYlink written 16 days ago by WouterDeCoster26k

Thank you everyone for their comments, and thanks Bastien for your code. Biopython is a fantastic resource!!

As stated previously the total fastq file is 107.4 million reads (21.9GB), I took the first 100,000 reads and ran Bastien's code on it, it took 51.98268 seconds. This is a great improvement, but it still puts me at 17.29hrs to parse the total fastq file.

ADD REPLYlink written 15 days ago by brismiller10

I bet there is a faster way somewhere, like said above, taking advantage of ordering... If you have time maybe try in C language. And also if you really want to improve the calculation speed, try to catch which operation takes the longest time to process. I don't think the header_array filling will be the problem, I rather bet on SeqIO.write(record, filtered_fastq,'fastq'). How many reads do you have to keep, in other words, how many headers do you have ? You can also try to split your fastq file and multiprocess the header reseach (

ADD REPLYlink written 15 days ago by Bastien Hervé170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1194 users visited in the last hour