Hey everybody. I have written a custom python script (below) but it is too slow and I need pointers as to where the code can be improved.
What this script is doing: I have two files, the first is a large fastq file (single-end reads) and the second is a text file with all of the header lines in the fastq file that I want to keep. The text file and fastq file are in same order with some of the fastq reads not having a corresponding header line in the text file. This script walks down both files and writes the desired reads to a new file. This script works but it is too slow, my large fastq file is ~20GB (~100millionreads) and my current estimate puts me at ~20days for it to finish (I ran it on a small sample). As someone with not much experience in bigdata, I'm sure there is something obvious that I could be doing to speed this up, any help is appreciated.
#!/usr/bin/python3 import sys from itertools import islice # first input (sys.argv) is the original fq file name # second input (sys.argv) is the quality threshold used (e.g. '30') print("\nentering python script") header_file_name = sys.argv + "_good_read_headers.txt" newfile_name = sys.argv + "_good_reads.fastq" print ("...making new fastq file for: " + goodreads) file_iterator = 0 head_iterator = 0 while True: four_line_fq, header = ,  with open(sys.argv, 'r') as name_original: # opening original fq file lines_gen = islice(name_original, int(file_iterator * 4), int(file_iterator * 4 + 4)) for i in lines_gen: four_line_fq.append(i) with open(header_file_name, 'r') as name_header: # opening good header names only file headers_gen = islice(name_header, head_iterator, head_iterator + 1) for i in headers_gen: header.append(i) if len(four_line_fq) == 0: # if at end of original fq file print("...end of original fastq file reached, done writing") break if four_line_fq == header: with open(newfile_name, 'a+') as new: for i in four_line_fq: new.write(i) head_iterator += 1 file_iterator += 1 print("...exiting python script\n")