I have an Illumina library back form the sequencing facility of around 18M PE reads. Sequences are in two huge files (R1.fastq, and R2.fastq), the index (barcodes) are in another file (I1.fastq). I have made some filtering on my reads before demultiplexing, so before doing that I need to sort my Index to be sure only the one correspond to the reads are present. I have a python code that aims to output that exactly the same Index that are contained in the reads .fastq file according the read headers. The code below is working but is incredibly slow (the files I am working are very big, like 18M for the Index file and 15M for the reads file) and I don't really know how to speed it up. Any suggestions? Thank you very much in advance, G.
#usage $: python match_index_to_reads_fastq.py reads.fastq index.fastq import sys reads = sys.argv index = sys.argv input = open(reads, "r") all_lines = input.readlines() input.close() all_readIDs =  for i, line in enumerate(all_lines): if line.startswith("@HWI"): all_readIDs.append(line) input2 = open(index, "r") all_lines2 = input2.readlines() input2.close() output = open("filtered_index.fastq", "w") for i, line in enumerate(all_lines2): if line.startswith("@HWI"): line in all_readIDs: output.write(line + "".join(all_lines2[i+1:i+4]) output.close()