I have a FASTQ file with a lot of reads. I expect sets of identical sequences: in fact I will be counting for occurrences of each unique sequence.
I am using Python and Biopython, and am trying to optimize this problem for a large file. I was wondering if there are any suggestions on how to do this?
What I have so far includes a fast Biopython iterator, and MD5 hashes
for title,seq,quals in FastqGeneralIterator(file_read_handle) : seq_digest = md5.new(seq).digest if seq_digest in list_digest: ... else list_digest.append(seq_digest) ...
Is there any other technique for searching for exact sequence matches which might be more efficient?
Thanks very much.