Entering edit mode
7.9 years ago
sumithrasank75 ▴ 140
I have 2 fastq files F1.fastq and F2.fastq. F2.fastq is a smaller file which is a subset of reads from F1.fastq. I want reads in F1.fastq which ARE NOT in F2.fastq. The following python code does not seem to work. Can you suggest edits?
needed_reads =  reads_array =  chosen_array =  for x in Bio.SeqIO.parse("F1.fastq","fastq"): reads_array.append(x) for y in Bio.SeqIO.parse("F2.fastq","fastq"): chosen_array.append(y) for y in chosen_array: for x in reads_array: if str(x.seq) != str(y.seq) : needed_reads.append(x) output_handle = open("DIFF.fastq","w") SeqIO.write(needed_reads,output_handle,"fastq") output_handle.close()
Instead of reinventing the wheel, why not use existing programs? cmpfastq is one such program.
Try it with tiny input files (like 2 records each, with only one in common) and you should be able to track down the bug.
as for the reason it does not work: your program checks each read in file 1 against each read in file 2 and adds it to output every it these two do not match, basically adding the same read many many times to the output
Yes, this is what is happening, but I am unable to get a way to fix this.
Frankly you should look at the results others gave you since this code has many other problems as well regarding its scalability - may not work all that well on large files.
As for your code you need to store the data first in a set and match against it something like this (not tested just typing it in):
Instead of reinventing the wheel, why not use something like:
It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/30942734/getting-records-which-are-different-from-two-fastq-files
This is typically not recommended as it runs the risk of annoying people in both communities.