I have lists of sequence which I would like to find fastq reads that contain these sequences.
Is there a tool or any possible programming to find fastq reads from specific lists of sequences??
My lists of sequences look like following,
GATAAAAAAAAAAAAAAAC GATAAAAAAAAAAAAAACC GATAAAAAAAAAAAAAATC GATAAAAAAAAAAAAAAGC GATAAAAAAAAAAAAACAC GATAAAAAAAAAAAAACCC GATAAAAAAAAAAAAACTC GATAAAAAAAAAAAAATAC GATAAAAAAAAAAAAATCC GATAAAAAAAAAAAAATGC GATAAAAAAAAAAAAAGAC GATAAAAAAAAAAAAAGCC GATAAAAAAAAAAAAAGGC GATAAAAAAAAAAAACAAC GATAAAAAAAAAAAACACC GATAAAAAAAAAAAACCAC GATAAAAAAAAAAAACCCC GATAAAAAAAAAAAACCTC GATAAAAAAAAAAAATAAC GATAAAAAAAAAAAATCAC GATAAAAAAAAAAAATTAC GATAAAAAAAAAAAAGAAC GATAAAAAAAAAAAAGACC GATAAAAAAAAAAACAAAC GATAAAAAAAAAAACCCCC GATAAAAAAAAAAATAAAC GATAAAAAAAAAAAGAAAC GATAAAAAAAAAACAAAAC
. . . .
I have used
grep to do this one by one but it's taking too long (I have 40k 19mers).
grep -A 2 -B 1 "CTCAAAAAAAAACAAAGGA" input.fastq |grep -v "^\-\-$" > output.fastq
Also, there is a problem with overlapping reads.