Hi,
I run a pair-end sequencing so I have R1 and R2. I need help to extract a sequence of varying length i.e. 19=22 bases so I can't use position and flanked by ATATC and TTTAA. The read for example would looks like this:
@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT AGTAATCTGGGGACGAGGCAAGCTAAGATATCTTGCCGCGGCTGTTTTTGCTTTTAAATGCGAAGTAAGGCGGGA
+ ACCCCFGGGGGGGGGGGEGGGGGGFGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG
And what I want is to retrieve the read only containing the sequence flanked by ATATC and TTTAA:
@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT
TTGCCGCGGCTGTTTTTGC
+
ACCCCFGGGGGGGGGGGEGGGGGGFGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG
I've used fastq-grep and ggrep but it retrieves the reads that contain the flanks not just the subset sequence of those reads: fastq-grep 'ATATC.({19,21})TTTAA' R1.fastq > test.fastq
ggrep -P -B 1 -A 2 "^\w+ATATC\K[ATGCN]+(?=TTTAA\w+$)" R1.fastq | ggrep -P -v "^--$" >test2.fastq
So, what I need is: 1. A grep code to only extract this 19-22 base sequence based on exact flanks 2. Extract this 19-22 base sequence based on flanks but with 1 base mismatch allowed in the flanks at any position, if this can be done at all.
Any help would greatly be appreciated. Thanks