Question: How to extract sequence by flanks from Fatsq R1 read
gravatar for doc.pombe
20 months ago by
doc.pombe0 wrote:


I run a pair-end sequencing so I have R1 and R2. I need help to extract a sequence of varying length i.e. 19=22 bases so I can't use position and flanked by ATATC and TTTAA. The read for example would looks like this:



And what I want is to retrieve the read only containing the sequence flanked by ATATC and TTTAA:

@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT




I've used fastq-grep and ggrep but it retrieves the reads that contain the flanks not just the subset sequence of those reads: fastq-grep 'ATATC.({19,21})TTTAA' R1.fastq > test.fastq

ggrep -P -B 1 -A 2 "^\w+ATATC\K[ATGCN]+(?=TTTAA\w+$)" R1.fastq | ggrep -P -v "^--$" >test2.fastq

So, what I need is: 1. A grep code to only extract this 19-22 base sequence based on exact flanks 2. Extract this 19-22 base sequence based on flanks but with 1 base mismatch allowed in the flanks at any position, if this can be done at all.

Any help would greatly be appreciated. Thanks

sequencing • 346 views
ADD COMMENTlink written 20 months ago by doc.pombe0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1193 users visited in the last hour