Question

How to extract sequence by flanks from Fatsq R1 read

0

Entering edit mode

5.9 years ago

doc.pombe • 0

Hi,

I run a pair-end sequencing so I have R1 and R2. I need help to extract a sequence of varying length i.e. 19=22 bases so I can't use position and flanked by ATATC and TTTAA. The read for example would looks like this:

@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT AGTAATCTGGGGACGAGGCAAGCTAAGATATCTTGCCGCGGCTGTTTTTGCTTTTAAATGCGAAGTAAGGCGGGA

+ ACCCCFGGGGGGGGGGGEGGGGGGFGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG

And what I want is to retrieve the read only containing the sequence flanked by ATATC and TTTAA:

@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT

TTGCCGCGGCTGTTTTTGC

+

ACCCCFGGGGGGGGGGGEGGGGGGFGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG

I've used fastq-grep and ggrep but it retrieves the reads that contain the flanks not just the subset sequence of those reads: fastq-grep 'ATATC.({19,21})TTTAA' R1.fastq > test.fastq

ggrep -P -B 1 -A 2 "^\w+ATATC\K[ATGCN]+(?=TTTAA\w+$)" R1.fastq | ggrep -P -v "^--$" >test2.fastq

So, what I need is: 1. A grep code to only extract this 19-22 base sequence based on exact flanks 2. Extract this 19-22 base sequence based on flanks but with 1 base mismatch allowed in the flanks at any position, if this can be done at all.

Any help would greatly be appreciated. Thanks

sequencing • 813 views

ADD COMMENT • link 5.9 years ago by doc.pombe • 0