How to extract sequence by flanks from Fatsq R1 read
0
0
Entering edit mode
5.9 years ago
doc.pombe • 0

Hi,

I run a pair-end sequencing so I have R1 and R2. I need help to extract a sequence of varying length i.e. 19=22 bases so I can't use position and flanked by ATATC and TTTAA. The read for example would looks like this:

@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT AGTAATCTGGGGACGAGGCAAGCTAAGATATCTTGCCGCGGCTGTTTTTGCTTTTAAATGCGAAGTAAGGCGGGA

+ ACCCCFGGGGGGGGGGGEGGGGGGFGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG

And what I want is to retrieve the read only containing the sequence flanked by ATATC and TTTAA:

@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT

TTGCCGCGGCTGTTTTTGC

+

ACCCCFGGGGGGGGGGGEGGGGGGFGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG

I've used fastq-grep and ggrep but it retrieves the reads that contain the flanks not just the subset sequence of those reads: fastq-grep 'ATATC.({19,21})TTTAA' R1.fastq > test.fastq

ggrep -P -B 1 -A 2 "^\w+ATATC\K[ATGCN]+(?=TTTAA\w+$)" R1.fastq | ggrep -P -v "^--$" >test2.fastq

So, what I need is: 1. A grep code to only extract this 19-22 base sequence based on exact flanks 2. Extract this 19-22 base sequence based on flanks but with 1 base mismatch allowed in the flanks at any position, if this can be done at all.

Any help would greatly be appreciated. Thanks

sequencing • 813 views
ADD COMMENT

Login before adding your answer.

Traffic: 1479 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6