Question: How to extract sequence by flanks from Fatsq R1 read
0
gravatar for doc.pombe
10 months ago by
doc.pombe0
doc.pombe0 wrote:

Hi,

I run a pair-end sequencing so I have R1 and R2. I need help to extract a sequence of varying length i.e. 19=22 bases so I can't use position and flanked by ATATC and TTTAA. The read for example would looks like this:

@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT AGTAATCTGGGGACGAGGCAAGCTAAGATATCTTGCCGCGGCTGTTTTTGCTTTTAAATGCGAAGTAAGGCGGGA

+ ACCCCFGGGGGGGGGGGEGGGGGGFGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG

And what I want is to retrieve the read only containing the sequence flanked by ATATC and TTTAA:

@M00865:193:000000000-BRDBG:1:1101:16794:1079 1:N:0:GAATTCGT+TATAGCCT

TTGCCGCGGCTGTTTTTGC

+

ACCCCFGGGGGGGGGGGEGGGGGGFGGGGFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG

I've used fastq-grep and ggrep but it retrieves the reads that contain the flanks not just the subset sequence of those reads: fastq-grep 'ATATC.({19,21})TTTAA' R1.fastq > test.fastq

ggrep -P -B 1 -A 2 "^\w+ATATC\K[ATGCN]+(?=TTTAA\w+$)" R1.fastq | ggrep -P -v "^--$" >test2.fastq

So, what I need is: 1. A grep code to only extract this 19-22 base sequence based on exact flanks 2. Extract this 19-22 base sequence based on flanks but with 1 base mismatch allowed in the flanks at any position, if this can be done at all.

Any help would greatly be appreciated. Thanks

sequencing • 213 views
ADD COMMENTlink written 10 months ago by doc.pombe0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1313 users visited in the last hour