Why my regular expression fails to match all the reasonable seq?
1
1
Entering edit mode
7.3 years ago
jinkuozhang ▴ 30

I try to find all the possible "N20NGG" sequence in a target sequnce like:

example_seq ATTAATACTTTTAACAATTGTAGTATATAAAAAAGGGAGTAACCGAAAACGGTCGGGACCGAAAACGG

What I used is python regular expression:

import re
example_seq = "ATTAATACTTTTAACAATTGTAGTATATAAAAAAGGGAGTAACCGAAAACGGTCGGGACCGAAAACGG"
pattern = re.compile(r'(.{20}).GG')
all_matched_seq = pattern.finditer(example_seq)

for record in all_matched_seq:
    print(record.group(1), end="\t")
    print(record.span())

I ony got two matched sequences:

  1. ACAATTGTAGTATATAAAAA (13, 36)
  2. AAAACGGTCGGGACCGAAAA (45, 68)

My script failed to retrieve the other 4 matched sequences:

CAATTGTAGTATATAAAAAA; AAAAAGGGAGTAACCGAAAA; AGGGAGTAACCGAAAACGGT; GGGAGTAACCGAAAACGGTC;

How can I modify my script to get all the reasonable ones?

sequence • 1.3k views
ADD COMMENT
2
Entering edit mode
7.3 years ago
John 13k

Regex's by default in python are non-overlapping. You have to use the lookahead operator ?=

import re
example_seq = 'ATTAATACTTTTAACAATTGTAGTATATAAAAAAGGGAGTAACCGAAAACGGTCGGGACCGAAAACGG'
pattern = re.compile(r'(?=((.{21})GG))')
all_matched_seq = pattern.finditer(example_seq)
for record in all_matched_seq:
    print(record.group(1), end="\t")
    print(record.span())
ACAATTGTAGTATATAAAAAAGG (13, 13)
CAATTGTAGTATATAAAAAAGGG (14, 14)
AAAAAGGGAGTAACCGAAAACGG (29, 29)
AGGGAGTAACCGAAAACGGTCGG (33, 33)
GGGAGTAACCGAAAACGGTCGGG (34, 34)
AAAACGGTCGGGACCGAAAACGG (45, 45)
  
ADD COMMENT
1
Entering edit mode

This precisely answered my question. John, Thanks!

ADD REPLY
1
Entering edit mode

If this answered your question it's appropriate to mark this answer as "accepted".

ADD REPLY

Login before adding your answer.

Traffic: 1925 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6