Find Exact Length Of One Letter Homopolymer
1
1
Entering edit mode
11.2 years ago
stolarek.ir ▴ 700

Hi guys, i have a code looking like this:

import HTSeq
reference = open('genome.fa','r')
sequences = dict( s.name, s) for s in HTSeq.FastaReader(reference))
out = open('/home/istolarek/moleculoNA12878/homopolymers_in_ref','w')

def find_all(a_str,sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub)
homa = 'AAAAAAAAAA'
homc = 'CCCCCCCCCC'
homg = 'GGGGGGGGGG'
homt = 'TTTTTTTTTT'
for key,line in sequences.items():
    seq = str(line)
    a= list(find_all(seq,homa))
    c = list(find_all(seq,homc))
    g = list(find_all(seq,homg))
    t = list(find_all(seq,homt))
    for i in a:
##        print i,key,'A'
        out.write(str(i)+'\t'+str(key)+'\t'+'A'+'\n')
    for i in c:
        out.write(str(i)+'\t'+str(key)+'\t'+'C'+'\n')
##        print i,key,'C'
    for i in g:
        out.write(str(i)+'\t'+str(key)+'\t'+'G'+'\n')
    for i in t:
        out.write(str(i)+'\t'+str(key)+'\t'+'T'+'\n')
out.close()

I used HTSeq to open the reference. What it does - it looks for simple homopolymers of length 10 and outputs start position, chromosome and type (A,C,T,G,). Now what I would like to change is finding the length of each homopolymer. Since, right now if a homopolymer has a length 15, my script can't tell it. I was thinking of doing it by some sort of regex, namely: find at least 10 bp like now and compute length by adding +1 to it until the next base isn't like those in homopolymer.

Any suggestions how to use a regex to do it in python?

python repeats • 4.2k views
ADD COMMENT
0
Entering edit mode
11.2 years ago
matted 7.8k

Yes, just read the documentation.

For example:

import re
re.findall("A{4,10}", "AAA AAAA AAAAA")

This will output the two hits:

['AAAA', 'AAAAA']
ADD COMMENT
0
Entering edit mode

Not sure this is precisely what was asked for. The regex for "at least 10 A" is A{10,}.

ADD REPLY
0
Entering edit mode

The search here needs to be greedy, A{4,10} will find all substrings matching that pattern, so longer stretches will generate many hits:

>>> x = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
>>> import re
>>> re.findall("A{4,10}", x)
['AAAAAAAAAA', 'AAAAAAAAAA', 'AAAAAAAAAA', 'AAAAAAA']

A{10,} works with a few quick tests. OP one issue with using a regex here is that you lose any information about the position of the homopolymers. You can easily generalize the search:

def counthpoly(char, seq):
    return map(len, re.findall(char + "{10,}", seq)))
nhpolys = map(lambda x: counthpoly(x, seq), ["A","T","C","G"])

The above code can be adapted to fill a dictionary if you really wanted, but I generally find it easier to enforce a fixed order for my alphabets if all I am doing is counting and so on. There is a downside to using this method: the memory usage could potentially be massive. re.findall returns a list of substrings matching the provided pattern, when all you're doing is looking for the length of each one. This means a potentially massive list will be generated each time. You would be better off iterating through the string in one shot, counting for all characters as you go. A regex can be handy and quick to write, but there are cases where the few moments less you spent writing the code can result in much worse performance than if you had written some loops.

ADD REPLY

Login before adding your answer.

Traffic: 2064 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6