Question

Is it possible to check if a sequence has a stop codon, without translating first?

0

Entering edit mode

8.1 years ago

uki ▴ 40

I recently started playing around with NGS data and Biopython, so this question might come as a bit of no-brainer to some but I havent quite figured it out on my own.

I realize that it's pretty straight forward to check the AA sequence to see if there is a stop symbol, but since I filter those sequences out eventually anyways I figured maybe I can avoid doing a whole bunch of translation() calls:

for rec in read1:
    nucSeq = rec.seq
    aaSeq = Seq.translate(nucSeq)

    if '*' in str(aaSeq):
        starSeqs += 1
    else:   
        # do awesome stuff!

PS: I could obviously do a RegEx search over the sequence for the stop codons, but that's neither pretty nor efficient. I was wondering if there is a method/api in the library that does the black magic for me.

biopython sequencing • 3.3k views

ADD COMMENT • link updated 8.1 years ago by John 13k • written 8.1 years ago by uki ▴ 40

1

Entering edit mode

Just curious, would a regex search really be less efficient, and why?

ADD REPLY • link 8.1 years ago by Michael 54k

0

Entering edit mode

It would be difficult to really know without testing it, but regexs are usually pretty slow (O(n)) compared to a static string match. I've seen a lot of whining about how Java's regex (which is in everything, like split()) is O(n) when people even provide hints like start/end of line anchors, etc. I mean, it's not a big issue which is what i think you're really getting at - reading the file off the disk is probably 99% of the time taken to execute...

ADD REPLY • link 8.1 years ago by John 13k

1

Entering edit mode

8.1 years ago

John 13k

jsgounot's answer is really great, although it only works for the first frame. I came up with this, but I don't know if it would be any faster than jsgounot's.

def find_all(string):
    idx = 0
    indexes = []
    try:
        while True:
            idx = string.index('T',idx+1)
            if string[idx:idx+3] in ("TGA", "TAG", "TAA"): indexes.append(idx)
    except: return indexes

frameCounter = {1:0, 2:0, 3:0}
for idx in find_all('ACTAGCACTGGCCTGATTCGATCGACTAGCA'):
    frameCounter[(idx%3)+1] += 1

for x,y in frameCounter.items(): print 'Frame:',x,'Stops:',y

---------
output
---------
Frame: 1 Stops: 0
Frame: 2 Stops: 1
Frame: 3 Stops: 2

Ideally you would want to take the input, put it into a multidimentional array where you can stripe across the array to get the DNA in 1 of the 3 frames, then once a stop is found in 1 frame stop looking there in the future. You could also do that with splicing that uses variable jumps, but the more time you spend in python and the less in C/Fortran the slower its going to get.

ADD COMMENT • link 8.1 years ago by John 13k

score 5 · Accepted Answer · 2016-03-10

5

Entering edit mode

8.1 years ago

jsgounot ▴ 170

You mean doing something with python like this ?

stop_codons = ["TGA", "TAG", "TAA"]
sequence = "ATGAAATGA"
sequence2 = "ATGAAATTA"

is_stop_codon = lambda x : any(x[i:i+3] in stop_codons for i in range(0,len(x),3))

print is_stop_codon(sequence)
print is_stop_codon(sequence2)

ADD COMMENT • link 8.1 years ago by jsgounot ▴ 170