Question

Finding All In-Frame Stop Codons Using Python

0

Entering edit mode

10.9 years ago

nobodyknowsme57 ▴ 10

A sequence of nucleotides with multiple genes and UTRs are given (seq):

<--UTR--><----------------Gene 1-------------><---- UTR-----><----------- Gene 2 --------------------><-- UTR-->
TGGAGA_startCodon_AGGAAG_stopCodon_GAAGGTAAC_statrCodon_AGCTCTG_stopCodon_ATCAAGA

There could be multiple out-frame/overlapping start codons between each primary start and stop codon shown above, or in UTRs. The position of all instances of overlapping in-/out-frame start codons can be found as follows:

startCodons = ['ATG','GTG','CTG','TTG']

  # Start positions of start codons 
  startCodons_pos = {}
  for startCodon_seq in startCodons:
     startCodons_pos[startCodon_seq] = [m.start() for m in re.finditer('(?=' + startCodon_seq + ')', seq)]

While the start codons can be in- or out-frame or overlapping, I need to find only the stop codons that are in-frame with respect to each 'primary' start codon. This can be done by using multiple loops, however, I was wondering if a smarter way of doing it in python exists.

python • 9.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 43k • written 10.9 years ago by nobodyknowsme57 ▴ 10

2

Entering edit mode

This old (and fun!) thread might help you. It as about finding the longest ORF in all 6-frames, but you can probably hack one of the results to include all ORFs Code golf: Finding ORF and corresponding strand in a DNA sequence

ADD REPLY • link 10.9 years ago by David W 4.9k

1

Entering edit mode

I wrote something for it here: https://github.com/vsbuffalo/findorf/blob/master/findorf/orfprediction.py but this may be too project-specific. It handles some other cases though which you may find interesting.

ADD REPLY • link 10.9 years ago by Vince Buffalo ▴ 470

1

Entering edit mode

If you're learning python or bioinformatics it's a good exercise, otherwise you can use EMBOSS sixpack.

ADD REPLY • link 10.9 years ago by JC 13k

Ram · Answer 1 · 2013-06-18

I'm not familiar with any particular packages to do that in BioPython. If your loop is giving you every position in your string of a possible start codon, then you can use the Bio.Seq library to translate from any arbitrary string. So, you have s.seq.translate() for the whole thing, but as your re loop gave you the positions, you can just loop through the positions:

from Bio.Seq import Seq
for x in all_start_codons:
   somelist.append(s.seq[x:].translate())

That will append that protein to somelist. If you need the position in the big string, then it is x + 3(len(protein)).