Finding All In-Frame Stop Codons Using Python
1
0
Entering edit mode
10.9 years ago

A sequence of nucleotides with multiple genes and UTRs are given (seq):

<--UTR--><----------------Gene 1-------------><---- UTR-----><----------- Gene 2 --------------------><-- UTR-->
TGGAGA_startCodon_AGGAAG_stopCodon_GAAGGTAAC_statrCodon_AGCTCTG_stopCodon_ATCAAGA

There could be multiple out-frame/overlapping start codons between each primary start and stop codon shown above, or in UTRs. The position of all instances of overlapping in-/out-frame start codons can be found as follows:

startCodons = ['ATG','GTG','CTG','TTG']

  # Start positions of start codons 
  startCodons_pos = {}
  for startCodon_seq in startCodons:
     startCodons_pos[startCodon_seq] = [m.start() for m in re.finditer('(?=' + startCodon_seq + ')', seq)]

While the start codons can be in- or out-frame or overlapping, I need to find only the stop codons that are in-frame with respect to each 'primary' start codon. This can be done by using multiple loops, however, I was wondering if a smarter way of doing it in python exists.

python • 9.2k views
ADD COMMENT
2
Entering edit mode

This old (and fun!) thread might help you. It as about finding the longest ORF in all 6-frames, but you can probably hack one of the results to include all ORFs Code golf: Finding ORF and corresponding strand in a DNA sequence

ADD REPLY
1
Entering edit mode

I wrote something for it here: https://github.com/vsbuffalo/findorf/blob/master/findorf/orfprediction.py but this may be too project-specific. It handles some other cases though which you may find interesting.

ADD REPLY
1
Entering edit mode

If you're learning python or bioinformatics it's a good exercise, otherwise you can use EMBOSS sixpack.

ADD REPLY
0
Entering edit mode
10.8 years ago
Wrf ▴ 210

I'm not familiar with any particular packages to do that in BioPython. If your loop is giving you every position in your string of a possible start codon, then you can use the Bio.Seq library to translate from any arbitrary string. So, you have s.seq.translate() for the whole thing, but as your re loop gave you the positions, you can just loop through the positions:

from Bio.Seq import Seq
for x in all_start_codons:
   somelist.append(s.seq[x:].translate())

That will append that protein to somelist. If you need the position in the big string, then it is x + 3(len(protein)).

ADD COMMENT

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6