Question

Biopython Translation Error?

1

Entering edit mode

13.2 years ago

Mark Evans ▴ 50

Hello, I am writing some code intended to translate ambiguous DNA codes into possible amino acids and I am seeing some strange translation from the Biopython 1.56 package. It appears to be translating ambiguous DNA codes to 'J' which does not exist as a code for anything. I am running python 2.6.1 on Mac OS 10.6.6.

For example:

>>>from Bio.Seq import *
>>>translate('ARAWTAGKAMTA')
'XJXJ'

or

>>>from Bio.Seq import Seq
>>>c = Seq('ARAWTAGKAMTA')
>>>c.translate().tostring()
'XJXJ'

I have looked through the Bio.Data.CodonTable source and Bio.Seq source and I cannot find a reason why this would be happening. Any ideas?

Thanks!

Mark

biopython python protein translation • 6.5k views

ADD COMMENT • link updated 13.2 years ago by User 2510 ▴ 50 • written 13.2 years ago by Mark Evans ▴ 50

Ram · Answer 1 · 2011-02-18

5

Entering edit mode

13.2 years ago

Pierre Lindenbaum 161k

Biopython seems to use a extended alphabet for the amino acids: see here

B = "Asx";  Aspartic acid (R) or Asparagine (N)
X = "Xxx";  Unknown or 'other' amino acid
Z = "Glx";  Glutamic acid (E) or Glutamine (Q)
J = "Xle";  Leucine (L) or Isoleucine (I), used in mass-spec (NMR)
U = "Sec";  Selenocysteine
O = "Pyl";  Pyrrolysine

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Pierre, please see my followup -Mark

ADD REPLY • link 13.2 years ago by Mark Evans ▴ 50

Ram · Answer 2 · 2011-02-18

I can explain the error in your second bit of code -- IUPACAmbiguousDNA is a class and needs to be instantiated, so

c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA)

should be

c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA() )

Meanwhile, Bio/Data/IUPACData.py maps W to A,T, which means that WTA -> ATA,TT' -> I,L which is J.

I haven't found a way to force Seq.translate() to use IUPACProtein instead of ExtendedIUPACProtein, which might be what you want if you'd rather see X than J. An ugly fix would be to just use string replace:

Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA()).translate().tostring().replace('J','X')

Ugly.

Ram · Answer 3 · 2011-02-18

Thanks Pierre. That helps some. There is still something I must be missing though. You are right that ExtendedIUPACProtein uses J. So in that case, based on my example, WTA would be the corresponding codon. I still don't see where that gets mapped to J.

ExtendedIUPACDNA calls W as wyosine, (which I don't even know what that is...googling) http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC.ExtendedIUPACDNA-class.html

B = 5-bromouridine
D = 5,6-dihydrouridine
S = thiouridine
W = wyosine

but "normal" DNA ambiguity codes are here in IUPACAmbiguousDNA.

letters = 'GATCRYWSMKHBVDN'

W traditionally codes for T or A

if I get more specific in my example and specify an alphabet

>>>from Bio.Seq import Seq
>>>from Bio.Alphabet.IUPAC import *
>>>c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA)
>>>c.translate().tostring()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mark/Downloads/biopython-1.54/build/lib.macosx-10.6-universal-2.6/Bio/Seq.py", line 930, in translate
  File "/Users/mark/Downloads/biopython-1.54/build/lib.macosx-10.6-universal-2.6/Bio/Alphabet/__init__.py", line 213, in _get_base_alphabet
AssertionError: Invalid alphabet found, <class Bio.Alphabet.IUPAC.IUPACAmbiguousDNA at 0x10057c230>

Bad things happen. So I am still not quite understanding the ins and outs of this. The translate call goes to CodonTable where 1) I still don't see a J'and 2) I don't understand this new error.

Thanks!
Mark