Biopython Translation Error?
3
1
Entering edit mode
10.8 years ago
Mark Evans ▴ 50

Hello, I am writing some code intended to translate ambiguous DNA codes into possible amino acids and I am seeing some strange translation from the Biopython 1.56 package. It appears to be translating ambiguous DNA codes to 'J' which does not exist as a code for anything. I am running python 2.6.1 on Mac OS 10.6.6.

For example:

>>>from Bio.Seq import *
>>>translate('ARAWTAGKAMTA')
'XJXJ'


or

>>>from Bio.Seq import Seq
>>>c = Seq('ARAWTAGKAMTA')
>>>c.translate().tostring()
'XJXJ'


I have looked through the Bio.Data.CodonTable source and Bio.Seq source and I cannot find a reason why this would be happening. Any ideas?

Thanks!

Mark

biopython python protein translation • 5.2k views
5
Entering edit mode
10.8 years ago

Biopython seems to use a extended alphabet for the amino acids: see here

B = "Asx";  Aspartic acid (R) or Asparagine (N)
X = "Xxx";  Unknown or 'other' amino acid
Z = "Glx";  Glutamic acid (E) or Glutamine (Q)
J = "Xle";  Leucine (L) or Isoleucine (I), used in mass-spec (NMR)
U = "Sec";  Selenocysteine
O = "Pyl";  Pyrrolysine

0
Entering edit mode

Thanks Pierre, please see my followup -Mark

5
Entering edit mode
10.8 years ago
User 2510 ▴ 50

I can explain the error in your second bit of code -- IUPACAmbiguousDNA is a class and needs to be instantiated, so

c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA)


should be

c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA() )


Meanwhile, Bio/Data/IUPACData.py maps W to A,T, which means that WTA -> ATA,TT' -> I,L which is J.

I haven't found a way to force Seq.translate() to use IUPACProtein instead of ExtendedIUPACProtein, which might be what you want if you'd rather see X than J. An ugly fix would be to just use string replace:

Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA()).translate().tostring().replace('J','X')


Ugly.

1
Entering edit mode
10.8 years ago
Mark Evans ▴ 50

Thanks Pierre. That helps some. There is still something I must be missing though. You are right that ExtendedIUPACProtein uses J. So in that case, based on my example, WTA would be the corresponding codon. I still don't see where that gets mapped to J.

ExtendedIUPACDNA calls W as wyosine, (which I don't even know what that is...googling) http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC.ExtendedIUPACDNA-class.html

B = 5-bromouridine
D = 5,6-dihydrouridine
S = thiouridine
W = wyosine


but "normal" DNA ambiguity codes are here in IUPACAmbiguousDNA.

letters = 'GATCRYWSMKHBVDN'


W traditionally codes for T or A

if I get more specific in my example and specify an alphabet

>>>from Bio.Seq import Seq
>>>from Bio.Alphabet.IUPAC import *
>>>c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA)
>>>c.translate().tostring()

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError: Invalid alphabet found, <class Bio.Alphabet.IUPAC.IUPACAmbiguousDNA at 0x10057c230>


Bad things happen. So I am still not quite understanding the ins and outs of this. The translate call goes to CodonTable where 1) I still don't see a J'and 2) I don't understand this new error.

Thanks!
Mark

0
Entering edit mode

The place in the code where that happens is https://github.com/biopython/biopython/blob/master/Bio/Data/CodonTable.py All of the ambiguous codes are expanded and shoved into the forward translation table which is referenced indirectly from the Bio.Seq.translate method.

0
Entering edit mode

IUPACAmbiguousDNA is the class, IUPACAmbiguousDNA() is an instance of the class. See profileshervold's answer.