Hello, I am writing some code intended to translate ambiguous DNA codes into possible amino acids and I am seeing some strange translation from the Biopython 1.56 package. It appears to be translating ambiguous DNA codes to 'J' which does not exist as a code for anything. I am running python 2.6.1 on Mac OS 10.6.6.
>>>from Bio.Seq import * >>>translate('ARAWTAGKAMTA') 'XJXJ'
>>>from Bio.Seq import Seq >>>c = Seq('ARAWTAGKAMTA') >>>c.translate().tostring() 'XJXJ'
I have looked through the Bio.Data.CodonTable source and Bio.Seq source and I cannot find a reason why this would be happening. Any ideas?
I can explain the error in your second bit of code -- IUPACAmbiguousDNA is a class and needs to be instantiated, so
c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA)
c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA() )
Meanwhile, Bio/Data/IUPACData.py maps 'W' to 'A','T', which means that 'WTA' -> 'ATA','TTA' -> 'I','L' which is 'J'.
I haven't found a way to force Seq.translate() to use IUPACProtein instead of ExtendedIUPACProtein, which might be what you want if you'd rather see 'X' than 'J'. An ugly fix would be to just use string replace:
Biopython seems to use a extended alphabet for the amino acids: see http://www.biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC.ExtendedIUPACProtein-class.html
B = "Asx"; Aspartic acid (R) or Asparagine (N) X = "Xxx"; Unknown or 'other' amino acid Z = "Glx"; Glutamic acid (E) or Glutamine (Q) J = "Xle"; Leucine (L) or Isoleucine (I), used in mass-spec (NMR) U = "Sec"; Selenocysteine O = "Pyl"; Pyrrolysine
Thanks Pierre. That helps some. There is still something I must be missing though. You are right that ExtendedIUPACProtein uses 'J'. So in that case, based on my example, 'WTA' would be the corresponding codon. I still don't see where that gets mapped to 'J'.
ExtendedIUPACDNA calls 'W' as wyosine, (which I don't even know what that is...googling) http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC.ExtendedIUPACDNA-class.html
B = 5-bromouridine D = 5,6-dihydrouridine S = thiouridine W = wyosine
but "normal" DNA ambiguity codes are here in IUPACAmbiguousDNA http://biopython.org/DIST/docs/api/Bio.Alphabet.IUPAC.IUPACAmbiguousDNA-class.html
letters = 'GATCRYWSMKHBVDN'
'W' traditionally codes for 'T' or 'A'
if I get more specific in my example and specify an alphabet
>>>from Bio.Seq import Seq >>>from Bio.Alphabet.IUPAC import * >>>c = Seq('ARAWTAGKAMTA',IUPACAmbiguousDNA) >>>c.translate().tostring() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/mark/Downloads/biopython-1.54/build/lib.macosx-10.6-universal-2.6/Bio/Seq.py", line 930, in translate File "/Users/mark/Downloads/biopython-1.54/build/lib.macosx-10.6-universal-2.6/Bio/Alphabet/__init__.py", line 213, in _get_base_alphabet AssertionError: Invalid alphabet found, <class Bio.Alphabet.IUPAC.IUPACAmbiguousDNA at 0x10057c230>
Bad things happen. So I am still not quite understanding the ins and outs of this. The translate call goes to CodonTable where 1) I still don't see a 'J' and 2) I don't understand this new error.