Question: Back Translating Protein Sequence Without Ambiguities...
gravatar for Badgerichard
10.1 years ago by
Badgerichard50 wrote:

(first question so be gentle...!)

I am wanting to do an exhaustive search over the human genome for DNA sequences that could encode a particular amino acid sequence. I got as far as back translating the amino acids into a DNA sequence with IUPAC ambiguity codes, but could not find a utility to disambiguate it (i.e. generate all possible sequences that could be encoded by the ambiguous sequence). I am intending to take the set of sequences and utilise a short read mapper that does not accept ambiguities, so need a set of discrete nucleotide sequences with no ambiguities.

Thanks in advance!

protein dna • 7.6k views
ADD COMMENTlink modified 7.1 years ago by Biostar ♦♦ 20 • written 10.1 years ago by Badgerichard50
gravatar for Neilfws
10.1 years ago by
Sydney, Australia
Neilfws49k wrote:

I don't know of a tool to generate all possible nucleotides from an ambiguous back-translation. In fact, it's quite a tricky computational problem.

Let's take this very short peptide: MVIL.

Using Bioperl's Bio::Tools::SeqPattern, we can back-translate and expand to a regular expression:

use Bio::Tools::SeqPattern;
my $pat     = 'MVIL';
my $pattern = Bio::Tools::SeqPattern->new(-seq =>$pat, -type =>'amino');
print $pattern->backtranslate->expand, "\n";
# result

If you picture moving along that regular expression, one position at a time and splitting each time that there is more than one option, you can see that there are 4 x 3 x 2 x 4 splits, giving 96 possible combinations of bases for just a 4 residue peptide.

I'm surprised that there is not an algorithm to convert a regular expression into all possible string combinations. It should not be too hard to program, based on the example above, but I'd be tempted as Dan says to look at another approach which uses the ambiguity.

ADD COMMENTlink modified 17 months ago by _r_am31k • written 10.1 years ago by Neilfws49k

Thanks for the code snippet (at least insured my bioperl install was working). I've understood that the . means [GATC] and can see logic. So what is left to do is right a little perl to iterate through all possible sequences...

ADD REPLYlink written 10.1 years ago by Badgerichard50

Exactly, nicely explained

ADD REPLYlink written 9.8 years ago by Michael Dondrup48k
gravatar for Daniel Swan
10.1 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

I think you might want to try this approach: "Fast-Find: A novel computational approach to analyzing combinatorial motifs"

They index the 'target' DNA sequences you wish to search then allow you to run a degenerate search against those sequences to return matches.

This seems like it will achieve what you want to do, rather than write something that would have to 'disambiguate' into quite a large sequence space if you were to perform every possible back-translation (depending on the size of your peptide of course).

ADD COMMENTlink written 10.1 years ago by Daniel Swan13k

+1 that looks useful. And pretty fast according to a quick read.

ADD REPLYlink written 10.1 years ago by brentp23k
gravatar for Jvb
9.7 years ago by
Jvb50 wrote:

Slightly long-winded approach in Python that does back-translation on ambiguous nucleotides instead of amino acids:

#!/usr/bin/env python

# 1. define codon table
codon_table = {
    'A': ('GCT', 'GCC', 'GCA', 'GCG'),
    'C': ('TGT', 'TGC'),
    'D': ('GAT', 'GAC'),
    'E': ('GAA', 'GAG'),
    'F': ('TTT', 'TTC'),
    'G': ('GGT', 'GGC', 'GGA', 'GGG'),
    'I': ('ATT', 'ATC', 'ATA'),
    'H': ('CAT', 'CAC'),
    'K': ('AAA', 'AAG'),
    'L': ('TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'),
    'M': ('ATG',),
    'N': ('AAT', 'AAC'),
    'P': ('CCT', 'CCC', 'CCA', 'CCG'),
    'Q': ('CAA', 'CAG'),
    'R': ('CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
    'S': ('TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'),
    'T': ('ACT', 'ACC', 'ACA', 'ACG'),
    'V': ('GTT', 'GTC', 'GTA', 'GTG'),
    'W': ('TGG',),
    'Y': ('TAT', 'TAC'),
    '*': ('TAA', 'TAG', 'TGA'),

# 2. create dictionary of base possibilities mapped to ambiguous bases 
ambiguous_bases = {
    frozenset(['A']): 'A',
    frozenset(['C']): 'C',
    frozenset(['G']): 'G',
    frozenset(['T']): 'T',
    frozenset(['U']): 'T',
    frozenset(['A', 'G']): 'R',
    frozenset(['C', 'T']): 'Y',
    frozenset(['G', 'C']): 'S',
    frozenset(['A', 'T']): 'W',
    frozenset(['G', 'T']): 'K',
    frozenset(['A', 'C']): 'M',
    frozenset(['C', 'G', 'T']): 'B',
    frozenset(['A', 'G', 'T']): 'D',
    frozenset(['A', 'C', 'T']): 'H',
    frozenset(['A', 'C', 'G']): 'V',
    frozenset(['A', 'C', 'G', 'T']): 'N',

# 3. naively determine ambiguous codons from ambiguous bases
ambiguous_codon_table = {}
for amino_acid, codons in codon_table.iteritems():
    ambiguous_codon = ''
    for i in range(3): # len(codon)
        bases = frozenset([codon[i] for codon in codons])
        ambiguous_codon += ambiguous_bases[bases]
    ambiguous_codon_table[amino_acid] = (ambiguous_codon,)

# 4. correct the incorrectly determined codons from 2 
ambiguous_codon_table['L'] = ('TTR', 'CTN')
ambiguous_codon_table['R'] = ('CGN', 'AGR')
ambiguous_codon_table['S'] = ('TCN', 'AGY')
ambiguous_codon_table['*'] = ('TAR', 'TGA')

# backtranslation functions that work for any string with a codon_table that is
# a dict mapping characters to a sequence of strings.

def backtranslate_permutations(protein, codon_table=codon_table):
    '''Returns the number of back-translated nucleotide sequences for a protein
    and codon table combination.

    >>> protein = 'ACDEFGHIKLMNPQRSTVWY*'
    >>> backtranslate_permutations(protein)
    >>> backtranslate_permutations(protein, codon_table=ambiguous_codon_table)
    permutations = 1
    for amino_acid in protein:
        permutations *= len(codon_table[amino_acid])
    return permutations

def backtranslate(protein, codon_table=codon_table):
    '''Returns the back-translated nucleotide sequences for a protein and codon 
    table combination.

    >>> protein = 'FVC'
    >>> len(backtranslate(protein))
    # create initial sequences == list of codons for the first amino acid
    sequences = [codon for codon in codon_table[protein[0]]]
    for amino_acid in protein[1:]:
        # add each codon to each existing sequence replacing sequences
        # leaves (num_codons * num_sequences) for next amino acid 
        to_extend = sequences
        sequences = []
        for codon in codon_table[amino_acid]:
            for sequence in to_extend:
                sequence += codon
    return sequences

# 5. reverse ambiguous_bases to get a kind of codon table 
ambiguous_bases_reversed = dict((value, sorted(list(key))) for (key, value) in ambiguous_bases.iteritems())

# 6. Reuse the backtranslation functions with ambiguous nucleotides instead
def disambiguate(ambiguous_dna):
    '''Call backtranslate with ambiguous DNA and reversed ambiguous bases.

    >>> ambiguous_dna = 'ACGTRYSWKMBDHVN'
    >>> backtranslate_permutations(ambiguous_dna, ambiguous_bases_reversed)
    >>> len(disambiguate(ambiguous_dna))
    return backtranslate(ambiguous_dna, ambiguous_bases_reversed)

if __name__ == '__main__':
    import doctest
ADD COMMENTlink modified 17 months ago by _r_am31k • written 9.7 years ago by Jvb50

great code! Just a small bug that affects the behavior of ambiguous_bases_reversed (when a T is given to that dict), to fix it, delete the following line:

frozenset(['U']): 'T',
ADD REPLYlink modified 17 months ago by _r_am31k • written 21 months ago by Rayan Chikhi1.4k
gravatar for Peter
10.1 years ago by
Scotland, UK
Peter5.9k wrote:

Have you ruled out using TBLASTN? It takes a protein query and compares it to a nucleotide database - see

ADD COMMENTlink written 10.1 years ago by Peter5.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2395 users visited in the last hour