Question: Common pattern for several amino acids
gravatar for tretyacv
4.4 years ago by
Czech Republic
tretyacv40 wrote:


I am trying to generate a nucleotide motif that will code chosen amino acids. For example - histidine is coded by CAT, CAC. Arginine is CGT, CGC, CGA, CGG,AGA and AGG. The pattern is:

1. position in codon - C or A

2. position in codon - A or G

3. position - A, T, C or G

by that rule you can define chosen amino acids (H and R) but also the amino acids that i dont want (for example AAA is lysine, AAT is asparagine...). So I need to define the pattern that matches only my chosen AAs, in case above it can be: [C][A or G][T], that pattern defines only histidine and arginine, but not the other amino acids. I am trying to work out an algorithm which will do this thing with any amino acids which i choose (more than two) and if the pattern does not exist it should find the possibilities for less amino acids (for example if pattern for 5 amino acids does not exist, it will find the patterns for four amino acids from the query) - this final optimization problem is probably the hardest part. Any suggestions? Thanks a lot and sorry for my poor english.

ADD COMMENTlink modified 4.4 years ago by RamRS21k • written 4.4 years ago by tretyacv40

Hello tretyacv!

It appears that your post has been cross-posted to another site:

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 4.4 years ago by Pierre Lindenbaum120k

Yes, that true, I'm sorry, I thought that stackoverflow is more about computer science and biostars is bioinformatics site, so the communities does not overlap.

ADD REPLYlink written 4.4 years ago by tretyacv40

You are correct in assuming that biostars specializes in bioinformatics, but people here also have a presence on stack overflow because geekiness has no boundaries :)

ADD REPLYlink written 4.4 years ago by RamRS21k
gravatar for RamRS
4.4 years ago by
Houston, TX
RamRS21k wrote:

I'd solve it this way:



H or R: (CAT|CAC|CGT|CGC|CGA|AGA|AGG) #matches to only the codons you need.

You can always create a dict of AA:List<codons> and use the pattern where the codons of all target AAs are joined with delimiter '|' and the entire set is bounded by '(' and ')'.

ADD COMMENTlink written 4.4 years ago by RamRS21k

But how do I generate the pattern? I need to create a library of DNA sequences where at, for example, third position will be this universal pattern for histidine or arginine, and i need to tell organic chemists what nucleotide they should add in synthesis next - first C, then the mix of A and G and at last only G. By this recipe can be yielded the DNA sequence with codons only for arginine or histidine. And of course I need to generalize this problem, not only for R and H.

ADD REPLYlink written 4.4 years ago by tretyacv40

That IMO is only possible if you have more rather than less common nucleotides in corresponding positions. The more diverse the codon set, the more false matches you will get. Looking at this nucleotide by nucleotide cannot yield a solution because the codon vocabulary is contextual to a length of 3.

Even if you manually created patterns for all combinations, you'd end up with false positives (unless you restrict yourself to a subset of the codons, assuming an overlapping subset (with >=1 common nucleotide) exists in the first place. For example, with R and H, if you chose CAT, CAC, CGT and CGC, you could possible use C[AG][TC] to be sure that you're going position by position and optimizing for only R and H.

I think this might be possible, but you'll invest more time creating the algorithm than saving time using it.

ADD REPLYlink written 4.4 years ago by RamRS21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1155 users visited in the last hour