Question: Retrieve The Coding Amino-Acid When There Is Certain Pattern In A Coding Dna Sequence
gravatar for biotech
6.4 years ago by
United States
biotech540 wrote:

Hi community,

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:

Input file:



The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.

Please let me know if this task can be performed by an already available tool.

Thanks for your help. All the best, Bernardo


Edit (due to the confusion generated with my post):

Please forget the original post and sequence1 and sequence2 too.

Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (, so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.

I would like to know for which amino-acid the following sequences are coding for:


The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.

Thanks again, Bernardo

amino-acids perl script • 3.1k views
ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by biotech540

I must be tired. Am I the only one that don't understand at all your question ?

ADD REPLYlink written 6.4 years ago by Pierre Lindenbaum127k

You are not tired, I was vague.

ADD REPLYlink written 6.4 years ago by biotech540

This is a very confusing question as written. I think what you are trying to say is that in the second sequence, ATAGTA is in-frame (counting from position 1) and so contains 2 codons (ATA, GTA). Whereas in the first sequence, only the sub-sequence AGT is in-frame, so represents 1 codon.

Am I right?

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Neilfws48k

Neilfws, this is exactly what I was trying to say, thanks for the clarification.

ADD REPLYlink written 6.4 years ago by biotech540
gravatar for Devon Ryan
6.4 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

The reason that no one has replied is that you didn't ask a coherent question. What you seem to have wanted to ask was, "Given a multisequence fasta file, are there any scripts that can output the number of start codons in each sequence?". Or perhaps you meant open reading frames instead of start codons and just don't care about the frame having a stop codon (since sequence1 in your example lacked one). I'll also note that sequence1 does actually have a 2 start codons (the second one is going the other direction). So, I assume that you also only want start codons encoded in the forward direction. With biopython, one possible solution would be:

#!/usr/bin/env python
from Bio import SeqIO
import sys

handle = open(sys.argv[1], "r")
for record in SeqIO.parse(handle, "fasta"):
    StartCodons = 0

Of course, in sequence2, both of the start codons are in the same frame, so perhaps you want that as well. You should be able to get the idea from the trivial example that I wrote how to accomplish that. If you really want this in perl, then you should be able to translate it.

In the future, try to learn the appropriate terminology for the question you want to ask. Otherwise you'll get no help.

ADD COMMENTlink written 6.4 years ago by Devon Ryan94k

Hi dpryan79, is it more clear now? See my edited post above.

ADD REPLYlink written 6.4 years ago by biotech540

Slightly, though I'm still unsure what you really want or why you want to do it. I mean, AGAGAG, for example, will always code for the same thing (RE). Likewise, GAGAGA will always code for ER. If you just want to translate the sequence, you could just use the translate() function in biopython (or whatever the equivalent is in bioperl). Do you instead just want to search for a handful of in-frame hexa-nucleotide patterns?

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by Devon Ryan94k

Hi dpryan79,

My main goal is to see if the patterns:





always code for certain amino-acids or not, because they could be in a frame different than +1, that is, +2 or +3 (remember, the sequences are already in +1 frame).

ADD REPLYlink modified 6.4 years ago • written 6.4 years ago by biotech540

Either modifying what I wrote or using a regex to find the positions and then computing the frame using modulus 3 would work.

ADD REPLYlink written 6.4 years ago by Devon Ryan94k
gravatar for Jelena Aleksic
6.4 years ago by
Cambridge, UK
Jelena Aleksic910 wrote:

You can write a Perl script to do this. You would start from the current gene annotations (e.g. Ensembl rel73 for human), get the coding sequence coordinates from here, and then you need to pay attention to the reading frame. I have a similar script that also compares the predicted amino acid residues to UniProt annotated ones, and flags inconsistencies (sometimes there are multiple potential ORF starts etc - I just filter the contradictory data, as it is a minority case). From there, you can look for a pattern in the DNA, and find out which amino acids it codes for.

However, I don't really understand the purpose of your question. The same DNA sequence will, if in frame, always code for the same amino acids. Are you just counting the frequency of occurrence of a particular DNA sequence in frame? What is the purpose of doing so?

I also don't understand your example. It appears that the sequence is out of frame in the first example (what you describe as coding for only 1 amino acid - though it could be described as contributing to 3), while it's in frame in the second (coding for 2 amino acids). From a biology perspective, why is this interesting?

ADD COMMENTlink written 6.4 years ago by Jelena Aleksic910

Hi Jelena Aleksic, Have a look at my edited post and let's see if we can redirect the solution. Thanks!

ADD REPLYlink written 6.4 years ago by biotech540
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1829 users visited in the last hour