Question

Retrieve The Coding Amino-Acid When There Is Certain Pattern In A Coding Dna Sequence

0

Entering edit mode

10.4 years ago

biotech ▴ 570

Hi community,

I would like to retrieve the coding amino-acid when there is certain pattern in a DNA sequence. For example, the pattern could be: ATAGTA. So, when having:

Input file:

>sequence1
ATGGCGCATAGTAATGC

>sequence2
ATGATAGTAATGCGCGC

The ideal output would be a table having for each amino-acid the number of times is coded by the pattern. Here in sequence1, pattern codes only for one amino-acid, but in sequence2 it codes for two. I would like to have this tool working to scale to thousands of sequences. I've been thinking about how to get this done, but I only thought to: replace all nucleotides different than the pattern, translate what remains and get summary of the coded amino-acids.

Please let me know if this task can be performed by an already available tool.

Thanks for your help. All the best, Bernardo

------------------------------------------------------------------------------------------------

Edit (due to the confusion generated with my post):

Please forget the original post and sequence1 and sequence2 too.

Hi all, and sorry for the confusion. The input fasta file is a *.ffn file derived from a GenBank file using 'FeatureExtract' tool (http://www.cbs.dtu.dk/services/FeatureExtract/download.php), so a can imagine they are already in frame (+1) and there is no need to get amino-acids coded in a frame different than +1.

I would like to know for which amino-acid the following sequences are coding for:

AGAGAG
GAGAGA
CTCTCT
TCTCTC

The unique strings I want to get coding amino-acids are repeats of three AG, GA, CT or TC, that is (AG)3, (GA)3, (CT)3 and (TC)3, respectively. I don't want the program to retrieve coding amino-acids for repeats of four or more.

Thanks again, Bernardo

amino-acids script perl • 4.2k views

ADD COMMENT • link 10.4 years ago by biotech ▴ 570

1

Entering edit mode

I must be tired. Am I the only one that don't understand at all your question ?

ADD REPLY • link 10.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

You are not tired, I was vague.

ADD REPLY • link 10.4 years ago by biotech ▴ 570

0

Entering edit mode

This is a very confusing question as written. I think what you are trying to say is that in the second sequence, ATAGTA is in-frame (counting from position 1) and so contains 2 codons (ATA, GTA). Whereas in the first sequence, only the sub-sequence AGT is in-frame, so represents 1 codon.

Am I right?

ADD REPLY • link 10.4 years ago by Neilfws 49k

0

Entering edit mode

Neilfws, this is exactly what I was trying to say, thanks for the clarification.

ADD REPLY • link 10.4 years ago by biotech ▴ 570

score 2 · Answer 1 · 2013-11-11

2

Entering edit mode

10.4 years ago

Devon Ryan 104k

The reason that no one has replied is that you didn't ask a coherent question. What you seem to have wanted to ask was, "Given a multisequence fasta file, are there any scripts that can output the number of start codons in each sequence?". Or perhaps you meant open reading frames instead of start codons and just don't care about the frame having a stop codon (since sequence1 in your example lacked one). I'll also note that sequence1 does actually have a 2 start codons (the second one is going the other direction). So, I assume that you also only want start codons encoded in the forward direction. With biopython, one possible solution would be:

#!/usr/bin/env python
from Bio import SeqIO
import sys

handle = open(sys.argv[1], "r")
for record in SeqIO.parse(handle, "fasta"):
    StartCodons = 0
    i=0
    while(i

Of course, in sequence2, both of the start codons are in the same frame, so perhaps you want that as well. You should be able to get the idea from the trivial example that I wrote how to accomplish that. If you really want this in perl, then you should be able to translate it.

In the future, try to learn the appropriate terminology for the question you want to ask. Otherwise you'll get no help.

ADD COMMENT • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

Hi dpryan79, is it more clear now? See my edited post above.

ADD REPLY • link 10.4 years ago by biotech ▴ 570

0

Entering edit mode

Slightly, though I'm still unsure what you really want or why you want to do it. I mean, AGAGAG, for example, will always code for the same thing (RE). Likewise, GAGAGA will always code for ER. If you just want to translate the sequence, you could just use the translate() function in biopython (or whatever the equivalent is in bioperl). Do you instead just want to search for a handful of in-frame hexa-nucleotide patterns?

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

Hi dpryan79,

My main goal is to see if the patterns:

AGAGAG

GAGAGA

CTCTCT

TCTCTC

always code for certain amino-acids or not, because they could be in a frame different than +1, that is, +2 or +3 (remember, the sequences are already in +1 frame).

ADD REPLY • link 10.4 years ago by biotech ▴ 570

0

Entering edit mode

Either modifying what I wrote or using a regex to find the positions and then computing the frame using modulus 3 would work.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

score 0 · Answer 2 · 2013-11-11

You can write a Perl script to do this. You would start from the current gene annotations (e.g. Ensembl rel73 for human), get the coding sequence coordinates from here, and then you need to pay attention to the reading frame. I have a similar script that also compares the predicted amino acid residues to UniProt annotated ones, and flags inconsistencies (sometimes there are multiple potential ORF starts etc - I just filter the contradictory data, as it is a minority case). From there, you can look for a pattern in the DNA, and find out which amino acids it codes for.

However, I don't really understand the purpose of your question. The same DNA sequence will, if in frame, always code for the same amino acids. Are you just counting the frequency of occurrence of a particular DNA sequence in frame? What is the purpose of doing so?

I also don't understand your example. It appears that the sequence is out of frame in the first example (what you describe as coding for only 1 amino acid - though it could be described as contributing to 3), while it's in frame in the second (coding for 2 amino acids). From a biology perspective, why is this interesting?