Question

Translating DNA to protein with perl

0

Entering edit mode

8.1 years ago

ksi216 ▴ 80

I have a DNA sequence in a scalar and have already performed the reverse function, now I want to get the compliment by using tr in perl, I'm getting a syntax error.

my $seq = "CCTCACACGCCCTGTTCTGGTGTCTGGTCGGATGTGTTGGTCGTGCGCAGATTGGTGCTC";
my $revseq = reverse $seq;
print "$revseq";
my $compseq = if($revseq =~ tr/ATCG/TAGC/){
print "$compseq";

then after I get the reverse compliment I have to get the RNA seq by using TR to replace T with U.

The hard part is I have a text file with codons and need to translate my RNA sequence into a protein. Any suggestions?

perl • 10.0k views

ADD COMMENT • link updated 12 months ago by Ram 43k • written 8.1 years ago by ksi216 ▴ 80

1

Entering edit mode

Btw.: your code is a fragment only and this is the error I get:

perl -c fragment.pl
syntax error at fragment.pl line 4, near "= if"
Missing right curly or square bracket at fragment.pl line 6, at end of line
fragment.pl had compilation errors.

You can't use an if statement as right-hand value of an assignment in Perl, the following should be slightly more safe:

my $seq = "CCTcacACGCCCTGTTCTGGTGTCTGGTCGGATGTGTTGGTCGTGCGCAGATTGGTGCTC";
print "$seq\n";
my $revseq = reverse $seq;
print "$revseq\n";
$revseq =~ tr/AaTtCcGg/TtAaGgCc/;
print "$revseq\n";

ADD REPLY • link 8.1 years ago by Michael 54k

0

Entering edit mode

the sequence its returning is shorter than the original ?

ADD REPLY • link 8.1 years ago by ksi216 ▴ 80

1

Entering edit mode

No, why?

perl  fragment.pl
CCTcacACGCCCTGTTCTGGTGTCTGGTCGGATGTGTTGGTCGTGCGCAGATTGGTGCTC
CTCGTGGTTAGACGCGTGCTGGTTGTGTAGGCTGGTCTGTGGTCTTGTCCCGCAcacTCC
GAGCACCAATCTGCGCACGACCAACACATCCGACCAGACACCAGAACAGGGCGTgtgAGG

ADD REPLY • link 8.1 years ago by Michael 54k

0

Entering edit mode

your correct sorry. how would I Get three nucleotides at a time (a codon) from each sequence until it is not possible to get another codon. Get the single letter amino acid code for each codon and add it to a scalar which contains the final translated sequence. i have the codons in a text file

ADD REPLY • link 8.1 years ago by ksi216 ▴ 80

0

Entering edit mode

So, as you are trying to learn how to do it, I will show you some steps that you could try and which documentation to read, and not a full solution, I will also list things that can go wrong. Please check my answer.

ADD REPLY • link 8.1 years ago by Michael 54k

score 2 · Answer 1 · 2016-04-02

Edit: Notes:

Make sure to remove any non-word characters from your input
To get range characters of a string, check out the substr function: perldoc -f substr, you could also use unpack to split the string into an array of triplets, and use array functions. substring will be faster most likely, array function will be more readable. You can find different ways of exactly your problem by searching stackoverflow.
You do only need to reverse complement, if you want to translate the reverse strand. CDS are noted in the correct direction, such that they can be translated directly.
Make a genetic code table for standard genetic code including stop codons, store it in a hash, use the three letter codon as the hash key
If the codon is not covered in the code table, you can output X for unknown sequence, e.g. if the input contains Ns

Pitfalls (you could argue this can never happen ;)> :

String IO, did you treat line breaks correctly? Do you want to have a certain input output format, e.g. Fasta?
How do you treat unknown characters, e.g. N, most genomes contain N, capital vs. non-caps?
A lot of sequences contain IUPAC ambiguity symbols in DNA, sometimes, the AA sequence can be uniquely recovered, even if the codon contains ambiguity
which frame do you want to translate? Could there be an offset at the start of the sequence? What if the sequence length is not multiple of 3?
What do you do if there is no start or stop codon in the sequence?
there are many different genetic codes, standard is often ok, but not always

My only other advice here is, unless this is just for learning perl, and exercise, or you understand exactly all the pitfalls, please do not re-implement a heavily tested and debugged existing piece of code. If this is part of a research project, you should use either the tool EMBOSS transeq, or the BioPerl/Libraries which also have a translate function, see http://www.bioperl.org/wiki/HOWTO:Beginners#Translating.

It is not trivial to specify and implement such a simple at first glance problem as translation, it turns out there are more potential pitfalls than first expected.

To make clearer what I mean, from the documentation of Transeq 6.6.0.0:

History Written 4 March 1999 - Gary Williams July 2001 - changed definition of reverse frames to use the same codon phase as forward frames. - Gary Williams

This program has been used to translate billions of sequences since it was written, it is hard to imagine that there is a special case of input that it could not handle or a bug that has not been discovered until now.

score 1 · Answer 2 · 2016-04-02

1

Entering edit mode

8.1 years ago

mastal511 ★ 2.1k

what is the syntax error you are getting?
there may be a function in one of the BioPerl packages to transform your RNA sequence to protein.

ADD COMMENT • link 8.1 years ago by mastal511 ★ 2.1k

0

Entering edit mode

ive resolved the error. could anyone help with how to . Get three nucleotides at a time (a codon) from each sequence until it is not possible to get another codon. Get the single letter amino acid code for each codon and add it to a scalar which contains the final translated sequence. Print the two final protein sequences. i have the codons and the proteins in a separate file

ADD REPLY • link 8.1 years ago by ksi216 ▴ 80

score 1 · Answer 3 · 2016-04-02

1

Entering edit mode

8.1 years ago

Ibrahim Tanyalcin ★ 1.2k

Dear,

Just a small remark, for your variable $compseq,if you want to use the if clause, than first initiate your variable:

my $compseq; if(...){$compseq = ...;} else {$compseq = .... ;} print $compseq."\n";

or use a ternary operator if you directly want to asign a value.

That being said, make sure you try your script on variety of outputs many times before you release it. Make sure there are no errors or the user is informed properly of the error. I am totally happy that you are reinventing the wheel. If no one reinvents the wheel, we will soon run out of people who can even remember what a wheel was. Keep up the good work,

Regards,

PS: I do have a code for what you are asking in the above comment, but first try to come up with your own solution.

ADD COMMENT • link 8.1 years ago by Ibrahim Tanyalcin ★ 1.2k

0

Entering edit mode

im confused as to the logic of how to perform the translation task any hints or help ?

ADD REPLY • link 8.0 years ago by ksi216 ▴ 80

0

Entering edit mode

There are many different styles to achieve it, here is the logical flow of one: First initate a hash where you keep your codons and their correspoinding amino acid such as: my %degenerate_code = ("A" => ["GCT","GCC","GCA","GCG","Nonpolar"],....) then split the the mRNA/cDNA by "" (empty string) and store it in an array. Now loop through this array with 3 offset values (i=0 or i=1 or i=2; because you have 3 possible frames), at each step combine 3 consecutive letters and search through the hash (using grep) to find the amino acid, once found store in an other array or combine as string. Do the whole loop for the 3 aforementioned offset values. You will get 3 possible candidates.

ADD REPLY • link 8.0 years ago by Ibrahim Tanyalcin ★ 1.2k

0

Entering edit mode

Using the codons as hash keys would be much more efficient: %gencode = (AUG => 'M', ...). The search for hash complexity for hash key lookup is O(1) and also more clean to write the lookup. You can find ready-to-copy-paste tables e.g. https://bioperl.wordpress.com/2011/04/14/genetic-code-perl-hash/ here. Make sure to double check that they are correct.

A ready solution to cheat is here http://www.perlmonks.org/?node_id=94263,

note that the translation is by a single regex replace where %g is the code table s/.{1,3}/$g{$&}/g, this was a code golf to make the shortest (aka. most unreadable obfuscated code). This solution came out fastest while others could reduce the code length to 123 characters including the code table.

ADD REPLY • link 8.0 years ago by Michael 54k

0

Entering edit mode

what would be the shortest style without using bio perl ?

ADD REPLY • link 8.0 years ago by ksi216 ▴ 80

0

Entering edit mode

Please read my response, what is shortest is irrelevant, it is not recommended for learning perl to try to imitate this style.

ADD REPLY • link 8.0 years ago by Michael 54k