How Do I Translate Multiple (More Than 25000) Dna Sequences With Different Frames To Protein Seuquence?
5
1
Entering edit mode
8.9 years ago
biostar ▴ 170

How can I translate multiple (more than 25000) DNA sequences with different frames to protein sequence? Is there any program or perl script I can use to do that? I am also not sure if can can include all the sequences with their frame for translation all at the same time. Please share any information on this. Thanks!

dna protein • 16k views
0
Entering edit mode

Thanks guys! @Pavel, I think it allows me to submit same frame for multiple sequences, but how do I include multiple sequences with different frames all as one batch submission? Also, Is there a way to omit the sequences with stop codons in the frames for translation?? @Biolab, so the script you mentioned only works for frame 1? How do I translate other frames, ? Sorry I am novice in perl.. Thanks a bunch!

1
Entering edit mode

These are good questions, you need to do some work for that - extract/organize sequences, compose proper command lines etc. For sixpack you'll need to pre-process/split your dataset - sixpack will extract all of the possible ORFs from a single sequence and allows customization of that process, transeq will just translate the whole batch placing stops * so you'll need to do post-processing. If ORFs positions are known, then transeq can take in the coordinates and translate.

1
Entering edit mode

Hi Youwanpras, I am also a perl beginner. I write a script as follows. It works, but you'd better test yourself. You need to pay attention that each sequence should be in single line (not sure how to improve it). My script is not consice, it will be helpful to ask others in BIOSTARS, as many experts are here. Hope it helps!

  #!/bin/perl
use strict;
use warnings;

local $/ = "\r\n"; my @frames = (1, 2, 3, -1, -2, -3 ); #six frames; foreach my$frame (@frames) {
frame ($frame); sub frame { my$f = shift;
open IN, $ARGV[0]; while (<IN>){ chomp; #each sequence should be in single line; my$m = length ($_); if(/^>(\w+)/){ print ">$1"."_"."frame"."$f"."\n"; }elsif($f > 0){
my $frameseq = substr($_, $f-1,$m-$f+1); print "$frameseq\n";
}elsif ($f <0){ my$comprevseq = reverse $_;$comprevseq =~ tr/[A,T,C,G,a,t,c,g]/[T,A,G,C,t,a,g,c]/;  # sequence reverse complement;
my $frameseq = substr ($comprevseq, abs($f)-1,$m-abs($f)+1); print "$frameseq\n";
}
}
}
close IN;
}

0
Entering edit mode

I am trying to translate DNA sequence with your code but it is giving following warning messages

use of uninitialized value $_ in scaler chomp at ..... use of uninitialized value$_ in pattern match (m//) at .....
use of uninitialized value $m in subtraction (-) at ..... use of uninitialized value$_ in substr at .....


7
Entering edit mode
8.9 years ago
Pavel Senin ★ 1.9k

EMBOSS, can do it really fast.

1. sixpack

sixpack reads a DNA sequence and writes an output file giving out the forward and reverse sense sequences with the three forward and (optionally) three reverse translations in a pretty display format. A genetic code may be specified for the translation. There are various options to control the appearance of the output file. It also writes a file of protein sequences corresponding to any open reading frames that are larger than the specified minimum size: the default of 1 base shows all possible open reading frames.


2. transeq

transeq reads one or more nucleotide sequences and writes the corresponding protein sequence translations to file. It can translate in any of the 3 forward or three reverse sense frames, or in all three forward or reverse frames, or in all six frames. The translation may be restricted to specified regions, for example, corresponding to the coding regions of your sequences. It can translate using the standard ('Universal') genetic code and also with a selection of non-standard codes.

0
Entering edit mode

FYI the current EMBOSS documentation and downloads can be found at: http://emboss.open-bio.org/. The old EMBOSS SourceForge site is obsolete.

EMBOSS contains a number of programs related to sequence translation (see B.6.25. Applications in group Nucleic:translation) and gene/ORF finding (see B.6.17. Applications in group Nucleic:gene finding).

0
Entering edit mode

the obsoletness is not mentioned anywhere on sourceforge, while many links, indeed, point onto openbio, how do you know that it is obsolete?

0
Entering edit mode

From discussions with the EMBOSS developers.

The emboss.open-bio.org site is based on the content written for the EMBOSS books. The aim is to maintain the book content through updates to the emboss.open-bio.org site. While the content generated from the EMBOSS sources has been updated at SourceForge, the rest of the content is severely out of date, incomplete and occasionally misleading.

1
Entering edit mode
8.9 years ago
biolab ★ 1.4k

Following is a script for frame +1 translation. It uses Bio::SeqIO module in Bioperl.

sub TranslateDNAFile()
{
use Bio::SeqIO;
(my $infile,my$outfile)=@_;
my $in=Bio::SeqIO->new(-file=>"$infile",-format=>"fasta");
my $out=Bio::SeqIO->new(-file=>">$outfile", -format=>"fasta");

while (my $seq=$in->next_seq())
{
$out->write_seq($seq->translate);
}
}

my $DNAfile="dna.fasta"; my$pepfile="pep.fasta";
&TranslateDNAFile($DNAfile,$pepfile);

1
Entering edit mode
7.1 years ago
x.jack.min ▴ 20
1
Entering edit mode
3.4 years ago

I often use translatorX for sequence alignment where you can get aligned protein sequence as well as DNA sequence alignment.

Thanks.

0
Entering edit mode
4.0 years ago
Kzra ▴ 40

I have written a program in Python 3 that takes a nucleotide FASTA file as input, and translates each sequence in that file in the frame which produces the fewest number of stop codons.

The software transcribes each sequence in all six frames and counts the number of STOP codons in each. It then writes the original sequence in the 'optimal' frame to an output FASTA file, with the frame name appended to the contig name. In cases where there are multiple optimal frames Optimal Translate writes both into the output FASTA file. It only needs Python 3 to run and works on both DNA and RNA sequences.

You can access it here: https://github.com/Kzra/Optimal-Translate