How to extract the longest orf?
5
2
Entering edit mode
7.9 years ago
MAPK ★ 2.0k

Hi guys,

I have a couple thousands sequences that I translated using getorf from EMBOSS. I now need to select the the longest sequence out of many sequences. How can we do this? For example I have translated these two sets of contigs:

Contig: 01003765.1 Contig562 mRNA sequence

and

Contig: 5553765.1_1 Contig562 mRNA sequence
>01003765.1_1 [227 - 259] Contig562 mRNA sequence
ATGTCCTCCAGAAATGTTCGACGTTATTTTGGA
>01003765.1_2 [259 - 291] Contig562 mRNA sequence
ATGAGAATCAACTGGACGATGCCTGTGAACATA
>01003765.1_3 [427 - 459] Contig562 mRNA sequence
ATGTCTACGGATACAGTCACGATTCTAGAGGTA
>01003765.1_4 [201 - 476] Contig562 mRNA sequence
ATGGTAGCTGCCGACAAATTGGCTCAATGTCCTCCAGAAATGTTCGACGTTATTTTGGAT
GAGAATCAACTGGACGATGCCTGTGAACATATAGCGGAATATCTGGAGGCATATTGGAGA
GCTACTCATCCAGAAATTGTAACGAGTACAACACGACAGATCGGTAGTCCACCTCAAGCA
TCGCCTAGTGGAGACATGGGAGAAACGACGCTTCCAGCTCAGCACGATGTCTACGGATAC
AGTCACGATTCTAGAGGTATAAATTCAGGGTTCAGC
>5553765.1_1 [227 - 259] Contig562 mRNA sequence
ATGTCCTCCAGAAATGTTCGACGTTATTTTGGACAACTGGACGATGCCTGTGAACATATAGCGGAATATCTGGAGGCATATTGGAGAGCTACTCATCCAGAAATTGTAACGAGTACAACACGACAGATCGGTAGTCCACCTCAAGCATCGCCTAGTGGAGACATGGGAGAAACGACGCTTCCAGCTCAGCACGATGTCTACGGATAC</code>AGTCACGATTCTAGAGGTATAAATTCAGGGTTCAGC
>5553765.1_2 [259 - 291] Contig562 mRNA sequence
ATGAGAATCAACTGGACGATGCCTGTGAACATA
>5553765.1_3 [427 - 459] Contig562 mRNA sequence
ATGTCTACGGATACAGTCACGATTCTAGAGGTA
>5553765.1_4 [201 - 476] Contig562 mRNA sequence
ATGGTAGCTGCCGACAAATTGGCTCAATGTCCTCCAGAAATGTTCGACGTTATTTTGGAT
GAGAAT

As a result, I only want the longest sequence for each contig translated in the same file:

>01003765.1_4 [201 - 476] Contig562 mRNA sequence
ATGGTAGCTGCCGACAAATTGGCTCAATGTCCTCCAGAAATGTTCGACGTTATTTTGGAT
GAGAATCAACTGGACGATGCCTGTGAACATATAGCGGAATATCTGGAGGCATATTGGAGA
GCTACTCATCCAGAAATTGTAACGAGTACAACACGACAGATCGGTAGTCCACCTCAAGCA
TCGCCTAGTGGAGACATGGGAGAAACGACGCTTCCAGCTCAGCACGATGTCTACGGATAC
AGTCACGATTCTAGAGGTATAAATTCAGGGTTCAGC

>5553765.1_1 [227 - 259] Contig562 mRNA sequence
ATGTCCTCCAGAAATGTTCGACGTTATTTTGGACAACTGGACGATGCCTGTGAACATATAGCGGAATATCTGGAGGCATATTGGAGAGCTACTCATCCAGAAATTGTAACGAGTACAACACGACAGATCGGTAGTCCACCTCAAGCATCGCCTAGTGGAGACATGGGAGAAACGACGCTTCCAGCTCAGCACGATGTCTACGGATACAGTCACGATTCTAGAGGTATAAATTCAGGGTTCAGC

I would really appreciate your help if you could provide me the solution to this problem. Thank you.

getorf ORF • 12k views
ADD COMMENT
0
Entering edit mode

I want to ask how you extract the orfs from the sequences,and I want to ask the orf maybe imcomplete,I am trying to extract the complete and imcomplete orf from the DNA sequences,I hope I can get help from you

ADD REPLY
5
Entering edit mode
7.9 years ago

Here is a dirty python script you can try to adapt. I haven't rigorously tested it, so spot check the results and make sure it is correct. This script assumes the output sequences are in blocks, meaning all possible ORFs of a transcript are grouped together in the fasta file. You need biopython.

import sys
from Bio import SeqIO

currentCid = ''
buffer = []

for record in SeqIO.parse(open(sys.argv[1]),"fasta"):
    cid = str(record.description).split('.')[0][1:]

    if currentCid == '':
        currentCid = cid
    else:
        if cid != currentCid:
            buffer.sort(key = lambda x : len(x[1]))
            print '>' + buffer[-1][0]
            print buffer[-1][1]
            currentCid = cid
            buffer = [(str(record.description),str(record.seq))]
        else:
            buffer.append((str(record.description),str(record.seq)))

buffer.sort(key = lambda x : len(x[1]))
print '>' + buffer[-1][0]
print buffer[-1][1]

Save it and use it by:

python longestORF.py yourORF.fasta > output.fasta
ADD COMMENT
0
Entering edit mode

Thanks a lot. I will check it.

ADD REPLY
3
Entering edit mode
7.9 years ago
SES 8.5k

You can find the longest ORFs for a set of DNA sequences with HMMER2GO by using a simple command (more info on the wiki):

hmmer2go getorf -i genes.fasta -o genes_orfs.fasta

This would be the ideal use case if you have ESTs or gene sequences, and it should give you the same result as filtering the results of getorf in a separate script (though I haven't tested the other approaches). You'll want to rethink these approaches if your contigs are actually genomic contigs where you'd expect many genes per contig.

ADD COMMENT
0
Entering edit mode

To add to this, there is an option to get all ORFs now, and select the output type, length, etc. The best place to look is the wiki tutorial, which explains the usage.

ADD REPLY
2
Entering edit mode
7.9 years ago
edrezen ▴ 730

Hi,

As an exercise, I tried to do it with the GATB C++ library, and it can be done with the following code:

#include <gatb/gatb_core.hpp>

#include <map>
#include <list>
#include <set>
#include <string>

using namespace std;

/********************************************************************************/
struct Entry
{
    Entry (int index=0, int length=0) : index(index), length(length) {}
    int    index;
    int    length;
    static bool compare (const Entry& e1, const Entry& e2)  { return e1.length > e2.length; }
};

/********************************************************************************/

int main (int argc, char* argv[])
{
    try
    {
        OptionsParser parser;
        parser.push_front (new OptionOneParam (STR_URI_INPUT,  "input bank"));
        parser.push_front (new OptionOneParam (STR_URI_OUTPUT, "output bank"));

        /** We read command line options. */
        IProperties* props = parser.parse (argc, argv);

        /** We get a handle on the input bank. */
        BankFasta intputBank (props->getStr(STR_URI_INPUT));

        /** We get a handle on the output bank. */
        BankFasta outputBank (props->getStr(STR_URI_OUTPUT));

        map < string,list<Entry> > seqMap;
        set <int> indexes;

        /** We read the input bank sequences. */
        Iterator<Sequence>* it = intputBank.iterator();  LOCAL (it);
        for (it->first(); !it->isDone(); it->next())
        {
            /** Shortcut. */
            Sequence& s = it->item();

            TokenizerIterator token (s.getComment().c_str(), " .");
            token.first();  string k1 = *token;

            seqMap[k1].push_back (Entry (s.getIndex(), s.getDataSize()));
        }

        for (map < string,list<Entry> >::iterator it = seqMap.begin(); it != seqMap.end(); it++)
        {
            /** We sort the sequences by decreasing sequence lengths. */
            it->second.sort(Entry::compare);

            /** We memorize the index of the biggest sequence. */
            indexes.insert (it->second.front().index);
        }

        /** We iterate the input bank sequences again. */
        for (it->first(); !it->isDone(); it->next())
        {
            /** We keep sequences for only our set of computed indexes. */
            if (indexes.find(it->item().getIndex()) != indexes.end())  {  outputBank.insert (it->item());  }
        }

        outputBank.flush ();
    }
    catch (Exception& e)
    {
        std::cout << "EXCEPTION: " << e.getMessage() << std::endl;
        return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

You can get the whole material for building the binary here. Once unarchived, you have a readme file explaining how to build the binary (mkdir build; cd build; cmake ..; make)

You can then run the command like this: orf -in input.fa -out output.fa

I got the correct output bank with the tiny example you gave.

If you have issues to build it, I can give you my binary.

ADD COMMENT
0
Entering edit mode

Thanks a lot. I will check it.

ADD REPLY
1
Entering edit mode
7.9 years ago
Woa ★ 2.9k

Hi, this perl script can be used for finding out the longest ORF. You can run the script inside the folder contating the EMBOSS generated ORF files (each having many ORFs generated from different frames). This requires Bioperl and Text::Wrap module for formatting, which you can replace by some REGEX based method. NOTE that this script does NOT output ALL longest ORFs for a given DNA sequence. There can be more than one longest sequences having same lengths.

use strict;
use warnings;
use Bio::SeqIO;
use Text::Wrap;
$Text::Wrap::columns = 61;
my @orf_files=<*.orf>;
foreach my $input_orf(@orf_files){
my $in  = Bio::SeqIO->new(-file =>  $input_orf , '-format' => 'Fasta');
my $longest_seq="";
my $longest_id="";
my $longest_desc="";
while ( my $seq = $in->next_seq ){
               my $id=$seq->id();
               my $seq_str= uc ($seq->seq() );
               if(length($seq_str) > length($longest_seq) ){
                $longest_id=$id;
                $longest_desc=$seq->desc();
                $longest_seq=$seq_str;
                }


        }
print  ">",$longest_id," ",$longest_desc,"\n",wrap('', '', $longest_seq),"\n";            
}
ADD COMMENT
0
Entering edit mode

Thanks a lot, Woa.

ADD REPLY
0
Entering edit mode
7.9 years ago
ddusan1 ▴ 50

Transdecoder does what you're asking!

ADD COMMENT
0
Entering edit mode

I used Transdecoder to predict ORF, but there were still many genes had more than one transcripts like below.

 >Zhze_TR100572|c0_g1_i1|m.127365 TR100572|c0_g1_i1|g.127365  ORF TR100572|c0_g1_i1|g.127365 TR100572|c0_g1_i1|m.127365 type:5prime_partial len:354 (-) TR100572|c0_g1_i1:322-1383(-)
TMTSAILRRNSSKQGLQNLIRLTAQWSVEDEEEAARERRRREREKQLRSQAEEGLNGTVS
CSESAALAQENHYDFKPSGTSELEEDEGFSDWSQKLEQRKQRSPRQSYEEENSGVREAEV
KLEQIQLDQECLEEKMVGREEGRLCQEEEEAQEQEEGEQAEQEEKKRRRNDGGKEEETPE
KRQKAPSLASLEEEELCSDHTAVCSTKITDRTESLNRSIQKSNSIKRSQPPLPVSKIDDR
LEQYTQAIETSTKAPKPVRQPSLDLPTTSMMVASTKSLWETGEVTAQSAVKPLACKDIVA
GDIVSKRSLWEQKGNPKPESSIKSIHPSGKKYKFVATGHGQYKKVLIDDAAEQ*
>Zhze_TR100572|c1_g1_i1|m.127368 TR100572|c1_g1_i1|g.127368  ORF TR100572|c1_g1_i1|g.127368 TR100572|c1_g1_i1|m.127368 type:complete len:184 (+) TR100572|c1_g1_i1:287-838(+)
MSDEEKKRRAATARRQHLKSAMLQLAATEIEKEAAAKEVEKQNYLAEHCPPLSLPGSMQE
LQDLCKKLHAKIESVDEERYDTEVKLQKTTKELEDLSQKLFDLRGKFKRPPLRRVRMSAD
AMLRALLGSKHKVCMDLRANLKQVKKEDTEKEKDLRDVGDWRKNIEEKSGMEGRKKMFEA
GES*

I would like to get the longest transcript to represent this gene for the following orthologs prediction, but I have no idea. Would you have some suggestion or some scripts for this problem? I'll really appreciate your help. Thank you.

ADD REPLY
0
Entering edit mode

You could use the script get_longest_ORF_per_transcript.pl that is included in the TransDecoder software.

ADD REPLY
0
Entering edit mode

Can you please tell me how I can use get_longest_ORF_per_transcript.pl script, I have tried following command:

perl TransDecoder-TransDecoder-v5.5.0/util/get_longest_ORF_per_transcript.pl input_file

But got this error:

Error, cannot parse header info: BnaA03g18710D_ref  at TransDecoder-TransDecoder-v5.5.0/util/get_longest_ORF_per_transcript.pl line 31, <$filehandle> line 1.
ADD REPLY

Login before adding your answer.

Traffic: 1404 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6