Question

Retrieve The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins

4

Entering edit mode

13.6 years ago

Samad ▴ 90

Hello everyone, I tried to find the answer in the previous question. I have a list of NCBI accessions (GenPept) of protein sequences, I'd like to retrieve a corresponding nucleic sequences returned in fasta file using bioperl, someone have ideas?

dna sequence retrieval perl bioperl • 55k views

ADD COMMENT • link updated 13.6 years ago by Palu ▴ 290 • written 13.6 years ago by Samad ▴ 90

0

Entering edit mode

Do you want the gene or the transcript?

ADD REPLY • link 13.6 years ago by Jarretinha 3.4k

0

Entering edit mode

I want to get the corresponding CDS. For example I have the ID of the protein ID ZP_07126586.1 (link : http://www.ncbi.nlm.nih.gov/protein/ZP_07126586.1), in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLY • link 13.6 years ago by Samad ▴ 90

0

Entering edit mode

Your case is much complicated than you've stated. This is a WGS draft. You don't have separated transcripts/genes. So, you have to extract your genes/CDSs from the WGS sequence. You really should restate your question and specify your target. Do your list of accn is from the same draft?

ADD REPLY • link 13.6 years ago by Jarretinha 3.4k

0

Entering edit mode

Exact, I have to retrieve the genes/CDS sequences from WGS in FASTA format. My input data is a list of NCBI Reference Sequences of the corresponding proteins (ZP_05752176.1, ZP_05752888.1, ZP_06245300.1, ...)

ADD REPLY • link 13.6 years ago by Samad • 0

0

Entering edit mode

Hi Samad, Have you retrieved the The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins successfully finally? If yes, could you please share your successful experience to us? Now, I also want to get the nucleotide sequence for each protein gene accession number? I used the search, elink, efetch ---these three E - utilities functions, but I always get "protein 110 1209747831 nuccore protein_nuccore 109 nuccore protein_nuccore_cds 109 nuccore protein_nuccore_mrna 109" at the last step. Did you have the similar experience?

ADD REPLY • link 6.0 years ago by bison100 • 0

0

Entering edit mode

This is an old thread, so not sure Samad is stil active, but did you check this? Retrieve Mrna Seq From Ncbi Given A List Of Protein Accessions I think there are several more questions related to protein ID to RNA/DNA sequence here already. Which solution have you tried?

ADD REPLY • link 6.0 years ago by Michael 55k

0

Entering edit mode

Hi Michael,

At first, thank you for your reply!

Below is the way I have tried until now, which doesn't work well.

Protein seq ---> Nucleotide seq:

Finally, I always get,

>X17019.1 B.taurus mRNA for delta subunit of ATP synthase
GGAGTGAGCTCGCCTCTGCGTCGCCTTCCGCTGTCCGCCAGCCGTCATGCTCCCCTCCGCGTTGCTCCGC
CGTCCGGGTTTGGGCCGCCTCGTTCGCCAGGTCCGCCTCTACGCCGAGGCTGCCGCTGCTCAGGCCCCTG
CCGCGGGTCCGGGACAGATGTCCTTCACCTTCGCCTCACCCACGCAGGTGTTCTTCAACAGCGCCAACGT
CCGCCAGGTGGACGTGCCCACGCAGACCGGCGCCTTTGGCATCCTTGCAGCCCATGTGCCCACTTTGCAG
GTCCTGCGGCCAGGCCTAGTTGTGGTCCACGCTGAGGATGGCACCACCTCCAAATACTTCGTGAGTAGCG
GCTCTGTCACAGTGAATGCTGACTCCTCCGTGCAGTTACTGGCTGAGGAAGCTGTCACCTTGGACATGCT
GGATCTGGGGGCAGCCAAAGCGAACCTGGAGAAGGCACAGTCGGAGCTGTTGGGGGCAGCAGACGAGGCC
ACTCGGGCTGAGATCCAAATCCGCATCGAGGCCAACGAGGCCTTGGTGAAGGCTCTGGAGTAGGCGGTGC
CTGCCTCACACTGGTCCAGCTGGGAGACCTGGCCGGAGCTGGGCAAGGGTGGCCTGGCATGAAGACCCAG
CTCCCCATGGGTGCAGGCTGACAGGGAGCGGGCCTGCGGGAAGCTGTCCTGTTAATAAGCCACTAGGGGG
CAGCACAGTGCCAGTTTCTGCCCAGAGCGCCCCTTGGGGCCCTGCCCCTTTACGCTAAATAAACCCAGGT
AAACAAGCCAGCTCTGGCAGGTTGGGTGGCTGGGGGCAAGCTGGGACACTTGCCCTGCCCGCGAGGAATG
CTGTCCCCCTGGGGGGGTCTCCTGCCATCTGCACCACTGCTCTGCCCCTGCAGTGTCTGCTGCAGAGACC
GCTGCTCCTAGCCACCGCCAGCGGCCTGCCCCCTGCCCAGCCCATATTAAAGACCTAGGATCC

So, now I want to try new ways....

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 6.0 years ago by bison100 • 0

0

Entering edit mode

At first, input this website. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=A3DH67 ---to get protein id : 110 145558928

Then input this site. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=nuccore&id=110 145558928 ---to get protein coding mrna id : nuccore protein_nuccore_mrna 109

Finally, input this. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=109

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 6.0 years ago by bison100 • 0

score 6 · Answer 1 · 2011-08-17

6

Entering edit mode

13.0 years ago

Palu ▴ 290

THIS FACILITY IS PROVIDED BY NCBI ITSELF..HOPE THIS WILL HELP

http://www.ncbi.nlm.nih.gov/sites/batchentrez

ADD COMMENT • link 13.0 years ago by Palu ▴ 290

1

Entering edit mode

No, that will not help at all. Batch Entrez is for retrieval when your identifiers match one database. The questioner wants to use identifiers for one database to retrieve from a second database (i.e. map proteins to nucleotide sequences).

ADD REPLY • link 13.0 years ago by Neilfws 49k

score 2 · Answer 2 · 2011-02-08

2

Entering edit mode

13.6 years ago

Jarretinha 3.4k

The simplest way is to use E-Utilities, which are already ready and written in perl. You can pass the same type of complex queries one can use in the web interface of Entrez (including ranges). The problem in your case is that you can't use accn of protein to get its gene/mrna directly. Given that you have GenPept, you could use it to retrieve the DBSOURCE locus/accession of the gene and feed it into Efetch.

I think you could genpept accn to retrieve genes in NCBI.

Edit:

This solution works for finished genomes with annotated genes and transcripts. For drafts the situation is much more delicate and depends heavily on the annotation quality.

ADD COMMENT • link 13.6 years ago by Jarretinha 3.4k

0

Entering edit mode

Thank you Jarretinha, by the way I tried E-Utilities, I can not find a corresponding script, if you know, just tell me or give me the link to access to!

ADD REPLY • link 13.6 years ago by Samad ▴ 90

0

Entering edit mode

I would go for BioMart instead of EUtils, it's more straightforward to use and doesn't need any programming skills (be it only installing perl and running scripts).

ADD REPLY • link 13.6 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

By the end of the page, you'll see a section "Demonstration programs". Click there and save the script. There's a lot of useless comments inside. The actual perl part is pretty simple straightforward. http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/eutils_example.pl

ADD REPLY • link 13.6 years ago by Jarretinha 3.4k

0

Entering edit mode

Thanks again, I'm trying o use E-Uilities examples/scripts, can not work for my case, I want to get the corresponding CDS of the protein. For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLY • link 13.6 years ago by Samad ▴ 90

0

Entering edit mode

I´ve tought this solution for a completely different situation.

ADD REPLY • link 13.6 years ago by Jarretinha 3.4k

0

Entering edit mode

OK, any way thanks a lot for you :)

ADD REPLY • link 13.6 years ago by Samad • 0

score 1 · Answer 3 · 2011-02-08

1

Entering edit mode

13.6 years ago

Michael 55k

Try BioMart

ADD COMMENT • link 13.6 years ago by Michael 55k

1

Entering edit mode

And search the archives here for instructions; retrieval via BioMart using accessions comes up all the time.

ADD REPLY • link 13.6 years ago by Neilfws 49k

Ram · Answer 4 · 2011-08-16

/#!/usr/bin/perl
    use strict;
    use warnings;
    use Bio::DB::GenBank;
    use Bio::SeqIO;
    # get command-line arguments, or die with a usage statement
    my $usage = "RetriveDataGenBank.pl Accessions_infile Fasta_outfilen";
    my $infile = shift or die $usage;
    my $outfile = shift or die $usage;
    my $outfileformat = 'Fasta';
    my $gb = Bio::DB::GenBank->;new(-retrievaltype => 'tempfile' , -format => 'Fasta');
    my @Acc_Numbers;
    open (IN,$infile)||die;
    my $line;
    while ($line =[?]) {
        chomp $line;
        push (@Acc_Numbers, "'".$line."'")
    }
    my $seq_in = $gb->get_Stream_by_acc(@Acc_Numbers);
    my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",'-format' => $outfileformat);
    # write each entry in the input file to the output file
    while (my $inseq = $seq_in->next_seq) {
        $seq_out->write_seq($inseq);
    }
    exit

your INPUT - list of GI's
your OUTPUT - Fasta sequences.fasta

Ilia

score 0 · Answer 5 · 2011-02-08

0

Entering edit mode

13.6 years ago

Dror ▴ 280

Use eutils as follow: 1. first search each protein against the "gene" database to get genes ID's of each protein. - use esearch 2. the search these genes in the "nucleotide" database. - esearch again. 3. use efetch to download the nucleotide FASTA sequences. see also: http://www.ncbi.nlm.nih.gov/books/NBK1058/

ADD COMMENT • link 13.6 years ago by Dror ▴ 280

0

Entering edit mode

Thank you Dror, I'll try to apply your steps to write a corresponding script! but the problem her is, for some protein we can't find the corresponding gene, so I want just to retrieve the corresponding CDS sequence! For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLY • link 13.6 years ago by Samad ▴ 90

0

Entering edit mode

I think that you can search against the "nucleotide" database, and get the best match of 'CDS' to the protein you want.

ADD REPLY • link 13.5 years ago by Dror ▴ 280

Ram · Answer 6 · 2011-02-14

Hello everyone,
I'd like to show you the code that I trying with:

#/usr/bin/perl
#use strict;

use warnings;
use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::SeqFeatureI;

#open my $OUTFILE, '>', "Gb_parser.fasta";
my $gb = new Bio::DB::GenBank; 
# Récupération dans $seq de l'objet Genbank contenant de nombreuses informations
# Gi, Acc, séquence, Annotations, Organisme, Espèce, Genre ...

my @Acc = qw(ABK27677); #Accession number de la séquence à traiter
my $seq1 = $gb->get_Seq_by_acc(@Acc);
my $Sequence = $seq1->seq();
my $Description = $seq1->desc();
print "Acc = @Acc\nDescription = $Description\n"; #ecrit le num d'accession et sa description associée
my $seqio_object = $seq1;

for my $feat_object ($seqio_object->get_SeqFeatures) {
    #print "\n", $feat_object->primary_tag, "\n";
    for my $tag ($feat_object->get_all_tags){
        print $tag." ";
        for my $value ($feat_object->get_tag_values($tag)){
            print  $value."\n"      
        }
    }
}

foreach $feat ( $seqio_object->get_SeqFeatures() ) {
if ($feat->primary_tag() eq "CDS"){ # ne s'intéresse qu'aux tags "CDS"
    foreach $tag ( $feat->get_all_tags() ) {    
        if ($tag eq "locus_tag"){
            print  "\n>the locus: ", join(' ',$feat->get_tag_values($tag)); #ecrit le locus
        }
        elsif ($tag eq "product"){
            print  " ", join(' ',$feat->get_tag_values($tag)); #ecrit le nom du CDS
        }
        elsif ($tag eq "coded_by"){
            print "\n\nThe corresponding gene acc: ", $feat->get_tag_values($tag);
            #print ">\n\n", join (' ', $feat-> get_tag_values($tag));     #affiche l'acc. du gene

        }
    }

     #print "\n"
    print " \n\n", $feat->start, "..", $feat->end,"\n\n";;# " on strand ",     $feat->strand, "\n";
    $out = $feat->seq; #out est un Primary::Seq
    $string = $out->seq(); #recupère la séquence
    print $string."\n\n";
}
}

There is 2 problem with my script:

I can't use it with several accession numbers (IDs)
The goted CDS sequences are protein sequences, in my case I'm looking for the corresponding nucleic sequences. We have just the coded_by infos that I don't know how I can use it to get the nucleic sequence