Question: Retrieve The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins
4
gravatar for Samad
8.9 years ago by
Samad90
France
Samad90 wrote:

Hello everyone, I tried to find the answer in the previous question. I have a list of NCBI accessions (GenPept) of protein sequences, I'd like to retrieve a corresponding nucleic sequences returned in fasta file using bioperl, someone have ideas?

ADD COMMENTlink modified 8.9 years ago by Palu260 • written 8.9 years ago by Samad90

Do you want the gene or the transcript?

ADD REPLYlink written 8.9 years ago by Jarretinha3.3k

I want to get the corresponding CDS. For example I have the ID of the protein ID ZP_07126586.1 (link : http://www.ncbi.nlm.nih.gov/protein/ZP_07126586.1), in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLYlink written 8.8 years ago by Samad90

Your case is much complicated than you've stated. This is a WGS draft. You don't have separated transcripts/genes. So, you have to extract your genes/CDSs from the WGS sequence. You really should restate your question and specify your target. Do your list of accn is from the same draft?

ADD REPLYlink written 8.8 years ago by Jarretinha3.3k

Exact, I have to retrieve the genes/CDS sequences from WGS in FASTA format. My input data is a list of NCBI Reference Sequences of the corresponding proteins (ZP_05752176.1, ZP_05752888.1, ZP_06245300.1, ...)

ADD REPLYlink written 8.8 years ago by Samad0

Hi Samad, Have you retrieved the The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins successfully finally? If yes, could you please share your successful experience to us? Now, I also want to get the nucleotide sequence for each protein gene accession number? I used the search, elink, efetch ---these three E - utilities functions, but I always get "protein 110 1209747831 nuccore protein_nuccore 109 nuccore protein_nuccore_cds 109 nuccore protein_nuccore_mrna 109" at the last step. Did you have the similar experience?

ADD REPLYlink written 15 months ago by bison1000

This is an old thread, so not sure Samad is stil active, but did you check this? Retrieve Mrna Seq From Ncbi Given A List Of Protein Accessions I think there are several more questions related to protein ID to RNA/DNA sequence here already. Which solution have you tried?

ADD REPLYlink modified 15 months ago • written 15 months ago by Michael Dondrup47k

Hi Michael,

At first, thank you for your reply!

Below is the way I have tried until now, which doesn't work well.

Protein seq ---> Nucleotide seq:

Finally, I always get,

>X17019.1 B.taurus mRNA for delta subunit of ATP synthase
GGAGTGAGCTCGCCTCTGCGTCGCCTTCCGCTGTCCGCCAGCCGTCATGCTCCCCTCCGCGTTGCTCCGC
CGTCCGGGTTTGGGCCGCCTCGTTCGCCAGGTCCGCCTCTACGCCGAGGCTGCCGCTGCTCAGGCCCCTG
CCGCGGGTCCGGGACAGATGTCCTTCACCTTCGCCTCACCCACGCAGGTGTTCTTCAACAGCGCCAACGT
CCGCCAGGTGGACGTGCCCACGCAGACCGGCGCCTTTGGCATCCTTGCAGCCCATGTGCCCACTTTGCAG
GTCCTGCGGCCAGGCCTAGTTGTGGTCCACGCTGAGGATGGCACCACCTCCAAATACTTCGTGAGTAGCG
GCTCTGTCACAGTGAATGCTGACTCCTCCGTGCAGTTACTGGCTGAGGAAGCTGTCACCTTGGACATGCT
GGATCTGGGGGCAGCCAAAGCGAACCTGGAGAAGGCACAGTCGGAGCTGTTGGGGGCAGCAGACGAGGCC
ACTCGGGCTGAGATCCAAATCCGCATCGAGGCCAACGAGGCCTTGGTGAAGGCTCTGGAGTAGGCGGTGC
CTGCCTCACACTGGTCCAGCTGGGAGACCTGGCCGGAGCTGGGCAAGGGTGGCCTGGCATGAAGACCCAG
CTCCCCATGGGTGCAGGCTGACAGGGAGCGGGCCTGCGGGAAGCTGTCCTGTTAATAAGCCACTAGGGGG
CAGCACAGTGCCAGTTTCTGCCCAGAGCGCCCCTTGGGGCCCTGCCCCTTTACGCTAAATAAACCCAGGT
AAACAAGCCAGCTCTGGCAGGTTGGGTGGCTGGGGGCAAGCTGGGACACTTGCCCTGCCCGCGAGGAATG
CTGTCCCCCTGGGGGGGTCTCCTGCCATCTGCACCACTGCTCTGCCCCTGCAGTGTCTGCTGCAGAGACC
GCTGCTCCTAGCCACCGCCAGCGGCCTGCCCCCTGCCCAGCCCATATTAAAGACCTAGGATCC

So, now I want to try new ways....

ADD REPLYlink modified 3 months ago by RamRS25k • written 15 months ago by bison1000

At first, input this website. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=A3DH67 ---to get protein id : 110 145558928

Then input this site. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=nuccore&id=110 145558928 ---to get protein coding mrna id : nuccore protein_nuccore_mrna 109

Finally, input this. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=109

ADD REPLYlink modified 3 months ago by RamRS25k • written 15 months ago by bison1000
6
gravatar for Palu
8.3 years ago by
Palu260
Palu260 wrote:

THIS FACILITY IS PROVIDED BY NCBI ITSELF..HOPE THIS WILL HELP

http://www.ncbi.nlm.nih.gov/sites/batchentrez

ADD COMMENTlink written 8.3 years ago by Palu260
1

No, that will not help at all. Batch Entrez is for retrieval when your identifiers match one database. The questioner wants to use identifiers for one database to retrieve from a second database (i.e. map proteins to nucleotide sequences).

ADD REPLYlink written 8.3 years ago by Neilfws48k
2
gravatar for Jarretinha
8.9 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

The simplest way is to use E-Utilities, which are already ready and written in perl. You can pass the same type of complex queries one can use in the web interface of Entrez (including ranges). The problem in your case is that you can't use accn of protein to get its gene/mrna directly. Given that you have GenPept, you could use it to retrieve the DBSOURCE locus/accession of the gene and feed it into Efetch.

I think you could genpept accn to retrieve genes in NCBI.

Edit:

This solution works for finished genomes with annotated genes and transcripts. For drafts the situation is much more delicate and depends heavily on the annotation quality.

ADD COMMENTlink modified 8.8 years ago • written 8.9 years ago by Jarretinha3.3k

Thank you Jarretinha, by the way I tried E-Utilities, I can not find a corresponding script, if you know, just tell me or give me the link to access to!

ADD REPLYlink written 8.8 years ago by Samad90

I would go for BioMart instead of EUtils, it's more straightforward to use and doesn't need any programming skills (be it only installing perl and running scripts).

ADD REPLYlink written 8.8 years ago by Michael Schubert6.9k

By the end of the page, you'll see a section "Demonstration programs". Click there and save the script. There's a lot of useless comments inside. The actual perl part is pretty simple straightforward. http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/eutils_example.pl

ADD REPLYlink written 8.8 years ago by Jarretinha3.3k

Thanks again, I'm trying o use E-Uilities examples/scripts, can not work for my case, I want to get the corresponding CDS of the protein. For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLYlink written 8.8 years ago by Samad90

I´ve tought this solution for a completely different situation.

ADD REPLYlink written 8.8 years ago by Jarretinha3.3k

OK, any way thanks a lot for you :)

ADD REPLYlink written 8.8 years ago by Samad0
1
gravatar for Michael Dondrup
8.9 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Try BioMart

ADD COMMENTlink written 8.9 years ago by Michael Dondrup47k
1

And search the archives here for instructions; retrieval via BioMart using accessions comes up all the time.

ADD REPLYlink written 8.9 years ago by Neilfws48k
1
gravatar for Zhidkov
8.3 years ago by
Zhidkov570
Israel
Zhidkov570 wrote:
/#!/usr/bin/perl
    use strict;
    use warnings;
    use Bio::DB::GenBank;
    use Bio::SeqIO;
    # get command-line arguments, or die with a usage statement
    my $usage = "RetriveDataGenBank.pl Accessions_infile Fasta_outfilen";
    my $infile = shift or die $usage;
    my $outfile = shift or die $usage;
    my $outfileformat = 'Fasta';
    my $gb = Bio::DB::GenBank->;new(-retrievaltype => 'tempfile' , -format => 'Fasta');
    my @Acc_Numbers;
    open (IN,$infile)||die;
    my $line;
    while ($line =[?]) {
        chomp $line;
        push (@Acc_Numbers, "'".$line."'")
    }
    my $seq_in = $gb->get_Stream_by_acc(@Acc_Numbers);
    my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",'-format' => $outfileformat);
    # write each entry in the input file to the output file
    while (my $inseq = $seq_in->next_seq) {
        $seq_out->write_seq($inseq);
    }
    exit

your INPUT - list of GI's
your OUTPUT - Fasta sequences.fasta

Ilia

ADD COMMENTlink modified 3 months ago by RamRS25k • written 8.3 years ago by Zhidkov570
0
gravatar for Dror
8.8 years ago by
Dror280
Israel
Dror280 wrote:

Use eutils as follow: 1. first search each protein against the "gene" database to get genes ID's of each protein. - use esearch 2. the search these genes in the "nucleotide" database. - esearch again. 3. use efetch to download the nucleotide FASTA sequences. see also: http://www.ncbi.nlm.nih.gov/books/NBK1058/

ADD COMMENTlink written 8.8 years ago by Dror280

Thank you Dror, I'll try to apply your steps to write a corresponding script! but the problem her is, for some protein we can't find the corresponding gene, so I want just to retrieve the corresponding CDS sequence! For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLYlink written 8.8 years ago by Samad90

I think that you can search against the "nucleotide" database, and get the best match of 'CDS' to the protein you want.

ADD REPLYlink written 8.8 years ago by Dror280
0
gravatar for Samad
8.8 years ago by
Samad0
Samad0 wrote:

Hello everyone,
I'd like to show you the code that I trying with:

#/usr/bin/perl
#use strict;

use warnings;
use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::SeqFeatureI;

#open my $OUTFILE, '>', "Gb_parser.fasta";
my $gb = new Bio::DB::GenBank; 
# Récupération dans $seq de l'objet Genbank contenant de nombreuses informations
# Gi, Acc, séquence, Annotations, Organisme, Espèce, Genre ...

my @Acc = qw(ABK27677); #Accession number de la séquence à traiter
my $seq1 = $gb->get_Seq_by_acc(@Acc);
my $Sequence = $seq1->seq();
my $Description = $seq1->desc();
print "Acc = @Acc\nDescription = $Description\n"; #ecrit le num d'accession et sa description associée
my $seqio_object = $seq1;

for my $feat_object ($seqio_object->get_SeqFeatures) {
    #print "\n", $feat_object->primary_tag, "\n";
    for my $tag ($feat_object->get_all_tags){
        print $tag." ";
        for my $value ($feat_object->get_tag_values($tag)){
            print  $value."\n"      
        }
    }
}

foreach $feat ( $seqio_object->get_SeqFeatures() ) {
if ($feat->primary_tag() eq "CDS"){ # ne s'intéresse qu'aux tags "CDS"
    foreach $tag ( $feat->get_all_tags() ) {    
        if ($tag eq "locus_tag"){
            print  "\n>the locus: ", join(' ',$feat->get_tag_values($tag)); #ecrit le locus
        }
        elsif ($tag eq "product"){
            print  " ", join(' ',$feat->get_tag_values($tag)); #ecrit le nom du CDS
        }
        elsif ($tag eq "coded_by"){
            print "\n\nThe corresponding gene acc: ", $feat->get_tag_values($tag);
            #print ">\n\n", join (' ', $feat-> get_tag_values($tag));     #affiche l'acc. du gene

        }
    }

     #print "\n"
    print " \n\n", $feat->start, "..", $feat->end,"\n\n";;# " on strand ",     $feat->strand, "\n";
    $out = $feat->seq; #out est un Primary::Seq
    $string = $out->seq(); #recupère la séquence
    print $string."\n\n";
}
}

There is 2 problem with my script:

  1. I can't use it with several accession numbers (IDs)
  2. The goted CDS sequences are protein sequences, in my case I'm looking for the corresponding nucleic sequences. We have just the coded_by infos that I don't know how I can use it to get the nucleic sequence
ADD COMMENTlink modified 3 months ago by RamRS25k • written 8.8 years ago by Samad0

For future reference: indent all lines of code with 4 spaces to format them properly in your question.I did it for you - this time.

ADD REPLYlink modified 3 months ago by RamRS25k • written 8.3 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1006 users visited in the last hour