Retrieve The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins
6
4
Entering edit mode
13.7 years ago
Samad ▴ 90

Hello everyone, I tried to find the answer in the previous question. I have a list of NCBI accessions (GenPept) of protein sequences, I'd like to retrieve a corresponding nucleic sequences returned in fasta file using bioperl, someone have ideas?

dna sequence retrieval perl bioperl • 55k views
ADD COMMENT
0
Entering edit mode

Do you want the gene or the transcript?

ADD REPLY
0
Entering edit mode

I want to get the corresponding CDS. For example I have the ID of the protein ID ZP_07126586.1 (link : http://www.ncbi.nlm.nih.gov/protein/ZP_07126586.1), in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLY
0
Entering edit mode

Your case is much complicated than you've stated. This is a WGS draft. You don't have separated transcripts/genes. So, you have to extract your genes/CDSs from the WGS sequence. You really should restate your question and specify your target. Do your list of accn is from the same draft?

ADD REPLY
0
Entering edit mode

Exact, I have to retrieve the genes/CDS sequences from WGS in FASTA format. My input data is a list of NCBI Reference Sequences of the corresponding proteins (ZP_05752176.1, ZP_05752888.1, ZP_06245300.1, ...)

ADD REPLY
0
Entering edit mode

Hi Samad, Have you retrieved the The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins successfully finally? If yes, could you please share your successful experience to us? Now, I also want to get the nucleotide sequence for each protein gene accession number? I used the search, elink, efetch ---these three E - utilities functions, but I always get "protein 110 1209747831 nuccore protein_nuccore 109 nuccore protein_nuccore_cds 109 nuccore protein_nuccore_mrna 109" at the last step. Did you have the similar experience?

ADD REPLY
0
Entering edit mode

This is an old thread, so not sure Samad is stil active, but did you check this? Retrieve Mrna Seq From Ncbi Given A List Of Protein Accessions I think there are several more questions related to protein ID to RNA/DNA sequence here already. Which solution have you tried?

ADD REPLY
0
Entering edit mode

Hi Michael,

At first, thank you for your reply!

Below is the way I have tried until now, which doesn't work well.

Protein seq ---> Nucleotide seq:

Finally, I always get,

>X17019.1 B.taurus mRNA for delta subunit of ATP synthase
GGAGTGAGCTCGCCTCTGCGTCGCCTTCCGCTGTCCGCCAGCCGTCATGCTCCCCTCCGCGTTGCTCCGC
CGTCCGGGTTTGGGCCGCCTCGTTCGCCAGGTCCGCCTCTACGCCGAGGCTGCCGCTGCTCAGGCCCCTG
CCGCGGGTCCGGGACAGATGTCCTTCACCTTCGCCTCACCCACGCAGGTGTTCTTCAACAGCGCCAACGT
CCGCCAGGTGGACGTGCCCACGCAGACCGGCGCCTTTGGCATCCTTGCAGCCCATGTGCCCACTTTGCAG
GTCCTGCGGCCAGGCCTAGTTGTGGTCCACGCTGAGGATGGCACCACCTCCAAATACTTCGTGAGTAGCG
GCTCTGTCACAGTGAATGCTGACTCCTCCGTGCAGTTACTGGCTGAGGAAGCTGTCACCTTGGACATGCT
GGATCTGGGGGCAGCCAAAGCGAACCTGGAGAAGGCACAGTCGGAGCTGTTGGGGGCAGCAGACGAGGCC
ACTCGGGCTGAGATCCAAATCCGCATCGAGGCCAACGAGGCCTTGGTGAAGGCTCTGGAGTAGGCGGTGC
CTGCCTCACACTGGTCCAGCTGGGAGACCTGGCCGGAGCTGGGCAAGGGTGGCCTGGCATGAAGACCCAG
CTCCCCATGGGTGCAGGCTGACAGGGAGCGGGCCTGCGGGAAGCTGTCCTGTTAATAAGCCACTAGGGGG
CAGCACAGTGCCAGTTTCTGCCCAGAGCGCCCCTTGGGGCCCTGCCCCTTTACGCTAAATAAACCCAGGT
AAACAAGCCAGCTCTGGCAGGTTGGGTGGCTGGGGGCAAGCTGGGACACTTGCCCTGCCCGCGAGGAATG
CTGTCCCCCTGGGGGGGTCTCCTGCCATCTGCACCACTGCTCTGCCCCTGCAGTGTCTGCTGCAGAGACC
GCTGCTCCTAGCCACCGCCAGCGGCCTGCCCCCTGCCCAGCCCATATTAAAGACCTAGGATCC

So, now I want to try new ways....

ADD REPLY
0
Entering edit mode

At first, input this website. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=A3DH67 ---to get protein id : 110 145558928

Then input this site. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=nuccore&id=110 145558928 ---to get protein coding mrna id : nuccore protein_nuccore_mrna 109

Finally, input this. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=109

ADD REPLY
6
Entering edit mode
13.2 years ago
Palu ▴ 290

THIS FACILITY IS PROVIDED BY NCBI ITSELF..HOPE THIS WILL HELP

http://www.ncbi.nlm.nih.gov/sites/batchentrez

ADD COMMENT
1
Entering edit mode

No, that will not help at all. Batch Entrez is for retrieval when your identifiers match one database. The questioner wants to use identifiers for one database to retrieve from a second database (i.e. map proteins to nucleotide sequences).

ADD REPLY
2
Entering edit mode
13.7 years ago

The simplest way is to use E-Utilities, which are already ready and written in perl. You can pass the same type of complex queries one can use in the web interface of Entrez (including ranges). The problem in your case is that you can't use accn of protein to get its gene/mrna directly. Given that you have GenPept, you could use it to retrieve the DBSOURCE locus/accession of the gene and feed it into Efetch.

I think you could genpept accn to retrieve genes in NCBI.

Edit:

This solution works for finished genomes with annotated genes and transcripts. For drafts the situation is much more delicate and depends heavily on the annotation quality.

ADD COMMENT
0
Entering edit mode

Thank you Jarretinha, by the way I tried E-Utilities, I can not find a corresponding script, if you know, just tell me or give me the link to access to!

ADD REPLY
0
Entering edit mode

I would go for BioMart instead of EUtils, it's more straightforward to use and doesn't need any programming skills (be it only installing perl and running scripts).

ADD REPLY
0
Entering edit mode

By the end of the page, you'll see a section "Demonstration programs". Click there and save the script. There's a lot of useless comments inside. The actual perl part is pretty simple straightforward. http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/eutils_example.pl

ADD REPLY
0
Entering edit mode

Thanks again, I'm trying o use E-Uilities examples/scripts, can not work for my case, I want to get the corresponding CDS of the protein. For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLY
0
Entering edit mode

I´ve tought this solution for a completely different situation.

ADD REPLY
0
Entering edit mode

OK, any way thanks a lot for you :)

ADD REPLY
1
Entering edit mode
13.7 years ago
Michael 55k

Try BioMart

ADD COMMENT
1
Entering edit mode

And search the archives here for instructions; retrieval via BioMart using accessions comes up all the time.

ADD REPLY
1
Entering edit mode
13.2 years ago
Zhidkov ▴ 600
/#!/usr/bin/perl
    use strict;
    use warnings;
    use Bio::DB::GenBank;
    use Bio::SeqIO;
    # get command-line arguments, or die with a usage statement
    my $usage = "RetriveDataGenBank.pl Accessions_infile Fasta_outfilen";
    my $infile = shift or die $usage;
    my $outfile = shift or die $usage;
    my $outfileformat = 'Fasta';
    my $gb = Bio::DB::GenBank->;new(-retrievaltype => 'tempfile' , -format => 'Fasta');
    my @Acc_Numbers;
    open (IN,$infile)||die;
    my $line;
    while ($line =[?]) {
        chomp $line;
        push (@Acc_Numbers, "'".$line."'")
    }
    my $seq_in = $gb->get_Stream_by_acc(@Acc_Numbers);
    my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",'-format' => $outfileformat);
    # write each entry in the input file to the output file
    while (my $inseq = $seq_in->next_seq) {
        $seq_out->write_seq($inseq);
    }
    exit

your INPUT - list of GI's
your OUTPUT - Fasta sequences.fasta

Ilia

ADD COMMENT
0
Entering edit mode
13.7 years ago
Dror ▴ 280

Use eutils as follow: 1. first search each protein against the "gene" database to get genes ID's of each protein. - use esearch 2. the search these genes in the "nucleotide" database. - esearch again. 3. use efetch to download the nucleotide FASTA sequences. see also: http://www.ncbi.nlm.nih.gov/books/NBK1058/

ADD COMMENT
0
Entering edit mode

Thank you Dror, I'll try to apply your steps to write a corresponding script! but the problem her is, for some protein we can't find the corresponding gene, so I want just to retrieve the corresponding CDS sequence! For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

ADD REPLY
0
Entering edit mode

I think that you can search against the "nucleotide" database, and get the best match of 'CDS' to the protein you want.

ADD REPLY
0
Entering edit mode
13.7 years ago
Samad • 0

Hello everyone,
I'd like to show you the code that I trying with:

#/usr/bin/perl
#use strict;

use warnings;
use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::SeqFeatureI;

#open my $OUTFILE, '>', "Gb_parser.fasta";
my $gb = new Bio::DB::GenBank; 
# Récupération dans $seq de l'objet Genbank contenant de nombreuses informations
# Gi, Acc, séquence, Annotations, Organisme, Espèce, Genre ...

my @Acc = qw(ABK27677); #Accession number de la séquence à traiter
my $seq1 = $gb->get_Seq_by_acc(@Acc);
my $Sequence = $seq1->seq();
my $Description = $seq1->desc();
print "Acc = @Acc\nDescription = $Description\n"; #ecrit le num d'accession et sa description associée
my $seqio_object = $seq1;

for my $feat_object ($seqio_object->get_SeqFeatures) {
    #print "\n", $feat_object->primary_tag, "\n";
    for my $tag ($feat_object->get_all_tags){
        print $tag." ";
        for my $value ($feat_object->get_tag_values($tag)){
            print  $value."\n"      
        }
    }
}

foreach $feat ( $seqio_object->get_SeqFeatures() ) {
if ($feat->primary_tag() eq "CDS"){ # ne s'intéresse qu'aux tags "CDS"
    foreach $tag ( $feat->get_all_tags() ) {    
        if ($tag eq "locus_tag"){
            print  "\n>the locus: ", join(' ',$feat->get_tag_values($tag)); #ecrit le locus
        }
        elsif ($tag eq "product"){
            print  " ", join(' ',$feat->get_tag_values($tag)); #ecrit le nom du CDS
        }
        elsif ($tag eq "coded_by"){
            print "\n\nThe corresponding gene acc: ", $feat->get_tag_values($tag);
            #print ">\n\n", join (' ', $feat-> get_tag_values($tag));     #affiche l'acc. du gene

        }
    }

     #print "\n"
    print " \n\n", $feat->start, "..", $feat->end,"\n\n";;# " on strand ",     $feat->strand, "\n";
    $out = $feat->seq; #out est un Primary::Seq
    $string = $out->seq(); #recupère la séquence
    print $string."\n\n";
}
}

There is 2 problem with my script:

  1. I can't use it with several accession numbers (IDs)
  2. The goted CDS sequences are protein sequences, in my case I'm looking for the corresponding nucleic sequences. We have just the coded_by infos that I don't know how I can use it to get the nucleic sequence
ADD COMMENT
0
Entering edit mode

For future reference: indent all lines of code with 4 spaces to format them properly in your question.I did it for you - this time.

ADD REPLY

Login before adding your answer.

Traffic: 2234 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6