Retrieve The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins
6
4
Entering edit mode
10.8 years ago

Hello everyone, I tried to find the answer in the previous question. I have a list of NCBI accessions (GenPept) of protein sequences, I'd like to retrieve a corresponding nucleic sequences returned in fasta file using bioperl, someone have ideas?

dna sequence retrieval perl bioperl • 50k views
0
Entering edit mode

Do you want the gene or the transcript?

0
Entering edit mode

I want to get the corresponding CDS. For example I have the ID of the protein ID ZP_07126586.1 (link : http://www.ncbi.nlm.nih.gov/protein/ZP_07126586.1), in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

0
Entering edit mode

Your case is much complicated than you've stated. This is a WGS draft. You don't have separated transcripts/genes. So, you have to extract your genes/CDSs from the WGS sequence. You really should restate your question and specify your target. Do your list of accn is from the same draft?

0
Entering edit mode

Exact, I have to retrieve the genes/CDS sequences from WGS in FASTA format. My input data is a list of NCBI Reference Sequences of the corresponding proteins (ZP_05752176.1, ZP_05752888.1, ZP_06245300.1, ...)

0
Entering edit mode

Hi Samad, Have you retrieved the The Fasta Nucleic Sequences Of A List Of Ncbi Accession Number Of Proteins successfully finally? If yes, could you please share your successful experience to us? Now, I also want to get the nucleotide sequence for each protein gene accession number? I used the search, elink, efetch ---these three E - utilities functions, but I always get "protein 110 1209747831 nuccore protein_nuccore 109 nuccore protein_nuccore_cds 109 nuccore protein_nuccore_mrna 109" at the last step. Did you have the similar experience?

0
Entering edit mode

This is an old thread, so not sure Samad is stil active, but did you check this? Retrieve Mrna Seq From Ncbi Given A List Of Protein Accessions I think there are several more questions related to protein ID to RNA/DNA sequence here already. Which solution have you tried?

0
Entering edit mode

Hi Michael,

At first, thank you for your reply!

Below is the way I have tried until now, which doesn't work well.

Protein seq ---> Nucleotide seq:

Finally, I always get,

>X17019.1 B.taurus mRNA for delta subunit of ATP synthase
GGAGTGAGCTCGCCTCTGCGTCGCCTTCCGCTGTCCGCCAGCCGTCATGCTCCCCTCCGCGTTGCTCCGC
CGTCCGGGTTTGGGCCGCCTCGTTCGCCAGGTCCGCCTCTACGCCGAGGCTGCCGCTGCTCAGGCCCCTG
CCGCGGGTCCGGGACAGATGTCCTTCACCTTCGCCTCACCCACGCAGGTGTTCTTCAACAGCGCCAACGT
CCGCCAGGTGGACGTGCCCACGCAGACCGGCGCCTTTGGCATCCTTGCAGCCCATGTGCCCACTTTGCAG
GTCCTGCGGCCAGGCCTAGTTGTGGTCCACGCTGAGGATGGCACCACCTCCAAATACTTCGTGAGTAGCG
GCTCTGTCACAGTGAATGCTGACTCCTCCGTGCAGTTACTGGCTGAGGAAGCTGTCACCTTGGACATGCT
GGATCTGGGGGCAGCCAAAGCGAACCTGGAGAAGGCACAGTCGGAGCTGTTGGGGGCAGCAGACGAGGCC
ACTCGGGCTGAGATCCAAATCCGCATCGAGGCCAACGAGGCCTTGGTGAAGGCTCTGGAGTAGGCGGTGC
CTGCCTCACACTGGTCCAGCTGGGAGACCTGGCCGGAGCTGGGCAAGGGTGGCCTGGCATGAAGACCCAG
CTCCCCATGGGTGCAGGCTGACAGGGAGCGGGCCTGCGGGAAGCTGTCCTGTTAATAAGCCACTAGGGGG
CAGCACAGTGCCAGTTTCTGCCCAGAGCGCCCCTTGGGGCCCTGCCCCTTTACGCTAAATAAACCCAGGT
AAACAAGCCAGCTCTGGCAGGTTGGGTGGCTGGGGGCAAGCTGGGACACTTGCCCTGCCCGCGAGGAATG
CTGTCCCCCTGGGGGGGTCTCCTGCCATCTGCACCACTGCTCTGCCCCTGCAGTGTCTGCTGCAGAGACC
GCTGCTCCTAGCCACCGCCAGCGGCCTGCCCCCTGCCCAGCCCATATTAAAGACCTAGGATCC


So, now I want to try new ways....

0
Entering edit mode

At first, input this website. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=A3DH67 ---to get protein id : 110 145558928

Then input this site. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=nuccore&id=110 145558928 ---to get protein coding mrna id : nuccore protein_nuccore_mrna 109

6
Entering edit mode
10.3 years ago
Palu ▴ 290

THIS FACILITY IS PROVIDED BY NCBI ITSELF..HOPE THIS WILL HELP

http://www.ncbi.nlm.nih.gov/sites/batchentrez

1
Entering edit mode

No, that will not help at all. Batch Entrez is for retrieval when your identifiers match one database. The questioner wants to use identifiers for one database to retrieve from a second database (i.e. map proteins to nucleotide sequences).

2
Entering edit mode
10.8 years ago

The simplest way is to use E-Utilities, which are already ready and written in perl. You can pass the same type of complex queries one can use in the web interface of Entrez (including ranges). The problem in your case is that you can't use accn of protein to get its gene/mrna directly. Given that you have GenPept, you could use it to retrieve the DBSOURCE locus/accession of the gene and feed it into Efetch.

I think you could genpept accn to retrieve genes in NCBI.

Edit:

This solution works for finished genomes with annotated genes and transcripts. For drafts the situation is much more delicate and depends heavily on the annotation quality.

0
Entering edit mode

Thank you Jarretinha, by the way I tried E-Utilities, I can not find a corresponding script, if you know, just tell me or give me the link to access to!

0
Entering edit mode

I would go for BioMart instead of EUtils, it's more straightforward to use and doesn't need any programming skills (be it only installing perl and running scripts).

0
Entering edit mode

By the end of the page, you'll see a section "Demonstration programs". Click there and save the script. There's a lot of useless comments inside. The actual perl part is pretty simple straightforward. http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/eutils_example.pl

0
Entering edit mode

Thanks again, I'm trying o use E-Uilities examples/scripts, can not work for my case, I want to get the corresponding CDS of the protein. For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format!

0
Entering edit mode

I´ve tought this solution for a completely different situation.

0
Entering edit mode

OK, any way thanks a lot for you :)

1
Entering edit mode
10.8 years ago

Try BioMart

1
Entering edit mode

And search the archives here for instructions; retrieval via BioMart using accessions comes up all the time.

1
Entering edit mode
10.3 years ago
Zhidkov ▴ 580
/#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::GenBank;
use Bio::SeqIO;
# get command-line arguments, or die with a usage statement
my $usage = "RetriveDataGenBank.pl Accessions_infile Fasta_outfilen"; my$infile = shift or die $usage; my$outfile = shift or die $usage; my$outfileformat = 'Fasta';
my $gb = Bio::DB::GenBank->;new(-retrievaltype => 'tempfile' , -format => 'Fasta'); my @Acc_Numbers; open (IN,$infile)||die;
my $line; while ($line =[?]) {
chomp $line; push (@Acc_Numbers, "'".$line."'")
}
my $seq_in =$gb->get_Stream_by_acc(@Acc_Numbers);
my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",'-format' => $outfileformat); # write each entry in the input file to the output file while (my$inseq = $seq_in->next_seq) {$seq_out->write_seq($inseq); } exit  your INPUT - list of GI's your OUTPUT - Fasta sequences.fasta Ilia ADD COMMENT 0 Entering edit mode 10.8 years ago Dror ▴ 280 Use eutils as follow: 1. first search each protein against the "gene" database to get genes ID's of each protein. - use esearch 2. the search these genes in the "nucleotide" database. - esearch again. 3. use efetch to download the nucleotide FASTA sequences. see also: http://www.ncbi.nlm.nih.gov/books/NBK1058/ ADD COMMENT 0 Entering edit mode Thank you Dror, I'll try to apply your steps to write a corresponding script! but the problem her is, for some protein we can't find the corresponding gene, so I want just to retrieve the corresponding CDS sequence! For example I have the ID of the protein ID ZP_07126586.1 (link : ncbi.nlm.nih.gov/protein/ZP_07126586.1) in this page we have the info about CDS (start 1, end 628 in the genome sequence), so my aim is to retrieve this nucleic sequence in FASTA format! ADD REPLY 0 Entering edit mode I think that you can search against the "nucleotide" database, and get the best match of 'CDS' to the protein you want. ADD REPLY 0 Entering edit mode 10.8 years ago Samad • 0 Hello everyone, I'd like to show you the code that I trying with: #/usr/bin/perl #use strict; use warnings; use Bio::DB::GenBank; use Bio::SeqIO; use Bio::SeqFeatureI; #open my$OUTFILE, '>', "Gb_parser.fasta";
my $gb = new Bio::DB::GenBank; # Récupération dans$seq de l'objet Genbank contenant de nombreuses informations
# Gi, Acc, séquence, Annotations, Organisme, Espèce, Genre ...

my @Acc = qw(ABK27677); #Accession number de la séquence à traiter
my $seq1 =$gb->get_Seq_by_acc(@Acc);
my $Sequence =$seq1->seq();
my $Description =$seq1->desc();
print "Acc = @Acc\nDescription = $Description\n"; #ecrit le num d'accession et sa description associée my$seqio_object = $seq1; for my$feat_object ($seqio_object->get_SeqFeatures) { #print "\n",$feat_object->primary_tag, "\n";
for my $tag ($feat_object->get_all_tags){
print $tag." "; for my$value ($feat_object->get_tag_values($tag)){
print  $value."\n" } } } foreach$feat ( $seqio_object->get_SeqFeatures() ) { if ($feat->primary_tag() eq "CDS"){ # ne s'intéresse qu'aux tags "CDS"
foreach $tag ($feat->get_all_tags() ) {
if ($tag eq "locus_tag"){ print "\n>the locus: ", join(' ',$feat->get_tag_values($tag)); #ecrit le locus } elsif ($tag eq "product"){
print  " ", join(' ',$feat->get_tag_values($tag)); #ecrit le nom du CDS
}
elsif ($tag eq "coded_by"){ print "\n\nThe corresponding gene acc: ",$feat->get_tag_values($tag); #print ">\n\n", join (' ',$feat-> get_tag_values($tag)); #affiche l'acc. du gene } } #print "\n" print " \n\n",$feat->start, "..", $feat->end,"\n\n";;# " on strand ",$feat->strand, "\n";
$out =$feat->seq; #out est un Primary::Seq
$string =$out->seq(); #recupère la séquence
print \$string."\n\n";
}
}


There is 2 problem with my script:

1. I can't use it with several accession numbers (IDs)
2. The goted CDS sequences are protein sequences, in my case I'm looking for the corresponding nucleic sequences. We have just the coded_by infos that I don't know how I can use it to get the nucleic sequence
0
Entering edit mode

For future reference: indent all lines of code with 4 spaces to format them properly in your question.I did it for you - this time.