Question

Unable to retrieve Fasta of certain NCBI entries given their accession number

0

Entering edit mode

5.9 years ago

erans995 • 0

Hello everyone

I have the following perl code that prints an entry's FASTA sequence to a file given its accession number:

LWP::Simple;

#append [accn] field to each accession
for ($i=0; $i < @ARGV; $i++) {
   $ARGV[$i] .= "[accn]";
}

#join the accessions with OR
$query = join('+OR+',@ARGV);

#assemble the esearch URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nuccore&term=$query&usehistory=y";

#post the esearch URL
$output = get($url);

#parse WebEnv and QueryKey
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

#assemble the efetch URL
$url = $base . "efetch.fcgi?db=nuccore&query_key=$key&WebEnv=$web";
$url .= "&rettype=fasta&retmode=text";

#post the efetch URL
$fasta = get($url);

my $filename = 'dna.txt';

open(FH, '>', $filename) or die $!;

print FH $fasta;

close(FH);

This is a modified version of application 2 from the "Sample Applications of the E-utilities" page of NCBI, here's the original version:

use LWP::Simple;
$acc_list = 'NM_009417,NM_000547,NM_001003009,NM_019353';
@acc_array = split(/,/, $acc_list);

#append [accn] field to each accession
for ($i=0; $i < @acc_array; $i++) {
   $acc_array[$i] .= "[accn]";
}

#join the accessions with OR
$query = join('+OR+',@acc_array);

#assemble the esearch URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nuccore&term=$query&usehistory=y";

#post the esearch URL
$output = get($url);

#parse WebEnv and QueryKey
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

#assemble the efetch URL
$url = $base . "efetch.fcgi?db=nuccore&query_key=$key&WebEnv=$web";
$url .= "&rettype=fasta&retmode=text";

#post the efetch URL
$fasta = get($url);
print "$fasta";

If I run the code with the accession number NM_009417 the code works fine and its FASTA sequence is being written to a file, however if I try running it with CAA30263.1, the following is written to the file: https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20131226/efetch.dtd"> <eFetchResult> <ERROR>Empty result - nothing to do</ERROR> </eFetchResult> I also tried running the code with CAA30263(removed the version number) but it didn't work either. I'll note that I got this accession number by using the following code(which writes the accession number that matches the GI you give it to a file) with the GI 672:

use LWP::Simple;
#$gi_list = '24475906,224465210,50978625,9507198';

#assemble the URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "efetch.fcgi?db=nucleotide&id=$ARGV[0]&rettype=acc";

#post the URL
$output = get($url);
my $filename = 'acc_num.txt';

open(FH, '>', $filename) or die $!;

print FH $output; 

close(FH);

This code is a modified version of application 1 from the "Sample Applications of the E-utilities" page of NCBI, here's the original version:

use LWP::Simple;
$gi_list = '24475906,224465210,50978625,9507198';

#assemble the URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "efetch.fcgi?db=nucleotide&id=$gi_list&rettype=acc";

#post the URL
$output = get($url);
print "$output";

Your help will be appreciated, thank you very much for your time!!

perl ncbi fasta accession number • 2.2k views

ADD COMMENT • link updated 5.0 years ago by josev.die ▴ 60 • written 5.9 years ago by erans995 • 0

score 1 · Answer 1 · 2018-05-19

1

Entering edit mode

5.9 years ago

GenoMax 141k

CAA30263.1 is a protein sequence and you are searching in a nucleotide database.

ADD COMMENT • link 5.9 years ago by GenoMax 141k

score 0 · Answer 2 · 2019-04-14

You can also use the following function written in R

save_AAfasta <- function(xpsIds, nameFile) {

 for(i in seq(length(xpsIds))) {
   protein <- rentrez::entrez_summary(db = "protein", id = xpsIds[i])
   protein_fasta <- rentrez::entrez_fetch(db="protein", id=protein$uid, rettype="fasta")

   # save amino acid sequences into a FASTA file ("nameFile"")
   write(protein_fasta, file= paste(nameFile, ".fasta", sep = ""), append = TRUE)
 }
 }

Then, just call the function with your id and it'll save a fasta file with your sequence:

save_AAfasta('CAA30263', "Downloads/my_proteins")