Question: Entrez/E-utilities or Blast? NCBI gi numbers to RefSeq accession
0
gravatar for Holly
22 months ago by
Holly0
Birmingham, UK
Holly0 wrote:

I have a list of about 7000 NCBI gi numbers from one program and I wish to take these into the next program, but this requires the refseq accession numbers i.e. those which start NC_*.

e.g. of what I have: `154350369

154350369 
594021901
811154183
407962962
407955691
218540569

What is the best approach?! I have a previous file with the sequence attached to the GI numbers but using blast I only managed to get back the same NCBI accessions (may have been user error?).

Using the following code it is easy to go from GI number to genbank accession (stolen from the docs), but again, still in the genbank format not refseq. I think I have to use the eLink feature and then use eSearch with the search term srcdb_refseq[prop] for the right linked file? (not that I'm sure how to do thins) or would blast be easier? (if I figured how to use the command line version!)

use LWP::Simple;
$gi_list = '154350369, 594021901, 811154183, 407962962';

#assemble the URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "efetch.fcgi?db=nucleotide&id=$gi_list&rettype=acc";

#post the URL
$output = get($url);
print "$output";

Very confused and a sanity check needed! Thank you!

genome • 1.1k views
ADD COMMENTlink modified 22 months ago • written 22 months ago by Holly0
1
gravatar for Holly
22 months ago by
Holly0
Birmingham, UK
Holly0 wrote:

OK solved the problem!! Using the Genbank accession numbers, I use the below script to search NCBI but specifying the RefSeq database, returning the RefSeq accession! It's not perfect as some do not create a match but I'll take it! I just use bash to loop around the files in my directory and direct the prints to a new file.

I do, however, get an error message (below) and I'm not sure why! But it works! Thanks to everyone for your suggestions, it helped point me in the right direction.

Use of uninitialized value in print at databaseQuery.pl line 25.

#!/usr/local/bin/perl 
use Bio::DB::EUtilities;
use strict;
use warnings;

my $line = ();
my $filename = $ARGV[0];
open(my $fh ,'<' ,$filename ) or die "could not open file $!";

foreach $line (<$fh>)
        {
        chomp $line;
        my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
                                       -db     => 'nuccore',
                                       -term   => "$line AND srcdb_refseq[prop]",
                                       -email  => 'mymail@foo.bar',
                                       -retmax => 5000);

# query terms are mapped; what's the actual query?
print "Query translation: ",$factory->get_query_translation,"\n";

# query hits
print "Count = ",$factory->get_count,"\n";

# UIDs
my @ids = $factory->get_ids;
print "> @ids \n";
}
print "\n";
ADD COMMENTlink written 22 months ago by Holly0
1
gravatar for Sej Modha
22 months ago by
Sej Modha3.7k
Glasgow, UK
Sej Modha3.7k wrote:

NCBI Unix e-utils equivalent would be

esearch -db nucleotide -query "9507198" |efetch -format acc
ADD COMMENTlink written 22 months ago by Sej Modha3.7k

Thanks for your help! Unfortunately that doesn't work using my GI numbers as they are the genbank ones and not the refseq gi numbers. Any idea if there's a linked file for the refseq accessions?

ADD REPLYlink written 22 months ago by Holly0

Following query returns NM_019353.1 for me.

esearch -db nucleotide -query "9507198" |efetch -format acc
ADD REPLYlink written 22 months ago by Sej Modha3.7k

please try one of the GI numbers 154350369, 59402190, 811154183. These are the example of the GI numbers I have :). My mistake though, I script was taken from the help pages for the utilities and I forgot to substitute their GI numbers for mine to save confusion! Sorry!

ADD REPLYlink written 22 months ago by Holly0

You can have a look at those GIs on the NCBI and check if a RefSeq accession exists for them.

ADD REPLYlink written 22 months ago by Sej Modha3.7k

Also I am not sure if refseq GI number really exists, as my understanding is that refseq only has an accession number attached to it as opposed to a GI. Depending on time BLAST run takes, you could re-run BLAST and extract accession number directly using blast+ commands. For BLAST+ you could use -outfmt '6 qseqid qlen sseqid sacc' where sacc is subject accession number.

ADD REPLYlink written 22 months ago by Sej Modha3.7k

As an example:

Streptococcus mutans UA159 chromosome, complete genome 2,032,925 bp circular DNA Accession: NC_004350.2 GI: 347750429

Streptococcus mutans UA159, complete genome 2,032,925 bp circular DNA Accession: AE014133.2 GI: 345287734`

That is eventually where I need to end up. but how to I get the first GI number, if I start with the second GI number?

ADD REPLYlink modified 22 months ago • written 22 months ago by Holly0
esearch -db nucleotide -query "347750429" |efetch -format acc

returns NC_004350.2

ADD REPLYlink written 22 months ago by Sej Modha3.7k

I don't think you can if they are not linked any way.

ADD REPLYlink written 22 months ago by Sej Modha3.7k

Sorry for editing my last answer -I had ran out of posts for the next 6 hours! OK - I think what I'll have to do is retrieve the taxomic name, then search for the name but specify in the search terms "srcdb_refseq[prop]" which is the refseq database? It is a shame they have no linking information! Thank you for your help.

ADD REPLYlink written 22 months ago by Holly0
0
gravatar for Diwan
22 months ago by
Diwan520
United States
Diwan520 wrote:

Eutils option will be helpful for this. For example, we can use the Rentrez package in R and do the following

library(rentrez)
GI_match <- entrez_summary(db = "nucleotide", id = "9507198")
GI_match$caption

we get "NM_019353" as output. HTH

ADD COMMENTlink written 22 months ago by Diwan520

That package is great! But unfortunately using my GI numbers (which are genbank GI numbers) I still get back the genbank accession number. I'll try to hunt down an additional database, maybe I can link the two together? My first language was R so maybe I'll have more luck here! Thank you for your help!

ADD REPLYlink written 22 months ago by Holly0
 get_acc <- function(ab){
        GI_match <- entrez_summary(db = "nucleotide", id = ab)
        GI_match$extra
    }

GI_query <- c("154350369","594021901","811154183","407962962")
OUT <- sapply(GI_query,get_acc)

OUT[1]

"gi|154350369|gb|CP000560.1|"

It seems to be working for your query too, unless you are looking for some other id.

ADD REPLYlink modified 22 months ago • written 22 months ago by Diwan520

CPaccessions are the genbank accessions - I am after the refseq accessions i.e. NC_. As an example:

Streptococcus mutans UA159 chromosome, complete genome 2,032,925 bp circular DNA Accession: NC_004350.2 GI: 347750429

Streptococcus mutans UA159, complete genome 2,032,925 bp circular DNA Accession: AE014133.2 GI: 345287734

I have the second, I want the first! I think it might be impossible, so I have been trying to extract the genomic name then use that with esearch.

ADD REPLYlink modified 22 months ago • written 22 months ago by Holly0
0
gravatar for Brian Bushnell
22 months ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

The BBMap package has a nifty tool for things like this, gi2taxid.sh. It only works on fasta files, though. You can use it like this:

gi2taxid.sh in=x.fa out=y.fa gi=gitable.int1d.gz

That's after downloading gi_taxid_nucl.dmp.gz and gi_taxid_prot.dmp.gz from NCBI, and converting them like this:

gitable.sh gi_taxid_nucl.dmp.gz,gi_taxid_prot.dmp.gz gitable.int1d.gz

Sorry it's a pain, but NCBI is not a very responsive institution, so their file formats are not very useful. If pure GI number to NCBI number (without fasta files) conversion is crucial, let me know, and I'll add a tool that does that.

ADD COMMENTlink written 22 months ago by Brian Bushnell16k

Thanks for this, however I am missing a step still to switch from genbank to Refseq! Unfortunately the GI numbers I have are genbank, so I need to find a way to get the linking information. Surely somewhere on NCBI there is a file with both accession numbers!? I need to go from one database to another

ADD REPLYlink written 22 months ago by Holly0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2147 users visited in the last hour