Question: Programatically fetch all GI numbers for nucleotide sequences of a species from NCBI?
0
gravatar for biocyberman
17 months ago by
biocyberman620
Denmark
biocyberman620 wrote:

I want to make an blastdb alias with blastdb_aliastool for Homo sapiens. I can go online to search and export a GI list and use it with the aliastool. However, I will have to repeat the process again after some time. Is there away I can fetch GI numbers from a command line or via Python/Biopython, R, and if not avoidable, Perl.

Update

After reading elsewhere on biostars and following the instruction Here to install edirect command line apps, I did this:

epost -db taxonomy -id 9606|elink -batch -target nuccore |efetch -format uid >hsapiens.gi.txt

The output hsapiens.gi.txt contains only 100000 records, this is the maximum retrieval for one batch. Searching further about E-Utilities, I think I have to use `retstart` paramenter somewhere in the command line above. But how?

blast biopython edirect ncbi • 725 views
ADD COMMENTlink modified 17 months ago by 5heikki6.1k • written 17 months ago by biocyberman620
2

Im sure it's possible to bruteforce/scrape your way to all the GIs, or programatically via biomart maybe, but my experience of doing this against the NCBI's websites is to not bother, because whatever you write to do the scraping now will not work in 6 months. You are better off just writing a nice e-mail to the NCBI and asking for the data. They will reply before lunch time, maybe sooner if its all just sitting in an Excel file anyway from the last person that asked, and they may even tell you about API/SQL access you could use to keep your lists as up-to-date as possible. This goes for pretty much anything where you want a large amount of data once, then not much ever again.

ADD REPLYlink written 17 months ago by John10k
1

Thanks for the suggestion. I will contact NCBI if the question can't be answered soon here. I chose to post here because the question may benefit others as well.

ADD REPLYlink written 17 months ago by biocyberman620
2
gravatar for 5heikki
17 months ago by
5heikki6.1k
Finland
5heikki6.1k wrote:

You can extract the GIs straight from your db as instructed here

blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt

In the above example you get the seqs, but I don't see why you couldn't pipe the GIs to blastdb_aliastool the same way..

ADD COMMENTlink modified 17 months ago • written 17 months ago by 5heikki6.1k

Use blastdbcmd to dump the data out is a reliable way, but also not the fastest way. Anyway, it is a reliable and working way. I came up with this command to get GI numbers I want and use them for `blastdb_aliastool` command.

blastdbcmd -db nt -entry all -outfmt "%g %T"|gawk '( $2 == 9606 ) {count+=1; print $1 >hs.gi.txt}END{print "Found: "count" records"}' 
ADD REPLYlink modified 17 months ago • written 17 months ago by biocyberman620
2
gravatar for genomax2
17 months ago by
genomax224k
United States
genomax224k wrote:

Get a copy of this file: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

Column 1 contains the taxID's. You can get all kinds of other ID's once your parse the relevant lines out.

File above is regenerated regularly so should be current at all times.

ADD COMMENTlink written 17 months ago by genomax224k

The gi numbers in NCBI are asigned to non-gene sequences, so this approach is not applicable for my case, but it would be very useful in other cases.
 

ADD REPLYlink written 17 months ago by biocyberman620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1014 users visited in the last hour