Programatically fetch all GI numbers for nucleotide sequences of a species from NCBI?
2
0
Entering edit mode
5.5 years ago
biocyberman ▴ 810

I want to make an blastdb alias with blastdb_aliastool for Homo sapiens. I can go online to search and export a GI list and use it with the aliastool. However, I will have to repeat the process again after some time. Is there away I can fetch GI numbers from a command line or via Python/Biopython, R, and if not avoidable, Perl.

Update

After reading elsewhere on biostars and following the instruction Here to install edirect command line apps, I did this:

epost -db taxonomy -id 9606|elink -batch -target nuccore |efetch -format uid >hsapiens.gi.txt

The output hsapiens.gi.txt contains only 100000 records, this is the maximum retrieval for one batch. Searching further about E-Utilities, I think I have to use retstart paramenter somewhere in the command line above. But how?

ncbi blast biopython edirect • 2.1k views
2
Entering edit mode

Im sure it's possible to bruteforce/scrape your way to all the GIs, or programatically via biomart maybe, but my experience of doing this against the NCBI's websites is to not bother, because whatever you write to do the scraping now will not work in 6 months. You are better off just writing a nice e-mail to the NCBI and asking for the data. They will reply before lunch time, maybe sooner if its all just sitting in an Excel file anyway from the last person that asked, and they may even tell you about API/SQL access you could use to keep your lists as up-to-date as possible. This goes for pretty much anything where you want a large amount of data once, then not much ever again.

1
Entering edit mode

Thanks for the suggestion. I will contact NCBI if the question can't be answered soon here. I chose to post here because the question may benefit others as well.

2
Entering edit mode
5.5 years ago
5heikki 9.6k

You can extract the GIs straight from your db as instructed here

blastdbcmd -db nr -entry all -outfmt "%g %T" | \
awk ' { if ($2 == 9606) { print$1 } } ' | \
blastdbcmd -db nr -entry_batch - -out human_sequences.txt


In the above example you get the seqs, but I don't see why you couldn't pipe the GIs to blastdb_aliastool the same way..

0
Entering edit mode

Use blastdbcmd to dump the data out is a reliable way, but also not the fastest way. Anyway, it is a reliable and working way. I came up with this command to get GI numbers I want and use them for blastdb_aliastool command.

blastdbcmd -db nt -entry all -outfmt "%g %T"|gawk '( $2 == 9606 ) {count+=1; print$1 >hs.gi.txt}END{print "Found: "count" records"}'

2
Entering edit mode
5.5 years ago
GenoMax 99k

Get a copy of this file: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

Column 1 contains the taxID's. You can get all kinds of other ID's once your parse the relevant lines out.

File above is regenerated regularly so should be current at all times.

0
Entering edit mode

The gi numbers in NCBI are asigned to non-gene sequences, so this approach is not applicable for my case, but it would be very useful in other cases.