Question: Programatically fetch all GI numbers for nucleotide sequences of a species from NCBI?
0
gravatar for biocyberman
21 months ago by
biocyberman630
Denmark
biocyberman630 wrote:

I want to make an blastdb alias with blastdb_aliastool for Homo sapiens. I can go online to search and export a GI list and use it with the aliastool. However, I will have to repeat the process again after some time. Is there away I can fetch GI numbers from a command line or via Python/Biopython, R, and if not avoidable, Perl.

Update

After reading elsewhere on biostars and following the instruction Here to install edirect command line apps, I did this:

epost -db taxonomy -id 9606|elink -batch -target nuccore |efetch -format uid >hsapiens.gi.txt

The output hsapiens.gi.txt contains only 100000 records, this is the maximum retrieval for one batch. Searching further about E-Utilities, I think I have to use `retstart` paramenter somewhere in the command line above. But how?

blast biopython edirect ncbi • 826 views
ADD COMMENTlink modified 21 months ago by 5heikki6.5k • written 21 months ago by biocyberman630
2

Im sure it's possible to bruteforce/scrape your way to all the GIs, or programatically via biomart maybe, but my experience of doing this against the NCBI's websites is to not bother, because whatever you write to do the scraping now will not work in 6 months. You are better off just writing a nice e-mail to the NCBI and asking for the data. They will reply before lunch time, maybe sooner if its all just sitting in an Excel file anyway from the last person that asked, and they may even tell you about API/SQL access you could use to keep your lists as up-to-date as possible. This goes for pretty much anything where you want a large amount of data once, then not much ever again.

ADD REPLYlink written 21 months ago by John11k
1

Thanks for the suggestion. I will contact NCBI if the question can't be answered soon here. I chose to post here because the question may benefit others as well.

ADD REPLYlink written 21 months ago by biocyberman630
2
gravatar for 5heikki
21 months ago by
5heikki6.5k
Finland
5heikki6.5k wrote:

You can extract the GIs straight from your db as instructed here

blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt

In the above example you get the seqs, but I don't see why you couldn't pipe the GIs to blastdb_aliastool the same way..

ADD COMMENTlink modified 21 months ago • written 21 months ago by 5heikki6.5k

Use blastdbcmd to dump the data out is a reliable way, but also not the fastest way. Anyway, it is a reliable and working way. I came up with this command to get GI numbers I want and use them for `blastdb_aliastool` command.

blastdbcmd -db nt -entry all -outfmt "%g %T"|gawk '( $2 == 9606 ) {count+=1; print $1 >hs.gi.txt}END{print "Found: "count" records"}' 
ADD REPLYlink modified 21 months ago • written 21 months ago by biocyberman630
2
gravatar for genomax
21 months ago by
genomax30k
United States
genomax30k wrote:

Get a copy of this file: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

Column 1 contains the taxID's. You can get all kinds of other ID's once your parse the relevant lines out.

File above is regenerated regularly so should be current at all times.

ADD COMMENTlink written 21 months ago by genomax30k

The gi numbers in NCBI are asigned to non-gene sequences, so this approach is not applicable for my case, but it would be very useful in other cases.
 

ADD REPLYlink written 21 months ago by biocyberman630
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1093 users visited in the last hour