Question: Map Genbank Gi To Accession Numbers (A Million Times)
3
gravatar for Owen S.
5.1 years ago by
Owen S.310
Oakland CA
Owen S.310 wrote:

A question of beguiling simplicity:

Given a long list of nucleotide gi numbers, how can one efficiently map them to their genbank accession numbers?

I doubt NCBI EUtils will work for me, because the list is nearly a million long. My understanding is that you can't (or shouldn't) ping their server with so many requests.

I have used ensemble's BioMart plenty in the past, but its organization into species-specific datasets precludes use in my situation, because my gi numbers are from multiple taxonomies.

About the best solution I have been able to come up with is to download the entire blast database repo, and then, for each db, dump the accessions and gis with a command like:

blastdbcmd -db dbname -entry all -outfmt '%a %g

Is there a better way?

Thanks.

genbank • 4.7k views
ADD COMMENTlink modified 4.0 years ago by Steve Moss2.2k • written 5.1 years ago by Owen S.310
1

You can ping servers with a large number of requests, provided that you respect the limits specified by the provider. The problem then becomes that the process takes days. So your solution is correct: once data goes beyond a certain size, it's better to work locally.

ADD REPLYlink written 5.1 years ago by Neilfws47k
3
gravatar for Will
5.1 years ago by
Will4.4k
United States
Will4.4k wrote:

I've had a similar issue and that was my solution as well. I just dumped out a text-file of GI -> accession numbers and then searched through that. After sorting my file and then converting it into a fixed-width format (so I could skip around with a binary-search) it was easiest/fastest method I could find.

ADD COMMENTlink written 5.1 years ago by Will4.4k
2
gravatar for Steve Moss
4.0 years ago by
Steve Moss2.2k
United Kingdom
Steve Moss2.2k wrote:

You can also use:

blastdbcmd -db dbname -entry_batch long_list_of_nucleotide_gi_numbers.txt -outfmt '%a %g' -logfile entry_batch_stdout.log

This has the benefit of only outputting the sequences you are interested in, for further downstream analyses. Any failed queries will be kicked out to entry_batch_stdout.log (or whatever you fancy calling it).

I'm doing this with the '%a %T' output format at the moment to get a list of accession numbers and taxonomic IDs.

ADD COMMENTlink written 4.0 years ago by Steve Moss2.2k
0
gravatar for Chris Evelo
5.1 years ago by
Chris Evelo9.9k
Maastricht, The Netherlands
Chris Evelo9.9k wrote:

You might want to have a look at http://www.bridgedb.org. We created that to make your (local) life easier when it comes to identifier mapping.

More info here: http://dx.doi.org/10.1186/1471-2105-11-5

That will not immediately solve your multiple species problem though, since the "standard" BridgeDB databases are single species as well. You could just run over each of these, or create your own. We also have a homologene cross species mapping database, which could be stacked on any of the others, but if I understand your question correctly you will not need that.

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Chris Evelo9.9k

Thanks, I had forgotten about bridgedb, I used it in the past and found it useful. But as you point out, it is not exactly the solution to this particular question, due to the multiple species.

ADD REPLYlink written 5.0 years ago by Owen S.310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 915 users visited in the last hour