Hello! I have a huge list of accession numbers (a little bit more than 1 million) and i need to get relevant gi numbers. Are there any ways to do this?
Please do not use gi numbers they have been deprecated for end-user use by NCBI for almost 2 years now. Stay with accession numbers where you can.
That said EntrezDirect should be able to get this information. It seems to be not behaving at the moment though.
I know it, but i can try to apply Koonin's pipepline from 2019 article in which they use gi numbers to define sequnces
Which pipeline are you referring to? Perhaps it could be modified to use accession numbers?
They use both accession numbers and gi numbers. To be more specific there are "GeneratedID"s, but in ProtocolFiles (Vicinity.faa) there are lines like ">gi|1000270263|gb|AAD36848.1| AAD36848.1 acetylornithine aminotransferase [Thermotoga maritima MSB8]". An additional problem is that in GenBank accession numbers do not match the gi numbers in this article. Even if you look in the revision history in GenBank there will be other numbers.
Using Entrezdirect (linked above) you can get the gi where possible:
$ esearch -db protein -query "CAA62188" | efetch -format gi
$ esearch -db protein -query "AAD36848" | efetch -format gi
As for the numbers not matching that is interesting (as above). Since authors of the pipeline you link are at NCBI you should make them aware of the discrepancy and also suggest that they may want to update their pipeline to use accessions instead of gi.
Thanks for Entrezdirect! I think that their "gi"s aren't gi numbers in GenBank because each accession number doesn't match gi in whole file
Hmm. I get the same record for titin if I use the following two URL's. One is for accession and other is for gi.
In my previous post I meant the file from the article.