Question

Convert huge list of accession numbers to GI numbers

0

Entering edit mode

4.4 years ago

hazirliver ▴ 10

Hello! I have a huge list of accession numbers (a little bit more than 1 million) and i need to get relevant gi numbers. Are there any ways to do this?

ncbi GI gi numbers accession numbers • 2.0k views

ADD COMMENT • link 4.4 years ago by hazirliver ▴ 10

1

Entering edit mode

Please do not use gi numbers they have been deprecated for end-user use by NCBI for almost 2 years now. Stay with accession numbers where you can.

That said EntrezDirect should be able to get this information. It seems to be not behaving at the moment though.

ADD REPLY • link 4.4 years ago by GenoMax 141k

0

Entering edit mode

I know it, but i can try to apply Koonin's pipepline from 2019 article in which they use gi numbers to define sequnces

ADD REPLY • link 4.4 years ago by hazirliver ▴ 10

0

Entering edit mode

Which pipeline are you referring to? Perhaps it could be modified to use accession numbers?

ADD REPLY • link 4.4 years ago by GenoMax 141k

0

Entering edit mode

https://www.nature.com/articles/s41596-019-0211-1 They use both accession numbers and gi numbers. To be more specific there are "GeneratedID"s, but in ProtocolFiles (Vicinity.faa) there are lines like ">gi|1000270263|gb|AAD36848.1| AAD36848.1 acetylornithine aminotransferase [Thermotoga maritima MSB8]". An additional problem is that in GenBank accession numbers do not match the gi numbers in this article. Even if you look in the revision history in GenBank there will be other numbers.

ADD REPLY • link 4.4 years ago by hazirliver ▴ 10

1

Entering edit mode

Using Entrezdirect (linked above) you can get the gi where possible:

$ esearch -db protein -query "CAA62188" | efetch -format gi
1212992
$ esearch -db protein -query "AAD36848" | efetch -format gi
4982364

As for the numbers not matching that is interesting (as above). Since authors of the pipeline you link are at NCBI you should make them aware of the discrepancy and also suggest that they may want to update their pipeline to use accessions instead of gi.

ADD REPLY • link 4.4 years ago by GenoMax 141k

0

Entering edit mode

Thanks for Entrezdirect! I think that their "gi"s aren't gi numbers in GenBank because each accession number doesn't match gi in whole file

ADD REPLY • link 4.4 years ago by hazirliver ▴ 10

0

Entering edit mode

Hmm. I get the same record for titin if I use the following two URL's. One is for accession and other is for gi.

https://www.ncbi.nlm.nih.gov/protein/1212992
https://www.ncbi.nlm.nih.gov/protein/CAA62188.1