Question: Convert huge list of accession numbers to GI numbers
0
gravatar for hazirliver
6 weeks ago by
hazirliver0
hazirliver0 wrote:

Hello! I have a huge list of accession numbers (a little bit more than 1 million) and i need to get relevant gi numbers. Are there any ways to do this?

ADD COMMENTlink written 6 weeks ago by hazirliver0
1

Please do not use gi numbers they have been deprecated for end-user use by NCBI for almost 2 years now. Stay with accession numbers where you can.

That said EntrezDirect should be able to get this information. It seems to be not behaving at the moment though.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax76k

I know it, but i can try to apply Koonin's pipepline from 2019 article in which they use gi numbers to define sequnces

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by hazirliver0

Which pipeline are you referring to? Perhaps it could be modified to use accession numbers?

ADD REPLYlink written 6 weeks ago by genomax76k

https://www.nature.com/articles/s41596-019-0211-1 They use both accession numbers and gi numbers. To be more specific there are "GeneratedID"s, but in ProtocolFiles (Vicinity.faa) there are lines like ">gi|1000270263|gb|AAD36848.1| AAD36848.1 acetylornithine aminotransferase [Thermotoga maritima MSB8]". An additional problem is that in GenBank accession numbers do not match the gi numbers in this article. Even if you look in the revision history in GenBank there will be other numbers.

ADD REPLYlink written 6 weeks ago by hazirliver0
1

Using Entrezdirect (linked above) you can get the gi where possible:

$ esearch -db protein -query "CAA62188" | efetch -format gi
1212992
$ esearch -db protein -query "AAD36848" | efetch -format gi
4982364

As for the numbers not matching that is interesting (as above). Since authors of the pipeline you link are at NCBI you should make them aware of the discrepancy and also suggest that they may want to update their pipeline to use accessions instead of gi.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax76k

Thanks for Entrezdirect! I think that their "gi"s aren't gi numbers in GenBank because each accession number doesn't match gi in whole file

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by hazirliver0

Hmm. I get the same record for titin if I use the following two URL's. One is for accession and other is for gi.

https://www.ncbi.nlm.nih.gov/protein/1212992
https://www.ncbi.nlm.nih.gov/protein/CAA62188.1

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by genomax76k

In my previous post I meant the file from the article.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by hazirliver0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2090 users visited in the last hour