Question: NCBI GI to Gene Description
1
gravatar for navillusol858
3.0 years ago by
United Kingdom
navillusol85810 wrote:

Hello all,

I have a very large list of NCBI gene IDs (such as, gi:47221249, ect). I am hoping to use this list to get the descriptions for each of the gene IDs. Using the GI above it would be "unnamed protein product [Tetraodon nigroviridis]".

Thus ending up with a file that has two columns, one with gene IDs and the other with the description for these IDs.

Would anyone know of a script/software already available to do a job such as this?

Thanks for the help!

blast next-gen gene • 4.5k views
ADD COMMENTlink modified 3.0 years ago by 5heikki6.5k • written 3.0 years ago by navillusol85810
3
gravatar for 5heikki
3.0 years ago by
5heikki6.5k
Finland
5heikki6.5k wrote:

With Entrez Direct you can:

efetch -id 47221249 -db protein -format docsum | xtract -element Title

unnamed protein product [Tetraodon nigroviridis]

 

Or if you have Blast installed you could fetch latest nr and query it with blastdbcmd.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by 5heikki6.5k

Hi 5heikki, thanks for the help, can I ask if it is possible to give efetch a file that contains a list of IDs, and get it to return the Titles in a list also (a file)?

ADD REPLYlink written 3.0 years ago by navillusol85810

As far as I know, you can't pass it a list as such, but it's trivial to script it. For example in Bash shell:

while read line; do title=$(efetch -id $line -db protein -format docsum | xtract -element Title); echo "$line      $title"; done<listOfGis.txt

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by 5heikki6.5k

Thanks 5heikki, that is exactly what I was looking for, very much appreciated!

Regards

ADD REPLYlink written 3.0 years ago by navillusol85810
2
gravatar for Alastair Kerr
3.0 years ago by
Alastair Kerr5.2k
The University of Edinburgh, UK
Alastair Kerr5.2k wrote:

When possible I suggest avoid using gi identifiers as I have seen many cases where they have been unable to retrieve historical data. 

You could use batch Entrez, and extract the information you need from the resulting file, remembering to select the appropriate database (nucleotide, protein etc ).

 

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Alastair Kerr5.2k

Thanks for the help Alastair! I also have GenBank IDs, do you think it would be more accurate to use them?

Regards

ADD REPLYlink written 3.0 years ago by navillusol85810
1

If using NCBI ideally  I would try and use RefSeq ids,  with the revision number if you have it.  See  http://www.ncbi.nlm.nih.gov/books/NBK50679/

ADD REPLYlink written 3.0 years ago by Alastair Kerr5.2k
1

Where possible it is usually best to use the accession (e.g. K00650) rather than the GenBank/GenPept Locus/Id (e.g. HUMFOS) or the NCBI GI number (e.g. 182734). The Locus/Id is not guaranteed to be stable and can change between releases. The GI number refers to a specific version of the sequence, which may change in later revisions, and as a bare number suffers from anonymous identifier syndrome (e.g. is a particular GI a protein or nucleotide sequence?). The accession, or if reference to a specific sequence is required the accession based sequence version (e.g. K00650.1), are guaranteed to be stable and persistent.

Since the accession is shared across the INSDC member databases (i.e. DDBJ, ENA and GenBank), using the accession has the advantage of allowing the use of any of the INSDC database for retrieval of nucleotide sequences (whole entry or CDS features). For protein sequences the accession used in GenBank/GenPept is the INSDC protein_id which is also shared across INSDC and is used in databases which consume data from the INSDC databases, for example in UniParc which has mappings to other sources which share the same protein sequence, and UniProtKB through the import of CDS translations into UniProtKB/TrEMBL, as well as providing a CDS identifier as used to provide CDS entries in ENA Coding.

For RefSeq entries the same principle applies, but they use the accession as the Locus/Id in the GenBank format. This is also the case in UniProtKB, where the entry name (ID) is a human friendly mnemonic which is subject to change, but the primary accession is the stable identifier.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by hpmcwill1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1431 users visited in the last hour