Question: Convert List With Several Different Gene Identifiers Using Command Line/Programmatically
gravatar for stenemo88
6.8 years ago by
stenemo880 wrote:

The question at hand is how to most effectively take a query list (of over 100 entries) containing multiple different gene and/or protein identifiers and convert them all to the same type of identifier.

EDIT 1: I guess all I really need is a table/csv file with one protein on each line, and the different names for that protein in the columns. Then I can easily use a script to extract the information I need.

At the moment I have tried several different conversion websites ( lists a few of the best ones), all of which requires the user to input the same type of ID (and to always know which type you have, making for a lot of manual work). And for several searches it ends up being a few that these websites can't determine, but which a search using UniProt can find. Therefore, the best way I have found to do this is to use the manual insertion of the search term into UniProt, and then take the best Homo sapiens match and map my query to this. The closest source have found that may hold an answer is although I am not sure if this is the best way to go about this.

My question is then how to best automate this using the command line (I have been given the impression that this is possible, but I have no idea as to how to go about doing this specific query).

Example querry list:

Ceruloplasmin (CF, F)
Complement C1s subcomponent (F)
Complement C3 (F)
Complement C5 (F)
Haptoglobin-related protein (F)
Hemopexin (CF)
Inter-alpha-trypsin inhibitor heavy chain H1 (CF, F)
Phosphatidylinositol-glycan-specific phospholipase Da
Plasminogen (F)
Protein AMBP
Serum albumin
Vitamin K-dependent protein S (F)

example output for first two (input - output (UniProt ID)):

ACTB_HUMAN P60709 Ceruloplasmin P00450

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by stenemo880
gravatar for Devon Ryan
6.8 years ago by
Devon Ryan96k
Freiburg, Germany
Devon Ryan96k wrote:

If you're using UniProt already, why not just download the ID text file, load it into a database, and then just query that (likely with a small script or in R)? That would seem relatively straight-forward.

Edit: You could probably even write a bash script to simply grep that file and parse the output.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Devon Ryan96k

This seems like a good solution, thank you for linking to the ID text file, which I was unable to find myself.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by stenemo880
gravatar for Elisabeth Gasteiger
6.8 years ago by
Elisabeth Gasteiger1.7k wrote:

Just out of curiosity:

Was the perl script I sent you via the UniProt heldesk of any use? The idea was to use PERL LWP with query$query_term%22+organism:9606+reviewed%3Ayes&format=tab&columns=id,entry%20name

(searching in the protein name field of human Swiss-Prot entries) and, as a fallback if there are no results, a full text search in human Swiss-Prot entries:$query_term+organism:9606+reviewed%3Ayes&format=tab&columns=id,entry%20name

ADD COMMENTlink written 6.8 years ago by Elisabeth Gasteiger1.7k

Your solution should be perfect for me, this project is currently on hold, but when I have time I will give this a try.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by stenemo880
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1103 users visited in the last hour