Question: Convert List With Several Different Gene Identifiers Using Command Line/Programmatically
0
gravatar for stenemo88
5.5 years ago by
stenemo880
Sweden
stenemo880 wrote:

The question at hand is how to most effectively take a query list (of over 100 entries) containing multiple different gene and/or protein identifiers and convert them all to the same type of identifier.

EDIT 1: I guess all I really need is a table/csv file with one protein on each line, and the different names for that protein in the columns. Then I can easily use a script to extract the information I need.

At the moment I have tried several different conversion websites (http://www.protocol-online.org/prot/Research_Tools/Online_Tools/Sequence_Analysis/Gene_ID_Conversion_Tools lists a few of the best ones), all of which requires the user to input the same type of ID (and to always know which type you have, making for a lot of manual work). And for several searches it ends up being a few that these websites can't determine, but which a search using UniProt can find. Therefore, the best way I have found to do this is to use the manual insertion of the search term into UniProt, and then take the best Homo sapiens match and map my query to this. The closest source have found that may hold an answer is http://www.uniprot.org/faq/28#batch_retrieval_of_entries although I am not sure if this is the best way to go about this.

My question is then how to best automate this using the command line (I have been given the impression that this is possible, but I have no idea as to how to go about doing this specific query).

Example querry list:

ACTB_HUMAN
Ceruloplasmin (CF, F)
Complement C1s subcomponent (F)
Complement C3 (F)
Complement C5 (F)
Desmoplakina
Haptoglobin-related protein (F)
Hemopexin (CF)
Inter-alpha-trypsin inhibitor heavy chain H1 (CF, F)
Lactotransferrin
Phosphatidylinositol-glycan-specific phospholipase Da
Plasminogen (F)
Protein AMBP
Serum albumin
Vitamin K-dependent protein S (F)
ARHL1_HUMAN
AFAM_HUMAN
A1AT_HUMAN
ANXA1_HUMAN
ANXA5_HUMAN
APOA1_HUMAN

example output for first two (input - output (UniProt ID)):

ACTB_HUMAN P60709 Ceruloplasmin P00450

ADD COMMENTlink modified 5.4 years ago • written 5.5 years ago by stenemo880
3
gravatar for Devon Ryan
5.5 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

If you're using UniProt already, why not just download the ID text file, load it into a database, and then just query that (likely with a small script or in R)? That would seem relatively straight-forward.

Edit: You could probably even write a bash script to simply grep that file and parse the output.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Devon Ryan88k

This seems like a good solution, thank you for linking to the ID text file, which I was unable to find myself.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by stenemo880
1
gravatar for Elisabeth Gasteiger
5.4 years ago by
Geneva
Elisabeth Gasteiger1.6k wrote:

Just out of curiosity:

Was the perl script I sent you via the UniProt heldesk of any use? The idea was to use PERL LWP with query

http://www.uniprot.org/uniprot/?query=name%3A%22$query_term%22+organism:9606+reviewed%3Ayes&format=tab&columns=id,entry%20name

(searching in the protein name field of human Swiss-Prot entries) and, as a fallback if there are no results, a full text search in human Swiss-Prot entries:

http://www.uniprot.org/uniprot/?query=$query_term+organism:9606+reviewed%3Ayes&format=tab&columns=id,entry%20name

ADD COMMENTlink written 5.4 years ago by Elisabeth Gasteiger1.6k

Your solution should be perfect for me, this project is currently on hold, but when I have time I will give this a try.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by stenemo880
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1070 users visited in the last hour