Question

Convert List With Several Different Gene Identifiers Using Command Line/Programmatically

0

Entering edit mode

10.6 years ago

stenemo88 • 0

The question at hand is how to most effectively take a query list (of over 100 entries) containing multiple different gene and/or protein identifiers and convert them all to the same type of identifier.

EDIT 1: I guess all I really need is a table/csv file with one protein on each line, and the different names for that protein in the columns. Then I can easily use a script to extract the information I need.

At the moment I have tried several different conversion websites (http://www.protocol-online.org/prot/Research_Tools/Online_Tools/Sequence_Analysis/Gene_ID_Conversion_Tools lists a few of the best ones), all of which requires the user to input the same type of ID (and to always know which type you have, making for a lot of manual work). And for several searches it ends up being a few that these websites can't determine, but which a search using UniProt can find. Therefore, the best way I have found to do this is to use the manual insertion of the search term into UniProt, and then take the best Homo sapiens match and map my query to this. The closest source have found that may hold an answer is http://www.uniprot.org/faq/28#batch_retrieval_of_entries although I am not sure if this is the best way to go about this.

My question is then how to best automate this using the command line (I have been given the impression that this is possible, but I have no idea as to how to go about doing this specific query).

Example querry list:

ACTB_HUMAN
Ceruloplasmin (CF, F)
Complement C1s subcomponent (F)
Complement C3 (F)
Complement C5 (F)
Desmoplakina
Haptoglobin-related protein (F)
Hemopexin (CF)
Inter-alpha-trypsin inhibitor heavy chain H1 (CF, F)
Lactotransferrin
Phosphatidylinositol-glycan-specific phospholipase Da
Plasminogen (F)
Protein AMBP
Serum albumin
Vitamin K-dependent protein S (F)
ARHL1_HUMAN
AFAM_HUMAN
A1AT_HUMAN
ANXA1_HUMAN
ANXA5_HUMAN
APOA1_HUMAN

example output for first two (input - output (UniProt ID)):

ACTB_HUMAN P60709 Ceruloplasmin P00450

genetics database uniprot command-line • 3.8k views

ADD COMMENT • link 10.5 years ago by stenemo88 • 0

score 3 · Answer 1 · 2013-09-29

3

Entering edit mode

10.6 years ago

Devon Ryan 104k

If you're using UniProt already, why not just download the ID text file, load it into a database, and then just query that (likely with a small script or in R)? That would seem relatively straight-forward.

Edit: You could probably even write a bash script to simply grep that file and parse the output.

ADD COMMENT • link 10.6 years ago by Devon Ryan 104k

0

Entering edit mode

This seems like a good solution, thank you for linking to the ID text file, which I was unable to find myself.

ADD REPLY • link 10.5 years ago by stenemo88 • 0

score 1 · Answer 2 · 2013-10-08

1

Entering edit mode

10.6 years ago

Elisabeth Gasteiger ★ 2.4k

Just out of curiosity:

Was the perl script I sent you via the UniProt heldesk of any use? The idea was to use PERL LWP with query

http://www.uniprot.org/uniprot/?query=name%3A%22$query_term%22+organism:9606+reviewed%3Ayes&format=tab&columns=id,entry%20name

(searching in the protein name field of human Swiss-Prot entries) and, as a fallback if there are no results, a full text search in human Swiss-Prot entries:

http://www.uniprot.org/uniprot/?query=$query_term+organism:9606+reviewed%3Ayes&format=tab&columns=id,entry%20name

ADD COMMENT • link 10.6 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Your solution should be perfect for me, this project is currently on hold, but when I have time I will give this a try.

ADD REPLY • link 10.5 years ago by stenemo88 • 0