Question: How To Find Sequences For Protein Names (A Challenge)
5
gravatar for Simon Cockell
8.5 years ago by
Simon Cockell7.3k
Newcastle
Simon Cockell7.3k wrote:

I came across an interesting problem yesterday, and, while I solved it in my own way, I thought it would make an interesting challenge for you good folks.

In a situation where you have only the name for a protein in UniProt (rather than the accession), it is perfectly possible to visit a URL such as: http://www.uniprot.org/uniprot/OPSD_HUMAN and the UniProt website will redirect you to http://www.uniprot.org/uniprot/P08100, which is, obviously, the UniProt entry for the protein concerned.

If you have a list of such names, and you want to get FASTA sequences for them, to me the obvious thing to do would be to go to http://www.uniprot.org/uniprot/OPSD_HUMAN.fasta, which in light of the previous behaviour, I would expect to redirect to http://www.uniprot.org/uniprot/P08100.fasta. Unfortunately this is not the case, the .fasta URL also redirects to the UniProt entry for a protein, not to the sequence.

I would imagine it is a one-line fix to "correct" this behaviour, and I was wondering if the Biostar-ers could come up with a solution of a similar length (rather than my crufty Python solution).

So, to clarify - a way to retrieve FASTA sequences for a list of UniProt protein names. An example list can be downloaded from here.

Some rules - there are no rules - any language, any library, any tool. I'm very interested in how broad a range of solutions we can produce. Obviously there is no correct answer, so the highest scoring response by the end of the week (i.e. 5pm Friday UK time, or as close to) gets the accepted answer bonus.

sequence uniprot conversion • 5.3k views
ADD COMMENTlink modified 5.3 years ago by Biostar ♦♦ 20 • written 8.5 years ago by Simon Cockell7.3k
8
gravatar for Mary
8.5 years ago by
Mary11k
Boston MA area
Mary11k wrote:

Ok, I'll do mine in English:

  1. go to UniProt.org.
  2. click tab "retrieve"
  3. Paste list into text box. Click Retrieve button.
  4. On results page, click FASTA download [ Download (30 KB*) | Open ] (Or you could click open just to have a look).

I'm done.

ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Mary11k
1

Your "using an already available webform" solution of course wins against my script ;-)

ADD REPLYlink written 8.5 years ago by Michael Schubert6.9k

Well, I can get mine down to 95 characters: Go to uniprot.org, click Retrieve tab. Paste list in box, click Retrieve button. Choose output. [?] And, may I add, that mine outputs as web interface, FASTA, GFF, Flat text, xml, rdf/xml, or a list of the corresponding IDs.

ADD REPLYlink written 8.5 years ago by Mary11k

Heh. Yeah, but that wasn't ruled out! There were no rules--it said any tool! You didn't have to write the platform you are using, right? Kidding. I'm just trying to be the voice of a non-coder here. If this site is to be broadly useful to people, scripts may not always be the only answer.

ADD REPLYlink written 8.5 years ago by Mary11k

Scripts are the only answer. But in your case, one of the uniprot devs did the writing ;-)

ADD REPLYlink written 8.5 years ago by Michael Schubert6.9k

yep, it's always a code or script that does the work, whether already done in a web-based tool or newly created. But, no use in reinventing the wheel (unless it's a better wheel of course) :D.

ADD REPLYlink written 8.5 years ago by Treylathe940

ROFL. Yeah, no one would ever want to reuse anyone else's existing code...wait a minute.... (PS: I can't believe that my answer has a positive value score. I expected to get downrated for this answer!)

ADD REPLYlink written 8.5 years ago by Mary11k
5
gravatar for Michael Schubert
8.5 years ago by
Cambridge, UK
Michael Schubert6.9k wrote:

Quite short (128 characters) and simple in bash:

for N in $NAMES; do curl -D o http://www.uniprot.org/uniprot/$N && \
wget `egrep -o http[\.\:\/a-zA-Z0-9]+ o`.fasta && rm o; done

$NAMES needs to be an array of names, can be replaced with cat [?]filename[?] (in backticks) if names are there.

edit: the shortest way I can come up with is 109 characters:

U="www.uniprot.org/uniprot/";for N in $NAMES;do curl -D o $U$N;\
wget `egrep -o $U[0-9A-Z]+ o`.fasta;rm o;done
ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Michael Schubert6.9k
4
gravatar for Neilfws
8.5 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

A couple of Bioperl and Bioruby solutions.

1. Bioperl

If you install the utility scripts, there's one named bp_fetch which you can run as:

bp_fetch net::genpept:OPSD_HUMAN
# result
>OPSD_HUMAN RecName: Full=Rhodopsin; AltName: Full=Opsin-2.
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA

Although obviously that fetches from GenPept, rather than UniProt.

The more detailed Bioperl solution:

use Bio::DB::SwissProt;    
use Bio::SeqIO;
use strict;
my $sp = Bio::DB::SwissProt->new('-servertype'   => 'expasy',
                                 '-hostlocation' => 'australia');

my $seq = $sp->get_Seq_by_id('OPSD_HUMAN');
my $outfile = Bio::SeqIO->new(-file => ">OPSD_HUMAN.fa", -format => 'fasta');
$outfile->write_seq($seq);

2. Bioruby

Bioruby also comes with a utility called br_biofetch.rb. It runs like this:

br_biofetch.rb -s http://www.ebi.ac.uk/cgi-bin/dbfetch swissprot OPSD_HUMAN raw fasta
# result
>sp|P08100|OPSD_HUMAN Rhodopsin;
MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI
YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA

As before, you can also write scripts around the module, Bio::Fetch, as described here. Writing the loops to fetch multiple queries should be easy enough.

ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Neilfws48k
4
3
gravatar for Heikki
8.5 years ago by
Heikki360
Heikki360 wrote:

It is good to remember that UniProt maintains up-to-date mappings between not only between AC and ID but also a large number of of other databases. See the README file and rest of the contents of this directory:

ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/idmapping

These mappings are easy to tie to any script fetching the entries.

ADD COMMENTlink written 8.5 years ago by Heikki360

This is how I did it :)

ADD REPLYlink written 8.5 years ago by Simon Cockell7.3k

+1 for ID mapping

ADD REPLYlink written 8.5 years ago by Khader Shameer18k
3
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

using awk and mysql (uniprot db at UCSC):

curl -s "http://bioinf1.ncl.ac.uk/biostar/protein.list" |\
egrep -v '^$' |\
awk 'BEGIN { printf("select concat(\">\",P.acc,\":\",D.val,char(10),P.val) from protein as P, displayId as D where D.acc=P.acc and D.val in (\"_IGNORE_\"\n",$1);} { printf(",\"%s\"\n",$1); } END { printf(");\n");}' |\
mysql  -N -h  genome-mysql.cse.ucsc.edu -A -u genome -D uniProt -r

>A4L9L3:A4L9L3_CAVPO
MNGTEGENFYIPFSNATGVVRSPFEYPQYYLAEPWQFSILAAYMFMLIVLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVANLFMVLGGFTTTLYTSMNGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVVFTWIMALACAAPPLVGWSRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWVPYASVAAYIFTHQGSNFGPIFMTVPAFFAKSSSIYNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASTTVSKTETSQVAPA
>P02699:OPSD_BOVIN
MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLYVTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLGGEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAVYNPVIYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA
>Q6W3E1:OPSD_CALPD
MNGTEGPNFYVPFSNKTGVVRSPFEEPQYYLAEPWQFSCLAAYMFMLIVLGFPINFLTLYVTIQHKKLRTPLNYILLNLAIADLFMVFGGFTTTLYTSLHGYFVFGPTGCDLEGFFATLGGEIALWSLVVLAIERYIVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIPEGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMVVIFFCYGQLVFTVKEAAAQQQESATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPILMTLPAFFAKTSAVYNPVIYIMLNKQFRTCMLTTLCCGKIPLGDDEASATASKTETSQVAPA
(...)
ADD COMMENTlink written 8.5 years ago by Pierre Lindenbaum119k
3
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

My second solution: using TOGOWS/REST

Nucleic Acids Res. 2010 Jul 1;38 Suppl:W706-11. Epub 2010 May 14. TogoWS: integrated SOAP and REST APIs for interoperable bioinformatics Web services. Katayama T, Nakao M, Takagi T. http://www.ncbi.nlm.nih.gov/pubmed/20472643

for ACC in ` curl -s "http://bioinf1.ncl.ac.uk/biostar/protein.list" | egrep -v '^$' `
do
   curl "http://togows.dbcls.jp/entry/uniprot/${ACC}.fasta"
done

>sp|P02699|OPSD_BOVIN Rhodopsin;
MNGTEGPNFYVPFSNKTGVVRSPFEAPQYYLAEPWQFSMLAAYMFLLIMLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNLEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIP
EGMQCSCGIDYYTPHEETNNESFVIYMFVVHFIIPLIVIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSDFGPIFMTIPAFFAKTSAV
YNPVIYIMMNKQFRNCMVTTLCCGKNPLGDDEASTTVSKTETSQVAPA
>sp|Q6W3E1|OPSD_CALPD Rhodopsin;
MNGTEGPNFYVPFSNKTGVVRSPFEEPQYYLAEPWQFSCLAAYMFMLIVLGFPINFLTLY
VTIQHKKLRTPLNYILLNLAIADLFMVFGGFTTTLYTSLHGYFVFGPTGCDLEGFFATLG
GEIALWSLVVLAIERYIVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLVGWSRYIP
EGMQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMVVIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWLPYAGVAFYIFTHQGSNFGPILMTLPAFFAKTSAV
YNPVIYIMLNKQFRTCMLTTLCCGKIPLGDDEASATASKTETSQVAPA
>sp|P32308|OPSD_CANFA Rhodopsin;
MNGTEGPNFYVPFSNKTGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY
VTVQHKKLRTPLNYILLNLAVADLFMVFGGFTTTLYTSLHGYFVFGPTGCNVEGFFATLG
GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP
EGMQCSCGIDYYTLKPEINNESFVIYMFVVHFAIPMIVIFFCYGQLVFTVKEAAAQQQES
ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSDFGPIFMTLPAFFAKSSSI
YNPVIYIMMNKQFRNCMITTLCCGKNPLGDDEASASASKTETSQVAPA
ADD COMMENTlink written 8.5 years ago by Pierre Lindenbaum119k
2
gravatar for Brad Chapman
8.5 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

You can get the mappings directly using the UniProt REST interface. This is a bit more general than a URL modification since you can make use of the mapping API to go from any input identifer to the UniProt reference. Here's a Clojure script that fetches matching accessions for any number of identifiers on the commandline, and then writes them out as FASTA files:

(comment "
  Retrieve UniProt FASTA files for provided input identifiers using REST API.
    http://www.uniprot.org/faq/28
  Usage:
    cljr run fetch_uniprot_fasta.clj [Any number of Uniprot IDs]
")

(ns biostar.uniprot
  (:import [java.net URLEncoder])
  (:use [clojure.contrib.duck-streams :only (with-out-writer)])
  (:require [clojure.contrib.http.agent :as http]
            [clojure.contrib.str-utils2 :as str]))

(defn- url-encode [v] (URLEncoder/encode (str v) "utf-8"))

(defn id-to-acc [uid]
  "Map a named identifier to accessions using UniProt REST API."
  (let [map-url "http://www.uniprot.org/mapping/"
        params {:from "ID" :to "ACC" :format "tab" :query uid}
        param-str (str/join "&" (for [[k v] params] 
                                  (str (name k) "=" (url-encode v))))
        cur-url (format "%s?%s" map-url param-str)
        result (http/string (http/http-agent cur-url))]
    (for [line (rest (str/split-lines result))]
      (second (str/split line #"\t")))))

(defn id-to-fasta [uid]
  "Get FASTA from UniProt REST API, removing duplicate IDs with same record."
  (let [fasta-url "http://www.uniprot.org/uniprot/"
        get-fasta (fn [acc]
                    (let [acc-url (format "%s%s.fasta" fasta-url acc)]
                      (http/string (http/http-agent acc-url))))]
    (distinct (for [acc (id-to-acc uid)] (get-fasta acc)))))

(defn write-as-fasta [uid]
  "Write out FASTA records for a UniProt identifier."
  (let [out-file (format "%s.fa" uid)]
    (with-out-writer out-file
      (doseq [fa (id-to-fasta uid)]
        (println fa)))))

(when *command-line-args*
  (doseq [uid *command-line-args*]
    (write-as-fasta uid))
  (shutdown-agents))

Use the cljr Clojure REPL to run it:

cljr run fetch_uniprot_fasta.clj OPSD_HUMAN TEST_HUMAN
ADD COMMENTlink written 8.5 years ago by Brad Chapman9.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1356 users visited in the last hour