Question: Extracting Sub-cellular location from Uniprot into tabular format
2
gravatar for Michael Dondrup
14 months ago by
Bergen, Norway
Michael Dondrup46k wrote:

Hi, here's a question which seems more tricky to solve than it looks initially. I am trying to convert SwissProt accessions into a tabular format for import into SQL containing the "best bet" sub-cellular localization of all proteins (one row per pair (accession, location) ):

Accession Location Evidence
Q9YH95    Nucleus  Manual

Just the way it looks like in the picture in the html page: http://www.uniprot.org/uniprot/Q9YH95 Parsing the XML format would be easy. http://www.uniprot.org/uniprot/Q9YH95.xml contains:

<comment type="subcellular location">
  <subcellularLocation>
     <location evidence="1 3">Nucleus</location>
  </subcellularLocation>
</comment>

Edit: Should be nicely solved using this XSLT by Pierre: How to map sub-cellular localisation to enteries in uniprot database fasta file.

That is not the case for all entries though: e.g. http://www.uniprot.org/uniprot/Q96AT9 and http://www.uniprot.org/uniprot/Q96AT9.xml

<dbReference type="GO" id="GO:0005829">
   <property type="term" value="C:cytosol"/>
   <property type="evidence" value="ECO:0000318"/>
   <property type="project" value="GO_Central"/>
</dbReference>
<dbReference type="GO" id="GO:0070062">
   <property type="term" value="C:extracellular exosome"/>
   <property type="evidence" value="ECO:0007005"/>
   <property type="project" value="UniProtKB"/>
</dbReference>

Does that mean the way to get the full information is:

  1. Parse the <subcellularlocation> for those entries that have it.
  2. Parse GO terms and select those that are coming from "Cellular localization" for the remaining entries using a GO parser?

I noted it would be best to simply reproduce the code that draws the compartment image, does somebody have access to that?

Related but not the same: what is the Query to find proteins which Subcellular location have Manually-assigned evidence in uniprot ?

parsing uniprot swissprot • 706 views
ADD COMMENTlink modified 13 months ago by Elisabeth Gasteiger1.6k • written 14 months ago by Michael Dondrup46k

Tagging: Elisabeth Gasteiger

ADD REPLYlink written 14 months ago by genomax70k

We are at SIB Swiss-Prot working on UniProt are offsite, wait until thursday ;)

ADD REPLYlink written 13 months ago by me690
3
gravatar for Elisabeth Gasteiger
13 months ago by
Geneva
Elisabeth Gasteiger1.6k wrote:

It looks like you have meanwhile figured out much of the answer. Here is a summary from the UniProt point of view (all SIB employees were out of town for an instutional event for a couple of days, sorry):

The "Subcellular location" section (https://www.uniprot.org/help/subcellular_location_section) in a UniProtKB entry presents

1) annotations that are directly provided by Swiss-Prot biocurators, in form of a controlled vocabulary (https://www.uniprot.org/locations) complemented by free text notes (in UniProtKB/TrEMBL, such information can also be present, added by the automatic annotation pipeline, https://www.uniprot.org/help/automatic_annotation). See also https://www.uniprot.org/help/subcellular_location

2) GO terms from the Cellular Component ontology (https://www.uniprot.org/help/gene_ontology)

To be complete, you would indeed have to get data from both sources, as they may be complementary. To filter the UniProtKB annotations by manual evidence, you will need to use our Evidence codes (documented here https://www.uniprot.org/help/evidences, searchable via the advanced search and subsequent re-use of the RESTful URLs), and to filter the GO annotations by evidence, you can use https://www.uniprot.org/help/gene_ontology, also combined with the advanced search and the RESTful URLs it creates.

Please don't hesitate to let us know if you have any additional questions or remarks.

ADD COMMENTlink written 13 months ago by Elisabeth Gasteiger1.6k
0
gravatar for Michael Dondrup
13 months ago by
Bergen, Norway
Michael Dondrup46k wrote:

So I got a solution using SQL. First, it looks like assocdb generated by AmiGO is as close as it gets to what I want. This database associates "termdb (above); all manual gene product annotations; electronic annotations (IEA) from all databases other than UniProtKB".

  1. Download the weekly build as SQL tables from here: http://archive.geneontology.org/latest-lite/go_weekly-assocdb-tables.tar.gz You could also download the complete dump and import it into MySQL, but I wanted to import only the required data and use sqlite instead.

  2. Extract the archive into a local directory.

  3. cd to the local dir and open a new sqlite database:

    sqlite3 celloc.db

At the sqlite prompt, run the following code:

-- create schema for the required tables
-- table definitions are the minimal sqlite compatible definitions derived from the MySQL definitions
DROP TABLE IF EXISTS `association`;
  CREATE TABLE `association` (
  `id` int(11) NOT NULL,
  `term_id` int(11) NOT NULL,
  `gene_product_id` int(11) NOT NULL,
  `is_not` int(11) DEFAULT NULL,
  `role_group` int(11) DEFAULT NULL,
  `assocdate` int(11) DEFAULT NULL,
  `source_db_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

DROP TABLE IF EXISTS `term`;
CREATE TABLE `term` (
  `id` int(11) NOT NULL,
  `name` varchar(255) NOT NULL DEFAULT '',
  `term_type` varchar(55) NOT NULL,
  `acc` varchar(255) NOT NULL,
  `is_obsolete` int(11) NOT NULL DEFAULT '0',
  `is_root` int(11) NOT NULL DEFAULT '0',
  `is_relation` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
);

DROP TABLE IF EXISTS `gene_product`;
CREATE TABLE `gene_product` (
  `id` int(11) NOT NULL,
  `symbol` varchar(128) NOT NULL,
  `dbxref_id` int(11) NOT NULL,
  `species_id` int(11) DEFAULT NULL,
  `type_id` int(11) DEFAULT NULL,
  `full_name` text,
  PRIMARY KEY (`id`)
);

DROP TABLE IF EXISTS `dbxref`;
CREATE TABLE `dbxref` (
  `id` int(11) NOT NULL,
  `xref_dbname` varchar(55) NOT NULL,
  `xref_key` varchar(255) NOT NULL,
  `xref_keytype` varchar(32) DEFAULT NULL,
  `xref_desc` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
  );
-- export is tab separated
.separator "\t"

-- import the data from the table flat files
.import term.txt term
.import gene_product.txt gene_product
.import association.txt association
.import dbxref.txt dbxref

-- not required but speeds up further  queries
DELETE FROM dbxref WHERE xref_dbname != 'UniProtKB';
DELETE FROM term where term_type != 'cellular_component';
VACUUM;



-- generate a materialized view
DROP TABLE IF EXISTS uniprot_cellular_localization;
CREATE TABLE uniprot_cellular_localization AS 
 SELECT DISTINCT dbxref.xref_key AS accession, gene_product.symbol, term.name, term.acc
 FROM gene_product
 INNER JOIN  association ON gene_product.id = association.gene_product_id
 INNER JOIN term  ON term.id = association.term_id
 INNER JOIN dbxref ON gene_product.dbxref_id = dbxref.id
 WHERE term.term_type = 'cellular_component';

.headers on

SELECT * FROM uniprot_cellular_localization WHERE accession IN ( 'Q96AT9', 'Q9YH95') ;

-- output:

accession   symbol  name    acc
Q96AT9  RPE cytosol GO:0005829
Q96AT9  RPE extracellular exosome   GO:0070062
Q9YH95  pax5    nucleus GO:0005634
ADD COMMENTlink modified 13 months ago • written 13 months ago by Michael Dondrup46k

can we not parse output from third party services like togows (json too large to be pasted here)? parse GO and under GO, extract C

http://togows.org/entry/ebi-uniprot/Q96AT9/dr.json

ADD REPLYlink written 13 months ago by cpad011211k

The associations are unfortunately incomplete too. An example: https://www.uniprot.org/uniprot/Q7Q6R1 has only automatic IEA GO annotations that are omitted by AmiGO and therefore nothing is found, but a manual annotation exists anyway in the Uniprot profile of this protein. Likely we will need How to map sub-cellular localisation to enteries in uniprot database fasta file. in addition. However, applying xsltproc to a 6GB xml file from swissprot hits the wall:

 zcat uniprot_sprot.xml.gz |  xsltproc transform.xsl -
 killed

Running the same on the server yields a file with 496192 lines after the process grew to a memsize of 80GB.

grep -e "Q7Q6R1" sprot_cl.txt
Q7Q6R1  Cell membrane
ADD REPLYlink modified 13 months ago • written 13 months ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1606 users visited in the last hour