Question: How To Retrieve The Crossreferences To Other Databases From Pubchem Compounds
2
gravatar for Pablacious
7.2 years ago by
Pablacious610
Cambridge, UK
Pablacious610 wrote:

I have a list of nearly 10,000 PubChem compounds identifiers, I want to retrieve the references that PubChem has for those compounds to other databases (like ChEBI, ChEMBL, ChemSpider, LipidMaps, EINECS, etc). For instance:

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=9891771

Has cross references for ChEMBL, ChEBI and LipidMaps (which can be seen in the sections "Depositor Supplied Synonyms" within "Identification and Related Records" and in "Substance Categorization Classification" within "Classification").

I have tried with the ASN.1 download, the SDF (which doesn't include these fields in the mol annotation), the web service and the download facility without much success. Maybe I'm doing something wrong with the web service.

If any one knows or have achieved this, I would really appreciate some help.

ADD COMMENTlink written 7.2 years ago by Pablacious610

Easily done with the CACTVS toolkt www.xemistry.com/academic has a free version for academic use).

Script snippet:

foreach cid $cidlist { set eh [ens create $cid] if {![catch {ens get $eh E_CHEBI_ID} id]} { puts "ChEBI: $id" }

same for other identifiers of interest, the only one from your list currently not supported is LipidMaps (I'll add it), ChemSpider and EINECS require version 3.395 because their query interface once more has morphed

ens delete $eh }

The code performs a fresh lookup on the reference databases, so it does not require registration of the structures at PubChem.

ADD REPLYlink written 7.2 years ago by Wolf Ihlenfeldt40
3
gravatar for Michael Kuhn
7.2 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

The trick is to look at PubChem Substance: In your case, this retrieves 8 source substances. For each substance, you can see the data source with the associated external id. The same data is contained in the PubChem Substance download files, together with the PubChem compound id.

This only works if the databases you care about actively deposit their compounds in PubChem. E.g. AFAIK it won't work for CAS.

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Michael Kuhn5.0k
1
gravatar for Egon Willighagen
7.2 years ago by
Maastricht
Egon Willighagen5.2k wrote:

You can also use Bio2RDF for discovering links (which you can easily automate), by following the http://bio2rdf.org/bio2rdf_resource:linkedToFrom, http://bio2rdf.org/bio2rdf_resource:xRef, and http://www.w3.org/2002/07/owl#sameAs links recursively.

For example, follow the :linkedToFrom for:

http://bio2rdf.org/page/pubchem:7847069

ADD COMMENTlink written 7.2 years ago by Egon Willighagen5.2k
1
gravatar for Pablacious
7.2 years ago by
Pablacious610
Cambridge, UK
Pablacious610 wrote:

For future reference, this is the detailed procedure that I followed. I used the Eutils web service access from NCBI.

The first step was to submit a POST request using ELink, like in this example (Java, using Jersey as HTTP client):

(baseURL is always: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/)

WebResource webRes = client.resource(baseURL + "elink.fcgi");
MultivaluedMap queryParams = new MultivaluedMapImpl();
queryParams.add("dbfrom", "pccompound");
queryParams.add("db", "pcsubstance");
queryParams.add("linkname", "pccompound_pcsubstance_same");
for (String id : dbFromIds) { // the dbFromIds is a list of PubChem CIDs
    queryParams.add("id", id);
}
ClientResponse resp = submitPost(webRes, queryParams);

You get an XML response which looks like this:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pccompound&id=2906&id=100&db=pcsubstance&linkname=pccompound_pcsubstance_same

Through the post version you can post up to 5000 compound ids at once. This gives you an XML from where you need to extract the compound CID to substance SID associations (when you submit several ids in the post form, you don't lose the compound - substance associations, as shown in the example). You could change the linkname variable to other available flavours, but I wanted the same structures.

Then, for groups of 5000 substance ids (SIDs, in the pubchemSubstanceIds list), you make a submission to the EPost application:

WebResource epostWebRes = client.resource(baseURL+"epost.fcgi");
MultivaluedMap queryParamsEPost = new MultivaluedMapImpl();
queryParamsEPost.add("db", "pcsubstance");
queryParamsEPost.add("id", StringUtils.join(pubchemSubstanceIds, ","));
ClientResponse respEpost = submitPost(epostWebRes, queryParamsEPost);

From the response, you obtain two values, a WebEnv and a query_key, which you can use with ESummary:

WebResource webRes = client.resource(baseURL + "esummary.fcgi");
MultivaluedMap queryParams = new MultivaluedMapImpl();
queryParams.add("db", "pcsubstance");
queryParams.add("query_key", epostRes.getQueryKey());
queryParams.add("WebEnv", epostRes.getWebEnv());
ClientResponse resp = submitPost(webRes, queryParams);

This last response includes an XML again from where you can parse names, synonyms, the source identifier (the identifier in the external database) and the source name (the database name) for each submitted pubchem substance id. With the source identifier and source name, you have a cross reference. In the synonyms you can also find identifiers to other databases that don't deposit directly to PubChem (like the HSDB or EINECS, as Michael Kuhn pointed out).

You need to keep in mind that you shouldn't make request with intervals of less than 3 seconds according to the EUtils rules. Even with this, for 14,000 PubChem CIDs, it took approximately an hour (and that included writing a Lucene index with the results).

ADD COMMENTlink written 7.2 years ago by Pablacious610
0
gravatar for Wolf Ihlenfeldt
7.2 years ago by
Wolf Ihlenfeldt40 wrote:

Easily done with the CACTVS toolkit www.xemistry.com/academic has a free version for academic use). Script snippet:

foreach cid $cidlist { 
    set eh [ens create $cid] 
    if {![catch {ens get $eh E_CHEBI_ID} id]} { puts "ChEBI: $id" } 
    # same for other identifiers of interest, the only one from your list currently not supported is LipidMaps (I'll see that I can add it), ChemSpider and EINECS IDs require toolkit version 3.395 because their query interface once more has morphed 
    ens delete $eh
}

The sample code performs a fresh lookup at the reference databases, so it does not require registration of the structures at PubChem. 10K cpds will take a while (but it can be scripted multi-threaded if it is urgent, and you want to code a little bit more).

If you want to analyse what is in the PubChem substance records, here is another approach:

foreach cid $cidlist {
   set eh [ens create $cid]
   foreach sid [ens get $eh E_SIDSET] {
      set eh2 [ens create SID$sid]
      echo [ens get $eh E_NCBI_SUBSTANCE_SOURCE(db)]
      ens delete $eh2
   }
   ens delete $eh
}

This version only contacts PubChem for cid and sid resolution.

ADD COMMENTlink written 7.2 years ago by Wolf Ihlenfeldt40
0
gravatar for Anon
7.2 years ago by
Anon10
Anon10 wrote:

Use the Identifier Exchange Service...

ADD COMMENTlink written 7.2 years ago by Anon10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1783 users visited in the last hour