Parsing Pubchem Compound Records
4
9
Entering edit mode
12.2 years ago

Can anyone recommend some methods for parsing data from PubChem Compound records. I can get a complete dump of the database from the PubChem FTP.

The data is available in ASN, SDF, and XML formats.

For demonstration purposes, imagine that I want to reproduce a subset of the information displayed for a particular drug on the website. For example the record here: Sunitinib.

More specifically, imagine that for this CID (5329102), I want to determine the drug name, the names listed under 'also known as', and the 'Depositor-Supplied Synonyms'.

I ultimately want to be able to perform these kind of queries for every record in PubChem, not just that one.

It sounds like the PubChem Power User Gateway (PUG) might be helpful? If so, can someone provide a description of how I would get going on the example problem I outlined?

xml pubchem database api • 10k views
ADD COMMENT
1
Entering edit mode

I'm having great difficulty relating the CID (5329102) to any file in the FTP site. That would be my starting point.

ADD REPLY
1
Entering edit mode

Hi Malachi, you asked this question three years ago but updated it a few weeks ago. Can you tell us how you solved the problem in the end? Apparently you're still working on it. Thanks!

ADD REPLY
0
Entering edit mode

Yes, I was also having trouble immediately relating records viewed on the web with info in the FTP site...

ADD REPLY
0
Entering edit mode

How do you deal with this problem eventually? I want to get the Therapeutic Uses and Pharmacology and Biochemistry for some CIDs from pubchem. Thank you.

ADD REPLY
4
Entering edit mode
12.2 years ago
Rajarshi Guha ▴ 880

You can't really go via the FTP site unless you're planning to mirror the entire db, as they changed the layout so you can't determine which file will contain the range of CID's you're interested in.

It's better to go via PUG - which is horrifically painful, but really the only practical way to query unless you mirror the collection yourself. See http://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html for examples of the XML requests and responses

ADD COMMENT
0
Entering edit mode

Thanks! May I ask, what are your thoughts on mirroring the whole site?

ADD REPLY
3
Entering edit mode
12.2 years ago
Rich Apodaca ▴ 170

I ultimately want to be able to perform these kind of queries for every record in PubChem, not just that one.

The way I've done it:

  1. Create a local PubChem mirror and (optionally) update it daily. I used gzipped sdf format (sdfgz)
  2. Develop a set of criteria for screening the PubChem data set, implemented as a record filter.
  3. Iterate records one at a time using a tool that can work with the raw sdfgz files as if the entire set were one big SDF file. For example, I developed such a tool as part of an earlier project. Filter those records that you don't care about.

Keep in mind that the Substance records contain most of the useful metadata, whereas the Compound records contain most of the useful structure data. There are many ways to combine these records.

For example, you can create a dictionary mapping CAS numbers, IUPAC names, trivial names, etc. to PubChem CID records. You'll need to be careful about how you do this (see this discussion on mapping CAS numbers, for example). Given some effort and possible combination with other large downloadable databases, it's possible to extract a lot of useful information this way.

ADD COMMENT
1
Entering edit mode

"Keep in mind that the Substance records contain most of the useful metadata, whereas the Compound records contain most of the useful structure data." ... That explains a lot. Thanks for the useful explanation and links!

ADD REPLY
4
Entering edit mode
12.2 years ago
Evan Bolton ▴ 50

Hi,

I am not sure I understand the responses to this question... the complete file of these for all of PubChem is available on their FTP site... what is the issue? ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Synonym-filtered.gz

You can easily download these per CID as well... http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?q=nmcv&namedisopt=&cid=5329102

Questions like these should really be sent to: info@ncbi.nlm.nih.gov

Evan

ADD COMMENT
0
Entering edit mode

Indeed, if it's just synonym look up the CID synonym file is sufficient. But more generally, isn't PUG the official way to query PubChem? I assumed so hence my suggestions to go via PUG for generality.

ADD REPLY
0
Entering edit mode

Right, that seems to be what he is interested in... synonyms but for all CIDs. The "Extras" folders augment the full dump information based on frequently requested aspects of PubChem.

PUG is simply one of several approaches that we support for programmatic access. The most heavily used is the EUtils interface. Using simply XML you get all the info, per record. You are limited to 3 requests per sec... before they being to notice you (and if you are causing trouble, they may decide to block your IP). PubChem also supports URL-based (almost RESTful) URL-based access to data.

ADD REPLY
0
Entering edit mode

Honestly I didn't notice that extras directory. Thanks! That will probably provide some convenient shortcuts. But, I really was just using that test case as an example starting point. I ultimately want arbitrarily complex queries of the data. All three answers so far have been really helpful.

ADD REPLY
0
Entering edit mode

Honestly I didn't notice that extras directory. Thanks! That will probably provide some convenient shortcuts. But, I really was just using that test case as an example. I ultimately want arbitrarily complex queries of the data. All three answers so far have been really helpful.

ADD REPLY
0
Entering edit mode
12.2 years ago

As far as I know, the only tool outside NCBI which can parse the native ASN.1 PubChem data is the CACTVS Chemoinformatics toolkit www.xemistry.com/academic for free academic versions). The toolkit has tight links into PubChem.

For example, to get a structure with all associated standard data via a CID, this can be conveniently scripted as

set eh [ens create 5329102]

or, to make things more obvious,

set eh [ens create CID5329102]

and then, with a few simple additional commands, you can dig into the data content and structure connectivity. This function decodes the native ASN.1 data and is lossless (the SDF records are only an approximation of the registered data content).

The toolkit also supports SID and assay retrieval, CID/SID/AIS determination from structure, SID/CID/AID cross-referencing, transparent querying of PubChem via full-structure, substructure, formula, data value etc. completely hiding the nasty PUG details, retrieval of selected information from PubChem and Entrez via structure (Mesh Terms, computed names, etc.) and much more.

Setting up your own copy of PubChem is really unnecessary except your queries are so secret that the world must never know what you are doing.

ADD COMMENT

Login before adding your answer.

Traffic: 2778 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6