Question: Parsing Pubchem Compound Records
7
gravatar for Malachi Griffith
7.1 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith17k wrote:

Can anyone recommend some methods for parsing data from PubChem Compound records. I can get a complete dump of the database from the PubChem FTP.

The data is available in ASN, SDF, and XML formats.

For demonstration purposes, imagine that I want to reproduce a subset of the information displayed for a particular drug on the website. For example the record here: Sunitinib.

More specifically, imagine that for this CID (5329102), I want to determine the drug name, the names listed under 'also known as', and the 'Depositor-Supplied Synonyms'.

I ultimately want to be able to perform these kind of queries for every record in PubChem, not just that one.

It sounds like the PubChem Power User Gateway (PUG) might be helpful? If so, can someone provide a description of how I would get going on the example problem I outlined?

api database xml pubchem • 6.1k views
ADD COMMENTlink modified 3.9 years ago by Biostar ♦♦ 20 • written 7.1 years ago by Malachi Griffith17k
1

I'm having great difficulty relating the CID (5329102) to any file in the FTP site. That would be my starting point.

ADD REPLYlink written 7.1 years ago by Neilfws48k
1

Hi Malachi, you asked this question three years ago but updated it a few weeks ago. Can you tell us how you solved the problem in the end? Apparently you're still working on it. Thanks!

ADD REPLYlink written 3.8 years ago by Maximilian Haeussler1.3k

Yes, I was also having trouble immediately relating records viewed on the web with info in the FTP site...

ADD REPLYlink written 7.1 years ago by Malachi Griffith17k

How do you deal with this problem eventually? I want to get the Therapeutic Uses and Pharmacology and Biochemistry for some CIDs from pubchem. Thank you.

ADD REPLYlink written 4.3 years ago by Zhilong Jia1.4k
4
gravatar for Rajarshi Guha
7.1 years ago by
Rajarshi Guha880
United States
Rajarshi Guha880 wrote:

You can't really go via the FTP site unless you're planning to mirror the entire db, as they changed the layout so you can't determine which file will contain the range of CID's you're interested in.

It's better to go via PUG - which is horrifically painful, but really the only practical way to query unless you mirror the collection yourself. See http://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html for examples of the XML requests and responses

ADD COMMENTlink written 7.1 years ago by Rajarshi Guha880

Thanks! May I ask, what are your thoughts on mirroring the whole site?

ADD REPLYlink written 7.1 years ago by Malachi Griffith17k
3
gravatar for Rich Apodaca
7.1 years ago by
Rich Apodaca170
La Jolla, CA
Rich Apodaca170 wrote:

I ultimately want to be able to perform these kind of queries for every record in PubChem, not just that one.

The way I've done it:

  1. Create a local PubChem mirror and (optionally) update it daily. I used gzipped sdf format (sdfgz)
  2. Develop a set of criteria for screening the PubChem data set, implemented as a record filter.
  3. Iterate records one at a time using a tool that can work with the raw sdfgz files as if the entire set were one big SDF file. For example, I developed such a tool as part of an earlier project. Filter those records that you don't care about.

Keep in mind that the Substance records contain most of the useful metadata, whereas the Compound records contain most of the useful structure data. There are many ways to combine these records.

For example, you can create a dictionary mapping CAS numbers, IUPAC names, trivial names, etc. to PubChem CID records. You'll need to be careful about how you do this (see this discussion on mapping CAS numbers, for example). Given some effort and possible combination with other large downloadable databases, it's possible to extract a lot of useful information this way.

ADD COMMENTlink written 7.1 years ago by Rich Apodaca170
1

"Keep in mind that the Substance records contain most of the useful metadata, whereas the Compound records contain most of the useful structure data." ... That explains a lot. Thanks for the useful explanation and links!

ADD REPLYlink written 7.1 years ago by Malachi Griffith17k
4
gravatar for Evan Bolton
7.1 years ago by
Evan Bolton50
Evan Bolton50 wrote:

Hi,

I am not sure I understand the responses to this question... the complete file of these for all of PubChem is available on their FTP site... what is the issue? ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Synonym-filtered.gz

You can easily download these per CID as well... http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?q=nmcv&namedisopt=&cid=5329102

Questions like these should really be sent to: info@ncbi.nlm.nih.gov

Evan

ADD COMMENTlink written 7.1 years ago by Evan Bolton50

Indeed, if it's just synonym look up the CID synonym file is sufficient. But more generally, isn't PUG the official way to query PubChem? I assumed so hence my suggestions to go via PUG for generality.

ADD REPLYlink written 7.1 years ago by Rajarshi Guha880

Right, that seems to be what he is interested in... synonyms but for all CIDs. The "Extras" folders augment the full dump information based on frequently requested aspects of PubChem.

PUG is simply one of several approaches that we support for programmatic access. The most heavily used is the EUtils interface. Using simply XML you get all the info, per record. You are limited to 3 requests per sec... before they being to notice you (and if you are causing trouble, they may decide to block your IP). PubChem also supports URL-based (almost RESTful) URL-based access to data.

ADD REPLYlink written 7.1 years ago by Evan Bolton50

Honestly I didn't notice that extras directory. Thanks! That will probably provide some convenient shortcuts. But, I really was just using that test case as an example starting point. I ultimately want arbitrarily complex queries of the data. All three answers so far have been really helpful.

ADD REPLYlink written 7.1 years ago by Malachi Griffith17k

Honestly I didn't notice that extras directory. Thanks! That will probably provide some convenient shortcuts. But, I really was just using that test case as an example. I ultimately want arbitrarily complex queries of the data. All three answers so far have been really helpful.

ADD REPLYlink written 7.1 years ago by Malachi Griffith17k
0
gravatar for Wolf Ihlenfeldt
7.1 years ago by
Wolf Ihlenfeldt40 wrote:

As far as I know, the only tool outside NCBI which can parse the native ASN.1 PubChem data is the CACTVS Chemoinformatics toolkit www.xemistry.com/academic for free academic versions). The toolkit has tight links into PubChem.

For example, to get a structure with all associated standard data via a CID, this can be conveniently scripted as

set eh [ens create 5329102]

or, to make things more obvious,

set eh [ens create CID5329102]

and then, with a few simple additional commands, you can dig into the data content and structure connectivity. This function decodes the native ASN.1 data and is lossless (the SDF records are only an approximation of the registered data content).

The toolkit also supports SID and assay retrieval, CID/SID/AIS determination from structure, SID/CID/AID cross-referencing, transparent querying of PubChem via full-structure, substructure, formula, data value etc. completely hiding the nasty PUG details, retrieval of selected information from PubChem and Entrez via structure (Mesh Terms, computed names, etc.) and much more.

Setting up your own copy of PubChem is really unnecessary except your queries are so secret that the world must never know what you are doing.

ADD COMMENTlink written 7.1 years ago by Wolf Ihlenfeldt40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour