Question: Is there a file about relationship between the cmap name In connectivity map and compound structure (maybe PubChem ID or Drugbank ID)
gravatar for Zhilong Jia
6.6 years ago by
Zhilong Jia1.6k
Zhilong Jia1.6k wrote:

Is there a file describe the relationship between cmap name (connectivity map) and CID (PubChem) or DBxxx (drugbank)? 

I have a list of cmap name (drug name) file. And I want to get the structure file (such as smi or sdf). But in PubChem or Drugbank, some cmap names are not in this database.

Thank you.

drugbank cmap pubchem • 4.0k views
ADD COMMENTlink modified 6.0 years ago by palukursusdisain0 • written 6.6 years ago by Zhilong Jia1.6k

It would make everyones life easier if Connectivity Map became a PubChem submitting source (anyone from that crew listening?) then the mappings are taken care of and advanced analysis becomes possible inside PubChem (e.g. exactly which ones are in cmap and/or/not DrugBank

ADD REPLYlink written 6.6 years ago by cdsouthan1.8k

Anyone from the cmap team actually following this post?

ADD REPLYlink modified 11 months ago by RamRS30k • written 6.5 years ago by cdsouthan1.8k

I mailed the cmap-help, buy no reply so far.

ADD REPLYlink written 6.5 years ago by Zhilong Jia1.6k
gravatar for wdiwdi
6.6 years ago by
wdiwdi380 wrote:

This is a  simple scripting task for the Cactvs Cheminformatics Toolkit (free academic downloads available at

The scripts below read the original CMAP Excel file (you need to store it as xlsx, there is no table reader for the old xls format) and writes an SDF file with the structure in the CTAB section and both PubChem CID and Drugbank ID as data fields, if they can be determined (there are failures, your observation is correct).  Since Drugbank have just completely revamped their interface and turned everything upside down, you also need the latest Drugbank ID retriever property definition, which is not yet included in the current academic packages. You can get it directly from me.

Scripted In Tcl:

set th [table read cmap_instances_02.xlsx colnames 1]
set fh [molfile open cmap_tcl.sdf w writelist "E_CID E_DRUGBANK_ID" writeflags compute]
puts "Process [table get $th nrows] table rows"
table loop $th row {
    set name [lindex $row 2]
    if {[catch {ens create name:$name} eh]} {
        puts stderr "Cannot resolve name $name"
    } else {
        molfile write $fh $eh
        ens delete $eh

or scripted in Python (sponsored by Vertex Inc.)

fh=Molfile('cmap_py.sdf','w',{'writelist':'E_CID E_DRUGBANK_ID','writeflags':'compute'})
print('Process',th.nrows,' table rows')
    print('Cannot resolve name',row[2])

ADD COMMENTlink written 6.6 years ago by wdiwdi380

Thank you. But as you said,some cmap names are not mapped to PubChem CID well. This is the key point. In pubchem ftp, there is a file CID-Synonym-filtered, which show the relationship between CID and drug synonym name. but the cmap names are special sometimes. Once getting all the CID of cmap name, it's easy to obtain the structual file.

cmap name: The name given to a perturbagen (or group of closely related perturbagens) by cmapcurators. For small molecules, the cmap name is always the recommended ('r') or provisional ('p') INN, if one is available. Otherwise a cmap name is selected from amongst the United States Adopted Name (USAN), the British Approved Name (BAN), the monograph titles from Martindale: The Complete Drug Reference or The Merck Index. Different salts of the same compound are given the same cmap name, unless a specific salt is the INN. For example, the cmap name for both propiomazine hydrochloride and propiomazine maleate is "propiomazine" (which is the rINN) but the cmap name for isosorbide dinitrate is "isoborbide dinitrate" since this is the rINN.


ADD REPLYlink modified 11 months ago by RamRS30k • written 6.6 years ago by Zhilong Jia1.6k

OK so cmap should (please) simply submit to PubChem. They can add what they like (but is useful) in the synonym or comment lines of the SID. I can't see why cmap would need any unique names anyway (exept novel strucutures ?) but they can go in. Its only necessary for the chemical structures to be correct (which they have probably curated anyway or run a checker and/or InChIKey intersects pre-submission). Inside PubChem the heuristics of name and synonym merging are looked after during the update of the CID. Users can then query either via the SID or the CID feilds, or both in fact ( e.g. all the INNs, USANs and BANs are in there).

ADD REPLYlink modified 11 months ago by RamRS30k • written 6.6 years ago by cdsouthan1.8k

I believe cmap name should be checked carefully when inquiring the CID.

for instance,

instance_id , cmap_name, catalog_name
3577, benzylpenicillin,  Benzylpenicillin sodium [69-57-8]

searching benzylpenicillin, CID is 5904; Benzylpenicillin sodium will be SID 23668834; 69-57-8 (CAS) CID 23668834. As a result, the CID should be 23668834.

There are some similar examples, especially in Prestwick_xxx.

And in some cases, the CAS is not the same compound as the cmap_name or catalog_name.

ADD REPLYlink modified 11 months ago by RamRS30k • written 6.6 years ago by Zhilong Jia1.6k

hi, I've found the CID-Synonym-filtered file, but it's nearly 7GB. That's too large to handle. How do you deal with the file? Thanks.

ADD REPLYlink written 6.0 years ago by zhengl070

Can you share the link ? 

ADD REPLYlink written 6.0 years ago by Khader Shameer18k
gravatar for cdsouthan
6.0 years ago by
cdsouthan1.8k wrote:

I just noticed that UniChem have  35320 LINCS compound structures loaded

Need to look at file to see what the mapping is - but InChIKey would be useful

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by cdsouthan1.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1320 users visited in the last hour