Question: PDB and MMDB differences in ligand structure
cdsouthan1.8k wrote:

During an audit of  our PDB ligand links in we have been looking at intersects between  PDBe ligands (via UniChem) and PubChem CIDs for what should be the same structures in NCBI  MMDB.  While the comparison is preliminary, from our total of  ~ 900 (curated real lead-like structures) we see peculiar discordance of well over 100 in both directions (i.e. PDBe with no exact match in MMDB and visa versa).  We have seen this issue before for indivdual cases but I wonder if anyone has done a systematic comparison (e.g. via InChIKeys) ?.  The numbers dont add up for starters with 19713 in PDBe and 27973 for MMDB. 

Some boil down to just two hydrogens but still a missmatch (e.g. AWJ)

Others are more serious such as 35Q

Just recieved a tweet reply (appreciated)  PDB ligs include  unobserved atoms and idealise geometry, looks like (eg 35Q) pubchem extracts from coordinates?  

This is the (we know what we put)  "in" verses the (lets see what we can density-fit) "out" problem

Lets see what else gets pitched in (check twitter if interested - the exchange seems to have moved there! )


I have expanded the topic at

conroy20 wrote:

To expand, when PDB annotators make a PDB chemical dictionary, it as per the author's definition (as far as it is given) of what the molecule _should_ be. What is built into the coordinates may deviate significantly from that. 

Due to disorder, there might be bits of the ligand which are not observed in the crystal structure (eg 35Q), but the PDB definition should include the unobserved bit.

In some cases the geometry in the coordinates is improbable. I can think of a ligand where (in a sugar ring) the bonds C-O-C were 1.2 and 1.7Å. The dictionary though would have fixed these to ideal values. Deviation of the coordinates from ideal is listed in the validation reports distributed with each PDB entry.

Covalently bound molecules may be another source of difference, The PDB definition may include a leaving group which has left, I don't know how pubchem handles such cases.

If a molecule definition is made entirely from the XYZ coordinates in a PDB file (with a modal resolution of about 2.3Å, and rarely with hydrogens) it will be prone to error, though I'm not suggesting all PDB definitions are correct by any means; TP7 is currently built incorrectly and is about to be fixed.





Thanks, very useful (BTW any chance of those 19K InChIKeys ?)  Next up should be someone from the MMDB team I hope.....

What is  TP7?

Coenzyme B, it has been built in error with a OH rather than carbonyl, but now PDB annotatotrs have noticed, it is being fixed. and will be updated at next week.

