Question

Small Molecule Databases and clustering

0

Entering edit mode

12 months ago

yarrowmadrona ▴ 10

I am looking for collections of small molecules or databases where I can cluster small molecules by type. For example, I would like to make a collection of all polyethylene terephthalate (PET) derived compounds. Does anyone here have experience with a pipeline for this?

One approach might be:

Look for compounds that contain substructures or modified versions of PET
Substructure Search using PubChem or ChemSpider to search for substructures or derivities based on PET core.
Cluster Based on Similarity with Open Babel or ChemAxon

polyethylene MolecularDatabase Molecule PET trephthalate • 574 views

ADD COMMENT • link updated 15 days ago by Kevin Blighe ★ 90k • written 12 months ago by yarrowmadrona ▴ 10

score 0 · Answer 1 · 2025-11-16

I recommend using PubChem, ChEMBL, and ChemSpider as primary databases for small molecule collections. PubChem contains over 100 million compounds and supports substructure searches via SMILES or drawn structures. ChEMBL focuses on bioactive molecules with over 2 million entries, allowing filtering by substructure and bioactivity. ChemSpider aggregates data from multiple sources, with about 100 million structures, and enables substructure queries.

For polyethylene terephthalate (PET)-derived compounds, define the core substructure. PET's repeating unit is based on terephthalic acid (benzene-1,4-dicarboxylic acid) linked to ethylene glycol. Use the SMILES for terephthalic acid: c1cc(ccc1C(=O)O)C(=O)O. Search for derivatives containing this motif or modifications.

A pipeline could be:

Perform substructure search in PubChem. Access via the web interface or API. For example, enter the substructure and retrieve matching CIDs.
Download results as SDF files.
Cluster using similarity metrics with tools like RDKit (preferred over Open Babel for Python integration) or ChemAxon JChem.

In Python with RDKit:

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina

# Load molecules from SDF
mols = [mol for mol in Chem.SDMolSupplier('compounds.sdf') if mol is not None]

# Generate fingerprints
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]

# Cluster using Tanimoto similarity, cutoff 0.7
clusters = Butina.ClusterData(fps, len(fps), 0.7, isDistData=False)

This groups similar compounds. Adjust the cutoff for cluster tightness. For PET derivatives, filter clusters containing the core substructure.

ZINC database also offers purchasable small molecules with substructure search capabilities.

Kevin