I recommend using PubChem, ChEMBL, and ChemSpider as primary databases for small molecule collections. PubChem contains over 100 million compounds and supports substructure searches via SMILES or drawn structures. ChEMBL focuses on bioactive molecules with over 2 million entries, allowing filtering by substructure and bioactivity. ChemSpider aggregates data from multiple sources, with about 100 million structures, and enables substructure queries.
For polyethylene terephthalate (PET)-derived compounds, define the core substructure. PET's repeating unit is based on terephthalic acid (benzene-1,4-dicarboxylic acid) linked to ethylene glycol. Use the SMILES for terephthalic acid: c1cc(ccc1C(=O)O)C(=O)O. Search for derivatives containing this motif or modifications.
A pipeline could be:
Perform substructure search in PubChem. Access via the web interface or API. For example, enter the substructure and retrieve matching CIDs.
Download results as SDF files.
Cluster using similarity metrics with tools like RDKit (preferred over Open Babel for Python integration) or ChemAxon JChem.
In Python with RDKit:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.ML.Cluster import Butina
# Load molecules from SDF
mols = [mol for mol in Chem.SDMolSupplier('compounds.sdf') if mol is not None]
# Generate fingerprints
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
# Cluster using Tanimoto similarity, cutoff 0.7
clusters = Butina.ClusterData(fps, len(fps), 0.7, isDistData=False)
This groups similar compounds. Adjust the cutoff for cluster tightness. For PET derivatives, filter clusters containing the core substructure.
ZINC database also offers purchasable small molecules with substructure search capabilities.
Kevin