Scaffold-based Split¶

In the field of cheminformatics, scaffold-based split is a method used to divide a dataset of molecules into training and test sets. This method is often used in the context of machine learning where the goal is to train a model to predict certain properties of molecules.

Definition¶

The ‘scaffold’ in scaffold-based split refers to the core structure of a molecule. In this method, molecules are grouped based on their scaffold. That is, molecules sharing the same core structure are put into the same group. The dataset is then split into training and test sets such that all molecules of the same group belong to the same set. This ensures that the training and test sets are diverse and that the model is not simply memorizing the data.

Pros¶

Diversity: Since molecules with the same scaffold are grouped together, the training and test sets are likely to be diverse, which is beneficial for model generalization.
Avoidance of information leakage: In traditional random splits, molecules with the same scaffold can end up in both the training and test sets. This can lead to overly optimistic performance estimates, as the model may simply be memorizing specific scaffolds. Scaffold splitting avoids this issue.

Cons¶

Imbalanced splits: Depending on the distribution of scaffolds, scaffold splitting can lead to imbalanced splits. For example, if a few scaffolds dominate the dataset, these will form a large part of one of the sets.
Harder task: Since the test set scaffolds are not seen during training, the prediction task is harder compared to random splitting. This can lead to lower performance estimates.

Comparison to Other Methods¶

Scaffold-based split is one of many methods to split a dataset into training and test sets. Other methods include random split, stratified split, and time-based split. The best method depends on the specific task and data at hand. Compared to these methods, scaffold-based split is specifically designed for molecular data and aims to provide a more realistic estimate of model performance on unseen data.

Example in Python¶

In this example, we will use the RDKit library to perform a scaffold-based split. RDKit is a collection of cheminformatics and machine learning tools. It includes functionality for generating and manipulating molecular scaffolds.

First, we need to install the necessary library. If you haven’t installed RDKit, you can do so by running the following command:

!pip install rdkit-pypi

Please note that this example assumes you have a dataset of molecules in the form of SMILES strings. SMILES (Simplified Molecular Input Line Entry System) is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings.

[1]:

# Import necessary libraries
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold

import random
import numpy as np

# Let's assume we have a list of SMILES strings
smiles_list = ['CC(=O)OC1=CC=CC=C1C(=O)O', 'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O', 'C1=CC=C(C=C1)CO', 'C1=CC=C2C(=C1)C(=CN2)C3=CC=CC=C3', 'C1=CC=C2C(=C1)C(=O)OC2=O']

# Create a dictionary to hold the scaffolds and the molecules they correspond to
scaffolds_to_molecules = {}

for smiles in smiles_list:
    mol = Chem.MolFromSmiles(smiles)
    scaffold = MurckoScaffold.GetScaffoldForMol(mol)
    scaffold_smiles = Chem.MolToSmiles(scaffold)

    if scaffold_smiles not in scaffolds_to_molecules:
        scaffolds_to_molecules[scaffold_smiles] = [smiles]
    else:
        scaffolds_to_molecules[scaffold_smiles].append(smiles)

# Now we have a dictionary where each key is a scaffold, and the value is a list of molecules (SMILES strings) that have that scaffold
# We can now split the data based on these scaffolds

# First, we shuffle the scaffolds
scaffold_list = list(scaffolds_to_molecules.keys())
random.shuffle(scaffold_list)

# Decide on a split, e.g., 80% of the data for training
train_cutoff = int(0.8 * len(scaffold_list))

# Split the scaffolds into training and test sets
train_scaffolds = scaffold_list[:train_cutoff]
test_scaffolds = scaffold_list[train_cutoff:]

# Now we can create our training and test sets
train_smiles = [smiles for scaffold in train_scaffolds for smiles in scaffolds_to_molecules[scaffold]]
test_smiles = [smiles for scaffold in test_scaffolds for smiles in scaffolds_to_molecules[scaffold]]

train_smiles, test_smiles

[1]:

(['CC(=O)OC1=CC=CC=C1C(=O)O',
  'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O',
  'C1=CC=C(C=C1)CO',
  'C1=CC=C2C(=C1)C(=CN2)C3=CC=CC=C3'],
 ['C1=CC=C2C(=C1)C(=O)OC2=O'])

[2]:

train_smiles

[2]:

['CC(=O)OC1=CC=CC=C1C(=O)O',
 'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O',
 'C1=CC=C(C=C1)CO',
 'C1=CC=C2C(=C1)C(=CN2)C3=CC=CC=C3']

[3]:

test_smiles

[3]:

['C1=CC=C2C(=C1)C(=O)OC2=O']

[4]:

scaffold_smiles

[4]:

'O=C1OC(=O)c2ccccc21'