Identifying Chirality in Line Drawings of Molecules Using Imbalanced Dataset Sampler for a Multilabel Classification Task
Yong En Kok, Simon Woodward, Ender Özcan and Mercedes Torres Torres
The paper has been accepted for publication: https://doi.org/10.1002/minf.202200068
Chirality is the ability of molecules to exist as two forms of non-superimposable mirror images. If the two forms cannot be superimposed on each other through any combination of transaltaion, rotations and conformational (bond rotation) changes, the molecules are achiral. There are four common structural motifs that lead to the identification of molecular chirality, namely centre/point, axial, planar and helical chirality.
Chemists have used line drawings to represent chiral organic molecules for more than 150 years, but machine readable representations were only developed much later: SMILES (in 1980s) and InChI (from 2000). Nonetheless, these molecular languages are not sufficient to fully define the molecular chirality as they are presently unable to represent axial, planar and helical chirality. Additionally, the process of reconstructing the 2D line drawings into machine readable formats are susceptible to the loss of stereochemical information, thus limiting chiral recognition.
Herein, we compared the pretrained EfficientNetV2 and ResNet50 networks that were fine-tuned for a binary task of chirality classification (achiral/chiral)and a multilabel task of chirality type classification (none/centre/axial/planar).
To address the label combination imbalanced problem in the multilabel task, the study proposed a new data sampling method–Formulated Imbalanced Dataset Sampler (FIDS) to sample a formulated amount of minority label combinations on top of the training set.
The research also demonstrated the potential of a deep learning network to make predictions that are align with human understanding of chirality through the study of heatmaps.
pip install -r requirements.txt
Our code is mostly based on the scripts in PyTorch Image Models to train, test, infer and save the models.
We modified the Pytorch Image Models at commit 6ae0ac6. The following are the modified files:
- parser_image_folder.py
- parser_factory.py
- dataset_factory.py
- dataset.py
- summary.py
To visualise the activation heatmaps, we applied Pytorch Grad-Cam on our network.
The networks were trained and tested on our manually curated molecule dataset--CHIRAL. Note that the database is subject to the normal limitations of human curation. There are two versions of the ChEMBL+ dataset that are available for download:
- raw
- preprocessed (removal of transparent background and converting them to grayscale images)
The EfficientNetV2 and ResNet50 networks were pretrained on the preprocessed ChEMBL+ dataset. There are two versions of the ChEMBL+ dataset that are available for download:
- raw
- preprocessed (removal of transparent background and converting them to grayscale images)
The pretrained models can be downloaded from here.
We recommend that you download and save the dataset in the dataset
folder. Do read about the data in the readme_data.md
file stored in the dataset
folder.
Binary model
python trainBinary.py --config trainBinary.yaml
Multilabel model
python trainMulti.py --config trainMulti.yaml
Binary model
python inferenceBinary.py --config inferenceBinary.yaml
Multilabel model
python inferenceMulti.py --config inferenceMulti.yaml
- dataset and pretrained models
- demo website
- Zooniverse platform for molecule annotation