Digitize your chemical reaction image into a machine-readable representation.
V.0
·
IBM Research Zurich
- Description
- Step by step
- Benchmarking
- Installation
- Models - Training - Evaluation - Inference
- Contributing
- Creators
- Thanks
- Citing
From a chemical reaction image, detect and classify molecules, text and arrows by using a Vision Transformer (DETR). The detections are then translated into text "OCR" or into SMILES. The direction of the reaction is detected and preserved into the output file.
Output:
Reaction X
SMILES:
C=Cc1ccc2[nH]cc(C[C@H](N)C(=O)OF)c2c1>>C/C=C(\\C)CN1C2CCC(=O)[C@]1(C)Cc1c2[nH]c2ccc(C)cc12.O>>molecule8>>molecule5>>BCc1ccc2[nH]c3c(c2c1-c1c(O)ccc(C)c1C1=C(C)C2CC4/C(=C\\C)CN2C(C1)C4CC)CC1C(CC)C24CC3N1[C@@H]2/C4=C/C.C.CC=O.CC=O>>C=C1CC2(C#N)C3(C)CC1C1CCNC12CC1=C3Cc2ccc(C)c(-c3c([O-])ccc4[nH]c5c(c34)CC3C[C@H]4C5(C)CC(C)([C@@H](C)CCO)C34CC3CC3)c21
A DETR model with a ResNet50 backbone is used to detect the objects in the image. Classes to be found = ["molecules","arrows","text", "+ symbols"]. Images of type "png" are feed as input and bounding boxes corresponding to the objects locations in a tensor type as well as its respective label are the returned outputs.
Syntetic Dataset consisting of 50k images that are randomly created to simulate the real-world reactions publications distribution.
- Learning Rate: 1e-4
- Learning Rate Backbone: 1e-5
- Learning Rate Drop: 22 epochs
- Weight Decay: 1e-5
- Epochs: 30
- Clip Max. Norm.: 0.1
- Dropout: 0.1
Encoder-decoder attention files can be found in the Backend-Output-detections path.
OCR model from DocTr library was trained on chemical-related data generated from US chemical patents. To come up with the training data set, the Text Recognition Data Generator tool was used. The resulting weights were used to extract textual information from text detections.
Towards recognizing the information present in the molecule objects detected, we utilized the open-access user interface MolVec.
In pursuance of preserving the real path of the whole reaction, the direction of the arrows detected was extracted.
A randomly selected small sample of the test set is evaluated under the folders "test_results" of each approach. DETR, FRCNN and RetinaNet. Check qualitatevly the performance of the models in there.
Aggregating the aforementioned steps outcome, we can reconstruct JSON and text files.
{
"arrow11": {
"prev_mol": "CCCC#N.CCCC[Al](CC(C)C)CC(C)C",
"text": ["-duction of nit-ile", "Coordination of nitrog- pair to the ilum-"],
"post_mol": "CCCC#[N+3]1(CCC)C(C)(C)C[AlH2]1(O)CC(C)C"
},
"arrow5": {
"prev_mol": "CCCC#[N+3]1(CCC)C(C)(C)C[AlH2]1(O)CC(C)C",
"text": ["Delivery of hyd-ide to the nitr-- carbon"],
"post_mol": "CCC/C=N/[Al](CC(C)C)CC(C)C"
},
"arrow7": {
"prev_mol": "CCC/C=N/[Al](CC(C)C)CC(C)C",
"text": ["H20","Formation o"],
"post_mol": "CCCC=O"
}
}
Reaction X
SMILES:
C=Cc1ccc2[nH]cc(C[C@H](N)C(=O)OF)c2c1>>C/C=C(\\C)CN1C2CCC(=O)[C@]1(C)Cc1c2[nH]c2ccc(C)cc12.O>>molecule8>>molecule5>>BCc1ccc2[nH]c3c(c2c1-c1c(O)ccc(C)c1C1=C(C)C2CC4/C(=C\\C)CN2C(C1)C4CC)CC1C(CC)C24CC3N1[C@@H]2/C4=C/C.C.CC=O.CC=O>>C=C1CC2(C#N)C3(C)CC1C1CCNC12CC1=C3Cc2ccc(C)c(-c3c([O-])ccc4[nH]c5c(c34)CC3C[C@H]4C5(C)CC(C)([C@@H](C)CCO)C34CC3CC3)c21
The synthetic training data set was benchmarked with well-established CNNs and a feature detector approach. As a one-stage detector, RetinaNet. As a two-stages detector, Faster-RCNN. However, DETR with default training schedules performed slightly better. Check the metrics comparison in the folder: plots.
mAP score per class and model (%)
-
Make sure to have all requirements.txt installed.
-
git clone https://github.com/mindee/doctr.git
Download the DETR_Resnet50 model and move the file to DETR/detr/output/checkpoint.pth
- Follow steps in arrow_78/README.md file.
- DETR/detr/:
- python3 main.py --batch_size 8 --no_aux_loss --eval --resume "output/checkpoint.pth" --arrow_path "images/val/" --output_dir "output/"
- DETR/detr/:
- python3 attention_DETR.py --resume output/checkpoint.pth
DETR - https://github.com/facebookresearch/detectron2
DocTr - https://github.com/mindee/doctr
MolVec - https://github.com/ncats/molvec
This thesis would not have been possible without the guidance of Dr. Daniel Probst as my supervisor, whom I deeply thank. Throughout the writing of this dissertation I have received a great deal of support by all my colleagues at the Accelerated DiscoveryTeam of IBM Research Zurich.
@software{M.Martori2022,
author = {Martori, Mark and Probst, Daniel},
month = {6},
title = {{Machine Learning approach for chemical reactions digitalisation.}},
url = {https://github.com/markmartorilopez/,
version = {0.0},
year = {2022}
}