This repository contains code, data and model weights for ICML 2024 paper Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates
The overall model architecture is shown below:
The dependencies can be set up using the following commands:conda create -n enzygen python=3.8 -y
conda activate enzygen
conda install pytorch=1.10.2 cudatoolkit=11.3 -c pytorch -y
bash setup.sh
We provide the EnzyBench at EnzyBench and Enzyme Classification Tree (EC) ID to index dict at EC_Dict
Please download the dataset and put them in the data folder.
mkdir data
cd data
wget https://drive.google.com/file/d/1VycT_gFV2JBpRMCBZlwwxLLRcZDljXCS/view?usp=drive_link
wget https://drive.google.com/file/d/1BCitsFRQpzUbGss7xBpTpvKcMcJh_oOz/view?usp=drive_link
We provide the checkpoint used in the paper at Model
Please download the checkpoints and put them in the models folder.
If you want to train your own model, please follow the training guidance below
If you want to train a model with enzyme-substrate interaction constraint as introduced in our paper, please follow the script below:bash train_enzyme_substrate_33layer.sh
If you want to train a model without enzyme-substrate interaction constraint, please follow the script below:
bash train_cluster_enzyme_33layer.sh
From our experiences, first training a model without enzyme-substrate interaction constraint for around 200,000 steps and then continue training based on sequence recovery loss, coordinate recovery loss and enzyme-substrate interaction loss will lead to the best performance!
To design enzymes for the 30 testing third-level categories, please use the following scripts:bash generation.sh
There are five items in the output directory:
- protein.txt refers to the designed protein sequence
- src.seq.txt refers to the ground truth sequences
- pdb.txt refers to the target PDB ID and the corresponding chain
- pred_pdbs refers to the directory of designed pdbs
- tgt_pdbs refers to the directory of target pdbs
python preprocess/get_substrate_feature.py
bash finetune.sh
The format for ESP evaluation is (Protein_Sequence Substrate_Representation) for each test case.
The evaluation code for ESP score is developed by Alexander Kroll, which can be found at link
Protein Family | 1.1.1 | 1.11.1 | 1.14.13 | 1.14.14 | 1.2.1 | 2.1.1 | 2.3.1 | 2.4.1 |
---|---|---|---|---|---|---|---|---|
EnzyGen | 0.64 | 0.98 | 0.38 | 0.42 | 0.72 | 0.80 | 0.61 | 0.38 |
Protein Family | 2.4.2 | 2.5.1 | 2.6.1 | 2.7.1 | 2.7.10 | 2.7.11 | 2.7.4 | 2.7.7 |
EnzyGen | 0.86 | 0.66 | 0.53 | 0.76 | 0.92 | 0.93 | 0.80 | 0.79 |
Protein Family | 3.1.1 | 3.1.3 | 3.1.4 | 3.2.2 | 3.4.19 | 3.4.21 | 3.5.1 | 3.5.2 |
EnzyGen | 0.76 | 0.62 | 0.88 | 0.47 | 0.26 | 0.73 | 0.40 | 0.14 |
Protein Family | 3.6.1 | 3.6.1 | 3.6.5 | 4.1.1 | 4.2.1 | 4.6.1 | -- | Avg |
EnzyGen | 0.66 | 0.78 | 0.40 | 0.80 | 0.93 | 0.57 | -- | 0.65 |
@inproceedings{songgenerative,
title={Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates},
author={Song, Zhenqiao and Zhao, Yunlong and Shi, Wenxian and Jin, Wengong and Yang, Yang and Li, Lei},
booktitle={Forty-first International Conference on Machine Learning}
}