Skip to content

Latest commit

 

History

History
125 lines (87 loc) · 5.63 KB

README.md

File metadata and controls

125 lines (87 loc) · 5.63 KB

Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

Model Architecture

This repository contains code, data and model weights for ICML 2024 paper Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

The overall model architecture is shown below:

image

Environment

The dependencies can be set up using the following commands:
conda create -n enzygen python=3.8 -y 
conda activate enzygen 
conda install pytorch=1.10.2 cudatoolkit=11.3 -c pytorch -y 
bash setup.sh 

Download Data

We provide the EnzyBench at EnzyBench and Enzyme Classification Tree (EC) ID to index dict at EC_Dict

Please download the dataset and put them in the data folder.

mkdir data 
cd data 
wget https://drive.google.com/file/d/1VycT_gFV2JBpRMCBZlwwxLLRcZDljXCS/view?usp=drive_link
wget https://drive.google.com/file/d/1BCitsFRQpzUbGss7xBpTpvKcMcJh_oOz/view?usp=drive_link

Download Model

We provide the checkpoint used in the paper at Model

Please download the checkpoints and put them in the models folder.

If you want to train your own model, please follow the training guidance below

Training

If you want to train a model with enzyme-substrate interaction constraint as introduced in our paper, please follow the script below:
bash train_enzyme_substrate_33layer.sh

If you want to train a model without enzyme-substrate interaction constraint, please follow the script below:

bash train_cluster_enzyme_33layer.sh

From our experiences, first training a model without enzyme-substrate interaction constraint for around 200,000 steps and then continue training based on sequence recovery loss, coordinate recovery loss and enzyme-substrate interaction loss will lead to the best performance!

Inference

To design enzymes for the 30 testing third-level categories, please use the following scripts:
bash generation.sh

There are five items in the output directory:

  1. protein.txt refers to the designed protein sequence
  2. src.seq.txt refers to the ground truth sequences
  3. pdb.txt refers to the target PDB ID and the corresponding chain
  4. pred_pdbs refers to the directory of designed pdbs
  5. tgt_pdbs refers to the directory of target pdbs

Finetune your own model

To finetune your own model based on our trained model, please follow the guidelines below:

Prepare your own data

We provide a case of training data at preprocess/case.json. For training and validation, you should prepare ['seq', 'coor', 'motif', 'pdb', 'ec4', 'substrate', 'binding', 'substrate_coor', 'substrate_feat'] features. Seq denotes the protein sequence, coor denotes the alpha-carbon coordinates which is flattened with the order of x, y, z coordinate. motif denotes the functional sites indexing from 0. pdb denotes the pdb id and chain. ec4 dotes the fourth EC category. substrate denotes the substrate id and binding (0 or 1) denotes if the substrates can bind to the enzyme. substrate_coor and substrate_feat respectively denotes the coordinates and features of the substrates. You can extract the substrate coordinates and features using preprocess/get_substrate_feature.py.
python preprocess/get_substrate_feature.py

Finetuning your model

After preparing your own data, you can finetune your model using finetune.sh
bash finetune.sh

Evaluation

We provide the ESP evaluation data at [ESP_data_eval](https://drive.google.com/file/d/1q8NENdVWBufz5fDk7TviS6h6_BKmfviN/view?usp=drive_link)

The format for ESP evaluation is (Protein_Sequence Substrate_Representation) for each test case.

The evaluation code for ESP score is developed by Alexander Kroll, which can be found at link

Expected Results

Protein Family 1.1.1 1.11.1 1.14.13 1.14.14 1.2.1 2.1.1 2.3.1 2.4.1
EnzyGen 0.64 0.98 0.38 0.42 0.72 0.80 0.61 0.38
Protein Family 2.4.2 2.5.1 2.6.1 2.7.1 2.7.10 2.7.11 2.7.4 2.7.7
EnzyGen 0.86 0.66 0.53 0.76 0.92 0.93 0.80 0.79
Protein Family 3.1.1 3.1.3 3.1.4 3.2.2 3.4.19 3.4.21 3.5.1 3.5.2
EnzyGen 0.76 0.62 0.88 0.47 0.26 0.73 0.40 0.14
Protein Family 3.6.1 3.6.1 3.6.5 4.1.1 4.2.1 4.6.1 -- Avg
EnzyGen 0.66 0.78 0.40 0.80 0.93 0.57 -- 0.65

Citation

If you find our work helpful, please consider citing our paper.
@inproceedings{songgenerative,
  title={Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates},
  author={Song, Zhenqiao and Zhao, Yunlong and Shi, Wenxian and Jin, Wengong and Yang, Yang and Li, Lei},
  booktitle={Forty-first International Conference on Machine Learning}
}