In this study, we explored various machine learning (ML) models to forecast docking scores of ligands for specific target proteins, aiming to reduce the need for extensive docking calculations. Our primary goal? Find a regression model that can determine the docking scores of ligands from a chemical library in relation to a target protein. We achieve this with data from explicit docking of a select few molecules.
Among the ML models:
- 🧠 An LSTM-based Neural Network (common in Natural Language Processing tasks like speech recognition). Combined with an attention mechanism, it effectively extracts ligand data. We used Pytorch for this.
- 🌳 Models like XGBoost, Decision Tree Regression, and Stochastic Gradient Descent from libraries like XGBoost and scikit-learn.
- Ensure Python 3.7 is installed 🐍
- Create a virtual environment and execute
pip install -r requirements.txt
📦 - Navigate to 'swifty' and run
sudo chmod -R 777 logs
📑
- Ensure Python 3.8 is installed 🐍
- Create a virtual environment and execute
pip -r apple-silcon-requirements.txt
📦 - Navigate to 'swifty' and run
sudo chmod -R 777 logs
📑
- Add your target to the 'dataset' folder. Follow the format in
sample_input.csv
. - Example: Lets say you want to train the lstm model for sample_input for mac descriptor and a training set size of 50 without cross validation. First, Navigate to
src/models
and run the below command. Note: All possible descriptors are mac, onehot, and morgan_onehot_mac:
python main_lstm.py --input sample_input --descriptors mac --training_sizes 50 --cross_validation False
python main_lstm.py --input <YOUR_INPUT_FILE> --descriptors <DESCRIPTOR> --training_sizes <TRAINING_SIZE> --cross_validation <CROSS_VALIDATION>
This will produce a result directory with 5 categories. Each file follows the format: lstm_target_descriptor_training_size.
- project_info: Details like training size and durations.
- serialized_models: Trained model post-training.
- test_predictions: Each docking score and corresponding model prediction.
- testing_metrics: Metrics such as R-squared, mean absolute error from testing.
- validation_metrics: Metrics from 5-fold cross-validation (only if
--cross_validation True
).
- Training Using Multiple Descriptors
python main_lstm.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 --cross_validation False
- Training Using Multiple Descriptors and Multiple Training set sizes
python main_lstm.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 100 --cross_validation False
- Training Using Multiple Descriptors, Multiple Training set sizes and Multiple Targets
python main_lstm.py --input sample_input sample_input_2 --descriptors mac morgan_onehot_mac --training_sizes 50 100 --cross_validation False
Run
python lstm_inference.py --input_file <YOUR_INPUT_FILE> --output_dir <YOUR_OUTPUT_DIRECTORY> --model_name <YOUR_MODEL_NAME>
Ensure than <YOUR_INPUT_FILE> follows the format of molecules_for_prediction.csv in the 'dataset' folder. Example
python lstm_inference.py --input_file molecules_for_prediction.csv --output_dir prediction_results --model_name lstm_target_mac_50_model.pt
- Add your target to the 'dataset' folder. It should match the format of sample_input.csv
- Run this command to prepare the dataset
python create_fingerprint_data.py --input sample_input --descriptors mac
python create_fingerprint_data.py --input <YOUR_INPUT_FILE> --descriptors <DESCRIPTOR>
python create_fingerprint_data.py --input sample_input --descriptors mac morgan_onehot_mac
- Run this to train
python main_ml.py --input sample_input --descriptors mac --training_sizes 50 --regressor sgreg
Command Format
python main_ml.py --input <YOUR_INPUT_FILE> --descriptors <DESCRIPTOR> --training_sizes <TRAINING_SIZE> --regressor <REGRESSOR>
Note: All possible descriptors are mac, morgan_onehot_mac and onehot. All possible regressors are sgreg, xgboost and decision_tree
- Training Using Multiple Descriptors
python main_ml.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 --regressor sgreg
- Training Using Multiple Descriptors and Multiple Training set sizes
python main_ml.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 100 --regressor sgreg
- Training Using Multiple Descriptors, Multiple Training set sizes and Multiple Models
python main_ml.py --input sample_input --descriptors mac morgan_onehot_mac --training_sizes 50 100 --regressor sgreg xgboost
This will give you a result directory with similar categories and file formats as mentioned in the LSTM section.
- Your input CSV should match the format of molecules_for_prediction.csv in the 'dataset' folder.
- Run
python other_models_inference.py --input_file <YOUR_INPUT_FILE> --output_dir <YOUR_OUTPUT_DIRECTORY> --model_name <YOUR_MODEL_NAME>
Ensure than <YOUR_INPUT_FILE> follows the format of molecules_for_prediction.csv in the 'dataset' folder.