Miniprotein stability from Sequence

Project part of the 2021 Copenhagen Protein Biohackathon.

We participated in the challenge Predicting multi mutant miniprotein stability

Our team name is MARS (Miniprotein stAbility fRom Sequence) 🌕

We built a convolutional Variational Autoencoder architecture using PyTorch to tackle the task of predicting miniprotein stability from sequence using the dataset by Rocklin et.al. Science 2017.

The idea is to learn a lower dimensional embedding of the miniprotein sequence space in the latent dimension from which the original sequence can reconstructed or from which new sequences can be sampled. Additionally, a prediction task is used to predict the stability score of the miniproteins from the learned embedding in the latent space.

This architecture has the advantage that one can sample sequences around a sequence of interest for which one knows or has predicted that it has high stability.

A sketch of the architecture is as follows:

The results for the single and multi mutant dataset are as follows:

using a onehot encoded sequence

Each sequence is encoded using one hot encoding with 3 extra entries to denote secondary structure

Single mutants Rp 0.72 Spearman 0.73 p<0.00001
Multi mutants Rp 0.47 Spearman 0.35 p<0.00001

embedding the input sequence using ProtTransBertBFD

Single mutants Rp 0.80 Spearman 0.81 p<0.00001 Interactive image

Multi mutants Rp 0.53 Spearman 0.39 p<0.00001 Interactive image

How to run

Create conda environment

conda env create -f environment.yml

activate environment

conda activate biohackathon

Install bioembeddings

pip3 install -U pip > /dev/null
pip3 install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git" > /dev/null

Run model

python embeddings_protbert_single.py

Examples were run using PyTorch 1.8.1, CUDA 10.1 on a machine equipped with 2x GTX 2070, 377 GB RAM and 2x Intel Xeon Gold 5120 CPU @ 2.20GHz

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Notebooks		Notebooks
data		data
images		images
modelweights		modelweights
plots		plots
LICENSE		LICENSE
README.md		README.md
embeddings_protbert_multi.py		embeddings_protbert_multi.py
embeddings_protbert_single.py		embeddings_protbert_single.py
environment.yml		environment.yml
prediction_multi_mutants_protbert.out		prediction_multi_mutants_protbert.out
prediction_single_mutants_protbert.out		prediction_single_mutants_protbert.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Miniprotein stability from Sequence

using a onehot encoded sequence

embedding the input sequence using ProtTransBertBFD

How to run

About

Contributors 2

Languages

License

duerrsimon/mars-biohackathon

Folders and files

Latest commit

History

Repository files navigation

Miniprotein stability from Sequence

using a onehot encoded sequence

embedding the input sequence using ProtTransBertBFD

How to run

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages