Skip to content

emanskaia/Nextflow_MBB659-G100_Similarity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Molecular Similarity Analysis Workflow

Introduction

Molecular similarity analysis stands as a cornerstone in the realm of drug discovery, offering a systematic approach to unveil compounds with therapeutic potential. This method, rooted in the comparison of molecular structures, plays a pivotal role in identifying potential drug candidates, optimizing lead compounds, and guiding the intricate processes of structure-activity relationship (SAR) studies.

By harnessing the power of molecular similarity, researchers can efficiently explore chemical space, predict crucial pharmacological and pharmacokinetic properties, and strategically navigate the complex landscape of drug development. This approach not only enhances the cost and time efficiency of the drug discovery process but also empowers scientists with valuable insights for the rational design and selection of compounds with desired bioactivities. Molecular similarity analysis thus emerges as a fundamental and indispensable tool, contributing to the success and advancement of drug development programs worldwide.

The primary objective of this research project is to design a filtering process that eliminates unsuitable or less suitable molecules while preserving potential drug candidates in drug discovery process. This project executes a series of three processes, where the input for each subsequent process is the output of the previous one. The workflow aims to analyze the similarity between small molecules, specifically drug candidates, and known drugs based on their molecular structures using Nextflow.

For this analysis, a set of molecules, that are known to be good candidates for a drug are compared to another dataset of molecules for similarity. If molecules from the second dataset are similar enough to known drug candidates, they may be added to the conventional drug discovery process. The workflow is described in the picture below in the square box as a part of a bigger process of drug discovery.

Slide1

Process 1: Calculate Similarity

The first process, named "calculateSimilarity," operates on a dataset of small molecules represented by molecular smiles. The dataset contains two columns, "SMILES_RESULT" and "SMILES_QUERY," representing the molecular formulas of known drugs and drug candidates, respectively. The goal is to assess the similarity between these drug candidates (SMILES_QUERY) and known drugs (SMILES_RESULT). If drug candidates exhibit sufficient similarity to known drugs, they can be further explored for characteristics such as affinity with a target, ADME (Absorption, Distribution, Metabolism, Excretion), and more.

The script for this process utilizes the RDKit library to calculate similarity scores between pairs of molecules. The output is a file with an additional column containing similarity scores ranked from 0 to 1. To complete this process, a script called runSimScore.py is executed. The input file is provided in the data folder and is called “CS_for_sim.csv”. Additionally, to columns with molecular smiles, there are also columns with information on Dataset, product type, price, supplier, and ID, because the data comes from similarity search from the real vendor database. The output file is called “output.csv” and contains an extra column with a similarity score calculated.

Input file:

image

Output file:

image

Process 2: Extract Top 10%

The second process, "extractTop10," takes the calculated similarity scores from Process 1 and extracts the top 10% of values along with their corresponding molecule pairs. These pairs represent the most similar drug candidates to known drugs, which can be further investigated. The scrip is called extractTop10.py, the “output.csv” from process 1 is becoming an input for this process, the generated file is called “top_10_percent.csv”. The script for this process identifies and extracts the top 10% of similarity scores, providing a subset of molecular pairs for downstream analysis.

image

Process 3: Visualization

The third process, "visualization," generates a distribution plot based on the top 10% of molecular pairs obtained from Process 2. This visualization helps in understanding the distribution of similarity scores within the selected subset. The scrip is called visualize_top_pairs.py, the “top_10_percent.csv” from process 2 is becoming an input for this process, the generated file is called “distribution_plot.png” The script for this process creates a distribution plot using the molecular pairs identified as the top 10% in the previous step. The plot is below

image

Workflow

The three processes are orchestrated in a workflow, where the output of each process serves as the input for the subsequent one. This ensures a seamless analysis of molecular similarity and facilitates the identification of promising drug candidates.

Usage

To execute the workflow, follow these steps:

Set up

Follow instructions to install git, conda, nextflow and docker before running this pipeline. The full list of dependencies is below:

  • nextflow 23.10.0
  • python=3.10
  • rdkit=2023.9.1
  • seaborn
  • pandas=2.1.2

Using Conda

1. Clone the repository in your root folder.

cd #this will redirect you to the root directory

git clone https://github.com/emanskaia/Nextflow_MBB659-G100_Similarity.git

2. Navigate to the project directory

cd Nextflow_MBB659-G100_Similarity

3. Create the environment. In the root of the repository run:

conda env create --file environment.yml

Running the Analysis

Activate the environment

conda activate similarity

From the root directory run

nextflow Manskaia_project.nf

image

Using Docker

Note: Docker was not running on all the machines, but the Dockerfile is presented in the repository

1. Install and launch Docker on your computer.

2. Clone the repository

git clone https://github.com/emanskaia/Nextflow_MBB659-G100_Similarity.git

3. Navigate to the project directory

cd Nextflow_MBB659-G100_Similarity

4. Running the analysis

docker compose up jupyter-lab

In the terminal, look for a URL that starts with http://127.0.0.1:8888/lab?token= (for an example, see the highlighted text in the terminal below). Copy and paste that URL into your browser.

image

References

  1. Warr, W. A., Nicklaus, M. C., Nicolaou, C. A., & Rarey, M. (2022). Journal of Chemical Information and Modeling, 62(9), 2021-2034. (https://doi.org/10.1021/acs.jcim.2c00224).
  2. Torab-Miandoab, A., Poursheikh Asghari, M., Hashemzadeh, N. et al. Analysis and identification of drug similarity through drug side effects and indications data. BMC Med Inform Decis Mak 23, 35 2023. (https://doi.org/10.1186/s12911-023-02133-3)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published