Skip to content

Latest commit

 

History

History
149 lines (113 loc) · 7 KB

ch08_visualization.asciidoc

File metadata and controls

149 lines (113 loc) · 7 KB

Chapter 8: Want to have lots of compounds at once

jupyter

In order to see how much data is distributed, it is common to map in an appropriate space. Especially in chemoinformatics the word chemical space is used.

What is Chemical Space

Chemical space refers to the arrangement of compounds in an n-dimensional space at some scale. In general, two or three dimensions are often used (for human understanding). Although various methods have been proposed for the scale, ie, similarity, it is often decided that a distance that well characterizes a compound is defined.

This time, we will visualize which pharmaceutical company is developing what kind of compound for the antagonist of Orexin Receptor, which is known as a target for sleep medicine. See Chapter 4 for how to download data. This time we used the data of 10 papers in the table.

There are two main things I want to know this time:

  • Were there companies that developed similar compounds?

  • Has Merck optimized only similar frameworks, or did it optimize multiple frameworks?

Table 1. Orexin Receptor Antagonist

Doc ID

Journal

Pharma

CHEMBL3098111

Bioorg. Med. Chem. Lett. (2013) 23:6620-6624

Merck

CHEMBL3867477

Bioorg Med Chem Lett (2016) 26:5809-5814

Merck

CHEMBL2380240

Bioorg. Med. Chem. Lett. (2013) 23:2653-2658

Rottapharm

CHEMBL3352684

Bioorg. Med. Chem. Lett. (2014) 24:4884-4890

Merck

CHEMBL3769367

J. Med. Chem. (2016) 59:504-530

Merck

CHEMBL3526050

Drug Metab. Dispos. (2013) 41:1046-1059

Actelion

CHEMBL3112474

Bioorg. Med. Chem. Lett. (2014) 24:1201-1208

Actelion

CHEMBL3739366

MedChemComm (2015) 6:947-955

Heptares

CHEMBL3739395

MedChemComm (2015) 6:1054-1064

Actelion

CHEMBL3351489

Bioorg. Med. Chem. (2014) 22:6071-6088

Eisai

Mapping using Euclidean distance

Use ggplot for the drawing library. Principal component analysis (PCA) is used to distribute and visualize similar compounds close together. At first we import necessary library

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
import numpy as np
import pandas as pd
from ggplot import *
from sklearn.decomposition import PCA
import os

Load the downloaded sdf, and create fingerprints for each compound, enabling correspondence between drug companies and document IDs. If you have any questions please check Chapter 6.

oxrs = [("CHEMBL3098111", "Merck" ),("CHEMBL3867477", "Merck" ),
     ("CHEMBL2380240", "Rottapharm" ),("CHEMBL3352684", "Merck" ),
     ("CHEMBL3769367", "Merck" ),("CHEMBL3526050", "Actelion" ),
     ("CHEMBL3112474", "Actelion" ),("CHEMBL3739366", "Heptares" ),
     ("CHEMBL3739395", "Actelion" ), ("CHEMBL3351489", "Eisai" )]

fps = []
docs = []
companies = []

for cid, company in oxrs:
    sdf_file = os.path.join("ch08", cid + ".sdf")
    mols = Chem.SDMolSupplier(sdf_file)
    for mol in mols:
        if mol is not None:
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
            arr = np.zeros((1,))
            DataStructs.ConvertToNumpyArray(fp, arr)
            docs.append(cid)
            companies.append(company)
            fps.append(arr)
fps = np.array(fps)
companies = np.array(companies)
docs = np.array(docs)

If you check the information of the fingerprint, you can see that data of 293 compounds are obtained from 10 articles.

fps.shape
# (293, 2048)

You are now ready for principal component analysis. The number of principal components can be specified by n_components, but this time, I want to scatter two dimensions, so I set it to 2.

pca = PCA(n_components=2)
x = pca.fit_transform(fps)

Draw. I changed the color option according to each label, so I chose two attributes, COMPANY and DOCID.

d = pd.DataFrame(x)
d.columns = ["PCA1", "PCA2"]
d["DOCID"] = docs
d["COMPANY"] = companies
g = ggplot(aes(x="PCA1", y="PCA2", color="COMPANY"), data=d) + geom_point() + xlab("X") + ylab("Y")
g

You can now see what compounds each pharmaceutical company has optimized. Merck, Acterion, Eisai and Heptaress seem to have optimized similar compounds, as there is an overlapping area in the center of the chemical space. It is interesting to see whether the Acterion has been successfully deployed in a unique direction (lower left) or has not been deployed and has advanced into the red ocean center.

Also, Merck seems to have optimized various frameworks. I don’t know if I’m optimizing at the same time or running ahead for backup, but it’s no doubt that there were a lot of skeletal optimizations running, so it’s probably an attractive target. In fact, SUVOREXANT was launched.

PCA PCA

patinformatics

In this chapter, we use dissertation data, but we do not use dissertation data when performing such analysis in a real field. Because when a company disseminates, it means that the project is over (whether it went to clinical or failed and closed). In the actual situation, analysis is performed using patent data.

Based on the analysis and experience of Medicinal Chemist and the insights of these companies, the project will proceed with a belief in their own successes while inferring the situation of other companies.

Mapping using tSNE

It is said that tSNE has better resolution than PCA and is closer to the sense of medicinal chemist. Sklearn just changes PCA to TSNE.

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
tx = tsne.fit_transform(fps)

As you can see when drawing, it is separated better than PCA.

d = pd.DataFrame(tx)
d.columns = ["PCA1", "PCA2"]
d["DOCID"] = docs
d["COMPANY"] = companies
g = ggplot(aes(x="PCA1", y="PCA2", color="COMPANY"), data=d) + geom_point() + xlab("X") + ylab("Y")
g
PCA

There are many other drawing methods besides PCA and tSNE introduced this time, so it is good to check.