Chapter 8: Want to have lots of compounds at once

In order to see how much data is distributed, it is common to map in an appropriate space. Especially in chemoinformatics the word chemical space is used.

What is Chemical Space

Chemical space refers to the arrangement of compounds in an n-dimensional space at some scale. In general, two or three dimensions are often used (for human understanding). Although various methods have been proposed for the scale, ie, similarity, it is often decided that a distance that well characterizes a compound is defined.

This time, we will visualize which pharmaceutical company is developing what kind of compound for the antagonist of Orexin Receptor, which is known as a target for sleep medicine. See Chapter 4 for how to download data. This time we used the data of 10 papers in the table.

There are two main things I want to know this time:

Were there companies that developed similar compounds?
Has Merck optimized only similar frameworks, or did it optimize multiple frameworks?

Table 1. Orexin Receptor Antagonist

Doc ID	Journal	Pharma
CHEMBL3098111	Bioorg. Med. Chem. Lett. (2013) 23:6620-6624	Merck
CHEMBL3867477	Bioorg Med Chem Lett (2016) 26:5809-5814	Merck
CHEMBL2380240	Bioorg. Med. Chem. Lett. (2013) 23:2653-2658	Rottapharm
CHEMBL3352684	Bioorg. Med. Chem. Lett. (2014) 24:4884-4890	Merck
CHEMBL3769367	J. Med. Chem. (2016) 59:504-530	Merck
CHEMBL3526050	Drug Metab. Dispos. (2013) 41:1046-1059	Actelion
CHEMBL3112474	Bioorg. Med. Chem. Lett. (2014) 24:1201-1208	Actelion
CHEMBL3739366	MedChemComm (2015) 6:947-955	Heptares
CHEMBL3739395	MedChemComm (2015) 6:1054-1064	Actelion
CHEMBL3351489	Bioorg. Med. Chem. (2014) 22:6071-6088	Eisai

Mapping using Euclidean distance

Use ggplot for the drawing library. Principal component analysis (PCA) is used to distribute and visualize similar compounds close together. At first we import necessary library

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
import numpy as np
import pandas as pd
from ggplot import *
from sklearn.decomposition import PCA
import os

Load the downloaded sdf, and create fingerprints for each compound, enabling correspondence between drug companies and document IDs. If you have any questions please check Chapter 6.

oxrs = [("CHEMBL3098111", "Merck" ),("CHEMBL3867477", "Merck" ),
　　　　　("CHEMBL2380240", "Rottapharm" ),("CHEMBL3352684", "Merck" ),
　　　　　("CHEMBL3769367", "Merck" ),("CHEMBL3526050", "Actelion" ),
　　　　　("CHEMBL3112474", "Actelion" ),("CHEMBL3739366", "Heptares" ),
　　　　　("CHEMBL3739395", "Actelion" ), ("CHEMBL3351489", "Eisai" )]

fps = []
docs = []
companies = []

for cid, company in oxrs:
    sdf_file = os.path.join("ch08", cid + ".sdf")
    mols = Chem.SDMolSupplier(sdf_file)
    for mol in mols:
        if mol is not None:
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
            arr = np.zeros((1,))
            DataStructs.ConvertToNumpyArray(fp, arr)
            docs.append(cid)
            companies.append(company)
            fps.append(arr)
fps = np.array(fps)
companies = np.array(companies)
docs = np.array(docs)

If you check the information of the fingerprint, you can see that data of 293 compounds are obtained from 10 articles.

fps.shape
# (293, 2048)

You are now ready for principal component analysis. The number of principal components can be specified by n_components, but this time, I want to scatter two dimensions, so I set it to 2.

pca = PCA(n_components=2)
x = pca.fit_transform(fps)

Draw. I changed the color option according to each label, so I chose two attributes, COMPANY and DOCID.

d = pd.DataFrame(x)
d.columns = ["PCA1", "PCA2"]
d["DOCID"] = docs
d["COMPANY"] = companies
g = ggplot(aes(x="PCA1", y="PCA2", color="COMPANY"), data=d) + geom_point() + xlab("X") + ylab("Y")
g

You can now see what compounds each pharmaceutical company has optimized. Merck, Acterion, Eisai and Heptaress seem to have optimized similar compounds, as there is an overlapping area in the center of the chemical space. It is interesting to see whether the Acterion has been successfully deployed in a unique direction (lower left) or has not been deployed and has advanced into the red ocean center.

Also, Merck seems to have optimized various frameworks. I don’t know if I’m optimizing at the same time or running ahead for backup, but it’s no doubt that there were a lot of skeletal optimizations running, so it’s probably an attractive target. In fact, SUVOREXANT was launched.

patinformatics

In this chapter, we use dissertation data, but we do not use dissertation data when performing such analysis in a real field. Because when a company disseminates, it means that the project is over (whether it went to clinical or failed and closed). In the actual situation, analysis is performed using patent data.

Based on the analysis and experience of Medicinal Chemist and the insights of these companies, the project will proceed with a belief in their own successes while inferring the situation of other companies.

Mapping using tSNE

It is said that tSNE has better resolution than PCA and is closer to the sense of medicinal chemist. Sklearn just changes PCA to TSNE.

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
tx = tsne.fit_transform(fps)

As you can see when drawing, it is separated better than PCA.

d = pd.DataFrame(tx)
d.columns = ["PCA1", "PCA2"]
d["DOCID"] = docs
d["COMPANY"] = companies
g = ggplot(aes(x="PCA1", y="PCA2", color="COMPANY"), data=d) + geom_point() + xlab("X") + ylab("Y")
g

There are many other drawing methods besides PCA and tSNE introduced this time, so it is good to check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch08_visualization.asciidoc

ch08_visualization.asciidoc

Chapter 8: Want to have lots of compounds at once

What is Chemical Space

Mapping using Euclidean distance

Mapping using tSNE

Files

ch08_visualization.asciidoc

Latest commit

History

ch08_visualization.asciidoc

File metadata and controls

Chapter 8: Want to have lots of compounds at once

What is Chemical Space

Mapping using Euclidean distance

Mapping using tSNE