In order to see how much data is distributed, it is common to map in an appropriate space. Especially in chemoinformatics the word chemical space is used.
Chemical space refers to the arrangement of compounds in an n-dimensional space at some scale. In general, two or three dimensions are often used (for human understanding). Although various methods have been proposed for the scale, ie, similarity, it is often decided that a distance that well characterizes a compound is defined.
This time, we will visualize which pharmaceutical company is developing what kind of compound for the antagonist of Orexin Receptor, which is known as a target for sleep medicine. See Chapter 4 for how to download data. This time we used the data of 10 papers in the table.
There are two main things I want to know this time:
-
Were there companies that developed similar compounds?
-
Has Merck optimized only similar frameworks, or did it optimize multiple frameworks?
Doc ID |
Journal |
Pharma |
CHEMBL3098111 |
Merck |
|
CHEMBL3867477 |
Merck |
|
CHEMBL2380240 |
Rottapharm |
|
CHEMBL3352684 |
Merck |
|
CHEMBL3769367 |
Merck |
|
CHEMBL3526050 |
Actelion |
|
CHEMBL3112474 |
Actelion |
|
CHEMBL3739366 |
Heptares |
|
CHEMBL3739395 |
Actelion |
|
CHEMBL3351489 |
Eisai |
Use ggplot for the drawing library. Principal component analysis (PCA) is used to distribute and visualize similar compounds close together. At first we import necessary library
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
import numpy as np
import pandas as pd
from ggplot import *
from sklearn.decomposition import PCA
import os
Load the downloaded sdf, and create fingerprints for each compound, enabling correspondence between drug companies and document IDs. If you have any questions please check Chapter 6.
oxrs = [("CHEMBL3098111", "Merck" ),("CHEMBL3867477", "Merck" ),
("CHEMBL2380240", "Rottapharm" ),("CHEMBL3352684", "Merck" ),
("CHEMBL3769367", "Merck" ),("CHEMBL3526050", "Actelion" ),
("CHEMBL3112474", "Actelion" ),("CHEMBL3739366", "Heptares" ),
("CHEMBL3739395", "Actelion" ), ("CHEMBL3351489", "Eisai" )]
fps = []
docs = []
companies = []
for cid, company in oxrs:
sdf_file = os.path.join("ch08", cid + ".sdf")
mols = Chem.SDMolSupplier(sdf_file)
for mol in mols:
if mol is not None:
fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
docs.append(cid)
companies.append(company)
fps.append(arr)
fps = np.array(fps)
companies = np.array(companies)
docs = np.array(docs)
If you check the information of the fingerprint, you can see that data of 293 compounds are obtained from 10 articles.
fps.shape
# (293, 2048)
You are now ready for principal component analysis. The number of principal components can be specified by n_components, but this time, I want to scatter two dimensions, so I set it to 2.
pca = PCA(n_components=2)
x = pca.fit_transform(fps)
Draw. I changed the color option according to each label, so I chose two attributes, COMPANY and DOCID.
d = pd.DataFrame(x)
d.columns = ["PCA1", "PCA2"]
d["DOCID"] = docs
d["COMPANY"] = companies
g = ggplot(aes(x="PCA1", y="PCA2", color="COMPANY"), data=d) + geom_point() + xlab("X") + ylab("Y")
g
You can now see what compounds each pharmaceutical company has optimized. Merck, Acterion, Eisai and Heptaress seem to have optimized similar compounds, as there is an overlapping area in the center of the chemical space. It is interesting to see whether the Acterion has been successfully deployed in a unique direction (lower left) or has not been deployed and has advanced into the red ocean center.
Also, Merck seems to have optimized various frameworks. I don’t know if I’m optimizing at the same time or running ahead for backup, but it’s no doubt that there were a lot of skeletal optimizations running, so it’s probably an attractive target. In fact, SUVOREXANT was launched.
In this chapter, we use dissertation data, but we do not use dissertation data when performing such analysis in a real field. Because when a company disseminates, it means that the project is over (whether it went to clinical or failed and closed). In the actual situation, analysis is performed using patent data.
Based on the analysis and experience of Medicinal Chemist and the insights of these companies, the project will proceed with a belief in their own successes while inferring the situation of other companies.
It is said that tSNE has better resolution than PCA and is closer to the sense of medicinal chemist. Sklearn just changes PCA to TSNE.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)
tx = tsne.fit_transform(fps)
As you can see when drawing, it is separated better than PCA.
d = pd.DataFrame(tx)
d.columns = ["PCA1", "PCA2"]
d["DOCID"] = docs
d["COMPANY"] = companies
g = ggplot(aes(x="PCA1", y="PCA2", color="COMPANY"), data=d) + geom_point() + xlab("X") + ylab("Y")
g
There are many other drawing methods besides PCA and tSNE introduced this time, so it is good to check.