-
Notifications
You must be signed in to change notification settings - Fork 12
/
6-hopach.Rmd
105 lines (72 loc) · 3.41 KB
/
6-hopach.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "HOPACH: hierarchical medoid clustering"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Highlights
* Hierarchical cluster algorithm that is divisive (top down), but also will consider combining clusters if beneficial.
* Based on medoids: each cluster is based on a representative observation with the smallest distance to other cluster members.
* Orders clusters and observations within clusters based on the distance metric.
* The hierarchical tree splits don't have to be binary - could split into 3 or more clusters.
* It will automatically determine the recommended number of clusters.
## Load data
```{r load_data}
# From 1-clean-data.Rmd
data = rio::import("data/clean-data-imputed.RData")
str(data)
# Convert factors to indicators
result = ck37r::factors_to_indicators(data, verbose = TRUE)
data = result$data
# Standardize the data so that distances are more comparable.
# NOTE: we might skip binary variables in this step.
data = ck37r::standardize(data)
```
## Basic HOPACH
```{r basic_hopach}
library(hopach)
# We use the numeric data matrix here, which has converted factors to indicators.
distances = distancematrix(data,
d = "euclid",
# d = "cosangle",
na.rm = TRUE)
# This is an n x n matrix.
dim(distances)
hobj = hopach(data, dmat = distances, newmed = "nn",
d = "euclid")
# d = "cosangle")
# Number of clusters identified.
hobj$clust$k
# Review sizes of each cluster.
hobj$clust$sizes
# Review the cluster assignment vector.
# These counts correspond to the cluster sizes that we reviewed.
table(hobj$clustering$labels)
# Extract medoids - representative observations for each cluster.
# These are the row indices from the dataframe.
hobj$clustering$medoids
# This plot is recommended but does not seeem that useful.
# Red = small distance, white = large distance.
dplot(distances, hobj, ord = "final",
main = "Distance matrix", showclusters = FALSE)
```
## Bootstrap analysis
```{r bootstrap, fig.width = 6, fig.height = 6}
# Bootstrap analysis
# TODO: identify how to set seed.
boot_result = boothopach(data, hobj, B = 100)
# We want to see large bands of the same color, which indicate cluster stability.
# May need to click "zoom" to view the plot.
bootplot(boot_result, hobj, ord = "bootp",
main = "Bootstrap plot", showclusters = FALSE)
```
## Challenges
1. Change the distance metric to "cosangle" and re-run. How do your results compare?
2. Cluster the variables rather than the observations by transposing the data frame with `t()` and re-running the distance calculation and clustering. Examine the medoids to find the representative variables in the dataset.
3. Don't standardize the data and re-run. How do the results compare? What if you standardize everything except the binary variables?
## Resources
* [Bioconductor's HOPACH package](https://www.bioconductor.org/packages/release/bioc/vignettes/hopach/inst/doc/hopach.pdf)
## References
van der Laan, M. J., & Pollard, K. S. (2003). A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. Journal of Statistical Planning and Inference, 117(2), 275-303.
Pollard, K. S., & van der Laan, M. J. (2005). Cluster analysis of genomic data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor (pp. 209-228). Springer, New York, NY.