In order to make, ggplot
plots of hierarchical cluster, we’ll need the ggplot
extensions ggdendro
. If you don’t already have it installed, uncomment the code in the chunk below and install it now.
#install.packages('ggdendro')
We’re going to use some simulated data from a normal distribution to demo hierarchical clustering. Run the code in the chunk below before continuing.
set.seed(42)
tibble(x = rnorm(6, mean = 5, sd = 2),
y = x + runif(6),
label = c('A', 'B', 'C', 'D', 'E', 'F')) -> clust_demo
dist()
The clustering algorithm requires a distance matrix. A distance matrix calculates the euclidean distance between every row in the table and returns a distance matrix with the distances between all rows.
We need to convert our data into a distance matrix using the function dist()
clust_demo %>%
# make the tibble into a dataframe because we need to make the labels
# into rownames and tibbles don't allow rownames
as.data.frame() %>%
# turn the data labels into rownames so they're carried through the distance
# matrix and hierarchical clustering calculations
column_to_rownames('label') %>%
# use dist() to calculate a distance matrix
dist(.) -> clust_demo_dist
hclust()
The hclust()
function does the hierarchical clustering calculations.
hclust(clust_demo_dist) -> clust_demo_hclust
You can use the base R plot()
to directly plot the hclust
object.
plot(clust_demo_hclust)
ggplot
using ggdendro
It requires more wrangling to plot the hclust
clustering with ggplot, but the ggendro
package will do most of the wrangling for us.
# start with the saved hclust object
clust_demo_hclust %>%
# as.dendrogram() turns the hclust results into a special dendrogram class
# that R uses for representing any kind of tree
as.dendrogram() %>%
# dendro_data() turns the dendrogram class data into numbers that are plottable
dendro_data() -> clust_demo_ggdendro
ggdendrogram()
Plot using ggdendro
’s helper function, ggdendro()
. It plots the dendrogram as a ggplot object.
ggdendrogram(clust_demo_ggdendro)
You can modify this like any other ggplot plot. For example, I don’t like the y axis labels, so I’ll remove them.
ggdendrogram(clust_demo_ggdendro) +
theme(axis.text.y = element_blank())
ggplot()
However, you can’t see (unless you look at the package code) what exact ggplot
commands ggdendrogram()
is plotting, which is super annoying when you want to make adjustments to the plot. However, as you can see when you look at the dendro_data()
table below, all the numbers are there, so we can just plot it ourselves.
clust_demo_ggdendro
## $segments
## x y xend yend
## 1 2.6875 5.9744985 1.500 5.9744985
## 2 1.5000 5.9744985 1.500 1.2034621
## 3 1.5000 1.2034621 1.000 1.2034621
## 4 1.0000 1.2034621 1.000 0.0000000
## 5 1.5000 1.2034621 2.000 1.2034621
## 6 2.0000 1.2034621 2.000 0.0000000
## 7 2.6875 5.9744985 3.875 5.9744985
## 8 3.8750 5.9744985 3.875 3.2020659
## 9 3.8750 3.2020659 3.000 3.2020659
## 10 3.0000 3.2020659 3.000 0.0000000
## 11 3.8750 3.2020659 4.750 3.2020659
## 12 4.7500 3.2020659 4.750 1.1513913
## 13 4.7500 1.1513913 4.000 1.1513913
## 14 4.0000 1.1513913 4.000 0.0000000
## 15 4.7500 1.1513913 5.500 1.1513913
## 16 5.5000 1.1513913 5.500 0.6038454
## 17 5.5000 0.6038454 5.000 0.6038454
## 18 5.0000 0.6038454 5.000 0.0000000
## 19 5.5000 0.6038454 6.000 0.6038454
## 20 6.0000 0.6038454 6.000 0.0000000
##
## $labels
## x y label
## 1 1 0 B
## 2 2 0 F
## 3 3 0 A
## 4 4 0 D
## 5 5 0 C
## 6 6 0 E
##
## $leaf_labels
## NULL
##
## $class
## [1] "dendrogram"
##
## attr(,"class")
## [1] "dendro"
Now we can plot it with ggplot()
# for once, don't put any data in ggplot() !!!
# the stem and label information is in separate tables, so we want to supply
# separate data to separate geoms
ggplot() +
# the segments table contains the information for plotting branches, so supply
# that to geom_segment() to plot the branches of the tree
geom_segment(data = clust_demo_ggdendro$segments, aes(x = x, y = y, xend = xend, yend = yend)) +
# the labels table has the labels for the ends of the branches, so supply that
# to geom_text() to label the ends of the branches
geom_text(data = clust_demo_ggdendro$labels, aes(x = x, y = y, label = label), vjust = 1.25) +
theme_dendro()