Visualizing clustering dendrogram in R | Hierarchical clustering
See Video ⮞ ☝ |
AGRON Stats
August 20, 2020
Introduction
You will learn enhanced visualization of clustering dendrogram using R studio. This topic will cover following aspects:
- Distance matrix computation
- Hierarchical clustering
- Customizing dendrogram
Let’s start using R studio.
Import data
Data file used here is obtained from R demo data set USArrests. Using head()
function will print first six rows of the data set USArrests. The str()
function display the internal structure of data set.
head(USArrests)
# Murder Assault UrbanPop Rape
# Alabama 13.2 236 58 21.2
# Alaska 10.0 263 48 44.5
# Arizona 8.1 294 80 31.0
# Arkansas 8.8 190 50 19.5
# California 9.0 276 91 40.6
# Colorado 7.9 204 78 38.7
str(USArrests)
# 'data.frame': 50 obs. of 4 variables:
# $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
# $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
# $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
# $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
Scale data set
To scale the data use scale()
function. scale is generic function whose default method centers and/or scales the columns of a numeric matrix. If the center
argument is set to TRUE
then centering is done by taking the mean deviations of each column. If scale = TRUE
then scaling is done by dividing the (centered) columns of data by their standard deviations.
<- scale(x = USArrests,
data.scaled center = TRUE,
scale = TRUE)
head(data.scaled)
# Murder Assault UrbanPop Rape
# Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
# Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
# Arizona 0.07163341 1.4788032 0.9989801 1.042878388
# Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
# California 0.27826823 1.2628144 1.7589234 2.067820292
# Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Distance matrix computation
Distance matrix can be computed using dist()
function. This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix. To run this function make sure that stats
package is loaded by using require()
function.
The argument x
specify a numeric matrix, data frame or “dist” object. Second argument method
specify the distance measure to be used.The method must be one of following:
"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"
Compute the results from res.dist
object as a matrix by using as.matrix()
function and specify the number of rows and columns to be printed in square brackets.
Missing values are allowed, and are excluded from all computations involving the rows within which they occur.
require(stats)
# Distance matrix computation
= dist(x = data.scaled,
res.dist method = "euclidean")
# Print distance matrix
= as.matrix(res.dist)[1:6, 1:6]
output round(output, digits = 3)
# Alabama Alaska Arizona Arkansas California Colorado
# Alabama 0.000 2.704 2.294 1.290 3.263 2.651
# Alaska 2.704 0.000 2.701 2.826 3.013 2.327
# Arizona 2.294 2.701 0.000 2.718 1.310 1.365
# Arkansas 1.290 2.826 2.718 0.000 3.764 2.831
# California 3.263 3.013 1.310 3.764 0.000 1.288
# Colorado 2.651 2.327 1.365 2.831 1.288 0.000
The “dist” method of as.matrix()
and as.dist()
can be used for conversion between objects of class “dist” and conventional distance matrices.
= as.dist(output) d
Hierarchical clustering
Hierarchical clustering is a cluster analysis on a set of dissimilarities and methods for analyzing it. Such clustering is performed by using hclust()
function in stats package.
The argument d
specify a dissimilarity structure as produced by dist()
function. The second argument is method
which specify the agglomeration method to be used. This should be one of the following:
"ward.D", "ward.D2", "single", "complete", "average" (UPGMA), "mcquitty" (WPGMA), "median" (WPGMC) or "centroid" (UPGMC).
# Cluster dendrogram using stats package
require(stats)
<- hclust(d = res.dist,
res.hc method = "complete")
plot(x = res.hc)
Another way of enhanced visualization of dendrogram is by using factoextra
package. The function fviz_dend()
draws easily beautiful dendrograms using either R base plot()
function or ggplot2()
function. It also provides an option for drawing a circular dendrogram and phylogenic trees.
The x
argument specify an object of class dendrogram, hclust, agnes, diana, hcut, hkmeans or HCPC. The size of labels and rectangle line width can be controlled by setting the value for cex
and lwd
arguments.
# Cluster dendrogram using factoextra package
require(factoextra)
fviz_dend(x = res.hc, cex = 0.7, lwd = 0.7)
Customize dendrogram
Color choices
The clusters in dendrogram can be assigned with different R built in color names. The colors()
function returns the built-in color names which R knows about.
require(grDevices)
colors()
The graphics palettes can be set or viewed by using palette()
function. In R it is almost always better to specify colors by name. The quick way to show colors in a plot is by using show_col()
function.
require(scales)
palette()
show_col(palette(rainbow(6)))
To view the color palettes inspired by plots in journal of clinical oncology use pal_jco()
function. The argument palette
specify the palette type. Currently there is one available option default
(10-color palette). The alpha argument specify transparency level. The value for this argument can be between \(0\) and \(1\).
require("ggsci")
show_col(pal_jco(palette = c("default"))(10))
show_col(pal_jco("default", alpha = 0.6)(10))
Assign colors and draw rectangles
Colors can be added for the number of groups or clusters both for lines as well as rectangles. The argument k_colors
specify a vector containing colors to be used for each group. Allowed values also include “gray” for gray color palettes; brewer palettes and scientific journal palettes from ggsci R package.
ggsci
: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”
The argument rect
specify logical value indicating whether to add a rectangle around clusters.
Used only when k != NULL.
The border color and line type for rectangles can be customized by using rect_border
argument. The rect_fill
is a logical argument if TRUE, fill the rectangle.
fviz_dend(x = res.hc, cex = 0.8, lwd = 0.8, k = 4,
# Manually selected colors
k_colors = c("red", "green3", "blue", "magenta"),
rect = TRUE,
rect_border = "gray",
rect_fill = FALSE)
# OR JCO fill color for rectangles
= c("jco"),
k_colors = TRUE,
rect = "jco",
rect_border = TRUE rect_fill
Horizontal alignment
The alignment of the dendrogram can be changed by setting a logical value for horiz
argument. Setting TRUE
for this argument will draw a horizontal dendrogram.
fviz_dend(res.hc, cex = 0.8, k=4,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
horiz = TRUE)
Apply themes
Different themes of ggplot2 package can be applied to the dendrogram by specifying the value for ggtheme
argument. Default value for this argument is theme_classic()
. Allowed values for this argument include the following ggplot2 official themes.
ggtheme
: theme_gray(), theme_bw(), theme_minimal(), theme_classic(), theme_void(), ….
fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
ggtheme = theme_gray())
Change type of dendrogram
The type of dendrogram can be changed by setting a value for type
argument. The allowed values for this argument include following:
type
: “rectangle”, “triangle”, “circular”, “phylogenic”
# Phylogenic
= fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4,
Phylo rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
type = "phylogenic")
# Circular
= fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4,
Circ rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
type = "circular")
Avoid overplotting of labels
In above dendrogram the labels are overplotted. To avoid the overplotting of labels set a logical value for repel
argument.
Use repel = TRUE to avoid label overplotting when type = "phylogenic".
fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
type = "phylogenic",
repel = TRUE)