Visualizing clustering dendrogram in R | Hierarchical clustering

See Video  ☝


Introduction

You will learn enhanced visualization of clustering dendrogram using R studio. This topic will cover following aspects:

  • Distance matrix computation
  • Hierarchical clustering
  • Customizing dendrogram
Let’s start using R studio.

Import data

Data file used here is obtained from R demo data set USArrests. Using head() function will print first six rows of the data set USArrests. The str() function display the internal structure of data set.

head(USArrests)
#            Murder Assault UrbanPop Rape
# Alabama      13.2     236       58 21.2
# Alaska       10.0     263       48 44.5
# Arizona       8.1     294       80 31.0
# Arkansas      8.8     190       50 19.5
# California    9.0     276       91 40.6
# Colorado      7.9     204       78 38.7
str(USArrests)
# 'data.frame': 50 obs. of  4 variables:
#  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
#  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
#  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
#  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Scale data set

To scale the data use scale() function. scale is generic function whose default method centers and/or scales the columns of a numeric matrix. If the center argument is set to TRUE then centering is done by taking the mean deviations of each column. If scale = TRUE then scaling is done by dividing the (centered) columns of data by their standard deviations.

data.scaled <- scale(x = USArrests,
                     center = TRUE,
                     scale = TRUE)
head(data.scaled)
#                Murder   Assault   UrbanPop         Rape
# Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
# Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
# Arizona    0.07163341 1.4788032  0.9989801  1.042878388
# Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
# California 0.27826823 1.2628144  1.7589234  2.067820292
# Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Distance matrix computation

Distance matrix can be computed using dist() function. This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix. To run this function make sure that stats package is loaded by using require() function.

The argument x specify a numeric matrix, data frame or “dist” object. Second argument method specify the distance measure to be used.The method must be one of following:

"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"

Compute the results from res.dist object as a matrix by using as.matrix() function and specify the number of rows and columns to be printed in square brackets.

Missing values are allowed, and are excluded from all computations involving the rows within which they occur. 
require(stats)
# Distance matrix computation
res.dist = dist(x = data.scaled,
                method = "euclidean")
# Print distance matrix 
output = as.matrix(res.dist)[1:6, 1:6]
round(output, digits = 3)
#            Alabama Alaska Arizona Arkansas California Colorado
# Alabama      0.000  2.704   2.294    1.290      3.263    2.651
# Alaska       2.704  0.000   2.701    2.826      3.013    2.327
# Arizona      2.294  2.701   0.000    2.718      1.310    1.365
# Arkansas     1.290  2.826   2.718    0.000      3.764    2.831
# California   3.263  3.013   1.310    3.764      0.000    1.288
# Colorado     2.651  2.327   1.365    2.831      1.288    0.000

The “dist” method of as.matrix() and as.dist() can be used for conversion between objects of class “dist” and conventional distance matrices.

d = as.dist(output)

Hierarchical clustering

Hierarchical clustering is a cluster analysis on a set of dissimilarities and methods for analyzing it. Such clustering is performed by using hclust() function in stats package.

The argument d specify a dissimilarity structure as produced by dist() function. The second argument is method which specify the agglomeration method to be used. This should be one of the following:

"ward.D", "ward.D2", "single", "complete", "average" (UPGMA), "mcquitty" (WPGMA), "median" (WPGMC) or "centroid" (UPGMC).
# Cluster dendrogram using stats package
require(stats)
res.hc <- hclust(d = res.dist,
                 method = "complete")
plot(x = res.hc) 

Another way of enhanced visualization of dendrogram is by using factoextra package. The function fviz_dend() draws easily beautiful dendrograms using either R base plot() function or ggplot2() function. It also provides an option for drawing a circular dendrogram and phylogenic trees.

The x argument specify an object of class dendrogram, hclust, agnes, diana, hcut, hkmeans or HCPC. The size of labels and rectangle line width can be controlled by setting the value for cex and lwd arguments.

# Cluster dendrogram using factoextra package
require(factoextra)
fviz_dend(x = res.hc, cex = 0.7, lwd = 0.7) 

Customize dendrogram

Color choices

The clusters in dendrogram can be assigned with different R built in color names. The colors() function returns the built-in color names which R knows about.

require(grDevices)
colors()

The graphics palettes can be set or viewed by using palette() function. In R it is almost always better to specify colors by name. The quick way to show colors in a plot is by using show_col() function.

require(scales)
palette()
show_col(palette(rainbow(6)))

To view the color palettes inspired by plots in journal of clinical oncology use pal_jco() function. The argument palette specify the palette type. Currently there is one available option default (10-color palette). The alpha argument specify transparency level. The value for this argument can be between \(0\) and \(1\).

require("ggsci")
show_col(pal_jco(palette = c("default"))(10))
show_col(pal_jco("default", alpha = 0.6)(10))

Assign colors and draw rectangles

Colors can be added for the number of groups or clusters both for lines as well as rectangles. The argument k_colors specify a vector containing colors to be used for each group. Allowed values also include “gray” for gray color palettes; brewer palettes and scientific journal palettes from ggsci R package.

ggsci: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”

The argument rect specify logical value indicating whether to add a rectangle around clusters.

Used only when k != NULL.

The border color and line type for rectangles can be customized by using rect_border argument. The rect_fill is a logical argument if TRUE, fill the rectangle.

fviz_dend(x = res.hc, cex = 0.8, lwd = 0.8, k = 4,
# Manually selected colors
          k_colors = c("red", "green3", "blue", "magenta"),
          rect = TRUE, 
          rect_border = "gray", 
          rect_fill = FALSE)

# OR JCO fill color for rectangles
          k_colors = c("jco"),
          rect = TRUE, 
          rect_border = "jco", 
          rect_fill = TRUE

Horizontal alignment

The alignment of the dendrogram can be changed by setting a logical value for horiz argument. Setting TRUE for this argument will draw a horizontal dendrogram.

fviz_dend(res.hc, cex = 0.8, k=4, 
          rect = TRUE,  
          k_colors = "jco",
          rect_border = "jco", 
          rect_fill = TRUE, 
          horiz = TRUE)

Apply themes

Different themes of ggplot2 package can be applied to the dendrogram by specifying the value for ggtheme argument. Default value for this argument is theme_classic(). Allowed values for this argument include the following ggplot2 official themes.

ggtheme: theme_gray(), theme_bw(), theme_minimal(), theme_classic(), theme_void(), ….

fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4, 
          rect = TRUE, 
          k_colors = "jco", 
          rect_border = "jco", 
          rect_fill = TRUE,
          ggtheme = theme_gray())

Change type of dendrogram

The type of dendrogram can be changed by setting a value for type argument. The allowed values for this argument include following:

type: “rectangle”, “triangle”, “circular”, “phylogenic”

# Phylogenic
Phylo = fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4,
                  rect = TRUE,
                  k_colors = "jco",
                  rect_border = "jco",
                  rect_fill = TRUE,
                  type = "phylogenic")
# Circular
Circ = fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4,
                 rect = TRUE,
                 k_colors = "jco",
                 rect_border = "jco",
                 rect_fill = TRUE,
                 type = "circular")

Avoid overplotting of labels

In above dendrogram the labels are overplotted. To avoid the overplotting of labels set a logical value for repel argument.

Use repel = TRUE to avoid label overplotting when type = "phylogenic".
fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4, 
          rect = TRUE, 
          k_colors = "jco", 
          rect_border = "jco", 
          rect_fill = TRUE,
          type = "phylogenic",
          repel = TRUE)