Visualizing clustering dendrogram in R | Hierarchical clustering

See Video  ☝


Introduction

You will learn enhanced visualization of clustering dendrogram using R studio. This topic will cover following aspects:

  • Distance matrix computation
  • Hierarchical clustering
  • Customizing dendrogram
Let’s start using R studio.

Import data

Data file used here is obtained from R demo data set USArrests. Using head() function will print first six rows of the data set USArrests. The str() function display the internal structure of data set.

head(USArrests)
#            Murder Assault UrbanPop Rape
# Alabama      13.2     236       58 21.2
# Alaska       10.0     263       48 44.5
# Arizona       8.1     294       80 31.0
# Arkansas      8.8     190       50 19.5
# California    9.0     276       91 40.6
# Colorado      7.9     204       78 38.7
str(USArrests)
# 'data.frame': 50 obs. of  4 variables:
#  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
#  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
#  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
#  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Scale data set

To scale the data use scale() function. scale is generic function whose default method centers and/or scales the columns of a numeric matrix. If the center argument is set to TRUE then centering is done by taking the mean deviations of each column. If scale = TRUE then scaling is done by dividing the (centered) columns of data by their standard deviations.

data.scaled <- scale(x = USArrests,
                     center = TRUE,
                     scale = TRUE)
head(data.scaled)
#                Murder   Assault   UrbanPop         Rape
# Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
# Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
# Arizona    0.07163341 1.4788032  0.9989801  1.042878388
# Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
# California 0.27826823 1.2628144  1.7589234  2.067820292
# Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Distance matrix computation

Distance matrix can be computed using dist() function. This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix. To run this function make sure that stats package is loaded by using require() function.

The argument x specify a numeric matrix, data frame or “dist” object. Second argument method specify the distance measure to be used.The method must be one of following:

"euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski"

Compute the results from res.dist object as a matrix by using as.matrix() function and specify the number of rows and columns to be printed in square brackets.

Missing values are allowed, and are excluded from all computations involving the rows within which they occur. 
require(stats)
# Distance matrix computation
res.dist = dist(x = data.scaled,
                method = "euclidean")
# Print distance matrix 
output = as.matrix(res.dist)[1:6, 1:6]
round(output, digits = 3)
#            Alabama Alaska Arizona Arkansas California Colorado
# Alabama      0.000  2.704   2.294    1.290      3.263    2.651
# Alaska       2.704  0.000   2.701    2.826      3.013    2.327
# Arizona      2.294  2.701   0.000    2.718      1.310    1.365
# Arkansas     1.290  2.826   2.718    0.000      3.764    2.831
# California   3.263  3.013   1.310    3.764      0.000    1.288
# Colorado     2.651  2.327   1.365    2.831      1.288    0.000

The “dist” method of as.matrix() and as.dist() can be used for conversion between objects of class “dist” and conventional distance matrices.

d = as.dist(output)

Hierarchical clustering

Hierarchical clustering is a cluster analysis on a set of dissimilarities and methods for analyzing it. Such clustering is performed by using hclust() function in stats package.

The argument d specify a dissimilarity structure as produced by dist() function. The second argument is method which specify the agglomeration method to be used. This should be one of the following:

"ward.D", "ward.D2", "single", "complete", "average" (UPGMA), "mcquitty" (WPGMA), "median" (WPGMC) or "centroid" (UPGMC).
# Cluster dendrogram using stats package
require(stats)
res.hc <- hclust(d = res.dist,
                 method = "complete")
plot(x = res.hc) 

Another way of enhanced visualization of dendrogram is by using factoextra package. The function fviz_dend() draws easily beautiful dendrograms using either R base plot() function or ggplot2() function. It also provides an option for drawing a circular dendrogram and phylogenic trees.

The x argument specify an object of class dendrogram, hclust, agnes, diana, hcut, hkmeans or HCPC. The size of labels and rectangle line width can be controlled by setting the value for cex and lwd arguments.

# Cluster dendrogram using factoextra package
require(factoextra)
fviz_dend(x = res.hc, cex = 0.7, lwd = 0.7) 

Customize dendrogram

Color choices

The clusters in dendrogram can be assigned with different R built in color names. The colors() function returns the built-in color names which R knows about.

require(grDevices)
colors()

The graphics palettes can be set or viewed by using palette() function. In R it is almost always better to specify colors by name. The quick way to show colors in a plot is by using show_col() function.

require(scales)
palette()
show_col(palette(rainbow(6)))

To view the color palettes inspired by plots in journal of clinical oncology use pal_jco() function. The argument palette specify the palette type. Currently there is one available option default (10-color palette). The alpha argument specify transparency level. The value for this argument can be between \(0\) and \(1\).

require("ggsci")
show_col(pal_jco(palette = c("default"))(10))
show_col(pal_jco("default", alpha = 0.6)(10))

Assign colors and draw rectangles

Colors can be added for the number of groups or clusters both for lines as well as rectangles. The argument k_colors specify a vector containing colors to be used for each group. Allowed values also include “gray” for gray color palettes; brewer palettes and scientific journal palettes from ggsci R package.

ggsci: “npg”, “aaas”, “lancet”, “jco”, “ucscgb”, “uchicago”, “simpsons” and “rickandmorty”

The argument rect specify logical value indicating whether to add a rectangle around clusters.

Used only when k != NULL.

The border color and line type for rectangles can be customized by using rect_border argument. The rect_fill is a logical argument if TRUE, fill the rectangle.

fviz_dend(x = res.hc, cex = 0.8, lwd = 0.8, k = 4,
# Manually selected colors
          k_colors = c("red", "green3", "blue", "magenta"),
          rect = TRUE, 
          rect_border = "gray", 
          rect_fill = FALSE)

# OR JCO fill color for rectangles
          k_colors = c("jco"),
          rect = TRUE, 
          rect_border = "jco", 
          rect_fill = TRUE

Horizontal alignment

The alignment of the dendrogram can be changed by setting a logical value for horiz argument. Setting TRUE for this argument will draw a horizontal dendrogram.

fviz_dend(res.hc, cex = 0.8, k=4, 
          rect = TRUE,  
          k_colors = "jco",
          rect_border = "jco", 
          rect_fill = TRUE, 
          horiz = TRUE)

Apply themes

Different themes of ggplot2 package can be applied to the dendrogram by specifying the value for ggtheme argument. Default value for this argument is theme_classic(). Allowed values for this argument include the following ggplot2 official themes.

ggtheme: theme_gray(), theme_bw(), theme_minimal(), theme_classic(), theme_void(), ….

fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4, 
          rect = TRUE, 
          k_colors = "jco", 
          rect_border = "jco", 
          rect_fill = TRUE,
          ggtheme = theme_gray())

Change type of dendrogram

The type of dendrogram can be changed by setting a value for type argument. The allowed values for this argument include following:

type: “rectangle”, “triangle”, “circular”, “phylogenic”

# Phylogenic
Phylo = fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4,
                  rect = TRUE,
                  k_colors = "jco",
                  rect_border = "jco",
                  rect_fill = TRUE,
                  type = "phylogenic")
# Circular
Circ = fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4,
                 rect = TRUE,
                 k_colors = "jco",
                 rect_border = "jco",
                 rect_fill = TRUE,
                 type = "circular")

Avoid overplotting of labels

In above dendrogram the labels are overplotted. To avoid the overplotting of labels set a logical value for repel argument.

Use repel = TRUE to avoid label overplotting when type = "phylogenic".
fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4, 
          rect = TRUE, 
          k_colors = "jco", 
          rect_border = "jco", 
          rect_fill = TRUE,
          type = "phylogenic",
          repel = TRUE)

Phylogenic layouts

Different layouts can be used for the phylogenic trees. To do this set a value for phylo_layout argument. The default value for this argument is layout.auto. Allowed values for this argument includes:

phylo_layout: “layout.auto”, “layout_with_drl”, “layout_as_tree”, “layout.gem”, “layout.mds” and “layout_with_lgl”

fviz_dend(res.hc, cex = 0.8, lwd = 0.8, k = 4, 
          rect = TRUE, k_colors = "jco", rect_border = "jco", 
          rect_fill = TRUE, type = "phylogenic", repel = TRUE,
# phylo_layout (layout_with_drl)
          phylo_layout = "layout_with_drl")
# phylo_layout (layout_as_tree)
          phylo_layout = "layout_as_tree"          
# phylo_layout (layout.gem)
          phylo_layout = "layout.gem"
# phylo_layout (layout.mds)
          phylo_layout = "layout.mds"
# phylo_layout (layout_with_lgl)
          phylo_layout = "layout_with_lgl"

If you have any question feel free to ask in comment box

Download Rscript — Click_here


Download R program — Click_here

Download R studio — Click_here


Comments

  1. Hi, Thanks for the nice tutorial, can we follow same procedure for categorical variables

    ReplyDelete
    Replies
    1. Thanks for your feedback. Yes you can the same procedure

      Delete
    2. hello sir ,i have a problem in creating Cluster dendrogram,it showing Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

      Delete
    3. Please share the code that you have used and also the output of str(data) where data is object containing data frame that you have imported in R.

      Delete
  2. Thanks for the robust tutorial. If I want to reduce (shorten) the height, how can I do?

    ReplyDelete
    Replies
    1. If you just need to reduce the height and not the amount then you can simply do this by changing the dimensions of image while saving it.

      In R-studio IDE, go to plots window click on "Export" and choose "save as PDF" then in device size option try different width and heights (in inches) to get the desired one.

      Delete
  3. Thank you for sharing such a useful article. I had a great time. This article was fantastic to read. Continue to publish more articles on

    Data Engineering Services

    ReplyDelete
  4. I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.

    Data Engineering Services 

    AI & ML Solutions

    Data Analytics Services

    Data Modernization Services

    ReplyDelete
  5. Can I change the order of clusters in rectangle denrogram somehow? (I just want to swap order of two brunches coming from one common.)

    ReplyDelete
  6. This comment has been removed by a blog administrator.

    ReplyDelete
  7. I am really very happy to visit your blog. Directly I am found which I truly need. please visit our website for more information
    Data visualization Service in USA

    ReplyDelete
  8. I am really very happy to visit your blog. Directly I am found which I truly need. please visit our website for more information
    Data visualization Service in USA

    ReplyDelete
  9. I am really very happy to visit your blog. Directly I am found which I truly need. please visit our website for more information
    Data visualization Service in USA

    ReplyDelete
  10. Hi There,
    Thank you for sharing the knowledgeable blog with us I hope that you will post many more blog with us:-
    One-Touch Ultra Diabetic Test Strips Mail Order Box of 50 - 2 boxes The One Touch Ultra Diabetic Test Strip is the #1 Test Strip in the United States.
    Email:info@babyformulanutrients.com
    Click here for more information:-more info

    ReplyDelete

Post a Comment

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Principal component analysis in R