Cluster analysis in R
See Video ⮞ ☝ |
AGRON stats
January 25, 2020
Introduction
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is one of the important data mining methods for discovering knowledge in multidimensional data. The goal of clustering is to identify pattern or groups of similar objects within a data set of interest.
Packages used
There are several packages used for cluster analysis. I shall use two packages named as cluster
and factoextra
packages. You can install these packages by using install.packages()
function. This will take some time while installing package files from cran. Cluster package is used for distance measures while factoextra
package is used for ggplot2 based elegant visualization of clustering results.
install.packages("cluster")
install.packages("factoextra")
# OR install both packages with single command
install.packages(c("cluster", "factoextra"))
To load or attach the these packages use library()
or require()
functions. Both these functions load the namespace of the package with name package and attach it on the search list. Both functions check and update the list of currently attached packages and do not reload a name space which is already loaded.
Use .packages(all = TRUE) to obtain just the names of all available packages
Import data file
Before importing data in R first thing to do is to prepare data file according to the given instructions.
- Use first row as
column names
that represent variables - Use first column as
row names
that represent observations - Avoid column names with
blank spaces
. For example, good name for plant height is to place underscore or dot between the two words e.gplant_height
orplant.height
. A bad name for plant height is to give space between the two words (plant height). - Avoid name with
special symbols
. - Avoid beginning variable names with a number. Use letter instead e.g instead of 1000_grain_weight type
th_grain_weight
orgrain_weight_1000
. - Avoid
blank rows
in your data - Delete any
comments
in your file - Replace
missing values
with NA usingna.omit()
function - Use four digit format for column containing date
After preparing the file, next step is to save the file. Save the file either as .CSV
format. There are several built in demo data sets in R for playing with R functions. These include USArrests
, iris
and mtcars
. To load a demo data set you can use the data()
function. In this example USArrests
data set will be used to perform cluster analysis in R. Using head()
function will print the first six rows of the data set.
# Murder Assault UrbanPop Rape
# Alabama 13.2 236 58 21.2
# Alaska 10.0 263 48 44.5
# Arizona 8.1 294 80 31.0
# Arkansas 8.8 190 50 19.5
# California 9.0 276 91 40.6
# Colorado 7.9 204 78 38.7
Handling with missing values
To see whether the data contain missing values or not, use na.fail()
function which returns the object if it does not contain any missing values. If data have missing values then use na.omit()
function to remove the incomplete cases.
na.pas() function returns the object unchanged
Scaling of the data
Scaling of the data is carried out as we do not want clustering to depend on an arbitrary value. This is particularly recommended when variables are measured in different scales. The scaling goal is to make variables more comparable.
Generally variables are scaled to have one value for standard deviation and zero value for mean.
The data is also standardized if the mean and standard deviation of variables is largely different. Scaling will transform the data as ratio between two deviations with numerator as mean deviation or median deviation and denominator as standard deviation or interquartile range or median absolute range.
This approach is widely used in gene expression data analysis before clustering. Use scale()
function in R console to standardize the data.
# Murder Assault UrbanPop Rape
# Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
# Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
# Arizona 0.07163341 1.4788032 0.9989801 1.042878388
# Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
# California 0.27826823 1.2628144 1.7589234 2.067820292
# Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Distance matrix computation
For computing distance measures we can use three methods which include;
- Euclidean distance
- correlation based distance
- distance for mixed data
Euclidean distance
First we shall see how to compute Euclidean distance. The function dist()
from stats package will be used to compute specified distance measure to compute the distance between the rows of a data matrix. Specify a numeric matrix, data frame or dist object in x
argument. In method
argument you can type the distance measure to be used.
method
: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”
To make it easier to see the distance information generated by dist()
function, round the values in distance vector using as.matrix()
function. Here range of 1 to 3 represents first three columns and first three rows, respectively. The value 1 represents output will be round to one decimal place.
# Alabama Alaska Arizona
# Alabama 0.0 2.7 2.3
# Alaska 2.7 0.0 2.7
# Arizona 2.3 2.7 0.0
In this distance matrix result, the values represent the distance between the objects. The values in the diagonal represent the distance between the objects and themselves which are zero.
Correlation based distance
The second type is to compute correlation based distance measures. This type is commonly used in gene expression data analysis. To compute correlation based distances use get_dist()
function after loading factoextra
package by using library function.
method
: “Pearson”, “Spearman” or “Kendall”
Pearson
correlation is the most commonly used method. It is also known as a parametric correlation which depends on the distribution of the data. Kendall
and Spearman
correlations are non-parametric associations which are used to perform rank-based correlation analysis.
library(factoextra)
data.cor <- get_dist(x = data.scaled,
method = "pearson")
round(as.matrix(data.cor)[1:3, 1:3], 1)
# Alabama Alaska Arizona
# Alabama 0.0 0.7 1.4
# Alaska 0.7 0.0 0.8
# Arizona 1.4 0.8 0.0
Distances for mixed data
The above two distance measuring methods accepts numeric data. However, for data containing both numeric and non-numeric or mixed data, daisy()
function is used to compute the distances. In daisy function, the Gower’s coefficient which is one of the most popular measures of proximity for mixed data types will be used as the metric.
Here we shall use a different example of flower data which contain factor, ordered factor and numeric variables. For this purpose first load cluster
package using library()
function. Then load the R demo data set flower
using the data()
function.
# 'data.frame': 18 obs. of 8 variables:
# $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
# $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
# $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
# $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
# $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
# $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
# $ V7: num 25 150 150 125 20 50 40 100 25 100 ...
# $ V8: num 15 50 50 50 15 40 20 15 15 60 ...
To compute the distance for mixed variables use daisy()
function. Round the distance matrix to two decimal place using as.matrix()
function.
# 1 2 3
# 1 0.00 0.89 0.53
# 2 0.89 0.00 0.51
# 3 0.53 0.51 0.00
Visualizing distances
After measuring distances next you need to visualize the distance matrix. A simple way to visualize distance matrices is to use the fviz_dist()
function by first loading the factoextra
package.
This function classify data samples into groups of similar objects.
library(factoextra)
# Visualize Euclidean matrix
fviz_dist(dist.obj = data.eucl,
order = TRUE, show_labels = TRUE)
# Visualize correlation matrix
fviz_dist(dist.obj = data.cor,
order = TRUE, show_labels = TRUE)
# Visualize mixed data distance matrix
fviz_dist(dist.obj = data.daisy,
order = TRUE, show_labels = TRUE)
The red color indicates high similarity while blue color indicates low similarity. The color level is proportional to the value of dissimilarity between observations where pure red represents zero and pure blue represents one.
Please comment below if you have any questions
Download Rscript — Download Rscript
Download R program —
Click_here
Download R studio —
Click_here
Comments
Post a Comment