K-means clustering in R
See Video ⮞ ☝ |
AGRON Stats Lectures
June 26, 2018
Introduction
\(K-means\) clustering is a method of vector quantization, originally from signal processing, that aims to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
In other words, the \(k-means\) algorithm identifies \(k\) number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
Let's get started
Import data file
I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE))
before importing the data set. You can also clear the plots using graphics.off()
and clear everything in console using shell()
function.
Now let’s import the data set using read.csv()
function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file
argument specify the file name with extension CSV. In header
argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE
for this argument. The head()
function will print the first six rows of the data set.
# X Murder Assault UrbanPop Rape
# 1 Alabama 13.2 236 58 21.2
# 2 Alaska 10.0 263 48 44.5
# 3 Arizona 8.1 294 80 NA
# 4 Arkansas 8.8 190 50 19.5
# 5 California 9.0 276 91 40.6
# 6 Colorado 7.9 204 78 38.7
Observe data & format variables
Before proceeding for analysis it is a better practice to have a look on the data set to see the structure of variables. The first variable representing the US states is being read as factor by the R program. I would like to change it to character. You can change the structure of this variable to character using as.character()
function as shown below.
# 'data.frame': 30 obs. of 5 variables:
# $ X : Factor w/ 30 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
# $ Assault : int 236 263 294 190 276 204 110 238 335 NA ...
# $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
# $ Rape : num 21.2 44.5 NA 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
# 'data.frame': 30 obs. of 5 variables:
# $ X : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
# $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
# $ Assault : int 236 263 294 190 276 204 110 238 335 NA ...
# $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
# $ Rape : num 21.2 44.5 NA 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
Next I want to change the first variable representing US states as rownmaes for this data set. The first variable is X
and to assign rownames for values of X
variable attach it with data object using dollar sign data$X
and assign it to object rownames(data)
. Using head()
function for data object will print the first six rows. You can see the same values from first variable has been assigned as the rownames for this data set.
# X Murder Assault UrbanPop Rape
# Alabama Alabama 13.2 236 58 21.2
# Alaska Alaska 10.0 263 48 44.5
# Arizona Arizona 8.1 294 80 NA
# Arkansas Arkansas 8.8 190 50 19.5
# California California 9.0 276 91 40.6
# Colorado Colorado 7.9 204 78 38.7
As the US states are now being represented as rownames so we don’t need the first variable X
anymore. To delete this variable subtract it from data set by setting minus one value in square brackets (data[, -1]
) and assign this object as newdata (df
). By using the head function you will see that the first variable has been removed from the data set.
# Murder Assault UrbanPop Rape
# Alabama 13.2 236 58 21.2
# Alaska 10.0 263 48 44.5
# Arizona 8.1 294 80 NA
# Arkansas 8.8 190 50 19.5
# California 9.0 276 91 40.6
# Colorado 7.9 204 78 38.7
You can handle missing values in the data set by using na.omit()
function which return the object with incomplete cases removed. This generic function is useful for dealing with NAs.
na.pass() returns the object unchanged
k-means Clustering
Now let’s start with \(k-means\) clustering of the data set. This will require library factoextra
. If this package is not installed on your device then first install it using install.packages("factoextra")
command. After installation is completed load the factoextra
package using require()
function.
Determine and visualize the optimal number of clusters
Partitioning methods, such as \(k-means\) clustering require the users to specify the number of clusters to be generated.
Functions:
fviz_nbclust()
: k-means, average silhouette, gap statistics
fviz_gap-stat()
: firstmax method
To determine and visualize the optimal number of clusters use fviz_nbclust()
function. In parenthesis the x
argument specify the numeric matrix or data frame of data set.
`x` can be the results of the function NbClust().
The argument FUNcluster
is a partitioning function which accepts as first argument a data matrix like x, second argument, say k, \(k \geq 2\), the number of clusters desired, and returns a list with a component named cluster which contains the grouping of observations.
Allowed values include:
kmeans, cluster::pam, cluster::clara, cluster::fanny, hcut, etc.
This argument is not required when x is an output of the function `NbClust()`.
The method
argument specify the method to be used for estimating the optimal number of clusters.
Possible values are
silhouette
: for average silhouette width
wss
: for total within sum of square, and
gap_stat
: for gap statistics
To draw a reference line use geom_vline()
function. The argument xintercept
is the parameter that controls the position of the line. Setting the value 2 in linetype
argument will draw a dashed line. You can add a subtitle
by using labs()
function.
# Optimal number of clusters
fviz_nbclust(x = df,
FUNcluster = kmeans,
method = "wss") +
# To draw reference line
geom_vline(xintercept = 4,
linetype = 2) +
# Add subtitle "Elbow method"
labs(subtitle = "Elbow method")
The plot showed that the optimal number of clusters are four as shown by the bent (elbow) of the curve.
Computing k-means clusters on a data matrix
To set a seed for random number generator to randomly select centroids for k means algorithms use set.seed()
function. To perform \(k-means\) clustering on a data matrix first load the library stats
using require function.
The kmeans()
function require certain arguments. In x
argument specify numeric matrix of data. The centers
argument specify the number of clusters as determined above. Set the value 4 for this argument. As we have provided a numeric value for centers therefore nstart
argument shows how many random sets should be chosen. Trying several random starts (nstart>1
) is often recommended.
# K-means clustering with 4 clusters of sizes 4, 2, 10, 9
#
# Cluster means:
# Murder Assault UrbanPop Rape
# 1 8.27500 182.75 71.75000 26.30000
# 2 13.35000 317.50 73.50000 29.85000
# 3 4.42000 92.20 62.10000 14.67000
# 4 11.58889 253.00 68.55556 29.61111
#
# Clustering vector:
# Alabama Alaska Arkansas California Colorado Connecticut
# 4 4 1 4 1 3
# Delaware Florida Hawaii Idaho Illinois Iowa
# 4 2 3 3 4 3
# Kansas Kentucky Louisiana Maine Maryland Michigan
# 3 3 4 3 2 4
# Minnesota Mississippi Missouri Montana Nebraska Nevada
# 3 4 1 3 3 4
# New Jersey
# 1
#
# Within cluster sum of squares by cluster:
# [1] 2165.068 713.810 7470.237 4494.920
# (between_SS / total_SS = 91.7 %)
#
# Available components:
#
# [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
# [6] "betweenss" "size" "iter" "ifault"
You can also access different components of \(k-means\) results by combining the component with km.res
object using dollar sign.
Directly computing means using aggregate function
Use aggregate function to get means of each group. In aggregate()
the argument x
specify an object or variable for which means are to be computed. Use x = cbind(x1, x2, x3)
for data set containing more than one dependent variables. In by
argument specify a list for clusters to an object cluster
for km.res$cluster
. In FUN
argument type mean as a function for each cluster.
# cluster Murder Assault UrbanPop Rape
# 1 1 8.27500 182.75 71.75000 26.30000
# 2 2 13.35000 317.50 73.50000 29.85000
# 3 3 4.42000 92.20 62.10000 14.67000
# 4 4 11.58889 253.00 68.55556 29.61111
Point classifcation of original data
To get point classification of clusters of original data then combine the data set (df
) with cluster vector component (km.res$cluster
). The head()
will print the first six rows for the object dd
.
# Murder Assault UrbanPop Rape cluster
# Alabama 13.2 236 58 21.2 4
# Alaska 10.0 263 48 44.5 4
# Arkansas 8.8 190 50 19.5 1
# California 9.0 276 91 40.6 4
# Colorado 7.9 204 78 38.7 1
# Connecticut 3.3 110 77 11.1 3
Please comment below if you have any questions
Download data file — Click_here
Download Rscript — Download Rscript
Download R program —
Click_here
Download R studio —
Click_here
Comments
Post a Comment