K-means clustering in R

See Video  ☝


Introduction

\(K-means\) clustering is a method of vector quantization, originally from signal processing, that aims to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

In other words, the \(k-means\) algorithm identifies \(k\) number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

Let's get started

Import data file

I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE)) before importing the data set. You can also clear the plots using graphics.off() and clear everything in console using shell() function.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Now let’s import the data set using read.csv() function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file argument specify the file name with extension CSV. In header argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE for this argument. The head() function will print the first six rows of the data set.

data <- read.csv(file = "data_kmeans.csv", 
                 header = TRUE)
head(data)
#            X Murder Assault UrbanPop Rape
# 1    Alabama   13.2     236       58 21.2
# 2     Alaska   10.0     263       48 44.5
# 3    Arizona    8.1     294       80   NA
# 4   Arkansas    8.8     190       50 19.5
# 5 California    9.0     276       91 40.6
# 6   Colorado    7.9     204       78 38.7

Observe data & format variables

Before proceeding for analysis it is a better practice to have a look on the data set to see the structure of variables. The first variable representing the US states is being read as factor by the R program. I would like to change it to character. You can change the structure of this variable to character using as.character() function as shown below.

str(data)
# 'data.frame': 30 obs. of  5 variables:
#  $ X       : Factor w/ 30 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
#  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
#  $ Assault : int  236 263 294 190 276 204 110 238 335 NA ...
#  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
#  $ Rape    : num  21.2 44.5 NA 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
data$X <- as.character(x = data$X)
str(data)
# 'data.frame': 30 obs. of  5 variables:
#  $ X       : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
#  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
#  $ Assault : int  236 263 294 190 276 204 110 238 335 NA ...
#  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
#  $ Rape    : num  21.2 44.5 NA 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Next I want to change the first variable representing US states as rownmaes for this data set. The first variable is X and to assign rownames for values of X variable attach it with data object using dollar sign data$X and assign it to object rownames(data). Using head() function for data object will print the first six rows. You can see the same values from first variable has been assigned as the rownames for this data set.

rownames(data) <- data$X
head(data)
#                     X Murder Assault UrbanPop Rape
# Alabama       Alabama   13.2     236       58 21.2
# Alaska         Alaska   10.0     263       48 44.5
# Arizona       Arizona    8.1     294       80   NA
# Arkansas     Arkansas    8.8     190       50 19.5
# California California    9.0     276       91 40.6
# Colorado     Colorado    7.9     204       78 38.7

As the US states are now being represented as rownames so we don’t need the first variable X anymore. To delete this variable subtract it from data set by setting minus one value in square brackets (data[, -1]) and assign this object as newdata (df). By using the head function you will see that the first variable has been removed from the data set.

df <- data[,-1]
head(df)
#            Murder Assault UrbanPop Rape
# Alabama      13.2     236       58 21.2
# Alaska       10.0     263       48 44.5
# Arizona       8.1     294       80   NA
# Arkansas      8.8     190       50 19.5
# California    9.0     276       91 40.6
# Colorado      7.9     204       78 38.7

You can handle missing values in the data set by using na.omit() function which return the object with incomplete cases removed. This generic function is useful for dealing with NAs.

na.pass() returns the object unchanged
df <- na.omit(df)

k-means Clustering

Now let’s start with \(k-means\) clustering of the data set. This will require library factoextra. If this package is not installed on your device then first install it using install.packages("factoextra") command. After installation is completed load the factoextra package using require() function.

require(factoextra)

Determine and visualize the optimal number of clusters

Partitioning methods, such as \(k-means\) clustering require the users to specify the number of clusters to be generated.

Functions:

fviz_nbclust(): k-means, average silhouette, gap statistics

fviz_gap-stat(): firstmax method

To determine and visualize the optimal number of clusters use fviz_nbclust() function. In parenthesis the x argument specify the numeric matrix or data frame of data set.

`x` can be the results of the function NbClust().

The argument FUNcluster is a partitioning function which accepts as first argument a data matrix like x, second argument, say k, \(k \geq 2\), the number of clusters desired, and returns a list with a component named cluster which contains the grouping of observations.

Allowed values include:
kmeans, cluster::pam, cluster::clara, cluster::fanny, hcut, etc. 
This argument is not required when x is an output of the function `NbClust()`. 

The method argument specify the method to be used for estimating the optimal number of clusters.

Possible values are

silhouette: for average silhouette width

wss: for total within sum of square, and

gap_stat: for gap statistics

To draw a reference line use geom_vline() function. The argument xintercept is the parameter that controls the position of the line. Setting the value 2 in linetype argument will draw a dashed line. You can add a subtitle by using labs() function.

# Optimal number of clusters
fviz_nbclust(x = df, 
             FUNcluster = kmeans, 
             method = "wss") +
# To draw reference line
          geom_vline(xintercept = 4, 
                     linetype = 2) + 
# Add subtitle "Elbow method"
          labs(subtitle = "Elbow method")

The plot showed that the optimal number of clusters are four as shown by the bent (elbow) of the curve.    

Computing k-means clusters on a data matrix

To set a seed for random number generator to randomly select centroids for k means algorithms use set.seed() function. To perform \(k-means\) clustering on a data matrix first load the library stats using require function.

The kmeans() function require certain arguments. In x argument specify numeric matrix of data. The centers argument specify the number of clusters as determined above. Set the value 4 for this argument. As we have provided a numeric value for centers therefore nstart argument shows how many random sets should be chosen. Trying several random starts (nstart>1) is often recommended.

set.seed(123)
require(stats)
km.res <- kmeans(x = df, 
                 centers = 4, 
                 nstart = 25)
print(km.res)
# K-means clustering with 4 clusters of sizes 4, 2, 10, 9
# 
# Cluster means:
#     Murder Assault UrbanPop     Rape
# 1  8.27500  182.75 71.75000 26.30000
# 2 13.35000  317.50 73.50000 29.85000
# 3  4.42000   92.20 62.10000 14.67000
# 4 11.58889  253.00 68.55556 29.61111
# 
# Clustering vector:
#     Alabama      Alaska    Arkansas  California    Colorado Connecticut 
#           4           4           1           4           1           3 
#    Delaware     Florida      Hawaii       Idaho    Illinois        Iowa 
#           4           2           3           3           4           3 
#      Kansas    Kentucky   Louisiana       Maine    Maryland    Michigan 
#           3           3           4           3           2           4 
#   Minnesota Mississippi    Missouri     Montana    Nebraska      Nevada 
#           3           4           1           3           3           4 
#  New Jersey 
#           1 
# 
# Within cluster sum of squares by cluster:
# [1] 2165.068  713.810 7470.237 4494.920
#  (between_SS / total_SS =  91.7 %)
# 
# Available components:
# 
# [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
# [6] "betweenss"    "size"         "iter"         "ifault"

You can also access different components of \(k-means\) results by combining the component with km.res object using dollar sign.

km.res$cluster
km.res$centers
km.res$size

Directly computing means using aggregate function

Use aggregate function to get means of each group. In aggregate() the argument x specify an object or variable for which means are to be computed. Use x = cbind(x1, x2, x3) for data set containing more than one dependent variables. In by argument specify a list for clusters to an object cluster for km.res$cluster. In FUN argument type mean as a function for each cluster.

aggregate(x = df, 
          by = list(cluster = km.res$cluster), 
          FUN = mean)
#   cluster   Murder Assault UrbanPop     Rape
# 1       1  8.27500  182.75 71.75000 26.30000
# 2       2 13.35000  317.50 73.50000 29.85000
# 3       3  4.42000   92.20 62.10000 14.67000
# 4       4 11.58889  253.00 68.55556 29.61111

Point classifcation of original data

To get point classification of clusters of original data then combine the data set (df) with cluster vector component (km.res$cluster). The head() will print the first six rows for the object dd.

dd <- cbind(df, 
            cluster = km.res$cluster)
head(dd)
#             Murder Assault UrbanPop Rape cluster
# Alabama       13.2     236       58 21.2       4
# Alaska        10.0     263       48 44.5       4
# Arkansas       8.8     190       50 19.5       1
# California     9.0     276       91 40.6       4
# Colorado       7.9     204       78 38.7       1
# Connecticut    3.3     110       77 11.1       3
Please comment below if you have any questions

Download data file — Click_here

Download Rscript — Download Rscript


Download R program — Click_here

Download R studio — Click_here


Comments

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Principal component analysis in R