### Cluster analysis in R

See Video ⮞ ☝ |

#### AGRON stats

#### January 25, 2020

### Introduction

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is one of the important data mining methods for discovering knowledge in multidimensional data. The goal of clustering is to identify pattern or groups of similar objects within a data set of interest.

### Packages used

There are several packages used for cluster analysis. I shall use two packages named as `cluster`

and `factoextra`

packages. You can install these packages by using `install.packages()`

function. This will take some time while installing package files from cran. Cluster package is used for distance measures while `factoextra`

package is used for ggplot2 based elegant visualization of clustering results.

```
install.packages("cluster")
install.packages("factoextra")
# OR install both packages with single command
install.packages(c("cluster", "factoextra"))
```

To load or attach the these packages use `library()`

or `require()`

functions. Both these functions load the namespace of the package with name package and attach it on the search list. Both functions check and update the list of currently attached packages and do not reload a name space which is already loaded.

`Use .packages(all = TRUE) to obtain just the names of all available packages`

### Import data file

Before importing data in R first thing to do is to *prepare data file* according to the given instructions.

- Use first row as
`column names`

that represent variables - Use first column as
`row names`

that represent observations - Avoid column names with
`blank spaces`

. For example, good name for plant height is to place underscore or dot between the two words e.g`plant_height`

or`plant.height`

. A bad name for plant height is to give space between the two words (plant height). - Avoid name with
`special symbols`

. - Avoid beginning variable names with a number. Use letter instead e.g instead of 1000_grain_weight type
`th_grain_weight`

or`grain_weight_1000`

. - Avoid
`blank rows`

in your data - Delete any
`comments`

in your file - Replace
`missing values`

with NA using`na.omit()`

function - Use four digit format for column containing date

After preparing the file, next step is to save the file. Save the file either as `.CSV`

format. There are several built in demo data sets in R for playing with R functions. These include `USArrests`

, `iris`

and `mtcars`

. To load a demo data set you can use the `data()`

function. In this example `USArrests`

data set will be used to perform cluster analysis in R. Using `head()`

function will print the first six rows of the data set.

```
# Murder Assault UrbanPop Rape
# Alabama 13.2 236 58 21.2
# Alaska 10.0 263 48 44.5
# Arizona 8.1 294 80 31.0
# Arkansas 8.8 190 50 19.5
# California 9.0 276 91 40.6
# Colorado 7.9 204 78 38.7
```

### Handling with missing values

To see whether the data contain missing values or not, use `na.fail()`

function which returns the object if it does not contain any missing values. If data have missing values then use `na.omit()`

function to remove the incomplete cases.

`na.pas() function returns the object unchanged`

### Scaling of the data

Scaling of the data is carried out as we do not want clustering to depend on an arbitrary value. This is particularly recommended when variables are measured in different scales. The scaling goal is to make variables more comparable.

`Generally variables are scaled to have one value for standard deviation and zero value for mean. `

The data is also standardized if the mean and standard deviation of variables is largely different. Scaling will transform the data as ratio between two deviations with numerator as mean deviation or median deviation and denominator as standard deviation or interquartile range or median absolute range.

This approach is widely used in gene expression data analysis before clustering. Use `scale()`

function in R console to standardize the data.

```
# Murder Assault UrbanPop Rape
# Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
# Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
# Arizona 0.07163341 1.4788032 0.9989801 1.042878388
# Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
# California 0.27826823 1.2628144 1.7589234 2.067820292
# Colorado 0.02571456 0.3988593 0.8608085 1.864967207
```

### Distance matrix computation

For computing distance measures we can use three methods which include;

- Euclidean distance
- correlation based distance
- distance for mixed data

#### Euclidean distance

First we shall see how to compute **Euclidean distance**. The function `dist()`

from stats package will be used to compute specified distance measure to compute the distance between the rows of a data matrix. Specify a numeric matrix, data frame or **dist** object in `x`

argument. In `method`

argument you can type the distance measure to be used.

`method`

: “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”

To make it easier to see the distance information generated by `dist()`

function, round the values in distance vector using `as.matrix()`

function. Here range of 1 to 3 represents first three columns and first three rows, respectively. The value 1 represents output will be round to one decimal place.

```
# Alabama Alaska Arizona
# Alabama 0.0 2.7 2.3
# Alaska 2.7 0.0 2.7
# Arizona 2.3 2.7 0.0
```

In this distance matrix result, the values represent the distance between the objects. The values in the diagonal represent the distance between the objects and themselves which are zero.

#### Correlation based distance

The second type is to compute correlation based distance measures. This type is commonly used in *gene expression* data analysis. To compute correlation based distances use `get_dist()`

function after loading `factoextra`

package by using library function.

`method`

: “Pearson”, “Spearman” or “Kendall”

`Pearson`

correlation is the most commonly used method. It is also known as a *parametric correlation* which depends on the distribution of the data. `Kendall`

and `Spearman`

correlations are *non-parametric* associations which are used to perform rank-based correlation analysis.

```
library(factoextra)
data.cor <- get_dist(x = data.scaled,
method = "pearson")
round(as.matrix(data.cor)[1:3, 1:3], 1)
```

```
# Alabama Alaska Arizona
# Alabama 0.0 0.7 1.4
# Alaska 0.7 0.0 0.8
# Arizona 1.4 0.8 0.0
```

#### Distances for mixed data

The above two distance measuring methods accepts numeric data. However, for data containing both numeric and non-numeric or mixed data, `daisy()`

function is used to compute the distances. In daisy function, the *Gower’s coefficient* which is one of the most popular measures of proximity for mixed data types will be used as the metric.

Here we shall use a different example of flower data which contain factor, ordered factor and numeric variables. For this purpose first load `cluster`

package using `library()`

function. Then load the R demo data set `flower`

using the `data()`

function.

```
# 'data.frame': 18 obs. of 8 variables:
# $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
# $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
# $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
# $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
# $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
# $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
# $ V7: num 25 150 150 125 20 50 40 100 25 100 ...
# $ V8: num 15 50 50 50 15 40 20 15 15 60 ...
```

To compute the distance for mixed variables use `daisy()`

function. Round the distance matrix to two decimal place using `as.matrix()`

function.

```
# 1 2 3
# 1 0.00 0.89 0.53
# 2 0.89 0.00 0.51
# 3 0.53 0.51 0.00
```

### Visualizing distances

After measuring distances next you need to visualize the distance matrix. A simple way to visualize distance matrices is to use the `fviz_dist()`

function by first loading the `factoextra`

package.

`This function classify data samples into groups of similar objects.`

```
library(factoextra)
# Visualize Euclidean matrix
fviz_dist(dist.obj = data.eucl,
order = TRUE, show_labels = TRUE)
# Visualize correlation matrix
fviz_dist(dist.obj = data.cor,
order = TRUE, show_labels = TRUE)
# Visualize mixed data distance matrix
fviz_dist(dist.obj = data.daisy,
order = TRUE, show_labels = TRUE)
```