Preparing data file for cluster analysis in R

See Video  ☝


Introduction

Here you will learn, how to prepare data set and import it in Rstudio for cluster analysis. The data file I am using here contains data which is obtained from R demo data set USarrests. I have deleted some rows to keep the data up to thirty rows. I have deleted some values which are highlighted in the data set to get some missing values.

Figure 1: Original data file

Figure 1: Original data file

Figure 2: Modified data file

Figure 2: Modified data file

Save this data file as CSV comma separated file format.

Figure 3: Save file as CSV format

Figure 3: Save file as CSV format

I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE)) before importing the data set. You can also clear the plots using graphics.off() and clear everything in console using shell() function.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Import data and format variables

Now let’s import the data set using read.csv() function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file argument specify the file name with extension CSV.

In header argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE for this argument. The head() function will print the first six rows of the data set.

data <- read.csv(file = "data_prep.csv", 
                 header = TRUE)
head(data)
#            X Murder Assault UrbanPop Rape
# 1    Alabama   13.2     236       58 21.2
# 2     Alaska   10.0     263       48 44.5
# 3    Arizona    8.1     294       80   NA
# 4   Arkansas    8.8     190       50 19.5
# 5 California    9.0     276       91 40.6
# 6   Colorado    7.9     204       78 38.7
str(data)
# 'data.frame': 30 obs. of  5 variables:
#  $ X       : Factor w/ 30 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
#  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
#  $ Assault : int  236 263 294 190 276 204 110 238 335 NA ...
#  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
#  $ Rape    : num  21.2 44.5 NA 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

Structure function will show whether the variables are being read as factor, character, number or integer. The first variable X is being read as factor in R. Suppose if you want to change the first variable to character then Use as.character() function. Within parenthesis type the file name and attach variable name with it by using dollar sign data$X. Using again the structure function will show the variable X is being read as character variable by R.

data$X = as.character(data$X)
head(data)
#            X Murder Assault UrbanPop Rape
# 1    Alabama   13.2     236       58 21.2
# 2     Alaska   10.0     263       48 44.5
# 3    Arizona    8.1     294       80   NA
# 4   Arkansas    8.8     190       50 19.5
# 5 California    9.0     276       91 40.6
# 6   Colorado    7.9     204       78 38.7

Add rownames

The first variable X do not represent the row names. This variable can be assigned as row names for the data file by using rownames() function. Using the head function will show that the values of first variable are assigned as row names however the first variable still exists.

rownames(data) = c(data$X)
head(data)
#                     X Murder Assault UrbanPop Rape
# Alabama       Alabama   13.2     236       58 21.2
# Alaska         Alaska   10.0     263       48 44.5
# Arizona       Arizona    8.1     294       80   NA
# Arkansas     Arkansas    8.8     190       50 19.5
# California California    9.0     276       91 40.6
# Colorado     Colorado    7.9     204       78 38.7

To remove the variable X type minus one after the data file name within square brackets data[, -1]. This action will delete the first variable from data set.

newdata = data[, -1]
head(newdata)
#            Murder Assault UrbanPop Rape
# Alabama      13.2     236       58 21.2
# Alaska       10.0     263       48 44.5
# Arizona       8.1     294       80   NA
# Arkansas      8.8     190       50 19.5
# California    9.0     276       91 40.6
# Colorado      7.9     204       78 38.7

Handling missing values

When data is imported in R it replaces the missing values with NA by default. In case your data set contains missing values then use na.omit() function. This generic function is useful for dealing with missing values in data set and returns the object with incomplete cases removed.

na.pass() returns the object unchanged

Now the data set is ready for cluster analysis. Scale the data to standardize it and use appropriate distance measuring method for cluster analysis.

Please comment below if you have any questions

AND.CO


Download data file — Click_here

Download Rscript — Download Rscript


Download R program — Click_here

Download R studio — Click_here


Comments

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Elegant barplot using ggplot function in R