Preparing data file for cluster analysis in R
See Video ⮞ ☝ |
AGRON Stats
June 27, 2018
Introduction
Here you will learn, how to prepare data set and import it in Rstudio for cluster analysis. The data file I am using here contains data which is obtained from R demo data set USarrests. I have deleted some rows to keep the data up to thirty rows. I have deleted some values which are highlighted in the data set to get some missing values.
Save this data file as CSV comma separated file format.
I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE))
before importing the data set. You can also clear the plots using graphics.off()
and clear everything in console using shell()
function.
Import data and format variables
Now let’s import the data set using read.csv()
function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file
argument specify the file name with extension CSV.
In header
argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE
for this argument. The head()
function will print the first six rows of the data set.
# X Murder Assault UrbanPop Rape
# 1 Alabama 13.2 236 58 21.2
# 2 Alaska 10.0 263 48 44.5
# 3 Arizona 8.1 294 80 NA
# 4 Arkansas 8.8 190 50 19.5
# 5 California 9.0 276 91 40.6
# 6 Colorado 7.9 204 78 38.7
# 'data.frame': 30 obs. of 5 variables:
# $ X : Factor w/ 30 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
# $ Assault : int 236 263 294 190 276 204 110 238 335 NA ...
# $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
# $ Rape : num 21.2 44.5 NA 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
Structure function will show whether the variables are being read as factor, character, number or integer. The first variable X
is being read as factor in R. Suppose if you want to change the first variable to character then Use as.character()
function. Within parenthesis type the file name and attach variable name with it by using dollar sign data$X
. Using again the structure function will show the variable X is being read as character variable by R.
# X Murder Assault UrbanPop Rape
# 1 Alabama 13.2 236 58 21.2
# 2 Alaska 10.0 263 48 44.5
# 3 Arizona 8.1 294 80 NA
# 4 Arkansas 8.8 190 50 19.5
# 5 California 9.0 276 91 40.6
# 6 Colorado 7.9 204 78 38.7
Add rownames
The first variable X
do not represent the row names. This variable can be assigned as row names for the data file by using rownames()
function. Using the head function will show that the values of first variable are assigned as row names however the first variable still exists.
# X Murder Assault UrbanPop Rape
# Alabama Alabama 13.2 236 58 21.2
# Alaska Alaska 10.0 263 48 44.5
# Arizona Arizona 8.1 294 80 NA
# Arkansas Arkansas 8.8 190 50 19.5
# California California 9.0 276 91 40.6
# Colorado Colorado 7.9 204 78 38.7
To remove the variable X
type minus one after the data file name within square brackets data[, -1]
. This action will delete the first variable from data set.
# Murder Assault UrbanPop Rape
# Alabama 13.2 236 58 21.2
# Alaska 10.0 263 48 44.5
# Arizona 8.1 294 80 NA
# Arkansas 8.8 190 50 19.5
# California 9.0 276 91 40.6
# Colorado 7.9 204 78 38.7
Handling missing values
When data is imported in R it replaces the missing values with NA by default. In case your data set contains missing values then use na.omit()
function. This generic function is useful for dealing with missing values in data set and returns the object with incomplete cases removed.
na.pass() returns the object unchanged
Now the data set is ready for cluster analysis. Scale the data to standardize it and use appropriate distance measuring method for cluster analysis.
Please comment below if you have any questions
Download data file — Click_here
Download Rscript — Download Rscript
Download R program —
Click_here
Download R studio —
Click_here
ReplyDeleteVery good article . Thanks for sharing.
Snowflake Training
Snowflake Training in Hyderabad
Snowflake Online Training
Snowflake Online Training Hyderabad
Snowflake Training Online
Snowflake Training in Ameerpet
Snowflake Training Institute in Hyderabad