Principal component analysis in R

See Video  ☝

Introduction

The principal aim of the principal component analysis is dimension reduction. Sometimes the data set consists of several variables. For example, the projects related to soil horizon data contain more than a hundred variables. It is difficult to graphically inspect the main data structure of a multivariate data set. It is required to find components that express as much of the inherent variability of the complete data set as possible.

PCA will generate as many components as there are variables. 

However, the majority of the inherent information in the data set is generally included in the first few components. Taking the first few components results in a considerable reduction of the dimension of the data set.

PCA is often used as a first step for further multivariate data analysis procedures like:

  - Cluster analysis
  - Multiple regression
  - Discriminant analysis

PCA is based on the correlation or covariance matrix.

Let’s start this analysis in R studio. The example data set used here is obtained from Swiss Bank Notes measurements. It consists of 200 measurements. The first half of these measurements are from genuine banknotes. The other half are from counterfeit banknotes. The values were measured in millimeters.

I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE)) before importing the data set. You can also clear the plots using graphics.off() and clear everything in console using shell() function.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Import data set

Now let’s import the data set using read.csv() function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file argument specify the file name with extension CSV.

In header argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE for this argument. The head() function will print the first six rows of the data set.

data = read.csv(file = "bank2.csv",
                header = TRUE)
head(data)
#     note.type Length  Left Right Bottom  Top Diagonal
# 1 counterfeit  214.8 131.0 131.1    9.0  9.7    141.0
# 2 counterfeit  214.6 129.7 129.7    8.1  9.5    141.7
# 3 counterfeit  214.8 129.7 129.7    8.7  9.6    142.2
# 4 counterfeit  214.8 129.7 129.6    7.5 10.4    142.0
# 5 counterfeit  215.0 129.6 129.7   10.4  7.7    141.8
# 6 counterfeit  215.7 130.8 130.5    9.0 10.1    141.4

Standardize the data set

Scaling is critical while performing principal component analysis. PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features.

To scale the data use scale() function. scale is generic function whose default method centers and/or scales the columns of a numeric matrix. If the center argument is set to TRUE then centering is done by taking the mean deviations of each column. If scale = TRUE then scaling is done by dividing the (centered) columns of data by their standard deviations. In x argument data[, -1] will exclude the first variable (categorical variable).

Head function will print the first six rows of the scaled data.

data.scaled <- scale(x = data[, -1], 
                     center = TRUE, 
                     scale = TRUE)
head(data.scaled)
#          Length      Left      Right     Bottom        Top  Diagonal
# [1,] -0.2549435  2.433346  2.8299417 -0.2890067 -1.1837648 0.4482473
# [2,] -0.7860757 -1.167507 -0.6347880 -0.9120152 -1.4328473 1.0557460
# [3,] -0.2549435 -1.167507 -0.6347880 -0.4966762 -1.3083061 1.4896737
# [4,] -0.2549435 -1.167507 -0.8822687 -1.3273542 -0.3119759 1.3161027
# [5,]  0.2761888 -1.444496 -0.6347880  0.6801176 -3.6745902 1.1425316
# [6,]  2.1351516  1.879368  1.3450576 -0.2890067 -0.6855997 0.7953894

Extraction of components (eigenvalues)

Eigen values & vectors

Second step is the extraction of the principal components. The simplest way is to use prcomp() or princomp() functions using stats package.

See other tuotials where these methods have been used and explained — PCA_1 and PCA_2

Here, I shall extract principal components by using eigen() function from base package and prcomp() function from stats package. First I shall use eigen() function to calculate the eigenvalues and eigenvectors. Apply eigen function to the covariance matrix of data set to get the eigenvalues and eigenvectors. Use the print() function to see the results.

e = eigen(cov(data[, -1]))
print(e)
# eigen() decomposition
# $values
# [1] 3.00030487 0.93562052 0.24341371 0.19465874 0.08521185 0.03551468
# 
# $vectors
#             [,1]        [,2]       [,3]       [,4]        [,5]        [,6]
# [1,] -0.04377427  0.01070966 -0.3263165 -0.5616918  0.75257278  0.09809807
# [2,]  0.11216159  0.07144697 -0.2589614 -0.4554588 -0.34680082 -0.76651197
# [3,]  0.13919062  0.06628208 -0.3447327 -0.4153296 -0.53465173  0.63169678
# [4,]  0.76830499 -0.56307225 -0.2180222  0.1861082  0.09996771 -0.02221711
# [5,]  0.20176610  0.65928988 -0.5566857  0.4506985  0.10190229 -0.03485874
# [6,] -0.57890193 -0.48854255 -0.5917628  0.2584483 -0.08445895 -0.04567946

The eigen() function is a base package function that decomposes data into eigen values and eigen vectors.

Eigenvectora matrix of values that represents direction.

Eigenvaluea value specifying the amount of variance in that direction.

In most of the cases first two components account for most of the variation in the data set.

Use the same function to get the eigenvalues and eigenvectors for scaled data. The function print() will display the results from scaled data.

e.scaled = eigen(cov(data.scaled))
print(e.scaled)
# eigen() decomposition
# $values
# [1] 2.9455582 1.2780838 0.8690326 0.4497687 0.2686769 0.1888799
# 
# $vectors
#              [,1]        [,2]        [,3]       [,4]       [,5]        [,6]
# [1,] -0.006987029  0.81549497 -0.01768066  0.5746173 -0.0587961 -0.03105698
# [2,]  0.467758161  0.34196711  0.10338286 -0.3949225  0.6394961  0.29774768
# [3,]  0.486678705  0.25245860  0.12347472 -0.4302783 -0.6140972 -0.34915294
# [4,]  0.406758327 -0.26622878  0.58353831  0.4036735 -0.2154756  0.46235361
# [5,]  0.367891118 -0.09148667 -0.78757147  0.1102267 -0.2198494  0.41896754
# [6,] -0.493458317  0.27394074  0.11387536 -0.3919305 -0.3401601  0.63179849

You can see the difference in eigenvectors and eigenvalues of both scaled and non-scaled data. If you look at $values component from the result of eigen decomposition, the variation in the first component PC1 in scaled data is reduced as compared to the non scaled data . This reduction led to a little bit increase in the variation of the rest of the components in scaled data.

Proportion of variance

You can also compute the proportion of variances. For this, first attach values component of eigen decomposition from scaled data using dollar sign (e.scaled$values) to get variances. As the total variation is 6 (six components and each with unit variance) so divide each variance with total variation to get the proportion of each variance.

Apply cumsum() function to proportion of variance as obtained earlier to get the cumulative proportions. The results showed that first PC explains 49% of the variation. The first three PCs explain cumulative proportion of 85% of the total variation.

e.scaled$values               # Variances
e.scaled$values/6             # proportion of variances
cumsum(e.scaled$values/6)     # Cumulative proportion
# [1] 2.9455582 1.2780838 0.8690326 0.4497687 0.2686769 0.1888799
# [1] 0.49092637 0.21301396 0.14483876 0.07496145 0.04477948 0.03147998
# [1] 0.4909264 0.7039403 0.8487791 0.9237405 0.9685200 1.0000000

Computation of principal component vectors

Now let’s compute the principal component vectors. For this multiply the data.scaled matrix with matrix of scaled eigenvectors (e.scaled$vectors) by using matrix multiplication pipe operator %*%. Head function will print the first six rows of components matrix.

data.pc  = as.matrix(data.scaled) %*% e.scaled$vectors 
head(data.pc)
#            [,1]        [,2]       [,3]       [,4]         [,5]        [,6]
# [1,]  1.7430272  1.64669605  1.4201973 -2.7479691  0.003293759 -0.60202200
# [2,] -2.2686248 -0.53744461  0.5313151 -0.6573558 -0.158171742 -0.45654268
# [3,] -2.2717009 -0.10740754  0.7156191 -0.3408384 -0.453880889  0.04532905
# [4,] -2.2778385 -0.08743490 -0.6041176 -0.3918255 -0.282913485  0.05543875
# [5,] -2.6255397  0.03909779  3.1883837  0.4240168 -0.277502895 -0.72026433
# [6,]  0.7565089  3.08101359  0.7845117 -0.5980322  0.192757017  0.10529393

Computation of vectors from stats package

Let’s compute the same principal components using stats package. Load this package by using require() function. The prcomp() function requires x, center and scale arguments. The x argument specify the data set. The logical value TRUE for center argument will shift the variables to be zero centered. The logical value TRUE for scale argument will scale the variables to have unit variance. The head() function will print the first six rows of principal components. In head() function attach X component of prcomp() function with the pc object using dollar sign. This will print the first six rows of pc$x component.

require(stats)
pc <- prcomp(x = data[, -1],
             center = TRUE,
             scale. = TRUE)
head(pc$x)
#             PC1         PC2        PC3        PC4          PC5         PC6
# [1,] -1.7430272 -1.64669605 -1.4201973 -2.7479691  0.003293759  0.60202200
# [2,]  2.2686248  0.53744461 -0.5313151 -0.6573558 -0.158171742  0.45654268
# [3,]  2.2717009  0.10740754 -0.7156191 -0.3408384 -0.453880889 -0.04532905
# [4,]  2.2778385  0.08743490  0.6041176 -0.3918255 -0.282913485 -0.05543875
# [5,]  2.6255397 -0.03909779 -3.1883837  0.4240168 -0.277502895  0.72026433
# [6,] -0.7565089 -3.08101359 -0.7845117 -0.5980322  0.192757017 -0.10529393
The PC values are the same as obtained using eigen() function however the signs are reversed. 

Use summary() function to print variance related measures of components.

summary(pc)
# Importance of components:
#                           PC1    PC2    PC3     PC4     PC5     PC6
# Standard deviation     1.7163 1.1305 0.9322 0.67065 0.51834 0.43460
# Proportion of Variance 0.4909 0.2130 0.1448 0.07496 0.04478 0.03148
# Cumulative Proportion  0.4909 0.7039 0.8488 0.92374 0.96852 1.00000

Display a screeplot

Now let’s display the variance explained by the PCs using plot() function and screeplot() function. First I shall use plot() function to display variances or eigen values accounted for each component. The argument e.scaled$values will take the eigenvalues to plot variances explained by PCs. Use character string to name xlab and ylab arguments. Use main argument to set the title of the plot. You can magnify labels, axis and title by using cex.lab, cex.axis and cex.main arguments.

You can display the same screeplot for pc object created earlier. Use screeplot() function where x argument specify the pc object. Set the value for type argument to specify which type of line you want to draw. Set the title of the plot using main argument.

You can also plot bars specifying variation in each PC using plot() function for pc object.

par(mfrow=c(1, 3))

plot(e.scaled$values, xlab = "Index", ylab = "Lambda", 
     main = "Scree plot", 
     cex.lab = 1.2, cex.axis = 1.2, cex.main = 1.8)

screeplot(x= pc, type = "line", main = "Scree plot", cex.main = 1.8)

plot(pc, main = "Variance bar plot", cex.main = 1.8)

Most of the variation is due to the first two prinicple components (70%).

Correlation of original variables with PCs

One may be interested to see the correlation of the original variables with the PCs. For this, first get the mean deviations for each variable. Using the scale() function just write FALSE for scale argument to get the mean deviations. The head() function will print the first six rows of mean deviations of each variable.

mean.dev <- scale(data[, -1],
                  center = TRUE,
                  scale = FALSE)
head(mean.dev)
#      Length    Left   Right  Bottom     Top Diagonal
# [1,] -0.096  0.8785  1.1435 -0.4175 -0.9505   0.5165
# [2,] -0.296 -0.4215 -0.2565 -1.3175 -1.1505   1.2165
# [3,] -0.096 -0.4215 -0.2565 -0.7175 -1.0505   1.7165
# [4,] -0.096 -0.4215 -0.3565 -1.9175 -0.2505   1.5165
# [5,]  0.104 -0.5215 -0.2565  0.9825 -2.9505   1.3165
# [6,]  0.804  0.6785  0.5435 -0.4175 -0.5505   0.9165

Now let’s compute the PCs for non-scaled data. Multiply mean deviations mean.dev with the e$vectors using matrix multiplication pipe operator. The head() function will print the first six rows of principal components.

mean.dev.pc <- mean.dev %*% e$vectors
head(mean.dev.pc)
#            [,1]       [,2]        [,3]        [,4]        [,5]        [,6]
# [1,] -0.5496481 -0.5063730 -0.27586457 -1.19372812 -1.17050345  0.05836252
# [2,] -2.0186292 -0.6612636  0.50199746  0.01544486 -0.29113718  0.14582450
# [3,] -1.8356755 -1.1753073 -0.04562915  0.18906546 -0.11268124  0.12578824
# [4,] -2.4943672  0.1188916 -0.07652521  0.31613769 -0.08076370  0.07052799
# [5,] -0.7013228 -3.1947667  0.83877387 -0.52104953  0.08262773  0.26879336
# [6,] -0.8458460 -0.4824940 -0.77029691 -1.07530245 -0.09605941 -0.11128017

To see the correlation between PCs and the original variables apply cor() function. First combine the PCs with data variables using cbind() function before applying correlation function. The print function for object r.cor will result in correlation matrix of PCs and data variables in combination.

r.cor <- cor(cbind(mean.dev.pc, data[, -1]))
head(r.cor)
#               1             2             3             4             5
# 1  1.000000e+00 -7.014633e-18 -4.469949e-16 -7.976627e-16 -1.020254e-15
# 2 -7.014633e-18  1.000000e+00  1.983991e-16 -5.252503e-16 -1.038670e-16
# 3 -4.469949e-16  1.983991e-16  1.000000e+00  4.464784e-16 -1.835115e-16
# 4 -7.976627e-16 -5.252503e-16  4.464784e-16  1.000000e+00 -1.130514e-15
# 5 -1.020254e-15 -1.038670e-16 -1.835115e-16 -1.130514e-15  1.000000e+00
# 6 -6.521526e-16  5.101457e-16 -1.808638e-15 -1.357142e-15 -2.570084e-15
#               6      Length       Left      Right       Bottom          Top
# 1 -6.521526e-16 -0.20136050  0.5381321  0.5966697  0.921229423  0.435255426
# 2  5.101457e-16  0.02751049  0.1914237  0.1586673 -0.377020918  0.794217722
# 3 -1.808638e-15 -0.42754733 -0.3538910 -0.4209169 -0.074460283 -0.342054932
# 4 -1.357142e-15 -0.65812392 -0.5566063 -0.4534936  0.056839988  0.247648882
# 5 -2.570084e-15  0.58340637 -0.2804091 -0.3862445  0.020200457  0.037046504
# 6  1.000000e+00  0.04909498 -0.4001151  0.2946144 -0.002898297 -0.008181424
#       Diagonal
# 1 -0.870231992
# 2 -0.410109295
# 3 -0.253377217
# 4  0.098959622
# 5 -0.021396513
# 6 -0.007470889

Now let’s choose the correlation matrix of variables for first two important PCs. In the above r.cor object you can see the variables are listed as 7th to 12th for each PC. Let’s separate the first two important PCs for 7th to 12th variables. In the below r1 object Seven to twelve shows the variables while one to two represents the first two PCs. The print() function for r1 object will result in correlation matrix of the first two PCs with data variables.

r1 = r.cor[7:12, 1:2]
print(r1)
#                   1           2
# Length   -0.2013605  0.02751049
# Left      0.5381321  0.19142371
# Right     0.5966697  0.15866725
# Bottom    0.9212294 -0.37702092
# Top       0.4352554  0.79421772
# Diagonal -0.8702320 -0.41010930

Displaying correlation of original variables with PCs

Set the graphical parameters. Use character string for pty = "s" argument to generate a square plotting region. The ucircle object will provide coordinate points for a unit variance circle. The plot() function will display unit variance circle for this object. Set the values for type, lty, col and lwd arguments.

To display main title, X and Y labels use main, xlab and ylab arguments. Add horizontal and vertical lines to the plot at zero tick point using abline() function. Locate the labels of variables by using text() function. The argument x specify the correlation matrix of first two PCs. For label argument list the variable names in quotations using concatenate function. Set the size of variable names using cex argument.

par(pty = "s")

ucircle = cbind(cos((0:360)/180*pi), sin((0:360)/180*pi))

plot(ucircle, 
     type= "l", lty= "solid", col = "blue", lwd = 2, 
     xlab = "First PC", ylab = "Second PC", main = "Unit variance circle")
abline(h = 0.0, v = 0.0, lty = 2)

text(x = r1, 
     label = c("Length", "Left", "Right", "Bottom", "Top", "Diagonal"), 
     cex = 1.2)

The figure shows that the variables top, bottom and diagonal correspond to correlations near the periphery of the circle. These variables are well explained by the first two PCs. PC1 is essentially the difference between the bottom frame variable and the diagonal. The second PC is best described by the difference between the top frame variable and the sum of bottom frame and diagonal variables. The percentage of variance of length, left and right variable explained by the first two PCs is relatively small.

Plotting principle components

Now let’s plot the principal components. In plot() function argument x specify the PC1 (data.pc[, 1]) and argument y specify the PC2 (data.pc[, 2]). Add points or symbols in pch argument. Value three and one represent the type of symbols to be used for genuine and counterfeit bank notes respectively. The hundred value in rep() function shows that the first hundred measurements from data set will be represented by type three symbols or plus signs. The plus signs will represent the genuine banknote measurements. The second 100 value shows counterfeit banknote measurements which will be represented by type one symbols or spheres. You can fully customize the plot by setting the values of col, labels and cexarguments.

par(pty = "s")
plot(x = data.pc[, 1], y = data.pc[, 2], 
     pch = c(rep(3, 100), rep(1, 100)),
     col = c(rep("blue", 100), rep("red", 100)), 
     xlab = "PC1", ylab = "PC2", main = "First vs. Second PC",
     cex.lab = 1.2, cex.axis = 1.2)

If we look simultaneously at both plots it shows that the genuine bank notes are roughly characterized by large values of bottom variable and smaller values of diagonal variable. The counterfeit bank notes show larger values of top variable and smaller values of the diagonal variable.

To plot second versus third PC just replace values 1 and 2 with 2 and 3 for x and y argument in above plot() function.

You can also plot first two PCs using biplot() from stats package. The argument x specify the pc object created earlier. In choices use the values for PCs you want to draw. Similarly, if you want to draw second and third PC just change the values in choices argument.

require(stats)
biplot(pc, choices = 1:2)

Please comment below if you have any questions.


Download data file — Click_here

Download Rscript — Click_here


Download R program — Click_here

Download R studio — Click_here


Comments

  1. Thank you for this tutorial. For the final plot, how do I replace the black numbers with values I have in the first column of the data?

    ReplyDelete
    Replies
    1. You need to convert first variable column to row names of the data set. Then you can add below argument to biplot() function:
      pch = rownames(data)

      Hope this will work

      Delete
  2. I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.

    Data Engineering Services 

    AI & ML Solutions

    Data Analytics Services

    Data Modernization Services

    ReplyDelete
  3. I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.

    Data Engineering Services 

    AI & ML Solutions

    Data Analytics Services

    Data Modernization Services

    ReplyDelete

Post a Comment

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Visualizing clustering dendrogram in R | Hierarchical clustering