Principal component analysis in R
See Video ⮞ ☝ |
AGRON Stats
May 26, 2020
Introduction
The principal aim of the principal component analysis is dimension reduction. Sometimes the data set consists of several variables. For example, the projects related to soil horizon data contain more than a hundred variables. It is difficult to graphically inspect the main data structure of a multivariate data set. It is required to find components that express as much of the inherent variability of the complete data set as possible.
PCA will generate as many components as there are variables.
However, the majority of the inherent information in the data set is generally included in the first few components. Taking the first few components results in a considerable reduction of the dimension of the data set.
PCA is often used as a first step for further multivariate data analysis procedures like:
- Cluster analysis
- Multiple regression
- Discriminant analysis
PCA is based on the correlation or covariance matrix.
Let’s start this analysis in R studio. The example data set used here is obtained from Swiss Bank Notes measurements. It consists of 200 measurements. The first half of these measurements are from genuine banknotes. The other half are from counterfeit banknotes. The values were measured in millimeters.
I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE))
before importing the data set. You can also clear the plots using graphics.off()
and clear everything in console using shell()
function.
Import data set
Now let’s import the data set using read.csv()
function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file
argument specify the file name with extension CSV.
In header
argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE
for this argument. The head()
function will print the first six rows of the data set.
# note.type Length Left Right Bottom Top Diagonal
# 1 counterfeit 214.8 131.0 131.1 9.0 9.7 141.0
# 2 counterfeit 214.6 129.7 129.7 8.1 9.5 141.7
# 3 counterfeit 214.8 129.7 129.7 8.7 9.6 142.2
# 4 counterfeit 214.8 129.7 129.6 7.5 10.4 142.0
# 5 counterfeit 215.0 129.6 129.7 10.4 7.7 141.8
# 6 counterfeit 215.7 130.8 130.5 9.0 10.1 141.4
Standardize the data set
Scaling is critical while performing principal component analysis. PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features.
To scale the data use scale()
function. scale is generic function whose default method centers and/or scales the columns of a numeric matrix. If the center
argument is set to TRUE
then centering is done by taking the mean deviations of each column. If scale = TRUE
then scaling is done by dividing the (centered) columns of data by their standard deviations. In x
argument data[, -1]
will exclude the first variable (categorical variable).
Head function will print the first six rows of the scaled data.
# Length Left Right Bottom Top Diagonal
# [1,] -0.2549435 2.433346 2.8299417 -0.2890067 -1.1837648 0.4482473
# [2,] -0.7860757 -1.167507 -0.6347880 -0.9120152 -1.4328473 1.0557460
# [3,] -0.2549435 -1.167507 -0.6347880 -0.4966762 -1.3083061 1.4896737
# [4,] -0.2549435 -1.167507 -0.8822687 -1.3273542 -0.3119759 1.3161027
# [5,] 0.2761888 -1.444496 -0.6347880 0.6801176 -3.6745902 1.1425316
# [6,] 2.1351516 1.879368 1.3450576 -0.2890067 -0.6855997 0.7953894
Extraction of components (eigenvalues)
Eigen values & vectors
Second step is the extraction of the principal components. The simplest way is to use prcomp()
or princomp()
functions using stats
package.
See other tuotials where these methods have been used and explained —
PCA_1
andPCA_2
Here, I shall extract principal components by using eigen()
function from base package and prcomp()
function from stats package. First I shall use eigen()
function to calculate the eigenvalues and eigenvectors. Apply eigen function to the covariance matrix of data set to get the eigenvalues and eigenvectors. Use the print()
function to see the results.
# eigen() decomposition
# $values
# [1] 3.00030487 0.93562052 0.24341371 0.19465874 0.08521185 0.03551468
#
# $vectors
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] -0.04377427 0.01070966 -0.3263165 -0.5616918 0.75257278 0.09809807
# [2,] 0.11216159 0.07144697 -0.2589614 -0.4554588 -0.34680082 -0.76651197
# [3,] 0.13919062 0.06628208 -0.3447327 -0.4153296 -0.53465173 0.63169678
# [4,] 0.76830499 -0.56307225 -0.2180222 0.1861082 0.09996771 -0.02221711
# [5,] 0.20176610 0.65928988 -0.5566857 0.4506985 0.10190229 -0.03485874
# [6,] -0.57890193 -0.48854255 -0.5917628 0.2584483 -0.08445895 -0.04567946
The eigen()
function is a base package function that decomposes data into eigen values and eigen vectors.
Eigenvector
— a matrix of values that represents direction.
Eigenvalue
— a value specifying the amount of variance in that direction.
In most of the cases first two components account for most of the variation in the data set.
Use the same function to get the eigenvalues and eigenvectors for scaled data. The function print()
will display the results from scaled data.
# eigen() decomposition
# $values
# [1] 2.9455582 1.2780838 0.8690326 0.4497687 0.2686769 0.1888799
#
# $vectors
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] -0.006987029 0.81549497 -0.01768066 0.5746173 -0.0587961 -0.03105698
# [2,] 0.467758161 0.34196711 0.10338286 -0.3949225 0.6394961 0.29774768
# [3,] 0.486678705 0.25245860 0.12347472 -0.4302783 -0.6140972 -0.34915294
# [4,] 0.406758327 -0.26622878 0.58353831 0.4036735 -0.2154756 0.46235361
# [5,] 0.367891118 -0.09148667 -0.78757147 0.1102267 -0.2198494 0.41896754
# [6,] -0.493458317 0.27394074 0.11387536 -0.3919305 -0.3401601 0.63179849
You can see the difference in eigenvectors and eigenvalues of both scaled and non-scaled data. If you look at $values
component from the result of eigen decomposition, the variation in the first component PC1 in scaled data is reduced as compared to the non scaled data . This reduction led to a little bit increase in the variation of the rest of the components in scaled data.
Proportion of variance
You can also compute the proportion of variances. For this, first attach values component of eigen decomposition from scaled data using dollar sign (e.scaled$values
) to get variances. As the total variation is 6 (six components and each with unit variance) so divide each variance with total variation to get the proportion of each variance.
Apply cumsum()
function to proportion of variance as obtained earlier to get the cumulative proportions. The results showed that first PC explains 49% of the variation. The first three PCs explain cumulative proportion of 85% of the total variation.
e.scaled$values # Variances
e.scaled$values/6 # proportion of variances
cumsum(e.scaled$values/6) # Cumulative proportion
# [1] 2.9455582 1.2780838 0.8690326 0.4497687 0.2686769 0.1888799
# [1] 0.49092637 0.21301396 0.14483876 0.07496145 0.04477948 0.03147998
# [1] 0.4909264 0.7039403 0.8487791 0.9237405 0.9685200 1.0000000
Computation of principal component vectors
Now let’s compute the principal component vectors. For this multiply the data.scaled
matrix with matrix of scaled eigenvectors (e.scaled$vectors
) by using matrix multiplication pipe operator %*%
. Head function will print the first six rows of components matrix.
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1.7430272 1.64669605 1.4201973 -2.7479691 0.003293759 -0.60202200
# [2,] -2.2686248 -0.53744461 0.5313151 -0.6573558 -0.158171742 -0.45654268
# [3,] -2.2717009 -0.10740754 0.7156191 -0.3408384 -0.453880889 0.04532905
# [4,] -2.2778385 -0.08743490 -0.6041176 -0.3918255 -0.282913485 0.05543875
# [5,] -2.6255397 0.03909779 3.1883837 0.4240168 -0.277502895 -0.72026433
# [6,] 0.7565089 3.08101359 0.7845117 -0.5980322 0.192757017 0.10529393
Computation of vectors from stats package
Let’s compute the same principal components using stats package. Load this package by using require()
function. The prcomp()
function requires x
, center
and scale
arguments. The x
argument specify the data set. The logical value TRUE
for center argument will shift the variables to be zero centered. The logical value TRUE
for scale argument will scale the variables to have unit variance. The head()
function will print the first six rows of principal components. In head()
function attach X component of prcomp()
function with the pc
object using dollar sign. This will print the first six rows of pc$x
component.
# PC1 PC2 PC3 PC4 PC5 PC6
# [1,] -1.7430272 -1.64669605 -1.4201973 -2.7479691 0.003293759 0.60202200
# [2,] 2.2686248 0.53744461 -0.5313151 -0.6573558 -0.158171742 0.45654268
# [3,] 2.2717009 0.10740754 -0.7156191 -0.3408384 -0.453880889 -0.04532905
# [4,] 2.2778385 0.08743490 0.6041176 -0.3918255 -0.282913485 -0.05543875
# [5,] 2.6255397 -0.03909779 -3.1883837 0.4240168 -0.277502895 0.72026433
# [6,] -0.7565089 -3.08101359 -0.7845117 -0.5980322 0.192757017 -0.10529393
The PC values are the same as obtained using eigen() function however the signs are reversed.
Use summary() function to print variance related measures of components.
# Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6
# Standard deviation 1.7163 1.1305 0.9322 0.67065 0.51834 0.43460
# Proportion of Variance 0.4909 0.2130 0.1448 0.07496 0.04478 0.03148
# Cumulative Proportion 0.4909 0.7039 0.8488 0.92374 0.96852 1.00000
Display a screeplot
Now let’s display the variance explained by the PCs using plot()
function and screeplot()
function. First I shall use plot()
function to display variances or eigen values accounted for each component. The argument e.scaled$values
will take the eigenvalues to plot variances explained by PCs. Use character string to name xlab
and ylab
arguments. Use main
argument to set the title of the plot. You can magnify labels, axis and title by using cex.lab
, cex.axis
and cex.main
arguments.
You can display the same screeplot for pc
object created earlier. Use screeplot()
function where x
argument specify the pc
object. Set the value for type
argument to specify which type of line you want to draw. Set the title of the plot using main
argument.
You can also plot bars specifying variation in each PC using plot()
function for pc
object.
par(mfrow=c(1, 3))
plot(e.scaled$values, xlab = "Index", ylab = "Lambda",
main = "Scree plot",
cex.lab = 1.2, cex.axis = 1.2, cex.main = 1.8)
screeplot(x= pc, type = "line", main = "Scree plot", cex.main = 1.8)
plot(pc, main = "Variance bar plot", cex.main = 1.8)
Most of the variation is due to the first two prinicple components (70%).
Correlation of original variables with PCs
One may be interested to see the correlation of the original variables with the PCs. For this, first get the mean deviations for each variable. Using the scale()
function just write FALSE for scale
argument to get the mean deviations. The head()
function will print the first six rows of mean deviations of each variable.
# Length Left Right Bottom Top Diagonal
# [1,] -0.096 0.8785 1.1435 -0.4175 -0.9505 0.5165
# [2,] -0.296 -0.4215 -0.2565 -1.3175 -1.1505 1.2165
# [3,] -0.096 -0.4215 -0.2565 -0.7175 -1.0505 1.7165
# [4,] -0.096 -0.4215 -0.3565 -1.9175 -0.2505 1.5165
# [5,] 0.104 -0.5215 -0.2565 0.9825 -2.9505 1.3165
# [6,] 0.804 0.6785 0.5435 -0.4175 -0.5505 0.9165
Now let’s compute the PCs for non-scaled data. Multiply mean deviations mean.dev
with the e$vectors
using matrix multiplication pipe operator. The head()
function will print the first six rows of principal components.
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] -0.5496481 -0.5063730 -0.27586457 -1.19372812 -1.17050345 0.05836252
# [2,] -2.0186292 -0.6612636 0.50199746 0.01544486 -0.29113718 0.14582450
# [3,] -1.8356755 -1.1753073 -0.04562915 0.18906546 -0.11268124 0.12578824
# [4,] -2.4943672 0.1188916 -0.07652521 0.31613769 -0.08076370 0.07052799
# [5,] -0.7013228 -3.1947667 0.83877387 -0.52104953 0.08262773 0.26879336
# [6,] -0.8458460 -0.4824940 -0.77029691 -1.07530245 -0.09605941 -0.11128017
To see the correlation between PCs and the original variables apply cor()
function. First combine the PCs with data variables using cbind()
function before applying correlation function. The print function for object r.cor
will result in correlation matrix of PCs and data variables in combination.
# 1 2 3 4 5
# 1 1.000000e+00 -7.014633e-18 -4.469949e-16 -7.976627e-16 -1.020254e-15
# 2 -7.014633e-18 1.000000e+00 1.983991e-16 -5.252503e-16 -1.038670e-16
# 3 -4.469949e-16 1.983991e-16 1.000000e+00 4.464784e-16 -1.835115e-16
# 4 -7.976627e-16 -5.252503e-16 4.464784e-16 1.000000e+00 -1.130514e-15
# 5 -1.020254e-15 -1.038670e-16 -1.835115e-16 -1.130514e-15 1.000000e+00
# 6 -6.521526e-16 5.101457e-16 -1.808638e-15 -1.357142e-15 -2.570084e-15
# 6 Length Left Right Bottom Top
# 1 -6.521526e-16 -0.20136050 0.5381321 0.5966697 0.921229423 0.435255426
# 2 5.101457e-16 0.02751049 0.1914237 0.1586673 -0.377020918 0.794217722
# 3 -1.808638e-15 -0.42754733 -0.3538910 -0.4209169 -0.074460283 -0.342054932
# 4 -1.357142e-15 -0.65812392 -0.5566063 -0.4534936 0.056839988 0.247648882
# 5 -2.570084e-15 0.58340637 -0.2804091 -0.3862445 0.020200457 0.037046504
# 6 1.000000e+00 0.04909498 -0.4001151 0.2946144 -0.002898297 -0.008181424
# Diagonal
# 1 -0.870231992
# 2 -0.410109295
# 3 -0.253377217
# 4 0.098959622
# 5 -0.021396513
# 6 -0.007470889
Now let’s choose the correlation matrix of variables for first two important PCs. In the above r.cor
object you can see the variables are listed as 7th to 12th for each PC. Let’s separate the first two important PCs for 7th to 12th variables. In the below r1
object Seven to twelve shows the variables while one to two represents the first two PCs. The print()
function for r1
object will result in correlation matrix of the first two PCs with data variables.
# 1 2
# Length -0.2013605 0.02751049
# Left 0.5381321 0.19142371
# Right 0.5966697 0.15866725
# Bottom 0.9212294 -0.37702092
# Top 0.4352554 0.79421772
# Diagonal -0.8702320 -0.41010930
Displaying correlation of original variables with PCs
Set the graphical parameters. Use character string for pty = "s"
argument to generate a square plotting region. The ucircle
object will provide coordinate points for a unit variance circle. The plot()
function will display unit variance circle for this object. Set the values for type
, lty
, col
and lwd
arguments.
To display main title, X and Y labels use main
, xlab
and ylab
arguments. Add horizontal and vertical lines to the plot at zero tick point using abline()
function. Locate the labels of variables by using text()
function. The argument x
specify the correlation matrix of first two PCs. For label
argument list the variable names in quotations using concatenate function. Set the size of variable names using cex
argument.
par(pty = "s")
ucircle = cbind(cos((0:360)/180*pi), sin((0:360)/180*pi))
plot(ucircle,
type= "l", lty= "solid", col = "blue", lwd = 2,
xlab = "First PC", ylab = "Second PC", main = "Unit variance circle")
abline(h = 0.0, v = 0.0, lty = 2)
text(x = r1,
label = c("Length", "Left", "Right", "Bottom", "Top", "Diagonal"),
cex = 1.2)
The figure shows that the variables top, bottom and diagonal correspond to correlations near the periphery of the circle. These variables are well explained by the first two PCs. PC1 is essentially the difference between the bottom frame variable and the diagonal. The second PC is best described by the difference between the top frame variable and the sum of bottom frame and diagonal variables. The percentage of variance of length, left and right variable explained by the first two PCs is relatively small.
Plotting principle components
Now let’s plot the principal components. In plot()
function argument x
specify the PC1 (data.pc[, 1]
) and argument y
specify the PC2 (data.pc[, 2]
). Add points or symbols in pch
argument. Value three and one represent the type of symbols to be used for genuine and counterfeit bank notes respectively. The hundred value in rep()
function shows that the first hundred measurements from data set will be represented by type three symbols or plus signs. The plus signs will represent the genuine banknote measurements. The second 100 value shows counterfeit banknote measurements which will be represented by type one symbols or spheres. You can fully customize the plot by setting the values of col
, labels
and cex
arguments.
par(pty = "s")
plot(x = data.pc[, 1], y = data.pc[, 2],
pch = c(rep(3, 100), rep(1, 100)),
col = c(rep("blue", 100), rep("red", 100)),
xlab = "PC1", ylab = "PC2", main = "First vs. Second PC",
cex.lab = 1.2, cex.axis = 1.2)
If we look simultaneously at both plots it shows that the genuine bank notes are roughly characterized by large values of bottom variable and smaller values of diagonal variable. The counterfeit bank notes show larger values of top variable and smaller values of the diagonal variable.
To plot second versus third PC just replace values 1 and 2 with 2 and 3 for x
and y
argument in above plot()
function.
You can also plot first two PCs using biplot()
from stats package. The argument x
specify the pc
object created earlier. In choices
use the values for PCs you want to draw. Similarly, if you want to draw second and third PC just change the values in choices
argument.
Please comment below if you have any questions.
Download data file — Click_here
Download Rscript — Click_here
Download R program —
Click_here
Download R studio —
Click_here
Thank you for this tutorial. For the final plot, how do I replace the black numbers with values I have in the first column of the data?
ReplyDeleteYou need to convert first variable column to row names of the data set. Then you can add below argument to biplot() function:
Deletepch = rownames(data)
Hope this will work
I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.
ReplyDeleteData Engineering Services
AI & ML Solutions
Data Analytics Services
Data Modernization Services
I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.
ReplyDeleteData Engineering Services
AI & ML Solutions
Data Analytics Services
Data Modernization Services