### Principal component analysis in R

See Video ⮞ ☝ |

#### AGRON Stats

#### May 26, 2020

### Introduction

The principal aim of the principal component analysis is *dimension reduction*. Sometimes the data set consists of several variables. For example, the projects related to soil horizon data contain more than a hundred variables. It is difficult to graphically inspect the main data structure of a multivariate data set. It is required to find components that express as much of the inherent variability of the complete data set as possible.

`PCA will generate as many components as there are variables. `

However, the majority of the inherent information in the data set is generally included in the first few components. Taking the first few components results in a considerable reduction of the dimension of the data set.

PCA is often used as a first step for further multivariate data analysis procedures like:

```
- Cluster analysis
- Multiple regression
- Discriminant analysis
```

PCA is based on the **correlation** or **covariance** matrix.

Let’s start this analysis in R studio. The example data set used here is obtained from Swiss Bank Notes measurements. It consists of 200 measurements. The first half of these measurements are from *genuine* banknotes. The other half are from *counterfeit* banknotes. The values were measured in millimeters.

I often recommend to first clear all the objects or values in global environment using `rm(list = ls(all = TRUE))`

before importing the data set. You can also clear the plots using `graphics.off()`

and clear everything in console using `shell()`

function.

### Import data set

Now let’s import the data set using `read.csv()`

function. I have already saved the data file as CSV (comma delimited file) in the working directory. The `file`

argument specify the file name with extension **CSV**.

In `header`

argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use `TRUE`

for this argument. The `head()`

function will print the first six rows of the data set.

```
# note.type Length Left Right Bottom Top Diagonal
# 1 counterfeit 214.8 131.0 131.1 9.0 9.7 141.0
# 2 counterfeit 214.6 129.7 129.7 8.1 9.5 141.7
# 3 counterfeit 214.8 129.7 129.7 8.7 9.6 142.2
# 4 counterfeit 214.8 129.7 129.6 7.5 10.4 142.0
# 5 counterfeit 215.0 129.6 129.7 10.4 7.7 141.8
# 6 counterfeit 215.7 130.8 130.5 9.0 10.1 141.4
```

### Standardize the data set

Scaling is critical while performing principal component analysis. PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features.

To scale the data use `scale()`

function. scale is generic function whose default method centers and/or scales the columns of a numeric matrix. If the `center`

argument is set to `TRUE`

then centering is done by taking the mean deviations of each column. If `scale = TRUE`

then scaling is done by dividing the (centered) columns of data by their standard deviations. In `x`

argument `data[, -1]`

will exclude the first variable (categorical variable).

Head function will print the first six rows of the scaled data.

```
# Length Left Right Bottom Top Diagonal
# [1,] -0.2549435 2.433346 2.8299417 -0.2890067 -1.1837648 0.4482473
# [2,] -0.7860757 -1.167507 -0.6347880 -0.9120152 -1.4328473 1.0557460
# [3,] -0.2549435 -1.167507 -0.6347880 -0.4966762 -1.3083061 1.4896737
# [4,] -0.2549435 -1.167507 -0.8822687 -1.3273542 -0.3119759 1.3161027
# [5,] 0.2761888 -1.444496 -0.6347880 0.6801176 -3.6745902 1.1425316
# [6,] 2.1351516 1.879368 1.3450576 -0.2890067 -0.6855997 0.7953894
```

### Extraction of components (eigenvalues)

#### Eigen values & vectors

Second step is the extraction of the principal components. The simplest way is to use `prcomp()`

or `princomp()`

functions using `stats`

package.

See other tuotials where these methods have been used and explained —

`PCA_1`

and`PCA_2`

Here, I shall extract principal components by using `eigen()`

function from **base** package and `prcomp()`

function from **stats** package. First I shall use `eigen()`

function to calculate the eigenvalues and eigenvectors. Apply eigen function to the covariance matrix of data set to get the eigenvalues and eigenvectors. Use the `print()`

function to see the results.

```
# eigen() decomposition
# $values
# [1] 3.00030487 0.93562052 0.24341371 0.19465874 0.08521185 0.03551468
#
# $vectors
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] -0.04377427 0.01070966 -0.3263165 -0.5616918 0.75257278 0.09809807
# [2,] 0.11216159 0.07144697 -0.2589614 -0.4554588 -0.34680082 -0.76651197
# [3,] 0.13919062 0.06628208 -0.3447327 -0.4153296 -0.53465173 0.63169678
# [4,] 0.76830499 -0.56307225 -0.2180222 0.1861082 0.09996771 -0.02221711
# [5,] 0.20176610 0.65928988 -0.5566857 0.4506985 0.10190229 -0.03485874
# [6,] -0.57890193 -0.48854255 -0.5917628 0.2584483 -0.08445895 -0.04567946
```

The `eigen()`

function is a base package function that decomposes data into eigen values and eigen vectors.

`Eigenvector`

—a matrix of values that represents direction.

`Eigenvalue`

—a value specifying the amount of variance in that direction.

In most of the cases first two components account for most of the variation in the data set.

Use the same function to get the eigenvalues and eigenvectors for **scaled** data. The function `print()`

will display the results from scaled data.

```
# eigen() decomposition
# $values
# [1] 2.9455582 1.2780838 0.8690326 0.4497687 0.2686769 0.1888799
#
# $vectors
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] -0.006987029 0.81549497 -0.01768066 0.5746173 -0.0587961 -0.03105698
# [2,] 0.467758161 0.34196711 0.10338286 -0.3949225 0.6394961 0.29774768
# [3,] 0.486678705 0.25245860 0.12347472 -0.4302783 -0.6140972 -0.34915294
# [4,] 0.406758327 -0.26622878 0.58353831 0.4036735 -0.2154756 0.46235361
# [5,] 0.367891118 -0.09148667 -0.78757147 0.1102267 -0.2198494 0.41896754
# [6,] -0.493458317 0.27394074 0.11387536 -0.3919305 -0.3401601 0.63179849
```

You can see the difference in eigenvectors and eigenvalues of both scaled and non-scaled data. If you look at `$values`

component from the result of eigen decomposition, the variation in the first component *PC1* in scaled data is reduced as compared to the non scaled data . This reduction led to a little bit increase in the variation of the rest of the components in scaled data.

#### Proportion of variance

You can also compute the proportion of variances. For this, first attach values component of eigen decomposition from scaled data using dollar sign (`e.scaled$values`

) to get variances. As the total variation is 6 (*six components and each with unit variance*) so divide each variance with total variation to get the proportion of each variance.

Apply `cumsum()`

function to proportion of variance as obtained earlier to get the cumulative proportions. The results showed that first PC explains 49% of the variation. The first three PCs explain cumulative proportion of 85% of the total variation.

```
e.scaled$values # Variances
e.scaled$values/6 # proportion of variances
cumsum(e.scaled$values/6) # Cumulative proportion
```

```
# [1] 2.9455582 1.2780838 0.8690326 0.4497687 0.2686769 0.1888799
# [1] 0.49092637 0.21301396 0.14483876 0.07496145 0.04477948 0.03147998
# [1] 0.4909264 0.7039403 0.8487791 0.9237405 0.9685200 1.0000000
```

#### Computation of principal component vectors

Now let’s compute the principal component vectors. For this multiply the `data.scaled`

matrix with matrix of scaled eigenvectors (`e.scaled$vectors`

) by using matrix multiplication pipe operator `%*%`

. Head function will print the first six rows of components matrix.

```
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1.7430272 1.64669605 1.4201973 -2.7479691 0.003293759 -0.60202200
# [2,] -2.2686248 -0.53744461 0.5313151 -0.6573558 -0.158171742 -0.45654268
# [3,] -2.2717009 -0.10740754 0.7156191 -0.3408384 -0.453880889 0.04532905
# [4,] -2.2778385 -0.08743490 -0.6041176 -0.3918255 -0.282913485 0.05543875
# [5,] -2.6255397 0.03909779 3.1883837 0.4240168 -0.277502895 -0.72026433
# [6,] 0.7565089 3.08101359 0.7845117 -0.5980322 0.192757017 0.10529393
```

#### Computation of vectors from stats package

Let’s compute the same principal components using **stats** package. Load this package by using `require()`

function. The `prcomp()`

function requires `x`

, `center`

and `scale`

arguments. The `x`

argument specify the data set. The logical value `TRUE`

for center argument will shift the variables to be zero centered. The logical value `TRUE`

for scale argument will scale the variables to have unit variance. The `head()`

function will print the first six rows of principal components. In `head()`

function attach X component of `prcomp()`

function with the `pc`

object using dollar sign. This will print the first six rows of `pc$x`

component.

```
# PC1 PC2 PC3 PC4 PC5 PC6
# [1,] -1.7430272 -1.64669605 -1.4201973 -2.7479691 0.003293759 0.60202200
# [2,] 2.2686248 0.53744461 -0.5313151 -0.6573558 -0.158171742 0.45654268
# [3,] 2.2717009 0.10740754 -0.7156191 -0.3408384 -0.453880889 -0.04532905
# [4,] 2.2778385 0.08743490 0.6041176 -0.3918255 -0.282913485 -0.05543875
# [5,] 2.6255397 -0.03909779 -3.1883837 0.4240168 -0.277502895 0.72026433
# [6,] -0.7565089 -3.08101359 -0.7845117 -0.5980322 0.192757017 -0.10529393
```

`The PC values are the same as obtained using eigen() function however the signs are reversed. `

Use **summary()** function to print variance related measures of components.

```
# Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6
# Standard deviation 1.7163 1.1305 0.9322 0.67065 0.51834 0.43460
# Proportion of Variance 0.4909 0.2130 0.1448 0.07496 0.04478 0.03148
# Cumulative Proportion 0.4909 0.7039 0.8488 0.92374 0.96852 1.00000
```

### Display a **screeplot**

Now let’s display the variance explained by the PCs using `plot()`

function and `screeplot()`

function. First I shall use `plot()`

function to display variances or eigen values accounted for each component. The argument `e.scaled$values`

will take the eigenvalues to plot variances explained by PCs. Use character string to name `xlab`

and `ylab`

arguments. Use `main`

argument to set the title of the plot. You can magnify labels, axis and title by using `cex.lab`

, `cex.axis`

and `cex.main`

arguments.

You can display the same screeplot for `pc`

object created earlier. Use `screeplot()`

function where `x`

argument specify the `pc`

object. Set the value for `type`

argument to specify which type of line you want to draw. Set the title of the plot using `main`

argument.

You can also plot bars specifying variation in each PC using `plot()`

function for `pc`

object.

```
par(mfrow=c(1, 3))
plot(e.scaled$values, xlab = "Index", ylab = "Lambda",
main = "Scree plot",
cex.lab = 1.2, cex.axis = 1.2, cex.main = 1.8)
screeplot(x= pc, type = "line", main = "Scree plot", cex.main = 1.8)
plot(pc, main = "Variance bar plot", cex.main = 1.8)
```

`Most of the variation is due to the first two prinicple components (70%).`

### Correlation of original variables with PCs

One may be interested to see the correlation of the original variables with the PCs. For this, first get the mean deviations for each variable. Using the `scale()`

function just write *FALSE* for `scale`

argument to get the mean deviations. The `head()`

function will print the first six rows of mean deviations of each variable.

```
# Length Left Right Bottom Top Diagonal
# [1,] -0.096 0.8785 1.1435 -0.4175 -0.9505 0.5165
# [2,] -0.296 -0.4215 -0.2565 -1.3175 -1.1505 1.2165
# [3,] -0.096 -0.4215 -0.2565 -0.7175 -1.0505 1.7165
# [4,] -0.096 -0.4215 -0.3565 -1.9175 -0.2505 1.5165
# [5,] 0.104 -0.5215 -0.2565 0.9825 -2.9505 1.3165
# [6,] 0.804 0.6785 0.5435 -0.4175 -0.5505 0.9165
```

Now let’s compute the PCs for non-scaled data. Multiply mean deviations `mean.dev`

with the `e$vectors`

using matrix multiplication pipe operator. The `head()`

function will print the first six rows of principal components.

```
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] -0.5496481 -0.5063730 -0.27586457 -1.19372812 -1.17050345 0.05836252
# [2,] -2.0186292 -0.6612636 0.50199746 0.01544486 -0.29113718 0.14582450
# [3,] -1.8356755 -1.1753073 -0.04562915 0.18906546 -0.11268124 0.12578824
# [4,] -2.4943672 0.1188916 -0.07652521 0.31613769 -0.08076370 0.07052799
# [5,] -0.7013228 -3.1947667 0.83877387 -0.52104953 0.08262773 0.26879336
# [6,] -0.8458460 -0.4824940 -0.77029691 -1.07530245 -0.09605941 -0.11128017
```

To see the correlation between PCs and the original variables apply `cor()`

function. First combine the PCs with data variables using `cbind()`

function before applying correlation function. The print function for object `r.cor`

will result in correlation matrix of PCs and data variables in combination.

```
# 1 2 3 4 5
# 1 1.000000e+00 -7.014633e-18 -4.469949e-16 -7.976627e-16 -1.020254e-15
# 2 -7.014633e-18 1.000000e+00 1.983991e-16 -5.252503e-16 -1.038670e-16
# 3 -4.469949e-16 1.983991e-16 1.000000e+00 4.464784e-16 -1.835115e-16
# 4 -7.976627e-16 -5.252503e-16 4.464784e-16 1.000000e+00 -1.130514e-15
# 5 -1.020254e-15 -1.038670e-16 -1.835115e-16 -1.130514e-15 1.000000e+00
# 6 -6.521526e-16 5.101457e-16 -1.808638e-15 -1.357142e-15 -2.570084e-15
# 6 Length Left Right Bottom Top
# 1 -6.521526e-16 -0.20136050 0.5381321 0.5966697 0.921229423 0.435255426
# 2 5.101457e-16 0.02751049 0.1914237 0.1586673 -0.377020918 0.794217722
# 3 -1.808638e-15 -0.42754733 -0.3538910 -0.4209169 -0.074460283 -0.342054932
# 4 -1.357142e-15 -0.65812392 -0.5566063 -0.4534936 0.056839988 0.247648882
# 5 -2.570084e-15 0.58340637 -0.2804091 -0.3862445 0.020200457 0.037046504
# 6 1.000000e+00 0.04909498 -0.4001151 0.2946144 -0.002898297 -0.008181424
# Diagonal
# 1 -0.870231992
# 2 -0.410109295
# 3 -0.253377217
# 4 0.098959622
# 5 -0.021396513
# 6 -0.007470889
```

Now let’s choose the correlation matrix of variables for first two important PCs. In the above `r.cor`

object you can see the variables are listed as *7th* to *12th* for each PC. Let’s separate the first two important PCs for 7th to 12th variables. In the below `r1`

object Seven to twelve shows the variables while one to two represents the first two PCs. The `print()`

function for `r1`

object will result in correlation matrix of the first two PCs with data variables.

```
# 1 2
# Length -0.2013605 0.02751049
# Left 0.5381321 0.19142371
# Right 0.5966697 0.15866725
# Bottom 0.9212294 -0.37702092
# Top 0.4352554 0.79421772
# Diagonal -0.8702320 -0.41010930
```

### Displaying correlation of original variables with PCs

Set the graphical parameters. Use character string for `pty = "s"`

argument to generate a square plotting region. The `ucircle`

object will provide coordinate points for a unit variance circle. The `plot()`

function will display unit variance circle for this object. Set the values for `type`

, `lty`

, `col`

and `lwd`

arguments.

To display main title, X and Y labels use `main`

, `xlab`

and `ylab`

arguments. Add horizontal and vertical lines to the plot at zero tick point using `abline()`

function. Locate the labels of variables by using `text()`

function. The argument `x`

specify the correlation matrix of first two PCs. For `label`

argument list the variable names in quotations using concatenate function. Set the size of variable names using `cex`

argument.

```
par(pty = "s")
ucircle = cbind(cos((0:360)/180*pi), sin((0:360)/180*pi))
plot(ucircle,
type= "l", lty= "solid", col = "blue", lwd = 2,
xlab = "First PC", ylab = "Second PC", main = "Unit variance circle")
abline(h = 0.0, v = 0.0, lty = 2)
text(x = r1,
label = c("Length", "Left", "Right", "Bottom", "Top", "Diagonal"),
cex = 1.2)
```

The figure shows that the variables top, bottom and diagonal correspond to correlations near the periphery of the circle. These variables are well explained by the first two PCs. PC1 is essentially the difference between the bottom frame variable and the diagonal. The second PC is best described by the difference between the top frame variable and the sum of bottom frame and diagonal variables. The percentage of variance of length, left and right variable explained by the first two PCs is relatively small.

### Plotting principle components

Now let’s plot the principal components. In `plot()`

function argument `x`

specify the PC1 (`data.pc[, 1]`

) and argument `y`

specify the PC2 (`data.pc[, 2]`

). Add points or symbols in `pch`

argument. Value three and one represent the type of symbols to be used for genuine and counterfeit bank notes respectively. The hundred value in `rep()`

function shows that the first hundred measurements from data set will be represented by type three symbols or plus signs. The plus signs will represent the genuine banknote measurements. The second 100 value shows counterfeit banknote measurements which will be represented by type one symbols or spheres. You can fully customize the plot by setting the values of `col`

, `labels`

and `cex`

arguments.

```
par(pty = "s")
plot(x = data.pc[, 1], y = data.pc[, 2],
pch = c(rep(3, 100), rep(1, 100)),
col = c(rep("blue", 100), rep("red", 100)),
xlab = "PC1", ylab = "PC2", main = "First vs. Second PC",
cex.lab = 1.2, cex.axis = 1.2)
```

If we look simultaneously at both plots it shows that the genuine bank notes are roughly characterized by large values of bottom variable and smaller values of diagonal variable. The counterfeit bank notes show larger values of top variable and smaller values of the diagonal variable.

To plot second versus third PC just replace values 1 and 2 with 2 and 3 for `x`

and `y`

argument in above `plot()`

function.

You can also plot first two PCs using `biplot()`

from *stats* package. The argument `x`

specify the `pc`

object created earlier. In `choices`

use the values for PCs you want to draw. Similarly, if you want to draw second and third PC just change the values in `choices`

argument.

`Please comment below if you have any questions.`

Download data file — ` Click_here `

Download Rscript — ` Click_here `

Download R program —

`Click_here`

Download R studio —

`Click_here`

Thank you for this tutorial. For the final plot, how do I replace the black numbers with values I have in the first column of the data?

ReplyDeleteYou need to convert first variable column to row names of the data set. Then you can add below argument to biplot() function:

Deletepch = rownames(data)

Hope this will work

I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.

ReplyDeleteData Engineering Services

AI & ML Solutions

Data Analytics Services

Data Modernization Services

I truly appreciate the time and work you put into sharing your knowledge. I found this topic to be quite effective and beneficial to me. Thank you very much for sharing. Continue to blog.

ReplyDeleteData Engineering Services

AI & ML Solutions

Data Analytics Services

Data Modernization Services