Biplot of PCs using ggbiplot function

See Video  ☝

Introduction

You will learn how to visualize biplot for principal components using ggbiplot() function in R studio. I shall use the bank note data set used in previous tutorial on principal component analysis.

Suggestion:

Watch this video tutorial to understand data set and principal component analysis — See here

In this tutorial I shall describe how to visualize principal components using ggbiplot() function and to interpret the biplot. It is often desirable to simultaneously visualize scores and loadings of the principal components. This can be accomplished with a biplot. The plot is called a biplot because it contains information on loadings (arrows) and scores (data points or sample identifiers) and not just because it plots two principal components against one another.

Let’s start how to display biplot using ggbiplot() function in R studio. 

I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE)) before importing the data set. You can also clear the plots using graphics.off() and clear everything in console using shell() function.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Import data set

Now let’s import the data set using read.csv() function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file argument specify the file name with extension CSV.

In header argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE for this argument. The head() function will print the first six rows of the data set.

data = read.csv(file = "biplot.csv",
                header = TRUE)
head(data)
#      type Length  Left Right Bottom  Top Diagonal
# 1 genuine  214.4 130.1 130.3    9.7 11.7    139.8
# 2 genuine  214.9 130.5 130.2   11.0 11.5    139.5
# 3 genuine  214.9 130.3 130.1    8.7 11.7    140.2
# 4 genuine  215.0 130.4 130.6    9.9 10.9    140.3
# 5 genuine  214.7 130.2 130.3   11.8 10.9    139.7
# 6 genuine  215.0 130.2 130.2   10.6 10.7    139.9

Principal component analysis

Now let’s carry out the principal component analysis. Load the stats package by using require() function. I shall use prcomp() function for principal component analysis. The argument x specify the data set. In X argument you need to specify the variables to be used for principal components. Minus one in square brackets [-1] shows that the first categorical variable will be excluded from data set. TRUE for center argument will shift the variables to be zero centered. I shall use FALSE for scale argument. As scaling is preferably done when you have variable measurements in different scales. In this data set all the variables were measured in same units so I shall use non scaled data for this analysis.

Print function for pc object will display the information of standard deviation and loadings. Use summary() function to print variance related measures of components as shown by the importance component below.

require(stats)
pc <- prcomp(x = data[-1],
                  center = TRUE, 
                  scale. = FALSE)
print(pc)
summary(pc)
# Standard deviations (1, .., p=6):
# [1] 1.7321388 0.9672748 0.4933697 0.4412015 0.2919107 0.1884534
# 
# Rotation (n x k) = (6 x 6):
#                  PC1         PC2        PC3        PC4         PC5         PC6
# Length    0.04377427  0.01070966 -0.3263165 -0.5616918  0.75257278 -0.09809807
# Left     -0.11216159  0.07144697 -0.2589614 -0.4554588 -0.34680082  0.76651197
# Right    -0.13919062  0.06628208 -0.3447327 -0.4153296 -0.53465173 -0.63169678
# Bottom   -0.76830499 -0.56307225 -0.2180222  0.1861082  0.09996771  0.02221711
# Top      -0.20176610  0.65928988 -0.5566857  0.4506985  0.10190229  0.03485874
# Diagonal  0.57890193 -0.48854255 -0.5917628  0.2584483 -0.08445895  0.04567946
# Importance of components:
#                           PC1    PC2     PC3     PC4     PC5    PC6
# Standard deviation     1.7321 0.9673 0.49337 0.44120 0.29191 0.1885
# Proportion of Variance 0.6675 0.2082 0.05416 0.04331 0.01896 0.0079
# Cumulative Proportion  0.6675 0.8757 0.92983 0.97314 0.99210 1.0000

Scree plot of eigenvalues

Simply using plot() for pc object will result in bargraph of variances. Screeplot can be displayed using screeplot() function.

plot(pc)
screeplot(pc, type = "line", main = "Scree plot")

Visualizing biplot

Install ggbiplot package

To visualize biplot using ggbiplot() first you need to install the package ggbiplot. You will also need to load the library devtools before installing ggbiplot package.

library(devtools)
install_github("vqv/ggbiplot")

Plot PCA using ggbiplot()

After installation is completed load the ggbiplot package using require() function. The results from prcomp() function showed that first component explains 67% of the variability in the data set. Second component explains 21% of total variability in the data set. This indicates the first two components 88% of the total variation in the data set.

Now let’s display a biplot for the first two principal components using ggbiplot() function for pc object created earlier. The scales for the loadings are shown at the bottom and left axis for PC1 and PC2 respectively.

require(ggbiplot)
ggbiplot(pc)

Custmoizing biplot

In ggbiplot() function we can use different arguments to enhance its visualization. In pcobj argument use pc object created earlier. In choices you can specify the two principal components to be used. I am using same scale factor to apply to obs.scale and var.scale. You can specify your own labels for data points in plot by using labels argument. I shall use rownames as labels for data points. In plot you can see the first cluster of hundred values represent the genuine bank notes. The second cluster from one hundred to two hundred values represent the counterfeit bank notes.

Let’s resize the variable names using varname.size argument. You can shorten the length of variable names by using varname.abbrev argument. TRUE for this argument will abbreviate the variable names. Typing FALSE for variable axis will remove the arrows or variable vectors. As I want to show the arrows so I shall keep this argument as TRUE in the final output.

To draw a correlation circle specify TRUE for circle argument. This argument should only be used when scale argument is specified as TRUE in prcomp() function. To draw a normal ellipse for each group set TRUE for ellipse argument. Also specify the group variable in groups argument. Use print() function to plot this object.

biplot = ggbiplot(pcobj = pc,
                  choices = c(1,2),
                  obs.scale = 1, var.scale = 1,  # Scaling of axis
                  labels = row.names(data),     # Add labels as rownames
                  labels.size = 4,
                  varname.size = 5,
                  varname.abbrev = TRUE,  # Abbreviate variable names (TRUE)
                  var.axes = FALSE,      # Remove variable vectors (TRUE)
                  circle = FALSE,        # Add unit variance circle (TRUE)
                  ellipse = TRUE, groups = data$type) # Adding ellipses
print(biplot)

Add main and legend title

Add title to this plot by using labs() function. In title argument type the main title of the plot. For colour argument type the title of the legend to be used. Print() function for this object will display main title and legend title on the plot.

biplot2 = biplot + labs(title = "PCA of yield contributing parameters",
                        colour = "Seed priming")

Change legend position

Legend is located at right side of the plot. You can change the position of legend by using theme() function. Set bottom for legend.position argument to display the legend in the bottom area of the plot.

biplot3 = biplot2 + theme(legend.position = "bottom")