Biplot of PCs using ggbiplot function
See Video ⮞ ☝ |
AGRON Stats
June 26, 2018
Introduction
You will learn how to visualize biplot for principal components using ggbiplot()
function in R studio. I shall use the bank note data set used in previous tutorial on principal component analysis.
Suggestion:
Watch this video tutorial to understand data set and principal component analysis —
See here
In this tutorial I shall describe how to visualize principal components using ggbiplot() function and to interpret the biplot. It is often desirable to simultaneously visualize scores and loadings of the principal components. This can be accomplished with a biplot. The plot is called a biplot because it contains information on loadings (arrows) and scores (data points or sample identifiers) and not just because it plots two principal components against one another.
Let’s start how to display biplot using ggbiplot() function in R studio.
I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE))
before importing the data set. You can also clear the plots using graphics.off()
and clear everything in console using shell()
function.
rm(list = ls(all = TRUE))
graphics.off()
shell("cls")
Import data set
Now let’s import the data set using read.csv()
function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file
argument specify the file name with extension CSV.
In header
argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE
for this argument. The head()
function will print the first six rows of the data set.
= read.csv(file = "biplot.csv",
data header = TRUE)
head(data)
# type Length Left Right Bottom Top Diagonal
# 1 genuine 214.4 130.1 130.3 9.7 11.7 139.8
# 2 genuine 214.9 130.5 130.2 11.0 11.5 139.5
# 3 genuine 214.9 130.3 130.1 8.7 11.7 140.2
# 4 genuine 215.0 130.4 130.6 9.9 10.9 140.3
# 5 genuine 214.7 130.2 130.3 11.8 10.9 139.7
# 6 genuine 215.0 130.2 130.2 10.6 10.7 139.9
Principal component analysis
Now let’s carry out the principal component analysis. Load the stats package by using require() function. I shall use prcomp() function for principal component analysis. The argument x
specify the data set. In X argument you need to specify the variables to be used for principal components. Minus one in square brackets [-1]
shows that the first categorical variable will be excluded from data set. TRUE
for center argument will shift the variables to be zero centered. I shall use FALSE
for scale argument. As scaling is preferably done when you have variable measurements in different scales. In this data set all the variables were measured in same units so I shall use non scaled data for this analysis.
Print function for pc
object will display the information of standard deviation and loadings. Use summary()
function to print variance related measures of components as shown by the importance component below.
require(stats)
<- prcomp(x = data[-1],
pc center = TRUE,
scale. = FALSE)
print(pc)
summary(pc)
# Standard deviations (1, .., p=6):
# [1] 1.7321388 0.9672748 0.4933697 0.4412015 0.2919107 0.1884534
#
# Rotation (n x k) = (6 x 6):
# PC1 PC2 PC3 PC4 PC5 PC6
# Length 0.04377427 0.01070966 -0.3263165 -0.5616918 0.75257278 -0.09809807
# Left -0.11216159 0.07144697 -0.2589614 -0.4554588 -0.34680082 0.76651197
# Right -0.13919062 0.06628208 -0.3447327 -0.4153296 -0.53465173 -0.63169678
# Bottom -0.76830499 -0.56307225 -0.2180222 0.1861082 0.09996771 0.02221711
# Top -0.20176610 0.65928988 -0.5566857 0.4506985 0.10190229 0.03485874
# Diagonal 0.57890193 -0.48854255 -0.5917628 0.2584483 -0.08445895 0.04567946
# Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6
# Standard deviation 1.7321 0.9673 0.49337 0.44120 0.29191 0.1885
# Proportion of Variance 0.6675 0.2082 0.05416 0.04331 0.01896 0.0079
# Cumulative Proportion 0.6675 0.8757 0.92983 0.97314 0.99210 1.0000
Scree plot of eigenvalues
Simply using plot()
for pc
object will result in bargraph of variances. Screeplot can be displayed using screeplot()
function.
plot(pc)
screeplot(pc, type = "line", main = "Scree plot")
Visualizing biplot
Install ggbiplot package
To visualize biplot using ggbiplot()
first you need to install the package ggbiplot. You will also need to load the library devtools before installing ggbiplot package.
library(devtools)
install_github("vqv/ggbiplot")
Plot PCA using ggbiplot()
After installation is completed load the ggbiplot package using require()
function. The results from prcomp()
function showed that first component explains 67% of the variability in the data set. Second component explains 21% of total variability in the data set. This indicates the first two components 88% of the total variation in the data set.
Now let’s display a biplot for the first two principal components using ggbiplot()
function for pc
object created earlier. The scales for the loadings are shown at the bottom and left axis for PC1 and PC2 respectively.
require(ggbiplot)
ggbiplot(pc)
Custmoizing biplot
In ggbiplot()
function we can use different arguments to enhance its visualization. In pcobj
argument use pc
object created earlier. In choices you can specify the two principal components to be used. I am using same scale factor to apply to obs.scale
and var.scale
. You can specify your own labels for data points in plot by using labels
argument. I shall use rownames
as labels for data points. In plot you can see the first cluster of hundred values represent the genuine bank notes. The second cluster from one hundred to two hundred values represent the counterfeit bank notes.
Let’s resize the variable names using varname.size
argument. You can shorten the length of variable names by using varname.abbrev
argument. TRUE
for this argument will abbreviate the variable names. Typing FALSE
for variable axis will remove the arrows or variable vectors. As I want to show the arrows so I shall keep this argument as TRUE
in the final output.
To draw a correlation circle specify TRUE for circle
argument. This argument should only be used when scale
argument is specified as TRUE in prcomp()
function. To draw a normal ellipse for each group set TRUE for ellipse
argument. Also specify the group variable in groups
argument. Use print()
function to plot this object.
= ggbiplot(pcobj = pc,
biplot choices = c(1,2),
obs.scale = 1, var.scale = 1, # Scaling of axis
labels = row.names(data), # Add labels as rownames
labels.size = 4,
varname.size = 5,
varname.abbrev = TRUE, # Abbreviate variable names (TRUE)
var.axes = FALSE, # Remove variable vectors (TRUE)
circle = FALSE, # Add unit variance circle (TRUE)
ellipse = TRUE, groups = data$type) # Adding ellipses
print(biplot)
Add main and legend title
Add title to this plot by using labs()
function. In title
argument type the main title of the plot. For colour
argument type the title of the legend to be used. Print()
function for this object will display main title and legend title on the plot.
= biplot + labs(title = "PCA of yield contributing parameters",
biplot2 colour = "Seed priming")
Change legend position
Legend is located at right side of the plot. You can change the position of legend by using theme()
function. Set bottom for legend.position
argument to display the legend in the bottom area of the plot.
= biplot2 + theme(legend.position = "bottom") biplot3