Biplot of PCs using ggbiplot function
See Video ⮞ ☝ |
AGRON Stats
June 26, 2018
Introduction
You will learn how to visualize biplot for principal components using ggbiplot()
function in R studio. I shall use the bank note data set used in previous tutorial on principal component analysis.
Suggestion:
Watch this video tutorial to understand data set and principal component analysis —
See here
In this tutorial I shall describe how to visualize principal components using ggbiplot() function and to interpret the biplot. It is often desirable to simultaneously visualize scores and loadings of the principal components. This can be accomplished with a biplot. The plot is called a biplot because it contains information on loadings (arrows) and scores (data points or sample identifiers) and not just because it plots two principal components against one another.
Let’s start how to display biplot using ggbiplot() function in R studio.
I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE))
before importing the data set. You can also clear the plots using graphics.off()
and clear everything in console using shell()
function.
rm(list = ls(all = TRUE))
graphics.off()
shell("cls")
Import data set
Now let’s import the data set using read.csv()
function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file
argument specify the file name with extension CSV.
In header
argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE
for this argument. The head()
function will print the first six rows of the data set.
= read.csv(file = "biplot.csv",
data header = TRUE)
head(data)
# type Length Left Right Bottom Top Diagonal
# 1 genuine 214.4 130.1 130.3 9.7 11.7 139.8
# 2 genuine 214.9 130.5 130.2 11.0 11.5 139.5
# 3 genuine 214.9 130.3 130.1 8.7 11.7 140.2
# 4 genuine 215.0 130.4 130.6 9.9 10.9 140.3
# 5 genuine 214.7 130.2 130.3 11.8 10.9 139.7
# 6 genuine 215.0 130.2 130.2 10.6 10.7 139.9
Principal component analysis
Now let’s carry out the principal component analysis. Load the stats package by using require() function. I shall use prcomp() function for principal component analysis. The argument x
specify the data set. In X argument you need to specify the variables to be used for principal components. Minus one in square brackets [-1]
shows that the first categorical variable will be excluded from data set. TRUE
for center argument will shift the variables to be zero centered. I shall use FALSE
for scale argument. As scaling is preferably done when you have variable measurements in different scales. In this data set all the variables were measured in same units so I shall use non scaled data for this analysis.
Print function for pc
object will display the information of standard deviation and loadings. Use summary()
function to print variance related measures of components as shown by the importance component below.
require(stats)
<- prcomp(x = data[-1],
pc center = TRUE,
scale. = FALSE)
print(pc)
summary(pc)
# Standard deviations (1, .., p=6):
# [1] 1.7321388 0.9672748 0.4933697 0.4412015 0.2919107 0.1884534
#
# Rotation (n x k) = (6 x 6):
# PC1 PC2 PC3 PC4 PC5 PC6
# Length 0.04377427 0.01070966 -0.3263165 -0.5616918 0.75257278 -0.09809807
# Left -0.11216159 0.07144697 -0.2589614 -0.4554588 -0.34680082 0.76651197
# Right -0.13919062 0.06628208 -0.3447327 -0.4153296 -0.53465173 -0.63169678
# Bottom -0.76830499 -0.56307225 -0.2180222 0.1861082 0.09996771 0.02221711
# Top -0.20176610 0.65928988 -0.5566857 0.4506985 0.10190229 0.03485874
# Diagonal 0.57890193 -0.48854255 -0.5917628 0.2584483 -0.08445895 0.04567946
# Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6
# Standard deviation 1.7321 0.9673 0.49337 0.44120 0.29191 0.1885
# Proportion of Variance 0.6675 0.2082 0.05416 0.04331 0.01896 0.0079
# Cumulative Proportion 0.6675 0.8757 0.92983 0.97314 0.99210 1.0000
Scree plot of eigenvalues
Simply using plot()
for pc
object will result in bargraph of variances. Screeplot can be displayed using screeplot()
function.
plot(pc)
screeplot(pc, type = "line", main = "Scree plot")
Visualizing biplot
Install ggbiplot package
To visualize biplot using ggbiplot()
first you need to install the package ggbiplot. You will also need to load the library devtools before installing ggbiplot package.
library(devtools)
install_github("vqv/ggbiplot")
Plot PCA using ggbiplot()
After installation is completed load the ggbiplot package using require()
function. The results from prcomp()
function showed that first component explains 67% of the variability in the data set. Second component explains 21% of total variability in the data set. This indicates the first two components 88% of the total variation in the data set.
Now let’s display a biplot for the first two principal components using ggbiplot()
function for pc
object created earlier. The scales for the loadings are shown at the bottom and left axis for PC1 and PC2 respectively.
require(ggbiplot)
ggbiplot(pc)
Custmoizing biplot
In ggbiplot()
function we can use different arguments to enhance its visualization. In pcobj
argument use pc
object created earlier. In choices you can specify the two principal components to be used. I am using same scale factor to apply to obs.scale
and var.scale
. You can specify your own labels for data points in plot by using labels
argument. I shall use rownames
as labels for data points. In plot you can see the first cluster of hundred values represent the genuine bank notes. The second cluster from one hundred to two hundred values represent the counterfeit bank notes.
Let’s resize the variable names using varname.size
argument. You can shorten the length of variable names by using varname.abbrev
argument. TRUE
for this argument will abbreviate the variable names. Typing FALSE
for variable axis will remove the arrows or variable vectors. As I want to show the arrows so I shall keep this argument as TRUE
in the final output.
To draw a correlation circle specify TRUE for circle
argument. This argument should only be used when scale
argument is specified as TRUE in prcomp()
function. To draw a normal ellipse for each group set TRUE for ellipse
argument. Also specify the group variable in groups
argument. Use print()
function to plot this object.
= ggbiplot(pcobj = pc,
biplot choices = c(1,2),
obs.scale = 1, var.scale = 1, # Scaling of axis
labels = row.names(data), # Add labels as rownames
labels.size = 4,
varname.size = 5,
varname.abbrev = TRUE, # Abbreviate variable names (TRUE)
var.axes = FALSE, # Remove variable vectors (TRUE)
circle = FALSE, # Add unit variance circle (TRUE)
ellipse = TRUE, groups = data$type) # Adding ellipses
print(biplot)
Add main and legend title
Add title to this plot by using labs()
function. In title
argument type the main title of the plot. For colour
argument type the title of the legend to be used. Print()
function for this object will display main title and legend title on the plot.
= biplot + labs(title = "PCA of yield contributing parameters",
biplot2 colour = "Seed priming")
Change legend position
Legend is located at right side of the plot. You can change the position of legend by using theme()
function. Set bottom for legend.position
argument to display the legend in the bottom area of the plot.
= biplot2 + theme(legend.position = "bottom") biplot3
Apply themes
You can specify different themes to the plot from one of the below mentined themes.
themes (classic, dark, gray, light light, linedraw, minimal, void, bw)
You can choose any of the given themes. Below you can see how the plots look like after applying themes.
= biplot3 + theme_bw()
theme1 = biplot3 + theme_classic()
theme2 = biplot3 + theme_dark()
theme3 = biplot3 + theme_gray()
theme4 = biplot3 + theme_minimal()
theme5 = biplot3 + theme_void() theme6
Interpretting biplot
Now, I shall briefly explain the biplot. The graph shows that objects are preferably displayed as points. The variables are displayed as vectors or arrows. Data points or sample identifiers represent the scores. The scores are in deviation from their average for each of the variables. The origin represents the average value, both for each variable and for each object across all variables.
This average has a value of zero in the centered data matrix. The arrows contains the information on loadings or it represent the variable vectors. The length of the variable vectors indicate how well the variables are represented by the graph — with a perfect fit if all vectors have equal lengths.
The length of arrows in the plot are directly proportional to the variability included in the two components. For example a very short arrow in this biplot represent the length variable. It indicates the first two components contains almost no information about this element.
The angle between any two arrows represent the correlation between those variables. If the angle between two variable vectors is zero then it shows both variables or collinear. It means both variable vectors lie on same line in same direction. If the angle between two variable vectors is 90 degrees then it shows both variable vectors are orthogonal. It means both variable show lack of correlation.
The smaller the acute angle the stronger the positive correlation will be between the two variables.
If the angle between two variable vectors is 180 degrees then it shows both variables are collinear. It means both variable vectors lie on same line but in opposite direction. The variables having greater obtuse angle are negatively correlated. Let’s have a look at correlation matrix of the numerical variables.
Correlation of variables
The acute angle between left and right variable is smaller so the correlation between these two variables is strongly positive (74%). The obtuse angle between bottom and diagonal variable indicates slightly strong negative correlation (-62%) between these two variables.
cor(data[, 2:7])
# Length Left Right Bottom Top Diagonal
# Length 1.00000000 0.2312926 0.1517628 -0.1898009 -0.06132141 0.1943015
# Left 0.23129257 1.0000000 0.7432628 0.4137810 0.36234960 -0.5032290
# Right 0.15176280 0.7432628 1.0000000 0.4867577 0.40067021 -0.5164755
# Bottom -0.18980092 0.4137810 0.4867577 1.0000000 0.14185134 -0.6229827
# Top -0.06132141 0.3623496 0.4006702 0.1418513 1.00000000 -0.5940446
# Diagonal 0.19430146 -0.5032290 -0.5164755 -0.6229827 -0.59404464 1.0000000
Variances for most contributing variables
Now let’s have a look at the length of the variable vectors. As most of the variation is explained by the bottom, diagonal and top variables due to greater length of variable vectors. To compute variance for these variables first attach the database to access variables by simply giving their names. We can compute the variance for bottom, diagonal and top variables using variance function. The higher variation in bottom variable is indicated by the greater length of bottom variable vector. The bottom and diagonal variables are the most important contributors to the first PC. Second PC is best explained by the top variable and sum of bottom and diagonal variables.
attach(data)
var(Bottom)
var(Diagonal)
var(Top)
# [1] 2.086878
# [1] 1.327716
# [1] 0.6447234
Please comment below if you have any questions.
Download data file — Click_here
Download Rscript — Click_here
Download R program —
Click_here
Download R studio —
Click_here
Thank you for these very useful explanations. I would like to know if there is a way to modify Xlab and Ylab. That is to say I want to change "PC1 (66.8 % explained var.)" by just "PC1 (66.8%)"?
ReplyDeleteDear Abel,
DeleteI am thankful for your valuable comment. You can simply type below command after plus sign to add X and Y axis labels:
+ xlab ("PC1 (49.1%)") + ylab ("PC2 (21.3%)")
If still have issue please let me know
Many Many thanks for your effort in providing this important tutorial. It is really great to get such an online support, your videos are fantastic
ReplyDeleteThank you for nice explanations, my question when I used my own data , grouping by genotypes not sucesse how can we group and display now names in the plot
ReplyDeleteThank you so much for this! I was just wondering if the ellipses are 95% confidence ellipses and if not how you could add these to the plot? thanks
ReplyDelete