Biplot using base graphic functions in R
See Video ⮞ ☝ |
AGRON Stats
June 26, 2018
Introduction
In this video, you will learn how to visualize biplot for principal components using base graphics functions in R studio.
I shall use the banknote data set. This data has been discussed in previous tutorials on the principal component analysis. Further in another tutorial, I have used the same data to visualize biplot using ggbiplot function.
Also go through some video tutorials to understand the data set, principal component analysis and biplot interpretation —
PCA_R
&Biplot_PCA_R
.
Here you will learn how to create and customize PCs using standard plotting functions. You can fully customize all the plotting functions in the base graphic system.
Let’s start how to display biplot using the high level
graphics functions from the graphics package in R studio.
I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE))
before importing the data set. You can also clear the plots using graphics.off()
and clear everything in console using shell()
function.
Import data set
Now let’s import the data set using read.csv()
function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file
argument specify the file name with extension CSV.
In header
argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE
for this argument. The head()
function will print the first six rows of the data set.
# type Length Left Right Bottom Top Diagonal
# 1 genuine 214.4 130.1 130.3 9.7 11.7 139.8
# 2 genuine 214.9 130.5 130.2 11.0 11.5 139.5
# 3 genuine 214.9 130.3 130.1 8.7 11.7 140.2
# 4 genuine 215.0 130.4 130.6 9.9 10.9 140.3
# 5 genuine 214.7 130.2 130.3 11.8 10.9 139.7
# 6 genuine 215.0 130.2 130.2 10.6 10.7 139.9
Principal Component analysis
Now let’s carry out the principal component analysis. Load the stats package by using require()
function. Principal component analysis can be performed in R using prcomp()
on the given data matrix.
This function requires certain arguments to be specified. In x
argument you can specify the numeric or complex data matrix for the principal component analysis. In this argument you need to specify the variables to be used for principal components. Minus one in square brackets (data[,-1]
) shows that the first column containing the categorical variable will not be included in the analysis.
TRUE
for the center
argument will shift the variables to be zero centered. I shall use FALSE
for scale.
argument. As scaling is preferably done when you have variable measurements on different scales. In this data set, all the variables were measured in the same units so I shall use non scaled data for PCA.
The print function for the pc
object will display the information of standard deviation and loadings. Use summary()
function to print variance related measures of components.
# Standard deviations (1, .., p=6):
# [1] 1.7321388 0.9672748 0.4933697 0.4412015 0.2919107 0.1884534
#
# Rotation (n x k) = (6 x 6):
# PC1 PC2 PC3 PC4 PC5 PC6
# Length 0.04377427 0.01070966 -0.3263165 -0.5616918 0.75257278 -0.09809807
# Left -0.11216159 0.07144697 -0.2589614 -0.4554588 -0.34680082 0.76651197
# Right -0.13919062 0.06628208 -0.3447327 -0.4153296 -0.53465173 -0.63169678
# Bottom -0.76830499 -0.56307225 -0.2180222 0.1861082 0.09996771 0.02221711
# Top -0.20176610 0.65928988 -0.5566857 0.4506985 0.10190229 0.03485874
# Diagonal 0.57890193 -0.48854255 -0.5917628 0.2584483 -0.08445895 0.04567946
# Importance of components:
# PC1 PC2 PC3 PC4 PC5 PC6
# Standard deviation 1.7321 0.9673 0.49337 0.44120 0.29191 0.1885
# Proportion of Variance 0.6675 0.2082 0.05416 0.04331 0.01896 0.0079
# Cumulative Proportion 0.6675 0.8757 0.92983 0.97314 0.99210 1.0000
Display biplot
The importance component of summary function indicated that 87% variation is contributed due to the first two principal components. Now let’s display this information as biplot for first two PCs using base graphic functions.
The package (graphics
) used in base graphics system is already loaded in a standard installation of R. However, for a non-standard installation you can load the graphics package using require()
function.
The main function used to access the graphics state is the par()
function. Simply typing par()
will result in a complete listing of the current graphics state. A specific state settings can be queried by supplying specific setting names as arguments to par()
.
The par()
function is a high level function and you can set some arguments which you do not want to repeat when adding other objects by using lower level plotting functions.
In pty
argument you can specify the plotting region to be used. The "s"
character in this argument will generate a square plotting region. You can specify a numeric value to magnifying main title and axis labels using cex
argument. In family
argument specify the name of font family for drawing text. The default value for this argument is blank (e.g. family = " "
) which means the default device font will be used.
To specify color of the main title and axis labels use col
argument. The simplest way is to use the character string giving the color name (e.g. "gray10"
).
You can specify color for axis and boxes around the plot by using col
as foreground argument.
The las
argument specify the style of axis labels. The argument las = 1
specify the placement of axis labels will be horizontal. These are the default settings which will be effective while using lower level plotting functions.
require(graphics)
par(pty = "s",
cex.main = 1.2,
cex.lab = 1,
font.main = 2,
font.lab = 2,
family = "sans",
col.main = "gray10",
col.lab = "gray10",
fg = "gray10",
las = 1)
To Start a new plot frame use plot.new()
function. To set up a graphics window you need specify certain arguments.
In this graphics window specify the limits for X and Y axis. You can also use range(PC$x[, 1])
and range(PC$x[, 2])
obtained from X values in PC object. Set the value for aspect ratio of X and Y axis.
To set up a graphics window specify certain arguments using plot.window()
. In this graphics window specify the limits for x-axis and y-axis using xlim
and ylim
arguments. You can also use range(PC$x[, 1])
and range(PC$x[, 2])
obtained from X values in PC object for these arguments. Set the value for aspect ratio of X and Y axis.
Add axis, title and labels
Let’s add the axis to the plot. The axis function allows the specification of the side, position, labels and other options. In side
argument specify on which the axis is to be drawn on. While the at
argument specify the points at which the tick marks are to be drawn. In label
argument you can set a logical value to specify whether numerical annotations are to be made at the tick marks.
Same arguments can be used for the second Y axis. For this you just need to change the value of side
argument. Using side = 2
for this argument will draw the axis on left side of the plot.
After drawing the axis now let’s add main title and X and Y labels to the plot. For this use title()
function. Some of the title arguments we have already used in high level par()
function. In main
argument you can specify the title of the plot. In line
argument you can set a value to specify the placement of labels on the axis in the figure region of the plot. The argument adj
determines the way in which text strings are justified. The value of adj = 0
will adjust the main title as left justified. You can also use other values for text justification on axis.
To add X and Y axis labels use the same title()
function. In the Xlab
argument specify the text to be used for the X axis. You can simply type the text in quotations. However, I am using the paste function to add variance along with text. The text in quotations will be pasted as such while for showing the value of variance you can use the formula to specify the variance. We can get the value of variance from summary(PC)
.
In importance component typing two in square brackets summary(pc)$importance[2]
will take the second value variation from the first principal component. Multiplying this value with 100 will convert the variance into percent variance. Set the value one for digits argument digits = 1
to round the percent variance to one decimal place. You can use character string to separate the terms in sep
argument.
Similarly with the same arguments you can add Y axis label. Just change the text and variance value. The value 5 in importance component of the PC summary(pc)$importance[5]
represents the second proportion of variance in PC2.
In line
argument set the value to 2 to place the X and Y axis labels in the figure region. Using adj = 0.5
for adjustment argument will place the axis labels as centered justified.
axis(side = 1,
at = c(-4, -2, 0, 2, 4),
labels = TRUE)
axis(side = 2,
at = c(-4, -2, 0, 2, 4),
labels = TRUE)
title(main = "Biplot for PCs of bank note data",
line = 3,
adj = 0)
title(xlab = paste("PC 1 (",
round(summary(pc)$importance[2]*100,
digits = 1),
"%)",
sep = ""),
ylab = paste("PC 2 (",
round(summary(pc)$importance[5]*100,
digits = 1),
"%)",
sep = ""),
line = 2,
adj = 0.5)
Add points
Now let’s add the points to the plot. For this you can use a low level point()
function. In points function the x
argument specify the coordinate vectors of points to the plot. The value pc$x[,1:2]
for this argument represents that the first two PC values from X component of PC object will be used as coordinate vectors. In pch
argument you can specify the type of symbols to be used for each group in categorical variable.
In this data set the categorical variable consists of two levels viz. the genuine and counterfeit bank notes. As the first hundred values represent the genuine bank notes so I am using symbol number 16 to represent genuine notes. As the second hundred values represent the counterfeit bank notes so I am using symbol number 17 for this category. In cex
argument you can set the value to increase the size of the symbols. In col
argument you can use character string by giving color name for each set of symbols.
Add ellipse
Now let’s add ellipse to each set of categorical variable. First load the ellipse package using library()
function. If this package is not installed on your device then you can use install.package()
function to install ellipse package.
To draw ellipse on the plot use polygon()
function. In the ellipse()
function the x
argument specify the correlation matrix at least two by two in size. From the pc$x component of PC object we shall get the correlation from first hundred values of PC1 and PC2 representing genuine and counterfeit bank notes, respectively.
The center
argument specify the position of the center of ellipse. Use mean values of first and second PCs obtained from first hundred values of x component of PC object. Level
argument represents the confidence level of a pairwise confidence region. This argument is used to control the size of the ellipse. The value of level = 0.85
represents the 85 percent of confidence region. You can fully control the appearance of the ellipse by using different arguments of the polygon function. The argument border
controls the color of the border of ellipse.
You can control the line type for ellipse by using lty
argument. Line width argument lwd
will control the width of the border of ellipse. In col
argument you can specify the fill color for ellipse by giving the color name and its transparency level in alpha.f
argument.
Similarly you can draw another ellipse for the second category representing the counterfeit bank notes. Here you need to select the second hundred values for both PCs. Similarly in center
argument you need to specify the second hundred values to calculate mean value for both PCs. Also change the color name for fill color of the ellipse in col
argument.
library(ellipse)
polygon(ellipse(x = cor(pc$x[1:100, 1], pc$x[1:100, 2]),
centre = colMeans(pc$x[1:100, 1:2]),
level = 0.85),
border = "gray10",
lty = "solid",
lwd = 1.5,
col = adjustcolor("darkcyan", alpha.f = 0.20))
polygon(ellipse(x = cor(pc$x[101:200, 1], pc$x[101:200, 2]),
centre = colMeans(pc$x[101:200, 1:2]),
level = 0.85),
border = "gray10",
lty = "solid",
lwd = 1.5,
col = adjustcolor("orangered", alpha.f = 0.20))
Add axis for variable vectors
So for we have completed the plotting of variable objects. Now let’s draw the second axis for variable vectors. Set the value new = TRUE
in par()
function. This will allow a second plot on the same graph. Now you create the second axis without clearing the graphics device.
In plot graphics window function specify the axis limits for rotations of the first two PCs. Also specify the value for the aspect ratio of the axis.
I have already discussed the arguments used in axis function. Here, I have used some additional arguments. In col
argument you can specify the color for axis line and tick marks. Using col.ticks = NULL
for color ticks argument means color for ticks will be used from previous color argument. The argument col.axis
specify color for the axis annotations.
Similarly do the same for second Y axis except changing the value for side
argument.
par(new = TRUE, las = 1)
plot.window(xlim = c(-1, 1),
ylim = c(-1, 1),
asp = 1)
axis(side = 3,
at = c(-1, 0.5, 0, -0.5, 1),
labels = TRUE,
col = "navy",
col.ticks = NULL,
lwd = 2,
col.axis = "navy")
axis(side = 4,
at = c(-1, 0.5, 0, -0.5, 1),
labels = TRUE,
col = "navy",
col.ticks = NULL,
lwd = 2,
col.axis = "navy")
Add labels to second axis
If you want to add labels to second axis then use mtext()
function. It will write text into the margins of a plot. In text argument you can specify the character or expression vector specifying the text to be written. The value three in side
argument specify the placement of text will take place in top margins of the figure region. Again you can specify the cex
, font
, font family
, col
and pos
(line position) for the axis label.
Similarly use the same function for adding label to the Y axis for second PC. Four value for side argument side = 4
specify the placement of the axis label will take place on the right side of the margins of the plot.
I have used an additional las
argument. The value three las = 3
specify the style of the Y axis label will be vertical.
Add box and ablines
Now let’s add box and lines to the plot. By simply using box()
function will add box around the plot. To draw horizontal and vertical lines for PC1 and PC2 use abline()
function. Set zero value for both vertical v = 0
and horizontal lines h = 0
. Setting value two for lty
argument will draw dashed lines. You can also specify color for the lines in col
argument.
Add variable vectors
To add variable vectors to the plot use arrows function. The argument x0
and y0
represents the coordinates of points from which to draw the variable vectors. Set zero value for both of these arguments. The argument x1
and y1
specify the coordinates of points to which to draw the variable vectors. The value one in pc$rotations[,1]
represent the first PC values from rotation component of PC object. Similarly the value two in pc$rotations[,2]
represents the rotation values of the second PC.
In col
argument specify the color for variable vectors. In length
argument you can specify the value for the length of the edges of arrow heads. The default value for this argument is 0.1. The lwd
argument controls the width of the variable vectors. In angle
argument you can specify the angle from the shaft of the arrow to the edges of the arrow head.
Add lables to variable vectors
Now let’s label these variable vectors using the text()
function. In x
and y
argument specify the numerical vectors of coordinates specifying the text to be written. Set these coordinate vectors from rotation component of PC object pc$rotation[,1]
& pc$rotation[,2]
for both PCs. The label
argument specify the text to be written. Use the row names of PC rotation component for this argument (row.names(pc$rotation)
).
You can also adjust cex
, font
and col
arguments for variable vectors names. You can specify the position of variable names using pos
position specifier argument. You can specify position of the text using values of one to four for each of the variable name.
Add unit variance circle
Now let’s add unit variance circle for the variable vectors. Create a unit variance circle object ucircle
by using cbind()
function. Divide each value from zero to three sixty with one eighty and multiply the result with pi𝞹. Take cosine and sine of these values and combine it using cbind()
function. Now again in polygon function use unit circle object ucircle
as matrix of values. Choose solid for lty
line type argument. Set the color of the border of unit variance circle in border argument. Also specify the border line width in lwd
argument.
This will plot a unit variance circle. The variables close to the border of the circle have major contribution in PC analysis.
Adding Legend
Finally you can add legend to the plot using legend()
function. In x
argument you can specify the location of the legend in plotting device region. I am using topleft
position. In legend
argument specify the categorical variable group names. Use same symbols in pch
argument for groups of categorical variable. Also specify color for each set of symbols in col
argument. In text.font
argument set the value for font type used for the legend text. The size of the legend text is controlled by cex
argument. You can also magnify points by changing the value of pt.cex
argument. In box type argument bty
, I am using N character string for this argument. It means box will not be drawn around the legend.
You can control the horizontal and vertical spacing of the legend text by setting the values of x.intersp
and y.intersp
argument. You can specify the legend placement to plot region, figure region and device region by using xpd
argument. FALSE for this argument means legend will be placed within plot region. TRUE for this argument will place the legend in figure region. NA for this argument will place the legend in device region. In adjustment argument adj
you can use values of length one or two for legend symbols and text adjustment.
Let's see how finally the biplot looks like.
legend(x = "topleft",
legend = c("genuine","counterfeit"),
pch = c(16, 17),
col = c("darkcyan", "orangered"),
text.font = 2,
cex = 1.0,
pt.cex = 1.0,
bty = "n",
x.intersp = 0.5,
y.intersp = 0.8,
xpd = FALSE,
adj = c(0, 0.25))
Please comment below if you have any questions.
Download data file — Click_here
Download Rscript — Click_here
Download R program —
Click_here
Download R studio —
Click_here
Very detailed and smart, keep it up with the good work.
ReplyDeleteThank you
DeleteGreat article for beginners
ReplyDeleteThank you for your feedback
DeleteGreat presentation and explanation. I would say you should keep up the good work. The views might not get to the number you want but i feel with a matter of time your videos will ge t more views because you upload quality content. Consistency is key. Thanks for your time in creating these videos.
ReplyDeleteDear Emmanuel,
DeleteThank you for your kind words. Your suggestions really help me to improve my content. Thanks once again.