Biplot using base graphic functions in R

See Video  ☝

Introduction

In this video, you will learn how to visualize biplot for principal components using base graphics functions in R studio.

I shall use the banknote data set. This data has been discussed in previous tutorials on the principal component analysis. Further in another tutorial, I have used the same data to visualize biplot using ggbiplot function.

Also go through some video tutorials to understand the data set, principal component analysis and biplot interpretation — PCA_R & Biplot_PCA_R.

Here you will learn how to create and customize PCs using standard plotting functions. You can fully customize all the plotting functions in the base graphic system.

Let’s start how to display biplot using the high level graphics functions from the graphics package in R studio.

I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE)) before importing the data set. You can also clear the plots using graphics.off() and clear everything in console using shell() function.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Import data set

Now let’s import the data set using read.csv() function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file argument specify the file name with extension CSV.

In header argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE for this argument. The head() function will print the first six rows of the data set.

data <- read.csv(file = "biplot.csv", 
                 header = TRUE)
head(data)
#      type Length  Left Right Bottom  Top Diagonal
# 1 genuine  214.4 130.1 130.3    9.7 11.7    139.8
# 2 genuine  214.9 130.5 130.2   11.0 11.5    139.5
# 3 genuine  214.9 130.3 130.1    8.7 11.7    140.2
# 4 genuine  215.0 130.4 130.6    9.9 10.9    140.3
# 5 genuine  214.7 130.2 130.3   11.8 10.9    139.7
# 6 genuine  215.0 130.2 130.2   10.6 10.7    139.9

Principal Component analysis

Now let’s carry out the principal component analysis. Load the stats package by using require() function. Principal component analysis can be performed in R using prcomp() on the given data matrix.

This function requires certain arguments to be specified. In x argument you can specify the numeric or complex data matrix for the principal component analysis. In this argument you need to specify the variables to be used for principal components. Minus one in square brackets (data[,-1]) shows that the first column containing the categorical variable will not be included in the analysis.

TRUE for the center argument will shift the variables to be zero centered. I shall use FALSE for scale. argument. As scaling is preferably done when you have variable measurements on different scales. In this data set, all the variables were measured in the same units so I shall use non scaled data for PCA.

The print function for the pc object will display the information of standard deviation and loadings. Use summary() function to print variance related measures of components.

require(stats)
pc <- prcomp(x = data[ , -1],
             center = TRUE, 
             scale. = FALSE)
print(pc)
summary(pc)
# Standard deviations (1, .., p=6):
# [1] 1.7321388 0.9672748 0.4933697 0.4412015 0.2919107 0.1884534
# 
# Rotation (n x k) = (6 x 6):
#                  PC1         PC2        PC3        PC4         PC5         PC6
# Length    0.04377427  0.01070966 -0.3263165 -0.5616918  0.75257278 -0.09809807
# Left     -0.11216159  0.07144697 -0.2589614 -0.4554588 -0.34680082  0.76651197
# Right    -0.13919062  0.06628208 -0.3447327 -0.4153296 -0.53465173 -0.63169678
# Bottom   -0.76830499 -0.56307225 -0.2180222  0.1861082  0.09996771  0.02221711
# Top      -0.20176610  0.65928988 -0.5566857  0.4506985  0.10190229  0.03485874
# Diagonal  0.57890193 -0.48854255 -0.5917628  0.2584483 -0.08445895  0.04567946
# Importance of components:
#                           PC1    PC2     PC3     PC4     PC5    PC6
# Standard deviation     1.7321 0.9673 0.49337 0.44120 0.29191 0.1885
# Proportion of Variance 0.6675 0.2082 0.05416 0.04331 0.01896 0.0079
# Cumulative Proportion  0.6675 0.8757 0.92983 0.97314 0.99210 1.0000

Display biplot

The importance component of summary function indicated that 87% variation is contributed due to the first two principal components. Now let’s display this information as biplot for first two PCs using base graphic functions.

The package (graphics) used in base graphics system is already loaded in a standard installation of R. However, for a non-standard installation you can load the graphics package using require() function.

The main function used to access the graphics state is the par() function. Simply typing par() will result in a complete listing of the current graphics state. A specific state settings can be queried by supplying specific setting names as arguments to par().

require(graphics)
par()

The par() function is a high level function and you can set some arguments which you do not want to repeat when adding other objects by using lower level plotting functions.

In pty argument you can specify the plotting region to be used. The "s" character in this argument will generate a square plotting region. You can specify a numeric value to magnifying main title and axis labels using cex argument. In family argument specify the name of font family for drawing text. The default value for this argument is blank (e.g. family = " ") which means the default device font will be used.

To specify color of the main title and axis labels use col argument. The simplest way is to use the character string giving the color name (e.g. "gray10").

You can specify color for axis and boxes around the plot by using col as foreground argument.

The las argument specify the style of axis labels. The argument las = 1 specify the placement of axis labels will be horizontal. These are the default settings which will be effective while using lower level plotting functions.

require(graphics)
par(pty = "s",
    cex.main = 1.2,
    cex.lab = 1,
    font.main = 2,
    font.lab = 2,
    family = "sans",
    col.main = "gray10",
    col.lab = "gray10",
    fg = "gray10",
    las = 1)

To Start a new plot frame use plot.new() function. To set up a graphics window you need specify certain arguments.

In this graphics window specify the limits for X and Y axis. You can also use range(PC$x[, 1]) and range(PC$x[, 2]) obtained from X values in PC object. Set the value for aspect ratio of X and Y axis.

To set up a graphics window specify certain arguments using plot.window(). In this graphics window specify the limits for x-axis and y-axis using xlim and ylim arguments. You can also use range(PC$x[, 1]) and range(PC$x[, 2]) obtained from X values in PC object for these arguments. Set the value for aspect ratio of X and Y axis.

plot.new()
plot.window(xlim = c(-4, 4), 
            ylim = c(-4, 4), 
            asp = 1)

Add axis, title and labels

Let’s add the axis to the plot. The axis function allows the specification of the side, position, labels and other options. In side argument specify on which the axis is to be drawn on. While the at argument specify the points at which the tick marks are to be drawn. In label argument you can set a logical value to specify whether numerical annotations are to be made at the tick marks.

Same arguments can be used for the second Y axis. For this you just need to change the value of side argument. Using side = 2 for this argument will draw the axis on left side of the plot.

After drawing the axis now let’s add main title and X and Y labels to the plot. For this use title() function. Some of the title arguments we have already used in high level par() function. In main argument you can specify the title of the plot. In line argument you can set a value to specify the placement of labels on the axis in the figure region of the plot. The argument adj determines the way in which text strings are justified. The value of adj = 0 will adjust the main title as left justified. You can also use other values for text justification on axis.

To add X and Y axis labels use the same title() function. In the Xlab argument specify the text to be used for the X axis. You can simply type the text in quotations. However, I am using the paste function to add variance along with text. The text in quotations will be pasted as such while for showing the value of variance you can use the formula to specify the variance. We can get the value of variance from summary(PC).

In importance component typing two in square brackets summary(pc)$importance[2] will take the second value variation from the first principal component. Multiplying this value with 100 will convert the variance into percent variance. Set the value one for digits argument digits = 1 to round the percent variance to one decimal place. You can use character string to separate the terms in sep argument.

Similarly with the same arguments you can add Y axis label. Just change the text and variance value. The value 5 in importance component of the PC summary(pc)$importance[5] represents the second proportion of variance in PC2.

In line argument set the value to 2 to place the X and Y axis labels in the figure region. Using adj = 0.5 for adjustment argument will place the axis labels as centered justified.

axis(side = 1, 
     at = c(-4, -2, 0, 2, 4),
     labels = TRUE)
axis(side = 2, 
     at = c(-4, -2, 0, 2, 4),
     labels = TRUE)
title(main = "Biplot for PCs of bank note data", 
      line = 3,
      adj = 0)
title(xlab = paste("PC 1 (", 
                   round(summary(pc)$importance[2]*100, 
                         digits = 1),
                   "%)", 
                   sep = ""), 
      ylab = paste("PC 2 (", 
                   round(summary(pc)$importance[5]*100, 
                         digits = 1),
                   "%)", 
                   sep = ""), 
      line = 2,
      adj = 0.5)

Add points

Now let’s add the points to the plot. For this you can use a low level point() function. In points function the x argument specify the coordinate vectors of points to the plot. The value pc$x[,1:2] for this argument represents that the first two PC values from X component of PC object will be used as coordinate vectors. In pch argument you can specify the type of symbols to be used for each group in categorical variable.

In this data set the categorical variable consists of two levels viz. the genuine and counterfeit bank notes. As the first hundred values represent the genuine bank notes so I am using symbol number 16 to represent genuine notes. As the second hundred values represent the counterfeit bank notes so I am using symbol number 17 for this category. In cex argument you can set the value to increase the size of the symbols. In col argument you can use character string by giving color name for each set of symbols.

points(x = pc$x[,1:2],
       pch = c(rep(16, times = 100),
               rep(17, times = 100)),
       cex = 1,
       col = c(rep("darkcyan", times=100),
               rep("orangered", times=100)))

Add ellipse

Now let’s add ellipse to each set of categorical variable. First load the ellipse package using library() function. If this package is not installed on your device then you can use install.package() function to install ellipse package.

To draw ellipse on the plot use polygon() function. In the ellipse() function the x argument specify the correlation matrix at least two by two in size. From the pc$x component of PC object we shall get the correlation from first hundred values of PC1 and PC2 representing genuine and counterfeit bank notes, respectively.

The center argument specify the position of the center of ellipse. Use mean values of first and second PCs obtained from first hundred values of x component of PC object. Level argument represents the confidence level of a pairwise confidence region. This argument is used to control the size of the ellipse. The value of level = 0.85 represents the 85 percent of confidence region. You can fully control the appearance of the ellipse by using different arguments of the polygon function. The argument border controls the color of the border of ellipse.

You can control the line type for ellipse by using lty argument. Line width argument lwd will control the width of the border of ellipse. In col argument you can specify the fill color for ellipse by giving the color name and its transparency level in alpha.f argument.

Similarly you can draw another ellipse for the second category representing the counterfeit bank notes. Here you need to select the second hundred values for both PCs. Similarly in center argument you need to specify the second hundred values to calculate mean value for both PCs. Also change the color name for fill color of the ellipse in col argument.

library(ellipse)
polygon(ellipse(x = cor(pc$x[1:100, 1], pc$x[1:100, 2]),
                centre = colMeans(pc$x[1:100, 1:2]),
                level = 0.85),
        border = "gray10",
        lty = "solid",
        lwd = 1.5,
        col = adjustcolor("darkcyan", alpha.f = 0.20))
polygon(ellipse(x = cor(pc$x[101:200, 1], pc$x[101:200, 2]),
                centre = colMeans(pc$x[101:200, 1:2]),
                level = 0.85),
        border = "gray10",
        lty = "solid",
        lwd = 1.5,
        col = adjustcolor("orangered", alpha.f = 0.20))

Add axis for variable vectors

So for we have completed the plotting of variable objects. Now let’s draw the second axis for variable vectors. Set the value new = TRUE in par() function. This will allow a second plot on the same graph. Now you create the second axis without clearing the graphics device.

In plot graphics window function specify the axis limits for rotations of the first two PCs. Also specify the value for the aspect ratio of the axis.

I have already discussed the arguments used in axis function. Here, I have used some additional arguments. In col argument you can specify the color for axis line and tick marks. Using col.ticks = NULL for color ticks argument means color for ticks will be used from previous color argument. The argument col.axis specify color for the axis annotations.

Similarly do the same for second Y axis except changing the value for side argument.

par(new = TRUE, las = 1)
plot.window(xlim = c(-1, 1), 
            ylim = c(-1, 1), 
            asp = 1)
axis(side = 3, 
     at = c(-1, 0.5, 0, -0.5, 1), 
     labels = TRUE, 
     col = "navy",
     col.ticks = NULL,
     lwd = 2,
     col.axis = "navy")
axis(side = 4, 
     at = c(-1, 0.5, 0, -0.5, 1), 
     labels = TRUE, 
     col = "navy",
     col.ticks = NULL,
     lwd = 2,
     col.axis = "navy")

Add labels to second axis

If you want to add labels to second axis then use mtext() function. It will write text into the margins of a plot. In text argument you can specify the character or expression vector specifying the text to be written. The value three in side argument specify the placement of text will take place in top margins of the figure region. Again you can specify the cex, font, font family, col and pos (line position) for the axis label.

Similarly use the same function for adding label to the Y axis for second PC. Four value for side argument side = 4 specify the placement of the axis label will take place on the right side of the margins of the plot.

I have used an additional las argument. The value three las = 3 specify the style of the Y axis label will be vertical.

mtext((text = "PC 1 rotations"), 
      side = 3,
      cex = 1,
      font = 2,
      family = "sans",
      col = "gray10", 
      line = 2) 
mtext((text = "PC 2 rotations"), 
      side = 4,
      cex = 1,
      font = 2,
      family = "sans",
      col = "gray10", 
      line = 2,
      las = 3)

Add box and ablines

Now let’s add box and lines to the plot. By simply using box() function will add box around the plot. To draw horizontal and vertical lines for PC1 and PC2 use abline() function. Set zero value for both vertical v = 0 and horizontal lines h = 0. Setting value two for lty argument will draw dashed lines. You can also specify color for the lines in col argument.

box()
abline(v = 0, h = 0, lty = 2, col = "grey25")

Add variable vectors

To add variable vectors to the plot use arrows function. The argument x0 and y0 represents the coordinates of points from which to draw the variable vectors. Set zero value for both of these arguments. The argument x1 and y1 specify the coordinates of points to which to draw the variable vectors. The value one in pc$rotations[,1] represent the first PC values from rotation component of PC object. Similarly the value two in pc$rotations[,2] represents the rotation values of the second PC.

In col argument specify the color for variable vectors. In length argument you can specify the value for the length of the edges of arrow heads. The default value for this argument is 0.1. The lwd argument controls the width of the variable vectors. In angle argument you can specify the angle from the shaft of the arrow to the edges of the arrow head.

arrows(x0 = 0, x1 = pc$rotation[,1], 
       y0 = 0, y1 = pc$rotation[,2], 
       col = "navy", 
       length = 0.08, 
       lwd = 2,
       angle = 30)

Add lables to variable vectors

Now let’s label these variable vectors using the text() function. In x and y argument specify the numerical vectors of coordinates specifying the text to be written. Set these coordinate vectors from rotation component of PC object pc$rotation[,1] & pc$rotation[,2] for both PCs. The label argument specify the text to be written. Use the row names of PC rotation component for this argument (row.names(pc$rotation)).

You can also adjust cex, font and col arguments for variable vectors names. You can specify the position of variable names using pos position specifier argument. You can specify position of the text using values of one to four for each of the variable name.

text(x = pc$rotation[,1], y = pc$rotation[,2], 
     labels = row.names(pc$rotation), 
     cex = 1.2,
     font = 2,
     col = "gray10", 
     pos = c(4, 3, 2, 1, 3, 1))

Add unit variance circle

Now let’s add unit variance circle for the variable vectors. Create a unit variance circle object ucircle by using cbind() function. Divide each value from zero to three sixty with one eighty and multiply the result with pi𝞹. Take cosine and sine of these values and combine it using cbind() function. Now again in polygon function use unit circle object ucircle as matrix of values. Choose solid for lty line type argument. Set the color of the border of unit variance circle in border argument. Also specify the border line width in lwd argument.

This will plot a unit variance circle. The variables close to the border of the circle have major contribution in PC analysis.

ucircle = cbind(cos((0:360)/180*pi), 
                sin((0:360)/180*pi))
polygon(ucircle, 
        lty= "solid", border = "gray10", lwd = 1)

Adding Legend

Finally you can add legend to the plot using legend() function. In x argument you can specify the location of the legend in plotting device region. I am using topleft position. In legend argument specify the categorical variable group names. Use same symbols in pch argument for groups of categorical variable. Also specify color for each set of symbols in col argument. In text.font argument set the value for font type used for the legend text. The size of the legend text is controlled by cex argument. You can also magnify points by changing the value of pt.cex argument. In box type argument bty, I am using N character string for this argument. It means box will not be drawn around the legend.

You can control the horizontal and vertical spacing of the legend text by setting the values of x.intersp and y.intersp argument. You can specify the legend placement to plot region, figure region and device region by using xpd argument. FALSE for this argument means legend will be placed within plot region. TRUE for this argument will place the legend in figure region. NA for this argument will place the legend in device region. In adjustment argument adj you can use values of length one or two for legend symbols and text adjustment.

Let's see how finally the biplot looks like.
legend(x = "topleft", 
       legend = c("genuine","counterfeit"),
       pch = c(16, 17),
       col = c("darkcyan", "orangered"),
       text.font = 2,
       cex = 1.0,
       pt.cex = 1.0,
       bty = "n",
       x.intersp = 0.5,
       y.intersp = 0.8,
       xpd = FALSE,
       adj = c(0, 0.25))

Please comment below if you have any questions.

Download data file — Click_here

Download Rscript — Click_here


Download R program — Click_here

Download R studio — Click_here


Comments

  1. Very detailed and smart, keep it up with the good work.

    ReplyDelete
  2. Great presentation and explanation. I would say you should keep up the good work. The views might not get to the number you want but i feel with a matter of time your videos will ge t more views because you upload quality content. Consistency is key. Thanks for your time in creating these videos.

    ReplyDelete
    Replies
    1. Dear Emmanuel,

      Thank you for your kind words. Your suggestions really help me to improve my content. Thanks once again.

      Delete

Post a Comment

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Visualizing clustering dendrogram in R | Hierarchical clustering