Biplot of PCs using ggbiplot function

See Video  ☝

Introduction

You will learn how to visualize biplot for principal components using ggbiplot() function in R studio. I shall use the bank note data set used in previous tutorial on principal component analysis.

Suggestion:

Watch this video tutorial to understand data set and principal component analysis — See here

In this tutorial I shall describe how to visualize principal components using ggbiplot() function and to interpret the biplot. It is often desirable to simultaneously visualize scores and loadings of the principal components. This can be accomplished with a biplot. The plot is called a biplot because it contains information on loadings (arrows) and scores (data points or sample identifiers) and not just because it plots two principal components against one another.

Let’s start how to display biplot using ggbiplot() function in R studio. 

I often recommend to first clear all the objects or values in global environment using rm(list = ls(all = TRUE)) before importing the data set. You can also clear the plots using graphics.off() and clear everything in console using shell() function.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Import data set

Now let’s import the data set using read.csv() function. I have already saved the data file as CSV (comma delimited file) in the working directory. The file argument specify the file name with extension CSV.

In header argument you can set a logical value that will indicate whether the data file contains first variable names as first row. In my data set the file contains the variable names in the first row, so I shall use TRUE for this argument. The head() function will print the first six rows of the data set.

data = read.csv(file = "biplot.csv",
                header = TRUE)
head(data)
#      type Length  Left Right Bottom  Top Diagonal
# 1 genuine  214.4 130.1 130.3    9.7 11.7    139.8
# 2 genuine  214.9 130.5 130.2   11.0 11.5    139.5
# 3 genuine  214.9 130.3 130.1    8.7 11.7    140.2
# 4 genuine  215.0 130.4 130.6    9.9 10.9    140.3
# 5 genuine  214.7 130.2 130.3   11.8 10.9    139.7
# 6 genuine  215.0 130.2 130.2   10.6 10.7    139.9

Principal component analysis

Now let’s carry out the principal component analysis. Load the stats package by using require() function. I shall use prcomp() function for principal component analysis. The argument x specify the data set. In X argument you need to specify the variables to be used for principal components. Minus one in square brackets [-1] shows that the first categorical variable will be excluded from data set. TRUE for center argument will shift the variables to be zero centered. I shall use FALSE for scale argument. As scaling is preferably done when you have variable measurements in different scales. In this data set all the variables were measured in same units so I shall use non scaled data for this analysis.

Print function for pc object will display the information of standard deviation and loadings. Use summary() function to print variance related measures of components as shown by the importance component below.

require(stats)
pc <- prcomp(x = data[-1],
                  center = TRUE, 
                  scale. = FALSE)
print(pc)
summary(pc)
# Standard deviations (1, .., p=6):
# [1] 1.7321388 0.9672748 0.4933697 0.4412015 0.2919107 0.1884534
# 
# Rotation (n x k) = (6 x 6):
#                  PC1         PC2        PC3        PC4         PC5         PC6
# Length    0.04377427  0.01070966 -0.3263165 -0.5616918  0.75257278 -0.09809807
# Left     -0.11216159  0.07144697 -0.2589614 -0.4554588 -0.34680082  0.76651197
# Right    -0.13919062  0.06628208 -0.3447327 -0.4153296 -0.53465173 -0.63169678
# Bottom   -0.76830499 -0.56307225 -0.2180222  0.1861082  0.09996771  0.02221711
# Top      -0.20176610  0.65928988 -0.5566857  0.4506985  0.10190229  0.03485874
# Diagonal  0.57890193 -0.48854255 -0.5917628  0.2584483 -0.08445895  0.04567946
# Importance of components:
#                           PC1    PC2     PC3     PC4     PC5    PC6
# Standard deviation     1.7321 0.9673 0.49337 0.44120 0.29191 0.1885
# Proportion of Variance 0.6675 0.2082 0.05416 0.04331 0.01896 0.0079
# Cumulative Proportion  0.6675 0.8757 0.92983 0.97314 0.99210 1.0000

Scree plot of eigenvalues

Simply using plot() for pc object will result in bargraph of variances. Screeplot can be displayed using screeplot() function.

plot(pc)
screeplot(pc, type = "line", main = "Scree plot")

Visualizing biplot

Install ggbiplot package

To visualize biplot using ggbiplot() first you need to install the package ggbiplot. You will also need to load the library devtools before installing ggbiplot package.

library(devtools)
install_github("vqv/ggbiplot")

Plot PCA using ggbiplot()

After installation is completed load the ggbiplot package using require() function. The results from prcomp() function showed that first component explains 67% of the variability in the data set. Second component explains 21% of total variability in the data set. This indicates the first two components 88% of the total variation in the data set.

Now let’s display a biplot for the first two principal components using ggbiplot() function for pc object created earlier. The scales for the loadings are shown at the bottom and left axis for PC1 and PC2 respectively.

require(ggbiplot)
ggbiplot(pc)

Custmoizing biplot

In ggbiplot() function we can use different arguments to enhance its visualization. In pcobj argument use pc object created earlier. In choices you can specify the two principal components to be used. I am using same scale factor to apply to obs.scale and var.scale. You can specify your own labels for data points in plot by using labels argument. I shall use rownames as labels for data points. In plot you can see the first cluster of hundred values represent the genuine bank notes. The second cluster from one hundred to two hundred values represent the counterfeit bank notes.

Let’s resize the variable names using varname.size argument. You can shorten the length of variable names by using varname.abbrev argument. TRUE for this argument will abbreviate the variable names. Typing FALSE for variable axis will remove the arrows or variable vectors. As I want to show the arrows so I shall keep this argument as TRUE in the final output.

To draw a correlation circle specify TRUE for circle argument. This argument should only be used when scale argument is specified as TRUE in prcomp() function. To draw a normal ellipse for each group set TRUE for ellipse argument. Also specify the group variable in groups argument. Use print() function to plot this object.

biplot = ggbiplot(pcobj = pc,
                  choices = c(1,2),
                  obs.scale = 1, var.scale = 1,  # Scaling of axis
                  labels = row.names(data),     # Add labels as rownames
                  labels.size = 4,
                  varname.size = 5,
                  varname.abbrev = TRUE,  # Abbreviate variable names (TRUE)
                  var.axes = FALSE,      # Remove variable vectors (TRUE)
                  circle = FALSE,        # Add unit variance circle (TRUE)
                  ellipse = TRUE, groups = data$type) # Adding ellipses
print(biplot)

Add main and legend title

Add title to this plot by using labs() function. In title argument type the main title of the plot. For colour argument type the title of the legend to be used. Print() function for this object will display main title and legend title on the plot.

biplot2 = biplot + labs(title = "PCA of yield contributing parameters",
                        colour = "Seed priming")

Change legend position

Legend is located at right side of the plot. You can change the position of legend by using theme() function. Set bottom for legend.position argument to display the legend in the bottom area of the plot.

biplot3 = biplot2 + theme(legend.position = "bottom")

Apply themes

You can specify different themes to the plot from one of the below mentined themes.

themes (classic, dark, gray, light light, linedraw, minimal, void, bw)

You can choose any of the given themes. Below you can see how the plots look like after applying themes.

theme1 = biplot3 + theme_bw()
theme2 = biplot3 + theme_classic()
theme3 = biplot3 + theme_dark()
theme4 = biplot3 + theme_gray()
theme5 = biplot3 + theme_minimal()
theme6 = biplot3 + theme_void()

Interpretting biplot

Now, I shall briefly explain the biplot. The graph shows that objects are preferably displayed as points. The variables are displayed as vectors or arrows. Data points or sample identifiers represent the scores. The scores are in deviation from their average for each of the variables. The origin represents the average value, both for each variable and for each object across all variables.

This average has a value of zero in the centered data matrix. The arrows contains the information on loadings or it represent the variable vectors. The length of the variable vectors indicate how well the variables are represented by the graph — with a perfect fit if all vectors have equal lengths.

The length of arrows in the plot are directly proportional to the variability included in the two components. For example a very short arrow in this biplot represent the length variable. It indicates the first two components contains almost no information about this element.

The angle between any two arrows represent the correlation between those variables. If the angle between two variable vectors is zero then it shows both variables or collinear. It means both variable vectors lie on same line in same direction. If the angle between two variable vectors is 90 degrees then it shows both variable vectors are orthogonal. It means both variable show lack of correlation.

The smaller the acute angle the stronger the positive correlation will be between the two variables. 

If the angle between two variable vectors is 180 degrees then it shows both variables are collinear. It means both variable vectors lie on same line but in opposite direction. The variables having greater obtuse angle are negatively correlated. Let’s have a look at correlation matrix of the numerical variables.

Correlation of variables

The acute angle between left and right variable is smaller so the correlation between these two variables is strongly positive (74%). The obtuse angle between bottom and diagonal variable indicates slightly strong negative correlation (-62%) between these two variables.

cor(data[, 2:7])
#               Length       Left      Right     Bottom         Top   Diagonal
# Length    1.00000000  0.2312926  0.1517628 -0.1898009 -0.06132141  0.1943015
# Left      0.23129257  1.0000000  0.7432628  0.4137810  0.36234960 -0.5032290
# Right     0.15176280  0.7432628  1.0000000  0.4867577  0.40067021 -0.5164755
# Bottom   -0.18980092  0.4137810  0.4867577  1.0000000  0.14185134 -0.6229827
# Top      -0.06132141  0.3623496  0.4006702  0.1418513  1.00000000 -0.5940446
# Diagonal  0.19430146 -0.5032290 -0.5164755 -0.6229827 -0.59404464  1.0000000

Variances for most contributing variables

Now let’s have a look at the length of the variable vectors. As most of the variation is explained by the bottom, diagonal and top variables due to greater length of variable vectors. To compute variance for these variables first attach the database to access variables by simply giving their names. We can compute the variance for bottom, diagonal and top variables using variance function. The higher variation in bottom variable is indicated by the greater length of bottom variable vector. The bottom and diagonal variables are the most important contributors to the first PC. Second PC is best explained by the top variable and sum of bottom and diagonal variables.

attach(data)

var(Bottom)
var(Diagonal)
var(Top)
# [1] 2.086878
# [1] 1.327716
# [1] 0.6447234
Please comment below if you have any questions.


Download data file — Click_here

Download Rscript — Click_here


Download R program — Click_here

Download R studio — Click_here


Comments

  1. Thank you for these very useful explanations. I would like to know if there is a way to modify Xlab and Ylab. That is to say I want to change "PC1 (66.8 % explained var.)" by just "PC1 (66.8%)"?

    ReplyDelete
    Replies
    1. Dear Abel,

      I am thankful for your valuable comment. You can simply type below command after plus sign to add X and Y axis labels:

      + xlab ("PC1 (49.1%)") + ylab ("PC2 (21.3%)")

      If still have issue please let me know

      Delete
  2. Many Many thanks for your effort in providing this important tutorial. It is really great to get such an online support, your videos are fantastic

    ReplyDelete
  3. Thank you for nice explanations, my question when I used my own data , grouping by genotypes not sucesse how can we group and display now names in the plot

    ReplyDelete
  4. Thank you so much for this! I was just wondering if the ellipses are 95% confidence ellipses and if not how you could add these to the plot? thanks

    ReplyDelete

Post a Comment

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Visualizing clustering dendrogram in R | Hierarchical clustering