Visualizing Scatterplots in R | Correlation

Introduction

In previous tutorial, we have learned how to perform correlation test and how to get a matrix of correlation values by using different packages. Now in this lecture, you will learn, how to visualize the association or relationship between variables and visualizing a matrix of scatterplots.

Import data set

As a first step, I always recommend to clear data objects and values in global environment using rm() function. The argument list in this function specify character vectors naming objects to be removed. Set TRUE value for the argument all to remove objects and values if you have created earlier. The graphics.off() function will close all graphics windows. The value “cls” in shell() function will clear the console environment.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Now let’s proceed further and import the data set. The data set, used in this example, is same as used in previous lecture on correlation analysis. For details on structure of data set please visit this link.

I have already saved the data file in CSV format. To import the CSV file we can use read.csv() function. In this function the argument file requires the path or location of the data file. You can just copy the path from the file explorer and paste it in quotations to answer the file argument. Now, you can choose the respective CSV data file. You can set header argument value to TRUE to indicate that the first row of the data contains variable names.

Note that file paths are specified using slashes (/). In R, you cannot use backslashes (\), as you would in Microsoft Windows, unless you replace it with forward-slash (/) or double all the backslashes (\\). 
data <- read.csv(file = "E:/Youtube Channel/Agron Info Tech/Lectures/Data Analysis/34 Correlation analysis in R/data_corr.csv", 
                 header = TRUE)
head(data, 3)
#   rep method priming em.head head.mat plant.height total.tillers th.kernal.wt
# 1   1   conv      NP      88       35          113           445          9.9
# 2   1   conv      HP      92       27          126           498         12.0
# 3   1   conv      OP      83       27          118           535         15.7
#   kernal.yield straw.yield harvest.index
# 1         3.46        12.9          26.8
# 2         3.58        13.3          26.9
# 3         3.61        13.9          26.0

To see how the data is imported you can apply fix() function to object data. This function also allow you to edit or make any changes in the data set.

fix(data)
Data editor window for fix(data) command

Data editor window for fix(data) command

Mask the components of data set by using attach() function. This function gives direct access to the variables of a data frame. You can simply access these variables by just typing the name of the variable.

attach(data)

Now let’s proceed further to visualize a scatterplot between two variables to see how one variable is affected by the other.

Visualizing Scatterplots

Scatterplot between two variables

One nice way to create a scatterplot is by using publication ready plots package. This package creates beautiful ggplot based graphs. If you have already installed this package then just load it by using library() function. You can install this package by using install.pakcages(“ggpubr”) command where in parenthesis you will type the name of the package in quotations.

library(ggpubr)
# Loading required package: ggplot2
# install.pakcages("ggpubr")
# install.pakcages("backports")

While loading this package after installation I stuck with an error. The error indicated that there is no package called backports. If you also get the same error you can simply fix it by installing the backports package.

Error message while installing ggpubr package

Error message while installing ggpubr package

Just after loading the package, you can create a scatter plot by using ggscatter() function. In this command the first argument data specify a data frame. As we have already saved data set in an object assigned as data. So we can set this data object as value for this argument. In x and y argument we need to specify the variable names that will be located on X and Y axis. We can simply apply colnames() function to data object and this will result in printing the names of each column from the data set. From this information you can easily type the name of the variables for which you want to see the relationship. Let’s suppose I am interested to see the association between total number of tillers and kernel yield. So I shall use these variable names in x and y arguments. The arguments xlab and ylab requires the X and Y axis labels as character vectors. If we run this command it will create a scatterplot.

ggscatter(
          data = data,
          x = "total.tillers", 
          y = "kernal.yield",
          xlab = "Number of tillers",
          ylab = "Kernal yield in rice"
          )

The scatterplot alone provide a very little information about the relationship. We can add a trend line or a line of best fit to these points to make this graph more informative. This can be accomplished by using add argument.

The value “reg.line” for this argument will add a linear regression line. The value “loess” for this argument will create a locally weighted smoothing line or we can say a local regression fitting to foresee trends of the variables.

ggscatter(
          data = data,
          x = "total.tillers", 
          y = "kernal.yield",
          xlab = "Number of tillers",
          ylab = "Kernal yield in rice",
          add = "reg.line", # "loess" for local regression fitting
          )

To add a confidence interval around this regression line we can use conf.int argument. We can set a logical value to this argument. Setting TRUE value for this argument will add confidence interval as shaded region along the regression line. The default level controlling confidence interval is \(95%\). Often researchers choose \(90%\), \(95%\) or \(99%\) confidence levels and you can set this value in conf.int.level argument.

We can further customize this regression line by using add.params argument. By using this argument you can specify color of the regression line and the shaded region, size of the line, linetype of regression fitting and fill color of the points.

ggscatter(
          data = data,
          x = "total.tillers", 
          y = "kernal.yield",
          xlab = "Number of tillers",
          ylab = "Kernal yield in rice",
          add = "reg.line",
          conf.int = TRUE,
          conf.int.level = 0.95,
          add.params = list(color = "black",
                            size = 1,
                            linetype = 1,
                            fill = "lightgray"))

Now let’s add correlation coefficient and probability values to this plot. You can add these values to scatterplot by setting TRUE value for cor.coef argument. Also specify the correlation method in cor.method argument as I am using “pearson” which is most commonly used method. You can see the values are printed in a single line. We can write these values in two separate line by setting “” character as value for label.sep argument. Further you can also show grouping variable in most of the cases also called factor variable by using color argument. Type the grouping variable name as character string in this argument. You can also customize the color of the points by using palette argument where you can specify the color of your choice for each level of the grouping variable. This will create an elegant style scatterplot.

ggscatter(
          data = data,
          x = "total.tillers", 
          y = "kernal.yield",
          xlab = "Number of tillers",
          ylab = "Kernal yield in rice",
          add = "reg.line",
          conf.int = TRUE,
          conf.int.level = 0.95,
          add.params = list(color = "black",
                            size = 1,
                            linetype = 1,
                            fill = "lightgray"),
          cor.coef = TRUE, cor.method = "pearson",
          cor.coeff.args = list(label.sep = "\n"), # Correlation and probability value
          color = "priming",                       # Custom color palettes
          palette = c("#00AFBB", "#E7B800", "#FC4E07"))

Further, you can apply different ggplot based themes to give a different customized look to the plot.

ggscatter(
          data = data,
          x = "total.tillers", 
          y = "kernal.yield",
          xlab = "Number of tillers",
          ylab = "Kernal yield in rice",
          add = "reg.line",
          conf.int = TRUE,
          conf.int.level = 0.95,
          add.params = list(color = "black",
                            size = 1,
                            linetype = 1,
                            fill = "lightgray"),
          cor.coef = TRUE, cor.method = "pearson",
          cor.coeff.args = list(label.sep = "\n"), # Correlation and probability value
          color = "priming",                       # Custom color palettes
          palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          ggtheme = theme_bw()                     # Customize themes
          )

A matrix of scatterplots

Let’s create a matrix of scatterplots for multiple variables. A simple way to visualize a matrix of scatterplots is by using pairs() function. The argument x in this function specify the data frame or columns of a matrix. Suppose I am interested to get scatterplots for yield contributing variables. I can specify these variables by giving the range of the column numbers for these variables. I am specifying this range as \(7^{th}\) to \(10^{th}\) in this data set. In main argument you can specify the title of the plot. You can customize points by setting the value for pch argument. Further, the fill color for the points can be specified by using bg argument. If you only want to show the lower triangle then you can simply set NULL value for upper.panel argument.

pairs(
      x = data[7:10],               ### Scatterplots Matrices ----
      main = "A matrix of scatterplots",
      pch = 21, 
      bg = c("red", "green3", "blue"),
      upper.panel = NULL,           ### show only lower triangle ----
      )
# NULL
# NULL

Let’s visualize such matrix of scatterplots in an elegant style with additional information.

ggplot2 generalized pairs plot

We shall use ggplot2 and GGally packages for more stylish pairs of scatterplots. Load the packages GGally and ggplot2 by using library() function.

library(GGally)
library(ggplot2)

We can use the function ggpairs() to create a matrix of scatterplots. The argument data specify data set which can have both numerical and categorical variables. I am using data object for this argument. In columns argument we can specify the range of the variables. I am using the same range of columns as used in previous command. We can set aesthetic mapping to visualize the grouping variable. The argument color is used to set the grouping variable. Again seed priming will be used as grouping variable in this data set. In upper triangle I want to set the correlation values. By setting cor value in continuous argument specify the upper triangle of the matrix of plots will contain correlation values. In lower triangle of the matrix I want to show scatterplots. Just paste the above command and replace upper with lower and in continuous argument replace the value cor with points. Next by setting “densityDiag” as value for diag argument, will print density plots in the diagonal of the matrix of plots.

When you will run this command you will get correlation values with significance, a matrix of scatterplots and density plots in diagonal.

library(GGally)

ggpairs(
        data = data,                ### A quick matrix
        columns = 7:10,
        mapping = ggplot2::aes(color=priming),   ### With grouping variable
        upper = list(continuous = "cor"), 
        lower = list(continuous = "points"), 
        diag = list(continuous = "densityDiag")
        ) 

Please comment below if you have any questions.

Download data file — Click_here

Download Rscript — Download Rscript


Download R program — Click_here

Download R studio — Click_here

Comments

  1. whare is path coefficient analysis script?

    ReplyDelete

Post a Comment

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Visualizing clustering dendrogram in R | Hierarchical clustering