Correlation analysis in R

Introduction

Every individual element in any population is composed of several quantitative as well as qualitative characters. None of the characters vary in isolation rather these are related with each other. If we study two variables at a time, the association is called bivariate. When we study many number of variables at a time, the association is called multivariate.

The associationship between the variables can be linear or nonlinear. In linear association one can find three different types of relationships.

  • Positive association between the variables
  • Negative association between the variables
  • No association between the variables (fails to present any relationship)

The degree of linear associationship between any two given variables is measured with the help of the correlation coefficient.

For measuring linear relationships, Pearson correlation is the best measure

So let’s start correlation analysis in R studio.

Before importing the data set, I often recommend to first clear all the objects or values in global environment using remove function. Shut down all open graphics devices using graphics off function. Clear everything in console using system command within shell function.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Importing data

Next step is importing the data set. The example used here is comprised of replication, two factor variables and rest of the variables are response variables. Load the package readxl by using library() function. To import the data from excel spreadsheet use read_excel() function. In argument path provide the link of the file. Type TRUE for argument col_names if the file contains first row as variable names. Your data are now available in the R console. They are stored in the object which I have chosen to call data.

Note that file paths are specified using slashes (/). In R, you cannot use backslashes (\), as you would in Microsoft Windows, unless you replace it with forward-slash (/) or double all the backslashes (\\). 
library(readxl)

data = read_excel(path = "E:/Correlation analysis/data_corr.xlsx",
                  col_names = TRUE)

Visualize data

You can visualize them by typing view(). You can also type head() or tail() function to display only the beginning or the end of the data set. These functions cannot be used to edit the data.

View(data)

The instruction fix() opens a small spreadsheet in R, which can be used to visualize and edit the data.

fix(data)
Data editor window for fix(data) command

Data editor window for fix(data) command

Verify the structure of data

To verify the structure of the data, str() function is used.

str(data)
# 'data.frame': 18 obs. of  10 variables:
#  $ rep          : num  1 1 1 1 1 1 2 2 2 2 ...
#  $ method       : chr  "conv" "conv" "conv" "SRI" ...
#  $ priming      : chr  "NP" "HP" "OP" "NP" ...
#  $ em.head      : num  88 92 83 85 86 82 87 89 84 86 ...
#  $ head.mat     : num  35 27 27 41 31 29 37 32 36 39 ...
#  $ plant.height : num  113 126 118 123 120 127 116 128 119 116 ...
#  $ total.tillers: num  445 498 535 497 534 562 450 498 538 452 ...
#  $ th.kernal.wt : num  9.9 12 15.7 13.8 15.3 15.8 14.4 15.1 19.1 16.7 ...
#  $ kernal.yield : num  3.46 3.58 3.61 3.28 3.79 3.73 3.34 3.65 4.03 3.63 ...
#  $ straw.yield  : num  12.9 13.3 13.9 11.2 11.3 12.3 13.5 14.2 14.2 11.7 ...

The structure of the data reveals that the second and third factor variables are being recognized as character variables. To change them to factor variables use as.factor() function for both variables. Using str() function again will show the second and third character variables have been changed to factor variables.

data$method = as.factor(data$method)
data$priming = as.factor(data$priming)
str(data)
# 'data.frame': 18 obs. of  10 variables:
#  $ rep          : num  1 1 1 1 1 1 2 2 2 2 ...
#  $ method       : Factor w/ 2 levels "conv","SRI": 1 1 1 2 2 2 1 1 1 2 ...
#  $ priming      : Factor w/ 3 levels "HP","NP","OP": 2 1 3 2 1 3 2 1 3 2 ...
#  $ em.head      : num  88 92 83 85 86 82 87 89 84 86 ...
#  $ head.mat     : num  35 27 27 41 31 29 37 32 36 39 ...
#  $ plant.height : num  113 126 118 123 120 127 116 128 119 116 ...
#  $ total.tillers: num  445 498 535 497 534 562 450 498 538 452 ...
#  $ th.kernal.wt : num  9.9 12 15.7 13.8 15.3 15.8 14.4 15.1 19.1 16.7 ...
#  $ kernal.yield : num  3.46 3.58 3.61 3.28 3.79 3.73 3.34 3.65 4.03 3.63 ...
#  $ straw.yield  : num  12.9 13.3 13.9 11.2 11.3 12.3 13.5 14.2 14.2 11.7 ...

The function attach() gives direct access to the variables of a data frame by typing the name of a variable as it is written on the first line of the file.

attach(data)

Correlation analysis

Using agricolae package

Now let’s proceed to see the association in response variables of the data set. The correlation analysis can be carried out in several different ways in R. Here you will learn how to find relationship between two variables individually and to compute correlation matrix.

Load agricolae package using the library() function. The correl() function in this package will provide the correlation coefficient of X, Y vectors with the statistical value and its probability. The X and Y arguments specify variable vectors for which the relationship is to be determined. Here I am interested to see the relationship between total number of tillers and kernel yield. In methods argument specify the method to apply. The methods to apply are Pearson, Spearman and Kendall. The alternative argument specify the alternative hypothesis. Use “two-sided”, “less” or “greater” according to the hypothesis. The print() function for this object will show the statistical value, correlation coefficient and probability value.

library(agricolae)
cor.1 = correl(x = total.tillers,
               y = kernal.yield,
               method = "pearson",
               alternative = "two.sided")

print(cor.1)
# $stat
# [1] 4.419862
# 
# $rho
# [1] 0.7414458
# 
# $pvalue
# [1] 0.0004291801

Since the probability value is lower than one percent alpha level for two-sided test, we reject the null hypothesis and accept the alternative hypothesis. There is statistically significant relationship between the total number of tillers and kernel yield. In this example, the total number of tillers was related positively to the kernel yield. The strength of the relationship was indicated by a value close to \(1.0\) and the direction of the relationship by the positive sign.

The second approach to interpreting the Pearson correlation coefficient is to square the sample correlation value.

# R squared value
cor.1$rho^2
var(kernal.yield)
# [1] 0.5497419
# [1] 0.07723301

The rho squared value indicates that \(55\) percent of the variability in kernel yield can be explained by the total number of tillers while \(45\) percent of variability is due to the other variable relationships or unexplained variance.

Using stats package

Another way to get the same results is using cor.test() function from stats package. Load the stats package using library() function. The function cor.test() require same arguments as used in previous analysis. However, an additional argument conf.level where you can specify the value for confidence level. Print function for this object will show the test value, degree of freedom, probability value, confidence interval and correlation coefficient value.

library(stats)
cor.2 = cor.test(x = total.tillers,
                 y = kernal.yield,
                 alternative = "two.sided",
                 method = "pearson",
                 conf.level = 0.95)

print(cor.2)
# 
#   Pearson's product-moment correlation
# 
# data:  total.tillers and kernal.yield
# t = 4.4199, df = 16, p-value = 0.0004292
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  0.4199426 0.8976028
# sample estimates:
#       cor 
# 0.7414458

Getting correlation matrix

Using stats package

Another way is to get a correlation matrix showing correlation coefficients between sets of variables. For this use cor() function to compute a matrix of correlation. The argument x specify a matrix or data frame and in this example the matrix of \(4^{th}\) to \(10^{th}\) variables for which I want to compute the correlation values. The argument use is an optional character string giving a method for computing correlations in the presence of missing values.

require(stats)

cor.3 = cor(x = data[, c(4:10)], y = NULL,
            use = "everything",
            method = "pearson")

print(cor.3)
#                  em.head   head.mat plant.height total.tillers th.kernal.wt
# em.head        1.0000000  0.1545703  -0.40328717 -0.6078961974   -0.7123963
# head.mat       0.1545703  1.0000000  -0.33997627 -0.4923771076   -0.1018873
# plant.height  -0.4032872 -0.3399763   1.00000000  0.5728506083    0.4487721
# total.tillers -0.6078962 -0.4923771   0.57285061  1.0000000000    0.5966086
# th.kernal.wt  -0.7123963 -0.1018873   0.44877214  0.5966086434    1.0000000
# kernal.yield  -0.6865255 -0.4265499   0.46877795  0.7414458464    0.7794977
# straw.yield    0.3121484 -0.2762724  -0.01128121  0.0004545191   -0.1204427
#               kernal.yield   straw.yield
# em.head         -0.6865255  0.3121484037
# head.mat        -0.4265499 -0.2762724365
# plant.height     0.4687779 -0.0112812050
# total.tillers    0.7414458  0.0004545191
# th.kernal.wt     0.7794977 -0.1204426847
# kernal.yield     1.0000000  0.1005602015
# straw.yield      0.1005602  1.0000000000
Using agricolae package

The cor() function just prints the correlation matrix. To print probability value in addition to correlation matrix use correlation() function. Load the library agricolae. The correlation() function require same arguments as discussed earlier. The print() function for this object will show correlation matrix, a matrix of probability values and number of observations.

library(agricolae)

cor.4 = correlation(x = data[, c(4:10)], y = NULL,
                    method = "pearson",
                    alternative = "two.sided")

print(cor.4)
# $correlation
#               em.head head.mat plant.height total.tillers th.kernal.wt
# em.head          1.00     0.15        -0.40         -0.61        -0.71
# head.mat         0.15     1.00        -0.34         -0.49        -0.10
# plant.height    -0.40    -0.34         1.00          0.57         0.45
# total.tillers   -0.61    -0.49         0.57          1.00         0.60
# th.kernal.wt    -0.71    -0.10         0.45          0.60         1.00
# kernal.yield    -0.69    -0.43         0.47          0.74         0.78
# straw.yield      0.31    -0.28        -0.01          0.00        -0.12
#               kernal.yield straw.yield
# em.head              -0.69        0.31
# head.mat             -0.43       -0.28
# plant.height          0.47       -0.01
# total.tillers         0.74        0.00
# th.kernal.wt          0.78       -0.12
# kernal.yield          1.00        0.10
# straw.yield           0.10        1.00
# 
# $pvalue
#                    em.head   head.mat plant.height total.tillers th.kernal.wt
# em.head       1.0000000000 0.54026966   0.09701049  0.0074434513 0.0009087942
# head.mat      0.5402696631 1.00000000   0.16747034  0.0379123294 0.6874722478
# plant.height  0.0970104894 0.16747034   1.00000000  0.0129596892 0.0617512187
# total.tillers 0.0074434513 0.03791233   0.01295969  1.0000000000 0.0089586967
# th.kernal.wt  0.0009087942 0.68747225   0.06175122  0.0089586967 1.0000000000
# kernal.yield  0.0016517901 0.07752274   0.04972248  0.0004291801 0.0001369208
# straw.yield   0.2072882086 0.26710622   0.96456396  0.9985718607 0.6340391279
#               kernal.yield straw.yield
# em.head       0.0016517901   0.2072882
# head.mat      0.0775227402   0.2671062
# plant.height  0.0497224848   0.9645640
# total.tillers 0.0004291801   0.9985719
# th.kernal.wt  0.0001369208   0.6340391
# kernal.yield  1.0000000000   0.6913522
# straw.yield   0.6913521760   1.0000000
# 
# $n.obs
# [1] 18

You can access the components by typing the name of the component with dollar sign just after the object.

cor.4$correlation
cor.4$pvalue
cor.4$n.obs
Using Hmisc package

Another way to get a matrix of correlation with probability values is to use rcorr() function by loading Harrell Miscellaneous package. The rcorr() function computes a matrix of Pearson’s r or Spearman’s rho rank correlation coefficients for all possible pairs of columns of a matrix. Same arguments are required as used earlier. The argument type specify the method to be used for computing correlations. Print function for this object will show the correlation matrix, number of observations used and a matrix of probability values.

require(Hmisc)

cor.5 = rcorr(x = as.matrix(data[, c(4:10)]), y = NULL,
              type ="pearson")

print(cor.5)
#               em.head head.mat plant.height total.tillers th.kernal.wt
# em.head          1.00     0.15        -0.40         -0.61        -0.71
# head.mat         0.15     1.00        -0.34         -0.49        -0.10
# plant.height    -0.40    -0.34         1.00          0.57         0.45
# total.tillers   -0.61    -0.49         0.57          1.00         0.60
# th.kernal.wt    -0.71    -0.10         0.45          0.60         1.00
# kernal.yield    -0.69    -0.43         0.47          0.74         0.78
# straw.yield      0.31    -0.28        -0.01          0.00        -0.12
#               kernal.yield straw.yield
# em.head              -0.69        0.31
# head.mat             -0.43       -0.28
# plant.height          0.47       -0.01
# total.tillers         0.74        0.00
# th.kernal.wt          0.78       -0.12
# kernal.yield          1.00        0.10
# straw.yield           0.10        1.00
# 
# n= 18 
# 
# 
# P
#               em.head head.mat plant.height total.tillers th.kernal.wt
# em.head               0.5403   0.0970       0.0074        0.0009      
# head.mat      0.5403           0.1675       0.0379        0.6875      
# plant.height  0.0970  0.1675                0.0130        0.0618      
# total.tillers 0.0074  0.0379   0.0130                     0.0090      
# th.kernal.wt  0.0009  0.6875   0.0618       0.0090                    
# kernal.yield  0.0017  0.0775   0.0497       0.0004        0.0001      
# straw.yield   0.2073  0.2671   0.9646       0.9986        0.6340      
#               kernal.yield straw.yield
# em.head       0.0017       0.2073     
# head.mat      0.0775       0.2671     
# plant.height  0.0497       0.9646     
# total.tillers 0.0004       0.9986     
# th.kernal.wt  0.0001       0.6340     
# kernal.yield               0.6914     
# straw.yield   0.6914

Properties

  • Correlation coefficient is worked out irrespective of the dependency i.e. \(r_{xy} =r_{yx}\)

  • Correlation coefficient is a unit free measure; as such, this can be used to compare the degree of linear association between any pair of variables measured in different units

  • Correlation coefficient \(r_{xy}\) lies between \(-1\) and \(+1\) i.e. \(-1 ≤ r_{xy} ≤ +1\).

  • Correlation coefficient is independent of change of origin and scale

  • Correlation coefficient between \(X_1\) and \(X_2\) is same as the correlation coefficient between \(X_2\) and \(X_1\)

  • Two independent variables are uncorrelated, but the converse may not be true (Zero correlation coefficient does not always mean that the variables are independent)

Limitations

  • Correlation coefficient can only measure the degree of linear association

  • When more than two variables are operating in a system, then in most of the cases, a simple correlation coefficient fails to measure the actual degree of associationship between the variables

  • Sometimes it is found that two variables (say \(X\) and \(Y\)) are showing high degree of linear associationship, as measured through correlation coefficient, not because they are highly correlated but because of the influence of another variable (say \(Z\)) on both variables under consideration

Please comment below if you have any questions.

Download data file — Click_here

Download Rscript — Download Rscript


Download R program — Click_here

Download R studio — Click_here


Comments

  1. Thank you so much for this excellent tutorial and explanation.

    ReplyDelete

Post a Comment

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Visualizing clustering dendrogram in R | Hierarchical clustering