Path analysis in R | SEM | Lavaan

Introduction

Today we shall learn a multivariate quantitative approach also called as structural equation modeling (SEM) to explain the relationships between variables of interest. For hypothesis testing and extension, this methodology assists the researcher in testing or validating a theoretical model. SEM is a nearly 100-year-old statistical method that has progressed over three generations. The first generation of SEMs developed the logic of causal modeling using path analysis.

SEM is also an earlier name of path analysis and is very powerful in testing and developing the structural hypothesis with both direct and indirect causal effects. It is popular in many fields, such as social science, business, medical and health science, and natural science.

In this lecture we are going to quantify relationships among variables by creating a path diagram.

Data set

The data set consists of 2 factor variables and 9 response or observed variables and one variable that represents the repetitions of each experimental unit. The first 3 observed variables (aba, apx and pod) reflects the activity of phytohormones. The variable ph represents the height of the plants. The third category of variables (til, pl, grp, tgw) are considered as the yield contributing variables.

As a first step, I always recommend to clear data objects and values in global environment with rm() function. Set TRUE value for the argument all to remove objects and values if you have created earlier. Shut down all the graphic windows by using graphics.off() function. Putting the value “cls” in shell() function will clear the console environment.

rm(list = ls(all = TRUE))
graphics.off()
shell("cls")

Importing data

To import the data set I have placed the excel data file in the project working directory. Load the library readxl to access its functions. Create an object data and assign to it a function which I am calling as read_excel(). In argument path after equal the sign within quotations just press tab button to access the files present in the working directory. Now we need to choose the respective excel data file. In the next argument col_names type TRUE to indicate that the first row of the data file contains variable names or headings.

In the next line we can use the head() function to print the first six rows of the data frame. Here I am converting the first 3 variables as factor variables by using as.factor() function. Use attach() function for data object to mask the components of the variables in the data frame.

library(readxl)
data = read_excel(
          path = 'data_path.xlsx',
          col_names = TRUE
)

head(data)

data$rep = as.factor(data$rep)
data$water = as.factor(data$water)
data$priming = as.factor(data$priming)

attach(data)
# # A tibble: 6 x 12
#     rep water priming   aba   apx   pod    ph   til    pl   grp   tgw    gy
#   <dbl> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1     1 FLOOD NP        4.6  0.9   3.78  90.6  11    13.7  75    17.3  20.9
# 2     2 FLOOD NP        5.7  0.77  4.4   81.6  11.0  16.9  84.5  13.6  24.0
# 3     3 FLOOD NP        4.1  0.81  5.33  86.5  10.4  14.4  70.2  15.6  19.1
# 4     1 FLOOD HP        7.1  0.97  6.73  94.2  15.4  18.4  94.1  20.9  28.8
# 5     2 FLOOD HP        5.6  0.91  6.31  86.9  13.9  18.9  88.8  19.7  25.9
# 6     3 FLOOD HP        6.3  0.87  5.4   82.0  12.7  15.4  81.1  17.4  30.3

Performing SEM

We shall perform SEM in three steps:

  1. Model specification
  2. Model estimation & identification
  3. Path diagram

Model specification

In the first step, we shall specify a model and one can do this by using the relevant theoretical and research based information to decide which variables to include in the model and how these variables are related. Model specification involves determining every relationship and parameter in the model that is of interest to the researcher based on the common knowledge.

I have divided variables into 2 categories. One representing the latent variables and second category includes the observed variables or identifiers. The main objective is to see how these variables are affecting the kernel yield in rice crop. I’m excluding the variable plant height since most studies have shown that it has the least effect on crop yield.

In this example I am specifying mod.id as model where latent factor which I am calling as EA which basically represent the Enzyme activity. There is an equal sign followed by tilde and on the right hand side there are observed variables. So the latent variable EA is indicated by these observed variables in the data set.

Similarly, I have specified another latent variable YC which basically represent the yield contributing factors. This latent variable is indicated by these observed variables on the right hand side which are separated by plus signs. The observed variables are the variables that have significant contribution towards crop yield.

The next two lines are the regressions. Here, the variable gy is the function of enzyme activity and yield contribution.

mod.id = '
EA =~ aba + apx + pod
YC =~ til + pl + grp + tgw
gy ~ EA + YC
'

Model estimation and identification

Model identification addresses the question whether or not it is possible to generate unique estimates for a collection of model parameters.

So in the next line we have mod.est object and you can see we are using the sem() function for model estimation. Inside the parenthesis we have model argument where we need to specify the model object which contains the model specifications. Then we have data argument which specify the data object containing the data frame.

In the next line type summary and in parenthesis write mod.est to get the model information.

library(lavaan)
mod.est = sem(
          model = mod.id,
          data = data
)

summary(mod.est)
# lavaan 0.6-8 ended normally after 85 iterations
# 
#   Estimator                                         ML
#   Optimization method                           NLMINB
#   Number of model parameters                        18
#                                                       
#   Number of observations                            18
#                                                       
# Model Test User Model:
#                                                       
#   Test statistic                                23.193
#   Degrees of freedom                                18
#   P-value (Chi-square)                           0.183
# 
# Parameter Estimates:
# 
#   Standard errors                             Standard
#   Information                                 Expected
#   Information saturated (h1) model          Structured
# 
# Latent Variables:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   EA =~                                               
#     aba               1.000                           
#     apx               0.315    0.033    9.621    0.000
#     pod               1.102    0.117    9.385    0.000
#   YC =~                                               
#     til               1.000                           
#     pl                0.968    0.137    7.058    0.000
#     grp               6.103    0.742    8.230    0.000
#     tgw               1.060    0.151    7.006    0.000
# 
# Regressions:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   gy ~                                                
#     EA                3.644    1.843    1.977    0.048
#     YC                0.082    1.034    0.080    0.937
# 
# Covariances:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   EA ~~                                               
#     YC                6.341    2.359    2.689    0.007
# 
# Variances:
#                    Estimate  Std.Err  z-value  P(>|z|)
#    .aba               0.594    0.217    2.739    0.006
#    .apx               0.012    0.007    1.655    0.098
#    .pod               0.195    0.100    1.948    0.051
#    .til               2.748    0.974    2.822    0.005
#    .pl                1.287    0.492    2.619    0.009
#    .grp               9.552    8.392    1.138    0.255
#    .tgw               1.614    0.612    2.637    0.008
#    .gy               11.480    4.110    2.793    0.005
#     EA                3.767    1.443    2.611    0.009
#     YC               11.584    4.700    2.465    0.014

We can see here that we have chisquare model fit test and its degrees of freedom. Chisquare tests the hypothesis that there is a mismatch between model implied covariance matrix and the original covariance matrix. Therefore, the non-significant discrepancy is preferred. For optimal fitting of the chosen SEM, the \(\chi^2\) test would be ideal with \(p>0.05\).

In the next section of latent variables, you can see the first observed variable has loading fixed at 1. So, there is no test associated with those but we have significant tests associated with the remaining variables.

In the next section we have regressions. We can see here the regression coefficients for latent variables and their significance.

You can also see that we have a covariance between enzyme activity and yield contribution. The probability value shows a significance directional relationship between the enzyme activity and yield contribution.

If you want fit of the measures in summary function then type fit.measures equal TRUE.

summary(mod.est,
        fit.measures = TRUE)
# lavaan 0.6-8 ended normally after 85 iterations
# 
#   Estimator                                         ML
#   Optimization method                           NLMINB
#   Number of model parameters                        18
#                                                       
#   Number of observations                            18
#                                                       
# Model Test User Model:
#                                                       
#   Test statistic                                23.193
#   Degrees of freedom                                18
#   P-value (Chi-square)                           0.183
# 
# Model Test Baseline Model:
# 
#   Test statistic                               278.306
#   Degrees of freedom                                28
#   P-value                                        0.000
# 
# User Model versus Baseline Model:
# 
#   Comparative Fit Index (CFI)                    0.979
#   Tucker-Lewis Index (TLI)                       0.968
# 
# Loglikelihood and Information Criteria:
# 
#   Loglikelihood user model (H0)               -258.507
#   Loglikelihood unrestricted model (H1)       -246.911
#                                                       
#   Akaike (AIC)                                 553.015
#   Bayesian (BIC)                               569.042
#   Sample-size adjusted Bayesian (BIC)          513.733
# 
# Root Mean Square Error of Approximation:
# 
#   RMSEA                                          0.127
#   90 Percent confidence interval - lower         0.000
#   90 Percent confidence interval - upper         0.259
#   P-value RMSEA <= 0.05                          0.223
# 
# Standardized Root Mean Square Residual:
# 
#   SRMR                                           0.024
# 
# Parameter Estimates:
# 
#   Standard errors                             Standard
#   Information                                 Expected
#   Information saturated (h1) model          Structured
# 
# Latent Variables:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   EA =~                                               
#     aba               1.000                           
#     apx               0.315    0.033    9.621    0.000
#     pod               1.102    0.117    9.385    0.000
#   YC =~                                               
#     til               1.000                           
#     pl                0.968    0.137    7.058    0.000
#     grp               6.103    0.742    8.230    0.000
#     tgw               1.060    0.151    7.006    0.000
# 
# Regressions:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   gy ~                                                
#     EA                3.644    1.843    1.977    0.048
#     YC                0.082    1.034    0.080    0.937
# 
# Covariances:
#                    Estimate  Std.Err  z-value  P(>|z|)
#   EA ~~                                               
#     YC                6.341    2.359    2.689    0.007
# 
# Variances:
#                    Estimate  Std.Err  z-value  P(>|z|)
#    .aba               0.594    0.217    2.739    0.006
#    .apx               0.012    0.007    1.655    0.098
#    .pod               0.195    0.100    1.948    0.051
#    .til               2.748    0.974    2.822    0.005
#    .pl                1.287    0.492    2.619    0.009
#    .grp               9.552    8.392    1.138    0.255
#    .tgw               1.614    0.612    2.637    0.008
#    .gy               11.480    4.110    2.793    0.005
#     EA                3.767    1.443    2.611    0.009
#     YC               11.584    4.700    2.465    0.014

Now in the output by scrolling up you can see we have root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR).

RMSEA is a “badness of fit” index where \(0\) indicates the perfect fit and higher values indicate the lack of fit. It is useful for detecting model misspecification and less sensitive to sample size than the \(\chi^2\) test. SRMR is similar to RMSEA and should be less than \(0.09\) for a good model fit.

Comparative fit index (CFI) represents the amount of variance that has been accounted for in a covariance matrix and its value is less affected by sample size than the chisquare test. The value for CFI ranges from \(0\) to \(1\). A higher CFI value indicates a better model fit. In practice, the CFI should be close to \(0.95\) or higher. Our CFI is close to this threshold which indicates a better fit of the model.

Next we can see there is a Tucker-Lewis index (TLI) which is a non-normed fit index (NNFI) that partly overcomes the disadvantages of NFI and also proposes a fit index independent of sample size. A TLI of \(>0.90\) is considered acceptable.

Next we also have the values for Akaike information criterion (AIC) and Bayesian information criterion (BIC) and these are two relative measures from the perspectives of model selection rather than the null hypothesis test. AIC offers a relative estimation of the information lost when the given model is used to generate data. BIC is an estimation of how parsimonious a model is among several candidate models.

Path diagram

Now to look at the path diagram we have to load the semPlot package. Now we can use the semPaths() function associated with this package. In this function the argument object specify sem model. The next argument specify what should the edges indicate in the path diagram. The value path for this argument will display un-weighted gray edges. Next we need to specify what the edge labels should indicate in the path diagram. You can type par or est for this argument to display parameter estimate in edge labels. Running this command will return a path diagram for the model specified.

library(semPlot)
semPaths(
          object = mod.est,
          what = "path",
          whatLabels = "par"
)

To make this path diagram more appealing, I need to make some tweaks by adding a few additional arguments. These parameters will affect the color and size of the values and texts as well as the rotation of the diagram. This is finally how the path diagram will look like.

semPaths(
          object = mod.est,
          what = "path",
          whatLabels = "par",
          style = "ram",
          layout = "tree",
          rotation = 2,
          sizeMan = 7,
          sizeLat = 7,
          color = "lightgray",
          edge.label.cex = 1.2,
          label.cex = 1.3
)

Interpretation of the path diagram

In this path diagram we can see that observed variables enclosed in squares or rectangles and latent variables are in circles or ellipses. Single-headed arrows (paths) are used to define causal relationships in the model, with the variable at the tail of the arrow causing the variable at the head. The double headed arrows pointing the two variables shows the covariance between them. The double headed arrows pointed the same variable represents the variances associated with that variable but it is often omitted if the variable is standardized to unit variance.


Download data file — Click_here

Download Rscript — Click_here


Download R program — Click_here

Download R studio — Click_here


Comments

Popular posts from this blog

Two way repeated measures analysis in R

Split plot analysis in R

Visualizing clustering dendrogram in R | Hierarchical clustering