blavaan
In this tutorial, we aim to demonstrate how to use
blavaan
(Merkle and Rosseel
2015) for structural equation models (SEMs) and the corresponding
model assessment. The newest version of blavaan
is equipped
with an efficient approach based on Stan
(Merkle et al. 2020).
As a measurement model and probably one of the most popular special cases of a SEM, CFA is often used to 1) validate a hypothesized factor structure among multiple variables, 2) estimate the correlation between factors, and 3) obtain factor scores. For example, consider a two-factor (\(\eta_{1j}, \eta_{2j}\)) model with each factor measured by six items (\(y_{1j}, \dots, y_{6j}\)) for person \(j\): \[ \underbrace{\left[\begin{array}{l} y_{1 j} \\ y_{2 j} \\ y_{3 j} \\ y_{4 j} \\ y_{5 j} \\ y_{6 j} \end{array}\right]}_{\boldsymbol{y}_{j}}=\underbrace{\left[\begin{array}{c} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \beta_{4} \\ \beta_{5} \\ \beta_{6} \end{array}\right]}_{\boldsymbol{\beta}}+\underbrace{\left[\begin{array}{cc} 1 & 0 \\ \lambda_{21} & 0 \\ \lambda_{31} & 0 \\ 0 & 1 \\ 0 & \lambda_{52} \\ 0 & \lambda_{62} \end{array}\right]}_{\Lambda}\underbrace{\left[\begin{array}{l} \eta_{1 j} \\ \eta_{2 j} \end{array}\right]}_{\boldsymbol{\eta}_j}+\underbrace{\left[\begin{array}{c} \epsilon_{1 j} \\ \epsilon_{2 j} \\ \epsilon_{3 j} \\ \epsilon_{4 j} \\ \epsilon_{5 j} \\ \epsilon_{6 j} \end{array}\right]}_{\boldsymbol{\epsilon}_j} \]
\[ \boldsymbol{\epsilon}_{j} \sim N_{I}(\mathbf{0}, \mathbf{\Theta}) \]
\[ \boldsymbol{\eta}_{j} \sim N_{K}(\mathbf{0}, \boldsymbol{\Psi}), \] where the number of items or variables is \(I = 6\), the number of factors is \(K = 2\) and \(\mathbf{\Theta}\) is often assumed to be a diagonal matrix.
\(Y_{ij}\) is the response of person \(j\) \((j=1,...,J)\) on item \(i\) \((i=1,...,I)\).
\(\beta_i\) is the intercept for item \(i\).
\(\eta_{jk}\) is the \(k\)th common factor for person \(j\).
\(\lambda_{ik}\) is the factor loading of item \(i\) on factor \(k\).
\(\epsilon_{ij}\) is the random error term for person \(j\) on item \(i\).
\(\boldsymbol{\Psi}\) is the variance-covariance matrix of the common factors \(\boldsymbol{\eta}_{j}\) .
\(\mathbf{\Theta}\) is the variance-covariance matrix of the residuals (or unique factors) \(\boldsymbol{\epsilon}_{j}\).
Suppose the errors or residuals \(\epsilon_{ij}\) are independent of each other. Then:
\(\psi_{kk}\) is the variance for the \(k\)th factor, \(\psi_{jk}\) is the covariance between the \(j\)th and \(k\)th factors, \(\theta_{ii}\) is the variance for the \(i\)th residual, and \(\theta_{ii'}=0\) if and only if \(i\neq i'\). Specifically,
\[ \boldsymbol{\Psi}=\mathrm{Cov}\begin{pmatrix} \eta_{1j} \\ \eta_{2j} \end{pmatrix}= \left[\begin{array}{cc} \psi_{1 1}&\psi_{1 2} \\ \psi_{2 1}&\psi_{2 2} \end{array}\right] \]
\[\mathbf{\Theta}=\mathrm{Cov}\begin{pmatrix} \epsilon_{1 j} \\ \epsilon_{2 j} \\ \epsilon_{3 j} \\ \epsilon_{4 j} \\ \epsilon_{5 j} \\ \epsilon_{6 j} \end{pmatrix}= \left[\begin{array}{cc} \theta_{1 1} &0&0&0&0&0\\ 0&\theta_{2 2}&0&0&0&0 \\ 0&0&\theta_{3 3}&0&0&0 \\ 0&0&0&\theta_{4 4}&0&0 \\ 0&0&0&0&\theta_{5 5}&0 \\ 0&0&0&0&0&\theta_{6 6} \end{array}\right] \]
To get started, we load the following R packages:
#knitr::opts_chunk$set(eval = FALSE)
knitr::opts_chunk$set(echo = TRUE, comment = "", cache = T)
library(rstan)
library(knitr)
library(blavaan)
library(lavaan)
library(MASS)
library(mvtnorm)
library(tidyverse)
library(semPlot)
library(magrittr)
library(Matrix)
options(mc.cores = parallel::detectCores())
To better illustrate the use of blavaan
, we simulate
data so that we know the data generating parameters. In our simulation,
we set \(\beta_i = 0\), \(\psi_{11} = 1\), \(\psi_{12} = \psi_{21} =.5\), \(\psi_{22} = .8\), \(\lambda_{21} = 1.5\), \(\lambda_{31} = 2\), \(\lambda_{52} = 1.5\), \(\lambda_{62} = 2\), \(\theta_{ii} =.3\). We simulate data from
the above model for \(J = 1000\)
units.
# setup
J <- 1000
I <- 6
K <- 2
psi <- matrix(c(1, 0.5,
0.5, 0.8), nrow = K)
beta <- seq(1, 2, by = .2)
# loading matrix
Lambda <- cbind(c(1, 1.5, 2, 0, 0, 0), c(0, 0, 0, 1, 1.5, 2))
# error covariance
Theta <- diag(0.3, nrow = I)
# factor scores
eta <- mvrnorm(J, mu = c(0, 0), Sigma = psi)
# error term
epsilon <- mvrnorm(J, mu = rep(0, ncol(Theta)),Sigma = Theta)
dat <- tcrossprod(eta, Lambda) + epsilon
dat_cfa <- dat %>% as.data.frame() %>% setNames(c("Y1", "Y2", "Y3", "Y4", "Y5", "Y6"))
We define the model for lavaan as follows:
lavaan_cfa <- 'eta1 =~ Y1 + Y2 + Y3
eta2 =~ Y4 + Y5 + Y6'
Two latent variables eta1
and eta2
are
specified to be measured by three items each, denoted as
eta1 =~ Y1 + Y2 + Y3
, and similary for eta2. By not
specifying other parts of the model, by default we assume that the error
terms for items are uncorrelated with each other while the covariances
between latent variables are free to be estimated.
We represent the CFA model in a path diagram and then fit the model
by maximum likelihood estimation using the cfa
function in
the lavaan
package. By convention, latent variables \(\eta_1\) and \(\eta_2\) are represented by circles, and
observed variables \(Y_{1}\) to \(Y_{6}\) by rectangles. Straight arrows
represent linear relations (here with coefficients given by the factor
loadings \(\lambda\)), and
double-headed arrows represent variances and covariances. We could make
the diagram by simply using the function call
semPaths(semPlotModel_lavaanModel(lavaan_cfa))
. Below is
the more complex syntax to display Greek letters, subscripts, etc.
FIT <-
semPlotModel_lavaanModel(
lavaan_cfa,
auto.var = TRUE,
auto.fix.first = TRUE,
auto.cov.lv.x = TRUE
)
semPaths(
FIT,
what = "paths",
whatLabels = "par",
,
nodeLabels = c(
expression(paste(Y[1])),
expression(paste(Y[2])),
expression(paste(Y[3])),
expression(paste(Y[4])),
expression(paste(Y[5])),
expression(paste(Y[6])),
expression(paste(eta[1])),
expression(paste(eta[2]))
),
edge.label.cex = 0.8,
edgeLabels = c(
expression(paste(lambda[1])),
expression(paste(lambda[2])),
expression(paste(lambda[3])),
expression(paste(lambda[4])),
expression(paste(lambda[5])),
expression(paste(lambda[6])),
expression(paste("Covariance")),
expression(paste(epsilon[1])),
expression(paste(epsilon[2])),
expression(paste(epsilon[3])),
expression(paste(epsilon[4])),
expression(paste(epsilon[5])),
expression(paste(epsilon[6])),
expression(paste(psi[1])),
expression(paste(psi[2]))
)
)
# lavaan
lav_cfa_fit <- cfa(lavaan_cfa, data = dat_cfa, meanstructure = TRUE)
summary(lav_cfa_fit, fit.measures = TRUE)
lavaan 0.6.15 ended normally after 29 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 19
Number of observations 1000
Model Test User Model:
Test statistic 1.553
Degrees of freedom 8
P-value (Chi-square) 0.992
Model Test Baseline Model:
Test statistic 6005.574
Degrees of freedom 15
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.002
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -7909.932
Loglikelihood unrestricted model (H1) -7909.155
Akaike (AIC) 15857.863
Bayesian (BIC) 15951.111
Sample-size adjusted Bayesian (SABIC) 15890.766
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 1.000
P-value H_0: RMSEA >= 0.080 0.000
Standardized Root Mean Square Residual:
SRMR 0.002
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|)
eta1 =~
Y1 1.000
Y2 1.448 0.032 45.487 0.000
Y3 1.964 0.042 46.845 0.000
eta2 =~
Y4 1.000
Y5 1.513 0.036 41.878 0.000
Y6 1.969 0.045 43.298 0.000
Covariances:
Estimate Std.Err z-value P(>|z|)
eta1 ~~
eta2 0.523 0.037 14.100 0.000
Intercepts:
Estimate Std.Err z-value P(>|z|)
.Y1 -0.016 0.037 -0.435 0.664
.Y2 -0.010 0.049 -0.194 0.846
.Y3 -0.004 0.066 -0.055 0.956
.Y4 -0.086 0.033 -2.581 0.010
.Y5 -0.078 0.047 -1.662 0.097
.Y6 -0.110 0.059 -1.852 0.064
eta1 0.000
eta2 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.Y1 0.328 0.017 19.053 0.000
.Y2 0.271 0.022 12.417 0.000
.Y3 0.351 0.037 9.537 0.000
.Y4 0.296 0.016 18.809 0.000
.Y5 0.297 0.023 12.952 0.000
.Y6 0.340 0.035 9.711 0.000
eta1 1.025 0.059 17.262 0.000
eta2 0.822 0.049 16.788 0.000
By default, lavaan
uses maximum likelihood estimation.
Estimates of the loadings \(\lambda_{ik}\) are given under “Latent
Variables”, the estimated covariance among the common factors is given
under “Covariances”, estimates of intercepts \(\beta_i\) of the measurement models, as
well as the means of the common factors (set to 0, not estimated), are
given under “Intercepts”, and estimates of the residual variances of the
responses (variances \(\theta_{ii}\) of
\(\epsilon_{ij}\)) and of the common
factors (variances \(\psi_{kk}\) of
\(\eta_{jk}\)) are given under
“Variances.”
blavaan
borrows the syntax from lavaan
and
is a wrapper for Bayesian estimation using Stan. The Stan object created
by blavaan
can be seen using lavInspect()
.
By default, the bcfa
function in blavaan
uses an anchoring item for each factor for identification, so the factor
loading of the first item that loads on a factor is fixed at 1.
blavaan
also includes intercepts for all observed variables
by default, which is different from the default in lavaan
(we specified meanstructure = TRUE
in the cfa
function call above so that the two parameterizations match).
# blavaan
blav_cfa_fit <- bcfa(lavaan_cfa, data=dat_cfa, mcmcfile = T)
Computing posterior predictives...
summary(blav_cfa_fit)
blavaan (0.3-15) results of 1000 samples after 500 adapt/burnin iterations
Number of observations 1000
Number of missing patterns 1
Statistic MargLogLik PPP
Value -8007.416 0.744
Latent Variables:
Estimate Post.SD pi.lower pi.upper Rhat Prior
eta1 =~
Y1 1.000 NA
Y2 1.449 0.031 1.39 1.512 0.999 normal(0,10)
Y3 1.967 0.042 1.887 2.05 1.000 normal(0,10)
eta2 =~
Y4 1.000 NA
Y5 1.516 0.036 1.446 1.587 1.000 normal(0,10)
Y6 1.973 0.044 1.888 2.062 1.000 normal(0,10)
Covariances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
eta1 ~~
eta2 0.522 0.037 0.453 0.598 1.000 lkj_corr(1)
Intercepts:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.Y1 -0.015 0.038 -0.09 0.06 1.000 normal(0,32)
.Y2 -0.009 0.051 -0.108 0.088 1.000 normal(0,32)
.Y3 -0.002 0.067 -0.137 0.13 1.000 normal(0,32)
.Y4 -0.086 0.034 -0.154 -0.022 1.000 normal(0,32)
.Y5 -0.077 0.047 -0.17 0.014 1.001 normal(0,32)
.Y6 -0.109 0.061 -0.23 0.009 1.001 normal(0,32)
eta1 0.000 NA
eta2 0.000 NA
Variances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.Y1 0.330 0.017 0.297 0.365 1.000 gamma(1,.5)[sd]
.Y2 0.272 0.022 0.23 0.318 1.000 gamma(1,.5)[sd]
.Y3 0.351 0.039 0.279 0.431 1.000 gamma(1,.5)[sd]
.Y4 0.298 0.016 0.269 0.333 1.000 gamma(1,.5)[sd]
.Y5 0.299 0.023 0.255 0.345 1.000 gamma(1,.5)[sd]
.Y6 0.342 0.036 0.272 0.412 1.000 gamma(1,.5)[sd]
eta1 1.027 0.059 0.919 1.151 1.000 gamma(1,.5)[sd]
eta2 0.823 0.049 0.729 0.921 0.999 gamma(1,.5)[sd]
fitmeasures(blav_cfa_fit)
npar logl ppp bic dic p_dic waic
19.000 -7909.981 0.744 15951.190 15858.125 19.081 15858.403
p_waic se_waic looic p_loo se_loo margloglik
19.272 111.599 15858.456 19.299 111.600 -8007.416
We can see that the output from blavaan
resembles that
from lavaan
although it gives the posterior means and
standard deviations for the model parameters.
We present the maximum likelihood estimates from lavaan
and Bayesian estimates from blavaan
next to each other for
comparison; Since by default blavaan
uses non-informative
priors and the size of data is large, we see the two sets of estimates
are very similar:
bind_cols(parameterEstimates(lav_cfa_fit)[, 1:4], parameterEstimates(blav_cfa_fit)[, 4]) %>% rename(ML = est, Bayes = ...5) %>% knitr::kable()
New names:
• `` -> `...5`
lhs | op | rhs | ML | Bayes |
---|---|---|---|---|
eta1 | =~ | Y1 | 1.0000000 | 1.0000000 |
eta1 | =~ | Y2 | 1.4479658 | 1.4494496 |
eta1 | =~ | Y3 | 1.9641409 | 1.9666819 |
eta2 | =~ | Y4 | 1.0000000 | 1.0000000 |
eta2 | =~ | Y5 | 1.5126562 | 1.5157573 |
eta2 | =~ | Y6 | 1.9690845 | 1.9727406 |
Y1 | ~~ | Y1 | 0.3276724 | 0.3296146 |
Y2 | ~~ | Y2 | 0.2709558 | 0.2724962 |
Y3 | ~~ | Y3 | 0.3507402 | 0.3512957 |
Y4 | ~~ | Y4 | 0.2962317 | 0.2984597 |
Y5 | ~~ | Y5 | 0.2972492 | 0.2991214 |
Y6 | ~~ | Y6 | 0.3398565 | 0.3415060 |
eta1 | ~~ | eta1 | 1.0248042 | 1.0265839 |
eta2 | ~~ | eta2 | 0.8222773 | 0.8226690 |
eta1 | ~~ | eta2 | 0.5225991 | 0.5223931 |
Y1 | ~1 | -0.0159813 | -0.0152329 | |
Y2 | ~1 | -0.0095513 | -0.0086645 | |
Y3 | ~1 | -0.0036208 | -0.0020447 | |
Y4 | ~1 | -0.0863222 | -0.0856052 | |
Y5 | ~1 | -0.0775607 | -0.0767313 | |
Y6 | ~1 | -0.1099804 | -0.1088010 | |
eta1 | ~1 | 0.0000000 | 0.0000000 | |
eta2 | ~1 | 0.0000000 | 0.0000000 |
To check the setting for the underlying Stan program (i.e., the number of chains, the number of warm-up interations, and the number of post warm-up draws):
blavInspect(blav_cfa_fit, "mcobj")
Inference for Stan model: stanmarg.
3 chains, each with iter=1500; warmup=500; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=3000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
ly_sign[1] 1.45 0.00 0.03 1.39 1.43 1.45 1.47 1.51
ly_sign[2] 1.97 0.00 0.04 1.89 1.94 1.97 2.00 2.05
ly_sign[3] 1.52 0.00 0.04 1.45 1.49 1.52 1.54 1.59
ly_sign[4] 1.97 0.00 0.04 1.89 1.94 1.97 2.00 2.06
Theta_var[1] 0.33 0.00 0.02 0.30 0.32 0.33 0.34 0.36
Theta_var[2] 0.27 0.00 0.02 0.23 0.26 0.27 0.29 0.32
Theta_var[3] 0.35 0.00 0.04 0.28 0.32 0.35 0.38 0.43
Theta_var[4] 0.30 0.00 0.02 0.27 0.29 0.30 0.31 0.33
Theta_var[5] 0.30 0.00 0.02 0.26 0.28 0.30 0.31 0.35
Theta_var[6] 0.34 0.00 0.04 0.27 0.32 0.34 0.37 0.41
Psi_cov[1] 0.52 0.00 0.04 0.45 0.50 0.52 0.55 0.60
Psi_var[1] 1.03 0.00 0.06 0.92 0.99 1.02 1.06 1.15
Psi_var[2] 0.82 0.00 0.05 0.73 0.79 0.82 0.86 0.92
Nu_free[1] -0.02 0.00 0.04 -0.09 -0.04 -0.02 0.01 0.06
Nu_free[2] -0.01 0.00 0.05 -0.11 -0.04 -0.01 0.03 0.09
Nu_free[3] 0.00 0.00 0.07 -0.14 -0.05 0.00 0.04 0.13
Nu_free[4] -0.09 0.00 0.03 -0.15 -0.11 -0.09 -0.06 -0.02
Nu_free[5] -0.08 0.00 0.05 -0.17 -0.11 -0.08 -0.04 0.01
Nu_free[6] -0.11 0.00 0.06 -0.23 -0.15 -0.11 -0.07 0.01
lp__ -7971.62 0.09 3.19 -7978.79 -7973.49 -7971.28 -7969.34 -7966.40
n_eff Rhat
ly_sign[1] 3120 1
ly_sign[2] 2922 1
ly_sign[3] 3137 1
ly_sign[4] 3137 1
Theta_var[1] 4755 1
Theta_var[2] 3581 1
Theta_var[3] 3796 1
Theta_var[4] 4847 1
Theta_var[5] 4220 1
Theta_var[6] 4103 1
Psi_cov[1] 3418 1
Psi_var[1] 3003 1
Psi_var[2] 3284 1
Nu_free[1] 1591 1
Nu_free[2] 1514 1
Nu_free[3] 1529 1
Nu_free[4] 1769 1
Nu_free[5] 1597 1
Nu_free[6] 1564 1
lp__ 1219 1
Samples were drawn using NUTS(diag_e) at Thu Apr 20 18:43:35 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
We see that there were 3 chains with 1500 iterations each, the first
500 of which were warmup, giving a total of 3000 draws. In the Stan
output above, se_mean
is the Monte Carlo error due to the
fact that we use posterior draws to empirically estimate posterior
expectations. These Monte Carlo errors are given by \(\frac{sd}{n_{eff}}\), where sd
here refers to the posterior standard deviation which can be viewed as a
Bayesian counterpart of frequentist standard errors. Because there is an
autocorrelation among our Monte Carlo draws within chains, the
“effective” sample, n_eff
, used to estimate the Monte Carlo
errors is smaller than the number of post warmup draws. The effective
sample size can be viewed as the number of independent draws that would
give the same amount of information in terms of the Monte Carlo errors.
We see that the Monte Carlo errors will go to zero as the number of
draws goes to infinity. Rhat
quantifies how well the
different chains mix with each other (overlap) post warmup, with a value
closer to 1 less than 1.1 indicating that there is adequate mixing. The
purpose of running multiple chains with different starting values for
the parameters is to assess whether the chains have reached the
stationary distribution, i.e., the required correct posterior
distribution, by checking that the chains are converging to the same
distribution and are hence mixing with each other.
One benefit of running a Bayesian analysis is that we can incorporate
prior information if we have it. blavaan
allows users to
specify priors for all the model parameters via the argument
dp
and uses non-informative priors (with flat
densities/large variances) if this argument is left blank. When
non-informative priors are used for all parameters, the Bayesian
estimates will be similar to their maximum likelihood counterparts.
To check the default priors,
(default_prior <- dpriors())
nu alpha lambda beta
"normal(0,32)" "normal(0,10)" "normal(0,10)" "normal(0,10)"
theta psi rho ibpsi
"gamma(1,.5)[sd]" "gamma(1,.5)[sd]" "beta(1,1)" "wishart(3,iden)"
tau delta
"normal(0,10^.5)" "gamma(1,.5)[sd]"
To change a prior, we can just substitute the desired one: for
example, if we would like to change the prior for beta to be
normal(0, 1)
, then we can form a new prior vector:
(new_prior <- dpriors(beta = "normal(0, 1)"))
nu alpha lambda beta
"normal(0,32)" "normal(0,10)" "normal(0,10)" "normal(0, 1)"
theta psi rho ibpsi
"gamma(1,.5)[sd]" "gamma(1,.5)[sd]" "beta(1,1)" "wishart(3,iden)"
tau delta
"normal(0,10^.5)" "gamma(1,.5)[sd]"
# new cfa with updated prior for beta
# bcfa(lavaan_cfa, data=dat_cfa, dp = new_prior)
For further information on how to change the priors for individual
parameters, check
https://faculty.missouri.edu/~merklee/blavaan/prior.html
.
As a popular method for modeling growth, in this section, we discuss another special case of SEM: latent growth-curve models. This model is useful when modeling change over time within individuals changing over time and variability among individual trajectories. A latent growth curve model for six occasions can be written as:
\[ \left[\begin{array}{l} y_{1 j} \\ y_{2 j} \\ y_{3 j} \\ y_{4 j} \\ y_{5 j} \\ y_{6 j} \end{array}\right]=\left[\begin{array}{ll} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ 1 & 4 \\ 1 & 5 \end{array}\right]\left[\begin{array}{l} \eta_{1 j} \\ \eta_{2 j} \end{array}\right]+\left[\begin{array}{l} \epsilon_{1 j} \\ \epsilon_{2 j} \\ \epsilon_{3 j} \\ \epsilon_{4 j} \\ \epsilon_{5 j} \\ \epsilon_{6 j} \end{array}\right]=\left[\begin{array}{l} \eta_{1 j}+0 \eta_{2 j}+\epsilon_{1 j} \\ \eta_{1 j}+1 \eta_{2 j}+\epsilon_{2 j} \\ \eta_{1 j}+2 \eta_{2 j}+\epsilon_{3 j} \\ \eta_{1 j}+3 \eta_{2 j}+\epsilon_{4 j} \\ \eta_{1 j}+4 \eta_{2 j}+\epsilon_{5 j} \\ \eta_{1 j}+5 \eta_{2 j}+\epsilon_{6 j} \end{array}\right] \]
\[ \boldsymbol{\epsilon}_{j} \sim N_{p}(\mathbf{0}, \mathbf{\Theta}) \]
Where
\(y_{ij}\) is the response at occasion \(i\) for individual \(j\).
\(\eta_{1j}\) is an individual-specific intercept and \(\eta_{2j}\) is an individual-specific slope of time. These intercepts and slopes are latent variables that follow a bivariate normal distribution with free means and an unstructured covariance matrix.
\(\epsilon_{ij}\) is an occasion-specific error. The errors \(\epsilon_{1j}\) to \(\epsilon_{6j}\)have zero means and a diagonal \(6 \times 6\) covariance matrix:
\[\mathbf{\Theta}=cov\begin{pmatrix} \epsilon_{1 j} \\ \epsilon_{2 j} \\ \epsilon_{3 j} \\ \epsilon_{4 j} \\ \epsilon_{5 j} \\ \epsilon_{6 j} \end{pmatrix}= \left[\begin{array}{cc} \theta_{1 1} &0&0&0&0&0\\ 0&\theta_{2 2}&0&0&0&0 \\ 0&0&\theta_{3 3}&0&0&0 \\ 0&0&0&\theta_{4 4}&0&0 \\ 0&0&0&0&\theta_{5 5}&0 \\ 0&0&0&0&0&\theta_{6 6} \end{array}\right] \] The pre-fixed loadings (0, 1, 2, 3, 4, 5) for the individual-specific slope represent the times associated with the occasions, and this kind of model works only if the time variable takes on the same set of values for all individuals. Here the occasions are additionally assumed to be equally spaced.
We simulated data for 500 individuals based on the model above as follows:
N <- 500
psi <- matrix(c(1, 0.5,
0.5, 1), nrow = 2)
# loading matrix
Lambda <- matrix(c(1, 1, 1, 1, 1, 1, 0:5), nrow = 6)
# error covariance
Theta <- diag(0.3, nrow = 6)
# exogenous latent variables
eta <- mvrnorm(N, mu = c(2, 1), Sigma = psi)
# error term
epsilon <- mvrnorm(N, mu = rep(0, ncol(Theta)), Sigma = Theta)
dat <- tcrossprod(eta, Lambda) + epsilon
dat <- dat %>% as.matrix() %>% as.data.frame() %>% setNames(paste0("Y", 1:6))
To estimate the parameters of our model (visualized in the path
diagram below) using lavaan
, we define two latent variables
(i.e., ri
the random intercept and rc
the
random slope/coefficient). The syntax 1*Y1
means that we
are fixing the factor loadings for Y1
to 1. We also
constrain the residual variance estimates (indicated as c
)
for all the observed variables to be the same by
Y1 ~~ c*Y1
, which means Y1
has a residual
variance c
to be estimated (and this is the same for the
other response variables).
lavaan_lgm <- 'ri =~ 1*Y1 + 1*Y2 + 1*Y3 + 1*Y4 + 1*Y5 + 1*Y6
rc =~ 0*Y1 + 1*Y2 + 2*Y3 + 3*Y4 + 4*Y5 + 5*Y6
Y1 ~~ c*Y1
Y2 ~~ c*Y2
Y3 ~~ c*Y3
Y4 ~~ c*Y4
Y5 ~~ c*Y5
Y6 ~~ c*Y6
ri ~~ rc
'
We can make a path diagram like this:
semPaths(semPlotModel_lavaanModel(lavaan_lgm))
To display Greek letters and subscripts, we can use the following syntax:
FIT2 <-
semPlotModel_lavaanModel(
lavaan_lgm,
auto.var = TRUE,
auto.fix.first = TRUE,
auto.cov.lv.x = TRUE
)
semPaths(
FIT2,
what = "paths",
whatLabels = "par",
nodeLabels = c(
expression(paste(Y[1])),
expression(paste(Y[2])),
expression(paste(Y[3])),
expression(paste(Y[4])),
expression(paste(Y[5])),
expression(paste(Y[6])),
expression(paste(eta[1])),
expression(paste(eta[2]))
),
edge.label.cex = 0.6,
edgeLabels = c(
expression(paste(1)),
expression(paste(1)),
expression(paste(1)),
expression(paste(1)),
expression(paste(1)),
expression(paste(1)),
expression(paste(0)),
expression(paste(1)),
expression(paste(2)),
expression(paste(3)),
expression(paste(4)),
expression(paste(5)),
expression(paste(epsilon[1])),
expression(paste(epsilon[2])),
expression(paste(epsilon[3])),
expression(paste(epsilon[4])),
expression(paste(epsilon[5])),
expression(paste(epsilon[6])),
expression(paste("Covariance")),
expression(paste(psi[1])),
expression(paste(psi[2]))
)
)
Following the above model specification, we fit a latent growth curve
model using the growth
function in lavaan
. By
default, growth
estimates the latent means, as required,
rather than the intercepts for the observed variables:
# lavaan
lav_lgm_fit <- growth(lavaan_lgm, data = dat)
summary(lav_lgm_fit)
lavaan 0.6.15 ended normally after 26 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 11
Number of equality constraints 5
Number of observations 500
Model Test User Model:
Test statistic 30.656
Degrees of freedom 21
P-value (Chi-square) 0.080
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|)
ri =~
Y1 1.000
Y2 1.000
Y3 1.000
Y4 1.000
Y5 1.000
Y6 1.000
rc =~
Y1 0.000
Y2 1.000
Y3 2.000
Y4 3.000
Y5 4.000
Y6 5.000
Covariances:
Estimate Std.Err z-value P(>|z|)
ri ~~
rc 0.484 0.051 9.506 0.000
Intercepts:
Estimate Std.Err z-value P(>|z|)
.Y1 0.000
.Y2 0.000
.Y3 0.000
.Y4 0.000
.Y5 0.000
.Y6 0.000
ri 2.038 0.048 42.629 0.000
rc 1.058 0.044 24.113 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.Y1 (c) 0.300 0.010 31.623 0.000
.Y2 (c) 0.300 0.010 31.623 0.000
.Y3 (c) 0.300 0.010 31.623 0.000
.Y4 (c) 0.300 0.010 31.623 0.000
.Y5 (c) 0.300 0.010 31.623 0.000
.Y6 (c) 0.300 0.010 31.623 0.000
ri 0.986 0.072 13.602 0.000
rc 0.945 0.061 15.529 0.000
In this model, we are mainly interested in the means and variances of
the random intercept and slope. We can use the bgrowth
function in the blavaan
package to fit the same model:
# blavaan
blav_lgm_fit <- bgrowth(lavaan_lgm, data = dat)
Computing posterior predictives...
summary(blav_lgm_fit)
blavaan (0.3-15) results of 1000 samples after 500 adapt/burnin iterations
Number of observations 500
Number of missing patterns 1
Statistic MargLogLik PPP
Value -4216.911 0.116
Latent Variables:
Estimate Post.SD pi.lower pi.upper Rhat Prior
ri =~
Y1 1.000 NA
Y2 1.000 NA
Y3 1.000 NA
Y4 1.000 NA
Y5 1.000 NA
Y6 1.000 NA
rc =~
Y1 0.000 NA
Y2 1.000 NA
Y3 2.000 NA
Y4 3.000 NA
Y5 4.000 NA
Y6 5.000 NA
Covariances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
ri ~~
rc 0.485 0.051 0.393 0.59 1.000 lkj_corr(1)
Intercepts:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.Y1 0.000 NA
.Y2 0.000 NA
.Y3 0.000 NA
.Y4 0.000 NA
.Y5 0.000 NA
.Y6 0.000 NA
ri 2.038 0.048 1.946 2.13 1.000 normal(0,10)
rc 1.058 0.044 0.97 1.146 0.999 normal(0,10)
Variances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.Y1 (c) 0.301 0.010 0.283 0.32 0.999 gamma(1,.5)[sd]
.Y2 (c) 0.301 0.010 0.283 0.32 0.999
.Y3 (c) 0.301 0.010 0.283 0.32 0.999
.Y4 (c) 0.301 0.010 0.283 0.32 0.999
.Y5 (c) 0.301 0.010 0.283 0.32 0.999
.Y6 (c) 0.301 0.010 0.283 0.32 0.999
ri 0.992 0.072 0.857 1.143 1.001 gamma(1,.5)[sd]
rc 0.952 0.061 0.839 1.079 1.001 gamma(1,.5)[sd]
For latent growth curve models, we can also view the responses at the
different time points as a univariate outcome (long format), instead of
being multivariate (wide format). In such a multilevel perspective,
responses at the different time-points (level 1) are viewed as nested
within each person (level 2). The growth curve model can then be
specified as a two-level linear mixed model (or hierarchical linear
model, HLM) and estimated using the lmer
function in the
lme4
package. To do this, we first reshape our data from
wide format to long format:
# wide to long
dat_long <-
dat %>% add_column(id = 1:500) %>% gather(key = "time", value = "y",-id) %>% mutate(time = dplyr::recode(
time,
Y1 = 0,
Y2 = 1,
Y3 = 2,
Y4 = 3,
Y5 = 4,
Y6 = 5
))
dat_long %>% head %>% kable()
id | time | y |
---|---|---|
1 | 0 | 2.9577313 |
2 | 0 | 3.2700705 |
3 | 0 | 2.6845096 |
4 | 0 | 1.4890722 |
5 | 0 | 0.5877364 |
6 | 0 | 2.6755304 |
For the random part of the model, (1 + time)|id
is used
to indicate that the model includes random intercept and random slope of
time for clusters (persons in the current example) whose identifiers is
in the variable id
. For a detailed tutorial on Bayesian
multilevel regression and multilevel regression in general, see
https://mc-stan.org/users/documentation/case-studies/tutorial_rstanarm.html
.
# mlm framework
lav_lgm_fit_hlm <- lme4::lmer(y ~ 1 + time + (1 + time|id), data = dat_long, REML = FALSE)
summary(lav_lgm_fit_hlm)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: y ~ 1 + time + (1 + time | id)
Data: dat_long
AIC BIC logLik deviance df.resid
8398.1 8434.1 -4193.0 8386.1 2994
Scaled residuals:
Min 1Q Median 3Q Max
-3.3512 -0.5844 0.0021 0.5514 2.9637
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 0.9856 0.9928
time 0.9452 0.9722 0.50
Residual 0.3005 0.5481
Number of obs: 3000, groups: id, 500
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.03815 0.04781 42.63
time 1.05787 0.04387 24.11
Correlation of Fixed Effects:
(Intr)
time 0.420
We present the frequentist SEM estimates from lavaan
,
the frequentist HLM estimates from lme4
and Bayesian
estimates from blavaan
. As for CFA, the Bayesian and
frequentist estimates from the different functions are similar:
hlm_beta <- lav_lgm_fit_hlm@beta
hlm_var_RE <-
lme4::VarCorr(lav_lgm_fit_hlm) %>% as.data.frame(., comp = c("Variance"), order = "lower.tri")
hlm_estimates <-
c(rep(NA, 12),
rep(hlm_var_RE[4, 4], 6),
hlm_var_RE[c(2, 1, 3), 4],
rep(NA, 6),
hlm_beta)
bind_cols(
parameterEstimates(lav_lgm_fit)[, 1:5],
hlm_estimates,
parameterEstimates(blav_lgm_fit)[, 5]
) %>% rename(ML_SEM = est,
ML_HLM = ...6,
Bayes = ...7) %>% knitr::kable()
New names:
• `` -> `...6`
• `` -> `...7`
lhs | op | rhs | label | ML_SEM | ML_HLM | Bayes |
---|---|---|---|---|---|---|
ri | =~ | Y1 | 1.0000000 | NA | 1.0000000 | |
ri | =~ | Y2 | 1.0000000 | NA | 1.0000000 | |
ri | =~ | Y3 | 1.0000000 | NA | 1.0000000 | |
ri | =~ | Y4 | 1.0000000 | NA | 1.0000000 | |
ri | =~ | Y5 | 1.0000000 | NA | 1.0000000 | |
ri | =~ | Y6 | 1.0000000 | NA | 1.0000000 | |
rc | =~ | Y1 | 0.0000000 | NA | 0.0000000 | |
rc | =~ | Y2 | 1.0000000 | NA | 1.0000000 | |
rc | =~ | Y3 | 2.0000000 | NA | 2.0000000 | |
rc | =~ | Y4 | 3.0000000 | NA | 3.0000000 | |
rc | =~ | Y5 | 4.0000000 | NA | 4.0000000 | |
rc | =~ | Y6 | 5.0000000 | NA | 5.0000000 | |
Y1 | ~~ | Y1 | c | 0.3004538 | 0.3004539 | 0.3008866 |
Y2 | ~~ | Y2 | c | 0.3004538 | 0.3004539 | 0.3008866 |
Y3 | ~~ | Y3 | c | 0.3004538 | 0.3004539 | 0.3008866 |
Y4 | ~~ | Y4 | c | 0.3004538 | 0.3004539 | 0.3008866 |
Y5 | ~~ | Y5 | c | 0.3004538 | 0.3004539 | 0.3008866 |
Y6 | ~~ | Y6 | c | 0.3004538 | 0.3004539 | 0.3008866 |
ri | ~~ | rc | 0.4838138 | 0.4838131 | 0.4853817 | |
ri | ~~ | ri | 0.9855904 | 0.9855833 | 0.9921259 | |
rc | ~~ | rc | 0.9451540 | 0.9451574 | 0.9524373 | |
Y1 | ~1 | 0.0000000 | NA | 0.0000000 | ||
Y2 | ~1 | 0.0000000 | NA | 0.0000000 | ||
Y3 | ~1 | 0.0000000 | NA | 0.0000000 | ||
Y4 | ~1 | 0.0000000 | NA | 0.0000000 | ||
Y5 | ~1 | 0.0000000 | NA | 0.0000000 | ||
Y6 | ~1 | 0.0000000 | NA | 0.0000000 | ||
ri | ~1 | 2.0381506 | 2.0381506 | 2.0380207 | ||
rc | ~1 | 1.0578727 | 1.0578727 | 1.0576254 |
SEM is useful to test hypothesized covariance structures among variables based on substantive theories by modeling the relations among multiple latent (unobserved) variables or constructs that are measured by manifest (observed) variables or indicators and evaluating the corresponding model fit. It can be thought of as a combination of regression analysis and factor analysis. SEM contains two parts, a so-called measurement model describes relations between observed indicators and latent variables, and a so-called structural model specifies a set of simultaneous linear relations among latent variables and possibly observed variables. SEM involves two main types of relations: 1) correlational relations between latent variables are represented by two-headed arrows; 2) cause-and-effect type relations are represented by single-headed arrows in the path diagram. For example, a cause-and-effect type SEM model can be written by starting with the measurement model for the exogenous latent variables:
\[ \left[\begin{array}{l} x_{1 j} \\ x_{2 j} \\ x_{3 j} \\ z_{4 j} \\ z_{5 j} \\ z_{6 j} \end{array}\right]=\left[\begin{array}{c} \beta_{1} \\ \beta_{2} \\ \beta_{3} \\ \beta_{4} \\ \beta_{5} \\ \beta_{6} \end{array}\right]+\left[\begin{array}{cc} 1 & 0 \\ \lambda_{21} & 0 \\ \lambda_{31} & 0 \\ 0 & 1 \\ 0 & \lambda_{52} \\ 0 & \lambda_{62} \end{array}\right]\left[\begin{array}{l} \eta_{1 j} \\ \eta_{2 j} \end{array}\right]+\left[\begin{array}{c} \epsilon_{1 j} \\ \epsilon_{2 j} \\ \epsilon_{3 j} \\ \epsilon_{4 j} \\ \epsilon_{5 j} \\ \epsilon_{6 j} \end{array}\right] \]
where
\(x_{i j}\) represents an indicator of \(\eta_{1 j}\) and \(z_{i j}\) represents an indicator of \(\eta_{2 j}\).
\(\eta_{1 j}\) and \(\eta_{2 j}\) represent exogenous latent variables in the model with zero means and unstructured covariance matrix.
\(\epsilon_{i j}\) represents errors of measurement for \(x_{i j}\) or \(z_{i j}\).
\(\beta_i\) represents an intercept for \(x_i\) or \(z_{ij}\).
\(\lambda_{i j}\) represents a factor loading for the measurement model, which represents the magnitude of the expected change in the observed variables for a one-unit change in the latent variable.
We assume that the errors of measurement \(\epsilon_{i j}\) have an expected value of zero, that they are uncorrelated with all latent variables and uncorrelated with each other for all pairs of items.
Next, we have a measurement model for a latent endogenous variable: \[ \left[\begin{array}{l} y_{1 j} \\ y_{2 j} \\ y_{3 j} \end{array}\right]=\left[\begin{array}{c} \gamma_{1} \\ \gamma_{2} \\ \gamma_{3} \end{array}\right]+\left[\begin{array}{cc} 1 \\ \lambda_{8} \\ \lambda_{9} \end{array}\right]\left[\begin{array}{l} \xi_{1 j} \end{array}\right]+\left[\begin{array}{c} \delta_{1 j} \\ \delta_{2 j} \\ \delta_{3 j} \end{array}\right] \]
where
\(y_{i j}\) represents an indicator of \(\xi_{1 j}\).
\(\xi_{1 j}\) represents a latent endogenous variable with zero mean.
\(\delta_{i j}\) represents measurement error for \(y_{ij}\).
\(\gamma_i\) represents an intercept for \(y_i\).
\(\lambda_{i}\) represents a factor loading for the measurement model.
We assume that the measurement errors have zero means and are uncorrelated with the exogenous and endogenous latent variables. We also assume that \(\delta_{i j}\) is homoscedastic and non-autocorrelated.
Finally, the structural relations among the latent variables are specified as: \[ \left[\begin{array}{l} \xi_{1 j} \end{array}\right]=\left[\begin{array}{cc} \gamma_{11} & \gamma_{12} \end{array}\right] \left[\begin{array}{l} \eta_{1 j} \\ \eta_{2 j} \end{array}\right] + \zeta_{1j} \] which represents a linear relation. We simulate data from this model as follows:
N <- 500
lat_cov <- matrix(c(1, 0,
0, 0.8), nrow = 2)
# loading matrix
Lambda <- bdiag(c(1, 1.5, 2), c(1, 1.5, 2), c(1, 1.5, 2))
# error covariance
Theta <- diag(0.3, nrow = 9)
# exogenous latent variables
eta <- mvrnorm(N, mu = c(0, 0), Sigma = lat_cov)
# disturbance for latent outcome
zeta <- rnorm(N)
# structural model
xi <- tcrossprod(eta, t(c(1, 2))) + zeta
# error term
epsilon <- mvrnorm(N, mu = rep(0, ncol(Theta)), Sigma = Theta)
cbind(eta, xi) %>% dim
[1] 500 3
dat <- tcrossprod(cbind(eta, xi), Lambda) + epsilon
dat <- dat %>% as.matrix() %>% as.data.frame() %>% setNames(c("X1", "X2", "X3", "Z1", "Z2", "Z3", "Y1", "Y2", "Y3"))
To fit the SEM model, the measurement part is specified as before
with =~
connecting the hypothesized latent variables and
the corresponding observed indicators. What is new here is the
“structural relations” among latent variables, indicated by
~
, although here “structural relation” is loosely defined
and does not necessarily mean causal.
lavaan_sem <- 'eta1 =~ X1 + X2 + X3
eta2 =~ Z1 + Z2 + Z3
xi =~ Y1 + Y2 + Y3
xi ~ eta1 + eta2'
# lavaan
lav_sem_fit <- cfa(lavaan_sem, data = dat, meanstructure = TRUE)
summary(lav_sem_fit)
lavaan 0.6.15 ended normally after 64 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 30
Number of observations 500
Model Test User Model:
Test statistic 22.985
Degrees of freedom 24
P-value (Chi-square) 0.521
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|)
eta1 =~
X1 1.000
X2 1.547 0.045 34.389 0.000
X3 2.087 0.059 35.315 0.000
eta2 =~
Z1 1.000
Z2 1.511 0.051 29.545 0.000
Z3 2.037 0.064 31.647 0.000
xi =~
Y1 1.000
Y2 1.539 0.022 70.267 0.000
Y3 2.010 0.026 76.904 0.000
Regressions:
Estimate Std.Err z-value P(>|z|)
xi ~
eta1 1.023 0.056 18.201 0.000
eta2 2.010 0.081 24.871 0.000
Covariances:
Estimate Std.Err z-value P(>|z|)
eta1 ~~
eta2 -0.024 0.040 -0.591 0.554
Intercepts:
Estimate Std.Err z-value P(>|z|)
.X1 -0.080 0.051 -1.587 0.113
.X2 -0.084 0.072 -1.165 0.244
.X3 -0.100 0.097 -1.034 0.301
.Z1 0.029 0.045 0.650 0.515
.Z2 0.001 0.063 0.018 0.986
.Z3 0.057 0.082 0.692 0.489
.Y1 -0.015 0.102 -0.143 0.886
.Y2 -0.023 0.155 -0.151 0.880
.Y3 -0.025 0.200 -0.123 0.902
eta1 0.000
eta2 0.000
.xi 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.X1 0.287 0.021 13.626 0.000
.X2 0.251 0.029 8.763 0.000
.X3 0.338 0.048 7.017 0.000
.Z1 0.268 0.019 13.736 0.000
.Z2 0.292 0.026 11.063 0.000
.Z3 0.280 0.037 7.485 0.000
.Y1 0.323 0.024 13.366 0.000
.Y2 0.386 0.038 10.137 0.000
.Y3 0.312 0.053 5.892 0.000
eta1 0.992 0.080 12.471 0.000
eta2 0.739 0.062 11.894 0.000
.xi 0.946 0.078 12.056 0.000
# blavaan with mcmcfile=T indicating having the blavaan-generated Stan code saved.
blav_sem_fit <- bcfa(lavaan_sem, data = dat, mcmcfile = T)
Computing posterior predictives...
summary(blav_sem_fit)
blavaan (0.3-15) results of 1000 samples after 500 adapt/burnin iterations
Number of observations 500
Number of missing patterns 1
Statistic MargLogLik PPP
Value -6227.716 0.541
Latent Variables:
Estimate Post.SD pi.lower pi.upper Rhat Prior
eta1 =~
X1 1.000 NA
X2 1.552 0.046 1.467 1.645 1.000 normal(0,10)
X3 2.094 0.060 1.979 2.215 1.000 normal(0,10)
eta2 =~
Z1 1.000 NA
Z2 1.518 0.053 1.417 1.625 1.000 normal(0,10)
Z3 2.047 0.067 1.915 2.18 1.001 normal(0,10)
xi =~
Y1 1.000 NA
Y2 1.540 0.023 1.496 1.585 1.001 normal(0,10)
Y3 2.010 0.026 1.959 2.064 1.000 normal(0,10)
Regressions:
Estimate Post.SD pi.lower pi.upper Rhat Prior
xi ~
eta1 1.025 0.058 0.916 1.148 1.001 normal(0,10)
eta2 2.019 0.081 1.859 2.175 1.001 normal(0,10)
Covariances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
eta1 ~~
eta2 -0.023 0.041 -0.104 0.056 0.999 beta(1,1)
Intercepts:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.X1 -0.079 0.051 -0.178 0.025 1.000 normal(0,32)
.X2 -0.081 0.074 -0.225 0.064 1.000 normal(0,32)
.X3 -0.096 0.099 -0.285 0.099 1.001 normal(0,32)
.Z1 0.028 0.046 -0.062 0.118 1.000 normal(0,32)
.Z2 0.000 0.063 -0.121 0.125 1.000 normal(0,32)
.Z3 0.055 0.082 -0.106 0.214 1.001 normal(0,32)
.Y1 -0.013 0.104 -0.216 0.198 1.000 normal(0,32)
.Y2 -0.022 0.158 -0.326 0.29 1.001 normal(0,32)
.Y3 -0.025 0.204 -0.416 0.382 1.001 normal(0,32)
eta1 0.000 NA
eta2 0.000 NA
.xi 0.000 NA
Variances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.X1 0.291 0.022 0.251 0.336 1.000 gamma(1,.5)[sd]
.X2 0.254 0.028 0.202 0.31 1.001 gamma(1,.5)[sd]
.X3 0.340 0.048 0.253 0.44 1.000 gamma(1,.5)[sd]
.Z1 0.271 0.020 0.233 0.314 1.000 gamma(1,.5)[sd]
.Z2 0.295 0.027 0.246 0.351 0.999 gamma(1,.5)[sd]
.Z3 0.283 0.037 0.214 0.362 0.999 gamma(1,.5)[sd]
.Y1 0.328 0.025 0.281 0.382 0.999 gamma(1,.5)[sd]
.Y2 0.392 0.038 0.321 0.47 0.999 gamma(1,.5)[sd]
.Y3 0.314 0.052 0.214 0.415 1.000 gamma(1,.5)[sd]
eta1 0.998 0.084 0.844 1.177 1.000 gamma(1,.5)[sd]
eta2 0.741 0.063 0.626 0.87 1.000 gamma(1,.5)[sd]
.xi 0.960 0.079 0.812 1.123 0.999 gamma(1,.5)[sd]
The output now has a new section describing the path coefficients among latent variables, which can be interpreted as in linear regression.
FIT3 <-
semPlotModel_lavaanModel(lavaan_sem, auto.var = TRUE, auto.fix.first = TRUE)
semPaths(
FIT3,
what = "paths",
whatLabels = "par",
nodeLabels = c(
expression(paste(x[1])),
expression(paste(x[2])),
expression(paste(x[3])),
expression(paste(z[1])),
expression(paste(z[2])),
expression(paste(z[3])),
expression(paste(y[1])),
expression(paste(y[2])),
expression(paste(y[3])),
expression(paste(eta[1])),
expression(paste(eta[2])),
expression(paste(xi))
),
edge.label.cex = 0.6,
edgeLabels = c(
expression(paste(1)),
expression(paste(lambda[21])),
expression(paste(lambda[31])),
expression(paste(1)),
expression(paste(lambda[52])),
expression(paste(lambda[62])),
expression(paste(1)),
expression(paste(lambda[7])),
expression(paste(lambda[8])),
expression(paste(gamma[1])),
expression(paste(gamma[2])),
expression(paste(epsilon[1])),
expression(paste(epsilon[2])),
expression(paste(epsilon[3])),
expression(paste(epsilon[4])),
expression(paste(epsilon[5])),
expression(paste(epsilon[6])),
expression(paste(delta[1])),
expression(paste(delta[2])),
expression(paste(delta[3])),
expression(paste(psi[1])),
expression(paste(psi[2])),
expression(paste(psi[3]))
)
)
We again compare the estimates from lavaan
and
blavaan
:
bind_cols(parameterEstimates(lav_sem_fit)[, 1:4],
parameterEstimates(blav_sem_fit)[, 4]) %>% rename(ML_SEM = est, Bayes = ...5) %>% knitr::kable()
New names:
• `` -> `...5`
lhs | op | rhs | ML_SEM | Bayes |
---|---|---|---|---|
eta1 | =~ | X1 | 1.0000000 | 1.0000000 |
eta1 | =~ | X2 | 1.5468897 | 1.5518487 |
eta1 | =~ | X3 | 2.0867574 | 2.0940599 |
eta2 | =~ | Z1 | 1.0000000 | 1.0000000 |
eta2 | =~ | Z2 | 1.5111137 | 1.5179009 |
eta2 | =~ | Z3 | 2.0369639 | 2.0465995 |
xi | =~ | Y1 | 1.0000000 | 1.0000000 |
xi | =~ | Y2 | 1.5394007 | 1.5395212 |
xi | =~ | Y3 | 2.0096801 | 2.0102847 |
xi | ~ | eta1 | 1.0229308 | 1.0254598 |
xi | ~ | eta2 | 2.0102597 | 2.0188056 |
X1 | ~~ | X1 | 0.2870573 | 0.2908129 |
X2 | ~~ | X2 | 0.2508078 | 0.2538065 |
X3 | ~~ | X3 | 0.3376840 | 0.3404139 |
Z1 | ~~ | Z1 | 0.2678232 | 0.2712257 |
Z2 | ~~ | Z2 | 0.2922366 | 0.2952527 |
Z3 | ~~ | Z3 | 0.2798114 | 0.2833929 |
Y1 | ~~ | Y1 | 0.3233531 | 0.3278058 |
Y2 | ~~ | Y2 | 0.3862745 | 0.3917939 |
Y3 | ~~ | Y3 | 0.3124156 | 0.3137321 |
eta1 | ~~ | eta1 | 0.9924733 | 0.9980795 |
eta2 | ~~ | eta2 | 0.7392622 | 0.7408280 |
xi | ~~ | xi | 0.9457461 | 0.9603592 |
eta1 | ~~ | eta2 | -0.0236847 | -0.0234200 |
X1 | ~1 | -0.0802587 | -0.0786916 | |
X2 | ~1 | -0.0844009 | -0.0808092 | |
X3 | ~1 | -0.0998296 | -0.0958554 | |
Z1 | ~1 | 0.0291862 | 0.0283963 | |
Z2 | ~1 | 0.0011052 | 0.0000806 | |
Z3 | ~1 | 0.0565958 | 0.0549565 | |
Y1 | ~1 | -0.0145914 | -0.0132324 | |
Y2 | ~1 | -0.0233656 | -0.0220552 | |
Y3 | ~1 | -0.0246824 | -0.0246455 | |
eta1 | ~1 | 0.0000000 | 0.0000000 | |
eta2 | ~1 | 0.0000000 | 0.0000000 | |
xi | ~1 | 0.0000000 | 0.0000000 |
When using SEMs to model multivariate linear relations in education, sociology, and psychology, evaluating model fit is usually considered to be a vital step before drawing sensible conclusions from the results.
Statistical fit indices provide a way to quantify how well our model
fits the data, and there are many choices available in
blavaan
ranging from those specifically proposed in the SEM
context to some more general, information-theoretic indices.
In frequentist SEM, many common fit indices are available in
lavaan
:
fitmeasures(lav_cfa_fit)
npar fmin chisq
19.000 0.001 1.553
df pvalue baseline.chisq
8.000 0.992 6005.574
baseline.df baseline.pvalue cfi
15.000 0.000 1.000
tli nnfi rfi
1.002 1.002 1.000
nfi pnfi ifi
1.000 0.533 1.001
rni logl unrestricted.logl
1.001 -7909.932 -7909.155
aic bic ntotal
15857.863 15951.111 1000.000
bic2 rmsea rmsea.ci.lower
15890.766 0.000 0.000
rmsea.ci.upper rmsea.ci.level rmsea.pvalue
0.000 0.900 1.000
rmsea.close.h0 rmsea.notclose.pvalue rmsea.notclose.h0
0.050 0.000 0.080
rmr rmr_nomean srmr
0.005 0.005 0.002
srmr_bentler srmr_bentler_nomean crmr
0.002 0.002 0.002
crmr_nomean srmr_mplus srmr_mplus_nomean
0.003 0.002 0.002
cn_05 cn_01 gfi
9989.555 12941.502 0.999
agfi pgfi mfi
0.998 0.296 1.003
ecvi
0.040
In Bayesian latent variable modeling, there are generally two
versions of fit indices, depending on which version of the likelihood
function the software uses. blavaan
utilizes the marginal
likelihood (marginal over the latent variables), so fit indices such as
DIC and WAIC reflect the model’s predictive validity for a future
person’s response while the conditional likelihood-based fit indices
concern future responses from the current persons in our sample (See
Merkle, Furr, and Rabe-Hesketh (2019) for
more detailed discussion). In almost all applications of SEM, the goal
is to make inferences for an underlying population of individuals, so
the conditional likelihood-based indices are not appropriate.
blavaan
uses the marginal version of the likelihood
function by default, and we illustrate the difference in implementation
below using the raw Stan syntax.
To assess how well a model fits the data with blavaan
,
we can check the fit indices in a similar fashion - taking the fitted
CFA model as an example:
# NULL model needed for relative indices such as CFI, TLI, and NFI
cfa_null <-
c(paste0("Y", 1:6, " ~~ Y", 1:6), paste0("Y", 1:6, " ~ 1"))
blav_cfa_null_fit <- bcfa(cfa_null, data = dat_cfa, mcmcfile = T)
Computing posterior predictives...
## The default method mimics fit indices derived from ML estimation
fitinx <-
blavFitIndices(blav_cfa_fit, baseline.model = blav_cfa_null_fit)
fitinx
Posterior mean (EAP) of devm-based fit indices:
BRMSEA BGammaHat adjBGammaHat BMc BCFI BTLI
0.003 1.000 0.999 1.000 1.000 1.002
BNFI
1.000
summary(fitinx)
Posterior summary statistics and highest posterior density (HPD) 90% credible intervals for devm-based fit indices:
EAP Median MAP SD lower upper
BRMSEA 0.003 0.000 0.000 0.009 0.000 0.018
BGammaHat 1.000 1.000 1.000 0.001 0.999 1.000
adjBGammaHat 0.999 1.000 1.000 0.003 0.997 1.000
BMc 1.000 1.000 1.000 0.001 0.999 1.000
BCFI 1.000 1.000 1.000 0.000 1.000 1.000
BTLI 1.002 1.002 1.003 0.002 0.999 1.005
BNFI 1.000 1.000 1.000 0.001 0.998 1.001
Notice the fit indices available in both lavaan
and
blavaan
tend to be similar, just as the parameter estimates
were similar in the above examples, because the posterior means are
simply used in place of maximum likelihood estimates of the same
parameters, yielding similar indices.
BRMESA was proposed by Hoofs et al.
(2018), and Garnier-Villarreal and
Jorgensen (2020) proposed Bayesian counterparts of many
widely-used indices such as CFI, NFI, TLI which is available when
specify fit.measures = "all"
.
Information-based fit indices based on marginal likelihood are also available:
fitmeasures(blav_cfa_fit)
npar logl ppp bic dic p_dic waic
19.000 -7909.981 0.744 15951.190 15858.125 19.081 15858.403
p_waic se_waic looic p_loo se_loo margloglik
19.272 111.599 15858.456 19.299 111.600 -8007.416
In this section, we analyze an example dataset using
blavaan
. The dataset is publicly available through the
lavaan.survey
package It contains Belgian school children’s
responses to math efficacy and math self-concept questions, as well as
measures of their math ability from the Programme for International
Student Assessment (PISA) study conducted in 2003. Ferla, Valcke, and Cai (2009) investigated the
association between several math-related constructs and math
achievement. In this example, we will construct a simplified model and
investigate the relations among (negative) math efficacy, (negative)
math concept and math achievement.
math achievement is ‘measured’ by four plausible values (PVs). PVs are random draws from the empirical Bayes posterior distribution of each individual’s math achievement, given their item responses on a math test and background variables. The posterior distribution is based on an item response model with a latent regression.
(negative) math efficacy is measured by eight self-reported items, e.g., “I have always believed that Mathematics is one of my best subjects” 1 (strongly agree) - 4 (strongly disagree).
(negative) math self-concept is measured by four self-reported items on perceived math ability for a given task, e.g., “Feel confident doing task:”1 (very) - 4 (not at all).
library(lavaan.survey)
data(pisa.be.2003)
# Simplified version of Ferla et al. (2009) model.
model.pisa <- "
math =~ PV1MATH1 + PV1MATH2 + PV1MATH3 + PV1MATH4
neg.efficacy =~ ST31Q01 + ST31Q02 + ST31Q03 + ST31Q04 +
ST31Q05 + ST31Q06 + ST31Q07 + ST31Q08
neg.selfconcept =~ ST32Q04 + ST32Q06 + ST32Q07 + ST32Q09
math ~ neg.selfconcept + neg.efficacy
"
semPaths(semPlotModel_lavaanModel(model.pisa))
# Fit the model using lavaan
fit <- sem(model.pisa, data = pisa.be.2003)
summary(fit)
lavaan 0.6.15 ended normally after 87 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 35
Used Total
Number of observations 7890 8796
Model Test User Model:
Test statistic 5581.954
Degrees of freedom 101
P-value (Chi-square) 0.000
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|)
math =~
PV1MATH1 1.000
PV1MATH2 1.110 0.005 204.520 0.000
PV1MATH3 0.994 0.006 174.365 0.000
PV1MATH4 0.964 0.006 170.700 0.000
neg.efficacy =~
ST31Q01 1.000
ST31Q02 1.238 0.032 38.580 0.000
ST31Q03 1.425 0.035 40.178 0.000
ST31Q04 1.174 0.032 37.176 0.000
ST31Q05 1.379 0.035 39.729 0.000
ST31Q06 1.390 0.035 39.397 0.000
ST31Q07 1.396 0.036 38.635 0.000
ST31Q08 1.058 0.031 33.716 0.000
neg.selfconcept =~
ST32Q04 1.000
ST32Q06 1.217 0.018 68.174 0.000
ST32Q07 1.248 0.019 64.541 0.000
ST32Q09 1.024 0.017 61.616 0.000
Regressions:
Estimate Std.Err z-value P(>|z|)
math ~
neg.selfconcpt 0.001 0.004 0.140 0.889
neg.efficacy -0.238 0.008 -30.632 0.000
Covariances:
Estimate Std.Err z-value P(>|z|)
neg.efficacy ~~
neg.selfconcpt 0.112 0.004 25.691 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.PV1MATH1 0.006 0.000 57.948 0.000
.PV1MATH2 0.001 0.000 16.184 0.000
.PV1MATH3 0.003 0.000 51.940 0.000
.PV1MATH4 0.004 0.000 53.499 0.000
.ST31Q01 0.462 0.008 58.312 0.000
.ST31Q02 0.427 0.008 55.328 0.000
.ST31Q03 0.442 0.008 53.216 0.000
.ST31Q04 0.466 0.008 56.656 0.000
.ST31Q05 0.445 0.008 53.895 0.000
.ST31Q06 0.476 0.009 54.350 0.000
.ST31Q07 0.539 0.010 55.268 0.000
.ST31Q08 0.578 0.010 58.796 0.000
.ST32Q04 0.341 0.006 52.790 0.000
.ST32Q06 0.200 0.005 37.223 0.000
.ST32Q07 0.335 0.007 46.470 0.000
.ST32Q09 0.291 0.006 50.354 0.000
.math 0.028 0.001 51.228 0.000
neg.efficacy 0.172 0.008 22.910 0.000
neg.selfconcpt 0.362 0.010 34.841 0.000
fitmeasures(fit)
npar fmin chisq
35.000 0.354 5581.954
df pvalue baseline.chisq
101.000 0.000 89388.597
baseline.df baseline.pvalue cfi
120.000 0.000 0.939
tli nnfi rfi
0.927 0.927 0.926
nfi pnfi ifi
0.938 0.789 0.939
rni logl unrestricted.logl
0.939 -73960.921 -71169.944
aic bic ntotal
147991.842 148235.909 7890.000
bic2 rmsea rmsea.ci.lower
148124.686 0.083 0.081
rmsea.ci.upper rmsea.ci.level rmsea.pvalue
0.085 0.900 0.000
rmsea.close.h0 rmsea.notclose.pvalue rmsea.notclose.h0
0.050 0.996 0.080
rmr rmr_nomean srmr
0.035 0.035 0.056
srmr_bentler srmr_bentler_nomean crmr
0.056 0.056 0.060
crmr_nomean srmr_mplus srmr_mplus_nomean
0.060 0.056 0.056
cn_05 cn_01 gfi
178.333 194.606 0.914
agfi pgfi mfi
0.885 0.679 0.707
ecvi
0.716
# Fit the model using blavaan
bfit <- bsem(model.pisa, data = pisa.be.2003)
blavaan NOTE: Posterior predictives with missing data are currently very slow.
Consider setting test="none".
Computing posterior predictives...
summary(bfit)
blavaan (0.3-15) results of 1000 samples after 500 adapt/burnin iterations
Number of observations 8796
Number of missing patterns 119
Statistic MargLogLik PPP
Value -79503.055 0.000
Latent Variables:
Estimate Post.SD pi.lower pi.upper Rhat
math =~
PV1MATH1 1.000 NA
PV1MATH2 1.118 0.005 1.108 1.128 1.000
PV1MATH3 0.983 0.005 0.973 0.993 1.000
PV1MATH4 1.003 0.005 0.993 1.014 1.000
neg.efficacy =~
ST31Q01 1.000 NA
ST31Q02 1.224 0.030 1.168 1.287 1.000
ST31Q03 1.394 0.034 1.331 1.461 1.000
ST31Q04 1.159 0.030 1.104 1.219 1.000
ST31Q05 1.388 0.035 1.322 1.462 1.000
ST31Q06 1.366 0.034 1.303 1.435 1.000
ST31Q07 1.395 0.037 1.327 1.469 1.001
ST31Q08 1.038 0.029 0.982 1.096 1.000
neg.selfconcept =~
ST32Q04 1.000 NA
ST32Q06 1.215 0.018 1.182 1.25 1.002
ST32Q07 1.248 0.019 1.211 1.288 1.000
ST32Q09 1.020 0.017 0.987 1.053 1.000
Prior
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
normal(0,10)
Regressions:
Estimate Post.SD pi.lower pi.upper Rhat Prior
math ~
neg.selfconcpt 0.005 0.004 -0.004 0.013 0.999 normal(0,10)
neg.efficacy -0.251 0.008 -0.267 -0.236 0.999 normal(0,10)
Covariances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
neg.efficacy ~~
neg.selfconcpt 0.113 0.004 0.104 0.121 1.000 beta(1,1)
Intercepts:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.PV1MATH1 1.068 0.002 1.063 1.072 1.001 normal(0,32)
.PV1MATH2 1.079 0.002 1.074 1.084 1.001 normal(0,32)
.PV1MATH3 1.058 0.002 1.054 1.062 1.001 normal(0,32)
.PV1MATH4 1.068 0.002 1.063 1.072 1.000 normal(0,32)
.ST31Q01 1.888 0.009 1.87 1.905 1.001 normal(0,32)
.ST31Q02 1.888 0.009 1.871 1.905 1.000 normal(0,32)
.ST31Q03 2.145 0.010 2.127 2.164 1.001 normal(0,32)
.ST31Q04 2.041 0.009 2.023 2.058 1.000 normal(0,32)
.ST31Q05 1.726 0.010 1.706 1.745 1.000 normal(0,32)
.ST31Q06 2.169 0.010 2.15 2.188 1.000 normal(0,32)
.ST31Q07 2.159 0.010 2.139 2.178 1.000 normal(0,32)
.ST31Q08 2.398 0.009 2.378 2.416 1.001 normal(0,32)
.ST32Q04 2.362 0.009 2.345 2.38 0.999 normal(0,32)
.ST32Q06 2.526 0.009 2.508 2.543 0.999 normal(0,32)
.ST32Q07 2.908 0.010 2.888 2.928 0.999 normal(0,32)
.ST32Q09 2.900 0.009 2.883 2.917 0.999 normal(0,32)
.math 0.000 NA
neg.efficacy 0.000 NA
neg.selfconcpt 0.000 NA
Variances:
Estimate Post.SD pi.lower pi.upper Rhat Prior
.PV1MATH1 0.007 0.000 0.006 0.007 1.000 gamma(1,.5)[sd]
.PV1MATH2 0.001 0.000 0 0.001 1.001 gamma(1,.5)[sd]
.PV1MATH3 0.004 0.000 0.004 0.004 1.000 gamma(1,.5)[sd]
.PV1MATH4 0.004 0.000 0.004 0.005 1.001 gamma(1,.5)[sd]
.ST31Q01 0.472 0.008 0.457 0.487 0.999 gamma(1,.5)[sd]
.ST31Q02 0.431 0.008 0.416 0.446 1.000 gamma(1,.5)[sd]
.ST31Q03 0.445 0.008 0.43 0.461 1.000 gamma(1,.5)[sd]
.ST31Q04 0.469 0.008 0.454 0.485 1.000 gamma(1,.5)[sd]
.ST31Q05 0.452 0.009 0.435 0.469 1.000 gamma(1,.5)[sd]
.ST31Q06 0.479 0.008 0.462 0.496 1.000 gamma(1,.5)[sd]
.ST31Q07 0.538 0.010 0.518 0.558 0.999 gamma(1,.5)[sd]
.ST31Q08 0.581 0.010 0.563 0.6 1.000 gamma(1,.5)[sd]
.ST32Q04 0.346 0.006 0.333 0.359 0.999 gamma(1,.5)[sd]
.ST32Q06 0.202 0.005 0.191 0.213 1.000 gamma(1,.5)[sd]
.ST32Q07 0.332 0.007 0.318 0.347 1.000 gamma(1,.5)[sd]
.ST32Q09 0.294 0.006 0.283 0.305 1.001 gamma(1,.5)[sd]
.math 0.030 0.001 0.029 0.031 1.000 gamma(1,.5)[sd]
neg.efficacy 0.179 0.008 0.164 0.193 1.000 gamma(1,.5)[sd]
neg.selfconcpt 0.360 0.010 0.34 0.381 1.002 gamma(1,.5)[sd]
fitmeasures(bfit)
npar logl ppp bic dic p_dic waic
51.000 -79141.585 0.000 158746.349 158384.787 50.808 158388.655
p_waic se_waic looic p_loo se_loo margloglik
54.617 801.011 158388.792 54.685 801.013 -79503.055
As we can see, the estimates from lavaan
and
blavaan
are similar, and specifically, the latent construct
“math achievement” is estimated to be negatively predicted by (negative)
self-efficacy (i.e., they are estimated to be positively correlated)
accounting for (negative) self-concept.