1 Introduction

A common feature of data structures in education is that units of analysis (e.g., students) are nested in higher organizational clusters (e.g. schools). This kind of structure induces dependence among the responses observed for units within the same cluster. Students in the same school tend to be more alike in their academic and attitudinal characteristics than students chosen at random from the population at large.

Multilevel models1 are designed to model such within-cluster dependence. Multilevel models recognize the existence of data clustering (at two or more levels) by allowing for residual components at each level in the hierarchy. For example, a two-level model that allows for grouping of student outcomes within schools would include residuals at both the student and school level. The residual variance is thus partitioned into a between-school component (the variance of the school-level residuals) and a within-school component (the variance of the student-level residuals).

In this tutorial, we illustrate how to fit a multilevel linear model within a full Bayesian framework using rstanarm. This tutorial is aimed primarily at educational researchers who have used lme4 in R to fit models to their data and who may be interested in learning how to fit Bayesian multilevel models. However, for readers who have not used lme4 before, we briefly review the use of the package for fitting multilevel models.

This tutorial uses Stan version 2.17.3 and requires the following R packages.

# Required Packages
library(mlmRev)
library(lme4)
library(rstanarm)
library(ggplot2)

1.1 Data example

We will be analyzing the Gcsemv dataset (Rasbash et al. 2000) from the mlmRev package in R. The data include the General Certificate of Secondary Education (GCSE) exam scores of 1,905 students from 73 schools in England on a science subject. The Gcsemv dataset consists of the following 5 variables:

  • school: school identifier
  • student: student identifier
  • gender: gender of a student (M: Male, F: Female)
  • written: total score on written paper
  • course: total score on coursework paper
# Use example dataset from mlmRev package: GCSE exam score
data(Gcsemv, package = "mlmRev")
summary(Gcsemv)
     school        student     gender      written          course      
 68137  : 104   77     :  14   F:1128   Min.   : 0.60   Min.   :  9.25  
 68411  :  84   83     :  14   M: 777   1st Qu.:37.00   1st Qu.: 62.90  
 68107  :  79   53     :  13            Median :46.00   Median : 75.90  
 68809  :  73   66     :  13            Mean   :46.37   Mean   : 73.39  
 22520  :  65   27     :  12            3rd Qu.:55.00   3rd Qu.: 86.10  
 60457  :  54   110    :  12            Max.   :90.00   Max.   :100.00  
 (Other):1446   (Other):1827            NA's   :202     NA's   :180     

Two components of the exam were recorded as outcome variables: written paper and course work. In this tutorial, only the total score on the courework paper (course) will be analyzed. As seen above, there some of the observations have missing values for certain covariates. While we do not subset the data to only include complete cases to demonstrate that rstanarm automatically drops these observations, it is generally good practice to manually do so if required.

# Make Male the reference category and rename variable
Gcsemv$female <- relevel(Gcsemv$gender, "M")


# Use only total score on coursework paper 
GCSE <- subset(x = Gcsemv, 
               select = c(school, student, female, course))

# Count unique schools and students
J <- length(unique(GCSE$school))
N <- nrow(GCSE)

The rstanarm package automates several data preprocessing steps making its use very similar to that of lme4 in the following way.

  • Input - rstanarm is able to take a data frame as input2.

  • Missing Data - rstanarm automatically discards observations with NA values for any variable used in the model3.

  • Identifiers - rstanarm does not require identifiers to be sequential4. We do suggest that it is good practice for all cluster and unit identifiers, as well as categorical variables be stored as factors. This applies to using lme4 as much as it does to rstanarm. One can check the structure of the variables by using the str() function.

# Check structure of data frame
str(GCSE)
'data.frame':   1905 obs. of  4 variables:
 $ school : Factor w/ 73 levels "20920","22520",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ student: Factor w/ 649 levels "1","2","3","4",..: 16 25 27 31 42 62 101 113 146 1 ...
 $ female : Factor w/ 2 levels "M","F": 1 2 2 2 1 2 2 1 1 2 ...
 $ course : num  NA 71.2 76.8 87.9 44.4 NA 89.8 17.5 32.4 84.2 ...

2 Likelihood inference using lmer()

In this section, we briefly review three basic multilevel linear models which will be fit in this tutorial. Starting with a varying intercept model with no predictors (Model 1), we then proceed to the varying intercept model with one predictor (Model 2), and the varying intercept and slope model (Model 3).

2.1 Model 1: Varying intercept model with no predictors (Variance components model)

Consider the simplest multilevel model for students \(i=1, ..., n\) nested within schools \(j=1, ..., J\) and for whom we have examination scores as responses. We can write a two-level varying intercept model with no predictors using the usual two-stage formulation as

\[y_{ij} = \alpha_{j} + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma_y^2)\] \[\alpha_j = \mu_{\alpha} + u_j, \text{ where } u_j \sim N(0, \sigma_\alpha^2)\]

where \(y_{ij}\) is the examination score for the ith student in the jth school, \(\alpha_{j}\) is the varying intercept for the jth school, and \(\mu_{\alpha}\) is the overall mean across schools. Alternatively, the model can be expressed in reduced form as \[y_{ij} = \mu_\alpha + u_j + \epsilon_{ij}.\]. If we further assume that the student-level errors \(\epsilon_{ij}\) are normally distributed with mean 0 and variance \(\sigma_{y}^{2}\), and that the school-level varying intercepts \(\alpha_{j}\) are normally distributed with mean \(\mu_{\alpha}\) and variance \(\sigma_{\alpha}^{2}\), then the model can be expressed as

\[y_{ij} \sim N(\alpha_{j}, \sigma_{y}^{2}),\] \[\alpha_{j}\sim N(\mu_{\alpha}, \sigma_{\alpha}^{2}),\]

This model can then be fit using lmer(). We specify an intercept (the predictor “1”) and allow it to vary by the level-2 identifier (school). We also specify the REML = FALSE option to obtain maximum likelihood (ML) estimates as opposed to the default restricted maximum likelihood (REML) estimates.

M1 <- lmer(formula = course ~ 1 + (1 | school), 
           data = GCSE, 
           REML = FALSE)
summary(M1)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: course ~ 1 + (1 | school)
   Data: GCSE

     AIC      BIC   logLik deviance df.resid 
 14111.4  14127.7  -7052.7  14105.4     1722 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.9693 -0.5101  0.1116  0.6741  2.7613 

Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept)  75.24    8.674  
 Residual             190.77   13.812  
Number of obs: 1725, groups:  school, 73

Fixed effects:
            Estimate Std. Error t value
(Intercept)    73.72       1.11    66.4

Under the Fixed effects part of the output, we see that the intercept \(\mu_{\alpha}\), averaged over the population of schools, is estimated as \(73.72\). Under the Random effects part of the output, we see that the between-school standard deviation \(\sigma_{\alpha}\) is estimated as \(8.67\) and the within-school standard deviation \(\sigma_{y}\) as \(13.81\).

2.2 Model 2: Varying intercept model with a single predictor

The varying intercept model5 with an indicator variable for being female \(x_{ij}\) can be written as \[y_{ij} \sim N(\alpha_{j}+\beta x_{ij} , \sigma_{y}^{2}),\] \[\alpha_{j}\sim N(\mu_{\alpha}, \sigma_{\alpha}^{2}).\] The equation of the average regression line across schools is \(\mu_{ij}=\mu_{\alpha}+\beta x_{ij}\). The regression lines for specific schools will be parallel to the average regression line (having the same slope \(\beta\)), but differ in terms of its intercept \(\alpha_{j}\). This model can be estimated by adding female to the formula in the lmer() function, which will allow only the intercept to vary by school, and while keeping the “slope” for being female constant across schools.

M2 <- lmer(formula = course ~ 1 + female + (1 | school), 
           data = GCSE, 
           REML = FALSE)
summary(M2) 
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: course ~ 1 + female + (1 | school)
   Data: GCSE

     AIC      BIC   logLik deviance df.resid 
 14017.4  14039.2  -7004.7  14009.4     1721 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.7809 -0.5401  0.1259  0.6795  2.6753 

Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept)  76.65    8.755  
 Residual             179.96   13.415  
Number of obs: 1725, groups:  school, 73

Fixed effects:
            Estimate Std. Error t value
(Intercept)   69.730      1.185   58.87
femaleF        6.739      0.678    9.94

Correlation of Fixed Effects:
        (Intr)
femaleF -0.338

The average regression line across schools is thus estimated as \(\hat{\mu}_{ij}=69.73+ 6.74 x_{ij}\), with \(\sigma_\alpha\) and \(\sigma_y\) estimated as \(8.76\) and \(13.41\) respectively. Treating these estimates of \(\mu_\alpha\), \(\beta\), \(\sigma^2_{y}\), and \(\sigma^2_{\alpha}\) as the true parameter values, we can then obtain the Best Linear Unbiased Predictions (BLUPs) for the school-level errors \(\hat{u}_j = \hat{\alpha}_{j} - \hat{\mu}_{\alpha}\).

The BLUPs are equivalent to the so-called Empirical Bayes (EB) prediction, which is the mean of the posterior distribution of \(u_{j}\) given all the estimated parameters, as well as the random variables \(y_{ij}\) and \(x_{ij}\) for the cluster. These predictions are called “Bayes” because they make use of the pre-specified prior distribution6 \(\alpha_j \sim N(\mu_\alpha, \sigma^2_\alpha)\), and by extension \(u_j \sim N(0, \sigma^2_\alpha)\), and called “Empirical” because the parameters of this prior, \(\mu_\alpha\) and \(\sigma^2_{\alpha}\), in addition to \(\beta\) and \(\sigma^2_{y}\), are estimated from the data.

Compared to the Maximum Likelihood (ML) approach of predicting values for \(u_j\) by using only the estimated parameters and data from cluster \(j\), the EB approach additionally consider the prior distribution of \(u_{j}\), and produces predicted values closer to \(0\) (a phenomenon described as shrinkage or partial pooling). To see why this phenomenon is called shrinkage, we usually express the estimates for \(u_j\) obtained from EB prediction as \(\hat{u}_j^{\text{EB}} = \hat{R}_j\hat{u}_j^{\text{ML}}\) where \(\hat{u}_j^{\text{ML}}\) are the ML estimates, and \(\hat{R}_j = \frac{\sigma_\alpha^2}{\sigma_\alpha^2 + \frac{\sigma_y^2}{n_j}}\) is the so-called Shrinkage factor.

head(ranef(M2)$school)
      (Intercept)
20920 -10.1702110
22520 -17.0578149
22710   7.8007260
22738   0.4871012
22908  -8.1940346
23208   4.4304453

These values estimate how much the intercept is shifted up or down in particular schools. For example, in the first school in the dataset, the estimated intercept is about 10.17 lower than average, so that the school-specific regression line is \((69.73 - 10.17) + 6.74 x_{ij}\).

Gelman and Hill (2006) characterize multilevel modeling as partial pooling (also called shrinkage), which is a compromise between two extremes: complete pooling in which the clustering is not considered in the model at all, and no pooling, in which separate intercepts are estimated for each school as coefficients of dummy variables. The estimated school-specific regression lines in the above model are based on partial pooling estimates. To show this, we first estimate the intercept and slope in each school three ways:

# Complete-pooling regression
pooled <- lm(formula = course ~ female,
             data = GCSE)
a_pooled <- coef(pooled)[1]   # complete-pooling intercept
b_pooled <- coef(pooled)[2]   # complete-pooling slope

# No-pooling regression
nopooled <- lm(formula = course ~ 0 + school + female,
               data = GCSE)
a_nopooled <- coef(nopooled)[1:J]   # 73 no-pooling intercepts              
b_nopooled <- coef(nopooled)[J+1]

# Partial pooling (multilevel) regression
a_part_pooled <- coef(M2)$school[, 1]
b_part_pooled <- coef(M2)$school[, 2]

Then, we plot7 the data and school-specific regression lines for a selection of eight schools using the following commands:

# (0) Set axes & choose schools
y <- GCSE$course
x <- as.numeric(GCSE$female) - 1 + runif(N, -.05, .05)
schid <- GCSE$school
sel.sch <- c("65385",
             "68207",
             "60729",
             "67051",
             "50631",
             "60427",
             "64321",
             "68137")

# (1) Subset 8 of the schools; generate data frame
df <- data.frame(y, x, schid)
df8 <- subset(df, schid %in% sel.sch)

# (2) Assign complete-pooling, no-pooling, partial pooling estimates
df8$a_pooled <- a_pooled 
df8$b_pooled <- b_pooled
df8$a_nopooled <- a_nopooled[df8$schid]
df8$b_nopooled <- b_nopooled
df8$a_part_pooled <- a_part_pooled[df8$schid]
df8$b_part_pooled <- b_part_pooled[df8$schid]

# (3) Plot regression fits for the 8 schools
ggplot(data = df8, 
       aes(x = x, y = y)) + 
  facet_wrap(facets = ~ schid, 
             ncol = 4) + 
  theme_bw() +
  geom_jitter(position = position_jitter(width = .05, 
                                         height = 0)) +
  geom_abline(aes(intercept = a_pooled, 
                  slope = b_pooled), 
              linetype = "solid", 
              color = "blue", 
              size = 0.5) +
  geom_abline(aes(intercept = a_nopooled, 
                  slope = b_nopooled), 
              linetype = "longdash", 
              color = "red", 
              size = 0.5) + 
  geom_abline(aes(intercept = a_part_pooled, 
                  slope = b_part_pooled), 
              linetype = "dotted", 
              color = "purple", 
              size = 0.7) + 
  scale_x_continuous(breaks = c(0, 1), 
                     labels = c("male", "female")) + 
  labs(title = "Complete-pooling, No-pooling, and Partial pooling estimates",
       x = "", 
       y = "Total score on coursework paper")+theme_bw( base_family = "serif")
The blue-solid, red-dashed, and purple-dotted lines show the complete-pooling, no-pooling, and partial pooling estimates respectively. 
 We see that the estimated school-specific regression line from the partial pooling estimates lies between the complete-pooling and no-pooling regression lines. There is more pooling (purple dotted line closer to blue solid line) in schools with small sample sizes.

The blue-solid, red-dashed, and purple-dotted lines show the complete-pooling, no-pooling, and partial pooling estimates respectively. We see that the estimated school-specific regression line from the partial pooling estimates lies between the complete-pooling and no-pooling regression lines. There is more pooling (purple dotted line closer to blue solid line) in schools with small sample sizes.

2.3 Model 3: Varying intercept and slope model with a single predictor

We now extend the varying intercept model with a single predictor to allow both the intercept and the slope to vary randomly across schools using the following model8:

\[y_{ij}\sim N(\alpha_{j}+\beta_{j}x_{ij} , \sigma_y ^2 ),\] \[\left( \begin{matrix} \alpha _{ j } \\ \beta _{ j } \end{matrix} \right) \sim N\left( \left( \begin{matrix} { \mu }_{ \alpha } \\ { \mu }_{ \beta } \end{matrix} \right) , \left( \begin{matrix} { \sigma }_{ \alpha }^{ 2 } & \rho { \sigma }_{ \alpha }{ \sigma }_{ \beta } \\ \rho { \sigma }_{ \alpha }{ \sigma }_{ \beta } & { \sigma }_{ \beta }^{ 2 } \end{matrix} \right) \right).\]

Note that now we have variation in the \(\alpha_{j}\)’s and the \(\beta_{j}\)’s, and also a correlation parameter \(\rho\) between \(\alpha_j\) and \(\beta_j\). This model can be fit using lmer() as follows:

M3 <- lmer(formula = course ~ 1 + female + (1 + female | school), 
           data = GCSE, 
           REML = FALSE)
summary(M3) 
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: course ~ 1 + female + (1 + female | school)
   Data: GCSE

     AIC      BIC   logLik deviance df.resid 
 13983.4  14016.2  -6985.7  13971.4     1719 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.6886 -0.5222  0.1261  0.6529  2.6729 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 school   (Intercept) 102.93   10.146        
          femaleF      47.94    6.924   -0.52
 Residual             169.79   13.030        
Number of obs: 1725, groups:  school, 73

Fixed effects:
            Estimate Std. Error t value
(Intercept)   69.425      1.352  51.338
femaleF        7.128      1.131   6.302

Correlation of Fixed Effects:
        (Intr)
femaleF -0.574

In this model, the residual within-school standard deviation is estimated as \(\hat{\sigma}_{y}=\) 13.03. The estimated standard deviations of the school intercepts and the school slopes are \(\hat{\sigma}_{\alpha}= 10.15\) and \(\hat{\sigma}_{\beta}= 6.92\) respectively. The estimated correlation between varying intercepts and slopes is \(\hat{\rho} = -0.52\). We can use code similar to that presented in section 2.2 to plot the data and school-specific regression lines for a selection of eight schools.

3 Bayesian inference for Model 1

We can quickly and easily fit many multilevel models using the lmer() function in R. As previously mentioned, functions such as lmer() are based on a combination of maximum likelihood (ML) estimation of the model parameters, and empirical Bayes (EB) predictions of the varying intercepts and/or slopes. However, in some instances, when the number of groups is small or when the model contains many varying coefficients or non-nested components, the ML approach may not work as well in part because there may not be enough information to estimate variance parameters precisely. In such cases, Restricted Maximum Likelihood (REML) Estimation provides more reasonable inferences.

A fully Bayesian approach also provides reasonable inferences in these instances with the added benefit of accounting for all the uncertainty in the parameter estimates when predicting the varying intercepts and slopes, and their associated uncertainty. This is one of several reasons that one should be interested in fully Bayesian estimation. Other reasons are discussed in Section 3.4.3. One may start by quickly fitting many specifications in building a model using the lmer() function, and then take advantage of the flexibility of a fully Bayesian approach using rstanarm to obtain simulations summarizing uncertainty about coefficients, predictions, and other quantities of interest.

In this section, we present how to fit and evaluate Model 1 using the rstanarm package. Alternatively, but not covered in this tutorial, one can also create a hand-written program in Stan and run it using the rstan package.

3.1 Using the rstanarm package

Many relatively simple models can be fit using the rstanarm package without writing any code in the Stan language. The rstanarm package is a wrapper for the rstan package that enables the most common applied regression models to be estimated using Markov Chain Monte Carlo (MCMC) but still be specified using customary R modeling syntax. Education researchers can use Bayesian estimation for multilevel models with only minimal changes to their existing code with lmer().

For example, Model 1 with default prior distributions for \(\mu_{\alpha}\), \(\sigma_{\alpha}\), and \(\sigma_{y}\) can be specified using the rstanarm package by prepending stan_ to the lmer call:

M1_stanlmer <- stan_lmer(formula = course ~ 1 + (1 | school), 
                         data = GCSE,
                         seed = 349)

This stan_lmer() function is similar in syntax to lmer() but rather than performing maximum likelihood estimation, Bayesian estimation is performed via MCMC. As each step in the MCMC estimation approach involves random draws from the parameter space, we include a seed option to ensure that each time the code is run, stan_lmer outputs the same results.

3.2 Prior distributions

Model 1 is a varying intercept model with normally distributed student residuals and school-level intercepts: \(y_{ij} \sim N(\alpha_{j}, \sigma_{y}^{2}),\) and \(\alpha_{j}\sim N(\mu_{\alpha}, \sigma_{\alpha}^{2})\). The normal distribution for the \(\alpha_{j}\)’s can be thought of as a prior distribution for these varying intercepts. The parameters of this prior distribution, \(\mu_{\alpha}\) and \(\sigma_{\alpha}\), are estimated from the data when using maximum likelihood estimation. In full Bayesian inference, all the hyperparameters (\(\mu_{\alpha}\) and \(\sigma_{\alpha}\)), along with the other unmodeled parameters (in this case, \(\sigma_{y}\)) also need a prior distribution.

Here, we use the default prior distributions for the hyperparameters in stan_lmer by not specifying any prior options in stan_lmer() function. The default priors are intended to be weakly informative in that they provide moderate regularization9 and help stabilize computation. It should be noted that the authors of rstanarm suggest not relying on rstanarm to specify the default prior for a model, but rather, to specify the priors explicitly even if they are indeed the current default, as updates to the package may result in different defaults.

First, before accounting for the scale of the variables, \(\mu_{\alpha}\) is given normal prior distribution with mean 0 and standard deviation 10. That is, \(\mu_{\alpha} \sim N(0, 10^2)\). The standard deviation of this prior distribution, 10, is five times as large as the standard deviation of the response if it were standardized. This should be a close approximation to a noninformative prior over the range supported by the likelihood, which should give inferences similar to those obtained by maximum likelihood methods if similarly weak priors are used for the other parameters.

Second, the (unscaled) prior for \(\sigma_{y}\) is set to an exponential distribution with rate parameter set to 1.

Third, in order to specify a prior for the variances and covariances of the varying (or “random”) effects, rstanarm will decompose this matrix into a correlation matrix of the varying effects and a function of their variances. Since there is only one varying effect in this example, the default (unscaled) prior for \(\sigma_{\alpha}\) that rstanarm uses reduces to an exponential distribution with rate parameter set to 1.

It should also be noted that rstanarm will scale the priors unless the autoscale = FALSE option is used. After fitting a model using stan_lmer, we can check the priors that are used by invoking the prior_summary() function.

# Obtain a summary of priors used
prior_summary(object = M1_stanlmer)
Priors for model 'M1_stanlmer' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 10)
     **adjusted scale = 163.21

Auxiliary (sigma)
 ~ exponential(rate = 1)
     **adjusted scale = 16.32 (adjusted rate = 1/adjusted scale)

Covariance
 ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
------
See help('prior_summary.stanreg') for more details
# Obtain SD of outcome
sd(GCSE$course, na.rm = TRUE)
[1] 16.32096

As seen above, the scales of the priors for \(\mu_{\alpha}\) and \(\sigma_y\) are set to \(163.21\) and \(16.32\) respectively after rescaling. Since the default prior for the intercept is normal with a scale parameter of \(10\), the rescaled prior is also normal but with a scale parameter of \(\text{scale} \times \text{SD}(y) = 10 \times 16.321= 163.21\). Similarly, since the default prior for \(\sigma_y\) is exponential with a rate parameter of \(1\) (or equivalently, scale parameter \(\text{scale} = \frac{1}{\text{rate}} = 1\)), the rescaled prior is likewise exponential with a scale parameter of \(\text{scale} \times \text{SD}(y) = 1 \times 16.321= 16.32\).

3.3 Direct output from stan_lmer

3.3.1 Posterior medians and posterior median absolute deviations

We can display a quick summary of the fit from Model 1 by using the print method in the following manner:

print(M1_stanlmer, digits = 2)
stan_lmer
 family:       gaussian [identity]
 formula:      course ~ 1 + (1 | school)
 observations: 1725
------
            Median MAD_SD
(Intercept) 73.77   1.08 
sigma       13.82   0.25 

Error terms:
 Groups   Name        Std.Dev.
 school   (Intercept)  8.88   
 Residual             13.82   
Num. levels: school 73 

Sample avg. posterior predictive distribution of y:
         Median MAD_SD
mean_PPD 73.40   0.47 

------
For info on the priors used see help('prior_summary.stanreg').

Here, the point estimate of \(\mu_{\alpha}\) from stan_lmer is \(73.75\) and this corresponds to the median of the posterior draws. This is similar to the ML estimate obtained from lmer. The point estimate for \(\sigma_{\alpha}\) from stan_lmer is \(8.88\), which is larger than the ML estimate (\(8.67\)). This discrepancy may be partly because the ML approach in lmer() does not take into account the uncertainty in \(\mu_{\alpha}\) when estimating \(\sigma_{\alpha}\). The REML approach (\(8.75\)) in lmer(), as mentioned previously, does in fact account for this uncertainty.

When using stan_lmer, standard errors are obtained by considering the median absolute deviation (MAD) of each draw from the median of those draws. It is well known that ML tends to underestimate uncertainties because it relies on point estimates of hyperparameters. Full Bayes, on the other hand, propagates the uncertainty in the hyperparameters throughout all levels of the model and provides more appropriate estimates of uncertainty. See also W. J. Browne, Draper, and others (2006) for further discussion.

3.3.2 Posterior means, posterior standard deviations, 95% credible intervals and Monte Carlo errors

While the use of the median and the MAD of the posterior draws for estimation and inference are the default outputs from rstanarm, users may instead prefer to use the mean and the standard deviation of the posterior distribution instead. Additionally, users may be interested in credible intervals, a concept unique to Bayesian statistics that is the analogue to confidence intervals in frequentist statistics. Unlike the latter, the 95% credible intervals have a 95% probability of containing the true value of the parameter given the data. This 95% credible interval is typically obtained by considering the 2.5th to 97.5th percentiles of the distribution of posterior draws.

To obtain these estimates, we can make use of the summary method as follows. The pars argument specifies which the parameter estimates to display. Here, (Intercept), sigma, and Sigma[school:(Intercept),(Intercept)] correspond to \(\mu_{\alpha}\), \(\sigma_{y}\), and \(\sigma_{\alpha}^2\). The probs argument specifies which quantiles of the posterior distribution to display.

summary(M1_stanlmer, 
        pars = c("(Intercept)", "sigma", "Sigma[school:(Intercept),(Intercept)]"),
        probs = c(0.025, 0.975),
        digits = 2)

Model Info:

 function:     stan_lmer
 family:       gaussian [identity]
 formula:      course ~ 1 + (1 | school)
 algorithm:    sampling
 priors:       see help('prior_summary')
 sample:       4000 (posterior sample size)
 observations: 1725
 groups:       school (73)

Estimates:
                                        mean   sd     2.5%   97.5%
(Intercept)                            73.75   1.11  71.46  75.91 
sigma                                  13.82   0.25  13.35  14.31 
Sigma[school:(Intercept),(Intercept)]  78.91  15.89  52.71 115.82 

Diagnostics:
                                      mcse Rhat n_eff
(Intercept)                           0.04 1.00  826 
sigma                                 0.00 1.00 4000 
Sigma[school:(Intercept),(Intercept)] 0.56 1.00  803 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

It is worthwhile to note that when using the summary method, the estimate for the standard deviation \(\sigma_y\) is the the mean of the posterior draws of the parameter. This is in contrast to the median of the posterior draws that we obtain when using the print method. One advantage of using the median is that the estimate for \(\sigma_y^2\) is simply the square of the estimate for \(\sigma_y\) if the number of samples is odd. This is not true when using the mean. In this case, and more generally when we need to evaluate other functions of the parameters, we need to access the posterior draws directly. This is detailed in the next section.

Under Diagnostics, we refer the reader to Section 5 for more information about Rhat and n_eff. The values under mcse represent estimates for the Monte Carlo standard errors which represent the randomness associated with each MCMC estimation run. That is, with the same data set, repeatedly using an MCMC approach to estimate a paramater produces estimates with a standard deviation equal to the Monte Carlo standard error.

3.4 Other output from stan_lmer

As mentioned, users may prefer to work directly with the posterior draws to obtain estimates of more complex parameters. To do so, users need to manually access them from the stan_lmer object. We demonstrate how to do so in the context of making comparisons between individuals schools.

3.4.1 Accessing the simulations and summarizing results

Based on the default settings, stan_lmer generates 4 MCMC chains of 2,000 iterations each. Half of these iterations in each chain are used as warm-up/burn-in (to allow the chain to converge to the posterior distribution), and hence we only use 1,000 samples per chain. These MCMC-generated samples are taken to be drawn from the posterior distributions of the parameters in the model. We can use these samples for predictions, summarizing uncertainty and estimating credible intervals for any function of the parameters.

To access the posterior draws for all the parameters, we apply the method as.matrix() to the stanreg object M1_stanlmer. This returns an \(S\) by \(P\) matrix, where \(S\) is the size of the posterior sample (or equivalently, the number of MCMC iterations after warm-up) and \(P\) is the number of parameters/quantities. We can then generate a matrix for varying intercepts \(\alpha_{j}\) as well as vectors containing the draws for the within standard deviation and the between variance by manipulating this matrix. Note that in order to select the correct columns for the parameter of interest, it is useful to explore the column names of the matrix sims.

A more direct approach to obtaining the posterior draws for specific parameters is to make use of the built in functionality of the as.matrix method for stanreg objects. When applying the as.matrix method to a stanreg object, the user is able to specify either an optional character vector of parameter names, or an optional character vector of regular expressions10 to extract the posterior draws of only the parameters they are interested in. For example, since the parameter representing the overall mean is labelled (Intercept), we can extract the posterior draws of only this parameter by including the option pars = "(Intercept)". Similarly, since the parameters representing the 73 school-level errors all contain the string b[(Intercept) school: we can extract all parameters that contain this string by using the option regex_pars = "b\\[\\(Intercept\\) school\\:".

# Extract the posterior draws for all parameters
sims <- as.matrix(M1_stanlmer)
dim(sims)
[1] 4000   76
para_name <- colnames(sims)
para_name
 [1] "(Intercept)"                          
 [2] "b[(Intercept) school:20920]"          
 [3] "b[(Intercept) school:22520]"          
 [4] "b[(Intercept) school:22710]"          
 [5] "b[(Intercept) school:22738]"          
 [6] "b[(Intercept) school:22908]"          
 [7] "b[(Intercept) school:23208]"          
 [8] "b[(Intercept) school:25241]"          
 [9] "b[(Intercept) school:30474]"          
[10] "b[(Intercept) school:35270]"          
[11] "b[(Intercept) school:37224]"          
[12] "b[(Intercept) school:47627]"          
[13] "b[(Intercept) school:50627]"          
[14] "b[(Intercept) school:50631]"          
[15] "b[(Intercept) school:60421]"          
[16] "b[(Intercept) school:60427]"          
[17] "b[(Intercept) school:60437]"          
[18] "b[(Intercept) school:60439]"          
[19] "b[(Intercept) school:60441]"          
[20] "b[(Intercept) school:60455]"          
[21] "b[(Intercept) school:60457]"          
[22] "b[(Intercept) school:60501]"          
[23] "b[(Intercept) school:60729]"          
[24] "b[(Intercept) school:60741]"          
[25] "b[(Intercept) school:63619]"          
[26] "b[(Intercept) school:63833]"          
[27] "b[(Intercept) school:64251]"          
[28] "b[(Intercept) school:64321]"          
[29] "b[(Intercept) school:64327]"          
[30] "b[(Intercept) school:64343]"          
[31] "b[(Intercept) school:64359]"          
[32] "b[(Intercept) school:64428]"          
[33] "b[(Intercept) school:65385]"          
[34] "b[(Intercept) school:66365]"          
[35] "b[(Intercept) school:67051]"          
[36] "b[(Intercept) school:67105]"          
[37] "b[(Intercept) school:67311]"          
[38] "b[(Intercept) school:68107]"          
[39] "b[(Intercept) school:68111]"          
[40] "b[(Intercept) school:68121]"          
[41] "b[(Intercept) school:68125]"          
[42] "b[(Intercept) school:68133]"          
[43] "b[(Intercept) school:68137]"          
[44] "b[(Intercept) school:68201]"          
[45] "b[(Intercept) school:68207]"          
[46] "b[(Intercept) school:68217]"          
[47] "b[(Intercept) school:68227]"          
[48] "b[(Intercept) school:68233]"          
[49] "b[(Intercept) school:68237]"          
[50] "b[(Intercept) school:68241]"          
[51] "b[(Intercept) school:68255]"          
[52] "b[(Intercept) school:68271]"          
[53] "b[(Intercept) school:68303]"          
[54] "b[(Intercept) school:68321]"          
[55] "b[(Intercept) school:68329]"          
[56] "b[(Intercept) school:68405]"          
[57] "b[(Intercept) school:68411]"          
[58] "b[(Intercept) school:68417]"          
[59] "b[(Intercept) school:68531]"          
[60] "b[(Intercept) school:68611]"          
[61] "b[(Intercept) school:68629]"          
[62] "b[(Intercept) school:68711]"          
[63] "b[(Intercept) school:68723]"          
[64] "b[(Intercept) school:68805]"          
[65] "b[(Intercept) school:68809]"          
[66] "b[(Intercept) school:71927]"          
[67] "b[(Intercept) school:74330]"          
[68] "b[(Intercept) school:74862]"          
[69] "b[(Intercept) school:74874]"          
[70] "b[(Intercept) school:76531]"          
[71] "b[(Intercept) school:76631]"          
[72] "b[(Intercept) school:77207]"          
[73] "b[(Intercept) school:84707]"          
[74] "b[(Intercept) school:84772]"          
[75] "sigma"                                
[76] "Sigma[school:(Intercept),(Intercept)]"
# Obtain school-level varying intercept a_j
# draws for overall mean
mu_a_sims <- as.matrix(M1_stanlmer, 
                       pars = "(Intercept)")
# draws for 73 schools' school-level error
u_sims <- as.matrix(M1_stanlmer, 
                    regex_pars = "b\\[\\(Intercept\\) school\\:")
# draws for 73 schools' varying intercepts               
a_sims <- as.numeric(mu_a_sims) + u_sims          

# Obtain sigma_y and sigma_alpha^2
# draws for sigma_y
s_y_sims <- as.matrix(M1_stanlmer, 
                       pars = "sigma")
# draws for sigma_alpha^2
s__alpha_sims <- as.matrix(M1_stanlmer, 
                       pars = "Sigma[school:(Intercept),(Intercept)]")

3.4.2 Obtaining means, standard deviations, medians and 95% credible intervals.

In a_sims, we have saved 4,000 posterior draws (from all 4 chains) for the varying intercepts \(\alpha_{j}\) of the 73 schools. For example, the first column of the 4,000 by 73 matrix is a vector of 4,000 posterior simulation draws for the first school’s (School 20920) varying intercept \(\alpha_{1}\). One quantitative way to summarize the posterior probability distribution of these 4,000 estimates for \(\alpha_{1}\) is to examine their quantiles.

# Compute mean, SD, median, and 95% credible interval of varying intercepts

# Posterior mean and SD of each alpha
a_mean <- apply(X = a_sims,     # posterior mean
                MARGIN = 2,
                FUN = mean)
a_sd <- apply(X = a_sims,       # posterior SD
              MARGIN = 2,
              FUN = sd)

# Posterior median and 95% credible interval
a_quant <- apply(X = a_sims, 
                 MARGIN = 2, 
                 FUN = quantile, 
                 probs = c(0.025, 0.50, 0.975))
a_quant <- data.frame(t(a_quant))
names(a_quant) <- c("Q2.5", "Q50", "Q97.5")

# Combine summary statistics of posterior simulation draws
a_df <- data.frame(a_mean, a_sd, a_quant)
round(head(a_df), 2)
                            a_mean a_sd  Q2.5   Q50 Q97.5
b[(Intercept) school:20920]  63.62 4.63 54.48 63.55 72.68
b[(Intercept) school:22520]  57.15 1.80 53.61 57.12 60.78
b[(Intercept) school:22710]  81.63 3.24 75.21 81.68 87.84
b[(Intercept) school:22738]  73.07 4.19 64.96 73.05 81.34
b[(Intercept) school:22908]  66.63 5.04 56.62 66.65 76.61
b[(Intercept) school:23208]  79.26 2.80 73.74 79.26 84.71

We can produce a caterpillar plot to show the fully Bayes estimates for the school varying intercepts in rank order together with their 95% credible intervals.

# Sort dataframe containing an estimated alpha's mean and sd for every school
a_df <- a_df[order(a_df$a_mean), ]
a_df$a_rank <- c(1 : dim(a_df)[1])  # a vector of school rank 

# Plot school-level alphas's posterior mean and 95% credible interval
ggplot(data = a_df, 
       aes(x = a_rank, 
           y = a_mean)) +
  geom_pointrange(aes(ymin = Q2.5, 
                      ymax = Q97.5),
                  position = position_jitter(width = 0.1, 
                                             height = 0)) + 
  geom_hline(yintercept = mean(a_df$a_mean), 
             size = 0.5, 
             col = "red") + 
  scale_x_continuous("Rank", 
                     breaks = seq(from = 0, 
                                  to = 80, 
                                  by = 5)) + 
  scale_y_continuous(expression(paste("varying intercept, ", alpha[j]))) + 
  theme_bw( base_family = "serif")

Of course, the same approach can be taken to generate 95% credible intervales for \(\sigma_y\) and \(\sigma_\alpha\).

3.4.3 Making comparisons between individual schools

Having samples of all the parameters and varying intercepts from their joint posterior distribution makes it easy to draw inferences about functions of these parameters.

In education research and practice, it is often of interest to compare the schools included in the data. Relevant questions include (1) what is the difference between the means of schools A and B, (2) is school A performing better than school B and (3) what are the rankings of these schools within the sample. When non-Bayesian methods are used, we can attempt to make such comparisons based on empirical Bayes (or Best Linear Unbiased) predictions of the varying intercepts, but it will generally be impossible to express the uncertainty for nonlinear function such as rankings. See also Goldstein and Spiegelhalter (1996) for further discussion.

Here we will compare two schools as an example: Schools 60501 (the 21st school) and 68271 (the 51st school). We already have 4,000 posterior simulation draws for both schools. To make inferences regarding the difference between the average scores of the two schools, we can simply take the difference between the two vectors of draws \(\alpha_{51} - \alpha_{21}\).

# The difference between the two school averages (school #21 and #51)
school_diff <- a_sims[, 21] - a_sims[, 51]

We can investigate the posterior distribution of the difference with descriptive statistics and a histogram as follows:

# Investigate differences of two distributions
mean <- mean(school_diff)
sd <- sd(school_diff)
quantile <- quantile(school_diff, probs = c(0.025, 0.50, 0.975))
quantile <- data.frame(t(quantile))
names(quantile) <- c("Q2.5", "Q50", "Q97.5")
diff_df <- data.frame(mean, sd, quantile)
round(diff_df, 2)
  mean  sd  Q2.5  Q50 Q97.5
1 5.13 4.4 -3.53 5.17 13.55
# Histogram of the differences
ggplot(data = data.frame(school_diff), 
       aes(x = school_diff)) + 
  geom_histogram(color = "black", 
                 fill = "gray", 
                 binwidth = 0.75) + 
  scale_x_continuous("Score diffence between two schools: #21, #51",
                     breaks = seq(from = -20, 
                                  to = 20, 
                                  by = 10)) + 
  geom_vline(xintercept = c(mean(school_diff),
                            quantile(school_diff, 
                                     probs = c(0.025, 0.975))),
             colour = "red", 
             linetype = "longdash") + 
  geom_text(aes(5.11, 20, label = "mean = 5.11"), 
            color = "red", 
            size = 4) + 
  geom_text(aes(9, 50, label = "SD = 4.46"), 
            color = "blue", 
            size = 4) + 
  theme_bw( base_family = "serif") 

The expected difference comes to 5.11 with a standard deviation of 4.46 and a wide range of uncertainty. The 95% credible interval is [-3.64, 13.66], so we are 95% certain that the true value of the difference between the two schools lies within the range, given the data.

We also can get the proportion of the time that School 60501 has a higher mean than School 68271:

prop.table(table(a_sims[, 21] > a_sims[, 51]))

  FALSE    TRUE 
0.11725 0.88275 

This means that the posterior probability that School 60501 is better than School 68271 is 88.3%. Any pair of schools within the sample of schools can be compared in this manner.

4 Bayesian inference for Model 2 and 3

4.1 Model 2: Adding a student-level predictor

Researchers might want to extend the varying-intercept models by including observed explanatory variables at the student level \(x_{ij}\), in this example, an indicator variable for being female. A simple varying intercept model with one predictor at the student level can be written as \(y_{ij} \sim N(\alpha_{j} + \beta x_{ij}, \sigma_{y}^{2})\) and \(\alpha_{j} \sim N(\mu_{\alpha}, \sigma_{\alpha}^{2})\). We use noninformative prior distributions for the hyperparameters (\(\mu_{\alpha}\) and \(\sigma_{\alpha}\)) as specified in the varying intercept model with no predictors. Additionally, the regression coefficient \(\beta\) is given normal prior distributions with mean 0 and standard deviation 100. This states, roughly, that we expect this coefficient to be in the range \((-100, 100)\), and if the ML estimate is in this range, the prior distribution is providing very little information for the inference.

The above model can be fit using the stan_lmer() function in the rstanarm package as follows:

M2_stanlmer <- stan_lmer(formula = course ~ female + (1 | school), 
                         data = GCSE, 
                         prior = normal(location = 0, 
                                        scale = 100,
                                        autoscale = FALSE),
                         prior_intercept = normal(location = 0, 
                                                  scale = 100, 
                                                  autoscale = FALSE),
                         seed = 349)
prior_summary(object = M2_stanlmer)
Priors for model 'M2_stanlmer' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 100)

Coefficients
 ~ normal(location = 0, scale = 100)

Auxiliary (sigma)
 ~ exponential(rate = 1)
     **adjusted scale = 16.32 (adjusted rate = 1/adjusted scale)

Covariance
 ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
------
See help('prior_summary.stanreg') for more details
M2_stanlmer
stan_lmer
 family:       gaussian [identity]
 formula:      course ~ female + (1 | school)
 observations: 1725
------
            Median MAD_SD
(Intercept) 69.7    1.2  
femaleF      6.7    0.7  
sigma       13.4    0.2  

Error terms:
 Groups   Name        Std.Dev.
 school   (Intercept)  9      
 Residual             13      
Num. levels: school 73 

Sample avg. posterior predictive distribution of y:
         Median MAD_SD
mean_PPD 73.4    0.5  

------
For info on the priors used see help('prior_summary.stanreg').

Note that instead of the default priors in stan_lmer, \(\mu_{\alpha}\) and \(\beta\) are given normal prior distributions with mean 0 and standard deviation 100 by specifying the arguments prior and prior_intercept as normal(location = 0, scale = 100, autoscale = FALSE). To prevent stan_lmer from scaling the prior, we need to make sure to append the argument autoscale = FALSE.

The point estimates of \(\mu_{\alpha}\), \(\beta\), and \(\sigma_{y}\) are almost identical to the ML estimates from the lmer() fit. However, partly because ML ignores the uncertainty about \(\mu_{\alpha}\) when estimating \(\sigma_{\alpha}\), the Bayesian estimate for \(\sigma_{\alpha}\) (\(9.0\)) is larger than the ML estimate (\(8.8\)), as with Model 1.

4.2 Model 3: Allowing for varying slopes across schools

We also use stan_lmer to fit Model 3 using the command below. Note that here, we use the default priors which are mostly similar to what was done in Model 1. Additionally, we are also required to specify a prior for the covariance matrix \(\Sigma\) for \(\alpha_j\) and \(\beta_j\) in this Model. stan_lmer decomposes this covariance matrix (up to a factor of \(\sigma_y\)) into (i) a correlation matrix \(R\) and (ii) a matrix of variances \(V\), and assigns them separate priors as shown below. \[ \begin{aligned} \Sigma &= \left(\begin{matrix} \sigma_\alpha^2 & \rho\sigma_\alpha \sigma_\beta \\ \rho\sigma_\alpha\sigma_\beta&\sigma_\beta^2 \end{matrix} \right)\\ &= \sigma_y^2\left(\begin{matrix} \sigma_\alpha^2/\sigma_y^2 & \rho\sigma_\alpha \sigma_\beta/\sigma_y^2 \\ \rho\sigma_\alpha\sigma_\beta/\sigma_y^2 & \sigma_\beta^2/\sigma_y^2 \end{matrix} \right)\\ &= \sigma_y^2\left(\begin{matrix} \sigma_\alpha/\sigma_y & 0 \\ 0&\sigma_\beta/\sigma_y \end{matrix} \right) \left(\begin{matrix} 1 & \rho\\ \rho&1 \end{matrix} \right) \left(\begin{matrix} \sigma_\alpha/\sigma_y & 0 \\ 0&\sigma_\beta/\sigma_y \end{matrix} \right)\\ &= \sigma_y^2VRV. \end{aligned} \]

The correlation matrix \(R\) is 2 by 2 matrix with 1’s on the diagonal and \(\rho\)’s on the off-diagonal. stan_lmer assigns it an LKJ11 prior (Lewandowski, Kurowicka, and Joe 2009), with regularization parameter 1. This is equivalent to assigning a uniform prior for \(\rho\). The more the regularization parameter exceeds one, the more peaked the distribution for \(\rho\) to take the value 0.

The matrix of (scaled) variances \(V\) can first be collapsed into a vector of (scaled) variances, and then decomposed into three parts, \(J\), \(\tau^2\) and \(\pi\) as shown below. \[ \left(\begin{matrix} \sigma_\alpha^2/\sigma_y^2 \\ \sigma_\beta^2/\sigma_y^2 \end{matrix} \right) = 2\left(\frac{\sigma_\alpha^2/\sigma_y^2 + \sigma_\beta^2/\sigma_y^2}{2}\right)\left(\begin{matrix} \frac{\sigma_\alpha^2/\sigma_y^2}{\sigma_\alpha^2/\sigma_y^2 + \sigma_\beta^2/\sigma_y^2} \\ \frac{\sigma_\beta^2/\sigma_y^2}{\sigma_\alpha^2/\sigma_y^2 + \sigma_\beta^2/\sigma_y^2} \end{matrix} \right)= J\tau^2 \pi. \]

In this formulation, \(J\) is the number of varying effects in the model (here, \(J=2\)), \(\tau^2\) can be viewed as an average (scaled) variance across the varying effects \(\alpha_j\) and \(\beta_j\), and \(\pi\) is a non-negative vector that sums to 1 (called a Simplex/probability vector). A symmetric Dirichlet12 distribution with concentration parameter set to 1 is then used as the prior for \(\pi\). By default, this implies a jointly uniform prior over all Simplex vectors of the same size. A scale-invariant Gamma prior with shape and scale parameters both set to 1 is then assigned for \(\tau\). This is equivalent to assigning as a prior the exponential distribution with rate parameter set to 1 which is consistent with the prior assigned to \(\sigma_y\).

M3_stanlmer <- stan_lmer(formula = course ~ female + (1 + female | school), 
                         data = GCSE,
                         seed = 349)
prior_summary(object = M3_stanlmer)
Priors for model 'M3_stanlmer' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 10)
     **adjusted scale = 163.21

Coefficients
 ~ normal(location = 0, scale = 2.5)
     **adjusted scale = 40.80

Auxiliary (sigma)
 ~ exponential(rate = 1)
     **adjusted scale = 16.32 (adjusted rate = 1/adjusted scale)

Covariance
 ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
------
See help('prior_summary.stanreg') for more details
M3_stanlmer
stan_lmer
 family:       gaussian [identity]
 formula:      course ~ female + (1 + female | school)
 observations: 1725
------
            Median MAD_SD
(Intercept) 69.5    1.3  
femaleF      7.1    1.2  
sigma       13.0    0.2  

Error terms:
 Groups   Name        Std.Dev. Corr 
 school   (Intercept) 10.3          
          femaleF      7.1     -0.48
 Residual             13.0          
Num. levels: school 73 

Sample avg. posterior predictive distribution of y:
         Median MAD_SD
mean_PPD 73.4    0.5  

------
For info on the priors used see help('prior_summary.stanreg').

Here, we notice that the point estimates for \(\mu_{\alpha}\) and \(\sigma_{y}\) are identical to the ML estimates from lmer() fit. The point estimate for \(\beta\) is slightly different in this Model (7.12 compared to 7.13). Furthermore, as in the previous two models, the Bayesian estimate for \(\sigma_{\alpha}\) (10.3) is larger than the ML estimate (10.15). Additionally, the Bayesian estimates for \(\sigma_{\beta}\) (7.1) and \(\rho\) (-0.48) are larger than the corresponding ML estimates (6.92 and -0.52 respectively).

5 Evaluating model convergence

By default, all rstanarm modeling functions will run 4 randomly initialized Markov chains, each for 2,000 iterations (including a warmup period of 1,000 iterations). All chains must converge to the target distribution for inferences to be valid. The diagnostics which we use to assess whether the chains have converged to the posterior distribution are the statistics \(\hat{R}\) and \(N_{\text{eff}}\) (Gelman and Rubin 1992). Each parameter has the \(\hat{R}\) and \(N_{\text{eff}}\) statistic associated with it. As seen previously, these statistics are automatically generated when using the summary method.

The \(\hat{R}\) is essentially the ratio of between-chain variance to within-chain variance analogous to ANOVA. The \(\hat{R}\) statistic should be less than 1.1 if the chains have converged. To see a plot of the \(\hat{R}\) values across parameters we can use the plot method for stanreg object M1_stanlmer:

plot(M1_stanlmer, "rhat")

The \(N_{\text{eff}}\) statistic represents the effective number of simulation draws. If the draws were independent \(N_{\text{eff}}\) would be the number of saved draws, 4,000 (4 chains \(\times\) (2,000 iterations - 1,000 iterations for warmup)), but \(N_{\text{eff}}\) tends to be smaller than 4,000 because Markov chain simulations tend to be autocorrelated. The \(N_{\text{eff}}\) statistic should typically be at least 100 across parameters. We can plot a histogram of the ratio of effective sample size to total sample size:

plot(M1_stanlmer, "ess")

References

Browne, William J, David Draper, and others. 2006. “A Comparison of Bayesian and Likelihood-Based Methods for Fitting Multilevel Models.” Bayesian Analysis 1 (3): 473–514.

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science. JSTOR, 457–72.

Goldstein, Harvey, and David J Spiegelhalter. 1996. “League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance.” Journal of the Royal Statistical Society. Series A (Statistics in Society), 385–443.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100 (9). Elsevier: 1989–2001.

Rasbash, Jon, William Browne, Harvey Goldstein, Min Yang, Ian Plewis, Michael Healy, Geoff Woodhouse, David Draper, Ian Langford, and Toby Lewis. 2000. “A User’s Guide to Mlwin.” University of London, Institute of Education, Centre for Multilevel Modelling.


  1. Also known as hierarchical linear models, random-effects models, random coefficient models, or mixed models.

  2. Stan requires the data to be in the form of a list or an environment.

  3. Stan does not accept either NA values even for variables that are not being modeled.

  4. Stan requires identifiers to be sequential.

  5. Equivalently, the model can be expressed using a two-stage formulation as \[y_{ij} = \alpha_j + \beta x_{ij} +\epsilon_{ij},\] \[\alpha_j = \mu_\alpha + u_j,\] or in a reduced form as \[y_{ij} = \mu_\alpha + \beta x_{ij} + u_j + \epsilon_{ij}\] where \(\epsilon_{ij} \sim N(0, \sigma_{y}^{2})\) and \(u_{j}\sim N(0, \sigma_{\alpha}^{2})\).

  6. We elaborate more on prior distributions in Section 3.2

  7. For users who are not familiar with ggplot2 syntax, we refer them to here.

  8. Equivalently, the model can be expressed in a two-stage formulation as \[y_{ij} = \alpha_j + \beta_j x_{ij} +\epsilon_{ij},\] \[\alpha_j = \mu_\alpha + u_j,\] \[\beta_j = \mu_\beta + v_j,\] or in a reduced form as \[y_{ij} = \mu_\alpha + \mu_\beta x_{ij} + u_j + v_j x_{ij} + \epsilon_{ij}\] where \(\epsilon_{ij} \sim N(0, \sigma_{y}^{2})\) and \(\left( \begin{matrix} u_j \\ v_j \end{matrix} \right) \sim N\left( \left( \begin{matrix} 0 \\ 0 \end{matrix} \right) ,\left( \begin{matrix} { \sigma }_{ \alpha }^{ 2 } & \rho { \sigma }_{ \alpha }{ \sigma }_{ \beta } \\ \rho { \sigma }_{ \alpha }{ \sigma }_{ \beta } & { \sigma }_{ \beta }^{ 2 } \end{matrix} \right) \right)\).

  9. Regularization can be regarded as a technique to ensure that estimates are bounded within an acceptable range of values.

  10. For more information on regular expressions, see here

  11. For more details about the LKJ distribution, see here and here

  12. The Dirichlet distribution is a multivariate generalization of the beta distribution with one concentration parameter, which can be interpreted as prior counts of a multinomial random variable (the simplex vector in our context), for details, see here.