Loading [MathJax]/jax/output/HTML-CSS/jax.js

1 Introduction

A common feature of data structures in education is that units of analysis (e.g., students) are nested in higher organizational clusters (e.g. schools). This kind of structure induces dependence among the responses observed for units within the same cluster. Students in the same school tend to be more alike in their academic and attitudinal characteristics than students chosen at random from the population at large.

Multilevel models1 are designed to model such within-cluster dependence. Multilevel models recognize the existence of data clustering (at two or more levels) by allowing for residual components at each level in the hierarchy. For example, a two-level model that allows for grouping of student outcomes within schools would include residuals at both the student and school level. The residual variance is thus partitioned into a between-school component (the variance of the school-level residuals) and a within-school component (the variance of the student-level residuals).

In this tutorial, we illustrate how to fit a multilevel linear model within a full Bayesian framework using rstanarm. This tutorial is aimed primarily at educational researchers who have used lme4 in R to fit models to their data and who may be interested in learning how to fit Bayesian multilevel models. However, for readers who have not used lme4 before, we briefly review the use of the package for fitting multilevel models.

This tutorial uses Stan version 2.17.3 and requires the following R packages.

# Required Packages
library(mlmRev)
library(lme4)
library(rstanarm)
library(ggplot2)

1.1 Data example

We will be analyzing the Gcsemv dataset (Rasbash et al. 2000) from the mlmRev package in R. The data include the General Certificate of Secondary Education (GCSE) exam scores of 1,905 students from 73 schools in England on a science subject. The Gcsemv dataset consists of the following 5 variables:

  • school: school identifier
  • student: student identifier
  • gender: gender of a student (M: Male, F: Female)
  • written: total score on written paper
  • course: total score on coursework paper
# Use example dataset from mlmRev package: GCSE exam score
data(Gcsemv, package = "mlmRev")
summary(Gcsemv)
     school        student     gender      written          course      
 68137  : 104   77     :  14   F:1128   Min.   : 0.60   Min.   :  9.25  
 68411  :  84   83     :  14   M: 777   1st Qu.:37.00   1st Qu.: 62.90  
 68107  :  79   53     :  13            Median :46.00   Median : 75.90  
 68809  :  73   66     :  13            Mean   :46.37   Mean   : 73.39  
 22520  :  65   27     :  12            3rd Qu.:55.00   3rd Qu.: 86.10  
 60457  :  54   110    :  12            Max.   :90.00   Max.   :100.00  
 (Other):1446   (Other):1827            NA's   :202     NA's   :180     

Two components of the exam were recorded as outcome variables: written paper and course work. In this tutorial, only the total score on the courework paper (course) will be analyzed. As seen above, there some of the observations have missing values for certain covariates. While we do not subset the data to only include complete cases to demonstrate that rstanarm automatically drops these observations, it is generally good practice to manually do so if required.

# Make Male the reference category and rename variable
Gcsemv$female <- relevel(Gcsemv$gender, "M")


# Use only total score on coursework paper 
GCSE <- subset(x = Gcsemv, 
               select = c(school, student, female, course))

# Count unique schools and students
J <- length(unique(GCSE$school))
N <- nrow(GCSE)

The rstanarm package automates several data preprocessing steps making its use very similar to that of lme4 in the following way.

  • Input - rstanarm is able to take a data frame as input2.

  • Missing Data - rstanarm automatically discards observations with NA values for any variable used in the model3.

  • Identifiers - rstanarm does not require identifiers to be sequential4. We do suggest that it is good practice for all cluster and unit identifiers, as well as categorical variables be stored as factors. This applies to using lme4 as much as it does to rstanarm. One can check the structure of the variables by using the str() function.

# Check structure of data frame
str(GCSE)
'data.frame':   1905 obs. of  4 variables:
 $ school : Factor w/ 73 levels "20920","22520",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ student: Factor w/ 649 levels "1","2","3","4",..: 16 25 27 31 42 62 101 113 146 1 ...
 $ female : Factor w/ 2 levels "M","F": 1 2 2 2 1 2 2 1 1 2 ...
 $ course : num  NA 71.2 76.8 87.9 44.4 NA 89.8 17.5 32.4 84.2 ...

2 Likelihood inference using lmer()

In this section, we briefly review three basic multilevel linear models which will be fit in this tutorial. Starting with a varying intercept model with no predictors (Model 1), we then proceed to the varying intercept model with one predictor (Model 2), and the varying intercept and slope model (Model 3).

2.1 Model 1: Varying intercept model with no predictors (Variance components model)

Consider the simplest multilevel model for students i=1,...,n nested within schools j=1,...,J and for whom we have examination scores as responses. We can write a two-level varying intercept model with no predictors using the usual two-stage formulation as

yij=αj+ϵij, where ϵijN(0,σ2y) αj=μα+uj, where ujN(0,σ2α)

where yij is the examination score for the ith student in the jth school, αj is the varying intercept for the jth school, and μα is the overall mean across schools. Alternatively, the model can be expressed in reduced form as yij=μα+uj+ϵij.. If we further assume that the student-level errors ϵij are normally distributed with mean 0 and variance σ2y, and that the school-level varying intercepts αj are normally distributed with mean μα and variance σ2α, then the model can be expressed as

yijN(αj,σ2y), αjN(μα,σ2α),

This model can then be fit using lmer(). We specify an intercept (the predictor “1”) and allow it to vary by the level-2 identifier (school). We also specify the REML = FALSE option to obtain maximum likelihood (ML) estimates as opposed to the default restricted maximum likelihood (REML) estimates.

M1 <- lmer(formula = course ~ 1 + (1 | school), 
           data = GCSE, 
           REML = FALSE)
summary(M1)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: course ~ 1 + (1 | school)
   Data: GCSE

     AIC      BIC   logLik deviance df.resid 
 14111.4  14127.7  -7052.7  14105.4     1722 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.9693 -0.5101  0.1116  0.6741  2.7613 

Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept)  75.24    8.674  
 Residual             190.77   13.812  
Number of obs: 1725, groups:  school, 73

Fixed effects:
            Estimate Std. Error t value
(Intercept)    73.72       1.11    66.4

Under the Fixed effects part of the output, we see that the intercept μα, averaged over the population of schools, is estimated as 73.72. Under the Random effects part of the output, we see that the between-school standard deviation σα is estimated as 8.67 and the within-school standard deviation σy as 13.81.

2.2 Model 2: Varying intercept model with a single predictor

The varying intercept model5 with an indicator variable for being female xij can be written as yijN(αj+βxij,σ2y), αjN(μα,σ2α). The equation of the average regression line across schools is μij=μα+βxij. The regression lines for specific schools will be parallel to the average regression line (having the same slope β), but differ in terms of its intercept αj. This model can be estimated by adding female to the formula in the lmer() function, which will allow only the intercept to vary by school, and while keeping the “slope” for being female constant across schools.

M2 <- lmer(formula = course ~ 1 + female + (1 | school), 
           data = GCSE, 
           REML = FALSE)
summary(M2) 
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: course ~ 1 + female + (1 | school)
   Data: GCSE

     AIC      BIC   logLik deviance df.resid 
 14017.4  14039.2  -7004.7  14009.4     1721 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.7809 -0.5401  0.1259  0.6795  2.6753 

Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept)  76.65    8.755  
 Residual             179.96   13.415  
Number of obs: 1725, groups:  school, 73

Fixed effects:
            Estimate Std. Error t value
(Intercept)   69.730      1.185   58.87
femaleF        6.739      0.678    9.94

Correlation of Fixed Effects:
        (Intr)
femaleF -0.338

The average regression line across schools is thus estimated as ˆμij=69.73+6.74xij, with σα and σy estimated as 8.76 and 13.41 respectively. Treating these estimates of μα, β, σ2y, and σ2α as the true parameter values, we can then obtain the Best Linear Unbiased Predictions (BLUPs) for the school-level errors ˆuj=ˆαjˆμα.

The BLUPs are equivalent to the so-called Empirical Bayes (EB) prediction, which is the mean of the posterior distribution of uj given all the estimated parameters, as well as the random variables yij and xij for the cluster. These predictions are called “Bayes” because they make use of the pre-specified prior distribution6 αjN(μα,σ2α), and by extension ujN(0,σ2α), and called “Empirical” because the parameters of this prior, μα and σ2α, in addition to β and σ2y, are estimated from the data.

Compared to the Maximum Likelihood (ML) approach of predicting values for uj by using only the estimated parameters and data from cluster j, the EB approach additionally consider the prior distribution of uj, and produces predicted values closer to 0 (a phenomenon described as shrinkage or partial pooling). To see why this phenomenon is called shrinkage, we usually express the estimates for uj obtained from EB prediction as ˆuEBj=ˆRjˆuMLj where ˆuMLj are the ML estimates, and ˆRj=σ2ασ2α+σ2ynj is the so-called Shrinkage factor.

head(ranef(M2)$school)
      (Intercept)
20920 -10.1702110
22520 -17.0578149
22710   7.8007260
22738   0.4871012
22908  -8.1940346
23208   4.4304453

These values estimate how much the intercept is shifted up or down in particular schools. For example, in the first school in the dataset, the estimated intercept is about 10.17 lower than average, so that the school-specific regression line is (69.7310.17)+6.74xij.

Gelman and Hill (2006) characterize multilevel modeling as partial pooling (also called shrinkage), which is a compromise between two extremes: complete pooling in which the clustering is not considered in the model at all, and no pooling, in which separate intercepts are estimated for each school as coefficients of dummy variables. The estimated school-specific regression lines in the above model are based on partial pooling estimates. To show this, we first estimate the intercept and slope in each school three ways:

# Complete-pooling regression
pooled <- lm(formula = course ~ female,
             data = GCSE)
a_pooled <- coef(pooled)[1]   # complete-pooling intercept
b_pooled <- coef(pooled)[2]   # complete-pooling slope

# No-pooling regression
nopooled <- lm(formula = course ~ 0 + school + female,
               data = GCSE)
a_nopooled <- coef(nopooled)[1:J]   # 73 no-pooling intercepts              
b_nopooled <- coef(nopooled)[J+1]

# Partial pooling (multilevel) regression
a_part_pooled <- coef(M2)$school[, 1]
b_part_pooled <- coef(M2)$school[, 2]

Then, we plot7 the data and school-specific regression lines for a selection of eight schools using the following commands:

# (0) Set axes & choose schools
y <- GCSE$course
x <- as.numeric(GCSE$female) - 1 + runif(N, -.05, .05)
schid <- GCSE$school
sel.sch <- c("65385",
             "68207",
             "60729",
             "67051",
             "50631",
             "60427",
             "64321",
             "68137")

# (1) Subset 8 of the schools; generate data frame
df <- data.frame(y, x, schid)
df8 <- subset(df, schid %in% sel.sch)

# (2) Assign complete-pooling, no-pooling, partial pooling estimates
df8$a_pooled <- a_pooled 
df8$b_pooled <- b_pooled
df8$a_nopooled <- a_nopooled[df8$schid]
df8$b_nopooled <- b_nopooled
df8$a_part_pooled <- a_part_pooled[df8$schid]
df8$b_part_pooled <- b_part_pooled[df8$schid]

# (3) Plot regression fits for the 8 schools
ggplot(data = df8, 
       aes(x = x, y = y)) + 
  facet_wrap(facets = ~ schid, 
             ncol = 4) + 
  theme_bw() +
  geom_jitter(position = position_jitter(width = .05, 
                                         height = 0)) +
  geom_abline(aes(intercept = a_pooled, 
                  slope = b_pooled), 
              linetype = "solid", 
              color = "blue", 
              size = 0.5) +
  geom_abline(aes(intercept = a_nopooled, 
                  slope = b_nopooled), 
              linetype = "longdash", 
              color = "red", 
              size = 0.5) + 
  geom_abline(aes(intercept = a_part_pooled, 
                  slope = b_part_pooled), 
              linetype = "dotted", 
              color = "purple", 
              size = 0.7) + 
  scale_x_continuous(breaks = c(0, 1), 
                     labels = c("male", "female")) + 
  labs(title = "Complete-pooling, No-pooling, and Partial pooling estimates",
       x = "", 
       y = "Total score on coursework paper")+theme_bw( base_family = "serif")
The blue-solid, red-dashed, and purple-dotted lines show the complete-pooling, no-pooling, and partial pooling estimates respectively. 
 We see that the estimated school-specific regression line from the partial pooling estimates lies between the complete-pooling and no-pooling regression lines. There is more pooling (purple dotted line closer to blue solid line) in schools with small sample sizes.

The blue-solid, red-dashed, and purple-dotted lines show the complete-pooling, no-pooling, and partial pooling estimates respectively. We see that the estimated school-specific regression line from the partial pooling estimates lies between the complete-pooling and no-pooling regression lines. There is more pooling (purple dotted line closer to blue solid line) in schools with small sample sizes.

2.3 Model 3: Varying intercept and slope model with a single predictor

We now extend the varying intercept model with a single predictor to allow both the intercept and the slope to vary randomly across schools using the following model8:

yijN(αj+βjxij,σ2y), (αjβj)N((μαμβ),(σ2αρσασβρσασβσ2β)).

Note that now we have variation in the αj’s and the βj’s, and also a correlation parameter ρ between αj and βj. This model can be fit using lmer() as follows:

M3 <- lmer(formula = course ~ 1 + female + (1 + female | school), 
           data = GCSE, 
           REML = FALSE)
summary(M3) 
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: course ~ 1 + female + (1 + female | school)
   Data: GCSE

     AIC      BIC   logLik deviance df.resid 
 13983.4  14016.2  -6985.7  13971.4     1719 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.6886 -0.5222  0.1261  0.6529  2.6729 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 school   (Intercept) 102.93   10.146        
          femaleF      47.94    6.924   -0.52
 Residual             169.79   13.030        
Number of obs: 1725, groups:  school, 73

Fixed effects:
            Estimate Std. Error t value
(Intercept)   69.425      1.352  51.338
femaleF        7.128      1.131   6.302

Correlation of Fixed Effects:
        (Intr)
femaleF -0.574

In this model, the residual within-school standard deviation is estimated as ˆσy= 13.03. The estimated standard deviations of the school intercepts and the school slopes are ˆσα=10.15 and ˆσβ=6.92 respectively. The estimated correlation between varying intercepts and slopes is ˆρ=0.52. We can use code similar to that presented in section 2.2 to plot the data and school-specific regression lines for a selection of eight schools.

3 Bayesian inference for Model 1

We can quickly and easily fit many multilevel models using the lmer() function in R. As previously mentioned, functions such as lmer() are based on a combination of maximum likelihood (ML) estimation of the model parameters, and empirical Bayes (EB) predictions of the varying intercepts and/or slopes. However, in some instances, when the number of groups is small or when the model contains many varying coefficients or non-nested components, the ML approach may not work as well in part because there may not be enough information to estimate variance parameters precisely. In such cases, Restricted Maximum Likelihood (REML) Estimation provides more reasonable inferences.

A fully Bayesian approach also provides reasonable inferences in these instances with the added benefit of accounting for all the uncertainty in the parameter estimates when predicting the varying intercepts and slopes, and their associated uncertainty. This is one of several reasons that one should be interested in fully Bayesian estimation. Other reasons are discussed in Section 3.4.3. One may start by quickly fitting many specifications in building a model using the lmer() function, and then take advantage of the flexibility of a fully Bayesian approach using rstanarm to obtain simulations summarizing uncertainty about coefficients, predictions, and other quantities of interest.

In this section, we present how to fit and evaluate Model 1 using the rstanarm package. Alternatively, but not covered in this tutorial, one can also create a hand-written program in Stan and run it using the rstan package.

3.1 Using the rstanarm package

Many relatively simple models can be fit using the rstanarm package without writing any code in the Stan language. The rstanarm package is a wrapper for the rstan package that enables the most common applied regression models to be estimated using Markov Chain Monte Carlo (MCMC) but still be specified using customary R modeling syntax. Education researchers can use Bayesian estimation for multilevel models with only minimal changes to their existing code with lmer().

For example, Model 1 with default prior distributions for μα, σα, and σy can be specified using the rstanarm package by prepending stan_ to the lmer call:

M1_stanlmer <- stan_lmer(formula = course ~ 1 + (1 | school), 
                         data = GCSE,
                         seed = 349)

This stan_lmer() function is similar in syntax to lmer() but rather than performing maximum likelihood estimation, Bayesian estimation is performed via MCMC. As each step in the MCMC estimation approach involves random draws from the parameter space, we include a seed option to ensure that each time the code is run, stan_lmer outputs the same results.

3.2 Prior distributions

Model 1 is a varying intercept model with normally distributed student residuals and school-level intercepts: yijN(αj,σ2y), and αjN(μα,σ2α). The normal distribution for the αj’s can be thought of as a prior distribution for these varying intercepts. The parameters of this prior distribution, μα and σα, are estimated from the data when using maximum likelihood estimation. In full Bayesian inference, all the hyperparameters (μα and σα), along with the other unmodeled parameters (in this case, σy) also need a prior distribution.

Here, we use the default prior distributions for the hyperparameters in stan_lmer by not specifying any prior options in stan_lmer() function. The default priors are intended to be weakly informative in that they provide moderate regularization9 and help stabilize computation. It should be noted that the authors of rstanarm suggest not relying on rstanarm to specify the default prior for a model, but rather, to specify the priors explicitly even if they are indeed the current default, as updates to the package may result in different defaults.

First, before accounting for the scale of the variables, μα is given normal prior distribution with mean 0 and standard deviation 10. That is, μαN(0,102). The standard deviation of this prior distribution, 10, is five times as large as the standard deviation of the response if it were standardized. This should be a close approximation to a noninformative prior over the range supported by the likelihood, which should give inferences similar to those obtained by maximum likelihood methods if similarly weak priors are used for the other parameters.

Second, the (unscaled) prior for σy is set to an exponential distribution with rate parameter set to 1.

Third, in order to specify a prior for the variances and covariances of the varying (or “random”) effects, rstanarm will decompose this matrix into a correlation matrix of the varying effects and a function of their variances. Since there is only one varying effect in this example, the default (unscaled) prior for σα that rstanarm uses reduces to an exponential distribution with rate parameter set to 1.

It should also be noted that rstanarm will scale the priors unless the autoscale = FALSE option is used. After fitting a model using stan_lmer, we can check the priors that are used by invoking the prior_summary() function.

# Obtain a summary of priors used
prior_summary(object = M1_stanlmer)
Priors for model 'M1_stanlmer' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 10)
     **adjusted scale = 163.21

Auxiliary (sigma)
 ~ exponential(rate = 1)
     **adjusted scale = 16.32 (adjusted rate = 1/adjusted scale)

Covariance
 ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
------
See help('prior_summary.stanreg') for more details
# Obtain SD of outcome
sd(GCSE$course, na.rm = TRUE)
[1] 16.32096

As seen above, the scales of the priors for μα and σy are set to 163.21 and 16.32 respectively after rescaling. Since the default prior for the intercept is normal with a scale parameter of 10, the rescaled prior is also normal but with a scale parameter of scale×SD(y)=10×16.321=163.21. Similarly, since the default prior for σy is exponential with a rate parameter of 1 (or equivalently, scale parameter scale=1rate=1), the rescaled prior is likewise exponential with a scale parameter of scale×SD(y)=1×16.321=16.32.

3.3 Direct output from stan_lmer

3.3.1 Posterior medians and posterior median absolute deviations

We can display a quick summary of the fit from Model 1 by using the print method in the following manner:

print(M1_stanlmer, digits = 2)
stan_lmer
 family:       gaussian [identity]
 formula:      course ~ 1 + (1 | school)
 observations: 1725
------
            Median MAD_SD
(Intercept) 73.77   1.08 
sigma       13.82   0.25 

Error terms:
 Groups   Name        Std.Dev.
 school   (Intercept)  8.88   
 Residual             13.82   
Num. levels: school 73 

Sample avg. posterior predictive distribution of y:
         Median MAD_SD
mean_PPD 73.40   0.47 

------
For info on the priors used see help('prior_summary.stanreg').

Here, the point estimate of μα from stan_lmer is 73.75 and this corresponds to the median of the posterior draws. This is similar to the ML estimate obtained from lmer. The point estimate for σα from stan_lmer is 8.88, which is larger than the ML estimate (8.67). This discrepancy may be partly because the ML approach in lmer() does not take into account the uncertainty in μα when estimating σα. The REML approach (8.75) in lmer(), as mentioned previously, does in fact account for this uncertainty.

When using stan_lmer, standard errors are obtained by considering the median absolute deviation (MAD) of each draw from the median of those draws. It is well known that ML tends to underestimate uncertainties because it relies on point estimates of hyperparameters. Full Bayes, on the other hand, propagates the uncertainty in the hyperparameters throughout all levels of the model and provides more appropriate estimates of uncertainty. See also W. J. Browne, Draper, and others (2006) for further discussion.

3.3.2 Posterior means, posterior standard deviations, 95% credible intervals and Monte Carlo errors

While the use of the median and the MAD of the posterior draws for estimation and inference are the default outputs from rstanarm, users may instead prefer to use the mean and the standard deviation of the posterior distribution instead. Additionally, users may be interested in credible intervals, a concept unique to Bayesian statistics that is the analogue to confidence intervals in frequentist statistics. Unlike the latter, the 95% credible intervals have a 95% probability of containing the true value of the parameter given the data. This 95% credible interval is typically obtained by considering the 2.5th to 97.5th percentiles of the distribution of posterior draws.

To obtain these estimates, we can make use of the summary method as follows. The pars argument specifies which the parameter estimates to display. Here, (Intercept), sigma, and Sigma[school:(Intercept),(Intercept)] correspond to μα, σy, and σ2α. The probs argument specifies which quantiles of the posterior distribution to display.

summary(M1_stanlmer, 
        pars = c("(Intercept)", "sigma", "Sigma[school:(Intercept),(Intercept)]"),
        probs = c(0.025, 0.975),
        digits = 2)

Model Info:

 function:     stan_lmer
 family:       gaussian [identity]
 formula:      course ~ 1 + (1 | school)
 algorithm:    sampling
 priors:       see help('prior_summary')
 sample:       4000 (posterior sample size)
 observations: 1725
 groups:       school (73)

Estimates:
                                        mean   sd     2.5%   97.5%
(Intercept)                            73.75   1.11  71.46  75.91 
sigma                                  13.82   0.25  13.35  14.31 
Sigma[school:(Intercept),(Intercept)]  78.91  15.89  52.71 115.82 

Diagnostics:
                                      mcse Rhat n_eff
(Intercept)                           0.04 1.00  826 
sigma                                 0.00 1.00 4000 
Sigma[school:(Intercept),(Intercept)] 0.56 1.00  803 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

It is worthwhile to note that when using the summary method, the estimate for the standard deviation σy is the the mean of the posterior draws of the parameter. This is in contrast to the median of the posterior draws that we obtain when using the print method. One advantage of using the median is that the estimate for σ2y is simply the square of the estimate for σy if the number of samples is odd. This is not true when using the mean. In this case, and more generally when we need to evaluate other functions of the parameters, we need to access the posterior draws directly. This is detailed in the next section.

Under Diagnostics, we refer the reader to Section 5 for more information about Rhat and n_eff. The values under mcse represent estimates for the Monte Carlo standard errors which represent the randomness associated with each MCMC estimation run. That is, with the same data set, repeatedly using an MCMC approach to estimate a paramater produces estimates with a standard deviation equal to the Monte Carlo standard error.

3.4 Other output from stan_lmer

As mentioned, users may prefer to work directly with the posterior draws to obtain estimates of more complex parameters. To do so, users need to manually access them from the stan_lmer object. We demonstrate how to do so in the context of making comparisons between individuals schools.

3.4.1 Accessing the simulations and summarizing results

Based on the default settings, stan_lmer generates 4 MCMC chains of 2,000 iterations each. Half of these iterations in each chain are used as warm-up/burn-in (to allow the chain to converge to the posterior distribution), and hence we only use 1,000 samples per chain. These MCMC-generated samples are taken to be drawn from the posterior distributions of the parameters in the model. We can use these samples for predictions, summarizing uncertainty and estimating credible intervals for any function of the parameters.

To access the posterior draws for all the parameters, we apply the method as.matrix() to the stanreg object M1_stanlmer. This returns an S by P matrix, where S is the size of the posterior sample (or equivalently, the number of MCMC iterations after warm-up) and P is the number of parameters/quantities. We can then generate a matrix for varying intercepts αj as well as vectors containing the draws for the within standard deviation and the between variance by manipulating this matrix. Note that in order to select the correct columns for the parameter of interest, it is useful to explore the column names of the matrix sims.

A more direct approach to obtaining the posterior draws for specific parameters is to make use of the built in functionality of the as.matrix method for stanreg objects. When applying the as.matrix method to a stanreg object, the user is able to specify either an optional character vector of parameter names, or an optional character vector of regular expressions10 to extract the posterior draws of only the parameters they are interested in. For example, since the parameter representing the overall mean is labelled (Intercept), we can extract the posterior draws of only this parameter by including the option pars = "(Intercept)". Similarly, since the parameters representing the 73 school-level errors all contain the string b[(Intercept) school: we can extract all parameters that contain this string by using the option regex_pars = "b\\[\\(Intercept\\) school\\:".

# Extract the posterior draws for all parameters
sims <- as.matrix(M1_stanlmer)
dim(sims)
[1] 4000   76
para_name <- colnames(sims)
para_name
 [1] "(Intercept)"                          
 [2] "b[(Intercept) school:20920]"          
 [3] "b[(Intercept) school:22520]"          
 [4] "b[(Intercept) school:22710]"          
 [5] "b[(Intercept) school:22738]"          
 [6] "b[(Intercept) school:22908]"          
 [7] "b[(Intercept) school:23208]"          
 [8] "b[(Intercept) school:25241]"          
 [9] "b[(Intercept) school:30474]"          
[10] "b[(Intercept) school:35270]"          
[11] "b[(Intercept) school:37224]"          
[12] "b[(Intercept) school:47627]"          
[13] "b[(Intercept) school:50627]"          
[14] "b[(Intercept) school:50631]"          
[15] "b[(Intercept) school:60421]"          
[16] "b[(Intercept) school:60427]"          
[17] "b[(Intercept) school:60437]"          
[18] "b[(Intercept) school:60439]"          
[19] "b[(Intercept) school:60441]"          
[20] "b[(Intercept) school:60455]"          
[21] "b[(Intercept) school:60457]"          
[22] "b[(Intercept) school:60501]"          
[23] "b[(Intercept) school:60729]"          
[24] "b[(Intercept) school:60741]"          
[25] "b[(Intercept) school:63619]"          
[26] "b[(Intercept) school:63833]"          
[27] "b[(Intercept) school:64251]"          
[28] "b[(Intercept) school:64321]"          
[29] "b[(Intercept) school:64327]"          
[30] "b[(Intercept) school:64343]"          
[31] "b[(Intercept) school:64359]"          
[32] "b[(Intercept) school:64428]"          
[33] "b[(Intercept) school:65385]"          
[34] "b[(Intercept) school:66365]"          
[35] "b[(Intercept) school:67051]"          
[36] "b[(Intercept) school:67105]"          
[37] "b[(Intercept) school:67311]"          
[38] "b[(Intercept) school:68107]"          
[39] "b[(Intercept) school:68111]"          
[40] "b[(Intercept) school:68121]"          
[41] "b[(Intercept) school:68125]"          
[42] "b[(Intercept) school:68133]"          
[43] "b[(Intercept) school:68137]"          
[44] "b[(Intercept) school:68201]"          
[45] "b[(Intercept) school:68207]"          
[46] "b[(Intercept) school:68217]"          
[47] "b[(Intercept) school:68227]"          
[48] "b[(Intercept) school:68233]"          
[49] "b[(Intercept) school:68237]"          
[50] "b[(Intercept) school:68241]"          
[51] "b[(Intercept) school:68255]"          
[52] "b[(Intercept) school:68271]"          
[53] "b[(Intercept) school:68303]"          
[54] "b[(Intercept) school:68321]"          
[55] "b[(Intercept) school:68329]"          
[56] "b[(Intercept) school:68405]"          
[57] "b[(Intercept) school:68411]"          
[58] "b[(Intercept) school:68417]"          
[59] "b[(Intercept) school:68531]"          
[60] "b[(Intercept) school:68611]"          
[61] "b[(Intercept) school:68629]"          
[62] "b[(Intercept) school:68711]"          
[63] "b[(Intercept) school:68723]"          
[64] "b[(Intercept) school:68805]"          
[65] "b[(Intercept) school:68809]"          
[66] "b[(Intercept) school:71927]"          
[67] "b[(Intercept) school:74330]"          
[68] "b[(Intercept) school:74862]"          
[69] "b[(Intercept) school:74874]"          
[70] "b[(Intercept) school:76531]"          
[71] "b[(Intercept) school:76631]"          
[72] "b[(Intercept) school:77207]"          
[73] "b[(Intercept) school:84707]"          
[74] "b[(Intercept) school:84772]"          
[75] "sigma"                                
[76] "Sigma[school:(Intercept),(Intercept)]"
# Obtain school-level varying intercept a_j
# draws for overall mean
mu_a_sims <- as.matrix(M1_stanlmer, 
                       pars = "(Intercept)")
# draws for 73 schools' school-level error
u_sims <- as.matrix(M1_stanlmer, 
                    regex_pars = "b\\[\\(Intercept\\) school\\:")
# draws for 73 schools' varying intercepts               
a_sims <- as.numeric(mu_a_sims) + u_sims          

# Obtain sigma_y and sigma_alpha^2
# draws for sigma_y
s_y_sims <- as.matrix(M1_stanlmer, 
                       pars = "sigma")
# draws for sigma_alpha^2
s__alpha_sims <- as.matrix(M1_stanlmer, 
                       pars = "Sigma[school:(Intercept),(Intercept)]")

3.4.2 Obtaining means, standard deviations, medians and 95% credible intervals.

In a_sims, we have saved 4,000 posterior draws (from all 4 chains) for the varying intercepts αj of the 73 schools. For example, the first column of the 4,000 by 73 matrix is a vector of 4,000 posterior simulation draws for the first school’s (School 20920) varying intercept α1. One quantitative way to summarize the posterior probability distribution of these 4,000 estimates for α1 is to examine their quantiles.

# Compute mean, SD, median, and 95% credible interval of varying intercepts

# Posterior mean and SD of each alpha
a_mean <- apply(X = a_sims,     # posterior mean
                MARGIN = 2,
                FUN = mean)
a_sd <- apply(X = a_sims,       # posterior SD
              MARGIN = 2,
              FUN = sd)

# Posterior median and 95% credible interval
a_quant <- apply(X = a_sims, 
                 MARGIN = 2, 
                 FUN = quantile, 
                 probs = c(0.025, 0.50, 0.975))
a_quant <- data.frame(t(a_quant))
names(a_quant) <- c("Q2.5", "Q50", "Q97.5")

# Combine summary statistics of posterior simulation draws
a_df <- data.frame(a_mean, a_sd, a_quant)
round(head(a_df), 2)
                            a_mean a_sd  Q2.5   Q50 Q97.5
b[(Intercept) school:20920]  63.62 4.63 54.48 63.55 72.68
b[(Intercept) school:22520]  57.15 1.80 53.61 57.12 60.78
b[(Intercept) school:22710]  81.63 3.24 75.21 81.68 87.84
b[(Intercept) school:22738]  73.07 4.19 64.96 73.05 81.34
b[(Intercept) school:22908]  66.63 5.04 56.62 66.65 76.61
b[(Intercept) school:23208]  79.26 2.80 73.74 79.26 84.71

We can produce a caterpillar plot to show the fully Bayes estimates for the school varying intercepts in rank order together with their 95% credible intervals.

# Sort dataframe containing an estimated alpha's mean and sd for every school
a_df <- a_df[order(a_df$a_mean), ]
a_df$a_rank <- c(1 : dim(a_df)[1])  # a vector of school rank 

# Plot school-level alphas's posterior mean and 95% credible interval
ggplot(data = a_df, 
       aes(x = a_rank, 
           y = a_mean)) +
  geom_pointrange(aes(ymin = Q2.5, 
                      ymax = Q97.5),
                  position = position_jitter(width = 0.1, 
                                             height = 0)) + 
  geom_hline(yintercept = mean(a_df$a_mean), 
             size = 0.5, 
             col = "red") + 
  scale_x_continuous("Rank", 
                     breaks = seq(from = 0, 
                                  to = 80, 
                                  by = 5)) + 
  scale_y_continuous(expression(paste("varying intercept, ", alpha[j]))) + 
  theme_bw( base_family = "serif")

Of course, the same approach can be taken to generate 95% credible intervales for σy and σα.

3.4.3 Making comparisons between individual schools

Having samples of all the parameters and varying intercepts from their joint posterior distribution makes it easy to draw inferences about functions of these parameters.

In education research and practice, it is often of interest to compare the schools included in the data. Relevant questions include (1) what is the difference between the means of schools A and B, (2) is school A performing better than school B and (3) what are the rankings of these schools within the sample. When non-Bayesian methods are used, we can attempt to make such comparisons based on empirical Bayes (or Best Linear Unbiased) predictions of the varying intercepts, but it will generally be impossible to express the uncertainty for nonlinear function such as rankings. See also Goldstein and Spiegelhalter (1996) for further discussion.

Here we will compare two schools as an example: Schools 60501 (the 21st school) and 68271 (the 51st school). We already have 4,000 posterior simulation draws for both schools. To make inferences regarding the difference between the average scores of the two schools, we can simply take the difference between the two vectors of draws α51α21.

# The difference between the two school averages (school #21 and #51)
school_diff <- a_sims[, 21] - a_sims[, 51]

We can investigate the posterior distribution of the difference with descriptive statistics and a histogram as follows:

# Investigate differences of two distributions
mean <- mean(school_diff)
sd <- sd(school_diff)
quantile <- quantile(school_diff, probs = c(0.025, 0.50, 0.975))
quantile <- data.frame(t(quantile))
names(quantile) <- c("Q2.5", "Q50", "Q97.5")
diff_df <- data.frame(mean, sd, quantile)
round(diff_df, 2)
  mean  sd  Q2.5  Q50 Q97.5
1 5.13 4.4 -3.53 5.17 13.55
# Histogram of the differences
ggplot(data = data.frame(school_diff), 
       aes(x = school_diff)) + 
  geom_histogram(color = "black", 
                 fill = "gray", 
                 binwidth = 0.75) + 
  scale_x_continuous("Score diffence between two schools: #21, #51",
                     breaks = seq(from = -20, 
                                  to = 20, 
                                  by = 10)) + 
  geom_vline(xintercept = c(mean(school_diff),
                            quantile(school_diff, 
                                     probs = c(0.025, 0.975))),
             colour = "red", 
             linetype = "longdash") + 
  geom_text(aes(5.11, 20, label = "mean = 5.11"), 
            color = "red", 
            size = 4) + 
  geom_text(aes(9, 50, label = "SD = 4.46"), 
            color = "blue", 
            size = 4) + 
  theme_bw( base_family = "serif") 

The expected difference comes to 5.11 with a standard deviation of 4.46 and a wide range of uncertainty. The 95% credible interval is [-3.64, 13.66], so we are 95% certain that the true value of the difference between the two schools lies within the range, given the data.

We also can get the proportion of the time that School 60501 has a higher mean than School 68271:

prop.table(table(a_sims[, 21] > a_sims[, 51]))

  FALSE    TRUE 
0.11725 0.88275 

This means that the posterior probability that School 60501 is better than School 68271 is 88.3%. Any pair of schools within the sample of schools can be compared in this manner.

4 Bayesian inference for Model 2 and 3

4.1 Model 2: Adding a student-level predictor

Researchers might want to extend the varying-intercept models by including observed explanatory variables at the student level xij, in this example, an indicator variable for being female. A simple varying intercept model with one predictor at the student level can be written as yijN(αj+βxij,σ2y) and αjN(μα,σ2α). We use noninformative prior distributions for the hyperparameters (μα and σα) as specified in the varying intercept model with no predictors. Additionally, the regression coefficient β is given normal prior distributions with mean 0 and standard deviation 100. This states, roughly, that we expect this coefficient to be in the range (100,100), and if the ML estimate is in this range, the prior distribution is providing very little information for the inference.

The above model can be fit using the stan_lmer() function in the rstanarm package as follows:

M2_stanlmer <- stan_lmer(formula = course ~ female + (1 | school), 
                         data = GCSE, 
                         prior = normal(location = 0, 
                                        scale = 100,
                                        autoscale = FALSE),
                         prior_intercept = normal(location = 0, 
                                                  scale = 100, 
                                                  autoscale = FALSE),
                         seed = 349)
prior_summary(object = M2_stanlmer)
Priors for model 'M2_stanlmer' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 100)

Coefficients
 ~ normal(location = 0, scale = 100)

Auxiliary (sigma)
 ~ exponential(rate = 1)
     **adjusted scale = 16.32 (adjusted rate = 1/adjusted scale)

Covariance
 ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
------
See help('prior_summary.stanreg') for more details
M2_stanlmer
stan_lmer
 family:       gaussian [identity]
 formula:      course ~ female + (1 | school)
 observations: 1725
------
            Median MAD_SD
(Intercept) 69.7    1.2  
femaleF      6.7    0.7  
sigma       13.4    0.2  

Error terms:
 Groups   Name        Std.Dev.
 school   (Intercept)  9      
 Residual             13      
Num. levels: school 73 

Sample avg. posterior predictive distribution of y:
         Median MAD_SD
mean_PPD 73.4    0.5  

------
For info on the priors used see help('prior_summary.stanreg').

Note that instead of the default priors in stan_lmer, μα and β are given normal prior distributions with mean 0 and standard deviation 100 by specifying the arguments prior and prior_intercept as normal(location = 0, scale = 100, autoscale = FALSE). To prevent stan_lmer from scaling the prior, we need to make sure to append the argument autoscale = FALSE.

The point estimates of μα, β, and σy are almost identical to the ML estimates from the lmer() fit. However, partly because ML ignores the uncertainty about μα when estimating σα, the Bayesian estimate for σα (9.0) is larger than the ML estimate (8.8), as with Model 1.

4.2 Model 3: Allowing for varying slopes across schools

We also use stan_lmer to fit Model 3 using the command below. Note that here, we use the default priors which are mostly similar to what was done in Model 1. Additionally, we are also required to specify a prior for the covariance matrix Σ for αj and βj in this Model. stan_lmer decomposes this covariance matrix (up to a factor of σy) into (i) a correlation matrix R and (ii) a matrix of variances V, and assigns them separate priors as shown below. Σ=(σ2αρσασβρσασβσ2β)=σ2y(σ2α/σ2yρσασβ/σ2yρσασβ/σ2yσ2β/σ2y)=σ2y(σα/σy00σβ/σy)(1ρρ1)(σα/σy00σβ/σy)=σ2yVRV.

The correlation matrix R is 2 by 2 matrix with 1’s on the diagonal and ρ’s on the off-diagonal. stan_lmer assigns it an LKJ11 prior (Lewandowski, Kurowicka, and Joe 2009), with regularization parameter 1. This is equivalent to assigning a uniform prior for ρ. The more the regularization parameter exceeds one, the more peaked the distribution for ρ to take the value 0.

The matrix of (scaled) variances V can first be collapsed into a vector of (scaled) variances, and then decomposed into three parts, J, τ2 and π as shown below. (σ2α/σ2yσ2β/σ2y)=2(σ2α/σ2y+σ2β/σ2y2)(σ2α/σ2yσ2α/σ2y+σ2β/σ2yσ2β/σ2yσ2α/σ2y+σ2β/σ2y)=Jτ2π.

In this formulation, J is the number of varying effects in the model (here, J=2), τ2 can be viewed as an average (scaled) variance across the varying effects αj and βj, and π is a non-negative vector that sums to 1 (called a Simplex/probability vector). A symmetric Dirichlet12 distribution with concentration parameter set to 1 is then used as the prior for π. By default, this implies a jointly uniform prior over all Simplex vectors of the same size. A scale-invariant Gamma prior with shape and scale parameters both set to 1 is then assigned for τ. This is equivalent to assigning as a prior the exponential distribution with rate parameter set to 1 which is consistent with the prior assigned to σy.

M3_stanlmer <- stan_lmer(formula = course ~ female + (1 + female | school), 
                         data = GCSE,
                         seed = 349)
prior_summary(object = M3_stanlmer)
Priors for model 'M3_stanlmer' 
------
Intercept (after predictors centered)
 ~ normal(location = 0, scale = 10)
     **adjusted scale = 163.21

Coefficients
 ~ normal(location = 0, scale = 2.5)
     **adjusted scale = 40.80

Auxiliary (sigma)
 ~ exponential(rate = 1)
     **adjusted scale = 16.32 (adjusted rate = 1/adjusted scale)

Covariance
 ~ decov(reg. = 1, conc. = 1, shape = 1, scale = 1)
------
See help('prior_summary.stanreg') for more details
M3_stanlmer
stan_lmer
 family:       gaussian [identity]
 formula:      course ~ female + (1 + female | school)
 observations: 1725
------
            Median MAD_SD
(Intercept) 69.5    1.3  
femaleF      7.1    1.2  
sigma       13.0    0.2  

Error terms:
 Groups   Name        Std.Dev. Corr 
 school   (Intercept) 10.3          
          femaleF      7.1     -0.48
 Residual             13.0          
Num. levels: school 73 

Sample avg. posterior predictive distribution of y:
         Median MAD_SD
mean_PPD 73.4    0.5  

------
For info on the priors used see help('prior_summary.stanreg').

Here, we notice that the point estimates for μα and σy are identical to the ML estimates from lmer() fit. The point estimate for β is slightly different in this Model (7.12 compared to 7.13). Furthermore, as in the previous two models, the Bayesian estimate for σα (10.3) is larger than the ML estimate (10.15). Additionally, the Bayesian estimates for σβ (7.1) and ρ (-0.48) are larger than the corresponding ML estimates (6.92 and -0.52 respectively).

5 Evaluating model convergence

By default, all rstanarm modeling functions will run 4 randomly initialized Markov chains, each for 2,000 iterations (including a warmup period of 1,000 iterations). All chains must converge to the target distribution for inferences to be valid. The diagnostics which we use to assess whether the chains have converged to the posterior distribution are the statistics ˆR and Neff (Gelman and Rubin 1992). Each parameter has the ˆR and Neff statistic associated with it. As seen previously, these statistics are automatically generated when using the summary method.

The ˆR is essentially the ratio of between-chain variance to within-chain variance analogous to ANOVA. The ˆR statistic should be less than 1.1 if the chains have converged. To see a plot of the ˆR values across parameters we can use the plot method for stanreg object M1_stanlmer:

plot(M1_stanlmer, "rhat")

The Neff statistic represents the effective number of simulation draws. If the draws were independent Neff would be the number of saved draws, 4,000 (4 chains × (2,000 iterations - 1,000 iterations for warmup)), but Neff tends to be smaller than 4,000 because Markov chain simulations tend to be autocorrelated. The Neff statistic should typically be at least 100 across parameters. We can plot a histogram of the ratio of effective sample size to total sample size:

plot(M1_stanlmer, "ess")

References

Browne, William J, David Draper, and others. 2006. “A Comparison of Bayesian and Likelihood-Based Methods for Fitting Multilevel Models.” Bayesian Analysis 1 (3): 473–514.

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.

Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science. JSTOR, 457–72.

Goldstein, Harvey, and David J Spiegelhalter. 1996. “League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance.” Journal of the Royal Statistical Society. Series A (Statistics in Society), 385–443.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100 (9). Elsevier: 1989–2001.

Rasbash, Jon, William Browne, Harvey Goldstein, Min Yang, Ian Plewis, Michael Healy, Geoff Woodhouse, David Draper, Ian Langford, and Toby Lewis. 2000. “A User’s Guide to Mlwin.” University of London, Institute of Education, Centre for Multilevel Modelling.


  1. Also known as hierarchical linear models, random-effects models, random coefficient models, or mixed models.

  2. Stan requires the data to be in the form of a list or an environment.

  3. Stan does not accept either NA values even for variables that are not being modeled.

  4. Stan requires identifiers to be sequential.

  5. Equivalently, the model can be expressed using a two-stage formulation as yij=αj+βxij+ϵij, αj=μα+uj, or in a reduced form as yij=μα+βxij+uj+ϵij where ϵijN(0,σ2y) and ujN(0,σ2α).

  6. We elaborate more on prior distributions in Section 3.2

  7. For users who are not familiar with ggplot2 syntax, we refer them to here.

  8. Equivalently, the model can be expressed in a two-stage formulation as yij=αj+βjxij+ϵij, αj=μα+uj, βj=μβ+vj, or in a reduced form as yij=μα+μβxij+uj+vjxij+ϵij where ϵijN(0,σ2y) and (ujvj)N((00),(σ2αρσασβρσασβσ2β)).

  9. Regularization can be regarded as a technique to ensure that estimates are bounded within an acceptable range of values.

  10. For more information on regular expressions, see here

  11. For more details about the LKJ distribution, see here and here

  12. The Dirichlet distribution is a multivariate generalization of the beta distribution with one concentration parameter, which can be interpreted as prior counts of a multinomial random variable (the simplex vector in our context), for details, see here.