# 1 Introduction

A common feature of data structures in education is that units of analysis (e.g., students) are nested in higher organizational clusters (e.g. schools). This kind of structure induces dependence among the responses observed for units within the same cluster. Students in the same school tend to be more alike in their academic and attitudinal characteristics than students chosen at random from the population at large.

Multilevel models1 are designed to model such within-cluster dependence. Multilevel models recognize the existence of data clustering (at two or more levels) by allowing for residual components at each level in the hierarchy. For example, a two-level model that allows for grouping of student outcomes within schools would include residuals at both the student and school level. The residual variance is thus partitioned into a between-school component (the variance of the school-level residuals) and a within-school component (the variance of the student-level residuals).

In this tutorial, we illustrate how to fit a multilevel linear model within a full Bayesian framework using rstanarm. This tutorial is aimed primarily at educational researchers who have used lme4 in R to fit models to their data and who may be interested in learning how to fit Bayesian multilevel models. However, for readers who have not used lme4 before, we briefly review the use of the package for fitting multilevel models.

This tutorial uses Stan version 2.17.3 and requires the following R packages.

# Required Packages
library(mlmRev)
library(lme4)
library(rstanarm)
library(ggplot2)

## 1.1 Data example

We will be analyzing the Gcsemv dataset (Rasbash et al. 2000) from the mlmRev package in R. The data include the General Certificate of Secondary Education (GCSE) exam scores of 1,905 students from 73 schools in England on a science subject. The Gcsemv dataset consists of the following 5 variables:

• school: school identifier
• student: student identifier
• gender: gender of a student (M: Male, F: Female)
• written: total score on written paper
• course: total score on coursework paper
# Use example dataset from mlmRev package: GCSE exam score
data(Gcsemv, package = "mlmRev")
summary(Gcsemv)
     school        student     gender      written          course
68137  : 104   77     :  14   F:1128   Min.   : 0.60   Min.   :  9.25
68411  :  84   83     :  14   M: 777   1st Qu.:37.00   1st Qu.: 62.90
68107  :  79   53     :  13            Median :46.00   Median : 75.90
68809  :  73   66     :  13            Mean   :46.37   Mean   : 73.39
22520  :  65   27     :  12            3rd Qu.:55.00   3rd Qu.: 86.10
60457  :  54   110    :  12            Max.   :90.00   Max.   :100.00
(Other):1446   (Other):1827            NA's   :202     NA's   :180     

Two components of the exam were recorded as outcome variables: written paper and course work. In this tutorial, only the total score on the courework paper (course) will be analyzed. As seen above, there some of the observations have missing values for certain covariates. While we do not subset the data to only include complete cases to demonstrate that rstanarm automatically drops these observations, it is generally good practice to manually do so if required.

# Make Male the reference category and rename variable
Gcsemv$female <- relevel(Gcsemv$gender, "M")

# Use only total score on coursework paper
GCSE <- subset(x = Gcsemv,
select = c(school, student, female, course))

# Count unique schools and students
J <- length(unique(GCSE$school)) N <- nrow(GCSE) The rstanarm package automates several data preprocessing steps making its use very similar to that of lme4 in the following way. • Input - rstanarm is able to take a data frame as input2. • Missing Data - rstanarm automatically discards observations with NA values for any variable used in the model3. • Identifiers - rstanarm does not require identifiers to be sequential4. We do suggest that it is good practice for all cluster and unit identifiers, as well as categorical variables be stored as factors. This applies to using lme4 as much as it does to rstanarm. One can check the structure of the variables by using the str() function. # Check structure of data frame str(GCSE) 'data.frame': 1905 obs. of 4 variables:$ school : Factor w/ 73 levels "20920","22520",..: 1 1 1 1 1 1 1 1 1 2 ...
$student: Factor w/ 649 levels "1","2","3","4",..: 16 25 27 31 42 62 101 113 146 1 ...$ female : Factor w/ 2 levels "M","F": 1 2 2 2 1 2 2 1 1 2 ...
$course : num NA 71.2 76.8 87.9 44.4 NA 89.8 17.5 32.4 84.2 ... # 2 Likelihood inference using lmer() In this section, we briefly review three basic multilevel linear models which will be fit in this tutorial. Starting with a varying intercept model with no predictors (Model 1), we then proceed to the varying intercept model with one predictor (Model 2), and the varying intercept and slope model (Model 3). ## 2.1 Model 1: Varying intercept model with no predictors (Variance components model) Consider the simplest multilevel model for students $$i=1, ..., n$$ nested within schools $$j=1, ..., J$$ and for whom we have examination scores as responses. We can write a two-level varying intercept model with no predictors using the usual two-stage formulation as $y_{ij} = \alpha_{j} + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma_y^2)$ $\alpha_j = \mu_{\alpha} + u_j, \text{ where } u_j \sim N(0, \sigma_\alpha^2)$ where $$y_{ij}$$ is the examination score for the ith student in the jth school, $$\alpha_{j}$$ is the varying intercept for the jth school, and $$\mu_{\alpha}$$ is the overall mean across schools. Alternatively, the model can be expressed in reduced form as $y_{ij} = \mu_\alpha + u_j + \epsilon_{ij}.$. If we further assume that the student-level errors $$\epsilon_{ij}$$ are normally distributed with mean 0 and variance $$\sigma_{y}^{2}$$, and that the school-level varying intercepts $$\alpha_{j}$$ are normally distributed with mean $$\mu_{\alpha}$$ and variance $$\sigma_{\alpha}^{2}$$, then the model can be expressed as $y_{ij} \sim N(\alpha_{j}, \sigma_{y}^{2}),$ $\alpha_{j}\sim N(\mu_{\alpha}, \sigma_{\alpha}^{2}),$ This model can then be fit using lmer(). We specify an intercept (the predictor “1”) and allow it to vary by the level-2 identifier (school). We also specify the REML = FALSE option to obtain maximum likelihood (ML) estimates as opposed to the default restricted maximum likelihood (REML) estimates. M1 <- lmer(formula = course ~ 1 + (1 | school), data = GCSE, REML = FALSE) summary(M1) Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: course ~ 1 + (1 | school) Data: GCSE AIC BIC logLik deviance df.resid 14111.4 14127.7 -7052.7 14105.4 1722 Scaled residuals: Min 1Q Median 3Q Max -4.9693 -0.5101 0.1116 0.6741 2.7613 Random effects: Groups Name Variance Std.Dev. school (Intercept) 75.24 8.674 Residual 190.77 13.812 Number of obs: 1725, groups: school, 73 Fixed effects: Estimate Std. Error t value (Intercept) 73.72 1.11 66.4 Under the Fixed effects part of the output, we see that the intercept $$\mu_{\alpha}$$, averaged over the population of schools, is estimated as $$73.72$$. Under the Random effects part of the output, we see that the between-school standard deviation $$\sigma_{\alpha}$$ is estimated as $$8.67$$ and the within-school standard deviation $$\sigma_{y}$$ as $$13.81$$. ## 2.2 Model 2: Varying intercept model with a single predictor The varying intercept model5 with an indicator variable for being female $$x_{ij}$$ can be written as $y_{ij} \sim N(\alpha_{j}+\beta x_{ij} , \sigma_{y}^{2}),$ $\alpha_{j}\sim N(\mu_{\alpha}, \sigma_{\alpha}^{2}).$ The equation of the average regression line across schools is $$\mu_{ij}=\mu_{\alpha}+\beta x_{ij}$$. The regression lines for specific schools will be parallel to the average regression line (having the same slope $$\beta$$), but differ in terms of its intercept $$\alpha_{j}$$. This model can be estimated by adding female to the formula in the lmer() function, which will allow only the intercept to vary by school, and while keeping the “slope” for being female constant across schools. M2 <- lmer(formula = course ~ 1 + female + (1 | school), data = GCSE, REML = FALSE) summary(M2)  Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: course ~ 1 + female + (1 | school) Data: GCSE AIC BIC logLik deviance df.resid 14017.4 14039.2 -7004.7 14009.4 1721 Scaled residuals: Min 1Q Median 3Q Max -4.7809 -0.5401 0.1259 0.6795 2.6753 Random effects: Groups Name Variance Std.Dev. school (Intercept) 76.65 8.755 Residual 179.96 13.415 Number of obs: 1725, groups: school, 73 Fixed effects: Estimate Std. Error t value (Intercept) 69.730 1.185 58.87 femaleF 6.739 0.678 9.94 Correlation of Fixed Effects: (Intr) femaleF -0.338 The average regression line across schools is thus estimated as $$\hat{\mu}_{ij}=69.73+ 6.74 x_{ij}$$, with $$\sigma_\alpha$$ and $$\sigma_y$$ estimated as $$8.76$$ and $$13.41$$ respectively. Treating these estimates of $$\mu_\alpha$$, $$\beta$$, $$\sigma^2_{y}$$, and $$\sigma^2_{\alpha}$$ as the true parameter values, we can then obtain the Best Linear Unbiased Predictions (BLUPs) for the school-level errors $$\hat{u}_j = \hat{\alpha}_{j} - \hat{\mu}_{\alpha}$$. The BLUPs are equivalent to the so-called Empirical Bayes (EB) prediction, which is the mean of the posterior distribution of $$u_{j}$$ given all the estimated parameters, as well as the random variables $$y_{ij}$$ and $$x_{ij}$$ for the cluster. These predictions are called “Bayes” because they make use of the pre-specified prior distribution6 $$\alpha_j \sim N(\mu_\alpha, \sigma^2_\alpha)$$, and by extension $$u_j \sim N(0, \sigma^2_\alpha)$$, and called “Empirical” because the parameters of this prior, $$\mu_\alpha$$ and $$\sigma^2_{\alpha}$$, in addition to $$\beta$$ and $$\sigma^2_{y}$$, are estimated from the data. Compared to the Maximum Likelihood (ML) approach of predicting values for $$u_j$$ by using only the estimated parameters and data from cluster $$j$$, the EB approach additionally consider the prior distribution of $$u_{j}$$, and produces predicted values closer to $$0$$ (a phenomenon described as shrinkage or partial pooling). To see why this phenomenon is called shrinkage, we usually express the estimates for $$u_j$$ obtained from EB prediction as $$\hat{u}_j^{\text{EB}} = \hat{R}_j\hat{u}_j^{\text{ML}}$$ where $$\hat{u}_j^{\text{ML}}$$ are the ML estimates, and $$\hat{R}_j = \frac{\sigma_\alpha^2}{\sigma_\alpha^2 + \frac{\sigma_y^2}{n_j}}$$ is the so-called Shrinkage factor. head(ranef(M2)$school)
      (Intercept)
20920 -10.1702110
22520 -17.0578149
22710   7.8007260
22738   0.4871012
22908  -8.1940346
23208   4.4304453

These values estimate how much the intercept is shifted up or down in particular schools. For example, in the first school in the dataset, the estimated intercept is about 10.17 lower than average, so that the school-specific regression line is $$(69.73 - 10.17) + 6.74 x_{ij}$$.

Gelman and Hill (2006) characterize multilevel modeling as partial pooling (also called shrinkage), which is a compromise between two extremes: complete pooling in which the clustering is not considered in the model at all, and no pooling, in which separate intercepts are estimated for each school as coefficients of dummy variables. The estimated school-specific regression lines in the above model are based on partial pooling estimates. To show this, we first estimate the intercept and slope in each school three ways:

# Complete-pooling regression
pooled <- lm(formula = course ~ female,
data = GCSE)
a_pooled <- coef(pooled)[1]   # complete-pooling intercept
b_pooled <- coef(pooled)[2]   # complete-pooling slope

# No-pooling regression
nopooled <- lm(formula = course ~ 0 + school + female,
data = GCSE)
a_nopooled <- coef(nopooled)[1:J]   # 73 no-pooling intercepts
b_nopooled <- coef(nopooled)[J+1]

# Partial pooling (multilevel) regression
a_part_pooled <- coef(M2)$school[, 1] b_part_pooled <- coef(M2)$school[, 2]

Then, we plot7 the data and school-specific regression lines for a selection of eight schools using the following commands:

# (0) Set axes & choose schools
y <- GCSE$course x <- as.numeric(GCSE$female) - 1 + runif(N, -.05, .05)
schid <- GCSE$school sel.sch <- c("65385", "68207", "60729", "67051", "50631", "60427", "64321", "68137") # (1) Subset 8 of the schools; generate data frame df <- data.frame(y, x, schid) df8 <- subset(df, schid %in% sel.sch) # (2) Assign complete-pooling, no-pooling, partial pooling estimates df8$a_pooled <- a_pooled
df8$b_pooled <- b_pooled df8$a_nopooled <- a_nopooled[df8$schid] df8$b_nopooled <- b_nopooled
df8$a_part_pooled <- a_part_pooled[df8$schid]
df8$b_part_pooled <- b_part_pooled[df8$schid]

# (3) Plot regression fits for the 8 schools
ggplot(data = df8,
aes(x = x, y = y)) +
facet_wrap(facets = ~ schid,
ncol = 4) +
theme_bw() +
geom_jitter(position = position_jitter(width = .05,
height = 0)) +
geom_abline(aes(intercept = a_pooled,
slope = b_pooled),
linetype = "solid",
color = "blue",
size = 0.5) +
geom_abline(aes(intercept = a_nopooled,
slope = b_nopooled),
linetype = "longdash",
color = "red",
size = 0.5) +
geom_abline(aes(intercept = a_part_pooled,
slope = b_part_pooled),
linetype = "dotted",
color = "purple",
size = 0.7) +
scale_x_continuous(breaks = c(0, 1),
labels = c("male", "female")) +
labs(title = "Complete-pooling, No-pooling, and Partial pooling estimates",
x = "",
y = "Total score on coursework paper")+theme_bw( base_family = "serif")

## 2.3 Model 3: Varying intercept and slope model with a single predictor

We now extend the varying intercept model with a single predictor to allow both the intercept and the slope to vary randomly across schools using the following model8:

$y_{ij}\sim N(\alpha_{j}+\beta_{j}x_{ij} , \sigma_y ^2 ),$ $\left( \begin{matrix} \alpha _{ j } \\ \beta _{ j } \end{matrix} \right) \sim N\left( \left( \begin{matrix} { \mu }_{ \alpha } \\ { \mu }_{ \beta } \end{matrix} \right) , \left( \begin{matrix} { \sigma }_{ \alpha }^{ 2 } & \rho { \sigma }_{ \alpha }{ \sigma }_{ \beta } \\ \rho { \sigma }_{ \alpha }{ \sigma }_{ \beta } & { \sigma }_{ \beta }^{ 2 } \end{matrix} \right) \right).$

Note that now we have variation in the $$\alpha_{j}$$’s and the $$\beta_{j}$$’s, and also a correlation parameter $$\rho$$ between $$\alpha_j$$ and $$\beta_j$$. This model can be fit using lmer() as follows:

M3 <- lmer(formula = course ~ 1 + female + (1 + female | school),
data = GCSE,
REML = FALSE)
summary(M3) 
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: course ~ 1 + female + (1 + female | school)
Data: GCSE

AIC      BIC   logLik deviance df.resid
13983.4  14016.2  -6985.7  13971.4     1719

Scaled residuals:
Min      1Q  Median      3Q     Max
-4.6886 -0.5222  0.1261  0.6529  2.6729

Random effects:
Groups   Name        Variance Std.Dev. Corr
school   (Intercept) 102.93   10.146
femaleF      47.94    6.924   -0.52
Residual             169.79   13.030
Number of obs: 1725, groups:  school, 73

Fixed effects:
Estimate Std. Error t value
(Intercept)   69.425      1.352  51.338
femaleF        7.128      1.131   6.302

Correlation of Fixed Effects:
(Intr)
femaleF -0.574

In this model, the residual within-school standard deviation is estimated as $$\hat{\sigma}_{y}=$$ 13.03. The estimated standard deviations of the school intercepts and the school slopes are $$\hat{\sigma}_{\alpha}= 10.15$$ and $$\hat{\sigma}_{\beta}= 6.92$$ respectively. The estimated correlation between varying intercepts and slopes is $$\hat{\rho} = -0.52$$. We can use code similar to that presented in section 2.2 to plot the data and school-specific regression lines for a selection of eight schools.