A common feature of data structures in education is that units of analysis (e.g., students) are nested in higher organizational clusters (e.g. schools). This kind of structure induces dependence among the responses observed for units within the same cluster. Students in the same school tend to be more alike in their academic and attitudinal characteristics than students chosen at random from the population at large.

*Multilevel models*^{1} are designed to model such within-cluster dependence. Multilevel models recognize the existence of data clustering (at two or more levels) by allowing for residual components at each level in the hierarchy. For example, a two-level model that allows for grouping of student outcomes within schools would include residuals at both the student and school level. The residual variance is thus partitioned into a between-school component (the variance of the school-level residuals) and a within-school component (the variance of the student-level residuals).

In this tutorial, we illustrate how to fit a multilevel linear model within a full Bayesian framework using **rstanarm**. This tutorial is aimed primarily at educational researchers who have used **lme4** in **R** to fit models to their data and who may be interested in learning how to fit Bayesian multilevel models. However, for readers who have not used **lme4** before, we briefly review the use of the package for fitting multilevel models.

This tutorial uses **Stan** version 2.17.3 and requires the following **R** packages.

```
# Required Packages
library(mlmRev)
library(lme4)
library(rstanarm)
library(ggplot2)
```

We will be analyzing the `Gcsemv`

dataset (Rasbash et al. 2000) from the **mlmRev** package in **R**. The data include the General Certificate of Secondary Education (GCSE) exam scores of 1,905 students from 73 schools in England on a science subject. The `Gcsemv`

dataset consists of the following 5 variables:

*school*: school identifier*student*: student identifier*gender*: gender of a student (M: Male, F: Female)*written*: total score on written paper*course*: total score on coursework paper

```
# Use example dataset from mlmRev package: GCSE exam score
data(Gcsemv, package = "mlmRev")
summary(Gcsemv)
```

```
school student gender written course
68137 : 104 77 : 14 F:1128 Min. : 0.60 Min. : 9.25
68411 : 84 83 : 14 M: 777 1st Qu.:37.00 1st Qu.: 62.90
68107 : 79 53 : 13 Median :46.00 Median : 75.90
68809 : 73 66 : 13 Mean :46.37 Mean : 73.39
22520 : 65 27 : 12 3rd Qu.:55.00 3rd Qu.: 86.10
60457 : 54 110 : 12 Max. :90.00 Max. :100.00
(Other):1446 (Other):1827 NA's :202 NA's :180
```

Two components of the exam were recorded as outcome variables: written paper and course work. In this tutorial, only the total score on the courework paper (*course*) will be analyzed. As seen above, there some of the observations have missing values for certain covariates. While we do not subset the data to only include complete cases to demonstrate that **rstanarm** automatically drops these observations, it is generally good practice to manually do so if required.

```
# Make Male the reference category and rename variable
Gcsemv$female <- relevel(Gcsemv$gender, "M")
# Use only total score on coursework paper
GCSE <- subset(x = Gcsemv,
select = c(school, student, female, course))
# Count unique schools and students
J <- length(unique(GCSE$school))
N <- nrow(GCSE)
```

The **rstanarm** package automates several data preprocessing steps making its use very similar to that of **lme4** in the following way.

Input -

**rstanarm**is able to take a data frame as input^{2}.Missing Data -

**rstanarm**automatically discards observations with NA values for any variable used in the model^{3}.Identifiers -

**rstanarm**does not require identifiers to be sequential^{4}. We do suggest that it is good practice for all cluster and unit identifiers, as well as categorical variables be stored as factors. This applies to using**lme4**as much as it does to**rstanarm**. One can check the structure of the variables by using the`str()`

function.

```
# Check structure of data frame
str(GCSE)
```

```
'data.frame': 1905 obs. of 4 variables:
$ school : Factor w/ 73 levels "20920","22520",..: 1 1 1 1 1 1 1 1 1 2 ...
$ student: Factor w/ 649 levels "1","2","3","4",..: 16 25 27 31 42 62 101 113 146 1 ...
$ female : Factor w/ 2 levels "M","F": 1 2 2 2 1 2 2 1 1 2 ...
$ course : num NA 71.2 76.8 87.9 44.4 NA 89.8 17.5 32.4 84.2 ...
```

`lmer()`

In this section, we briefly review three basic multilevel linear models which will be fit in this tutorial. Starting with a varying intercept model with no predictors (Model 1), we then proceed to the varying intercept model with one predictor (Model 2), and the varying intercept and slope model (Model 3).

Consider the simplest multilevel model for students \(i=1, ..., n\) nested within schools \(j=1, ..., J\) and for whom we have examination scores as responses. We can write a two-level varying intercept model with no predictors using the usual two-stage formulation as

\[y_{ij} = \alpha_{j} + \epsilon_{ij}, \text{ where } \epsilon_{ij} \sim N(0, \sigma_y^2)\] \[\alpha_j = \mu_{\alpha} + u_j, \text{ where } u_j \sim N(0, \sigma_\alpha^2)\]

where \(y_{ij}\) is the examination score for the *i*th student in the *j*th school, \(\alpha_{j}\) is the varying intercept for the *j*th school, and \(\mu_{\alpha}\) is the overall mean across schools. Alternatively, the model can be expressed in reduced form as \[y_{ij} = \mu_\alpha + u_j + \epsilon_{ij}.\]. If we further assume that the student-level errors \(\epsilon_{ij}\) are normally distributed with mean 0 and variance \(\sigma_{y}^{2}\), and that the school-level varying intercepts \(\alpha_{j}\) are normally distributed with mean \(\mu_{\alpha}\) and variance \(\sigma_{\alpha}^{2}\), then the model can be expressed as

\[y_{ij} \sim N(\alpha_{j}, \sigma_{y}^{2}),\] \[\alpha_{j}\sim N(\mu_{\alpha}, \sigma_{\alpha}^{2}),\]

This model can then be fit using `lmer()`

. We specify an intercept (the predictor “1”) and allow it to vary by the level-2 identifier (*school*). We also specify the `REML = FALSE`

option to obtain maximum likelihood (ML) estimates as opposed to the default restricted maximum likelihood (REML) estimates.

```
M1 <- lmer(formula = course ~ 1 + (1 | school),
data = GCSE,
REML = FALSE)
summary(M1)
```

```
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: course ~ 1 + (1 | school)
Data: GCSE
AIC BIC logLik deviance df.resid
14111.4 14127.7 -7052.7 14105.4 1722
Scaled residuals:
Min 1Q Median 3Q Max
-4.9693 -0.5101 0.1116 0.6741 2.7613
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 75.24 8.674
Residual 190.77 13.812
Number of obs: 1725, groups: school, 73
Fixed effects:
Estimate Std. Error t value
(Intercept) 73.72 1.11 66.4
```

Under the `Fixed effects`

part of the output, we see that the intercept \(\mu_{\alpha}\), averaged over the population of schools, is estimated as \(73.72\). Under the `Random effects`

part of the output, we see that the between-school standard deviation \(\sigma_{\alpha}\) is estimated as \(8.67\) and the within-school standard deviation \(\sigma_{y}\) as \(13.81\).

The varying intercept model^{5} with an indicator variable for being female \(x_{ij}\) can be written as \[y_{ij} \sim N(\alpha_{j}+\beta x_{ij} , \sigma_{y}^{2}),\] \[\alpha_{j}\sim N(\mu_{\alpha}, \sigma_{\alpha}^{2}).\] The equation of the average regression line across schools is \(\mu_{ij}=\mu_{\alpha}+\beta x_{ij}\). The regression lines for specific schools will be parallel to the average regression line (having the same slope \(\beta\)), but differ in terms of its intercept \(\alpha_{j}\). This model can be estimated by adding `female`

to the formula in the `lmer()`

function, which will allow only the intercept to vary by school, and while keeping the “slope” for being female constant across schools.

```
M2 <- lmer(formula = course ~ 1 + female + (1 | school),
data = GCSE,
REML = FALSE)
summary(M2)
```

```
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: course ~ 1 + female + (1 | school)
Data: GCSE
AIC BIC logLik deviance df.resid
14017.4 14039.2 -7004.7 14009.4 1721
Scaled residuals:
Min 1Q Median 3Q Max
-4.7809 -0.5401 0.1259 0.6795 2.6753
Random effects:
Groups Name Variance Std.Dev.
school (Intercept) 76.65 8.755
Residual 179.96 13.415
Number of obs: 1725, groups: school, 73
Fixed effects:
Estimate Std. Error t value
(Intercept) 69.730 1.185 58.87
femaleF 6.739 0.678 9.94
Correlation of Fixed Effects:
(Intr)
femaleF -0.338
```

The average regression line across schools is thus estimated as \(\hat{\mu}_{ij}=69.73+ 6.74 x_{ij}\), with \(\sigma_\alpha\) and \(\sigma_y\) estimated as \(8.76\) and \(13.41\) respectively. Treating these estimates of \(\mu_\alpha\), \(\beta\), \(\sigma^2_{y}\), and \(\sigma^2_{\alpha}\) as the true parameter values, we can then obtain the Best Linear Unbiased Predictions (BLUPs) for the school-level errors \(\hat{u}_j = \hat{\alpha}_{j} - \hat{\mu}_{\alpha}\).

The BLUPs are equivalent to the so-called Empirical Bayes (EB) prediction, which is the mean of the posterior distribution of \(u_{j}\) given all the estimated parameters, as well as the random variables \(y_{ij}\) and \(x_{ij}\) for the cluster. These predictions are called “Bayes” because they make use of the pre-specified prior distribution^{6} \(\alpha_j \sim N(\mu_\alpha, \sigma^2_\alpha)\), and by extension \(u_j \sim N(0, \sigma^2_\alpha)\), and called “Empirical” because the parameters of this prior, \(\mu_\alpha\) and \(\sigma^2_{\alpha}\), in addition to \(\beta\) and \(\sigma^2_{y}\), are estimated from the data.

Compared to the Maximum Likelihood (ML) approach of predicting values for \(u_j\) by using only the estimated parameters and data from cluster \(j\), the EB approach additionally consider the prior distribution of \(u_{j}\), and produces predicted values closer to \(0\) (a phenomenon described as *shrinkage* or *partial pooling*). To see why this phenomenon is called *shrinkage*, we usually express the estimates for \(u_j\) obtained from EB prediction as \(\hat{u}_j^{\text{EB}} = \hat{R}_j\hat{u}_j^{\text{ML}}\) where \(\hat{u}_j^{\text{ML}}\) are the ML estimates, and \(\hat{R}_j = \frac{\sigma_\alpha^2}{\sigma_\alpha^2 + \frac{\sigma_y^2}{n_j}}\) is the so-called Shrinkage factor.

`head(ranef(M2)$school)`

```
(Intercept)
20920 -10.1702110
22520 -17.0578149
22710 7.8007260
22738 0.4871012
22908 -8.1940346
23208 4.4304453
```

These values estimate how much the intercept is shifted up or down in particular schools. For example, in the first school in the dataset, the estimated intercept is about 10.17 lower than average, so that the school-specific regression line is \((69.73 - 10.17) + 6.74 x_{ij}\).

Gelman and Hill (2006) characterize multilevel modeling as *partial pooling* (also called *shrinkage*), which is a compromise between two extremes: *complete pooling* in which the clustering is not considered in the model at all, and *no pooling*, in which separate intercepts are estimated for each school as coefficients of dummy variables. The estimated school-specific regression lines in the above model are based on *partial pooling* estimates. To show this, we first estimate the intercept and slope in each school three ways:

```
# Complete-pooling regression
pooled <- lm(formula = course ~ female,
data = GCSE)
a_pooled <- coef(pooled)[1] # complete-pooling intercept
b_pooled <- coef(pooled)[2] # complete-pooling slope
# No-pooling regression
nopooled <- lm(formula = course ~ 0 + school + female,
data = GCSE)
a_nopooled <- coef(nopooled)[1:J] # 73 no-pooling intercepts
b_nopooled <- coef(nopooled)[J+1]
# Partial pooling (multilevel) regression
a_part_pooled <- coef(M2)$school[, 1]
b_part_pooled <- coef(M2)$school[, 2]
```

Then, we plot^{7} the data and school-specific regression lines for a selection of eight schools using the following commands:

```
# (0) Set axes & choose schools
y <- GCSE$course
x <- as.numeric(GCSE$female) - 1 + runif(N, -.05, .05)
schid <- GCSE$school
sel.sch <- c("65385",
"68207",
"60729",
"67051",
"50631",
"60427",
"64321",
"68137")
# (1) Subset 8 of the schools; generate data frame
df <- data.frame(y, x, schid)
df8 <- subset(df, schid %in% sel.sch)
# (2) Assign complete-pooling, no-pooling, partial pooling estimates
df8$a_pooled <- a_pooled
df8$b_pooled <- b_pooled
df8$a_nopooled <- a_nopooled[df8$schid]
df8$b_nopooled <- b_nopooled
df8$a_part_pooled <- a_part_pooled[df8$schid]
df8$b_part_pooled <- b_part_pooled[df8$schid]
# (3) Plot regression fits for the 8 schools
ggplot(data = df8,
aes(x = x, y = y)) +
facet_wrap(facets = ~ schid,
ncol = 4) +
theme_bw() +
geom_jitter(position = position_jitter(width = .05,
height = 0)) +
geom_abline(aes(intercept = a_pooled,
slope = b_pooled),
linetype = "solid",
color = "blue",
size = 0.5) +
geom_abline(aes(intercept = a_nopooled,
slope = b_nopooled),
linetype = "longdash",
color = "red",
size = 0.5) +
geom_abline(aes(intercept = a_part_pooled,
slope = b_part_pooled),
linetype = "dotted",
color = "purple",
size = 0.7) +
scale_x_continuous(breaks = c(0, 1),
labels = c("male", "female")) +
labs(title = "Complete-pooling, No-pooling, and Partial pooling estimates",
x = "",
y = "Total score on coursework paper")+theme_bw( base_family = "serif")
```

We now extend the varying intercept model with a single predictor to allow both the intercept and the slope to vary randomly across schools using the following model^{8}:

\[y_{ij}\sim N(\alpha_{j}+\beta_{j}x_{ij} , \sigma_y ^2 ),\] \[\left( \begin{matrix} \alpha _{ j } \\ \beta _{ j } \end{matrix} \right) \sim N\left( \left( \begin{matrix} { \mu }_{ \alpha } \\ { \mu }_{ \beta } \end{matrix} \right) , \left( \begin{matrix} { \sigma }_{ \alpha }^{ 2 } & \rho { \sigma }_{ \alpha }{ \sigma }_{ \beta } \\ \rho { \sigma }_{ \alpha }{ \sigma }_{ \beta } & { \sigma }_{ \beta }^{ 2 } \end{matrix} \right) \right).\]

Note that now we have variation in the \(\alpha_{j}\)’s and the \(\beta_{j}\)’s, and also a correlation parameter \(\rho\) between \(\alpha_j\) and \(\beta_j\). This model can be fit using `lmer()`

as follows:

```
M3 <- lmer(formula = course ~ 1 + female + (1 + female | school),
data = GCSE,
REML = FALSE)
summary(M3)
```

```
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: course ~ 1 + female + (1 + female | school)
Data: GCSE
AIC BIC logLik deviance df.resid
13983.4 14016.2 -6985.7 13971.4 1719
Scaled residuals:
Min 1Q Median 3Q Max
-4.6886 -0.5222 0.1261 0.6529 2.6729
Random effects:
Groups Name Variance Std.Dev. Corr
school (Intercept) 102.93 10.146
femaleF 47.94 6.924 -0.52
Residual 169.79 13.030
Number of obs: 1725, groups: school, 73
Fixed effects:
Estimate Std. Error t value
(Intercept) 69.425 1.352 51.338
femaleF 7.128 1.131 6.302
Correlation of Fixed Effects:
(Intr)
femaleF -0.574
```

In this model, the residual within-school standard deviation is estimated as \(\hat{\sigma}_{y}=\) 13.03. The estimated standard deviations of the school intercepts and the school slopes are \(\hat{\sigma}_{\alpha}= 10.15\) and \(\hat{\sigma}_{\beta}= 6.92\) respectively. The estimated correlation between varying intercepts and slopes is \(\hat{\rho} = -0.52\). We can use code similar to that presented in section 2.2 to plot the data and school-specific regression lines for a selection of eight schools.