## 1.11 Item-response theory models

Item-response theory (IRT) models the situation in which a number of students each answer one or more of a group of test questions. The model is based on parameters for the ability of the students, the difficulty of the questions, and in more articulated models, the discriminativeness of the questions and the probability of guessing correctly; see Andrew Gelman and Hill (2007, pps. 314–320) for a textbook introduction to hierarchical IRT models and Curtis (2010) for encodings of a range of IRT models in BUGS.

### Data declaration with missingness

The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.

```
data {
int<lower=1> J; // number of students
int<lower=1> K; // number of questions
int<lower=1> N; // number of observations
int<lower=1,upper=J> jj[N]; // student for observation n
int<lower=1,upper=K> kk[N]; // question for observation n
int<lower=0,upper=1> y[N]; // correctness for observation n
}
```

This declares a total of `N`

student-question pairs in the data
set, where each `n`

in `1:N`

indexes a binary observation
`y[n]`

of the correctness of the answer of student `jj[n]`

on question `kk[n]`

.

The prior hyperparameters will be hard coded in the rest of this section for simplicity, though they could be coded as data in Stan for more flexibility.

### 1PL (Rasch) model

The 1PL item-response model, also known as the Rasch model, has one parameter (1P) for questions and uses the logistic link function (L).

The model parameters are declared as follows.

```
parameters {
real delta; // mean student ability
real alpha[J]; // ability of student j - mean ability
real beta[K]; // difficulty of question k
}
```

The parameter `alpha[J]`

is the ability coefficient for student
`j`

and `beta[k]`

is the difficulty coefficient for question
`k`

. The non-standard parameterization used here also includes
an intercept term `delta`

, which represents the average student’s
response to the average question.^{4}

The model itself is as follows.

```
model {
alpha ~ std_normal(); // informative true prior
beta ~ std_normal(); // informative true prior
delta ~ normal(0.75, 1); // informative true prior
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
}
```

This model uses the logit-parameterized Bernoulli distribution, where \[ \texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right). \]

The key to understanding it is the term inside the
`bernoulli_logit`

distribution, from which it follows that
\[
\textsf{Pr}[y_n = 1] = \operatorname{logit}^{-1}\left(\alpha_{jj[n]} - \beta_{kk[n]}
+ \delta\right).
\]

The model suffers from additive identifiability issues without the priors. For example, adding a term \(\xi\) to each \(\alpha_j\) and \(\beta_k\) results in the same predictions. The use of priors for \(\alpha\) and \(\beta\) located at 0 identifies the parameters; see Andrew Gelman and Hill (2007) for a discussion of identifiability issues and alternative approaches to identification.

For testing purposes, the IRT 1PL model distributed with Stan uses informative priors that match the actual data generation process used to simulate the data in R (the simulation code is supplied in the same directory as the models). This is unrealistic for most practical applications, but allows Stan’s inferences to be validated. A simple sensitivity analysis with fatter priors shows that the posterior is fairly sensitive to the prior even with 400 students and 100 questions and only 25% missingness at random. For real applications, the priors should be fit hierarchically along with the other parameters, as described in the next section.

### Multilevel 2PL model

The simple 1PL model described in the previous section is generalized in this section with the addition of a discrimination parameter to model how noisy a question is and by adding multilevel priors for the question difficulty and discrimination parameters. The model parameters are declared as follows.

```
parameters {
real mu_beta; // mean question difficulty
vector[J] alpha; // ability for j - mean
vector[K] beta; // difficulty for k
vector<lower=0>[K] gamma; // discrimination of k
real<lower=0> sigma_beta; // scale of difficulties
real<lower=0> sigma_gamma; // scale of log discrimination
}
```

The parameters should be clearer after the model definition.

```
model {
alpha ~ std_normal();
beta ~ normal(0, sigma_beta);
gamma ~ lognormal(0, sigma_gamma);
mu_beta ~ cauchy(0, 5);
sigma_beta ~ cauchy(0, 5);
sigma_gamma ~ cauchy(0, 5);
y ~ bernoulli_logit(gamma[kk] .* (alpha[jj] - (beta[kk] + mu_beta)));
}
```

The `std_normal`

function is used here, defined by
\[
\texttt{std}\mathtt{\_}\texttt{normal}(y)
=
\textsf{normal}\left(y \mid 0, 1\right).
\]

The sampling statement is also vectorized using elementwise multiplication; it is equivalent to

```
for (n in 1:N)
y[n] ~ bernoulli_logit(gamma[kk[n]]
* (alpha[jj[n]] - (beta[kk[n]] + mu_beta));
```

The 2PL model is similar to the 1PL model, with the additional parameter
`gamma[k]`

modeling how discriminative question `k`

is. If
`gamma[k]`

is greater than 1, responses are more attenuated with
less chance of getting a question right at random. The parameter
`gamma[k]`

is constrained to be positive, which prohibits there
being questions that are easier for students of lesser ability; such
questions are not unheard of, but they tend to be eliminated from most
testing situations where an IRT model would be applied.

The model is parameterized here with student abilities `alpha`

being
given a standard normal prior. This is to identify both the scale and
the location of the parameters, both of which would be unidentified
otherwise; see the problematic posteriors
chapter for further discussion of
identifiability. The difficulty and discrimination parameters `beta`

and `gamma`

then have varying scales given hierarchically in this
model. They could also be given weakly informative non-hierarchical
priors, such as

```
beta ~ normal(0, 5);
gamma ~ lognormal(0, 2);
```

The point is that the `alpha`

determines the scale and location
and `beta`

and `gamma`

are allowed to float.

The `beta`

parameter is here given a non-centered
parameterization, with parameter `mu_beta`

serving as the mean
`beta`

location. An alternative would’ve been to take:

`beta ~ normal(mu_beta, sigma_beta);`

and

`y[n] ~ bernoulli_logit(gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]]));`

Non-centered parameterizations tend to be more efficient in hierarchical models; see the reparameterization section for more information on non-centered reparameterizations.

The intercept term `mu_beta`

can’t itself be modeled
hierarchically, so it is given a weakly informative
\(\textsf{Cauchy}(0,5)\) prior. Similarly, the scale terms,
`sigma_beta`

, and `sigma_gamma`

, are given half-Cauchy
priors. As mentioned earlier, the scale and location for `alpha`

are fixed to ensure identifiability. The truncation in the
half-Cauchy prior is implicit; explicit truncation is not necessary
because the log probability need only be calculated up to a proportion
and the scale variables are constrained to \((0,\infty)\) by their
declarations.

*References*

*Journal of Statistical Software*36 (1): 1–34.

*Data Analysis Using Regression and Multilevel-Hierarchical Models*. Cambridge, United Kingdom: Cambridge University Press.

Andrew Gelman and Hill (2007) treat the \(\delta\) term equivalently as the location parameter in the distribution of student abilities.↩︎