## 1.11 Item-Response Theory Models

Item-response theory (IRT) models the situation in which a number of students each answer one or more of a group of test questions. The model is based on parameters for the ability of the students, the difficulty of the questions, and in more articulated models, the discriminativeness of the questions and the probability of guessing correctly; see Gelman and Hill (2007 pps. 314–320) for a textbook introduction to hierarchical IRT models and Curtis (2010) for encodings of a range of IRT models in BUGS.

### Data Declaration with Missingness

The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.

```
data {
int<lower=1> J; // number of students
int<lower=1> K; // number of questions
int<lower=1> N; // number of observations
int<lower=1,upper=J> jj[N]; // student for observation n
int<lower=1,upper=K> kk[N]; // question for observation n
int<lower=0,upper=1> y[N]; // correctness for observation n
}
```

This declares a total of `N`

student-question pairs in the data set, where each `n`

in `1:N`

indexes a binary observation `y[n]`

of the correctness of the answer of student `jj[n]`

on question `kk[n]`

.

The prior hyperparameters will be hard coded in the rest of this section for simplicity, though they could be coded as data in Stan for more flexibility.

### 1PL (Rasch) Model

The 1PL item-response model, also known as the Rasch model, has one parameter (1P) for questions and uses the logistic link function (L).

The model parameters are declared as follows.

```
parameters {
real delta; // mean student ability
real alpha[J]; // ability of student j - mean ability
real beta[K]; // difficulty of question k
}
```

The parameter `alpha[J]`

is the ability coefficient for student `j`

and `beta[k]`

is the difficulty coefficient for question `k`

. The non-standard parameterization used here also includes an intercept term `delta`

, which represents the average student’s response to the average question.^{4}

The model itself is as follows.

```
model {
alpha ~ std_normal(); // informative true prior
beta ~ std_normal(); // informative true prior
delta ~ normal(0.75, 1); // informative true prior
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
}
```

This model uses the logit-parameterized Bernoulli distribution, where \[ \texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right). \]

The key to understanding it is the term inside the `bernoulli_logit`

distribution, from which it follows that \[
\textsf{Pr}[y_n = 1] = \operatorname{logit}^{-1}\left(\alpha_{jj[n]} - \beta_{kk[n]}
+ \delta\right).
\]

The model suffers from additive identifiability issues without the priors. For example, adding a term \(\xi\) to each \(\alpha_j\) and \(\beta_k\) results in the same predictions. The use of priors for \(\alpha\) and \(\beta\) located at 0 identifies the parameters; see Gelman and Hill (2007) for a discussion of identifiability issues and alternative approaches to identification.

For testing purposes, the IRT 1PL model distributed with Stan uses informative priors that match the actual data generation process used to simulate the data in R (the simulation code is supplied in the same directory as the models). This is unrealistic for most practical applications, but allows Stan’s inferences to be validated. A simple sensitivity analysis with fatter priors shows that the posterior is fairly sensitive to the prior even with 400 students and 100 questions and only 25% missingness at random. For real applications, the priors should be fit hierarchically along with the other parameters, as described in the next section.

### Multilevel 2PL Model

The simple 1PL model described in the previous section is generalized in this section with the addition of a discrimination parameter to model how noisy a question is and by adding multilevel priors for the question difficulty and discrimination parameters. The model parameters are declared as follows.

```
parameters {
real mu_beta; // mean question difficulty
vector[J] alpha; // ability for j - mean
vector[K] beta; // difficulty for k
vector<lower=0>[K] gamma; // discrimination of k
real<lower=0> sigma_beta; // scale of difficulties
real<lower=0> sigma_gamma; // scale of log discrimination
}
```

The parameters should be clearer after the model definition.

```
model {
alpha ~ std_normal();
beta ~ normal(0, sigma_beta);
gamma ~ lognormal(0, sigma_gamma);
mu_beta ~ cauchy(0, 5);
sigma_beta ~ cauchy(0, 5);
sigma_gamma ~ cauchy(0, 5);
y ~ bernoulli_logit(gamma[kk] .* (alpha[jj] - (beta[kk] + mu_beta)));
}
```

The `std_normal`

function is used here, defined by \[
\texttt{std}\mathtt{\_}\texttt{normal}(y)
=
\textsf{normal}\left(y \mid 0, 1\right).
\]

The sampling statement is also vectorized using elementwise multiplication; it is equivalent to

```
for (n in 1:N)
y[n] ~ bernoulli_logit(gamma[kk[n]]
* (alpha[jj[n]] - (beta[kk[n]] + mu_beta));
```

The 2PL model is similar to the 1PL model, with the additional parameter `gamma[k]`

modeling how discriminative question `k`

is. If `gamma[k]`

is greater than 1, responses are more attenuated with less chance of getting a question right at random. The parameter `gamma[k]`

is constrained to be positive, which prohibits there being questions that are easier for students of lesser ability; such questions are not unheard of, but they tend to be eliminated from most testing situations where an IRT model would be applied.

The model is parameterized here with student abilities `alpha`

being given a standard normal prior. This is to identify both the scale and the location of the parameters, both of which would be unidentified otherwise; see the problematic posteriors chapter for further discussion of identifiability. The difficulty and discrimination parameters `beta`

and `gamma`

then have varying scales given hierarchically in this model. They could also be given weakly informative non-hierarchical priors, such as

```
beta ~ normal(0, 5);
gamma ~ lognormal(0, 2);
```

The point is that the `alpha`

determines the scale and location and `beta`

and `gamma`

are allowed to float.

The `beta`

parameter is here given a non-centered parameterization, with parameter `mu_beta`

serving as the mean `beta`

location. An alternative would’ve been to take:

`beta ~ normal(mu_beta, sigma_beta);`

and

`y[n] ~ bernoulli_logit(gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]]));`

Non-centered parameterizations tend to be more efficient in hierarchical models; see the reparameterization section for more information on non-centered reparameterizations.

The intercept term `mu_beta`

can’t itself be modeled hierarchically, so it is given a weakly informative \(\textsf{Cauchy}(0,5)\) prior. Similarly, the scale terms, `sigma_beta`

, and `sigma_gamma`

, are given half-Cauchy priors. As mentioned earlier, the scale and location for `alpha`

are fixed to ensure identifiability. The truncation in the half-Cauchy prior is implicit; explicit truncation is not necessary because the log probability need only be calculated up to a proportion and the scale variables are constrained to \((0,\infty)\) by their declarations.

*References*

Gelman, Andrew, and Jennifer Hill. 2007. *Data Analysis Using Regression and Multilevel-Hierarchical Models*. Cambridge, United Kingdom: Cambridge University Press.

Curtis, S. McKay. 2010. “BUGS Code for Item Response Theory.” *Journal of Statistical Software* 36 (1). American Statistical Association: 1–34.