1.11 Item-response theory models

This is an old version, view current version.

Item-response theory (IRT) models the situation in which a number of students each answer one or more of a group of test questions. The model is based on parameters for the ability of the students, the difficulty of the questions, and in more articulated models, the discriminativeness of the questions and the probability of guessing correctly; see Andrew Gelman and Hill (2007, pps. 314–320) for a textbook introduction to hierarchical IRT models and Curtis (2010) for encodings of a range of IRT models in BUGS.

Data declaration with missingness

The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.

data {
  int<lower=1> J;                     // number of students
  int<lower=1> K;                     // number of questions
  int<lower=1> N;                     // number of observations
  array[N] int<lower=1, upper=J> jj;  // student for observation n
  array[N] int<lower=1, upper=K> kk;  // question for observation n
  array[N] int<lower=0, upper=1> y;   // correctness for observation n
}

This declares a total of N student-question pairs in the data set, where each n in 1:N indexes a binary observation y[n] of the correctness of the answer of student jj[n] on question kk[n].

The prior hyperparameters will be hard coded in the rest of this section for simplicity, though they could be coded as data in Stan for more flexibility.

1PL (Rasch) model

The 1PL item-response model, also known as the Rasch model, has one parameter (1P) for questions and uses the logistic link function (L).

The model parameters are declared as follows.

parameters {
  real delta;            // mean student ability
  array[J] real alpha;   // ability of student j - mean ability
  array[K] real beta;    // difficulty of question k
}

The parameter alpha[J] is the ability coefficient for student j and beta[k] is the difficulty coefficient for question k. The non-standard parameterization used here also includes an intercept term delta, which represents the average student’s response to the average question.⁴

The model itself is as follows.

model {
  alpha ~ std_normal();         // informative true prior
  beta ~ std_normal();          // informative true prior
  delta ~ normal(0.75, 1);      // informative true prior
  for (n in 1:N) {
    y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
  }
}

This model uses the logit-parameterized Bernoulli distribution, where $\texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right).$

The key to understanding it is the term inside the bernoulli_logit distribution, from which it follows that $\textsf{Pr}[y_n = 1] = \operatorname{logit}^{-1}\left(\alpha_{jj[n]} - \beta_{kk[n]} + \delta\right).$

The model suffers from additive identifiability issues without the priors. For example, adding a term $\xi$ to each $\alpha_j$ and $\beta_k$ results in the same predictions. The use of priors for $\alpha$ and $\beta$ located at 0 identifies the parameters; see Andrew Gelman and Hill (2007) for a discussion of identifiability issues and alternative approaches to identification.

For testing purposes, the IRT 1PL model distributed with Stan uses informative priors that match the actual data generation process used to simulate the data in R (the simulation code is supplied in the same directory as the models). This is unrealistic for most practical applications, but allows Stan’s inferences to be validated. A simple sensitivity analysis with fatter priors shows that the posterior is fairly sensitive to the prior even with 400 students and 100 questions and only 25% missingness at random. For real applications, the priors should be fit hierarchically along with the other parameters, as described in the next section.

Multilevel 2PL model

The simple 1PL model described in the previous section is generalized in this section with the addition of a discrimination parameter to model how noisy a question is and by adding multilevel priors for the question difficulty and discrimination parameters. The model parameters are declared as follows.

parameters {
  real mu_beta;                // mean question difficulty
  vector[J] alpha;             // ability for j - mean
  vector[K] beta;              // difficulty for k
  vector<lower=0>[K] gamma;    // discrimination of k
  real<lower=0> sigma_beta;    // scale of difficulties
  real<lower=0> sigma_gamma;   // scale of log discrimination
}

The parameters should be clearer after the model definition.

model {
  alpha ~ std_normal();
  beta ~ normal(0, sigma_beta);
  gamma ~ lognormal(0, sigma_gamma);
  mu_beta ~ cauchy(0, 5);
  sigma_beta ~ cauchy(0, 5);
  sigma_gamma ~ cauchy(0, 5);
  y ~ bernoulli_logit(gamma[kk] .* (alpha[jj] - (beta[kk] + mu_beta)));
}

The std_normal function is used here, defined by $\texttt{std}\mathtt{\_}\texttt{normal}(y) = \textsf{normal}\left(y \mid 0, 1\right).$

The sampling statement is also vectorized using elementwise multiplication; it is equivalent to

for (n in 1:N) {
  y[n] ~ bernoulli_logit(gamma[kk[n]]
                         * (alpha[jj[n]] - (beta[kk[n]] + mu_beta));
}

The 2PL model is similar to the 1PL model, with the additional parameter gamma[k] modeling how discriminative question k is. If gamma[k] is greater than 1, responses are more attenuated with less chance of getting a question right at random. The parameter gamma[k] is constrained to be positive, which prohibits there being questions that are easier for students of lesser ability; such questions are not unheard of, but they tend to be eliminated from most testing situations where an IRT model would be applied.

The model is parameterized here with student abilities alpha being given a standard normal prior. This is to identify both the scale and the location of the parameters, both of which would be unidentified otherwise; see the problematic posteriors chapter for further discussion of identifiability. The difficulty and discrimination parameters beta and gamma then have varying scales given hierarchically in this model. They could also be given weakly informative non-hierarchical priors, such as

beta ~ normal(0, 5);
gamma ~ lognormal(0, 2);

The point is that the alpha determines the scale and location and beta and gamma are allowed to float.

The beta parameter is here given a non-centered parameterization, with parameter mu_beta serving as the mean beta location. An alternative would’ve been to take:

beta ~ normal(mu_beta, sigma_beta);

and

y[n] ~ bernoulli_logit(gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]]));

Non-centered parameterizations tend to be more efficient in hierarchical models; see the reparameterization section for more information on non-centered reparameterizations.

The intercept term mu_beta can’t itself be modeled hierarchically, so it is given a weakly informative $\textsf{Cauchy}(0,5)$ prior. Similarly, the scale terms, sigma_beta, and sigma_gamma, are given half-Cauchy priors. As mentioned earlier, the scale and location for alpha are fixed to ensure identifiability. The truncation in the half-Cauchy prior is implicit; explicit truncation is not necessary because the log probability need only be calculated up to a proportion and the scale variables are constrained to $(0,\infty)$ by their declarations.

References

Curtis, S. McKay. 2010. “BUGS Code for Item Response Theory.” Journal of Statistical Software 36 (1): 1–34.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel-Hierarchical Models. Cambridge, United Kingdom: Cambridge University Press.

Andrew Gelman and Hill (2007) treat the $\delta$ term equivalently as the location parameter in the distribution of student abilities.↩︎