1.11 Item-Response Theory Models
Item-response theory (IRT) models the situation in which a number of students each answer one or more of a group of test questions. The model is based on parameters for the ability of the students, the difficulty of the questions, and in more articulated models, the discriminativeness of the questions and the probability of guessing correctly; (Gelman and Hill 2007, pps. 314–320) for a textbook introduction to hierarchical IRT models and Curtis (2010) for encodings of a range of IRT models in BUGS.
Data Declaration with Missingness
The data provided for an IRT model may be declared as follows to account for the fact that not every student is required to answer every question.
data {
int<lower=1> J; // number of students
int<lower=1> K; // number of questions
int<lower=1> N; // number of observations
int<lower=1,upper=J> jj[N]; // student for observation n
int<lower=1,upper=K> kk[N]; // question for observation n
int<lower=0,upper=1> y[N]; // correctness for observation n
}
This declares a total of N
student-question pairs in the data
set, where each n
in 1:N
indexes a binary observation
y[n]
of the correctness of the answer of student jj[n]
on question kk[n]
.
The prior hyperparameters will be hard coded in the rest of this section for simplicity, though they could be coded as data in Stan for more flexibility.
1PL (Rasch) Model
The 1PL item-response model, also known as the Rasch model, has one parameter (1P) for questions and uses the logistic link function (L).
The model parameters are declared as follows.
parameters {
real delta; // mean student ability
real alpha[J]; // ability of student j - mean ability
real beta[K]; // difficulty of question k
}
The parameter alpha[j]
is the ability coefficient for student
j
and beta[k]
is the difficulty coefficient for question
k
. The non-standard parameterization used here also includes
an intercept term delta
, which represents the average student’s
response to the average question.4
The model itself is as follows.
model {
alpha ~ std_normal(); // informative true prior
beta ~ std_normal(); // informative true prior
delta ~ normal(0.75, 1); // informative true prior
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
}
This model uses the logit-parameterized Bernoulli distribution, where
\[ \mathsf{BernoulliLogit}(y \mid \alpha) \ = \ \mathsf{Bernoulli}(y \mid \mbox{logit}^{-1}(\alpha)). \]
The key to understanding it is the term inside the
bernoulli_logit
distribution, from which it follows that
\[
\mbox{Pr}[y_n = 1] = \mbox{logit}^{-1}(\alpha_{jj[n]} - \beta_{kk[n]}
+ \delta).
\]
The model suffers from additive identifiability issues without the priors. For example, adding a term \(\xi\) to each \(\alpha_j\) and \(\beta_k\) results in the same predictions. The use of priors for \(\alpha\) and \(\beta\) located at 0 identifies the parameters; see Gelman and Hill (2007) for a discussion of identifiability issues and alternative approaches to identification.
For testing purposes, the IRT 1PL model distributed with Stan uses informative priors that match the actual data generation process used to simulate the data in R (the simulation code is supplied in the same directory as the models). This is unrealistic for most practical applications, but allows Stan’s inferences to be validated. A simple sensitivity analysis with fatter priors shows that the posterior is fairly sensitive to the prior even with 400 students and 100 questions and only 25% missingness at random. For real applications, the priors should be fit hierarchically along with the other parameters, as described in the next section.
Multilevel 2PL Model
The simple 1PL model described in the previous section is generalized in this section with the addition of a discrimination parameter to model how noisy a question is and by adding multilevel priors for the question difficulty and discrimination parameters. The model parameters are declared as follows.
parameters {
real mu_beta; // mean question difficulty
vector[J] alpha; // ability for j - mean
vector[K] beta; // difficulty for k
vector<lower=0>[K] gamma; // discrimination of k
real<lower=0> sigma_beta; // scale of difficulties
real<lower=0> sigma_gamma; // scale of log discrimination
}
The parameters should be clearer after the model definition.
model {
alpha ~ std_normal();
beta ~ normal(0, sigma_beta);
gamma ~ lognormal(0, sigma_gamma);
mu_beta ~ cauchy(0, 5);
sigma_beta ~ cauchy(0, 5);
sigma_gamma ~ cauchy(0, 5);
y ~ bernoulli_logit(gamma[kk] .* (alpha[jj] - (beta[kk] + mu_beta)));
}
The std_normal
function is used here, defined by
\[ \mathsf{std_normal}(y) \ = \ \mathsf{normal}(y \mid 0, 1). \]
The sampling statement is also vectorized using elementwise multiplication; it is equivalent to
for (n in 1:N)
y[n] ~ bernoulli_logit(gamma[kk[n]]
* (alpha[jj[n]] - (beta[kk[n]] + mu_beta));
The 2PL model is similar to the 1PL model, with the additional parameter
gamma[k]
modeling how discriminative question k
is. If
gamma[k]
is greater than 1, responses are more attenuated with
less chance of getting a question right at random. The parameter
gamma[k]
is constrained to be positive, which prohibits there
being questions that are easier for students of lesser ability; such
questions are not unheard of, but they tend to be eliminated from most
testing situations where an IRT model would be applied.
The model is parameterized here with student abilities alpha
being
given a standard normal prior. This is to identify both the scale and
the location of the parameters, both of which would be unidentified
otherwise; see the problematic posteriors
chapter for further discussion of
identifiability. The difficulty and discrimination parameters beta
and gamma
then have varying scales given hierarchically in this
model. They could also be given weakly informative non-hierarchical
priors, such as
beta ~ normal(0, 5);
gamma ~ lognormal(0, 2);
The point is that the alpha
determines the scale and location
and beta
and gamma
are allowed to float.
The beta
parameter is here given a non-centered
parameterization, with parameter mu_beta
serving as the mean
beta
location. An alternative would’ve been to take:
beta ~ normal(mu_beta, sigma_beta);
and
y[n] ~ bernoulli_logit(gamma[kk[n]] * (alpha[jj[n]] - beta[kk[n]]));
Non-centered parameterizations tend to be more efficient in hierarchical models; see the reparameterization section for more information on non-centered reparameterizations.
The intercept term mu_beta
can’t itself be modeled
hierarchically, so it is given a weakly informative
\(\mathsf{Cauchy}(0,5)\) prior. Similarly, the scale terms,
sigma_beta
, and sigma_gamma
, are given half-Cauchy
priors. As mentioned earlier, the scale and location for alpha
are fixed to ensure identifiability. The truncation in the
half-Cauchy prior is implicit; explicit truncation is not necessary
because the log probability need only be calculated up to a proportion
and the scale variables are constrained to \((0,\infty)\) by their
declarations.