Cognitive psychology has a long tradition of empirically investigating how short-term/working memory works. How do we encode information in memory, and what causes forgetting? In order to investigate memory-related processes, a common experimental approach is to have participants memorize words or numbers or letters, and then measure recall accuracy and recall time in milliseconds.
Sentence comprehension is a field within linguistics that investigates the cognitive processes that unfold as one reads or hears a sentence. Theories of memory developed within cognitive psychology have come to play an important role in explaining sentence comprehension processes. Because reading or listening to a sentence involves maintaining words and phrases in memory and connecting them to one another to compute the meaning of a sentence, it is generally assumed that constraints on working memory influence the speed and accuracy of sentence comprehension processes. In the sentence shown below, for example, when we reach the verb was imprisoned a linguistic dependency must be completed between this verb and friend in order to understand the main assertion of the sentence.
Alice’s friend, who defeated the Queen in Alice in Wonderland, was imprisoned.
A common assumption in many models of sentence comprehension is that this dependency completion process involves accessing some mental representation of friend at the moment that the verb was imprisoned is processed.
What is the underlying cognitive process that leads to a successful dependency completion? We consider here two psycholinguistic models that explain the incremental resolution of linguistic dependencies: the activation model (Lewis and Vasishth 2005) and the direct access model (McElree 2000). Both assume that working memory is involved in storing words, and that dependency completion involves a search in memory using a feature-specification. For example, at the verb was imprisoned, a search for a singular marked animate noun that is the subject of the sentence is assumed to be initiated. Informally speaking, the retrieval process involves looking for an item in memory with at least the following features:
Although both models assume that we retrieve words or phrases from memory using such feature specifications, there are some important differences in the retrieval process and in the way that the correct dependent is distinguished from other items in memory (i.e., Alice, Queen). In both models, the dependent measure that is modeled is the amount of time taken to complete the dependency, retrieval time; and the probability of retrieving the correct element from memory, retrieval probability.
This model assumes that items in memory have an activation value, which is an abstract numerical value that represents how easy they are to access (as one might expect, items with higher activation are easier to access). Retrieval time in the activation-based model can be thought of as being generated from a lognormal race between accumulators of evidence, where the rate of accumulation of evidence for each retrieval is each item’s activation in memory. Under this model, retrieval accuracy is determined by the winner of the race, and retrieval time by the rate of accumulation of the winner. Retrieval time and retrieval probability are therefore not independent: an item that takes longer to retrieve on average will have a lower mean retrieval probability.
In contrast to the activation model, the direct access model assumes a model of memory where the distribution of retrieval times for items in memory is independent of their probability of being retrieved. This model assumes that in some of the trials where an incorrect item is retrieved, a backtracking and repair process is initiated to retrieve the correct item (McElree 1993). As a consequence, retrieval times associated with correct responses are a mixture of directly accessed correct items together with directly accessed incorrect items followed by a repair process (which costs time). In contrast, retrieval times associated with incorrect responses are only due to directly accessed incorrect items.
All things being equal (as in controlled experimental settings), we assume that reading times (\(RT\)s) at the retrieval site (i.e., was imprisoned in the previous example) include the retrieval time (as well as other theoretically uninteresting processes). We also assume that if participants in an experiment are asked the question, Who was imprisoned?, their answer is informative of the dependency completion they carried out. If they answer, friend, they are assumed to have made the correct dependency, but if they answer Queen or Alice, they are assumed to have built the incorrect dependency.
Below, we implement the two models and evaluate their predictions against empirical data from a reading study (Nicenboim et al. Under revision following review). The original study had sentences in German, and reading times were measured at an auxiliary verb. After every sentences a multiple choice question with four options (a noun which is the correct dependent, two other nouns which are incorrect, and an option of “I don’t know”) were presented. The dependent variables that we modeled were (i) reading times in milliseconds and (ii) a choice between one and four. For more detail see the following paper: Models of retrieval in sentence comprehension: A computational evaluation using Bayesian hierarchical modeling, Nicenboim, Vasishth.
After implementing the models, we inspect the descriptive adequacy of the models using posterior predictive checks and we use Pareto smoothed importance sampling (PSIS-LOO; Vehtari, Gelman, and Gabry 2016) to compare them. We show that the direct access model provides a superior fit to the data considered here.
The modeling details are provided below, with reproducible code.
For ease of presentation, we first show non-hierarchical implementations of the models (ignoring participants and experimental items, i.e., sentences), then hierarchical versions with simulated data, and finally we compare the models with the data of Nicenboim et al. (Under revision following review).
We first discuss a non-hierarchical version of the activation-based model. The implementation of the activation-based model is adapted from the shifted lognormal race model proposed by Rouder et al. (2014). We assume that each trial in the empirical data has two pieces of information: (i) reading times at the retrieval site (e.g., the verb was imprisoned in the example above), and (ii) question responses about which item was retrieved from memory (e.g., friend, Queen, Alice). Even though retrieval times are unobserved, we assume that they are included in the reading times and so they are in the same unit. Regarding the probability of retrieving a certain noun, we assume that the probability of choosing an answer is directly mapped to the probability of retrieving a noun. We modeled the observed reading times (\(RT\)s) and question response (\(w\)), assuming that each element of the vector of observed reading times \(\mathbf{RT}\) has the following distribution:
\begin{equation} RT_n \sim \psi + lognormal(b - \alpha_{c=w} ,\sigma) \label{eq:observed} \end{equation}where:
The activation based model assumes that in a given trial, the item with the highest activation is retrieved. It follows that the accumulators that did not win the race, that is the accumulators associated with any candidate \(c\) except for the winner \(w\) (\(\forall c, c \neq w\)) must have had lower activations (been slower) in that specific trial. This means that the accumulators that lost the race in the trial \(n\) are associated with potential reading times \(RT_{n,\forall c, c \neq w}\), which are larger than the reading time associated with the winner accumulator \(RT_{n,c=w}\), which is the observed reading time for the trial \(n\): \(RT_{n}\) from Equation \eqref{eq:observed}.
If all the answers are selected at least once, the race turns into a problem of censored data, where \(RT_{n,\forall c, c \neq w}\) will have a lower bound set by \(RT_{n,c = w}\). In order to calculate the posterior of the rates of accumulation, \(\alpha_c\), of all the accumulators, we cannot ignore the censored data (Gelman et al. 2014, 224–27). However, it is not necessary to impute values, and the values can be integrated out with each censored data point having a probability of
\begin{equation} \begin{aligned} Pr[RT_{n,\forall c, c \neq w} > RT_{n,c=w} ] &= \int_{RT_{n,c=w}}^{\infty} lognormal(RT_{n,\forall c, c \neq w} - \psi | b- \alpha_{\forall c,c \neq w} ,\sigma) \cdot dRT_{n,\forall c, c \neq w} \\ &= 1-\Phi \left( \frac{log(RT_{n,c=w} - \psi) - (b - \alpha_{\forall c,c \neq w}) }{\sigma} \right) \end{aligned} \end{equation}where \(\Phi()\) is the cumulative distribution function of the standard normal distribution.
In order to fit the model, all the parameters were given weakly informative priors.
The Stan program follows the notation presented above with the probability function race taking two dependent variables: winner and RT.
functions {
real race(int winner, real RT, real[] alpha, real b, real sigma, real psi){
real log_lik;
int N_choices;
N_choices = num_elements(alpha);
log_lik = 0;
for(c in 1:N_choices)
if(c == winner)
log_lik = log_lik + lognormal_lpdf(RT - psi|b - alpha[c], sigma);
else
log_lik = log_lik + lognormal_lccdf(RT - psi|b - alpha[c], sigma);
return(log_lik);
}
}
data {
int<lower = 0> N_obs;
int<lower = 1> N_choices;
int<lower = 1, upper = N_choices> winner[N_obs];
vector<lower = 0>[N_obs] RT;
}
transformed data {
real b; //arbitrary threshold
real min_RT;
b = 10;
min_RT = min(RT);
}
parameters{
real alpha[N_choices];
real<lower=0> sigma;
real<lower=0,upper=min_RT> psi;
}
model {
alpha ~ normal(0,10);
sigma ~ normal(0,2);
psi ~ normal(0,300);
for (n in 1:N_obs) {
target += race(winner[n], RT[n], alpha, b, sigma, psi);
}
}
The model presented above ignores the fact that the original data was elicited from different participants (subj in the code) reading different sentences (item in the code). We accounted for this by including a hierarchical structure to the rates of accumulation \(\alpha\):
\begin{equation} \alpha_{c,i,j} = \alpha_{0,c} + u_{c,i} + w_{c,j} \end{equation}Where \(u\) represents the by-participant adjustment to the \(\alpha_0\) with \(i=\{1,..,N_{subj}\}\) and \(w\) the by-items adjustment with \(j=\{1,..,N_{item}\}\) with the following priors:
\begin{equation} \mathbf{u}\ \sim\ \mathcal{N}(\mathbf{0},\, \boldsymbol\Sigma_u), \end{equation} \begin{equation} \mathbf{w}\ \sim\ \mathcal{N}(\mathbf{0},\, \boldsymbol\Sigma_w), \end{equation} In addition we included a by-participant adjustment to \(\psi\): \begin{equation} \psi_i = exp(\psi_0 + u_{\psi,i}) \end{equation}with
\begin{equation} \mathbf{u_\psi}\ \sim\ \mathcal{N}(\mathbf{0},\, \boldsymbol\Sigma_{u\psi}), \end{equation}The exponential function ensures that \(\psi_i\) will be higher than zero. A constraint on the upper bound of the prior distribution of \(\psi_0\) ensures that the shift of each participant is smaller than the participant minimum RT. This entails that for each observation \(n\), where \([subj[n]]\) represents the participant associated with the observation \(n\), the following inequalities hold:
\begin{equation} \begin{aligned} \psi_i < RT_{n} \\ exp \left(\psi_0 + u_{\psi,[subj[n]]}\right) &< RT_{n} \\ \psi_0 + u_{\psi,[subj[n]]} &< log(RT_{n}) \\ \psi_0 &< log(RT_{n}) - u_{\psi,[subj[n]]} \end{aligned} \end{equation} Thus the prior on \(\psi_0\) was constrained in the following way: \begin{equation} \psi_0 < {\underset {n}{\operatorname {arg\,min} }}\,\left(log(RT_{n}) - u_{\psi,[subj[n]]}\right) \end{equation}In addition, all the parameters were given weakly informative priors; see the Stan implementation.
The Stan program follows the notation presented above with the same probability function race
as for the non-hierarchical model. However, the code includes now some modifications to achieve convergence faster: We optimized the by-participant and by-items adjustments through Cholesky Factorization (Stan Development Team 2016, 76), and \(\alpha_{0,c}\) and \(\psi_0\) were transformed so that they their priors would have a scale close to one (Stan Development Team 2016, 225). The rates \(\alpha_{0,c}\) were scaled using a helper function mean_of_X_by_Y
that calculated the mean RTs for each winner. The code also includes a function with a pseudo random number generator race_rng
which allows us to generate data following the activation-based model.
functions {
vector mean_of_X_by_Y(vector X, int[] Y){
int N_rows;
int N_groups;
N_rows = num_elements(X);
if(N_rows!= num_elements(Y))
reject("X and Y don't have the same length");
N_groups = max(Y);
{ # matrix with a column for each group of Y,
#and 1 if Y belong to the group:
matrix[N_rows, N_groups] matrix_1;
for(r in 1:N_rows)
for(g in 1:N_groups)
matrix_1[r,g] = (g == Y[r]);
return(((X' * matrix_1) // sum of Xs per group
// divided by matrix_1^T * matrix_1 (number of times each group appears):
/ crossprod(matrix_1))');
}
}
real psi_max(vector u_psi, int[] subj, vector RT) {
real psi_max;
psi_max = positive_infinity();
for (i in 1:num_elements(RT))
psi_max = fmin(psi_max, log(RT[i]) - u_psi[subj[i]]);
return (psi_max);
}
real race(int winner, real RT, vector alpha, real b, real sigma, real psi) {
real log_lik;
int N_choices;
N_choices = num_elements(alpha);
log_lik = 0;
for (c in 1:N_choices)
if (c == winner)
log_lik = log_lik + lognormal_lpdf(RT-psi|b-alpha[c],sigma);
else
log_lik = log_lik + lognormal_lccdf(RT-psi|b-alpha[c], sigma);
return(log_lik);
}
vector race_rng(vector alpha, real b, real sigma,
real psi) {
int N_choices;
real curr_accum_RT;
real fastest_accum_RT;
vector[2] gen;
N_choices = num_elements(alpha);
fastest_accum_RT = positive_infinity();
for (c in 1:N_choices) {
curr_accum_RT = psi + lognormal_rng(b - alpha[c], sigma);
if (curr_accum_RT < fastest_accum_RT){ #this accumulator is faster
fastest_accum_RT = curr_accum_RT;
gen[1] = c;
gen[2] = curr_accum_RT;
}
}
return(gen);
}
}
data {
int<lower = 0> N_obs;
int<lower = 1> N_choices;
int<lower = 1, upper = N_choices> winner[N_obs];
vector<lower = 0>[N_obs] RT;
int<lower = 1> N_subj;
int<lower = 1> subj[N_obs];
int<lower = 1> item[N_obs];
int<lower = 1> N_item;
}
transformed data {
real b;
real min_RT;
real logmean_RT;
vector[N_choices] logmean_RT_w;
b = 10;
min_RT = min(RT);
logmean_RT = log(mean(RT));
logmean_RT_w = log(mean_of_X_by_Y(RT, winner));
}
parameters{
vector[N_choices] alpha_0raw;
vector<lower = 0> [N_choices] tau_u;
cholesky_factor_corr[N_choices] L_u;
matrix[N_choices, N_subj] z_u;
vector<lower = 0> [N_choices] tau_w;
cholesky_factor_corr[N_choices] L_w;
matrix[N_choices, N_item] z_w;
real<lower = 0> sigma;
vector[N_subj] u_psi;
real<lower = 0> tau_psi;
real<upper = psi_max(u_psi, subj, RT) / logmean_RT> psi_0raw;
}
transformed parameters {
real psi_0;
matrix[N_choices, N_subj] u;
matrix[N_choices, N_item] w;
vector[N_choices] alpha_0;
// Optimization through Cholesky Fact:
u = diag_pre_multiply(tau_u, L_u) //matrix[N_choices,N_choices]
* z_u;
w = diag_pre_multiply(tau_w, L_w) //matrix[N_choices,N_choices]
* z_w;
// Unit-scaling
alpha_0 = b - alpha_0raw .* logmean_RT_w;
psi_0 = psi_0raw * logmean_RT;
}
model {
vector[N_obs] log_lik;
alpha_0raw ~ normal(0, 1);
tau_u ~ normal(0, 1);
L_u ~ lkj_corr_cholesky(2.0);
to_vector(z_u) ~ normal(0, 1);
tau_w ~ normal(0, 1);
L_w ~ lkj_corr_cholesky(2.0);
to_vector(z_w) ~ normal(0, 1);
sigma ~ normal(0, 2);
psi_0raw ~ normal(0, 1);
tau_psi ~ normal(0, 1);
u_psi ~ normal(0, tau_psi);
for (n in 1:N_obs) {
vector[N_choices] alpha;
real psi;
alpha = alpha_0 + u[, subj[n]] + w[, item[n]];
psi = exp(psi_0 + u_psi[subj[n]]);
log_lik[n] = race(winner[n], RT[n], alpha, b, sigma, psi);
}
target += log_lik;
}
generated quantities {
matrix[N_choices, N_choices] Cor_u;
matrix[N_choices, N_choices] Cor_w;
vector[N_obs] gen_RT;
vector[N_obs] gen_winner;
vector[N_obs] log_lik;
Cor_u = tcrossprod(L_u);
Cor_w = tcrossprod(L_w);
for (n in 1:N_obs) {
vector[N_choices] alpha;
real psi;
vector[2] gen;
alpha = alpha_0 + u[, subj[n]] + w[, item[n]];
psi = exp(psi_0 + u_psi[subj[n]]);
gen = race_rng(alpha, b, sigma, psi);
gen_winner[n] = gen[1];
gen_RT[n] = gen[2];
log_lik[n] = race(winner[n], RT[n], alpha, b, sigma, psi);
}
}
We first show the ability of the model to recover parameters generated from a fake dataset.
# Load R packages
library(ggplot2)
library(scales)
library(hexbin)
library(tidyr)
library(dplyr)
library(MASS)
library(rstan)
library(loo)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
set.seed(42)
iter <- 2000
chains <- 4
We generated 3600 observations; similar number as the data from Nicenboim et al. (Under revision following review):
# We extract the function 'race_rng' for simulating data:
expose_stan_functions(stanc(file = "activation-based_h.stan"))
N_obs <- 3600
b <- 10
alpha_0 <- c(4.5, 2.3, 2, 2.3)
sigma <- 1.2
psi_0 <- 5.6
tau_psi <- 0.2
# For simplicity we assume the same correlation for all the adjustments:
rhos <- 0.3
# Same SD for all the by-subj adjustments
taus_u <- 0.7
N_subj <- 180
# Same SD for all the by-item adjustments
taus_w <- 0.2
N_item <- 20
subj <- rep(1:N_subj, N_obs/N_subj)
N_tau_u <- length(alpha_0)
Cor_u <- matrix(rep(rhos, N_tau_u * N_tau_u), nrow = N_tau_u)
diag(Cor_u) <- 1
tau_u <- rep(taus_u, N_tau_u)
b_subj <- tau_u %*% t(tau_u)
Sigma_u <- b_subj * Cor_u #variance covariance matrix for 'subj'
u <- mvrnorm(n = N_subj, rep(0, N_tau_u), Sigma_u)
item <- rep(1:N_item, each = N_obs/N_item)
N_tau_w <- length(alpha_0)
Cor_w <- matrix(rep(rhos, N_tau_w * N_tau_w), nrow = N_tau_w)
diag(Cor_w) <- 1
tau_w <- rep(taus_w, N_tau_w)
b_item <- tau_w %*% t(tau_w)
Sigma_w <- b_item * Cor_w #variance covariance matrix for 'item'
w <- mvrnorm(n = N_item, rep(0, N_tau_w), Sigma_w)
psi <- exp(psi_0 + rnorm(N_subj, 0, tau_psi))
N_choices <- length(alpha_0)
RT <- c()
winner <- c()
for (n in 1:N_obs) {
pred <- race_rng(alpha_0 + u[subj[n], ] + w[item[n], ], b, sigma, psi[subj[n]])
winner[n] <- round(pred[1])
RT[n] <- round(pred[2])
}
sim_data_race_h <- list(N_obs = N_obs, winner = winner, RT = RT, N_choices = N_choices,
subj = subj, N_subj = N_subj, item = item, N_item = N_item)
# Fit model to simulated data
sim_fit_race_h <- stan(file = "activation-based_h.stan", data = sim_data_race_h,
chains = chains, iter = iter)
We fit the data and we show the summary of the parameters below:
print(sim_fit_race_h, pars = c("alpha_0", "sigma", "psi_0", "tau_u", "tau_w",
"tau_psi", "Cor_u", "Cor_w"), prob = c(0.025, 0.5, 0.975))
## Inference for Stan model: activation-based_h.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
## alpha_0[1] 4.44 0.00 0.07 4.30 4.44 4.59 1189 1.00
## alpha_0[2] 2.13 0.00 0.11 1.92 2.14 2.34 2279 1.00
## alpha_0[3] 2.15 0.00 0.09 1.97 2.15 2.32 2411 1.00
## alpha_0[4] 2.32 0.00 0.08 2.17 2.32 2.48 2635 1.00
## sigma 1.21 0.00 0.02 1.16 1.21 1.25 2037 1.00
## psi_0 5.61 0.00 0.02 5.57 5.61 5.65 497 1.02
## tau_u[1] 0.66 0.00 0.04 0.58 0.66 0.75 1610 1.01
## tau_u[2] 0.79 0.00 0.08 0.65 0.79 0.95 1706 1.00
## tau_u[3] 0.69 0.00 0.07 0.56 0.69 0.84 1950 1.00
## tau_u[4] 0.68 0.00 0.06 0.57 0.68 0.81 1965 1.00
## tau_w[1] 0.21 0.00 0.05 0.14 0.21 0.31 1609 1.00
## tau_w[2] 0.26 0.00 0.07 0.13 0.25 0.42 1879 1.00
## tau_w[3] 0.13 0.00 0.07 0.01 0.13 0.27 1268 1.00
## tau_w[4] 0.08 0.00 0.05 0.00 0.07 0.20 2269 1.00
## tau_psi 0.21 0.00 0.01 0.18 0.21 0.23 4000 1.00
## Cor_u[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_u[1,2] 0.05 0.00 0.11 -0.17 0.06 0.27 1561 1.00
## Cor_u[1,3] 0.25 0.00 0.11 0.02 0.26 0.47 2070 1.00
## Cor_u[1,4] 0.45 0.00 0.10 0.25 0.45 0.63 1720 1.00
## Cor_u[2,1] 0.05 0.00 0.11 -0.17 0.06 0.27 1561 1.00
## Cor_u[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 142 1.00
## Cor_u[2,3] 0.36 0.00 0.12 0.11 0.36 0.59 1646 1.00
## Cor_u[2,4] 0.16 0.00 0.13 -0.10 0.16 0.40 1614 1.00
## Cor_u[3,1] 0.25 0.00 0.11 0.02 0.26 0.47 2070 1.00
## Cor_u[3,2] 0.36 0.00 0.12 0.11 0.36 0.59 1646 1.00
## Cor_u[3,3] 1.00 0.00 0.00 1.00 1.00 1.00 275 1.00
## Cor_u[3,4] 0.36 0.00 0.12 0.11 0.37 0.59 1316 1.00
## Cor_u[4,1] 0.45 0.00 0.10 0.25 0.45 0.63 1720 1.00
## Cor_u[4,2] 0.16 0.00 0.13 -0.10 0.16 0.40 1614 1.00
## Cor_u[4,3] 0.36 0.00 0.12 0.11 0.37 0.59 1316 1.00
## Cor_u[4,4] 1.00 0.00 0.00 1.00 1.00 1.00 629 1.00
## Cor_w[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_w[1,2] -0.14 0.00 0.25 -0.61 -0.15 0.37 2848 1.00
## Cor_w[1,3] 0.43 0.00 0.29 -0.23 0.48 0.87 4000 1.00
## Cor_w[1,4] -0.06 0.01 0.35 -0.71 -0.08 0.63 4000 1.00
## Cor_w[2,1] -0.14 0.00 0.25 -0.61 -0.15 0.37 2848 1.00
## Cor_w[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 915 1.00
## Cor_w[2,3] 0.14 0.01 0.32 -0.51 0.16 0.71 4000 1.00
## Cor_w[2,4] 0.19 0.01 0.36 -0.56 0.21 0.78 4000 1.00
## Cor_w[3,1] 0.43 0.00 0.29 -0.23 0.48 0.87 4000 1.00
## Cor_w[3,2] 0.14 0.01 0.32 -0.51 0.16 0.71 4000 1.00
## Cor_w[3,3] 1.00 0.00 0.00 1.00 1.00 1.00 760 1.00
## Cor_w[3,4] 0.05 0.01 0.37 -0.64 0.05 0.73 4000 1.00
## Cor_w[4,1] -0.06 0.01 0.35 -0.71 -0.08 0.63 4000 1.00
## Cor_w[4,2] 0.19 0.01 0.36 -0.56 0.21 0.78 4000 1.00
## Cor_w[4,3] 0.05 0.01 0.37 -0.64 0.05 0.73 4000 1.00
## Cor_w[4,4] 1.00 0.00 0.00 1.00 1.00 1.00 114 1.00
##
## Samples were drawn using NUTS(diag_e) at Fri Jan 6 04:35:26 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
We adapted a graph from Daniel C. Furr’s case study to show the scaled difference (the difference divided by the size of the generated value) between the posterior means of the parameters of the model and the values we generated for those parameters. (The code producing the graph is available in the Rmd file.)
The graph above shows that in general the model recovers the main parameters satisfactorily, and the discrepancies are relatively small. The graph also shows more uncertainty for the accumulators with lower rates of accumulation (\(\alpha\)); this is so because their associated with less frequent responses, and thus there is less data for estimation.
Regarding the by-participant and by-item adjustments, the previous graph shows that the model cannot recover the correlations very precisely, and that the estimations of the by-intercept adjustments have more uncertainty than the by-participants adjustments.
As before, we first discuss a non-hierarchical version of the model. The response selection hypothesized by the direct access model assumes that there is a certain probability associated with the retrieval of each candidate, \(\theta_c\), but that the final response is affected by the possibility of backtracking and correctly repairing the incorrect retrievals, with probability \(P_b\). This means that the probability of choosing an answer is not directly mapped to the probability of retrieving a noun, but it is also affected by the probability of backtracking and repairing. We implemented this by assuming that the responses \(w\) have a discrete distribution that follows a one-inflated categorical model, where additional probability mass is added to the outcome \(one\) (correct response) due to backtracking (in order to correct an incorrect response) with probability \(P_b\) as follows:
\begin{equation} P(w_n=1 | \boldsymbol{\theta}, P_b ) = \theta_1 + (1-\theta_1) \cdot P_b \label{eq:dis1} \end{equation} \begin{equation} P(w_n=s |\boldsymbol{\theta}, P_b ) = \theta_s \cdot (1-P_b) \label{eq:dis2} \end{equation}where:
We assume, as before, that RTs follow a lognormal distribution. If the response given in trial \(n\) is wrong (\(w_n>1\)), we assume that there is no backtracking, and the distribution is as follows:
\begin{equation} RT_{n,\forall w, w>1} \sim \psi + lognormal(\mu_{da},\sigma) \label{eq:errorRT} \end{equation}where:
If the response given is correct (i.e., \(w=1\)), \(RT\)s are assumed to have a mixture distribution. This is so because there are two “paths” to reach a correct response: (i) The correct item is accessed from memory at the first attempt, and this means that there is direct access and \(RT\)s should belong to a distribution similar to the previous one as shown in Equation \eqref{eq:errorRT}; or (ii) the incorrect item is retrieved from memory, but is backtracked and repaired, and this means that \(RT\)s should belong to a distribution with a larger location than \(\mu_{da}\), namely, \(\mu_{da}+\mu_{b}\). Thus \(RT\)s should be distributed in the following way:
\begin{equation} RT_{n,w=1} \sim \psi_i + \begin{cases} lognormal(\mu_{da},\sigma) \text{; if } y=1 | Categorical( y | \boldsymbol{\theta}) \\ lognormal(\mu_{da}+\mu_{b},\sigma) \text{; if }y \neq 1 \text{ and } z=1 | Categorical(y | \boldsymbol{\theta}) \text{ and } Bernoulli(z | P_b) \end{cases} \label{eq:correctRT} \end{equation}where, from Equation \eqref{eq:dis1}, the first component of the mixture defined in \eqref{eq:correctRT} is the probability of a correct retrieval at the first attempt conditional on obtaining \(one\) as a response (\(w_n=1\)), and occurs with probability:
\begin{equation} \frac{\theta_1 }{ \theta_1 + (1-\theta_1) \cdot P_b } \end{equation}and the second component of the mixture \eqref{eq:correctRT} represents the probability of backtracking when there is an error, this is formulated as the probability of an incorrect retrieval in the first attempt conditional on obtaining \(one\) as a response (\(w_n=1\)):
\begin{equation} \frac{1 - \theta_1 }{ \theta_1 + (1-\theta_1) \cdot P_b } \end{equation}In order to fit the complete model, all the parameters were given weakly informative priors except for \(\theta\): Given that correct retrievals should be more common than incorrect ones, we constrained the probability of retrieving the correct item from memory to be larger than the probability of retrieving an incorrect item, so that \(\theta_1 > 1 - \theta_1\).
The Stan program is a close translation of the notation above, except that to avoid problems of identifiability, we used a categorical distribution with the parameters on the logit scale, so that the argument of the categorical distribution was the vector \(\boldsymbol{\beta}\) instead of \(\boldsymbol{\theta}\) with (\(K-1\))-arguments, that is (\(N_{choices}-1\))-arguments. This was achieved by fixing the \(K\) argument to be zero: \((\theta_1,\theta_2,...,\theta_K)^T = softmax((\beta_1,\beta_2,...,\beta_{K-1},0)^T)\).
functions {
real da(int winner, real RT, vector beta, real P_b, real mu_da,
real mu_b, real sigma, real psi){
// theta = softmax(beta)
// log(P(w = 1 | theta, P_b)):
real log_P_w1;
// Prob of direct access given winner = 1
real log_P_da_gw1;
// Prob of backtracking given winner = 1
real log_P_b_gw1;
// Equation (10) in log:
log_P_w1 = log_sum_exp(categorical_logit_lpmf(1 | beta),
log(P_b)+ log1m_exp(categorical_logit_lpmf(1|beta)));
// Equation (14) in log:
log_P_da_gw1 = categorical_logit_lpmf(1 | beta) - log_P_w1;
// Equation (15) in log:
log_P_b_gw1 = log(P_b) + log1m_exp(categorical_logit_lpmf(1 | beta)) -
log_P_w1;
if(winner==1) {
return (log_P_w1 + // Increment on likelihood due to winner=1
// Increment on likelihood due to RT:
log_sum_exp(log_P_da_gw1 + lognormal_lpdf(RT - psi| mu_da, sigma),
log_P_b_gw1 + lognormal_lpdf(RT - psi | mu_da + mu_b, sigma) ));
} else {
return (log1m(P_b) + categorical_logit_lpmf(winner | beta) +
// Increment on likelihood due to RT:
lognormal_lpdf(RT - psi | mu_da, sigma));
}
}
}
data {
int<lower = 0> N_obs;
int<lower = 1> N_choices;
int<lower = 1, upper = N_choices> winner[N_obs];
vector<lower = 0>[N_obs] RT;
}
transformed data {
real<lower=0> min_RT;
min_RT = min(RT);
}
parameters{
real<lower=0,upper=1> P_b;
real<lower=0> mu_da;
real<lower=0> mu_b;
vector[N_choices-2] beta_incorrect;
real<lower=0> beta_added;
real<lower=0> sigma;
real<lower=0,upper=min_RT> psi;
}
transformed parameters{
vector[N_choices] beta;
beta[1] = beta_added + fmax(max(beta_incorrect),0);
beta[2:N_choices-1] = beta_incorrect;
beta[N_choices] = 0;
}
model {
beta_added ~ normal(0,2);
beta_incorrect ~ normal(0,2);
P_b ~ beta(1,1);
mu_da ~ normal(0,10);
mu_b ~ normal(0,2);
sigma ~ normal(0,2);
psi ~ normal(0,300);
for (n in 1:N_obs) {
target += da(winner[n], RT[n], beta, P_b, mu_da, mu_b, sigma, psi);
}
}
generated quantities {
vector[N_choices] theta;
theta = softmax(beta);
}
In order to account for differences in participants and items we included a hierarchical structure to the retrieval probabilities \(\mathbf{\theta}\), and the parameters that affect RTs: \(\mu_{da}\), \(\mu_{b}\), and \(\psi\).
For convenience we included the hierarchical structure to the logit-transformed probabilities \(\mathbf{\beta}\) rather than to \(\mathbf{\theta}\):
\begin{equation} \beta_{c,i,j} = \beta_{0,c} + u_{c,i} + w_{c,j} \end{equation}With \(c=\{1,..,N_{choices}-1\}\) Where \(u\) represents the by-participant adjustment to the \(\beta_0\) with \(i=\{1,..,N_{subj}\}\) and \(w\) the by-items adjustment with \(j=\{1,..,N_{item}\}\) with the following priors:
\begin{equation} \mathbf{u}\ \sim\ \mathcal{N}(\mathbf{0},\, \boldsymbol\Sigma_u), \end{equation} \begin{equation} \mathbf{w}\ \sim\ \mathcal{N}(\mathbf{0},\, \boldsymbol\Sigma_w), \end{equation}We included adjustments to the parameters affecting RTs in the following way:
\begin{equation} \mu_{da} = \mu_{da_0} + u_{da,i} + w_{da,j} \end{equation} \begin{equation} \mu_{b} = \mu_{b_0} + u_{b,i} + w_{b,j} \end{equation}with the following priors:
\begin{equation} (u_{da}, u_{b})^T \sim\ \mathcal{N}(\mathbf{0},\, \boldsymbol\Sigma_{u_{rt}}), \end{equation} \begin{equation} (w_{da}, w_{b})^T \sim\ \mathcal{N}(\mathbf{0},\, \boldsymbol\Sigma_{w_{rt}}), \end{equation}We also included a by-participant adjustment to \(\psi\) as described before for the activation-based model.
Similar to the hierarchical version of the activation-based model, the code includes some minor modifications to the notation presented above to achieve convergence faster: We optimized the by-participant and by-items adjustments through Cholesky Factorization (Stan Development Team 2016, 76), and \(\mu_{da_0,c}\) and \(\psi_0\) were transformed so that their priors would have a scale close to one (Stan Development Team 2016, 225). The code also includes a function with a pseudo random number generator da_rng
which allows us to generate data following the assumptions of the direct access model.
functions {
real psi_max(vector u_psi, int[] subj, vector RT) {
real psi_max;
psi_max = positive_infinity();
for (i in 1:num_elements(RT))
psi_max = fmin(psi_max, log(RT[i]) - u_psi[subj[i]]);
return (psi_max);
}
real da(int winner, real RT, vector beta, real P_b, real mu_da,
real mu_b, real sigma, real psi){
// theta = softmax(beta)
// log(P(w = 1 | theta, P_b)):
real log_P_w1;
// Prob of direct access given winner = 1
real log_P_da_gw1;
// Prob of backtracking given winner = 1
real log_P_b_gw1;
// Equation (10) in log:
log_P_w1 = log_sum_exp(categorical_logit_lpmf(1 | beta),
log(P_b)+ log1m_exp(categorical_logit_lpmf(1|beta)));
// Equation (14) in log:
log_P_da_gw1 = categorical_logit_lpmf(1 | beta) - log_P_w1;
// Equation (15) in log:
log_P_b_gw1 = log(P_b) + log1m_exp(categorical_logit_lpmf(1 | beta)) -
log_P_w1;
if(winner==1) {
return (log_P_w1 + // Increment on likelihood due to winner=1
// Increment on likelihood due to RT:
log_sum_exp(log_P_da_gw1 + lognormal_lpdf(RT - psi| mu_da, sigma),
log_P_b_gw1 + lognormal_lpdf(RT - psi | mu_da + mu_b, sigma) ));
} else {
return (log1m(P_b) + categorical_logit_lpmf(winner | beta) +
// Increment on likelihood due to RT:
lognormal_lpdf(RT - psi | mu_da, sigma));
}
}
vector da_rng(vector theta, real P_b, real mu_da, real mu_b, real sigma,
real psi) {
int orig_choice;
int backtracking;
vector[2] gen;
orig_choice = categorical_rng(theta);
backtracking = 0;
if (orig_choice!=1) backtracking = bernoulli_rng(P_b);
# Change the answer to 1 if there was backtracking:
gen[1] = backtracking ? 1 : orig_choice;
{ real mu; # it adds the mu_b if there is backtracking:
mu = mu_da + (backtracking ? mu_b : 0);
gen[2] = psi + lognormal_rng(mu, sigma);
}
return(gen);
}
}
data {
int<lower=0> N_obs;
int<lower=1> N_choices;
vector<lower=0>[N_obs] RT;
int<lower=1,upper=N_choices> winner[N_obs];
int<lower = 1> subj[N_obs];
int<lower = 1> N_subj;
int<lower = 1> item[N_obs];
int<lower = 1> N_item;
}
transformed data {
real<lower=0> min_RT;
real logmean_RT;
min_RT = min(RT);
logmean_RT = log(mean(RT));
}
parameters{
real<lower=0> sigma;
real<lower=0> mu_da_0raw;
real<lower=0> mu_b_0;
vector[N_choices-2] beta_incorrect;
real<lower=0> beta_added;
vector<lower = 0> [N_choices - 1] tau_u;
cholesky_factor_corr[N_choices - 1] L_u;
matrix[N_choices - 1, N_subj] z_u;
vector<lower = 0> [2] tau_u_RT;
cholesky_factor_corr[2] L_u_RT;
matrix[2, N_subj] z_u_RT;
vector<lower = 0> [N_choices - 1] tau_w;
cholesky_factor_corr[N_choices - 1] L_w;
matrix[N_choices - 1, N_item] z_w;
vector<lower = 0> [2] tau_w_RT;
cholesky_factor_corr[2] L_w_RT;
matrix[2, N_item] z_w_RT;
real<lower=0,upper=1> P_b;
vector[N_subj] u_psi;
real<lower = 0> tau_psi;
real<upper = psi_max(u_psi, subj, RT) / logmean_RT> psi_0raw;
}
transformed parameters{
real<lower=0> mu_da_0;
vector[N_choices] beta_0;
matrix[2, N_subj] u_RT;
matrix[N_choices, N_subj] u;
matrix[2, N_item] w_RT;
matrix[N_choices, N_item] w;
real psi_0;
u_RT = diag_pre_multiply(tau_u_RT, L_u_RT) * z_u_RT;
u[1:N_choices-1] = diag_pre_multiply(tau_u, L_u) * z_u;
u[N_choices] = rep_row_vector(0,N_subj);
w_RT = diag_pre_multiply(tau_w_RT, L_w_RT) * z_w_RT;
w[1:N_choices-1] = diag_pre_multiply(tau_w, L_w) * z_w;
w[N_choices] = rep_row_vector(0,N_item);
beta_0[1] = beta_added + fmax(max(beta_incorrect),0);
beta_0[2:N_choices-1] = beta_incorrect;
beta_0[N_choices] = 0;
mu_da_0 = mu_da_0raw * logmean_RT;
psi_0 = psi_0raw * logmean_RT;
}
model {
vector[N_obs] log_lik;
sigma ~ normal(0,2);
beta_added ~ normal(0,2);
beta_incorrect ~ normal(0,2);
psi_0raw ~ normal(0, 1);
tau_psi ~ normal(0, 1);
u_psi ~ normal(0, tau_psi);
to_vector(z_u_RT) ~ normal(0, 1);
to_vector(z_u) ~ normal(0, 1);
tau_u_RT ~ normal(0, 1);
tau_u ~ normal(0, 1);
L_u_RT ~ lkj_corr_cholesky(2.0);
L_u ~ lkj_corr_cholesky(2.0);
to_vector(z_w_RT) ~ normal(0, 1);
to_vector(z_w) ~ normal(0, 1);
tau_w_RT ~ normal(0, 1);
tau_w ~ normal(0, 1);
L_w_RT ~ lkj_corr_cholesky(2.0);
L_w ~ lkj_corr_cholesky(2.0);
P_b ~ beta(1,1);
mu_da_0raw ~ normal(0,1);
mu_b_0 ~ normal(0,2);
for (n in 1:N_obs) {
real mu_da;
real mu_b;
vector[N_choices] beta;
real psi;
mu_da = mu_da_0 + u_RT[1,subj[n]] + w_RT[1,item[n]];
mu_b = mu_b_0 + u_RT[2,subj[n]] + w_RT[2,item[n]];
beta = beta_0 + u[,subj[n]] + w[,item[n]];
psi = exp(psi_0 + u_psi[subj[n]]);
log_lik[n] = da(winner[n], RT[n], beta, P_b, mu_da, mu_b, sigma, psi);
}
target += log_lik;
}
generated quantities {
vector[N_choices] theta_0;
matrix[N_choices-1, N_choices-1] Cor_u;
matrix[N_choices-1, N_choices-1] Cor_w;
matrix[2, 2] Cor_u_RT;
matrix[2, 2] Cor_w_RT;
vector[N_obs] gen_RT;
vector[N_obs] gen_winner;
vector[N_obs] log_lik;
theta_0 = softmax(beta_0);
Cor_u = tcrossprod(L_u);
Cor_w = tcrossprod(L_w);
Cor_u_RT = tcrossprod(L_u_RT);
Cor_w_RT = tcrossprod(L_w_RT);
for (n in 1:N_obs) {
real mu_da;
real mu_b;
vector[N_choices] beta;
real psi;
vector[2] gen;
mu_da = mu_da_0 + u_RT[1,subj[n]] + w_RT[1,item[n]];
mu_b = mu_b_0 + u_RT[2,subj[n]] + w_RT[2,item[n]];
beta = beta_0 + u[,subj[n]] + w[,item[n]];
psi = exp(psi_0 + u_psi[subj[n]]);
gen = da_rng(softmax(beta), P_b, mu_da, mu_b, sigma, psi);
gen_winner[n] = gen[1];
gen_RT[n] = gen[2];
log_lik[n] = da(winner[n], RT[n], beta, P_b, mu_da, mu_b, sigma, psi);
}
}
# We extract the function 'da_rng' for simulating data:
expose_stan_functions(stanc(file = "direct_access_h.stan"))
# Some useful functions:
logsumexp <- function(x) {
y = max(x)
y + log(sum(exp(x - y)))
}
softmax <- function(x) {
exp(x - logsumexp(x))
}
inv_softmax <- function(y) {
# inverse of softmax function, setting that last value to 0
z <- -log(y[length(y)] + 1e-08)
y <- y[-length(y)]
return(c(z + log(y + 1e-08), 0))
}
N_obs <- 3600
theta_m1 <- c(0.68, 0.13, 0.06)
theta_0 <- c(theta_m1, 1 - sum(theta_m1))
beta_0 <- inv_softmax(theta_0)
mu_da_0 <- 5.2
mu_b_0 <- 0.5
P_b <- 0.35
sigma <- 0.7
psi_0 <- 5.6
tau_psi <- 0.2
rhos <- 0.3
taus_u <- 0.7
N_subj <- 180
taus_w <- 0.2
N_item <- 20
## Adjustments by subject * To retrieval prob.:
subj <- rep(1:N_subj, N_obs/N_subj)
N_tau_u <- length(beta_0) - 1
Cor_u <- matrix(rep(rhos, N_tau_u * N_tau_u), nrow = N_tau_u)
diag(Cor_u) <- 1
tau_u <- rep(taus_u, N_tau_u)
b_subj <- tau_u %*% t(tau_u)
Sigma_u <- b_subj * Cor_u #variance covariance matrix for subj
u <- mvrnorm(n = N_subj, rep(0, N_tau_u), Sigma_u)
# * To the shift:
psi <- exp(psi_0 + rnorm(N_subj, 0, tau_psi))
# * To the location of the lognormal:
Cor_u_RT <- matrix(rep(rhos, 2 * 2), nrow = 2)
diag(Cor_u_RT) <- 1
tau_u_RT <- rep(taus_u, 2)
b_subj <- tau_u_RT %*% t(tau_u_RT)
Sigma_u_RT <- b_subj * Cor_u_RT #variance covariance matrix for subj
u_RT <- mvrnorm(n = N_subj, rep(0, 2), Sigma_u_RT)
## Adjustments by item: * To retrieval prob.:
item <- rep(1:N_item, each = N_obs/N_item)
N_tau_w <- length(beta_0) - 1
Cor_w <- matrix(rep(rhos, N_tau_w * N_tau_w), nrow = N_tau_w)
diag(Cor_w) <- 1
tau_w <- rep(taus_w, N_tau_w)
b_item <- tau_w %*% t(tau_w)
Sigma_w <- b_item * Cor_w #variance covariance matrix for subj
w <- mvrnorm(n = N_item, rep(0, N_tau_w), Sigma_w)
# * To the location of the lognormal:
Cor_w_RT <- matrix(rep(rhos, 2 * 2), nrow = 2)
diag(Cor_w_RT) <- 1
tau_w_RT <- rep(taus_w, 2)
b_item <- tau_w_RT %*% t(tau_w_RT)
Sigma_w_RT <- b_item * Cor_w_RT #variance covariance matrix for subj
w_RT <- mvrnorm(n = N_item, rep(0, 2), Sigma_w_RT)
N_choices <- length(theta_0)
RT <- c()
winner <- c()
for (n in 1:N_obs) {
theta <- softmax(beta_0 + c(u[subj[n], ], 0) + c(w[item[n], ], 0))
mu_da <- mu_da_0 + u_RT[subj[n], 1] + w_RT[item[n], 1]
mu_b <- mu_b_0 + u_RT[subj[n], 2] + w_RT[item[n], 2]
pred <- da_rng(theta, P_b, mu_da, mu_b, sigma, psi[subj[n]])
winner[n] <- round(pred[1])
RT[n] <- round(pred[2])
}
sim_data_da_h <- list(N_obs = N_obs, winner = winner, RT = RT, N_choices = N_choices,
subj = subj, N_subj = N_subj, item = item, N_item = N_item)
sim_fit_da_h <- stan(file = "direct_access_h.stan", data = sim_data_da_h, chains = chains,
iter = iter)
print(sim_fit_da_h, pars = c("theta_0", "P_b", "mu_da_0", "mu_b_0", "sigma",
"psi_0", "tau_u", "tau_u_RT", "tau_w", "tau_w_RT", "tau_psi", "Cor_u", "Cor_w",
"Cor_u_RT", "Cor_w_RT"), prob = c(0.025, 0.5, 0.975))
## Inference for Stan model: direct_access_h.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
## theta_0[1] 0.66 0.00 0.04 0.58 0.66 0.73 466 1.01
## theta_0[2] 0.14 0.00 0.02 0.11 0.14 0.18 527 1.01
## theta_0[3] 0.06 0.00 0.01 0.04 0.06 0.08 783 1.00
## theta_0[4] 0.14 0.00 0.02 0.11 0.14 0.18 543 1.01
## P_b 0.29 0.00 0.06 0.17 0.29 0.40 400 1.01
## mu_da_0 5.17 0.00 0.07 5.04 5.17 5.31 344 1.01
## mu_b_0 0.64 0.01 0.21 0.25 0.62 1.10 322 1.01
## sigma 0.68 0.00 0.02 0.65 0.68 0.72 1158 1.00
## psi_0 5.59 0.00 0.02 5.55 5.59 5.63 355 1.02
## tau_u[1] 0.64 0.00 0.10 0.46 0.63 0.83 1079 1.00
## tau_u[2] 0.52 0.01 0.13 0.26 0.52 0.75 465 1.00
## tau_u[3] 0.58 0.01 0.20 0.12 0.60 0.93 334 1.02
## tau_u_RT[1] 0.71 0.00 0.04 0.63 0.71 0.80 582 1.01
## tau_u_RT[2] 0.57 0.01 0.16 0.24 0.58 0.85 335 1.01
## tau_w[1] 0.25 0.00 0.10 0.03 0.25 0.46 455 1.00
## tau_w[2] 0.24 0.00 0.12 0.02 0.24 0.49 823 1.01
## tau_w[3] 0.21 0.00 0.14 0.01 0.20 0.51 1034 1.01
## tau_w_RT[1] 0.19 0.00 0.04 0.13 0.18 0.27 782 1.01
## tau_w_RT[2] 0.19 0.01 0.14 0.01 0.16 0.52 447 1.01
## tau_psi 0.18 0.00 0.01 0.16 0.18 0.21 2128 1.00
## Cor_u[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_u[1,2] -0.23 0.01 0.25 -0.75 -0.21 0.26 513 1.00
## Cor_u[1,3] 0.17 0.01 0.25 -0.36 0.19 0.61 833 1.01
## Cor_u[2,1] -0.23 0.01 0.25 -0.75 -0.21 0.26 513 1.00
## Cor_u[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 1557 1.00
## Cor_u[2,3] -0.18 0.01 0.30 -0.74 -0.17 0.40 589 1.00
## Cor_u[3,1] 0.17 0.01 0.25 -0.36 0.19 0.61 833 1.01
## Cor_u[3,2] -0.18 0.01 0.30 -0.74 -0.17 0.40 589 1.00
## Cor_u[3,3] 1.00 0.00 0.00 1.00 1.00 1.00 179 1.00
## Cor_w[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_w[1,2] 0.38 0.01 0.35 -0.47 0.44 0.88 1056 1.01
## Cor_w[1,3] 0.05 0.01 0.38 -0.68 0.07 0.74 1932 1.00
## Cor_w[2,1] 0.38 0.01 0.35 -0.47 0.44 0.88 1056 1.01
## Cor_w[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 1360 1.00
## Cor_w[2,3] 0.20 0.01 0.39 -0.64 0.24 0.83 1601 1.01
## Cor_w[3,1] 0.05 0.01 0.38 -0.68 0.07 0.74 1932 1.00
## Cor_w[3,2] 0.20 0.01 0.39 -0.64 0.24 0.83 1601 1.01
## Cor_w[3,3] 1.00 0.00 0.00 1.00 1.00 1.00 433 1.00
## Cor_u_RT[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_u_RT[1,2] 0.36 0.01 0.19 -0.01 0.36 0.73 993 1.00
## Cor_u_RT[2,1] 0.36 0.01 0.19 -0.01 0.36 0.73 993 1.00
## Cor_u_RT[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 1667 1.00
## Cor_w_RT[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_w_RT[1,2] 0.13 0.01 0.38 -0.65 0.15 0.80 2894 1.00
## Cor_w_RT[2,1] 0.13 0.01 0.38 -0.65 0.15 0.80 2894 1.00
## Cor_w_RT[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 967 1.00
##
## Samples were drawn using NUTS(diag_e) at Fri Jan 6 08:14:36 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The graph above shows that the direct access model recovers the main parameters with more uncertainty than the activation-based model. This could be due to the increased number of parameters.
Similarly to the activation-based model, the graph above shows that the implementation of the direct access model cannot recover the correlations between by-participant and by-item adjustments precisely, and it also shows that the estimations of the by-intercept adjustments have more uncertainty than the by-participants adjustments.
The example data are a subset of Nicenboim et al. (Under revision following review), where the comprehension questions are meaningful for assessing which item is retrieved from memory. We ignore the experimental manipulation of the original experiment, and we focus on the ability of the models to account for the responses and the reading times of the data.
load("dataNIG-SRC.Rda")
# dexp contains a subset of the data of Nicenboim et al 2016.
str(dexp)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3588 obs. of 5 variables:
## $ participant: num 23 23 23 23 23 23 23 23 23 23 ...
## $ item : num 26 2 44 5 35 17 59 32 38 8 ...
## $ winner : num 1 1 1 4 1 1 1 1 1 4 ...
## $ RT : int 655 591 694 913 906 418 942 891 845 676 ...
## $ condition : Factor w/ 2 levels "HI","LI": 1 1 1 2 2 2 2 1 1 1 ...
# To avoid problems of participant/item numbers being skipped, we recode
# participant and item numbers
exp_data <- list(N_obs = nrow(dexp), winner = dexp$winner, RT = dexp$RT, N_choices = max(dexp$winner),
subj = as.numeric(as.factor(dexp$participant)), N_subj = length(unique(dexp$participant)),
item = as.numeric(as.factor(dexp$item)), N_item = length(unique(dexp$item)))
# We fit the activation-based and direct access models And we save the time
# taken to fit
timer <- proc.time()
samples_dexp_ab <- stan(file = "activation-based_h.stan", data = exp_data, chains = chains,
iter = iter)
ab_time <- proc.time() - timer
timer <- proc.time()
samples_dexp_da <- stan(file = "direct_access_h.stan", data = exp_data, chains = chains,
iter = iter)
da_time <- proc.time() - timer
We fitted the data of the experiments with both models (activation-based: 148 minutes; direct access:160 minutes). We show below the summaries of the posteriors for the activation- based model and the direct access model. Notice that the few NaNs in the Rhat
s are not a concern because they appear only for the diagonal elements of a correlation matrix, which do not need to be estimated.
## Inference for Stan model: activation-based_h.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
## alpha_0[1] 4.42 0.00 0.06 4.31 4.42 4.54 901 1.00
## alpha_0[2] 2.33 0.00 0.11 2.10 2.33 2.54 1432 1.00
## alpha_0[3] 1.72 0.00 0.13 1.44 1.73 1.96 1421 1.00
## alpha_0[4] 2.31 0.00 0.10 2.09 2.31 2.50 1342 1.00
## sigma 1.16 0.00 0.02 1.12 1.16 1.20 1918 1.00
## psi_0 5.61 0.00 0.02 5.57 5.61 5.64 280 1.01
## tau_u[1] 0.64 0.00 0.04 0.56 0.64 0.72 1109 1.00
## tau_u[2] 1.05 0.00 0.08 0.92 1.05 1.21 1454 1.00
## tau_u[3] 1.15 0.00 0.09 0.98 1.14 1.33 1406 1.00
## tau_u[4] 1.05 0.00 0.08 0.90 1.05 1.22 1590 1.00
## tau_w[1] 0.14 0.00 0.03 0.08 0.14 0.21 1287 1.00
## tau_w[2] 0.40 0.00 0.07 0.28 0.40 0.55 1259 1.00
## tau_w[3] 0.37 0.00 0.09 0.20 0.37 0.56 1466 1.00
## tau_w[4] 0.23 0.00 0.07 0.09 0.23 0.37 1182 1.00
## tau_psi 0.23 0.00 0.02 0.20 0.23 0.26 2565 1.00
## Cor_u[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_u[1,2] 0.67 0.00 0.06 0.53 0.68 0.79 979 1.00
## Cor_u[1,3] 0.75 0.00 0.06 0.62 0.75 0.86 1140 1.00
## Cor_u[1,4] 0.57 0.00 0.08 0.41 0.57 0.70 972 1.00
## Cor_u[2,1] 0.67 0.00 0.06 0.53 0.68 0.79 979 1.00
## Cor_u[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 336 1.00
## Cor_u[2,3] 0.89 0.00 0.05 0.79 0.89 0.96 1218 1.01
## Cor_u[2,4] 0.62 0.00 0.07 0.46 0.62 0.75 945 1.01
## Cor_u[3,1] 0.75 0.00 0.06 0.62 0.75 0.86 1140 1.00
## Cor_u[3,2] 0.89 0.00 0.05 0.79 0.89 0.96 1218 1.01
## Cor_u[3,3] 1.00 0.00 0.00 1.00 1.00 1.00 4000 1.00
## Cor_u[3,4] 0.70 0.00 0.07 0.54 0.71 0.83 530 1.01
## Cor_u[4,1] 0.57 0.00 0.08 0.41 0.57 0.70 972 1.00
## Cor_u[4,2] 0.62 0.00 0.07 0.46 0.62 0.75 945 1.01
## Cor_u[4,3] 0.70 0.00 0.07 0.54 0.71 0.83 530 1.01
## Cor_u[4,4] 1.00 0.00 0.00 1.00 1.00 1.00 193 1.00
## Cor_w[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_w[1,2] 0.02 0.01 0.22 -0.40 0.02 0.44 766 1.00
## Cor_w[1,3] -0.29 0.01 0.24 -0.73 -0.30 0.23 1291 1.00
## Cor_w[1,4] -0.19 0.01 0.27 -0.69 -0.20 0.36 1350 1.00
## Cor_w[2,1] 0.02 0.01 0.22 -0.40 0.02 0.44 766 1.00
## Cor_w[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 590 1.00
## Cor_w[2,3] -0.13 0.00 0.22 -0.54 -0.14 0.30 2069 1.00
## Cor_w[2,4] 0.32 0.01 0.23 -0.17 0.33 0.74 1937 1.00
## Cor_w[3,1] -0.29 0.01 0.24 -0.73 -0.30 0.23 1291 1.00
## Cor_w[3,2] -0.13 0.00 0.22 -0.54 -0.14 0.30 2069 1.00
## Cor_w[3,3] 1.00 0.00 0.00 1.00 1.00 1.00 422 1.00
## Cor_w[3,4] 0.18 0.01 0.27 -0.37 0.20 0.69 1488 1.00
## Cor_w[4,1] -0.19 0.01 0.27 -0.69 -0.20 0.36 1350 1.00
## Cor_w[4,2] 0.32 0.01 0.23 -0.17 0.33 0.74 1937 1.00
## Cor_w[4,3] 0.18 0.01 0.27 -0.37 0.20 0.69 1488 1.00
## Cor_w[4,4] 1.00 0.00 0.00 1.00 1.00 1.00 89 1.00
##
## Samples were drawn using NUTS(diag_e) at Thu Jan 5 02:39:04 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
## Inference for Stan model: direct_access_h.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
## theta_0[1] 0.68 0.00 0.04 0.60 0.68 0.75 340 1.00
## theta_0[2] 0.13 0.00 0.02 0.09 0.12 0.17 565 1.01
## theta_0[3] 0.06 0.00 0.01 0.04 0.06 0.08 951 1.00
## theta_0[4] 0.14 0.00 0.02 0.10 0.14 0.18 463 1.00
## P_b 0.46 0.00 0.04 0.37 0.47 0.54 180 1.01
## mu_da_0 5.37 0.00 0.06 5.25 5.37 5.49 451 1.01
## mu_b_0 0.43 0.02 0.18 0.06 0.45 0.75 101 1.03
## sigma 0.78 0.00 0.03 0.73 0.77 0.83 409 1.00
## psi_0 5.54 0.00 0.02 5.49 5.54 5.58 373 1.01
## tau_u[1] 1.11 0.00 0.13 0.86 1.10 1.39 1069 1.00
## tau_u[2] 1.03 0.00 0.15 0.75 1.03 1.33 950 1.00
## tau_u[3] 0.92 0.01 0.18 0.57 0.92 1.28 932 1.00
## tau_u_RT[1] 0.66 0.00 0.05 0.57 0.66 0.76 120 1.03
## tau_u_RT[2] 0.97 0.01 0.16 0.69 0.95 1.30 110 1.03
## tau_w[1] 0.48 0.00 0.11 0.27 0.47 0.71 937 1.00
## tau_w[2] 0.57 0.00 0.14 0.30 0.57 0.84 823 1.00
## tau_w[3] 0.38 0.01 0.18 0.04 0.38 0.74 848 1.00
## tau_w_RT[1] 0.10 0.00 0.02 0.05 0.10 0.14 749 1.00
## tau_w_RT[2] 0.08 0.00 0.06 0.00 0.07 0.23 1382 1.00
## tau_psi 0.23 0.00 0.02 0.20 0.23 0.27 796 1.00
## Cor_u[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_u[1,2] 0.35 0.01 0.15 0.03 0.36 0.61 907 1.00
## Cor_u[1,3] 0.40 0.01 0.17 0.03 0.41 0.70 1111 1.00
## Cor_u[2,1] 0.35 0.01 0.15 0.03 0.36 0.61 907 1.00
## Cor_u[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 1375 1.00
## Cor_u[2,3] 0.82 0.00 0.10 0.58 0.84 0.96 1154 1.00
## Cor_u[3,1] 0.40 0.01 0.17 0.03 0.41 0.70 1111 1.00
## Cor_u[3,2] 0.82 0.00 0.10 0.58 0.84 0.96 1154 1.00
## Cor_u[3,3] 1.00 0.00 0.00 1.00 1.00 1.00 2484 1.00
## Cor_w[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_w[1,2] 0.32 0.01 0.24 -0.20 0.35 0.72 1089 1.00
## Cor_w[1,3] -0.07 0.01 0.33 -0.70 -0.07 0.57 1788 1.00
## Cor_w[2,1] 0.32 0.01 0.24 -0.20 0.35 0.72 1089 1.00
## Cor_w[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 162 1.00
## Cor_w[2,3] -0.13 0.01 0.33 -0.76 -0.14 0.52 2027 1.00
## Cor_w[3,1] -0.07 0.01 0.33 -0.70 -0.07 0.57 1788 1.00
## Cor_w[3,2] -0.13 0.01 0.33 -0.76 -0.14 0.52 2027 1.00
## Cor_w[3,3] 1.00 0.00 0.00 1.00 1.00 1.00 70 1.00
## Cor_u_RT[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_u_RT[1,2] -0.08 0.02 0.21 -0.47 -0.08 0.33 113 1.03
## Cor_u_RT[2,1] -0.08 0.02 0.21 -0.47 -0.08 0.33 113 1.03
## Cor_u_RT[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 678 1.00
## Cor_w_RT[1,1] 1.00 0.00 0.00 1.00 1.00 1.00 4000 NaN
## Cor_w_RT[1,2] -0.16 0.01 0.43 -0.86 -0.20 0.69 3110 1.00
## Cor_w_RT[2,1] -0.16 0.01 0.43 -0.86 -0.20 0.69 3110 1.00
## Cor_w_RT[2,2] 1.00 0.00 0.00 1.00 1.00 1.00 75 1.00
##
## Samples were drawn using NUTS(diag_e) at Thu Jan 5 05:19:33 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
We use posterior predictive checking to examine the descriptive adequacy of the models (Shiffrin et al. 2008; Gelman et al. 2014, Chapter 6). The posterior predictive distribution is composed of 4000 datasets (one for each iteration after the warm-up) that the model generates based on the posterior distributions of its parameters.
Since the main difference between the activation-based model and the direct access model is in the way they account for the relationship between retrieval probability and latencies, for each of the generated datasets, we calculate the mean \(RT\) associated with each response and the mean proportion of responses given. We represent this using violin plots (Hintze and Nelson 1998): the width of the violin plots represents the density of the predicted means. The observed mean of the data is represented with a cross. If the data could plausibly have been generated by the model, we would expect the crosses to be inside the violin plots. (The code producing the graph is available in the Rmd file.)
The posterior predictive checks reveal that the activation-based model is inadequate for predicting some key characteristics of the data. This model predicts shorter times when the correct response is given and longer times when an incorrect answer is given. In other words, the model underestimates the retrieval time of the correct dependent and overestimates the retrieval time of the competitor items. The activation-based model also shows a slight misfit for the predicted accuracy: the model tends to underestimate the proportion of correct responses. In contrast to the activation-based model, the direct access model is able to predict the main characteristics of the data fairly well: The model is able to predict that reading times associated with correct responses are on average slower than the ones associated with incorrect ones, while is also able to predict fairly well the proportion of responses from the data.
Since the descriptive adequacy can also be achieved by a model that is too flexible and can generate too many different results, we also compared the expected predictive performance of the models using cross-validation. The leave-one-out (LOO; Geisser and Eddy 1979) method is a robust way to compare the expected predictive performance of the models (Vehtari, Gelman, and Gabry 2016). In order to estimate the expected log pointwise predictive density (\(\hat{elpd}\)), we approximated LOO with PSIS-LOO using the R
package loo
.
However, PSIS-LOO can be affected by highly influential observations; the estimated shape parameter \(\hat{k}\) of the generalized Pareto distribution can be used to assess the reliability of the estimates, with \(\hat{k} > 0.7\) indicating an unreliable calculation of \(\hat{elpd}\). Even though the calculation of PSIS-LOO shows several \(\hat{k}\) above \(0.7\), and k-fold cross validation is recommended instead, for ease of presentation we show the results using PSIS-LOO. We verified that K-fold cross-validation (Vehtari and Ojanen 2012) with \(K=10\) yields virtually the same results, see Appendix for details.
loglik_dexp_ab <- extract_log_lik(samples_dexp_ab)
loglik_dexp_da <- extract_log_lik(samples_dexp_da)
(loo_dexp_ab <- loo(loglik_dexp_ab))
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-
## diagnostic') for details.
## Computed from 4000 by 3588 log-likelihood matrix
##
## Estimate SE
## elpd_loo -26820.2 99.3
## p_loo 634.9 18.1
## looic 53640.3 198.7
##
## Pareto k diagnostic values:
## Count Pct
## (-Inf, 0.5] (good) 3344 93.2%
## (0.5, 0.7] (ok) 196 5.5%
## (0.7, 1] (bad) 48 1.3%
## (1, Inf) (very bad) 0 0.0%
## See help('pareto-k-diagnostic') for details.
(loo_dexp_da <- loo(loglik_dexp_da))
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-
## diagnostic') for details.
## Computed from 4000 by 3588 log-likelihood matrix
##
## Estimate SE
## elpd_loo -26707.2 99.1
## p_loo 652.3 15.1
## looic 53414.4 198.2
##
## Pareto k diagnostic values:
## Count Pct
## (-Inf, 0.5] (good) 3452 96.2%
## (0.5, 0.7] (ok) 83 2.3%
## (0.7, 1] (bad) 53 1.5%
## (1, Inf) (very bad) 0 0.0%
## See help('pareto-k-diagnostic') for details.
Comparing the models on PSIS-LOO reveals an estimated difference in \(\hat{elpd}\)in favor of the direct access model in comparison with the activation-based model:
(loo_comparison <- compare(loo_dexp_ab, loo_dexp_da))
## elpd_diff se
## 113.0 27.6
We compare the models in their \(\hat{elpd}\), point by point below. (The code producing the graph is available in the Rmd file.)
The figure above shows for any given observation, whether one model has an advantage over the other in its predictive accuracy. Since higher (or less negative) values of \(\hat{elpd}\) indicate a better fit for a model, observations that are further away from the dotted line correspond to data that are particularly better predicted by one model (and poorly by the other). This figure shows that the advantage of the direct access model is not due to some outlier observations, but mostly due to a high number of observations that fit slightly better under this model than under the activation-based one (this is the darker patch on the top right corner). (The code producing the graph is available in the Rmd file.)
The figure above shows the difference between the \(\hat{elpd}\) of the two models for every observation corresponding to either a correct or an incorrect response. The figure shows that most of the advantage of the direct access model comes from reading times between 300 and 1000 ms (notice the darker patch above the zero dotted line). In addition, the direct access model has a clear advantage in predicting long reading times associated with correct responses and short reading times associated with incorrect ones, while the activation-based model has an advantage in predicting short reading times for correct responses and long reading times for incorrect ones.
This work presents an evaluation of two well-known models of retrieval processes in sentence comprehension, the activation-based model and the direct-access model. We implemented these models in a Bayesian hierarchical framework and showed that some aspects of the data can be explained better by the direct access model. Specifically, the activation-based cannot predict that, on average, incorrect retrievals would be faster than correct ones.
More generally, our work leverages the capabilities of Stan to provide a powerful framework for flexibly developing computational models of competing theories of retrieval, and demonstrates how these models’ predictions can be compared in a Bayesian setting.
Geisser, Seymour, and William F Eddy. 1979. “A Predictive Approach to Model Selection.” Journal of the American Statistical Association 74 (365). Taylor & Francis Group: 153–60. doi:10.2307/2286745.
Gelman, Andrew, John B Carlin, Hal S Stern, and Donald B Rubin. 2014. Bayesian Data Analysis. Third. Taylor & Francis.
Heathcote, Andrew, and Jonathon Love. 2012. “Linear Deterministic Accumulator Models of Simple Choice.” Frontiers in Psychology 3. Frontiers Media SA. doi:10.3389/fpsyg.2012.00292.
Hintze, Jerry L, and Ray D Nelson. 1998. “Violin Plots: A Box Plot-Density Trace Synergism.” The American Statistician 52 (2). Taylor & Francis Group: 181–84. doi:10.1080/00031305.1998.10480559.
Lewis, Richard L., and Shravan Vasishth. 2005. “An Activation-Based Model of Sentence Processing as Skilled Memory Retrieval.” Cognitive Science 29 (3): 375–419. doi:10.1207/s15516709cog0000_25.
McElree, Brian. 1993. “The Locus of Lexical Preference Effects in Sentence Comprehension: A Time-Course Analysis.” Journal of Memory and Language 32 (4). Elsevier: 536–71. doi:10.1006/jmla.1993.1028.
———. 2000. “Sentence Comprehension Is Mediated by Content-Addressable Memory Structures.” Journal of Psycholinguistic Research 29 (2): 111–23. doi:10.1023/A:1005184709695.
Nicenboim, Bruno, Felix Engelmann, Katja Suckow, and Shravan Vasishth. Under revision following review. “Number Interference in German: Evidence for Cue-Based Retrieval.” Cognitive Science. http://www.ling.uni-potsdam.de/~nicenboim/bib/nicenboimetal2016-numint.pdf.
Rouder, Jeffrey N., Jordan M. Province, Richard D. Morey, Pablo Gomez, and Andrew Heathcote. 2014. “The Lognormal Race: A Cognitive-Process Model of Choice and Latency with Desirable Psychometric Properties.” Psychometrika 80 (2). Springer Science + Business Media: 491–513. doi:10.1007/s11336-013-9396-3.
Shiffrin, Richard M., Michael Lee, Woojae Kim, and Eric-Jan Wagenmakers. 2008. “A Survey of Model Evaluation Approaches with a Tutorial on Hierarchical Bayesian Methods.” Cognitive Science 32 (8). Wiley-Blackwell: 1248–84. doi:10.1080/03640210802414826.
Stan Development Team. 2016. “Stan: A C++ Library for Probability and Sampling, Version 2.14.0.” http://mc-stan.org/.
Vehtari, Aki, and Janne Ojanen. 2012. “A Survey of Bayesian Predictive Methods for Model Assessment, Selection and Comparison.” Statistics Surveys 6 (0). Institute of Mathematical Statistics: 142–228. doi:10.1214/12-ss102.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2016. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing.