Bayesian Latent Class Models and Handling of Label Switching
Overview of the model
Latent class models are used to identify unobservable or latent subgroups within a population. They are also a special case of a multivariate finite mixture models, where the response variables are categorical. Each unit \(j\) is assumed to belong to one of \(C\) classes, represented by a class indicator \(z_j = 1, \dots, C\). The marginal response probabilities are:
\[\begin{align} \tag{1} \mbox{Pr}(\mathbf{y}_{j}) = \sum_{c = 1}^C\alpha_{c}\prod_{i=1}^{I}\mbox{Pr}(y_{ji}|z_j = c) \label{marginal} \end{align}\]
Where \(\mathbf{y}_{j}\) is the response vector \((y_{j1}, \dots, y_{jI})^\intercal\) for individual \(j\) on \(I\) items, \(z_j\) is the class that individual \(j\) belongs to, and \(\alpha_{c}\) is the probability of being in class \(c\).
We focus on binary responses (\(y = 1\) or \(y = 0\)), with \(y_{ji} \sim \text{Bernoulli}(p_{ci})\) in the current case study, where \(p_{ci}\) is the probability that an individual in class \(c\) endorses item \(i\).
The latent class indicator \(Z_j\) can be viewed as a discrete latent variable. For those who are familiar with JAGS or WinBUGS, the following hierarchical sampling is often used:
\[\begin{align} z_j &\sim \mathsf{Categorical}(\alpha_j)\\ \mathbf{y}_{j}|z_j = c &\sim \underbrace{\mathsf{Bernoulli}(\mathbf{p}_{c})}_{\text{focus in this case study}} \end{align}\]
Where \(\mathbf{\alpha}\) is a simplex \((\alpha_{1}, \dots, \alpha_{C})^\intercal\) with \(\sum_{c = 1}^{C}\alpha_{c} = 1\) and \(\mathbf{p}_{c}\) is the class-specific parameter vector \((p_{c1}, \dots, p_{cI})^\intercal\).
Stan does not support direct sampling of discrete parameters. Instead, models that involve discrete parameters can be coded by marginalizing out the discrete parameters, which is shown in equation 1. It is noteworthy that this marginalized likelihood approach yields a better estimator (in terms of mean squared errors) due to the Rao-Blackwell theorem (i.e., instead of making inference of the discrete latent variables, we condition on their sufficient statistics, which in our case are the proportions of each class).
Stan code for LCA models
Here we present a Stan program for a latent class model:
writeLines(readLines("LCM0425b.stan"))
data {
int<lower=1> I; // # of items
int<lower=1> J; // # of respondents
int<lower=1> C; // # of classes
int y[J,I]; // response matrix
}
parameters {
simplex[C] alpha; // probabilities of being in one group
real <lower = 0, upper = 1> p[C, I];
}
transformed parameters{
vector[I] p_prod;
/* product of endorsing probs across classes
to check convergence up to permutation of class labels */
for(i in 1:I){
p_prod[i] = prod(p[, i]);
}
}
model {
real lmix[C];
for (j in 1:J){
for (c in 1: C){
lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[j, ] | p[c,]);
}
target += log_sum_exp(lmix);
}
}
generated quantities {
int<lower = 1> pred_class_dis[J]; // posterior predictive prediction for respondent j in latent class c
simplex[C] pred_class[J]; // posterior probabilities of respondent j in latent class c
real lmix[C];
for (j in 1:J){
for (c in 1: C){
lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[j, ] | p[c,]);
}
for (c in 1: C){
pred_class[j][c] = exp((lmix[c])-log_sum_exp(lmix));
}
pred_class_dis[j] = categorical_rng(pred_class[j]);
}
}
The data
block specifies how data are organized: the total number of items I
, the number of persons J
, and the number of classes C
are defined as shown in the model equation. y
denotes a \(J \times I\) data matrix where each column represents one item and each row is the response vector for one person. In the parameter
block we define our class proportion parameters for each class as a vector alpha
and the \(C \times I\) matrix for endorsement probabilities p
for \(I\) items and \(C\) classes. The transformed parameter
block defines the vector p_prod
, the products of each column of p
, i.e., the product of endorsing item \(i\) across the \(C\) classes. We will use p_prod
as a convergence check that allows for label switching, ash we will discuss later.
In the model
block, we define prior distributions for all model parameters (or the default prior with \(\mathsf{Unif}(-\infty, \infty)\) when left unspecified). We compute the logs of the contributions to the marginal probabilities from each class in equation (1), \(\mathrm{log}\alpha_c+\sum_{i=1}^{I}\mathrm{log}\mathsf{Bernoulli}(y_{ji}|p_{ci})\)) and save it in lmix[c]
. log_sum_exp
computes the logarithm of the sum of the exponentiated elements of lmix[c]
, giving \(\log \sum^{C = 1}_c (\alpha_c\prod_{i = 1}^I\mathsf{Bernoulli}(y_{ji}|p_{ci}))\). target +=
increments the resulting log posterior up to a constant.
Coding latent classes in Stan: log_mix
and log_sum_exp
log_mix
For models with two classes/mixture components, Stan has a wrapper log_mix
for easy implementation: it takes a probability parameter and two log densities for the two mixture components as input. For example, to express that \(y\) is sampled from a mixture of two Bernoulli distributions with parameters \(p_1\) and \(p_2\) and with a known marginal probability of being in the first class equal to .3, we would write log_mix(.3, bernoulli_lpmf(y | p_1), bernoulli_lpmf(y | p_2)
.
log_sum_exp
log_sum_exp
is a more general version of log_mix
that handles situations where there are more than two mixture components. It takes two arguments \(a\) and \(b\) (or more if there are more than two mixture components), and then takes the log of the sum of the exponentials (i.e., is \(\text{log_sum_exp}(a,b) = \log(\exp(a) + \exp(b))\). It can be viewed as a summation operation on the log scale.).
To express that \(y\) is sampled from a mixture of two Bernoulli distributions with parameters \(p_1\) and \(p_2\) and with a known marginal probability of being in the first class equal to .3, we would write log_sum_exp(log(.3) + bernoulli_lpmf(y | p_1), log(.7) + bernoulli_lpmf(y | p_2))
.
We use log_sum_exp
for latent class models given its flexibility dealing with an arbitrary number of latent classes.
Prediction of latent class membership
One common research question with latent class analysis is what class a unit belongs to, that is, we are interested in the posterior probability \(\mbox{Pr}(z_j = c|\mathbf{y})\).
In practice, we draw model parameters \(\theta^{(t)} = (\mathbf{p}^{(t)}, \alpha^{(t)})\), with superscript \((t)\) denoting the \(t^{th}\) draw, from the posterior distribution and obtain the posterior probability of being in the class conditional on these draws. By doing this in generated quantities
block, the final sample mean estimate is the Monte Carlo approximation of the expectation of our posterior prediction of class membership. Compared to frequentist approaches, where the posterior probability of being in the class is evaluated at the parameter estimates \(\mbox{Pr}(z_j = c|\mathbf{y}_j; \hat{\theta})\)), the fully Bayesian way of predicting class membership does not treat parameter estimates as known and hence takes all the uncertainty regarding parameter estimates into account, as shown below. \[\begin{align}
\mbox{Pr}(z_j =
c|\mathbf{y}) &= \int_{ \mathbf{\theta}}\mbox{Pr}(z_j =
c|\mathbf{y}_j, \mathbf{\theta})\mbox{Pr}( \mathbf{\theta}|\mathbf{y})d\theta \\
&= E_{\theta|\mathbf{y}}\mbox{Pr}(z_j = c|\mathbf{y}_j, \theta) \\ &\approx \frac{1}{T}\sum^T_{t = 1}\mbox{Pr}(z_j =
c|\mathbf{y}_j, \mathbf{\theta}^{(t)})
\end{align}\]
Where \[\begin{align} \mbox{Pr}(z_j = c|\mathbf{y}_j, \theta^{(t)})\\ &= \frac{\mbox{Pr}(z_j = c|\theta^{(t)})\mbox{Pr}(\mathbf{y}_j|z_j = c, \theta^{(t)})}{\sum_{c = 1}^C\mbox{Pr}(z_j = c|\theta^{(t)})\mbox{Pr}(\mathbf{y}_j|z_j = c, \theta^{(t)})}\\ &= \frac{\alpha_{c}^{(t)}\prod_{i=1}^{I}\mbox{Pr}(y_{ji}|z_j = c, \theta^{(t)})}{\sum_{c = 1}^C\alpha_{c}^{(t)}\prod_{i=1}^{I}\mbox{Pr}(y_{ji}|z_j = c, \theta^{(t)})} &= \frac{\alpha_{c}^{(t)}\prod_{i=1}^{I}p_{ci}^{y_{ji}}(1-p_{ci}^{1-y_{ji}})}{\sum_{c = 1}^C\alpha_{c}^{(t)}\prod_{i=1}^{I}p_{ci}^{y_{ji}}(1-p_{ci}^{1-y_{ji}})} \end{align}\].
Distal outcome in latent class models
One common extension of latent class models is modelling the relations between latent classes and a variable \(d_j\) of interest, which is often called distal outcome, by considering \(\mbox{Pr}(d_j|z_j)\). In Stan, all we need is to jointly model \(d_j\) with \(\mathbf{y}_j\), \(\mbox{Pr}(\mathbf{y}_j, d_j|z_j)\), in the model
block. Of course, we assume conditional independence, namely, \(\mbox{Pr}(\mathbf{y}_j, d_j|z_j) = \mbox{Pr}(\mathbf{y}_j|z_j)\mbox{Pr}(d_j|z_j)\). We consider the case where the the distal outcome dis_out[j]
is binary so we need to include bernoulli_lpmf(dis_out[j] | dis_p[c])
in the same line as the response probability bernoulli_lpmf(y[j] | p[c,])
, where dis_p[c]
is the distal outcome probability in class \(c\). We can also use a Gaussian distribution to model continuous distal outcomes by normal_lpdf(dis_out[j] | dis_mu[c], dis_sigma[c])
, where dis_mu[c]
and dis_sigma[c]
are the distal outcome mean and standard deviation in class \(c\).
For a binary distal outcome, the original code in line 28 should be replaced by lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[j, ] | p[c,]) + bernoulli_lpmf(dis_out[j] | dis_p[c]);
Data-generation and Label-switching
In this section, we will run our Stan code on data simulated from a latent class model with two classes to assess its ability to recover the generating parameters.
Data generating process for two latent classes
Here we create two classes with probabilities \(\alpha_1 = .2\) and \(\alpha_2 = .8\). We create four items with probabilities of agreeing in class one (\(\mathbf{p}_1\)) set to \((0.4, 0.5, 0.4, 0.35)'\) and the probabilities of agreeing in class two (\(\mathbf{p}_2\)) to \((0.7, 0.1, 0.3, 0.9)'\), as summarized in the table below.
Class 1 (20%) | Class 2 (80%) | |
---|---|---|
Item 1 | 0.4 | 0.7 |
Item 2 | 0.5 | 0.1 |
Item 3 | 0.4 | 0.3 |
Item 4 | 0.35 | 0.9 |
# simulate data with four items and two classes
<- 1000
J = 4
I <- sample(1:2,
latent_group prob = c(0.2, 0.8),
size = J,
replace = TRUE)
<- c(0.4, 0.5, 0.4, 0.35)
p1 <- c(0.7, 0.1, 0.3, 0.9)
p2
<- matrix(data = NA, nrow = J, ncol = I)
item
for (i in 1:J) {
= case_when(
item[i, ] == 1 ~ rbinom(
latent_group[i] n = rep(1, I),
size = rep(1, I),
prob = p1
),== 2 ~ rbinom(
latent_group[i] n = rep(1, I),
size = rep(1, I),
prob = p2
)
)
}
# how the data look like
::datatable(item) DT
# check whether the simulated data match the generating value
== 1,] %>% colMeans() item[latent_group
[1] 0.4179894 0.5079365 0.4656085 0.3492063
== 2,] %>% colMeans() item[latent_group
[1] 0.6966708 0.1085080 0.2811344 0.8939581
#Stan program
<- list(y = item, #response matrix
lca_data J = J, #number of units/persons
I = I, #number of items
C = 2)
We run Stan with four chains:
library(rstan)
# Stan
<-
stan_fitstan(
file = "LCM0425b.stan",
data = lca_data,
iter = 1000,
chains = 4
)
Summary of posterior distribution of parameters:
print(stan_fit, c("alpha", "p", "lp__", "p_prod"))
Inference for Stan model: LCM0425b.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5%
alpha[1] 0.50 0.20 0.29 0.14 0.21 0.47 0.78 0.86
alpha[2] 0.50 0.20 0.29 0.14 0.22 0.53 0.79 0.86
p[1,1] 0.56 0.10 0.15 0.32 0.41 0.60 0.71 0.74
p[1,2] 0.28 0.12 0.18 0.07 0.11 0.23 0.45 0.58
p[1,3] 0.37 0.07 0.10 0.24 0.28 0.33 0.46 0.55
p[1,4] 0.60 0.24 0.35 0.07 0.26 0.71 0.94 0.99
p[2,1] 0.56 0.10 0.15 0.32 0.42 0.60 0.71 0.74
p[2,2] 0.28 0.12 0.18 0.07 0.11 0.23 0.45 0.58
p[2,3] 0.37 0.07 0.10 0.24 0.28 0.33 0.46 0.55
p[2,4] 0.60 0.24 0.35 0.06 0.27 0.71 0.93 0.99
lp__ -2235.01 0.08 2.22 -2239.89 -2236.34 -2234.74 -2233.39 -2231.55
p_prod[1] 0.29 0.00 0.04 0.21 0.27 0.29 0.32 0.37
p_prod[2] 0.05 0.00 0.01 0.03 0.04 0.05 0.06 0.08
p_prod[3] 0.13 0.00 0.02 0.10 0.12 0.13 0.14 0.16
p_prod[4] 0.25 0.00 0.11 0.04 0.17 0.25 0.32 0.45
n_eff Rhat
alpha[1] 2 5.33
alpha[2] 2 5.33
p[1,1] 2 3.80
p[1,2] 2 3.91
p[1,3] 2 2.76
p[1,4] 2 4.36
p[2,1] 2 3.79
p[2,2] 2 3.79
p[2,3] 2 3.00
p[2,4] 2 4.51
lp__ 750 1.00
p_prod[1] 806 1.00
p_prod[2] 640 1.00
p_prod[3] 1033 1.00
p_prod[4] 555 1.00
Samples were drawn using NUTS(diag_e) at Wed Jun 23 11:01:46 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Label-switching
We see that the convergence index Rhat
is large for almost all parameters suggesting that there is a problem. The problem is label-switching - a known problem for latent class models and mixture models in general that is due to non-identifiability (i.e., the model is identifiable up to a permutation of the class labels - see Redner and Walker (1984) and, Stephens (2000)).
To see why it happens, take the likelihood contribution of unit \(j\) on items \(1, \dots, I\) assuming \(2\) latent classes, as an example:
\[\begin{align} \mbox{Pr}(\mathbf{y}_{j}) &= \sum_{c = 1}^2 \mbox{Pr}(z_j = c)\mbox{Pr}(\mathbf{y}_{j}|z_j = c)\\ &= \sum_{z_j = 1}^2\alpha_{c}\mathsf{Bernoulli}(\mathbf{y}_{j}|\mathbf{p}_{c})\\ &=\alpha_1\mathsf{Bernoulli}(\mathbf{y}_{j}|\mathbf{p}_{1}) + \alpha_2\mathsf{Bernoulli}(\mathbf{y}_{j}|\mathbf{p}_{2}) \end{align}\]
No matter what the true values of the parameters are, for example, \(\alpha_1 = 0.3, \alpha_2 = 0.7, \mathbf{p}_{1} = (0.4, 0.5, 0.4, 0.35)^{\intercal}, \mathbf{p}_{2} = (0.7, 0.1, 0.3, 0.9)^{\intercal}\), there will be an alternative set of values with equal likelihood (e.g., \(\alpha_1 = 0.7, \alpha_2 = 0.3, \mathbf{p_{1}} = (0.7, 0.1, 0.3, 0.9)^{\intercal}, \mathbf{p_{2}} = (0.4, 0.5, 0.4, 0.35)^{\intercal}\), that is, \(\mbox{Pr}(\mathbf{y}_j|\alpha_1, \alpha_2, \mathbf{p}_{1}, \mathbf{p}_{2}) = \mbox{Pr}(\mathbf{y}_j|\alpha_2, \alpha_1, \mathbf{p}_{2}, \mathbf{p}_{1})\)). Intuitively, the labels do not matter when no additional constraints are added. The reason we know that label switching is the likely problem here is that p_prod
, the product of the parameters for the two classes, has good R-hat values in the table above.
The traceplots below also provide clear evidence of label-switching: for a given parameter, the different chains converge to two different values - one corresponding to class one and the other to class two. Each plot entry represents one parameter: the four lines of different colors denote the four chains with different initial values. For a single chain, there are no abrupt changes in the mean which suggests that there is no within-chain label-switching. However, when the chains do not overlap, that suggests between-chain switching.
traceplot(stan_fit, c("alpha", "p"))
There are multiple ways to fix label-switching and we will introduce two ways in the current case study1:
Post-hoc relabeling We can post-process and relabel the draws using the methods developed by Papastamoulis (2015).
Variational Bayes (VB)
Post-hoc relabeling
A detailed introduction to different relabeling algorithms is beyond the scope of our case study, but we can easily apply them using the label.swich
package (interested readers can refer to Papastamoulis (2015)).
Taking the Pivotal Reordering Algorithm (PRA
) as an example: we select the iteration of MCMC draws for which the parameter draws have the maximum posterior probability as the pivot (or “gold standard”) and then relabel other MCMC draws so that the relabeled parameters are as close as possible to the pivot in terms of their Euclidean distance.
For \(T\) posterior draws, the pivot draw is defined as \[
\theta^{\mathrm{Pivot}}=\arg \max _{t=1, \ldots, T} p\left\{(\boldsymbol{\theta})^{(t)} \mid \mathbf{y}\right\}
\] To use the package, we need to firstly extract the original posterior draws from Stan fit object, specifically, extract all posterior draws to post_par
. We also extract posterior probabilities of membership to post_class_p
for later use;
# extract stan fit as the required format of the input
<- stan_fit %>% names %>% `[`(1:10)
pars @model_pars stan_fit
[1] "alpha" "p" "p_prod" "pred_class_dis"
[5] "pred_class" "lmix" "lp__"
<- rstan::extract(stan_fit,
post_par c("alpha", "p", "pred_class", "lp__"),
permuted = TRUE)
# classification probabilities
<- post_par$pred_class
post_class_p
# simulated allocation vectors
<- ((post_class_p[,,1] > 0.5)*1) + 1 post_class
We then create an array called mcmc
with dimensions equal to number of posterior draws after warmup
\(\times\) number of classes
\(\times\) number of parameters
. We then specify different relabeling algrithms to the character object set
. We then find the index of MAP draw (i.e., the pivot draw) based on lp__
, which is the log posterior probability up to a constant and assign the index to mapindex
.
= 2000 # of draws
m = 2 # of classes
K = 5 # of component-wise parameters
J
# initialize mcmc arrays
<- array(data = NA, dim = c(m = m, K = K, J = J))
mcmc
# assign posterior draws to the array
1] <- post_par$alpha
mcmc[, , for (i in 1:(J - 1)) {
+ 1] <- post_par$p[, , i]
mcmc[, , i
}
# set of selected relabeling algorithm
<-
set c("PRA",
"ECR",
"ECR-ITERATIVE-1",
"AIC",
"ECR-ITERATIVE-2",
"STEPHENS",
"DATA-BASED")
# find the MAP draw as a pivot
= which.max(post_par$lp__) mapindex
In the label.switching
function, we input the previously defined objects, and we can also check whether different algorithms produce the same relabeling.
# switch labels
<-
ls_lcm label.switching(
method = set,
zpivot = post_class[mapindex,],
z = post_class,
K = K,
prapivot = mcmc[mapindex, ,],
constraint = 1,
mcmc = mcmc,
p = post_class_p,
data = lca_data$y
)
......................................................................................
. Method Time (sec) Status .
......................................................................................
. PRA 0.06 OK .
. ECR 5.39 OK .
. ECR-ITERATIVE-1 29.16 Converged (4 iterations) .
. AIC 0.04 OK .
. ECR-ITERATIVE-2 12.86 Converged (3 iterations) .
. STEPHENS 21.16 Converged (4 iterations) .
. DATA-BASED 8.64 OK .
......................................................................................
Relabelling all methods according to method PRA ... done!
Retrieve the 7 permutation arrays by typing:
[...]$permutations$"PRA"
[...]$permutations$"ECR"
[...]$permutations$"ECR-ITERATIVE-1"
[...]$permutations$"AIC"
[...]$permutations$"ECR-ITERATIVE-2"
[...]$permutations$"STEPHENS"
[...]$permutations$"DATA-BASED"
Retrieve the 7 best clusterings: [...]$clusters
Retrieve the 7 CPU times: [...]$timings
Retrieve the 7 X 7 similarity matrix: [...]$similarity
Label switching finished. Total time: 79.4 seconds.
# run time in seconds
$timings ls_lcm
PRA ECR ECR-ITERATIVE-1 AIC ECR-ITERATIVE-2
0.06 5.39 29.16 0.04 12.86
STEPHENS DATA-BASED
21.16 8.64
# similarity of the classification
$similarity ls_lcm
PRA ECR ECR-ITERATIVE-1 AIC ECR-ITERATIVE-2 STEPHENS DATA-BASED
PRA 1 1 1 1 1 1 1
ECR 1 1 1 1 1 1 1
ECR-ITERATIVE-1 1 1 1 1 1 1 1
AIC 1 1 1 1 1 1 1
ECR-ITERATIVE-2 1 1 1 1 1 1 1
STEPHENS 1 1 1 1 1 1 1
DATA-BASED 1 1 1 1 1 1 1
# permuted posterior based on ECR method
<- permute.mcmc(mcmc, ls_lcm$permutations$ECR)
mcmc_permuted
# change dimension for each parameter defined as in the Stan code
<-
mcmc_permuted array(
data = mcmc_permuted$output,
dim = c(2000, 1, 10),
dimnames = list(NULL, NULL, pars)
)
We can assess the same information as the default Stan summary function by monitor
function with the new relabeled dataset mcmc_permuted
.
# reassess the model convergence after switch the labels
<-
fit_permuted monitor(mcmc_permuted, warmup = 0, digits_summary = 3)
Inference for the input samples (1 chains: each with iter = 2000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
alpha[1] 0.67 0.79 0.86 0.78 0.06 1.00 2092 1733
alpha[2] 0.14 0.21 0.33 0.22 0.06 1.00 2092 1733
p[1,1] 0.67 0.71 0.74 0.71 0.02 1.00 1753 1996
p[2,1] 0.32 0.42 0.50 0.41 0.06 1.00 1919 1795
p[1,2] 0.07 0.11 0.14 0.11 0.02 1.00 1854 1922
p[2,2] 0.36 0.45 0.58 0.46 0.07 1.00 2150 1893
p[1,3] 0.24 0.28 0.31 0.28 0.02 1.00 1951 1923
p[2,3] 0.39 0.46 0.55 0.47 0.05 1.01 1846 1919
p[1,4] 0.89 0.93 0.99 0.94 0.03 1.00 1900 1732
p[2,4] 0.06 0.27 0.45 0.26 0.11 1.00 2025 2032
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (an ESS > 100
per chain is considered good), and Rhat is the potential scale reduction
factor on rank normalized split chains (at convergence, Rhat <= 1.05).
We can plot the Bayesian estimates and associated 95% credible intervals below after relabeling, and we can see that most of the intervals cover the generating values as expected:
# Get estimated and generating values for wanted parameters
= c(.8, .2, p2, p1)
generating_values <- as.data.frame(fit_permuted)
sim_summary <- sim_summary[pars %>% sort(), c("mean", "2.5%", "97.5%")]) (estimated_values
mean 2.5% 97.5%
alpha[1] 0.7795487 0.65369763 0.8717948
alpha[2] 0.2204513 0.12820519 0.3463024
p[1,1] 0.7068443 0.66617328 0.7500191
p[1,2] 0.1100546 0.06606866 0.1453874
p[1,3] 0.2752799 0.23535235 0.3120311
p[1,4] 0.9357286 0.87995252 0.9918262
p[2,1] 0.4149365 0.29843941 0.5199308
p[2,2] 0.4561739 0.34510355 0.6057716
p[2,3] 0.4661228 0.37574023 0.5681748
p[2,4] 0.2613140 0.03824962 0.4744365
# Assesmble a data frame to pass to ggplot()
<- data.frame(parameter = factor(pars, rev(pars)), row.names = NULL)
sim_df $middle <- estimated_values[, "mean"] - generating_values
sim_df$lower <- estimated_values[, "2.5%"] - generating_values
sim_df$upper <- estimated_values[, "97.5%"] - generating_values
sim_df
# Plot the discrepancy
ggplot(sim_df) + aes(x = parameter, y = middle, ymin = lower, ymax = upper) +
scale_x_discrete() + labs(y = "Discrepancy", x = NULL) + geom_abline(intercept = 0,
slope = 0, color = "white") + geom_linerange() + geom_point(size = 2) +
theme(panel.grid = element_blank()) + coord_flip()
Variational Bayes (VB)
Variational Bayes refers a family of methods that approximate often intractable posterior distributions by a simpler distribution. Dropping the subscript for unit j, let \(\theta\) be the vector of parameters, \(\alpha, \mathbf{p}\) in our model, and \(\lambda\) be the parameters of the approximating density \(q\), which serves as an approximation of the posterior distribution \(p(\theta|\mathbf{y})\). Minimizing the Kuback-Leibler distance (\(\mbox{KL}\)) has been shown to be equivalent to maximizing the evidence lower bound (\(\mbox{ELBO}\)), which is then optimized with respect to \(\lambda\) by gradient descent. \(q(\theta; \lambda)\) is often chosen to be some family of distributions that are easy to work with. In Stan, two options are meanfield
, the product of unidimensional Gaussians or fullrank
, a multidimensional Gaussian, interested readers can refer to Kucukelbir et al. (2015) for more details.
<- stan_model(file = "LCM0425b.stan")
stan_vb # VB works
<-
vb_fit vb(
stan_vb,data = lca_data,
iter = 15000,
elbo_samples = 1000,
algorithm = c("fullrank"),
# imporance_resampling = T,
output_samples = 10000,
tol_rel_obj = 0.00001
)
# vb esitmates
print(vb_fit, c("alpha", "p", "p_prod"))
Inference for Stan model: LCM0425b.
1 chains, each with iter=10000; warmup=0; thin=1;
post-warmup draws per chain=10000, total post-warmup draws=10000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff khat
alpha[1] 0.22 NaN 0.05 0.14 0.19 0.22 0.25 0.32 NaN 0.61
alpha[2] 0.78 NaN 0.05 0.68 0.75 0.78 0.81 0.86 NaN 0.59
p[1,1] 0.41 NaN 0.06 0.30 0.37 0.41 0.45 0.52 NaN 0.60
p[1,2] 0.46 NaN 0.06 0.35 0.42 0.46 0.50 0.58 NaN 0.60
p[1,3] 0.47 NaN 0.05 0.37 0.44 0.47 0.50 0.57 NaN 0.63
p[1,4] 0.28 NaN 0.09 0.13 0.21 0.27 0.34 0.49 NaN 0.60
p[2,1] 0.71 NaN 0.02 0.66 0.69 0.71 0.72 0.75 NaN 0.61
p[2,2] 0.11 NaN 0.02 0.08 0.10 0.11 0.12 0.15 NaN 0.61
p[2,3] 0.27 NaN 0.02 0.23 0.26 0.27 0.29 0.32 NaN 0.61
p[2,4] 0.93 NaN 0.03 0.87 0.91 0.93 0.95 0.97 NaN 0.61
p_prod[1] 0.29 NaN 0.04 0.21 0.26 0.29 0.32 0.37 NaN 0.60
p_prod[2] 0.05 NaN 0.01 0.03 0.04 0.05 0.06 0.08 NaN 0.61
p_prod[3] 0.13 NaN 0.01 0.10 0.12 0.13 0.14 0.16 NaN 0.61
p_prod[4] 0.26 NaN 0.09 0.12 0.20 0.25 0.32 0.46 NaN 0.59
Approximate samples were drawn using VB(fullrank) at Wed Jun 23 11:11:17 2021.
We can also plot Bayes estimates and associated credible intervals as above:
# Get estimated and generating values for wanted parameters
<- vb_fit %>% names %>% `[`(1:10) %>% sort()
pars = c(.2, .8 ,p1 , p2)
generating_values <- as.data.frame(summary(vb_fit)[[1]])
sim_summary <- sim_summary[pars, c("mean", "2.5%", "97.5%")]
estimated_values ::traceplot(vb_fit) rstan
# Assesmble a data frame to pass to ggplot()
<- data.frame(parameter = factor(pars, rev(pars)), row.names = NULL)
sim_df $middle <- estimated_values[, "mean"] - generating_values
sim_df$lower <- estimated_values[, "2.5%"] - generating_values
sim_df$upper <- estimated_values[, "97.5%"] - generating_values
sim_df
# Plot the discrepancy
ggplot(sim_df) + aes(x = parameter, y = middle, ymin = lower, ymax = upper) +
scale_x_discrete() + labs(y = "Discrepancy", x = NULL) + geom_abline(intercept = 0,
slope = 0, color = "white") + geom_linerange() + geom_point(size = 2) +
theme(panel.grid = element_blank()) + coord_flip()
Example application
Description of the dataset
We consider an example from poLCA
package, originally from Dayton and Dayton (1998).
319 undergrads responded to four yes-no questions about cheating behaviors.
Whether they have lied to avoid taking an exam (
LIEEXAM
);Whether they have lied to avoid handing a term paper in on time (
LIEPAPER
);Whether they have purchased a term paper to hand in as their own or have obtained a copy of an exam prior to taking the exam (
FRAUD
);Whether they have copied answers during an exam from someone sitting near to them (
COPYEXAM
).
We assume there are two groups - “cheaters” and “non-cheaters” - depending on how they respond to the four questions with their answers no (= 1) and yes (= 2), recorded to 0, 1 respectively for analysis using Stan.
Analysis
First we fit the model with two classes by maximum likelihood using polCA()
:
set.seed(1112)
data("cheating")
# how the data look like
::datatable(cheating) DT
# LCA
<- cbind(LIEEXAM,LIEPAPER,FRAUD,COPYEXAM)~1
f <- poLCA(f,cheating,nclass=2) ch2
Conditional item response (column) probabilities,
by outcome variable, for each class (row)
$LIEEXAM
Pr(1) Pr(2)
class 1: 0.9834 0.0166
class 2: 0.4231 0.5769
$LIEPAPER
Pr(1) Pr(2)
class 1: 0.9708 0.0292
class 2: 0.4109 0.5891
$FRAUD
Pr(1) Pr(2)
class 1: 0.9629 0.0371
class 2: 0.7840 0.2160
$COPYEXAM
Pr(1) Pr(2)
class 1: 0.8181 0.1819
class 2: 0.6236 0.3764
Estimated class population shares
0.8394 0.1606
Predicted class memberships (by modal posterior prob.)
0.8307 0.1693
=========================================================
Fit for 2 latent classes:
=========================================================
number of observations: 319
number of estimated parameters: 9
residual degrees of freedom: 6
maximum log-likelihood: -440.0271
AIC(2): 898.0542
BIC(2): 931.9409
G^2(2): 7.764242 (Likelihood ratio/deviance statistic)
X^2(2): 8.3234 (Chi-square goodness of fit)
#summary(ch2)
plot(ch2)
Now we use Stan with two classes:
#Stan program
<- list(y = (cheating - 1)[, 1:4], #response matrix
lca_data J = nrow(cheating), #number of units/persons
I = 4, #number of items
C = 2)
<-
stan_fitstan(
file = "LCM0425b.stan",
data = lca_data,
iter = 1000,
chains = 4
)print(stan_fit, c("alpha", "p", "p_prod"))
Inference for Stan model: LCM0425b.
4 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha[1] 0.69 0.22 0.32 0.09 0.59 0.85 0.89 0.93 2 7.81
alpha[2] 0.31 0.22 0.32 0.07 0.11 0.15 0.41 0.91 2 7.81
p[1,1] 0.18 0.18 0.26 0.00 0.02 0.04 0.12 0.82 2 3.88
p[1,2] 0.19 0.18 0.26 0.01 0.03 0.05 0.15 0.82 2 4.09
p[1,3] 0.09 0.06 0.09 0.02 0.04 0.05 0.09 0.34 2 2.44
p[1,4] 0.24 0.06 0.10 0.14 0.18 0.20 0.25 0.51 3 2.15
p[2,1] 0.48 0.19 0.29 0.01 0.20 0.57 0.71 0.89 2 2.60
p[2,2] 0.50 0.19 0.29 0.01 0.23 0.59 0.71 0.91 2 2.68
p[2,3] 0.20 0.07 0.11 0.02 0.08 0.21 0.28 0.41 3 1.68
p[2,4] 0.35 0.07 0.13 0.15 0.22 0.36 0.44 0.60 3 1.58
p_prod[1] 0.02 0.00 0.01 0.00 0.01 0.02 0.03 0.05 770 1.01
p_prod[2] 0.03 0.00 0.01 0.00 0.02 0.03 0.04 0.06 837 1.01
p_prod[3] 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.02 1427 1.00
p_prod[4] 0.07 0.00 0.02 0.04 0.06 0.07 0.09 0.12 1546 1.00
Samples were drawn using NUTS(diag_e) at Wed Jun 23 11:12:23 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
traceplot(stan_fit, c("alpha", "p", "p_prod"))
From the output, we see that only p_prod
s have reached convergence based on n_eff
and Rhat
, which is also consistent with what we observe from the trace plots. Now we move to fix any label switching:
# extract stan fit as the required format of the input
<- stan_fit %>% names %>% `[`(1:10)
pars @model_pars stan_fit
[1] "alpha" "p" "p_prod" "pred_class_dis"
[5] "pred_class" "lmix" "lp__"
<- rstan::extract(stan_fit,
post_par c("alpha", "p", "pred_class", "lp__"),
permuted = TRUE)
# classification probabilities
<- post_par$pred_class
post_class_p
# simulated allocation vectors
<- ((post_class_p[,,1] > 0.5)*1) + 1
post_class
= 2000 # of draws
m = 2 # of classes
K = 5 # of component-wise parameters
J # initialize mcmc arrays
<- array(data = NA, dim = c(m = m, K = K, J = J))
mcmc
# assign posterior draws to the array
1] <- post_par$alpha
mcmc[, , for (i in 1:(J - 1)) {
+ 1] <- post_par$p[, , i]
mcmc[, , i
}
# set of selected relabeling algorithm
<-
set c("PRA",
"ECR",
"ECR-ITERATIVE-1",
"AIC",
"ECR-ITERATIVE-2",
"STEPHENS",
"DATA-BASED")
# find the MAP draw as a pivot
= which.max(post_par$lp__)
mapindex
# switch labels
<-
ls_lcm label.switching(
method = set,
zpivot = post_class[mapindex,],
z = post_class,
K = K,
prapivot = mcmc[mapindex, ,],
constraint = 1,
mcmc = mcmc,
p = post_class_p,
data = lca_data$y
)
......................................................................................
. Method Time (sec) Status .
......................................................................................
. PRA 0.03 OK .
. ECR 8.22 OK .
. ECR-ITERATIVE-1 17.14 Converged (2 iterations) .
. AIC 0.04 OK .
. ECR-ITERATIVE-2 14.36 Converged (2 iterations) .
. STEPHENS 21.85 Converged (3 iterations) .
. DATA-BASED 15.67 OK .
......................................................................................
Relabelling all methods according to method PRA ... done!
Retrieve the 7 permutation arrays by typing:
[...]$permutations$"PRA"
[...]$permutations$"ECR"
[...]$permutations$"ECR-ITERATIVE-1"
[...]$permutations$"AIC"
[...]$permutations$"ECR-ITERATIVE-2"
[...]$permutations$"STEPHENS"
[...]$permutations$"DATA-BASED"
Retrieve the 7 best clusterings: [...]$clusters
Retrieve the 7 CPU times: [...]$timings
Retrieve the 7 X 7 similarity matrix: [...]$similarity
Label switching finished. Total time: 78.1 seconds.
# run time in seconds
$timings ls_lcm
PRA ECR ECR-ITERATIVE-1 AIC ECR-ITERATIVE-2
0.03 8.22 17.14 0.04 14.36
STEPHENS DATA-BASED
21.85 15.67
# similarity of the classification
$similarity ls_lcm
PRA ECR ECR-ITERATIVE-1 AIC ECR-ITERATIVE-2 STEPHENS DATA-BASED
PRA 1 1 1 1 1 1 1
ECR 1 1 1 1 1 1 1
ECR-ITERATIVE-1 1 1 1 1 1 1 1
AIC 1 1 1 1 1 1 1
ECR-ITERATIVE-2 1 1 1 1 1 1 1
STEPHENS 1 1 1 1 1 1 1
DATA-BASED 1 1 1 1 1 1 1
# permuted posterior
<- permute.mcmc(mcmc, ls_lcm$permutations$ECR) mcmc_permuted
We can assess model convergence after relabeling:
# change dimension for each parameter defined as in the Stan code
<-
mcmc_permuted array(
data = mcmc_permuted$output,
dim = c(2000, 1, 10),
dimnames = list(NULL, NULL, pars)
)
# reassess the model convergence after switch the labels
<-
fit_permuted monitor(mcmc_permuted, warmup = 0, digits_summary = 4)
Inference for the input samples (1 chains: each with iter = 2000; warmup = 0):
Q5 Q50 Q95 Mean SD Rhat Bulk_ESS Tail_ESS
alpha[1] 0.075 0.132 0.216 0.137 0.043 1.002 1958 1812
alpha[2] 0.784 0.868 0.925 0.863 0.043 1.002 1958 1812
p[1,1] 0.417 0.627 0.857 0.632 0.137 1.000 2075 1846
p[2,1] 0.005 0.030 0.061 0.031 0.017 1.003 1734 1666
p[1,2] 0.440 0.644 0.880 0.649 0.133 1.004 1823 1595
p[2,2] 0.010 0.042 0.077 0.042 0.020 1.001 1922 1848
p[1,3] 0.126 0.240 0.389 0.245 0.080 1.000 1890 1720
p[2,3] 0.020 0.042 0.069 0.043 0.015 1.001 2165 1922
p[1,4] 0.264 0.398 0.568 0.402 0.094 1.003 1963 1920
p[2,4] 0.145 0.187 0.232 0.187 0.026 1.001 1860 1703
For each parameter, Bulk_ESS and Tail_ESS are crude measures of
effective sample size for bulk and tail quantities respectively (an ESS > 100
per chain is considered good), and Rhat is the potential scale reduction
factor on rank normalized split chains (at convergence, Rhat <= 1.05).
All Rhat
are reported as 1 after rounding so convergence looks good after relabeling. We can see that the estimates using Stan after relabeling are similar to the MLE estimates: specifically, the Stan estimates suggest that, “cheaters” take up 13.7% of population and these individuals tend to have responded “yes” to LIEEXAM, LIEPAPER, FRAUD and COPYEXAM with the estimated probability 0.63, 0.65, 0.25, 0.4, respectively; while “non-cheaters” were estimated to take up 86.3% of population and they tend to have responded “yes” to LIEEXAM, LIEPAPER, FRAUD and COPYEXAM with the estimated probability 0.03, 0.04, 0.04, 0.19, respectively.
Now we use variational Bayes which does not suffer from label switching because it does not involve sampling parameters.
<- stan_model(file = "LCM0425b.stan")
stan_vb # VB works
<-
vb_fit vb(
stan_vb,data = lca_data,
iter = 15000,
elbo_samples = 1000,
algorithm = c("fullrank"),
# imporance_resampling = T,
output_samples = 10000,
tol_rel_obj = 0.00001,
refresh = 0
)
print(vb_fit, c("alpha", "p"))
Inference for Stan model: LCM0425b.
1 chains, each with iter=10000; warmup=0; thin=1;
post-warmup draws per chain=10000, total post-warmup draws=10000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff khat
alpha[1] 0.87 NaN 0.04 0.78 0.84 0.87 0.89 0.93 NaN 0.88
alpha[2] 0.13 NaN 0.04 0.07 0.11 0.13 0.16 0.22 NaN 0.84
p[1,1] 0.03 NaN 0.02 0.01 0.02 0.03 0.04 0.08 NaN 0.83
p[1,2] 0.05 NaN 0.02 0.01 0.03 0.04 0.06 0.11 NaN 0.84
p[1,3] 0.04 NaN 0.02 0.02 0.03 0.04 0.05 0.08 NaN 0.84
p[1,4] 0.19 NaN 0.03 0.14 0.17 0.19 0.21 0.25 NaN 0.83
p[2,1] 0.63 NaN 0.12 0.37 0.55 0.64 0.72 0.85 NaN 0.87
p[2,2] 0.65 NaN 0.11 0.41 0.58 0.66 0.74 0.85 NaN 0.89
p[2,3] 0.25 NaN 0.09 0.10 0.18 0.24 0.31 0.47 NaN 0.87
p[2,4] 0.41 NaN 0.11 0.22 0.33 0.41 0.48 0.62 NaN 0.86
Approximate samples were drawn using VB(fullrank) at Wed Jun 23 11:15:50 2021.
The VB estimates are close to the MCMC estimates and maximum likelihood estimates.
References
there are alternative ways to fix it: for example, put order constrains on the parameters; initialization around a single mode. We refer the readers who are interested in details to a comprehensive treatment of identifying Bayesian mixture models identifying Bayesian mixture models and label-swiching in mixture models.↩︎