Bayesian Latent Class Models and Handling of Label Switching

Overview of the model

Latent class models are used to identify unobservable or latent subgroups within a population. They are also a special case of a multivariate finite mixture models, where the response variables are categorical. Each unit \(j\) is assumed to belong to one of \(C\) classes, represented by a class indicator \(z_j = 1, \dots, C\). The marginal response probabilities are:

\[\begin{align} \tag{1} \mbox{Pr}(\mathbf{y}_{j}) = \sum_{c = 1}^C\alpha_{c}\prod_{i=1}^{I}\mbox{Pr}(y_{ji}|z_j = c) \label{marginal} \end{align}\]

Where \(\mathbf{y}_{j}\) is the response vector \((y_{j1}, \dots, y_{jI})^\intercal\) for individual \(j\) on \(I\) items, \(z_j\) is the class that individual \(j\) belongs to, and \(\alpha_{c}\) is the probability of being in class \(c\).

We focus on binary responses (\(y = 1\) or \(y = 0\)), with \(y_{ji} \sim \text{Bernoulli}(p_{ci})\) in the current case study, where \(p_{ci}\) is the probability that an individual in class \(c\) endorses item \(i\).

The latent class indicator \(Z_j\) can be viewed as a discrete latent variable. For those who are familiar with JAGS or WinBUGS, the following hierarchical sampling is often used:

\[\begin{align} z_j &\sim \mathsf{Categorical}(\alpha_j)\\ \mathbf{y}_{j}|z_j = c &\sim \underbrace{\mathsf{Bernoulli}(\mathbf{p}_{c})}_{\text{focus in this case study}} \end{align}\]

Where \(\mathbf{\alpha}\) is a simplex \((\alpha_{1}, \dots, \alpha_{C})^\intercal\) with \(\sum_{c = 1}^{C}\alpha_{c} = 1\) and \(\mathbf{p}_{c}\) is the class-specific parameter vector \((p_{c1}, \dots, p_{cI})^\intercal\).

Stan does not support direct sampling of discrete parameters. Instead, models that involve discrete parameters can be coded by marginalizing out the discrete parameters, which is shown in equation 1. It is noteworthy that this marginalized likelihood approach yields a better estimator (in terms of mean squared errors) due to the Rao-Blackwell theorem (i.e., instead of making inference of the discrete latent variables, we condition on their sufficient statistics, which in our case are the proportions of each class).

Stan code for LCA models

Here we present a Stan program for a latent class model:

writeLines(readLines("LCM0425b.stan"))
data {
int<lower=1> I; // # of items
int<lower=1> J; // # of respondents
int<lower=1> C; // # of classes
int y[J,I]; // response  matrix
}

parameters {
  simplex[C] alpha; // probabilities of being in one group

   real <lower = 0, upper = 1> p[C, I];
}
transformed parameters{
   vector[I] p_prod; 

   /* product of endorsing probs across classes 
    to check convergence up to permutation of class labels */
   for(i in 1:I){
     p_prod[i] = prod(p[, i]); 
   }

}

model {
real lmix[C];
for (j in 1:J){
  for (c in 1: C){
               lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[j, ] | p[c,]);
               }
  target += log_sum_exp(lmix);
  }
}

generated quantities {
  int<lower = 1> pred_class_dis[J];     // posterior prediction for respondent j in latent class c
  simplex[C] pred_class[J];     // posterior probabilities of respondent j in latent class c
  real lmix[C];


for (j in 1:J){
  for (c in 1: C){
               lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[j, ] | p[c,]);
    }               
  for (c in 1: C){
               pred_class[j][c] = exp((lmix[c])-log_sum_exp(lmix));
    }
    pred_class_dis[j] = categorical_rng(pred_class[j]);
  }
}

The data block specifies how data are organized: the total number of items I, the number of persons J, and the number of classes C are defined as shown in the model equation. y denotes a \(J \times I\) data matrix where each column represents one item and each row is the response vector for one person. In the parameter block we define our class proportion parameters for each class as a vector alpha and the \(C \times I\) matrix for endorsement probabilities p for \(I\) items and \(C\) classes. The transformed parameter block defines the vector p_prod, the products of each column of p, i.e., the product of endorsing item \(i\) across the \(C\) classes. We will use p_prod as a convergence check that allows for label switching, ash we will discuss later.

In the model block, we define prior distributions for all model parameters (or the default prior with \(\mathsf{Unif}(-\infty, \infty)\) when left unspecified). We compute the logs of the contributions to the marginal probabilities from each class in equation (1), \(\mathrm{log}\alpha_c+\sum_{i=1}^{I}\mathrm{log}\mathsf{Bernoulli}(y_{ji}|p_{ci})\)) and save it in lmix[c]. log_sum_exp computes the logarithm of the sum of the exponentiated elements of lmix[c], giving \(\log \sum^{C = 1}_c (\alpha_c\prod_{i = 1}^I\mathsf{Bernoulli}(y_{ji}|p_{ci}))\). target += increments the resulting log posterior up to a constant.

Coding latent classes in Stan: log_mix and log_sum_exp

log_mix

For models with two classes/mixture components, Stan has a wrapper log_mix for easy implementation: it takes a probability parameter and two log densities for the two mixture components as input. For example, to express that \(y\) is sampled from a mixture of two Bernoulli distributions with parameters \(p_1\) and \(p_2\) and with a known marginal probability of being in the first class equal to .3, we would write log_mix(.3, bernoulli_lpmf(y | p_1), bernoulli_lpmf(y | p_2).

log_sum_exp

log_sum_exp is a more general version of log_mix that handles situations where there are more than two mixture components. It takes two arguments \(a\) and \(b\) (or more if there are more than two mixture components), and then takes the log of the sum of the exponentials (i.e., is \(\text{log_sum_exp}(a,b) = \log(\exp(a) + \exp(b))\). It can be viewed as a summation operation on the log scale.).

To express that \(y\) is sampled from a mixture of two Bernoulli distributions with parameters \(p_1\) and \(p_2\) and with a known marginal probability of being in the first class equal to .3, we would write log_sum_exp(log(.3) + bernoulli_lpmf(y | p_1), log(.7) + bernoulli_lpmf(y | p_2)).

We use log_sum_exp for latent class models given its flexibility dealing with an arbitrary number of latent classes.

Prediction of latent class membership

One common research question with latent class analysis is what class a unit belongs to, that is, we are interested in the posterior probability \(\mbox{Pr}(z_j = c|\mathbf{y})\).

In practice, we draw model parameters \(\theta^{(t)} = (\mathbf{p}^{(t)}, \alpha^{(t)})\), with superscript \((t)\) denoting the \(t^{th}\) draw, from the posterior distribution and obtain the posterior probability of being in the class conditional on these draws. By doing this in generated quantities block, the final sample mean estimate is the Monte Carlo approximation of the expectation of our posterior prediction of class membership. Compared to frequentist approaches, where the posterior probability of being in the class is evaluated at the parameter estimates \(\mbox{Pr}(z_j = c|\mathbf{y}_j; \hat{\theta})\)), the fully Bayesian way of predicting class membership does not treat parameter estimates as known and hence takes all the uncertainty regarding parameter estimates into account, as shown below. \[\begin{align} \mbox{Pr}(z_j = c|\mathbf{y}) &= \int_{ \mathbf{\theta}}\mbox{Pr}(z_j = c|\mathbf{y}_j, \mathbf{\theta})\mbox{Pr}( \mathbf{\theta}|\mathbf{y})d\theta \\ &= E_{\theta|\mathbf{y}}\mbox{Pr}(z_j = c|\mathbf{y}_j, \theta) \\ &\approx \frac{1}{T}\sum^T_{t = 1}\mbox{Pr}(z_j = c|\mathbf{y}_j, \mathbf{\theta}^{(t)}) \end{align}\]

Where \[\begin{align} \mbox{Pr}(z_j = c|\mathbf{y}_j, \theta^{(t)})\\ &= \frac{\mbox{Pr}(z_j = c|\theta^{(t)})\mbox{Pr}(\mathbf{y}_j|z_j = c, \theta^{(t)})}{\sum_{c = 1}^C\mbox{Pr}(z_j = c|\theta^{(t)})\mbox{Pr}(\mathbf{y}_j|z_j = c, \theta^{(t)})}\\ &= \frac{\alpha_{c}^{(t)}\prod_{i=1}^{I}\mbox{Pr}(y_{ji}|z_j = c, \theta^{(t)})}{\sum_{c = 1}^C\alpha_{c}^{(t)}\prod_{i=1}^{I}\mbox{Pr}(y_{ji}|z_j = c, \theta^{(t)})} &= \frac{\alpha_{c}^{(t)}\prod_{i=1}^{I}p_{ci}^{y_{ji}}(1-p_{ci}^{1-y_{ji}})}{\sum_{c = 1}^C\alpha_{c}^{(t)}\prod_{i=1}^{I}p_{ci}^{y_{ji}}(1-p_{ci}^{1-y_{ji}})} \end{align}\].

Distal outcome in latent class models

One common extension of latent class models is modelling the relations between latent classes and a variable \(d_j\) of interest, which is often called distal outcome, by considering \(\mbox{Pr}(d_j|z_j)\). In Stan, all we need is to jointly model \(d_j\) with \(\mathbf{y}_j\), \(\mbox{Pr}(\mathbf{y}_j, d_j|z_j)\), in the model block. Of course, we assume conditional independence, namely, \(\mbox{Pr}(\mathbf{y}_j, d_j|z_j) = \mbox{Pr}(\mathbf{y}_j|z_j)\mbox{Pr}(d_j|z_j)\). We consider the case where the the distal outcome dis_out[j] is binary so we need to include bernoulli_lpmf(dis_out[j] | dis_p[c]) in the same line as the response probability bernoulli_lpmf(y[j] | p[c,]), where dis_p[c] is the distal outcome probability in class \(c\). We can also use a Gaussian distribution to model continuous distal outcomes by normal_lpdf(dis_out[j] | dis_mu[c], dis_sigma[c]), where dis_mu[c] and dis_sigma[c] are the distal outcome mean and standard deviation in class \(c\).

For a binary distal outcome, the original code in line 28 should be replaced by lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[j, ] | p[c,]) + bernoulli_lpmf(dis_out[j] | dis_p[c]);

Data-generation and Label-switching

In this section, we will run our Stan code on data simulated from a latent class model with two classes to assess its ability to recover the generating parameters.

Data generating process for two latent classes

Here we create two classes with probabilities \(\alpha_1 = .2\) and \(\alpha_2 = .8\). We create four items with probabilities of agreeing in class one (\(\mathbf{p}_1\)) set to \((0.4, 0.5, 0.4, 0.35)'\) and the probabilities of agreeing in class two (\(\mathbf{p}_2\)) to \((0.7, 0.1, 0.3, 0.9)'\), as summarized in the table below.

Class 1 (20%) Class 2 (80%)
Item 1 0.4 0.7
Item 2 0.5 0.1
Item 3 0.4 0.3
Item 4 0.35 0.9
# simulate data with four items and two classes
J <- 1000
I = 4
latent_group <- sample(1:2,
                       prob = c(0.2, 0.8),
                       size = J,
                       replace = TRUE)

p1 <- c(0.4, 0.5, 0.4, 0.35)
p2 <- c(0.7, 0.1, 0.3, 0.9)

item <- matrix(data = NA, nrow = J, ncol = I)


for (i in 1:J) {
  item[i, ] = case_when(
    latent_group[i] == 1 ~ rbinom(
      n = rep(1, I),
      size = rep(1, I),
      prob = p1
    ),
    latent_group[i] == 2 ~ rbinom(
      n = rep(1, I),
      size = rep(1, I),
      prob = p2
    )
  )
}


# how the data look like
DT::datatable(item)
# check whether the simulated data match the generating value
item[latent_group == 1,] %>% colMeans()
[1] 0.4179894 0.5079365 0.4656085 0.3492063
item[latent_group == 2,] %>% colMeans()
[1] 0.6966708 0.1085080 0.2811344 0.8939581
#Stan program
 lca_data <- list(y = item, #response matrix
                 J = J, #number of units/persons
                 I = I, #number of items
                 C = 2)

We run Stan with four chains:

library(rstan)
# Stan 
stan_fit<-
  stan(
    file = "LCM0425b.stan",
    data = lca_data,
    iter = 1000,
    chains = 4
  )

Summary of posterior distribution of parameters:

print(stan_fit, c("alpha", "p", "lp__", "p_prod"))
Inference for Stan model: LCM0425b.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

              mean se_mean   sd     2.5%      25%      50%      75%    97.5%
alpha[1]      0.50    0.20 0.29     0.14     0.21     0.47     0.78     0.86
alpha[2]      0.50    0.20 0.29     0.14     0.22     0.53     0.79     0.86
p[1,1]        0.56    0.10 0.15     0.32     0.41     0.60     0.71     0.74
p[1,2]        0.28    0.12 0.18     0.07     0.11     0.23     0.45     0.58
p[1,3]        0.37    0.07 0.10     0.24     0.28     0.33     0.46     0.55
p[1,4]        0.60    0.24 0.35     0.07     0.26     0.71     0.94     0.99
p[2,1]        0.56    0.10 0.15     0.32     0.42     0.60     0.71     0.74
p[2,2]        0.28    0.12 0.18     0.07     0.11     0.23     0.45     0.58
p[2,3]        0.37    0.07 0.10     0.24     0.28     0.33     0.46     0.55
p[2,4]        0.60    0.24 0.35     0.06     0.27     0.71     0.93     0.99
lp__      -2235.01    0.08 2.22 -2239.89 -2236.34 -2234.74 -2233.39 -2231.55
p_prod[1]     0.29    0.00 0.04     0.21     0.27     0.29     0.32     0.37
p_prod[2]     0.05    0.00 0.01     0.03     0.04     0.05     0.06     0.08
p_prod[3]     0.13    0.00 0.02     0.10     0.12     0.13     0.14     0.16
p_prod[4]     0.25    0.00 0.11     0.04     0.17     0.25     0.32     0.45
          n_eff Rhat
alpha[1]      2 5.33
alpha[2]      2 5.33
p[1,1]        2 3.80
p[1,2]        2 3.91
p[1,3]        2 2.76
p[1,4]        2 4.36
p[2,1]        2 3.79
p[2,2]        2 3.79
p[2,3]        2 3.00
p[2,4]        2 4.51
lp__        750 1.00
p_prod[1]   806 1.00
p_prod[2]   640 1.00
p_prod[3]  1033 1.00
p_prod[4]   555 1.00

Samples were drawn using NUTS(diag_e) at Thu Aug 26 23:06:38 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Label-switching

We see that the convergence index Rhat is large for almost all parameters suggesting that there is a problem. The problem is label-switching - a known problem for latent class models and mixture models in general that is due to non-identifiability (i.e., the model is identifiable up to a permutation of the class labels - see Redner and Walker (1984) and, Stephens (2000)).

To see why it happens, take the likelihood contribution of unit \(j\) on items \(1, \dots, I\) assuming \(2\) latent classes, as an example:

\[\begin{align} \mbox{Pr}(\mathbf{y}_{j}) &= \sum_{c = 1}^2 \mbox{Pr}(z_j = c)\mbox{Pr}(\mathbf{y}_{j}|z_j = c)\\ &= \sum_{z_j = 1}^2\alpha_{c}\mathsf{Bernoulli}(\mathbf{y}_{j}|\mathbf{p}_{c})\\ &=\alpha_1\mathsf{Bernoulli}(\mathbf{y}_{j}|\mathbf{p}_{1}) + \alpha_2\mathsf{Bernoulli}(\mathbf{y}_{j}|\mathbf{p}_{2}) \end{align}\]

No matter what the true values of the parameters are, for example, \(\alpha_1 = 0.3, \alpha_2 = 0.7, \mathbf{p}_{1} = (0.4, 0.5, 0.4, 0.35)^{\intercal}, \mathbf{p}_{2} = (0.7, 0.1, 0.3, 0.9)^{\intercal}\), there will be an alternative set of values with equal likelihood (e.g., \(\alpha_1 = 0.7, \alpha_2 = 0.3, \mathbf{p_{1}} = (0.7, 0.1, 0.3, 0.9)^{\intercal}, \mathbf{p_{2}} = (0.4, 0.5, 0.4, 0.35)^{\intercal}\), that is, \(\mbox{Pr}(\mathbf{y}_j|\alpha_1, \alpha_2, \mathbf{p}_{1}, \mathbf{p}_{2}) = \mbox{Pr}(\mathbf{y}_j|\alpha_2, \alpha_1, \mathbf{p}_{2}, \mathbf{p}_{1})\)). Intuitively, the labels do not matter when no additional constraints are added. The reason we know that label switching is the likely problem here is that p_prod, the product of the parameters for the two classes, has good R-hat values in the table above.

The traceplots below also provide clear evidence of label-switching: for a given parameter, the different chains converge to two different values - one corresponding to class one and the other to class two. Each plot entry represents one parameter: the four lines of different colors denote the four chains with different initial values. For a single chain, there are no abrupt changes in the mean which suggests that there is no within-chain label-switching. However, when the chains do not overlap, that suggests between-chain switching.

traceplot(stan_fit, c("alpha", "p"))

There are multiple ways to fix label-switching and we will introduce two ways in the current case study1:

  1. Post-hoc relabeling We can post-process and relabel the draws using the methods developed by Papastamoulis (2015).

  2. Variational Bayes (VB)

Post-hoc relabeling

A detailed introduction to different relabeling algorithms is beyond the scope of our case study, but we can easily apply them using the label.swich package (interested readers can refer to Papastamoulis (2015)).

Taking the Pivotal Reordering Algorithm (PRA) as an example: we select the iteration of MCMC draws for which the parameter draws have the maximum posterior probability as the pivot (or “gold standard”) and then relabel other MCMC draws so that the relabeled parameters are as close as possible to the pivot in terms of their Euclidean distance.

For \(T\) posterior draws, the pivot draw is defined as \[ \theta^{\mathrm{Pivot}}=\arg \max _{t=1, \ldots, T} p\left\{(\boldsymbol{\theta})^{(t)} \mid \mathbf{y}\right\} \] To use the package, we need to firstly extract the original posterior draws from Stan fit object, specifically, extract all posterior draws to post_par. We also extract the posterior predictive classification and probabilities of membership to post_class and post_class_p for later use;

# extract stan fit as the required format of the input
pars <- stan_fit %>% names %>% `[`(1:10)
stan_fit@model_pars
[1] "alpha"          "p"              "p_prod"         "pred_class_dis"
[5] "pred_class"     "lmix"           "lp__"          
post_par <- rstan::extract(stan_fit,
                 c("alpha", "p", "pred_class", "pred_class_dis", "lp__"),
                 permuted = TRUE)

# simulated allocation vectors
post_class <- post_par$pred_class_dis
# classification probabilities
post_class_p <- post_par$pred_class

We then create an array called mcmc with dimensions equal to number of posterior draws after warmup \(\times\) number of classes \(\times\) number of parameters. We then specify different relabeling algrithms to the character object set. We then find the index of MAP draw (i.e., the pivot draw) based on lp__, which is the log posterior probability up to a constant and assign the index to mapindex.

m = 2000 # of draws
K = 2 # of classes
J = 5 # of component-wise parameters

# initialize mcmc arrays
mcmc <- array(data = NA, dim = c(m = m, K = K, J = J))

# assign posterior draws to the array
mcmc[, , 1] <- post_par$alpha
for (i in 1:(J - 1)) {
  mcmc[, , i + 1] <- post_par$p[, , i]
}

# set of selected relabeling algorithm
set <-
  c("PRA",
    "ECR",
    "ECR-ITERATIVE-1",
    "AIC",
    "ECR-ITERATIVE-2",
    "STEPHENS",
    "DATA-BASED")

# find the MAP draw as a pivot
mapindex = which.max(post_par$lp__)

In the label.switching function, we input the previously defined objects, and we can also check whether different algorithms produce the same relabeling.

# switch labels
ls_lcm <-
  label.switching(
    method = set,
    zpivot = post_class[mapindex,],
    z = post_class,
    K = K,
    prapivot = mcmc[mapindex, ,],
    constraint = 1,
    mcmc = mcmc,
    p = post_class_p,
    data = lca_data$y
  )

    ......................................................................................
    . Method                         Time (sec)           Status                         . 
    ......................................................................................
    . PRA                            0.03                 OK                             . 
    . ECR                            5.3                  OK                             . 
    . ECR-ITERATIVE-1                28.15                Converged (4 iterations)       . 
    . AIC                            0.07                 OK                             . 
    . ECR-ITERATIVE-2                13.32                Converged (3 iterations)       . 
    . STEPHENS                       23.88                Converged (4 iterations)       . 
    . DATA-BASED                     8.93                 OK                             . 
    ......................................................................................

    Relabelling all methods according to method PRA ... done!
    Retrieve the 7 permutation arrays by typing:
        [...]$permutations$"PRA"
        [...]$permutations$"ECR"
        [...]$permutations$"ECR-ITERATIVE-1"
        [...]$permutations$"AIC"
        [...]$permutations$"ECR-ITERATIVE-2"
        [...]$permutations$"STEPHENS"
        [...]$permutations$"DATA-BASED"
    Retrieve the 7 best clusterings: [...]$clusters
    Retrieve the 7 CPU times: [...]$timings
    Retrieve the 7 X 7 similarity matrix: [...]$similarity
    Label switching finished. Total time: 81.6 seconds. 
# run time in seconds
ls_lcm$timings
            PRA             ECR ECR-ITERATIVE-1             AIC ECR-ITERATIVE-2 
           0.03            5.30           28.15            0.07           13.32 
       STEPHENS      DATA-BASED 
          23.88            8.93 
# similarity of the classification
ls_lcm$similarity
                PRA ECR ECR-ITERATIVE-1 AIC ECR-ITERATIVE-2 STEPHENS DATA-BASED
PRA               1   1               1   1               1        1          1
ECR               1   1               1   1               1        1          1
ECR-ITERATIVE-1   1   1               1   1               1        1          1
AIC               1   1               1   1               1        1          1
ECR-ITERATIVE-2   1   1               1   1               1        1          1
STEPHENS          1   1               1   1               1        1          1
DATA-BASED        1   1               1   1               1        1          1
# permuted posterior based on ECR method
mcmc_permuted <- permute.mcmc(mcmc, ls_lcm$permutations$ECR)

# change dimension for each parameter defined as in the Stan code
mcmc_permuted <-
  array(
    data = mcmc_permuted$output,
    dim = c(2000, 1, 10),
    dimnames = list(NULL, NULL, pars)
  )

We can assess the same information as the default Stan summary function by monitor function with the new relabeled dataset mcmc_permuted.

# reassess the model convergence after switch the labels
fit_permuted <-
  monitor(mcmc_permuted, warmup = 0,  digits_summary = 3)
Inference for the input samples (1 chains: each with iter = 2000; warmup = 0):

           Q5  Q50  Q95 Mean   SD  Rhat Bulk_ESS Tail_ESS
alpha[1] 0.67 0.79 0.86 0.78 0.06  1.00     1647     1749
alpha[2] 0.14 0.21 0.33 0.22 0.06  1.00     1647     1749
p[1,1]   0.67 0.71 0.74 0.71 0.02  1.00     1929     1629
p[2,1]   0.32 0.42 0.50 0.41 0.06  1.00     1901     2000
p[1,2]   0.07 0.11 0.14 0.11 0.02  1.00     1998     2004
p[2,2]   0.36 0.45 0.58 0.46 0.07  1.00     1644     1901
p[1,3]   0.24 0.28 0.31 0.28 0.02  1.00     1843     1960
p[2,3]   0.39 0.46 0.55 0.47 0.05  1.01     1968     1837
p[1,4]   0.89 0.93 0.99 0.94 0.03  1.00     1745     1719
p[2,4]   0.06 0.27 0.45 0.26 0.11  1.00     2011     1677

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.05).

We can plot the Bayesian estimates and associated 95% credible intervals below after relabeling, and we can see that most of the intervals cover the generating values as expected:

# Get estimated and generating values for wanted parameters
generating_values = c(.8, .2, p2, p1)
sim_summary <- as.data.frame(fit_permuted)
(estimated_values <- sim_summary[pars %>% sort(), c("mean", "2.5%", "97.5%")])
              mean       2.5%     97.5%
alpha[1] 0.7795487 0.65369763 0.8717948
alpha[2] 0.2204513 0.12820519 0.3463024
p[1,1]   0.7068443 0.66617328 0.7500191
p[1,2]   0.1100546 0.06606866 0.1453874
p[1,3]   0.2752799 0.23535235 0.3120311
p[1,4]   0.9357286 0.87995252 0.9918262
p[2,1]   0.4149365 0.29843941 0.5199308
p[2,2]   0.4561739 0.34510355 0.6057716
p[2,3]   0.4661228 0.37574023 0.5681748
p[2,4]   0.2613140 0.03824962 0.4744365
# Assesmble a data frame to pass to ggplot()
sim_df <- data.frame(parameter = factor(pars, rev(pars)), row.names = NULL)
sim_df$middle <- estimated_values[, "mean"] - generating_values
sim_df$lower <- estimated_values[, "2.5%"] - generating_values
sim_df$upper <- estimated_values[, "97.5%"] - generating_values

# Plot the discrepancy
ggplot(sim_df) + aes(x = parameter, y = middle, ymin = lower, ymax = upper) +
    scale_x_discrete() + labs(y = "Discrepancy", x = NULL) + geom_abline(intercept = 0,
    slope = 0, color = "white") + geom_linerange() + geom_point(size = 2) +
    theme(panel.grid = element_blank()) + coord_flip()

Variational Bayes (VB)

Variational Bayes refers a family of methods that approximate often intractable posterior distributions by a simpler distribution. Dropping the subscript for unit j, let \(\theta\) be the vector of parameters, \(\alpha, \mathbf{p}\) in our model, and \(\lambda\) be the parameters of the approximating density \(q\), which serves as an approximation of the posterior distribution \(p(\theta|\mathbf{y})\). Minimizing the Kuback-Leibler distance (\(\mbox{KL}\)) has been shown to be equivalent to maximizing the evidence lower bound (\(\mbox{ELBO}\)), which is then optimized with respect to \(\lambda\) by gradient descent. \(q(\theta; \lambda)\) is often chosen to be some family of distributions that are easy to work with. In Stan, two options are meanfield, the product of unidimensional Gaussians or fullrank, a multidimensional Gaussian, interested readers can refer to Kucukelbir et al. (2015) for more details.

stan_vb <- stan_model(file = "LCM0425b.stan")
# VB works
vb_fit <-
  vb(
    stan_vb,
    data = lca_data,
    iter = 15000,
    elbo_samples = 1000,
    algorithm = c("fullrank"),
  # imporance_resampling = T,
    output_samples = 10000,
    tol_rel_obj = 0.00001
  )
# vb esitmates
print(vb_fit, c("alpha", "p", "p_prod"))
Inference for Stan model: LCM0425b.
1 chains, each with iter=10000; warmup=0; thin=1; 
post-warmup draws per chain=10000, total post-warmup draws=10000.

          mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff khat
alpha[1]  0.22     NaN 0.05 0.14 0.19 0.22 0.25  0.32   NaN 0.61
alpha[2]  0.78     NaN 0.05 0.68 0.75 0.78 0.81  0.86   NaN 0.59
p[1,1]    0.41     NaN 0.06 0.30 0.37 0.41 0.45  0.52   NaN 0.60
p[1,2]    0.46     NaN 0.06 0.35 0.42 0.46 0.50  0.58   NaN 0.60
p[1,3]    0.47     NaN 0.05 0.37 0.44 0.47 0.50  0.57   NaN 0.63
p[1,4]    0.28     NaN 0.09 0.13 0.21 0.27 0.34  0.49   NaN 0.60
p[2,1]    0.71     NaN 0.02 0.66 0.69 0.71 0.72  0.75   NaN 0.61
p[2,2]    0.11     NaN 0.02 0.08 0.10 0.11 0.12  0.15   NaN 0.61
p[2,3]    0.27     NaN 0.02 0.23 0.26 0.27 0.29  0.32   NaN 0.61
p[2,4]    0.93     NaN 0.03 0.87 0.91 0.93 0.95  0.97   NaN 0.61
p_prod[1] 0.29     NaN 0.04 0.21 0.26 0.29 0.32  0.37   NaN 0.60
p_prod[2] 0.05     NaN 0.01 0.03 0.04 0.05 0.06  0.08   NaN 0.61
p_prod[3] 0.13     NaN 0.01 0.10 0.12 0.13 0.14  0.16   NaN 0.61
p_prod[4] 0.26     NaN 0.09 0.12 0.20 0.25 0.32  0.46   NaN 0.59

Approximate samples were drawn using VB(fullrank) at Thu Aug 26 23:16:35 2021.
# prediction proportion correct
#(round(get_posterior_mean(vb_fit,par=c("pred_class")))[1:2000] == latent_group - 1) %>% mean()

We can also plot Bayes estimates and associated credible intervals as above:

# Get estimated and generating values for wanted parameters
pars <- vb_fit %>% names %>% `[`(1:10) %>% sort()
generating_values = c(.2, .8 ,p1 , p2)
sim_summary <- as.data.frame(summary(vb_fit)[[1]])
estimated_values <- sim_summary[pars, c("mean", "2.5%", "97.5%")]
rstan::traceplot(vb_fit)

# Assesmble a data frame to pass to ggplot()
sim_df <- data.frame(parameter = factor(pars, rev(pars)), row.names = NULL)
sim_df$middle <- estimated_values[, "mean"] - generating_values
sim_df$lower <- estimated_values[, "2.5%"] - generating_values
sim_df$upper <- estimated_values[, "97.5%"] - generating_values

# Plot the discrepancy
ggplot(sim_df) + aes(x = parameter, y = middle, ymin = lower, ymax = upper) +
    scale_x_discrete() + labs(y = "Discrepancy", x = NULL) + geom_abline(intercept = 0,
    slope = 0, color = "white") + geom_linerange() + geom_point(size = 2) +
    theme(panel.grid = element_blank()) + coord_flip()

Example application

Description of the dataset

We consider an example from poLCA package, originally from Dayton and Dayton (1998).

319 undergrads responded to four yes-no questions about cheating behaviors.

  • Whether they have lied to avoid taking an exam (LIEEXAM);

  • Whether they have lied to avoid handing a term paper in on time (LIEPAPER);

  • Whether they have purchased a term paper to hand in as their own or have obtained a copy of an exam prior to taking the exam (FRAUD);

  • Whether they have copied answers during an exam from someone sitting near to them (COPYEXAM).

We assume there are two groups - “cheaters” and “non-cheaters” - depending on how they respond to the four questions with their answers no (= 1) and yes (= 2), recorded to 0, 1 respectively for analysis using Stan.

Analysis

First we fit the model with two classes by maximum likelihood using polCA():

set.seed(1112)
data("cheating")

# how the data look like
DT::datatable(cheating)
# LCA
f <- cbind(LIEEXAM,LIEPAPER,FRAUD,COPYEXAM)~1
ch2 <- poLCA(f,cheating,nclass=2)
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$LIEEXAM
           Pr(1)  Pr(2)
class 1:  0.9834 0.0166
class 2:  0.4231 0.5769

$LIEPAPER
           Pr(1)  Pr(2)
class 1:  0.9708 0.0292
class 2:  0.4109 0.5891

$FRAUD
           Pr(1)  Pr(2)
class 1:  0.9629 0.0371
class 2:  0.7840 0.2160

$COPYEXAM
           Pr(1)  Pr(2)
class 1:  0.8181 0.1819
class 2:  0.6236 0.3764

Estimated class population shares 
 0.8394 0.1606 
 
Predicted class memberships (by modal posterior prob.) 
 0.8307 0.1693 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
number of observations: 319 
number of estimated parameters: 9 
residual degrees of freedom: 6 
maximum log-likelihood: -440.0271 
 
AIC(2): 898.0542
BIC(2): 931.9409
G^2(2): 7.764242 (Likelihood ratio/deviance statistic) 
X^2(2): 8.3234 (Chi-square goodness of fit) 
 
#summary(ch2)
plot(ch2)

Now we use Stan with two classes:

#Stan program
lca_data <- list(y = (cheating - 1)[, 1:4], #response matrix
                 J = nrow(cheating), #number of units/persons
                 I = 4, #number of items
                 C = 2)
stan_fit<-
  stan(
    file = "LCM0425b.stan",
    data = lca_data,
    iter = 1000,
    chains = 4
  )
print(stan_fit, c("alpha", "p", "p_prod"))
Inference for Stan model: LCM0425b.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

          mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
alpha[1]  0.69    0.22 0.32 0.09 0.59 0.85 0.89  0.93     2 7.81
alpha[2]  0.31    0.22 0.32 0.07 0.11 0.15 0.41  0.91     2 7.81
p[1,1]    0.18    0.18 0.26 0.00 0.02 0.04 0.12  0.82     2 3.88
p[1,2]    0.19    0.18 0.26 0.01 0.03 0.05 0.15  0.82     2 4.09
p[1,3]    0.09    0.06 0.09 0.02 0.04 0.05 0.09  0.34     2 2.44
p[1,4]    0.24    0.06 0.10 0.14 0.18 0.20 0.25  0.51     3 2.15
p[2,1]    0.48    0.19 0.29 0.01 0.20 0.57 0.71  0.89     2 2.60
p[2,2]    0.50    0.19 0.29 0.01 0.23 0.59 0.71  0.91     2 2.68
p[2,3]    0.20    0.07 0.11 0.02 0.08 0.21 0.28  0.41     3 1.68
p[2,4]    0.35    0.07 0.13 0.15 0.22 0.36 0.44  0.60     3 1.58
p_prod[1] 0.02    0.00 0.01 0.00 0.01 0.02 0.03  0.05   770 1.01
p_prod[2] 0.03    0.00 0.01 0.00 0.02 0.03 0.04  0.06   837 1.01
p_prod[3] 0.01    0.00 0.00 0.00 0.01 0.01 0.01  0.02  1427 1.00
p_prod[4] 0.07    0.00 0.02 0.04 0.06 0.07 0.09  0.12  1546 1.00

Samples were drawn using NUTS(diag_e) at Thu Aug 26 23:17:45 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
traceplot(stan_fit, c("alpha", "p", "p_prod"))

From the output, we see that only p_prods have reached convergence based on n_eff and Rhat, which is also consistent with what we observe from the trace plots. Now we move to fix any label switching:

# extract stan fit as the required format of the input
pars <- stan_fit %>% names %>% `[`(1:10)
stan_fit@model_pars
[1] "alpha"          "p"              "p_prod"         "pred_class_dis"
[5] "pred_class"     "lmix"           "lp__"          
post_par <-
  rstan::extract(stan_fit,
                 c("alpha", "p", "pred_class", "pred_class_dis", "lp__"),
                 permuted = TRUE)
# simulated allocation vectors
post_class <- post_par$pred_class_dis
# classification probabilities
post_class_p <- post_par$pred_class

m = 2000 # of draws
K = 2 # of classes
J = 5 # of component-wise parameters
# initialize mcmc arrays
mcmc <- array(data = NA, dim = c(m = m, K = K, J = J))

# assign posterior draws to the array
mcmc[, , 1] <- post_par$alpha
for (i in 1:(J - 1)) {
  mcmc[, , i + 1] <- post_par$p[, , i]
}

# set of selected relabeling algorithm
set <-
  c("PRA",
    "ECR",
    "ECR-ITERATIVE-1",
    "AIC",
    "ECR-ITERATIVE-2",
    "STEPHENS",
    "DATA-BASED")

# find the MAP draw as a pivot
mapindex = which.max(post_par$lp__)

# switch labels
ls_lcm <-
  label.switching(
    method = set,
    zpivot = post_class[mapindex,],
    z = post_class,
    K = K,
    prapivot = mcmc[mapindex, ,],
    constraint = 1,
    mcmc = mcmc,
    p = post_class_p,
    data = lca_data$y
  )

    ......................................................................................
    . Method                         Time (sec)           Status                         . 
    ......................................................................................
    . PRA                            0.02                 OK                             . 
    . ECR                            4.5                  OK                             . 
    . ECR-ITERATIVE-1                10.39                Converged (2 iterations)       . 
    . AIC                            0.06                 OK                             . 
    . ECR-ITERATIVE-2                8.39                 Converged (2 iterations)       . 
    . STEPHENS                       13.53                Converged (3 iterations)       . 
    . DATA-BASED                     10.8                 OK                             . 
    ......................................................................................

    Relabelling all methods according to method PRA ... done!
    Retrieve the 7 permutation arrays by typing:
        [...]$permutations$"PRA"
        [...]$permutations$"ECR"
        [...]$permutations$"ECR-ITERATIVE-1"
        [...]$permutations$"AIC"
        [...]$permutations$"ECR-ITERATIVE-2"
        [...]$permutations$"STEPHENS"
        [...]$permutations$"DATA-BASED"
    Retrieve the 7 best clusterings: [...]$clusters
    Retrieve the 7 CPU times: [...]$timings
    Retrieve the 7 X 7 similarity matrix: [...]$similarity
    Label switching finished. Total time: 48.5 seconds. 
# run time in seconds
ls_lcm$timings
            PRA             ECR ECR-ITERATIVE-1             AIC ECR-ITERATIVE-2 
           0.02            4.50           10.39            0.06            8.39 
       STEPHENS      DATA-BASED 
          13.53           10.80 
# similarity of the classification
ls_lcm$similarity
                PRA ECR ECR-ITERATIVE-1 AIC ECR-ITERATIVE-2 STEPHENS DATA-BASED
PRA               1   1               1   1               1        1          1
ECR               1   1               1   1               1        1          1
ECR-ITERATIVE-1   1   1               1   1               1        1          1
AIC               1   1               1   1               1        1          1
ECR-ITERATIVE-2   1   1               1   1               1        1          1
STEPHENS          1   1               1   1               1        1          1
DATA-BASED        1   1               1   1               1        1          1
# permuted posterior
mcmc_permuted <- permute.mcmc(mcmc, ls_lcm$permutations$ECR)

We can assess model convergence after relabeling:

# change dimension for each parameter defined as in the Stan code
mcmc_permuted <-
  array(
    data = mcmc_permuted$output,
    dim = c(2000, 1, 10),
    dimnames = list(NULL, NULL, pars)
  )


# reassess the model convergence after switch the labels
fit_permuted <-
  monitor(mcmc_permuted, warmup = 0,  digits_summary = 4)
Inference for the input samples (1 chains: each with iter = 2000; warmup = 0):

            Q5   Q50   Q95  Mean    SD  Rhat Bulk_ESS Tail_ESS
alpha[1] 0.075 0.132 0.216 0.137 0.043 1.002     1656     1820
alpha[2] 0.784 0.868 0.925 0.863 0.043 1.002     1656     1820
p[1,1]   0.417 0.627 0.857 0.632 0.137 1.000     1929     1811
p[2,1]   0.005 0.030 0.061 0.031 0.017 1.003     1881     1795
p[1,2]   0.440 0.644 0.880 0.649 0.133 1.004     1878     1891
p[2,2]   0.010 0.042 0.077 0.042 0.020 1.001     1970     1771
p[1,3]   0.126 0.240 0.389 0.245 0.080 1.000     2091     2004
p[2,3]   0.020 0.042 0.069 0.043 0.015 1.001     1910     1922
p[1,4]   0.264 0.398 0.568 0.402 0.094 1.003     1927     1920
p[2,4]   0.145 0.187 0.232 0.187 0.026 1.001     2055     1641

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (an ESS > 100 
per chain is considered good), and Rhat is the potential scale reduction 
factor on rank normalized split chains (at convergence, Rhat <= 1.05).

All Rhat are reported as 1 after rounding so convergence looks good after relabeling. We can see that the estimates using Stan after relabeling are similar to the MLE estimates: specifically, the Stan estimates suggest that, “cheaters” take up 13.7% of population and these individuals tend to have responded “yes” to LIEEXAM, LIEPAPER, FRAUD and COPYEXAM with the estimated probability 0.63, 0.65, 0.25, 0.4, respectively; while “non-cheaters” were estimated to take up 86.3% of population and they tend to have responded “yes” to LIEEXAM, LIEPAPER, FRAUD and COPYEXAM with the estimated probability 0.03, 0.04, 0.04, 0.19, respectively.

Now we use variational Bayes which does not suffer from label switching because it does not involve sampling parameters.

stan_vb <- stan_model(file = "LCM0425b.stan")
# VB works
vb_fit <-
  vb(
    stan_vb,
    data = lca_data,
    iter = 15000,
    elbo_samples = 1000,
    algorithm = c("fullrank"),
  # imporance_resampling = T,
    output_samples = 10000,
    tol_rel_obj = 0.00001,
    refresh = 0
  )

print(vb_fit, c("alpha", "p"))
Inference for Stan model: LCM0425b.
1 chains, each with iter=10000; warmup=0; thin=1; 
post-warmup draws per chain=10000, total post-warmup draws=10000.

         mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff khat
alpha[1] 0.87     NaN 0.04 0.78 0.84 0.87 0.89  0.93   NaN 0.88
alpha[2] 0.13     NaN 0.04 0.07 0.11 0.13 0.16  0.22   NaN 0.84
p[1,1]   0.03     NaN 0.02 0.01 0.02 0.03 0.04  0.08   NaN 0.83
p[1,2]   0.05     NaN 0.02 0.01 0.03 0.04 0.06  0.11   NaN 0.84
p[1,3]   0.04     NaN 0.02 0.02 0.03 0.04 0.05  0.08   NaN 0.84
p[1,4]   0.19     NaN 0.03 0.14 0.17 0.19 0.21  0.25   NaN 0.83
p[2,1]   0.63     NaN 0.12 0.37 0.55 0.64 0.72  0.85   NaN 0.87
p[2,2]   0.65     NaN 0.11 0.41 0.58 0.66 0.74  0.85   NaN 0.89
p[2,3]   0.25     NaN 0.09 0.10 0.18 0.24 0.31  0.47   NaN 0.87
p[2,4]   0.41     NaN 0.11 0.22 0.33 0.41 0.48  0.62   NaN 0.86

Approximate samples were drawn using VB(fullrank) at Thu Aug 26 23:21:18 2021.

The VB estimates are close to the MCMC estimates and maximum likelihood estimates.

References

Dayton, C Mitchell, and Mitchell Chauncey Dayton. 1998. “Latent Class Scaling Analysis,” no. 126.
Kucukelbir, Alp, Rajesh Ranganath, Andrew Gelman, and David M Blei. 2015. “Automatic Variational Inference in Stan.” arXiv Preprint arXiv:1506.03431.
Papastamoulis, Panagiotis. 2015. “Label. Switching: An r Package for Dealing with the Label Switching Problem in MCMC Outputs.” arXiv Preprint arXiv:1503.02271.
Redner, Richard A, and Homer F Walker. 1984. “Mixture Densities, Maximum Likelihood and the EM Algorithm.” SIAM Review 26 (2): 195–239.
Stephens, Matthew. 2000. “Dealing with Label Switching in Mixture Models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 (4): 795–809.

  1. there are alternative ways to fix it: for example, put order constrains on the parameters; initialization around a single mode. We refer the readers who are interested in details to a comprehensive treatment of identifying Bayesian mixture models identifying Bayesian mixture models and label-swiching in mixture models.↩︎