5.7 Priors and effective data size in mixture models
Suppose we have a two-component mixture model with mixing rate \(\lambda \in (0, 1)\). Because the likelihood for the mixture components is proportionally weighted by the mixture weights, the effective data size used to estimate each of the mixture components will also be weighted as a fraction of the overall data size. Thus although there are \(N\) observations, the mixture components will be estimated with effective data sizes of \(\theta \, N\) and \((1 - \theta) \, N\) for the two components for some \(\theta \in (0, 1)\). The effective weighting size is determined by posterior responsibility, not simply by the mixing rate \(\lambda\).
Comparison to model averaging
In contrast to mixture models, which create mixtures at the observation level, model averaging creates mixtures over the posteriors of models separately fit with the entire data set. In this situation, the priors work as expected when fitting the models independently, with the posteriors being based on the complete observed data \(y\).
If different models are expected to account for different observations, we recommend building mixture models directly. If the models being mixed are similar, often a single expanded model will capture the features of both and may be used on its own for inferential purposes (estimation, decision making, prediction, etc.). For example, rather than fitting an intercept-only regression and a slope-only regression and averaging their predictions, even as a mixture model, we would recommend building a single regression with both a slope and an intercept. Model complexity, such as having more predictors than data points, can be tamed using appropriately regularizing priors. If computation becomes a bottleneck, the only recourse can be model averaging, which can be calculated after fitting each model independently (see Hoeting et al. (1999) and Andrew Gelman et al. (2013) for theoretical and computational details).