16.4 Effective sample size
The second technical difficulty posed by MCMC methods is that the samples will typically be autocorrelated (or anticorrelated) within a chain. This increases the uncertainty of the estimation of posterior quantities of interest, such as means, variances, or quantiles; see Charles J. Geyer (2011).
Stan estimates an effective sample size for each parameter, which plays the role in the Markov chain Monte Carlo central limit theorem (MCMC CLT) as the number of independent draws plays in the standard central limit theorem (CLT).
Unlike most packages, the particular calculations used by Stan follow those for split-\(\hat{R}\), which involve both cross-chain (mean) and within-chain calculations (autocorrelation); see Gelman et al. (2013).
16.4.1 Definition of effective sample size
The amount by which autocorrelation within the chains increases uncertainty in estimates can be measured by effective sample size (ESS). Given independent samples, the central limit theorem bounds uncertainty in estimates based on the number of samples \(N\). Given dependent samples, the number of independent samples is replaced with the effective sample size \(N_{\mathrm{eff}}\), which is the number of independent samples with the same estimation power as the \(N\) autocorrelated samples. For example, estimation error is proportional to \(1 / \sqrt{N_{\mathrm{eff}}}\) rather than \(1/\sqrt{N}\).
The effective sample size of a sequence is defined in terms of the autocorrelations within the sequence at different lags. The autocorrelation \(\rho_t\) at lag \(t \geq 0\) for a chain with joint probability function \(p(\theta)\) with mean \(\mu\) and variance \(\sigma^2\) is defined to be
\[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} (\theta^{(n)} - \mu) (\theta^{(n+t)} - \mu) \, p(\theta) \, d\theta. \]
This is the correlation between the two chains offset by \(t\) positions (i.e., a lag in time-series terminology). Because we know \(\theta^{(n)}\) and \(\theta^{(n+t)}\) have the same marginal distribution in an MCMC setting, multiplying the two difference terms and reducing yields
\[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} \theta^{(n)} \, \theta^{(n+t)} \, p(\theta) \, d\theta - \frac{\mu^2}{\sigma^2}. \]
The effective sample size of \(N\) samples generated by a process with autocorrelations \(\rho_t\) is defined by \[ N_{\mathrm{eff}} \ = \ \frac{N}{\sum_{t = -\infty}^{\infty} \rho_t} \ = \ \frac{N}{1 + 2 \sum_{t = 1}^{\infty} \rho_t}. \]
For independent draws, the effective sample size is just the number of iterations. For correlated draws, the effective sample size will be lower than the number of iterations. For anticorrelated draws, the effective sample size can be larger than the number of iterations. In this latter case, MCMC can work better than independent sampling for some estimation problems. Hamiltonian Monte Carlo, including the no-U-turn sampler used by default in Stan, can produce anticorrelated draws if the posterior is close to Gaussian with little posterior correlation.
16.4.2 Estimation of effective sample size
In practice, the probability function in question cannot be tractably integrated and thus the autocorrelation cannot be calculated, nor the effective sample size. Instead, these quantities must be estimated from the samples themselves. The rest of this section describes a autocorrelations and split-\(\hat{R}\) based effective sample size estimator, based on multiple chains. As before, each chain \(\theta_m\) will be assumed to be of length \(N\).
Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Charles J. Geyer (2011) for more detail on using FFT for autocorrelation calculations. The autocorrelation estimates \(\hat{\rho}_{t,m}\) at lag \(t\) from multiple chains \(m \in (1,\ldots,M)\) are combined with within-sample variance estimate \(W\) and multi-chain variance estimate \(\widehat{\mbox{var}}^{+}\) introduced in the previous section to compute the combined autocorrelation at lag \(t\) as
\[ \hat{\rho}_t = 1 - \frac{\displaystyle W - \textstyle \frac{1}{M}\sum_{m=1}^M s_m^2 \hat{\rho}_{t,m}} {\widehat{\mbox{var}}^{+}}. \]
If the chains have not converged, the variance estimator \(\widehat{\mbox{var}}^{+}\) will overestimate variance, leading to an overestimate of autocorrelation and an underestimate effective sample size.
Because of the noise in the correlation estimates \(\hat{\rho}_t\) as \(t\) increases, a typical truncated sum of \(\hat{\rho}_t\) is used. Negative autocorrelations may occur only on odd lags and by summing over pairs starting from lag 0, the paired autocorrelation is guaranteed to be positive, monotone and convex modulo estimator noise Charles J. Geyer (1992), Charles J. Geyer (2011). Stan uses Geyer’s initial monotone sequence criterion. The effective sample size estimator is defined as
\[ \hat{N}_{\mathrm{eff}} = \frac{M \cdot N}{\hat{\tau}}, \]
where
\[ \hat{\tau} = 1 + 2 \sum_{t=1}^{2m+1} \hat{\rho}_t = -1 + 2 \sum_{t'=0}^{m} \hat{P}_{t'}, \]
where \(\hat{P}_{t'}=\hat{\rho}_{2t'}+\hat{\rho}_{2t'+1}\). Initial positive sequence estimators is obtained by choosing the largest \(m\) such that \(\hat{P}_{t'}>0, \quad t' = 1,\ldots,m\). The initial monotone sequence is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding ones so that the estimated sequence is monotone.
16.4.3 Estimation of MCMC standard error
The posterior standard deviation of a parameter \(\theta_n\) conditioned on observed data \(y\) is just the standard deviation of the posterior density \(p(\theta_n | y)\). This is estimated by the standard deviation of the combined posterior draws across chains,
\[ \hat{\sigma}_n = \mathrm{sd}(\theta^{(1)}_n, \ldots, \theta^{(m)}_n). \]
The previous section showed how to estimate \(N_{\mathrm{eff}}\) for a parameter \(\theta_n\) based on multiple chains of posterior draws.
The mean of the posterior draws of \(\theta_n\) \[ \hat{\theta}_n = \mathrm{mean}(\theta^{(1)}_n, \ldots, \theta^{(m)}_n) \]
is treated as an estimator of the true posterior mean,
\[ \mathbb{E}[\theta_n \mid y] \ = \ \int_{-\infty}^{\infty} \, \theta \, p(\theta | y) \, \mathrm{d}\theta_n, \]
based the observed data \(y\).
The standard error for the estimator \(\hat{\theta}_n\) is given by the posterior standard deviation divided by the square root of the effective sample size. This standard error is itself estimated as \(\hat{\sigma}_n / \sqrt{N_{\mathrm{eff}}}\). The smaller the standard error, the closer the estimate \(\hat{\theta}_n\) is expected to be to the true value. This is just the MCMC CLT applied to an estimator; see Charles J. Geyer (2011) for more details of the MCMC central limit theorem.
16.4.4 Thinning samples
In complex posteriors, draws are almost always positively correlated. In these situations, the autocorrelation at lag \(t\), \(\rho_t\), decreases as the lag, \(t\), increases. In this situation, thinning the sample by keeping only every \(N\)-th draw will reduce the autocorrelation of the resulting chain. This is particularly useful if we need to save storage or re-use the draws for inference.
For instance, consider generating one thousand posterior draws in one of the following two ways.
Generate 1000 draws after convergence and save all of them.
Generate 10,000 draws after convergence and save every tenth draw.
Even though both produce a sample consisting one thousand draws, the second approach with thinning can produce a higher effective sample size when the draws are positively correlated. That’s because the autocorrelation \(\rho_t\) for the thinned sequence is equivalent to \(\rho_{10t}\) in the unthinned sequence, so the sum of the autocorrelations will be lower and thus the effective sample size higher.
Now contrast the second approach above with the unthinned alternative,
- Generate 10,000 draws after convergence and save every draw.
This will typically have a higher effective sample than the thinned sample consisting of every tenth drawn. But the gap might not be very large. To summarize, the only reason to thin a sample is to reduce memory requirements.
If draws are anticorrelated, then thinning will increase correlation and reduce the overall effective sample size.