16.4 Effective Sample Size

This is an old version, view current version.

The second technical difficulty posed by MCMC methods is that the samples will typically be autocorrelated (or anticorrelated) within a chain. This increases the uncertainty of the estimation of posterior quantities of interest, such as means, variances, or quantiles; see Geyer (2011).

Stan estimates an effective sample size for each parameter, which plays the role in the Markov chain Monte Carlo central limit theorem (MCMC CLT) as the number of independent draws plays in the standard central limit theorem (CLT).

Unlike most packages, the particular calculations used by Stan follow those for split-\(\hat{R}\), which involve both cross-chain (mean) and within-chain calculations (autocorrelation); see Gelman et al. (2013).

16.4.1 Definition of Effective Sample Size

The amount by which autocorrelation within the chains increases uncertainty in estimates can be measured by effective sample size (ESS). Given independent samples, the central limit theorem bounds uncertainty in estimates based on the number of samples \(N\). Given dependent samples, the number of independent samples is replaced with the effective sample size \(N_{\mathrm{eff}}\), which is the number of independent samples with the same estimation power as the \(N\) autocorrelated samples. For example, estimation error is proportional to \(1 / \sqrt{N_{\mathrm{eff}}}\) rather than \(1/\sqrt{N}\).

The effective sample size of a sequence is defined in terms of the autocorrelations within the sequence at different lags. The autocorrelation \(\rho_t\) at lag \(t \geq 0\) for a chain with joint probability function \(p(\theta)\) with mean \(\mu\) and variance \(\sigma^2\) is defined to be

\[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} (\theta^{(n)} - \mu) (\theta^{(n+t)} - \mu) \, p(\theta) \, d\theta. \]

This is the correlation between the two chains offset by \(t\) positions (i.e., a lag in time-series terminology). Because we know \(\theta^{(n)}\) and \(\theta^{(n+t)}\) have the same marginal distribution in an MCMC setting, multiplying the two difference terms and reducing yields

\[ \rho_t = \frac{1}{\sigma^2} \, \int_{\Theta} \theta^{(n)} \, \theta^{(n+t)} \, p(\theta) \, d\theta - \frac{\mu^2}{\sigma^2}. \]

The effective sample size of \(N\) samples generated by a process with autocorrelations \(\rho_t\) is defined by \[ N_{\mathrm{eff}} \ = \ \frac{N}{\sum_{t = -\infty}^{\infty} \rho_t} \ = \ \frac{N}{1 + 2 \sum_{t = 1}^{\infty} \rho_t}. \]

Effective sample size \(N_{\mathrm{eff}}\) can be larger than \(N\) in case of antithetic Markov chains, which have negative autocorrelations on odd lags. The no-U-turn sampling (NUTS) algorithm used in Stan can produce \(N_{\mathrm{eff}}>N\) for parameters which have close to Gaussian posterior and little dependency on other parameters.

16.4.2 Estimation of Effective Sample Size

In practice, the probability function in question cannot be tractably integrated and thus the autocorrelation cannot be calculated, nor the effective sample size. Instead, these quantities must be estimated from the samples themselves. The rest of this section describes a autocorrelations and split-\(\hat{R}\) based effective sample size estimator, based on multiple chains. As before, each chain \(\theta_m\) will be assumed to be of length \(N\).

Stan carries out the autocorrelation computations for all lags simultaneously using Eigen’s fast Fourier transform (FFT) package with appropriate padding; see Geyer (2011) for more detail on using FFT for autocorrelation calculations. The autocorrelation estimates \(\hat{\rho}_{t,m}\) at lag \(t\) from multiple chains \(m \in (1,\ldots,M)\) are combined with within-sample variance estimate \(W\) and multi-chain variance estimate \(\widehat{\mbox{var}}^{+}\) introduced in the previous section to compute the combined autocorrelation at lag \(t\) as

\[ \hat{\rho}_t = 1 - \frac{\displaystyle W - \textstyle \frac{1}{M}\sum_{m=1}^M s_m^2 \hat{\rho}_{t,m}} {\widehat{\mbox{var}}^{+}}. \]

If the chains have not converged, the variance estimator \(\widehat{\mbox{var}}^{+}\) will overestimate variance, leading to an overestimate of autocorrelation and an underestimate effective sample size.

Because of the noise in the correlation estimates \(\hat{\rho}_t\) as \(t\) increases, a typical truncated sum of \(\hat{\rho}_t\) is used. Negative autocorrelations may occur only on odd lags and by summing over pairs starting from lag 0, the paired autocorrelation is guaranteed to be positive, monotone and convex modulo estimator noise Geyer (1992), Geyer (2011). Stan uses Geyer’s initial monotone sequence criterion. The effective sample size estimator is defined as

\[ \hat{N}_{\mathrm{eff}} = \frac{M \cdot N}{\hat{\tau}}, \]

where

\[ \hat{\tau} = 1 + 2 \sum_{t=1}^{2m+1} \hat{\rho}_t = -1 + 2 \sum_{t'=0}^{m} \hat{P}_{t'}, \]

where \(\hat{P}_{t'}=\hat{\rho}_{2t'}+\hat{\rho}_{2t'+1}\). Initial positive sequence estimators is obtained by choosing the largest \(m\) such that \(\hat{P}_{t'}>0, \quad t' = 1,\ldots,m\). The initial monotone sequence is obtained by further reducing \(\hat{P}_{t'}\) to the minimum of the preceding ones so that the estimated sequence is monotone.

16.4.3 Estimation of MCMC Standard Error

The posterior standard deviation of a parameter \(\theta_n\) conditioned on observed data \(y\) is just the standard deviation of the posterior density \(p(\theta_n | y)\). This is estimated by the standard deviation of the combined posterior draws across chains,

\[ \hat{\sigma}_n = \mathrm{sd}(\theta^{(1)}_n, \ldots, \theta^{(m)}_n). \]

The previous section showed how to estimate \(N_{\mathrm{eff}}\) for a parameter \(\theta_n\) based on multiple chains of posterior draws.

The mean of the posterior draws of \(\theta_n\) \[ \hat{\theta}_n = \mathrm{mean}(\theta^{(1)}_n, \ldots, \theta^{(m)}_n) \]

is treated as an estimator of the true posterior mean,

\[ \mathbb{E}[\theta_n \mid y] \ = \ \int_{-\infty}^{\infty} \, \theta \, p(\theta | y) \, \mathrm{d}\theta_n, \]

based the observed data \(y\).

The standard error for the estimator \(\hat{\theta}_n\) is given by the posterior standard deviation divided by the square root of the effective sample size. This standard error is itself estimated as \(\hat{\sigma}_n / \sqrt{N_{\mathrm{eff}}}\). The smaller the standard error, the closer the estimate \(\hat{\theta}_n\) is expected to be to the true value. This is just the MCMC CLT applied to an estimator; see Geyer (2011) for more details of the MCMC central limit theorem.

16.4.4 Thinning Samples

In the typical situation, the autocorrelation, \(\rho_t\), decreases as the lag, \(t\), increases. When this happens, thinning the samples will reduce the autocorrelation.

For instance, consider generating one thousand posterior draws in one of the following two ways.

Generate 1000 draws after convergence and save all of them.
Generate 10,000 draws after convergence and save every tenth draw.

Even though both produce a sample consisting one thousand draws, the second approach with thinning can produce a higher effective sample size. That’s because the autocorrelation \(\rho_t\) for the thinned sequence is equivalent to \(\rho_{10t}\) in the unthinned sequence, so the sum of the autocorrelations will be lower and thus the effective sample size higher.

Now contrast the second approach above with the unthinned alternative,

Generate 10,000 draws after convergence and save every draw.

This will have a higher effective sample than the thinned sample consisting of every tenth drawn. Therefore, it should be emphasized that the only reason to thin a sample is to reduce memory requirements.