Time-Series Models

Times series data come arranged in temporal order. This chapter presents two kinds of time series models, regression-like models such as autoregressive and moving average models, and hidden Markov models.

The Gaussian processes chapter presents Gaussian processes, which may also be used for time-series (and spatial) data.

Autoregressive models

A first-order autoregressive model (AR(1)) with normal noise takes each point \(y_n\) in a sequence \(y\) to be generated according to \[ y_n \sim \textsf{normal}(\alpha + \beta y_{n-1}, \sigma). \]

That is, the expected value of \(y_n\) is \(\alpha + \beta y_{n-1}\), with noise scaled as \(\sigma\).

AR(1) models

With improper flat priors on the regression coefficients \(\alpha\) and \(\beta\) and on the positively-constrained noise scale (\(\sigma\)), the Stan program for the AR(1) model is as follows.¹

data {
  int<lower=0> N;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  for (n in 2:N) {
    y[n] ~ normal(alpha + beta * y[n-1], sigma);
  }
}

The first observed data point, y[1], is not modeled here because there is nothing to condition on; instead, it acts to condition y[2]. This model also uses an improper prior for sigma, but there is no obstacle to adding an informative prior if information is available on the scale of the changes in y over time, or a weakly informative prior to help guide inference if rough knowledge of the scale of y is available.

Slicing for efficiency

Although perhaps a bit more difficult to read, a much more efficient way to write the above model is by slicing the vectors, with the model above being replaced with the one-liner

model {
  y[2:N] ~ normal(alpha + beta * y[1:(N - 1)], sigma);
}

The left-hand side slicing operation pulls out the last \(N-1\) elements and the right-hand side version pulls out the first \(N-1\).

Extensions to the AR(1) model

Proper priors of a range of different families may be added for the regression coefficients and noise scale. The normal noise model can be changed to a Student-\(t\) distribution or any other distribution with unbounded support. The model could also be made hierarchical if multiple series of observations are available.

To enforce the estimation of a stationary AR(1) process, the slope coefficient beta may be constrained with bounds as follows.

real<lower=-1, upper=1> beta;

In practice, such a constraint is not recommended. If the data are not well fit by a stationary model it is best to know this. Stationary parameter estimates can be encouraged with a prior favoring values of beta near zero.

AR(2) models

Extending the order of the model is also straightforward. For example, an AR(2) model could be coded with the second-order coefficient gamma and the following model statement.

for (n in 3:N) {
  y[n] ~ normal(alpha + beta*y[n-1] + gamma*y[n-2], sigma);
}

AR(\(K\)) models

A general model where the order is itself given as data can be coded by putting the coefficients in an array and computing the linear predictor in a loop.

data {
  int<lower=0> K;
  int<lower=0> N;
  array[N] real y;
}
parameters {
  real alpha;
  array[K] real beta;
  real sigma;
}
model {
  for (n in (K+1):N) {
    real mu = alpha;
    for (k in 1:K) {
      mu += beta[k] * y[n-k];
    }
    y[n] ~ normal(mu, sigma);
  }
}

ARCH(1) models

Econometric and financial time-series models usually assume heteroscedasticity: they allow the scale of the noise terms defining the series to vary over time. The simplest such model is the autoregressive conditional heteroscedasticity (ARCH) model (Engle 1982). Unlike the autoregressive model AR(1), which modeled the mean of the series as varying over time but left the noise term fixed, the ARCH(1) model takes the scale of the noise terms to vary over time but leaves the mean term fixed. Models could be defined where both the mean and scale vary over time; the econometrics literature presents a wide range of time-series modeling choices.

The ARCH(1) model is typically presented as the following sequence of equations, where \(r_t\) is the observed return at time point \(t\) and \(\mu\), \(\alpha_0\), and \(\alpha_1\) are unknown regression coefficient parameters.

\[\begin{align*} r_t &= \mu + a_t \\ a_t &= \sigma_t \epsilon_t \\ \epsilon_t &\sim \textsf{normal}(0,1) \\ \sigma^2_t &= \alpha_0 + \alpha_1 a_{t-1}^2 \end{align*}\]

In order to ensure the noise terms \(\sigma^2_t\) are positive, the scale coefficients are constrained to be positive, \(\alpha_0, \alpha_1 > 0\). To ensure stationarity of the time series, the slope is constrained to be less than one, i.e., \(\alpha_1 < 1\).²

The ARCH(1) model may be coded directly in Stan as follows.

data {
  int<lower=0> T;                // number of time points
  array[T] real r;               // return at time t
}
parameters {
  real mu;                       // average return
  real<lower=0> alpha0;          // noise intercept
  real<lower=0, upper=1> alpha1; // noise slope
}
model {
  for (t in 2:T) {
    r[t] ~ normal(mu, sqrt(alpha0 + alpha1
                                    * pow(r[t - 1] - mu,2)));
  }
}

The loop in the model is defined so that the return at time \(t=1\) is not modeled; the model in the next section shows how to model the return at \(t=1\). The model can be vectorized to be more efficient; the model in the next section provides an example.

Modeling temporal heteroscedasticity

A set of variables is homoscedastic if their variances are all the same; the variables are heteroscedastic if they do not all have the same variance. Heteroscedastic time-series models allow the noise term to vary over time.

GARCH(1,1) models

The basic generalized autoregressive conditional heteroscedasticity (GARCH) model, GARCH(1,1), extends the ARCH(1) model by including the squared previous difference in return from the mean at time \(t-1\) as a predictor of volatility at time \(t\), defining \[ \sigma^2_t = \alpha_0 + \alpha_1 a^2_{t-1} + \beta_1 \sigma^2_{t-1}. \]

To ensure the scale term is positive and the resulting time series stationary, the coefficients must all satisfy \(\alpha_0, \alpha_1, \beta_1 > 0\) and the slopes \(\alpha_1 + \beta_1 < 1\).

data {
  int<lower=0> T;
  array[T] real r;
  real<lower=0> sigma1;
}
parameters {
  real mu;
  real<lower=0> alpha0;
  real<lower=0, upper=1> alpha1;
  real<lower=0, upper=(1-alpha1)> beta1;
}
transformed parameters {
  array[T] real<lower=0> sigma;
  sigma[1] = sigma1;
  for (t in 2:T) {
    sigma[t] = sqrt(alpha0
                     + alpha1 * pow(r[t - 1] - mu, 2)
                     + beta1 * pow(sigma[t - 1], 2));
  }
}
model {
  r ~ normal(mu, sigma);
}

To get the recursive definition of the volatility regression off the ground, the data declaration includes a non-negative value sigma1 for the scale of the noise at \(t = 1\).

The constraints are coded directly on the parameter declarations. This declaration is order-specific in that the constraint on beta1 depends on the value of alpha1.

A transformed parameter array of non-negative values sigma is used to store the scale values at each time point. The definition of these values in the transformed parameters block is where the regression is now defined. There is an intercept alpha0, a slope alpha1 for the squared difference in return from the mean at the previous time, and a slope beta1 for the previous noise scale squared. Finally, the whole regression is inside the sqrt function because Stan requires scale (deviation) parameters (not variance parameters) for the normal distribution.

With the regression in the transformed parameters block, the model reduces a single vectorized distribution statement. Because r and sigma are of length T, all of the data are modeled directly.

Moving average models

A moving average model uses previous errors as predictors for future outcomes. For a moving average model of order \(Q\), \(\mbox{MA}(Q)\), there is an overall mean parameter \(\mu\) and regression coefficients \(\theta_q\) for previous error terms. With \(\epsilon_t\) being the noise at time \(t\), the model for outcome \(y_t\) is defined by \[ y_t = \mu + \theta_1 \epsilon_{t-1} + \dotsb + \theta_Q \epsilon_{t-Q} + \epsilon_t, \] with the noise term \(\epsilon_t\) for outcome \(y_t\) modeled as normal, \[ \epsilon_t \sim \textsf{normal}(0,\sigma). \] In a proper Bayesian model, the parameters \(\mu\), \(\theta\), and \(\sigma\) must all be given priors.

MA(2) example

An \(\mbox{MA}(2)\) model can be coded in Stan as follows.

data {
  int<lower=3> T;          // number of observations
  vector[T] y;             // observation at time T
}
parameters {
  real mu;                 // mean
  real<lower=0> sigma;     // error scale
  vector[2] theta;         // lag coefficients
}
transformed parameters {
  vector[T] epsilon;       // error terms
  epsilon[1] = y[1] - mu;
  epsilon[2] = y[2] - mu - theta[1] * epsilon[1];
  for (t in 3:T) {
    epsilon[t] = ( y[t] - mu
                    - theta[1] * epsilon[t - 1]
                    - theta[2] * epsilon[t - 2] );
  }
}
model {
  mu ~ cauchy(0, 2.5);
  theta ~ cauchy(0, 2.5);
  sigma ~ cauchy(0, 2.5);
  for (t in 3:T) {
    y[t] ~ normal(mu
                  + theta[1] * epsilon[t - 1]
                  + theta[2] * epsilon[t - 2],
                  sigma);
  }
}

The error terms \(\epsilon_t\) are defined as transformed parameters in terms of the observations and parameters. The definition of the distribution statement (which also defines the likelihood) follows the definition, which can only be applied to \(y_n\) for \(n > Q\). In this example, the parameters are all given Cauchy (half-Cauchy for \(\sigma\)) priors, although other priors can be used just as easily.

This model could be improved in terms of speed by vectorizing the distribution statement in the model block. Vectorizing the calculation of the \(\epsilon_t\) could also be sped up by using a dot product instead of a loop.

Vectorized MA(Q) model

A general \(\mbox{MA}(Q)\) model with a vectorized distribution statement may be defined as follows.

data {
  int<lower=0> Q;       // num previous noise terms
  int<lower=3> T;       // num observations
  vector[T] y;          // observation at time t
}
parameters {
  real mu;              // mean
  real<lower=0> sigma;  // error scale
  vector[Q] theta;      // error coeff, lag -t
}
transformed parameters {
  vector[T] epsilon;    // error term at time t
  for (t in 1:T) {
    epsilon[t] = y[t] - mu;
    for (q in 1:min(t - 1, Q)) {
      epsilon[t] = epsilon[t] - theta[q] * epsilon[t - q];
    }
  }
}
model {
  vector[T] eta;
  mu ~ cauchy(0, 2.5);
  theta ~ cauchy(0, 2.5);
  sigma ~ cauchy(0, 2.5);
  for (t in 1:T) {
    eta[t] = mu;
    for (q in 1:min(t - 1, Q)) {
      eta[t] = eta[t] + theta[q] * epsilon[t - q];
    }
  }
  y ~ normal(eta, sigma);
}

Here all of the data are modeled, with missing terms just dropped from the regressions as in the calculation of the error terms. Both models converge quickly and mix well at convergence, with the vectorized model being faster (per iteration, not to converge—they compute the same model).

Autoregressive moving average models

Autoregressive moving-average models (ARMA), combine the predictors of the autoregressive model and the moving average model. An ARMA(1,1) model, with a single state of history, can be encoded in Stan as follows.

data {
  int<lower=1> T;            // num observations
  array[T] real y;                 // observed outputs
}
parameters {
  real mu;                   // mean coeff
  real phi;                  // autoregression coeff
  real theta;                // moving avg coeff
  real<lower=0> sigma;       // noise scale
}
model {
  vector[T] nu;              // prediction for time t
  vector[T] err;             // error for time t
  nu[1] = mu + phi * mu;     // assume err[0] == 0
  err[1] = y[1] - nu[1];
  for (t in 2:T) {
    nu[t] = mu + phi * y[t - 1] + theta * err[t - 1];
    err[t] = y[t] - nu[t];
  }
  mu ~ normal(0, 10);        // priors
  phi ~ normal(0, 2);
  theta ~ normal(0, 2);
  sigma ~ cauchy(0, 5);
  err ~ normal(0, sigma);    // error model
}

The data are declared in the same way as the other time-series regressions and the parameters are documented in the code.

In the model block, the local vector nu stores the predictions and err the errors. These are computed similarly to the errors in the moving average models described in the previous section.

The priors are weakly informative for stationary processes. The data model only involves the error term, which is efficiently vectorized here.

Often in models such as these, it is desirable to inspect the calculated error terms. This could easily be accomplished in Stan by declaring err as a transformed parameter, then defining it the same way as in the model above. The vector nu could still be a local variable, only now it will be in the transformed parameter block.

Wayne Folta suggested encoding the model without local vector variables as follows.

model {
  real err;
  mu ~ normal(0, 10);
  phi ~ normal(0, 2);
  theta ~ normal(0, 2);
  sigma ~ cauchy(0, 5);
  err = y[1] - (mu + phi * mu);
  err ~ normal(0, sigma);
  for (t in 2:T) {
    err = y[t] - (mu + phi * y[t - 1] + theta * err);
    err ~ normal(0, sigma);
  }
}

This approach to ARMA models illustrates how local variables, such as err in this case, can be reused in Stan. Folta’s approach could be extended to higher order moving-average models by storing more than one error term as a local variable and reassigning them in the loop.

Both encodings are fast. The original encoding has the advantage of vectorizing the normal distribution, but it uses a bit more memory. A halfway point would be to vectorize just err.

Identifiability and stationarity

MA and ARMA models are not identifiable if the roots of the characteristic polynomial for the MA part lie inside the unit circle, so it’s necessary to add the following constraint³

real<lower=-1, upper=1> theta;

When the model is run without the constraint, using synthetic data generated from the model, the simulation can sometimes find modes for (theta, phi) outside the \([-1,1]\) interval, which creates a multiple mode problem in the posterior and also causes the NUTS tree depth to get large (often above 10). Adding the constraint both improves the accuracy of the posterior and dramatically reduces the tree depth, which speeds up the simulation considerably (typically by much more than an order of magnitude).

Further, unless one thinks that the process is really non-stationary, it’s worth adding the following constraint to ensure stationarity.

real<lower=-1, upper=1> phi;

Stochastic volatility models

Stochastic volatility models treat the volatility (i.e., variance) of a return on an asset, such as an option to buy a security, as following a latent stochastic process in discrete time (Kim, Shephard, and Chib 1998). The data consist of mean corrected (i.e., centered) returns \(y_t\) on an underlying asset at \(T\) equally spaced time points. Kim et al. formulate a typical stochastic volatility model using the following regression-like equations, with a latent parameter \(h_t\) for the log volatility, along with parameters \(\mu\) for the mean log volatility, and \(\phi\) for the persistence of the volatility term. The variable \(\epsilon_t\) represents the white-noise shock (i.e., multiplicative error) on the asset return at time \(t\), whereas \(\delta_t\) represents the shock on volatility at time \(t\). \[\begin{align*} y_t &= \epsilon_t \exp(h_t / 2) \\ h_{t+1} &= \mu + \phi (h_t - \mu) + \delta_t \sigma \\ h_1 &\sim \textsf{normal}\left( \mu, \frac{\sigma}{\sqrt{1 - \phi^2}} \right) \\ \epsilon_t &\sim \textsf{normal}(0,1) \\ \delta_t &\sim \textsf{normal}(0,1) \end{align*}\]

Rearranging the first line, \(\epsilon_t = y_t \exp(-h_t / 2)\), allowing the distribution for \(y_t\) to be written as \[ y_t \sim \textsf{normal}(0,\exp(h_t/2)). \] The recurrence equation for \(h_{t+1}\) may be combined with the scaling of \(\delta_t\) to yield the distribution \[ h_t \sim \mathsf{normal}(\mu + \phi(h_{t-1} - \mu), \sigma). \] This formulation can be directly encoded, as shown in the following Stan model.

data {
  int<lower=0> T;   // # time points (equally spaced)
  vector[T] y;      // mean corrected return at time t
}
parameters {
  real mu;                     // mean log volatility
  real<lower=-1, upper=1> phi; // persistence of volatility
  real<lower=0> sigma;         // white noise shock scale
  vector[T] h;                 // log volatility at time t
}
model {
  phi ~ uniform(-1, 1);
  sigma ~ cauchy(0, 5);
  mu ~ cauchy(0, 10);
  h[1] ~ normal(mu, sigma / sqrt(1 - phi * phi));
  for (t in 2:T) {
    h[t] ~ normal(mu + phi * (h[t - 1] -  mu), sigma);
  }
  for (t in 1:T) {
    y[t] ~ normal(0, exp(h[t] / 2));
  }
}

Compared to the Kim et al. formulation, the Stan model adds priors for the parameters \(\phi\), \(\sigma\), and \(\mu\). The shock terms \(\epsilon_t\) and \(\delta_t\) do not appear explicitly in the model, although they could be calculated efficiently in a generated quantities block.

The posterior of a stochastic volatility model such as this one typically has high posterior variance. For example, simulating 500 data points from the above model with \(\mu = -1.02\), \(\phi = 0.95\), and \(\sigma = 0.25\) leads to 95% posterior intervals for \(\mu\) of \((-1.23, -0.54)\), for \(\phi\) of \((0.82, 0.98)\), and for \(\sigma\) of \((0.16, 0.38)\).

The NUTS draws show a high degree of autocorrelation, both for this model and the stochastic volatility model evaluated in (Hoffman and Gelman 2014). Using a non-diagonal mass matrix provides faster convergence and higher effective sample size than a diagonal mass matrix, but will not scale to large values of \(T\).

It is relatively straightforward to speed up the effective sample size per second generated by this model by one or more orders of magnitude. First, the distribution statements for return \(y\) is easily vectorized to

y ~ normal(0, exp(h / 2));

This speeds up the iterations, but does not change the effective sample size because the underlying parameterization and log probability function have not changed. Mixing is improved by reparameterizing in terms of a standardized volatility, then rescaling. This requires a standardized parameter h_std to be declared instead of h.

parameters {
  // ...
  vector[T] h_std;  // std log volatility time t
}

The original value of h is then defined in a transformed parameter block.

transformed parameters {
  vector[T] h = h_std * sigma;  // now h ~ normal(0, sigma)
  h[1] /= sqrt(1 - phi * phi);  // rescale h[1]
  h += mu;
  for (t in 2:T) {
    h[t] += phi * (h[t - 1] - mu);
  }
}

The first assignment rescales h_std to have a \(\textsf{normal}(0,\sigma)\) distribution and temporarily assigns it to h. The second assignment rescales h[1] so that its prior differs from that of h[2] through h[T]. The next assignment supplies a mu offset, so that h[2] through h[T] are now distributed \(\textsf{normal}(\mu,\sigma)\); note that this shift must be done after the rescaling of h[1]. The final loop adds in the moving average so that h[2] through h[T] are appropriately modeled relative to phi and mu.

As a final improvement, the distribution statements for h[1] to h[T] are replaced with a single vectorized standard normal distribution statement.

model {
  // ...
  h_std ~ std_normal();
}

Although the original model can take hundreds and sometimes thousands of iterations to converge, the reparameterized model reliably converges in tens of iterations. Mixing is also dramatically improved, which results in higher effective sample sizes per iteration. Finally, each iteration runs in roughly a quarter of the time of the original iterations.

Hidden Markov models

A Hidden Markov model is a probabilistic model over \(N\) observations \(y_{1:N}\) and \(N\) hidden states \(z_{1:N}\). This models is defined by the conditional distributions \(p(y_n \mid z_n, \phi)\) and \(p(z_n \mid z_{n-1}, \phi)\). Here we make the dependency on additional model parameters \(\phi\) explicit. (\(\phi\) may be a vector of parameters.) The complete data likelihood is then \[ p(y, z \mid \phi) = \prod_n p(y_n \mid z_n, \phi) p(z_n \mid z_{n - 1}, \phi) \] When \(z_{1:N}\) is continuous, the user can explicitly encode these distributions in Stan and use Markov chain Monte Carlo to integrate \(z\) out.

When each state \(z\) takes a value over a discrete and finite set, say \(\{1, 2, ..., K\}\), we can use Stan’s suite of HMM functions to marginalize out \(z_{1:N}\) and compute \[ p(y_{1:N} \mid \phi) = \int_{\mathcal Z} p(y, z \mid \phi) \text d z. \] We start by defining the conditional observation distribution, stored in a \(K \times N\) matrix \(\omega\) with \[ \omega_{kn} = p(y_n \mid z_n = k, \phi). \] Next, we introduce the \(K \times K\) transition matrix, \(\Gamma\), with \[ \Gamma_{ij} = p(z_n = j \mid z_{n - 1} = i, \phi). \] (This is a right-stochastic matrix.) Finally, we define the initial state \(K\)-vector \(\rho\), with \[ \rho_k = p(z_0 = k \mid \phi). \] It is common practice to set \(\rho\) to be the stationary distribution of the HMM, that is \(\rho\) is the first eigenvector of \(\Gamma\) and solves \(\Gamma \rho = \rho\).

As an example, consider a three-state model with \(K=3\). The observations are normally distributed conditional on the HMM states with \[ y_n \sim \text{normal}(\mu_k, \sigma), \] where \(\mu = (1, 5, 9)\) and the standard deviation \(\sigma\) is the same across all observations. The model is then

data {
  int N;  // Number of observations
  array[N] real y;
}

parameters {
  // Rows of the transition matrix
  array[3] simplex[3] gamma_arr;

  // Initial state
  simplex[3] rho;

  // Parameters of measurement model
  vector[3] mu;
  real<lower = 0.0> sigma;
}

transformed parameters {
  // Build transition matrix
  matrix[3, 3] gamma;
  for (k in 1:3) gamma[k, ] = to_row_vector(gamma_arr[k]);

  // Compute the log likelihoods in each possible state
  matrix[3, N] log_omega;
  for (n in 1:N) {
    for (i in 1:3) {
      log_omega[i, n] = normal_lpdf(y[n] | mu[i], sigma);
    }
  }
}

model {
  // prior
  mu ~ normal(0, 1);
  sigma ~ normal(0, 1);
  
  // no explicit prior on gamma_arr, meaning we default to a
  // uniform prior over the simplexes.

  // Increment target by log p(y | mu, sigma, Gamma, rho)
  target += hmm_marginal(log_omega, gamma, rho);
}

The last function hmm_marginal takes in all the ingredients of the HMM and computes the relevant log marginal distribution, \(\log p(y \mid \phi)\).

If we desire draws from the posterior distribution of \(z\), we use the generated quantities block and draw, for each sample \(\phi\), a sample from \(p(z \mid y, \phi)\). In effect, MCMC produces draws from \(p(\phi \mid y)\) and with the draws in generated quantities, we obtain draws from \(p(\phi \mid y) p(z \mid y, \phi) = p(z, \phi \mid y)\). It is also possible to compute the posterior probbability of each hidden state, that is \(\text{Pr}(z_n = k \mid \phi, y)\). Averagging these probabilities over all MCMC draws, we obtain \(\text{Pr}(z_n = k \mid y)\).

generated quantities {
  array[N] int latent_states = hmm_latent_rng(log_omega, gamma, rho);
  matrix[3, N] hidden_probs = hmm_hidden_state_prob(log_omega, gamma, rho);
}

hmm_hidden_state_prob returns the marginal probabilities of each state, \(\text{Pr}(z_n = k \mid \phi, y)\). This function cannot be used to compute the joint probability \(\text{Pr}(z \mid \phi, y)\), because such calculation requires accounting for the posterior correlation between the different components of \(z\). Therefore, hidden_probs should not be used to obtain posterior draws. Instead, users should rely on hmm_latent_rng.

generated quantities {
   array[N] int<lower=1, upper=K> z = hmm_latent_rng(...fill-in params here to match example...);
}

The example in this section is derived from the more detailed case study by Ben Bales: https://mc-stan.org/users/documentation/case-studies/hmm-example.html.

References

Engle, Robert F. 1982. “Autoregressive Conditional Heteroscedasticity with Estimates of Variance of United Kingdom Inflation.” Econometrica 50: 987–1008.

Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15: 1593–623. http://jmlr.org/papers/v15/hoffman14a.html.

Kim, Sangjoon, Neil Shephard, and Siddhartha Chib. 1998. “Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models.” Review of Economic Studies 65: 361–93.

Footnotes

The intercept in this model is \(\alpha / (1 - \beta)\). An alternative parameterization in terms of an intercept \(\gamma\) suggested Mark Scheuerell on GitHub is \(y_n \sim \textsf{normal}\left(\gamma + \beta \cdot (y_{n-1} - \gamma), \sigma\right)\).↩︎
In practice, it can be useful to remove the constraint to test whether a non-stationary set of coefficients provides a better fit to the data. It can also be useful to add a trend term to the model, because an unfitted trend will manifest as non-stationarity.↩︎
This subsection is a lightly edited comment of Jonathan Gilligan’s on GitHub; see https://github.com/stan-dev/stan/issues/1617#issuecomment-160249142.↩︎