## 2.3 Moving average models

A moving average model uses previous errors as predictors for future outcomes. For a moving average model of order $$Q$$, $$\mbox{MA}(Q)$$, there is an overall mean parameter $$\mu$$ and regression coefficients $$\theta_q$$ for previous error terms. With $$\epsilon_t$$ being the noise at time $$t$$, the model for outcome $$y_t$$ is defined by $y_t = \mu + \theta_1 \epsilon_{t-1} + \dotsb + \theta_Q \epsilon_{t-Q} + \epsilon_t,$ with the noise term $$\epsilon_t$$ for outcome $$y_t$$ modeled as normal, $\epsilon_t \sim \textsf{normal}(0,\sigma).$ In a proper Bayesian model, the parameters $$\mu$$, $$\theta$$, and $$\sigma$$ must all be given priors.

### MA(2) example

An $$\mbox{MA}(2)$$ model can be coded in Stan as follows.

data {
int<lower=3> T;          // number of observations
vector[T] y;             // observation at time T
}
parameters {
real mu;                 // mean
real<lower=0> sigma;     // error scale
vector theta;         // lag coefficients
}
transformed parameters {
vector[T] epsilon;       // error terms
epsilon = y - mu;
epsilon = y - mu - theta * epsilon;
for (t in 3:T)
epsilon[t] = ( y[t] - mu
- theta * epsilon[t - 1]
- theta * epsilon[t - 2] );
}
model {
mu ~ cauchy(0, 2.5);
theta ~ cauchy(0, 2.5);
sigma ~ cauchy(0, 2.5);
for (t in 3:T)
y[t] ~ normal(mu
+ theta * epsilon[t - 1]
+ theta * epsilon[t - 2],
sigma);
}

The error terms $$\epsilon_t$$ are defined as transformed parameters in terms of the observations and parameters. The definition of the sampling statement (defining the likelihood) follows the definition, which can only be applied to $$y_n$$ for $$n > Q$$. In this example, the parameters are all given Cauchy (half-Cauchy for $$\sigma$$) priors, although other priors can be used just as easily.

This model could be improved in terms of speed by vectorizing the sampling statement in the model block. Vectorizing the calculation of the $$\epsilon_t$$ could also be sped up by using a dot product instead of a loop.

### Vectorized MA(Q) model

A general $$\mbox{MA}(Q)$$ model with a vectorized sampling probability may be defined as follows.

data {
int<lower=0> Q;       // num previous noise terms
int<lower=3> T;       // num observations
vector[T] y;          // observation at time t
}
parameters {
real mu;              // mean
real<lower=0> sigma;  // error scale
vector[Q] theta;      // error coeff, lag -t
}
transformed parameters {
vector[T] epsilon;    // error term at time t
for (t in 1:T) {
epsilon[t] = y[t] - mu;
for (q in 1:min(t - 1, Q))
epsilon[t] = epsilon[t] - theta[q] * epsilon[t - q];
}
}
model {
vector[T] eta;
mu ~ cauchy(0, 2.5);
theta ~ cauchy(0, 2.5);
sigma ~ cauchy(0, 2.5);
for (t in 1:T) {
eta[t] = mu;
for (q in 1:min(t - 1, Q))
eta[t] = eta[t] + theta[q] * epsilon[t - q];
}
y ~ normal(eta, sigma);
}

Here all of the data are modeled, with missing terms just dropped from the regressions as in the calculation of the error terms. Both models converge quickly and mix well at convergence, with the vectorized model being faster (per iteration, not to converge—they compute the same model).