In this document, we discuss the implementation of Bayesian model-based inference for causal effects in **Stan**. The Bayesian inferential framework introduced by Rubin (1978) defines causal effects as a comparison of the potential outcomes of the same units and clearly separates the true underlying model of the potential outcomes from the treatment assignment mechanism. Then the framework takes advantage of fully Bayesian posterior predictive inference for multiply-imputing the missing potential outcomes.

We start by providing an introduction to the Bayesian inferential framework by analyzing a simulated dataset generated under unconfounded treatment assignment. Then we analyze an example dataset obtained from a completely randomized experiment focusing on the specification of the joint distribution of the potential outcomes. All of the source code for this case study is available on GitHub at joonho112/Bayesian-causal-inference/Case_study_1.

Consider a random sample of \(N\) units, indexed by \(i\in {1, ..., N}\), which consist of the individuals in a randoimzed study designed to evaluate the effect of a binary treatment \(W\) on some outcome \(Y\). Each unit has two potential outcomes: \(Y_{i}(1)\) for unit \(i\) if the unit receives treatment (\(W_{i} = 1\)), and \(Y_{i}(0)\) for unit \(i\) in the control condition (\(W_{i} = 0\)).

Causal inference is a missing data problem because \(Y_{i}(1)\) and \(Y_{i}(0)\) are never both observed. In any particular sample, only the potential outcome corresponding to the assigned treatment condition is observed and the other potential outcome is missing. Hence we can express the observed and missing potential outcomes as functions of \(Y_{i}(0)\), \(Y_{i}(1)\), and \(W_{i}\):

\[ \begin{array}{rcl} Y_{i}^{obs} &=& Y_{i}(1)W_{i} + Y_{i}(0)(1 - W_{i}) \\ Y_{i}^{mis} &=& Y_{i}(1)(1 - W_{i}) + Y_{i}(0)W_{i} \end{array} \]

In a Bayesian perspective, the missing potential outcomes \(Y_{i}^{mis}\) are viewed as unobserved random variables, which are no different than unknown model parameters (Rubin 1978). Thus, the key step of the Bayesian approach to causal inference is to calculate the conditional distribution of the full vector of missing potential outcomes given the observed data (Imbens and Rubin 2015),

\[ \mathrm{Pr}(\mathrm{Y}^{mis}|\mathrm{Y}^{obs}, \mathrm{W}). \]

This is also called the posterior predictive distribution of \(\mathrm{Y}^{mis}\). The posterior predictive distribution allows us to mutiply-impute the missing potential outcomes by simulation. Once we obtain this distribution, we can calculate the posterior distribution of any causal estimand of the form \(\tau = \tau(\mathrm{Y}^{mis}, \mathrm{Y}^{obs}, \mathrm{W})\) where \(\mathrm{Y}^{obs}\) and \(\mathrm{W}\) are known.

Obtaining the posterior predictive distribution of \(\mathrm{Y}^{mis}\) requires building a model of the joint distribution of potential outcomes and assignment. Rubin (1978) showed that the joint distribution factors into two components: (1) *the assignment mechanism* which describes the process by which some potential outcomes are observed or missing, and (2) the true underlying data called *the science*. The key insignt of Rubin (1978) is that the factorization clearly separates the true unknown state of Nature which we are trying to estimate (the science) from the man-made study indicating what we do to learn about the science (the assignment mechanism).

To illustrate these two components in completely randomized experiments, let us consider a simulated data example. We assume that the true underlying model for the potential outcomes follows a bivariate normal distribution governed by a model parameter \(\theta\):

\[ \begin{pmatrix} Y_{i}(0) \\ Y_{i}(1) \end{pmatrix} \biggm\vert \theta \sim \mathsf{Normal} \begin{pmatrix} \begin{pmatrix} \mu_{c} \\ \mu_{t} \end{pmatrix}, \begin{pmatrix} \sigma_{c}^{2} & \rho \sigma_{c} \sigma_{t} \\ \rho \sigma_{c} \sigma_{t} & \sigma_{t}^{2} \end{pmatrix} \end{pmatrix} \]

where \(\mu_{c}\) and \(\mu_{t}\) are \(\alpha\) and \(\alpha + \tau\) respectively, and thus the parameter vector is \(\theta = (\alpha, \tau, \sigma_{c}, \sigma_{t}, \rho)\). The **R** code that follows simulates a dataset conforming to the model.

```
# Basic setup
set.seed(123456)
N <- 500 # number of observations
alpha <- 1.0 # intercept in the Y model
tau <- 0.25 # treatment effect
# The assignment mechanism
N_t <- 200 # number of treated units
W <- sample(rep(c(0, 1), c(N - N_t, N_t))) # binary treatment variable
ii_t <- which(W == 1); ii_c <- which(W == 0) # index arrays for treatment variable
# The science
mu_c <- alpha + 0*tau; sd_c <- 1 # mean and SD for the control
mu_t <- alpha + 1*tau; sd_t <- 1 # mean and SD for the treated
rho <- 0.0 # correlation between the errors
cov_mat <- rbind(c(sd_c^2, rho*sd_c*sd_t),
c(rho*sd_c*sd_t, sd_t^2)) # variance-covariance matrix
science <- mvrnorm(n = N, mu = c(mu_c, mu_t), Sigma = cov_mat, empirical = TRUE)
Y0 <- science[, 1] # potential outcome if W = 1
Y1 <- science[, 2] # potential outcome if W = 0
tau_unit <- Y1 - Y0 # unit-level treatment effect
# The realization of potential outcomes by the assignment mechanism
Y_obs <- Y0 * (1 - W) + Y1 * W
Y_mis <- Y0 * W + Y1 * (1 - W)
```

The array of values of `Y0`

and `Y1`

represents the science about which we want to learn. Particular consideration should be given to the parameter \(\rho\) representing the correlation between the two potential outcomes. Among the five parameters included in \(\theta\), \(\rho\) is the only parameter about which the observed data cannot provide empirical information because \(Y_{i}(0)\) and \(Y_{i}(1)\) are never both observed (See Imbens and Rubin, 2015, pp. 165-169). Thus, model-based inference requires subject-matter (scientific) knowledge and the sensible assumptions on the joint distribution of the potential outcomes. Although the simulated example assumed no dependence between the potential outcomes, this can be modified easily by assigning any values between 0 and 1 to the `rho`

in the example code above.

Given the science that exists independently of the study design, the assignment mechanism determines what we are able to observe from the science. Suppose we observed the following data from the simulated example, where the values in parentheses are unobserved:

W | Y(0) | Y(1) | unit-level causal effect |
---|---|---|---|

1.0 | ( 2.2) | -1.1 | (-3.3) |

1.0 | ( 2.2) | 0.6 | (-1.6) |

0.0 | 1.3 | ( 0.6) | (-0.7) |

0.0 | 0.5 | ( 1.1) | ( 0.6) |

0.0 | 4.0 | ( 1.1) | (-2.9) |

0.0 | 0.0 | ( 1.8) | ( 1.8) |

In the completely randomized experiment, a fixed number of participants is assigned to receive the treatment. Accordingly, the probabilities to be treated (i.e., the propensity scores) are equal for all units and are strictly between 0 and 1, for example, \(N_{t}/N = 200/500 = 0.4\) in the simulated data. The probabilistic assignment mechanism is determined by the experimenter, say, \(\mathrm{Pr}(\mathrm{W}|\mathrm{Y}(0), \mathrm{Y}(1)) = \mathrm{Pr}(\mathrm{W})\), which, by design, makes the assignment *unconfounded* with the potential outcomes. In the unconfounded assignment mechanism, the assignment of treatment conditions for all units is independent of all potential outcomes, observed or unobserved. We see that `W`

in the simulated data was randomly drawn depending solely on the number of treated and control units.

The posterior predictive distribution of missing potential outcomes is given by

\[ \begin{array}{rcl} \mathrm{Pr}(\mathrm{Y}^{mis}|\mathrm{Y}^{obs}, \mathrm{W}) &=& \displaystyle\int{\mathrm{Pr}(\mathrm{Y}^{mis}, \theta | \mathrm{Y}^{obs}, \mathrm{W})d\theta} \\ &=& \displaystyle\int{\mathrm{Pr}(\mathrm{Y}^{mis} | \mathrm{Y}^{obs}, \mathrm{W}, \theta) \cdot \mathrm{Pr}(\theta | \mathrm{Y}^{obs}, \mathrm{W})d\theta}. \end{array} \]

The joint posterior distribution of the missing data and the parameter factors into the two terms in the second integral. The first term is the sampling distribution for the replicated missing potential outcomes given parameters, treatment assignment, and observed potential coutcomes, which encapsulates the uncertainty in the imputation. The second term represents the posterior distribution of the model parameters \(\theta\). This term captures the uncertainty due to parameter estimation given the observations. The integral incorporates the two forms of uncertainty into the model by taking a weighted average of the sampling distribution with weights given by the posterior of \(\theta\).

The first term is inherently a function of the two underlying primitives, the assignment mechanism and the science (See Ding and Li (2017) pp 18-19). Given unconfounded treatment assignment, however, the assignment mechanism is *ignorable* in the imputation process. This means that causal inference under unconfoundedness requires only a model of the science but depends crucially on the joint distribution of \(Y_{i}(0)\) and \(Y_{i}(1)\). The model-based approach for completely randomized experiments thus starts from modeling the science to ultimately derive the posterior predictive distribution of the \(Y_{i}^{mis}\).

The **Stan** program to obtain the posterior predictive distribution of the missing potential outcomes from the simulated data is provided and explained in section 2.3.2. We illustrate the following program based on how to code the two forms of uncertainty in **Stan**.

```
data {
int<lower=0> N; // sample size
vector[N] y; // observed outcome
vector[N] w; // treatment assigned
real<lower=-1,upper=1> rho; // assumed correlation between the potential outcomes
}
parameters {
real alpha; // intercept
real tau; // super-population average treatment effect
real<lower=0> sigma_c; // residual SD for the control
real<lower=0> sigma_t; // residual SD for the treated
}
model {
// PRIORS
alpha ~ normal(0, 5);
tau ~ normal(0, 5);
sigma_c ~ normal(0, 5);
sigma_t ~ normal(0, 5);
// LIKELIHOOD
y ~ normal(alpha + tau*w, sigma_t*w + sigma_c*(1 - w));
}
generated quantities{
real tau_fs; // finite-sample average treatment effect
real y0[N]; // potential outcome if W = 0
real y1[N]; // potential outcome if W = 1
real tau_unit[N]; // unit-level treatment effect
for(n in 1:N){
real mu_c = alpha;
real mu_t = alpha + tau;
if(w[n] == 1){
y0[n] = normal_rng(mu_c + rho*(sigma_c/sigma_t)*(y[n] - mu_t), sigma_c*sqrt(1 - rho^2));
y1[n] = y[n];
}else{
y0[n] = y[n];
y1[n] = normal_rng(mu_t + rho*(sigma_t/sigma_c)*(y[n] - mu_c), sigma_t*sqrt(1 - rho^2));
}
tau_unit[n] = y1[n] - y0[n];
}
tau_fs = mean(tau_unit);
}
```

The model specified in the **Stan** program is fit to the simulated data:

```
# Collect data into a list format suitable for Stan
stan_data <- list(N = N, y = Y_obs, w = W, rho = 0.0)
# Compile and run the stan model
fit_simdat <- stan(file = "simulated_example.stan",
data = stan_data,
iter = 1000, chains = 4)
```

Note that the `rho`

is assumed to be a fixed value rather than a parameter in the model. Because the parameters governing the association between the potential outcomes cannot be empirically estimated using the data, they need to be given *a priori*. Here we first assumed a correlation coefficient equal to zero when simulating the data.

To obtain the posterior distribution of the model parameters governing the science, \(\mathrm{Pr}(\theta | \mathrm{Y}^{obs}, \mathrm{W})\), we combine the prior distribution with the likelihood function. The first part of the `model`

block defines the sampling statements of the priors on \(\alpha\), \(\tau\), \(\sigma_{c}\), and \(\sigma_{t}\). Considering the fact that the simulated outcome follows a standard normal distribution, we impose weakly informative normal priors with mean 0 and standard deviation 5 on the parameters for \(\mu_{c}\) and \(\mu_{t}\). For the scale parameters \(\sigma_{c}\) and \(\sigma_{t}\), half-cauchy priors with scale set to 5 are used.

The next sampling statement in the `model`

block indicates the distribution of the observed potential outcome, which specifies the likelihood of \(Y_{i}^{obs}\). Conditional on the assignment vector \(\mathrm{W}\) and parameters, the distribution of \(Y_{i}^{obs}\) is given by

\[ \mathrm{Pr}(Y_{i}^{obs} \vert \mathrm{W}, \theta) \sim \mathsf{Normal}(W_{i} \cdot \mu_{t} + (1 - W_{i}) \cdot \mu_{c} , W_{i} \sigma_{t}^{2} + (1 - W_{i}) \cdot \sigma_{c}^{2}) \]

where \(\mu_{c}\) and \(\mu_{t}\) are \(\alpha\) and \(\alpha + \tau\) respectively. Because we can observe only one potential outcome per person, the likelihood function \(\mathrm{Pr}(\mathrm{Y}^{obs}, \mathrm{W} \vert \theta)\) remains the same irrespective of the value of \(\rho\). The sampling statement of \(\mathrm{Y}^{obs}\) is coded as:

` y ~ normal(alpha + tau*w, sigma_t*w + sigma_c*(1 - w));`

The posterior means and standard deviations of the parameters, \(\alpha\), \(\tau\), \(\sigma_{c}\), and \(\sigma_{t}\), can be displayed as follows:

```
print(fit_simdat, pars = c("alpha", "tau", "sigma_c", "sigma_t"),
probs = c(0.1, 0.5, 0.9), digits = 3)
```

```
## Inference for Stan model: simulated_example.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 10% 50% 90% n_eff Rhat
## alpha 1.012 0.001 0.055 0.942 1.013 1.084 1466 0.999
## tau 0.248 0.002 0.091 0.132 0.250 0.367 1596 0.999
## sigma_c 0.984 0.001 0.040 0.934 0.983 1.035 1987 1.000
## sigma_t 1.021 0.001 0.050 0.958 1.017 1.088 2000 1.000
##
## Samples were drawn using NUTS(diag_e) at Thu Sep 13 10:47:34 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

Before interpreting the results, it is necessary to check that the chains have converged for each parameter. The \(\hat{R}\) statistics shown in the rightmost column of the model summary are all less than 1.1. Also the effective sample size estimates are sufficient for inference. Thus it seems that **Stan** has produced an adequate approximation of the posterior.

```
posterior <- as.array(fit_simdat)
mcmc_areas(posterior, pars = c("alpha", "tau", "sigma_c", "sigma_t"),
prob = 0.8, prob_outer = 0.99, point_est = "mean" ) +
vline_at(c(0.25, 1.00), linetype = "dashed", color = "red", size = 1) +
ggplot2::labs(title = "Posterior distributions of the parameters",
subtitle = "with means and 80% intervals")
```

`## Warning: package 'bindrcpp' was built under R version 3.4.4`

The plot shows point estimates and 99% credible intervals for parameters along with kernel density curves of the posteriors. The uncertainty intervals shown as shaded areas under the estimated posterior density curves represents 80% credible intervals. We observe that the 80% credible intervals of `alpha`

, `tau`

, `sigma_c`

, and `sigma_t`

include the true parameters (\(\tau = 0.25\), \(\alpha, \sigma_{c}, \sigma_{t} = 1.0\)), which indicates that the **Stan** model successfully recovers the generating values of parameters.

With the posterior distributions of the parameters in hand, we then derive the sampling distribution for the replicated missing potential outcomes given observed outcomes, treatment assignment, and the parameters, \(\mathrm{Pr}(\mathrm{Y}^{mis}|\mathrm{Y}^{obs}, \mathrm{W}, \theta)\). As explained above, imputing the missing potential outcomes \(\mathrm{Y}^{mis}\) requires only a model for the science because of unconfoundedness, that is, the joint distribution of \(Y_{i}(0)\) and \(Y_{i}(1)\). Assuming the random variables are i.i.d. conditional on \(\theta\), the posterior distribution factors into \(N\) terms. Hence for the treated units we can impute the missing control potential outcomes from the conditional distribution of \(Y_{i}(0)\) given \(Y_{i}(1)\) and \(\theta\) (Ding and Li 2017). For control units, we impute the missing treatment potential outcomes from \(\mathrm{Pr}(Y_{i}(1)| Y_{i}(0), \theta)\). Because we assume that the joint distribution of the potential outcomes is bivariate normal, we obtain the conditional distribution of one potential outcome given the other as

\[ \mathrm{Pr}(Y_{i}(1) \vert Y_{i}(0), \theta, W_{i} = 0) \sim \mathsf{Normal} \left(\mu_{t} + \rho \cdot \frac{\sigma_{t}}{\sigma_{c}} \cdot (Y_{i}(0) - \mu_{c}), \sigma_{t}^{2} (1-\rho^{2}) \right), \]

and

\[ \mathrm{Pr}(Y_{i}(0) \vert Y_{i}(1), \theta, W_{i} = 1) \sim \mathsf{Normal} \left(\mu_{c} + \rho \cdot \frac{\sigma_{c}}{\sigma_{t}} \cdot (Y_{i}(1) - \mu_{t}), \sigma_{c}^{2} (1-\rho^{2}) \right). \]

where \(\mu_{c}\) and \(\mu_{t}\) are \(\alpha\) and \(\alpha + \tau\) respectively.

The following **Stan** code in the `generated quantities`

block implements the imputations:

```
for(n in 1:N){
real mu_c = alpha;
real mu_t = alpha + tau;
if(w[n] == 1){
y0[n] = normal_rng(mu_c + rho*(sigma_c/sigma_t)*(y[n] - mu_t), sigma_c*sqrt(1 - rho^2));
y1[n] = y[n];
}else{
y0[n] = y[n];
y1[n] = normal_rng(mu_t + rho*(sigma_t/sigma_c)*(y[n] - mu_c), sigma_t*sqrt(1 - rho^2));
}
tau_unit[n] = y1[n] - y0[n];
}
```

Note that assuming a correlation coefficient between \(Y_{i}(1)\) and \(Y_{i}(0)\) equal to 0 (`rho`

= 0) the **Stan** code for the conditional distribution of one potential outcome given the other can be simplified to `y0[n] = normal_rng(mu_c, sigma_c)`

for \(Y_{i}(0)\) and `y1[n] = normal_rng(mu_t, sigma_t)`

for \(Y_{i}(1)\).

In this code, the two forms of uncertainties, \(\mathrm{Pr}(\mathrm{Y}^{mis} | \mathrm{Y}^{obs}, \mathrm{W}, \theta)\) and \(\mathrm{Pr}(\theta | \mathrm{Y}^{obs}, \mathrm{W})\), are combined to generate the posterior predictive distribution of \(\mathrm{Y}^{mis}\), \(\mathrm{Pr}(\mathrm{Y}^{mis} | \mathrm{Y}^{obs}, \mathrm{W})\). Each of `alpha`

, `tau`

, `sigma_c`

, and `sigma_t`

represents a draw of the parameter from the posterior distribution \(\mathrm{Pr}(\theta| \mathrm{Y}^{obs}, \mathrm{W})\). The uncertainty of parameter estimation is thus rolled into the specified parameter values. On the other hand, the uncertainty due to the imputation is reflected by sampling random numbers from each conditional distribution of the potential outcomes using the pseudorandom number generator, `normal_rng`

.

The imputation procedure defined in the `generated quantities`

block gives us the posterior predictive distribution of each missing potential outcome. Our model-based approach aims to impute the missing potential outcomes which are in parentheses in the true science table shown in section 2.2. We need to impute \(Y(1)\) for the first two units and impute \(Y(0)\)s for the remaining four units.

The posterior predictive distribution of the six missing potential outcomes can be displayed as follows:

```
# Collect MCMC samples for the missing potential outcomes
mcmc <- as.matrix(fit_simdat)
ymis_idx <- paste0(ifelse(W[seq(6)] == 0, "y1", "y0"), "[", seq(6), "]")
subset <- which(colnames(mcmc) %in% ymis_idx)
mcmc_ymis <- data.frame(mcmc[, subset])
names(mcmc_ymis) <- colnames(mcmc)[subset]
mcmc_ymis <- dplyr::select(mcmc_ymis, c("y0[1]", "y0[2]"), everything())
mcmc_ymis_long <- mcmc_ymis %>% gather(key, value, seq(6), factor_key = TRUE)
imputed_ymis <- data.frame(key = levels(mcmc_ymis_long$key), apply(mcmc_ymis, 2, mean))
true_ymis <- data.frame(key = levels(mcmc_ymis_long$key), c(Y_mis[1:6]))
names(imputed_ymis) <- c("key", "imputed_ymean"); names(true_ymis) <- c("key", "ymis_true")
# Plot the posterior predictive distributions with the true Y^mis
ggplot(mcmc_ymis_long, aes(x = value)) +
geom_density(alpha = 0.1) +
geom_vline(data = imputed_ymis, aes(xintercept = imputed_ymean),
linetype = "solid", color = "blue", size = 0.5) +
geom_vline(data = true_ymis, aes(xintercept = ymis_true),
linetype = "dashed", color = "red", size = 0.5) +
geom_text(aes(x = imputed_ymean, y = 0.15, label = sprintf("%.3f", round(imputed_ymean, 3))),
hjust = 0.5, vjust = -1.0, color = "blue", data = imputed_ymis) +
geom_text(aes(x = ymis_true, y = 0, label = sprintf("%.3f", round(ymis_true, 3))),
hjust = 0.5, vjust = -1.0, color = "red", data = true_ymis) +
facet_wrap( ~ key, ncol = 2) +
labs(title = "Posterior predictive distributions of the missing potential outcomes",
subtitle = "with posterior means (blue solid) and true values (red dashed)", x = "")
```

The plot shows that the true values of the missing potential outcomes fall in approximately at least 80% credible intervals of the posterior predictive distributions.

Once we obtain the posterior predictive distributions for \(\mathrm{Y}^{mis}\), we can easily define the posterior distributions of any causal estimand based on averages, variances, quantiles, ratios, or intermediate outcomes. We focus primarily on the average treatment effect (ATE):

\[ \tau_{\mathrm{fs}} = \tau(\mathrm{Y}(0), \mathrm{Y}(1)) = \frac{1}{N} \cdot \sum_{i = 1}^{N}{(Y_{i}(1) - Y_{i}(0))} = \bar{Y}(1) - \bar{Y}(0). \]

We use the subscript \(\mathrm{fs}\) to denote the *finite-sample* average treatment effect because the \(N\) units in the dataset are viewed as the population of interest itself. The estimand is based on comparisons between the two potential outcomes, which gives the unit-level causal effects \(Y_{i}(1) - Y_{i}(0)\). The average treatment effect is obtained by simply taking the average of the unit-level causal effects over the finite sample of \(N\) units. The **Stan** code in the `generated quantities`

block defines the unit-level causal effects as `tau_unit[n] = y1[n] - y0[n];`

and the finite-sample average treatment effect as `tau_fs = mean(tau_unit);`

.

If we instead view the \(N\) observed units as a random sample from an infinite super-population, the random sampling induces an additional source of uncertainty. Even if all the missing potential outcomes could be imputed with certainty, it would still be uncertain whether the estimated \(\tau_{\mathrm{fs}}\) is the same as the average treatment effect in the super-population from which the sample was drawn. Hence the *super-population* average treatment effect, denoted as \(\tau_{\mathrm{sp}}\), can be defined by taking the expectation over the distribution of the \(\tau_{\mathrm{fs}}\) generated by random sampling from the super-population. Given the bivariate normal model for the simulated data governed by parameter \(\theta\), \(\tau_{\mathrm{sp}}\) is given by

\[ \tau_{\mathrm{sp}} = \mathbb{E}_{\mathrm{sp}} \left[ \tau_{\mathrm{fs}}\right] = \mathbb{E}_{\mathrm{sp}} \left[ \bar{Y}(1) - \bar{Y}(0) \right] = \mu_{t} - \mu_{c} = (\alpha + \tau) - \alpha = \tau. \]

Since \(\tau_{\mathrm{sp}}\) is solely a function of the model parameters \(\theta\), we can estimate \(\tau_{\mathrm{sp}}\) by simply obtaining the posterior distribution of \(\tau\). In the **Stan** code the `tau`

in the `model`

block represents the super-population average treatment effect. We observe from the following results that the \(\tau_{\mathrm{sp}}\) estimate is less precise than the \(\tau_{\mathrm{fs}}\) estimate assuming independence between the potential outcomes due to the additional source of uncertainty induced by random sampling from the super-population:

```
print(fit_simdat, pars = c("tau_fs", "tau"),
probs = c(0.1, 0.5, 0.9), digits = 3)
```

```
## Inference for Stan model: simulated_example.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 10% 50% 90% n_eff Rhat
## tau_fs 0.246 0.002 0.066 0.159 0.247 0.331 1847 1.000
## tau 0.248 0.002 0.091 0.132 0.250 0.367 1596 0.999
##
## Samples were drawn using NUTS(diag_e) at Thu Sep 13 10:47:34 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

```
color_scheme_set("green")
mcmc_areas(posterior, pars = c("tau_fs", "tau"),
prob = 0.8, prob_outer = 0.99, point_est = "mean" ) +
vline_at(c(0.25), linetype = "dashed", color = "red", size = 1) +
ggplot2::labs(title = "Finite-sample ATE (rho = 0) vs. Super-population ATE",
subtitle = "with means, 80% intervals, and the true value (red dashed)")
```

For more detailed discussion on Bayesian inference for super-population and finite-population estimands, see the section 8.4 of Gelman et al. (2013).

Because observed data generally do not include empirical information about the dependence between potential outcomes, the key difficulty of the model-based inference lies in how the dependence should be modeled. Researchers often vary the specification of the model of the joint distribution of \(Y_{i}(1)\) and \(Y_{i}(0)\) and look at the sensitivity of conclusions to the choice for the model. Focusing on this aspect of the model-based approach, we analyze an example data obtained from a completely randomized experiment in this section. We specifically aim to replicate the analysis performed in chapter 8 of Imbens and Rubin (2015) for the illustration.

We will be analyzing the data come from a randomized evaluation of the National Supported Work (NSW) project. The project was a transitional and subsidized job training program for people with longstanding employment problems. The data has been widely analyzed in the study on program evaluation particularly in econometrics, including Heckman and Hotz (1989), Dehejia and Wahba (1999), and Smith and Todd (2001). We analyze a subset of the data used by Dehejia and Wahba (1999), which is the same data set illustrated in chapter 8 of Imbens and Rubin (2015). The dataset is available in the **Matching** R package, and consists of the following 12 variables:

`age`

: age in years`educ`

: years of schooling`nodegr`

: indicator variable for being a high school dropout`black`

: indicator variable for being African American`hisp`

: indicator variable for being Hispanic/Latino`married`

: indicator variable for being now or ever before married`re74`

: pre-training earnings in 1974`u74`

: an indicator for earnings in 1974 being zero`re75`

: pre-training earnings in 1975`u75`

: an indicator for earnings in 1975 being zero`re78`

: post-program labor market earnings in 1978`treat`

: indicator variable for treatment status

```
# Use example dataset form Matching package
data(lalonde, package = "Matching")
# Change the scales of all earnings variables in thousands
lalonde$re74 <- lalonde$re74/1000
lalonde$re75 <- lalonde$re75/1000
lalonde$re78 <- lalonde$re78/1000
# Prepare outcome and treatment
y <- lalonde$re78
w <- lalonde$treat
# Display summary statistics
sumstats <- lalonde %>%
summarise_all(funs(mean, sd, min, max)) %>%
gather(key, value, everything()) %>%
separate(key, into = c("variable", "stat"), sep = "_") %>%
spread(stat, value) %>%
dplyr::select(variable, mean, sd, min, max) %>%
mutate_each(funs(round(., 1)), -variable)
knitr::kable(sumstats, caption = "Summary statistics for the NSW data")
```

variable | mean | sd | min | max |
---|---|---|---|---|

age | 25.4 | 7.1 | 17 | 55.0 |

black | 0.8 | 0.4 | 0 | 1.0 |

educ | 10.2 | 1.8 | 3 | 16.0 |

hisp | 0.1 | 0.3 | 0 | 1.0 |

married | 0.2 | 0.4 | 0 | 1.0 |

nodegr | 0.8 | 0.4 | 0 | 1.0 |

re74 | 2.1 | 5.4 | 0 | 39.6 |

re75 | 1.4 | 3.2 | 0 | 25.1 |

re78 | 5.3 | 6.6 | 0 | 60.3 |

treat | 0.4 | 0.5 | 0 | 1.0 |

u74 | 0.7 | 0.4 | 0 | 1.0 |

u75 | 0.6 | 0.5 | 0 | 1.0 |

Among the sample of \(N = 445\), 42% of participants were randomly assigned to the job training program (\(N_{t} = 185\)). To replicate Imbens and Rubin (2015)’s analyses, all earning variables, `re74`

, `re75`

, and `re78`

, were converted to thousands by deviding them by 1,000. Alternatively, one may want to standardize continuous covariates `age`

and log-transform all earnings variables to improve MCMC sampling.

Let us first consider the same specification applied to the simulated data above. We assumed a joint normal distribution with (1) independent potential outcomes (\(\rho = 0\)), (2) no covariates, and with (3) different variances in the treatment and control groups (\(\sigma_{t}\) and \(\sigma_{c}\)):

\[ \begin{pmatrix} Y_{i}(0) \\ Y_{i}(1) \end{pmatrix} \biggm\vert \theta \sim \mathsf{Normal} \begin{pmatrix} \begin{pmatrix} \mu_{c} \\ \mu_{t} \end{pmatrix}, \begin{pmatrix} \sigma_{c}^{2} & 0 \\ 0 & \sigma_{t}^{2} \end{pmatrix} \end{pmatrix} \]

where \(\mu_{c}\) and \(\mu_{t}\) are \(\alpha\) and \(\alpha + \tau\) respectively. The following is the **Stan** program **Model_01_Assuming_Independent_Potential_Outcomes.stan** for the model 1:

```
functions {
real quantile(vector x, real p){
int n; // length of vector x
real index; // integer index of p
int lo; // lower integer cap of the index
int hi; // higher integer cap of the index
real h; // generated weight between the lo and hi
real qs; // weighted average of x[lo] and x[hi]
n = num_elements(x);
index = 1 + (n - 1)*p;
lo = 1;
while ((lo + 1) < index)
lo = lo + 1;
hi = lo + 1;
h = index - lo;
qs = (1 - h)*sort_asc(x)[lo] + h*sort_asc(x)[hi];
return qs;
}
}
data {
int<lower=0> N; // sample size
vector[N] y; // observed outcome
vector[N] w; // treatment assigned
}
parameters {
real alpha; // intercept
real tau; // super-population average treatment effect
real<lower=0> sigma_c; // residual SD for the control
real<lower=0> sigma_t; // residual SD for the treated
}
model {
// PRIORS
alpha ~ normal(0, 100);
tau ~ normal(0, 100);
sigma_c ~ normal(0, 100);
sigma_t ~ normal(0, 100);
// LIKELIHOOD
y ~ normal(alpha + tau*w, sigma_t*w + sigma_c*(1 - w));
}
generated quantities{
real tau_fs; // finite-sample average treatment effect
real tau_qte25; // quantile treatment effect at p = 0.25
real tau_qte50; // quantile treatment effect at p = 0.50
real tau_qte75; // quantile treatment effect at p = 0.75
real y0[N]; // potential outcome if W = 0
real y1[N]; // potential outcome if W = 1
real tau_unit[N]; // unit-level treatment effect
for(n in 1:N){
real mu_c = alpha;
real mu_t = alpha + tau;
if(w[n] == 1){
y0[n] = normal_rng(mu_c, sigma_c);
y1[n] = y[n];
}else{
y0[n] = y[n];
y1[n] = normal_rng(mu_t, sigma_t);
}
tau_unit[n] = y1[n] - y0[n];
}
tau_fs = mean(tau_unit);
tau_qte25 = quantile(to_vector(y1), 0.25) - quantile(to_vector(y0), 0.25);
tau_qte50 = quantile(to_vector(y1), 0.50) - quantile(to_vector(y0), 0.50);
tau_qte75 = quantile(to_vector(y1), 0.75) - quantile(to_vector(y0), 0.75);
}
```

The program is basically the same as the one for the simulated data, but has two notable differences. First, we define three additional finite-sample causal estimands based on quantiles in addition to the average treatment effect (`tau_fs`

), that is, quantile treatment effects (QTE) for the 0.25, 0.50, and 0.75 quantiles (`tau_qte25`

, `tau_qte50`

, `tau_qte75`

). The QTEs are defined as the differences in the \(p\)th quantiles of the potential outcomes, \(Q_{p}(Y_{i}(1)) - Q_{p}(Y_{i}(0))\), not the \(p\)th quantiles of the differences in the potential outcomes, \(Q_{p}(Y_{i}(1) - Y_{i}(0))\). The posterior distribution of the QTEs can be easily derived given the imputed missing potential outcomes in the same manner as for the finite-sample average treatement effect. Unfortunately, a built-in function to calculate the sample quantile values for data and generated quantities is not yet available in **Stan**. Thus we need to add a `functions`

block for a user-defined function equivalent to `quantile()`

in **R**. In the `functions`

block, all sample quantiles were defined as a weighted average of consecutive order statistics. See Hyndman and Fan (1996) for further details.

Second, the scales of the prior distributions for the two mean parameters are now set to 100 because the standard deviation of 100 is large relative to the scale of the earning variables measured in thousands of dollars which range from 0 to 60.31. We use unscaled normal priors `normal(0, 100)`

for the `alpha`

and `tau`

for pedagogical purposes - we want to replicate the estimates in Table 8.6 of Imbens and Rubin (2015). In practice, it is better idea to put the `alpha`

and `tau`

on a unit scale by dividing them by the scale such as the median absolute deviation (MAD) and to use priors stronger than `normal(0, 100)`

for the two mean parameters. Then it is generally easy to define weakly informative priors and to make computational algorithm much better conditioned. See a Wiki on prior choice recommendations by the Stan Development Team.

The prior distributions for \(\sigma_{c}\) and \(\sigma_{t}\) are half-normal distributions with the scale parameters are set to 100 and with a `<lower=0>`

constraint in the declaration of the parameter. To replicate the Imbens and Rubin (2015) analyses, one may want to use inverse gamma distributions with the shape parameters \(\alpha\) set to 1 and with the scale parameters \(\beta\) set to 0.01. The estimates are almost the same in both prior specifications with a single-level model with enough data. Note, however, that the purportedly noninformative inverse-gamma prior can affect inferences in other settings such as hierarchical models, particularly where the number of groups is small and the group-level variance is close to zero. See Gelman (2006) for more details.

The **Stan** program assumming independence between the potential outcomes is then run on the example data:

```
# Collect data into a list format suitable for Stan
stan_data <- list(N = nrow(lalonde), y = y, w = w)
# Compile and run the stan model
fit_mod1 <- stan(file = "Model_01_Assuming_Independent_Potential_Outcomes.stan",
data = stan_data,
iter = 1000, chains = 4)
```

We view summaries of the parameter posteriors for each causal estimand. As discussed above, convergence of the chains is assessed for every parameter using \(\hat{R}\). All \(\hat{R}\) values from the model summaries are less than 1.1, which indicates that the chains have converged in the fitted model.

```
param <- c("alpha", "tau", "sigma_c", "sigma_t", "tau_fs", "tau_qte25", "tau_qte50", "tau_qte75" )
print(fit_mod1, pars = param,
probs = c(0.1, 0.5, 0.9), digits = 3)
```

```
## Inference for Stan model: Model_01_Assuming_Independent_Potential_Outcomes.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 10% 50% 90% n_eff Rhat
## alpha 4.562 0.008 0.337 4.135 4.567 4.999 1700 0.999
## tau 1.794 0.017 0.691 0.936 1.782 2.682 1695 0.999
## sigma_c 5.494 0.006 0.237 5.205 5.488 5.802 1764 1.000
## sigma_t 7.876 0.010 0.405 7.375 7.854 8.398 1795 0.998
## tau_fs 1.800 0.012 0.502 1.160 1.788 2.432 1902 0.999
## tau_qte25 0.637 0.008 0.361 0.017 0.647 1.084 2000 1.000
## tau_qte50 1.637 0.013 0.571 0.922 1.605 2.415 1901 0.999
## tau_qte75 3.093 0.015 0.640 2.262 3.103 3.865 1877 1.000
##
## Samples were drawn using NUTS(diag_e) at Mon Jul 30 17:40:49 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

The posterior means and standard deviations for treatment effects above are the same as the estimates in the second line of Table 8.6 of Imbens and Rubin (2015) with small differences.

```
posterior <- as.array(fit_mod1)
color_scheme_set("purple")
mcmc_areas(posterior, pars = c("tau", "tau_fs", "tau_qte25", "tau_qte50", "tau_qte75"),
prob = 0.8, prob_outer = 0.99, point_est = "mean" ) +
vline_at(c(1.78), linetype = "dashed", color = "red", size = 1) +
ggplot2::labs(title = "Model 1: Posterior distributions of the causal estimands",
subtitle = "with means and 80% intervals")
```

Again we observe that the posterior standard deviation of the finite-sample ATE (`tau_fs`

), 0.48, is substantially lower than that of the super-population ATE (`tau`

), 0.66. Cosidering the high proportion of zeros in the outcome `re78`

(about 31%), it is not surprising that the posterior distribution of `tau_qte25`

seems to be non-normal. For the readers who attempt to model the zero-inflated distribution of the potential outcomes, the **Stan** program for the two-part model is available on the GitHub repository.

The three QTE estimates shed some light on the treatment effect heterogeneity across the outcome distribution. From the figure above we observe that the effect of the job-training program becomes stronger for the higher conditional quantiles, indicating that the program is most benefical for the participants at the high end of the earning distribution (75%). However, these results for the QTE estimands might be sensitive to choices for the prior distribution of the dependence structure between the two potential outcomes (Imbens and Rubin 2015).

One way to investigate the sensitivity of conclusions to the choice for the dependence between the potential outcomes is to assume the most conservative case and exploits the estimate from the case as a bound. In a mode-based inference, the most conservative case is often the situation assuming the treatment effect is constant for all units, \(Y_{i}(1) - Y_{i}(0) = \tau\). The constant treatment effect assumption implies that the potential outcomes are perfectly correlated (\(\rho = 1.0\)) and the variances of the two potential outcomes are equal (\(\sigma_{t}^{2} = \sigma_{c}^{2} = \sigma^{2}\)). Let us consider the following alternative conservative specification of the joint normal distribution of the potential outcomes, which are again free from dependence on the covariates:

\[ \begin{pmatrix} Y_{i}(0) \\ Y_{i}(1) \end{pmatrix} \biggm\vert \theta \sim \mathsf{Normal} \begin{pmatrix} \begin{pmatrix} \mu_{c} \\ \mu_{t} \end{pmatrix}, \begin{pmatrix} \sigma^{2} & \sigma^{2} \\ \sigma^{2} & \sigma^{2} \end{pmatrix} \end{pmatrix} \]

where \(\mu_{c}\) and \(\mu_{t}\) are \(\alpha\) and \(\alpha + \tau\) respectively.

The **Stan** program **Model_02_Assuming_Constant_Treatment_Effects.stan** for the model 2 is the same as the program for the model 1, with only a few differences. First, due to the assumption of equal variances (`sigma_t`

= `sigma_c`

), the sampling statment of \(\mathrm{Y}^{obs}\) is now coded as `y ~ normal(alpha + tau*w, sigma);`

with the pooled residual standard deviation `sigma`

.

Second, assuming a correlation coefficient between \(Y_{i}(1)\) and \(Y_{i}(0)\) equal to 1, the conditional distribution of one potential outcome given the other now has zero variance (\(\sigma_{t}^{2}(1-\rho^{2}) = \sigma_{c}^{2}(1-\rho^{2}) = 0\)). Hence the **Stan** code for the model-based imputation becomes `y0[n] = mu_c + (y[n] - mu_t);`

for \(Y_{i}(0)\) and `y1[n] = mu_t + (y[n] - mu_c);`

for \(Y_{i}(1)\).

Note that we no longer draw samples using the pseudorandom number generator `normal_rng`

. This means that the the uncertainty in the imputation vanishes and the only source of uncertainty is due to parameter estimation. Under the constant treatment effects assumption, we are able to know the exact value of \(Y_{i}^{mis}\) given \(Y_{i}^{obs}\) and the parameter estimates of \(\mu_{c}\) and \(\mu_{t}\).

The **Stan** program for the Model 2 is fit to the example data:

```
# Compile and run the stan model
fit_mod2 <- stan(file = "Model_02_Assuming_Constant_Treatment_Effects.stan",
data = stan_data,
iter = 1000, chains = 4)
```

To investigate the variance of the unit-level treatment effect in the Model 1 and 2, the first 20 example unit-level effects for the two models are presented side by side:

```
# Model 1. assuming independence between Y(1) and Y(0)
color_scheme_set("brightblue")
tau_unit_idx <- paste0("tau_unit", "[", seq(20), "]")
mod1_posterior <- as.array(fit_mod1)
p1 <- mcmc_intervals(mod1_posterior, pars = tau_unit_idx,
prob = 0.8, prob_outer = 0.99, point_est = "mean" ) +
vline_at(c(0), linetype = "dashed", size = 0.5) +
ggplot2::labs(title = "Model 1. Assuming Independent Potential Outcomes")
# Model 2. assuming constant treatment effect
mod2_posterior <- as.array(fit_mod2)
p2 <- mcmc_intervals(mod2_posterior, pars = tau_unit_idx,
prob = 0.8, prob_outer = 0.99, point_est = "mean" ) +
vline_at(c(0), linetype = "dashed", size = 0.5) +
scale_x_continuous(limits = c(-20, 40)) +
ggplot2::labs(title = "Model 2. Assuming Constant Treatment Effects")
grid.arrange(p1, p2, ncol = 2)
```

We observe that the unit-level treatment effects from the Model 2 are constant over all units and have minimal uncertainty intervals compared to the estimates from the Model 1. The Model 1 estimates show substantial variation among the posterior means with far larger uncertainty intervals. This plot clearly shows that the uncertainty in the imputation constitutes the main source of uncertainty when multiply-imputing the missing potential outcomes in model-based inference.

The conservative estimates of the finite-sample causal estimands are displayed in tabular form as follows:

```
param <- c("alpha", "tau", "sigma", "tau_fs", "tau_qte25", "tau_qte50", "tau_qte75" )
print(fit_mod2, pars = param,
probs = c(0.1, 0.5, 0.9), digits = 3)
```

```
## Inference for Stan model: Model_02_Assuming_Constant_Treatment_Effects.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 10% 50% 90% n_eff Rhat
## alpha 4.565 0.012 0.417 4.032 4.568 5.086 1182 1.001
## tau 1.776 0.019 0.636 0.980 1.778 2.612 1164 1.002
## sigma 6.581 0.006 0.212 6.316 6.577 6.857 1398 1.001
## tau_fs 1.776 0.019 0.636 0.980 1.778 2.612 1164 1.002
## tau_qte25 1.776 0.019 0.636 0.980 1.778 2.612 1164 1.002
## tau_qte50 1.776 0.019 0.636 0.980 1.778 2.612 1164 1.002
## tau_qte75 1.776 0.019 0.636 0.980 1.778 2.612 1164 1.002
##
## Samples were drawn using NUTS(diag_e) at Mon Jul 30 17:41:49 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

Since the unit-level treatment effects are assumed to have no heterogeneity in the Model 2, the QTE estimates are all identical to that for the ATE. Note also that the posterior means and standard deviations of the finite-sample ATE (`tau_fs`

) is equal to those of the super-population ATE (`tau`

). The worst-case scenario assumption of constant treatment effects not only gives a *conservative* estimate of the posterior variance for the finite-sample estimands but also provides an *unbiased* estimate of the super-population estimands (Imbens and Rubin 2015). See the discussion in Chapter 6 of Imbens and Rubin (2015) for more details.

In the previous models without covariates, the posterior predictive distributions of the missing control potential outcomes had similar means centered around \(\alpha\) for all the treated units. For control units, the missing treatment potential outcomes were imputed from the poterior predicteve distributions with the expected mean of \(\alpha + \tau\) for all the control units. Adding the pre-treatment covariates improves the imputation process because the covariates provide information to help predict the missing potential outcomes. Now let us extend the previous model assuming independent potential outcomes (Model 1) to allow for pre-treatment covariates \(\mathrm{X}\):

\[ \begin{pmatrix} Y_{i}(0) \\ Y_{i}(1) \end{pmatrix} \biggm\vert X_{i}, \theta \sim \mathsf{Normal} \begin{pmatrix} \begin{pmatrix} \alpha + \mathrm{X}_{i} \beta_{c} \\ \alpha + \mathrm{X}_{i} \beta_{t} + \tau \end{pmatrix}, \begin{pmatrix} \sigma_{c}^{2} & 0 \\ 0 & \sigma_{t}^{2} \end{pmatrix} \end{pmatrix} \]

where now \(\theta = (\alpha, \beta_{c}, \beta_{t}, \tau, \sigma_{c}^{2}, \sigma_{t}^{2})\).

Instead of imposing restrictions that the effects of \(\mathrm{X}_{i}\) are the same for both potential outcomes, we define two different vectors of the slope coefficients \(\beta_{c}\) and \(\beta_{t}\) for the control and treated units repectively. The difference in the two vectors, \(\beta_{t} - \beta_{c}\), can be obtained by including an interaction term between \(\mathrm{X}\) and \(\mathrm{W}\) in the model. We prepare a data list for **Stan** including the nine mean-centered covariates and their interaction terms with the treatment variable as follows:

```
# Add nine mean-centered covariates and their interaction terms
x <- as.matrix(lalonde[, c("age", "educ", "married", "nodegr", "black", "re74", "u74", "re75", "u75")])
x_mean_mat <- matrix(rep(apply(x, 2, mean, na.rm = TRUE), each = nrow(x)), nrow = nrow(x))
x_c_mat <- x - x_mean_mat
xw_inter <- x_c_mat*w
colnames(xw_inter) <- paste0(colnames(x), "_w")
# Collect data into a list format suitable for Stan
stan_data <- list(y = y, w = w, x = x_c_mat, xw_inter = xw_inter, N = nrow(lalonde), N_cov = ncol(x_c_mat))
```

In the **Stan** program **Model_03_Independent_Potential_Outcomes_With_Covariates.stan** for the Model 3, the data variables are coded following their specifications in the data list. In addition to the `alpha`

, `tau`

, `sigma_c`

, and `sigma_t`

, we specify two vectors for the slope coefficients in the `parameters`

block: `beta`

for \(\beta_{c}\) and `beta_inter`

for \(\beta_{t} - \beta_{c}\). The priors for the slope coefficients are specified to be normal with zero means and variance equal to \(100^{2}\). The sampling statement of \(\mathrm{Y}^{obs}\) is then coded using matrix notation as

` y ~ normal(alpha + x*beta + xw_inter*beta_inter + tau * w, sigma_t*w + sigma_c*(1-w));`

Assuming zero correlation between \(Y_{i}(0)\) and \(Y_{i}(1)\), the code for the model-based imputation is the same as the code for Model 1: `y0[n] = normal_rng(mu_c, sigma_c);`

for \(Y_{i}(0)\) and `y1[n] = normal_rng(mu_t, sigma_t);`

for \(Y_{i}(1)\). Because the covariates determines the location of the distribution but not its scale in Model 3, only the `mu_c`

and `mu_t`

are changed from the program for Model 1 as

```
real mu_t = alpha + x[n,]*beta + x[n, ]*beta_inter + tau;
real mu_c = alpha + x[n,]*beta;
```

The **Stan** program for the Model 3 is fit to the example data:

```
# Compile and run the stan model
fit_mod3 <- stan(file = "Model_03_Independent_Potential_Outcomes_With_Covariates.stan",
data = stan_data,
iter = 1000, chains = 4)
```

Next the posterior predictive distributions of the first 20 missng control potential outcomes are displayed. We see that the locations of the posterior predictive distributions vary in Model 3, whereas those of Model 1 are almost the same across the 20 missing potential outcomes.

```
# Model 1. Without Covariates
color_scheme_set("red")
ymis_idx <- paste0(ifelse(w[seq(20)] == 0, "y1", "y0"), "[", seq(20), "]")
mod1_posterior <- as.array(fit_mod1)
p1 <- mcmc_intervals(mod1_posterior, pars = ymis_idx,
prob = 0.8, prob_outer = 0.99, point_est = "mean" ) +
#vline_at(c(0), linetype = "dashed", size = 0.5) +
ggplot2::labs(title = "Model 1. Without Covariates")
# Model 2. assuming constant treatment effect
mod3_posterior <- as.array(fit_mod3)
p3 <- mcmc_intervals(mod3_posterior, pars = ymis_idx,
prob = 0.8, prob_outer = 0.99, point_est = "mean" ) +
#vline_at(c(0), linetype = "dashed", size = 0.5) +
#scale_x_continuous(limits = c(-20, 40)) +
ggplot2::labs(title = "Model 3. With Covariates")
grid.arrange(p1, p3, ncol = 2)
```

The estimates of the causal estimands are displayed below. The posterior standard deviations for the ATE and QTE estimates are almost the same as those in Model 1, which means that the independence assumption between the potential outcomes determines the level of uncertainty for the causal estimates rather than the presence of covariates in the model. The posterior means for the ATE and QTEs replicates the estimates presented in the Table 8.6 of Imbens and Rubin (2015).

```
param <- c("tau", "tau_fs", "tau_qte25", "tau_qte50", "tau_qte75" )
print(fit_mod3, pars = param,
probs = c(0.1, 0.5, 0.9), digits = 3)
```

```
## Inference for Stan model: Model_03_Independent_Potential_Outcomes_With_Covariates.
## 4 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=2000.
##
## mean se_mean sd 10% 50% 90% n_eff Rhat
## tau 1.561 0.015 0.667 0.666 1.561 2.438 2000 1.001
## tau_fs 1.571 0.011 0.504 0.942 1.568 2.224 2000 1.001
## tau_qte25 0.435 0.008 0.341 0.000 0.485 0.892 2000 1.001
## tau_qte50 1.399 0.012 0.548 0.684 1.380 2.128 2000 1.000
## tau_qte75 2.881 0.014 0.637 2.039 2.910 3.648 2000 1.000
##
## Samples were drawn using NUTS(diag_e) at Mon Jul 30 17:42:53 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

Dehejia, Rajeev H, and Sadek Wahba. 1999. “Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs.” *Journal of the American Statistical Association* 94 (448). Taylor & Francis Group: 1053–62.

Ding, Peng, and Fan Li. 2017. “Causal Inference: A Missing Data Perspective.” *arXiv Preprint arXiv:1712.06170*.

Gelman, Andrew. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models.” *Bayesian Analysis* 1 (3): 515–34.

Gelman, Andrew, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. *Bayesian Data Analysis*. Chapman; Hall/CRC.

Heckman, James J, and V Joseph Hotz. 1989. “Choosing Among Alternative Nonexperimental Methods for Estimating the Impact of Social Programs: The Case of Manpower Training.” *Journal of the American Statistical Association* 84 (408). Taylor & Francis Group: 862–74.

Hyndman, Rob J, and Yanan Fan. 1996. “Sample Quantiles in Statistical Packages.” *The American Statistician* 50 (4). Taylor & Francis: 361–65.

Imbens, Guido W, and Donald B Rubin. 2015. *Causal Inference in Statistics, Social, and Biomedical Sciences*. Cambridge University Press.

Rubin, Donald B. 1978. “Bayesian Inference for Causal Effects: The Role of Randomization.” *The Annals of Statistics*. JSTOR, 34–58.

Smith, Jeffrey A, and Petra E Todd. 2001. “Reconciling Conflicting Evidence on the Performance of Propensity-Score Matching Methods.” *American Economic Review* 91 (2): 112–18.