In the following paper, Dorazio et al. provide a hierarchical occupancy model for “estimating species richness and accumulation by forming these community-level attributes as functions of model-based estimators of species occurence while accounting for imperfect detection of individual species.”

Dorazio, R. M., Royle, J. A., Söderström, B., & Glimskär, A. (2006). Estimating species richness and accumulation by modeling species occurrence and detectability. Ecology 87(4), 842-854.

They apply their model to estimating total numbers of butterfly species (and of birds, with a similar format to the data).

This knitr document contains all that you need to replicate Dorazio et al.’s statistical analysis of butterfly species. The model is coded and statistical inference carried out using the Bayesian inference package RStan. A great follow-on exercise would be to code up more of the models from Dorazio and Royle’s book in Stan.

## Data

The data consists of an the observation of butterfly species in July 1997 at “grazed seminatural grasslands located in south-central Sweden…Twenty grasslands thought to be representative of the region were selected for sampling”.

The data consists of the following variables:

• $$n$$: distinct butterfly species detected (28)

• $$J$$: distinct sites visited (20)

• $$K$$: visits to each site (18)

• $$x$$: $$n \times J$$ matrix where $$x_{i,j} \in 1{:}K$$ is the number of visits in which species $$i$$ was observed at site $$j$$

• $$S$$: size of supercommunity out of which community is drawn (50)

The supercommunity size parameter $$S$$ is just required to be large enough to encompass all possible species that might be present; it will affect the estimation of various model parameters, but not the derived quantities of interest such as the estimated total number of species available at a site.

Reading the data in requires a web connection to the publisher’s web site:

x_csv <- read.csv("http://esapubs.org/archive/ecol/E087/050/butterflyData.txt");
n <- dim(x_csv)[1];
J <- dim(x_csv)[2];
x <- matrix(NA, n, J)
for (i in 1:n) {
for (j in 1:J) {
x[i,j] <- x_csv[i,j];
}
}
K <- 18;   # read from paper, figure 5
S <- 50;   # more than adequate judging from plot in figure 6
head(x_csv);
##                       X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
## Aglais.urticae         0  0  1  0  1  2  1  0  1   1   2   0   3   2   1
## Aphantopus.hyperantus 14 10 13 15  3  9  7 17  1  11  11  15  15   6   0
## Argynnis.paphia        0  0  2  4  1  1  0  5  0   0   0   2   2   0   0
## Aricia.artaxerxes      0  1  0  4  0  0  0  0  0   2   2   0   0   0   0
## Aricia.eumedon         0  3  1  4  0  0  0  2  0   0   0   0   0   0   0
## Brenthis.ino           0  1  1  1  0  0  1  1  0   0   3   0   0   0   0
##                       X16 X17 X18 X19 X20
## Aglais.urticae          1   1   0   2   0
## Aphantopus.hyperantus  10  15   5  11   6
## Argynnis.paphia         0   1   3   1   1
## Aricia.artaxerxes       2   1   0   0   0
## Aricia.eumedon          0   0   0   0   0
## Brenthis.ino            0   0   1   0   0

The first thing Dorazio et al. do is plot the data itself (figure 5 of their paper). This requires the R packages ggplot2 and reshape2, (both of which may be installed from CRAN).

library(ggplot2);
library(reshape2);
df_x <- melt(x);
colnames(df_x)[1] <- "species";
colnames(df_x)[2] <- "site";
colnames(df_x)[3] <- "detections";

head(df_x);
##   species site detections
## 1       1    1          0
## 2       2    1         14
## 3       3    1          0
## 4       4    1          0
## 5       5    1          0
## 6       6    1          0
detections_heatmap_plot <-
ggplot(df_x, aes(site, species)) +
geom_tile(aes(fill = detections), colour = "white") +
scale_fill_gradient(low = "white", high = "black") +
labs(x = "site number", y = "species number") +
scale_x_discrete(expand = c(0, -5), breaks=(5 * (1:4))) +
scale_y_discrete(expand = c(0, -5), breaks=(5 * (1:5))) +
ggtitle("Detections of Species at Sites over Visits");

plot(detections_heatmap_plot);

Unlike Dorazio et al.’s original plot, this version does not reorder the species by total detections.

## Statistical Model

### Parameters

The model involves parameters for the site (here the sites are considered exchangeable and considered to share a single parameter) that determine species occupancy and detectability.

• $$\alpha$$ : site-level occupancy (logit scaled)

• $$\beta$$ : site-level detection (logit scaled)

There is a single parameter for the probability of a species being in the region:

• $$\Omega \in (0,1)$$: availability of species

Then for each species $$i$$, there are a pair of parameters for species-specific occupancy and detection

• $$u_i$$ : species-level occupancy (logit scaled)

• $$v_i$$ : species-level detection (logit scaled)

These species-level effects are given a hierarchical multivariate normal prior with zero mean and covariance matrix parameterized by three scalars,

• $$\sigma_u$$ : std deviation of species occupancy
• $$\sigma_v$$ : std deviation of species detection
• $$\rho_{uv}$$ : correlation of species occupancy and detection

so that the covariance matrix for the prior on $$(u,v)$$ pairs is defined as

$$\Sigma = \left[ \begin{array}{cc} \sigma_u^2 & \sigma_u \sigma_v \rho_{uv} \\ \sigma_u \sigma_v \rho_{uv} & \sigma_v^2 \end{array} \right]$$

### Priors

Rather than using Dorazio et al.’s exact diffuse priors, we follow the recommendations given in the Stan manual, and formulate priors as

• $$\alpha, \beta \sim \mbox{Cauchy}(0, 2.5)$$

• $$\sigma_u, \sigma_v \sim \mbox{Cauchy}(0, 2.5)$$

• $$(\rho_{uv} + 1) / 2 \sim \mbox{Beta}(2,2)$$

• $$(u_i, v_i) \sim \mbox{Normal}(0, \Sigma)$$

• $$\Omega \sim \mbox{Beta}(2,2)$$

Although we did not perform any prior sensitivity analysis, the results from the above weakly informative priors are in-line with what Dorazio et al. found with diffuse conjugate inverse gamma priors on $$\sigma_u, \sigma_v$$, diffuse normal priors on $$\alpha, \beta$$, and uniform priors on $$\Omega$$ and $$\rho_{uv}$$.

### Latent Parameters

Dorazio et al. formulate the likelihood for $$x_{i,j}$$ in terms of two discrete variables,

• $$z_{i,j} = 1$$ if species $$i$$ is present at site $$j$$

• $$w_i = 1$$ if species $$i$$ in supercommunity is available

By definition, for $$i \in (n+1){:}S$$, $$x_{i,j} = 0$$, because no members of species $$i$$ was detected for $$i > n$$. Also by definition, $$w_i = 1$$ for $$i \leq n$$, because a species must be available if it was detected, whereas $$w_i$$ for $$i > n$$ is a latent parameter. The availability of a species from the supercommunity is modeled by

• $$w_i \sim \mbox{Bernoulli}(\Omega)$$.

The site-level availability is modeled by

• $$z_{i,j} \sim \mbox{Bernoulli}(\psi_i)$$,

where $$\psi_i = \mbox{logit}^{-1}(u_i + \alpha)$$. Dorazio et al. use two indexes, writing $$\psi_{i,j}$$, but for their analyses assume $$\psi_{i,j} = \psi_{i,j'}$$ for all $$j, j'$$, allowing this note to simplify $$\psi$$ to a vector.

### Likelihood

The likelihood for the observed data is defined by marginalizing over the unobserved latent parameters (by “summing out” using the law of total probability), with the number of visits to site $$j$$ on which species $$i$$ was detected is given by equation (2) in Dorazio et al.’s paper, which we rewrite using sampling notation as

• $$x_{i,j} \sim \mbox{Binomial}(K, z_{i,j} \theta_i)$$,

where the detection probability is $$\theta_i = \mbox{logit}^{-1}(v_i + \beta)$$. (As with $$\psi$$, this note simplifies Dorazio et al.’s notation by treating $$\theta$$ as a vector.)

Note that if $$z_{i,j} = 0$$, then species $$i$$ is not present at site $$j$$ and hence $$x_{i,j} = 0$$ by definition. As pointed out in the paper, “if species $$i$$ is not detected at site $$j$$ (i.e., $$x_{i,j} = 0$$), species $$i$$ is either absent (with probability $$1 - \psi_i$$) or present but undetected (with probability $$\psi_i (1 - \theta_i)^K$$).”

Marginalizing the $$z_{i,j}$$ from the above gives the following for $$i \in 1{:}n$$, which is explicitly conditioned on the species being available $$w_i = 1$$,

$$p(x_{i,j} \, | \, \theta_i, \psi_i) = \mbox{Bernoulli}(1 \, | \, \psi_i) \times \mbox{Binomial}(x_{i,j} \, | \, K, \theta_i) + \mbox{Bernoulli}(0 \, | \, \psi_i) \times \mbox{I}(x_{i,j} = 0)$$

The Stan model will exploit the fact that if $$x_{i,j} > 0$$, the second term representing non-occurrence drops out.

The $$w_i = 1$$ enters the likelihood computation as $$\mbox{Bernoulli}(1 \, | \, \Omega)$$ when looking at the entire vector $$x_i = x_{i,1}, \ldots, x_{i,n}$$ of site-detection counts for detected species $$i \in 1{:}n$$, as

$$p(x_i \, | \, \theta_i, \psi_i, \Omega) = \mbox{Bernoulli}(1 \, | \, \Omega) \times \prod_{j=1}^J p(x_{i,j} \, | \, \theta_i, \psi_i)$$

Expanding out the nested probability function gives the final form of

$$p(x_i \, | \, \theta_i, \psi_i, \Omega) = \mbox{Bernoulli}(1 \, | \, \Omega) \times \prod_{j=1}^J \left( \mbox{Bernoulli}(1 \, | \, \psi_i) \times \mbox{Binomial}(x_{i,j} \, | \, K, \theta_i) + \mbox{Bernoulli}(0 \, | \, \psi_i) \times \mbox{I}(x_{i,j} = 0) \right)$$

The $$1$$ built into the Bernoulli is because we know that $$x_{i,j} > 0$$ for at least one $$j$$ and hence we know species $$i$$ is available.

The last step is marginalizing out the $$w_i$$ for the undetected species $$i \in (n+1){:}S$$, for which we know $$x_{i,j} = 0$$, giving

$$p(x_i \, | \, \Omega, \psi_i, \theta_i) = \mbox{Bernoulli}(0 \, | \, \Omega) + \mbox{Bernoulli}(1 \, | \, \Omega) \times \prod_{j=1}^J p(x_{i,j} \, | \, \theta_i, \psi_i)$$.

where the final term is defined as for the formula conditioning on $$w_i = 1$$ above, knowing that $$x_{i,j} = 0$$ because $$i > n$$, resulting in the full joint likelihood of $$x_i = x_{i,1}, \ldots, x_{i,J}$$ for the case where $$i > n$$ of

$$p(x_i \, | \, \Omega, \psi_i, \theta_i) = \begin{array}[t]{l} \mbox{Bernoulli}(0 \, | \, \Omega) \\ \mbox{ } + \mbox{Bernoulli}(1 \, | \, \Omega) \times \prod_{j=1}^J \big( \mbox{Bernoulli}(1 \, | \, \psi_i) \times \mbox{Binomial}(x_{i,j} \, | \, K, \theta_i) + \mbox{Bernoulli}(0 \, | \, \psi_i) \big) \end{array}$$

## Stan Program

With the data, parameters, likelihood, and prior in hand, the Stan program defines the log density following the steps given above for marginalizing out the discrete parameters.

stan_program <- "
functions {
matrix cov_matrix_2d(vector sigma, real rho) {
matrix[2,2] Sigma;
Sigma[1,1] = square(sigma[1]);
Sigma[2,2] = square(sigma[2]);
Sigma[1,2] = sigma[1] * sigma[2] * rho;
Sigma[2,1] = Sigma[1,2];
return Sigma;
}

real lp_observed(int x, int K, real logit_psi, real logit_theta) {
return log_inv_logit(logit_psi)
+ binomial_logit_lpmf(x | K, logit_theta);
}

real lp_unobserved(int K, real logit_psi, real logit_theta) {
return log_sum_exp(lp_observed(0, K, logit_psi, logit_theta),
log1m_inv_logit(logit_psi));
}

real lp_never_observed(int J, int K, real logit_psi, real logit_theta,
real Omega) {
real lp_unavailable = bernoulli_lpmf(0 | Omega);
real lp_available = bernoulli_lpmf(1 | Omega)
+ J * lp_unobserved(K, logit_psi, logit_theta);
return log_sum_exp(lp_unavailable, lp_available);
}
}
data {
int<lower=1> J;  // sites within region
int<lower=1> K;  // visits to sites
int<lower=1> n;  // observed species
int<lower=0, upper=K> x[n,J];  // visits when species i was detected at site j
int<lower=n> S;  // superpopulation size
}
parameters {
real alpha;  //  site-level occupancy
real beta;   //  site-level detection
real<lower=0, upper=1> Omega;  // availability of species

real<lower=-1,upper=1> rho_uv;  // correlation of (occupancy, detection)
vector<lower=0>[2] sigma_uv;    // sd of (occupancy, detection)
vector[2] uv[S];                // species-level (occupancy, detection)
}
transformed parameters {
vector[S] logit_psi;    // log odds  of occurrence
vector[S] logit_theta;  // log odds of detection
for (i in 1:S)
logit_psi[i] = uv[i,1] + alpha;
for (i in 1:S)
logit_theta[i] = uv[i,2] + beta;
}
model {
// priors
alpha ~ cauchy(0, 2.5);
beta ~ cauchy(0, 2.5);
sigma_uv ~ cauchy(0, 2.5);
(rho_uv + 1) / 2 ~ beta(2, 2);
uv ~ multi_normal(rep_vector(0, 2), cov_matrix_2d(sigma_uv, rho_uv));
Omega ~ beta(2,2);

// likelihood
for (i in 1:n) {
1 ~ bernoulli(Omega); // observed, so available
for (j in 1:J) {
if (x[i,j] > 0)
target += lp_observed(x[i,j], K, logit_psi[i], logit_theta[i]);
else
target += lp_unobserved(K, logit_psi[i], logit_theta[i]);
}
}
for (i in (n + 1):S)
target += lp_never_observed(J, K, logit_psi[i], logit_theta[i], Omega);
}
generated quantities {
real<lower=0,upper=S> E_N = S * Omega; // model-based expectation species
int<lower=0,upper=S> E_N_2;  // posterior simulated species
vector[2] sim_uv;
real logit_psi_sim;
real logit_theta_sim;

E_N_2 = n;
for (i in (n+1):S) {
real lp_unavailable = bernoulli_lpmf(0 | Omega);
real lp_available = bernoulli_lpmf(1 | Omega)
+ J * lp_unobserved(K, logit_psi[i], logit_theta[i]);
real Pr_available = exp(lp_available
- log_sum_exp(lp_unavailable, lp_available));
E_N_2 = E_N_2 + bernoulli_rng(Pr_available);
}

sim_uv = multi_normal_rng(rep_vector(0,2),
cov_matrix_2d(sigma_uv, rho_uv));
logit_psi_sim = alpha + sim_uv[1];
logit_theta_sim = beta + sim_uv[2];
}
";

There are several helper functions defined:

• cov_matrix_2d(): constructs a $$2 \times 2$$ covariance matrix from the standard deviations and correlation

• lp_observed(): log likelihood function for case where $$x_{i,j} > 0$$ (entails $$i \leq n$$)

• lp_unobserved(): log likelihood function for case where $$x_{i,j} = 0$$

• lp_never_observed(): log likelihood function for case where the entire vector $$x_i = 0$$ (entails $$i > n$$)

These log likelihood definitions take advantage of the fact that $$\psi$$ and $$\theta$$ do not vary by site, allowing scalar arguments to the functions and the use of $$J$$ as a multiplicand in the definition of lp_never_observed().

The data and parameters are declared with appropriate constraints based on their restrictions.

The transformed parameters block defines the logit-scaled versions of $$\psi$$ and $$\theta$$ as vectors, again noting they do not vary by site in this model.

The model block defines the priors directly as specified. The expression rep_vector(0, 2) provides a zero-vector for the mean of the multivariate normal prior.

The model block then defines the likelihood by separating out the cases for $$i \leq n$$ and the cases $$i > n$$. The former also branches using a conditional on x[i,j] > 0 to determine whether to use the simplified form of log likelihood for the unobserved case. Otherwise, the likelihood is just called directly. The increment_log_prob() function is used to add directly to the log density defined by the Stan model block.

The generated quantities block performs three calculations. The generated quantities block is executed once per iteration, calculating the variables declared at the top of the block using the parameter values for the iteration. This provides full Bayesian inference via MCMC for the following variables defined in the generated quantities block.

• E_N, the expectation of $$N$$ using the simple formula $$S \times \Omega$$ suggested in the Dorazio et al. paper.

• E_N_2, which we assume is how Dorazio et al. actually estimated species occupancy. This is carried out by reconstructing the expectations of the $$w_i$$ in the posterior for $$i > n$$ and adding the resulting sum to $$n$$, because we know $$w_i = 1$$ for $$i \leq n$$. The log likelihood of availability and unavailability is calculated for each $$i > n$$, then normalized to define Pr_available, which is the posterior expectation of $$w_i$$ for species never detected for the parameter values being used.

• The third generated quantities is a set of simulated $$(u,v)$$ pairs for a new species, along with the logit-scaled $$\psi$$ and $$\theta$$ variables. These are used for posterior predictive checking to see if the actual species are distributed as characterized by their prior.

## Fitting the Model

Next up, we fit the model by running Stan and then printing the fit object producted. The control parameters are provided to avoid divergence of the numerical simulations during warmup. The initial stepsize is set to 0.01 instead of its default 1, which causes the Hamiltonian dynamics to be evaluated using smaller steps at the expense of some computaiton time. The other control parameter is the target acceptance rate (a complicated notion with the no-U-turn sampler used by Stan), which is set to 0.95, which is higher than its default value of 0.8, so that the target step size will be lower and the floating-point arithmetic more stable. Together, these cause Stan to run a bit more stably but at about half the speed of the default settings.

A total of four Markov chains are simulated, each for 5000 iterations, with the default of 2500 warmup iterations being used for adaptation and discarded to produce the final sample, which will consist of the second-half of each chain, or a total of 10,000 draws. The width of the print is increased before printing the fit because it’s rather wide.

### Running the Model

The model is fit by loading the RStan library, then calling the stan() function given the program, names of the data variables in the global environment, a control specification for Stan (optional), and a number of chains, iterations per chain, and random seed (the seed is also optional but will guarantee the same results across runs if provided).

library(rstan);
fit <- stan(model_code=stan_program, data=c("x", "n", "J", "K", "S"),
chains=4, iter=5000, refresh=0, seed=12345);
## In file included from file1ad265ced898.cpp:8:
## In file included from /Library/Frameworks/R.framework/Versions/3.3/Resources/library/StanHeaders/include/stan/math.hpp:4:
## In file included from /Library/Frameworks/R.framework/Versions/3.3/Resources/library/StanHeaders/include/stan/math/rev/mat.hpp:4:
## In file included from /Library/Frameworks/R.framework/Versions/3.3/Resources/library/StanHeaders/include/stan/math/rev/core.hpp:12:
## In file included from /Library/Frameworks/R.framework/Versions/3.3/Resources/library/StanHeaders/include/stan/math/rev/core/gevv_vvv_vari.hpp:5:
## In file included from /Library/Frameworks/R.framework/Versions/3.3/Resources/library/StanHeaders/include/stan/math/rev/core/var.hpp:7:
## In file included from /Library/Frameworks/R.framework/Versions/3.3/Resources/library/BH/include/boost/math/tools/config.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/3.3/Resources/library/BH/include/boost/config.hpp:39:
## /Library/Frameworks/R.framework/Versions/3.3/Resources/library/BH/include/boost/config/compiler/clang.hpp:196:11: warning: 'BOOST_NO_CXX11_RVALUE_REFERENCES' macro redefined [-Wmacro-redefined]
## #  define BOOST_NO_CXX11_RVALUE_REFERENCES
##           ^
## <command line>:5:9: note: previous definition is here
## #define BOOST_NO_CXX11_RVALUE_REFERENCES 1
##         ^
## 1 warning generated.
##
##  Elapsed Time: 59.992 seconds (Warm-up)
##                53.5034 seconds (Sampling)
##                113.495 seconds (Total)
##
##
##  Elapsed Time: 71.6769 seconds (Warm-up)
##                47.7704 seconds (Sampling)
##                119.447 seconds (Total)
##
##
##  Elapsed Time: 75.2507 seconds (Warm-up)
##                65.9798 seconds (Sampling)
##                141.23 seconds (Total)
##
##
##  Elapsed Time: 69.1103 seconds (Warm-up)
##                52.7265 seconds (Sampling)
##                121.837 seconds (Total)
## Warning: There were 26 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems

Setting refresh=0 removes the output updates during warmup and sampling.

### Printing the Fitted Parameters

The fit object can be directly printed in RStan (after widening the default output number of columns).

options(width="120")
print(fit, c("alpha", "beta", "Omega", "sigma_uv", "rho_uv", "E_N", "E_N_2", "lp__"));
## Inference for Stan model: ff975a7e121ef7b489369ef2eb311654.
## 4 chains, each with iter=5000; warmup=2500; thin=1;
## post-warmup draws per chain=2500, total post-warmup draws=10000.
##
##                 mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
## alpha           1.24    0.01  0.29     0.66     1.06     1.24     1.42     1.81  1986 1.00
## beta           -1.88    0.01  0.25    -2.42    -2.03    -1.87    -1.71    -1.40  1141 1.00
## Omega           0.56    0.00  0.07     0.43     0.52     0.56     0.61     0.70 10000 1.00
## sigma_uv[1]     1.04    0.01  0.31     0.51     0.82     1.01     1.22     1.72   888 1.00
## sigma_uv[2]     1.20    0.00  0.21     0.87     1.05     1.17     1.32     1.68  2584 1.00
## rho_uv          0.70    0.01  0.20     0.21     0.59     0.74     0.85     0.96   768 1.00
## E_N            28.19    0.03  3.42    21.54    25.92    28.19    30.50    34.90 10000 1.00
## E_N_2          28.50    0.02  0.93    28.00    28.00    28.00    29.00    31.00  2100 1.00
## lp__        -1157.10    1.10 20.65 -1194.89 -1171.27 -1158.16 -1143.99 -1112.97   355 1.01
##
## Samples were drawn using NUTS(diag_e) at Tue Mar 21 15:55:30 2017.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).

An explicit list of parameters to print is provided to suppress the printing of the parameters u and v and the transformed parameters logit_psi and logit_theta.

### Inspecting Traceplots for Mixing

The traceplots can be shown for some of the parameters (incuding one $$(u,v)$$ pair for a detected species and one for an undetected species) and the log density (up to a constant).

traceplot(fit,
c("Omega", "alpha", "beta", "sigma_uv", "rho_uv",
"uv[1,1]", "uv[1,2]", "uv[30,1]", "uv[30,2]",
"lp__"),
inc_warmup=FALSE) +
coord_cartesian(xlim = c(3001,3200)) +
scale_x_continuous(breaks=c(3000, 3100, 3200))