Summary statistics

We can easily customize the summary statistics reported by $summary() and $print().

fit <- cmdstanr::cmdstanr_example("schools", method = "sample")
Warning: 124 of 4000 (3.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.
Warning: 1 of 4 chains had an E-BFMI less than 0.3.
See https://mc-stan.org/misc/warnings for details.
fit$summary()
   variable       mean     median       sd      mad          q5       q95
1      lp__ -58.697829 -58.911800 5.056327 5.134318 -66.6875850 -49.89660
2        mu   6.353763   6.597480 4.381424 4.372565  -1.2281605  13.23484
3       tau   5.604807   4.871790 3.537844 3.371217   1.4006485  12.28454
4  theta[1]   9.416434   8.789030 7.469694 6.297677  -1.7534945  22.77244
5  theta[2]   6.708629   6.756395 5.775008 5.428570  -2.7675325  16.09375
6  theta[3]   5.013658   5.633410 7.082039 6.055554  -7.5991140  15.69620
7  theta[4]   6.479275   6.629210 6.113995 5.602212  -3.7474250  16.47548
8  theta[5]   4.415845   4.751705 5.837104 5.557445  -5.7668200  13.46467
9  theta[6]   5.316444   5.776025 6.132648 5.695771  -5.1632545  14.89951
10 theta[7]   9.159755   8.682990 6.276413 5.676549  -0.3210604  20.38977
11 theta[8]   7.114358   7.154305 6.994382 6.108230  -4.2959180  18.25591
       rhat  ess_bulk  ess_tail
1  1.018582  172.4200  135.9782
2  1.008425  635.3354 1168.1366
3  1.021625  165.4672  120.1282
4  1.003830 1073.3788 1319.6507
5  1.004990 1050.8445 1693.6799
6  1.007712  871.0003 1537.3018
7  1.003560 1027.3246 1869.0154
8  1.007430  780.1467 1587.7245
9  1.007162  917.8167 1453.6511
10 1.003741  956.3136 1718.3496
11 1.002369 1338.3238 1986.1357

By default all variables are summaries with the follow functions:

[1] "mean"      "median"    "sd"        "mad"       "quantile2"

To change the variables summarized, we use the variables argument

fit$summary(variables = c("mu", "tau"))
  variable     mean  median       sd      mad        q5      q95     rhat
1       mu 6.353763 6.59748 4.381424 4.372565 -1.228160 13.23484 1.008425
2      tau 5.604807 4.87179 3.537844 3.371217  1.400649 12.28454 1.021625
  ess_bulk  ess_tail
1 635.3354 1168.1366
2 165.4672  120.1282

We can additionally change which functions are used

fit$summary(variables = c("mu", "tau"), mean, sd)
  variable     mean       sd
1       mu 6.353763 4.381424
2      tau 5.604807 3.537844

To summarize all variables with non-default functions, it is necessary to set explicitly set the variables argument, either to NULL or the full vector of variable names.

fit$metadata()$model_params
 [1] "lp__"     "mu"       "tau"      "theta[1]" "theta[2]" "theta[3]"
 [7] "theta[4]" "theta[5]" "theta[6]" "theta[7]" "theta[8]"
fit$summary(variables = NULL, "mean", "median")
   variable       mean     median
1      lp__ -58.697829 -58.911800
2        mu   6.353763   6.597480
3       tau   5.604807   4.871790
4  theta[1]   9.416434   8.789030
5  theta[2]   6.708629   6.756395
6  theta[3]   5.013658   5.633410
7  theta[4]   6.479275   6.629210
8  theta[5]   4.415845   4.751705
9  theta[6]   5.316444   5.776025
10 theta[7]   9.159755   8.682990
11 theta[8]   7.114358   7.154305

Summary functions can be specified by character string, function, or using a formula (or anything else supported by rlang::as_function()). If these arguments are named, those names will be used in the tibble output. If the summary results are named they will take precedence.

my_sd <- function(x) c(My_SD = sd(x))
fit$summary(
  c("mu", "tau"), 
  MEAN = mean, 
  "median",
  my_sd,
  ~quantile(.x, probs = c(0.1, 0.9)),
  Minimum = function(x) min(x)
)        
  variable     MEAN  median    My_SD      10%      90%   Minimum
1       mu 6.353763 6.59748 4.381424 0.756247 11.71914 -9.340540
2      tau 5.604807 4.87179 3.537844 1.723530 10.46083  0.749924

Arguments to all summary functions can also be specified with .args.

fit$summary(c("mu", "tau"), quantile, .args = list(probs = c(0.025, .05, .95, .975)))
  variable      2.5%        5%      95%    97.5%
1       mu -2.926103 -1.228160 13.23484 14.96275
2      tau  1.106000  1.400649 12.28454 14.09220

The summary functions are applied to the array of sample values, with dimension iter_samplingxchains.

fit$summary(variables = NULL, dim, colMeans)
   variable dim.1 dim.2          1          2          3          4
1      lp__  1000     4 -57.746884 -58.396713 -59.488281 -59.159440
2        mu  1000     4   6.886773   6.098293   6.275602   6.154385
3       tau  1000     4   4.969696   5.504146   6.035791   5.909594
4  theta[1]  1000     4   9.584919   8.855289   9.788069   9.437459
5  theta[2]  1000     4   7.218764   6.441512   6.701321   6.472920
6  theta[3]  1000     4   6.039431   4.880572   4.532884   4.601744
7  theta[4]  1000     4   6.923818   6.048362   6.708068   6.236851
8  theta[5]  1000     4   5.339809   4.243510   3.959413   4.120650
9  theta[6]  1000     4   5.971513   4.851822   5.284337   5.158102
10 theta[7]  1000     4   9.310560   8.904282   9.350879   9.073298
11 theta[8]  1000     4   7.380020   7.069916   7.245772   6.761726

For this reason users may have unexpected results if they use stats::var() directly, as it will return a covariance matrix. An alternative is the distributional::variance() function, which can also be accessed via posterior::variance().

fit$summary(c("mu", "tau"), posterior::variance, ~var(as.vector(.x)))
  variable posterior::variance ~var(as.vector(.x))
1       mu            19.19687            19.19687
2      tau            12.51634            12.51634

Summary functions need not be numeric, but these won’t work with $print().

strict_pos <- function(x) if (all(x > 0)) "yes" else "no"
fit$summary(variables = NULL, "Strictly Positive" = strict_pos)
   variable Strictly Positive
1      lp__                no
2        mu                no
3       tau               yes
4  theta[1]                no
5  theta[2]                no
6  theta[3]                no
7  theta[4]                no
8  theta[5]                no
9  theta[6]                no
10 theta[7]                no
11 theta[8]                no
# fit$print(variables = NULL, "Strictly Positive" = strict_pos)

For more information, see posterior::summarise_draws(), which is called by $summary().

Extracting posterior draws/samples

The $draws() method can be used to extract the posterior draws in formats provided by the posterior package. Here we demonstrate only the draws_array and draws_df formats, but the posterior package supports other useful formats as well.

# default is a 3-D draws_array object from the posterior package
# iterations x chains x variables
draws_arr <- fit$draws() # or format="array"
str(draws_arr)
 'draws_array' num [1:1000, 1:4, 1:11] -66.1 -61.6 -58.3 -57.5 -55.3 ...
 - attr(*, "dimnames")=List of 3
  ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
  ..$ chain    : chr [1:4] "1" "2" "3" "4"
  ..$ variable : chr [1:11] "lp__" "mu" "tau" "theta[1]" ...
# draws x variables data frame
draws_df <- fit$draws(format = "df")
str(draws_df)
draws_df [4,000 × 14] (S3: draws_df/draws/tbl_df/tbl/data.frame)
 $ lp__      : num [1:4000] -66.1 -61.6 -58.3 -57.5 -55.3 ...
 $ mu        : num [1:4000] 1.93 8.57 4.05 9.53 9.94 ...
 $ tau       : num [1:4000] 10.73 6.28 6.1 4.65 3.2 ...
 $ theta[1]  : num [1:4000] 9.1 21.74 11.18 13.36 9.68 ...
 $ theta[2]  : num [1:4000] 11.006 5.842 -0.815 17.943 11.148 ...
 $ theta[3]  : num [1:4000] 0.4 12.17 7.94 6.17 3.94 ...
 $ theta[4]  : num [1:4000] -3.74 14.86 6.59 9.74 7.95 ...
 $ theta[5]  : num [1:4000] 12.97 5.5 1.32 8.77 12.83 ...
 $ theta[6]  : num [1:4000] -11.04 1.81 6.18 6.33 6.56 ...
 $ theta[7]  : num [1:4000] 22.21 10.53 7.75 14.3 11.54 ...
 $ theta[8]  : num [1:4000] 14.77 -2.98 11.19 7.25 10.37 ...
 $ .chain    : int [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
 $ .iteration: int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
 $ .draw     : int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
print(draws_df)
# A draws_df: 1000 iterations, 4 chains, and 11 variables
   lp__   mu  tau theta[1] theta[2] theta[3] theta[4] theta[5]
1   -66  1.9 10.7      9.1    11.01      0.4     -3.7     13.0
2   -62  8.6  6.3     21.7     5.84     12.2     14.9      5.5
3   -58  4.1  6.1     11.2    -0.82      7.9      6.6      1.3
4   -58  9.5  4.6     13.4    17.94      6.2      9.7      8.8
5   -55  9.9  3.2      9.7    11.15      3.9      7.9     12.8
6   -55  9.9  4.0      8.5     8.81      6.6      9.8     12.0
7   -54  9.7  2.7     14.5    10.87      5.8      6.8     10.2
8   -52  8.7  2.2     10.5    10.42      9.4      9.8      9.4
9   -57 10.1  5.0     13.4    10.29     11.4      5.8     12.0
10  -58 10.5  5.5     14.8     9.45     13.9      5.6     12.0
# ... with 3990 more draws, and 3 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}

To convert an existing draws object to a different format use the posterior::as_draws_*() functions.

To manipulate the draws objects use the various methods described in the posterior package vignettes and documentation.

Structured draws similar to rstan::extract()

The posterior package’s rvar format provides a multidimensional, sample-based representation of random variables. See https://mc-stan.org/posterior/articles/rvar.html for details. In addition to being useful in its own right, this format also allows CmdStanR users to obtain draws in a similar format to rstan::extract().

Suppose we have a parameter matrix[2,3] x. The rvar format lets you interact with x as if it’s a 2 x 3 matrix and automatically applies operations over the many posterior draws of x. To instead directly access the draws of x while maintaining the structure of the matrix use posterior::draws_of(). For example:

draws <- posterior::as_draws_rvars(fit$draws())
x_rvar <- draws$x
x_array <- posterior::draws_of(draws$x)

The object x_rvar will be an rvar that can be used like a 2 x 3 matrix, with the draws handled behind the scenes. The object x_array will be a 4000 x 2 x 3 array (assuming 4000 posterior draws), which is the same as it would be after being extracted from the list returned by rstan::extract().