Summary statistics

We can easily customize the summary statistics reported by $summary() and $print().

fit <- cmdstanr::cmdstanr_example("schools", method = "sample")
Warning: 145 of 4000 (4.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.
fit$summary()
   variable       mean     median       sd      mad           q5       q95
1      lp__ -58.189615 -58.233950 4.983507 5.199033 -66.09672000 -49.48569
2        mu   6.770868   6.720615 4.160694 4.104430   0.05392642  13.50031
3       tau   5.360230   4.524270 3.536662 3.128657   1.41195350  12.01772
4  theta[1]   9.538150   8.703025 6.992732 6.139536  -0.46183645  21.76395
5  theta[2]   7.037750   7.084940 5.672420 5.463211  -1.89471350  16.19477
6  theta[3]   5.865357   5.940265 6.567071 5.598943  -4.94042600  15.87788
7  theta[4]   7.096898   7.021110 5.846662 5.507303  -2.11556450  16.63653
8  theta[5]   4.905340   5.104415 5.567945 5.102183  -4.69825650  13.42720
9  theta[6]   5.743860   5.979900 5.965230 5.476702  -4.29255000  15.06765
10 theta[7]   9.332145   8.785955 5.911919 5.517199   0.73314065  20.11738
11 theta[8]   7.283320   7.100935 6.510593 5.937568  -3.18226350  18.23657
       rhat  ess_bulk  ess_tail
1  1.029985  154.8234  166.9326
2  1.004450  669.4002 1293.6704
3  1.029747  147.4551  128.0359
4  1.006213 1068.8761 2337.7828
5  1.002627 1067.0310 2241.5417
6  1.004902 1217.1050 1912.7529
7  1.002033 1030.8586 1665.2932
8  1.009740  637.5400 1725.4378
9  1.003866 1129.6591 2064.6751
10 1.003642  989.5674 1661.8675
11 1.005570 1163.3998 1970.5629

By default all variables are summaries with the follow functions:

[1] "mean"      "median"    "sd"        "mad"       "quantile2"

To change the variables summarized, we use the variables argument

fit$summary(variables = c("mu", "tau"))
  variable     mean   median       sd      mad         q5      q95     rhat
1       mu 6.770868 6.720615 4.160694 4.104430 0.05392642 13.50031 1.004450
2      tau 5.360230 4.524270 3.536662 3.128657 1.41195350 12.01772 1.029747
  ess_bulk  ess_tail
1 669.4002 1293.6704
2 147.4551  128.0359

We can additionally change which functions are used

fit$summary(variables = c("mu", "tau"), mean, sd)
  variable     mean       sd
1       mu 6.770868 4.160694
2      tau 5.360230 3.536662

To summarize all variables with non-default functions, it is necessary to set explicitly set the variables argument, either to NULL or the full vector of variable names.

fit$metadata()$model_params
 [1] "lp__"     "mu"       "tau"      "theta[1]" "theta[2]" "theta[3]"
 [7] "theta[4]" "theta[5]" "theta[6]" "theta[7]" "theta[8]"
fit$summary(variables = NULL, "mean", "median")
   variable       mean     median
1      lp__ -58.189615 -58.233950
2        mu   6.770868   6.720615
3       tau   5.360230   4.524270
4  theta[1]   9.538150   8.703025
5  theta[2]   7.037750   7.084940
6  theta[3]   5.865357   5.940265
7  theta[4]   7.096898   7.021110
8  theta[5]   4.905340   5.104415
9  theta[6]   5.743860   5.979900
10 theta[7]   9.332145   8.785955
11 theta[8]   7.283320   7.100935

Summary functions can be specified by character string, function, or using a formula (or anything else supported by rlang::as_function()). If these arguments are named, those names will be used in the tibble output. If the summary results are named they will take precedence.

my_sd <- function(x) c(My_SD = sd(x))
fit$summary(
  c("mu", "tau"), 
  MEAN = mean, 
  "median",
  my_sd,
  ~quantile(.x, probs = c(0.1, 0.9)),
  Minimum = function(x) min(x)
)        
  variable     MEAN   median    My_SD      10%      90%   Minimum
1       mu 6.770868 6.720615 4.160694 1.577812 12.27689 -7.905940
2      tau 5.360230 4.524270 3.536662 1.751609 10.20606  0.836362

Arguments to all summary functions can also be specified with .args.

fit$summary(c("mu", "tau"), quantile, .args = list(probs = c(0.025, .05, .95, .975)))
  variable      2.5%         5%      95%    97.5%
1       mu -1.320617 0.05392642 13.50031 14.99084
2      tau  1.176680 1.41195350 12.01772 13.99329

The summary functions are applied to the array of sample values, with dimension iter_samplingxchains.

fit$summary(variables = NULL, dim, colMeans)
   variable dim.1 dim.2          1          2          3          4
1      lp__  1000     4 -59.512825 -57.501953 -58.519416 -57.224267
2        mu  1000     4   6.687580   7.323996   6.485391   6.586506
3       tau  1000     4   6.113202   4.991316   5.617574   4.718827
4  theta[1]  1000     4   9.845526  10.063756   9.480842   8.762475
5  theta[2]  1000     4   7.109487   7.340212   6.950864   6.750437
6  theta[3]  1000     4   5.614556   6.465821   5.368069   6.012981
7  theta[4]  1000     4   7.095362   7.575608   6.862193   6.854430
8  theta[5]  1000     4   4.335314   5.915169   4.347114   5.023764
9  theta[6]  1000     4   5.495814   6.275650   5.270746   5.933231
10 theta[7]  1000     4   9.739179   9.685921   9.144607   8.758873
11 theta[8]  1000     4   7.154340   7.740947   7.134879   7.103115

For this reason users may have unexpected results if they use stats::var() directly, as it will return a covariance matrix. An alternative is the distributional::variance() function, which can also be accessed via posterior::variance().

fit$summary(c("mu", "tau"), posterior::variance, ~var(as.vector(.x)))
  variable posterior::variance ~var(as.vector(.x))
1       mu            17.31137            17.31137
2      tau            12.50798            12.50798

Summary functions need not be numeric, but these won’t work with $print().

strict_pos <- function(x) if (all(x > 0)) "yes" else "no"
fit$summary(variables = NULL, "Strictly Positive" = strict_pos)
   variable Strictly Positive
1      lp__                no
2        mu                no
3       tau               yes
4  theta[1]                no
5  theta[2]                no
6  theta[3]                no
7  theta[4]                no
8  theta[5]                no
9  theta[6]                no
10 theta[7]                no
11 theta[8]                no
# fit$print(variables = NULL, "Strictly Positive" = strict_pos)

For more information, see posterior::summarise_draws(), which is called by $summary().

Extracting posterior draws/samples

The $draws() method can be used to extract the posterior draws in formats provided by the posterior package. Here we demonstrate only the draws_array and draws_df formats, but the posterior package supports other useful formats as well.

# default is a 3-D draws_array object from the posterior package
# iterations x chains x variables
draws_arr <- fit$draws() # or format="array"
str(draws_arr)
 'draws_array' num [1:1000, 1:4, 1:11] -60.3 -61 -59.5 -58.6 -64.2 ...
 - attr(*, "dimnames")=List of 3
  ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
  ..$ chain    : chr [1:4] "1" "2" "3" "4"
  ..$ variable : chr [1:11] "lp__" "mu" "tau" "theta[1]" ...
# draws x variables data frame
draws_df <- fit$draws(format = "df")
str(draws_df)
draws_df [4,000 × 14] (S3: draws_df/draws/tbl_df/tbl/data.frame)
 $ lp__      : num [1:4000] -60.3 -61 -59.5 -58.6 -64.2 ...
 $ mu        : num [1:4000] 13.44 3.8 5.99 5.51 10.76 ...
 $ tau       : num [1:4000] 5.55 7.85 6.33 5.48 8.88 ...
 $ theta[1]  : num [1:4000] 18.07 3.52 12.12 4.46 13.9 ...
 $ theta[2]  : num [1:4000] 16.476 -0.445 4.075 3.544 4.102 ...
 $ theta[3]  : num [1:4000] 11.7279 5.8634 -0.0525 -0.6551 3.4664 ...
 $ theta[4]  : num [1:4000] 14.84 3.94 -1.12 6.6 -2.9 ...
 $ theta[5]  : num [1:4000] 12.108 0.652 -2.931 7.275 -6.486 ...
 $ theta[6]  : num [1:4000] 7.16 9.55 9.69 3.59 6.77 ...
 $ theta[7]  : num [1:4000] 11.31 11.43 4.64 11.3 5.55 ...
 $ theta[8]  : num [1:4000] 24.13 -7.81 9.13 -3.8 20.05 ...
 $ .chain    : int [1:4000] 1 1 1 1 1 1 1 1 1 1 ...
 $ .iteration: int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
 $ .draw     : int [1:4000] 1 2 3 4 5 6 7 8 9 10 ...
print(draws_df)
# A draws_df: 1000 iterations, 4 chains, and 11 variables
   lp__   mu  tau theta[1] theta[2] theta[3] theta[4] theta[5]
1   -60 13.4  5.6     18.1    16.48   11.728     14.8    12.11
2   -61  3.8  7.9      3.5    -0.45    5.863      3.9     0.65
3   -59  6.0  6.3     12.1     4.08   -0.053     -1.1    -2.93
4   -59  5.5  5.5      4.5     3.54   -0.655      6.6     7.28
5   -64 10.8  8.9     13.9     4.10    3.466     -2.9    -6.49
6   -61  6.6 11.3     19.3    10.54    6.469      5.6    -3.64
7   -49  7.2  2.1      5.6     6.76    6.771      7.5     7.51
8   -53  6.8  1.5      6.1     6.97    7.150      3.8     8.49
9   -53  6.8  1.5      6.1     6.97    7.150      3.8     8.49
10  -56  6.5  3.8      4.8     3.31    9.891      8.5     4.73
# ... with 3990 more draws, and 3 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}

To convert an existing draws object to a different format use the posterior::as_draws_*() functions.

To manipulate the draws objects use the various methods described in the posterior package vignettes and documentation.

Structured draws similar to rstan::extract()

The posterior package’s rvar format provides a multidimensional, sample-based representation of random variables. See https://mc-stan.org/posterior/articles/rvar.html for details. In addition to being useful in its own right, this format also allows CmdStanR users to obtain draws in a similar format to rstan::extract().

Suppose we have a parameter matrix[2,3] x. The rvar format lets you interact with x as if it’s a 2 x 3 matrix and automatically applies operations over the many posterior draws of x. To instead directly access the draws of x while maintaining the structure of the matrix use posterior::draws_of(). For example:

draws <- posterior::as_draws_rvars(fit$draws())
x_rvar <- draws$x
x_array <- posterior::draws_of(draws$x)

The object x_rvar will be an rvar that can be used like a 2 x 3 matrix, with the draws handled behind the scenes. The object x_array will be a 4000 x 2 x 3 array (assuming 4000 posterior draws), which is the same as it would be after being extracted from the list returned by rstan::extract().