Introduction

This vignette is intended to be read after the Getting started with CmdStanR vignette. Please read that first for important background. In this document we provide additional details about compiling models, passing in data, and how CmdStan output is saved and read back into R.

We will only use the $sample() method in examples, but all model fitting methods work in a similar way under the hood.

library(cmdstanr)

Compilation

Immediate compilation

The cmdstan_model() function creates a new CmdStanModel object. The CmdStanModel object stores the path to a Stan program as well as the path to a compiled executable:

stan_file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
mod <- cmdstan_model(stan_file)
Compiling Stan program...
mod$print()
data {
  int<lower=0> N;
  int<lower=0,upper=1> y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1,1);  // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}
mod$stan_file()
[1] "/Users/jgabry/.cmdstanr/cmdstan-2.25.0/examples/bernoulli/bernoulli.stan"
mod$exe_file()
[1] "/Users/jgabry/.cmdstanr/cmdstan-2.25.0/examples/bernoulli/bernoulli"

Subsequently, if you create a CmdStanModel object from the same Stan file then compilation will be skipped (assuming the file hasn’t changed):

mod <- cmdstan_model(stan_file)
Model executable is up to date!

Internally, cmdstan_model() first creates the CmdStanModel object from just the Stan file and then calls its $compile() method. Optional arguments to the $compile() method can be passed via ..., for example:

mod <- cmdstan_model(
  stan_file,
  force_recompile = TRUE,
  include_paths = "paths/to/directories/with/included/files",
  cpp_options = list(stan_threads = TRUE, STANC2 = TRUE)
)

Delayed compilation

It is also possible to delay compilation when creating the CmdStanModel object by specifying compile=FALSE. You can later call the $compile() method directly:

unlink(mod$exe_file())
mod <- cmdstan_model(stan_file, compile = FALSE)
mod$exe_file() # not yet created
character(0)
mod$compile()
Compiling Stan program...
mod$exe_file()
[1] "/Users/jgabry/.cmdstanr/cmdstan-2.25.0/examples/bernoulli/bernoulli"

Pedantic check

If you are using CmdStan version 2.24 or later and CmdStanR version 0.2.1 or later, you can run a pedantic check for your model. CmdStanR will always check that your Stan program does not contain any invalid syntax but with pedantic mode enabled the check will also warn you about other potential issues in your model, for example:

  • Distribution usages issues: distribution arguments do not match the distribution specification, or some specific distribution is used in an inadvisable way.
  • Unused parameter: a parameter is defined but does not contribute to target.
  • Large or small constant in a distribution: very large or very small constants are used as distribution arguments.
  • Control flow depends on a parameter: branching control flow (like if/else) depends on a parameter value.
  • Parameter has multiple twiddles: a parameter is on the left-hand side of multiple twiddles (i.e., multiple ~ symbols).
  • Parameter has zero or multiple priors: a parameter has zero or more than one prior distribution.
  • Variable is used before assignment: a variable is used before being assigned a value.
  • Strict or nonsensical parameter bounds: a parameter is given questionable bounds.

Pedantic mode is available when compiling the model or when using the separate $check_syntax() method of a CmdStanModel object. Internally this corresponds to setting the stanc (Stan transpiler) option warn-pedantic. Here we demonstrate pedantic mode with a Stan program that is syntactically correct but is missing a lower bound and a prior for a parameter.

stan_file_pedantic <- write_stan_file("
data {
  int N;
  int y[N];
}
parameters {
  // should have <lower=0> but omitting to demonstrate pedantic mode
  real lambda;
}
model {
  y ~ poisson(lambda);
}
")

To turn on pedantic mode at compile time you can set pedantic=TRUE in the call to cmdstan_model() (or when calling the $compile() method directly if using the delayed compilation approach described above).

mod_pedantic <- cmdstan_model(stan_file_pedantic, pedantic = TRUE)
Compiling Stan program...
Warning:
  The parameter lambda has no priors.
Warning at '/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/model-14ba25f3a526f.stan', line 11, column 14 to column 20:
  A poisson distribution is given parameter lambda as a rate parameter
  (argument 1), but lambda was not constrained to be strictly positive.

To turn on pedantic mode separately from compilation use the pedantic argument to the $check_syntax() method.

mod_pedantic$check_syntax(pedantic = TRUE) 
Warning:
  The parameter lambda has no priors.
Warning at '/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/file14ba25b2cde39.stan', line 11, column 14 to column 20:
  A poisson distribution is given parameter lambda as a rate parameter
  (argument 1), but lambda was not constrained to be strictly positive.
Stan program is syntactically correct

Using pedantic=TRUE via the $check_syntax() method also has the advantage that it can be used even if the model hasn’t been compiled yet. This can be helpful because the pedantic and syntax checks themselves are much faster than compilation.

file.remove(mod_pedantic$exe_file()) # delete compiled executable
[1] TRUE
rm(mod_pedantic)

mod_pedantic <- cmdstan_model(stan_file_pedantic, compile = FALSE)
mod_pedantic$check_syntax(pedantic = TRUE)
Warning:
  The parameter lambda has no priors.
Warning at '/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/file14ba25b2cde39.stan', line 11, column 14 to column 20:
  A poisson distribution is given parameter lambda as a rate parameter
  (argument 1), but lambda was not constrained to be strictly positive.
Stan program is syntactically correct

Executable location

By default, the executable is created in the same directory as the file containing the Stan program. You can also specify a different location with the dir argument:

mod <- cmdstan_model(stan_file, dir = "path/to/directory/for/executable")

Processing data

There are three data formats that CmdStanR allows when fitting a model:

  • named list of R objects
  • JSON file
  • R dump file

Named list of R objects

Like the RStan interface, CmdStanR accepts a named list of R objects where the names correspond to variables declared in the data block of the Stan program. In the Bernoulli model the data is N, the number of data points, and y an integer array of observations.

mod$print()
data {
  int<lower=0> N;
  int<lower=0,upper=1> y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  theta ~ beta(1,1);  // uniform prior on interval 0,1
  y ~ bernoulli(theta);
}
# data block has 'N' and 'y'
data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))
fit <- mod$sample(data = data_list)

Because CmdStan doesn’t accept lists of R objects, CmdStanR will first write the data to a temporary JSON file using write_stan_json(). This happens internally, but it is also possible to call write_stan_json() directly:

data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))
json_file <- tempfile(fileext = ".json")
write_stan_json(data_list, json_file)
cat(readLines(json_file), sep = "\n")
{
  "N": 10,
  "y": [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]
}

JSON file

If you already have your data in a JSON file you can just pass that file directly to CmdStanR instead of using a list of R objects. For example, we could pass in the JSON file we created above using write_stan_json():

fit <- mod$sample(data = json_file)

R dump file

Finally, it is also possible to use the R dump file format. This is not recommended because CmdStan can process JSON faster than R dump, but CmdStanR allows it because CmdStan will accept files created by rstan::stan_rdump():

rdump_file <- tempfile(fileext = ".data.R")
rstan::stan_rdump(names(data_list), file = rdump_file, envir = list2env(data_list))
cat(readLines(rdump_file), sep = "\n")
fit <- mod$sample(data = rdump_file)

Writing CmdStan output to CSV

Default temporary files

data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))
fit <- mod$sample(data = data_list)

When fitting a model, the default behavior is to write the output from CmdStan to CSV files in a temporary directory:

[1] "/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/bernoulli-202011241551-1-9165a8.csv"
[2] "/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/bernoulli-202011241551-2-9165a8.csv"
[3] "/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/bernoulli-202011241551-3-9165a8.csv"
[4] "/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/bernoulli-202011241551-4-9165a8.csv"

These files will be lost if you end your R session or if you remove the fit object and force (or wait for) garbage collection.

files <- fit$output_files()
file.exists(files)
[1] TRUE TRUE TRUE TRUE
rm(fit)
gc()
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  856930 45.8    1341254 71.7         NA  1341254 71.7
Vcells 1593242 12.2    8388608 64.0      16384  2363425 18.1
[1] FALSE FALSE FALSE FALSE

Non-temporary files

To save these files to a non-temporary location there are two options. You can either specify the output_dir argument to mod$sample() or use fit$save_output_files() after fitting the model:

# see ?save_output_files for info on optional arguments
fit$save_output_files(dir = "path/to/directory")
fit <- mod$sample(
  data = data_list,
  output_dir = "path/to/directory"
)

Reading CmdStan output into R

Lazy CSV reading

With the exception of some diagnostic information, the CSV files are not read into R until their contents are requested by calling a method that requires them (e.g., fit$draws(), fit$summary(), etc.). If we examine the structure of the fit object, notice how the Private slot draws_ is NULL, indicating that the CSV files haven’t yet been read into R:

str(fit)
Classes 'CmdStanMCMC', 'CmdStanFit', 'R6' <CmdStanMCMC>
  Inherits from: <CmdStanFit>
  Public:
    clone: function (deep = FALSE) 
    cmdstan_diagnose: function (...) 
    cmdstan_summary: function (...) 
    data_file: function () 
    draws: function (variables = NULL, inc_warmup = FALSE) 
    init: function () 
    initialize: function (runset) 
    inv_metric: function (matrix = TRUE) 
    latent_dynamics_files: function (include_failed = FALSE) 
    lp: function () 
    metadata: function () 
    num_chains: function () 
    num_procs: function () 
    output: function (id = NULL) 
    output_files: function (include_failed = FALSE) 
    print: function (variables = NULL, ..., digits = 2, max_rows = 10) 
    return_codes: function () 
    runset: CmdStanRun, R6
    sampler_diagnostics: function (inc_warmup = FALSE) 
    save_data_file: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE) 
    save_latent_dynamics_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE) 
    save_object: function (file, ...) 
    save_output_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE) 
    summary: function (variables = NULL, ...) 
    time: function () 
  Private:
    draws_: NULL
    init_: NULL
    inv_metric_: NULL
    metadata_: list
    read_csv_: function (variables = NULL, sampler_diagnostics = NULL) 
    sampler_diagnostics_: NULL
    warmup_draws_: NULL
    warmup_sampler_diagnostics_: NULL 

After we call a method that requires the draws then if we reexamine the structure of the object we will see that the draws_ slot in Private is no longer empty:

draws <- fit$draws() # force CSVs to be read into R
str(fit)
Classes 'CmdStanMCMC', 'CmdStanFit', 'R6' <CmdStanMCMC>
  Inherits from: <CmdStanFit>
  Public:
    clone: function (deep = FALSE) 
    cmdstan_diagnose: function (...) 
    cmdstan_summary: function (...) 
    data_file: function () 
    draws: function (variables = NULL, inc_warmup = FALSE) 
    init: function () 
    initialize: function (runset) 
    inv_metric: function (matrix = TRUE) 
    latent_dynamics_files: function (include_failed = FALSE) 
    lp: function () 
    metadata: function () 
    num_chains: function () 
    num_procs: function () 
    output: function (id = NULL) 
    output_files: function (include_failed = FALSE) 
    print: function (variables = NULL, ..., digits = 2, max_rows = 10) 
    return_codes: function () 
    runset: CmdStanRun, R6
    sampler_diagnostics: function (inc_warmup = FALSE) 
    save_data_file: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE) 
    save_latent_dynamics_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE) 
    save_object: function (file, ...) 
    save_output_files: function (dir = ".", basename = NULL, timestamp = TRUE, random = TRUE) 
    summary: function (variables = NULL, ...) 
    time: function () 
  Private:
    draws_: -6.78913 -6.76301 -6.76814 -7.29383 -7.03635 -7.40541 -8 ...
    init_: NULL
    inv_metric_: list
    metadata_: list
    read_csv_: function (variables = NULL, sampler_diagnostics = NULL) 
    sampler_diagnostics_: NULL
    warmup_draws_: NULL
    warmup_sampler_diagnostics_: NULL 

For models with many parameters, transformed parameters, or generated quantities, if only some are requested (e.g., by specifying the variables argument to fit$draws()) then CmdStanR will only read in the requested variables (unless they have already been read in).

read_cmdstan_csv()

Internally, the read_cmdstan_csv() function is used to read the CmdStan CSV files into R. This function is exposed to users, so you can also call it directly:

# see ?read_cmdstan_csv for info on optional arguments controlling 
# what information is read in
csv_contents <- read_cmdstan_csv(fit$output_files())
str(csv_contents)
List of 7
 $ metadata                       :List of 33
  ..$ stan_version_major  : num 2
  ..$ stan_version_minor  : num 25
  ..$ stan_version_patch  : num 0
  ..$ method              : chr "sample"
  ..$ save_warmup         : num 0
  ..$ thin                : num 1
  ..$ gamma               : num 0.05
  ..$ kappa               : num 0.75
  ..$ t0                  : num 10
  ..$ init_buffer         : num 75
  ..$ term_buffer         : num 50
  ..$ window              : num 25
  ..$ algorithm           : chr "hmc"
  ..$ engine              : chr "nuts"
  ..$ metric              : chr "diag_e"
  ..$ stepsize_jitter     : num 0
  ..$ id                  : num [1:4] 1 2 3 4
  ..$ init                : num [1:4] 2 2 2 2
  ..$ seed                : num [1:4] 2.10e+08 1.69e+09 7.13e+08 7.69e+08
  ..$ refresh             : num 100
  ..$ sig_figs            : num -1
  ..$ sampler_diagnostics : chr [1:6] "accept_stat__" "stepsize__" "treedepth__" "n_leapfrog__" ...
  ..$ model_params        : chr [1:2] "lp__" "theta"
  ..$ step_size_adaptation: num [1:4] 0.859 0.921 1.049 1.023
  ..$ model_name          : chr "bernoulli_model"
  ..$ adapt_engaged       : num 1
  ..$ adapt_delta         : num 0.8
  ..$ max_treedepth       : num 10
  ..$ step_size           : num [1:4] 1 1 1 1
  ..$ iter_warmup         : num 1000
  ..$ iter_sampling       : num 1000
  ..$ stan_variable_dims  :List of 2
  .. ..$ lp__ : num 1
  .. ..$ theta: num 1
  ..$ stan_variables      : chr [1:2] "lp__" "theta"
 $ inv_metric                     :List of 4
  ..$ 1: num 0.542
  ..$ 2: num 0.525
  ..$ 3: num 0.444
  ..$ 4: num 0.52
 $ step_size                      :List of 4
  ..$ 1: num 0.859
  ..$ 2: num 0.921
  ..$ 3: num 1.05
  ..$ 4: num 1.02
 $ warmup_draws                   : NULL
 $ post_warmup_draws              : 'draws_array' num [1:1000, 1:4, 1:2] -6.79 -6.76 -6.77 -7.29 -7.04 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
  .. ..$ chain    : chr [1:4] "1" "2" "3" "4"
  .. ..$ variable : chr [1:2] "lp__" "theta"
 $ warmup_sampler_diagnostics     : NULL
 $ post_warmup_sampler_diagnostics: 'draws_array' num [1:1000, 1:4, 1:6] 0.994 1 0.995 0.923 1 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
  .. ..$ chain    : chr [1:4] "1" "2" "3" "4"
  .. ..$ variable : chr [1:6] "accept_stat__" "stepsize__" "treedepth__" "n_leapfrog__" ...

Saving and accessing advanced algorithm info (latent dynamics)

If save_latent_dynamics is set to TRUE when running the $sample() method then additional CSV files are created (one per chain) that provide access to quantities used under the hood by Stan’s implementation of dynamic Hamiltonian Monte Carlo.

CmdStanR does not yet provide a special method for processing these files but they can be read into R using R’s standard CSV reading functions:

fit <- mod$sample(data = data_list, save_latent_dynamics = TRUE)
[1] "/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/bernoulli-diagnostic-202011241551-1-55cc49.csv"
[2] "/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/bernoulli-diagnostic-202011241551-2-55cc49.csv"
[3] "/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/bernoulli-diagnostic-202011241551-3-55cc49.csv"
[4] "/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp6z0m2V/bernoulli-diagnostic-202011241551-4-55cc49.csv"
# read one of the files in
x <- utils::read.csv(fit$latent_dynamics_files()[1], comment.char = "#")
head(x)
      lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__
1 -6.77163      1.000000    1.01917           1            1           0
2 -7.08782      0.901299    1.01917           1            3           0
3 -6.82835      0.998086    1.01917           1            3           0
4 -7.17254      0.961877    1.01917           2            3           0
5 -6.79864      0.663324    1.01917           2            3           0
6 -9.34056      0.564600    1.01917           2            3           0
  energy__     theta   p_theta   g_theta
1  6.78126 -0.955446 -0.195418  0.333491
2  7.26475 -1.677310  0.837281 -1.110570
3  7.02601 -1.372220  0.884978 -0.572866
4  7.23579 -0.510811 -0.500586  1.500040
5  8.68073 -0.890040  2.730830  0.493219
6  9.36160 -2.899070  0.288769 -2.373610

The column lp__ is also provided via fit$draws(), and the columns accept_stat__, stepsize__, treedepth__, n_leapfrog__, divergent__, and energy__ are also provided by fit$sampler_diagnostics(), but there are several columns unique to the latent dynamics file:

head(x[, c("theta", "p_theta", "g_theta")])
      theta   p_theta   g_theta
1 -0.955446 -0.195418  0.333491
2 -1.677310  0.837281 -1.110570
3 -1.372220  0.884978 -0.572866
4 -0.510811 -0.500586  1.500040
5 -0.890040  2.730830  0.493219
6 -2.899070  0.288769 -2.373610

Our model has a single parameter theta and the three columns above correspond to theta in the unconstrained space (theta on the constrained space is accessed via fit$draws()), the auxiliary momentum p_theta, and the gradient g_theta. In general, each of these three columns will exist for every parameter in the model.

Saving fitted model objects

As described above, the contents of the CSV files are only read into R when they are needed. This means that in order to save a fitted model object containing all of the posterior draws and sampler diagnostics you should either make sure to call fit$draws() and fit$sampler_diagnostics() before saving the object fit, or use the special $save_object() method provided by CmdStanR, which will ensure that everything has been read into R before saving the object using saveRDS():

temp_rds_file <- tempfile(fileext = ".RDS") # temporary file just for demonstration
fit$save_object(file = temp_rds_file)

We can check that this worked by removing fit and loading it back in from the save file:

rm(fit); gc()
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  873638 46.7    1654670 88.4         NA  1341254 71.7
Vcells 1696076 13.0    8388608 64.0      16384  3066776 23.4
fit <- readRDS(temp_rds_file)
fit$summary()
# A tibble: 2 x 10
  variable   mean median    sd   mad      q5    q95  rhat ess_bulk ess_tail
  <chr>     <dbl>  <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -7.26  -6.99  0.761 0.325 -8.73   -6.75   1.00    1963.    2302.
2 theta     0.247  0.235 0.118 0.120  0.0819  0.465  1.00    1587.    2049.