## Introduction

CmdStanR is a lightweight interface to Stan for R users (see CmdStanPy for Python) that provides an alternative to the traditional RStan interface. See the Comparison with RStan section later in this vignette for more details on how the two inferfaces differ.

CmdStanR is unreleased and user facing functions are still subject to change, but the development version can be installed from GitHub:

devtools::install_github("stan-dev/cmdstanr")

CmdStanR (the cmdstanr R package) can now be loaded like any other R package:

library(cmdstanr)

## Installing CmdStan

CmdStanR requires a working installation of CmdStan, the shell interface to Stan. If you don’t have CmdStan installed you can install it yourself or use the install_cmdstan() function provided by CmdStanR.

install_cmdstan()

Before CmdStanR can be used it needs to know where the CmdStan installation is located. When the package is loaded it tries to help automate this to avoid having to manually set the path every session:

1. If the environment variable "CMDSTAN" exists at load time then its value will be automatically set as the default path to CmdStan for the R session. This is useful if your CmdStan installation is not located in the default directory that would have been used by install_cmdstan() (see #2).

2. If no environment variable is found when loaded but the directory ".cmdstanr/cmdstan" exists in the user’s home directory (not working directory) then it will be set as the path to CmdStan for the R session. This is the same as the default directory that install_cmdstan() uses to install the latest version of CmdStan, so if that’s how you installed CmdStan you shouldn’t need to manually set the path to CmdStan when loading CmdStanR.

If neither of these applies (or you want to subsequently change the path) you can use the set_cmdstan_path() function:

set_cmdstan_path(PATH_TO_CMDSTAN)

If you need to check or get the current path to CmdStan use the cmdstan_path() function.

To verify the version of CmdStan you are using, use the cmdstan_version() function.

cmdstan_version()
[1] "2.22.1"

## Running CmdStan from R

### Compilation

The cmdstan_model() function creates a new CmdStanModel object from a file containing a Stan program. Here we’ll use the example Stan program that comes with the CmdStan installation:

file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
mod <- cmdstan_model(file)
Compiling Stan program...

The object mod is an R6 reference object of class CmdStanModel and behaves similarly to to R’s reference class objects and those in object oriented programming languages. Methods are accessed using the $ operator. This design choice allows for CmdStanR and CmdStanPy to provide a similar user experience and share many implementation details. The Stan program can be printed using the $print() method:

mod$print() # print the Stan program data { int<lower=0> N; int<lower=0,upper=1> y[N]; } parameters { real<lower=0,upper=1> theta; } model { theta ~ beta(1,1); // uniform prior on interval 0,1 y ~ bernoulli(theta); } The path to the compiled executable is returned by the $exe_file() method:

mod$exe_file() [1] "/Users/jgabry/.cmdstanr/cmdstan/examples/bernoulli/bernoulli" ### Model fitting #### MCMC The $sample() method for CmdStanModel objects runs Stan’s default MCMC algorithm. Data can be passed in as a named list of R objects (like for RStan) or as a path to a data file compatible with CmdStan (JSON or R dump).

data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1))
fit <- mod$sample( data = data_list, seed = 123, num_chains = 2, num_cores = 2 ) Running MCMC with 2 chain(s) on 2 core(s)... Running ./bernoulli 'id=1' random 'seed=123' data \ 'file=/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/standata-47ccd300471.json' \ output \ 'file=/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/bernoulli-202003031420-1-739d7a.csv' \ 'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt \ 'engaged=1' Running ./bernoulli 'id=2' random 'seed=124' data \ 'file=/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/standata-47ccd300471.json' \ output \ 'file=/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/bernoulli-202003031420-2-739d7a.csv' \ 'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt \ 'engaged=1' Chain 1 Iteration: 1 / 2000 [ 0%] (Warmup) Chain 1 Iteration: 100 / 2000 [ 5%] (Warmup) Chain 1 Iteration: 200 / 2000 [ 10%] (Warmup) Chain 1 Iteration: 300 / 2000 [ 15%] (Warmup) Chain 1 Iteration: 400 / 2000 [ 20%] (Warmup) Chain 1 Iteration: 500 / 2000 [ 25%] (Warmup) Chain 1 Iteration: 600 / 2000 [ 30%] (Warmup) Chain 1 Iteration: 700 / 2000 [ 35%] (Warmup) Chain 1 Iteration: 800 / 2000 [ 40%] (Warmup) Chain 1 Iteration: 900 / 2000 [ 45%] (Warmup) Chain 1 Iteration: 1000 / 2000 [ 50%] (Warmup) Chain 1 Iteration: 1001 / 2000 [ 50%] (Sampling) Chain 1 Iteration: 1100 / 2000 [ 55%] (Sampling) Chain 1 Iteration: 1200 / 2000 [ 60%] (Sampling) Chain 1 Iteration: 1300 / 2000 [ 65%] (Sampling) Chain 1 Iteration: 1400 / 2000 [ 70%] (Sampling) Chain 1 Iteration: 1500 / 2000 [ 75%] (Sampling) Chain 1 Iteration: 1600 / 2000 [ 80%] (Sampling) Chain 1 Iteration: 1700 / 2000 [ 85%] (Sampling) Chain 1 Iteration: 1800 / 2000 [ 90%] (Sampling) Chain 1 Iteration: 1900 / 2000 [ 95%] (Sampling) Chain 1 Iteration: 2000 / 2000 [100%] (Sampling) Chain 2 Iteration: 1 / 2000 [ 0%] (Warmup) Chain 2 Iteration: 100 / 2000 [ 5%] (Warmup) Chain 2 Iteration: 200 / 2000 [ 10%] (Warmup) Chain 2 Iteration: 300 / 2000 [ 15%] (Warmup) Chain 2 Iteration: 400 / 2000 [ 20%] (Warmup) Chain 2 Iteration: 500 / 2000 [ 25%] (Warmup) Chain 2 Iteration: 600 / 2000 [ 30%] (Warmup) Chain 2 Iteration: 700 / 2000 [ 35%] (Warmup) Chain 2 Iteration: 800 / 2000 [ 40%] (Warmup) Chain 2 Iteration: 900 / 2000 [ 45%] (Warmup) Chain 2 Iteration: 1000 / 2000 [ 50%] (Warmup) Chain 2 Iteration: 1001 / 2000 [ 50%] (Sampling) Chain 2 Iteration: 1100 / 2000 [ 55%] (Sampling) Chain 2 Iteration: 1200 / 2000 [ 60%] (Sampling) Chain 2 Iteration: 1300 / 2000 [ 65%] (Sampling) Chain 2 Iteration: 1400 / 2000 [ 70%] (Sampling) Chain 2 Iteration: 1500 / 2000 [ 75%] (Sampling) Chain 2 Iteration: 1600 / 2000 [ 80%] (Sampling) Chain 2 Iteration: 1700 / 2000 [ 85%] (Sampling) Chain 2 Iteration: 1800 / 2000 [ 90%] (Sampling) Chain 2 Iteration: 1900 / 2000 [ 95%] (Sampling) Chain 2 Iteration: 2000 / 2000 [100%] (Sampling) Chain 1 finished in 0.2 seconds. Chain 2 finished in 0.2 seconds. Both chains finished succesfully. Mean chain execution time: 0.2 seconds. Total execution time: 0.3 seconds. The $sample() method creates R6 CmdStanMCMC objects.

The $draws() method can be used to extract the posterior draws as a 3-D array (iteration x chain x variable): # this is a draws_array object from the posterior package draws <- fit$draws()
str(draws)
 'draws_array' num [1:1000, 1:2, 1:2] -7.21 -6.88 -6.77 -6.87 -6.85 ...
- attr(*, "dimnames")=List of 3
..$iteration: chr [1:1000] "1" "2" "3" "4" ... ..$ chain    : chr [1:2] "1" "2"
..$variable : chr [1:2] "lp__" "theta" The $summary() method calls summarise_draws() from the posterior package:

fit$summary() # A tibble: 2 x 10 variable mean median sd mad q5 q95 rhat ess_bulk ess_tail <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 lp__ -7.31 -7.04 0.783 0.380 -8.88 -6.75 1.00 814. 1029. 2 theta 0.244 0.226 0.122 0.126 0.0778 0.467 1.00 684. 762. The $cmdstan_diagnose() and $cmdstan_summary() methods call CmdStan’s diagnose and stansummary utilities, respectively: fit$cmdstan_diagnose()
Running bin/diagnose \
/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/bernoulli-202003031420-1-739d7a.csv \
/var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/bernoulli-202003031420-2-739d7a.csv
Processing csv files: /var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/bernoulli-202003031420-1-739d7a.csv, /var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/bernoulli-202003031420-2-739d7a.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory for all transitions.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.
fit$cmdstan_summary() Running bin/stansummary \ /var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/bernoulli-202003031420-1-739d7a.csv \ /var/folders/h6/14xy_35x4wd2tz542dn0qhtc0000gn/T/Rtmp2DgYHm/bernoulli-202003031420-2-739d7a.csv Inference for Stan model: bernoulli_model 2 chains: each with iter=(1000,1000); warmup=(0,0); thin=(1,1); 2000 iterations saved. Warmup took (0.0078, 0.0083) seconds, 0.016 seconds total Sampling took (0.018, 0.019) seconds, 0.037 seconds total Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat lp__ -7.3 3.1e-02 0.78 -8.9 -7.0 -6.8 6.5e+02 1.8e+04 1.0e+00 accept_stat__ 0.92 3.3e-03 0.12 0.68 0.97 1.0 1.3e+03 3.6e+04 1.0e+00 stepsize__ 0.95 1.3e-01 0.14 0.81 1.1 1.1 1.0e+00 2.7e+01 4.2e+13 treedepth__ 1.4 1.2e-02 0.48 1.0 1.0 2.0 1.7e+03 4.5e+04 1.0e+00 n_leapfrog__ 2.5 2.0e-01 1.3 1.0 3.0 3.0 4.4e+01 1.2e+03 1.0e+00 divergent__ 0.00 nan 0.00 0.00 0.00 0.00 nan nan nan energy__ 7.8 4.2e-02 1.0 6.8 7.5 9.8 6.1e+02 1.7e+04 1.0e+00 theta 0.24 4.5e-03 0.12 0.077 0.23 0.47 7.4e+02 2.0e+04 1.0e+00 Samples were drawn using hmc with nuts. For each parameter, N_Eff is a crude measure of effective sample size, and R_hat is the potential scale reduction factor on split chains (at convergence, R_hat=1). It is also possible to create a stanfit object (from RStan) from the csv output files written by CmdStan. This can be done by using rstan::read_stan_csv() in combination with the $output_files() method of the CmdStanMCMC object.

## Comparison with RStan

The RStan interface (rstan package) is an in-memory interface to Stan and relies on R packages like Rcpp and inline to call C++ code from R. On the other hand, the CmdStanR interface does not directly call any C++ code from R, instead relying on CmdStan for compilation, running algorithms, and writing results to output files.

Both forms of interfacing with Stan have advantages and disadvantages. An in-memory interface like RStan is able to offer more advanced features than CmdStanR (for example the rstan::log_prob() and rstan::grad_log_prob() methods) but keeping up with Stan releases is more complicated for RStan, often requiring non-trivial changes to the rstan package and requiring new CRAN releases of rstan and StanHeaders. On the other hand, with CmdStanR, the latest features in Stan will be available from R immediately after updating CmdStan, without an update to the cmdstanr package. We also anticipate that running Stan via external processes will have the advantage of playing nicer with R (and RStudio) and result in fewer unexpected crashes than when using RStan.

Finally, RStan and CmdStanR have different open source licenses. RStan uses the GPL-3 license while the license for CmdStanR (like Stan) is BSD-3, which is a bit more permissive.