This introduction to reduce_sum copies directly from Richard McElreath’s Multithreading and Map-Reduce in Stan 2.18.0: A Minimal Example.

Introduction

Stan 2.23 introduced reduce_sum, a new way to parallelize the execution of a single Stan chain across multiple cores. This is in addition to the already existing map_rect utility, and introduces a number of features that make it easier to use parallelism:

  1. More flexible argument interface, avoiding the packing and unpacking that is necessary with map_rect.
  2. Partitions data for parallelization automatically (this is done manually in map_rect).
  3. Is easier to use.

This tutorial will highlight these new features and demonstrate how to use reduce_sum on a simple logistic regression using cmdstanr.

Overview

Modifying a Stan model for use with reduce_sum has three steps.

  1. Identify the portion of the model that you want to parallelize. reduce_sum is design specifically for speeding up log likelihood evaluations that are composed of a number of conditionally independent terms that can be computed in parallel then added together.

  2. Write a partial sum function that can compute a given number of terms in that sum. Pass this function (and other necessary arguments) to a call to reduce_sum.

  3. Configure your compiler to enable multithreading. Then compile and sample as usual.

To demonstrate this process, we’ll first build a model first without reduce_sum, identify the part to be parallelized, and then work out how to use reduce_sum to parallelize it.

Prepping the data

Let’s consider a simple logistic regression. We’ll use a reasonably large amount of data, the football red card data set from the recent crowdsourced data analysis project (https://psyarxiv.com/qkwst/). This data table contains 146,028 player-referee dyads. For each dyad, the table records the total number of red cards the referee assigned to the player over the observed number of games.

RedcardData.csv is provided in the repository here. Load the data in R, and let’s take a look at the distribution of red cards:

d <- read.csv("RedcardData.csv", stringsAsFactors = FALSE)
table(d$redCards)
## 
##      0      1      2 
## 144219   1784     25

The vast majority of dyads have zero red cards. Only 25 dyads show 2 red cards. These counts are our inference target.

The motivating hypothesis behind these data is that referees are biased against darker skinned players. So we’re going to try to predict these counts using the skin color ratings of each player. Not all players actually received skin color ratings in these data, so let’s reduce down to dyads with ratings:

d2 <- d[!is.na(d$rater1),]
redcard_data <- list(n_redcards = d2$redCards, n_games = d2$games, rating = d2$rater1)
redcard_data$N <- nrow(d2)

This leaves us with 124,621 dyads to predict.

At this point, you are thinking: “But there are repeat observations on players and referees! You need some cluster variables in there in order to build a proper multilevel model!” You are right. But let’s start simple. Keep your partial pooling on idle for the moment.

Making the model

A Stan model for this problem is just a simple logistic (binomial) GLM. I’ll assume you know Stan well enough already that I can just plop the code down here. Save this code as logistic0.stan where you saved RedcardData.csv (we’ll assume this is your current working directory):

data {
  int<lower=0> N;
  int<lower=0> n_redcards[N];
  int<lower=0> n_games[N];
  vector[N] rating;
}
parameters {
  vector[2] beta;
}
model {
  beta[1] ~ normal(0, 10);
  beta[2] ~ normal(0, 1);

  n_redcards ~ binomial_logit(n_games, beta[1] + beta[2] * rating);
}

Building and Running the Basic Model

To install cmdstanr, follow the directions here: https://mc-stan.org/cmdstanr/

To build the model in cmdstanr run:

logistic0 <- cmdstan_model("logistic0.stan")
## Model executable is up to date!

To sample from the model run:

time0 = system.time(fit0 <- logistic0$sample(redcard_data,
                                             chains = 4,
                                             parallel_chains = 4,
                                             refresh = 1000))
## Running MCMC with 4 parallel chains...
## 
## Running ./logistic0 'id=1' random 'seed=1821957448' data \
##   'file=/tmp/Rtmp0TDWkT/standata-104856db8c26.json' output \
##   'file=/tmp/Rtmp0TDWkT/logistic0-202006161303-1-1333c5.csv' 'refresh=1000' \
##   'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt \
##   'engaged=1'
## Running ./logistic0 'id=2' random 'seed=1139081694' data \
##   'file=/tmp/Rtmp0TDWkT/standata-104856db8c26.json' output \
##   'file=/tmp/Rtmp0TDWkT/logistic0-202006161303-2-1333c5.csv' 'refresh=1000' \
##   'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt \
##   'engaged=1'
## Running ./logistic0 'id=3' random 'seed=128574416' data \
##   'file=/tmp/Rtmp0TDWkT/standata-104856db8c26.json' output \
##   'file=/tmp/Rtmp0TDWkT/logistic0-202006161303-3-1333c5.csv' 'refresh=1000' \
##   'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt \
##   'engaged=1'
## Running ./logistic0 'id=4' random 'seed=1343942358' data \
##   'file=/tmp/Rtmp0TDWkT/standata-104856db8c26.json' output \
##   'file=/tmp/Rtmp0TDWkT/logistic0-202006161303-4-1333c5.csv' 'refresh=1000' \
##   'method=sample' 'save_warmup=0' 'algorithm=hmc' 'engine=nuts' adapt \
##   'engaged=1'
## Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 1 finished in 172.3 seconds.
## Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 2 finished in 173.3 seconds.
## Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 4 finished in 175.9 seconds.
## Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 3 finished in 183.6 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 176.3 seconds.
## Total execution time: 183.7 seconds.
time0
##    user  system elapsed 
## 627.300  79.702 184.054

Note: Older versions of cmdstanr use num_cores, cores, and num_chains instead of parallel_chains, threads_per_chain, and chains. If you get an error, update cmdstanr.

In this case, we’ll compute four chains running in parallel on four cores. This computer has eight cores though, and we’ll see how we can make use of those other cores later. The elapsed time is the time that we would have recorded if we were timing this process with a stop-watch, so that is the one relevant to understanding performance here (system time is time spent on system functions like I/O, and user time is for parallel calculations).

Rewriting the Model to Enable Multithreading

The key to porting this calculation to reduce_sum, is recognizing that the statement:

n_redcards ~ binomial_logit(n_games, beta[1] + beta[2] * rating);

can be rewritten (up to a proportionality constant) as:

for(n in 1:N) {
  target += binomial_logit_lpmf(n_redcards[n] | n_games[n], beta[1] + beta[2] * rating[n])
}

Now it is clear that the calculation is the sum of a number of conditionally independent Bernoulli log probability statements. So whenever we need to calculate a large sum where each term is independent of all others and associativity holds, then reduce_sum is useful.

To use reduce_sum, a function must be written that can be used to compute arbitrary sections of this sum.

Using the reducer interface defined in Reduce-Sum:

functions {
  real partial_sum(int[] slice_n_redcards,
                   int start, int end,
                   int[] n_games,
                   vector rating,
                   vector beta) {
    return binomial_logit_lpmf(slice_n_redcards |
                               n_games[start:end],
                               beta[1] + beta[2] * rating[start:end]);
  }
}

The likelihood statement in the model can now be written:

target += partial_sum(n_redcards, 1, N, n_games, rating, beta); // Sum terms 1 to N in the likelihood

Equivalently it could be broken into two pieces and written like:

int M = N / 2;
target += partial_sum(n_redcards[1:M], 1, M, n_games, rating, beta) // Sum terms 1 to M
target += partial_sum(n_redcards[(M + 1):N], M + 1, N, n_games, rating, beta); // Sum terms M + 1 to N

By passing partial_sum to reduce_sum, we allow Stan to automatically break up these calculations and do them in parallel.

Notice the difference in how n_redcards is split in half (to reflect which terms of the sum are being accumulated) and the rest of the arguments (n_games, x, and beta) are left alone. This distinction is important and more fully described in the User’s Guide section on Reduce-sum.

Given the partial sum function, reduce_sum can be used to automatically parallelize the likelihood:

int grainsize = 1;
target += reduce_sum(partial_sum, n_redcards, grainsize,
                     n_games, rating, beta);

reduce_sum automatically breaks the sum into roughly grainsize sized pieces and computes them in parallel. grainsize = 1 specifies that the grainsize should be estimated automatically (grainsize should be left at 1 unless specific tests are done to pick a different one).

Making grainsize data (this makes it convenient to experiment with), the final model is:

functions {
  real partial_sum(int[] slice_n_redcards,
                   int start, int end,
                   int[] n_games,
                   vector rating,
                   vector beta) {
    return binomial_logit_lpmf(slice_n_redcards |
                               n_games[start:end],
                               beta[1] + beta[2] * rating[start:end]);
  }
}
data {
  int<lower=0> N;
  int<lower=0> n_redcards[N];
  int<lower=0> n_games[N];
  vector[N] rating;
  int<lower=1> grainsize;
}
parameters {
  vector[2] beta;
}
model {

  beta[1] ~ normal(0, 10);
  beta[2] ~ normal(0, 1);

  target += reduce_sum(partial_sum, n_redcards, grainsize,
                       n_games, rating, beta);
}

Save this as logistic1.stan.

Running the Multithreaded Model

Compile this model with threading support:

logistic1 <- cmdstan_model("logistic1.stan", cpp_options = list(stan_threads = TRUE))
## Model executable is up to date!

Note: If you get an error Error in self$compile(...) : unused argument (cpp_options = list(stan_threads = TRUE)), update your cmdstanr to the latest version (the threading interface was changed in May 2020, shortly after this tutorial was originally published).

The computer I’m on has 8 cores and we want to make use of all of them. Running 4 chains in parallel gives us 2 threads per chain (which makes full use of all 8 cores on the processor). The within-chain parallelism is generally less efficient as compared to running chains in parallel, but if greater single-chain speedups are desired, then the user can choose to run fewer chains (i.e. 2 chains in parallel with 4 threads each).

Run and time the model with:

redcard_data$grainsize <- 1
time1 = system.time(fit1 <- logistic1$sample(redcard_data,
                                             chains = 4,
                                             parallel_chains = 4,
                                             threads_per_chain = 2,
                                             refresh = 1000))
## Running MCMC with 4 parallel chains, with 2 thread(s) per chain...
## 
## Running ./logistic1_threads 'id=1' random 'seed=948103613' data \
##   'file=/tmp/Rtmp0TDWkT/standata-1048c784cc7.json' output \
##   'file=/tmp/Rtmp0TDWkT/logistic1_threads-202006161306-1-679aa7.csv' \
##   'refresh=1000' 'method=sample' 'save_warmup=0' 'algorithm=hmc' \
##   'engine=nuts' adapt 'engaged=1'
## Running ./logistic1_threads 'id=2' random 'seed=1654742267' data \
##   'file=/tmp/Rtmp0TDWkT/standata-1048c784cc7.json' output \
##   'file=/tmp/Rtmp0TDWkT/logistic1_threads-202006161306-2-679aa7.csv' \
##   'refresh=1000' 'method=sample' 'save_warmup=0' 'algorithm=hmc' \
##   'engine=nuts' adapt 'engaged=1'
## Running ./logistic1_threads 'id=3' random 'seed=1898131361' data \
##   'file=/tmp/Rtmp0TDWkT/standata-1048c784cc7.json' output \
##   'file=/tmp/Rtmp0TDWkT/logistic1_threads-202006161306-3-679aa7.csv' \
##   'refresh=1000' 'method=sample' 'save_warmup=0' 'algorithm=hmc' \
##   'engine=nuts' adapt 'engaged=1'
## Running ./logistic1_threads 'id=4' random 'seed=662848006' data \
##   'file=/tmp/Rtmp0TDWkT/standata-1048c784cc7.json' output \
##   'file=/tmp/Rtmp0TDWkT/logistic1_threads-202006161306-4-679aa7.csv' \
##   'refresh=1000' 'method=sample' 'save_warmup=0' 'algorithm=hmc' \
##   'engine=nuts' adapt 'engaged=1'
## Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 4 finished in 69.2 seconds.
## Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 1 finished in 71.4 seconds.
## Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 3 finished in 72.5 seconds.
## Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 2 finished in 73.0 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 71.5 seconds.
## Total execution time: 73.0 seconds.
time1
##    user  system elapsed 
## 571.353   0.499  73.133

Note: Older versions of cmdstanr use num_cores, cores, and num_chains instead of parallel_chains, threads_per_chain, and chains. If you get an error, update cmdstanr.

Again, elapsed time is the time recorded as if by a stopwatch. Computing the ratios of the two times gives a speedup on eight cores of:

time0[["elapsed"]] / time1[["elapsed"]]
## [1] 2.516702

This model was particularly easy to parallelize because a lot of the arguments are data (and do not need to be autodiffed). It is possible to get a speedup over two even though we are only doubling the computational resources because of the way reduce_sum breaks up work and uses CPU cache more efficiently. If a model has a large number of arguments that are either defined in the parameters block, transformed parameters block, or model block, or do not do very much computation inside the reduce function, or does not get lucky with caching, the speedup will be much more limited.

We can always get speedup in terms of effective sample size per time by running multiple chains in parallel on different cores. reduce_sum is not a replacement for that, and it is still important to run multiple chains to check diagnostics. reduce_sum is a tool for speeding up single chain calculations, which can be useful for model development and on computers with large numbers of cores.

We can do a quick check that these two methods are mixing with posterior. When parallelizing a model is a good thing to do to make sure something is not breaking:

remotes::install_github("jgabry/posterior")
## Skipping install of 'posterior' from a github remote, the SHA1 (d7c0a696) has not changed since last install.
##   Use `force = TRUE` to force installation
library(posterior)
## This is posterior version 0.1.0
summarise_draws(bind_draws(fit0$draws(), fit1$draws(), along = "chain"))
## # A tibble: 3 x 10
##   variable     mean   median      sd     mad       q5      q95  rhat ess_bulk
##   <chr>       <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl>    <dbl>
## 1 lp__     -9.05e+3 -9.05e+3 1.20e+3 1.78e+3 -1.03e+4 -7.85e+3  1.70     12.6
## 2 beta[1]  -5.53e+0 -5.53e+0 3.44e-2 3.48e-2 -5.59e+0 -5.48e+0  1.00   2902. 
## 3 beta[2]   2.91e-1  2.90e-1 8.09e-2 8.02e-2  1.56e-1  4.23e-1  1.00   3061. 
## # … with 1 more variable: ess_tail <dbl>

Ignoring lp__ (remember the change to bernoulli_logit_lpmf means the likelihoods between the two models are only proportional up to a constant), the Rhats are less than 1.01, so our basic convergence checks pass. Unless something strange is going on, we probably parallelized our model correctly!

More Information

For a more detailed description of how reduce_sum works, see the User’s Manual. For a more complete description of the interface, see the Function Reference.