18 Parallelization
Stan provides three ways of parallelizing execution of a Stan model:
- multi-threading with Intel Threading Building Blocks (TBB),
- multi-processing with Message Passing Interface (MPI) and
- manycore processing with OpenCL.
18.1 Multi-threading with TBB
In order to exploit multi-threading in a Stan model, the models must be
rewritten to use the reduce_sum
and map_rect
functions. For instructions
on how to rewrite Stan models to use these functions see Stan’s User guide chapter on parallelization, the reduce_sum case study or the Multithreading and Map-Reduce tutorial.
18.1.1 Compiling
Once a model is rewritten to use the above-mentioned functions, the model
must be compiled with the STAN_THREADS
makefile flag. The flag can be
supplied in the make
call but we recommend writing the flag to the
make/local
file.
An example of the contents of make/local
to enable threading with TBB:
STAN_THREADS=true
The model is then compiled as normal:
make path/to/model
18.1.2 Running
Before running a multi-threaded model, we need to specify the maximum number of threads
the program can run (total threads for all chains). This is done by setting the num_threads
argument. Valid values for num_threads
are positive integers and -1. If num_threads
is set
to -1, all available cores will be used.
Generally, this number should not exceed the number of available cores for best performance.
Example:
./model sample data file=data.json num_threads=4 ...
When the model is compiled with STAN_THREADS
we can sample with multiple chains with a single
executable (see section running multiple chains for cases when this is
available). When running multiple chains num_threads
is the maximum number of threads that can
be used by all the chains combined. The exact number of threads that will be used for each chain
at a given point in time is determined by the TBB scheduler. The following example start 2 chains
with 8 total threads available:
./model sample num_chains=2 data file=data.json num_threads=8 ...
18.2 Multi-processing with MPI
In order to use multi-processing with MPI in a Stan model, the models must be
rewritten to use the map_rect
function. By using MPI, the model can be parallelized across multiple cores or a cluster. MPI with Stan is supported on MacOS and Linux.
18.2.1 Dependencies
Compiling and running Stan models with MPI requires that the system has an MPI implementation installed. For Unix systems the most commonly used implementations are MPICH and OpenMPI.
18.2.2 Compiling
Once a model is rewritten to use map_rect
, additional makefile flags
must be written to the make/local
. These are:
STAN_MPI
: Enables the use of MPI with Stan iftrue
.CXX
: The name of the MPI C++ compiler wrapper. Typicallympicxx
.TBB_CXX_TYPE
: The C++ compiler the MPI wrapper wraps. Typicallygcc
on Linux andclang
on macOS.
An example of make/local
on Linux:
STAN_MPI=true
CXX=mpicxx
TBB_CXX_TYPE=gcc
The model is then compiled as normal:
make path/to/model
18.2.3 Running
The Stan model compiled with STAN_MPI
is run using an MPI launcher. The MPI standard
suggests using mpiexec
, but a vendor wrapper for the launcher like mpirun
can also be used.
The launcher is supplied the path to the built executable and the number of processes to start:
-n X
for mpiexec
or -np X
for mpirun
where X
is replaced by the integer representing
the number of processes.
Example for running a model with six processes:
mpiexec -n 6 path/to/model sample data file=data.json ...
18.3 OpenCL
18.3.1 Dependencies
OpenCL is supported on most modern CPUs and GPUs. In order to run OpenCL-enabled Stan models, an OpenCL runtime for the target device must be installed. This subsection lists installation instructions for OpenCL runtimes of the commonly-found devices.
In order to check if any OpenCL-enabled device and its runtime is already present use the
clinfo
tool. On Linux, clinfo
can typically be installed with the default package manager
(for example sudo apt-get install clinfo
on Ubuntu). For Windows, pre-built clinfo
binary
can be found here.
Also use clinfo
to verify successful installation of OpenCL runtimes.
18.3.1.1 NVIDIA GPU
Linux:
Install the NVIDIA GPU driver and the NVIDIA CUDA Toolkit. On Ubuntu the commands to install both is:
sudo apt update sudo apt install nvidia-driver-460 nvidia-cuda-toolkit
Replace the driver version (
460
in the above case) with the lastest number at the time of installation.Windows:
Install the NVIDIA GPU Driver and CUDA Toolkit.
18.3.1.3 AMD CPU
Install the open source PoCL.
18.3.1.4 Intel CPU/GPU
Follow Intel’s install instructions given here (requires registration).
18.3.2 Compiling
In order to enable the OpenCL backend the model
must be compiled with the STAN_OPENCL
makefile flag. The flag can be
supplied in the make
call but we recommend writing the flag to the
make/local
file.
An example of the contents of make/local
to enable parallelization
with OpenCL:
STAN_OPENCL=true
If you are using OpenCL with an integrated GPU you also need to add the INTEGRATED_OPENCL
flag, as the sharing of memory between CPU and GPU is slightly different with integrated graphics:
INTEGRATED_OPENCL=true
The model is then compiled as normal:
make path/to/model
18.3.3 Running
The Stan model compiled with STAN_OPENCL
can also be supplied the OpenCL platform and device IDs
of the target device. These IDs determine the device on which to run the OpenCL-supported functions on.
You can list the devices on your system using the clinfo
program. If the system has one GPU and
no OpenCL CPU runtime, the platform and device IDs of the GPU are typically 0
. In that case
you can also omit the OpenCL IDs as the default 0
IDs are used in that case.
We supply these IDs when starting the executable as shown below:
path/to/model sample data file=data.json opencl platform=0 device=1