Parallelization

Stan provides three ways of parallelizing execution of a Stan model:

  • multi-threading with Intel Threading Building Blocks (TBB),
  • multi-processing with Message Passing Interface (MPI) and
  • manycore processing with OpenCL.

Multi-threading with TBB

In order to exploit multi-threading in a Stan model, the models must be rewritten to use the reduce_sum and map_rect functions. For instructions on how to rewrite Stan models to use these functions see Stan’s User guide chapter on parallelization, the reduce_sum case study or the Multithreading and Map-Reduce tutorial.

Compiling

Once a model is rewritten to use the above-mentioned functions, the model must be compiled with the STAN_THREADS makefile flag. The flag can be supplied in the make call but we recommend writing the flag to the make/local file.

An example of the contents of make/local to enable threading with TBB:

STAN_THREADS=true

The model is then compiled as normal:

make path/to/model

Running

Before running a multi-threaded model, we need to specify the maximum number of threads the program can run (total threads for all chains). This is done by setting the num_threads argument. Valid values for num_threads are positive integers and -1. If num_threads is set to -1, all available cores will be used.

Generally, this number should not exceed the number of available cores for best performance.

Example:

./model sample data file=data.json num_threads=4 ...

When the model is compiled with STAN_THREADS we can sample with multiple chains with a single executable (see section running multiple chains for cases when this is available). When running multiple chains num_threads is the maximum number of threads that can be used by all the chains combined. The exact number of threads that will be used for each chain at a given point in time is determined by the TBB scheduler. The following example start 2 chains with 8 total threads available:

./model sample num_chains=2 data file=data.json num_threads=8 ...

Multi-processing with MPI

In order to use multi-processing with MPI in a Stan model, the models must be rewritten to use the map_rect function. By using MPI, the model can be parallelized across multiple cores or a cluster. MPI with Stan is supported on MacOS and Linux.

Dependencies

Compiling and running Stan models with MPI requires that the system has an MPI implementation installed. For Unix systems the most commonly used implementations are MPICH and OpenMPI.

Compiling

Once a model is rewritten to use map_rect, additional makefile flags must be written to the make/local. These are:

  • STAN_MPI: Enables the use of MPI with Stan if true.
  • CXX: The name of the MPI C++ compiler wrapper. Typically mpicxx.
  • TBB_CXX_TYPE: The C++ compiler the MPI wrapper wraps. Typically gcc on Linux and clang on macOS.

An example of make/local on Linux:

STAN_MPI=true
CXX=mpicxx
TBB_CXX_TYPE=gcc

The model is then compiled as normal:

make path/to/model

Running

The Stan model compiled with STAN_MPI is run using an MPI launcher. The MPI standard suggests using mpiexec, but a vendor wrapper for the launcher like mpirun can also be used. The launcher is supplied the path to the built executable and the number of processes to start: -n X for mpiexec or -np X for mpirun where X is replaced by the integer representing the number of processes.

Example for running a model with six processes:

mpiexec -n 6 path/to/model sample data file=data.json ...

OpenCL

Dependencies

OpenCL is supported on most modern CPUs and GPUs. In order to run OpenCL-enabled Stan models, an OpenCL runtime for the target device must be installed. This subsection lists installation instructions for OpenCL runtimes of the commonly-found devices.

In order to check if any OpenCL-enabled device and its runtime is already present use the clinfo tool. On Linux, clinfo can typically be installed with the default package manager (for example sudo apt-get install clinfo on Ubuntu). For Windows, pre-built clinfo binary can be found here.

Also use clinfo to verify successful installation of OpenCL runtimes.

NVIDIA GPU

  • Linux:

    Install the NVIDIA GPU driver and the NVIDIA CUDA Toolkit. On Ubuntu the commands to install both is:

    sudo apt update
    sudo apt install nvidia-driver-460 nvidia-cuda-toolkit

    Replace the driver version (460 in the above case) with the lastest number at the time of installation.

  • Windows:

    Install the NVIDIA GPU Driver and CUDA Toolkit.

AMD GPU

  • Linux:

    Install Radeon Software for Linux available here.

  • Windows:

    We recommend installing the open source OCL-SDK.

AMD CPU

Install the open source PoCL.

Intel CPU/GPU

Follow Intel’s install instructions given here (requires registration).

Compiling

In order to enable the OpenCL backend the model must be compiled with the STAN_OPENCL makefile flag. The flag can be supplied in the make call but we recommend writing the flag to the make/local file.

An example of the contents of make/local to enable parallelization with OpenCL:

STAN_OPENCL=true

If you are using OpenCL with an integrated GPU you also need to add the INTEGRATED_OPENCL flag, as the sharing of memory between CPU and GPU is slightly different with integrated graphics:

INTEGRATED_OPENCL=true

The model is then compiled as normal:

make path/to/model

Running

The Stan model compiled with STAN_OPENCL can also be supplied the OpenCL platform and device IDs of the target device. These IDs determine the device on which to run the OpenCL-supported functions on. You can list the devices on your system using the clinfo program. If the system has one GPU and no OpenCL CPU runtime, the platform and device IDs of the GPU are typically 0. In that case you can also omit the OpenCL IDs as the default 0 IDs are used in that case.

We supply these IDs when starting the executable as shown below:

path/to/model sample data file=data.json opencl platform=0 device=1
Back to top