Putting your Stan model and data into a Jupyter notebook lets your audience work through your analysis step by step. The challenge in a demo, talk, or classroom situation is getting everyone in the room on the same page when that page is your presentation notebook and you want it to display properly and be runnable from their browser. Unfortunately, the law of conservation of energy mandates that the easier things are for your audience, the harder they are for you.
This report takes you through the tedious details of setting up a Jupyter Notebook so that anyone with a modern web browser and a Google account can run your Stan analysis with Google Colaboratory free cloud servers, with plenty of screenshots and technical details. Inasmuch as clouds are always moving and changing, I’ve titled this report “cloud-compute-2020”. While the screenshots may have a sell-by date of Q3 2020, the challenges will remain.
The example notebooks for R and Python in this report use two new lightweight interfaces to Stan: CmdStanR and CmdStanPy. They were developed with the following goals:
Simplicity and modularity: these packages wrap CmdStan and just provide functions to compile models, do inference, and assemble and save the results; other packages are needed for downstream analysis
Keep up with Stan releases: these interfaces can use any (recent) version of CmdStan, including the current release, Stan 2.23.
Quick and easy installation: minimal dependencies with other packages and no direct calls to C++.
Flexible licensing: BSD-3.
A Jupyter notebook consists of blocks of markdown text interleaved with blocks of statements which unifies the exposition of ideas and arguments with the necessary supporting data, computation, and visualizations. The three core programming languages supported are Julia, Python, and R, hence the name, Ju-Pyt-R. In order to author a Jupyter notebook on your machine, you need a local install of both Python and Jupyter, as outlined in the Jupyter installation instructions. Once the Jupyter server is running, you can then run existing notebooks and create new notebooks via your favorite web browser.
A notebook document is a JSON file with suffix .ipynb
which contains a list of cells, one cell per content block (code, text or image), and a dictionary of metadata which specifies the kernel used to run the notebook, here, either R or Python. When viewed in a browser, the notebook is displayed as an HTML page where the contents of the text cells are rendered by default while the code cells are displayed with controls which allow them to be executed independently. Additional controls allow you to create, edit, and publish notebooks as HTML or pdf documents.
By distributing an ipynb
notebook file together with your Stan model and data, your audience can replicate your analysis on their machine. But in order to do this, they must have a local install of the Jupyter notebook server or other IDE that can run the notebook. This is not always possible; they might not have have enough permissions to install software on their computer or their machine might not have enough memory or a powerful enough CPU to run Stan. You could give them access to your machine by running a Jupyter notebook server for single-user access or JupyterHub server for multiple-user access, but this requires bandwidth, compute power, and careful attention to security. Regarding security, here’s some wisdom from the Jupyter blog, (emphasis added)
It is important to keep in mind that, since arbitrary code execution is the main point of IPython and Jupyter, running a publicly accessible server without authentication or encryption is a very bad idea indeed.
Hence the need for Jupyter servers in the cloud: many instances of someone else’s server where the audience can run your notebook.
It is a truth, universally acknowledged, that there ain’t no such thing as a free lunch. Nonetheless, as of 2020, Google is providing free email via Gmail, free file storage via Google Drive, and free Jupyter notebook servers via Google Colab. To use these services, you must sign up for a Google account. The Colab server instances are limited, as is the amount of storage on your Google Drive, and should you use Gmail, Google examines all messages and serves ads accordingly; there ain’t no such thing as a free lunch, Q.E.D.
Google Colaboratory, nicknamed Colab, is a free Jupyter notebook environment that runs in the cloud, i.e., on remote Google servers. Google makes no promises about VM availability or resources, (see https://research.google.com/colaboratory/faq.html), however you can inspect the virtual machine instance that a notebook is running on via system commands. To see how this works, I created a notebook called show_VM_specs.ipynb
which is stored on my Google Drive in a folder called ‘Colab_Notebooks’. In order to run this notebook, I can either open it with Google Colab or download it and run it locally.
The Colab UI is just different enough from the Jupyter UI to be confusing. Execution of each cell is controlled by a “play button” icon.
The above screenshot shows that when I ran this notebook on Colab, it ran on a VM instance with 12 GB RAM, 2 Intel Xeon 2.2Ghz processors, running Unbuntu Linux on a file system with 74GB free disk space. This is sufficient to use Stan to fit small-to-medium models on small-to-medium datasets; i.e., good enough for demos and classroom exercises. Colab is a gateway drug - for large-scale processing pipelines you’ll need to move up to Google Cloud Platform or one of its competitors AWS, Azure, etc.
A Colab VM persists for approximately 10 or 12 hours from first spin-up, therefore you must upload/download local files to your Google Drive or another cloud storage system in order to save your work. The notebook itself is saved on your Google Drive. You can download it to your local machine, or you can save to GitHub via the the “Save a copy in GitHub” option from the Colab UI “File” tab.
The save dialog provides the option of adding an “Open in Colab” button to the saved notebook.
This saves the notebook show_VM_specs.ipynb
to the GitHub repo for this report.
Once uploaded, Colab provides minimal editing facilities for data and test files.
Unfortunately, the extension .stan
is not recognized; however .json
data files can be edited. This makes editing program files in Colab challenging. Fundamentally, Colab is a notebook viewer, not an IDE.
In a talk, class, or demo situation every minute spent installing software is a minute less for the presentation itself. A Colab virtual machine instance has both Python and R already installed, as well as both the clang and gcc-7 C++ compilers. But to run the current Stan release (Stan 2.23) from R or Python, you need either CmdStanR or CmdStanPy as well as a local CmdStan installation. Both CmdStanR and CmdStanPy are small packages with minimal dependencies and can be quickly downloaded from their respective repositories and install in a matter of seconds. Both packages provide a function install_cmdstan
which downloads and compiles the latest CmdStan release.
Unfortunately, installing CmdStan can take upwards of 10 minutes because the install_cmdstan
function downloads a CmdStan release from GitHub, decompresses and unpacks the tarball, and compiles all C++ code for CmdStan, Stan, and the math libraries. This must be done every time a fresh VM spins up. As a Colab VM only persists for at most 12 hours, this means that most of the time your class, talk, or demo will require waiting on this installation process before you or your audience can start running Stan.
To avoid this, we’ve created a CmdStan binary for Colab for the CmdStan 2.23.0 release, colab-cmdstan-2.23.0.tar.gz. From a Colab notebook, downloading and unpacking this set of pre-compiled binaries takes on the order of 10 seconds. With this shortcut, the Stan install process for Colab consists of three steps:
To see how this works, I have created two example notebooks for Colab which run CmdStan’s example model bernoulli.stan
: CmdStanPy_Example_Notebook.ipynb and CmdStanR_Example_Notebook.ipynb
The initial cells of the CmdStanR_Example_Notebook follow the steps described above. To create this notebook, it was necessary to first create an IR notebook using Jupyter running on my notebook, then upload it to Google Drive. The first cell in the CmdStanR notebook installs the CmdStanR package, as needed. CmdStanR is not yet on CRAN, so we install from GitHub instead. To speed up the install process, only the necessary dependencies are are installed.
# Install package CmdStanR from GitHub
library(devtools)
if(!require(cmdstanr)){
devtools::install_github("stan-dev/cmdstanr", dependencies=c("Depends", "Imports"))
library(cmdstanr)
}
The following cells download the precompiled CmdStan binaries and registers the path to the CmdStan installation:
# Install CmdStan binaries
if (!file.exists("cmdstan-2.23.0.tgz")) {
system("wget https://github.com/stan-dev/cmdstan/releases/download/v2.23.0/colab-cmdstan-2.23.0.tar.gz", intern=T)
system("tar zxf colab-cmdstan-2.23.0.tar.gz", intern=T)
}
# Set cmdstan_path to CmdStan installation
set_cmdstan_path("cmdstan-2.23.0")
The CmdStanPy_Example_Notebook contains the Python version of the fast spin-up steps. CmdStanPy requires Python3, which is the default runtime for new Colab notebooks. CmdStanPy is a pure-Python package which can be installed from PyPI:
!pip install --update cmdstanpy
We specify the --update
flag in order to get the latest version, in case any of the pre-installed Python packages for the Colab Python runtime are using older versions of CmdStanPy.
We can use Python to download and unpack the precompiled CmdStan binaries:
# Install pre-built CmdStan binary
# (faster than compiling from source via install_cmdstan() function)
tgz_file = 'colab-cmdstan-2.23.0.tar.gz'
tgz_url = 'https://github.com/stan-dev/cmdstan/releases/download/v2.23.0/colab-cmdstan-2.23.0.tar.gz'
if not os.path.exists(tgz_file):
urllib.request.urlretrieve(tgz_url, tgz_file)
shutil.unpack_archive(tgz_file)
The following cells check the CmdStan installation and register its location:
# Specify CmdStan location via environment variable
os.environ['CMDSTAN'] = './cmdstan-2.23.0'
# Check CmdStan path
from cmdstanpy import CmdStanModel, cmdstan_path
cmdstan_path()
In Colab, once you have executed a code block, hovering over the “run” icon shows the execution time - here a CmdStan installation took just over 10 seconds.
Note that these binaries might work on other Ubuntu machines, but definitely won’t work in Mac or Windows.
This section is for all Stan users who work primarily in R and RStudio. It shows you how to get your local Jupyter environment set up and how to translate R Markdown documents to Jupyter notebooks.
In order to author Jupyter notebooks for R, you also need a local install of R as well as the IRkernel package, the R kernel for Jupyter notebook, installation instructions here.
Jupyter notebooks for R are similar to RStudio’s R Markdown documents, in that both contain chunks of R code, interleaved with chunks of text. Underlyingly, Jupyter notebooks are JSON documents with suffix .ipynb
while R markdown documents are in R Markdown format and have suffix .Rmd
. Although the RStudio IDE provides an R Markdown notebook interface, notebooks authored via the RStudio IDE are in R Markdown format, not Jupyter Notebook format.
The RStudio interface calls this a notebook, but the resulting file is still in R Markdown format:
---
title: "R Notebook"
output: html_notebook
---
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*.
```{r}
plot(cars)
```
Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.
...
When viewed with Jupyter, this document is treated as a raw text file.
Juptext identified the R Markdown YAML header as its own block, with celltype raw
, and converted the next three chunks to notebook cells with celltype markdown
, code
, and markdown
, respectively. After editing and testing the notebook in the browser, you can save it as a Jupyter R notebook via the options menu tab “File”, selection “Download as”, option “notebook (.ipynb)”.
Congratulations, you now have a Jupyter R notebook that you can share with the world! Anyone who has access to a Jupyter server can recreate and extend your analysis.
In order to run a Jupyter notebook via Colab, it must be available on the web or your Google Drive. Google Drive lets you can create new Python notebooks or upload existing notebooks from your local machine. To share a Jupyter R notebook with the world, you will need to author the notebook locally and then upload it to Google Drive:
Alternatively, you could use this this empty R notebook, but of course, if you want an R notebook that runs Stan, use the example CmdStanR notebook from the previous section.
The example notebooks for CmdStanPy and CmdStanR provide a starting point for sharing your Stan analysis as a Colab notebook by getting you over the critical software installation hurdle. Whatever your analysis, it should include the initial cells in these notebooks which download the Python and R wrapper interface packages, download the precompiled CmdStan installation, and set the package’s CmdStan path variable accordingly.
After installation, an interesting Stan notebook should include cells to:
An extremely interesting Stan notebook would expand the above to the full Bayesian workflow:
We’ve just added notebooks for both Python and R which correspond to Andrew’s blogpost on an early study on Covid-19 prevalence in Santa Clara county, CA:
The notebooks, models, and data are available from GitHub: