Putting your Stan model and data into a Jupyter notebook lets your audience work through your analysis step by step. The challenge in a demo, talk, or classroom situation is getting everyone in the room on the same page when that page is your presentation notebook and you want it to display properly and be runnable from their browser. Unfortunately, the law of conservation of energy mandates that the easier things are for your audience, the harder they are for you.

This report takes you through the tedious details of setting up a Jupyter Notebook so that anyone with a modern web browser and a Google account can run your Stan analysis with Google Colaboratory free cloud servers, with plenty of screenshots and technical details. Inasmuch as clouds are always moving and changing, I’ve titled this report “cloud-compute-2020”. While the screenshots may have a sell-by date of Q3 2020, the challenges will remain.

Introducing CmdStanR and CmdStanPy!

The example notebooks for R and Python in this report use two new lightweight interfaces to Stan: CmdStanR and CmdStanPy. They were developed with the following goals:

Jupyter 101

A Jupyter notebook consists of blocks of markdown text interleaved with blocks of statements which unifies the exposition of ideas and arguments with the necessary supporting data, computation, and visualizations. The three core programming languages supported are Julia, Python, and R, hence the name, Ju-Pyt-R. In order to author a Jupyter notebook on your machine, you need a local install of both Python and Jupyter, as outlined in the Jupyter installation instructions. Once the Jupyter server is running, you can then run existing notebooks and create new notebooks via your favorite web browser.

A notebook document is a JSON file with suffix .ipynb which contains a list of cells, one cell per content block (code, text or image), and a dictionary of metadata which specifies the kernel used to run the notebook, here, either R or Python. When viewed in a browser, the notebook is displayed as an HTML page where the contents of the text cells are rendered by default while the code cells are displayed with controls which allow them to be executed independently. Additional controls allow you to create, edit, and publish notebooks as HTML or pdf documents.

Jupyter notebook in the browser, local Jupyter server

Jupyter notebook in the browser, local Jupyter server

By distributing an ipynb notebook file together with your Stan model and data, your audience can replicate your analysis on their machine. But in order to do this, they must have a local install of the Jupyter notebook server or other IDE that can run the notebook. This is not always possible; they might not have have enough permissions to install software on their computer or their machine might not have enough memory or a powerful enough CPU to run Stan. You could give them access to your machine by running a Jupyter notebook server for single-user access or JupyterHub server for multiple-user access, but this requires bandwidth, compute power, and careful attention to security. Regarding security, here’s some wisdom from the Jupyter blog, (emphasis added)

It is important to keep in mind that, since arbitrary code execution is the main point of IPython and Jupyter, running a publicly accessible server without authentication or encryption is a very bad idea indeed.

Hence the need for Jupyter servers in the cloud: many instances of someone else’s server where the audience can run your notebook.

Challenge: Free Servers

It is a truth, universally acknowledged, that there ain’t no such thing as a free lunch. Nonetheless, as of 2020, Google is providing free email via Gmail, free file storage via Google Drive, and free Jupyter notebook servers via Google Colab. To use these services, you must sign up for a Google account. The Colab server instances are limited, as is the amount of storage on your Google Drive, and should you use Gmail, Google examines all messages and serves ads accordingly; there ain’t no such thing as a free lunch, Q.E.D.

Google Colaboratory 101

Google Colaboratory, nicknamed Colab, is a free Jupyter notebook environment that runs in the cloud, i.e., on remote Google servers. Google makes no promises about VM availability or resources, (see https://research.google.com/colaboratory/faq.html), however you can inspect the virtual machine instance that a notebook is running on via system commands. To see how this works, I created a notebook called show_VM_specs.ipynb which is stored on my Google Drive in a folder called ‘Colab_Notebooks’. In order to run this notebook, I can either open it with Google Colab or download it and run it locally.

Open a notebook in Google Colab

Open a notebook in Google Colab

The Colab UI is just different enough from the Jupyter UI to be confusing. Execution of each cell is controlled by a “play button” icon.

Jupyter notebook in Colab

Jupyter notebook in Colab

The above screenshot shows that when I ran this notebook on Colab, it ran on a VM instance with 12 GB RAM, 2 Intel Xeon 2.2Ghz processors, running Unbuntu Linux on a file system with 74GB free disk space. This is sufficient to use Stan to fit small-to-medium models on small-to-medium datasets; i.e., good enough for demos and classroom exercises. Colab is a gateway drug - for large-scale processing pipelines you’ll need to move up to Google Cloud Platform or one of its competitors AWS, Azure, etc.

A Colab VM persists for approximately 10 or 12 hours from first spin-up, therefore you must upload/download local files to your Google Drive or another cloud storage system in order to save your work. The notebook itself is saved on your Google Drive. You can download it to your local machine, or you can save to GitHub via the the “Save a copy in GitHub” option from the Colab UI “File” tab.

Saving a notebook to GitHub

Saving a notebook to GitHub

The save dialog provides the option of adding an “Open in Colab” button to the saved notebook.

Save dialog

Save dialog

This saves the notebook show_VM_specs.ipynb to the GitHub repo for this report.

Saved notebook in GitHub preview

Saved notebook in GitHub preview

A few words on Colab security and authorization

In order to run a notebook in Colab, you must first be logged into your Google Account. Once logged in, when you first attempt to run the notebook, you will get a warning that the notebook may request access to your data or read stored credentials.

Security warning for GitHub notebooks

Security warning for GitHub notebooks

This is a valid warning! However, in order for a notebook to access your Google Drive, it must use the the Google Drive API, and this will be evident by reading through the markdown and Python notebook cells. So don’t take my word that these notebooks are safe; review the notebook source code before clicking “RUN ANYWAY” (or “CANCEL”).

Editing files and data on Colab

The Colab interface provides a “files” utility which lets you upload data and models in your notebook, and also gives you minimal editing facilities over files on the Colab instance. This is the file folder icon to the left of the main window
Colab Files Utility

Colab Files Utility

Clicking on this icon opens the files utility, which shows a view of the files in the current working directory of the Colab instane. Across the top of the files utility are controls to upload files from your local computer, refresh the files view, and connect to your Google Drive.
Files utility, controls

Files utility, controls

The file upload utility opens a file browser that lets you select one or more files to upload.
Files utility file browser

Files utility file browser

Once uploaded, Colab provides minimal editing facilities for data and test files.

Files utility file editor

Files utility file editor

Unfortunately, the extension .stan is not recognized; however .json data files can be edited. This makes editing program files in Colab challenging. Fundamentally, Colab is a notebook viewer, not an IDE.

Challenge: Fast Spin-up

In a talk, class, or demo situation every minute spent installing software is a minute less for the presentation itself. A Colab virtual machine instance has both Python and R already installed, as well as both the clang and gcc-7 C++ compilers. But to run the current Stan release (Stan 2.23) from R or Python, you need either CmdStanR or CmdStanPy as well as a local CmdStan installation. Both CmdStanR and CmdStanPy are small packages with minimal dependencies and can be quickly downloaded from their respective repositories and install in a matter of seconds. Both packages provide a function install_cmdstan which downloads and compiles the latest CmdStan release.

Unfortunately, installing CmdStan can take upwards of 10 minutes because the install_cmdstan function downloads a CmdStan release from GitHub, decompresses and unpacks the tarball, and compiles all C++ code for CmdStan, Stan, and the math libraries. This must be done every time a fresh VM spins up. As a Colab VM only persists for at most 12 hours, this means that most of the time your class, talk, or demo will require waiting on this installation process before you or your audience can start running Stan.

To avoid this, we’ve created a CmdStan binary for Colab for the CmdStan 2.23.0 release, colab-cmdstan-2.23.0.tar.gz. From a Colab notebook, downloading and unpacking this set of pre-compiled binaries takes on the order of 10 seconds. With this shortcut, the Stan install process for Colab consists of three steps:

To see how this works, I have created two example notebooks for Colab which run CmdStan’s example model bernoulli.stan: CmdStanPy_Example_Notebook.ipynb and CmdStanR_Example_Notebook.ipynb

CmdStanR notebook spin-up

The initial cells of the CmdStanR_Example_Notebook follow the steps described above. To create this notebook, it was necessary to first create an IR notebook using Jupyter running on my notebook, then upload it to Google Drive. The first cell in the CmdStanR notebook installs the CmdStanR package, as needed. CmdStanR is not yet on CRAN, so we install from GitHub instead. To speed up the install process, only the necessary dependencies are are installed.

# Install package CmdStanR from GitHub
library(devtools)
if(!require(cmdstanr)){
  devtools::install_github("stan-dev/cmdstanr", dependencies=c("Depends", "Imports"))
  library(cmdstanr)
}

The following cells download the precompiled CmdStan binaries and registers the path to the CmdStan installation:

# Install CmdStan binaries
if (!file.exists("cmdstan-2.23.0.tgz")) {
  system("wget https://github.com/stan-dev/cmdstan/releases/download/v2.23.0/colab-cmdstan-2.23.0.tar.gz", intern=T)
  system("tar zxf colab-cmdstan-2.23.0.tar.gz", intern=T)
}

# Set cmdstan_path to CmdStan installation
set_cmdstan_path("cmdstan-2.23.0")

Spinning up a CmdStanPy Notebook

The CmdStanPy_Example_Notebook contains the Python version of the fast spin-up steps. CmdStanPy requires Python3, which is the default runtime for new Colab notebooks. CmdStanPy is a pure-Python package which can be installed from PyPI:

!pip install --update cmdstanpy

We specify the --update flag in order to get the latest version, in case any of the pre-installed Python packages for the Colab Python runtime are using older versions of CmdStanPy.

We can use Python to download and unpack the precompiled CmdStan binaries:

# Install pre-built CmdStan binary
# (faster than compiling from source via install_cmdstan() function)
tgz_file = 'colab-cmdstan-2.23.0.tar.gz'
tgz_url = 'https://github.com/stan-dev/cmdstan/releases/download/v2.23.0/colab-cmdstan-2.23.0.tar.gz'
if not os.path.exists(tgz_file):
    urllib.request.urlretrieve(tgz_url, tgz_file)
    shutil.unpack_archive(tgz_file)

The following cells check the CmdStan installation and register its location:

# Specify CmdStan location via environment variable
os.environ['CMDSTAN'] = './cmdstan-2.23.0'
# Check CmdStan path
from cmdstanpy import CmdStanModel, cmdstan_path 
cmdstan_path()

In Colab, once you have executed a code block, hovering over the “run” icon shows the execution time - here a CmdStan installation took just over 10 seconds.

Execution time to install precompiled CmdStan binary

Execution time to install precompiled CmdStan binary

Note that these binaries might work on other Ubuntu machines, but definitely won’t work in Mac or Windows.

Challenge: Putting the R in Jupyter

This section is for all Stan users who work primarily in R and RStudio. It shows you how to get your local Jupyter environment set up and how to translate R Markdown documents to Jupyter notebooks.

In order to author Jupyter notebooks for R, you also need a local install of R as well as the IRkernel package, the R kernel for Jupyter notebook, installation instructions here.

Jupyter notebooks for R are similar to RStudio’s R Markdown documents, in that both contain chunks of R code, interleaved with chunks of text. Underlyingly, Jupyter notebooks are JSON documents with suffix .ipynb while R markdown documents are in R Markdown format and have suffix .Rmd. Although the RStudio IDE provides an R Markdown notebook interface, notebooks authored via the RStudio IDE are in R Markdown format, not Jupyter Notebook format.

RStudio R Notebook

RStudio R Notebook

The RStudio interface calls this a notebook, but the resulting file is still in R Markdown format:

---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

```{r}
plot(cars)
```

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Cmd+Option+I*.

...

When viewed with Jupyter, this document is treated as a raw text file.
R Markdown document in Jupyter browser

R Markdown document in Jupyter browser

An extremely useful Python package called Jupytext allows you to convert from R Markdown to Jupyter notebook format and back again. After installation, Jupyter now treats R Markdown documents as notebooks.
R Markdown document in Jupyter with juptext extension

R Markdown document in Jupyter with juptext extension

Juptext identified the R Markdown YAML header as its own block, with celltype raw, and converted the next three chunks to notebook cells with celltype markdown, code, and markdown, respectively. After editing and testing the notebook in the browser, you can save it as a Jupyter R notebook via the options menu tab “File”, selection “Download as”, option “notebook (.ipynb)”.

Convert R Markdown file to Jupyter notebook

Convert R Markdown file to Jupyter notebook

Congratulations, you now have a Jupyter R notebook that you can share with the world! Anyone who has access to a Jupyter server can recreate and extend your analysis.

Putting the R in Colab

In order to run a Jupyter notebook via Colab, it must be available on the web or your Google Drive. Google Drive lets you can create new Python notebooks or upload existing notebooks from your local machine. To share a Jupyter R notebook with the world, you will need to author the notebook locally and then upload it to Google Drive:

Upload to Google Drive

Upload to Google Drive

Alternatively, you could use this this empty R notebook, but of course, if you want an R notebook that runs Stan, use the example CmdStanR notebook from the previous section.

Next steps

The example notebooks for CmdStanPy and CmdStanR provide a starting point for sharing your Stan analysis as a Colab notebook by getting you over the critical software installation hurdle. Whatever your analysis, it should include the initial cells in these notebooks which download the Python and R wrapper interface packages, download the precompiled CmdStan installation, and set the package’s CmdStan path variable accordingly.

After installation, an interesting Stan notebook should include cells to:

An extremely interesting Stan notebook would expand the above to the full Bayesian workflow:

Update, May 2020

We’ve just added notebooks for both Python and R which correspond to Andrew’s blogpost on an early study on Covid-19 prevalence in Santa Clara county, CA:

The notebooks, models, and data are available from GitHub: