This notebook is a short introduction to multilevel regression modeling using Stan and the CmdStanPy interface. It shows how to integrate CmdStanPy into the data analysis workflow: how to instantiate the Stan model, fit it to data, access and validate the inference engine outputs, and use the results for downstream analysis and prediction.
A secondary goal is to demonstrate best practices of Bayesian Data Analysis. Before coding up a model and trying to fit it to the data it is critical to establish both the analysis goals and the sizes, shapes, and tendencies of the available data. Once the model is running, we can use posterior predictive checks to assess whether or not the model is properly specified. Both of these activities rely primarily on data visualization. This notebook uses the plotnine package, an Python implementation of a grammar of graphics based on ggplot2.
The data and models for this notebook are taken from chapter 12 of the book Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill, Cambridge Press, 2007. In this chapter they use a multilevel regression model to analyze data taken from a national survey of home radon levels in the US done by the EPA in the early 1990s.
The goal of the radon study is to provide reasonable estimates
of home radon levels in each of the approximately 3000 counties in the United States.
Radon gas is a product of the slow decay of uranium into lead. Due to local differences in geology, the level of exposure to radon gas differs from place to place. A common source is uranium-containing minerals in the ground, and therefore it accumulates in subterranean areas such as basements.