{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Primer on Bayesian Multilevel Modeling using PyStan\n", "\n", "*Author*: Chris Fonnesbeck \n", "*License*: Apache, Version 2.0 (code); CC-BY 3.0 (text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hierarchical or multilevel modeling is a generalization of regression modeling.\n", "\n", "*Multilevel models* are regression models in which the constituent model parameters are given **probability models**. This implies that model parameters are allowed to **vary by group**.\n", "\n", "Observational units are often naturally **clustered**. Clustering induces dependence between observations, despite random sampling of clusters and random sampling within clusters.\n", "\n", "A *hierarchical model* is a particular multilevel model where parameters are nested within one another.\n", "\n", "Some multilevel structures are not hierarchical. \n", "\n", "* e.g. \"country\" and \"year\" are not nested, but may represent separate, but overlapping, clusters of parameters\n", "\n", "We will motivate this topic using an environmental epidemiology example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Radon contamination (Gelman and Hill 2006)\n", "\n", "Radon is a radioactive gas that enters homes through contact points with the ground. It is a carcinogen that is the primary cause of lung cancer in non-smokers. Radon levels vary greatly from household to household.\n", "\n", "\n", "\n", "The EPA did a study of radon levels in 80,000 houses. Two important predictors:\n", "\n", "* measurement in basement or first floor (radon higher in basements)\n", "* county uranium level (positive correlation with radon levels)\n", "\n", "We will focus on modeling radon levels in Minnesota.\n", "\n", "The hierarchy in this example is households within county. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data organization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we import the data from a local file, and extract Minnesota's data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns; sns.set_context('notebook')\n", "\n", "# Import radon data\n", "srrs2 = pd.read_csv('../data/srrs2.dat')\n", "srrs2.columns = srrs2.columns.map(str.strip)\n", "srrs_mn = srrs2.assign(fips=srrs2.stfips*1000 + srrs2.cntyfips)[srrs2.state=='MN']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, obtain the county-level predictor, uranium, by combining two variables." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "cty = pd.read_csv('../data/cty.dat')\n", "cty_mn = cty[cty.st=='MN'].copy()\n", "cty_mn[ 'fips'] = 1000*cty_mn.stfips + cty_mn.ctfips" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `merge` method to combine home- and county-level information in a single DataFrame." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "srrs_mn = srrs_mn.merge(cty_mn[['fips', 'Uppm']], on='fips')\n", "srrs_mn = srrs_mn.drop_duplicates(subset='idnum')\n", "u = np.log(srrs_mn.Uppm)\n", "\n", "n = len(srrs_mn)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | idnum | \n", "state | \n", "state2 | \n", "stfips | \n", "zip | \n", "region | \n", "typebldg | \n", "floor | \n", "room | \n", "basement | \n", "... | \n", "stopdt | \n", "activity | \n", "pcterr | \n", "adjwt | \n", "dupflag | \n", "zipflag | \n", "cntyfips | \n", "county | \n", "fips | \n", "Uppm | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "5081 | \n", "MN | \n", "MN | \n", "27 | \n", "55735 | \n", "5 | \n", "1 | \n", "1 | \n", "3 | \n", "N | \n", "... | \n", "12288 | \n", "2.2 | \n", "9.7 | \n", "1146.499190 | \n", "1 | \n", "0 | \n", "1 | \n", "AITKIN | \n", "27001 | \n", "0.502054 | \n", "
1 | \n", "5082 | \n", "MN | \n", "MN | \n", "27 | \n", "55748 | \n", "5 | \n", "1 | \n", "0 | \n", "4 | \n", "Y | \n", "... | \n", "12088 | \n", "2.2 | \n", "14.5 | \n", "471.366223 | \n", "0 | \n", "0 | \n", "1 | \n", "AITKIN | \n", "27001 | \n", "0.502054 | \n", "
2 | \n", "5083 | \n", "MN | \n", "MN | \n", "27 | \n", "55748 | \n", "5 | \n", "1 | \n", "0 | \n", "4 | \n", "Y | \n", "... | \n", "21188 | \n", "2.9 | \n", "9.6 | \n", "433.316718 | \n", "0 | \n", "0 | \n", "1 | \n", "AITKIN | \n", "27001 | \n", "0.502054 | \n", "
3 | \n", "5084 | \n", "MN | \n", "MN | \n", "27 | \n", "56469 | \n", "5 | \n", "1 | \n", "0 | \n", "4 | \n", "Y | \n", "... | \n", "123187 | \n", "1.0 | \n", "24.3 | \n", "461.623670 | \n", "0 | \n", "0 | \n", "1 | \n", "AITKIN | \n", "27001 | \n", "0.502054 | \n", "
4 | \n", "5085 | \n", "MN | \n", "MN | \n", "27 | \n", "55011 | \n", "3 | \n", "1 | \n", "0 | \n", "4 | \n", "Y | \n", "... | \n", "13088 | \n", "3.1 | \n", "13.8 | \n", "433.316718 | \n", "0 | \n", "0 | \n", "3 | \n", "ANOKA | \n", "27003 | \n", "0.428565 | \n", "
5 rows × 27 columns
\n", "