30.1 Some examples

This is an old version, view current version.

30.1.1 Earth science

Stratification and poststratification can be applied to many applications beyond survey sampling (Kennedy and Gelman 2019). For example, large-scale whole-earth soil-carbon models are fit with parametric models of how soil-carbon depends on features of an area such as soil composition, flora, fauna, temperature, humidity, etc. Given a model that predicts soil-carbon concentration given these features, a whole-earth model can be created by stratifying the earth into a grid of say 10km by 10km “squares” (they can’t literally be square because the earth’s surface is topologically a sphere). Each grid area has an estimated makeup of soil type, forestation, climate, etc. The global level of soil carbon is then estimated using poststratification by simply summing the expected soil carbon estimated for each square in the grid (Paustian et al. 1997). Dynamic models can then be constructed by layering a time-series component, varying the poststratification predictors over time, or both (Field et al. 1998).

30.1.2 Polling

Suppose a university’s administration would like to estimate the support for a given proposal among its students. A poll is carried out in which 490 respondents are undergraduates, 112 are graduate students, and 47 are continuing education students. Now suppose that support for the issue among the poll respondents is is 25% among undergraduate students (subgroup 1), 40% among graduate students (subgroup 2), and 80% among continuing education students (subgroup 3). Now suppose that the student body is made up of 20,000 undergraduates, 5,000 graduate students, and 2,000 continuing education students. It is important that our subgroups are exclusive and exhaustive, i.e., they form a partition of the population.

The proportion of support in the poll among students in each group provides a simple maximum likelihood estimate $\theta^* = (0.25, 0.5, 0.8)$ of support in each group for a simple Bernoulli model where student $n$ ’s vote is modeled as $y_n \sim \textrm{bernoulli}(\theta_{jj[n]}),$ where $jj[n] \in 1:3$ is the subgroup to which the $n$ -th student belongs.

An estimate of the population prevalence of support for the issue among students can be constructed by simply multiplying estimated support in each group by the size of each group. Letting $N = (20\,000,\, 5\,000,\, 2\,000)$ be the subgroup sizes, the poststratified estimate of support in the population $\phi^*$ is estimated by $\phi^* = \frac{\displaystyle \sum_{j = 1}^3 \theta_j^* \cdot N_j} {\displaystyle \sum_{j = 1}^3 N_j}.$ Plugging in our estimates and population counts yields $\begin{eqnarray*} \phi* & = & \frac{0.25 \cdot 20\,000 + 0.4 \cdot 5\,000 + 0.8 \cdot 2\,000} {20\,000 + 5\,000 + 2\,000} \\[4pt] & = & \frac{8\,600}{27\,000} \\[4pt] & \approx & 0.32. \end{eqnarray*}$

References

Field, Christopher B, Michael J Behrenfeld, James T Randerson, and Paul Falkowski. 1998. “Primary Production of the Biosphere: Integrating Terrestrial and Oceanic Components.” Science 281 (5374): 237–40.

Kennedy, Lauren, and Andrew Gelman. 2019. “Know Your Population and Know Your Model: Using Model-Based Regression and Poststratification to Generalize Findings Beyond the Observed Sample.” arXiv, no. 1906.11323.

Paustian, Keith, Elissa Levine, Wilfred M Post, and Irene M Ryzhova. 1997. “The Use of Models to Integrate Information and Understanding of Soil C at the Regional Scale.” Geoderma 79 (1-4): 227–60.