29.4 Regression and poststratification

This is an old version, view current version.

29.4 Regression and poststratification

In applications to polling, there are often numerous demographic features like age, gender, income, education, state of residence, etc. If each of these demographic features induces a partition on the population, then their product also induces a partition on the population. Often sources such as the census have matching (or at least matchable) demographic data; otherwise it must be estimated.

The problem facing poststratification by demographic feature is that the number of strata increases exponentially as a function of the number of features. For instance, 4 age brackets, 2 sexes, 5 income brackets, and 50 states of residence leads to $5 \cdot 2 \cdot 5 \cdot 50 = 2000$ strata. Adding another 5-way distinction, say for education level, leads to 10,000 strata. A simple model like the one in the previous section that takes an independent parameter $\theta_j$ for support in each stratum is unworkable in that many groups will have zero respondents and almost all groups will have very few respondents.

A practical approach to overcoming the problem of low data size per stratum is to use a regression model. Each demographic feature will require a regression coefficient for each of its subgroups, but now the parameters add to rather than multiply the total number of parameters. For example, with 4 age brackets, 2 sexes, 5 income brackets, and 50 states of residence, there are only $4 + 2 + 5 + 50 = 61$ regression coefficients to estimate. Now suppose that item $n$ has demographic features $\textrm{age}_n \in 1:5$ , $\textrm{sex}_n \in 1:2$ , $\textrm{income}_n \in 1:5,$ and $\textrm{state}_n \in 1:50$ . A logistic regression may be formulated as $y_n \sim \textrm{bernoulli}(\textrm{logit}^{-1}( \alpha + \beta_{\textrm{age}[n]} + \gamma_{\textrm{sex}[n]} + \delta_{\textrm{income}[n]} + \epsilon_{\textrm{state}[n]} )),$ where $\textrm{age}[n]$ is the age of the $n$ -th respondent, $\textrm{sex}[n]$ is their sex, $\textrm{income}[n]$ their income and $\textrm{state}[n]$ their state of residence. These coefficients can be assigned priors, resulting in a Bayesian regression model.

To poststratify the results, the population size for each combination of predictors must still be known. Then the population estimate is constructed as $\sum_{i = 1}^5 \sum_{j = 1}^2 \sum_{k = 1}^5 \sum_{m = 1}^{50} \textrm{logit}^{-1}(\alpha + \beta_i + \gamma_j + \delta_k + \eta_m) \cdot \textrm{pop}_{i, j, k, m},$ where $\textrm{pop}_{i, j, k, m}$ is the size of the subpopulation with age $i$ , sex $j$ , income level $k$ , and state of residence $m$ .

As formulated, it should be clear that any kind of prediction could be used as a basis for poststratification. For example, a Gaussian process or neural network could be used to produce a non-parametric model of outcomes $y$ given predictors $x$ .