## 31.4 Regression and poststratification

In applications to polling, there are often numerous demographic features like age, gender, income, education, state of residence, etc. If each of these demographic features induces a partition on the population, then their product also induces a partition on the population. Often sources such as the census have matching (or at least matchable) demographic data; otherwise it must be estimated.

The problem facing poststratification by demographic feature is that the number of strata increases exponentially as a function of the number of features. For instance, 4 age brackets, 2 sexes, 5 income brackets, and 50 states of residence leads to $$5 \cdot 2 \cdot 5 \cdot 50 = 2000$$ strata. Adding another 5-way distinction, say for education level, leads to 10,000 strata. A simple model like the one in the previous section that takes an independent parameter $$\theta_j$$ for support in each stratum is unworkable in that many groups will have zero respondents and almost all groups will have very few respondents.

A practical approach to overcoming the problem of low data size per stratum is to use a regression model. Each demographic feature will require a regression coefficient for each of its subgroups, but now the parameters add to rather than multiply the total number of parameters. For example, with 4 age brackets, 2 sexes, 5 income brackets, and 50 states of residence, there are only $$4 + 2 + 5 + 50 = 61$$ regression coefficients to estimate. Now suppose that item $$n$$ has demographic features $$\textrm{age}_n \in 1:5$$, $$\textrm{sex}_n \in 1:2$$, $$\textrm{income}_n \in 1:5,$$ and $$\textrm{state}_n \in 1:50$$. A logistic regression may be formulated as $y_n \sim \textrm{bernoulli}(\textrm{logit}^{-1}( \alpha + \beta_{\textrm{age}[n]} + \gamma_{\textrm{sex}[n]} + \delta_{\textrm{income}[n]} + \epsilon_{\textrm{state}[n]} )),$ where $$\textrm{age}[n]$$ is the age of the $$n$$-th respondent, $$\textrm{sex}[n]$$ is their sex, $$\textrm{income}[n]$$ their income and $$\textrm{state}[n]$$ their state of residence. These coefficients can be assigned priors, resulting in a Bayesian regression model.

To poststratify the results, the population size for each combination of predictors must still be known. Then the population estimate is constructed as $\sum_{i = 1}^5 \sum_{j = 1}^2 \sum_{k = 1}^5 \sum_{m = 1}^{50} \textrm{logit}^{-1}(\alpha + \beta_i + \gamma_j + \delta_k + \eta_m) \cdot \textrm{pop}_{i, j, k, m},$ where $$\textrm{pop}_{i, j, k, m}$$ is the size of the subpopulation with age $$i$$, sex $$j$$, income level $$k$$, and state of residence $$m$$.

As formulated, it should be clear that any kind of prediction could be used as a basis for poststratification. For example, a Gaussian process or neural network could be used to produce a non-parametric model of outcomes $$y$$ given predictors $$x$$.