Exercise 1.1

Modeling heights in a population

The !Kung San are the most famous foraging population of the 20th century, mostly because they have the most detailed quantitative studies conducted by anthropologists like Nancy Howell in the late 1960s.

I’ve excluded the children from the data for now, so we’re only modeling adult heights.

str(height_data)

'data.frame':   352 obs. of  4 variables:
 $ height: num  152 140 137 157 145 ...
 $ weight: num  47.8 36.5 31.9 53 41.3 ...
 $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
 $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

Simple modeling

How tall is the average !Kung San?

mean(height_data$height)

[1] 154.5971

How dispersed are the heights?

sd(height_data$height)

[1] 7.742332

What might we expect a new !Kung San height to look like?

rnorm(1, mean(height_data$height), sd(height_data$height))

[1] 145.2516

Draws all the way down

Primary output of a fit object is a set of draws of parameter values
Draws retain uncertainty, correlation, and other aspects of the posterior that point estimates lose
Standard toolbox for working with draws
Summarize only when required

Bayesian ways to summarize a distribution

Exercise

Data: http://mc-stan.org/workshops/dec2017/1.1.zip
http://mc-stan.org > Users > Documentation > Stan Reference Manual (pdf)
- List of functions in the index
We think adult heights are normally distributed, \(y \sim \mathcal{N}(\mu, \sigma)\)

data {
  int num_people;
  real heights[num_people];
}

parameters { ??? }
model { ??? }

Write heights.stan (missing) in your working directory by using the above and filling in the model and parameter blocks.

A first draft

data {
  int num_people;
  real heights[num_people];
}
parameters {
  real height_location;
  real<lower=0> height_scale;
}
model {
  heights ~ normal(height_location, height_scale);
  height_location ~ normal(183, 120);
  height_scale ~ normal(0, 20);
}

Priors

The fit

Inference for Stan model: heights1.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean   sd    2.5%     50%   97.5% n_eff Rhat
height_location  154.60    0.01 0.42  153.77  154.60  155.41  2846    1
height_scale       7.78    0.00 0.29    7.23    7.77    8.38  3500    1
lp__            -895.02    0.02 0.99 -897.64 -894.72 -894.02  1967    1

Samples were drawn using NUTS(diag_e) at Tue Dec  5 13:45:33 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

How far off are we, roughly?

Pretend we got point estimates - take the mean of the samples of the parameter
Use alpha as our prediction, \(\hat{y_i} = \alpha\)
Sum the residuals/error between the true values and the “prediction”

samples.simple = extract(fit1)
alpha.simple = mean(samples.simple$height_location)
sum(abs(height_data$height - alpha.simple))

[1] 2259.431

Let’s see if we can get a better fit

We’ll look at better ways of evaluating this later, but for now we’ll try to reduce the standard deviation for height by introducing weight as a predictive feature and adding a regression coefficient for it.

data {
  int num_people;
  vector[num_people] weights;
  real heights[num_people];
}

Weight regression model

transformed data {
  vector[num_people] std_weights =
      (weights - mean(weights)) / sd(weights);
}
parameters {
  real height_location_avg_weight;
  real weight_coeff;
  real<lower=0> height_scale;
}
model {
  heights ~ normal(height_location_avg_weight + weight_coeff * std_weights
                   , height_scale);
  height_location_avg_weight ~ normal(183, 120);
  height_scale ~ normal(0, 20);
  weight_coeff ~ normal(7, 10);
}

Weight regression fit

hash mismatch so recompiling; make sure Stan code ends with a blank line

Inference for Stan model: heights2.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                              mean se_mean   sd    2.5%   97.5% n_eff Rhat
height_location_avg_weight  154.61    0.00 0.27  154.07  155.13  3790    1
weight_coeff                  5.85    0.00 0.27    5.32    6.39  3071    1
height_scale                  5.11    0.00 0.19    4.74    5.49  3886    1
lp__                       -747.45    0.03 1.21 -750.61 -746.09  1658    1

Samples were drawn using NUTS(diag_e) at Tue Dec  5 13:46:00 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

How are we doing?

sum(abs(height_data$height - alpha.simple))

[1] 2259.431

Now incorporate weight_coeff to get \(\hat{y_i} = \beta x_i + \alpha\)

samples = extract(fit2)
alpha = mean(samples$height_location_avg_weight)
beta = mean(samples$weight_coeff)
std_weight = (height_data$weight - mean(height_data$weight)) / sd(height_data$weight)
sum(abs(height_data$height - (beta * std_weight + alpha)))

[1] 1390.102

Interpretation

height_location_avg_weight
weight_coeff

sd(height_data$weight) #kg

[1] 6.456708