Exercise 1.1

Modeling heights in a population

The !Kung San are the most famous foraging population of the 20th century, mostly because they have the most detailed quantitative studies conducted by anthropologists like Nancy Howell in the late 1960s.

I’ve excluded the children from the data for now, so we’re only modeling adult heights.

str(height_data)
'data.frame':   352 obs. of  4 variables:
 $ height: num  152 140 137 157 145 ...
 $ weight: num  47.8 36.5 31.9 53 41.3 ...
 $ age   : num  63 63 65 41 51 35 32 27 19 54 ...
 $ male  : int  1 0 0 1 0 1 0 1 0 1 ...

Simple modeling

mean(height_data$height)
[1] 154.5971
sd(height_data$height)
[1] 7.742332
rnorm(1, mean(height_data$height), sd(height_data$height))
[1] 145.2516

Draws all the way down

Bayesian ways to summarize a distribution

Exercise

data {
  int num_people;
  real heights[num_people];
}
parameters { ??? }
model { ??? }

Write heights.stan (missing) in your working directory by using the above and filling in the model and parameter blocks.

A first draft

data {
  int num_people;
  real heights[num_people];
}
parameters {
  real height_location;
  real<lower=0> height_scale;
}
model {
  heights ~ normal(height_location, height_scale);
  height_location ~ normal(183, 120);
  height_scale ~ normal(0, 20);
}

The fit

Inference for Stan model: heights1.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                   mean se_mean   sd    2.5%     50%   97.5% n_eff Rhat
height_location  154.60    0.01 0.42  153.77  154.60  155.41  2846    1
height_scale       7.78    0.00 0.29    7.23    7.77    8.38  3500    1
lp__            -895.02    0.02 0.99 -897.64 -894.72 -894.02  1967    1

Samples were drawn using NUTS(diag_e) at Tue Dec  5 13:45:33 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

How far off are we, roughly?

samples.simple = extract(fit1)
alpha.simple = mean(samples.simple$height_location)
sum(abs(height_data$height - alpha.simple))
[1] 2259.431

Let’s see if we can get a better fit

We’ll look at better ways of evaluating this later, but for now we’ll try to reduce the standard deviation for height by introducing weight as a predictive feature and adding a regression coefficient for it.

data {
  int num_people;
  vector[num_people] weights;
  real heights[num_people];
}

Weight regression model

transformed data {
  vector[num_people] std_weights =
      (weights - mean(weights)) / sd(weights);
}
parameters {
  real height_location_avg_weight;
  real weight_coeff;
  real<lower=0> height_scale;
}
model {
  heights ~ normal(height_location_avg_weight + weight_coeff * std_weights
                   , height_scale);
  height_location_avg_weight ~ normal(183, 120);
  height_scale ~ normal(0, 20);
  weight_coeff ~ normal(7, 10);
}

Weight regression fit

hash mismatch so recompiling; make sure Stan code ends with a blank line
Inference for Stan model: heights2.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                              mean se_mean   sd    2.5%   97.5% n_eff Rhat
height_location_avg_weight  154.61    0.00 0.27  154.07  155.13  3790    1
weight_coeff                  5.85    0.00 0.27    5.32    6.39  3071    1
height_scale                  5.11    0.00 0.19    4.74    5.49  3886    1
lp__                       -747.45    0.03 1.21 -750.61 -746.09  1658    1

Samples were drawn using NUTS(diag_e) at Tue Dec  5 13:46:00 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

How are we doing?

sum(abs(height_data$height - alpha.simple))
[1] 2259.431

Now incorporate weight_coeff to get \(\hat{y_i} = \beta x_i + \alpha\)

samples = extract(fit2)
alpha = mean(samples$height_location_avg_weight)
beta = mean(samples$weight_coeff)
std_weight = (height_data$weight - mean(height_data$weight)) / sd(height_data$weight)
sum(abs(height_data$height - (beta * std_weight + alpha)))
[1] 1390.102

Interpretation

sd(height_data$weight) #kg
[1] 6.456708

Futher directions?