The !Kung San are the most famous foraging population of the 20th century, mostly because they have the most detailed quantitative studies conducted by anthropologists like Nancy Howell in the late 1960s.
I’ve excluded the children from the data for now, so we’re only modeling adult heights.
str(height_data)
'data.frame': 352 obs. of 4 variables:
$ height: num 152 140 137 157 145 ...
$ weight: num 47.8 36.5 31.9 53 41.3 ...
$ age : num 63 63 65 41 51 35 32 27 19 54 ...
$ male : int 1 0 0 1 0 1 0 1 0 1 ...
mean(height_data$height)
[1] 154.5971
sd(height_data$height)
[1] 7.742332
rnorm(1, mean(height_data$height), sd(height_data$height))
[1] 145.2516
data {
int num_people;
real heights[num_people];
}
parameters { ??? }
model { ??? }
Write heights.stan
(missing) in your working directory by using the above and filling in the model and parameter blocks.
data {
int num_people;
real heights[num_people];
}
parameters {
real height_location;
real<lower=0> height_scale;
}
model {
heights ~ normal(height_location, height_scale);
height_location ~ normal(183, 120);
height_scale ~ normal(0, 20);
}
Inference for Stan model: heights1.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
height_location 154.60 0.01 0.42 153.77 154.60 155.41 2846 1
height_scale 7.78 0.00 0.29 7.23 7.77 8.38 3500 1
lp__ -895.02 0.02 0.99 -897.64 -894.72 -894.02 1967 1
Samples were drawn using NUTS(diag_e) at Tue Dec 5 13:45:33 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
samples.simple = extract(fit1)
alpha.simple = mean(samples.simple$height_location)
sum(abs(height_data$height - alpha.simple))
[1] 2259.431
We’ll look at better ways of evaluating this later, but for now we’ll try to reduce the standard deviation for height by introducing weight as a predictive feature and adding a regression coefficient for it.
data {
int num_people;
vector[num_people] weights;
real heights[num_people];
}
transformed data {
vector[num_people] std_weights =
(weights - mean(weights)) / sd(weights);
}
parameters {
real height_location_avg_weight;
real weight_coeff;
real<lower=0> height_scale;
}
model {
heights ~ normal(height_location_avg_weight + weight_coeff * std_weights
, height_scale);
height_location_avg_weight ~ normal(183, 120);
height_scale ~ normal(0, 20);
weight_coeff ~ normal(7, 10);
}
hash mismatch so recompiling; make sure Stan code ends with a blank line
Inference for Stan model: heights2.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 97.5% n_eff Rhat
height_location_avg_weight 154.61 0.00 0.27 154.07 155.13 3790 1
weight_coeff 5.85 0.00 0.27 5.32 6.39 3071 1
height_scale 5.11 0.00 0.19 4.74 5.49 3886 1
lp__ -747.45 0.03 1.21 -750.61 -746.09 1658 1
Samples were drawn using NUTS(diag_e) at Tue Dec 5 13:46:00 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
sum(abs(height_data$height - alpha.simple))
[1] 2259.431
Now incorporate weight_coeff
to get \(\hat{y_i} = \beta x_i + \alpha\)
samples = extract(fit2)
alpha = mean(samples$height_location_avg_weight)
beta = mean(samples$weight_coeff)
std_weight = (height_data$weight - mean(height_data$weight)) / sd(height_data$weight)
sum(abs(height_data$height - (beta * std_weight + alpha)))
[1] 1390.102
sd(height_data$weight) #kg
[1] 6.456708