Survival Models
Survival models apply to animals and plants as well as inanimate objects such as machine parts or electrical components. Survival models arise when there is an event of interest for a group of subjects, machine component, or other item that is
- certain to occur after some amount of time,
- but only measured for a fixed period of time, during which the event may not have occurred for all subjects.
For example, one might wish to estimate the the distribution of time to failure for solid state drives in a data center, but only measure drives for a two year period, after which some number will have failed and some will still be in service.
Survival models are often used comparatively, such as comparing time to death of patients diagnosed with stage one liver cancer under a new treatment and a standard treatment (pure controls are not allowed when there is an effective existing treatment for a serious condition). During a two year trial, some patients will die and others will survive.
Survival models may involve covariates, such as the factory at which a component is manufactured, the day on which it is manufactured, and the amount of usage it gets. A clinical trial might be adjusted for the sex and age of a cancer patient or the hospital at which treatment is received.
Survival models come in two main flavors, parametric and semi-parametric. In a parametric model, the survival time of a subject is modeled explicitly using a parametric probability distribution. There is a great deal of flexibility in how the parametric probability distribution is constructed. The sections below consider exponential and Weibull distributed survival times.
Rather than explicitly modeling a parametric survival probability, semi-parametric survival models instead model the relative effect on survival of covariates. The final sections of this chapter consider the proportional hazards survival model.
Exponential survival model
The exponential distribution is commonly used in survival models where there is a constant risk of failure that does not go up the longer a subject survives. This is because the exponential distribution is memoryless in sense that if
The exponential survival model has a single parameter for the rate, which assumes all subjects have the same distribution of failure time (this assumption is relaxed in the next section by introducing per-subject covariates). With the rate parameterization, the expected survival time for a component with survival time represented as the random variable
The data for a survival model consists of two components. First, there is a vector
The model for the observed failure times is exponential, so that for
The model for the censored failure times is also exponential. All that is known of a censored item is that its failure time is greater than the censoring time, so each censored item contributes a factor to the likelihood of
On the log scale, that’s
The model can be completed with a standard lognormal prior on
Stan program
The data for a simple survival analysis without covariates can be coded as follows.
data {
int<lower=0> N;
vector[N] t;
int<lower=0> N_cens;
real<lower=0> t_cens;
}
In this program, N
is the number of uncensored observations and t
contains the times of the uncensored observations. There are a further N_cens
items that are right censored at time t_cens
. Right censoring means that if the time to failure is greater than
t_cens
, it is only observed that the part survived until time t_cens
. In the case where there are no covariates, the model only needs the number of censored items because they all share the same censoring time.
There is a single rate parameter, the inverse of which is the expected time to failure.
parameters {
real<lower=0> lambda;
}
The exponential survival model and the prior are coded directly using vectorized distribution and ccdf statements. This both simplifies the code and makes it more computationally efficient by sharing computation across instances.
model {
t ~ exponential(lambda);target += N_cens * exponential_lccdf(t_cens | lambda);
0, 1);
lambda ~ lognormal( }
The likelihood for rate lambda
is just the density of exponential distribution for observed failure time. The Stan code is vectorized, modeling each entry of the vector t
as a having an exponential distribution with rate lambda
. This data model could have been written as
for (n in 1:N) {
t[n] ~ exponential(lambda); }
The log likelihood contribution given censored items is the number of censored items times the log complementary cumulative distribution function (lccdf) at the censoring time of the exponential distribution with rate lambda
. The log likelihood terms arising from the censored events could have been added to the target log density one at a time,
for (n in 1:N)
target += exponential_lccdf(t_cens | lambda);
to define the same log density, but it is much more efficient computationally to multiply by a constant than do a handful of sequential additions.
Weibull survival model
The Weibull distribution is a popular alternative to the exponential distribution in cases where there is a decreasing probability of survival as a subject gets older. The Weibull distribution models this by generalizing the exponential distribution to include a power-law trend.
The Weibull distribution is parameterized by a shape
If
There are three regimes of the Weibull distribution.
A subject is more likely to fail early. When the Weibull density approaches infinity as The Weibull distribution reduces to the exponential distribution, with a constant rate of failure over time. When the Weibull distribution approaches as Subjects are less likely to fail early. When the Weibull density approaches zero as
With
Stan program
With Stan, one can just swap the exponential distribution for the Weibull distribution with the appropriate parameters and the model remains essentially the same. Recall the exponential model’s parameters and model block.
parameters {
real<lower=0> beta;
}model {
t ~ exponential(beta);target += N_cens * exponential_lccdf(t_cens | beta);
0, 1);
beta ~ lognormal( }
The Stan program for the Weibull model just swaps in the Weibull distribution and complementary cumulative distribution function with shape (alpha
) and scale (sigma
) parameters.
parameters {
real<lower=0> alpha;
real<lower=0> sigma;
}model {
t ~ weibull(alpha, sigma);target += N_cens * weibull_lccdf(t_cens | alpha, sigma);
0, 1);
alpha ~ lognormal(0, 1);
sigma ~ lognormal( }
As usual, if more is known about expected survival times, alpha
and sigma
should be given more informative priors.
Survival with covariates
Suppose that for each of
Survival with covariates replaces what is essentially a simple regression with only an intercept
The covariates form an
A Stan program for the exponential survival model with covariates is as follows. It relies on the fact that the order of failure times (t
and t_cens
) corresponds to the ordering of items in the covariate matrices (x
and x_cens
).
data {
int<lower=0> N;
vector[N] t;
int<lower=0> N_cens;
real<lower=0> t_cens;
int<lower=0> K;
matrix[N, K] x;
matrix[N_cens, K] x_cens;
}parameters {
vector[K] gamma;
}model {
0, 2);
gamma ~ normal(
t ~ exponential(exp(x * gamma));target += exponential_lccdf(t_cens | exp(x_cens * gamma));
}
Both the distribution statement for uncensored times and the log density increment statement for censored times are vectorized, one in terms of the exponential distribution and one in terms of the log complementary cumulative distribution function.
Hazard and survival functions
Suppose
The survival function
The hazard function
The hazard function and survival function are related through the differential equation
If
If
If
Proportional hazards model
The exponential model is parametric in that is specifies an explicit parametric form for the distribution of survival times. Cox (1972) introduced a semi-parametric survival model specified directly in terms of a hazard function
Let
In the semi-parametric, proportional hazards model, the baseline hazard function
Cox’s proportional hazards model is not fully generative. There is no way to generate the times of failure because the baseline hazard function
Partial likelihood function
Cox’s proportional specification of the hazard function is insufficient to generate random variates because the baseline hazard function
The hazard function
Suppose there are
With failure times sorted in decreasing order, the partial likelihood for each observed subject
The partial likelihood, defined in this form by Breslow (1975), is just the product of the partial likelihoods for the observed subjects (i.e., excluding subjects whose failure time is censored).
This likelihood follows the same approach to ranking as that developed by Plackett (1975) for estimating the probability of the order of the first few finishers in a horse race.
A simple normal prior on the components of
Stan program
To simplify the Stan program, the survival times for uncensored events are sorted into decreasing order (unlike in the mathematical presentation, where they were sorted into ascending order). The covariates for censored and uncensored observations are separated into two matrices.
data {
int<lower=0> K; // num covariates
int<lower=0> N; // num uncensored obs
vector[N] t; // event time (non-strict decreasing)
matrix[N, K] x; // covariates for uncensored obs
int N_c; // num censored obs
real<lower=t[N]> t_c; // censoring time
matrix[N_c, K] x_c; // covariates for censored obs
}
The parameters are just the coefficients.
parameters {
vector[K] beta; // slopes (no intercept)
}
The prior is a simple independent centered normal distribution on each element of the parameter vector, which is vectorized in the Stan code.
model {
0, 2);
beta ~ normal( ...
The log likelihood is implemented so as to minimize duplicated effort. The first order of business is to calculate the linear predictors, which is done separately for the subjects whose event time is observed and those for which the event time is censored.
vector[N] log_theta = x * beta;
vector[N_c] log_theta_c = x_c * beta;
These vectors are computed using efficient matrix-vector multiplies. The log of exponential values of the censored covariates times the coefficients is reused in the denominator of each factor, which on the log scale, starts with the log sum of exponentials of the censored items’ linear predictors.
real log_denom = log_sum_exp(log_theta_c);
Then, for each observed survival time, going backwards from the latest to the earliest, the denominator can be incremented (which turns into a log sum of exponentials on the log scale), and then the target is updated with its likelihood contribution.
for (n in 1:N) {
log_denom = log_sum_exp(log_denom, log_theta[n]);target += log_theta[n] - log_denom; // log likelihood
}
The running log sum of exponentials is why the list is iterated in reverse order of survival times. It allows the log denominator to be accumulated one term at a time. The condition that the survival times are sorted into decreasing order is not checked. It could be checked very easily in the transformed data block by adding the following code.
transformed data {
for (n in 2:N) {
if (!(t[n] < t[n - 1])) {
reject("times must be strictly decreasing, but found"
"!(t[", n, "] < t[, ", (n - 1), "])");
}
} }
Stan model for tied survival times
Technically, for continuous survival times, the probability of two survival times being identical will be zero. Nevertheless, real data sets often round survival times, for instance to the nearest day or week in a multi-year clinical trial. The technically “correct” thing to do in the face of unknown survival times in a range would be to treat their order as unknown and infer it. But considering all
In the interest of completeness, here is the Stan code for an implementation of Efron’s estimator. It uses two user-defined functions. The first calculates how many different survival times occur in the data.
functions {
int num_unique_starts(vector t) {
if (size(t) == 0) return 0;
int us = 1;
for (n in 2:size(t)) {
if (t[n] != t[n - 1]) us += 1;
}return us;
}
This is then used to compute the value J
to send into the function that computes the position in the array of failure times where each new failure time starts, plus an end point that goes one past the target. This is a standard way in Stan to code ragged arrays.
array[] int unique_starts(vector t, int J) {
array[J + 1] int starts;
if (J == 0) return starts;
1] = 1;
starts[int pos = 2;
for (n in 2:size(t)) {
if (t[n] != t[n - 1]) {
starts[pos] = n;1;
pos +=
}
}1] = size(t) + 1;
starts[J + return starts;
} }
The data format is exactly the same as for the model in the previous section, but in this case, the transformed data block is used to cache some precomputations required for the model, namely the ragged array grouping elements that share the same survival time.
transformed data {
int<lower=0> J = num_unique_starts(t);
array[J + 1] int<lower=0> starts = unique_starts(t, J);
}
For each unique survival time j
in 1:J
, the subjects indexed from starts[j]
to starts[j + 1] - 1
(inclusive) share the same survival time. The number of elements with survival time j
is thus (starts[j + 1] - 1) - starts[j] + 1
, or just starts[j + 1] - starts[j]
.
The parameters and prior are also the same—just a vector beta
of coefficients with a centered normal prior. Although it starts with the same caching of results for later, and uses the same accumulator for the denominator, the overall partial likelihood is much more involved, and depends on the user-defined functions defining the transformed data variables J
and starts
.
vector[N] log_theta = x * beta;
vector[N_c] log_theta_c = x_c * beta;
real log_denom_lhs = log_sum_exp(log_theta_c);
for (j in 1:J) {
int start = starts[j];
int end = starts[j + 1] - 1;
int len = end - start + 1;
real log_len = log(len);
real numerator = sum(log_theta[start:end]);
log_denom_lhs = log_sum_exp(log_denom_lhs,
log_sum_exp(log_theta[start:end]));vector[len] diff;
for (ell in 1:len) {
diff[ell] = log_diff_exp(log_denom_lhs,1) - log_len
log(ell -
+ log_sum_exp(log_theta[start:end]));
}target += numerator - sum(diff);
}
The special function log_diff_exp
is defined as
Because of how J
and starts
are constructed, the length len
will always be strictly positive so that the log is well defined.
References
Footnotes
Cox mentioned in his seminal paper that modeling the baseline hazard function would improve statistical efficiency, but he did not do it for computational reasons.↩︎