Chapter 9 Survival Analysis
Survival analysis is concerned by the statistical modelling of the time to ‘failure’ from a well defined origin or starting point. For instance:
time of hard-drive to fail from the time it has been built or bought (computer science),
time of a patient to die from the time the disease has been diagnosed (medicine).
9.1 Distributions
Specificity of survival times:
the times are non-negative and have skewed distributions with long tails,
some subjects may survive beyond the study and their failure time is not observed. In this case, the data are said to be censored.
Exercise:
Show that \(\mathbb{E}[y]=\frac{1}{\theta}\) for the exponential distribution.
Exercise:
Show that \(\mathbb{E}[y]=\left(\frac{1}{\theta}\right)^{1/\lambda} \Gamma(1+\frac{1}{\lambda})\) for the Weibull distribution with \(\Gamma(u)=\int_0^{+\infty} s^{u-1}\ \exp(-s) \ ds\).
9.2 Survivor and hazard functions
The random variable \(y\) denotes the survival time and \(f(\cdot)\) is its p.d.f. (i.e. either the exponential \(p_{y|\theta}\) or Weibull \(p_{y|\lambda\theta}\)).
The probability of failure before a specific time is: \[F(y)=\mathbb{P}(0 \leq t\leq y)=\int_{0}^{y} f(t) \ dt\] The median survival time is given by the solution of the equation \(F(y)=.5\) and it serves as the average survival time.
The survivor function is the probability of survival beyond time \(y\): \[S(y)=\int_{y}^{+\infty} f(t) \ dt =1-F(y)\]
The hazard function is defined as: \[h(y)=\frac{f(y)}{S(y)}=-\frac{d \log(S(y))}{dy}\] \(h(y)\) can be understood as the probability of failure in between time \(y\) and \(y+\delta y\) (with \(\delta y \rightarrow 0\)) given that the subject has survived up to time \(y\). The primitive \(H(y)=-\log(S(y)\) is called the cumulative hazard function.
9.2.1 Survivor and hazard functions for the exponential distribution
\(F_{\theta}(y)=1-\exp(-\theta \ y)\) . The median survival time is \(\log(2)/\theta\) and this is a more appropriate description of the average survival time than \(\mathbb{E}(y)=1/\theta\) because of the skewness of the exponential distribution.
\(S_{\theta}(y)=\exp(-\theta \ y)\)
\(h_{\theta}(y)=\theta\). The hazard function does not depend on \(y\) so the probability of failure in the interval \([y;y+\delta y]\) is not related to how long the subject has already survived. This lack of memory property may be a limitation in some cases when probability of failure increases with time.
\(H_{\theta}(y)=\theta\ y\)
9.2.2 Survivor and hazard functions for the Weibull distribution
\(F_{\theta,\lambda}(y)=1-\exp\left( -\theta\ y^\lambda \right)\). The median survival time is then \(\theta^{-1/\lambda}\ (\log(2))^{1/\lambda}\).
\(S_{\theta,\lambda}(y)=\exp\left( -\theta\ y^\lambda \right)\)
\(h_{\theta,\lambda}(y)=\lambda\ \theta\ y^{\lambda-1}\). When \(\lambda\neq 1\), the hazard function does depend on \(y\) so the probability of failure in the interval \([y;y+\delta y]\) is related to how long the subject has already survived. This allows for accelerated failure time.
\(H_{\theta,\lambda}(y)=\theta\ y^\lambda\)
9.3 Link function
The expectation of the survival time, \(\mathbb{E}[y]=\left(\frac{1}{\theta}\right)^{1/\lambda} \Gamma(1+\frac{1}{\lambda})\), is an element of \(\mathbb{R}^{+}\). Hence a convenient link (invertible) function \(g\) to map \(\mathbb{R}^{+}\) to \(\mathbb{R}\) where \(x^T\beta\) is, is the \(\log\) function: \[g\left(\mathbb{E}[y]\right)=\log\left(\mathbb{E}[y]\right)=x^T\beta\] or \[\theta\propto \exp \left(x^T\beta\right)\]
9.4 Estimation of \(\beta\) with the Likelihood
The likelihood function can be used as an objective function to maximise to estimate the best parameters \(\beta\).
9.4.1 With uncensored data
Having collected \(N\) responses with their explanatory variables values \(\lbrace (y_i,x_i) \rbrace_{i=1,\cdots,N}\), the likelihood is: \[\mathcal{L}(\beta)=\prod_{i=1}^N p(y_i|\theta_i,\lambda) \quad \text{contrained by } \theta_i=\exp(x_i^{T}\beta)\]
9.4.2 With censored data
Unfortunatly, in many applications, the values of some responses can be only partially known. For instance, when the study is ending before failure time has been observed or when a subject is not followed up during study period. In this case, the collected responses are censored. For a censored response, the survival time is at least \(y_i\) and the probability associated to this response is then \(S(y_i)\) (ie probability to have survived behond time \(y_i\)). Lets define \(\delta_i\) an indicator variable that is 0 when the response \(y_i\) is censored and 1 if it is not censored. The likelihood can then be written: \[\mathcal{L}(\pmb{\beta})=\prod_{i=1}^N p(y_i|\theta_i,\lambda) ^{\delta_i}\ S(y_i)^{1-\delta_i} \quad \text{with } \theta_i=\exp(x_i^{T}\beta)\] The data to analyse is then better expressed as \(\left\lbrace \left((y_i,\delta_i),x_i\right)\right \rbrace_{i=1,\cdots,N}\) and, in R using survreg
for fitting the data the expression is written as \((y,\delta)\sim x\) (the indicator variable \(\delta\) is encapsulated with the response).
9.5 Proportional hazard models
For the Exponential distribution, with \(x\) is an indicator variable, the hazard function: \[h(y)=\theta=\exp(x^{T}\beta)=\exp(\beta_0+\beta_1 \ x)\propto \exp(\beta_1 \ x)\] For the Weibull distribution, the hazard function is: \[h(y)\propto\lambda \theta y^{\lambda-1}=\lambda y^{\lambda-1} \ \exp(\beta_0+\beta_1 \ x)=h_0(y)\ \exp(\beta_1 \ x)\] The ratio \[\frac{h^{(x=1)}(y)}{h^{(x=0)}(y)}=\frac{\exp(\beta_1 \ 1)}{\exp(\beta_1 \ 0)}=\exp(\beta_1)\] is called the hazard ratio for presence or absence of exposure to \(x\).
Example of remission time.
\(x=0\) indicates that placebo has been given to the patient and \(x=1\) indicates that a treatment (specific drug) has been given to the patient. The hazard ratio \(\frac{h^{(x=1)}(y)}{h^{(x=0)}(y)} \simeq 1/4\) indicates that the drugs is helping patients as they are 4 times more likely to ‘Fail’ (i.e. remission time ending) when they take the placebo.