# Integrating survival curves

· 9 minutes read

A survival curve `\(S(t)\)`

represents the probability of a given event to happen at a given time point `\(t\)`

.
It is often estimated using the Kaplan-Meier (product-limit) estimator.

However, sometimes it is useful to calculate the integral of a survival function, e.g. when computing restricted mean survival time.

The restricted mean is a measure of average survival from time 0 to a specified time point, and may be estimated as the area under the survival curve up to that point.

Formally, the restricted mean survival time `\(\mu\)`

is the mean of the survival time, limited to some horizon time `\(t^*\)`

.
It equals the area under the survival curve `\(S(t)\)`

from time zero to `\(t^*\)`

:

$$ \mu = \int_0^{t^*} S(u) \ du $$

Most interestingly, the *difference* in restricted mean survival time between groups provides an alternative measure of treatment effect.
Remember: the hazard ratio is *not* (in general) an appropriate measure of treatment effect in randomised trials with time to event outcomes.

The restricted mean survival time can be easily computed analytically if `\(S(t)\)`

follows a (piecewise) exponential distribution.
Otherwise, we can always compute the integral above numerically.
The goal of this post is to show how to do so.

# Data

Every example needs an example dataset: here I’ll be using data from the German breast cancer study, which comes bundled with Stata.

We can import it in R directly from the web:

```
library(haven)
brcancer <- read_dta("https://www.stata-press.com/data/r16/brcancer.dta")
brcancer
```

```
## # A tibble: 686 x 14
## id hormon x1 x2 x3 x4 x5 x6 x7 rectime censrec x4a
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 70 2 21 2 3 48 66 1814 1 1
## 2 2 1 56 2 12 2 7 61 77 2018 1 1
## 3 3 1 58 2 35 2 9 52 271 712 1 1
## 4 4 1 59 2 17 2 4 60 29 1807 1 1
## 5 5 0 73 2 35 2 1 26 65 772 1 1
## 6 6 0 32 1 57 3 24 0 13 448 1 1
## 7 7 1 59 2 8 2 2 181 0 2172 0 1
## 8 8 0 65 2 16 2 1 192 25 2161 0 1
## 9 9 0 80 2 39 2 30 0 59 471 1 1
## 10 10 0 66 2 18 2 7 0 3 2014 0 1
## # … with 676 more rows, and 2 more variables: x4b <dbl>, x5e <dbl>
```

For simplicity, I will only include the treatment of interest (`hormon`

, which represents hormonal therapy) as a covariate in the model.
The survival time is `rectime`

(in days), and the event indicator variable is `censrec`

.

I am also creating a new variable `hormon2`

that will be useful for plotting:

```
brcancer$hormon2 <- ifelse(brcancer$hormon == 0, "Control arm", "Treatment arm")
```

# Modelling

I fit a flexible parametric model using the `rstpm2`

package and assuming 5 degrees of freedom for the baseline hazard:

```
library(rstpm2)
fit <- stpm2(Surv(rectime, censrec) ~ hormon, data = brcancer, df = 5)
summary(fit)
```

```
## Maximum likelihood estimation
##
## Call:
## stpm2(formula = Surv(rectime, censrec) ~ hormon, data = brcancer,
## df = 5)
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## (Intercept) -6.82308 0.78776 -8.6613 < 2.2e-16 ***
## hormon -0.36445 0.12493 -2.9173 0.00353 **
## nsx(log(rectime), df = 5)1 5.35779 0.77579 6.9062 4.977e-12 ***
## nsx(log(rectime), df = 5)2 5.88972 0.78949 7.4602 8.640e-14 ***
## nsx(log(rectime), df = 5)3 5.01742 0.50802 9.8763 < 2.2e-16 ***
## nsx(log(rectime), df = 5)4 9.94593 1.53864 6.4641 1.019e-10 ***
## nsx(log(rectime), df = 5)5 4.92988 0.36824 13.3875 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 5213.05
```

The estimated hazard ratio is 0.695… but we don’t care about this measure, right?

A benefit of modelling the baseline hazard (as compared to a Cox model) is that we can easily obtain smooth fitted survival curves for each combination of covariates of interest. For instance, the fitted survival curves for the two treatment arms are:

```
# Prediction:
fitted_curves <- predict(fit, type = "surv", se.fit = TRUE)
brcancer$S_hat <- fitted_curves$Estimate
brcancer$S_hat_lower <- fitted_curves$lower
brcancer$S_hat_upper <- fitted_curves$upper
# Plot:
ggplot(brcancer, aes(x = rectime, y = S_hat, linetype = hormon2)) +
geom_ribbon(aes(ymin = S_hat_lower, ymax = S_hat_upper), alpha = 0.25) +
geom_line() +
coord_cartesian(ylim = c(0, 1)) +
scale_x_continuous(breaks = 365 * seq(7)) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")
```

So easy!

# Computing restricted mean survival time

Now, let’s get down to business.

The fitted model is a flexible parametric model; in this setting, it is not straightforward to obtain an analytical solution.

We can, however, use spline-interpolation followed by numerical integration to compute the above integral very easily, and without needing to install any extra package. It will be a three-steps procedure:

- We obtain a fine grid of predictions for each survival curve that we want to integrate (over follow-up time);
- We obtain a function representing the spline interpolation of each curve;
- We integrate the function that we just obtained.
- Piece of cake!

*Ok, there are four steps above, but don’t we deserve some cake?*

Step number one:

```
data_grid <- expand.grid(
rectime = seq(.Machine$double.eps, 365 * 5, length.out = 100),
hormon = unique(brcancer$hormon)
)
```

Here we set `\(t^*\)`

as 5 years, hence creating an appropriate grid for time and each treatment arm.
Also, we start from `.Machine$double.eps`

(roughly 2e-16) instead of zero as survival times need to be strictly positive; nevertheless, the difference will be negligible.

Then, onto step two. We predict the survival curve based on this grid:

```
data_grid$fitted <- predict(fit, newdata = data_grid, type = "surv")
```

Done.

Step three consists of fitting an interpolating spline for each fitted survival curve:

```
data_grid_0 <- data_grid[data_grid$hormon == 0, ]
int_spline_0 <- splinefun(
x = data_grid_0$rectime,
y = data_grid_0$fitted,
method = "natural"
)
data_grid_1 <- data_grid[data_grid$hormon == 1, ]
int_spline_1 <- splinefun(
x = data_grid_1$rectime,
y = data_grid_1$fitted,
method = "natural"
)
```

Here I am using a natural spline for the interpolation, but other methods work well too (in my experience) given that the function we are trying to interpolate is quite regular.

The cool thing about `splinefun`

is that it returns a function that we can call.
For instance, to obtain fitted (interpolated) survival at one year:

```
int_spline_0(365)
```

```
## [1] 0.9072905
```

```
int_spline_1(365)
```

```
## [1] 0.9346556
```

Finally, step four consist of using the built-in function `integrate`

to (*what a surprise!*) integrate the two functions we just obtained:

```
RMST_0 <- integrate(f = int_spline_0, lower = 0, upper = 365 * 5)$value
RMST_1 <- integrate(f = int_spline_1, lower = 0, upper = 365 * 5)$value
```

We now have restricted mean survival time (in days) for both treatment arms:

```
RMST_0
```

```
## [1] 1266.97
```

```
RMST_1
```

```
## [1] 1406.486
```

In years, that will be:

```
RMST_0 / 365
```

```
## [1] 3.471151
```

```
RMST_1 / 365
```

```
## [1] 3.853386
```

Calculating the difference between the two is now trivial:

```
RMST_1 - RMST_0
```

```
## [1] 139.5155
```

```
(RMST_1 - RMST_0) / 365
```

```
## [1] 0.3822341
```

Turns out, women treated with hormonal therapy have a 0.382-years longer time to recurrence compared to controls (and focussing on the first five years of follow-up only). Cool!

We would need to quantify the uncertainty in our estimate (and there is a closed-form formula for the standard error of the difference), but that’s a lesson for another day.

# What are we actually calculating?

The two fitted survival curves that we are comparing are the following:

```
data_grid$hormon2 <- ifelse(data_grid$hormon == 0, "Control arm", "Treatment arm")
ggplot(data_grid, aes(x = rectime, y = fitted, linetype = hormon2)) +
geom_line() +
coord_cartesian(ylim = c(0, 1)) +
scale_x_continuous(breaks = 365 * seq(5)) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")
```

`RMST_0`

is the area below the survival curve of the control arm:

```
ggplot() +
geom_ribbon(data = data_grid_0, aes(x = rectime, ymin = 0, ymax = fitted), alpha = 0.2) +
geom_line(data = data_grid, aes(x = rectime, y = fitted, linetype = hormon2)) +
coord_cartesian(ylim = c(0, 1)) +
scale_x_continuous(breaks = 365 * seq(5)) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")
```

`RMST_1`

is the area below the survival curve of the treatment arm:

```
ggplot() +
geom_ribbon(data = data_grid_1, aes(x = rectime, ymin = 0, ymax = fitted), alpha = 0.2) +
geom_line(data = data_grid, aes(x = rectime, y = fitted, linetype = hormon2)) +
coord_cartesian(ylim = c(0, 1)) +
scale_x_continuous(breaks = 365 * seq(5)) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")
```

The difference between `RMST_1`

and `RMST_0`

is the area between the curves:

```
library(tidyr)
data_grid_diff <- pivot_wider(data_grid[, -4], names_from = hormon, values_from = fitted)
ggplot() +
geom_line(data = data_grid, aes(x = rectime, y = fitted, linetype = hormon2)) +
geom_ribbon(data = data_grid_diff, aes(x = rectime, ymin = `0`, ymax = `1`), alpha = 0.2) +
coord_cartesian(ylim = c(0, 1)) +
scale_x_continuous(breaks = 365 * seq(5)) +
theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")
```

`data_grid_diff`

is a dataset that was reshaped for plotting purposes, but otherwise contains the same information of `data_grid`

.

# Wrapping up

This post showed how to compute restricted mean survival time after fitting a flexible parametric model, but it can be replicated with every model that yields predictions for the survival function `\(S(t)\)`

.

I started writing this blog post with the idea of describing how to integrate survival functions in general terms but got carried away with the estimation of restricted mean survival time. Nevertheless, knowing how to integrate a survival function can be broadly useful e.g. when comparing different predictions from competing models or when comparing with the truth (if known, as in simulation studies). And it’s terribly easy!

This example used functions that come bundled with R for simplicity, but other functions and packages can be used as well.
For instance, I prefer using the `quadinf`

function from the `pracma`

package for numerical integration, as I found it (empirically, of course) to be more robust and reliable than `integrate`

.
Other approaches for numerical integration can be used as well, such as Gaussian quadrature or more simple trapezoidal rules (and so on).

As always, let me know if I got something horribly wrong. Cheers!