logo
ALL-IN meta-analysis with Safe logrank testing

Below we illustrate, via examples in R, the Safe logrank test. Like the classical logrank test, this is a test for time-to-event data with censoring. But unlike the standard approach, the Safe logrank test outputs an e-value as measure of evidence.

The Safe logrank test

We first introduce the Safe logrank test in its use for a single trial with only one underlying stratum. One stratum means that the baseline expected number of events per time point is equal for all enrolled participants. In case participants enter the study from different locations or hospitals, this assumption might be unrealistic. We therefore secondly introduce the Safe logrank test for the use on stratified time-to-event data. We conclude our discussion by an illustration of its use on different studies, each in its own stage of analysis. Due to the simplicity of the meta-analysis procedure – pooling by multiplication – neither of these extensions introduce any more complexity.

Using the Safe logrank test: one study, one stratum, many interim analyses

Please also find the complete R markdown file here.

library(devtools)
devtools::install_github("AlexanderLyNL/Safestats", ref = "logrank")
library(safestats)
library(survival)
library(knitr)

Consider the following survival data of a first study:

enrollment.study1 <- 1500     # 750 treatment, 750 placebo, so:
ratio.study1 <- 1             # ratio = nT/nP = 1
fup.study1 <- 180             # folow up of 180 days
nEventsC.study1 <- 150        # 150 events in 180 days among 1500
                              # in case of no treatment effect,
                              # this defines the baseline hazard
hr1.study1 <- 0.5             # hazard ratio between treatment en placebo group
                              # (hr1: alternative hypothesis is true)
                              # 113 events anticipated within 180 days 
                              # if treatment reduces risk with hr1.study1:
anticipNevents.study1 <- nEventsC.study1*(1/(1 + ratio.study1)) + 
                         nEventsC.study1*(ratio.study1/(1 + ratio.study1))*hr1.study1  

# assuming constant hazard, we simulate the following baseline hazard and data:
lambdaC.study1 <- 1 - (1 - (nEventsC.study1/enrollment.study1))^(1/fup.study1)  
data.study1 <- generateSurvData(nP = enrollment.study1*(1/(1 + ratio.study1)),
                                nT = enrollment.study1*(ratio.study1/(1 + ratio.study1)),
                                lambdaP = lambdaC.study1,
                                lambdaT = hr1.study1*lambdaC.study1,
                                endTime = fup.study1,
                                seed = 52020)  # set seed to 52020 make the result reproducible

time status group
2 2 T
3 2 P
3 2 T
5 2 T
8 2 T
9 2 P
9 2 T
10 2 P
11 2 P
12 2 T

We see events occurring in either the treatment (T) or the placebo control group (P) at day 2, 3, 5, 8, 9, 10, 11, 12,… (events have status = 2, censoring has status = 1). These are the possible days for an interim analysis. An analysis at day 75 (d = 75) would look as follows:

alpha <- 0.025

dataSoFar <- data.study1
dataSoFar$status[dataSoFar$time > d] <- 1  # censor everything after day d

designObj.study1 <- designSafeLogrank(hrMin = hr1.study1,
                                      alpha = alpha,
                                      alternative = "less",  # one-sided test hr < 1
                                      ratio = ratio.study1)
dataSoFar$survObj <- Surv(dataSoFar$time, dataSoFar$status)

test <- safeLogrankTest(formula = dataSoFar$survObj ~ dataSoFar$group, 
                        designObj = designObj.study1)
test
## 
##  Safe Logrank Test
## 
## data:  dataSoFar$survObj by dataSoFar$group (P, T). nEvents = 61
## 
## test: log(thetaS) = -0.69315
## e-value = 21.64 > 1/alpha = 40 : FALSE
## alternative hypothesis: true hazard ratio is less than 1 
## 
## design: the test was designed with alpha = 0.025
## for minimal relevant hazard ratio = 0.5 (less)

The analysis returns an e-value of 21.6, that has an interpretation in terms of evidence: a larger e-value means there is more evidence against the null hypothesis. Just like a p-value, we can use a threshold based on \(\alpha\) to obtain a test that has \(\alpha\) type-I error control (has smaller than \(\alpha\) probability to cross the threshold if the null hypothesis is true). The threshold to use for the e-value is \(\frac{1}{\alpha}\), meaning that for the one-sided test above with \(\alpha = 0.025\), our e-value needs to be bigger than \(40\).

If we track the available evidence using a Safe logrank test, our progress over all days with possible interim analyses is as follows:

eValues.study1 <- rep(NA, times = fup.study1)
interimDays.study1 <- unique(data.study1$time[data.study1$status == 2])

eValues.study1[1:interimDays.study1[1]] <- 1  # before you observe any event, your e-value is 1

for (d in interimDays.study1[1]:fup.study1) {
  if (!(d %in% interimDays.study1)) {           # at days on which you do not observe an event, 
    eValues.study1[d] <- eValues.study1[d - 1]  # evidence stays the same as the day before
  } else {
    dataSoFar <- data.study1
    dataSoFar$status[dataSoFar$time > d] <- 1  # censor everything after day d
    
    dataSoFar$survObj <- Surv(dataSoFar$time, dataSoFar$status)
    eValues.study1[d] <- safeLogrankTest(formula = dataSoFar$survObj ~ dataSoFar$group,
                                         designObj = designObj.study1
                                         )$eValue
  }
  if (eValues.study1[d] > (1/alpha)) break()
}

Note that the evidence in terms of e-value is doubling on the 2-log y-axis. We observe that the threshold of \(e > 40\) is crossed at the interim analysis around 82 days. This means that we can stop the trial or choose this moment to initiate a pooled analyses. We discuss a pooled analyses in the final paragraph.

Power and type-I error control of the Safe logrank test

What to expect before starting a trial – how many days we need before reaching \(e > 40\) – depends of course on how many events we observe early on and how large the difference is in hazard between our two groups. The simulation below shows the power of our analysis, if the data is generated following the assumptions of our first study above:

simeValues.study1 <- matrix(NA, nrow = numSim, ncol = fup.study1)
simEvents.study1 <- rep(NA, times = numSim)

for (j in 1:numSim){
    data <- generateSurvData(nP = enrollment.study1*(1/(1 + ratio.study1)),
                             nT = enrollment.study1*(ratio.study1/(1 + ratio.study1)),
                             lambdaP = lambdaC.study1,
                             lambdaT = hr1.study1*lambdaC.study1,
                             endTime = fup.study1, 
                             seed = j)

    interimDays <- unique(data$time[data$status == 2])

    for (d in interimDays) {
      # to speed up the simulation, we do not fill all days with e-values,
      # but leave those at which evidence does not change NA
      dataSoFar <- data
      dataSoFar$status[dataSoFar$time > d] <- 1  # censor everything after day d
      
      dataSoFar$survObj <- Surv(dataSoFar$time, dataSoFar$status)
      simeValues.study1[j, d] <- safeLogrankTest(formula = dataSoFar$survObj ~ dataSoFar$group,
                                                 designObj = designObj.study1
                                                 )$eValue
      if (simeValues.study1[j, d] > (1/alpha)) {
        simEvents.study1[j] <- sum(dataSoFar$status == 2)
        break()
      }
    }
  }

power.optimal <- mean(sapply(1:numSim, function(i) any(simeValues.study1[i, ] > (1/alpha), na.rm = T)))*100

The vertical lines in the above plot show all the possible days of interim analysis at which a simulated analysis reaches an e-value of \(40\). Of those 1000 simulated paths, a sample of 50 are shown. Green e-value paths indicate that the study reaches \(e > 40\) at some interim analysis, red e-value paths indicate that it does not.

84.6% of all simulated e-value paths are green, which means that we have an approximate power of 84.6% within our follow up of 180 days to correctly conclude that there is a difference between the two groups (we have set that difference to be a hazard ratio of \(0.5\) (hr1 = 0.5). If we set the hazard ratio to \(1\) (hr0 = 1) and run the same simulation, we observe that our type-I error, so hitting our threshold of 40 within 180 days if there is no difference between the groups, is only 1.9%.

Unlike alpha-spending, the type-I error control does not depend on any specification of a spending function. Type-I error control is preserved by comparing the e-value to \(40\), regardless of the days of interim analysis to do so. So the threshold of \(40\) allows for all possible interim analyses, without spending all alpha, as is shown by our simulated type-I error 1.9%. In fact, not all \(\alpha = 2.5\%\) is spent in these possible interim analyses: the Safe test always reserves some \(\alpha\) for pooled analyses with data from other studies, and can in fact keep combining data indefinitely.

Designing the Safe logrank test

Type-I error control does not depend on what e-value or threshold we use. The power and the number of events before stopping the trial, on the other hand, do depend on how the Safe logrank test is designed. In the above simulation a design object designObj.study1 was defined based on a two-sided test and hazard ratio hr1 = 0.5 that exactly matched the simulation. This is the best possible scenario, and results in the following simulated events before stopping at an e-value of \(40\), out of 1000 simulated runs:

However, in reality, we can supply a hazard ratio that is of minimal clinical interest, but can never truly know it in advance. If we design a test based on a difference between the groups that is more extreme than the true value, type-I error is preserved, but we obtain suboptimal power and need more events to detect the true difference. We are underpowered, because the true effect is smaller than the effect we set to be of minimal clinical relevance.

The hazard ratio used to design the Safe test should be prespecified at the same stage as would a power analysis: before seeing any data. Adjusting the designObject to the observed hazard ratio is like post-hoc changing your alpha-spending function to capitalize on many early or late events. It invalidates the analysis and loses type-I error control.

hr1.study1 <- 0.5
hr1.suboptimal <- 0.3
designObj.study1.suboptimal <- designSafeLogrank(hrMin = hr1.suboptimal,
                                                 alpha = alpha,
                                                 alternative = "less",  # one-sided test hr < 1
                                                 ratio = ratio.study1)
  
designObj.study1.suboptimal
## 
##  Safe Logrank Test Design
## 
##             minimal hazard ratio = 0.3
##                      alternative = less
##           parameter: log(thetaS) = -1.203973
##                            alpha = 0.025
## decision rule: e-value > 1/alpha = 40
## 
## Timestamp: 2021-06-30 10:57:44 CEST

The consequences of a too optimistic hazard ratio are shown in the simulation below. While the data is simulated assuming a hazard ratio of 0.5, the test is designed based on a hazard ratio of 0.3.

Note that if the number of events is sufficiently large, such as in pooling time-to-event data across multiple trials, a Safe test designed with a too optimistic effect size can still have sufficient power.

The power obtained now with 1500 participants over 180 days is power.suboptimal = 69.2% instead of power.optimal = 84.6%. And the distribution of events needed before stopping the trial looks as follows:

Combining the Safe logrank test: one study, multiple strata, many interim analyses

If our trial is composed of several strata, such as hospitals with different characteristics, it is optimal to account for this structure. After all, the underlying baseline hazard, at a specific day after randomization, might be different from hospital to hospital. So we want to allow the events that are observed on that day to tell a different story per hospital.

The standard logrank test as well as the Safe logrank test assume that at each day that observes an event, there is one underlying baseline hazard of events. If we assume the null hypothesis of no difference between placebo and treatment group, then this baseline hazard underlies observations in both groups. When the observations deviate from that pattern, the e-value grows and indicates evidence against the null. Consequently, it is not optimal to calculate an e-value based on combined events from different strata. In that case the e-value might not be able to detect a difference between treatment and control group, because the differences between the strata make it more uncertain about the baseline.

So accounting for strata – or more general: baseline characteristics – makes the analysis more powerful. But the type-I error is always controlled, whether we correct for baseline characteristics or not.

To account for differences between strata, we calculate an e-value per stratum and multiply these e-values. As we observed in the beginning of this text, this preserves type-I error control: for every fixed \(\alpha\), the probability that a product of e-values ever crosses \(1/\alpha\), is bounded by \(\alpha\).

Intuitively, if the null hypothesis of no difference between the groups is true, each e-value will only get smaller. So if we do not specify when we multiply two or more e-values, we expect to multiply something small with something small, and do not expect the product to be anything but small. Because each e-value has a probability of only \(1\) in \(40\) to ever become larger than \(40\) if we keep adding data, so does the combined e-value if we add data by multiplying e-values.

This procedure does not inflate the type-I errors if constantly multiply our e-value evidence. But also if we are not able to do this constantly due to practical limitations, the type-I error is retained for any selection of possible interim meta-analyses. All these properties rely on the assumption that before we start the data collection, all analysis times are possible, so if we control type-I error overall, we control type-I error over the analyses we actually perform.

Meta-analyzing the Safe logrank test: many studies, multiple stages, many interim analyses

Just like tracking an individual study e-value, as shown above, we can also track e-values from multiple studies at different stages, and pool the results into one meta-analysis. The following code describes the properties of a second study:

enrollment.study2 <- 4000     # 2000 treatment, 2000 placebo, so:
ratio.study2 <- 1             # ratio = nT/nP = 1
fup.study2  <- 140            # follow up of 140 days
nEventsC.study2 <- 300        # 300 events in 140 days among 4000
                              # in case of no treatment effect,
                              # this defines the baseline hazard
hr1.study2 <- 0.7             # hazard ratio between treatment en placebo group
                              # (hr1: alternative hypothesis is true)
                              # 255 events anticipated within 140 days 
                              # if treatment reduces risk with hr1.study2:
anticipNevents.study2 <- nEventsC.study2*(1/(1 + ratio.study2)) + 
                         nEventsC.study2*(ratio.study2/(1 + ratio.study2))*hr1.study2  

designObj.study2 <- designSafeLogrank(hrMin = hr1.study2,
                                      alpha = alpha,
                                      alternative = "less",  # one-sided test hr < 1
                                      ratio = ratio.study2)

# assuming constant hazard, we simulate the following baseline hazard and data:
lambdaC.study2 <- 1 - (1 - (nEventsC.study2/enrollment.study2))^(1/fup.study2) 
data.study2 <- generateSurvData(nP = enrollment.study2*(1/(1 + ratio.study2)),
                                nT = enrollment.study2*(ratio.study2/(1 + ratio.study2)),
                                lambdaP = lambdaC.study2,
                                lambdaT = hr1.study2*lambdaC.study2,
                                endTime = fup.study2,
                                seed = 20)  # set seed to 20 to make the result reproducible
# Assume that this new study starts 40 days later than the previous one
data.study2$time <- data.study2$time + 40

time status group
41 2 P
44 2 P
44 2 T
44 2 T
45 2 P
46 2 P
46 2 P
46 2 T
47 2 P
47 2 P

Because we assume that this second study starts 40 days later than the first, the days for interim analyses on the timeline of the first study are: 41, 44, 45, 46, 47, 48, 49, 51, 52, 53

We can pool the evidence in the second study with the first one by multiplying their e-values:

eValueMeta <- sapply(1:fup.study1, function(day) prod(eValues.study1[day], eValues.study2[day]))

We observe that the second study in its first 66 days does not have much power individually to reach \(40\), due to very low number of events. Fortunately, pooling the little evidence it does provide with that already provided by the first study enables the meta-analysis e-value to reach the threshold sooner.

Type-I error control over all possible analysis times

The error control of multiplied e-values is preserved over all interim analyses and all possible timings for the pooled meta-analysis. This means that if you design to pool, it can also be that the first pooled analysis contains only one study. While such an analysis would not conventionally be called a meta-analysis, our e-value meta-analysis allows for this situation. Before observing any event, the e-value of a study is 1, indicated by the flat line of study 2 in the first 40 days of the e-value plot above. Therefore, pooling by multiplication is always possible, also when e-values of some studies just provide \(1\).

Note that error control and power over all possible pooled analyses therefore inherently assume that stopping the pooled analysis at a stage with only single study contribution is possible. Disallowing such a scenario does not necessarily compromise type-I error control (for details see Schure and Grünwald (2019) and its definition of error control surviving over time). However, only allowing for pooled analyses of more than one trial does decrease power because it excludes the possibility of reaching \(e > 40\) with only one study.

We show that type-I error control is preserved over a pooled analysis, in which sometimes the first study that provides data is sufficient, and sometimes two studies combined provide the evidence.

The plot depicts the evidence of the meta-analysis e-value. The first 40 days of each path equal the e-values of the first study. After results from the second study come in, the meta-analytic e-value starts pooling the evidence. Although the rate at which either the first or the second study reaches \(e > 40\) will be larger than \(\alpha\) due to multiple testing (shown in dark green en light green), no more than \(\alpha\)% of the pooled meta-analysis e-value paths will reach the threshold (shown in blue).

We observe that adding data will only decrease the e-value under the null hypothesis. Pooling evidence means adding data by multiplying e-values. So, also pooled e-values decrease, and due to the larger amounts of data available at each time, even more than single study e-values (indicated by the many pink e-value paths in the bottom of the graph).

The type-I error of the pooled e-value in the above simulation is therefore controlled at 1.4%.

Summary: General recommendations Safe logrank test

We recommend to pool the evidence available at all interim times at which pooling is practically possible. The analysis per trial stratum should be performed using a Safe logrank test and results – from multiple strata and trials at different interim stages – should be pooled by multiplying their e-values. Type-I error is controlled if the multiplied e-values are compared to a prespecified threshold of \(\frac{1}{\alpha}\). The available evidence could be assessed based on the pooled e-values and the individual trial e-values. The pooled product of e-values should be leading, but additional requirements on individual trial e-values can be posed since those will only decrease type-I error.

So to preserve type-I error control, the statistical analysis plan needs to prespecify the design of the Safe test: minimally setting \(\alpha\), whether the test is two-sided, or left- or right-sided, and what is the effect size of minimum clinical relevance in terms of hazard ratio.

Larger e-values indicate more evidence against the null hypothesis of no difference between the treated and control groups. Trends in evidence can be inspected by plotting the e-values on a logarithmic scale at the various interim analyses. An increasing trend is a good indication of conditional power to cross the threshold in the future. Visual inspection of this trend can be used to substantiate preparations needed to act as soon as the threshold is crossed. Formal conditional power analyses can also be performed, as well as futility analyses, and will soon be available as straightforward functions in the safestats R package.

The Safe logrank test enables unlimited interim analyses and pooling of strata and trials. Any possible dependency between the timing of the analysis, the number of trials included and the results of those trials is allowed for. This means that to achieve optimal efficiency, we encourage intermediate decisions: Use the interim analyses to decide whether new trials are needed or should be discouraged. Use the observation of a large difference between the two groups in one stratum to increase enrollment in that stratum. Use the number of events observed in one stratum to optimize the power-design of the e-value in the next stratum. Note that the \(\frac{1}{\alpha}\) threshold as well as the design of the e-value need to be specified before observing the data that is tested. But if that is the case, anything is possible.

References

Grünwald, Peter, Rianne de Heide, and Wouter Koolen. 2019. “Safe Testing.” arXiv Preprint arXiv:1906.07801.

Robbins, Herbert. 1970. “Statistical Methods Related to the Law of the Iterated Logarithm.” Annals of Mathematical Statistics 41: 1397–1409.

Schure, Judith ter, and Peter Grünwald. 2019. “Accumulation Bias in Meta-Analysis: The Need to Consider Time in Error Control [Version 1; Peer Review: 2 Approved].” F1000Research 8 (June): 962. https://doi.org/10.12688/f1000research.19375.1.

Shafer, Glenn, Alexander Shen, Nikolai Vereshchagin, Vladimir Vovk, and others. 2011. “Test Martingales, Bayes Factors and P-Values.” Statistical Science 26 (1): 84–101.