This article is intended to give a gentle mathematical and statistical introduction to group sequential design. We also provide relatively simple examples from the literature to explain clinical applications. There is no programming shown, but by accessing the source for the article all required programming can be accessed; substantial commenting is provided in the source in the hope that users can understand how to implement the concepts developed here. Hopefully, the few mathematical and statistical concepts introduced will not discourage those wishing to understand some underlying concepts for group sequential design.
A group sequential design enables repeated analysis of an endpoint for a clinical trial to enable possible early stopping of a trial for either a positive result, for futility, or for a safety issue. This approach can
Examples of outcomes that might be considered include:
Examples of the above include:
We assume
For a time-to-event outcome, \(\delta\) would typically represent the logarithm of the hazard ratio for the control group versus the experimental group. For a difference in response rates, \(\delta\) would represent the underlying response rates. For a continuous outcome such as the HAM-D, we would examine the difference in change from baseline at a milestone time point (e.g., at 6 weeks as in Binneman et al. (2008)). For \(j=1,\ldots,k\), the tests \(Z_j\) are asymptotically multivariate normal with correlations as above, and for \(i=1,\ldots,k\) have \(\text{Cov}(Z_i,Z_j)=\text{Corr}(\hat\delta_i,\hat\delta_j)\) and \(E(Z_j)=\delta\sqrt{I_j}.\)
This multivariate asymptotic normal distribution for \(Z_1,\ldots,Z_k\) is referred to as the canonical form by Jennison and Turnbull (2000) who have also summarized much of the surrounding literature.
We assume that the primary test the null hypothesis \(H_{0}\): \(\delta=0\) against the alternative \(H_{1}\): \(\delta = \delta_1\) for a fixed effect size \(\delta_1 > 0\) which represents a benefit of experimental treatment compared to control. We assume further that there is interest in stopping early if there is good evidence to reject one hypothesis in favor of the other. For \(i=1,2,\ldots,k-1\), interim cutoffs \(l_{i}< u_{i}\) are set; final cutoffs \(l_{k}\leq u_{k}\) are also set. For \(i=1,2,\ldots,k\), the trial is stopped at analysis \(i\) to reject \(H_{0}\) if \(l_{j}<Z_{j}< u_{j}\), \(j=1,2,\dots,i-1\) and \(Z_{i}\geq u_{i}\). If the trial continues until stage \(i\), \(H_{0}\) is not rejected at stage \(i\), and \(Z_{i}\leq l_{i}\) then \(H_{1}\) is rejected in favor of \(H_{0}\), \(i=1,2,\ldots,k\). Thus, \(3k\) parameters define a group sequential design: \(l_{i}\), \(u_{i}\), and \(\mathcal{I}_{i}\), \(i=1,2,\ldots,k\). Note that if \(l_{k}< u_{k}\) there is the possibility of completing the trial without rejecting \(H_{0}\) or \(H_{1}\). We will often restrict \(l_{k}= u_{k}\) so that one hypothesis is rejected.
We begin with a one-sided test. In this case there is no interest in stopping early for a lower bound and thus \(l_i= -\infty\), \(i=1,2,\ldots,k\). The probability of first crossing an upper bound at analysis \(i\), \(i=1,2,\ldots,k\), is
\[\alpha_{i}^{+}(\delta)=P_{\delta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{Z_{j}< u_{j}\}\}\]
The Type I error is the probability of ever crossing the upper bound when \(\delta=0\). The value \(\alpha^+_{i}(0)\) is commonly referred to as the amount of Type I error spent at analysis \(i\), \(1\leq i\leq k\). The total upper boundary crossing probability for a trial is denoted in this one-sided scenario by
\[\alpha^+(\delta) \equiv \sum_{i=1}^{k}\alpha^+_{i}(\delta)\]
and the total Type I error by \(\alpha^+(0)\). Assuming \(\alpha^+(0)=\alpha\) the design will be said to provide a one-sided group sequential test at level \(\alpha\).
With both lower and upper bounds for testing and any real value \(\delta\) representing treatment effect we denote the probability of crossing the upper boundary at analysis \(i\) without previously crossing a bound by
\[\alpha_{i}(\delta)=P_{\delta}\{\{Z_{i}\geq u_{i}\}\cap_{j=1}^{i-1} \{ l_{j}<Z_{j}< u_{j}\}\},\]
\(i=1,2,\ldots,k.\) The total probability of crossing an upper bound prior to crossing a lower bound is denoted by
\[\alpha(\delta)\equiv\sum_{i=1}^{k}\alpha_{i}(\delta).\]
Next, we consider analogous notation for the lower bound. For \(i=1,2,\ldots,k\) denote the probability of crossing a lower bound at analysis \(i\) without previously crossing any bound by \[\beta_{i}(\delta)=P_{\delta}\{\{Z_{i}\leq l_{i}\}\cap_{j=1}^{i-1}\{ l_{j} <Z_{j}< u_{j}\}\}.\] The total lower boundary crossing probability in this case is written as \[\beta(\delta)= {\sum\limits_{i=1}^{k}} \beta_{i}(\delta).\]
When a design has final bounds equal (\(l_k=u_k\)), \(\beta(\delta_1)\) is the Type II error which is equal to 1 minus the power of the design. In this case, \(\beta_i(\delta)\) is referred to as the \(\beta\)-spending at analysis \(i, i=1,\ldots,k\).
Type I error is most often defined with \(\alpha_i^+(0), i=1,\ldots,k\). This is referred to as non-binding Type I error since any lower bound is ignored in the calculation. This means that if a trial is continued in spite of a lower bound being crossed at an interim analysis that Type I error is still controlled at the design \(\alpha\)-level. For Phase III trials used for approvals of new treatments, non-binding Type I error calculation is generally expected by regulators.
For any given \(0<\alpha<1\) we define a non-decreasing \(\alpha\)-spending function \(f(t; \alpha)\) for \(t\geq 0\) with \(\alpha\left( 0\right) =0\) and for \(t\geq 1\), \(f( t; \alpha) =\alpha\). Letting \(t_0=0\), we set \(\alpha_j(0)\) for \(j=1,\ldots,k\) through the equation \[\alpha^+_{j}(0) = f(t_j;\alpha)-f(t_{j-1}; \alpha).\] Assuming an asymmetric lower bound, we similarly use a \(\beta\)-spending function and to set \(\beta\)-spending at analysis \(j=1,\ldots, k\) as: \[\beta_{j}(\delta_1) = g(t_j;\delta_1, \beta) - g(t_{j-1};\delta_1, \beta).\]
In the following example, the function \(\Phi()\) represents the cumulative distribution function for the standard normal distribution function (i.e., mean 0, standard deviation 1). The major depression study of Binneman et al. (2008) considered above used the Lan and DeMets (1983) spending function approximating an O’Brien-Fleming bound for a single interim analysis half way through the trial with
\[f(t; \alpha) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\alpha/2)}{\sqrt{t}}\right) \right).\]
\[g(t; \beta) = 2\left( 1-\Phi\left( \frac{\Phi ^{-1}(\beta/2)}{\sqrt{t}}\right) \right).\]
delta1 <- 3 # Treatment effect, alternate hypothesis
delta0 <- 0 # Treatment effect, null hypothesis
ratio <- 1 # Randomization ratio (experimental / control)
sd <- 7.5 # Standard deviation for change in HAM-D score
alpha <- 0.1 # 1-sided Type I error
beta <- 0.17 # Targeted Type II error (1 - targeted power)
k <- 2 # Number of planned analyses
test.type <- 4 # Asymmetric bound design with non-binding futility bound
timing <- .5 # information fraction at interim analyses
sfu <- sfLDOF # O'Brien-Fleming spending function for alpha-spending
sfupar <- 0 # Parameter for upper spending function
sfl <- sfLDOF # O'Brien-Fleming spending function for beta-spending
sflpar <- 0 # Parameter for lower spending function
delta <- 0
endpoint <- "normal"
# Derive normal fixed design sample size
n <- nNormal(
delta1 = delta1,
delta0 = delta0,
ratio = ratio,
sd = sd,
alpha = alpha,
beta = beta
)
# Derive group sequential design based on parameters above
x <- gsDesign(
k = k,
test.type = test.type,
alpha = alpha,
beta = beta,
timing = timing,
sfu = sfu,
sfupar = sfupar,
sfl = sfl,
sflpar = sflpar,
delta = delta, # Not used since n.fix is provided
delta1 = delta1,
delta0 = delta0,
endpoint = "normal",
n.fix = n
)
# Convert sample size at each analysis to integer values
x <- toInteger(x)
#> toInteger: rounding done to nearest integer since ratio was not specified as postive integer .
The planned design used \(\alpha=0.1\), one-sided and Type II error
17% (83% power) with an interim analysis at 50% of the final planned
observations. This leads to Type I \(\alpha\)-spending of 0.02 and \(\beta\)-spending of 0.052 at the planned
interim. An advantage of the spending function approach is that bounds
can be adjusted when the number of observations at analyses are
different than planned. The actual observations for experimental versus
control at the analysis were 59 as opposed to the planned 67, which
resulted in interim spending fraction \(t_1=\) 0.4403. With the Lan-DeMets spending
function to approximate O’Brien-Fleming bounds this results in \(\alpha\)-spending of 0.0132
(P(Cross) if delta=0
row in Efficacy column) and \(\beta\)-spending of 0.0386
(P(Cross) if delta=3
row in Futility column). We note that
the Z-value and 1-sided p-values in the table below correspond exactly
and either can be used for evaluation of statistical significance for a
trial result. The rows labeled ~delta at bound
are
approximations that describe approximately what treatment difference is
required to cross a bound; these should not be used for a formal
evaluation of whether a bound has been crossed. The O’Brien-Fleming
spending function is generally felt to provide conservative bounds for
stopping at interim analysis. Most of the error spending is reserved for
the final analysis in this example. The futility bound only required a
small trend in the wrong direction to stop the trial; a nominal p-value
of 0.77 was observed which crossed the futility bound, stopping the
trial since this was greater than the futility p-value bound of 0.59.
Finally, we note that at the final analysis, the cumulative probability
for P(Cross) if delta=0
is less than the planned \(\alpha=0.10\). This probability represents
\(\alpha(0)\) which excludes the
probability of crossing the lower bound at the interim analysis and the
final analysis. The value of the non-binding Type I error is still \(\alpha^+(0) = 0.10\).
# Updated alpha is unchanged
alphau <- 0.1
# Updated sample size at each analysis
n.I <- c(59, 134)
# Updated number of analyses
ku <- length(n.I)
# Information fraction is used for spending
usTime <- n.I / x$n.I[x$k]
lsTime <- usTime
# Update design based on actual interim sample size and planned final sample size
xu <- gsDesign(
k = ku,
test.type = test.type,
alpha = alphau,
beta = x$beta,
sfu = sfu,
sfupar = sfupar,
sfl = sfl,
sflpar = sflpar,
n.I = n.I,
maxn.IPlan = x$n.I[x$k],
delta = x$delta,
delta1 = x$delta1,
delta0 = x$delta0,
endpoint = endpoint,
n.fix = n,
usTime = usTime,
lsTime = lsTime
)
#> Analysis Value Efficacy Futility
#> IA 1: 44% Z 2.2209 -0.2304
#> N: 59 p (1-sided) 0.0132 0.5911
#> ~delta at bound 4.3370 -0.4500
#> P(Cross) if delta=0 0.0132 0.4089
#> P(Cross) if delta=3 0.2468 0.0386
#> Final Z 1.3047 1.3047
#> N: 134 p (1-sided) 0.0960 0.0960
#> ~delta at bound 1.6907 1.6907
#> P(Cross) if delta=0 0.0965 0.9035
#> P(Cross) if delta=3 0.8350 0.1650