## Biomarkers, Subgroup Evaluation, and Clinical Trial Design

**Abstract: **Advances in clinical and basic sciences are raising the potential to use genetic and clinical biomarkers to identify a subgroup of patients who would most likely benefit from treatment, and to evaluate the benefit of treatment in that subgroup. To make full use of this potential, special clinical trial designs and analyses are needed. For identifying and evaluating a subgroup based on a single continuous biomarker, the most informative approach is the biomarker-analysis design, which is a randomized trial whose analysis involves estimation of the treatment benefit within patient groups defined with respect to various cutpoints or intervals of the biomarker. For identifying and evaluating a subgroup considering a range of possible markers, the adaptive signature design is recommended. In the adaptive signature design, participants are randomly split into training and test samples, a rule for creating the subgroup is formulated in the training sample, and treatment benefit is estimated in the test sample. The adaptive signature design can be usefully extended via the sliding-window subgroup plot that was originally developed for the biomarker-analysis design.

**Introduction**

In many randomized clinical trials, it is becoming increasingly feasible to measure hundreds or thousands of genes as well as other baseline clinical variables such as age, tumor stage, and co-morbidity. With these measurements, there is great interest in pharmacogenetics, which is defined here as using genetic and clinical markers to identify a promising subgroup of patients who may benefit from experimental versus control treatment substantially more than the average benefit among all patients. [A more narrow definition of pharmacogenetics involves the use of only genetic markers to identify a promising subgroup (Hopkins *et al*., 2006); the inclusion of clinical markers can add important information.] Identifying a promising subgroup is particularly desirable when either (i) the experimental treatment has worse outcome than the control treatment overall, but has better outcome in the subgroup or (ii) the experimental treatment has better outcome than control treatment among all patients but that outcome does not outweigh detrimental side effects except in the subgroup.

While observational studies of various genetic or clinical factors can provide useful preliminary insights for defining a promising subgroup or hypotheses for future definitive testing (Parker and Strout, 2011), a randomized clinical trial is needed to more fully exploit the potential of pharmacogenetics. Pharmacogenetics involves predictive, as opposed to prognostic, markers. Prognostic markers are used to predict the outcome in a control group of subjects and are therefore useful for projecting the natural history of a cancer independent of therapy. Predictive markers, also called treatment selection markers, are used to define a specific subgroup of patients for which treatment will be beneficial (Jiang *et al*., 2010; Janes *et al*., 2011). An example of a predictive marker is the presence or absence of *K-Ras* mutations in colorectal cancers; patients without *K-Ras *mutations benefit from anti-epidermal growth factor receptor (EGFR) treatment while patients with *K-Ras *mutations derive little, if any, benefit (Dahbreth *et al*., 2011). Clinical trial designs and analyses for identifying and evaluating subgroups based on predictive markers are discussed here with an emphasis on situations where there are a large number of possible pharmacogenetic markers.

**Biomarker-enrichment Design**

The biomarker-enrichment design (also called the gene-enrichment design) is a randomized trial involving only patients testing positive for a biomarker (Freidlin *et al*., 2010; Sargent *et al*., 2005). The biomarker-enrichment design is most appropriate when there is preliminary evidence that patients testing positive for a biomarker will likely benefit from the treatment. (A stronger case for biomarker-enrichment design can be made if there is also preliminary evidence that patients testing negative for the biomarker would not likely receive any benefit from the treatment.) For example, because about 50% of melanomas have an activating ^{V600E}*BRAF *mutation, investigators hypothesized that inhibition of mutated BRAF kinase may have clinical benefit (Chapmen *et al*., 2011). Therefore the investigators randomized only patients who tested positive for ^{V600E}*BRAF *mutation to an inhibitor of mutated BRAF-kinase or control treatment (Chapmen *et al*., 2011) and demonstrated a large treatment effect in this prospectively specified subgroup. This design essentially consists of an added specification to the patient inclusion criteria of the study.

**Biomarker-strategy Design**

The biomarker-strategy design (also called a marker-based strategy design) randomizes patients to either control treatment or *marker-based treatment selection*, namely receiving experimental treatment if the marker is positive and control treatment if the marker is negative (Freidlin *et al*., 2010; Sargent *et al*., 2005). The biomarker-strategy design is generally a poor substitute for a randomized trial comparing experimental and control treatments because the effect of experimental versus control treatment is diluted by marker-based treatment selection. As will be discussed, the same comparison can be estimated using the biomarker-analysis design that has the advantages of (i) also comparing experimental and control treatments among all participants and (ii) investigating multiple possible cutpoints of a continuous marker that yield a positive or negative marker result.

**Biomarker-stratified Design **

The biomarker-stratified design (also called a marker-by-treatment-interaction design) involves first testing patients for the biomarker and then separately randomizing patients who test positive and who test negative for the marker (Freidlin *et al*., 2010; Sargent *et al*., 2005). This design is predicated on the assumption that there is no preliminary evidence to strongly favor a positive or negative biomarker that would necessitate a biomarker-enrichment design. The biomarker-stratified design is more informative than biomarker-enrichment design but less informative than the biomarker-analysis design (to be discussed).

**Biomarker-analysis Design**

Consider a single continuous biomarker measured at randomization in which it is thought that treatment benefit increases as marker level increases (but there is not sufficient evidence to warrant a biomarker-enrichment design). For clinical use, a cutpoint must be established where patients with marker levels greater than or equal to a cutpoint are deemed positive and patients with a marker level less than the cutpoint are deemed negative. For this situation, the biomarker-analysis design (a name coined here) is an informative design. The biomarker-analysis design has two components: a randomized trial with the single continuous biomarker ascertained in all participants, and subsequently the identification of a promising subgroup using a plot of treatment benefit versus various cutpoints or intervals of the biomarker. Biomarker ascertainment would usually occur at the time of randomization. Alternatively, specimens could be collected at randomization and stored until the end of the trial when the biomarkers would be ascertained, thereby avoiding the possibility of noncompliance with treatment due to knowledge of marker results (Baker and Freedman, 1995), although investigators would need to consider any ethical implications of withholding such information.

The biomarker-analysis design can involve the following five comparisons: (i) experimental versus control treatment among all participants, (ii) experimental versus control treatment in the subgroup with positive markers, (iii) experimental versus control treatment in the subgroup with negative markers, (iv) marker-based treatment selection versus control treatment, and (v) marker-based treatment selection versus experimental treatment. Song and Pepe (2004) and Vickers *et al*. (2007) emphasize comparisons (iv) and (v). See Appendix A for details (involving a binary outcome) of how comparisons (iv) and (v) are computed and why they are equivalent to comparisons (ii) and (iii), respectively, multiplied by a constant.

To account for multiple comparisons, significance levels and confidence intervals must be adjusted in the biomarker analysis design. Based on a conservative Bonferroni adjustment, to obtain an overall significance level of 5%, the significance levels used for all comparisons should sum to 5% (Costigan, 1998). For example, suppose there are two comparisons, namely, (i) and (ii). A reasonable choice is significance levels of 4% and 1% for comparisons (i) and (ii), which implies that estimates of treatment benefit associated with comparisons (i) and (ii) should involve 96% and 99% confidence intervals, respectively.

The identification of a promising subgroup in the biomarker-analysis design involves one of various plots. Bonetti and Gelber (2000; 2004) introduced the tail-oriented and sliding-window subpopulation treatment effect pattern plots, whose names are shortened here to tail-oriented and sliding-window subgroup plots. The *tail-oriented subgroup plot *graphs the estimated benefit of experimental versus control treatment among patients with a marker level greater than a cutpoint (thereby specifying the tail of the distribution) as a function of different cutpoints. The *sliding-window subgroup plot* graphs the estimated benefit of experimental versus control treatment among persons with a marker level within an interval (thereby defining a sliding window) as a function of marker level. The *selection impact curve* (Song and Pepe, 2004) graphs the benefit of marker-based treatment selection as a function of marker cutpoints. The *marker-by-treatment predictiveness curves* graph the risks of outcome separately under experimental and control treatments for persons with a marker in the interval. Appendix B formally defines these curves and discusses their mathematical connections when the outcome is binary. The tail-oriented subgroup plot is closely related to the selection impact curve, and the sliding window plot can be obtained from marker-by-treatment predictiveness curves.

Confidence intervals for the tail-oriented and sliding window plots adjust for multiple comparisons of many cutpoints or intervals (Bonetti and Gelber, 2000; 2004; Cai *et al*., 2011). Cai *et al*. (2011) applied the sliding-window subgroup plot to a single marker derived from fitting a mathematical model to a small number of pre-specified markers. With a few markers, there is little overfitting bias from using the same data to both fit a model to combine markers and evaluate treatment benefit based on a single marker derived from this model. With more markers under consideration, the adaptive signature design (to be discussed) is required to avoid overfitting bias.

When evaluating treatment benefit, an important population quantity is the *benefit threshold*, denoted by *T*, which is the benefit of experimental versus control treatment that exactly compensates for any additional detrimental side effects associated with the experimental treatment (Vickers *et al*., 2007). Using the benefit threshold, investigators can consider the net benefit of treatment comparisons, which is the benefit minus the cost in the same units. As derived in Appendix C, for marker-based treatment selection to have greater net benefit than control treatment, the benefit of experimental versus control treatment must be greater than *T* among persons positive for the marker. Also for marker-based treatment selection to have greater net benefit than experimental treatment, the benefit of experimental versus control treatment must be less than *T* among persons negative for the marker, although this is usually less of a consideration. A promising subgroup can be identified by selecting the subgroup in which the lower bound of the confidence interval is greater than the benefit threshold *T. *An example is discussed later for the extension of the adaptive signature design using the tail-oriented and sliding-window subgroup plots.

**Biomarker-adaptive Threshold Design**

The biomarker-adaptive threshold design (Jiang *et al*., 2007) is a type of biomarker analysis design involving the sequential investigation of the two aforementioned comparisons, (i) and (ii). For example if the null hypothesis for comparison (i) is rejected at significance level of 4%, then the procedure is complete; otherwise (and this is the “adaptive” aspect in the name of the design), a statistical test is performed for comparison (ii). In the biomarker-adaptive threshold design, the subgroup is chosen based on the cutpoint (threshold) yielding the strongest statistical evidence for a treatment benefit (in contrast to the use of the benefit threshold to select a subgroup).

**Biomarker-nested-case-control Design**

For a rare outcome, the *biomarker-nested-case-control design* is an appealing variant of the biomarker-analysis design because it can reduce costs with little reduction in the precision of estimates (Baker and Kramer, 2005)*. *In the biomarker-nested-case-control design, all participants are randomized to either control or experimental treatment with specimens stored at the time of randomization. At the end of the trial, some participants are randomly selected for marker testing using the stored specimens, where the probability of selection for marker testing depends on the participant’s outcome. Alternatively if the biomarker indicates a gene carrier or not, which is time invariant, the determination of the marker level at the end of the trial (assuming participants are available) does not require stored specimens. Another version of the biomarker-nested-case-control design involves testing for the biomarker among only participants with the outcome of interest, but this design can only be used to estimate relative risk (King *et al*., 2001).

**Adaptive Signature Design **

The advantage of pharmacogenetics over the use of a single pre-specified marker comes from the potential to identify a more promising subgroup using a large set of possible markers measured at randomization compared to using only a single marker measured at randomization. Friedlin and Simon (2005) introduced the *adaptive signature design* to identify and evaluate treatment in a subgroup of trial participants using a large number of possible markers. (The word “signature” in the name of the design refers to a combination of genetic markers that is sometimes called a gene signature.) The adaptive signature design involves the following procedure. The null hypothesis that the benefit of experimental versus control treatment equals the benefit threshold *T* (with *T* = 0 in the original formulation) is tested among all participants at a slightly reduced statistical significance level of 0.04 instead of the usual 0.05 because a second test at level 0.01 will also be implemented. If the null hypothesis is rejected in this test, the experimental treatment is recommended over control treatment for all eligible persons and the analysis is complete. Otherwise (and this is the “adaptive” aspect in the name of the design) a potentially promising subgroup is identified and evaluated in the test sample by the following algorithm. First, using the training sample, a model is constructed to predict the benefit of experimental versus control treatment as function of baseline variables. This model is called here a *benefit function.* The benefit function is then computed for each participant in the test sample to obtain a *benefit score. *A pre-specified cutpoint for the benefit score is used to specify a subgroup. The estimated benefit of treatment in this subgroup is tested at a statistical significance level of 0.01 to determine if it is greater than the benefit threshold of *T *(*T* = 0 in the original formulation). According to Friedlin and Simon (2005) the adaptive signature design “is especially attractive for allowing pharmaceutical companies to invest in the development of pharmacogenomic signatures without the risk of losing of broad labeling indications where supported by the results of phase III trials.”

One benefit function that has been used with the adaptive signature design is the minimum odds ratio for treatment effect in at least *k* markers with treatment-marker level interactions that are statistical significant at some level *c*,* *where *k* and *c* are specified in advance (Friedlin and Simon, 2005). Another benefit function, which has been used in the biomarker-analysis design but could be applied to the adaptive signature design, is the risk difference (Vickers *et al*., 2005; Cai *et al*., 2011; Foster *et al*., 2011). The risk difference is the probability of outcome based on baseline variables in the experimental group minus the probability of outcome based on baseline variables in the control group.

**Adaptive Signature Design with Subgroup Plots**

The adaptive signature design can be made more flexible, at the “price” of wider confidence intervals of estimated benefit, by identifying a subgroup after considering various cutpoints of the benefit score using tail-oriented or sliding window subgroup plots. This extension of the adaptive signature design is considered in the context of a risk difference benefit function. To estimate the risk difference benefit function in the training sample, a separate model for the risk of outcome as a function of markers is fit to data in each randomization group. Typically, to speed computations, the large number of markers is reduced to a much smaller number of the most promising prognostic markers, namely those markers that best individually separate favorable versus unfavorable outcomes. Then a stepwise fitting algorithm can be applied to this reduced set of prognostic promising markers, successively selecting the marker that most improves fit given the previous markers already selected. It is generally a good idea to limit the number of markers selected because there are diminishing returns in classification performance after only a few markers are included in the risk model (Hand, 2006). Using the risk difference benefit function estimated in the training sample, benefit scores are computed in the test sample, which are used to create tail-oriented and sliding-window subgroup plots. It is convenient to use quantiles of benefit score rather than the raw benefit score on the horizontal axes of these plots. As only the particular benefit function obtained from the training sample is evaluated, estimates of parameters in the benefit function are viewed as fixed, and the only variability in the estimated benefit in the subgroups comes from the data in the test sample. Simultaneous 99% confidence intervals that adjust for multiple comparisons are computed by repeatedly re-sampling the test sample data and finding a multiplier of standard error that includes 99% of the estimates. For comparison, pointwise 99% confidence intervals that do not adjust for multiple comparisons are also computed. These plots allow identification of a subgroup when the lower bound of the simultaneous 99% confidence interval is greater than the benefit threshold.

To illustrate the application of tail-oriented and sliding window plots to the adaptive signature design, hypothetical data were created for a trial with 10,000 markers and 400 participants. Expression levels for each marker were generated under independent normal distributions. In each randomization group, a binary outcome for each individual was generated with a probability based on two markers. See Appendix D for details. (Although this example considers a binary outcome, the same design and analysis could be applied to a survival outcome.) The benefit threshold was specified as *T* = 0.15. With these hypothetical data, the estimated benefit of treatment among all participants was 0.13 with a 96% confidence interval of (0.03, 0.23). Thus there is little evidence that the benefit of treatment among all participants is greater than the benefit threshold; in fact these results suggest that the benefit of treatment among all participants is less than the benefit threshold. Therefore there is interest in looking for a promising subgroup with a lower bound of the 99% confidence interval for estimated benefit that is greater than the benefit threshold.

Tail-oriented and sliding-window subgroup plots were computed from these hypothetical data. (See Appendix E for computational details.) Because high quantiles of benefit scores involve too little data for estimating treatment benefit in both subgroup plots and low quantiles involve too little data for estimating treatment benefit in the sliding-window subgroup plot, interest focused on a range of quantiles for benefit score between 0.10 and 0.90 (**Figure 1**). The confidence intervals are narrower for the tail-oriented subgroup plot than for the sliding-window subgroup plot because most of the subgroups in the tail-oriented subgroup plot involve a larger sample size. However, the sliding-window subgroup plot provides more relevant information for an individual.** **This is illustrated by comparing the subgroups with a lower bound of *T *= 0.15 for the simultaneous 99% confidence interval for estimated benefit. In the sliding-window subgroup plot, this subgroup of interest consists of individuals with a benefit score in the 85^{th} percentile or greater. In the tail-oriented subgroup plot, this subgroup of interest consists of individuals with a benefit score in the 35^{th} percentile or greater. The subgroup of interest derived from the tail-oriented subgroup plot includes many individuals with lower bounds of the estimated benefit that are substantially less than the benefit threshold of *T. *Therefore despite the wider confidence intervals, from the perspective of treatment evaluation for an individual, the sliding-window subgroup plot is preferred to the tail-oriented subgroup plot for identifying a subgroup of interest based on the benefit threshold.

**Discussion **

One criticism of the adaptive signature design over the biomarker-analysis design is the lack of information on disease mechanism to guide selection of the subgroup. However, medical advances do not always require an understanding of disease mechanism. A particularly striking example is quinine to treat malaria, which was successfully used centuries ago without any knowledge of its mechanism (Achon *et al*., 2011). Echoing this viewpoint in a different context, Ransohoff (2004) writes “if we relied on a ‘Should it work?’ line of reasoning, we could never learn anything really new….we cannot [always] decide whether new things do work by reasoning about whether they should work.” The adaptive signature design has the potential to identify a promising subgroup beyond the constraints of what is currently known about the mechanistic cause of disease. The use of sliding window subgroup plot with the adaptive signature design increases the flexibility of the adaptive signature design in finding a subgroup with treatment benefit that is significantly greater than the benefit threshold.

**Supplementary Document**

**Acknowledgments**

This work was supported by the National Cancer Institute. The authors thank Rich Simon for his helpful comments.

**Disclosure**

The authors report no conflicts of interest.

**Corresponding Author**

Stuart G. Baker, Sc.D., Division of Cancer Prevention, National Cancer Institute, Bethesda, Maryland 20892, USA.

**References**

Achan J, Talisuna AO, Erhart A, Yeka A, Tibenderana JK, Baliraine FN, Rosenthal PJ, Alessandro UD. Quinine, an old anti-malarial drug in a modern world: role in the treatment of malaria. *Malaria J* 10:144, 2011.

Baker SG, Freedman LS. Potential impact of genetic testing on cancer prevention trials, using breast cancer as an example. *J Natl Cancer Inst* 87:1137-1144, 1995.

Baker SG, Kramer BS. Statistics for weighing benefits and harms in a proposed genetic substudy of a randomized cancer prevention trial. *Appl Stat* 54(5):941-954, 2005.

Bonetti M, Gelber RD. A graphical method to assess treatment-covariate interactions using the cox model on subsets of the data. *Stat Med *19:2595-2609, 2000.

Bonetti M, Gelber RD. Patterns of treatment effects in subsets of patients in clinical trials. *Biostatistics* 5:465-481, 2005.

Cai T, Tian L, Wong PH, Wei LJ. Analysis of randomized comparative clinical trial data for personalized treatment selections. *Biostatistics* 12(2):270-282, 2011.

Chapman PB, Hauschild A, Robert C, Larkin JMG,. Haanen J B AG, Ribas A, Hogg D, Day S, Ascierto PA, Testori A, Lorigan P, Dummer R, Sosman JA, Garbe C, Lee RJ, Nolop KB, Nelson B, Hou J, Flaherty KT, McArthur G A. Phase III randomized, open-label, multicenter trial (BRIM3) comparing BRAF inhibitor vemurafenib with dacarbazine (DTIC) in patients with ^{V600E}BRAF-mutated melanoma*. J Clin Oncol* 29(suppl), abstract #LBA4, 2011.

Costigan T. Bonferroni inequalities and intervals In: *Encyclopedia of Biostatistics*. Armitage P, Colton T (eds.). Volume 1, pp. 421-425, John Wiley & Sons, Ltd., Chichester, UK, 1998.

Dahabreh IJ, Terasawa T, Castaldi PJ, Trikalinos TA. Systematic review: anti-epidermal growth factor receptor treatment effect modification by KRAS mutations in advanced colorectal cancer. *Ann Internal Med* 154(1):37-49, 2011.

Foster JC, Taylor JMG, Ruber SJ. Subgroup identification from randomized clinical trial data. *Stat Med* 30(24):2867-2880, 2011.

Freidlin B, McShane LM, Korn EL. Randomized clinical trials with biomarkers: design issues* J Natl Cancer Inst* 102(3):152-160, 2010.

Freidlin B, Simon R. Adaptive signature design: an adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. *Clin Cancer Res* 11(21):7872-7878, 2005.

Hand DJ. Classifier technology and the illusion of progress. *Statist* *Sci* 21(1):1-14, 2006.

Hopkins MM, Ibarreta D, Gaisser S, Enzing CM, Ryan J, Martin PA, Lewis G, Detmar S, van den Akker-van Marle ME, Hedgecoe, AM, Nightingale P, Dreiling M, Hartig KJ, Vullings W, Forde T. Putting pharmocogenetics into practice.* Nat Biotechnol* 24(4):403-410, 2006.

Janes H, Pepe MS, Bossuyt PM, Barlow WE. Measuring the performance of markers for guiding treatment decisions. *Ann Intern Med* 154(4):253-259, 2011.

Jiang W, Freidlin B, Simon R. Biomarker-adaptive threshold design: a procedure for evaluating treatment with possible biomarker-defined subset effect. *J Natl Cancer Inst* 99(13):1036-1043, 2007.

King MC, Wieand S, Hale K, Lee M, Walsh T, Owens K, Tait J, Ford L, Dunn BK, Costantino J, Wickerham L, Wolmark N, Fisher B; National Surgical Adjuvant Breast and Bowel Project. Tamoxifen and breast cancer incidence among women with inherited mutations in BRCA1 and BRCA2: National Surgical Adjuvant Breast and Bowel Project (NSABP-P1) Breast Cancer Prevention Trial. *JAMA* 286(18):2251-2256, 2001.

Parker TL, Strout MP. Chronic lymphocytic leukemia: prognostic factors and impact on treatment. *Discov Med* 11(4):115-123, 2011.

Ransohoff DF. Evaluating discovery-based research: when biologic reasoning cannot work. *Gastroenterology* 127:1028, 2004.

Sargent DJ, Conley BA, Allegra C, Collete L. Clinical trial designs for predictive marker validation in cancer treatment trials. *J Clin Oncol *23(9):2020-2227, 2005.

Song X, Pepe MS. Evaluating markers for selecting a patient’s treatment. *Biometrics *60(4):874-883, 2004.

Vickers AJ, Kattan MW, Sargent DJ. Method for evaluating prediction models that apply the results of randomized trials to individual patients. *Trials *9:14, 2007.

**[ Discovery Medicine; **

**ISSN: 1539-6509;**

*Discov Med*13(70):187-192, March 2012.**Copyright © Discovery Medicine. All rights reserved.]**