## Evaluating a New Marker for Risk Prediction: Decision Analysis to the Rescue

**Abstract: **In many areas of medicine risk prediction models are used to identify high-risk persons to receive treatment, with the goal of maximizing the ratio of benefits to harms. Thus there is considerable interest in evaluating markers to improve risk prediction. Many measures to evaluate a new marker for risk prediction are based solely on predictive accuracy including the odds ratio, change in the area under the receiver operating characteristic curve, and net reclassification improvement. However, predictive accuracy measures do not capture important clinical implications. Decision analysis comes to the rescue by including the ratio of the anticipated harm ("cost") of a false positive to the anticipated benefit of a true positive, which is transformed into a risk threshold (*T*) of indifference between treatment and no treatment. A decision-analytic measure of the "value" of a new marker is the number needed to test at a particular risk threshold, denoted *NNTest(T)*, the minimum number of marker tests per true positive needed for risk prediction to be worthwhile. If *NNTest(T)* is acceptable given the invasiveness and adverse consequences of the test for the new marker, the new marker is recommended for inclusion in risk prediction. We provide a simple review of the derivation and computation of *NNTest(T)* from risk stratification tables and compare the minimum of *NNTest(T)*, over risk thresholds, with measures of predictive accuracy in six studies. The results illustrate the advantages of this decision-analytic approach for evaluating a new marker for risk prediction.

**Introduction**

Risk prediction models are increasingly common in the medical literature. A risk prediction model is a statistical model to predict the probability of an outcome (here disease occurrence) based on risk factors and markers. We consider the situation in which persons receive no treatment in the absence of risk prediction and persons at sufficiently high risk are recommended for treatment (or an intervention to prevent disease). The goal is to identify those people at sufficiently high risk for disease that the benefits outweigh the risks of the treatment, which is particularly important in the setting of primary prevention in a healthy population. In this context, we evaluate a new (or additional) marker to improve the risk prediction model. Motivation and illustration come from the following examples:

(i) the addition of systolic blood pressure (SBP) to predict the 10-year risk of developing cardiovascular disease among women (Cook and Ridker, 2009),

(ii) the addition of C-reactive protein (CRP) to predict the 10-year risk of developing cardiovascular disease among women (Cook *et al.,* 2006),

(iii) the addition of breast density (BD) to predict the 5-year risk of developing breast cancer in women (Tice *et al.,* 2008; Janes *et al*., 2008),

(iv) the addition of C-reactive protein and parental history (CRP+PH) to predict the 10-year risk of developing cardiovascular disease in men (Ridker *et al.,* 2008),

(v) the addition of high density lipoprotein (HDL) to predict the 10-year risk of cardiovascular disease among women (Cook, 2007),

(vi) the addition (to covariates from ATP III risk score) of a particular genotype to predict the 10-year risk of cardiovascular disease among women (Paynter *et al*., 2009).

These examples were selected because the articles presented risk stratification tables. In these examples, only persons at sufficiently high risk are recommended for an intervention to prevent disease occurrence in the time range indicated.

Commonly used measures of the “value” of a new marker for risk prediction emphasize predictive accuracy; these include the odds ratio (*OR*), the hazard ratio (*HR*), the change in area under the receiving-operating characteristic (ROC) curve (Tzoulaki *et al.,* 2009), abbreviated here as *∆AUC*, and net reclassification improvement (*NRI*), which equals the net increase in the percent of cases classified as higher risk plus the net increase in the percent of controls classified as lower risk (Pencina *et al.,* 2008; 2010). However, *OR*,* HR,* *∆AUC*, and *NRI* provide only a limited view of the “value” of a new marker because they do not account for anticipated costs (harms) and benefits. They therefore do not capture the tradeoffs that the physician and patient face in making decisions about interventions that can carry both benefits and harms.

Decision-analysis comes to the rescue by providing a more informative measure of the net “value” of new marker. A key quantity in formulating a decision-analytic measure is the ratio of the anticipated cost (and harm) of a false positive to the anticipated benefit of a true positive, which is transformed into a risk threshold of indifference (denoted by *T*) between treatment and no treatment (Pauker and Kassirer, 1975). A useful decision-analytic measure is the number needed to test at a given risk threshold, *NNTest*(*T*), the minimum number of marker tests per true positive required for risk prediction to be worthwhile when the risk threshold is *T*. Previous names for *NNTest*(*T*) are test tradeoff and test threshold (Baker, 2009; Baker *et al*., 2009). Here we provide a simple review of the derivation and computation of *NNTest*(*T*) from summary risk stratification tables. Also, using the aforementioned studies, we compare the minimum of *NNTest*(*T*) with measures of predictive accuracy, and discuss interpretation of results.

**Risk Stratification Tables**

Let Model 1 denote a baseline risk prediction model and Model 2 denote an enhanced risk prediction model that adds a new marker or set of markers to Model 1. A risk stratification table is a cross-classification involving intervals of predicted risk (risk scores) for Models 1 and 2. Risk stratification tables are the basis for computing *NRI* (Janes *et al.,* 2008), but can also be used to compute *NNTest*(*T*). We focus on a common clinical scenario of no treatment in the absence of risk prediction with data based only on persons not receiving the treatment. Ideally the computation of risk stratification tables is based on data in an application (also called validation) sample using a model that was previously fit to an independent development sample. In the aforementioned cardiovascular disease prediction examples, the same data were used for fitting and application.

With survival data, two risk stratification tables are required: one in which each cell of the table displays the Kaplan-Meier estimates of the probability of disease incidence, *Q*, in some time period, and one in which each cell displays the number of persons, *N* (**Table 1**). These tables are used to derive two other risk stratification tables, one in which each cell displays the number of persons developing disease in the time period (cases), *X* = *Q* *N*, and one in which each cell displays the number of persons not developing disease in the time period, *Y *= (1-*Q*) *N* (**Table 2**). Let *j* index intervals of risk score, called risk intervals. Also let *X _{j}* denote marginal counts for cases and

*Y*denote marginal counts for controls, where these calculations apply either to row margins (Model 1) or column margins (Model 2). The observed risk for interval

_{j}*j*is

*R*=

_{j}*X*/

_{j}*N*, where

_{j}*N*=

_{j}*X*+

_{j}*Y*(

_{j}**Table 3**). The observed risk should not be confused with the risk score. The risk score is a preliminary guess of risk based on the risk prediction model. The observed risk directly uses outcome data and is thus a better estimate of risk than the risk score.

**ROC Curve**

Although the ROC curve provides a summary of predictive accuracy, it is also an important intermediate quantity in the calculation of *NNTest*(*T*) (Baker, 2009; Baker *et al.,* 2012c). Let *FPR _{j} *and

*TPR*denote the false and true positive rates corresponding to risk interval

_{j}*j*.

*FPR*is the estimated probability that the risk score of a person who does not develop disease lies in a risk interval greater than or equal to

_{j}*j.*

*TPR*is the estimated probability that the risk score of a person who develops disease lies in a risk interval greater than or equal to

_{j}*j.*The calculation of

*FPR*and

_{j}*TPR*based on observed risks is presented in Appendix A. The ROC curve is a plot of successive points (

_{j}*FPR*

_{j}_{,}

*TPR*) connected by lines.

_{j}*NNTest(T)*requires a concave (decreased sloping) ROC curve, nearly always the case with a clinically useful risk prediction model but which may not occur due to sampling variability, particularly with many risk intervals. However, if not uniformly concave, one can be created by connecting the points with the largest slopes from left to right (Baker

*et al.,*2012).

**Net Benefit **

A key quantity in the computation of *NNTest(T)* is net benefit — the total expected benefit minus the total expected cost (or harm) measured in the same units as benefit. Net benefit plays an important role in decision-making because the fundamental rule in making a rational choice is to select the alternative with highest net benefit (Stokey and Zeckhauser, 1978). This is the decision process that clinicians face in practice, but they usually do not have the luxury of an empirically derived quantitative model as an aide.

Three types of net benefit related to marker evaluation are (*i*)* *the net benefit of treatment, (*ii*) the net benefit of no treatment, and (*iii*) the net benefit of a treatment decision based on risk prediction. For the situation here in which no treatment is recommended in the absence of risk prediction, the net benefit of risk prediction is defined as (*iii*) - (*ii*).

The net benefit formulas are derived from utilities associated with decisions to treat or not treat among persons who do or do not develop disease (Metz, 1978; Baker *et al.,* 2012). These formulas reduce to two key quantities: *B*, which is the anticipated gain from treatment instead of no treatment among persons who would develop disease in the absence of treatment, and *C*, which is anticipated gain from no treatment instead of treatment among persons who would not develop disease in the absence of treatment. In the following formula for the net benefit of risk prediction, *B* can also be interpreted as the anticipated benefit of treatment in a true positive and *C* can also be interpreted as the anticipated harm, or “cost,” of treatment in a false positive,

*NB _{j} = B TPR_{j} P - FPR_{j} *(

*1-P*)

*C*(1)

In equation (1) *P *is the estimated probability of disease, which means that *B* is multiplied by the estimated probability of a true positive and *C* is multiplied by the estimated probability of a false positive. A similar formula was proposed by C.S. Peirce in the 19^{th} century in the context of evaluating weather prediction (Peirce, 1884).

**Relative Utility**

Relative utility is the maximum* *net benefit of risk prediction (over risk intervals) divided by the net benefit of perfect prediction. Computation of relative utility involves the risk threshold *T — *the risk of developing disease at which there is indifference between treatment and no treatment (Pauker and Kassirer, 1975). Mathematically *T* solves the equation* T B =* (1-*T*)* C*, which implies *C / B *= *T */ (1-*T*) and *T = *(*C/B*)* / *{(*C/B*) + 1}. According to a well-known result in decision analysis, the maximum net benefit of risk prediction occurs in the risk interval *j* at which *R _{j}* =

*T*or equivalently the slope of the ROC curve equals {(1-

*P*) /

*P*}

*T*/(1-

*T*) (Metz, 1978; Gail and Pfeiffer, 2005). See also Baker (2012c). The net benefit of perfect prediction is obtained by substituting

*TPR*= 1 and

_{j}*FPR*= 0 into equation (1) to yield

_{j}*B P.*Thus relative utility at risk threshold

*T*=

*R*is

_{j}*RU*=

_{j}*NB*which can be written as

_{j}/ B P,* *

* RU _{j}*

*= TPR*-

_{j}*FPR*{(1-

_{j}*P*) /

*P*} {

*R*/ (1-

_{j}*R*)} (2)

_{j}Let *RU(T)* denote the relative utility at risk threshold *T* computed from linear interpolation (Appendix B). A relative utility curve is a plot of *RU*(*T*) versus *T.* * *An important but under-appreciated consideration is the range of *T*. If no treatment is provided in the absence of risk prediction (the scenario discussed here), then *T ≥* *P.* This result follows from setting the net benefit of treatment versus no treatment, namely *B P - *(*1-P*)* C, *to less than or equal to zero and solving for *T* (Baker *et al.,* 2012c).* *For *T ≥* *P* the relative utility decreases as *T* increases (**Figure 1**). More general relative utility curves also plot relative utility for *T* < *P. *Computations of relative utility are illustrated in Table 3, which is similar to a table in Baker (2009).

Decision curves (Vickers and Elkin, 2006), which were introduced before relative utility curves, are relative utility curves multiplied by *P *for *T ≥* *P*. Decision curves can be interpreted as a maximum net benefit of risk prediction (in units of true positives) when *B *is set to a reference value of 1. Although similar methods can be used to estimate decision curves and relative utility curves, estimation based on risk stratification tables has only been used with relative utility curves. See Baker *et al.* (2012c) for a comparison of various methods of estimation.

**Number Needed to Test ( NNTest)**

Let *RU _{1}*(

*T*) and

*RU*(

_{2}*T*) denote the relative utilities for Model 1 and 2, respectively, at risk threshold

*T*and let

*∆RU*(

*T*) =

*RU*(

_{2}*T*) -

*RU*(

_{1}*T*). As derived in Appendix C,

*NNTest*(*T*) = 1/{*P ∆RU*(*T*)} (3)

is the minimum number of new marker tests per true positive required for a positive net benefit of risk prediction that includes the harms and costs of marker testing. Because *NNTest*(*T*) is inversely proportional to the estimated probability of developing disease, then, all else being the same in the setting of disease prevention, *NNTest*(*T*) is larger for a rare disease than for a commonly occurring one. If *∆RU*(*T*) is negative in equation (3), *NNTest*(*T*) is designated as “harm.”

A useful summary measure is the minimum of *NNTest*(*T*) over a set of risk thresholds (here five equally spaced thresholds from *P* to the largest observed risk), which is denoted *minNNTest*(*T**),* *where *T** is the risk threshold corresponding to this minimum*.* The value of *minNNTest*(*T**) is weighed against the cost testing for the new marker. Confidence intervals for *minNNTest*(*T**) are computed by multinomial bootstrapping of the risk stratification tables (Appendix D). These confidence intervals are applicable with the counts derived from Kaplan-Meier estimates (Baker, 2012b, Appendix A).

**Specific Examples**

**Figures 1**, **2**, and **3** depict ROC and relative utility curves for the aforementioned six studies arbitrarily grouped into pairs. **Table 4** compares *minNNTest(T*)* with various measures of predictive accuracy (reporting the largest odds or hazard ratios when multiple values are available). In these studies the odds or hazard ratios are large by traditional epidemiological standards, yet the ROC curves for Models 1 and 2 are similar and *∆AUC* is small, a well-known result (Pepe *et al.,* 2004). Also *NRI *is larger than *∆AUC*, as is typically the case (Cook, 2007). A fundamental problem is that without any information on anticipated harms and benefits of incorrect and correct clinical decisions, respectively, it is difficult to draw conclusions based on *OR, HR, ∆AUC, *or* NRI.*

Now consider *minNNTest*(*T**) for the evaluation of new markers. For the SBP example, *minNNTest*(*T**)* *is 860 with 95% confidence interval of (570,1820) at *T**=0.84 (Table 4). In other words 860 tests of SBP are needed to correctly identify one person who would later develop cardiovascular disease in order for risk prediction to be worthwhile at the threshold of *T*=0.84. Because ascertaining SBP is non-invasive and inexpensive, this tradeoff may be worthwhile. For the C-reactive protein and parental history example, *minNNTest*(*T**)* *is 210 with 95% confidence interval of (150,380) at *T**=0.137. This smaller value of *minNNTest*(*T**)* *compared with that for the SBP example is primarily due to the much larger probability of disease. Again the tradeoff is likely worthwhile. For the HDL example *minNNTest*(*T**)* *is 1630 with 95% confidence interval of (810,harm) at *T**=0.028. The upper bound of “harm” indicates the HDL marker is consistent with no improvement in risk prediction. For the genetic marker *minNNTest*(*T**) is 2990 with 95% confidence interval of (1390,harm) at *T**=0.084. If collecting the genetic data is expensive, even the point estimate (ignoring the upper bound) might represent an unfavorable tradeoff.

**Discussion**

This analysis focuses on a single summary measure, namely *minNNTest(T**), rather than an entire curve *NNTest(T)*. Although some information is lost, we believe this single summary is reasonable because *T* *is likely applicable to a large subset of the population. Very large values of *T *would apply only to few persons; otherwise risk prediction would not be a consideration as costs of false positives would be too high. Also as mentioned previously, *T* must be greater than or equal to *P. *Thus *T** falls in a limited range, applicable to many persons.

For rare diseases, the cost of study can be reduced by collecting data on random samples of cases for the risk stratification tables. With such case-control data, *minNNTest(T*) *is still valid because ROC curves are invariant to this sampling. However, a simple adjustment is needed to correct *T** for different probabilities of developing disease (Baker, 2009; Baker *et al.,* 2012c).

Although *NNTest*(*T*) evaluates a new marker in a population, ultimately risk prediction models are applied to individuals. With data from risk stratification tables, a person’s estimated risk is the observed risk in the interval that includes the person’s risk score. For individual decision-making, clinicians may nevertheless modify estimates of risk based on factors not included in the risk prediction model. Also clinicians may consider a particular risk threshold for a patient (instead of *T**) based on particularities of the specific treatment and the patient’s preference. Thus, none of these tools replace the importance of judgments at the individual level in the clinical setting. However, clinical decisions without such tools are more prone to error and inadequately informed judgments about tradeoffs. The focus of the analysis presented here is the population evaluation of a new marker added to a risk prediction model, a public health issue.

It is important to keep in mind that these evaluations of a new marker are observational studies because risks are estimated using only data from persons not receiving treatment. With randomized trials, risk prediction can be used to identify a promising subgroup for receiving treatment and to evaluate treatment in that subgroup, providing an even higher level of evidence to support clinical decisions (Baker *et al.,* 2012a).

**Supplementary Document**

**Acknowledgment**

This work was supported by the National Institutes of Health.

**Disclosure**

Opinions expressed in this manuscript are those of the authors and should not be interpreted as official positions or statements of the U.S. Federal government or of the Department of Health and Human Services.

**Corresponding Author**

Stuart G. Baker, Sc.D., Division of Cancer Prevention, National Cancer Institute, Bethesda, Maryland 20892, USA.

**References **

Baker SG. *Putting risk prediction in perspective: relative utility curves. **J Natl Cancer Inst *101:1538-1542, 2009.

Baker SG, Cook NR, Vickers A, Kramer BS. Using relative utility curves to evaluate risk prediction. *J R Stat Soc [Ser A] *172:729-748, 2009.

Baker SG, Kramer BS, Sargent DJ, Bonetti M. Biomarkers, subgroup evaluation, and trial design. *Discov Med *13(70):187-192, 2012a.

Baker, SG, Sargent DJ, Buyse M, and Burzykowski T. Predicting treatment effect from surrogate endpoints and historical trials: an extrapolation involving probabilities of a binary outcome or survival to a specific time. *Biometrics* 68:248-257, 2012b.

Baker SG, Van Calster B, Steyerberg EW. Evaluating a new marker for risk prediction using the test tradeoff: An update. *I J Biostat* 8(1):5, 2012c.

Cook NR, Ridker PM. Advances in measuring the effect of individual predictors of cardiovascular risk: the role of reclassification measures. *Ann Intern Med* 150(11):795-802, 2009.

Cook NR, Buring JE, Ridker PM. The effect of including C-reactive protein in cardiovascular risk prediction models for women. *Ann Intern Med* 145:21-29, 2006.

Cook NR. Use and misuse of the Receiver Operating Characteristic curve in risk prediction. *Circulation* 115(7):928-935, 2007.

Gail MH, Pfeiffer RM. On criteria for evaluating models for absolute risk. *Biostatistics* 6:227-239, 2005.

Janes H, Pepe MS, Gu W. Assessing the value of risk predictions by using risk stratification tables. *Ann Intern Med *149(10):751-760, 2008.

Metz CE. Basic principles of ROC analysis. *Semin Nucl Med* 8(4):283-298, 1978.

Mihaescu M, van Zitteren M, van Hoek M, Sijbrands EJG, Uitterlinden AG, Witteman JCM, Hofman A, Hunink MGM, van Duijn CM, Janssens ACJW. Improvement of risk prediction by genomic profiling: reclassification measures versus the area under the receiver operating characteristic curve. *Am J Epidemiol* 172(3):353-361, 2010.

Pauker SG, Kassirer JP. Therapeutic decision making: a cost-benefit analysis. *N Engl J Med* 293:229-234, 1975.

Paynter NP, Chasman DI, Buring JE, Shiffman D, Cook NR, Ridker PM. Cardiovascular disease risk prediction with and without knowledge of genetic variation at chromosome 9p21.3. *Ann Intern Med* 150(2):65-72, 2009.

Peirce CS. The numerical measure of the success of predictions. *Science* 4:453-454, 1884.

Pencina MJ, D’Agostino RB Sr, Steyerberg EW. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. *Stat Med* 30(1):11-21, 2010.

Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. *Stat Med* 27:157-172, 2008.

Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P. Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. *Am J Epidemiol* 159(9):882-890, 2004.

Pepe MS. Problems with risk reclassification methods for evaluating prediction models. *Am J Epidemiol* 173:1327-1335, 2011.

Ridker PM, Paynter NP, Rifai N, Gaziano JM, Cook NR. C-reactive protein and parental history improve global cardiovascular risk prediction: the Reynolds Risk Score for men. *Circulation* 118(22):2243-2251, 2008.

Stokey E, Zechauser R. *A Primer for Policy Analysis*. W.W. Norton Company, New York, New York, USA, 1978.

Tice JA, Cummings SR, Smith-Bindman R, Ichikawa L, Barlow WE, Kerlikowske K. Using clinical factors and mammographic breast density to estimate breast cancer risk: development and validation of a new predictive model. *Ann Intern Med** *148(5):337-347, 2008.

Tzoulaki I, Liberopoulos G, Ioannidis JP. Assessment of claims of improved prediction beyond the Framingham risk score. *J Am Med Assoc* 302:2345-2352, 2009.

Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. *Med Decis Making* 26:565-574, 2006.

Vickers AJ, Cronin AM, Elkin EB, Gonen M. Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers. *BMC Med Inform Decis Mak* 8:53, 2008.

**[ Discovery Medicine; **

**ISSN: 1539-6509;**

*Discov Med*14(76):181-188, September 2012.**Copyright © Discovery Medicine. All rights reserved.]**