2. Evaluation of Markers and Risk Prediction
Models: Overview of Relationships between
NRI and Decision-Analytic Measures
Ben Van Calster, PhD, Andrew J. Vickers, PhD, Michael J. Pencina, PhD,
Stuart G. Baker, ScD, Dirk Timmerman, PhD, Ewout W. Steyerberg, PhD
Background. For the evaluation and comparison of
markers and risk prediction models, various novel measures have recently been introduced as alternatives to the
commonly used difference in the area under the receiver
operating characteristic (ROC) curve (DAUC). The net reclassification improvement (NRI) is increasingly popular
to compare predictions with 1 or more risk thresholds,
but decision-analytic approaches have also been proposed. Objective. We aimed to identify the mathematical
relationships between novel performance measures for the
situation that a single risk threshold T is used to classify
patients as having the outcome or not. Methods. We con-
appear to be transformations of each other and hence
always lead to consistent conclusions. On the other
hand, conclusions may differ when using the standard
NRI, depending on the adopted risk threshold T, prevalence P, and the obtained differences in sensitivity and
specificity of the 2 models that are compared. In the
case study, adding the CA-125 tumor marker to a baseline
set of covariates yielded a negative NRI yet a positive
value
Medical
for the utility-based 013;
33:490–501.
decision
making
2 measures. Conclusions. The
decision-analytic measures are each appropriate to indicate the clinical usefulness of an added marker or compare prediction models since these measures each reflect
3. Terminology
and
Nota0on
• 二値変数のアウトカムの予測モデルに焦点
(survival-‐type
dataにも応用可能)
• Risk
threshold:
Tにおける予測モデルの陽性
(イベントあり),陰性(イベントなし)から得ら
れる予測ルールを考える
• データセットのサイズをN(イベントありをN+,
なしN-‐)
• イベントのprevalence:
P=N+/N-‐
Medical
decision
making
2013;
33:490–501.
5. Classifica0on
as
a
Decision
Problem
• 患者の分類の際,eventsとnoneventsの分類
を同時に最適化するthresholdを設定すること
は可能だが,これだけではmisclassifica0onの
コストが考慮に入っていない
• 「疾患ありの人を正しく分類するbenefit」と
「疾患なしの人を誤って疾患があると分類す
るharm」は通常異なる
• 疾患ありの人の治療はovertreatmentの害を
上回ることが多い
Medical
decision
making
2013;
33:490–501.
7. harm-‐to-‐benefit
ra0o
• Risk
thresholdは治療するか,治療しないかに
よってアウトカムのリスクがちょうどつりあうと
ころ
• →期待されるu0lity/costがつりあうところと定
義できる
Medical
decision
making
2013;
33:490–501.
8. e. Likewise, cTP – cFN is the benefit of providatment to diseased patients or, equivalently,
nefit of a true 検査がposi0veの時の
positive. The risk threshold
informs us about the harm-to-benefit ratio.
harm-‐to-‐benefit
ra0o
pecifically,
• 途中の計算がうまく理解できませんが,
cTN À cFP
T
:
5
1ÀT
cTP À cFN
ð4Þ
• という関係が成り立つようです
• CTN-‐CFP:
疾患なしの人を治療しないbenefit
→false
posi0veのharm
491
• CTP-‐CFN:
疾患ありの人に治療するbenefit
→true
posi0veのbenefit
n January 29, 2014
Medical
decision
making
2013;
33:490–501.
9. Tのoddsがharm-‐to-‐benefit
ra0oと
等しい
• thresholdが5%の時,
0.05/(1-‐0.05)=1/19
→「TPはFPよりも19倍重要だ」という意味
• Thresholdが50%なら,どちらの誤分類の価
値も等しいと考えられる
• 好ましいと考えるTの値は個人(医師や患者)
によって異なる
Medical
decision
making
2013;
33:490–501.
11. DECISION-ANALYTIC EVALUATION OF MARKERS AND
IOTAモデルにCA-‐125を加えたら
ROC曲線
実線がreference
model,点線がextended
model
Medical
decision
making
2013;
33:490–501.
13. Performance
Measures
• モデルの予測能評価には総ての可能性のある
thresholdでの性能の要約が一般的で,従来はAUCや
ΔAUCが用いられてきた
• モデルの分類を評価するにはYouden
indexが一般的
である
• 最近は2つのモデルの比較にNRIが用いられる
• あるthreshold
TにおけるNRIはΔYoudenに等しい
• Youden
indexはeventsとnoneventsの分類の重要性を
同等に扱う→”science
of
the
method”
(by
Peirce)
• Decision-‐analy0c
measures
(net
benefit,
rela0ve
u0lity,
wNRI)はeventsとnoneventsの重要性の違いを考慮す
るので,”u0lity
of
the
method”
(by
Peirce)
Medical
decision
making
2013;
33:490–501.
14. 各方法の比較
Figure 1 Receiver operating characteristic (ROC) (a), Youden (b), decision (c), and RUT (d) curves for the reference model (full line) and the
extended model (dashed line). Vertical lines in panels b, c, and d indicate the risk threshold T of 5% and the prevalence P of 28%.
Table 1
Evaluation Measures for the Comparison of Competing Models
Approach
Method
Evaluation of predictions (summary
DAUC
measures over all possible thresholds)
Evaluation of classifications (using a risk threshold to define 2 groups)
‘‘Science of the method’’
NRIT
‘‘Utility of the method’’
DNBT
DRUT
wNRIT
Characteristics
Equal to difference in Mann-Whitney statistics;
rank-order statistic
Sum of differences in sensitivity and
specificity; identical to DYouden
(TP – wFP)/N
Expresses NBT relative to the net benefit of the
baseline strategies and the maximum net
benefit
Weights the NRIT with misclassification costs
ORIGINAL ARTICLES
493
Downloaded from mdm.sagepub.com at Kyoto University on January 29, 2014
Medical
decision
making
2013;
33:490–501.
15. 938
and
ent,
rom
ultithe
ROC
use
e.g.,
d to
reclassified in the correct direction minus the proportion of cases reclassified in the incorrect direction is
computed to obtain the event NRI and nonevent
NRI
NRI. The NRI can then be written as
NRI 5 event NRI 1 nonevent NRI
5 PðupjeventÞ À PðdownjeventÞ
VAN CALSTER A
ð5Þ
1 Pð thresholds. We use upjnonevent NRIT as
on 2 or more downjnoneventÞ À Pðthe notationÞ:
we focus on the NRI for 2 risk groups using T.15 We
The NRI can be used for は
• can reformulate NRIT as follows: groups based on 1
Threshold
TにおけるNRIT 2 risk
risk threshold but also for 3 or more risk groups based
TP2 À TP1
FP1 À FP2
DTP DFP
NRIT 5
1
5
1
:
Nþ
NÀ
Nþ
NÀ
• NRIT=ΔSensi0vity+ΔSpecificity=ΔYouden
ð6Þ
NRIT is the sum of improvement in sensitivity
(event NRIT) and specificity (nonevent NRIT). For
epub.com at Kyoto University on January 29, 2014
Medical
d event NRIT and nonappropriate interpretability, ecision
making
2013;
33:490–501.
event NRIT should be reported as well.14,15
17. Net Benefit
Relative
RUT i
Net
B consistent
‘‘treat n
The NB, written as NBT to beenefit
in notanet bene
ion, corrects TP for FP based on misclassification
13,31
‘‘treat al
and is written as
osts
• TPを誤分類のコストに応じたFPで補正
given in
none’’ if
TP
FP
Àw
:
ð7Þ
NBT 5
given in
N
N
• TPのbenefitはFPにwをかけたharmと等しい
are based
The benefit of a true positive is considered equivadetailed
ent to the harm of w false positives. The weight w for
• wはTのオッズで表される
found el
alse positives is defined as odds(T), or the harm-toRUT e
• TPをwFPで補正することによりNBはTPと分類
13
benefit ratio that corresponds to threshold T. By
される純増分(net
propor0on
of
true-‐posi0ve) tion of th
penalizing TP for wFP, NB is the net proportion of
strategy
を意味する
rue-positive classifications. NBT can be plotted as
a functio
function of T, yielding a decision curve.mFigure 31c
Medical
decision
aking
2013;
3:490–501.
Figure
in
hows the decision curves for the ovarian tumor
ence in R
18. Tを変えてNBTをプロットすると
decision
curveが得られる
extended
model
reference
model
Figure 1 Receiver operating characteristicdecision
mYouden013;
33:490–501.
RUT (d) curv
Medical
(ROC) (a), aking
2 (b), decision (c), and
extended model (dashed line). Vertical lines in panels b, c, and d indicate the risk threshold T
19. penalizing TP for wFP, NB is the net proportion of
true-positive classifications. NBT can be plotted as
a function of T, yielding a decision curve. Figure 1c
shows the decision curves for the ovarian tumor
models.
A “treat
ncan be compared with 2 baseline stratemodel one”:
常に疾患なしとみなし誰も治療
•
gies: ‘‘treat none’’ (i.e., always predict absence of disしない
ease) and ‘‘treat all’’ (i.e., always predict disease).
→NBtreat
always zero as there are no true
NBtreat none is noneはTPもFPもないので定義上0
or false
positives. For treat all,
• “treat
all”:
常に疾患ありとみなし全員治療す
る
Nþ
NÀ
PÀT
Àw
5
NBtreat all 5
:
ð8Þ
N
N
1ÀT
NBT can be interpreted as the equivalent of the
Tがprevalence
(P)と等しくなるとき0になる
increase in the proportion of true positives relative
Medical
decision
making
2013;
33:490–501.
st
af
in
en
in
W
tio
w
ba
re
m
de
22. DFP
:
NÀ
ð6Þ
sensitivity
NRIT). For
and non-
5
•
of the decrease in the proportion of false positives relative to ‘‘treat all,’’ without a decrease in true positives. Decision curve analysis is an elegant way of
evaluating the clinical consequences of classifications derived from a prediction model without performing a formal and complex decision analysis.13,31
The difference in NBT (DNBT) of model 1 and
model 2 is given as follows:
モデル1とモデル2のNBTの差(ΔNBT)
ositives by
y N–, NRIT
TP2
FP2
TP1
FP1
Àw
À
Àw
DNB 5
h that senN
N
N
N
ð9Þ
ally impor1
5 ðDTP À wDFPÞ:
nd hence it
N
re a consech reflects • ΔNBTはモデル1の代わりにモデル2を使った場合
DNBT can be interpreted as being equivalent to the
not explic- にFPを変化させずにTPがどれくらい増えるかの
increase in the proportion of true positives without
ghted vera change in false positives when
model 2
正味の割合
1. Alternatively, using /w is the
15
osed,
as
instead of model
DNBT
equivalent of the decrease in the proportion of false
• ΔNBT/wはTPを減らすことなくFPがどれくらい減る
ws the You- かの正味の割合
positives without a decrease in true positives.
varying T.
Relative Utility
Medical
decision
making
2013;
33:490–501.
RUT is the net benefit in excess of ‘‘treat all’’ or
23. Table 2 Overview of Relationships between Measures That Compare Classifications of 2 Competing
Models at Risk Threshold T
Condition
Regarding T and P
NRIT 5
DNBT 5
Relationships
between Measures
If T P
DRUT 5
If T ! P
1
DRUT 5 P DNBT
P
wNRIT 5 T DRUT
If T = P
1
NRIT 5 P DNBT 5 DRUT 5 wNRIT
Irrespective
of T and P
NRIT = DYouden
NRIT = 2*DAUCT (i.e., twice the
difference in the areas under
the prediction rules’
single-point receiver operating
characteristic curves)
1
wNRIT 5 T DNBT
DRUT;T ! P 5
1
wð1ÀPÞ DNBT
DRUT;TP 5
wNRIT 5
Alternatively, these
a weighted sum of the
specificity (DSe and DS
Rela0ve
U0lity
(to a lower risk group by model 2. When
reclassified RU),
weighted
NRI
(wNRI)もあるが,
using Bayes’ rule on the NRI and including s1 and
上の式で変換可能なので結局同じ結論になる
s2, wNRI equals
NRIT 5 D
DNBT 5 P
DRUT;T ! P 5 D
DRUT;TP 5
w
P
wNRIT 5
T
Medical
decision
making
2013;
33:490–501.
wNRI 5 s1 ðPðeventjupÞPðupÞ À PðeventjdownÞPðdownÞÞ 1
The decision-analyt
s2 ðPðnoneventjdownÞPðdownÞ À PðnoneventjupÞPðupÞÞ:
ease prevalence and t
24. asng
RIT
he
r
NRIT 5
DNBT 5
DRUT;T ! P 5
DRUT;TP 5
wNRIT 5
DTP DFP
À
;
Nþ
NÀ
DTP
DFP
Àw
;
N
N
DTP
DFP
Àw
;
Nþ
Nþ
DTN 1 DFN
À
;
NÀ
w NÀ
1 DTP
DFP
Àw
:
T
N
N
ð11Þ
Alternatively, these measures can be expressed as
こんな風に変換可能
a weighted sum of the differences in sensitivity and
Medical
decision
making
2013;
33:490–501.
specificity (DSe and DSp):
25. specificity (DSe and DSp):
NRIT 5 DSe 1 DSp;
DNBT 5 PDSe 1 w ð1 À PÞDSp;
w ð1 À PÞ
DSp;
DRUT;T ! P 5 DSe 1
P
P
DRUT;TP 5
DSe 1 DSp;
w ð1 À PÞ
P
1ÀP
wNRIT 5 DSe 1
DSp:
T
1ÀT
ð12
感度と特異度で表現すると
The decision-analytic measures account for di
ease prevalenceこんな風にも変換可能
and the harm-to-benefit ratio an
Medical
decision
making
2013;
33:490–501.
are thus consistent approaches that are simple tran
26. Maximum
Test
Harm
of
the
New
Marker
• 新しいマーカーを加えた場合のtest
harmが
u0lityを上回るなら,新しいマーカーを入れない
モデルの方が望ましい
• Test
harmは test
tradeoff=1/ΔNBT
で定量化できる
• test
tradeoffはTPを余分に得るために何人新し
い検査が必要になるかの値
• Test
tradeoffがとても高いと,新しいマーカーで
得られるu0lityが検査に関連するharmに見あわ
ないこともある
Medical
decision
making
2013;
33:490–501.
27. 卵巣腫瘍のケーススタディの結果
DECISION-ANALYTIC EVALUATION OF MARKERS AND MODELS
Table 3
Evaluation Measures for the Reference and Extended Models for the Diagnosis of Ovarian Tumors
Approach
Evaluation of predictions
(summary measure)
Evaluation of classifications
Science of the method
Utility of the method
Method
DAUC
NRI0.05
Event NRI0.05
Nonevent NRI0.05
DNB0.05
DRU0.05
wNRI0.05
Value (95% CI)a
0.008 (0.004 to 0.013)
20.010 (–0.033 to 0.011)
0.011 (0.000 to 0.025)
20.021 (20.040 to 20.004)
0.0023 (–0.0011 to 0.0064)
0.061 (–0.029 to 0.165)
0.046 (–0.022 to 0.127)
Reference v. Extended Model
AUC: 0.934 v. 0.942
Youden index: 0.654 v. 0.644
Sensitivity: 0.943 v. 0.954
Specificity: 0.711 v. 0.691
NB0.05: 0.253 v. 0.255
RU0.05: 0.289 v. 0.350
The risk threshold T is set at 0.05. CI, confidence interval.
a
Approximate confidence intervals were obtained using the bias-corrected bootstrap method based on 1000 replicates of the data set.
percentage improvement over ‘‘treat all.’’ Or, as
T P, this indicates the increase in net specificity
obtained by model 2: At the same level of sensitivity,
model 2’s specificity is 6.1 percentage points higher.
The test tradeoff derived from the decision-analytic
measures was 435. This means that per true positive
at least 435 patients need 2 have CA-125 tested to
Medical
decision
making
to013;
33:490–501.
make risk prediction with model 2 worthwhile.
Figure 2 demonstrates that the decision-analytic
measures always give concordant recommendations
T=5%では,NRIは悪化しているものの,
ΔNB,ΔRU,wNRIは改善している
28. Science of the method
Utility of the method
NRI0.05
Event NRI0.05
Nonevent NRI0.05
DNB0.05
DRU0.05
wNRI0.05
20.010 (–0.033 to 0.011)
0.011 (0.000 to 0.025)
20.021 (20.040 to 20.004)
0.0023 (–0.0011 to 0.0064)
0.061 (–0.029 to 0.165)
0.046 (–0.022 to 0.127)
Youden index: 0.654 v. 0.644
Sensitivity: 0.943 v. 0.954
Specificity: 0.711 v. 0.691
NB0.05: 0.253 v. 0.255
RU0.05: 0.289 v. 0.350
Risk
thresholdに応じた比較
The risk threshold T is set at 0.05. CI, confidence interval.
a
Approximate confidence intervals were obtained using the bias-corrected bootstrap method based on 1000 replicates of the data set.
percentage improvement over ‘‘treat all.’’ Or, as
T P, this indicates the increase in net specificity
obtained by model 2: At the same level of sensitivity
model 2’s specificity is 6.1 percentage points higher
The test tradeoff derived from the decision-analytic
measures was 435. This means that per true positive
at least 435 patients need to have CA-125 tested to
make risk prediction with model 2 worthwhile.
Figure 2 demonstrates that the decision-analytic
measures always give concordant recommendations
but that NRIT may give a different recommendation
These curves show that the CA-125 marker mainly
T
has value when T ! 0.60. This observation explains
why summary measures over all possible thresholds
point at an advantage of adding the CA-125 marker
However, high-risk thresholds are irrational in this
example as it would lead to an unacceptable number
of women with cancer who would be denied appropriate treatment.
Figure 2 Curves showing NRIT, DNBT, DRUT, and wNRIT (full,
The calibration plots for both models are shown
dashed, dotted, and dash-dotted lines, respectively) when varying
in Figure 3. After calibration using local regression
the risk threshold on the x-axis. Vertical lines indicate the risk
(loess) as in the calibration plots, sensitivity
threshold T of 5% and the prevalence P of 28%.
increased aking
2 percentage points and
Medical
decision
mwith 0.2013;
33:490–501.
specificity
with 2.4 percentage points. Then, NRI0.05 equaled
20.0011 to 0.0064). This difference can be inter0.026, DNB0.05 equaled 0.0014, and DRU0.05 equaled
preted as equivalent to an additional 2.3 detected
0.038. NRI0.05 became more supportive of model 2
Decision-‐analy0c
measuresの結果
はそれぞれ矛盾
していないが,
NRI は結果が異
なる範囲がある
29. teyerberg and Vickers
前立腺癌の予測モデル
Page 7
40%を超えたくら
いでようやく”treat
all”(グレーの線)
と差が出てくる
→前立腺癌の可
能性が40%以上
のところで生検す
るかどうかを迷う
人がいるか?
Figure 3.
Decision curve analysis for prostate biopsy. Black line: model including age, PSA and
urokinase. Dashed line: Model including age and PSA only.
Medical
decision
making
2008;
28:146–149.
30. teyerberg and Vickers
精巣腫瘍における化学療法後の
残存腫瘍の予測モデル
Page 8
AUC
0.79と良好だが,
臨床的な決断で迷う
だろうthreshold
30%
以下の部分
で”resec0on
in
all”と
Net
benefitの差がな
い
Figure 4.
Decision curve for testis cancer model
Medical
decision
making
2008;
28:146–149.
31. Decision
Curve
Analysis
(DCA)に関する文献
•
•
•
•
•
•
•
Peirce
CS.
The
numerical
measure
of
the
success
of
predic0ons.
Science
1884;
4:453–454.
Net
benefitの考え方の基本になった論文
Pauker
SG,
Kassirer
JP.
The
threshold
approach
to
clinical
decision
making.
New
England
Journal
of
Medicine
1980;
302:1109–1117.
決断分析のthresholdについて
Vickers
AJ,
Elkin
EB.
Decision
curve
analysis:
a
novel
method
for
evalua0ng
predic0on
models.
Medical
decision
making
2006;
26:565–574.
DCAを初めて提唱した論文
Vickers
AJ,
Cronin
AM,
Elkin
EB,
Gonen
M.
Extensions
to
decision
curve
analysis,
a
novel
method
for
evalua0ng
diagnos0c
tests,
predic0on
models
and
molecular
markers.
BMC
Med
Inform
Decis
Mak
2008;
8:53.
DCAのoverfit補正(crossvalida0on),信頼区間,打ち切りデータへの対処(生存
データへの応用),decision
curveのスムージングについて
Steyerberg
EW,
Vickers
AJ.
Decision
curve
analysis:
a
discussion.
Medical
decision
making
2008;
28:146–149.
Steyerberg(予測モデルの大家)とVickersの対話
Steyerberg
EW,
Vickers
AJ,
Cook
NR,
et
al.
Assessing
the
performance
of
predic0on
models:
a
framework
for
tradi0onal
and
novel
measures.
Epidemiology
2010;
21:128–138.
予測モデルの評価の仕方のまとめ
Moons
KGM,
de
Groot
JAH,
Linnet
K,
Reitsma
JB,
Bossuyt
PMM.
Quan0fying
the
added
value
of
a
diagnos0c
test
or
marker.
Clin.
Chem.
2012;
58:1408–1417.
診断検査の評価法のレビュー
32. DCAを用いている研究例
•
•
•
•
•
•
•
Sakura
M,
Kawakami
S,
Ishioka
J,
et
al.
A
novel
repeat
biopsy
nomogram
based
on
three-‐dimensional
extended
biopsy.
Urology
2011;
77:915–920.
Raji
OY,
Duffy
SW,
Agbaje
OF,
et
al.
Predic0ve
accuracy
of
the
Liverpool
Lung
Project
risk
model
for
stra0fying
pa0ents
for
computed
tomography
screening
for
lung
cancer:
a
case-‐control
and
cohort
valida0on
study.
Ann
Intern
Med
2012;
157:242–250.
Ishioka
J,
Saito
K,
Sakura
M,
et
al.
Development
of
a
nomogram
incorpora0ng
serum
C-‐reac0ve
protein
level
to
predict
overall
survival
of
pa0ents
with
advanced
urothelial
carcinoma
and
its
evalua0on
by
decision
curve
analysis.
Br
J
Cancer
2012;
107:1031–1036.
Xiao
W-‐J,
Ye
D-‐W,
Yao
X-‐D,
Zhang
S-‐L,
Dai
B.
Comparison
of
three
versions
of
par0n
tables
to
predict
final
pathologic
stage
in
a
chinese
cohort:
a
decision
curve
analysis.
Urol.
Int.
2013;
91:69–74.
Boyce
S,
Fan
Y,
Watson
RW,
Murphy
TB.
Evalua0on
of
predic0on
models
for
the
staging
of
prostate
cancer.
BMC
Med
Inform
Decis
Mak
2013;
13:126.
Premaor
M,
Parker
RA,
Cummings
S,
et
al.
Predic0ve
value
of
FRAX
for
fracture
in
obese
older
women.
J
Bone
Miner
Res
2013;
28:188–195.
Parry-‐Jones
AR,
Abid
KA,
Di
Napoli
M,
et
al.
Accuracy
and
clinical
usefulness
of
intracerebral
hemorrhage
grading
scores:
a
direct
comparison
in
a
UK
popula0on.
Stroke
2013;
44:1840–1845.