Comparing the diagnostic performance of methods used in a full-factorial design multi-reader multi-case studies

313 Accesses
Explore all metrics

Abstract

In radiology, patients are frequently diagnosed according to the subjective interpretations of radiologists based on an image. Such diagnosis results may be biased and significantly differ among evaluators (i.e., readers) due to different education levels and experiences. One solution to overcome this problem is to use a multi-reader multi-case study design in which there are multiple readers, and the same images are evaluated multiple times. Several methods, including model-based and bootstrap-based, are available for analyzing the multi-reader multi-case studies. In this study, we aimed to compare the performance of available methods on a mammogram dataset. We also conducted a comprehensive simulation study to generalize the results to more general scenarios. We considered the effect of the number of samples and readers, data structures (i.e., correlation structures and variance components), and overall accuracy of diagnostic tests (AUC) in the simulation set-up. Results showed that the model-based methods had type-I error rates close to the nominal level as the number of samples and readers increased. Bootstrap-based methods, on the other hand, were generally conservative. However, they performed the best when the sample size was small, and the AUC level was high. In conclusion, the performance of the proposed methods was not the same under all conditions and was affected by the factors we considered in the simulation study. Therefore, it is not a perfect strategy to use one method under all scenarios because it may lead to biased conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving radiologist's ability in identifying particular abnormal lesions on mammograms through training test set with immediate feedback

Article Open access 10 May 2021

Inter-observer agreement according to three methods of evaluating mammographic density and parenchymal pattern in a case control study: impact on relative risk of breast cancer

Article Open access 12 April 2015

Double versus single reading of mammograms in a breast cancer screening programme: a cost-consequence analysis

Article 08 January 2016

Code availability

All the source codes written in R are publicly available in the GitHub https://github.com/basolmerve/MRMC-Simulation-ArticleSupplementary.git environment.

References

Beiden SV, Wagner RF, Campbell G (2000) Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis. Acad Radiol 7(5):341–349. https://doi.org/10.1016/S1076-6332(00)80008-2
Article Google Scholar
Chakraborty D, Philips P, Zhai X (2019) RJafroc: analyzing diagnostic observer performance studies. https://CRAN.R-project.org/package=RJafroc. R package version 1.2.0
DeLong E, DeLong D, Clarke-Pearson D (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845. https://doi.org/10.2307/2531595
Article MATH Google Scholar
Dodd LE, Pepe MS (2003) Partial AUC estimation and regression. Biometrics 59(3):614–623. https://doi.org/10.1111/1541-0420.00071
Article MathSciNet MATH Google Scholar
Dorfman DD, Berbaum KS, Metz CE (1992) Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investig Radiol 27(9):723–731
Article Google Scholar
Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA (1998) Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Acad Radiol 5(9):591–602. https://doi.org/10.1016/S1076-6332(98)80294-8
Article Google Scholar
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap, vol 57. Monographs on statistics and applied probability. Chapman and Hall/CRC, Florida
Book MATH Google Scholar
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36. https://doi.org/10.1148/radiology.143.1.7063747
Article Google Scholar
Hillis SL (2012) Simulation of unequal-variance binormal multireader ROC decision data: an extension of the Roe and Metz simulation model. Acad Radiol 19(12):1518–1528. https://doi.org/10.1016/j.acra.2012.09.011
Article Google Scholar
Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS (2005) A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette methods for receiver operating characteristic (ROC) data. Stat Med 24(10):1579–1607. https://doi.org/10.1002/sim.2024
Article MathSciNet Google Scholar
Obuchowski NA (1995) Multireader receiver operating characteristic studies: a comparison of study designs. Acad Radiol 2(8):709–716. https://doi.org/10.1016/S1076-6332(05)80441-6
Article Google Scholar
Obuchowski NA (2007) New methodological tools for multiple-reader ROC studies. Radiology 243(1):10–12. https://doi.org/10.1148/radiol.2432060387
Article Google Scholar
Obuchowski NA, Rockette HE (1995) Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests an anova approach with dependent observations. Commun Stat Simul Comput 24(2):285–308. https://doi.org/10.1080/03610919508813243
Article MATH Google Scholar
Obuchowski NA, Beiden SV, Berbaum KS, Hillis SL, Ishwaran H, Song HH, Wagner RF (2004) Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. Acad Radiol 11(9):980–995. https://doi.org/10.1016/j.acra.2004.04.014
Article Google Scholar
Quenouille MH (1956) Notes on bias in estimation. Biometrika 43(3/4):353–360
Article MathSciNet MATH Google Scholar
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Roe CA, Metz CE (1997) Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Acad Radiol 4(4):298–303. https://doi.org/10.1016/S1076-6332(97)80032-3
Article Google Scholar
Roe CA, Metz CE (1997) Variance-component modeling in the analysis of receiver operating characteristic index estimates. Acad Radiol 4(8):587–600. https://doi.org/10.1016/S1076-6332(97)80210-3
Article Google Scholar
Samuelson FW, Wagner RF (2005) Bootstrapped MRMC confidence intervals. In: Medical imaging, (2005) image perception, observer performance, and technology assessment, vol 10, p 597660
Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biom Bull 2(6):110–114
Article Google Scholar
Song X, Zhou XH (2005) A marginal model approach for analysis of multi-reader multi-test receiver operating characteristic (ROC) data. Biostatistics 6(2):303–312. https://doi.org/10.1093/biostatistics/kxi011
Article MATH Google Scholar
Zanca F, Hillis SL, Claus F, Van Ongeval C, Celis V, Provoost V, Yoon HJ, Bosmans H (2012) Correlation of free-response and receiver-operating-characteristic area-under-the-curve estimates: results from independently conducted FROC/ROC studies in mammography. Med Phys 39(10):5917–5929. https://doi.org/10.1118/1.4747262
Article Google Scholar
Zhou XH, Obuchowski NA, McClish DK (2011) Statistical methods in diagnostic medicine, 2nd edn. John Wiley & Sons, New Jersey
Book MATH Google Scholar
Zou KH, Liu A, Bandos AI, Ohno-Machado L, Rockette HE (2012) Statistical evaluation of diagnostic performance: topics in ROC analysis. Chapman and Hall/CRC, Florida
Google Scholar

Download references

Acknowledgements

We would like to thank anonymous reviewers for their valuable comments that improved the quality of our manuscript.

Funding

This study was not supported by any institution/organization.

Author information

Authors and Affiliations

Department of Biostatistics, School of Medicine, Erciyes University, Kayseri, 38280, Turkey
Merve Basol & Dincer Goksuluk
Department of Biostatistics, School of Medicine, Hacettepe University, Sihhiye Campus, 06230, Ankara, Turkey
Ergun Karaagaoglu

Authors

Merve Basol
View author publications
You can also search for this author in PubMed Google Scholar
Dincer Goksuluk
View author publications
You can also search for this author in PubMed Google Scholar
Ergun Karaagaoglu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Merve Basol.

Ethics declarations

Conflict of interest

Authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 92 KB)

Appendices

Appendix

Defining data correlation and reader variance

Consider the following full factorial experimental design.

$$\begin{aligned} Y_{ijk} = \mu&+ \tau _i + R_j + C_k + \left( \tau R \right) _{ij} + \left( \tau C \right) _{ik} \nonumber \\&+ \left( RC \right) _{jk} + \left( \tau R C \right) _{ijk} + \varepsilon _{ijk} \nonumber \\&i: 1, 2, \dots , t \quad j: 1, 2, \dots , r \quad k: 1, 2, \dots , n \end{aligned}$$

(9)

From model Eq. 9, the decision variable $Y_{ijk}$ has variance

$$\begin{aligned} \sigma ^2 = \sigma _{C}^{2} + \sigma _{\tau C}^{2} + \sigma _{RC}^{2} + \sigma _{\varepsilon }^{2} \end{aligned}$$

(10)

for both diseased and non-diseased subjects, i.e., equal-variance components. To create unequal- variance components, Hillis (2012) modified variance components in Eq. 10 by defining

$$\begin{aligned} \sigma _{*(1)}^2 = \dfrac{1}{b^2} \sigma _{*(0)}^2 \end{aligned}$$

(11)

such that $\sigma _{C(1)}^2 = \frac{1}{b^2} \sigma _{C(0)}^2$ where $b = \frac{\sigma _{(0)}}{\sigma _{(1)}}$ is sigma ratio for some $b > 0$. By using variance components (10), we define

$$\begin{aligned} \rho _{WR} = \dfrac{\sigma ^{2}_{C} + \sigma ^{2}_{\tau C} + \sigma ^{2}_{RC}}{\sigma ^{2}_{C} + \sigma ^{2}_{\tau C} + \sigma ^{2}_{RC} + \sigma ^{2}_{\varepsilon }}, \qquad \rho _{BR} = \dfrac{\sigma ^{2}_{C} + \sigma ^{2}_{\tau C}}{\sigma ^{2}_{C} + \sigma ^{2}_{\tau C} + \sigma ^{2}_{RC} + \sigma ^{2}_{\varepsilon }} \end{aligned}$$

(12)

where $\rho _{WR}$ and $\rho _{BR}$ are used to define first letter (i.e., data correlation) and $\sigma _R^2$ or $\sigma _{\tau R}^2$ are used to define second letter (i.e., reader variance) of data structures given in Supplementary Tables S1 and S2. For example, HH stands for high data correlation and high reader variance. The correlations in Eq. 12 can be estimated using variance components from one of diseased or non-diseased groups. For more details, see Roe and Metz (1997a) and Hillis (2012).

Test statistics for the ANOVA models of the DBM and OR methods

The DBM and OR models are based on the three- and two-way mixed-effect ANOVA models, respectively. The statistical significance of the test statistics for each component in the DBM and OR model are evaluated via an F statistic calculated using the mean squared errors.

The DBM model The F statistic for the significance of the test effect $\tau$ in the ANOVA model Eq. (3) is obtained using the mean squares (MS) as

$$\begin{aligned} F_{DBM} = \dfrac{MS\left( \tau \right) }{MS\left( \tau R \right) + MS\left( \tau C \right) - MS\left( \tau R C\right) } \end{aligned}$$

(13)

where the denominator degrees of freedom, $df_2$, is calculated as in Eq. (14) by using the Satterthwaite’s approximation (Satterthwaite 1946).

$$\begin{aligned} df_2 = \dfrac{\left[ MS\left( \tau R \right) + MS\left( \tau C \right) - MS\left( \tau R C\right) \right] ^2}{\dfrac{MS\left( \tau R \right) ^2}{\left( t - 1\right) \left( r - 1\right) } + \dfrac{MS\left( \tau C \right) ^2}{\left( t - 1 \right) \left( n - 1 \right) } + \dfrac{MS\left( \tau R C \right) ^2}{\left( t - 1 \right) \left( r - 1\right) \left( n-1 \right) }} \end{aligned}$$

(14)

The OR model $\cdot$ The corrected F statistic for the significance of the test effect $\tau$ in the ANOVA model Eq. (4) is

$$\begin{aligned} F_{OR} = \dfrac{MS\left( \tau \right) }{MS\left( \tau R \right) + r\left( {\widehat{Cov}}_2 - {\widehat{Cov}}_3 \right) } \end{aligned}$$

(15)

with degrees of freedom $df_1$ and $df_2$. Here, $df_1$ equals $(t - 1)$ and $df_2$ is calculated as in Eq. (16).

$$\begin{aligned} df_2 = \dfrac{\left\{ MS\left( \tau R \right) + r\left( {\widehat{Cov}}_2 - {\widehat{Cov}}_3 \right) \right\} ^2}{\dfrac{MS \left( \tau R \right) ^2}{(t - 1)(r - 1)}} \end{aligned}$$

(16)

where the covariance estimates are calculated from the Eq. (5).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Basol, M., Goksuluk, D. & Karaagaoglu, E. Comparing the diagnostic performance of methods used in a full-factorial design multi-reader multi-case studies. Comput Stat 38, 1537–1553 (2023). https://doi.org/10.1007/s00180-022-01309-1

Download citation

Received: 25 August 2021
Accepted: 24 November 2022
Published: 18 December 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00180-022-01309-1

Comparing the diagnostic performance of methods used in a full-factorial design multi-reader multi-case studies

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving radiologist's ability in identifying particular abnormal lesions on mammograms through training test set with immediate feedback

Inter-observer agreement according to three methods of evaluating mammographic density and parenchymal pattern in a case control study: impact on relative risk of breast cancer

Double versus single reading of mammograms in a breast cancer screening programme: a cost-consequence analysis

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 92 KB)

Appendices

Appendix

Defining data correlation and reader variance

Test statistics for the ANOVA models of the DBM and OR methods

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Comparing the diagnostic performance of methods used in a full-factorial design multi-reader multi-case studies

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving radiologist's ability in identifying particular abnormal lesions on mammograms through training test set with immediate feedback

Inter-observer agreement according to three methods of evaluating mammographic density and parenchymal pattern in a case control study: impact on relative risk of breast cancer

Double versus single reading of mammograms in a breast cancer screening programme: a cost-consequence analysis

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 92 KB)

Appendices

Appendix

Defining data correlation and reader variance

Test statistics for the ANOVA models of the DBM and OR methods

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation