Abstract
In radiology, patients are frequently diagnosed according to the subjective interpretations of radiologists based on an image. Such diagnosis results may be biased and significantly differ among evaluators (i.e., readers) due to different education levels and experiences. One solution to overcome this problem is to use a multi-reader multi-case study design in which there are multiple readers, and the same images are evaluated multiple times. Several methods, including model-based and bootstrap-based, are available for analyzing the multi-reader multi-case studies. In this study, we aimed to compare the performance of available methods on a mammogram dataset. We also conducted a comprehensive simulation study to generalize the results to more general scenarios. We considered the effect of the number of samples and readers, data structures (i.e., correlation structures and variance components), and overall accuracy of diagnostic tests (AUC) in the simulation set-up. Results showed that the model-based methods had type-I error rates close to the nominal level as the number of samples and readers increased. Bootstrap-based methods, on the other hand, were generally conservative. However, they performed the best when the sample size was small, and the AUC level was high. In conclusion, the performance of the proposed methods was not the same under all conditions and was affected by the factors we considered in the simulation study. Therefore, it is not a perfect strategy to use one method under all scenarios because it may lead to biased conclusions.
Similar content being viewed by others
Code availability
All the source codes written in R are publicly available in the GitHub https://github.com/basolmerve/MRMC-Simulation-ArticleSupplementary.git environment.
References
Beiden SV, Wagner RF, Campbell G (2000) Components-of-variance models and multiple-bootstrap experiments: an alternative method for random-effects, receiver operating characteristic analysis. Acad Radiol 7(5):341–349. https://doi.org/10.1016/S1076-6332(00)80008-2
Chakraborty D, Philips P, Zhai X (2019) RJafroc: analyzing diagnostic observer performance studies. https://CRAN.R-project.org/package=RJafroc. R package version 1.2.0
DeLong E, DeLong D, Clarke-Pearson D (1988) Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44(3):837–845. https://doi.org/10.2307/2531595
Dodd LE, Pepe MS (2003) Partial AUC estimation and regression. Biometrics 59(3):614–623. https://doi.org/10.1111/1541-0420.00071
Dorfman DD, Berbaum KS, Metz CE (1992) Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investig Radiol 27(9):723–731
Dorfman DD, Berbaum KS, Lenth RV, Chen YF, Donaghy BA (1998) Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: factorial experimental design. Acad Radiol 5(9):591–602. https://doi.org/10.1016/S1076-6332(98)80294-8
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap, vol 57. Monographs on statistics and applied probability. Chapman and Hall/CRC, Florida
Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36. https://doi.org/10.1148/radiology.143.1.7063747
Hillis SL (2012) Simulation of unequal-variance binormal multireader ROC decision data: an extension of the Roe and Metz simulation model. Acad Radiol 19(12):1518–1528. https://doi.org/10.1016/j.acra.2012.09.011
Hillis SL, Obuchowski NA, Schartz KM, Berbaum KS (2005) A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette methods for receiver operating characteristic (ROC) data. Stat Med 24(10):1579–1607. https://doi.org/10.1002/sim.2024
Obuchowski NA (1995) Multireader receiver operating characteristic studies: a comparison of study designs. Acad Radiol 2(8):709–716. https://doi.org/10.1016/S1076-6332(05)80441-6
Obuchowski NA (2007) New methodological tools for multiple-reader ROC studies. Radiology 243(1):10–12. https://doi.org/10.1148/radiol.2432060387
Obuchowski NA, Rockette HE (1995) Hypothesis testing of diagnostic accuracy for multiple readers and multiple tests an anova approach with dependent observations. Commun Stat Simul Comput 24(2):285–308. https://doi.org/10.1080/03610919508813243
Obuchowski NA, Beiden SV, Berbaum KS, Hillis SL, Ishwaran H, Song HH, Wagner RF (2004) Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. Acad Radiol 11(9):980–995. https://doi.org/10.1016/j.acra.2004.04.014
Quenouille MH (1956) Notes on bias in estimation. Biometrika 43(3/4):353–360
R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Roe CA, Metz CE (1997) Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: validation with computer simulation. Acad Radiol 4(4):298–303. https://doi.org/10.1016/S1076-6332(97)80032-3
Roe CA, Metz CE (1997) Variance-component modeling in the analysis of receiver operating characteristic index estimates. Acad Radiol 4(8):587–600. https://doi.org/10.1016/S1076-6332(97)80210-3
Samuelson FW, Wagner RF (2005) Bootstrapped MRMC confidence intervals. In: Medical imaging, (2005) image perception, observer performance, and technology assessment, vol 10, p 597660
Satterthwaite FE (1946) An approximate distribution of estimates of variance components. Biom Bull 2(6):110–114
Song X, Zhou XH (2005) A marginal model approach for analysis of multi-reader multi-test receiver operating characteristic (ROC) data. Biostatistics 6(2):303–312. https://doi.org/10.1093/biostatistics/kxi011
Zanca F, Hillis SL, Claus F, Van Ongeval C, Celis V, Provoost V, Yoon HJ, Bosmans H (2012) Correlation of free-response and receiver-operating-characteristic area-under-the-curve estimates: results from independently conducted FROC/ROC studies in mammography. Med Phys 39(10):5917–5929. https://doi.org/10.1118/1.4747262
Zhou XH, Obuchowski NA, McClish DK (2011) Statistical methods in diagnostic medicine, 2nd edn. John Wiley & Sons, New Jersey
Zou KH, Liu A, Bandos AI, Ohno-Machado L, Rockette HE (2012) Statistical evaluation of diagnostic performance: topics in ROC analysis. Chapman and Hall/CRC, Florida
Acknowledgements
We would like to thank anonymous reviewers for their valuable comments that improved the quality of our manuscript.
Funding
This study was not supported by any institution/organization.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix
Defining data correlation and reader variance
Consider the following full factorial experimental design.
From model Eq. 9, the decision variable \(Y_{ijk}\) has variance
for both diseased and non-diseased subjects, i.e., equal-variance components. To create unequal- variance components, Hillis (2012) modified variance components in Eq. 10 by defining
such that \(\sigma _{C(1)}^2 = \frac{1}{b^2} \sigma _{C(0)}^2\) where \(b = \frac{\sigma _{(0)}}{\sigma _{(1)}}\) is sigma ratio for some \(b > 0\). By using variance components (10), we define
where \(\rho _{WR}\) and \(\rho _{BR}\) are used to define first letter (i.e., data correlation) and \(\sigma _R^2\) or \(\sigma _{\tau R}^2\) are used to define second letter (i.e., reader variance) of data structures given in Supplementary Tables S1 and S2. For example, HH stands for high data correlation and high reader variance. The correlations in Eq. 12 can be estimated using variance components from one of diseased or non-diseased groups. For more details, see Roe and Metz (1997a) and Hillis (2012).
Test statistics for the ANOVA models of the DBM and OR methods
The DBM and OR models are based on the three- and two-way mixed-effect ANOVA models, respectively. The statistical significance of the test statistics for each component in the DBM and OR model are evaluated via an F statistic calculated using the mean squared errors.
The DBM model The F statistic for the significance of the test effect \(\tau\) in the ANOVA model Eq. (3) is obtained using the mean squares (MS) as
where the denominator degrees of freedom, \(df_2\), is calculated as in Eq. (14) by using the Satterthwaite’s approximation (Satterthwaite 1946).
The OR model \(\cdot\) The corrected F statistic for the significance of the test effect \(\tau\) in the ANOVA model Eq. (4) is
with degrees of freedom \(df_1\) and \(df_2\). Here, \(df_1\) equals \((t - 1)\) and \(df_2\) is calculated as in Eq. (16).
where the covariance estimates are calculated from the Eq. (5).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Basol, M., Goksuluk, D. & Karaagaoglu, E. Comparing the diagnostic performance of methods used in a full-factorial design multi-reader multi-case studies. Comput Stat 38, 1537–1553 (2023). https://doi.org/10.1007/s00180-022-01309-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-022-01309-1