In radiology, patients are frequently diagnosed according to the subjective interpretations of radiologists based on an image. Such diagnosis results may be biased and significantly differ among evaluators (i.e., readers) due to different education levels and experiences. One solution to overcome this problem is to use a multi-reader multi-case study design in which there are multiple readers, and the same images are evaluated multiple times. Several methods, including model-based and bootstrap-based, are available for analyzing the multi-reader multi-case studies. In this study, we aimed to compare the performance of available methods on a mammogram dataset. We also conducted a comprehensive simulation study to generalize the results to more general scenarios. We considered the effect of the number of samples and readers, data structures (i.e., correlation structures and variance components), and overall accuracy of diagnostic tests (AUC) in the simulation set-up. Results showed that the model-based methods had type-I error rates close to the nominal level as the number of samples and readers increased. Bootstrap-based methods, on the other hand, were generally conservative. However, they performed the best when the sample size was small, and the AUC level was high. In conclusion, the performance of the proposed methods was not the same under all conditions and was affected by the factors we considered in the simulation study. Therefore, it is not a perfect strategy to use one method under all scenarios because it may lead to biased conclusions.
Code availability
All the source codes written in R are publicly available in the GitHub https://github.com/basolmerve/MRMC-Simulation-ArticleSupplementary.git environment.
Supplementary Information
Below is the link to the electronic supplementary material.
Defining data correlation and reader variance
Consider the following full factorial experimental design.
From model Eq. 9, the decision variable \(Y_{ijk}\) has variance
for both diseased and non-diseased subjects, i.e., equal-variance components. To create unequal- variance components, Hillis (2012) modified variance components in Eq. 10 by defining
such that \(\sigma _{C(1)}^2 = \frac{1}{b^2} \sigma _{C(0)}^2\) where \(b = \frac{\sigma _{(0)}}{\sigma _{(1)}}\) is sigma ratio for some \(b > 0\). By using variance components (10), we define
where \(\rho _{WR}\) and \(\rho _{BR}\) are used to define first letter (i.e., data correlation) and \(\sigma _R^2\) or \(\sigma _{\tau R}^2\) are used to define second letter (i.e., reader variance) of data structures given in Supplementary Tables S1 and S2. For example, HH stands for high data correlation and high reader variance. The correlations in Eq. 12 can be estimated using variance components from one of diseased or non-diseased groups. For more details, see Roe and Metz (1997a) and Hillis (2012).
Test statistics for the ANOVA models of the DBM and OR methods
The DBM and OR models are based on the three- and two-way mixed-effect ANOVA models, respectively. The statistical significance of the test statistics for each component in the DBM and OR model are evaluated via an F statistic calculated using the mean squared errors.
The DBM model The F statistic for the significance of the test effect \(\tau\) in the ANOVA model Eq. (3) is obtained using the mean squares (MS) as
where the denominator degrees of freedom, \(df_2\), is calculated as in Eq. (14) by using the Satterthwaite’s approximation (Satterthwaite 1946).
The OR model \(\cdot\) The corrected F statistic for the significance of the test effect \(\tau\) in the ANOVA model Eq. (4) is
with degrees of freedom \(df_1\) and \(df_2\). Here, \(df_1\) equals \((t - 1)\) and \(df_2\) is calculated as in Eq. (16).
where the covariance estimates are calculated from the Eq. (5).
