Trial Factors Associated With Completion of Clinical Trials Evaluating AI: Retrospective Case-Control Study

Original Paper

¹Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada

²Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada

³Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada

Corresponding Author:

Srinivas Raman, MASc, MD

Department of Radiation Oncology

University of Toronto

610 University Avenue

Toronto, ON, M5G 2M9

Canada

Phone: 1 416 946 4501 ext 2320

Fax:1 416 946 6566

Email: srinivas.raman@uhn.ca

Background: Evaluation of artificial intelligence (AI) tools in clinical trials remains the gold standard for translation into clinical settings. However, design factors associated with successful trial completion and the common reasons for trial failure are unknown.

Objective: This study aims to compare trial design factors of complete and incomplete clinical trials testing AI tools. We conducted a case-control study of complete (n=485) and incomplete (n=51) clinical trials that evaluated AI as an intervention of ClinicalTrials.gov.

Methods: Trial design factors, including area of clinical application, intended use population, and intended role of AI, were extracted. Trials that did not evaluate AI as an intervention and active trials were excluded. The assessed trial design factors related to AI interventions included the domain of clinical application related to organ systems; intended use population for patients or health care providers; and the role of AI for different applications in patient-facing clinical workflows, such as diagnosis, screening, and treatment. In addition, we also assessed general trial design factors including study type, allocation, intervention model, masking, age, sex, funder, continent, length of time, sample size, number of enrollment sites, and study start year. The main outcome was the completion of the clinical trial. Odds ratio (OR) and 95% CI values were calculated for all trial design factors using propensity-matched, multivariable logistic regression.

Results: We queried ClinicalTrials.gov on December 23, 2023, using AI keywords to identify complete and incomplete trials testing AI technologies as a primary intervention, yielding 485 complete and 51 incomplete trials for inclusion in this study. Our nested propensity-matched, case-control results suggest that trials conducted in Europe were significantly associated with trial completion when compared with North American trials (OR 2.85, 95% CI 1.14-7.10; P=.03), and the trial sample size was positively associated with trial completion (OR 1.00, 95% CI 1.00-1.00; P=.02).

Conclusions: Our case-control study is one of the first to identify trial design factors associated with completion of AI trials and catalog study-reported reasons for AI trial failure. We observed that trial design factors positively associated with trial completion include trials conducted in Europe and sample size. Given the promising clinical use of AI tools in health care, our results suggest that future translational research should prioritize addressing the design factors of AI clinical trials associated with trial incompletion and common reasons for study failure.

J Med Internet Res 2024;26:e58578

doi:10.2196/58578

Keywords

artificial intelligence; clinical trial; completion; AI; cross-sectional study; application; intervention; trial design; logistic regression; Europe; clinical; trials testing; health care; informatics; health information

The advent of artificial intelligence (AI) is expected to transform the practice and delivery of health care practices, including applications in clinical diagnosis, treatment, and management [1,2]. Adoption of these promising but often untested tools requires systematic evaluation through clinical trials, widely regarded as one of the highest forms of evidence to inform clinical practice [3,4].

It is well known that clinical trials fail to compete at different stages of the research and development process [5,6]. However, the evaluation of AI tools as interventions in clinical trials raises the potential for new modes of trial incompletion that have yet to be explored. Unique challenges in translation AI research can include poor patient cohort selection, ineffective patient monitoring during trials, and logistical difficulties for implementation [7,8].

Prioritizing research to address the common limitations of AI in trials is a critical step toward the validation and adoption of these tools. To address this issue, we performed a case-control study of AI trials in ClinicalTrials.gov to identify trial design factors associated with trial completion and catalog study-reported reasons for trial incompletion.

We queried ClinicalTrials.gov, the largest international registry of clinical trials, on December 23, 2023, using AI keywords based on previous methodology to identify AI-related trials (n=6738) [9]. Incomplete trials were defined as terminated, suspended, or withdrawn trials. We only included complete and incomplete trials testing AI technologies as a primary intervention, categorized by complete (n=485) or incomplete status (n=51; Figure S1 in Multimedia Appendix 1). Our study focused on clinical trials that evaluated AI technologies in at least 1 study arm to assess the characteristics of trial completion associated with primary AI interventions. We excluded exact duplicate trials but included separate trials evaluating the same AI intervention for different trial methods or targeted populations. We excluded ongoing trials given their unknown status of trial completion by the date of data collection needed for this study’s case-control design. We excluded observational trials and studies with missing trial design elements such as allocation, intervention model, and masking.

Two reviewers (CC and RK) independently screened all studies for inclusion and data extraction after a pilot on 20 studies to improve interreviewer agreement. Discordance of screening and data extraction were resolved through discussion with a third reviewer (DC) to achieve full agreement across the reviewer team. We extracted trial design factors including the clinical area addressed by AI technology, the intended use population, and the intended role of AI technology (Table S1 in Multimedia Appendix 1). Trial factors with no available data were coded as “Unknown.”

Chi-square tests with Benjamini-Hochberg correction were used to compare the distribution of trial factors between complete and incomplete trials. Logistic regression models were fit for univariable and multivariable analysis of trial factors associated with trial completion. Stepwise variable selection using the Akaike information criterion was used to identify the optimal set of trial factors useful for the multivariable regression model. The propensity score–matched, multivariable regression was conducted using 3 complete trials for each incomplete trial, matched based on all insignificant trial factors from the multivariable analysis. Odds ratio (OR) and 95% CI values were calculated for all factors observed in both complete and incomplete trials. A 2-sided P value threshold of .05 was used for statistical significance. This study was completed in accordance with the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) reporting guidelines [10].

No ethics approval and informed consent were needed since this study analyzed publicly available data and did not include human subjects.

The majority of AI trials were categorized as being completed (485/536, 90.5%). Trials primarily implemented diagnostic AI interventions (200/536, 37.3%), tested AI interventions intended for health care providers (432/536, 80.6%), and were conducted in adult and older adult populations (397/536, 74.6%). Furthermore, the most prevalent clinical areas addressed in trials included oncology (93/536, 17.4%), cardiovascular system (71/536, 13.2%), and generic health (60/536, 11.2%). Geographically, the majority of trials were conducted in Europe (174/536, 32.5%), North America (145/536, 27.1%), or Asia (142/536, 26.5%), and many trials recruited relatively larger sample sizes (1000 participants; 131/536, 24.4%). We found a paucity of reporting information concerning study allocation (399/536, 74.4% unknown), intervention model (321/536, 60.1% unknown), masking procedures (59.9% unknown), and funding source (438/536, 81.7% unknown) and excluded these factors from the univariable analysis. Multimedia Appendix 2 reports a summary of the included studies.

From the univariable logistic regression model, the role of AI (prediction: OR 3.93, 95% CI 1.32-11.72; P<.001), study type (interventional: OR 0.52, 95% CI 0.29-0.92; P=.03), continent (Asia: OR 16.0, 95% CI 3.73-68.77; P<.001 and Europe: OR 4.11, 95% CI 1.90-9.25; P<.001), and sample size (OR, 1.00; 95% CI 1.000-1.003; P<.001) were associated with completion of AI trials (Multimedia Appendix 3). All significant factors from univariable analysis were included in our multivariable model. From the multivariable logistic regression model, the role of AI (prediction: OR 4.55, 95% CI 1.44-14.36; P=.01), continent (Asia: OR 11.57, 95% CI 2.59-51.73; P=.001 and Europe: OR 4.44, 95% CI 1.91-10.3; P<.001), and sample size (OR 1.00, 95% CI 1.000232-1.00243; P=.02) were associated with completion of AI trials (Multimedia Appendix 3). The case-control, propensity-matched, multivariable logistic regression model found that continent (Europe: OR 2.85, 95% CI 1.14-7.10; P=.02) and sample size (OR 1.00, 95% CI 1.000202-1.00218; P=.02) were associated with completion of AI trials (Multimedia Appendix 3)

Common study-reported reasons for trial incompletion include poor accrual (13/51, 25.5%), poor results at interim (3/51, 5.9%), administration (22/51, 43.1%), and other (12/51, 23.5%; Figure 1). Among incomplete trials due to poor administration, reported reasons included logistical difficulties (8/22, 36.4%), lack of funding (7/22, 31.8%), COVID-19 pandemic (4/22, 18.2%), departure of investigator (3/22, 13.6%), and lack of ethics approval (1/22, 4.5%).

**Figure 1.** Reasons for failure of clinical trials that evaluated artificial intelligence. PI: principal investigator.

Principal Findings

Our multivariable analysis found that clinical trials conducted in Europe were positively associated with trial completion compared with trials in North America. However, we note that trials conducted outside of North America are less likely to be registered on ClinicalTrials.gov [11], which may be due in part to different national and funding mandates, thus potentially resulting in geographic reporting bias. There remains a need for sound methodological design in AI model training to improve generalizability in validation cohorts [12]. We hypothesize that AI models trained on data from local, homogenous cohorts may be more likely to complete but could fail to generalize to external cohorts with increased heterogeneity and a lack of training representation. The design of AI training architectures should consider representation from data-poor sources and due diligence in external validation before clinical implementation [13].

Our findings also highlight the importance of trial design, noting that larger participant sizes correlate with successful trial completion, which may be unique to AI-based trials, where the performance of the tools are enhanced with larger datasets. We note that studies involving large sample sizes may lead to false positive discoveries due to inflation of P values [14], and administrative difficulties that can contribute to trial failure [15]. This finding aligns with recent reports on the need to consider appropriate sample size to ensure reliable estimates of AI intervention performance [16] as well as the association between sample size and the sensitivity of detecting differences in study outcomes [17,18]. The emergence of noninferiority trials evaluating the performance of AI interventions compared with standard-of-care controls should require consistent reporting and justification for sample size [19] and should consider the use and challenges of large-scale training and validation cohorts for AI models [20]. Randomized clinical trials (RCTs), including but not limited to trials evaluating AI interventions, should consider sample size with respect to type 1 error, power, effect size of clinical interest, and population variance, as well as justify the use of sample size calculations based on applicable assumptions [21].

Compared with a cross-sectional study of all trials reported in ClinicalTrials.gov [22], our results also demonstrated that administrative reasons made up a greater proportion (45.1% vs 29.8%) of reasons for AI trial failure compared with all trials, and further research is required to understand unique administrative challenges present in AI trials. There remain several key challenges that should be addressed to translate AI interventions in medicine, including the assessment of performance metrics in relation to clinical use, algorithmic biases that limit generalizability to new populations, and logistical difficulties in implementing AI systems into clinical workflows led by clinicians [8]. Broadly, the shift of AI interventions toward integration into compound systems with multiple inputs, outputs, and operators may present new administrative challenges in clinical workflows that should be addressed in future AI clinical trials.

Our study has several limitations. First, this study is limited by the lack of paired trials in the literature that could provide case-control comparisons between different trial design factors. Despite incorporating relevant trial design covariates and applying propensity-matching techniques, simplifying the intricacies of trial completion into its constituent design factors may overlook several other considerations, which could significantly influence trial outcomes. Second, there was a paucity of information reporting, where several study design factors were not consistently reported [23]. Researchers should adhere to reporting guidelines wherever possible to enhance scientific transparency and accountability [24].

Conclusion

This study suggests that clinical trials that recruited larger sample sizes and conducted in Europe, compared with North America, are associated with successful trial completion. The most common reasons for trial incompletion included poor participant accrual and administrative difficulties. Future research is needed to address the limitations of AI clinical trials associated with trial incompletion to improve the translation of AI into clinical practice.

Data Availability

The datasets generated during and/or analyzed during this study are available from the corresponding author on reasonable request.

Conflicts of Interest

None declared.

Multimedia Appendix 1

Inclusion and exclusion criteria and data extraction procedure.

DOCX File , 203 KB

Multimedia Appendix 2

Descriptive statistics of complete and incomplete interventional trials evaluating artificial intelligence.

XLSX File (Microsoft Excel File), 13 KB

Multimedia Appendix 3

Predictive statistics of complete and incomplete interventional trials evaluating artificial intelligence.

XLSX File (Microsoft Excel File), 12 KB

Meskó B, Görög M. A short guide for medical professionals in the era of artificial intelligence. NPJ Digit Med. 2020;3:126. [FREE Full text] [CrossRef] [Medline]
Alowais SA, Alghamdi SS, Alsuhebany N, Alqahtani T, Alshaya AI, Almohareb SN, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23(1):689. [FREE Full text] [CrossRef] [Medline]
Tricoci P, Allen JM, Kramer JM, Califf RM, Smith SC. Scientific evidence underlying the ACC/AHA clinical practice guidelines. JAMA. 2009;301(8):831-841. [CrossRef] [Medline]
Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJY, Kann BH. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Netw Open. 2022;5(9):e2233946. [FREE Full text] [CrossRef] [Medline]
Rajpurkar P, Chen E, Banerjee O, Topol EJ. AI in health and medicine. Nat Med. 2022;28(1):31-38. [CrossRef] [Medline]
Fogel DB. Factors associated with clinical trials that fail and opportunities for improving the likelihood of success: a review. Contemp Clin Trials Commun. 2018;11:156-164. [FREE Full text] [CrossRef] [Medline]
Harrer S, Shah P, Antony B, Hu J. Artificial intelligence for clinical trial design. Trends Pharmacol Sci. 2019;40(8):577-591. [FREE Full text] [CrossRef] [Medline]
Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019;17(1):195. [FREE Full text] [CrossRef] [Medline]
Pearce FJ, Cruz Rivera S, Liu X, Manna E, Denniston AK, Calvert MJ. The role of patient-reported outcome measures in trials of artificial intelligence health technologies: a systematic evaluation of ClinicalTrials.gov records (1997-2022). Lancet Digit Health. 2023;5(3):e160-e167. [FREE Full text] [CrossRef] [Medline]
von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP, et al. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007;370(9596):1453-1457. [FREE Full text] [CrossRef] [Medline]
Lindsley K, Fusco N, Li T, Scholten R, Hooft L. Clinical trial registration was associated with lower risk of bias compared with non-registered trials among trials included in systematic reviews. J Clin Epidemiol. 2022;145:164-173. [FREE Full text] [CrossRef] [Medline]
Eche T, Schwartz LH, Mokrane F, Dercle L. Toward generalizability in the deployment of artificial intelligence in radiology: role of computation stress testing to overcome underspecification. Radiol Artif Intell. 2021;3(6):e210097. [FREE Full text] [CrossRef] [Medline]
Celi LA, Cellini J, Charpignon M, Dee EC, Dernoncourt F, Eber R, et al. Sources of bias in artificial intelligence that perpetuate healthcare disparities-a global review. PLOS Digit Health. 2022;1(3):e0000022. [FREE Full text] [CrossRef] [Medline]
Lin M, Lucas HC, Shmueli G. Too big to fail: larger samples and false discoveries. Robert H. Smith School Research Paper No. RHS 06-068. 2011:37. [CrossRef]
Reith C, Landray M, Devereaux P, Bosch J, Granger CB, Baigent C, et al. Randomized clinical trials--removing unnecessary obstacles. N Engl J Med. 2013;369(11):1061-1065. [CrossRef] [Medline]
Rajput D, Wang W, Chen C. Evaluation of a decided sample size in machine learning applications. BMC Bioinformatics. 2023;24(1):48. [FREE Full text] [CrossRef] [Medline]
Gieraerts C, Dangis A, Janssen L, Demeyere A, de Bruecker Y, de Brucker N, et al. Prognostic value and reproducibility of AI-assisted analysis of lung involvement in COVID-19 on low-dose submillisievert chest CT: sample size implications for clinical trials. Radiol Cardiothorac Imaging. 2020;2(5):e200441. [FREE Full text] [CrossRef] [Medline]
Liu L, Parker KJ, Jung S. Design and analysis methods for trials with AI-based diagnostic devices for breast cancer. J Pers Med. 2021;11(11):1150. [FREE Full text] [CrossRef] [Medline]
Rehal S, Morris TP, Fielding K, Carpenter JR, Phillips PPJ. Non-inferiority trials: are they inferior? A systematic review of reporting in major medical journals. BMJ Open. 2016;6(10):e012594. [FREE Full text] [CrossRef] [Medline]
L'Heureux A, Grolinger K, Elyamany HF, Capretz MAM. Machine learning with big data: challenges and approaches. IEEE Access. 2017;5:7776-7797. [CrossRef]
Noordzij M, Tripepi G, Dekker FW, Zoccali C, Tanck MW, Jager KJ. Sample size calculations: basic principles and common pitfalls. Nephrol Dial Transplant. 2010;25(5):1388-1393. [CrossRef] [Medline]
Williams RJ, Tse T, DiPiazza K, Zarin DA. Terminated trials in the clinicaltrials.gov results database: evaluation of availability of primary outcome data and reasons for termination. PLoS One. 2015;10(5):e0127242. [FREE Full text] [CrossRef] [Medline]
Anderson ML, Chiswell K, Peterson ED, Tasneem A, Topping J, Califf RM. Compliance with results reporting at ClinicalTrials.gov. N Engl J Med. 2015;372(11):1031-1039. [FREE Full text] [CrossRef] [Medline]
Kwong JCC, Khondker A, Lajkosz K, McDermott MBA, Frigola XB, McCradden MD, et al. APPRAISE-AI tool for quantitative evaluation of AI studies for clinical decision support. JAMA Netw Open. 2023;6(9):e2335377. [FREE Full text] [CrossRef] [Medline]

‎

AI: artificial intelligence

OR: odds ratio

RCT: randomized clinical trials

STROBE: Strengthening the Reporting of Observational Studies in Epidemiology

Edited by A Mavragani; submitted 19.03.24; peer-reviewed by A Wani, J Mistry; comments to author 27.04.24; revised version received 02.05.24; accepted 11.07.24; published 23.09.24.

©David Chen, Christian Cao, Robert Kloosterman, Rod Parsa, Srinivas Raman. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.09.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

This paper is in the following e-collection/theme issue:

Trial Factors Associated With Completion of Clinical Trials Evaluating AI: Retrospective Case-Control Study