Abstract
Background:
It has been shown that metrics recorded for instrument kinematics during robotic surgery can predict urinary continence outcomes.
Objective:
To evaluate the contributions of patient and treatment factors, surgeon efficiency metrics, and surgeon technical skill scores, especially for vesicourethral anastomosis (VUA), to models predicting urinary continence recovery following robot-assisted radical prostatectomy (RARP).
Design, setting, and participants:
Automated performance metrics (APMs; instrument kinematics and system events) and patient data were collected for RARPs performed from July 2016 to December 2017. Robotic Anastomosis Competency Evaluation (RACE) scores during VUA were manually evaluated. Training datasets included: (1) patient factors; (2) summarized APMs (reported over RARP steps); (3) detailed APMs (reported over suturing phases of VUA); and (4) technical skills (RACE). Feature selection was used to compress the dimensionality of the inputs.
Outcome measurements and statistical analysis:
The study outcome was urinary continence recovery, defined as use of 0 or 1 safety pads per day. Two predictive models (Cox proportional hazards [CoxPH] and deep learning survival analysis [DeepSurv]) were used.
Results and limitations:
Of 115 patients undergoing RARP, 89 (77.4%) recovered their urinary continence and the median recovery time was 166 d (interquartile range [IQR] 82–337). VUAs were performed by 23 surgeons. The median RACE score was 28/30 (IQR 27–29). Among the individual datasets, technical skills (RACE) produced the best models (C index: CoxPH 0.695, DeepSurv: 0.708). Among summary APMs, posterior/anterior VUA yielded superior model performance over other RARP steps (C index 0.543–0.592). Among detailed APMs, metrics for needle driving yielded top-performing models (C index 0.614–0.655) over other suturing phases. DeepSurv models consistently outperformed CoxPH; both approaches performed best when provided with all the datasets. Limitations include feature selection, which may have excluded relevant information but prevented overfitting.
Conclusions:
Technical skills and “needle driving” APMs during VUA were most contributory. The best-performing model used synergistic data from all datasets.
Patient summary:
One of the steps in robot-assisted surgical removal of the prostate involves joining the bladder to the urethra. Detailed information on surgeon performance for this step improved the accuracy of predicting recovery of urinary continence among men undergoing this operation for prostate cancer.
Keywords: Prostatectomy, Urinary incontinence, Survival analysis, Robotics, Machine learning, Artificial intelligence
1. Introduction
Increasing data confirm that surgeon performance impacts patient outcomes after robot-assisted radical prostatectomy (RARP) [1] and procedures outside urology [2]. How we assess surgeon performance is attracting increasing interest for both training and high-stakes credentialing purposes. Surgeon performance is currently assessed using three modalities: manual measures, automated measures, and patient outcomes.
Several manually observed assessment tools for technical skills, often used by expert surgeons or reviewers who have undergone standardized training, have been developed and validated. The original assessment tools measure global psychomotor skills [3]. Superior performance, as measured by Global Evaluative Assessment of Robotic Skills (GEARS), has been linked to urinary continence recovery after RARP [4]. Automated assessments exist as computer-generated motion-tracking evaluations of surgeon efficiency, as measured by automated performance metrics (APMs). These metrics, while not a direct measure of technical skill, have been linked to perioperative and long-term patient outcomes after RARP [1,5]. Finally, patient outcomes are arguably the ultimate measure of surgeon performance. For patients with prostate cancer, an important functional outcome is urinary continence recovery after RARP.
Alongside the growth of robot-assisted surgeries and the generation of big data has been the application of artificial intelligence in surgical assessment and training [6]. Our prior work using machine learning for survival analysis showed that select APMs, particularly for vesicourethral anastamosis (VUA; the key reconstructive step in RARP), are associated with urinary continence recovery prediction after RARP [1]. Whether VUA performance improved outcomes or whether skills demonstrated during VUA were particularly reflective of overall surgeon skill or proficiency remained to be elucidated. Regardless, we found that performance during VUA was worthy of further evaluation due to the connection with continence recovery. In this study, we analyzed surgeon performance during VUA through granular evaluation of manual and automated assessments of technical skill. As opposed to measuring global robotic skills, we used a procedure-specific tool for assessing psychomotor skills—Robotic Anastomosis Competency Evaluation (RACE)—for VUA during RARP [7]. While prior APMs have summarized kinematic and events data over whole steps for RARP, they have not provided granularity at the level of individual surgical maneuvers. Here we report APMs for individual stitch and substitch phases of suturing during VUA.
Our study evaluates the association between patient and treatment factors, automated assessments of surgeon efficiency (APMs) during the whole VUA step and individual stitches, and a VUA-specific manual assessment of surgeon technical skill (RACE), as they contribute to predicting continence recovery time after RARP. To date, this is the most granular and comprehensive array of surgeon-focused datasets we have used to predict post-RARP continence recovery.
2. Patients and methods
2.1. Study overview
We used a combination of four different datasets to predict continence recovery after RARP: (1) a set of 16 patient and treatment features (Table 1); (2) summary APMs at surgeon level (41 APMs during each of the 12 standardized RARP steps [8]); (3) detailed APMs at surgeon level (41 APMs summarized over each substitch phase for suturing during VUA); and (4) surgeon technical skills (RACE scores) measuring surgeon VUA performance.
Table 1 –
Total cohort (N = 115) |
Continent at 3 mo (N = 34) |
Continent during FU (N = 89) |
Not continent during FU (N = 26) |
|
---|---|---|---|---|
Patient characteristics | ||||
Median age, yr (IQR) | 64 (60–69) | 63 (60–67) | 63 (60–68) | 67 (63–74) |
Median BMI, kg/m2 (IQR) | 28.8 (25.9–32.5) | 28.8 (27.1–31.8) | 28.9 (26.2–32.5) | 27.2 (25.3–32.7) |
Median ASA score (IQR) | 3 (2–3) | 3 (2–3) | 3 (2–3) | 3 (2–3) |
Median PSA, ng/ml (IQR) | 6.8 (5.2–10.3) | 6.6 (4.9–8.2) | 6.8 (5.1–9.9) | 7.1 (5.6–11.1) |
Preoperative Gleason score, % (n/N) | ||||
≤6 | 15.7 (18/115) | 14.7 (5/34) | 15.7 (14/89) | 15.4 (4/26) |
7 | 67.0 (77/115) | 70.6 (24/34) | 66.3 (59/89) | 69.2 (18/26) |
≥8 | 17.4 (20/115) | 14.7 (5/34) | 18.0 (16/89) | 15.3 (4/26) |
Postoperative Gleason score, % (n/N) | ||||
≤6 | 5.2 (6/115) | 5.9 (2/34) | 5.6 (5/89) | 3.8 (1/26) |
7 | 77.4 (89/115) | 79.4 (27/34) | 79.8 (71/89) | 69.2 (18/26) |
≥8 | 17.4 (20/115) | 14.7 (5/34) | 14.6 (13/89) | 26.9 (7/26) |
Pathologic tumor stage, % (n/N) | ||||
pT2 | 44.3 (51/115) | 41.2 (14/34) | 46.1 (41/89) | 38.4 (10/26) |
≥pT3 | 55.7 (64/115) | 58.8 (20/34) | 53.9 (48/89) | 61.6 (16/26) |
Extracapsular extension, % (n/N) | 55.7 (64/115) | 58.8 (20/34) | 53.9 (48/89) | 61.6 (16/26) |
Median prostate volume, g (IQR) | 48.0 (35.0–63.0) | 45.0 (33.5–62.5) | 48.0 (35.5–62.0) | 50.0 (35.0–70.5) |
Median prostate lobe, % (n/N) | 13.9 (16/115) | 8.8 (3/34) | 12.4 (11/89) | 19.2 (5/26) |
Treatment characteristics | ||||
Median surgery time, min (IQR) | 237 (208–262) | 245 (212–267) | 238 (211–264) | 236 (190–258) |
Median EBL, ml (IQR) | 100 (100–150) | 100 (100–150) | 100 (100–150) | 100 (75–200) |
Nerve-sparing surgery, % (n/N) | 92.2 (106/115) | 97.1 (33/34) | 92.1 (82/89) | 92.3 (24/26) |
Urethropexy, % (n/N) | 58.3 (67/115) | 50.0 (17/34) | 53.9 (48/89) | 73.1 (19/26) |
Pelvic lymph node dissection % (n/N) | ||||
Standard | 63.5 (73/115) | 67.6 (23/34) | 61.8 (55/89) | 69.2 (18/26) |
Extended | 36.5 (42/115) | 32.4 (11/34) | 38.2 (34/89) | 30.8 (8/26) |
Radiation after surgery, % (n/N) | 21.7 (25/115) | 26.5 (9/34) | 22.5 (20/89) | 19.2 (5/26) |
ASA = American Society of Anesthesiologists; BMI = body mass index; EBL = estimated blood loss; FU = follow-up; IQR = interquartile range; PSA = prostate-specific antigen.
2.2. Datasets
Consecutive RARP cases from July 2016 to December 2017 in our single institution for which there were complete data for the four datasets were included in the study. All RARPs were performed using the anterior, non–Retzius-sparing approach. The operating surgeon was prospectively recorded for each step of the procedure. Patient data, including baseline characteristics and perioperative data, were prospectively collected according to an institutional review board–approved protocol. Urinary continence recovery was defined as use of no pads or one safety pad that was mostly dry [9].
APMs, as previously described [10-13], were derived from systems data obtained directly from the da Vinci robot via a custom data recorder (Intuitive Surgical, Sunnyvale, CA, USA). In this study, 41 previously validated APMs during all standardized RARP steps (summary APMs) were used for analysis. In addition, the VUA step was further deconstructed into its individual stitches and substitch phases of suturing: needle handling/targeting, needle driving, and suture cinching (Fig. 1), and APMs were reported for each of these elements (detailed APMs).
RACE is a previously validated assessment tool used to rate suturing technical skills during VUA [7]. Total scores were based on six different domains: needle positioning, needle entry, needle driving/tissue trauma, suture placement, tissue approximation, and knot tying. After standardized training by the senior author (A.J.H.), three raters reviewed ten random videos of VUA procedures from our cohort to assess interobserver variability (intraclass correlation [ICC]). The remaining VUAs were rated by a single trained rater. All raters were blinded to identification of the operating surgeon.
2.3. Models and inputs
Cox proportional hazards (CoxPH) and deep learning survival analysis (DeepSurv) models were used to predict urinary continence after RARP. DeepSurv is a deep feed-forward neural network that predicts the effects of a patient’s covariates on their hazard ratio for an event using a nonlinear network parameterization θ. DeepSurv generalizes the CoxPH model. Both models were trained using a combination of the four datasets to predict continence recovery and to measure and contrast the insights of each of the datasets.
2.4. Data preprocessing
Data were preprocessed by imputing missing values for any feature using its median value, bucketing patient/treatment features by quartiles, and standardizing each of the summary and detailed APMs. We used our dataset of 115 RARP cases for training, validation, and testing of the models. The dataset was split into five for cross-validation: it was trained on three-fifths of the data, validated on another one-fifth, and tested on the remaining one-fifth.
2.5. Model performance and feature selection
We used the concordance index (C index) measured for the testing phase in cross-validation to measure prediction performance. A feature selection step compressed the inputs and prevented overfitting from the high dimensionality of the surgical APMs [14,15]. To compress the dimensions of the summary APMs, we selected a subset of RARP steps that yielded the top-performing models according to the C index. Similarly, for detailed APMs, we measured the C index for each substitch phase (needle handling/targeting, needle driving, and suture cinching; Fig. 1) to select the most discriminatory variables and substitch within the feature set.
2.6. Feature ranking
From the top-performing model with the overall highest C index, we feature-ranked the variables (APMs, patient/treatment factors, and RACE scores) according to the importance for predicting urinary continence. To calculate the score for each variable x, we randomly permuted the values of the x column and measured the resulting C-index. The variable importance is defined as the absolute difference in C-index before and after permutation. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature.
2.7. Training procedure and hyperparameters
CoxPH analysis was performed using the Python Lifelines package. DeepSurv models were constructed using PyCox and time-to-event prediction with the PyTorch 0.2.1 package. For DeepSurv, we used a two-layer feed-forward network to parameterize the hazard function, with the number of neurons proportional to three times the input dimensions for each layer. A dropout probability of 0.1 was used to reduce overfitting. The optimization was carried out in mini-batches of size 32 for 100 epochs with the Adam optimizer and a learning rate of 1e−2.
3. Results
3.1. Demographics
A total of 115 RARP cases were included in the study, representing VUAs performed by 23 surgeons (residents, fellows, and faculty) with median RARP experience of 120 cases (interquartile range [IQR] 38–375). Eight-nine patients (77.4%) achieved urinary continence and the median recovery time was 166 d (IQR 82–337). Median follow-up was 371 d (IQR 320–553).
The median RACE score for VUA was 28 (IQR 27–29). Interobserver variability among three independent raters for ten initial cases was moderate to high (ICC 0.78, 95% confidence interval 0.37–0.94; p = 0.003).
3.2. CoxPH performance
When individual datasets were used, technical skills evaluation (RACE) achieved the best-performing model (C index 0.695; Fig. 2). Summary APMs, detailed APMs, and patient/surgical factors resulted in comparably lower performance (C index 0.493, 0.562, and 0.527, respectively).
Use of all datasets resulted in a C index of 0.592, while the overall best-performing CoxPH model combined patient factors, technical skills scores, and detailed APMs (C index 0.662; Fig. 2).
3.3. Feature selection performance
Among the summary APMs, the posterior and anterior VUA steps yielded the top-performing model (Fig. 3A). Among the detailed APMs for VUA), APMs for the needle-driving task yielded the top-performing model (Fig. 3B).
3.4. CoxPH performance using selected data
Reducing the features to only summary APMs for VUA and only the needle-driving task for suturing among detailed APMs improved the CoxPH model performance (Fig. 2). Selected summary APMs alone resulted in an improved C index of 0.546 and selected detailed APMs improved it to 0.609. This selection of summary and detailed APMs also substantially improved the performance of models using a combination of the datasets. Notably, use of all four datasets resulted in the best-performing model, with a C index of 0.713. Of note, when both the selected summary and detailed APMs were used, the VUA was theoretically covered twice, once as metrics summarized across the whole VUA and again as metrics summarizing each substitch phase of suturing.
3.5. DeepSurv performance
When the selected datasets were used, DeepSurv had better performance than the CoxPH model (Fig. 2). Among the standalone datasets, RACE score yielded the best-performing model with DeepSurv (C index 0.708). The best performing model with DeepSurv utilized all four datasets, with a C index of 0.782.
3.6. Feature ranking
RACE scores accounted for three of the five top-ranking metrics (needle positioning, tissue approximation, and needle entry angle; Fig. 4). Patient age was the only patient/treatment factor in the top ten. Detailed APMs for the needle-driving phase of suturing accounted for four of the ten top-ranking features. Finally, summary APMs for the VUA steps accounted for two of the top ten features.
4. Discussion
In this study we used our most comprehensive datasets of patient factors, summary and detailed APMs, and scores for technical skills to predict urinary continence, a key functional outcome after RARP. The study underscores the ability of DeepServ models to optimize results while processing large datasets and to provide insights into the relative contributions of different datasets.
While this is not our first attempt to predict urinary continence recovery after RARP, the addition of detailed metrics (down to the level of suturing substitch maneuvers) and evaluation of surgeon performance during VUA has yielded the best performance to date. While our prior work indicated that metrics during VUA were particularly relevant in predicting urinary continence recovery [1], our findings confirm that the additional attention to new performance datasets was a correct direction to take.
Feature selection improves performance by reducing the dimensionality of the data and preventing overfitting of the models. Of the summary APMs, the best-performing metrics were from VUA, a procedural step that logically affects continence. Of the detailed APMs, the best-performing metrics came from the needle-driving phase of suturing, highlighting an emphasis on instrument kinematics while the needle is in direct contact with tissue. Focusing on these featured metrics boosted the performance of the models using summary APMs and detailed APMs individually, as well as the models using a combination of these datasets.
In agreement with our prior work [1], DeepSurv performed better than CoxPH with the selected dataset for almost every combination of data. This finding aligns well with theoretical analysis in the literature [16], as DeepSurv extends the linear risk function used in Cox models into a nonlinear risk function parameterized by a deep neural network. This additional expressivity allows DeepSurv to better capture nonlinear interactions between covariates, which is crucial in settings in which data from multiple sources are combined.
The top-ranked feature for urinary continence prediction was the RACE score for needle positioning. This may appear to be a discrepancy with the feature selection process for detailed APMs, which identified needle-driving APMs as more robust, and may suggest inconsistency of our results (needle-positioning vs -driving phases of suturing). However, it may also signify that manual versus automated assessments of surgeon performance fundamentally rely on different data. Manual skills assessment uniquely allows for evaluation of instrument, needle, and tissue interactions. RACE scoring for needle positioning focuses on spatial placement (how the needle is held) irrespective of kinematics (eg, path length, velocity). On the contrary, automated assessments track the movement of instruments. Needle-driving APMs comprise kinematic metrics (path length, wrist articulation) while a needle is in contact with tissue, and may be more indicative of smoothness or, conversely, direct tissue trauma. Nevertheless, the combined datasets generally improved the performance of the models, suggesting that the individual datasets may each provide some unique characteristics that contribute synergistically to the overall model performance and reiterating the notion that surgeon technique affects continence recovery.
Compared to features from the other datasets, RACE metrics performed well, suggesting that the value of manual assessments of a surgeon’s technical skills cannot yet be replaced by APMs, which are largely measures of surgeon efficiency. Future work should attempt to confirm these results. If they are confirmed, more investment to streamline and potentially automate evaluation of technical skills, which is currently resource-intensive and subject to human bias and error, would be warranted. Scores for technical skills could then become an easily accessible tool in training environments.
This study is not without limitations. There was a relatively small cohort of patients for whom complete summary and detailed APMs, RACE skills assessment, and patient factors were available. It is challenging and resource-intensive to collect comprehensive and nuanced surgeon and patient data. Furthermore, the majority of the cases had only one blinded rater. Manual assessment is time-consuming and potentially inconsistent; therefore, we initially had two additional raters to evaluate a random sample of cases, which demonstrated relatively high ICC with the primary rater. Had the ICC been poor, all cases would have been submitted to additional raters. In addition, owing to the sheer number of features per sample, we performed feature selection to prevent overfitting and avoid the problem of dimensionality. However, this potentially leaves out other relevant information, which is at the time resource-intensive to obtain. In future work, we aim to expand DeepSurv to model APM data in an even more granular approach (focusing on other surgical maneuvers such as those during tissue dissection and beyond VUA), taking into account the kinematics and trajectories of the surgical manipulators. This can better leverage our domain knowledge and use all aspects of the given data, which could potentially improve the performance. APMs, while reflecting surgeon performance, probably serve as surrogates of surgeon skill; it is unlikely that APMs in their current form speak to a specific skill that directly impacts patient outcomes. Finally, our results are limited to the scope of project—urinary continence recovery after RARP—and thus the findings cannot be generalized to other procedures or outcomes.
5. Conclusions
APMs that evaluate the most granular surgical movements (ie, substitch maneuvers) and technical skills appear to aid in prediction of urinary continence recovery after RARP. Evaluation of technical skills in suturing appears to be particularly important in anticipating this key patient outcome of functional recovery.
Acknowledgments:
We would like to acknowledge Anthony Jarc (Intuitive Surgical Clinical Research, Norcross, GA, USA) for processing of automated performance metrics.
Funding/Support and role of the sponsor:
This study was supported in part by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health under award number K23EB026493 and an Intuitive Surgical clinical research grant. Intuitive Surgical played a role in collection and management of the data and in manuscript approval.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Financial disclosures: Andrew J. Hung certifies that all conflicts of interest, including specific financial interests and relationships and affiliations relevant to the subject matter or materials discussed in the manuscript (eg, employment/affiliation, grants or funding, consultancies, honoraria, stock ownership or options, expert testimony, royalties, or patents filed, received, or pending), are the following: Andrew J. Hung has received consultant fees from Quantgene, Mimic Technologies, and Johnson & Johnson.
References
- [1].Hung AJ, Chen J, Ghodoussipour S, et al. A deep-learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot-assisted radical prostatectomy. BJU Int 2019;124:487–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Birkmeyer JD, Finks JF, O’Reilly A, et al. Surgical skill and complication rates after bariatric surgery. N Engl J Med 2013;369:1434–42. [DOI] [PubMed] [Google Scholar]
- [3].Goh AC, Goldfarb DW, Sander JC, Miles BJ, Dunkin BJ. Global evaluative assessment of robotic skills: validation of a clinical assessment tool to measure robotic surgical skills. J Urol. 2012;187:247–52. [DOI] [PubMed] [Google Scholar]
- [4].Goldenberg MG, Goldenberg L, Grantcharov TP. Surgeon performance predicts early continence after robot-assisted radical prostatectomy. J Endourol 2017;31:858–63. [DOI] [PubMed] [Google Scholar]
- [5].Hung AJ, Chen J, Che Z, et al. Utilizing machine learning and automated performance metrics to evaluate robot-assisted radical prostatectomy performance and predict outcomes. J Endourol 2018;32:438–44. [DOI] [PubMed] [Google Scholar]
- [6].Ma R, Vanstrum EB, Lee R, Chen J, Hung AJ. Machine learning in the optimization of robotics in the operative field. Curr Opin Urol 2020;30:808–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Raza SJ, Field E, Jay C, et al. Surgical competency for urethrovesical anastomosis during robot-assisted radical prostatectomy: development and validation of the robotic anastomosis competency evaluation. Urology 2015;85:27–32. [DOI] [PubMed] [Google Scholar]
- [8].Hung AJ, Bottyan T, Clifford TG, et al. Structured learning for robotic surgery utilizing a proficiency score: a pilot study. World J Urol 2017;35:27–34. [DOI] [PubMed] [Google Scholar]
- [9].Patel VR, Sivaraman A, Coelho RF, et al. Pentafecta: a new concept for reporting outcomes of robot-assisted laparoscopic radical prostatectomy. Eur Urol 2011;59:702–7. [DOI] [PubMed] [Google Scholar]
- [10].Hung AJ, Chen J, Jarc A, Hatcher D, Djaladat H, Gill IS. Development and validation of objective performance metrics for robot-assisted radical prostatectomy: a pilot study. J Urol 2018;199:296–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Hung AJ, Oh PJ, Chen J, et al. Experts vs super-experts: differences in automated performance metrics and clinical outcomes for robot-assisted radical prostatectomy. BJU Int 2019;123:861–8. [DOI] [PubMed] [Google Scholar]
- [12].Chen J, Oh PJ, Cheng N, et al. Use of automated performance metrics to measure surgeon performance during robotic vesicourethral anastomosis and methodical development of a training tutorial. J Urol 2018;200:895–902. [DOI] [PubMed] [Google Scholar]
- [13].Hung AJ, Chen J, Gill IS. Automated performance metrics and machine learning algorithms to measure surgeon performance and anticipate clinical outcomes in robotic surgery. JAMA Surg 2018;153:770–1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Fan J, Feng Y, Wu Y. High-dimensional variable selection for Cox’s proportional hazards model. IMS Collections 2010;6:70–86. [Google Scholar]
- [15].Chen Z, Pang M, Zhao Z, et al. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 2020;35:1542–52. [DOI] [PubMed] [Google Scholar]
- [16].Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol. 2018;18:24. [DOI] [PMC free article] [PubMed] [Google Scholar]