Abstract
Objectives
Manually-collected suturing technical skill scores are strong predictors of continence recovery after robotic radical prostatectomy. Herein, we automate suturing technical skill scoring through computer vision (CV) methods as a scalable method to provide feedback.
Methods
22 surgeons completed a suturing exercise three times on the Mimic™ Flex VR simulator. Instrument kinematic data (XYZ coordinates of each instrument and pose) was captured at 30 Hz. After standardized training, three human raters manually video segmented suturing task into four sub-stitch phases (Needle handling, Needle targeting, Needle driving, Needle withdrawal) and labeled the corresponding technical skill domains (Needle positioning, Needle entry, Needle driving, and Needle withdrawal). The CV framework extracted RGB features and optical flow frames using a pre-trained AlexNet. Additional CV strategies including auxiliary supervision (using kinematic data during training only) and attention mechanisms were implemented to improve performance.
Results
This study included data from 15 expert surgeons (median caseload 300 [IQR 165–750]) and 7 training surgeons (0 [IQR 0–8]). In all, 226 virtual sutures were captured. Automated assessments for Needle positioning performed best with the simplest approach (1-second video; AUC 0.749). Remaining skill domains exhibited improvements with the implementation of auxiliary supervision and attention mechanisms when deployed separately (AUC 0.604 – 0.794). All techniques combined produced the best performance, particularly for Needle driving and Needle withdrawal (AUC 0.959 and 0.879, respectively).
Conclusions
This study demonstrated the best performance of automated suturing technical skills assessment to date using advanced CV techniques. Future work will determine if a “human in the loop” is necessary to verify surgeon evaluations.
Keywords: machine learning, artificial intelligence, surgeon performance, technical skill, robotic surgery
INTRODUCTION
As computer vision techniques have continued to advance, so has its breadth of application in the healthcare field. Thus far, image-based processing has successfully achieved feature/object detection (e.g. robotic instruments), task segmentation (e.g. procedural steps), and action recognition (e.g. suturing gestures), especially in regard to robotic and laparoscopic surgery due to the ease of access to medical images and videos in these approaches [1–4].
The latest machine learning-based computer vision endeavors have focused on the automation of technical skill assessments since surgeon performance has been shown to affect clinical patient outcomes [5, 6]. In one instance, manually rated suturing technical skill scores were the strongest predictors of patient continence recovery following a robot-assisted radical prostatectomy compared to other objective measures of surgeon performance [7]. Ultimately, the value of skill assessments is not only in its ability to predict surgical outcomes, but also in its function as formative feedback for training surgeons. The need to automate skills assessment is readily apparent, especially since manual assessments by expert raters are subjective, time consuming, and unscalable [8–9].
Preliminary attempts at skill assessment automation have had favorable results. Machine learning methods automating suturing technical skills using instrument kinematic (motion-tracking) data as the sole input have been able to achieve AUC 0.766 [10]. Other computer vision techniques have garnered greater success through meticulous feature extraction of key items in each image frame [11]. For example, researchers have been able to automate assessments of general robotic skills and the thoroughness of lymph node dissections with greater than 80% accuracy by extracting instrument pose data and identifying nerves/vessels, respectively [12–13].
In the present study, we sought to optimize computer vision methods to automate suturing technical skill evaluation, utilizing a previously validated suturing assessment tool [14]. Compared to previous video-based automation tasks, our challenge was capturing fine-grained details during suturing to distinguish ideal vs. non-ideal technical skills. We built on developed deep-learning computer vision approaches and augmented them with two novel strategies in the automated pipeline: auxiliary supervision (using instrument kinematic data only during the training phase) and attention mechanisms, to improve model performance.
METHODS
Study setup
Data was collected while surgeons performed the Basic Suture Sponge exercise on the Mimic™ FlexVR robotic simulator (Mimic Technologies, Seattle, WA) (Fig. 1). The simulation platform provided a reproducible environment for each surgeon and allowed collection of the exercise video recordings and the corresponding instrument/camera kinematic data (XYZ coordinates and pose estimations [roll/pitch/yaw] captured at 30 Hz).
Video Segmentation and Data Labeling
Video of all suturing attempts were broken down into four sub-stitch phases (Needle handling, Needle targeting, Needle driving, Needle withdrawal) with technical skill domains corresponding to the same phases: Needle positioning, Needle entry, Needle driving, and Needle withdrawal (Fig. 2). Each video recording was first temporally segmented into the four sub-stitch phases, and each of these video segments were then annotated with a binary technical score (Ideal or Non-Ideal). Both segmentation and skill assessment were performed manually. The goal of the study was video-based skill assessment, where we aimed to train a model that could predict the binary technical score using only the video segment as model input during evaluation.
Three labelers were trained by a content expert (AJH) to manually assess the quality of suturing skills using a previously validated suturing assessment tool [14]. The labelers provided the ground truth technical scores for each video segment through an established group consensus-building process [10].
Video-based automatic skill assessment approaches
Our goal of video-based surgical skill assessment is closely related to action recognition in computer vision [15, 16]. However, one key difference from previous works on surgical video recognition is that our model needed to distinguish much more fine-grained details in the videos to perform skill assessment [17]. The visual difference between an Ideal and Non-Ideal needle withdrawal is much smaller than the difference between a Needle withdrawal and an entirely different suturing sub-stitch action (i.e. Needle Handling). While this is a more challenging problem than standard action recognition, recent developments in computer vision have shown promising results in recognizing fine-grained actions like different cooking-related actions (e.g. slicing vs. dicing) [18]. One recurring theme in these works is to focus attention at finer granularity in the videos [19]. In the previous example, subjects’ hands and interacting objects were important focused cues to significantly improve the performance [20].
Even before implementing innovative machine learning solutions, we determined that for certain skill assessments (Needle Positioning, Needle Entry Angle), very brief one-second intervals may be all that is necessary to capture critical representation of ideal performance. For other assessments, the full video segment of the action may be most beneficial for accurate assessment (Needle driving, Needle withdrawal). We then further approached video-based surgical skill assessment with two distinct, but mutually beneficial machine learning advances. First, instrument kinematic data was synchronized with the videos as auxiliary supervision during model training. This means that in addition to the task of interest (i.e., skill assessment), we trained the model to simultaneously predict the corresponding kinematic data (instrument position and pose) for each frame. Recent works in computer vision have shown that leveraging privileged information that is not available during subsequent evaluation as auxiliary supervision during training can significantly improve the performances of computer vision tasks [21]. Instrument kinematics are indeed privileged information for our task of skill assessment. As previously discussed, hand movements are crucial to recognizing fine-grained activities. Having such additional supervision encourages our video models to focus on the visual signals that can predict the privileged kinematic information and improve the model’s capability for skill assessment.
Second, we leveraged an attention mechanism to improve the model’s ability to recognize fine-grained motion from videos [22]. While previous works have shown that standard action recognition architectures (e.g., two-stream Convolutional Long Short-term Memory [ConvLSTM]) can recognize different suturing gestures with high accuracies [15], our goal of skill assessment requires model architectures that are more suitable for differentiating fine-grained details [23]. To this end, we processed the videos with attention models before feeding them into our action recognition model [24]. The attention model is trained in a data-driven way to select input regions that are relevant to the task of interest. This allows the model to ignore the largely irrelevant backgrounds and focus on the regions that are relevant to fine-grained skill assessment.
Model Implementation Details
An overview of our model is shown in Fig. 3. We follow previous work on surgical video analysis and use ConvLSTM as our backbone architecture [15]. We use pre-trained AlexNet as the convolutional feature extractors and consider both the red/blue/green (RGB) frames and the corresponding optical flow images as inputs. The extracted RGB and optical flow feature maps are concatenated along the channel dimension and fused by 1 convolutional layer as the input feature map.
RESULTS
Each of the 22 robotic surgeons included in this study performed the Basic Suture Sponge exercise on the Mimic FlexVR™ three times. The surgeon cohort was comprised of 15 expert surgeons (median prior caseload 300 [IQR 165–750]) and 7 training surgeons 0 [IQR 0–8]. In all, 226 virtual sutures were captured.
Full Time Segments vs One-second Intervals
Initial efforts to automate assessment of Needle positioning and Needle entry angle utilized the entire video segment of the two sub-stitch phases (AUC 0.691 and 0.548, respectively; Table 1). Subsequently, the video inputs for both domains were narrowed to a one-second interval to assess model improvement. For Needle positioning, a one-second interval around the final needle position resulted in AUC 0.749. Similarly, Needle entry angle assessment improved performance by utilizing a one-second interval around the first needle contact with the target (AUC 0.767; Fig. 4).
Table 1.
Skill Domain | Video Segment Input | AUC |
---|---|---|
Needle positioning | Full time segment during needle handling sub-stitch phase | 0.691 |
One-second interval around the final needle position | 0.749* | |
Needle entry angle | Full time segment during needle targeting sub-stitch phase | 0.548 |
One-second interval around the 1st contact with target | 0.767* |
Best performing model for given skill domain
Augmented CV Training Strategies (for Full Time Segments)
Further analysis sought to improve models (employing full-segment videos) by applying various augmented training strategies. The simplest approach, using video only (no supplementary strategies), achieved AUC 0.548 – 0.691 for all skill domains (Table 2). By individually implementing the novel strategies, including auxiliary supervision with kinematic data and attention mechanisms, the model performances improved (AUC 0.604 – 0.794) for all skill domains except Needle positioning. The combined application of both augmented approaches (video + auxiliary supervision + attention) achieved the best performance for three skill domains (AUC 0.762 – 0.959; Table 2). Notably, Needle driving, and Needle withdrawal domains achieved remarkable performances of AUC 0.959 and 0.879 respectively, using the combined approach (Fig. 5).
Table 2.
Skill Domain | Training Strategy (All full time segment videos) | AUC |
---|---|---|
Needle positioning | Video only | 0.691* |
Video + auxiliary supervision (kinematic data) | 0.519 | |
Video + attention | 0.471 | |
Video + auxiliary supervision + attention | 0.606 | |
Needle entry angle | Video only | 0.548 |
Video + auxiliary supervision (kinematic data) | 0.604 | |
Video + attention | 0.756 | |
Video + auxiliary supervision + attention | 0.762* | |
Needle driving | Video only | 0.662 |
Video + auxiliary supervision (kinematic data) | 0.697 | |
Video + attention | 0.776 | |
Video + auxiliary supervision + attention | 0.959* | |
Needle withdrawal | Video only | 0.637 |
Video + auxiliary supervision (kinematic data) | 0.720 | |
Video + attention | 0.794 | |
Video + auxiliary supervision + attention | 0.879* |
Best performing model per suturing skill domain
DISCUSSION
The goal of this study was video-based automation of suturing skills assessment, inspired by readily available surgical video yet a lack of a truly objective and scalable method to provide surgeons skills assessment feedback. Our results demonstrate that we can achieve moderate to strong model performance automating skills assessment for four primary suturing technical skills: Needling positioning, Needle entry angle, Needle driving, and Needle withdrawal.
The key challenge we had to address was attention – to identify what the most relevant visual details are to consider such that accurate assessment of suturing technical skills may be made. The simplest strategy was to simply focus the input video from a full sequence to a one-second time frame for Needle positioning and Needle entry angle. Human raters also utilized the few frames after the surgeon has finalized needle grasp or has committed to an angle of attack with the needle tip to rate these two skill domains, respectively. Providing a longer time frame (and more data) for evaluation in these scenarios was not helpful and likely muddied the picture.
The two augmented machine learning strategies to increase the model’s ability to capture fine-grained details were the real innovations of this study, demonstrating that auxiliary supervision (privileged kinematic data utilized during training phase only) and an attention mechanism could truly improve automated technical skills assessment. Without these strategies, our video evaluation performance was comparable to other published efforts [25]. With the addition of these two strategies of auxiliary supervision and attention mechanism, our video-based evaluation surpassed prior efforts. Indeed, we show that these architectural improvement with the attention mechanism is synergetic with kinematic data auxiliary supervision.
We note a few limitations of our study. Single institution data was utilized for both training and validation of the automated pipeline. Further work must include validation with external data. Additionally, our initial efforts focus on videos of VR suturing performance to keep the “surgical field” streamlined. Subsequent evaluation should include footage from live surgery. We chose suturing as a test bed for skills assessment because of more predictable and structured nature; ongoing efforts are tackling the more complex and nuanced tissue dissection where greater change to the tissue occurs. Finally, while we have utilized kinematic data for training the model, we hope not to need or rely on it in the future. We do want to treat it as truly privileged information.
Future work will determine with what confidence skills assessments may be made, and whether a “human in the loop” will be necessary to verify assessments – to ensure that they are truly accurate depictions of surgeon performance. Under-scoring (providing a surgeon a lower mark than reality) may unfairly disadvantage a surgeon and prevent them from obtaining credentials or privileges to operate. On the other hand, over-scoring (giving a surgeon a higher mark than reality), may endanger a patient by allowing a potentially unsafe surgeon to operate. The accuracy to which the model performs, and minimizes under-scoring and over-scoring, will ultimately dictate whether surgeons and the greater society can accept autonomous skills assessments.
CONCLUSIONS
Machine learning innovations, as utilized in our study, prove that video-based evaluation of suturing technical skills is achievable with robust performance. Future work will expand on the applications we have demonstrated in the present work, and will determine whether it will be necessary to have a “human in the loop” to verify surgeon evaluations or whether it can be a fully autonomous process.
Acknowledgements:
We thank Daniel Sanford1, Balint Der1, Ryan Hakim1, Runzhuo Ma1, and Taseen Haque1 for data collection and grading of technical skill scores through video review. Mimic Technologies, Inc. provided access to the raw kinematic instrument data for each exercise.
1Center for Robotic Simulation & Education, Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, University of Southern California, Los Angeles, California
Funding/Support:
Research reported in this publication was supported in part by the National Cancer Institute under Award No. R01CA251579-01A1.
Footnotes
Statement and Declarations: Andrew J. Hung has financial disclosures with Intuitive Surgical, Inc.
DECLARATIONS
Ethics approval: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. Our study complied with protocols put forth by the University of Southern California’s IRB.
Consent to participate: Informed consent was obtained from all individual participants included in the study.
REFERENCES
- 1.Luongo F, Hakim R, Nguyen JH, Anandkumar A, Hung AJ (2021) Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery. Surgery 169(5):1240–1244. doi: 10.1016/j.surg.2020.08.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wesierski D, Jezierska A (2018) Instrument detection and pose estimation with rigid part mixtures model in video-assisted surgeries. Med Image Anal 46:244–265. doi: 10.1016/j.media.2018.03.012 [DOI] [PubMed] [Google Scholar]
- 3.Cai T, Zhao Z (2020) Convolutional neural network-based surgical instrument detection. Technol Health Care 28(S1):81–88. doi: 10.3233/THC-209009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kitaguchi D, Takeshita N, Matsuzaki H, Takano H, Owada Y, Enomoto T, Oda T, Miura H, Yamanashi T, Watanabe M, Sato D, Sugomori Y, Hara S, Ito M (2020) Real-time automatic surgical phase recognition in laparoscopic sigmoidectomy using the convolutional neural network-based deep learning approach. Surg Endosc. 34(11):4924–4931. doi: 10.1007/s00464-019-07281-0 [DOI] [PubMed] [Google Scholar]
- 5.Birkmeyer JD, Finks JF, O’Reilly A, Oerline M, Carlin AM, Nunn AR, Dimick J, Banerjee M, Birkmeyer NJ, Michigan Bariatric Surgery C (2013) Surgical skill and complication rates after bariatric surgery. N Engl J Med 369 (15):1434–1442. doi: 10.1056/NEJMsa130062 [DOI] [PubMed] [Google Scholar]
- 6.Hung AJ, Chen J, Ghodoussipour S, Oh PJ, Liu Z, Nguyen J, Purushotham S, Gill IS, Liu Y (2019) A deep-learning model using automated performance metrics and clinical features to predict urinary continence recovery after robot-assisted radical prostatectomy. BJU Int. 124(3):487–495. doi: 10.1111/bju.14735 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Trinh L, Mingo S, Vanstrum EB, Sanford DI, Aastha Ma R, Nguyen JH, Liu Y, Hung AJ (2021). Survival Analysis Using Surgeon Skill Metrics and Patient Factors to Predict Urinary Continence Recovery After Robot-assisted Radical Prostatectomy. Eur Urol Focus. S2405-4569(21)00107-3. doi: 10.1016/j.euf.2021.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Chen J, Cheng N, Cacciamani G, Oh P, Lin-Brande M, Remulla D, Gill IS, Hung AJ (2019) Objective Assessment of Robotic Surgical Technical Skill: A Systematic Review. J Urol. 201(3):461–469. doi: 10.1016/j.juro.2018.06.078 [DOI] [PubMed] [Google Scholar]
- 9.Lendvay TS, White L, Kowalewski T (2015) Crowdsourcing to Assess Surgical Skill. JAMA Surg. 150(11):1086–7. doi: 10.1001/jamasurg.2015.2405 [DOI] [PubMed] [Google Scholar]
- 10.Hung AJ, Rambhatla S, Sanford DI, Pachauri N, Vanstrum E, Nguyen JH, Liu Y(2021) Road to automating robotic suturing skills assessment: Battling mislabeling of the ground truth. Surgery. S0039-6060(21)00784-4. doi: 10.1016/j.surg.2021.08.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Levin M, McKechnie T, Khalid S, Grantcharov TP, Goldenberg M (2019) Automated Methods of Technical Skill Assessment in Surgery: A Systematic Review. J Surg Educ. 76(6):1629–1639. doi: 10.1016/j.jsurg.2019.06.011 [DOI] [PubMed] [Google Scholar]
- 12.Law H, Ghani K, Deng J (2017) Surgeon technical skill assessment using computer vision based analysis. Proceedings of Machine Learning for Healthcare. 2017;68:88–99. [Google Scholar]
- 13.Baghdadi A, Hussein AA, Ahmed Y, Cavuoto LA, Guru KA (2019) A computer vision technique for automated assessment of surgical performance using surgeons’ console-feed videos. Int J Comput Assist Radiol Surg. 14(4):697–707. doi: 10.1007/s11548-018-1881-9 [DOI] [PubMed] [Google Scholar]
- 14.Raza S, Field E, Jay C, Eun D, Fumo M, Hu J, Lee D, Mehboob Z, Peabody JO, Sarle R, Stricker H, Yang Z, Wilding G, Mohler JL, Guru KA (2015) Surgical Competency for Urethrovesical Anastomosis During Robot-assisted Radical Prostatectomy: Development and Validation of the Robotic Anastomosis Competency Evaluation. Urology. 85(1):27–32. d oi: 10.1016/j.urology.2014.09.017 [DOI] [PubMed] [Google Scholar]
- 15.Luongo F, Hakim R, Nguyen JH, Anandkumar A, Hung AJ (2021) Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery. Surgery. 169(5):1240–1244. doi: 10.1016/j.surg.2020.08.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Poppe R (2010) A survey on vision-based human action recognition. Image and vision computing 28.6: 976–990. [Google Scholar]
- 17.Lea C, Vidal R, Hager GD (2016) Learning convolutional action primitives for fine-grained action recognition. IEEE international conference on robotics and automation (ICRA). doi: 10.1109/ICRA.2016.7487305 [DOI] [Google Scholar]
- 18.Rohrbach M, Amin S, Adriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities IEEE Conference on Computer Vision and Pattern Recognition. 1194–1201. doi: 10.1109/CVPR.2012.6247801 [DOI] [Google Scholar]
- 19.Ni B, Paramathayalan VR, Moulin P (2014) Multiple Granularity Analysis for Fine-Grained Action Detection. IEEE Conference on Computer Vision and Pattern Recognition. 756–763. doi: 10.1109/CVPR.2014.102 [DOI] [Google Scholar]
- 20.Ma M, Fan H, Kitani KM (2016) Going Deeper into First-Person Activity Recognition. Conference on Computer Vision and Pattern Recognition (CVPR). arXiv: 1605.03688 [Google Scholar]
- 21.Hoffman J, Gupta S, Darrell T (2016) Learning with side information through modality hallucination. IEEE Conference on Computer Vision and Pattern Recognition. doi: 10.1109/CVPR.2016.96 [DOI] [Google Scholar]
- 22.Mnih V, Heess N, Graves A, Kavulkcuoglu (2014) Recurrent models of visual attention.” Advances in neural information processing systems. arXiv: 1406.6247 [Google Scholar]
- 23.Ni B, Paramathayalan VR, Moulin P. (2014) Multiple granularity analysis for fine-grained action detection. IEEE Conference on Computer Vision and Pattern Recognition. doi: 10.1109/CVPR.2014.102 [DOI] [Google Scholar]
- 24.Li Z, Huang Y, Cai M, Sato Y (2019) Manipulation-skill assessment from videos with spatial attention network. IEEE/CVF International Conference on Computer Vision Workshops. doi: 10.1109/ICCVW.2019.00539 [DOI] [Google Scholar]
- 25.Kitaguchi D, Takeshita N, Matsuzaki H, Igaki T, Hasegawa H, Ito M (2021) Development and Validation of a 3-Dimensional Convolutional Neural Network for Automatic Surgical Skill Assessment Based on Spatiotemporal Video Analysis. JAMA Netw Open. 4(8):e2120786. doi: 10.1001/jamanetworkopen.2021.20786 [DOI] [PMC free article] [PubMed] [Google Scholar]