Introduction

A profound education of novice surgeons is crucial to increase patient safety and to avoid adverse operation outcomes. This holds especially for technically challenging surgical techniques, such as traditional and robot-assisted minimally invasive surgery. One important aspect of teaching surgery is surgical skill assessment. It is a prerequisite for targeted feedback, which facilitates efficient skill acquisition by telling the novice how to improve. Additionally, skill assessment is important for certifying prospective surgeons.

Traditionally, novice surgeons are taught and evaluated by experts in the field. To ensure objective and standardized assessments, procedure-specific checklists and global rating scales for surgical skill evaluation were introduced [1]. However, these evaluation protocols make skill assessment very time-consuming and thus expensive. For this reason, automating the process of surgical skill evaluation is a promising idea. Besides saving time and money, this would enable novice surgeons to train effectively in absence of a human supervisor using a surgical simulator that is equipped with automatic assessment and feedback functionalities. Consequently, objective computer-aided technical skill evaluation (OCASE-T) [28] has received increasing interest in the research community over the last decade.

For computer-aided technical skill evaluation, sensor data is being collected during surgical training, for example video data, robot kinematics, tool motion data, or data acquired by force and torque sensors. This data is then analyzed in order to estimate the surgical skill of the trainee. Here, surgical skill can either be described numerical, for example as average OSATS [21] score, or categorical, for example using the classes novice, intermediate, and expert.

One approach for OCASE-T is to calculate descriptive metrics from the observed data and to use these metrics to determine surgical skill. For example, Fard et al. [10] record instrument trajectories and extract several features from the trajectories, such as path length, motion smoothness, curvature, and task completion time. Using these features, they train various machine learning classifiers to distinguish between expert and novice. Zia and Essa [33] evaluate texture features, frequency-based features, and entropy-based features extracted from robot kinematic data. For classifying surgeons into expert, intermediate, or novice, they employ a nearest-neighbor classifier after dimensionality reduction using principal component analysis. They achieve very high classification accuracies using a feature based on approximate entropy (ApEn).

Another approach is to model the collected data as time series using Markov models, hidden Markov models, or variations thereof. For example, Tao et al. [26] propose a sparse hidden Markov model (S-HMM). For surgical skill classification, one S-HMM per class is learned from performances of the respective skill level. For classifying a new performance, the model that generates this performance with the highest likelihood is determined and the corresponding skill class is returned.

Recently, also methods based on convolutional neural networks were proposed. Convolutional neural networks (ConvNets) are able to learn hierarchical representations from data. They can serve as feature extractors that build high-level features from low-level ones. As implied by recent studies [14, 31], they can also learn to detect discriminative motion patterns in robot kinematic data in order to distinguish accurately between expert, intermediate, and novice surgeons.

A brief introduction to ConvNets and deep learning is given in [20]. Notably, ConvNets can process data in their raw form. In contrast, conventional machine learning classifiers require data to be transformed into an adequate representation, such as the motion features described above. Such features usually need to be designed manually, which can be a cumbersome process that depends on experience in feature engineering and requires domain knowledge.

Most methods for OCASE-T focus on the analysis of robot kinematic data or tool motion data. However, obtaining this data generally requires the availability of a robotic surgical system or of specialized tracking systems [7]. On the other hand, video data of the surgical field can be easily acquired in traditional and robot-assisted minimally invasive surgery using the laparoscopic camera. However, video data is high-dimensional and considerably more complex than sequences of a few motion variables. To deal with these challenges, approaches working on video data usually apply one of two strategies.

One strategy is to track the surgical tools in the video and then analyze the obtained tool motion data. There exist a number of algorithms to perform vision-based surgical tool tracking [3]. Also ConvNets have been shown to work well for this task [9, 19]. In one study, Jin et al. [16] use region-based ConvNets for instrument detection and localization. Then, they assess surgical skill by analyzing tool usage patterns, movement range, and economy of motion.

The other strategy is to transform the video data into a D-dimensional time series using a bag-of-words (BoW) approach. For example, Zia et al. [34] extract spatiotemporal interest points (STIPs) from each video frame. Each STIP is represented by a descriptor composed of a histogram of optical flow (HOF) and a histogram of oriented gradients (HOG). A dictionary of visual words is learned from two distinct expert videos by clustering the STIP descriptors of their video frames using k-means clustering with \(k = D\). For the remaining videos, each frame is represented by a D-dimensional histogram, where the ith histogram bin counts the number of STIP descriptors extracted from the frame that belong to the ith cluster represented by the ith visual word. The D-dimensional time series corresponding to one video can then be analyzed using descriptive metrics or time series analysis as described before.

Fig. 1
figure 1

The temporal segment network employed for video classification. Each of the K video snippets is classified using one ConvNet instance. The snippet-level classifications \(c(s_i)\) are aggregated in the consensus layer, yielding the overall video classification c(v). Snippets are stacks of several consecutive video frames or, as depicted, stacks of several optical flow fields calculated between consecutive video frames

In contrast to both strategies, we propose to process video data without any of the intermediate steps discussed above. Using a 3D ConvNet, we directly learn feature representations from video data. This way, we avoid the process of manually engineering suitable intermediate video representations for video-based surgical skill assessment. To the best of our knowledge, we are the first to present a deep learning-based approach for automatic, objective surgical skill classification using video data only.

To achieve this, we bring together three advances from video-based human action recognition: (i) a modern 3D ConvNet architecture [6] to extract spatiotemporal features from video snippets with a typical length of 16–64 frames, (ii) Kinetics [17], a very large, publicly available human action dataset, which we exploit for cross-domain pretraining, and (iii) the temporal segment network (TSN) framework [30] to learn to resolve ambiguities in single video snippets by considering several snippets at a time, which are selected using a segment-based sampling scheme. The source code will be made available at https://gitlab.com/nct_tso_public/surgical_skill_classification.

We evaluate the method on the JIGSAWS dataset [2, 11], which consists of recordings of three elementary tasks of robot-assisted surgery. The tasks are performed in a simulated environment using dry lab bench-top models. While the JIGSAWS dataset does not contain intraoperative recordings and hence does not represent real surgical scenes, it represents surgical training scenarios very well. It is common that surgical novices practice on dry lab bench-top trainers, where they perform tasks of curricula such as the fundamentals of laparoscopic surgery (FLS) [23] or the structured fundamental inanimate robotic skills tasks (FIRST) [12]. These training scenarios are the ones where we expect automatic skill assessment to have an impact first, by providing automatic feedback to the novice and thus enabling him or her to practice without human expert supervision. In general, however, our method can be applied to video recordings of any kind of surgical task as long as a sufficiently large database of labeled videos is available.

Most similar to our approach is the work by Doughty et al. [8] on skill-based ranking of instructional videos. They train a temporal segment network with a modified loss function to learn which video of a pair of videos exhibits a higher level of skill. However, they do not investigate the use of 3D ConvNets and therefore rely on spatial image features only.

Methods

Formally, we solve a video classification problem. Given are a number of videos recorded during surgical training. Each video displays the performance of one surgical task, such as suturing or knot tying. Such a video has a typical length of one to five minutes. The problem is to classify each video as expert, intermediate, or novice depending on the level of surgical skill exhibited in the video.

Temporal segment networks (TSNs) were introduced by Wang et al. [29, 30] as a deep learning framework for video classification in the context of human action recognition. The main idea is to use a ConvNet to classify several short snippets taken from a video. Then, the snippet-level predictions are aggregated using a consensus function to obtain the overall video classification result (see Fig. 1). Several differentiable aggregation functions can be used to form consensus. One of the most straightforward choices is average pooling, which has been shown to work well in practice [30].

The sampling strategy for obtaining the video snippets differs slightly between the training and the testing stages of the TSN. During training, snippets are sampled segment-based. This means that each video is partitioned into K non-overlapping segments along the temporal axis. Then, from each segment a snippet consisting of one or more consecutive frames is sampled at a random temporal position within the segment. This yields K snippets in total. Usually, K is in the range of 3–9 for classifying video clips with a typical length of around 10 s [30]. During testing, \(\kappa \) snippets are sampled equidistantly from the test video. Usually, the number \(\kappa \) is larger than K, for example \(\kappa = 25\).

During training, a TSN can be seen as a Siamese [5] architecture: several instances of the same ConvNet, which share all of their weights, are used to classify one video snippet each. After forming consensus, the video classification error is backpropagated to update the network parameters. This way, the ConvNet learns to classify video snippets in a way that, if being applied to several snippets of a video, the consensus over all predictions yields the correct video class. Notably, forming consensus allows for compensation of contradictions and ambiguities in single video snippets.

A TSN is a promising choice for surgical skill classification from video. Looking at short snippets instead of complete videos is an effective strategy to reduce the size and complexity of the problem at hand. Additionally, the segment-based snippet sampling scheme at training time already realizes some kind of data augmentation. This is appealing because the available training data for the problem of surgical skill assessment is still very limited. Moreover, with a TSN it is possible to classify single snippets within the same video different from each other as long as the snippet-level predictions add up to the correct overall result. This is important for video-based skill classification, where usually only an overall assessment per video is given, while individual parts of the video could exhibit different levels of surgical skill.

In order to implement a TSN for surgical skill classification, two design decisions must be made. First, we need to decide how a single snippet is constructed from a video. Secondly, we need to decide about the architecture of the ConvNet used for snippet classification.

Snippet structure

In the original paper by Wang et al. [29], a snippet is a single RGB video frame or a stack of five dense optical flow (OF) fields calculated between consecutive video frames. Here, each optical flow frame has two channels: one containing the horizontal component, the other containing the vertical component of optical flow.

Because we want to provide as much information as possible to the network, we decided to use larger snippets consisting of 64 consecutive frames. We extract frames from the video at a frequency of 10 Hz, which means that a snippet of 64 frames represents more than six seconds of surgical video.

While prior work [24] indicates that the optical flow modality yields better results than the RGB modality for deep learning-based human action recognition, optical flow is also expensive to calculate. Also, optical flow explicitly captures motion between video frames but does not represent the appearance of static objects in the scene. In this study, we evaluate both modalities. Therefore, a video snippet consists of either 64 consecutive RGB frames or of 64 frames of optical flow calculated between consecutive video frames.

ConvNet architecture

Originally, Wang et al. [30] proposed to use deep ConvNet architectures that are well established for image classification tasks. Examples are Inception [25] and ResNet [13]. However, image classification networks only extract features along the spatial dimensions. They do not model the temporal information encoded in stacks of consecutive video or optical flow frames.

In order to facilitate spatiotemporal feature learning for action recognition, 3D ConvNets were proposed [15, 27]. In a nutshell, 3D ConvNets are a variant of convolutional neural networks with the property that convolutional filters, pooling operations, and feature maps have a third—the temporal—dimension.

Carreira and Zisserman [6] describe how to canonically inflate deep 2D ConvNet architectures along the temporal dimension to obtain deep 3D ConvNets. They propose Inception-v1 I3D, the 3D version of Inception. Because this architecture achieves state-of-the-art results on many human action datasets, we decided to use this network for surgical skill classification.

In order to adapt the network to our problem, we set the number of output neurons to three. The output neurons are meant to indicate the surgical skill class, namely expert, intermediate, or novice, using a one-hot encoding.

ConvNet pretraining

To successfully train a 3D ConvNet from scratch, a lot of training data and a lot of computational resources are required, both of which we are lacking. For image classification problems, a popular strategy for coping with limited training data is to pretrain the ConvNet on a large labeled dataset coming from a related domain and to fine-tune it afterward. Fortunately, it has been shown that this also works for 3D ConvNets and video classification problems [6]. Moreover, pretrained 3D ConvNet models have become publicly available and are ready to be fine-tuned on custom datasets. We chose to use an Inception-v1 I3D model pretrained on the Kinetics dataset, which is publicly availableFootnote 1 for both input modalities, RGB and optical flow.

Kinetics [17] is one of the largest human action datasets available so far. It consists of 400 action classes and contains at least 400 video clips per class. We assume that the spatiotemporal features learned from the Kinetics dataset are a helpful initialization for subsequently fine-tuning the 3D ConvNet for surgical skill classification.

Before fine-tuning the pretrained model, we freeze the weights of all layers up to, but not including, the second-to-last Inception block (Mixed_5b). This way, the number of trainable parameters is greatly reduced, which enables us to fine-tune the model using one GPU with only 8 GB of video RAM.

Fig. 2
figure 2

Surgical tasks recorded in JIGSAWS [2]

Implementation and training details

We implement the TSN for surgical skill classification using the PyTorch package [22]. Our implementation is based on the source code published with the TSN paperFootnote 2 and the PyTorch implementation of Inception-v1 I3D provided by Piergiovanni.Footnote 3 The optical flow fields are calculated using the TV-L1 algorithm [32] implemented in OpenCV [4]. For data augmentation, we use corner cropping, scale jittering, and horizontal flipping as described in [29]. The final width and height of each frame are 224 pixels, which is the expected input size of Inception-v1 I3D.

When fine-tuning the pretrained 3D ConvNet for surgical skill classification, the parameters of the modified output layer are initialized with random values sampled uniformly from the range \((-\sqrt{\frac{1}{1024}}, \sqrt{\frac{1}{1024}})\). The dropout probability of the dropout layer, which is located just before the output layer in Inception-v1 I3D, is set to 0.7. We use standard cross-entropy loss and the Adam optimizer [18] with a learning rate of \(10^{-5}\). We train the TSN for 1200 epochs. During each epoch of training, we sample \(K = 10\) snippets from each video in the training set. The batch size is set to 2. For testing, we use \(\kappa = 25\) snippets per video. All hyperparameters were chosen heuristically.

Evaluation

We evaluate the TSN for surgical skill classification on the JHU-ISI gesture and skill assessment working set (JIGSAWS) [2, 11], one of the largest public datasets for surgical skill evaluation. It contains recordings of three basic surgical tasks performed on a bench-top model using the da Vinci Surgical System: suturing, needle passing, and knot tying (see Fig. 2). Each recording contains the synchronized robot kinematics in addition to the laparoscopic stereo video captured during the performance of one task. Eight participants with varying robotic surgical experience performed each of the tasks up to five times. The participants are classified into the categories expert, intermediate, and novice based on their hours of robotic surgical experience.

In this work, our goal is to predict the level of surgical experience (expert, intermediate, or novice) on the JIGSAWS dataset using the video data only. The video data in JIGSAWS was recorded at 30 Hz with a resolution of \(640 \times 480\) pixels. We only use the right video from each stereo recording.

The authors define two cross-validation schemes for JIGSAWS: Leave-one-supertrial-out (LOSO) and leave-one-user-out (LOUO). For LOSO, the data of one surgical task is split into five folds, where the ith fold contains the ith performance of each participant. For LOUO, the data is split into eight folds, where each fold contains the performances of one specific participant.

LOSO evaluation

Table 1 Surgical skill classification results on the knot tying task, averaged over four evaluation runs

We evaluate our approach separately for each of the three surgical tasks, following the LOSO cross-validation scheme. During one evaluation run for one task, we train a model for each of the five LOSO folds using data from the other folds. Then, we predict the surgical skill class for each video using the model that has not seen this video at training time.

Table 2 Surgical skill classification results on the suturing and needle passing tasks, averaged over four evaluation runs

For the obtained predictions, we calculate accuracy, average recall, and average \(F_1\) score. Accuracy is the proportion of correct predictions among all predictions. Regarding a single class, recall describes how many of the instances of this class are recognized as instance of this class. On the other hand, precision describes how many of the examples that are predicted to be instance of this class actually are instances of this class. \(F_1\) is the harmonic mean of precision and recall. Recall and \(F_1\) score are averaged over the three classes expert, intermediate, and novice. Notably, accuracy and average recall correspond to the Micro and Macro measures defined in [2].

Fig. 3
figure 3

Predictions on the knot tying task, accumulated over four evaluation runs

We perform our experiments using either RGB or optical flow as input modality. To account for the stochastic nature of ConvNet initialization and training, we perform four evaluation runs per task and input modality and average the results. We report mean and standard deviation. The results can be found in Tables 1 and 2, where our method is referred to as 3D ConvNet (\(*\)). The input modality employed, namely plain video (RGB) or optical flow (OF), is denoted in brackets.

Table 3 Surgical skill classification results of the ablation experiments (a)–(c) on the knot tying task using the optical flow modality, averaged over four evaluation runs

For comparison, we state the results of some state-of-the-art methods described in “Introduction” section. Notably, all of these methods utilize the robot kinematics provided in JIGSAWS. Unfortunately, we could not find another video-based method for surgical skill classification that has been evaluated on JIGSAWS.

The confusion matrices in Fig. 3 visualize the predictions on the knot tying task, using either RGB (Fig. 3a) or optical flow (Fig. 3b) as input modality. Here, for matrix C, the number in matrix cell C(ij), \(i, j \in \lbrace 0, 1, 2\rbrace \), denotes how many times an instance of class i is classified as instance of class j. Class 0 corresponds to skill level novice, class 1 to intermediate, and class 2 to expert.

Ablation studies

To evaluate the impact of our main design decisions on skill classification performance, we investigate the following questions:

  1. (a)

    What is the benefit of using a 3D ConvNet?

  2. (b)

    What is the benefit of extending the 3D ConvNet into a temporal segment network during training?

  3. (c)

    What is the benefit of pretraining the 3D ConvNet on the Kinetics dataset?

To this end, we re-run our experiments with one of the following modifications. Unless stated otherwise, we stick to the training procedure described in “Methods” section. We restrict the experiments to the knot tying task, which had turned out to be most challenging. As input modality, we use optical flow.

  1. (a)

    We use the 2D ConvNet Inception-v3 [25] instead of Inception-v1 I3D. This also requires changing the video snippet structure: we use stacks of 5 dense optical flow fields calculated at a frequency of 5 Hz. The 2D ConvNet is initialized with the weights of an Inception-v3 TSN model,Footnote 4 pretrained on Kinetics. For fine-tuning, the weights of all layers up to, but not including, the last Inception block (Mixed_10) are frozen.

  2. (b)

    During training, we do not extend the 3D ConvNet into a TSN. Instead, we train the network on single video snippets that are sampled at random temporal positions from the video. This equals setting the number of segments K to 1. Because the network sees only one snippet, instead of 10, from each video during one epoch of training, we multiply the number of training epochs by factor 10.

  3. (c)

    Instead of initializing the 3D ConvNet with weights obtained by training on the Kinetics dataset, we train the TSN from scratch. For initialization, we use PyTorch’s standard routine and initialize each layer with weights sampled uniformly from the range \((-\sqrt{\frac{1}{n}}, \sqrt{\frac{1}{n}})\). Here, n denotes the total number of weights in the layer. Because we need to train all layers of the network, more computational resources are required. In fact, we cannot fit the model with \(K = 10\) segments on the GPU. Thus, we train the TSN with \(K = 5\) segments and a batch size of 1. Because we halve the number of video snippets that are processed per epoch, we double the number of training epochs.

    For fair comparison, we conduct one further experiment where we use a pretrained 3D ConvNet, as proposed, but train the TSN also with only \(K = 5\) segments, a batch size of 1, and for 2400 training epochs.

The LOSO cross-validation results of the ablation experiments can be found in Table 3. The evaluation metrics are calculated as described in “LOSO evaluation” section and are averaged over four evaluation runs.

LOUO evaluation

Fig. 4
figure 4

Predictions on the knot tying task using the optical flow modality, obtained from one LOUO evaluation run

Establishing LOUO validity is important to demonstrate that a skill assessment algorithm will perform well on new subjects. However, we believe that the data provided in JIGSAWS is too limited to produce meaningful LOUO cross-validation results for learning-based skill assessment approaches.

One problem is that there are only two experts and two intermediates among the participants in JIGSAWS. Thus, the training data is not representative for many of the LOUO splits. For example, when the test split consists of performances by an expert, the corresponding training data contains examples of only one other expert, exhibiting a severe lack of intra-class variability. This issue can be observed in Fig. 4, which illustrates the predictions on the knot tying task obtained by a model that was trained following the LOUO cross-validation scheme. While the model classifies most performances of novices correctly, it struggles to recognize performances of intermediates and experts as expected.

Discussion

As can be seen in Tables 1 and 2, our method achieves high classification accuracies that are on a par with state-of-the-art approaches. Both evaluated input modalities, RGB and optical flow, perform similarly well. There may be a slight advantage in using optical flow because this achieves 100% classification accuracy on the needle passing task. Further exploring the strengths and weaknesses of both modalities as well as possibilities to combine them is left for future work. All in all, our model can learn to extract discriminative patterns from video data that distinguish experienced surgeons from less experienced ones.

Notably, our method succeeds to learn from highly complex video data, while the other approaches analyze the robot kinematics, which consist of only 76 variables. The ability to analyze raw video opens up new application areas. For example, our method could be used on low-cost surgical trainers, which lack sophisticated data acquisition systems. Still, a combination of video with kinematic data, if available, or other sensor modalities could increase the robustness and reliability of our model and further improve results.

In line with the other approaches, our method achieves the lowest classification accuracy on the knot tying task. Here, our model mostly misclassifies intermediate surgeons as novices (see Fig. 3). However, we want to point out that the skill labels in JIGSAWS were assigned to participants based on their hours of robotic surgical experience (novice: \(< 10\) hours, expert: \(> 100\) hours, intermediate: in between). Hence, it might be possible that intermediate surgeons still exhibit novice-like motion patterns when performing the challenging task of knot tying, leading to a justified misclassification by our model.

It is noteworthy that both our 3D ConvNet and the ConvNet described by Ismail Fawaz et al. [14] outperform the ConvNet proposed by Wang and Fey [31]. One reason might be that [31] use a rather generic network architecture, while [14] draw from domain knowledge to design their network specifically for the skill classification task. Additionally, the ConvNet by [14] processes the complete time series of kinematic variables at once, while the ConvNet by [31] only regards short snippets with a duration of up to 3 s. Furthermore, [31] make the simplifying assumption that the skill label of a snippet is the same as the label assigned to the video containing that snippet. In contrast, we extend our ConvNet into a temporal segment network to avoid explicit assignment of skill labels to individual snippets.

The handcrafted feature based on approximate entropy (ApEn) [33] still outperforms our approach. However, we believe that learning-based methods have more potential to solve advanced problems for computer-assisted surgical training. One example would be the automatic detection of procedural errors, which cannot be detected based on dexterity features alone.

The results of the ablation studies (see Table 3) indicate that our method benefits from the combination of a 3D ConvNet architecture with TSN-based training. Using a 3D ConvNet enables spatiotemporal feature learning from video. Training the 3D ConvNet as TSN allows the network to look at multiple snippets of a video at a time, finding consensus to resolve contradictions and ambiguities of single snippets.

Pretraining the 3D ConvNet on the Kinetics dataset turned out to be the most important step. Indeed, a TSN that is trained from scratch achieves only 41% classification accuracy on the knot tying task. This drastic drop in performance cannot be explained by the circumstance that, due to memory limits, the TSN is trained using only \(K = 5\) segments. In fact, we observed a classical overfitting effect where the model had converged and fit the training data perfectly, but could not generalize to unseen examples.

The proposed approach still has its limitations. First, our model can only detect motion patterns that can be observed in rather short video snippets. It cannot explicitly model long-term temporal relations in surgical video. Secondly, as a deep learning-based method, our model is rather data hungry and will not work without a comprehensive video database of surgical task performances that exhibit varying levels of surgical skill. Fortunately, it is relatively easy to record video of traditional or robot-assisted surgery or surgical training. Thus, we are optimistic that enough data can be collected and annotated to establish LOUO validity in future work. Another approach to address data limitations could be the investigation of semi- and self-supervised learning algorithms.

Conclusion

In this paper, we describe a deep learning-based approach to assess technical surgical skill using video data only. Using the temporal segment network (TSN) framework, we fine-tune a pretrained 3D ConvNet on stacks of video frames or optical flow fields for the task of surgical skill classification.

Our method achieves competitive video-level classification accuracies of at least 95% on the JIGSAWS dataset, which contains recordings of basic tasks in robot-assisted minimally invasive surgery. This indicates that the 3D ConvNet is able to learn meaningful patterns from the data, which enable the differentiation between expert, intermediate, and novice surgeons.

Future work will investigate how to integrate RGB video data and optical flow into one framework to fully exploit the richness of video data for surgical skill assessment. The ultimate goal is to develop a thorough understanding of executions of surgical tasks, where we not only model surgical motion or dexterity but also the interactions between surgical actions and the environment as well as long-range temporal structure. A crucial step will be the extension of the datasets currently available for surgical skill evaluation in order to facilitate large-scale training and evaluation of deep neural networks. Especially the curation and annotation of collected video data will require a close collaboration with surgical educators.