Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

A comparative analysis of multi-backbone Mask R-CNN for surgical tools detection

2020 International Joint Conference on Neural Networks (IJCNN)

A comparative analysis of multi-backbone Mask R-CNN for surgical tools detection Gioele Ciaparrone Francesco Bardozzo Mattia Delli Priscoli Neuronelab, DISA-MIS Università degli Studi di Salerno Salerno, Italy gciaparrone@unisa.it Neuronelab, DISA-MIS Università degli Studi di Salerno Salerno, Italy fbardozzo@unisa.it Neuronelab, DISA-MIS Università degli Studi di Salerno Salerno, Italy mdellipriscoli@unisa.it Juanita Londoño Kallewaard Maycol Ruiz Zuluaga Roberto Tagliaferri Neuronelab, DISA-MIS Università degli Studi di Salerno Salerno, Italy Faculty of Engineering - UTP Pereira, Colombia j.londonokallewaa@studenti.unisa.it Neuronelab, DISA-MIS Università degli Studi di Salerno Salerno, Italy Faculty of Engineering - UTP Pereira, Colombia m.ruizzuluaga@studenti.unisa.it Neuronelab, DISA-MIS Università degli Studi di Salerno Salerno, Italy robtag@unisa.it Abstract—Real-time surgical tool segmentation and tracking based on convolutional neural networks (CNN) has gained increasing interest in the field of mini-invasive surgery. In fact, the application of this novel artificial vision technologies allows both to reduce surgical risks and to increase patient safety. Moreover, these types of models can be used both to track the tools and detect markers or external artefacts in a real-time video stream. Multiple object detection and instance segmentation can be addressed efficiently by leveraging region-based CNN models. Thus, this work provides a comparison among state-of-the-art multi-backbone Mask R-CNNs to solve these tasks. Moreover, we show that such models can serve as a basis for tracking algorithms. The models were trained and tested with a data-set of 4955 manually annotated images, validated by 3 experts in the field. We tested 12 different combinations of CNN backbones and training hyperparameters. The results show that it is possible to employ a modern CNN to tackle the surgical tool detection problem, with the best-performing Mask R-CNN configuration achieving 87% Average Precision (AP) at Intersection over Union (IOU) 0.5. I. I NTRODUCTION Real-time surgical tool segmentation based on CNNs has gained increasing interest in the field of mini-invasive surgery (MIS). MIS is adopted to reduce patient pain and postoperative complications. However, due to limited operating space and insufficient feedback, the risk to cause damage to surfaces and internal organs is concrete. For this reason, considering the impressive results previously achieved with the use of deep learning in various tasks in the biomedical field [1]–[3], surgical tool detection and tracking based on CNN are applied to provide an augmented perception to the surgeon, thereby reducing surgical risks and preserving patient safety. Therefore, real-time tool detection is an essential component to avoid operative and post-operative complications [4]. First, advanced computer vision approaches could be useful to detect 978-1-7281-6926-2/20/$31.00 ©2020 IEEE external objects, such as surgical markers, and they can help solving the problems of identifying and localizing artefacts of different nature [5]–[12]. Second, hiding objects from the background by segmenting multiple objects is an important task in diagnostics and image-based investigations. In fact, if the region of interest is partially occluded by the presence of surgical instruments, it is possible to remove these tools from the frames. For example, this is particularly useful in two types of registration, which are volume-to-surface or surfaceto-surface alignment problems [13], and in 3D soft tissue reconstruction tasks [14]. Third, the segmentation of multiple objects could provide proximity measurements between the left and right robotic tools or between them and the background, by predicting their overlap and possible collisions [15]. To address these tasks, even if several CNN-based models (e.g. U-Net [16]) have been adopted, as far as we know, a comparative study on Mask R-CNN [17] models has never been conducted in this area of medical imaging. While models like U-Net perform semantic image segmentation, that is they are able to segment foreground and to classify it into different classes, the main advantage of employing Mask R-CNNlike models relies in their ability to perform object instance segmentation, that is identifying separate object instances of a target class and predicting their segmentation mask separately; so, while U-Net would only be able to detect pixels belonging to the ”surgical tools” class, Mask R-CNN is also able to distinguish among different instances of such tools. In turn, this can be useful as a basis to track each of the tools. The main contribution of this work is a comparison among different configurations of the state-of-the-art Mask R-CNN detectors for recognizing and segmenting endoscopic surgical tools. Moreover, we evaluate the robustness of these models under challenging conditions, such as low-resolution videos. Hyper Parameters Tuning Variation on images Training set images Augmentation Using ResNet50 Trained weights obtained Using ResNet101 Trained Models Using ResNet152 Validation/test set images Prediction Tracking used to follow objects through frames Bounding boxes and masks Tracking Weights are used to predict results on new images Training Mask R-CNN Fig. 1: Pipeline of the implemented system, which consist of two main parts: model training phase, and prediction phase. The bounding boxes and masks output by the Mask R-CNN model were also used as input to a simple tracking algorithm to evaluate the potential of implementing a full tracking pipeline using this model. In particular, our work suggests that it is possible to train a high-performance model for detection and segmentation of surgical tools in endoscopic images, and that it can also serve as a base for a tool tracking algorithm. The organization of this paper is as follows. In Section II the dataset collection, pre-processing and hand curated image labelling procedures are described. Section III, after a description of the employed model, is divided into two sub-sections: sub-section III-A discusses the training procedure, while subsection III-B describes the data augmentation procedure. In Section IV the experimental results are discussed and a simple tracking algorithm is presented to evaluate the potential of using the model as a base for a tool tracking pipeline. Finally, Section V sums up the results of this study and possible future directions of research. every polygon, which represented a foreground mask during the training process. The rest of the image was considered background. (a) Annotation process of a sample frame. II. DATASET To build our dataset, we obtained a total of 4198 selected video frames with a resolution of 1920 × 1080 pixels, from 13 high-quality endoscopic/laparoscopic videos, plus an additional 757 frames from a low-resolution video (384 × 192 pixels). In particular, some of them contain noise in the form of superimposed written text or minimal graphical user interfaces. We chose to keep those noisy frames, as they can be useful to analyze if the models are able to generalize on unseen conditions and unfiltered noisy frames. The dataset is divided into three parts: 3195 images for the training set and 290 images for the validation set were extracted from different sections of the same videos. To further test the generalization capabilities of the model, 713 frames from an unseen video have also been extracted and used as a test set. Finally, a low-resolution dataset (757 images) was used as a second test set to evaluate the model robustness to drastic changes in resolution. The images were manually annotated by marking each visible tool with a binary mask [18]–[20]. The annotation procedure was performed using VIA (VGG Image Annotator) [21], [22], and validated by 3 experts in the field. Figure 2 shows the annotation process of a single sample frame. The annotation process involved manually drawing the polygons delimiting each of the tools in every video frame. The class label ”Tool” was assigned to (b) Foreground annotation sample. Fig. 2: Manual annotation of a example sample from the endoscopic dataset (2a) and its resulting annotated mask (2b). The process was performed by using VIA (VGG Image Annotator) [21]. III. M ODELS AND METHODS Since deep neural networks are the most effective techniques currently available to solve object detection and instance segmentation tasks [23]–[27], we chose to employ the Mask R-CNN architecture [17] to detect and segment surgical tools. The Mask R-CNN model is an evolution of Faster RCNN [28]. In addition to predicting the bounding boxes containing each instance of the target class(es) and a confidence score for each box, it can also compute a segmentation mask for each object instance. The Mask R-CNN architecture can be implemented using various backbone network structures, with varying number of layers and complexity. For this study, we chose to use three different backbone architectures: ResNet50, ResNet-101 and ResNet-152 [29], where 50, 101 and 152, respectively, indicate the number of convolutional layers in the network. More information about the ResNet backbones (such as kernel and output sizes) can be found in the original ResNet article [29]. A. Training procedure For the training procedure, we exploited transfer learning [30] by initializing the network with pre-trained weights that were obtained on the COCO dataset [31]. The original classification head was replaced by a 2-class classification head (background vs. surgical tools). To evaluate the best model training procedure and backbone architecture, we trained the models by varying different networks and hyperparameters: i) the backbone network employed, ii) regularization parameters, iii) number of epochs, iv) use of data augmentation. In particular, as already mentioned, we used ResNet-50, ResNet101 and ResNet-152 as backbone networks. Training was performed using Stochastic Gradient Descent (SGD) with learning rate 0.001 and momentum 0.9. We tested two values for the L2-regularization parameter [32] (0.0001, weaker regularization, and 0.001, stronger regularization) and for the number of epochs (25 or 30) [33]. The different experimental setting combinations are shown in Table II, where Exp is the Experiment number, Bb is the number of layers of the ResNet backbone, Reg is the regularization parameter, Ep is the number of epochs, Aug is the number of augmented images generated for each original training image (see Section III-B), PC Spec is the hardware used for each training/test procedure (see Section IV). The model ability to generalize, the accuracy and performance analyses are assessed on the validation set and tested on the two test videos. The goodness of the model is evaluated using the well-known Average Precision metric (AP), as it is described in Section IV. performed by adding to the dataset 2 augmented images for each original image in the training set, effectively growing the number of training images from 3195 to 9585. Each augmented image was generated by sequentially applying the previously described augmentation techniques, randomly selecting a value for each transformation parameter listed in Table I. As we will see, in our endoscopic/laparoscopic image dataset, data augmentation turns out to be relevant for an improved detection and segmentation accuracy for our best performing model. TABLE I: Data augumentation parameters Technique Details Rotation 10° clockwise and counterclockwise Scaling Range [0.8, 1.2] Flipping Vertical and horizontal Perspective change Range [0.01, 0.1] Linear contrast change Range [0.8, 1.2] IV. E XPERIMENTS AND RESULTS As explained in the previous section, a set of experiments were performed using three versions of Mask R-CNN, trained with different hyperparameters (see Table II). Furthermore, after the train and validation sets, two independent test sets were used. The former has high-resolution images (1920×1080px), but presents visual artefacts, specifically a navigation bar under the images. The latter has no virtual artefacts (such as surgical machine logos, medical notes, virtual markers) but is made of low-resolution images (384 × 192px). This was done to test the robustness and generalization capabilities of the trained models. TABLE II: Hyperparameter sets Exp Bb Reg Ep Val Test 1 101 L2 0.0001 25 0 1 1 2 2 101 L2 0.0001 25 2 1 1 2 3 101 L2 0.0001 30 2 1 1 2 4 101 L2 0.001 30 2 2 1 2 5 50 L2 0.0001 25 0 1 1 2 6 50 L2 0.0001 25 2 1 1 2 7 50 L2 0.0001 30 2 2 1 2 8 50 L2 0.001 30 2 2 1 2 B. Data augmentation The segmentation accuracy has been improved by applying data augmentation [34], a well-known technique adopted in real-world problems to improve models accuracy and reduce overfitting to the training set by presenting variations of the same image to the network during training. This helps the model to generalize unseen images, thereby improving its performance on external data [35]. In this work, the following augmentation techniques have been applied: i) rotation ii) scaling iii) flipping iv) perspective changes and v) linear contrast changes. More details about the augmentation parameters are provided in Table I. The efficiency of data augmentation is proved by training the models with different backbones both with and without data augmentation. In Table II, a value of Aug = 0 indicates the absence of data augmentation, while a value of Aug = 2 means that the training was PC Spec Aug Train 9 152 L2 0.0001 25 0 2 1 2 10 152 L2 0.0001 25 2 1 1 2 11 152 L2 0.0001 30 2 2 2 2 12 152 L2 0.001 30 2 2 2 2 (a) (b) Fig. 3: Example of bad prediction (3a) and good prediction results (3b). The quality of the prediction is determined by the type and conditions of the analyzed image and the quality of the trained model. A. Implementation details All the processes and experiments were developed in Python 3.7. The following packages are used to implement Mask RCNN and for image processing: OpenCV [36], MaskRCNN [17], Tensorflow-GPU [37], Keras [38] and imgaug [39]. The pycocotools package was used for evaluation [31]. All the experiments were performed on machines with two different hardware configurations: 1) CPU Intel Core I7-8700, 16GB of Ram, GPU Nvidia GTX 1060 and 2) CPU Intel Core I7-8700, 16GB of Ram, GPU Nvidia GTX 1070. The PC Spec column in table II specifies which machine was used for each experiment. The training and inference times thus varied among experiments, according to the backbone and PC used. The fastest model was ResNet-50, as expected, given its lower number of layers, with an inference time of around 0.4 seconds per image on the fastest machine, while the inference time reached 0.6 second per image when using ResNet-152, when using the 1920 × 1080 px dataset. While, this is still not enough for real-time performance on high-quality video, it shows encouraging results for its use in real-time applications in the near future on more specialized hardware. B. Metrics To evaluate the performance of all the trained models, the AP [40] was used, both for the bounding boxes and for the masks, as it is common practice in the object detection and instance segmentation tasks [17], [27], [41]. We followed the COCO evaluation protocol and computed the AP at varying levels of bounding box/mask overlap (IOU). In particular, for each results table, we list six different metrics: APbb 50 is the Average Precision at bounding box IOU threshold 0.50, while bb APbb 75 is the AP at bounding box IOU threshold 0.75; AP indicates the average AP computed at different IOU threshold , levels, from 0.5 to 0.95, increasing in steps of 0.05. APmask 50 APmask and APmask work similarly, but using mask IOU 75 instead of bounding box IOU, in order to evaluate the mask network branch accuracy. C. Results and discussion The results of the analysis are shown in Table III for the validation set, Table IV for the high-resolution test set, and Table V for the low-resolution test set. According to the presented results, on all considered datasets, the Mask R-CNN with ResNet101 backbone obtained the best performance in the segmentation of surgical tools in endoscopic/laparoscopic images, reaching 92% AP on both boxes and masks at IOU threshold 0.50 on the validation set, and 87% on the high-resolution test set. The ResNet152 backbone also presented good performance on the high-resolution test set, reaching 86% AP at IOU threshold 0.50 with both boxes and masks. We also notice that the AP on boxes and masks are usually highly correlated, showing that whenever a box is correctly identified on a given tool, the mask is also often correctly predicted. As expected, the AP is lower at higher IOU threshold, but still relatively good, reaching, TABLE III: RESULTS ON THE VALIDATION SET Experiment APbb APbb 50 APbb 75 APmask APmask 50 APmask 75 (1) RN101 0.63 0.90 0.72 0.58 0.90 0.70 (2) RN101 0.68 0.92 0.80 0.64 0.92 0.79 (3) RN101 0.52 0.67 0.61 0.48 0.67 0.58 (4) RN101 0.46 0.58 0.53 0.42 0.58 0.51 (5) RN50 0.45 0.56 0.53 0.42 0.56 0.51 (6) RN50 0.42 0.53 0.49 0.39 0.52 0.48 (7) RN50 0.41 0.50 0.48 0.38 0.51 0.47 (8) RN50 0.40 0.49 0.47 0.37 0.49 0.45 (9) RN152 0.39 0.49 0.46 0.37 0.48 0.45 (10) RN152 0.39 0.48 0.46 0.36 0.48 0.44 (11) RN152 0.39 0.48 0.45 0.36 0.48 0.44 (12) RN152 0.39 0.48 0.45 0.36 0.48 0.45 TABLE IV: RESULTS ON THE HIGH-RESOLUTION TEST SET Experiment APbb APbb 50 APbb 75 APmask APmask 50 APmask 75 (1) RN101 0.57 0.86 0.71 0.55 0.85 0.68 (2) RN101 0.59 0.87 0.76 0.57 0.87 0.70 (3) RN101 0.50 0.71 0.64 0.46 0.71 0.57 (4) RN101 0.43 0.59 0.54 0.39 0.59 0.49 (5) RN50 0.40 0.56 0.50 0.37 0.55 0.45 (6) RN50 0.35 0.52 0.43 0.34 0.52 0.42 (7) RN50 0.37 0.54 0.45 0.36 0.55 0.45 (8) RN50 0.33 0.47 0.39 0.32 0.48 0.39 (9) RN152 0.56 0.86 0.67 0.52 0.86 0.65 (10) RN152 0.58 0.85 0.73 0.52 0.85 0.64 (11) RN152 0.48 0.66 0.58 0.43 0.65 0.55 (12) RN152 0.45 0.61 0.56 0.40 0.60 0.52 when the IOU overlap threshold is set to 0.75, 80% and 76% on boxes on validation and high-res test sets respectively, and 79% and 70% on masks. Despite presenting visual artefacts, the results on the highres test set show that Mask R-CNN is able to generalize well and excludes those artefacts from the segmentation. At the same time, the performance on the low-res test set degraded bb for all the models, with the highest AP50 being 49%, and the mask reaching 47%. While those scores are worse highest AP50 on the other datasets, they are still acceptable and show once again that the network trained in experiment 2 did not overfit the training set. In general, we found that the quality of the prediction varies depending on the scene, quality of the image, position of the tools (e.g. tool occlusions), getting the worst results in situations that were not present in the training set, such as crossing tools or when an organ or tissue in the image has a similar color and shape as a tool. In figure 3, examples of good and bad predictions are shown. The most common errors include predicting wrong elements as tools, and the prediction of overlapped tools. (a) TABLE V: RESULTS ON THE LOW-RESOLUTION TEST SET Experiment APbb APbb 50 APbb 75 APmask APmask 50 APmask 75 (1) RN101 0.18 0.40 0.10 0.09 0.28 0.04 (2) RN101 0.26 0.49 0.25 0.24 0.47 0.22 (3) RN101 0.24 0.42 0.26 0.21 0.39 0.21 (4) RN101 0.22 0.39 0.24 0.20 0.37 0.20 (5) RN50 0.22 0.39 0.25 0.20 0.37 0.20 (6) RN50 0.22 0.39 0.24 0.20 0.37 0.20 (7) RN50 0.22 0.38 0.23 0.20 0.37 0.20 (8) RN50 0.21 0.38 0.23 0.20 0.37 0.21 (9) RN152 0.21 0.38 0.22 0.20 0.37 0.20 (10) RN152 0.21 0.37 0.22 0.20 0.36 0.20 (11) RN152 0.21 0.36 0.24 0.19 0.35 0.19 (12) RN152 0.21 0.35 0.24 0.19 0.34 0.19 (b) Fig. 4: Mask R-CNN prediction examples vs ground truth. True Positives: cyan, False Positives: magenta, False Negatives: yellow, True Negatives: green. Image 4a shows the output from the model trained in experiment number 2 (best performance), while image 4b shows the output from model number 11 (worst performance on validation set). In Figure 4 it is possible to observe the slight difference between the mask predictions and the ground truth on the same image for two different models. True/false positives and true/false negatives are highlighted in the image. Regarding data augmentation, by comparing the results of experiments 1-2, 5-6 and 9-10 we can see that the ResNet101- based model shows an improvement on all datasets using a training dataset with augmented images, highlighting the importance of this technique. The other models did not benefit from data augmentation. By comparing experiments 3-4, 7-8 and 11-12, we can also see that all the models seem to prefer a weaker L2 regularization parameter of 0.0001. Fig. 5: The pipeline to track the tools in consecutive frames is shown here. The process is performed by computing the similarity among the binary masks of the tools and with between-frames positional center similarities. In a) the Euclidean distances between the bounding box centers are computed. Then in b) a test with the Structural similarity index (SSIM) between the masks of the tools is computed. If the test is not passed, the tool is then compared to the last 5 saved tools (c), like in a) and b). The tools are considered the same and are visualized with the same colors if and only if the Euclidean distance between the centers is below a certain threshold (150 px in our case) and the SSIM is greater than or equal to a second threshold (0.95 in our example). When a new tool is found, it is buffered to the saved pool of tools. D. Segmentation qualitative analysis as a base for tracking methods In order to evaluate the possibility of employing the segmentation output as a basis for surgical tool tracking, a basic tracking algorithm is implemented. Since annotations for the tracking task are not currently available, a qualitative analysis is performed by experts on the output of the tracking algorithm. The implemented tracking procedure is based on a two-phase comparison: i) evaluation of the proximity of the bounding boxes, ii) the computation of the structural similarity index (SSIM) [42] between the two frame-adjacent masks. The between-frames bounding box proximity is computed between adjacent frames and it is a necessary condition for applying segmentation-based tool similarity comparisons. In particular, for each tool in a new frame, a comparison is made with the tools detected in the previous frame. If the Euclidean distance between the centers of the bounding boxes is less than a certain threshold (empirically fixed to 150 pixels in the high-resolution image case) on both axes, the selected boxes’ masks are compared with SSIM against a specific threshold. In our case, the bounding boxes are considered to outline the same segmented tool only if SSIM (m1 , m2 ) ≥ 0.95, with m1 and m2 being the two compared masks. If the tool is not found in the previous frame, it is compared to a pool of previously encountered tools, in order to recover the identity of tools that have disappeared in the previous frames (or that have not been detected by the Mask R-CNN). The two previously-described comparisons are then repeated using the previously saved bounding boxes and masks to try to find a match. If a match is not found, the tool is considered a new tool and is added to the pool of saved tools. We decided to limit the size of the pool to 5. Both the Euclidean distance threshold and the SSIM threshold were chosen empirically. The choice of using the SSIM was inspired by recent works that employed it in other segmentation tasks [43] and by its use in some tracking algorithms [44]. The SSIM was computed using the function compare ssim, available in the skimage Python package. In short, the tracking algorithm recognizes the shape of similar segmented tools that are close in space in adjacent frames and assigns the same identity to such tools. In Figure 6 a sequence of surgical tools is segmented and tracked by applying this methodology. A different color is used to distinguish the different tools. In this specific scene, four different tools are show up at a different time in the sequence. The identities of the masks have been observed along the whole sequences by three experts and the tracking output was judged to be of good quality. A quantitative tracking evaluation is planned to be performed in the future. V. C ONCLUSION We have proposed the use of Mask R-CNN to detect and segment surgical tools into endoscopic/laparoscopic images. We trained and evaluated Mask R-CNN with different backbone structures (ResNet50, ResNet101 and ResNet152) along with different data augmentation techniques and hyperparameter tuning. After evaluation, the proposed approach shows good potential for use in endoscopic surgical tools detection and segmentation, as well as being a solid base for the implementation of a tracking algorithm. The best results on our dataset were obtained training the network with a ResNet101 (a) (b) (c) (d) (e) (f) (g) (h) Fig. 6: Example of results of the tracking algorithm, each tool is recognized and segmented using a specific color. From figure 6a to figure 6d two objects are recognized (in gray and purple), the same objects are correctly recognized in frame 6f and 6h. In figure 6e and 6g. two equal tools are alternated with the previous tools and different colors (black and green) are assigned. backbone for 25 epochs. Moreover, we showed that the trained model was robust to image artefacts and could still work reasonably well on low-resolution images. However, the model still presented some limitations and failure cases, opening the way to possible future improvements, such as the use of a bigger and richer training dataset. A quantitative tracking evaluation, along with more complex tracking algorithms, should also be explored in future research. R EFERENCES [1] N. Mammone, C. Ieracitano, and F. C. Morabito, “A deep cnn approach to decode motor preparation of upper limbs from time–frequency maps of eeg signals at source level,” Neural Networks, vol. 124, pp. 357–372, 2020. [2] C. Ieracitano, N. Mammone, A. Bramanti, A. Hussain, and F. C. Morabito, “A convolutional neural network approach for classification of dementia stages based on 2d-spectral representation of eeg recordings,” Neurocomputing, vol. 323, pp. 96–107, 2019. [3] M. Zhou, C. Tian, R. Cao, B. Wang, Y. Niu, T. Hu, H. Guo, and J. Xiang, “Epileptic seizure detection based on eeg signals and cnn,” Frontiers in neuroinformatics, vol. 12, p. 95, 2018. [4] B. Choi, K. Jo, S. Choi, and J. Choi, “Surgical-tools detection based on convolutional neural network in laparoscopic robot-assisted surgery,” in 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Ieee, 2017, pp. 1756–1759. [5] J. Kang and J. Gwak, “Ensemble of instance segmentation models for polyp segmentation in colonoscopy images,” IEEE Access, vol. 7, pp. 26 440–26 447, 2019. [6] X. Mo, K. Tao, Q. Wang, and G. Wang, “An efficient approach for polyps detection in endoscopic videos based on faster r-cnn,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 3929–3934. [7] Y. Shin, H. A. Qadir, L. Aabakken, J. Bergsland, and I. Balasingham, “Automatic colon polyp detection using region based deep cnn and post learning approaches,” IEEE Access, vol. 6, pp. 40 950–40 962, 2018. [8] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multidigit number recognition from street view imagery using deep convolutional neural networks,” arXiv preprint arXiv:1312.6082, 2013. [9] K. Rohit Malhotra, A. Davoudi, S. Siegel, A. Bihorac, and P. Rashidi, “Autonomous detection of disruptions in the intensive care unit using deep mask r-cnn,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 1863–1865. [10] S. Ali, F. Zhou, C. Daul, B. Braden, A. Bailey, S. Realdon, J. East, G. Wagnières, V. Loschenov, E. Grisan et al., “Endoscopy artifact detection (ead 2019) challenge dataset,” arXiv preprint arXiv:1905.03209, 2019. [11] S. Ali, F. Zhou, A. Bailey, B. Braden, J. East, X. Lu, and J. Rittscher, “A deep learning framework for quality assessment and restoration in video endoscopy,” arXiv preprint arXiv:1904.07073, 2019. [12] J. Hung and A. Carpenter, “Applying faster r-cnn for object detection on malaria images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 56–61. [13] S. Bernhardt, S. A. Nicolau, L. Soler, and C. Doignon, “The status of augmented reality in laparoscopic surgery as of 2016,” Medical image analysis, vol. 37, pp. 66–90, 2017. [14] J. Kowalczuk, A. Meyer, J. Carlson, E. T. Psota, S. Buettner, L. C. Pérez, S. M. Farritor, and D. Oleynikov, “Real-time three-dimensional soft tissue reconstruction for laparoscopic surgery,” Surgical endoscopy, vol. 26, no. 12, pp. 3413–3417, 2012. [15] M.-C. Dy, K. Tagawa, H. T. Tanaka, and M. Komori, “Method in collision detection and interaction between rigid surgical tools and deformable organs,” in SIGGRAPH Asia 2014 Posters, 2014, pp. 1–1. [16] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241. [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969. [18] Q. You, J. Luo, H. Jin, and J. Yang, “Building a large scale dataset for image emotion recognition: The fine print and the benchmark,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016. [19] N. Murray, L. Marchesotti, and F. Perronnin, “Ava: A large-scale database for aesthetic visual analysis,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2408–2415. [20] D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient interactive annotation of segmentation datasets with polygon-rnn++,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 859–868. [21] A. Dutta and A. Zisserman, “The via annotation software for images, [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] audio and video,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 2276–2279. C. Zhang, K. Loken, Z. Chen, Z. Xiao, and G. Kunkel, “Mask editor: an image annotation tool for image segmentation tasks,” arXiv preprint arXiv:1809.06461, 2018. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. M. Seyedhosseini, M. Sajjadi, and T. Tasdizen, “Image segmentation with cascaded hierarchical models and logistic disjunctive normal networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2168–2175. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732. Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instanceaware semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2359–2367. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on deep transfer learning,” in Artificial Neural Networks and Machine Learning – ICANN 2018, V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis, Eds. Cham: Springer International Publishing, 2018, pp. 270–279. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755. K. Yu, W. Xu, and Y. Gong, “Deep learning with kernel regularization for visual recognition,” in Advances in Neural Information Processing Systems, 2009, pp. 1889–1896. R. Rawat, J. K. Patel, and M. T. Manry, “Minimizing validation error with respect to network size and number of training epochs,” in The 2013 International Joint Conference on Neural Networks (IJCNN). IEEE, 2013, pp. 1–7. D. A. Van Dyk and X.-L. Meng, “The art of data augmentation,” Journal of Computational and Graphical Statistics, vol. 10, no. 1, pp. 1–50, 2001. L. Perez and J. Wang, “The effectiveness of data augmentation in image classification using deep learning,” arXiv preprint arXiv:1712.04621, 2017. G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/ F. Chollet et al., “Keras,” https://keras.io, 2015. A. B. Jung, “imgaug,” https://github.com/aleju/imgaug, 2018, [Online; accessed 30-Oct-2018]. G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. USA: McGraw-Hill, Inc., 1986. B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous detection and segmentation,” in European Conference on Computer Vision. Springer, 2014, pp. 297–312. Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, pp. 1398–1402. [43] S. Zhao, B. Wu, W. Chu, Y. Hu, and D. Cai, “Correlation maximized structural similarity loss for semantic segmentation,” arXiv preprint arXiv:1910.08711, 2019. [44] A. Loza, L. Mihaylova, N. Canagarajah, and D. Bull, “Structural similarity-based object tracking in video sequences,” in 2006 9th International Conference on Information Fusion. IEEE, 2006, pp. 1–6.