Lra 2018 2882856

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LRA.2018.2882856, IEEE Robotics
and Automation Letters
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2018 1
Towards a Generic Diver-Following Algorithm:

Balancing Robustness and Efficiency in Deep
Visual Detection
Md Jahidul Islam1 , Michael Fulton2 and Junaed Sattar3
Abstract—This paper explores the design and development of a diver:96%
class of robust diver detection algorithms for autonomous diver- diver:99%
following applications. By considering the operational challenges

for underwater visual tracking in diverse real-world settings, diver:93% diver:97%
we formulate a set of desired features of a generic diver-

following algorithm. We attempt to accommodate these features
and maximize general tracking performance by exploiting the
state-of-the-art deep object detection models. We fine-tune the diver:89%
diver:97%
diver:97%
building blocks of these models with a goal of balancing the rov:99%
trade-off between robustness and efficiency in an on-board setting

under real-time constraints. Subsequently, we design an architec-
turally simple Convolutional Neural Network (CNN)-based diver
diver:97%
detection model that is much faster than the state-of-the-art

deep models yet provides comparable detection performances.
In addition, we validate the performance and effectiveness of the
proposed model through a number of diver-following experiments diver:96%
in closed-water and open-water environments. diver:97%
diver:98% diver:67%
Index Terms—Human Detection and Tracking; Field Robots;

Marine Robotics; diver:97%
I. I NTRODUCTION
U NDERWATER applications of autonomous underwater

robots range from inspection and surveillance to data
collection and mapping tasks. Such missions often require a
Fig. 1: Snapshots of a set of diverse first-person views of the
robot from different diver-following scenarios. Notice the variation
in appearances of the divers and possible noise or disturbances in the
team of divers and robots to collaborate for successful com- scene over different scenarios. The rectangles and text overlaid on
pletion. Without sacrificing the generality of such applications, the figures are the outputs generated by our model at test time.
we can consider a single-robot setting where a human diver
leads the task and interacts with the robot which follows the
diver at certain stages of the mission. Such situations arise in greatly based on their swimming styles, choices of wearables,
numerous important applications such as submarine pipeline and relative orientations with respect to the robot. These
and ship-wreck inspection, marine life and seabed monitoring, problems are exacerbated underwater since both the robot
and many other underwater exploration activities [1]. Although and diver are suspended in a six-degrees-of-freedom (6DOF)
following the diver is not the primary objective in these environment. Consequently, classical model-based detection
applications, it significantly simplifies the operational loop and algorithms fail to achieve good generalization performance [2].
reduces the associated overhead by eliminating the necessity On the other hand, model-free algorithms incur significant
of tele-operation. target drift [3] under such noisy conditions.
Robust underwater visual perception is generally challeng- In this paper, we address the inherent difficulties of un-
ing due to marine artifacts such as poor visibility, variations derwater visual detection by designing a class of diver de-
in illumination, suspended particles, etc. Additionally, color tection algorithms that are: a) invariant to color (of divers’
distortion and scarcity of salient visual features make it harder body/wearables [4]), b) invariant to divers’ relative motion and
to robustly detect and accurately follow a diver in arbitrary orientation, c) robust to noise and image distortions [5], and
directions. Moreover, divers’ appearances to the robot vary d) reasonably efficient for real-time deployment. We exploit
the current state-of-the-art object detectors to accommodate
Manuscript received: September, 11, 2018; Accepted November, 5, 2018.
This paper was recommended for publication by Editor Jonathan Roberts these features and maximize the generalization performance
upon evaluation of the Associate Editor and Reviewers’ comments. for diver detection using RGB images as input. Specifically,
The authors are with the Interactive Robotics and Vision Laboratory, we use the following four models: Faster R-CNN [6] with
Department of Computer Science and Engineering, University of Minnesota,
Twin Cities, US. {1 islam034, 2 fulto081, 3 junaed}@umn.edu Inception V2 [7] as a feature extractor, Single Shot MultiBox
Digital Object Identifier (DOI): see top of this page. Detector (SSD) [8] with MobileNet V2 [9], [10] as a feature
2377-3766 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LRA.2018.2882856, IEEE Robotics
2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED NOVEMBER, 2018
extractor, You Only Look Once (YOLO) V2 [11], and Tiny Diver-following algorithms
YOLO [12]. These are the fastest (in terms of processing
time of a single frame) among the family of current state-
of-the-art models [13] for general object detection. We train Feature perspective Model perspective
these models using a rigorously prepared dataset containing Feature-based tracking Model-based
sufficient training instances to capture the variabilities of Feature-based learning Model-free
underwater visual sensing.
Feature or Representation learning
Subsequently, we design an architecturally simple (i.e.,
sparse) CNN-based model that is computationally much faster Fig. 2: An algorithmic categorization of the visual perception tech-
than the state-of-the-art diver detection models. The faster niques used for diver-following [17]
running time ensures real-time tracking performance with lim-
ited on-board computational resources. We also demonstrate
its effectiveness in terms of detection performances compared A. Model Perspective
to the state-of-the-art models through extensive quantitative In model-free algorithms, no prior information about the
experiments. We then validate these results with a series of target (e.g., diver’s motion model, color of wearables, etc.) is
field experiments. Based on our design, implementation, and used for tracking. These algorithms are initialized arbitrarily
experimental findings, we make the following contributions in and then iteratively learn to track the target in a semi-
this paper: supervised fashion [18]. TLD (“tracking-learning-detection”)
• We attempt to overcome the limitations of existing model- trackers [19] and optical flow-based trackers [20] are the
based diver-following algorithms by leveraging state- most commonly used model-free algorithms for general object
of-the-art deep object detection models. These models tracking. The TLD trackers train a detector using positive and
are trained on comprehensive datasets to deal with the negative feedback that are obtained from image-based features.
challenges involved in underwater visual perception1 . In contrast, the optical flow-based methods estimate the motion
• In addition, we design a CNN-based diver detection of each pixel by solving the Horn and Schunck formula-
model to balance the trade-offs between robustness and tion [21]. Although model-free techniques work reasonably
efficiency. The proposed model provides considerably well in practice for general object tracking, they often suffer
faster running time, in addition to achieving detection from tracking drift caused by the accumulation of detection
performances comparable to the state-of-the-art models. errors over time.
• Finally, we validate the effectiveness of the proposed On the other hand, model-based algorithms use prior knowl-
diver detection models through extensive experimental edge about the divers’ motion and appearances in order
evaluations. A number of diver-following experiments to formulate a model in the input feature-space. Iterative
are performed both in open-water and closed-water (i.e., search methods are then applied to find the target model in
oceans and pools, respectively) environments in order to the feature-space [22]. Machine learning techniques are also
demonstrate their real-time tracking performances. widely used to learn the diver-specific features [17], [23] and
predict the target location in the feature-space. Performance
Furthermore, we demonstrate that the proposed models can of the model-free algorithms depend on comprehensiveness of
be extended for a wide range of other applications such the model descriptors and the underlying input feature-space.
as human-robot communication [14], robot convoying [3], Hence, they require careful design and thorough training
cooperative localization [15], [16], etc. The state-of-the-art processes to ensure good tracking performance.
detection performance, fast running time, and architectural
portability are the key features of these models, which make
them suitable for underwater human-robot collaborative appli- B. Feature Perspective
cations. Simple feature-based trackers [24], [25] are often practical
choices for autonomous diver-following due to their opera-
tional simplicity and computational efficiency. For instance,
II. R ELATED W ORK color-based trackers perform binary image thresholding based
A categorization of the visual perception techniques that on the color of a diver’s flippers or suit. The thresholded
are commonly used for autonomous diver-following is il- binary image is then refined to track the centroid of the target
lustrated in Figure 2. Based on algorithmic usage of the (diver) using algorithms such as mean-shift, particle filters,
input features, the perception techniques can be classified etc. Optical flow-based methods can also be utilized to track
as feature-based tracking, feature-based learning, or fea- divers’ motion in the spatio-temporal volume [17], [21].
ture/representation learning algorithms. On the other hand, Since color distortions and low visibility issues are com-
they can be categorized as model-based or model-free tech- mon in underwater settings, frequency-domain signatures of
niques based on whether or not any prior knowledge about divers’ swimming patterns are often used for reliable detection.
the appearance or motion of the diver is used for tracking. Specifically, intensity variations in the spatio-temporal volume
caused by a diver’s swimming gait generate identifiable high-
1 The dataset and trained models are available for academic research energy responses in the 1-2Hz frequency range, which can be
purposes used for diver detection [26]. Moreover, the frequency-domain
ISLAM et al.: A GENERIC DIVER-FOLLOWING ALGORITHM 3
signatures can be combined with the spatial-domain features V2 [7] model for feature extraction instead, as it is known
for robust diver tracking. For instance, in [22], a Hidden to provide better object detection performances on standard
Markov Model (HMM) is used to track divers’ potential datasets [13].
swimming trajectories in the spatio-temporal domain, and then 2) YOLO V2 and Tiny YOLO: YOLO models [30], [11]
frequency-domain features are utilized to detect the diver along formulate object detection as a regression problem in order
those trajectories. to avoid using computationally expensive RPNs. They divide
Another class of approaches use machine learning tech- the image-space into rectangular grids and predict a fixed
niques to approximate the underlying function that relates number of bounding boxes, their corresponding confidence
the input feature-space to the target model of the diver. For scores, and class probabilities. Although there are restrictions
instance, Support Vector Machines (SVMs) are trained using on the maximum number of object categories, they perform
Histogram of Oriented Gradients (HOG) features [27] for faster than the standard RPN-based object detectors. Tiny
robust person detection in general. Ensemble methods such YOLO [12] is a scaled down version of the original model
as Adaptive Boosting (AdaBoost) [23] are also widely used having sparser layers that runs much faster compared to the
as they are computationally inexpensive yet highly accurate in original model; however, it sacrifices detection accuracy in the
practice. AdaBoost learns a strong tracker from a large number process.
of simple feature-based diver trackers. Several other machine 3) SSD with MobileNet V2: SSD (Single-Shot Detector) [8]
learning techniques have been investigated for diver tracking also performs object localization and classification in a single
and underwater object tracking in general [17]. One major pass of the network using the regression trick as in the
challenge involved in using these models is to design a set of YOLO [30] model. The architectural difference of SSD with
robust features that are invariant to noise, lighting condition, YOLO is that it introduces additional convolutional layers
and other variabilities such as divers’ swimming motion and to the end of a base network, which results in improved
wearables. performances. In our implementation, we use MobileNet V2
Convolutional Neural Network(CNN)-based deep models [9] as the base network to ensure faster running time.
improve generalization performance by learning a feature
representation from the image-space. The extracted features
B. Proposed CNN-based Model
are used as inputs to the detector (i.e., fully-connected lay-
ers); this end-to-end training process significantly improves Figure 4 shows a schematic diagram of the proposed CNN-
the detection performance compared to using hand-crafted based diver detection model. It consists of three major parts:
features. Once trained with sufficient data, these models are a convolutional block, a regressor block, and a classifier
quite robust to occlusion, noise, and color distortions [3]. block. The convolutional block consists of five layers, whereas
Despite the robust performance, the applicability of these the classifier and regressor block each consist of three fully
models to real-time applications is often limited due to their connected layers. Detailed network parameters and dimensions
slow running time on embedded devices. In this paper, we are specified in Table I.
investigate the performances and feasibilities of the state-of- 1) Design Intuition: The state-of-the-art deep visual models
the-art deep object detectors for diver-following applications. are designed for general applications and are trained on
We also design a CNN-based model that achieves robust standard datasets having a large number of object categories.
detection performance in addition to ensuring that the real- However, for most underwater human-robot collaborative ap-
time operating constraints are met. plications including diver-following, only a few object cate-
gories (e.g., diver, robot, coral reefs, etc.) are relevant. We
try to take advantage of this by designing an architecturally
III. N ETWORK A RCHITECTURE AND D ESIGN
simpler model that ensures much faster running time in an
A. State-of-the-art Object Detectors embedded platform in addition to providing robust detection
We use a Faster R-CNN model, two YOLO models, and performance. The underlying design intuitions can be summa-
an SSD model for diver detection. These are end-to-end rized as follows:
trainable models and provide state-of-the-art performances • The proposed model demonstrated in Figure 4 is partic-
on standard object detection datasets; we refer to [12], [13] ularly designed for detecting a single diver. Five convo-
for detailed comparisons of their detection performances and lutional layers are used to extract the spatial features in
running times. As outlined in Figure 3, we now briefly discuss the RGB image-space by learning a set of convolutional
their methodologies and the related design choices in terms of kernels.
major computational components. • The extracted features are then fed to the classifier and
1) Faster R-CNN with Inception V2: Faster R-CNN [6] regressor block for detecting a diver and localizing the
is an improvement of R-CNN [28] that introduces a Region corresponding bounding box, respectively. Both the clas-
Proposal Network (RPN) to make the whole object detection sifier and regressor block consist of three fully connected
network end-to-end trainable. The RPN uses the last convo- layers.
lutional feature-maps to produce region proposals which are • Therefore, the task of the regressor block is to locate a
then fed to the fully connected layers for the final detection. potential diver in the image-space, whereas the classifier
The original implementation of Faster R-CNN uses VGG-16 block provides the confidence scores associated with that
[29] model for feature extraction. However, we use Inception detection.
Conv blocks {box,

proposals
FC Layers
Region Proposal score}
Inception V2 Network RoI
Architecture Pooling
feature maps
(a) Faster R-CNN with Inception V2
Conv blocks
x detections {box,
FC Layers
YOLO per class Non-maximum score}
Architecture (e.g., x=98) Suppression
feature maps
448x448
(b) YOLO V2 and Tiny TOLO
Conv blocks
x detections {box,
Extra conv
FC Layers
MobileNet V2 per class Non-maximum score}
Layers
Architecture Suppression
(e.g., x=8732)
feature maps
300x300
(c) SSD with MobileNet V2
Fig. 3: Schematic diagrams of the deep visual models used for diver detection
Regressor block
TABLE I: Parameters and dimensions of the CNN model outlined
Conv block in Figure 4. (convolutional block: conv1-conv5, classifier block: fc1-
bounding fc3, regression block: rc1-rc3; n: the number of object categories; *an
box additional pooling layer was used before passing the conv5 features-
feature maps to fc1)
maps class scores
224x224
Classifier block Input Output
Layer Kernel size Strides
Fig. 4: A schematic diagram of the proposed CNN-based model for feature-map feature-map
detecting a single diver in the image-space. conv1 224x224x3 11x11x3x64 [1,4,4,1] 56x56x64
pool1 56x56x64 1x3x3x1 [1,2,2,1] 27x27x64
conv2 27x27x64 5x5x64x192 [1,1,1,1] 27x27x192
pool2 27x27x192 1x3x3x1 [1,2,2,1] 13x13x192
Conv block
Classifier block conv3 13x13x192 3x3x192x192 [1,1,1,1] 13x13x192
selective pooling conv4 13x13x192 3x3x192x192 [1,1,1,1] 13x13x192
conv5 13x13x192 3x3x192x128 [1,1,1,1] 13x13x128
feature maps {box,
score} fc1 4608x1∗ − − 1024x1
proposed
fc2 1024x1 − − 128x1
boxes
fc3 128x1 − − n
Non-maximum
Edge-box Suppression rc1 21632x1 − − 4096x1
rc2 4096x1 − − 192x1
Fig. 5: Allowing detections of multiple divers in the proposed model rc3 192x1 − − 4n
using a region selector named Edge-box [31].
based on their objectness scores and then non-maxima sup-

pression techniques are applied to get the dominant regions
The proposed model has a sparse convolutional block and of interest in the image-space. The corresponding feature
uses a three layer regressor block instead of using an RPN. maps are then fed to the classifier block to predict the
As demonstrated in Table I, it has significantly fewer network object categories. Although we need additional computation
parameters compared to the state-of-the-art object detection for Edge-box, it runs independently and in parallel with the
models. convolutional block; the overall pipeline is still faster than if
2) Allowing Multiple Detections: Although following a we were to use an RPN-based object detector model.
single diver is the most common diver-following scenario, de-
tecting multiple divers and other objects is necessary for many IV. E XPERIMENTS
human-robot collaborative applications. As shown in Figure
We now discuss the implementation details of the proposed
5, we add multi-object detection capabilities in our proposed
networks and present the experimental results.
model by replacing the regressor with a region selector. We
use the state-of-the-art class-agnostic region selector named
Edge-box [31]. Edge-box utilizes the image-level statistics like A. Dataset Preparation
edges and contours in order to measure objectness scores in We performed numerous diver-following experiments in
various prospective regions in the image-space. pools and oceans in order to prepare training datasets for the
We use the same convolutional block to extract feature deep models. In addition, we collected data from underwater
maps. The bounding boxes generated by Edge-box are filtered field trials that are performed by different research groups
Fig. 6: A few samples from the training dataset are shown. The annotated training images have class labels (e.g., diver, robot) and corresponding
bounding boxes. A total of 30K of these annotated images are used for supervised training.
over the years in pools, lakes, and oceans. This variety of

1
experimental setups is crucial to ensure comprehensiveness
of the datasets so that the supervised models can learn the Accuracy
0.9
inherent diversity of various application scenarios. We made
sure that the datasets contain training examples to capture the 0.8
following variabilities: 0 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
• Natural variabilities: changes in visibilities for different # of iteration
sources of water, lighting conditions at varied depths,

chromatic distortions, etc. 3e5 Total loss = classification loss + regression loss
• Artificial variabilities: data collected using different
2e5
robots and cameras.
1e5
• Human variabilities: different persons and appearances,
choice and variations of wearables such as suits, flippers, 0
goggles, etc. 0 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K
# of iteration
We extracted the robot’s camera-feed during these ex-
periments and prepared image-based datasets for supervised
Fig. 7: Convergence behavior of the proposed CNN-based model in
training. The images are annotated using the ‘label-image’ terms of training accuracy (top) and training loss (bottom).
software (github.com/tzutalin/labelImg) by a number of human
participants (acknowledged later in the paper) over the period
of six months. Few sample images from the dataset are shown C. Performance Evaluation
in Figure 6; it contains a total of 30K images, which are
annotated to have class-labels and bounding boxes. We evaluate and compare detection performances of all
the models based on standard performance metrics. The test
dataset contain 2.2K images that are chosen from separate field
B. Supervised Training Processes experiments (i.e., they are excluded from the training dataset).
We train all the supervised deep models on a Linux ma-
1) Metrics: We use the following two standard performance
chine with four GPU cards (NVIDIATM GTX 1080). Tensor-
metrics:
Flow [32] and Darknet [12] libraries are used for implemen-
• mAP (mean Average Precision): it is the average of
tation. Once the training is done, the trained inference model
(and parameters) is saved and transferred to the robot CPU for the maximum precisions at different recall values. The
validation and real-time experiments. precision and recall are defined as precision = T PT+F P
P
TP
For the state-of-the-art models (Figure 3), we utilized the and recall = T P +F N ; here, the terms TP, FP, and FN
pre-trained models for Faster R-CNN, YOLO, and SSD. These are short forms of True Positive, False Positive, and False
models are trained with the recommended configurations pro- Negative, respectively.
vided with their APIs; we refer to [12], [13] for the detailed • IoU (Intersection over Union): it is a measure of how
processes. On the other hand, our proposed CNN-based mod- well a model predicts the locations of the objects. It is
els are trained from scratch. Non-supervised pre-training and calculated using the area of overlapping regions of the
drop-outs are not used while training. RMSProp [33] is used as predicted and ground truth bounding boxes, defined as
the optimization function with an initial learning rate of 0.001. IoU = Area of overlap
Area of union
In addition, standard cross-entropy and L2 loss functions are As their definitions suggest, mAP measures the detection
used by the classifier and regressor, respectively. Visualization accuracy, and IoU measures the object localization perfor-
for the overall convergence behavior of the model is provided mance. We also evaluate and compare the running times of
in Figure 7. the models based on FPS (Frames Per Second), the (average)
TABLE II: Performance comparison for the diver detection models

based on standard metrics.
approximated by the size of the bounding box and forward
velocity rates are generated accordingly. Additionally, the yaw
FPS and pitch commands are normalized based on the horizontal
Models mAP IoU
GTX Jetson Robot
(%) (%) and vertical displacements of the observed bounding box-
1080 TX2 CPU
Faster R-CNN center from the image-center (see Figure 8); these navigation
71.1 78.3 17.3 2.1 0.52
(Inception V2) commands are then regulated by separate PID controllers. On
YOLO V2 57.84 62.42 73.3 6.2 0.11 the other hand, the roll stabilization and hovering are handled
Tiny YOLO 52.33 59.94 220 20 5.5
SSD
by the robot’s autopilot module [36].
61.25 69.8 92 9.85 3.8
(MobileNet V2)
Proposed CNN-
53.75 67.4 263.5 17.35 6.85
based Model
diver:99%
number of image-frames that a model can process per second.
We measure the running times on three different devices:
TM
• NVIDIA GTX 1080 GPU
TM
• Embedded GPU (NVIDIA Jetson TX2)
TM
• Robot CPU (Intel i3-6100U)
2) Results: The performances of the diver detection models
based on mAP, IoU, and FPS are illustrated in Table II. The
Faster R-CNN (Inception V2) model achieves much better
Fig. 8: Illustration of how the yaw and pitch commands are generated
detection performances compared to the other models although
based on the horizontal and vertical displacements of the center of
it is the slowest in terms of running time. On the other hand, the detected bounding box
YOLO V2, SSD (MobileNet V2), and the proposed CNN-
based model provide comparable detection performances. Al- 3) Feasibility and General Applicability: As mentioned,
though Tiny YOLO provides fast running time, its detection the diver-following module uses a monocular camera feed of
performance is not as good as the other models. the robot in order to detect a diver in the image-space and
As the results demonstrate, the proposed CNN-based model generate a bounding box. The visual servoing controller uses
runs at a rate of 6.85 FPS on the robot CPU and 17.35 FPS this bounding box and regulates robot motion commands in
on the embedded GPU, which validate its applicability for order to follow the diver. Therefore, correct detection of the
real-time diver-following applications. This fast running time diver is essential for overall success of the operation. We
comes at a cost of losing approximately 18% mAP and 11% provided the detection performances of our proposed model
IoU compared to the Faster R-CNN (Inception V2) model. over a variety of test scenarios in Table II (few snapshots are
Nevertheless, in our real-world experiments, we have found illustrated in Figure 1). During the field experiments, we have
these detection performances to be sufficient for achieving found 6-7 positive detections per second on an average, which
reasonable tracking performances. In the following sections, is sufficient for successfully following a diver in real-time [2].
we provide details of these field experiments and discuss In addition, the on-board memory overhead is low as the saved
general applicability of the proposed model from a practical inference model is only about 60MB in size.
standpoint. In addition, the proposed model is considerably robust
to occlusion and noise, in addition to being invariant to
D. Field Experiments divers’ appearances and wearables. Nevertheless, the detection
1) Setup: We have performed several real-world experi- performances might be negatively affected by unfavorable
ments both in closed-water and in open-water conditions (i.e., visual conditions; we demonstrate few such cases in Figure
in pools and in oceans). An autonomous underwater robot of 9. In Figure 9(a), the diver is only partially detected with
the Aqua [34] family is used for testing the diver-following low confidence (67%). This is because the flippers’ motion
modules. During the experiments, a diver swims in front of produces a flurry of air-bubbles (since he was swimming
the robot in arbitrary directions. The task of the robot is to very close to the ocean surface), which occluded the robot’s
visually detect the diver using its camera feed and follow view. Suspended particles cause similar difficulties in diver-
behind him/her with a smooth motion. following scenarios. The visual servoing controller can recover
2) Visual Servoing Controller: The Aqua robots have five from such inaccurate detections as long as the diver is partially
degrees-of-freedom of control, i.e., three angular (yaw, pitch, visible. However, the continuous tracking might fail if the diver
and roll) and two linear (forward and vertical speed) controls. moves away from the robot’s field of view before it can re-
In our experiments for autonomous diver-following, we adopt cover. In this experiment, 27 consecutive inaccurate detections
a tracking-by-detection method where the visual servoing [35] (i.e., confidence score less than 50%) caused enough drift in
controller uses the uncalibrated camera feeds for navigation. the robot’s motion for it to lose sight of the person. On the
The controller regulates the motion of the robot in order to other hand, occlusion also affects the detection performances
bring the observed bounding box of the target diver to the as shown in Figure 9(b); here, the proposed model could not
center of the camera image. The distance of the diver is localize the two divers correctly due to occlusion.
diver:88%
diver:67%
diver:88%
diver:82%
diver:98%
(a) Air-bubbles produced from divers’ flippers (b) A diver is occluded by another (c) Color-distorted visuals due to poor
while swimming very close to the ocean surface lighting conditions
Fig. 9: A few cases where the diver-detection performance is challenged by noise and occlusion.
diver:96% detection model that establishes a delicate balance between ro-

OK:93%
bust detection performance and fast running time. Finally, we
diver:97%
validated the tracking performances and general applicabilities
ROV:99%
of the proposed models through a number of field experiments

zero:79%
in pools and oceans.
In the future, we seek to improve the running time of the
general object detection models on embedded devices. Addi-
tionally, we aim to investigate the use of human body-pose
detection models to understand divers’ motion, instructions,
and activities.
Fig. 10: Detection of ROVs and hand gestures by the same diver-
detector model. In this case, the SSD (MobileNet V2) model was
re-trained on additional data and object categories for ROV and hand ACKNOWLEDGMENT
gestures (used for human-robot communication [14]). We gratefully acknowledge the support of the MnDrive
initiative on this research. We are also thankful to the Bellairs
Research Institute of Barbados for providing the facilities
Lastly, since our training datasets include a large collec- for our field experiments. In addition, we are grateful for
tion of gray-scale and color distorted underwater images, the support of NVIDIA Corporation with the donation of a
the proposed models are considerably robust to noise and Titan Xp GPU for our research. We also acknowledge our
color distortions (Figure 9(c)). Nonetheless, state-of-the-art colleagues, namely Cameron Fabbri, Marc Ho, Elliott Imhoff,
image enhancement techniques for underwater imagery can Youya Xia, and Julian Lagman for their assistance in collecting
be utilized to alleviate severe chromatic distortions. We refer and annotating the training data.
interested readers to [5], where we tried to address these issues
for generic underwater applications.
We also performed experiments to explore the usabilities of R EFERENCES
the proposed diver detection models for other underwater ap- [1] J. Sattar, G. Dudek, O. Chiu, I. Rekleitis, P. Giguere, A. Mills, N. Plam-
plications. As demonstrated in Figure 10, by simply re-training ondon, C. Prahacs, Y. Girdhar, M. Nahon et al., “Enabling Autonomous
Capabilities in Underwater Robotics,” in IEEE/RSJ International Con-
on additional data and object categories, the same models ference on Intelligent Robots and Systems (IROS), 2008, pp. 3628–3634.
can be utilized in a wide range of underwater human-robot [2] M. J. Islam, M. Ho, and J. Sattar, “Understanding Human Motion and
collaborative applications such as following a team of divers, Gestures for Underwater Human-Robot Collaboration,” Journal of Field
Robotics (JFR), pp. 1–23, 2018.
robot convoying [3], human-robot communication [14], etc. In [3] F. Shkurti, W.-D. Chang, P. Henderson, M. J. Islam, J. C. G. Higuera,
particular, if the application do not pose real-time constraints, J. Li, T. Manderson, A. Xu, G. Dudek, and J. Sattar, “Underwater Multi-
we can use models such as Faster R-CNN (Inception V2) for Robot Convoying using Visual Tracking by Detection,” in IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS).
better detection performances. IEEE, 2017.
[4] J. Sattar and G. Dudek, “Where is Your Dive Buddy: Tracking Humans
Underwater using Spatio-Temporal Features,” in IEEE/RSJ International
V. C ONCLUSION Conference on Intelligent Robots and Systems (IROS). IEEE, 2007, pp.
3654–3659.
In this paper, we have tried to address the challenges [5] C. Fabbri, M. J. Islam, and J. Sattar, “Enhancing Underwater Imagery
involved in underwater visual perception for autonomous using Generative Adversarial Networks,” in IEEE International Confer-
diver-following. At first, we investigated the performances ence on Robotics and Automation (ICRA), 2018, pp. 7159–7165.
[6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-
and applicabilities of the state-of-the-art deep object detectors. Time Object Detection with Region Proposal Networks,” in Advances
We prepared and used a comprehensive dataset for training in Neural Information Processing Systems (NIPS), 2015.
these models; then we fine-tuned each computational compo- [7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the Inception Architecture for Computer Vision,” in IEEE Conference
nent in order to meet the real-time and on-board operating on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–
constraints. Subsequently, we designed a CNN-based diver 2826.
[8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. [31] C. L. Zitnick and P. Dollár, “Edge boxes: Locating Object Proposals
Berg, “SSD: Single Shot Multibox Detector,” in European Conference from Edges,” in European Conference on Computer Vision (ECCV).
on Computer Vision (ECCV). Springer, 2016, pp. 21–37. Springer, 2014, pp. 391–405.
[9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- [32] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
bileNetV2: Inverted Residuals and Linear Bottlenecks,” arXiv preprint G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-
arXiv:1801.04381, 2018. scale Machine Learning on Heterogeneous Distributed Systems,” arXiv
[10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, preprint arXiv:1603.04467, 2016.
M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional [33] T. Tieleman and G. Hinton, “Lecture 6.5-RMSProp: Divide the gradient
Neural Networks for Mobile Vision Applications,” arXiv preprint by a running average of its recent magnitude,” COURSERA: Neural
arXiv:1704.04861, 2017. Networks for Machine Learning, vol. 4, no. 2, pp. 26–31, 2012.
[11] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” in [34] G. Dudek, P. Giguere, C. Prahacs, S. Saunderson, J. Sattar, L.-A. Torres-
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Mendez, M. Jenkin, A. German, A. Hogue, A. Ripsman et al., “Aqua:
2017. An Amphibious Autonomous Robot,” Computer, vol. 40, no. 1, 2007.
[12] J. Redmon and A. Farhadi, “Tiny YOLO,” https://pjreddie.com/darknet/ [35] B. Espiau, F. Chaumette, and P. Rives, “A New Approach to Visual
yolo/, 2017, accessed: 11-10-2018. Servoing in Robotics,” IEEE Transactions on Robotics and Automation,
vol. 8, no. 3, pp. 313–326, 1992.
[13] Google, “TensorFlow Object Detection Zoo,” https://github.com/
[36] D. Meger, F. Shkurti, D. C. Poza, P. Giguere, and G. Dudek, “3d
tensorflow/models/blob/master/research/object detection/g3doc/
trajectory synthesis and control for a legged swimming robot,” in
detection model zoo.md, 2017, accessed: 11-10-2018.
IEEE/RSJ International Conference on Intelligent Robots and Systems
[14] M. J. Islam, M. Ho, and J. Sattar, “Dynamic Reconfiguration of Mis-
(IROS). IEEE, 2014, pp. 2257–2264.
sion Parameters in Underwater Human-Robot Collaboration,” in IEEE
International Conference on Robotics and Automation (ICRA), 2018, pp.
1–8.
[15] A. Bahr, J. J. Leonard, and M. F. Fallon, “Cooperative Localization for
Autonomous Underwater Vehicles,” International Journal of Robotics
Research (IJRR), vol. 28, no. 6, pp. 714–728, 2009. Md Jahidul Islam is an IEEE student member,
and currently a Ph.D. candidate at the Computer
[16] I. Rekleitis, G. Dudek, and E. Milios, “Probabilistic Cooperative Local-
Science and Engineering (CSE) Department at the
ization and Mapping in Practice,” in IEEE International Conference on
University of Minnesota Twin Cities, supervised by
Robotics and Automation (ICRA), vol. 2. IEEE, 2003, pp. 1907–1912.
Dr. Junaed Sattar. His research work focuses on
[17] M. J. Islam, J. Hong, and J. Sattar, “Person Following by Autonomous the design and development of visual perception
Robots: A Categorical Overview,” in review at the International Journal techniques in order to understand human motions,
of Robotics Research (IJRR), arXiv preprint arXiv:1803.08202, 2018. gestures, body poses, and activities for human-robot
[18] Q. Yu, T. B. Dinh, and G. Medioni, “Online Tracking and Reacquisition collaborative applications. Before starting his Ph.D.
using Co-trained Generative and Discriminative Trackers,” in European in Fall-2015, he received his B.Sc. (Engg.) and
Conference on Computer Vision (ECCV). Springer, 2008, pp. 678–691. M.Sc. (Engg.) degrees in CSE from Bangladesh
[19] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-Learning-Detection,” University of Engineering and Technology (BUET), Dhaka, Bangladesh, in
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012 and 2015, respectively.
vol. 34, no. 7, pp. 1409–1422, 2012.
[20] J. Shin, S. Kim, S. Kang, S.-W. Lee, J. Paik, B. Abidi, and M. Abidi,
“Optical Flow-based Real-time Object Tracking using Non-prior Train-
ing Active Feature Model,” Real-Time Imaging, vol. 11, no. 3, pp. 204–
218, 2005. Michael Fulton is a Computer Science Ph.D. stu-
[21] H. Inoue, T. Tachikawa, and M. Inaba, “Robot Vision System with dent supervised by Dr. Junaed Sattar. His work fo-
a Correlation Chip for Real-time Tracking, Optical Flow and Depth cuses on developing human-robot interaction meth-
Map Generation,” in IEEE International Conference on Robotics and ods, trash and invasive species detection techniques,
Automation (ICRA). IEEE, 1992, pp. 1621–1626. and risk assessment algorithms for autonomous field
[22] M. J. Islam and J. Sattar, “Mixed-domain Biological Motion Tracking robots, particularly for underwater robots. He gradu-
for Underwater Human-Robot Interaction,” in IEEE International Con- ated with distinction from Clarkson University with
ference on Robotics and Automation (ICRA), 2017, pp. 4457–4464. a B.Sc. in Computer Science in 2017, and is now
[23] J. Sattar and G. Dudek, “Robust Servo-Control for Underwater Robots pursuing his Ph.D. at the University of Minnesota
using Banks of Visual Filters,” in IEEE International Conference on Twin Cities.
Robotics and Automation (ICRA), 2009, pp. 3583–3588.
[24] ——, “On the Performance of Color Tracking Algorithms for Underwa-
ter Robots under Varying Lighting and Visibility,” in IEEE International
Conference on Robotics and Automation (ICRA). IEEE, 2006, pp.
3550–3555.
[25] J. Sattar, P. Giguere, G. Dudek, and C. Prahacs, “A visual Servoing Junaed Sattar is an IEEE member, and an assistant
System for an Aquatic Swimming Robot,” in IEEE/RSJ International professor at the Department of Computer Science
Conference on Intelligent Robots and Systems (IROS). IEEE, 2005, pp. and Engineering at the University of Minnesota Twin
1483–1488. Cities, and a MnDrive (Minnesota Discovery, Re-
[26] J. Sattar and G. Dudek, “Underwater Human-Robot Interaction via search, and Innovation Economy) faculty. Junaed is
Biological Motion Identification,” in Robotics: Science and Systems the founding director of the Interactive Robotics and
(RSS), 2009. Vision Lab, where he and his students investigate
problems in field robotics, robot vision, human-robot
[27] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human
communication, assisted driving and applied (deep)
Detection,” in IEEE Conference on Computer Vision and Pattern Recog-
machine learning, not to mention developing rugged
nition (CVPR), vol. 1, 2005, pp. 886–893.
robotic systems. Before coming to the UoM, Junaed
[28] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature Hier- was a post-doctoral fellow at the University of British Columbia working on
archies for Accurate Object Detection and Semantic Segmentation,” in service and assistive robotics, and at Clarkson University in upstate New York
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), as an Assistant Professor. Find him at junaedsattar.org, and the IRV Lab at
2014, pp. 580–587. irvlab.cs.umn.edu or @irvlab on Twitter.
[29] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556,
2014.
[30] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Once: Unified, Real-time Object Detection,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.

Lra 2018 2882856

Uploaded by

Copyright:

Available Formats

Lra 2018 2882856

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lra 2018 2882856

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Towards a Generic Diver-Following Algorithm:

Abstract—This paper explores the design and development of a diver:96%

class of robust diver detection algorithms for autonomous diver- diver:99%

following applications. By considering the operational challenges

we formulate a set of desired features of a generic diver-

building blocks of these models with a goal of balancing the rov:99%

trade-off between robustness and efficiency in an on-board setting

detection model that is much faster than the state-of-the-art

Index Terms—Human Detection and Tracking; Field Robots;

U NDERWATER applications of autonomous underwater

Conv blocks {box,

(a) Faster R-CNN with Inception V2

(b) YOLO V2 and Tiny TOLO

based on their objectness scores and then non-maxima sup-

over the years in pools, lakes, and oceans. This variety of

sources of water, lighting conditions at varied depths,

TABLE II: Performance comparison for the diver detection models

diver:96% detection model that establishes a delicate balance between ro-

of the proposed models through a number of field experiments

You might also like