Nothing Special   »   [go: up one dir, main page]

Maritime Vessel Images Classification Using Deep

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Maritime Vessel Images Classification Using Deep

Convolutional Neural Networks

Cuong Dao-Duc Hua Xiaohui Olivier Morère


HUST∗, NUS† SJTU‡, NUS† I2R§, UPMC¶, IPALk
duccuong.hust@gmail.com coraline@sjtu.edu.cn olivier.morere@etu.upmc.fr

ABSTRACT security. Knowing the class of a maritime vessel sailing along


The ability to identify maritime vessels and their type is an the coast of Vietnam or in the straits of Malacca reveals
important component of modern maritime safety and secu- information about its tonnage, displacement, length, beam,
rity. In this work, we present the application of deep con- draft, propulsion and capacity that can be included in a
volutional neural networks to the classification of maritime navigation security or safety assessment.
vessel images. We use the AlexNet deep convolutional neu- Images are readily available to navigation companies, ship
ral network as our base model and propose a new model owners (e.g., on-ship cameras), to port traffic authorities and
that is twice smaller then the AlexNet. We conduct experi- security agencies (e.g., on-shore surveillance cameras). Crew
ments on different configurations of the model on commodity of the vessels, officers of the various agencies and authorities
hardware. We comparatively evaluate and analyse the per- as well as members of the general public are also in a position
formance of different configurations the model. We measure to collect photos and videos using the multitude of personal
the top-1 and top-5 accuracy rates. The contribution of this mobile devices available, such as mobile phones, tablets and
work is the implementation, tuning and evaluation of au- digital cameras.
tomatic image classifier for the specific domain of maritime The recent spectacular results by deep convolutional neu-
vessels with deep convolutional neural networks under the ral networks for general-purpose image classification lever-
constraints imposed by commodity hardware and size of the age the availability of very large collections of labelled im-
image collection. ages.
We foresee the development of navigation safety and se-
curity tools and services that allow the automatic, fast and
CCS Concepts accurate classification and identification of maritime vessels
•Computing methodologies → Object detection; Ob- and can be easily embedded on-ship, on-shore or are avail-
ject recognition; Neural networks; able for general public mobile devices.
In this work, we present the application of deep convolu-
Keywords tional neural networks to the classification of images of mar-
itime vessels. We use the AlexNet deep convolutional neural
Deep learning, convolutional neural networks, image classi-
network as our base model. We train and test variants of
fication, maritime vessel classification
the model with a collection of 130, 000 images collected from
specialized website. We classify the images into 35 classes
1. INTRODUCTION indicated in or directly derived from the image source.
The ability to identify maritime vessels and their type The contribution of this work is the implementation, tun-
is an important component of modern maritime safety and ing and evaluation of automatic image classifier for the spe-
∗ cific domain of maritime vessels with deep convolutional neu-
Hanoi University of Science and Technology, Hanoi, Viet-
nam ral networks under the constraints imposed by commodity

National University of Singapore, Singapore hardware and size of the image collection. We show that

Shanghai Jiao Tong University, Shanghai, China the deep convolutional neural network approach to domain
§
Institute for Infocomm Research, A*STAR, Singapore specific classification is effective. It requires a reasonable
¶ amount of labelled images. It can be performed on commod-
Université Pierre et Marie Curie, Paris, France
k ity hardware but is clearly time and resource comsuming.
Image & Pervasive Access Lab, UMI CNRS 2955, Singa-
pore The remainder of this paper is organized as follows. We
present and synthesize the related work in section 2. Then
Permission to make digital or hard copies of all or part of this work for personal or
we present the detailed methodology in section 3, with some
classroom use is granted without fee provided that copies are not made or distributed backgrounds on convolutional neural networks, a discussion
for profit or commercial advantage and that copies bear this notice and the full cita- of the dataset used for the experiments and a detailed outline
tion on the first page. Copyrights for components of this work owned by others than of the training process. We comparatively evaluate and anal-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission yse the performance of different configurations the model by
and/or a fee. Request permissions from permissions@acm.org. measuring the top-1 and top-5 accuracy rate in section 4. Fi-
SoICT 2015, December 03-04, 2015, Hue City, Viet Nam nally we summarize our contribution and findings in section
c 2015 ACM. ISBN 978-1-4503-3843-1/15/12. . . $15.00 5.
DOI: http://dx.doi.org/10.1145/2833258.2833266

276
2. RELATED WORK by pooling and/or nonlinear layers in-between, followed by
Conventional approaches to image classification systems one or several fully-connected layers as in standard artifical
heavily rely on the availability of Visual features. Feature neural networks.
extraction processes are often handcrafted into invariant rep- The input of convolutional layer is a volume of size [W ×
resentations (e.g., BoVW [3]; IFV [13]).. H × D] where W , H are the width and height respectively
Deep convolutional neural networks (often referred to as and D is the dimension of input volume. In practice, typi-
CNN, for short) have now substantially outweighed hand- cally in image classification applications, W and H are equal
crafted features based approaches on visual classification (squared image) and D = 3 representing the number of chan-
and recognition tasks. Convolutional neural networks were nels (i.e., RGB) of input image. Each layer of this type has
first introduced by [11]. AlexNet [10] is one of the first pop- K filters (kernels) of size [F × F × Q] where F (the receptive
ular deep convolutional neural network. It achieved aston- field) is smaller than W . In the first convolutional layer, Q
ishing performance on ILSVRC 2012, making convolutional is the number of channels of input image; in other ones, it
neural networks become mainstream for image classification. equals to the number of filters of previous layer. Each fil-
Due to the availability of massive dataset (e.g., Pascal VOC ter is convolved with the input volume to produce a feature
[5], ImageNet [4]) and powerful graphic processing units map whose width is W − F + 1 and in total, each con-
(GPU), the architecture of convolutional neural networks volutional layer produces K feature maps of that size. In
have been growing in both the depth of network [12] and addition, parameter-sharing scheme of convolutional neural
the size of its layers [15]. State-of-the-art convolutional neu- networks significantly reduces their number of parameters,
ral networks have sophisticated architectures [16] and have which makes them easier to train compared to traditional
even surpassed humans in image classification tasks [7]. fully-connected neural networks.
Deep convolutional neural networks have not only been AlexNet [10] was the first popular convolutional neural
used in general classification on dataset whose objects are network that achieved astounishing performance and be-
highly distinctive as of ImageNet, Pascal VOC; a number came the winning model of ILSVRC 2012 [14] contest. It
of attempts have been made to utilise convolutional neu- consists of eight layers with weights; the first five layers
ral networks in other classification problems which involved are convolutional layers stacked together with some max-
scene recognition [19] and more interestingly on fine-grained pooling layers and ReLU layers in between; the last three
classification tasks - classification among categories which are fully-connected layers. The input of this network is an
are both visually and semantically very similar - such as on image of size 224 × 224 which is randomly cropped from
birds [1], pedestrian poses [6]. In maritime domain, several 256 × 256 image. The output layer is a 35-way SoftMax that
attempts have been made to ship classification tasks. [18] calculates the probabilities distribution over 35 class labels
used high-resolution synthetic aperture radar (SAR) images (corresponding to 35 classes of vessel images).
and classified them into three different classes; [8] investi- The minor modification on the architecture of the net-
gated three different classifiers, i.e., the K nearest neigh- work compared to one used by [10] is that instead of split-
bor classifier, the Bayes classifier, and the back-propagation ting hidden layers into 2 mutual parts (similar to “columnar”
neural network classifier and proposed a SVM combination convolutional neural network of [2]), we use only one unified
strategy to fuse the results of individual classifier in classi- architecture of AlexNet provided by opensource Caffe frame-
fying SAR images. And, also note that the related area of work [9] refered as Caffe Reference Model. Table 1 describes
maritime vessels detection, which is highly related to ves- the architecture and parameters of such model.
sel images classification, has been developed in the recent The reason we choose AlexNet for experiments in this
years. [17] combined deep convolutional neural network and work is that it has been the source of inspiration for many
extreme learning machine to maritime vessel detection on following models in deep learning. In addition, the network
optical images. is simple enough to train on typical computer systems with
We are interested in developing a domain specific classifier commodity hardware; therefore, making the training and
that can classify images of maritime vessels into a number testing of the convolutional neural networks for typical im-
of classes. This is a different challenge than the general- age classification tasks more feasible.
purpose classification of images of random objects in a very
large number of general classes. This poses an extra chal-
3.2 The Dataset
lenge over classification in general classes such as books, pens For this work, we create a data set - E2S2-Vessel - using
and computers, because of the close similarity among sub- a collection of high-resolution images from online maritime
classes. vessel photos and tracker website ShipSpotting 1 . The im-
ages on this website are gathered mainly from professional
3. METHODS photographers, on-ship and on-shore cameras. The follow-
ing sub-sections explain in detail the process of collecting,
In this section, we provide detailed descriptions of the categorizing, preprocessing images in the data set.
methods that we utilise to classify maritime vessels. We
first give a background introduction of deep convolutional 3.2.1 Data collection
neural network. We then describe the maritime vessel image Natural maritime vessel images are downloaded in the pe-
dataset that we use to perform our experiments. Finally riod of time from July 16th, 2015 to July 19th, 2015. While
we describe and discuss the details of the network training crawling those images, we do not use sorting criteria (e.g.,
process and the corresponding parameter setting. in chronological order, group by photographer’s name, etc.)
3.1 Convolutional Neural Network provided by ShipSpotting but randomly select images in dis-
crete time periods and from different photographers. In to-
A convolutional neural network is composed of a set of
1
multiple convolutional layers, some of which may be stacked http://shipspotting.com

277
Table 1: Model architecture

No.
layer type K F S P pool output size parameters
conv1 convolution 96 11 4 0 [55 × 55 × 96] 34,848
relu1 ReLU
1
pool1 pooling 3 2 MAX [27 × 27 × 96]
norm1 LRN
conv2 convolution 256 5 1 2 [27 × 27 × 256] 614,400
relu2 ReLU
2
pool2 pooling 3 2 MAX [13 × 13 × 256]
norm2 LRN
conv3 convolution 384 3 1 3 [13 × 13 × 384] 884,736
3 relu3 ReLU
conv4 convolution 384 3 1 3 [13 × 13 × 384] 663,552
4
relu4 ReLU
conv5 convolution 256 3 1 1 [13 × 13 × 256] 884,736
5 relu5 ReLU
pool5 pooling 3 2 MAX [6 × 6 × 256]
fc6 fully-connected [1 × 1 × 4096] 37,748,736
6 relu6 ReLU
dropout6 dropout (50%)
fc7 fully-connected [1 × 1 × 4096] 16,777,216
7 relu7 ReLU
dropout7 dropout (50%)
fc8 fully-connected [1 × 1 × 35] 143,360
8
loss SoftMax [1 × 1 × 35]
Legend: K: the number of filters in a layer; F : size of receptive fields; S : number of stride; P : size of
zero-padding which pads the input with zeros on the border of input volume

tal, we have collected 150, 000 images whose size, orienta-


tion, lighting condition are greatly varied. Those images
correspond to 51, 500 different martime vessels and each im-
age is associated with a ground-truth class label provided
by the website.
To improve the quality of the E2S2-Vessel, we refine the
data set and eliminate surplus images so that no image is be-
longed to more than one class. After such elimination, there
are approximately 130, 000 images. Subsequently, those im-
ages are divided into two groups: training and validation
sets. 80% (103, 000) images are in training set while the rest
20% (26, 000) images are put into validation set. This ratio
between the number of images in training and validation set
is the recommended one used to train deep convolutional
neural networks. We further ensure that images of the same
maritime vessel only belong either to training or validation
set but not both.

3.2.2 Class hierarchy


ShipSpotting classifies maritime vessels into 175 hierarchi- Figure 1: Distribution of images over classes in
cal classes. While this number of classes may implicitly de- E2S2-Vessel.
duce a detailed classification, those classes are not exclusive,
which makes this categorization unnecessary for most of the
applications related to maritime vessel images classification. sel (14.34%). Several sample images of those classes are
For instance, images in two distinguished classes “Cruid Oil shown in Figure 2
Tanker” and “Chemical and Products Tanker” cannot be vi- Due to the nature of existing maritime vessels, images in
sually differentiated. Therefore, we aggregate classes into E2S2-Vessel are not uniformly distributed over all classes (as
higher level ones (i.e., combine sub-classes into their par- depicted in Figure 1).
ents), resulting in 35 classes of maritime vessels. Of all those
35 classes, the followings are the dominant ones in term of 3.2.3 Preprocessing
number of images: cargo vessel (21.97%); tankers (16.36%) The image preprocessing includes squashing all images
bulkers (15.99%); container vessel (14.50%); passenger ves- into size 256 × 256 (without preserving original ratio). This

278
Figure 2: Sample images from E2S2-Vessel dataset. Images within the same column belong to same class.
From left to right: (a) Cargo vessel; (b) Container vessel; (c) Fishing vessel; (d) Military vessels; (e) Tugs.

deformation does not affect the performance and accuracy and FC7 and the total number of parameters for the whole
of classification task as deep convolutional neural networks model of two configurations.
are fairly robust to elastic deformations of the input image
due to the max-pooling operations. Moreover, we make the
3.3.3 Training process
choice of deforming images rather than taking the risk of
cropping parts of maritime vessel images that may be rele- In all of our experiments, the training process generally
vant to identification. follows that of [10]. Specifically, the training is carried out
After that, images in training and validation sets are con- optimising the multinomial logistic regression objective us-
verted to LMDB format. We also randomly shuffle the order ing stochastic gradient descent (SGD) (based on back-propagation
of images in both sets to avoid problems that might arise by [11]) with momentum. In each SGD iteration, the train-
when images in a mini-batch belong to same class during ing uses a batch size of 80 (instead of 256 as in the original
training process. Caffe Reference Model) due to limited memory space on
the machine, a momentum µ = 0.9, and a weight decay of
3.3 Training the Convolutional Neural Network 0.0005. The base learning rate is set to α0 = 0.01 and de-
creased during training process by dropping by a factor of
10 every 40,000 iterations.
3.3.1 Hardware specification While the model is being trained, we monitor the changes
We train variations of AlexNet on a commodity desktop in accuracy rate regularly by taking snapshots after every
computer with the following hardware specifications: CP U : 2000 iterations and evaluate each snapshot on both training
Intel Core 2 Quad Q9550 @2.38GHz; RAM : 4GB DDR3; and validation sets. This is an useful strategy for train-
GP U : NVIDIA Geforce GTX 560 Ti with 2GB of memory. ing deep convolutional neural networks in Caffe for it is not
neccesary to wait until the training is duly completed (which
3.3.2 Model configurations can take five to six days in the case of AlexNet) to assess the
We evaluate two different configurations of the AlexNet performance of the model. The evolution of top-1 and top-5
reffered as Config-1 and Config-2 respectively. Config-1 accuracy rate during training process of both configurations
follows exactly the generic architecture described in Table 1. we run on are depicted in Figure 3. As can be inferred from
The modification of Config-2 is involved in layers FC6 and Figure 3 the accuracy of both configuration models dramat-
FC7 in which we use 2048 neurons for each layer instead of ically surges over the first 2000 iterations. In the period
4096 as in Config-1. This reduces the number of parameters from iteration 2, 000th to 40, 000th, in which learning rate
of those layers to 18, 874, 368 and 4, 194, 304 respectively. α = 0.01, the accuracy of both models fluctuates around
Consequently, the number of parameters of the whole model 70% before increasing to approximately 80% at 40, 000th it-
is significantly reduced by a factor of 65%. Table 2 presents eration in which learning rate α = 0.001. As the training
the number of neurons and parameters in each of layers FC6 goes on, top-1 accuracy measured on both models seem to

279
Table 2: Number of neurons and parameters for layers FC6 and FC7 as well as total number of parameters
of the model for 2 configurations

No. of neurons No. of parameters


Configuration Total no. of parameters of model
FC6 FC7 FC6 FC7
Config-1 4096 4096 37,748,736 16,777,216 57,751,584
Config-2 2048 2048 18,874,368 4,194,304 26,150,944

(a) Config-1 (b) Config-2

Figure 3: The evolution of top-1 and top-5 accuracy rate on both training and validation sets over the course
of training. The x-axis is the number of iterations; y-axis is the accuracy rate.

Table 3: Time for model training and single image and top-5 accuracy rate of 94.93% meanwhile those accuracy
classification rates of Config-2 are 80.91% and 95.43% respectively.
Although the difference in performance of two configura-
Configuration T (hour) τ (milisecond) tion is insignificant, the interesting finding is that Config-2
Config-1 28.6h 4.12 is much faster to train and has much fewer parameters. This
Config-2 23.9h 4.09 fact can be interpreted that for certain fine-grained classifi-
cation problems, simpler architectures than the AlexNet are
worth investigating into as they can be trained in shorter
reach the “plateau” around 80%. amount of time and can be deployed to variety of platforms
such as mobile devices, embedded systems.
4. RESULTS
5. CONCLUSIONS
4.1 Training and validation times We present an application of deep convolutional neural
Table 3 shows comparison for two timing measurements of networks to the classification images of maritime vessels. We
two configurations. The first measurement, T (hour), is the train and test the AlexNet deep convolutional neural net-
amount of time needed to train models within first 100, 000 work with a data set of 130,000 images of maritime vessels
iterations. The second one, τ (milisecond), is the amount of labelled with 35 classes. We conduct experiments on dif-
time needed to classify an image once the models have been ferent configurations of the model on commodity hardware.
trained. The second configuration, Config-2, is 16% faster We comparatively evaluate and analyse the performance of
to train as it has fewer parameters learnt during training different configurations the model. We measure the top-1
process. For classification of single maritime vessel image, and top-5 accuracy rates.
Config-2 takes 4.12 miliseconds and Config-2 has slightly Our experiment results yield solid evidence that the deep
better result of 4.09 milisecond. Though the difference for convolutional neural network approach to domain specific
classifying a single image is not considerable, Config-2 will classification is as effective as that to general image classi-
achieve more convincing performance when it comes to clas- fication, provided sufficient amount of images in each cate-
sification of a massive amount of data. gory, proper design of the network architecture and proper
parameter setting. We obtain a top-1 accuracy rate of 80.39%
4.2 Performance of models and top-5 accuracy rate of 94.93%. Of the two AlexNet con-
The classification performance is evaluated by two mea- figurations tested, the simpler one with fewer parameters, al-
surements: top-1 and top-5 accuracy rate. The former is though not as effective, performs reasonably well with top-1
the rate of comparision between ground-truth against first accuracy rate of 80.91% and top-5 accuracy rate of 95.43%.
predicted class; the later is computed based on the number The simpler configuration is significantly faster to train.
of images whose ground-truth is in the top five predicted The contribution of this work is the implementation, tun-
classes by the network. ing and evaluation of automatic image classifier for the spe-
The Config-1 model achieves top-1 accuracy rate of 80.39% cific domain of maritime vessels with deep convolutional neu-

280
ral networks under the constraints imposed by commodity [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
hardware and size of the image collection. Imagenet classification with deep convolutional neural
networks. In F. Pereira, C. Burges, L. Bottou, and
6. ACKNOWLEDGMENTS K. Weinberger, editors, Advances in Neural
This research was funded by the National Research Foun- Information Processing Systems 25, pages 1097–1105.
dation of Singapore under its Campus for Research Excel- Curran Associates, Inc., 2012.
lence and Technological Enterprise (CREATE) programme [11] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson,
with the SP2 project of the “Energy and Environmental Sus- R. E. Howard, W. E. Hubbard, and L. D. Jackel.
tainability Solutions for Megacities - E2S2” programme. Handwritten digit recognition with a back-propagation
network. In D. Touretzky, editor, Advances in Neural
7. REFERENCES Information Processing Systems 2, pages 396–404.
[1] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Morgan-Kaufmann, 1990.
Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale [12] M. Lin, Q. Chen, and S. Yan. Network in network.
fine-grained visual categorization of birds. In Proc. CoRR, abs/1312.4400, 2013.
Conf. Computer Vision and Pattern Recognition [13] F. Perronnin, J. Sánchez, and T. Mensink. Improving
(CVPR), June 2014. the fisher kernel for large-scale image classification. In
[2] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, Proceedings of the 11th European Conference on
and J. Schmidhuber. High-performance neural Computer Vision: Part IV, ECCV’10, pages 143–156,
networks for visual object classification. CoRR, Berlin, Heidelberg, 2010. Springer-Verlag.
abs/1102.0183, 2011. [14] O. Russakovsky, J. Deng, H. Su, J. Krause,
[3] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
C. Bray. Visual categorization with bags of keypoints. A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei.
In In Workshop on Statistical Learning in Computer ImageNet Large Scale Visual Recognition Challenge.
Vision, ECCV, pages 1–22, 2004. International Journal of Computer Vision (IJCV),
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and pages 1–42, April 2015.
L. Fei-Fei. ImageNet: A Large-Scale Hierarchical [15] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu,
Image Database. In CVPR09, 2009. R. Fergus, and Y. LeCun. Overfeat: Integrated
[5] M. Everingham, L. Van Gool, C. K. I. Williams, recognition, localization and detection using
J. Winn, and A. Zisserman. The PASCAL Visual convolutional networks. CoRR, abs/1312.6229, 2013.
Object Classes Challenge 2012 (VOC2012) Results. [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
http://www.pascal- D. Anguelov, D. Erhan, V. Vanhoucke, and
network.org/challenges/VOC/voc2012/workshop/index.html. A. Rabinovich. Going deeper with convolutions.
[6] D. Hall and P. Perona. Fine-grained classification of CoRR, abs/1409.4842, 2014.
pedestrians in video: Benchmark and state of the art. [17] J. Tang, C. Deng, G. Huang, and B. Zhao.
June 2015. Compressed-domain ship detection on spaceborne
[7] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep optical image using deep neural network and extreme
into rectifiers: Surpassing human-level performance on learning machine. IEEE T. Geoscience and Remote
imagenet classification. CoRR, abs/1502.01852, 2015. Sensing, 53(3):1174–1185, 2015.
[8] K. Ji, X. Xing, W. Chen, H. Zou, and J. Chen. Ship [18] C. Wang, H. Zhang, F. Wu, S. Jiang, B. Zhang, and
classification in terrasar-x sar images based on Y. Tang. A novel hierarchical ship classifier for
classifier combination. In Geoscience and Remote cosmo-skymed sar data. Geoscience and Remote
Sensing Symposium (IGARSS), 2013 IEEE Sensing Letters, IEEE, 11(2):484–488, Feb 2014.
International, pages 2589–2592, July 2013. [19] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and
[9] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, A. Oliva. Learning deep features for scene recognition
J. Long, R. Girshick, S. Guadarrama, and T. Darrell. using places database. In Z. Ghahramani, M. Welling,
Caffe: Convolutional architecture for fast feature C. Cortes, N. Lawrence, and K. Weinberger, editors,
embedding. In Proceedings of the ACM International Advances in Neural Information Processing Systems
Conference on Multimedia, MM ’14, pages 675–678, 27, pages 487–495. Curran Associates, Inc., 2014.
New York, NY, USA, 2014. ACM.

281

You might also like