Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
https://doi.org/10.1007/s10015-018-0518-y
ORIGINAL ARTICLE
Received: 10 April 2018 / Accepted: 12 November 2018 / Published online: 19 December 2018
© International Society of Artificial Life and Robotics (ISAROB) 2018
Abstract
As action recognition undergoes change as a field under influence of the recent deep learning trend, and while research in
areas such as background subtraction, object segmentation and action classification is steadily progressing, experiments
devoted to evaluate a combination of the aforementioned fields, be it from a speed or a performance perspective, are far and
few between. In this paper, we propose a deep, unified framework targeted towards suspicious action recognition that takes
advantage of recent discoveries, fully leverages the power of convolutional neural networks and strikes a balance between
speed and accuracy not accounted for in most research. We carry out performance evaluation on the KTH dataset and attain
a 95.4% accuracy in 200 ms computational time, which compares favorably to other state-of-the-art methods. We also apply
our framework to a video surveillance dataset and obtain 91.9% accuracy for suspicious actions in 205 ms computational time.
Keywords Suspicious action recognition · Deep learning · Convolutional neural networks · Background subtraction ·
Optical flow estimation · Action classification
13
Vol.:(0123456789)
220 Artificial Life and Robotics (2019) 24:219–224
frame into small pixel-centered patches and classifying these 3.1 Deep background subtraction
as background or foreground patches. It should be noted
that focusing on these pixel-centered patches instead of the We use the method described in [3] to achieve efficient
whole scene does not hamper performance. Also, contrary to background subtraction with the use of convolutional neu-
expectations, using only a small number of frames (25–50) ral networks. More specifically (assuming we are inputting
is suffice to obtain very good to excellent results and allows a grayscale video):
for camera stabilization and shadow removal among other
benefits. 1. We construct a simple background model of the video
input by computing a temporal average of each pixel.
2. We generate for each input frame a three-channel frame,
2.2 Optical flow where the first channel is the untouched input, the sec-
ond channel is the background model, and the third
Optical flow estimation has traditionally been done using channel is left empty.
differential methods, be they sparse like the Lucas–Kanade 3. We extract for each pixel of the generated frame a square
method [4], or dense like the Farnebäck method [5] and the patch centered around that pixel, and we feed it to the
TV-L1 method [6]. While speed and accuracy vary wildly neural network.
across these kinds of methods, a major drawback com- 4. The neural network, if training, learns from the input
mon to all of them is lack of generalization to large data. patches, given pixel-precise ground truth; if predicting,
One breakthrough in this domain is FlowNet [7], an opti- it classifies each patch as either background or fore-
cal flow estimation method that uses convolutional neural ground.
networks to take advantage of large datasets. FlowNet 2.0 5. We generate the foreground video based on the above
[8], an improved iteration standing as the current state of classification results and pass it on to the optical flow
the art, has also been released, as well as a dataset aimed at estimation part of the framework.
stereo optical flow estimation [9]. It is worth pointing out
that while datasets used for training purposes are artificial Results of the above steps on sample frames can be seen
(with some created with 3D modeling software), the result- in Fig. 1. The background model, while retaining faint
ing models generalize surprisingly well to real-world data. traces of motion, is sufficiently accurate for an uncluttered
scene. The three-channel frame shows the static back-
ground as brown and motion (including removed shad-
2.3 Action classification ows) as red.
3 The framework
13
Artificial Life and Robotics (2019) 24:219–224 221
⎧0 x < −𝛼
⎪
f (x) = ⎨ 255 ∗
x+𝛼
2𝛼
�x� < 𝛼 (1)
⎪ 255 x > 𝛼.
⎩
13
222 Artificial Life and Robotics (2019) 24:219–224
13
Artificial Life and Robotics (2019) 24:219–224 223
13
224 Artificial Life and Robotics (2019) 24:219–224
22. He K, Zhang X, Ren S et al (2016) Deep residual learning for 28. Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101
image recognition. In: proceedings of the IEEE conference on human action classes from videos in the wild. arXiv:1212.0402
computer vision and pattern recognition, pp 770–778 29. Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video
23. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: database for human motion recognition. In: IEEE international
a local SVM approach. In: proceedings of the 17th international conference on computer vision, pp 2556–2563
conference on pattern recognition, pp 32–36 30. Abu-El-Haija S, Kothari N, Lee J et al (2016) Youtube-8M: a
24. Gao Z, Chen MY, Hauptmann A et al (2010) Comparing eval- large-scale video classification benchmark. arXiv:1609.08675
uation protocols on the KTH dataset. Hum Behav Underst
6219:88–100 Publisher’s Note Springer Nature remains neutral with regard to
25. Yi S, Li H, Wang X (2015) Understanding pedestrian behaviors jurisdictional claims in published maps and institutional affiliations.
from stationary crowd groups. In: IEEE conference on computer
vision and pattern recognition, pp 3488–3496
26. Han Y, Zhang P, Zhuo T et al (2017) Going deeper with two-
stream ConvNets for action recognition in video surveillance.
Pattern Recognit Lett 107:83–90
27. Kim TK, Wong SF, Cipolla R (2007) Tensor canonical correla-
tion analysis for action classification. In: Proceedings of the IEEE
conference on computer vision and pattern recognition, pp 1–8
13