Nothing Special   »   [go: up one dir, main page]

Wang - Human Action Recognition Algorithm Based On Multi-Feature Map Fusion - 2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

Human Action Recognition Algorithm


Based on Multi-feature Map Fusion
HAOFEI Wang1, JUNFENG Li1
1
Control Science and Engineering, Zhejiang SCI-TECH University, Hangzhou 310000, China
Corresponding author: JUNFENG LI (e-mail: ljf2003@zstu.edu.cn).
This work was supported in part by the National Natural Science Foundation of China under Grant No. 61374022.

ABSTRACT The emergence of the convolutional neural network greatly improves the accuracy of human
action recognition. However, with the deepening of the network, fewer and fewer features are extracted, and
in some datasets, due to the shooting angle, the size of the target to be recognized is different. To solve this
problem, on the basis of resnext human action recognition method, we propose an improved resnext human
action recognition method based on multi-feature map fusion. First, the video is uniformly sampled to
generate training samples, and we generate samples with different frames as the input to the network. Second,
we add n layers of up-sampling layers after layer 1 of resnext, to enlarge the feature maps and extract multiple
feature maps, so that the extracted feature maps are clearer, and small targets can be better recognized. Finally,
for the n results obtained, we use the weighted geometric means combination forecasting method based on
L_1 norm to fuse and obtain the final result. In the process of experiment, using UCF-101 and HMDB-51 for
verification, the accuracy of our model is 90.3% on UCF-101, which is higher than most of the state-of-art
algorithms.

INDEX TERMS Human action recognition, Resnext-101 network, up-sampling method, weight fusion

I. INTRODUCTION
Due to the potential applications of human action recognition accuracy and running speed. Therefore, in order to ensure
in video surveillance, behavior analysis, video retrieval, and accuracy and improve the running speed, the human action
other fields, human action recognition has become a very recognition algorithm is still the focus and difficulty of current
important field in computer vision research [1]. Human action research.
recognition refers to the video sequence of human action,
through the detection, classification, and tracking of moving II. RELATED WORK
targets, analysis and recognition of human action, and A. Feature-based approaches
description in natural language [2]. In the real world, human Some classic image feature extraction methods are generalized
action recognition plays a basic role in video analysis. to video [4], traditional human action recognition feature
Early human action recognition was based on hand-crafted extraction methods include SIFT (Scale-invariant feature
features. These features are more dependent on databases, they transform) relying on prior knowledge and HOG (Histogram
performed well on some databases, however, these features of Oriented Gradient); SIFT improved SURF (Speeded-Up
are not necessarily applicable to other databases. Moreover, Robust Features) [5] algorithm and 3D-SIFT [6] algorithm;
hand-crafted features take a long time, which is not conducive HOG3D [7] algorithm, etc.
to feature extraction of large databases. The detection operator of the traditional feature extraction
With the rise of deep learning, the learning of automatic algorithm is artificially designed and obtained by a large
feature engineering has solved the shortcomings of hand- amount of prior knowledge. Therefore, the traditional
crafted feature engineering and made significant progress in algorithm is time-consuming and the workload is heavy. With
the field of human action recognition. However, because of the advent of deep learning, more and more studies are
the long information period, a large amount of redundant affected by the significant achievements of convolutional
information, complex backgrounds, and the diversity of neural networks in static image recognition. In action
viewpoints [3], there are still many problems to be solved in recognition, the use of deep models to train end-to-end
human action recognition methods based on deep learning. networks has clearly exceeded hand-crafted features [8].
In the context of big data, human action recognition has
broader application scenarios, including video B. CNN based approaches.
recommendation, monitoring analysis, human-computer Human action recognition methods based on convolutional
interaction, etc. However, although the current algorithm has neural network architecture are roughly divided into 2D
achieved good performance, it still needs to be improved in convolutional neural networks and 3D convolutional neural

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

networks [9]. The 2D convolutional neural network achieves Map Fusion, adding n layers of up-sampling layers after layer1
good performance in static image recognition [10], [11]. It is to train separately, which aims to enlarge the feature maps to
easy to apply a 2D convolutional neural network to video make the extracted features clearer. In order to solve the
representation to extract features, but ignore the relationship problem of obtaining the final result only by the average
between video frames. For this reason, Ji. S [12] proposed a classification scores in the 2D two-stream network
3D convolutional neural network method. A 3D convolutional architecture [14], we propose to use the weighted geometric
neural network consists of a 2D spatial dimension and a means combination forecasting method based on L_1 norm to
temporal dimension, 3D convolutional neural networks fuse the obtained n results.
consider time information, but have too many parameters.
Compared to 2D convolutional neural networks, 3D III. RESNEXT-101
convolutional neural networks are more difficult to learn [13]. We first introduce the architecture of resnext. The traditional
2D two-stream network architecture [14]. This method way to improve the accuracy of the model is to deepen or
divides the video into two parts: the temporal domain and the widen the network, but as the number of hyperparameters
spatial domain. The obtained RGB images and optical flow increases (such as the number of channels, filter size, etc.), the
frames are input into the network to perform feature learning difficulty of network design and the computational overhead
in temporal and spatial domains. Finally, the prediction is also increases. Therefore, the proposed resnext structure can
performed only by averaging the classification scores, and improve the accuracy without increasing the complexity of the
only the convolution operation between adjacent frames parameters, while reducing the number of hyperparameters.
cannot obtain long-term information. Based on the problems Xie. S [23] et al, proposed the network resnext, adopting the
existing in the 2D two-stream network architecture, it was idea of VGG stack and Inception's split-transform-merge at
improved and a 3D two-stream network architecture was the same time, but with strong expansibility, which can be
proposed. The I3D proposed by Carreira. J [15] increases the considered as increasing the accuracy while not changing or
length of the input video clip to obtain a longer range of reducing the complexity of the model. There is a noun called
information, but it is computationally intensive and cannot cardinality, with the mean of the size of the set of
handle a longer range of video. Based on I3D, Wang. X [16] transformations. Experiments demonstrate that increasing
proposed a non-local neural network, which uses the cardinality is a more effective way of gaining accuracy than
spatiotemporal non-local relationship in the video. Xu. Z and going deeper or wider [23]. Fig. 1 shows a block of resnext.
Qiu. Z [17], [18] proposed a coding method to obtain the
video-level representation but ignored the connection between
frames. TSN proposed by Wang. L [19] adopts the sparse and
global sampling method to sample a fixed number of frames
to cover the time sequence structure of a long-time range so
that the entire length of a video is not considered and the fusion
is carried out at the end. Different from the above-mentioned
spatio-temporal features, [20] encodes spatio-temporal
features by imposing a weight sharing constraint on the
learnable parameters so that practice and spatial features can
benefit from each other through collaborative learning. The
above-mentioned spatio-temporal fusion methods are time-
consuming and expensive for network training. In order to
solve this problem, Y. Z. Zhou et al [21] proposed a new
Figure 1. A block of resnext with cardinality=32
method to embed the spatiotemporal fusion strategy into a pre-
defined probability space so that any multiple fusion strategies
can be evaluated at the network level without having to train Based on the block of resnext, the internal structure of
them separately, which greatly improves the strategy for resnext-101 is shown in Table I.
spatiotemporal fusion Analysis efficiency. Shallow features contain detailed information, deep
Inspired by FPN [22], we propose a multi-scale fusion features contain semantic information, semantic information
method for human action recognition; besides, by observing can help to accurately detect the target. However, as the
the datasets, we found that the background information of network [23] deepens, more useful features are filtered out.
some actions is complex, and the targets we want to recognize And in the dataset, the size of the target to be recognized is
are small relative to the entire background, so we use the up- different. Therefore, we add n up-sampling layers after layer 1
sampling method to enlarge the feature maps to make small of resnext-101 to amplify the extracted features, so that the
targets clearer and easier to detect. Moreover, deep learning is network extracts more detailed features and small targets are
learned by learning the extracted features, the more detailed better recognized.
and clear the extracted features, the better the learning effect. Our work is different from the previous approaches in two
Therefore, on the basis of resnext, we propose the method of main aspects [24]: (1) Based on resnext-101, we add several
Human Action Recognition Algorithm Based on Multi-feature up-sampling layers [25] to extract more feature maps; (2)

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

TABLE I
UNITS FOR MAGNETIC PROPERTIES
Stage Output Resnext-101 a
Conv1 112×112 7×7, 64, stride=2
3×3, max pool, stride=2
Conv2 56×56 1×1,128
3×3,128,C=32 ×3
1×1,256
Conv3 28×28 1×1,256
3×3,256,C=32 ×4
1×1,512
Conv4 14×14 1×1,512
3×3,512,C=32 ×23
1×1,1024
Conv5 7×7 1×1,1024
3×3,1024,C=32 ×3
1×1,2048
1×1 Global average pool
Fc, softmax

Several groups of results obtained are fused using the


weighted geometric means combination forecasting method
based on L_1 norm to get the final result.

IV. HUMAN ACTION RECOGNITION ALGORITHM BASED


ON MULTI-FEATURE MAP FUSION
A. Network architecture
The input image of the neural network input layer is convolved
with the convolution kernel to obtain the feature map. A
feature map is a description of the characteristics of the image,
the more features extracted and the more detailed, the Figure 2. The architecture of our network
identification effect is better. Therefore, the more feature maps,
the more representative the extracted features will be, and the 𝑠𝑟𝑐𝑋 = 𝑑𝑠𝑡𝑋 × (𝑠𝑐𝑟𝑊𝑖𝑑𝑡ℎ/𝑑𝑠𝑡𝑊𝑖𝑑𝑡ℎ)
better the recognition effect will be. The shallow features show (1)
more detailed information, while the deep features contain 𝑠𝑟𝑐𝑌 = 𝑑𝑠𝑡𝑌 × (𝑠𝑐𝑟𝐻𝑒𝑖𝑔ℎ𝑡/𝑑𝑠𝑡𝐻𝑒𝑖𝑔ℎ𝑡)
more semantic information, which can help accurately detect
the target. However, after multi-layer convolution, many (srcWidth, srcHeight) indicates the width and height of the
features have been filtered out, so on the basis of the resnext- source image, (dstWidth, dstHeight) indicates the width and
101, we proposed a new architecture named Human Action height of the image after interpolation.
Recognition Algorithm Based on Multi-feature Map Fusion. 2) BILINEAR INTERPOLATION
N layers of up-sampling are added after layer 1 of resnext-101 This method is to calculate the pixel value of point P(x, y)
for prediction. according to the pixel values of the nearest four points of point
Fig. 2 is the architecture of our network. First, the sampling P(x, y). The core idea is to perform linear interpolation in two
method is uniform sampling, and the default size of each directions respectively. 𝑄11 , 𝑄12 , 𝑄21 , 𝑄22 pixel values are
sample is 3 channels × 16 frames × 112 pixels × 112 pixels. known, we first use (2) to calculate the pixel values of 𝑅1 , and
Second, we use stochastic gradient descent to train the network
𝑅2 . Then, we calculate the pixel value of P using (3). We
and get n prediction results. We calculate the weights of the
substitute (2) into (3) to get (4).
training results through the above model and fuse the test
results according to the weights to get the final results.

B. Up-sampling method
On the basis of resnext-101, we separately add n up-sampling
layers to extract more features. To make this more concrete,
we now discuss several ways of up-sampling methods:
1) NEAREST NEIGHBOR INTERPOLATION
The simplest interpolation method, we obtain the coordinate
(srcX, srcY) of the source image corresponding to (dstX, dstY) Figure 3. Bilinear interpolation
by (1), and fill in the corresponding pixel value.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

x direction: First, interpolate in the x direction:


𝑥2 − 𝑥 𝑥 − 𝑥1 𝑅1 = 𝑓000 (1 − 𝑥𝑑 ) + 𝑓100 𝑥𝑑
𝑓(𝑅1 ) ≈ 𝑓(𝑄11 ) + 𝑓(𝑄21 )
𝑥2 − 𝑥1 𝑥2 − 𝑥1 𝑅2 = 𝑓010 (1 − 𝑥𝑑 ) + 𝑓110 𝑥𝑑
(2) 𝑅3 = 𝑓011 (1 − 𝑥𝑑 ) + 𝑓111 𝑥𝑑
𝑥2 − 𝑥 𝑥 − 𝑥1 𝑅4 = 𝑓001 (1 − 𝑥𝑑 ) + 𝑓101 𝑥𝑑 (6)
𝑓(𝑅2 ) ≈ 𝑓(𝑄12 ) + 𝑓(𝑄22 )
𝑥2 − 𝑥1 𝑥2 − 𝑥1
Then interpolate in the y direction:
y direction: 𝑟1 = 𝑅4 (1 − 𝑦𝑑 ) + 𝑅3 𝑦𝑑
𝑦2 − 𝑦 𝑦 − 𝑦1 𝑟2 = 𝑅1 (1 − 𝑦𝑑 ) + 𝑅2 𝑦𝑑 (7)
𝑓(𝑃) ≈ 𝑓(𝑅1 ) + 𝑓(𝑅2 ) (3)
𝑦2 − 𝑦1 𝑦2 − 𝑦1
Finally, interpolate in the z direction:
So the pixel value at point P is: 𝑓 = 𝑟2 (1 − 𝑧𝑑 ) + 𝑟1 𝑧𝑑 (8)

𝑓(𝑄11 ) The nearest neighbor interpolation method will cause a


𝑓(𝑥, 𝑦) ≈ (𝑥 − 𝑥)(𝑦2 − 𝑦)
(𝑥2 − 𝑥1 )(𝑦2 − 𝑦1 ) 2 discontinuity in the grayscale of the generated image. When
𝑓(𝑄21 ) the feature map is enlarged, this method directly uses the
+ (𝑥 − 𝑥1 )(𝑦2 − 𝑦) nearest pixel to generate a new pixel, so where the grayscale
(𝑥2 − 𝑥1 )(𝑦2 − 𝑦1 )
changes, there is obvious jagged; The calculation of the
𝑓(𝑄12 ) bilinear interpolation method is complicated, and the amount
+ (𝑥 − 𝑥)(𝑦 − 𝑦1 ) of calculation is large, but the calculation result of the four
(𝑥2 − 𝑥1 )(𝑦2 − 𝑦1 ) 2
pixels used by this method greatly eliminates the phenomenon
of jaggedness and has no disadvantages of grayscale
𝑓(𝑄22 )
+ (𝑥 − 𝑥1 )(𝑦 − 𝑦1 ) (4) discontinuity. However, the bilinear interpolation has the
(𝑥2 − 𝑥1 )(𝑦2 − 𝑦1 ) characteristic of low-pass filtering, so that high-frequency
components are damaged, so the edges will become blurred;
3) TRILINEAR INTERPOLATION The trilinear interpolation method overcomes the
Trilinear interpolation operation in n=1, three-dimensional shortcomings of the above two methods, with high calculation
D=3 parameter space, so that 8 points adjacent to the point to accuracy and better effect. Therefore, we choose the trilinear
be interpolated are needed. interpolation as the up-sampling method.
On a periodic cube grid, let 𝑥𝑑 , 𝑦𝑑 , 𝑧𝑑 be the differences
between each of x, y, z, and the smaller coordinate related, that C. Fuse method
is: We train these networks separately to obtain different training
𝑥 − 𝑥0 results, the weights of the training results are obtained by the
𝑥𝑑 =
𝑥1 − 𝑥0 method of [26], when evaluating the network, the weights
obtained by [26] are used to fuse the obtained results to obtain
𝑦 − 𝑦0 the final result. The combined prediction model of [26] can be
𝑦𝑑 = (5)
𝑦1 − 𝑦0 expressed as:
𝑧 − 𝑧0 𝑁 𝑛
𝑧𝑑 =
𝑧1 − 𝑧0 𝑚𝑖𝑛𝐹(𝐿) = ∑ | ∑ 𝑙𝑖 𝑒𝑖𝑡 |
𝑡=1 𝑡=1
Where 𝑥0 is the point below x and 𝑥1 is the point above x, 𝑒𝑖𝑡 = 𝑙𝑛 𝑥𝑡 − 𝑙𝑛 𝑥𝑖𝑡
𝑚
𝑦0 , 𝑦1 , 𝑧0 , 𝑧1 are the same. 𝑓000 , 𝑓001 , 𝑓010 , 𝑓011 , 𝑓100 , 𝑓101 ,
𝑓110 , 𝑓111 pixel values are known, first, we calculate the pixel ∑ 𝑙𝑖 = 1
values of 𝑅1 , 𝑅2 , 𝑅3 , 𝑅4 using (6). Then we use (7) to 𝑖=1
calculate the pixel values of 𝑟1 , 𝑟2 . Finally, we calculate the s.t
pixel value of using (8). ̇
𝑙𝑖 ≥ 0,𝑙 =1,2,…,n (9)

Among them, F(L) is the logarithmic error based on the L1


norm between the combined prediction method of the
weighted geometric average and the actual value of the index
sequence, 𝑒𝑖𝑡 represents the logarithmic error between the
predicted value 𝑥𝑖𝑡 and the actual value 𝑥𝑡 at time t of the i-th
prediction method. The smaller F(L) is, the closer the
combined prediction method of weighted geometric mean is
to the actual value of the index series, thus the more accurate
and effective it will be.
Figure 4. Trilinear interpolation

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

V. EXPERIMENT
A. Dataset
UCF-101 is one of the databases with the largest number of
action categories and samples, which contains 13320 videos
and 101 categories. The database samples are taken from
various sports samples collected from the BBC/ESPN and
downloaded from the Internet. Fig. 5 shows several clips of
UCF-101.

Figure 7. Generate training samples


Figure 5. Several clips of UCF-101
The aspect ratio of the sample is 1, and the sample is spatio-
HMDB-51 contains 6849 videos and 51 categories. Each temporally cropped. Finally, we get that the size of each
category contains at least 101 videos. Most of the videos are sample is 3channels×16frames×112pixels×112pixels. All the
from movies, and some are from public databases such as resulting samples retain the same class labels from their
YouTube. Fig. 6 shows several clips of HMDB-51. original videos.
In the training, we initialize the parameters of the network
with resnext-101 pre-trained on Kinetics and fine-tune the last
two layers using SGD optimizer with momentum 0.9. We start
with the learning rate of 0.05 and divide it by 10 after the
validation loss saturates. To prevent overfitting, we also add
dropout with 0.5. The loss function we use is the cross-entropy

Figure 6. Several clips of HMDB-51 loss function, y represents the predicted value, y represents
the actual value.
We use split1 of UCF-101 and HMDB-51 for training and 
validation. When testing, the dataset is the same as the -log ( y ) y=1

validation set. L( y ,y)= (10)

-log (1- y ) otherwise
B. Implementation Recognition. We use a sliding window to produce the input
In the experiment, we take n=1,2, that is, add one and two up- clips. Then, we input the clip into the network and evaluate the
sampling layers respectively for the experiment. class score of the clip, which is the averaged of all the clips.
Training. We use SGD with momentum to train the We use the method of up-sampling to generate two feature
networks. At the same time, in order to increase the data, we maps with different scales using the above method to train and
randomly generate samples in the video of training data. Fig. recognition and get different class scores. Finally, we fuse the
7 shows the method of generating training samples. different class scores through [23] to get the final class score.
Firstly, we choose a temporal location in the video, generate
the training samples by uniform sampling, and then produce a C. Feature maps
16-frames clip around it. If the video is less than 16 frames, We use three different up-sampling methods to experiment
we loop it multiple times. Then, we randomly pick a spatial separately. Fig. 8 shows the feature maps of different up-
location from four angles or the center and select the spatial sampling methods.
scale of a sample for multi-scale cropping.

(a) Nearest neighbor interpolation (b) Bilinear interpolation (c) Trilinear interpolation
Figure 8. The feature maps of three up-sampling methods

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

(a) Nearest neighbor interpolation (b) Bilinear interpolation (c) Trilinear interpolation
Figure 9. The one feature map of these feature maps

We separately enlarge the same one feature map of these three We separately sample a 16-frames clip and a 32-frames clip
feature maps, as shown in Fig. 9 we can see that the feature for training. Fig. 10 and 11 separately show the training and
map obtained by the nearest neighbor interpolation method is validation losses of different sample-durations. As can be
fuzzy and has obvious jagged; The feature map obtained by seen in the figures, the validation losses are slightly higher
the bilinear interpolation method is not obvious jagged, but the than training losses on UCF-101, which indicates that the
edge is blurred; The feature map obtained by the trilinear network performs well on UCF-101. However, the validation
interpolation method is not obvious jagged, and the edge is losses quickly converge and are higher than the training
clearer. These results prove that our choice in IV is correct. losses on HMDB-51, which indicates that the performance
of the network on HMDB-51 is not as good as UCF-101.
D. Results

(a) UCF-101 (b) HMDB-51


Figure 10. Training and validation losses (16 frames)

(c) UCF-101 (d) HMDB-51


Figure 11. Training and validation losses (32 frames)

For evaluating the network, we measure clip and video Then we get the video score by averaging clip scores. Table II
accuracies. We take the maximum score to each clip score. shows the accuracies of different sample-durations. With the

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

number of sample-duration increases, the clip-level accuracy crafted feature, and the two-stream [14] with a deeper feature
and video-level accuracy also improve. on the UCF-101.
TABLE II TABLE III
THE ACCURACIES OF DIFFERENT SAMPLE-DURATION COMPARISON WITH STATE-OF-THE-ART ALGORITHMS
Clip Video Clip Video Method UCF-101 HMDB-51
Dataset
16f×112×112 32f×112×112 iDT [4] 85.9 51.9
UCF-101 85.9 89.2 87.5 90.3 Res3D [27] 85.8 54.9
HMDB-51 57.9 57.9 57.8 90.3 ConvNet [28] 65.4 -
C3D [29] 82.3 -
P3D [30] 88.6 -
We then compare our architecture with state-of-art Two-stream CNN [14] 88.0 59.4
algorithms and the results are presented in Table III. Two-stream+LSTM [31] 88.6 -
We can see that the proposed architecture achieved higher F st CN [32]
accuracies compared with other state-of-art algorithms. 84.5 49
Proposed 90.3 58.4
Moreover, our architecture improved 4.4% and 2.3%
compared with the most effective iDT [4] with the hand-

(a) (b)

(c) (d)
Figure 12. the confusion matrix. (a) and (c) are respectively ucf-101 and hmdb-51 of 16 frames.
(b) and (d) are respectively ucf-101 and hmdb-51 of 32 frames

Fig. 12 shows the confusion matrices of different sample- and ApplyEyeMakeUp are in a similar room of a house and
durations of UCF-101 and HMDB-51 respectively. have similar movements. RockClimbingIndoor and
In UCF-101, most of the classes perform well, even some RopeClimbing are from the same type of sports.
classes, such as ApplyEyeMakeUp, ApplyLipstick, Archery, PlayingFlute and PlayingViolin are playing Musical
BabyCrawling, Rowing separately reach the accuracies of Instruments, they're in similar positions and moving their
99.8%, 99.7%, 99.9%, 99.9%, 99.8%. However, as shown in hands.
Fig. 12 (a) and (b), the most confused classes are: Shotput In HMDB-51, these classes do not perform as well as UCF-
with VolleyballSpiking (39.1%), Skiing with Surfing 101 does. As shown in Fig. 12 (c) and (d), the most confused
(35.0%), ShavingBeard with ApplyEyeMakeUp (32.6%), classes are: ride_horse with ride_bike (56.6%), smile with
RockClimbingIndoor with RopeClimbing (32.7%), laugh (43.3%), talk with chew (36.7%), cartwheel with
PlayingFlute with PlayingViolin (29.2%). Fig. 13 shows the flic_flac (30.0%), walk with run (26.7%). Fig. 14 shows the
most confused classes of UCF-101. We can see that Shotput most confused classes of HMDB-51. Ride_horse and
and VolleyballSpiking are similar movements and have ride_bike both have the riding movements and a similar scene.
many people in the scenes. Skiing and Surfing are from the Smile and laugh are similar facial expressions. Talk and chew
same sports and have the same background. ShavingBeard have similar mouth movements. Cartwheel and flic_flac are

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

two similar actions. Walk and run have similar leg movements. Fall_floor is confused with kick_ball, jump, punch, stand.
However, there are some classes confused with other classes. Shoot_bow is confused with shoot_gun, jump, punch.

shotput
Volleyball
Spiking

Figure 13. In UCF-101 the most confused classes


Ride_horse
Ride_bike

Figure 14. In HMDB-51 the most confused classes

VI. CONCLUSION proposed to use the weighted geometric means combination


At present, human action recognition is the focus and forecasting method based on L_1 norm to fuse the obtained n
difficulty of research and has a very wide application prospect, results. The proposed architecture achieved 90.3% and 58.4%
mainly used in monitoring, human-computer interaction, and on UCF-101 and HMDB-51, which illustrates that the
other scenarios. Due to the complexity and diversity of human architecture is effective and comparable.
action, research on human action recognition has great
challenges. This paper presented a solution to improve the REFERENCES
performance of human action recognition. we proposed the [1] A. Diba, V. Sharma and L. Van. Gool, “Deep temporal linear encoding
architecture based on Multi-feature Map Fusion, which uses networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Las Vegas, NV, USA, 2016, pp. 2329-2338.
multiple up-sampling layers to enlarge feature maps, so that [2] Q. Fu, “Analysis of Human Behavior Recognition,” China Computer
smaller targets can be better detected, at the same time, the & Communication (Theory Edition), vol. 2017, no. 24, pp. 146-147,
information of the features extracted by the network is more 2017. DOI: CNKI: SUN: XXDL.0.2017-24-059.
and clearer. In our architecture, for the up-sampling method, [3] T. Yu, H. Gu, L. Wang, S. Xiang and C. Pan, “Cascaded temporal
spatial features for video action recognition,” in Proc. Int. Conf. Image
the nearest neighbor interpolation method, the bilinear Process. (ICIP), Bei Jing, China, Sep. 2017, pp. 1552-1556.
interpolation method, and the trilinear interpolation method [4] H. Wang and C. Schmid, “Action recognition with improved
have been studies. Experiments show that the effect of the trajectories,” in Proc. In Proc. IEEE Int Conf. Comput Vision. (ICCV),
feature map obtained by the trilinear interpolation method is Sydney, Australia, Dec. 2013, pp. 3551-3558.
[5] G. Willems, T. Tuytelaars and L. V. Gool, “An efficient dense and
better than the other two methods. Simultaneously, we used scale-invariant spatio-temporal interest point detector,” in Proc.
the clip with different sample-durations for training. The European Conf. on Computer Vision. (ECCV), Springer, Berlin,
results indicate that with the number of sample-duration Heidelberg, Oct. 2008, pp. 650-663
increases, the accuracies also improve. Finally, for the results [6] P. Scovanner, S. Ali and M. Shah, “A 3-dimensional sift descriptor and
its application to action recognition,” in Proc. ACM Int Multimedia
obtained by the network, we did not use the method of Conf, Augsburg, Bavaria, Germany, Sep. 2007, pp. 357-360. DOI:
averaging scores as mentioned in [14] to fuse the results. We 10.1145/1291233.1291311.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3017076, IEEE Access

[7] A. Klaser, M. Marszalek and C. Schmid, “A spatio-temporal descriptor Sciences), vol. 28, no.4, pp. 5-10, Jul. 2004. DOI:
based on 3d-gradients,” in Proc. Br. Mach. Vis. Conf. (BMVC), Leeds, 10.2116/analsci.20.717.
UK, Sep. 2008. pp. 275:1-10. [27] D. Tran, J. Ray, Z. Shou, S. F. Chang and M. Paluri, “Convnet
[8] Y. Zhou, X. Sun, Z. J. Zha and W. Zeng, “Mict: Mixed 3d/2d architecture search for spatiotemporal feature learning,” Aug. 2017,
convolutional tube for human action recognition,” in Proc. IEEE Conf. arXiv: 1708.05038. [Online]. Available:
Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, USA, Jun. https://arxiv.org/abs/1708.05038
2018. pp. 449-458. [28] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L.
[9] D. L. He, F. Li, Q. J. Zhao, X. Long, Y. Fu and S. L. Wen, “Exploiting Fei-Fei, “Large-scale video classification with convolutional neural
Spatial-Temporal Modelling and Multi-Modal Fusion for Human networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Action Recognition,” Jun. 2018, arXiv: 1806.10319. [Online]. (CVPR), Columbus, OH, USA, Sep. 2014, pp. 1725-1732.
Available: https://arxiv.org/abs/1806.10319 [29] D. Tran, L. Fergus, R. Fergus, L. Torresani and M. Paluri, “Learning
[10] A. S. Razavian, H. Azizpour, J. Sullivan and S. Carlsson, “CNN spatiotemporal features with 3d convolutional networks,” in Proc.
Features Off-the-Shelf: An Astounding Baseline for Recognition,” in IEEE Int Conf. Comput Vision. (ICCV), Santiago, Chile, Dec. 2015, pp.
Pro. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recogn. 4489-4497.
Workshops. (CVPRW), Columbus, Ohio, USA, Jun. 2014, pp. 806-813. [30] Z. Qiu, T. Yao and T. Mei, “Learning spatio-temporal representation
[11] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu and F. E. Alsaadi, “A survey with pseudo-3d residual networks,” in Proc. IEEE Int Conf. Comput
of deep neural network architectures and their applications,” Vision. (ICCV), Venice, Italy, Oct. 2017, pp. 5533-5541.
Neurocomputing, vol. 234, no.19, pp. 11-26, Apr. 2017. DOI: [31] Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga
10.1016/j.neucom.2016.12.038. and G. Toderici, “Beyond short snippets: Deep networks for video
[12] S. Ji, W. Xu, M. Yang and K. Yu, “3D convolutional neural networks classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
for human action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., (CVPR), Boston, MA, USA, Jun. 2015, pp. 4694-4702.
vol. 35, no. 1, pp. 221-231, Jan. 2013. [32] L. Sun, K. Jia, D. Y. Yeung and B. E. Shi, “Human action recognition
[13] V. M. Khong and T. H. Tran, “Improving human action recognition using factorized spatio-temporal convolutional networks,” in Proc.
with two-stream 3d convolutional neural network,” in Int. Conf. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Boston, MA,
Multimed. Anal. Pattern Recognit., MAPR - Proc. (MAPR), Ho Chi USA, Jun. 2015, pp. 4597-4605.
Minh City, Viet nam, Apr. 2018, pp. 1-6.
[14] K. Simontan and A. Zisserman, “Two-stream convolutional networks HAOFEI WANG received a bachelor's degree in
for action recognition in videos,” Adv. neural inf. proces. syst., Dec. Electrical Engineering and Automation from
2014. DOI: 10.1002/14651858.CD001941.pub3. Zhejiang SCI-TECH University in 2018. She is
[15] J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New currently studying for a master's degree in the
Model and the Kinetics Dataset,” in Proc. IEEE Conf. Comput. Vis. Department of Control Science and Engineering,
Pattern Recognit. (CVPR), Honolulu, HI, USA, May. 2017, pp. 6299- Zhejiang SCI-TECH University, her current
6308. research direction is pattern recognition and
[16] X. Wang, R. Girshick, A. Gupta and K. He, “Non-local neural intelligent system.
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Honolulu, HI, USA, Nov. 2017, pp. 7794-7803.
[17] Z. Xu, Y. Yang and A. G. Hauptmann, “A discriminative CNN video
representation for event detection,” in Proc. IEEE Conf. Comput. Vis. JUNFENG LI received the B.S. degrees in
Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 1798- electrical engineering from the Zhengzhou
1807. University, China, in 2002, the M.S. degree in
[18] Z. Qiu, T. Yao and T. Mei, “Deep quantization: Encoding mechanical engineering from Zhejiang SCI-TECH
convolutional activations with deep generative model,” in Proc. IEEE University, Hangzhou, China, in 2005 and the
Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Ph.D. degree in mechanical engineering from
Nov. 2017, pp. 6759-6768. Donghua University, Shanghai, China, in 2010.
[19] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang and L. From 2005 to 2011, he was a lecturer with the
“Temporal segment networks for action recognition in videos,” IEEE Department of Automation, Zhejiang SCI-TECH
Trans. Pattern Anal. Mach. Intell., vol. 41, no. 11, pp. 2740-2755, Sep. University. Since 2011, he has been an Assistant
2018. Professor with the Department of Electrical
[20] C. Li, Q. Y. Zhong, D. Xie, S. L. Pu, “Collaborative Spatiotemporal Engineering, Zhejiang SCI-TECH University. His research interest includes
Feature Learning for Video Action Recognition,” in Proc. IEEE Conf. the intelligent information processing, machine learning and defect
Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, Jun. detection.
2019, pp. 7872-7881.
[21] Y. Z. Zhou, X. Y. Sun, C. Luo, Z. J. Zha, W. J. Zeng, “Spatiotemporal
Fusion in 3D CNNs: A Probabilistic View,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, Jun. 2020,
pp. 9829-9838.
[22] T. Y. Lin, P. Dollár, R. Girshick, K. M. He, B. Hariharan, S. Belongie,
“Feature Pyramid Networks for Object Detection” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, May.
2017, pp. 2117-2125.
[23] S. Xie, R. Girshick, P. Dollár, Z. Tu and K. He, “Aggregated residual
transformations for deep neural networks,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Nov.
2017, pp. 1492-1500.
[24] M. Zolfaghari, K. Singh and T. Brox,” Eco: Efficient convolutional
network for online video understanding,” in Proc. European Conf. on
Computer Vision. (ECCV), Munich, Germany, Sep. 2018, pp. 695-712.
[25] J. Redmon, and A. Farhadi, “Yolov3: An incremental improvement,”
Apr. 2018, arXiv: 1804.02767. [Online]. Available:
https://arxiv.org/abs/1804.02767
[26] H. Y. Chen, “Weighted geometric means combination forecasting
method based on L_1 norm,” Journal of Anhui University (Natural

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like