IET Image Processing
Research Article
Improving the robustness of motion vector
temporal descriptor
ISSN 1751-9659
Received on 4th March 2017
Revised 5th September 2017
Accepted on 1st October 2017
E-First on 17th November 2017
doi: 10.1049/iet-ipr.2017.0206
www.ietdl.org
Farzaneh Rahmani1, Farzad Zargari1 , Mohammad Ghanbari2,3
1IT
Faculty, Research Institute for ICT (Iran Telecomm Research Centre), North Kargar, Tehran, Iran
of Computer Science and Electronic Engineering, University of Essex, Colchester, UK
3School of Electrical and Computer Engineering, University of Tehran, North Kargar, Tehran, Iran
E-mail: Zargari@itrc.ac.ir
2School
Abstract: Motion vectors (MVs) are the most common temporal descriptors in video analysis, indexing and retrieval
applications. However, video indexing and analysis based on MVs do not perform well for videos at different dimension ratios
(DRs) or even various resolutions. As a result, video indexing and analysis which are based on identifying similar video face
with many difficulties at different DRs or resolutions by MVs. In this study, a two-stage algorithm is introduced to make MV
descriptors robust against variations first in DR and then at resolution. In the experiments performed on motion vector
histograms, the proposed method improves the performance on identifying similar videos at various spatial specifications by up
to 73%. Moreover, in the video retrieval experiments, the proposed modified MV outperforms original MV feature vector. This is
an indication of improvement in differentiation of similar and dissimilar videos by the proposed temporal feature vector.
1
Introduction
The visual contents of video such as colour, texture, shape and
motion are used in content-based video analysis, indexing and
retrieval. Colour, texture and shape are among the spatial features
and are used in both image and video indexing and analysis
applications [1]. Whereas motion is a temporal feature which is
extracted from a sequence of frames in a video clip and is specific
to video.
Spatial and temporal features can be extracted either in pixel
domain or in compressed domain. Compressed domain feature
vector extraction avoids the extra time and computational power
which is required for decompression of coded video or image.
There are several proposed methods in [2–6] for extraction of
spatial features in the compressed domain. Motion vectors (MVs)
can be directly extracted in the compressed domain from videos
coded by standard video codecs such as MPEG and H.26X
families.
Even though both temporal and spatial descriptors can be used
in video analysis and indexing, there are several proposed solutions
only based on temporal features for many applications such as
action detection, shot classification and shot boundary detection
[7–13]. Temporal features are also employed to obtain key frames
by dividing a shot into segments of equal cumulative motion
activity using MPEG-7 motion activity descriptor [1]. Moreover,
temporal features are used extensively in compressed domain video
for analysis and indexing applications [14–22]. In [7], motion
information of a video is modelled by a two-dimensional motion
histogram of MVs. The picture displacement in the horizontal and
vertical directions is quantised into 121 segments (60 segments for
positive, 60 for negative and one for zero). Totally, there are 121 ×
121 bins for the 2D motion histogram. MVs between consecutive
frames of MPEG-1 video stream are used to generate motion
histogram of P frames. The motion vector histogram (MVH) is
then normalised to the number of P frames in a shot. Shao et al.
[13] have presented an automated video analysis system which
addresses segmentation and detection of human actions in an
indoor environment. They used colour intensity and motion
information for action segmentation. They also described human
actions using motion and shape features for human action
recognition. Babu and Ramakrishnan [14] presented an objectbased video indexing and retrieval system using motion
IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104
© The Institution of Engineering and Technology 2017
information obtained from compressed MPEG video. The main
contribution of their proposed system is the utility of the readily
available motion information of MPEG video for global and
object-based retrieval. Akrami and Zargari [15] introduced a
temporal feature in compressed domain for video indexing and
retrieval. Their proposed indexing method is based on the
histogram of positions of blocks which are used in motion
compensation. The method was implemented on H.264/AVCcoded video, which outperformed MVH in video retrieval
experiments. Tom et al. [16] discussed an approach for human
action recognition in H.264/AVC compressed domain. Their
proposed algorithm utilises information from quantisation
parameters and MVs extracted from the compressed video
sequences for feature extraction and classification were performed
by support vector machines. Biswas and Babu [17] proposed a
simple and effective approach to classify H.264 compressed video,
by capturing orientation information from the MVs. Fei and Zhu
[18] presented a mean shift clustering-based moving object
segmentation approach in the H.264 compressed domain. In their
approach, the motion information extracted from H.264
compressed video, including MVs and partitioned block size, are
used for moving object segmentation. A real-time video object
segmentation algorithm is proposed in [19] that works in H.264
compressed domain. The algorithm utilises the motion information
from the H.264 compressed bit stream to identify background
motion model and moving objects. Yu et al. [20] proposed a
pioneering motion estimation approach based on modelling of
motion imaging to stabilise the captured video from a fast-moving
car. Their method was based on MVs.
Due to rapid advancement in consumer electronics, there are
various digital video capturing devices with different specifications
which can generate video at various dimension ratios (DRs) and
resolutions. The variations in DR and resolution of the captured or
produced video face the employment of many of the
aforementioned MV-based techniques with difficulties. This is due
to the fact that even the same video in different DRs or resolutions
produces different set of MVs and hence the MV cannot be
considered as the representative for temporal content of video
because it is highly affected by special features such as DR and
resolution. It is shown in Fig. 1 that the same motion in the videos
with different resolutions produces different lengths MVs and in
videos with different DRs produces MVs differing in both length
98
Fig. 1 Relation of MVs and different DRs and resolutions
(a1, b1, c1) First frames, (a2, b2, c2) Corresponding following frames
Fig. 2 Successive stages of the proposed method
and angle. This imposes limitations in the applications based on
MVs. For example, in [19] a number of constant values have been
defined as thresholds, such as γ in motion cost termination, and the
threshold values depend on video resolution, though it is not
expressed explicitly in the article. Therefore, in a video with
different resolutions these values should be initialised. As a result,
the reported experiments in [19] are performed on a single video
resolution. Or, the 2D motion histogram constructed in [7] is based
on the size of MVs in positive and negative directions. Hence, the
histogram is sensitive to variations in resolution and DR specially
in retrieval of similar video at different resolutions and DRs.
Tasdemir et al. [22] addressed the problem of variation of MVs
for the same video in different resolutions. They extracted MVs at
lower frame rates using exhaustive search in pixel domain. Mean
of the magnitudes of MVs (MMMV) as well as the mean of the
angles of MVs (MPMV) for macro blocks of a frame are used as
feature vectors. They were faced with the problem of varying MVs
for similar video at different resolutions. To solve this problem,
they have normalised MMMV and MPMV values by the mean and
standard deviation of the features at the entire frames. Even though
IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104
© The Institution of Engineering and Technology 2017
their method is robust against variations in resolution for a number
of video resolutions, it suffers from high computational load and
moreover it is offline, because it requires information about the
entire frames’ MVs.
In this paper, which is an improved and extended version of our
previous work [23], a method for making MVs robust against DR
and resolution variations is proposed. The proposed method is
tested on H.264/AVC-coded video but the same principles can be
used for the MVs derived from coded video by the other MV-based
video coding standards. The first stage of the proposed method
deals with variations in DR and in the second stage, the variations
in resolution are addressed.
Experimental results indicate that employing only the first stage
can improve the performance of identifying similar video with
various DRs by up to 28%, and the improvement achieved by both
stages increases the performance by up to 73%. Fig. 2 indicates the
successive stages of the proposed method. As shown in Fig. 2, the
extracted MVs by partial decoding are processed in two stages for
dimension ratio (DR) and resolution modification and the modified
MVs are used in the evaluations instead of initially extracted MVs.
99
This paper is organised as follows: In Section 2, the proposed
method is described. Section 3 is dedicated to the evaluation of the
proposed method followed by concluding remarks in Section 4.
2
Proposed method
In this section, the proposed method for improving robustness of
MVs against variations in DR and resolution is presented by using
MV histograms intersection as a similarity measure between
videos. Hence, at first a brief explanation on MV histograms is
provided and then the proposed method named resolutiondimension scaled motion vector histogram (R-D SMVH) is
explained.
Motion estimation is commonly used in all standard video
coding families of H.26X and MPEG. MVs can be directly
extracted from the H.26X and MPEG coded video streams
including H.264/AVC. Each frame in H.264/AVC can be coded as
I, P or B. I-frames are intra coded, whereas macroblocks in P and B
frames can be inter coded. Motion estimation in P-frames is
unidirectional, whereas B frames can use bidirectional motion
estimation as well. The proposed two-stage method is described for
MV information derived from P frames. However, the method can
be easily employed for B-frames without any modification.
Assume MVm(mvx, mvy) is a MV for inter coded Macroblock m
in a P frame. MV histogram is a two-dimensional histogram with
Q × R bins and MV m is assigned to bin (i,j) where i and j are
derived as
i=
mvx + biasx
Kx
Q
0
if
0 ≤ mvx + biasx ≤ upperlimit mvx
if
if
mvx + biasx > upperlimit mvx
mvx + biasx < 0
(1)
2.1 MV modification based on DR
In the first stage, MVs are modified to be robust against DR. The
MVs are scaled by using DRatio parameter. DRatio parameter is
defined as
DRatio =
mvy + biasy
Ky
R
0
if
0 ≤ mvy + biasy ≤ upperlimit mvy
if
if
mvy + biasy > upperlimit mvy
mvy + biasy < 0
(2)
(3)
For each P frame including n macroblocks, each bin of MV
histogram can be computed as
n
Hq, r P =
∑ #Blocks(m) × δ(Bin(MVm) − (q, r))
(4)
m=1
where #Blocks(m) indicates the number of 4 × 4 pixel blocks in
macroblock m and is the Kronecker delta function which is 1 when
Bin(MVm), computed from (1) and (2), is equal to (q, r) and is
zero, otherwise. It is worth noting that in the histogram there are
two extra added bins for intra coded and skip macroblocks, as well.
The MV histogram can be normalised as
H(P) =
H(P)
W
(6)
DScaledMVm(dsmvx, dsmvy) =
(7)
mvx
, mvy
DRatio
if(DRatio) > 1
mvx, mvy × DRatio
if(Dratio) < 1
Now the MV histogram is generated for scaled MVs which
hereafter is referred to dimension scaled MV histogram (DSMVH).
By scaling the MVs, the resultant scaled MVs correspond to a
virtual P-frame with resolution (VP_width, VP_height) where
VP_width and VP_height are defined as
VP_height =
F_width
= F_height
DRatio
F_width
if(DRatio) > 1
(8)
if(DRatio) < 1
F_height
if(DRatio) > 1
F_height × DRatio = F_width if(DRatio) < 1
(9)
and the virtual P-frame resolution is
where biasx and biasy are used to up-shift the MVs ranges to nonnegative values. The bins are two dimensional and Kx and Ky are
the horizontal and vertical ranges of the bin.
MV histogram for a P frame is defined as H(P) and it consists
of Q × R bins as
H(P) = H0, 0(P), H0, 1(P), …, H2, 1(P), H2, 2(P), …, HQ, R(P)
F_width
F_height
where F_width and F_height are the width and the height of P
frame, respectively. Each MV in P is scaled by DRatio as
VP_width =
j=
100
where W is the total number of 4 × 4 blocks in a P frame. This
normalisation is common in image and video histograms, e.g.
colour histograms are usually divided by image resolution to
produce a histogram robust against image resolution. However, it is
not just enough to make MV histogram robust against DR and
resolution of frames, since besides the number of MVs the MV
values are highly affected by DR and resolution of the coded
frame.
In the following sub-sections, the proposed two-stage method is
explained which makes the MV histograms more robust against
DR and resolution of coded frames.
(VP_width, VP_height) =
F_height, F_height
F_width, F_width
if(DRatio) > 1
if(DRatio) < 1
(10)
Even though the resultant DSMVH in this stage is robust to DR, it
may not produce satisfactory results when the resultant virtual
frames have not similar resolutions.
Fig. 3 provides a pictorial representation about the MVs
modification process. Original videos might be different in both
resolution and DR. At the first step, the MVs of each video are
scaled such that the MVs correspond to a square virtual video
whose side is equal to the minimum of sides of the original video.
Consequently, in the second step, the resulted MVs in higher
resolution virtual video are scaled once more to generate MVs
corresponding to a virtual video of the same resolution as the
virtual video with the lower resolution. It is worth noting that at
each step at most one multiplication or division is necessary for
modifying each MV thus the computational overhead of
modifications is negligible.
(5)
IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104
© The Institution of Engineering and Technology 2017
Fig. 3 Original and virtual videos after two-stage modifications
2.2 MV modification based on resolution
By employing the second stage, robustness of MVs against
resolution is achieved. As shown in Fig. 3, in modification of MVs
for videos with different resolutions, the MVs of higher resolution
video are down scaled. Assume A, XA × XA and B, XB × XB are the
resolutions of two original videos. The resolution scaled MVs for
each frame of A are derived as
R − DScaledMVm(rdsmvx, rdsmvy) =
dsmvx, dsmvy
dsmvx ×
if(X A < XB)
XB
XB
, dsmvy ×
XA
XA
(11)
if(X A > XB)
The resulted MV histogram is called R-D SMVH. The next step is
histogram normalisation which is generated as follows:
(12)
H¯ α, β(P) = Hα, β(P)/W
Histograms can be generated for a part of video such as a number
of group of pictures or a shot. If there are F P-frames in the given
part of video, each bin of the final MV histogram will be
constructed as
k
H¯ α, β(final) =
1
H¯ α, β(P j)
F j∑
=1
(13)
Similarity comparison between two different videos is computed as
the sum of intersection between the corresponding bins of the
given final histograms:
P
S=
Q
i=1 j=1
Fig. 4 MVH, DSMVH and R-DSMVH Histograms for a video in two
different DRs and resolutions for Kimono video
(a1) MVH for 320 × 240, (a2) DSMVH for 320 × 240 ≃ MVH 240 × 240, (a3) R-D
SMVH for 320 × 240 ≃ MVH for 240 × 240, (b1) MVH for 1280 × 720, (b2) DSMVH
for 1280 × 720 ≃ MVH 720 × 720, (b3) R-D SMVH for 1280 × 720 ≃ MVH for 240 ×
240
IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104
© The Institution of Engineering and Technology 2017
P
Q
∑ ∑ (H̄i, j(1) ∩ H¯ i, j(2)) = ∑ ∑
i=1 j=1
min (H̄i, j(1), H¯ i, j(2))
(14)
Since the histograms are normalised, S is a number in the range of
[0,1] and the higher S value implies higher similarity between the
two videos. Thus, in the comparison of a video in two different
resolutions or DRs the feature vector which indicates higher
similarity or S has superior performance
In order to provide a pictorial representation for operation of the
proposed method, a sample of resulted histograms by the proposed
method is shown in Fig. 4. Fig. 4 shows the histogram of original
101
column shows the corresponding virtual videos after SMVH
modification on MVs of the original video.
3
Experimental results
Two groups of experiments were conducted to evaluate the
performance of the proposed method. At first, the effectiveness of
the proposed method on detecting the same video at different
resolutions and DRs was evaluated. In these experiments, down
sampled or up sampled versions of the original videos were
employed. In the second group of experiments, a video retrieval
application was conducted to evaluate the performance of the
proposed modified feature vectors in discriminating similar videos
from dissimilar ones. At the end, significance analysis and
execution time analysis are included to consider other aspects that
may be important in implementing the proposed method.
Moreover, in order to facilitate access to the test video that is used
in the experiments, they are made available for the interested
readers at [24].
Fig. 5 Resolutions of original videos (a1, b1) and virtual videos (a2, b2)
(a1) First frame of original video 1 720 × 480 (DR = 3:2), (a2) First frames of
corresponding virtual video 1 480 × 480, (b1) First frame of original video 2 480 × 640
(DR = 3:4), (b2) First frames of corresponding virtual video 2 480 × 480
Table 1 Used test sequences
Class
Resolution,
Video
of
pixels
video
A
2560 × 1600
B
1920 × 1080
C
832 × 480
D
416 × 240
E
1280 × 720
ultra
HD
4096 × 2304
people-onstreet
Kimono1
ParkScene
basketball-drill
BQMall
PartyScene
basketballpass
blowing
bubbles
Mobisode2
RaceHorses
FourPeople
mobilecalender
honey bees
PuppiesBath
Spatial
Temporal
complexity complexity
High
high
High
High
Low
medium
High
Low
high
medium
low
low
medium
low
low
high
low
low
low
high
low
high
low
low
low low
high high
Table 2 Resolution of videos in the conducted tests
Test
Resolution of video 1, Resolution of video 2,
number
pixels
pixels
1
2
3
4
5
320 × 240
480 × 640
360 × 360
240 × 320
480 × 640
1280 × 720
1280 × 720
1280 × 720
640 × 360
3840 × 2160
MVs and the modified MVs histograms for 320 × 240 and 1280 ×
720 pixels of Kimono video. Figs. 4a1 and b1 show the original
histograms and Figs. 4a2 and b2 show the modified DSMVH
histograms and Figs. 4a3 and b3 show the modified R-D SMVH
histograms. It can be seen in Fig. 4 that at each step of modification
of MVs the resulted MV histograms for the video become more
similar, both objectively and subjectively.
Moreover, Fig. 5 shows the relation between two videos in
different resolutions and their corresponding virtual videos. First
column of Fig. 5 depicts two original videos and the second
102
3.1 Similarity of videos in different DRs and resolutions
The test sequences include 12 standard video sequences from class
A to class E of the proposed video sequences by JCT-VC group
along with two Ultra HD video sequences [25]. A number of
specifications of these classes are tabulated in Table 1. It is worth
noting that spatial and temporal complexities are reported based on
the analysis of the 20 frames of video sequences which are used in
the experiments.
In the experiments at first, new copies from the test videos in
different resolutions and DRs were constructed. Then the new
videos were coded by H.264/AVC reference software JM-18.0
[26]. The group of pictures (GOP) which is the number of frames
from one I-frame to next I-frame is selected as ten which includes
one I frame and nine P frames. Videos were coded and the MV
information for the first two GOPs of each coded video were
extracted. Motion estimation was performed by using fast full
search option of HM.18 which employs extended diamond search
algorithm. Consequently, MVH, DSMVH and R-D SMVH
histograms were generated by using the MVs and employing the
methods described in this article. Five test experiments were
conducted for different DRs and resolutions copies of each video in
the experiments.
Table 2 shows the resolution of compared videos in each test.
According to Table 2 in this experiment, the Ultra HD videos are
only down sampled but the other test sequences might be either up
sampled or down sampled according to the required resolutions of
the corresponding test condition. In fact, in each test two copies of
the same video were used and the similarity between them were
compared by using MVH, RSMVH and R-D SMVH methods.
Table 3 shows the resulted similarities for different videos at
five test conditions. Since in this experiment, the same video is
used in comparisons, the higher similarity values in Table 3 imply
the better performance of the temporal feature. Experimental
results in Table 3 indicate that MVH in Kimono video sequence
with different DRs and resolutions may result in on average
similarity value as low as 0.53. This is increased in DSMVH to
0.68 on average, whereas by using R-D SMVH, the average of the
derived similarities in the entire tested cases is higher than 0.92. It
corresponds to 73% improvement of R-D SMVH with respect to
MVH. Average similarity and standard deviation for all test videos
are shown in the last row of Table 3. Standard deviation for R-D
SMVH is on average 0.13 which is lower than other feature
vectors. It shows that the persistent and reliable improvement is
achieved by this feature vector. Hence, steady improvement on the
performance can be achieved by employing the successive stages
of the proposed modification on MVs.
However, Table 3 indicates small improvement in similarity of
few tested videos such as ‘Mobilecalender’ and ‘Mobisode2’. This
is because in these video sequences the percentage of the non-zero
MVs is very low. In order to justify this claim, the ratios of the
blocks that are coded as intra, skipped and non-zero MVs at each
video sequence are tabulated in Table 4. The results given in
Table 4 indicate that the percentage of blocks with non-zero MVs
IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104
© The Institution of Engineering and Technology 2017
Table 3 Average and standard deviation for performance of MVH, DSMVH and R-D SMVH for different test conditions
MVH average standard deviation DSMVH average standard deviation R-D SMVH average standard deviation
Video
BasketballDrill
BasketballPass
blowing bubbles
BQMall
FourPeople
Kimono
mobilecalender
Mobisode2
ParkScene
PartyScene
PeopleOnStreet
RaceHorses
honey bees
PuppiesBath
all video tests
0.8026
0.8236
0.6722
0.7881
0.9388
0.5392
0.8107
0.9438
0.8433
0.7734
0.7760
0.4156
0.6920
0.8568
0.7587
0.0837
0.1348
0.1854
0.1211
0.0458
0.1300
0.2375
0.0131
0.0386
0.1213
0.1157
0.0794
0.1091
0.0936
0.1837
0.8215
0.8473
0.7605
0.8319
0.9395
0.6847
0.8115
0.9513
0.8761
0.7875
0.7876
0.4979
0.6750
0.8989
0.7939
0.0942
0.1468
0.2332
0.1477
0.0453
0.1163
0.2359
0.0086
0.0450
0.1071
0.1202
0.1403
0.0876
0.0734
0.1724
Table 4 Motion compensation statistics of test video sequences
Video
Class Resolution, pixels Percent of
Percent of
intra, %
skipped, %
PeopleOnStreet
A
Kimono
B
ParkScene
BasketballDrill
C
BQMall
PartyScene
blowing bubbles
D
BasketballPass
Mobisode2
RaceHorses
Mobilecalender
E
FourPeople
honey bees
ultra HD
PuppiesBath
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
1280 × 720
5.5
7.3
5.6
9.2
5.5
7.5
6.6
5.8
2.3
15.5
8.9
3.1
8.2
9.5
29.6
8.2
42.9
63.4
68.4
41.2
22.03
75.7
92.2
4.5
73.2
86.4
30.1
38.4
Table 5 ANMRR values for retrieval of different size of
queries using MVH, DSMVH and R-D SMVH
Resolution of query, pixels
MVH DSMVH R-D SMVH
3840 × 2160
1920 × 1080
1280 × 720
640 × 480
480 × 640
360 × 360
320 × 240
240 × 320
0.57
0.56
0.47
0.46
0.44
0.44
0.51
0.44
0.39
0.4
0.45
0.44
0.43
0.4
0.47
0.48
0.49
0.39
0.43
0.43
0.35
0.4
0.38
0.36
in the aforementioned video sequences is <20% and hence the
similarity between histograms is mainly achieved by the number of
blocks coded as skip or intra. As a result, scaling MVs have low
impact in improving the performance of similarity metric. Which is
due to a few non-zero MVs or low temporal complexity in these
coded videos.
3.2 Video retrieval using MVH, DSMVH and R-D SMVH
In this section, a video retrieval experiment is conducted to
evaluate the performance of the proposed temporal features in
differentiating between similar videos from dissimilar ones. In this
experiment, UCF Sports Action database [27] is employed which
IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104
© The Institution of Engineering and Technology 2017
0.8688
0.8821
0.8440
0.8893
0.9466
0.9253
0.8132
0.9654
0.9031
0.8018
0.8541
0.7831
0.8124
0.9349
0.8690
0.1133
0.1487
0.1627
0.1493
0.0411
0.0221
0.2329
0.0081
0.0471
0.1059
0.0248
0.2074
0.0773
0.0243
0.1345
Percent of
Average
Average
Average
non-zero MVs, percent of
percent of
percent of
%
intra in class, skipped in
non-zero
%
class, % MVs in class,
%
64.9
84.5
51.5
27.4
26.1
51.3
71.37
18.5
5.5
80
17.9
10.5
61.7
52.1
5.5
6.45
29.6
25.55
64.9
68
7.4
57.66
34.93
7.55
48.60
43.84
6
79.8
14.2
8.85
34.25
56.9
consists of 115 720 × 404 videos at 13 different actions. The
queries are resized videos from 240 × 320 pixels to 3840 × 2160
pixels among six different actions. ANMRR [28] metric is used to
measure the performance of video retrieval. ANMRR measure is a
number in the range of [0,1] and the higher value implies lower
retrieval performance. The experimental results that are tabulated
in Table 5 show R-D SMVH MVs outperform the original MVs in
the entire conducted retrieval experiments. This is an indication for
superior performance of R-D SMVH not only in detecting similar
videos but also in differentiating between similar videos from
dissimilar ones.
3.3 Significance analysis of the proposed method
Significance analysis is performed according to the experimental
results in Tables 3 and 5 by using paired samples T-test, to show
the competitive analysis between MVH, DSMVH and R-D SMVH
methods. Table 6 tabulates the resulted T value and its
corresponding p value for each pair of methods for the significance
level of 0.05. According to Table 6, DSMVH is significant
compared with MVH only in similarity calculation but R-D SMVH
is significant against MVH in both similarity calculation and video
retrieval.
3.4 Time complexity analysis
In order to perform time complexity analysis, the decoding
execution time of 20 frames of three 3840 × 2160 H.264 coded
103
Table 6 T-test results for similarity measure and video
retrieval
Compared methods
Similarity measure Video retrieval
T-value
p
T-value
p
DSMVH and MVH
R-D SMVH and MVH
3.042
3.403
0.0094
0.0047
−1.996
−4.828
0.086
0.0019
[4]
[5]
[6]
[7]
Table 7 Average decoding time of 3840 × 2160 videos
using MVH, DSMVH and R-DSMVH
Method
Average decoding time, s
MVH
DSMVH
R-DSMVH
2.33
2.398
2.426
videos on the same PC for MVH, DSMVH and R-D SMVH is
measured. Table 7 indicates the total decoding time for the
methods. The results given in Table 7 indicate that R-D SMVH
decoding time is only 4% higher than MVH. It means that the
proposed method imposes negligible computation overhead on the
common MVH method.
4
Conclusion
Variation on spatial dimensions is amongst the challenging issues
in video indexing and retrieval based on MVs. This paper
presented a two-stage modification on MV information generated
from compressed domain of H.264/AVC coded videos, named
resolution scaled MV histogram (R-D SMVH). It was shown that
this method makes the MVHs robust against variations on DR and
resolution of the video frames. The proposed method is based on
selective scaling of the MVs according to the DR and resolution of
the video frames. Experimental results showed that the first stage
of the proposed method improved detecting similar videos with
different DRs by up to 28% compared with MVH and applying
both stages increases the improvement by up to 73%. Moreover, on
video retrieval experiments (based on MVs), the proposed R-D
SMVH outperforms the original MVs which indicates the superior
performance of them in differentiating between similar and
dissimilar videos. As a result, R-D SMVH can be used more
effectively than MVH in applications based on measuring video
similarity such as video retrieval and action detection over a wide
range of videos with various DRs and resolutions.
5
[1]
[2]
[3]
104
References
Wiegand, T., Sullivan, G.J., Bjøntegaard, G., et al.: ‘Overview of the
H.264/AVC video coding standard’, IEEE Trans. Circuits Syst. Video
Technol., 2003, 13, (7), pp. 560–576
Rahmani, F., Zargari, F.: ‘Compressed domain visual information retrieval
based on I-frames in HEVC’, Multimedia Tools Appl., 2016, 75, (10), pp. 1–
18
Zargari, F., Mehrabi, M., Ghanbari, M.: ‘Compressed domain texture based
visual information retrieval method for I-frame coded frames’, IEEE Trans.
Consum. Electron., 2010, 56, (2), pp. 728–736
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
Mehrabi, M., Zargari, F., Ghanbari, M.: ‘Fast and low complexity method for
content accessing and extracting DC-frames from H.264 coded videos’, IEEE
Trans. Consum. Electron., 2010, 56, (3), pp. 1801–1808
Zargari, F., Rahmani, F.: ‘Visual information retrieval in HEVC compressed
domain’. Proc. 23rd Iranian Conf. Electrical Engineering, Tehran, Iran, May
2015, pp. 793–798
Mehrabi, M., Zargari, F., Ghanbari, M., et al.: ‘Fast content access and
retrieval of JPEG compressed images’, Signal Process. Image Commun.,
2016, 46, pp. 54–59
Chen, L.H., Chin, K.H., Liao, H.Y.: ‘An integrated approach to video
retrieval’, Proc. Nineteenth Conf. Australasian Database, Australia, 2008, 75,
pp. 49–55
Ciptadi, A., Goodwin, M.S., Rehg, J.M.: ‘Movement pattern histogram for
action recognition and retrieval’, in Fleet, D., Pajdla, T., Schiele, B., et al.
(Eds.): ‘Computer vision – ECCV 2014’ (Springer International Publishing,
2014), pp. 695–710
Koumaras, H., Gardikis, G., Xilouris, G., et al.: ‘Shot boundary detection
without threshold parameters’, J. Electron. Imag., 2006, 15, (2) pp. 1–3
Zhao, Z.C., Cai, A.N.: ‘Shot boundary detection algorithm in compressed
domain based on adaboost and fuzzy theory’, in Jiao, L., Wang, L., Gao, X.,
et al. (Eds.): ‘Advances in natural computation’ (Springer Berlin Heidelberg,
2006), pp. 617–626
Rossetto, L., Giangreco, I., Schuldt, H., et al.: ‘Imotion—a content-based
video retrieval engine’, in He, X., Luo, S., Tao, D., et al. (Eds.): ‘Multimedia
modeling’ (Springer International Publishing, 2015), pp. 255–260
Chen, L.H., Chin, K.H., Liao, H.Y.: ‘An integrated approach to video
retrieval’. Proc. Nineteenth Conf. Australasian Database, Gold Coast,
Australia, December 2007, pp. 49–55
Shao, L., Ji, L., Liu, Y., et al.: ‘Human action segmentation and recognition
via motion and shape analysis’, Pattern Recognit. Lett., 2012, 33, (4), pp.
438–445
Babu, R.V., Ramakrishnan, K.R.: ‘Compressed domain video retrieval using
object and global motion descriptors’, Multimedia Tools Appl., 2007, 32, (1),
pp. 93–113
Akrami, F., Zargari, F.: ‘An efficient compressed domain video indexing
method’, Multimedia Tools Appl., 2014, 72, (1), pp. 705–721
Tom, M., Babu, R.V., Praveen, R.G.: ‘Compressed domain human action
recognition in H.264/AVC video streams’, Multimedia Tools Appl., 2015, 74,
(21), pp. 9323–9338
Biswas, S., Babu, R.V.: ‘H.264 compressed video classification using
histogram of oriented motion vectors (HOMV)’. Proc. IEEE Int. Conf.
Acoustics, Speech and Signal Processing, Vancouver, Canada, May 2013, pp.
2040–2044
Fei, L., Zhu, S.: ‘Mean shift clustering-based moving object segmentation in
the H.264 compressed domain’, IET Image Process., 2010, 4, (1), pp. 11–18
Mak, C.M., Cham, W.K.: ‘Real-time video object segmentation in H.264
compressed domain’, IET Image Process., 2009, 3, (5), pp. 272–285
Yu, J., Xiang, K., Wang, X., et al.: ‘Video stabilisation based on modelling of
motion imaging’, IET Image Process., 2016, 10, (3), pp. 177–188
Bruyne, S., Deursen, D., Cock, J., et al.: ‘A compressed-domain approach for
shot boundary detection on H.264/AVC bit streams’, Signal Process. Image
Commun., 2008, 23, (8), pp. 473–489
Tasdemir, K., Cetin, A.E.: ‘Content-based video copy detection based on
motion vectors estimated using a lower frame rate’, Signal Image Video
Process., 2014, 8, (6), pp. 1049–1057
Zargari, F., Rahmani, F.: ‘A temporal feature vector which is robust against
aspect ratio variations’. Proc. Eighth Int. Symp. Telecommunications, Tehran,
Iran, September 2016
‘The entire test videos’. Available at http://jmp.sh/aLxR8kb, accessed 2
September 2017
‘Free Downloadable 4K Sample Content’. Available at http://4ksamples.com/,
accessed 1 August 2017
‘H.264/AVC Software Coordination’. Available at http://iphome.hhi.de/
suehring/tml/, accessed 1 August 2017
Soomro, K., Zamir, A.R.: ‘Action recognition in realistic sports videos’, In
(Eds): ‘Computer vision in sports’ (Springer International Publishing, 2014)
Zhu, M.: ‘Recall, precision, and average precision’ Technical Report, 9,
Department of Statistics, Actuarial Science, University, Waterloo, CA, 2004
IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104
© The Institution of Engineering and Technology 2017