Improving the robustness of motion vector temporal descriptor

2018, IET Image Processing

IET Image Processing Research Article Improving the robustness of motion vector temporal descriptor ISSN 1751-9659 Received on 4th March 2017 Revised 5th September 2017 Accepted on 1st October 2017 E-First on 17th November 2017 doi: 10.1049/iet-ipr.2017.0206 www.ietdl.org Farzaneh Rahmani1, Farzad Zargari1 , Mohammad Ghanbari2,3 1IT Faculty, Research Institute for ICT (Iran Telecomm Research Centre), North Kargar, Tehran, Iran of Computer Science and Electronic Engineering, University of Essex, Colchester, UK 3School of Electrical and Computer Engineering, University of Tehran, North Kargar, Tehran, Iran E-mail: Zargari@itrc.ac.ir 2School Abstract: Motion vectors (MVs) are the most common temporal descriptors in video analysis, indexing and retrieval applications. However, video indexing and analysis based on MVs do not perform well for videos at different dimension ratios (DRs) or even various resolutions. As a result, video indexing and analysis which are based on identifying similar video face with many difficulties at different DRs or resolutions by MVs. In this study, a two-stage algorithm is introduced to make MV descriptors robust against variations first in DR and then at resolution. In the experiments performed on motion vector histograms, the proposed method improves the performance on identifying similar videos at various spatial specifications by up to 73%. Moreover, in the video retrieval experiments, the proposed modified MV outperforms original MV feature vector. This is an indication of improvement in differentiation of similar and dissimilar videos by the proposed temporal feature vector. 1 Introduction The visual contents of video such as colour, texture, shape and motion are used in content-based video analysis, indexing and retrieval. Colour, texture and shape are among the spatial features and are used in both image and video indexing and analysis applications [1]. Whereas motion is a temporal feature which is extracted from a sequence of frames in a video clip and is specific to video. Spatial and temporal features can be extracted either in pixel domain or in compressed domain. Compressed domain feature vector extraction avoids the extra time and computational power which is required for decompression of coded video or image. There are several proposed methods in [2–6] for extraction of spatial features in the compressed domain. Motion vectors (MVs) can be directly extracted in the compressed domain from videos coded by standard video codecs such as MPEG and H.26X families. Even though both temporal and spatial descriptors can be used in video analysis and indexing, there are several proposed solutions only based on temporal features for many applications such as action detection, shot classification and shot boundary detection [7–13]. Temporal features are also employed to obtain key frames by dividing a shot into segments of equal cumulative motion activity using MPEG-7 motion activity descriptor [1]. Moreover, temporal features are used extensively in compressed domain video for analysis and indexing applications [14–22]. In [7], motion information of a video is modelled by a two-dimensional motion histogram of MVs. The picture displacement in the horizontal and vertical directions is quantised into 121 segments (60 segments for positive, 60 for negative and one for zero). Totally, there are 121 × 121 bins for the 2D motion histogram. MVs between consecutive frames of MPEG-1 video stream are used to generate motion histogram of P frames. The motion vector histogram (MVH) is then normalised to the number of P frames in a shot. Shao et al. [13] have presented an automated video analysis system which addresses segmentation and detection of human actions in an indoor environment. They used colour intensity and motion information for action segmentation. They also described human actions using motion and shape features for human action recognition. Babu and Ramakrishnan [14] presented an objectbased video indexing and retrieval system using motion IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104 © The Institution of Engineering and Technology 2017 information obtained from compressed MPEG video. The main contribution of their proposed system is the utility of the readily available motion information of MPEG video for global and object-based retrieval. Akrami and Zargari [15] introduced a temporal feature in compressed domain for video indexing and retrieval. Their proposed indexing method is based on the histogram of positions of blocks which are used in motion compensation. The method was implemented on H.264/AVCcoded video, which outperformed MVH in video retrieval experiments. Tom et al. [16] discussed an approach for human action recognition in H.264/AVC compressed domain. Their proposed algorithm utilises information from quantisation parameters and MVs extracted from the compressed video sequences for feature extraction and classification were performed by support vector machines. Biswas and Babu [17] proposed a simple and effective approach to classify H.264 compressed video, by capturing orientation information from the MVs. Fei and Zhu [18] presented a mean shift clustering-based moving object segmentation approach in the H.264 compressed domain. In their approach, the motion information extracted from H.264 compressed video, including MVs and partitioned block size, are used for moving object segmentation. A real-time video object segmentation algorithm is proposed in [19] that works in H.264 compressed domain. The algorithm utilises the motion information from the H.264 compressed bit stream to identify background motion model and moving objects. Yu et al. [20] proposed a pioneering motion estimation approach based on modelling of motion imaging to stabilise the captured video from a fast-moving car. Their method was based on MVs. Due to rapid advancement in consumer electronics, there are various digital video capturing devices with different specifications which can generate video at various dimension ratios (DRs) and resolutions. The variations in DR and resolution of the captured or produced video face the employment of many of the aforementioned MV-based techniques with difficulties. This is due to the fact that even the same video in different DRs or resolutions produces different set of MVs and hence the MV cannot be considered as the representative for temporal content of video because it is highly affected by special features such as DR and resolution. It is shown in Fig. 1 that the same motion in the videos with different resolutions produces different lengths MVs and in videos with different DRs produces MVs differing in both length 98 Fig. 1 Relation of MVs and different DRs and resolutions (a1, b1, c1) First frames, (a2, b2, c2) Corresponding following frames Fig. 2 Successive stages of the proposed method and angle. This imposes limitations in the applications based on MVs. For example, in [19] a number of constant values have been defined as thresholds, such as γ in motion cost termination, and the threshold values depend on video resolution, though it is not expressed explicitly in the article. Therefore, in a video with different resolutions these values should be initialised. As a result, the reported experiments in [19] are performed on a single video resolution. Or, the 2D motion histogram constructed in [7] is based on the size of MVs in positive and negative directions. Hence, the histogram is sensitive to variations in resolution and DR specially in retrieval of similar video at different resolutions and DRs. Tasdemir et al. [22] addressed the problem of variation of MVs for the same video in different resolutions. They extracted MVs at lower frame rates using exhaustive search in pixel domain. Mean of the magnitudes of MVs (MMMV) as well as the mean of the angles of MVs (MPMV) for macro blocks of a frame are used as feature vectors. They were faced with the problem of varying MVs for similar video at different resolutions. To solve this problem, they have normalised MMMV and MPMV values by the mean and standard deviation of the features at the entire frames. Even though IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104 © The Institution of Engineering and Technology 2017 their method is robust against variations in resolution for a number of video resolutions, it suffers from high computational load and moreover it is offline, because it requires information about the entire frames’ MVs. In this paper, which is an improved and extended version of our previous work [23], a method for making MVs robust against DR and resolution variations is proposed. The proposed method is tested on H.264/AVC-coded video but the same principles can be used for the MVs derived from coded video by the other MV-based video coding standards. The first stage of the proposed method deals with variations in DR and in the second stage, the variations in resolution are addressed. Experimental results indicate that employing only the first stage can improve the performance of identifying similar video with various DRs by up to 28%, and the improvement achieved by both stages increases the performance by up to 73%. Fig. 2 indicates the successive stages of the proposed method. As shown in Fig. 2, the extracted MVs by partial decoding are processed in two stages for dimension ratio (DR) and resolution modification and the modified MVs are used in the evaluations instead of initially extracted MVs. 99 This paper is organised as follows: In Section 2, the proposed method is described. Section 3 is dedicated to the evaluation of the proposed method followed by concluding remarks in Section 4. 2 Proposed method In this section, the proposed method for improving robustness of MVs against variations in DR and resolution is presented by using MV histograms intersection as a similarity measure between videos. Hence, at first a brief explanation on MV histograms is provided and then the proposed method named resolutiondimension scaled motion vector histogram (R-D SMVH) is explained. Motion estimation is commonly used in all standard video coding families of H.26X and MPEG. MVs can be directly extracted from the H.26X and MPEG coded video streams including H.264/AVC. Each frame in H.264/AVC can be coded as I, P or B. I-frames are intra coded, whereas macroblocks in P and B frames can be inter coded. Motion estimation in P-frames is unidirectional, whereas B frames can use bidirectional motion estimation as well. The proposed two-stage method is described for MV information derived from P frames. However, the method can be easily employed for B-frames without any modification. Assume MVm(mvx, mvy) is a MV for inter coded Macroblock m in a P frame. MV histogram is a two-dimensional histogram with Q × R bins and MV m is assigned to bin (i,j) where i and j are derived as i= mvx + biasx Kx Q 0 if 0 ≤ mvx + biasx ≤ upperlimit mvx if if mvx + biasx > upperlimit mvx mvx + biasx < 0 (1) 2.1 MV modification based on DR In the first stage, MVs are modified to be robust against DR. The MVs are scaled by using DRatio parameter. DRatio parameter is defined as DRatio = mvy + biasy Ky R 0 if 0 ≤ mvy + biasy ≤ upperlimit mvy if if mvy + biasy > upperlimit mvy mvy + biasy < 0 (2) (3) For each P frame including n macroblocks, each bin of MV histogram can be computed as n Hq, r P = ∑ #Blocks(m) × δ(Bin(MVm) − (q, r)) (4) m=1 where #Blocks(m) indicates the number of 4 × 4 pixel blocks in macroblock m and is the Kronecker delta function which is 1 when Bin(MVm), computed from (1) and (2), is equal to (q, r) and is zero, otherwise. It is worth noting that in the histogram there are two extra added bins for intra coded and skip macroblocks, as well. The MV histogram can be normalised as H(P) = H(P) W (6) DScaledMVm(dsmvx, dsmvy) = (7) mvx , mvy DRatio if(DRatio) > 1 mvx, mvy × DRatio if(Dratio) < 1 Now the MV histogram is generated for scaled MVs which hereafter is referred to dimension scaled MV histogram (DSMVH). By scaling the MVs, the resultant scaled MVs correspond to a virtual P-frame with resolution (VP_width, VP_height) where VP_width and VP_height are defined as VP_height = F_width = F_height DRatio F_width if(DRatio) > 1 (8) if(DRatio) < 1 F_height if(DRatio) > 1 F_height × DRatio = F_width if(DRatio) < 1 (9) and the virtual P-frame resolution is where biasx and biasy are used to up-shift the MVs ranges to nonnegative values. The bins are two dimensional and Kx and Ky are the horizontal and vertical ranges of the bin. MV histogram for a P frame is defined as H(P) and it consists of Q × R bins as H(P) = H0, 0(P), H0, 1(P), …, H2, 1(P), H2, 2(P), …, HQ, R(P) F_width F_height where F_width and F_height are the width and the height of P frame, respectively. Each MV in P is scaled by DRatio as VP_width = j= 100 where W is the total number of 4 × 4 blocks in a P frame. This normalisation is common in image and video histograms, e.g. colour histograms are usually divided by image resolution to produce a histogram robust against image resolution. However, it is not just enough to make MV histogram robust against DR and resolution of frames, since besides the number of MVs the MV values are highly affected by DR and resolution of the coded frame. In the following sub-sections, the proposed two-stage method is explained which makes the MV histograms more robust against DR and resolution of coded frames. (VP_width, VP_height) = F_height, F_height F_width, F_width if(DRatio) > 1 if(DRatio) < 1 (10) Even though the resultant DSMVH in this stage is robust to DR, it may not produce satisfactory results when the resultant virtual frames have not similar resolutions. Fig. 3 provides a pictorial representation about the MVs modification process. Original videos might be different in both resolution and DR. At the first step, the MVs of each video are scaled such that the MVs correspond to a square virtual video whose side is equal to the minimum of sides of the original video. Consequently, in the second step, the resulted MVs in higher resolution virtual video are scaled once more to generate MVs corresponding to a virtual video of the same resolution as the virtual video with the lower resolution. It is worth noting that at each step at most one multiplication or division is necessary for modifying each MV thus the computational overhead of modifications is negligible. (5) IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104 © The Institution of Engineering and Technology 2017 Fig. 3 Original and virtual videos after two-stage modifications 2.2 MV modification based on resolution By employing the second stage, robustness of MVs against resolution is achieved. As shown in Fig. 3, in modification of MVs for videos with different resolutions, the MVs of higher resolution video are down scaled. Assume A, XA × XA and B, XB × XB are the resolutions of two original videos. The resolution scaled MVs for each frame of A are derived as R − DScaledMVm(rdsmvx, rdsmvy) = dsmvx, dsmvy dsmvx × if(X A < XB) XB XB , dsmvy × XA XA (11) if(X A > XB) The resulted MV histogram is called R-D SMVH. The next step is histogram normalisation which is generated as follows: (12) H¯ α, β(P) = Hα, β(P)/W Histograms can be generated for a part of video such as a number of group of pictures or a shot. If there are F P-frames in the given part of video, each bin of the final MV histogram will be constructed as k H¯ α, β(final) = 1 H¯ α, β(P j) F j∑ =1 (13) Similarity comparison between two different videos is computed as the sum of intersection between the corresponding bins of the given final histograms: P S= Q i=1 j=1 Fig. 4 MVH, DSMVH and R-DSMVH Histograms for a video in two different DRs and resolutions for Kimono video (a1) MVH for 320 × 240, (a2) DSMVH for 320 × 240 ≃ MVH 240 × 240, (a3) R-D SMVH for 320 × 240 ≃ MVH for 240 × 240, (b1) MVH for 1280 × 720, (b2) DSMVH for 1280 × 720 ≃ MVH 720 × 720, (b3) R-D SMVH for 1280 × 720 ≃ MVH for 240 × 240 IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104 © The Institution of Engineering and Technology 2017 P Q ∑ ∑ (H̄i, j(1) ∩ H¯ i, j(2)) = ∑ ∑ i=1 j=1 min (H̄i, j(1), H¯ i, j(2)) (14) Since the histograms are normalised, S is a number in the range of [0,1] and the higher S value implies higher similarity between the two videos. Thus, in the comparison of a video in two different resolutions or DRs the feature vector which indicates higher similarity or S has superior performance In order to provide a pictorial representation for operation of the proposed method, a sample of resulted histograms by the proposed method is shown in Fig. 4. Fig. 4 shows the histogram of original 101 column shows the corresponding virtual videos after SMVH modification on MVs of the original video. 3 Experimental results Two groups of experiments were conducted to evaluate the performance of the proposed method. At first, the effectiveness of the proposed method on detecting the same video at different resolutions and DRs was evaluated. In these experiments, down sampled or up sampled versions of the original videos were employed. In the second group of experiments, a video retrieval application was conducted to evaluate the performance of the proposed modified feature vectors in discriminating similar videos from dissimilar ones. At the end, significance analysis and execution time analysis are included to consider other aspects that may be important in implementing the proposed method. Moreover, in order to facilitate access to the test video that is used in the experiments, they are made available for the interested readers at [24]. Fig. 5 Resolutions of original videos (a1, b1) and virtual videos (a2, b2) (a1) First frame of original video 1 720 × 480 (DR = 3:2), (a2) First frames of corresponding virtual video 1 480 × 480, (b1) First frame of original video 2 480 × 640 (DR = 3:4), (b2) First frames of corresponding virtual video 2 480 × 480 Table 1 Used test sequences Class Resolution, Video of pixels video A 2560 × 1600 B 1920 × 1080 C 832 × 480 D 416 × 240 E 1280 × 720 ultra HD 4096 × 2304 people-onstreet Kimono1 ParkScene basketball-drill BQMall PartyScene basketballpass blowing bubbles Mobisode2 RaceHorses FourPeople mobilecalender honey bees PuppiesBath Spatial Temporal complexity complexity High high High High Low medium High Low high medium low low medium low low high low low low high low high low low low low high high Table 2 Resolution of videos in the conducted tests Test Resolution of video 1, Resolution of video 2, number pixels pixels 1 2 3 4 5 320 × 240 480 × 640 360 × 360 240 × 320 480 × 640 1280 × 720 1280 × 720 1280 × 720 640 × 360 3840 × 2160 MVs and the modified MVs histograms for 320 × 240 and 1280 × 720 pixels of Kimono video. Figs. 4a1 and b1 show the original histograms and Figs. 4a2 and b2 show the modified DSMVH histograms and Figs. 4a3 and b3 show the modified R-D SMVH histograms. It can be seen in Fig. 4 that at each step of modification of MVs the resulted MV histograms for the video become more similar, both objectively and subjectively. Moreover, Fig. 5 shows the relation between two videos in different resolutions and their corresponding virtual videos. First column of Fig. 5 depicts two original videos and the second 102 3.1 Similarity of videos in different DRs and resolutions The test sequences include 12 standard video sequences from class A to class E of the proposed video sequences by JCT-VC group along with two Ultra HD video sequences [25]. A number of specifications of these classes are tabulated in Table 1. It is worth noting that spatial and temporal complexities are reported based on the analysis of the 20 frames of video sequences which are used in the experiments. In the experiments at first, new copies from the test videos in different resolutions and DRs were constructed. Then the new videos were coded by H.264/AVC reference software JM-18.0 [26]. The group of pictures (GOP) which is the number of frames from one I-frame to next I-frame is selected as ten which includes one I frame and nine P frames. Videos were coded and the MV information for the first two GOPs of each coded video were extracted. Motion estimation was performed by using fast full search option of HM.18 which employs extended diamond search algorithm. Consequently, MVH, DSMVH and R-D SMVH histograms were generated by using the MVs and employing the methods described in this article. Five test experiments were conducted for different DRs and resolutions copies of each video in the experiments. Table 2 shows the resolution of compared videos in each test. According to Table 2 in this experiment, the Ultra HD videos are only down sampled but the other test sequences might be either up sampled or down sampled according to the required resolutions of the corresponding test condition. In fact, in each test two copies of the same video were used and the similarity between them were compared by using MVH, RSMVH and R-D SMVH methods. Table 3 shows the resulted similarities for different videos at five test conditions. Since in this experiment, the same video is used in comparisons, the higher similarity values in Table 3 imply the better performance of the temporal feature. Experimental results in Table 3 indicate that MVH in Kimono video sequence with different DRs and resolutions may result in on average similarity value as low as 0.53. This is increased in DSMVH to 0.68 on average, whereas by using R-D SMVH, the average of the derived similarities in the entire tested cases is higher than 0.92. It corresponds to 73% improvement of R-D SMVH with respect to MVH. Average similarity and standard deviation for all test videos are shown in the last row of Table 3. Standard deviation for R-D SMVH is on average 0.13 which is lower than other feature vectors. It shows that the persistent and reliable improvement is achieved by this feature vector. Hence, steady improvement on the performance can be achieved by employing the successive stages of the proposed modification on MVs. However, Table 3 indicates small improvement in similarity of few tested videos such as ‘Mobilecalender’ and ‘Mobisode2’. This is because in these video sequences the percentage of the non-zero MVs is very low. In order to justify this claim, the ratios of the blocks that are coded as intra, skipped and non-zero MVs at each video sequence are tabulated in Table 4. The results given in Table 4 indicate that the percentage of blocks with non-zero MVs IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104 © The Institution of Engineering and Technology 2017 Table 3 Average and standard deviation for performance of MVH, DSMVH and R-D SMVH for different test conditions MVH average standard deviation DSMVH average standard deviation R-D SMVH average standard deviation Video BasketballDrill BasketballPass blowing bubbles BQMall FourPeople Kimono mobilecalender Mobisode2 ParkScene PartyScene PeopleOnStreet RaceHorses honey bees PuppiesBath all video tests 0.8026 0.8236 0.6722 0.7881 0.9388 0.5392 0.8107 0.9438 0.8433 0.7734 0.7760 0.4156 0.6920 0.8568 0.7587 0.0837 0.1348 0.1854 0.1211 0.0458 0.1300 0.2375 0.0131 0.0386 0.1213 0.1157 0.0794 0.1091 0.0936 0.1837 0.8215 0.8473 0.7605 0.8319 0.9395 0.6847 0.8115 0.9513 0.8761 0.7875 0.7876 0.4979 0.6750 0.8989 0.7939 0.0942 0.1468 0.2332 0.1477 0.0453 0.1163 0.2359 0.0086 0.0450 0.1071 0.1202 0.1403 0.0876 0.0734 0.1724 Table 4 Motion compensation statistics of test video sequences Video Class Resolution, pixels Percent of Percent of intra, % skipped, % PeopleOnStreet A Kimono B ParkScene BasketballDrill C BQMall PartyScene blowing bubbles D BasketballPass Mobisode2 RaceHorses Mobilecalender E FourPeople honey bees ultra HD PuppiesBath 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 1280 × 720 5.5 7.3 5.6 9.2 5.5 7.5 6.6 5.8 2.3 15.5 8.9 3.1 8.2 9.5 29.6 8.2 42.9 63.4 68.4 41.2 22.03 75.7 92.2 4.5 73.2 86.4 30.1 38.4 Table 5 ANMRR values for retrieval of different size of queries using MVH, DSMVH and R-D SMVH Resolution of query, pixels MVH DSMVH R-D SMVH 3840 × 2160 1920 × 1080 1280 × 720 640 × 480 480 × 640 360 × 360 320 × 240 240 × 320 0.57 0.56 0.47 0.46 0.44 0.44 0.51 0.44 0.39 0.4 0.45 0.44 0.43 0.4 0.47 0.48 0.49 0.39 0.43 0.43 0.35 0.4 0.38 0.36 in the aforementioned video sequences is <20% and hence the similarity between histograms is mainly achieved by the number of blocks coded as skip or intra. As a result, scaling MVs have low impact in improving the performance of similarity metric. Which is due to a few non-zero MVs or low temporal complexity in these coded videos. 3.2 Video retrieval using MVH, DSMVH and R-D SMVH In this section, a video retrieval experiment is conducted to evaluate the performance of the proposed temporal features in differentiating between similar videos from dissimilar ones. In this experiment, UCF Sports Action database [27] is employed which IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104 © The Institution of Engineering and Technology 2017 0.8688 0.8821 0.8440 0.8893 0.9466 0.9253 0.8132 0.9654 0.9031 0.8018 0.8541 0.7831 0.8124 0.9349 0.8690 0.1133 0.1487 0.1627 0.1493 0.0411 0.0221 0.2329 0.0081 0.0471 0.1059 0.0248 0.2074 0.0773 0.0243 0.1345 Percent of Average Average Average non-zero MVs, percent of percent of percent of % intra in class, skipped in non-zero % class, % MVs in class, % 64.9 84.5 51.5 27.4 26.1 51.3 71.37 18.5 5.5 80 17.9 10.5 61.7 52.1 5.5 6.45 29.6 25.55 64.9 68 7.4 57.66 34.93 7.55 48.60 43.84 6 79.8 14.2 8.85 34.25 56.9 consists of 115 720 × 404 videos at 13 different actions. The queries are resized videos from 240 × 320 pixels to 3840 × 2160 pixels among six different actions. ANMRR [28] metric is used to measure the performance of video retrieval. ANMRR measure is a number in the range of [0,1] and the higher value implies lower retrieval performance. The experimental results that are tabulated in Table 5 show R-D SMVH MVs outperform the original MVs in the entire conducted retrieval experiments. This is an indication for superior performance of R-D SMVH not only in detecting similar videos but also in differentiating between similar videos from dissimilar ones. 3.3 Significance analysis of the proposed method Significance analysis is performed according to the experimental results in Tables 3 and 5 by using paired samples T-test, to show the competitive analysis between MVH, DSMVH and R-D SMVH methods. Table 6 tabulates the resulted T value and its corresponding p value for each pair of methods for the significance level of 0.05. According to Table 6, DSMVH is significant compared with MVH only in similarity calculation but R-D SMVH is significant against MVH in both similarity calculation and video retrieval. 3.4 Time complexity analysis In order to perform time complexity analysis, the decoding execution time of 20 frames of three 3840 × 2160 H.264 coded 103 Table 6 T-test results for similarity measure and video retrieval Compared methods Similarity measure Video retrieval T-value p T-value p DSMVH and MVH R-D SMVH and MVH 3.042 3.403 0.0094 0.0047 −1.996 −4.828 0.086 0.0019 [4] [5] [6] [7] Table 7 Average decoding time of 3840 × 2160 videos using MVH, DSMVH and R-DSMVH Method Average decoding time, s MVH DSMVH R-DSMVH 2.33 2.398 2.426 videos on the same PC for MVH, DSMVH and R-D SMVH is measured. Table 7 indicates the total decoding time for the methods. The results given in Table 7 indicate that R-D SMVH decoding time is only 4% higher than MVH. It means that the proposed method imposes negligible computation overhead on the common MVH method. 4 Conclusion Variation on spatial dimensions is amongst the challenging issues in video indexing and retrieval based on MVs. This paper presented a two-stage modification on MV information generated from compressed domain of H.264/AVC coded videos, named resolution scaled MV histogram (R-D SMVH). It was shown that this method makes the MVHs robust against variations on DR and resolution of the video frames. The proposed method is based on selective scaling of the MVs according to the DR and resolution of the video frames. Experimental results showed that the first stage of the proposed method improved detecting similar videos with different DRs by up to 28% compared with MVH and applying both stages increases the improvement by up to 73%. Moreover, on video retrieval experiments (based on MVs), the proposed R-D SMVH outperforms the original MVs which indicates the superior performance of them in differentiating between similar and dissimilar videos. As a result, R-D SMVH can be used more effectively than MVH in applications based on measuring video similarity such as video retrieval and action detection over a wide range of videos with various DRs and resolutions. 5 [1] [2] [3] 104 References Wiegand, T., Sullivan, G.J., Bjøntegaard, G., et al.: ‘Overview of the H.264/AVC video coding standard’, IEEE Trans. Circuits Syst. Video Technol., 2003, 13, (7), pp. 560–576 Rahmani, F., Zargari, F.: ‘Compressed domain visual information retrieval based on I-frames in HEVC’, Multimedia Tools Appl., 2016, 75, (10), pp. 1– 18 Zargari, F., Mehrabi, M., Ghanbari, M.: ‘Compressed domain texture based visual information retrieval method for I-frame coded frames’, IEEE Trans. Consum. Electron., 2010, 56, (2), pp. 728–736 [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] Mehrabi, M., Zargari, F., Ghanbari, M.: ‘Fast and low complexity method for content accessing and extracting DC-frames from H.264 coded videos’, IEEE Trans. Consum. Electron., 2010, 56, (3), pp. 1801–1808 Zargari, F., Rahmani, F.: ‘Visual information retrieval in HEVC compressed domain’. Proc. 23rd Iranian Conf. Electrical Engineering, Tehran, Iran, May 2015, pp. 793–798 Mehrabi, M., Zargari, F., Ghanbari, M., et al.: ‘Fast content access and retrieval of JPEG compressed images’, Signal Process. Image Commun., 2016, 46, pp. 54–59 Chen, L.H., Chin, K.H., Liao, H.Y.: ‘An integrated approach to video retrieval’, Proc. Nineteenth Conf. Australasian Database, Australia, 2008, 75, pp. 49–55 Ciptadi, A., Goodwin, M.S., Rehg, J.M.: ‘Movement pattern histogram for action recognition and retrieval’, in Fleet, D., Pajdla, T., Schiele, B., et al. (Eds.): ‘Computer vision – ECCV 2014’ (Springer International Publishing, 2014), pp. 695–710 Koumaras, H., Gardikis, G., Xilouris, G., et al.: ‘Shot boundary detection without threshold parameters’, J. Electron. Imag., 2006, 15, (2) pp. 1–3 Zhao, Z.C., Cai, A.N.: ‘Shot boundary detection algorithm in compressed domain based on adaboost and fuzzy theory’, in Jiao, L., Wang, L., Gao, X., et al. (Eds.): ‘Advances in natural computation’ (Springer Berlin Heidelberg, 2006), pp. 617–626 Rossetto, L., Giangreco, I., Schuldt, H., et al.: ‘Imotion—a content-based video retrieval engine’, in He, X., Luo, S., Tao, D., et al. (Eds.): ‘Multimedia modeling’ (Springer International Publishing, 2015), pp. 255–260 Chen, L.H., Chin, K.H., Liao, H.Y.: ‘An integrated approach to video retrieval’. Proc. Nineteenth Conf. Australasian Database, Gold Coast, Australia, December 2007, pp. 49–55 Shao, L., Ji, L., Liu, Y., et al.: ‘Human action segmentation and recognition via motion and shape analysis’, Pattern Recognit. Lett., 2012, 33, (4), pp. 438–445 Babu, R.V., Ramakrishnan, K.R.: ‘Compressed domain video retrieval using object and global motion descriptors’, Multimedia Tools Appl., 2007, 32, (1), pp. 93–113 Akrami, F., Zargari, F.: ‘An efficient compressed domain video indexing method’, Multimedia Tools Appl., 2014, 72, (1), pp. 705–721 Tom, M., Babu, R.V., Praveen, R.G.: ‘Compressed domain human action recognition in H.264/AVC video streams’, Multimedia Tools Appl., 2015, 74, (21), pp. 9323–9338 Biswas, S., Babu, R.V.: ‘H.264 compressed video classification using histogram of oriented motion vectors (HOMV)’. Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada, May 2013, pp. 2040–2044 Fei, L., Zhu, S.: ‘Mean shift clustering-based moving object segmentation in the H.264 compressed domain’, IET Image Process., 2010, 4, (1), pp. 11–18 Mak, C.M., Cham, W.K.: ‘Real-time video object segmentation in H.264 compressed domain’, IET Image Process., 2009, 3, (5), pp. 272–285 Yu, J., Xiang, K., Wang, X., et al.: ‘Video stabilisation based on modelling of motion imaging’, IET Image Process., 2016, 10, (3), pp. 177–188 Bruyne, S., Deursen, D., Cock, J., et al.: ‘A compressed-domain approach for shot boundary detection on H.264/AVC bit streams’, Signal Process. Image Commun., 2008, 23, (8), pp. 473–489 Tasdemir, K., Cetin, A.E.: ‘Content-based video copy detection based on motion vectors estimated using a lower frame rate’, Signal Image Video Process., 2014, 8, (6), pp. 1049–1057 Zargari, F., Rahmani, F.: ‘A temporal feature vector which is robust against aspect ratio variations’. Proc. Eighth Int. Symp. Telecommunications, Tehran, Iran, September 2016 ‘The entire test videos’. Available at http://jmp.sh/aLxR8kb, accessed 2 September 2017 ‘Free Downloadable 4K Sample Content’. Available at http://4ksamples.com/, accessed 1 August 2017 ‘H.264/AVC Software Coordination’. Available at http://iphome.hhi.de/ suehring/tml/, accessed 1 August 2017 Soomro, K., Zamir, A.R.: ‘Action recognition in realistic sports videos’, In (Eds): ‘Computer vision in sports’ (Springer International Publishing, 2014) Zhu, M.: ‘Recall, precision, and average precision’ Technical Report, 9, Department of Statistics, Actuarial Science, University, Waterloo, CA, 2004 IET Image Process., 2018, Vol. 12 Iss. 1, pp. 98-104 © The Institution of Engineering and Technology 2017

Log In

Improving the robustness of motion vector temporal descriptor

Free related PDFsRelated papers

Free related PDFsRelated papers

Related topics