Video Summarization Based on Mutual Information and Entropy Sliding Window Method
<p>Entropy <math display="inline"><semantics> <mrow> <mi>H</mi> <mo>[</mo> <mi>F</mi> <mo>]</mo> </mrow> </semantics></math> measures the degree of change in gesture movements. The correlation entropy <math display="inline"><semantics> <mrow> <mi>H</mi> <mo>[</mo> <msup> <mi>F</mi> <mi>t</mi> </msup> <mo>,</mo> <msup> <mi>F</mi> <mrow> <mo>(</mo> <mi>t</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> </msup> <mo>]</mo> </mrow> </semantics></math> indicates the similarity of the two frames. The mathematical relationship between information entropy and mutual information is illustrated in (<b>a</b>), <math display="inline"><semantics> <msup> <mi>F</mi> <mi>t</mi> </msup> </semantics></math> denotes a frame at time <span class="html-italic">t</span>, and <math display="inline"><semantics> <mrow> <mi>H</mi> <mo>[</mo> <msup> <mi>F</mi> <mi>t</mi> </msup> <mo>]</mo> </mrow> </semantics></math> represents the information entropy of <math display="inline"><semantics> <msup> <mi>F</mi> <mi>t</mi> </msup> </semantics></math>, <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>[</mo> <msup> <mi>F</mi> <mi>t</mi> </msup> <mo>,</mo> <msup> <mi>F</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msup> <mo>]</mo> </mrow> </semantics></math> represents the mutual information value of two consecutive frames at time <span class="html-italic">t</span> and <math display="inline"><semantics> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </semantics></math>. (<b>b</b>) Shows two gray-scale gesture images, and (<b>c</b>) shows the gray-scale distribution histograms, which can be used to calculate the entropy of the image. In (<b>d</b>), the horizontal axis represents the gray gradation value, and the vertical axis shows the number of pixels. The joint histogram counts the frequencies at which different gray-value combinations appear at corresponding positions in the two images. The shapes of the two histograms are similar, indicating that their probability distribution of the pixel gray values in approximate.</p> "> Figure 2
<p>Flow chart of the Mutual Information and Entropy based adaptive Sliding Window (MIESW) algorithm for extracting key frames.</p> "> Figure 3
<p>The improved sliding window schematic.</p> "> Figure 4
<p>The framework of Algorithms 1 and 2.</p> "> Figure 5
<p>Frames in the video of a gesture for the word `Clic’.</p> "> Figure 6
<p>Schematic diagram of extracting key frames. A hand gesture sequence sample from Marcel-IDIAP2001, which contains 99 frames. The horizontal axis coordinate represents the mutual information value obtained by successive frame pairs in the video sequence. Data1 is the inter-frame mutual information value range from 2 to 3.5. Blocks of different colors indicate the grouping results after using Algorithms 2 and 3. The horizontal lines in color blocks indicate the mean value of each group. The key frames obtained by our method are in red boxes, which are the 17, 33, 72, and 98 frames.</p> "> Figure 7
<p>SURF (Speeded Up Robust Features) analysis.</p> "> Figure 8
<p>The key frames extraction result of the test video.</p> "> Figure 9
<p>Precision, Recall, and F<math display="inline"><semantics> <msub> <mrow/> <mrow> <mi>m</mi> <mi>e</mi> <mi>a</mi> <mi>s</mi> <mi>u</mi> <mi>r</mi> <mi>e</mi> </mrow> </msub> </semantics></math> of different techniques on test videos. The purple bar, yellow bar, and blue line represent Precision, Recall, and F-measure, respectively. The horizontal axis in the figure represents four test videos, and the vertical axis represents the PRF evaluation metrics of the key frame extraction result. It can be seen that using the same test video, our proposed algorithm can obtain higher precision, recall, and F <math display="inline"><semantics> <msub> <mrow/> <mrow> <mi>m</mi> <mi>e</mi> <mi>a</mi> <mi>s</mi> <mi>u</mi> <mi>r</mi> <mi>e</mi> </mrow> </msub> </semantics></math>. For detailed experimental results, see <a href="#entropy-22-01285-t002" class="html-table">Table 2</a>.</p> "> Figure 10
<p>The key frames extractions applied on video S2 that contains 141 frames and consists of four gestures. Its gesture starts with the fist, then opens fingers, next clenches the fist, and finally opens the palm. For the 1st to 3rd comparison algorithms, especially the third gesture, has been leaked, which would be bad in key-frames extraction. However, in the fourth algorithm we proposed, although the third gesture was extracted, unfortunately, there are many redundant frames. The results of our proposed method shows a series of gesture changes can be extracted more obviously.</p> ">
Abstract
:1. Introduction
2. Related Work
3. Key Frames Extraction and Feature Fusion Principle
3.1. Entropy and Mutual Information Theory
- Input video: Input the video to extract key frames.
- Conversion part: Convert the video into a frame sequence, count the length of the sequence, and adjust the frame size to .
- First, an adaptive sliding window is used to process the frame sequence. The mutual information value is calculated for the frame group entering the window, the window size is adaptively adjusted according to the difference between the frames, and similar frames are classified into a group. Then according to the entropy value of the frame, the groups with similar content are further merged, and finally the frame closest to the average mutual information value between the groups is selected to enter the candidate sequence of the key frame.
- Eliminate redundancy: Use SURF to extract the features of the frames in the candidate sequence, and eliminate the redundant frames with high similarity.
- Output key frames: Output the final key frame extraction results. It should be noted that during the key frame extraction process, the frame sequence retains time sequential information.
3.2. The Proposed Mutual Information and Entropy Based Sliding Window Method (MIESW) for Key Frames Extraction
3.3. Improved Adaptive Sliding Window Method to Extract Key Frames
- step 1.
- Pre-defined the initial length of the sliding window to w. Set the threshold . Start the algorithm from the first frame.
- step 2.
- Make sure the last frame in the sliding window is not the final frame . If true, denote the current group as , algorithm is over, output the result. Otherwise, initial the current window size , go to step 3.
- step 3.
- The window slides on the sequence, and frames entering the sliding window are denoted as , satisfying: . Calculate the mutual information value M of each consecutive two frames in , recorded as , and then calculate the mean value as
- step 4.
- Add the frame next to the current window’s right boundary into the group and obtain . Calculate the absolute value of the difference between and as . If , it means that the newly added frame reduces the overall correlation of the original group , so denote the group as . Set , go to step 2. Otherwise go to step 5.
- step 5.
- , it indicates that the newly added frame has a correlation with the original group. If , then , go to step 4. If , go to step 2.
- step 1.
- Represented groups in the sequence as as input.
- step 2.
- Calculate the standard deviation of the entropy values of each group as , and the average standard deviation is obtained as the threshold .
- step 3.
- In the second grouping, adjacent segment which is smaller than the threshold will be merged. For , if , merge to ; if not, keep the group unchanged.
- step 4.
- After this process, the final groups is denoted as , in each group, the frame closets to the average mutual information value is selected as the key frame: .
3.4. Remove Redundant Frames
4. Experiment
4.1. Improved Precision, Recall, and F Criteria
4.2. Experimental Results
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Money, A.G.; Agius, H. Video summarisation: A conceptual framework and survey of the state of the art. J. Vis. Commun. Image Represent. 2008, 19, 121–143. [Google Scholar] [CrossRef] [Green Version]
- Hu, W.; Xie, N.; Li, L.; Zeng, X.; Maybank, S. A Survey on Visual Content-Based Video Indexing and Retrieval. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2011, 41, 797–819. [Google Scholar] [CrossRef]
- Ejaz, N.; Tariq, T.B.; Baik, S.W. Adaptive key frame extraction for video summarization using an aggregation mechanism. J. Vis. Commun. Image Represent. 2012, 23, 1031–1040. [Google Scholar] [CrossRef]
- Amiri, A.; Fathy, M. Hierarchical Keyframe-based Video Summarization Using QR-Decomposition and Modified-Means Clustering. EURASIP J. Adv. Signal Process. 2010, 2010, 892124. [Google Scholar] [CrossRef] [Green Version]
- Zhou, K.; Qiao, Y.; Xiang, T. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Hannane, R.; Elboushaki, A.; Afdel, K.; Naghabhushan, P.; Javed, M. An efficient method for video shot boundary detection and keyframe extraction using SIFT-point distribution histogram. Int. J. Multimed. Inf. Retr. 2016, 5, 89–104. [Google Scholar] [CrossRef]
- Zhu, Y.; Li, K.; Jiang, J. Video super-resolution based on automatic key-frame selection and feature-guided variational optical flow. Signal Process. Image Commun. 2014, 29, 875–886. [Google Scholar] [CrossRef]
- Smeaton, A.F.; Over, P.; Doherty, A.R. Video shot boundary detection: Seven years of TRECVid activity. Comput. Vis. Image Underst. 2010, 114, 411–418. [Google Scholar] [CrossRef] [Green Version]
- Hannane, R.; Elboushaki, A.; Afdel, K. MSKVS: Adaptive mean shift-based keyframe extraction for video summarization and a new objective verification approach. J. Vis. Commun. Image Represent. 2018, 55, 179–200. [Google Scholar] [CrossRef]
- Meghdadi, A.H.; Irani, P. Interactive Exploration of Surveillance Video through Action Shot Summarization and Trajectory Visualization. IEEE Trans. Vis. Comput. Graph. 2013, 19, 2119–2128. [Google Scholar] [CrossRef]
- Ma, L.; Yang, H.; Tan, X.; Feng, G. Image Keyframe-based Visual-Depth Map Establishing Method. J. Harbin Inst. Technol. 2018, 50, 23–31. [Google Scholar]
- De Avila, S.E.F.; Lopes, A.P.B.; da Luz, A., Jr.; de Albuquerque Araújo, A. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 2011, 32, 56–68. [Google Scholar] [CrossRef]
- Yin, Y.; Thapliya, R.; Zimmermann, R. Encoded Semantic Tree for Automatic User Profiling Applied to Personalized Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 181–192. [Google Scholar] [CrossRef]
- Zhu, D.; Wang, Z. Extraction of keyframe from motion capture data based on motion sequence segmentation. J. Comput.-Aided Des. Comput. Graph. 2008, 20, 787–792. [Google Scholar]
- Wolf, W. Key frame selection by motion analysis. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; Volume 2, pp. 1228–1231. [Google Scholar] [CrossRef]
- Gao, X.; Li, X.; Feng, J.; Tao, D. Shot-based video retrieval with optical flow tensor and HMMs. Pattern Recognit. Lett. 2009, 30, 140–147. [Google Scholar] [CrossRef]
- Shi, Y.; Yang, H.; Gong, M.; Liu, X.; Xia, Y. A fast and robust key frame extraction method for video copyright protection. J. Electr. Comput. Eng. 2017, 2017, 1231794. [Google Scholar] [CrossRef]
- Yu, L.; Cao, J.; Chen, M.; Cui, X. Key frame extraction scheme based on sliding window and features. Peer-Netw. Appl. 2018, 11, 1141–1152. [Google Scholar] [CrossRef]
- Rao, P.C.; Das, M.M. Keyframe Extraction Method Using Contourlet Transform. In Proceedings of the 2012 International Conference on Electronics, Communications and Control; IEEE Computer Society: Washington, DC, USA, 2012; pp. 437–440. [Google Scholar]
- Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality image registration by maximization of mutual information. IEEE Trans. Med Imaging 1997, 16, 187–198. [Google Scholar] [CrossRef] [Green Version]
- Ngo, C.W.; Ma, Y.F.; Zhang, H.J. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits Syst. Video Technol. 2005, 15, 296–305. [Google Scholar]
- Rigau, J.; Feixas, M.; Sbert, M.; Bardera, A.; Boada, I. Medical image segmentation based on mutual information maximization. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2004; pp. 135–142. [Google Scholar]
- Kaneda, Y.; Mineno, H. Sliding window-based support vector regression for predicting micrometeorological data. Expert Syst. Appl. 2016, 59, 217–225. [Google Scholar] [CrossRef] [Green Version]
- Huang, C.; Wang, H. Novel Key-frames Selection Framework for Comprehensive Video Summarization. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 577–589. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Laurens, V.D.M.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Kan, S.; Cen, Y.; He, Z.; Zhang, Z.; Zhang, L.; Wang, Y. Supervised Deep Feature Embedding with Hand Crafted Feature. IEEE Trans. Image Process. Publ. IEEE Signal Process. Soc. 2019, 28, 5809–5823. [Google Scholar] [CrossRef]
- Wang, Y.J.; Ding, M.; Kan, S.; Zhang, S.; Lu, C. Deep Proposal and Detection Networks for Road Damage Detection and Classification. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Lowe, D.G. Distinctive Image Feature from Scale-Invariant Key points. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Chasanis, V.T.; Ioannidis, A.I.; Likas, A.C. Efficient key-frame extraction based on unimodality of frame sequences. In Proceedings of the IEEE 2014 12th international conference on signal processing (ICSP), HangZhou, China, 19–23 October 2014; pp. 1133–1138. [Google Scholar]
- Tang, H.; Xiao, W.; Liu, H.; Sebe, N. Fast and Robust Dynamic Hand Gesture Recognition via Key Frames Extraction and Feature Fusion. Neurocomputing 2018, 331, 424–433. [Google Scholar] [CrossRef] [Green Version]
Compared Images | The Best Matching Value | The Worst Matching Value | SURF Similarity |
---|---|---|---|
(14, 33) | 0.0414 | 0.5746 | 0.3469 |
(33, 72) | 0.0313 | 0.5511 | 0.3864 |
(72, 98) | 0.0187 | 0.6372 | 0.5882 |
Method | Video Name | P | R | F | |||
---|---|---|---|---|---|---|---|
entropy-based | c1 | 5 | 6 | 3 | 0.5000 | 0.6000 | 0.5445 |
c2 | 5 | 2 | 1 | 0.5000 | 0.2000 | 0.2857 | |
s1 | 7 | 4 | 3 | 0.7500 | 0.4286 | 0.5455 | |
s2 | 7 | 9 | 4 | 0.4444 | 0.5714 | 0.5000 | |
color-based | c1 | 5 | 6 | 3 | 0.5000 | 0.6000 | 0.5445 |
c2 | 5 | 10 | 4 | 0.4000 | 0.8000 | 0.5333 | |
s1 | 7 | 9 | 5 | 0.5556 | 0.7143 | 0.6250 | |
s2 | 7 | 11 | 4 | 0.3636 | 0.5714 | 0.4444 | |
sliding-window-based | c1 | 5 | 6 | 3 | 0.5000 | 0.6000 | 0.5445 |
c2 | 5 | 7 | 3 | 0.4286 | 0.6000 | 0.5000 | |
s1 | 7 | 8 | 5 | 0.6250 | 0.7143 | 0.6667 | |
s2 | 7 | 9 | 4 | 0.4444 | 0.5714 | 0.5000 | |
proposed (w/o optimization) | c1 | 5 | 5 | 3 | 0.6000 | 0.6000 | 0.6000 |
c2 | 5 | 5 | 3 | 0.6000 | 0.6000 | 0.6000 | |
s1 | 7 | 8 | 4 | 0.5000 | 0.5714 | 0.5333 | |
s2 | 7 | 6 | 3 | 0.5000 | 0.4286 | 0.4615 | |
our method | c1 | 5 | 5 | 4 | 0.8000 | 0.8000 | 0.8000 |
c2 | 5 | 7 | 4 | 0.5714 | 0.8000 | 0.6667 | |
s1 | 7 | 6 | 5 | 0.8333 | 0.7143 | 0.7692 | |
s2 | 7 | 6 | 5 | 0.8333 | 0.7143 | 0.7692 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, W.; Qi, D.; Zhang, C.; Guo, J.; Yao, J. Video Summarization Based on Mutual Information and Entropy Sliding Window Method. Entropy 2020, 22, 1285. https://doi.org/10.3390/e22111285
Li W, Qi D, Zhang C, Guo J, Yao J. Video Summarization Based on Mutual Information and Entropy Sliding Window Method. Entropy. 2020; 22(11):1285. https://doi.org/10.3390/e22111285
Chicago/Turabian StyleLi, WenLin, DeYu Qi, ChangJian Zhang, Jing Guo, and JiaJun Yao. 2020. "Video Summarization Based on Mutual Information and Entropy Sliding Window Method" Entropy 22, no. 11: 1285. https://doi.org/10.3390/e22111285
APA StyleLi, W., Qi, D., Zhang, C., Guo, J., & Yao, J. (2020). Video Summarization Based on Mutual Information and Entropy Sliding Window Method. Entropy, 22(11), 1285. https://doi.org/10.3390/e22111285