Abstract
The exponential growth in the usage of computing technologies in various applications has led to the creation of huge amount of multimedia information such as, video, audio, and text. The enormous amount of video data generated over the past years necessitates the use of video summarization techniques that has become an emerging field of research. These techniques may facilitate quick browsing, indexing and faster sharing of content among various sources. Video summarization has been popular method to generate a short summary of a longer sized video and these approaches may be broadly classified into handcrafted (using features descriptors) or deep learning (DL) based algorithms. In this paper, we expound a comprehensive review of state-of-the-art (SOTA) techniques for video summarization from traditional to modern data-driven approaches. In addition, we proposed a taxonomy for the classification of video summarization methods based on a plenty of criteria. We also present an analysis of evaluation protocols for these approaches using benchmark datasets and performance metrices. We identify and list various research challenges specifically for each sub-category of video summarization. It may be clearly inferred that modern deep learning-based approaches outperformed traditional methods in terms of accuracy with an additional training overhead. Furthermore, most of the handcrafted-based approaches offer limited performance in dynamic video scenario and there exist several inconsistencies such as scaling or rotational variations under different illumination conditions. Besides, our analysis investigates that multi-criteria-based video summarization is an area that requisite further exploration by the research community. This survey may serve as a reference article to the new researchers for carrying out investigations in this active field of computer vision.
Similar content being viewed by others
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Aggarwal JK, Ryoo MS (2011) Human activity analysis: A review. ACM Computing Surveys 43(3):1–43
Agyeman R, Muhammad R, Choi GS (2019) Soccer Video Summarization Using Deep Learning. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR) 2019 Mar 28, pp. 270–273
Ahmad Z, Illanko K, Khan N, Androutsos D (2019) Human action recognition using convolutional neural network and depth sensor data. In: Proceedings of the 2019 International Conference on Information Technology and Computer Communications 2019 Aug 16, pp. 1–5
Ali H, Sharif M, Yasmin M, Rehmani MH, Riaz F (2020) A survey of feature extraction and fusion of deep learning for detection of abnormalities in video endoscopy of gastrointestinal-tract. Artif Intell Rev 53:2635–2707
Ali JJ, Shati NM, Gaata MT (2020) Abnormal activity detection in surveillance video scenes. Telkomnika (Telecommun Comput Electron Control) 18(5):2447–2453
Benjak J, Hofman D, Knezović J, Žagar M (2022) Performance Comparison of H. 264 and H. 265 Encoders in a 4K FPV Drone Piloting System. Appl Sci 12(13):6386
Arev I, Park HS, Sheikh Y, Hodgins J, Shamir A (2014) Automatic editing of footage from multiple social cameras. ACM Trans Graph 33(4):1–11. https://doi.org/10.1145/2601097.2601198
Aslan MF, Durdu A, Sabanci K (2020) Human action recognition with bag of visual words using different machine learning methods and hyperparameter optimization. Neural Comput. & Applic. 32(12):8585–8597. https://doi.org/10.1007/s00521-019-04365-9
B. World (2019) World Population Ageing 2019. [Online]. Available: http://link.springer.com/chapter/10.1007/978-94-007-5204-7_6
Baillie M, Jose JM (2003) Audio-based event detection for sports video. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2728:300–309. https://doi.org/10.1007/3-540-45113-7_30
Basavarajaiah M, Sharma P (2019) Survey of Compressed Domain Video Summarization. ACM Comput Surv 52(6):1–29
Bir B (2020) Wildfires, forest fires around world in 2020. https://www.aa.com.tr/en/environment/wildfires-forest-fires-around-world-in-2020/2088198
Bojukyan E (2022) 52 video marketing statistics 2022 [infographic]. https://www.renderforest.com/blog/video-marketing-statistics. Accessed 14 Jan 2022
Calic J, Izquierdo E (2002) Efficient key-frame extraction and video analysis. Proceedings - International Conference on Information Technology: Coding and Computing, ITCC 2002, pp 28–33. https://doi.org/10.1109/ITCC.2002.1000355.
Chaquet JM, Carmona EJ, Fernández-Caballero A (2013) A survey of video datasets for human action and activity recognition. Comput Vis Image Underst 117(6):633–659. https://doi.org/10.1016/j.cviu.2013.01.013
Chen T, Lu A, Hu SM (2012) Visual storylines: semantic visualization of movie sequence. Elsevier 36(4):241–249. https://doi.org/10.1016/j.cag.2012.02.010
Choroś K (2014) Categorization of sports video shots and scenes in tv sports news based on ball detection. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8397 LNAI, no. PART 1, pp 591–600. https://doi.org/10.1007/978-3-319-05476-6_60.
Das Dawn D, Shaikh SH (2016) A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis Comput 32(3):289–306. https://doi.org/10.1007/s00371-015-1066-2
Dilawari A, Khan MUG (2019) ASoVS: abstractive summarization of video sequences. IEEE Access 7:29253–29263. https://doi.org/10.1109/ACCESS.2019.2902507
Donchev D (2022) “40 Mind Blowing YouTube Facts, Figures and Statistics – 2022,”. https://fortunelords.com/youtube-statistics/#:~:text=300 hours of video are,on Youtube every single day.&text=In an average month%2C 8,to a pay-TV service.
Dov D, Talmon R, Cohen I (2015) Audio-visual voice activity detection using diffusion maps. IEEE Trans Audio Speech Lang Process 23(4):732–745. https://doi.org/10.1109/TASLP.2015.2405481
Elharrouss O, Almaadeed N, Al-Maadeed S, Bouridane A, Beghdadi A (2021) A combined multiple action recognition and summarization for surveillance video sequences. Appl Intell 51(2):690–712. https://doi.org/10.1007/s10489-020-01823-z
Evangelopoulos G et al. (2009) “Video event detection and summarization using audio, visual and text saliency,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, no. April, pp. 3553–3556, https://doi.org/10.1109/ICASSP.2009.4960393.
Fei M, Jiang W, Mao W (2018) “Creating personalized video summaries via semantic event detection,” J. Ambient. Intell. Humaniz. Comput., vol. 0, no. 0, pp. 1–12, https://doi.org/10.1007/s12652-018-0797-0.
Feng W, Liu R, Zhu M (2014) Fall detection for elderly person care in a vision-based home surveillance environment using a monocular camera. SIViP 8(6):1129–1138. https://doi.org/10.1007/s11760-014-0645-4
Furini M, Ghini V (2006) “<(34) an Audio-Video Summarization Scheme Based on Audio and Video Analysis.Pdf>,” pp. 1209–1213
Furini M, Geraci F, Montangero M, Pellegrini M (2010) STIMO: STIll and MOving video storyboard for the web scenario. Multimed. Tools Appl. 46(1):47–69. https://doi.org/10.1007/s11042-009-0307-7
G. of India (2020) “Accidental Deaths and Suicides in India by NCRB,”https://ncrb.gov.in/en/accidental-deaths-suicides-in-india?page=1
Ghafoor HA, Javed A, Irtaza A, Dawood H, Dawood H, Banjar A (2018) Egocentric Video Summarization Based on People Interaction Using Deep Learning. vol. 2018
Ghatak S, Rup S, Majhi B, Swamy MNS (2020) An improved surveillance video synopsis framework: a HSATLBO optimization approach. Multimed Tools Appl 79(7–8):4429–4461
Gong Y, Liu X (2000) Video summarization using singular value decomposition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2:174–180. https://doi.org/10.1109/cvpr.2000.854772
Gong F et al. (2019) A real-time fire detection method from video with multifeature fusion. Comput Intell Neurosci vol 2019. https://doi.org/10.1155/2019/1939171.
Guan G, Wang Z, Mei S, Ott M, He M, Feng DD (2014) A top-down approach for video summarization. ACM Trans Multimed Comput Commun Appl 11(1). https://doi.org/10.1145/2632267.
Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recogn 47(10):3343–3361. https://doi.org/10.1016/j.patcog.2014.04.018
Han Y, Zhang P, Zhuo T, Huang W, Zhang Y (2018) Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recogn Lett 107:83–90. https://doi.org/10.1016/j.patrec.2017.08.015
He L, Wen S, Wang L, Li F (2020) Vehicle theft recognition from surveillance video based on spatiotemporal attention. Appl Intell pp 2128–2143. https://doi.org/10.1007/s10489-020-01933-8.
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 07-12-June:961–970. https://doi.org/10.1109/CVPR.2015.7298698
Herranz L, Martinez JM (2010) A framework for scalable summarization of video. IEEE Trans Circ Syst Vid Technol 20(9):1265–1270. https://doi.org/10.1109/TCSVT.2010.2057020
Huang C, Wang H (2020) A novel key-frames selection framework for comprehensive video summarization. IEEE Trans Circ Syst Vid Technol 30(2):577–589. https://doi.org/10.1109/TCSVT.2019.2890899
Hussain T et al. (2021) A comprehensive survey of multi-view video summarization. Elsevier 109. https://doi.org/10.1016/j.patcog.2020.107567.
Hussein F, Piccardi M (2017) V-Jaune. ACM Trans. Multimed. Comput. Commun. Appl 13(2):1–19. https://doi.org/10.1145/3063532
Iosifidis A, Mouroutsos SG, Gasteratos A (2010) Real-time video surveillance by a hybrid static/active camera mechatronic system. Int Conf Adv Intell Mechatron pp 84–89
Itazuri T, Fukusato T, Yamaguchi S, Morishima S (2017) Court-Based Volleyball Video Summarization Focusing on Rally Scene. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2017-July, pp. 179–186, https://doi.org/10.1109/CVPRW.2017.28.
Jegham I, Khalifa AB, Alouani I, Mahjoub MA (2019) MDAD: A Multimodal and Multiview in-Vehicle Driver Action Dataset, vol. 11678 LNCS. Springer International Publishing. https://doi.org/10.1007/978-3-030-29888-3_42.
Jegham I, Khalifa AB, Alouani I, Mahjoub MA (2020) Vision-based human action recognition: An overview and real world challenges. Forensic Sci Int Digit Investig 32:200901. https://doi.org/10.1016/j.fsidi.2019.200901
Jeyanthi Suresh A, Visumathi J (2020) Inception ResNet deep transfer learning model for human action recognition using LSTM. Materials Today: Proceedings, no. xxxx. https://doi.org/10.1016/j.matpr.2020.09.609.
Ji Z, Xiong K, Pang Y, Li X (2020) Video summarization with attention-based encoder-decoder networks. IEEE Trans Circ Syst Vid Technol 30(6):1709–1717. https://doi.org/10.1109/TCSVT.2019.2904996
Kakadiya R, Lemos R, Mangalan S, Pillai M, Nikam S (2019) “AI Based Automatic Robbery/Theft Detection using Smart Surveillance in Banks,” Proceedings of the 3rd International Conference on Electronics and Communication and Aerospace Technology, ICECA 2019, pp. 201–204, https://doi.org/10.1109/ICECA.2019.8822186.
Kalaivani P, Roomi SMM (2017) Towards comprehensive understanding of event detection and video summarization approaches. Proceedings - 2017 2nd International Conference on Recent Trends and Challenges in Computational Models, ICRTCCM 2017, pp 61–66. https://doi.org/10.1109/ICRTCCM.2017.84.
Kamel A, Sheng B, Yang P, Li P, Shen R, Feng DD (2019) Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans Syst Man Cybern Syst 49(9):1806–1819. https://doi.org/10.1109/TSMC.2018.2850149
Kim G, Kim J, Kim S (2019) “Fire Detection Using Video Images and Temporal Variations,” 1st International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2019, pp. 564–567, https://doi.org/10.1109/ICAIIC.2019.8669083.
Koidan K (2018) New datasets for action recognition. https://neurohive.io/en/datasets/new-datasets-for-action-recognition/
Koutras P, Zlatinsi A, Maragos P (2018) Exploring CNN-Based Architectures for Multimodal Salient Event Detection in Videos. 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop, IVMSP 2018 - Proceedings, pp 1–5, https://doi.org/10.1109/IVMSPW.2018.8448977.
Kushwaha A (2017) Theft-Detection using Motion Sensing Camera. 2(11):90–97
Li Y, Zhai Q, Ding S, Yang F, Li G, Zheng YF (2019) Efficient health-related abnormal behavior detection with visual and inertial sensor integration. Pattern Anal Applic 22(2):601–614. https://doi.org/10.1007/s10044-017-0660-5
Li A, Miao Z, Cen Y, Zhang XP, Zhang L, Chen S (2020) Abnormal event detection in surveillance videos based on low-rank and compact coefficient dictionary learning. Pattern Recogn 108:107355. https://doi.org/10.1016/j.patcog.2020.107355
Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 1159–1168. https://doi.org/10.1109/CVPR.2018.00127.
Liu H, Feris R, Sun M (2011) Visual Analysis of Humans. Vis Anal Hum. https://doi.org/10.1007/978-0-85729-997-0.
Liu AA, Xu N, Su YT, Lin H, Hao T, Yang ZX (2015) Single/multi-view human action recognition via regularized multi-task learning. Neurocomputing 151(P2):544–553. https://doi.org/10.1016/j.neucom.2014.04.090
Luna E, Miguel JCS, Ortego D, Martínez JM (2018) Abandoned object detection in video-surveillance: Survey and comparison. Sensors (Switzerland), vol. 18, no. 12, https://doi.org/10.3390/s18124290.
Ma Y, Lu L, Zhang H, Li M (2002) A User Attention Model for Video Summarization. ACM, pp 1–10, [Online]. Available: papers2://publication/uuid/DE9F0C43-0DAB-459B-ADDC-928A1433801B
Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Exp Syst Appl 91:480–491. https://doi.org/10.1016/j.eswa.2017.09.029
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp 2982–2991. https://doi.org/10.1109/CVPR.2017.318.
Mahesh Kini M, Pai K (2019) A Survey on Video Summarization Techniques. 2019 Innovations in Power and Advanced Computing Technologies, i-PACT 2019, pp 1–5. https://doi.org/10.1109/i-PACT44901.2019.8960003.
Marvaniya S, Damoder M, Gopalakrishnan V, Iyer KN, Soni K (2016) Real-time video summarization on mobile. Proceedings - International Conference on Image Processing, ICIP, vol. 2016-Augus, no. September 2016, pp 176–18. https://doi.org/10.1109/ICIP.2016.7532342.
McCue T (2018) Video Marketing Trends (Forbes). https://www.forbes.com/sites/tjmccue/2018/06/22/video-marketing-2018-trends-continues-to-explode-as-the-way-to-reach-customers/?sh=5fd70755598d
Mei T, Tang LX, Tang J, Hua XS (2013) Near-lossless semantic video summarization and its applications to video analysis. ACM Trans Multimed Comput Commun Appl 9(3). https://doi.org/10.1145/2487268.2487269.
Milotta FLM, Furnari A, Battiato S, Signorello G, Farinella GM (2019) Egocentric visitors localization in natural sites. J Vis Commun Image Represent 65(2). https://doi.org/10.1016/j.jvcir.2019.102664.
Mlik N, Barhoumi W, Zagrouba E (2014) Object-based event detection for the extraction of video key-frames (no. January 2012)
Muhammad K, Ahmad J, Mehmood I, Rho S, Baik SW (2018) Convolutional Neural Networks Based Fire Detection in Surveillance Videos. IEEE Access 6(March):18174–18183. https://doi.org/10.1109/ACCESS.2018.2812835
Muhammad K, Ahmad J, Lv Z, Bellavista P, Yang P, Baik SW (2019) Efficient deep CNN-based fire detection and localization in video surveillance applications. IEEE Trans Syst Man Cybern Syst 49(7):1419–1434. https://doi.org/10.1109/TSMC.2018.2830099
Münzer B, Schoeffmann K, Böszörmenyi L (2018) Content-based processing and analysis of endoscopic images and videos: a survey. Multimed Tools Appl 77(1):1323–1362. https://doi.org/10.1007/s11042-016-4219-z
Muszynski M, Kostoulas T, Lombardo P, Pun T, Chanel G (2018) Aesthetic highlight detection in movies based on synchronization of spectators’ reactions. ACM Trans Multimed Comput Commun Appl 14(3). https://doi.org/10.1145/3175497.
Nie L, Hong R, Zhang L, Xia Y, Tao D, Sebe N (2016) Perceptual attributes optimization for multivideo summarization. IEEE Trans Cybern 46(12):2991–3003. https://doi.org/10.1109/TCYB.2015.2493558
Oskouie P, Alipour S, Eftekhari-Moghadam AM (2014) Multimodal feature extraction and fusion for semantic mining of soccer video: a survey. Artif Intell Rev 42(2):173–210
Pareek P, Thakkar A (2021) A survey on video-based Human Action Recognition: recent updates, datasets, challenges, and applications, vol. 54, no. 3. Springer Netherlands. https://doi.org/10.1007/s10462-020-09904-8.
Park H, Park S, Joo Y (2019) Robust detection of abandoned object for smart video surveillance in illumination changes. Sensors (Switzerland), vol. 19, no. 23, https://doi.org/10.3390/s19235114.
Park H, Park S, Joo Y (2020) Detection of abandoned and stolen objects based on dual background model and mask R-CNN. IEEE Access 8:80010–80019. https://doi.org/10.1109/ACCESS.2020.2990618
Plummer BA, Brown M, Lazebnik S (2017) Enhancing video summarization via vision-language embedding. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Janua, pp 1052–1060. https://doi.org/10.1109/CVPR.2017.118.
Rouast PV, Adam MTP (2020) Learning deep representations for video-based intake gesture detection. IEEE J Biomed Health Inf 24(6):1727–1737. https://doi.org/10.1109/JBHI.2019.2942845
Rouvier M, Oger S, Linarès G, Matrouf D, Merialdo B, Li Y (2015) Audio-based video genre identification. IEEE Trans. Audio Speech Lang Process 23(6):1031–1041. https://doi.org/10.1109/TASLP.2014.2387411
Sabha A, Selwal A (2021) HAVS: Human action-based video summarization, Taxonomy, Challenges, and Future Perspectives. Proceedings of the 2021 IEEE International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems, ICSES 2021, pp 1–9. https://doi.org/10.1109/ICSES52305.2021.9633804.
Sahu A, Chowdhury AS (2020) Multiscale summarization and action ranking in egocentric videos. Pattern Recogn Lett 133:256–263. https://doi.org/10.1016/j.patrec.2020.02.029
Sanal Kumar KP, Bhavani R (2019) Human activity recognition in egocentric video using PNN, SVM, kNN and SVM+kNN classifiers. Clust Comput 22(s5):10577–10586. https://doi.org/10.1007/s10586-017-1131-x
Sarika (2022) 135 Video Marketing Statistics You Can’t Ignore in 2022. https://invideo.io/blog/video-marketing-statistics/
Savage C (2016) Does length matter? It does for video!. https://wistia.com/learn/marketing/does-length-matter-it-does-for-video
Schuldt C, Barbara L, Stockholm S (2004) Recognizing human actions: a local SVM approach ∗ Dept. of Numerical Analysis and Computer Science. Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th international conference on, vol. 3, pp 32–36
Vivekraj VK, Debashis S, Balasubramanian R (2019) Video Skimming: taxonomy and comprehensive survey. ACM Comput Surv 52(5):(Article 106)38. https://doi.org/10.1145/3347712
Shammi S, Islam S, Rahman HA, Zaman HU (2019) An automated way of vehicle theft detection in parking facilities by identifying moving vehicles in CCTV video stream. Proceedings of the 2018 International Conference On Communication, Computing and Internet of Things, IC3IoT 2018, pp 36–41. https://doi.org/10.1109/IC3IoT.2018.8668135
Shang X, Yuan Z, Wang A, Wang C (2021) Multimodal video summarization via time-aware transformers. MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia, pp. 1756–1765. https://doi.org/10.1145/3474085.3475321
Sharma D, Selwal A (2021) HyFiPAD: a hybrid approach for fingerprint presentation attack detection using local and adaptive image features. Vis Comput no. 0123456789, https://doi.org/10.1007/s00371-021-02173-8.
Sharma D, Selwal A (2021) An intelligent approach for fingerprint presentation attack detection using ensemble learning with improved local image features, no. 0123456789. Springer US, https://doi.org/10.1007/s11042-021-11254-8.
Singh Parihar A, Pal J, Sharma I (2021) Multiview video summarization using video partitioning and clustering. J Vis Commun Image Represent 74(April 2020):102991. https://doi.org/10.1016/j.jvcir.2020.102991
Singh T, Vishwakarma DK (2021) A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput Applic 33(1):469–485. https://doi.org/10.1007/s00521-020-05018-y
Song X, Sun L, Lei J, Tao D, Yuan G, Song M (2016) Event-based large scale surveillance video summarization. Neurocomputing 187:66–74. https://doi.org/10.1016/j.neucom.2015.07.131
Sood M (2020) The Hindustan Times. https://www.hindustantimes.com/mumbai-news/india-had-most-deaths-in-road-accidents-in-2019-report/story-pikRXxsS4hptNVvf6J2g9O.html#:~:text=India.continued to have the,in 2019%2C the report revealed
Specht DF (1990) Probabilistic neural networks. Neural Netw 3(1):109–118. https://doi.org/10.1016/0893-6080(90)90049-Q
Sridevi M, Kharde M (2020) Video summarization using highlight detection and pairwise deep ranking model. Procedia Comput Sci 167(2019):1839–1848. https://doi.org/10.1016/j.procs.2020.03.203
Srivastava AK, Biswas KK (2018) Human activity recognition using local motion histogram. In: Bhattacharyya P, Sastry H, Marriboyina V, Sharma R (eds), Smart and innovative trends in next generation computing technologies. NGCT 2017. Communications in Computer and Information Science, vol 828. Springer, Singapore. https://doi.org/10.1007/978-981-10-8660-1_69
Staff R (2020) Video marketing statistics 2021 [infographic]. https://www.renderforest.com/blog/video-marketing-statistics
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488
Sun S, Wang F, He L (2018) Movie summarization using bullet screen comments. Multimed Tools Appl 77(7):9093–9110. https://doi.org/10.1007/s11042-017-4807-6
Tabish M, Tanooli ZUR, Shaheen M (2021) Activity recognition framework in sports videos. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-10519-6.
Tang K, Bao Y, Zhao Z, Zhu L, Lin Y, Peng Y (2019) AutoHighlight: automatic highlights detection and segmentation in soccer matches. In 2018 IEEE International Conference on Big Data (Big Data), pp 4619–4624. IEEE.
Terms I (2015) A multi-view video synopsis framework Ansuman Mahapatra, Pankaj K Sa, and Banshidhar Majhi Department of Computer Science and Engineering National Institute of Technology Rourkela. Int Conf Image Process (ICIP), pp 1–5
Tian Z, Xue J, Lan X, Li C, Zheng N (2011) Key object-based static video summarization. MM’11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, pp 1301–1304. https://doi.org/10.1145/2072298.2071999.
Tian Z, Xue J, Lan X, Li C, Zheng N (2014) Object segmentation and key-pose based summarization for motion video. Multimed. Tools Appl 72(2):1773–1802. https://doi.org/10.1007/s11042-013-1488-7
Tribune T (2022) Rash driving to blame for 92% accidents in 2019-road crash analysis cell report. https://www.tribuneindia.com/news/chandigarh/rash-driving-to-blame-for-92-accidents-in-2019-114422.Accessed 18 Jul 2020
Tripathi RK, Jalal AS, Agrawal SC (2018) Suspicious human activity recognition: a review. Artif Intell Rev 50(2):283–339. https://doi.org/10.1007/s10462-017-9545-7
Truong BT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACM Trans Multimed Comput Commun Appl 3(1):3-es. https://doi.org/10.1145/1198302.1198305
Uemura H, Ishikawa S, Mikolajczyk K (2008) Feature tracking and motion compensation for action recognition. In BMVC, pp 1–10
Ullah A, Ahmad J, Muhammad K, Sajjad M, Baik SW (2017) Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6:1155–1166. https://doi.org/10.1109/ACCESS.2017.2778011
Vaswani A et al. (2017) Attention is all you need. Adv Neural Inf Process Syst, vol. 2017-Decem, no. Nips, pp 5999–6009
Verma KK, Singh BM, Dixit A (2019) A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system. Int J Inf Technol pp 1–14. https://doi.org/10.1007/s41870-019-00364-0.
Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. Vis Comput 29(10):983–1009. https://doi.org/10.1007/s00371-012-0752-6
Wang F, Ngo CW (2012) Summarizing rushes videos by motion, object, and event understanding. IEEE Trans Multimed 14(1):76–87. https://doi.org/10.1109/TMM.2011.2165531
Wang T, Chen J, Snoussi H (2013) Online detection of abnormal events in video streams. J Electr Comput Eng 2013, https://doi.org/10.1155/2013/837275.
Wang J, Chen Y, Hao S, Peng X, Hu L (2019) Deep learning for sensor-based activity recognition: a survey. Pattern Recogn Lett 119:3–11. https://doi.org/10.1016/j.patrec.2018.02.010
World Health Organization (2018) Global status report on road safety 2018. https://www.who.int/publications/i/item/9789241565684
Xiao Q, Song R (2018) Action recognition based on hierarchical dynamic Bayesian network. Multimed Tools Appl 77(6):6955–6968. https://doi.org/10.1007/s11042-017-4614-0
Xu L, Yan S, Chen X, Wang P (2019) Motion recognition algorithm based on deep edge-aware pyramid pooling network in human-computer interaction. IEEE Access 7:163806–163813
Xu J, Sun Z, Ma C (2021) Crowd aware summarization of surveillance videos by deep reinforcement learning. Multimed. Tools Appl. 80(4):6121–6141. https://doi.org/10.1007/s11042-020-09888-1
Yasmin G, Chowdhury S, Nayak J, Das P, Das AK (2021) Key moment extraction for designing an agglomerative clustering algorithm-based video summarization framework. Neural Comput Appl, vol. 1, https://doi.org/10.1007/s00521-021-06132-1.
Yoon DH, Cho NG, Lee SW (2020) A novel online action detection framework from untrimmed video streams. Pattern Recogn 106:107396. https://doi.org/10.1016/j.patcog.2020.107396
Zhang Y, Zhang L, Zimmermann R (2014) Aesthetics-guided summarization from multiple user generated videos. ACM Trans Multimed Comput Commun Appl 11(2). https://doi.org/10.1145/2659520.
Zhang B, Conci N, de Natale FGB (2015) Segmentation of discriminative patches in human activity video. ACM Trans Multimed Comput Commun Appl 12(1):1–19. https://doi.org/10.1145/2750780.
Zhang Z et al. (2019) Multi-scale visualization based on sketch interaction for massive surveillance video data. Pers Ubiquit Comput. https://doi.org/10.1007/s00779-019-01281-6.
Zhang Y, Liang X, Zhang D, Tan M, Xing EP (2020) Unsupervised object-level video summarization with online motion auto-encoder. Pattern Recogn Lett 130:376–385. https://doi.org/10.1016/j.patrec.2018.07.030
Zhao B, Li X, Lu X (2018) HSA-RNN: hierarchical structure-adaptive RNN for video summarization. Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 7405–7414, https://doi.org/10.1109/CVPR.2018.00773.
Zhao B, Gong M, Li X (2022) Hierarchical multimodal transformer to summarize videos. Neurocomputing 468:360–369. https://doi.org/10.1016/j.neucom.2021.10.039
Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp 7582–7589
Zhu F, Shao L, Xie J, Fang Y (2016) From handcrafted to learned representations for human action recognition: a survey. Image Vis Comput 55:42–52. https://doi.org/10.1016/j.imavis.2016.06.007
Zhu W, Lu J, Li J, Zhou J (2021) DSNet: a flexible detect-to-summarize network for video summarization. IEEE Trans Image Process 30:948–962. https://doi.org/10.1109/TIP.2020.3039886
Zhuang Y, Rui Y, Huang TS, Mehrotra S (1998) Adaptive key frame extraction using unsupervised clustering. IEEE Int Conf Image Process 1(94):866–870. https://doi.org/10.1109/icip.1998.723655
Zutshi A, Gupta A, Raj A (2021) TRACS Transformer for Video Captioning and Summarisation TRACS: transformer for Video Captioning and Summarisation (no. January)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
All the authors declare that they do not have any conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sabha, A., Selwal, A. Data-driven enabled approaches for criteria-based video summarization: a comprehensive survey, taxonomy, and future directions. Multimed Tools Appl 82, 32635–32709 (2023). https://doi.org/10.1007/s11042-023-14925-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14925-w