Abstract
Shot Boundary Detection (SBD) is one of the most interesting pre-processing tasks involving all intelligent video analysis applications. An efficient method for SBD is a very important task in this challenge. A wide variety of methods was proposed in the literature to achieve this task. However, only a few of them adopted the multimodal approach to help solve the problem. In this work, we introduced a new multimodal technique for shot boundary detection by learning the distance measure between audiovisual features using the Siamese network. The proposed system consists of two models: Convolutional Neural Network-Gated Recurrent Unit(CNN-GRU) based model for the audio modality and the pre-trained model EfficientNet for the visual modality. The proposed network learns the similarity score from the image embedding features and the Power Spectrum Density (PSD) as audio features. The obtained similarity scores from the proposed network were then used to build a signal which represents the audio-visual change. After that, we used a global threshold for transition detection, and an adaptive threshold to differentiate between the detected transition types (Abrupt or Gradual). The experimental study, applied on standard datasets (TRECvid 2001 and TRECvid 2007) revealed that the introduction of the audio features achieved an interesting improvement, in terms of F1 score (91.36%) and gradual transition (89.06%) compared to the state-of-the-art models. The proposed approach can be incorporated into different multimedia applications to reduce their complexity.
Similar content being viewed by others
References
Abdulhussain SH, Ramli AR, Mahmmod BM, Saripan MI, Al-Haddad S, Jassim WA (2019) Shot boundary detection based on orthogonal polynomial. Multimed Tools Appl 78(14):20361–20382
Abdulhussain SH, Ramli AR, Saripan MI, Mahmmod BM, Al-Haddad SAR, Jassim WA, et al. (2018) Methods and challenges in shot boundary detection: a review. Entropy 20(4):214
Amirian S, Rasheed K, Taha TR, Arabnia HR (2020) Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access 8:218386–218400
Bakkouri I, Afdel K (2020) Computer-aided diagnosis (cad) system based on multi-layer feature fusion network for skin lesion recognition in dermoscopy images. Multimed Tools Appl 79(29):20483–20518
Bansal M, Kumar M, Kumar M, Kumar K (2021) An efficient technique for object recognition using shi-tomasi corner detection algorithm. Soft Comput 25(6):4423–4432
Bouyahi M, Ayed YB (2020) Video scenes segmentation based on multimodal genre prediction. Proc Comput Sci 176:10–21
Bouyahi M, Ayed YB (2021) Multimodal features for shots boundary detection. In: International conference on machine vision, vol. 11605, pp 661–670
Chakladar DD, Kumar P, Roy PP, Dogra DP, Scheme E, Chang V (2021) A multimodal-siamese neural network (msnn) for person verification using signatures and eeg. Inf Fus 71:17–27
Chakraborty S, Thounaojam DM (2019) A novel shot boundary detection system using hybrid optimization technique. Appl Intell 49(9):3207–3220
Chakraborty S, Thounaojam DM (2021) Sbd-duo: a dual stage shot boundary detection technique robust to motion and illumination effect. Multimed Tools Appl 80(2):3071–3087
Chakraborty S, Thounaojam DM, Sinha N (2021) A shot boundary detection technique based on visual colour information. Multimed Tools Appl 80 (3):4007–4022
Chavate S, Mishra R, Yadav P (2021) A comparative analysis of video shot boundary detection using different approaches. In: 2021 10Th international conference on system modeling & advancement in research trends (SMART), pp 1–7
Choi J-A, Lim K (2020) Identifying machine learning techniques for classification of target advertising. ICT Express 6(3):175–180
Deng J, Dong W, Socher R, Li L. -J., Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition, pp 248–255
Georgiou T, Liu Y, Chen W, Lew M (2020) A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int J Multimed Inf Retriev 9(3):135–170
Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S (2017) Learning dynamic siamese network for visual object tracking. In: Proceedings of the IEEE international conference on computer vision, pp 1763–1771
He L, Shen X-H, Zhang M-H, Wang H-Y (2020) Segmentation method for ship-radiated noise using the generalized likelihood ratio test on an ordinal pattern distribution. Entropy 22(4):374
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Hou L, Jin X, Zhao Z (2019) Time series similarity measure via siamese convolutional neural network. In: 2019 12Th international congress on image and signal processing, biomedical engineering and informatics (CISP-BMEI), pp 1–6
Ichida AY, Meneguzzi F, Ruiz DD (2018) Measuring semantic similarity between sentences using a siamese neural network. In: 2018 International joint conference on neural networks (IJCNN), pp 1–7
Iwan LH, Thom JA (2017) Temporal video segmentation: detecting the end-of-act in circus performance videos. Multimed Tools Appl 76(1):1379–1401
Jiang H, Learned-Miller E (2017) Face detection with the faster r-cnn. In: 2017 12Th IEEE international conference on automatic face gesture recognition, pp 650–657
Langford Z, Eisenbeiser L, Vondal M (2019) Robust signal classification using siamese networks. In: Proceedings of the ACM workshop on wireless security and machine learning, pp 1–5
Mocanu B, Tapu R, Zaharia T (2020) A multimodal high level video segmentation for content targeted online advertising. In: International symposium on visual computing, pp 506–517
Priya GL, Domnic S (2014) Shot based keyframe extraction for ecological video indexing and retrieval. Ecol Inf 23:107–117
Rashmi B, Nagendraswamy H (2021) Video shot boundary detection using block based cumulative approach. Multimed Tools Appl 80(1):641–664
Rastgoo MN, Nakisa B, Maire F, Rakotonirainy A, Chandran V (2019) Automatic driver stress level classification using multimodal deep learning. Expert Syst Appl 112793:138
Sajjad M, Khan ZA, Ullah A, Hussain T, Ullah W, Lee MY, Baik SW (2020) A novel cnn-gru-based hybrid approach for short-term residential load forecasting. IEEE Access 8:143759–143768
Sasithradevi A, Roomi SMM (2020) A new pyramidal opponent color-shape model based video shot boundary detection. J Vis Commun Image Represent 102754:67
Sharma V, Gupta M, Kumar A, Mishra D (2021) Video processing using deep learning techniques: a systematic literature review. IEEE Access 9:139489–139507
Shen L, Hong R, Hao Y (2020) Advance on large scale near-duplicate video retrieval. Front Comput Sci 14(5):1–24
Shoeibi A, Ghassemi N, Alizadehsani R, Rouhani M, Hosseini-Nejad H, Khosravi A, Panahiazar M, Nahavandi S (2021) A comprehensive comparison of handcrafted features and convolutional autoencoders for epileptic seizures detection in eeg signals. Expert Syst Appl 113788:163
Spolaor N, Lee HD, Takaki WSR, Ensina LA, Coy CSR, Wu FC (2020) A systematic review on content-based video retrieval. Eng Appl Artif Intell 103557:90
Sun J, Peng Y, Guo Y, Li D (2021) Segmentation of the multimodal brain tumor image used the multi-pathway architecture method based on 3d fcn. Neurocomputing 423:34–45
Supriya S, Siuly S, Wang H, Zhang Y (2020) Automated epilepsy detection techniques from electroencephalogram signals: a review study. Health Inf Sci Syst 8(1):1–15
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks International conference on machine learning, pp 6105–6114
Tanberk S, Dağlı V, Gürkan MK (2021) Deep learning for videoconferencing: a brief examination of speech to text and speech synthesis. In: 6Th international conference on computer science and engineering (UBMK), pp 506–511
Thounaojam DM, Bhadouria VS, Roy S, Singh K, et al. (2017) Shot boundary detection using perceptual and semantic information. Int J Multimed Inf Retr 6(2):167–174
Tippaya S, Sitjongsataporn S, Tan T, Khan MM, Chamnongthai K (2017) Multi-modal visual features-based video shot boundary detection. IEEE Access 5:12563–12575
Zhu Q, Guo X, Deng W, Guan Q, Zhong Y, Zhang L, Li D (2022) Land-use/land-cover change detection based on a siamese global learning framework for high spatial resolution remote sensing imagery. J Photogrammetry Remote Sens 184:63–78
Funding
The authors have no relevant financial or non-financial interests to disclose.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ben Ayed Yassine contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mohamed, B., Yassine, B.A. Shot boundary detection using multimodal Siamese network. Multimed Tools Appl 83, 5055–5078 (2024). https://doi.org/10.1007/s11042-023-15428-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15428-4