Video Scene Detection Using Transformer Encoding Linker Network (TELNet)
<p>TELNet overall architecture.</p> "> Figure 2
<p>Details of proposed TELNet model.</p> "> Figure 3
<p>Transformer encoding layer.</p> "> Figure 4
<p>Merge algorithm. Two separate key-shot candidates of the target shot in window <span class="html-italic">n</span> and window <span class="html-italic">n</span> + 1. Compare key-shot in window <span class="html-italic">n</span> with key-shot in window <span class="html-italic">n</span> + 1, keep the most related one as the final key-shot.</p> "> Figure 5
<p>Training label generation. This diagram represents a sample scene in which all the other shots within a scene are linked to the key-shot (red rectangle).</p> "> Figure 6
<p>Comparison of predicted scene boundaries and ground truth labels for video 08 of the BBC Planet Earth dataset, titled “Ocean Deep”.</p> ">
Abstract
:1. Introduction
- We proposed a transformer encoding linker network (TELNet) that models video shot correlations and identifies video scene boundaries without prior knowledge of video structure, such as the number of video scenes. TELNet’s results demonstrate an increase of 50% in the F-score in half of the transfer settings while maintaining a performance on par with the other SOTA models in the rest of the evaluation settings.
- The transformer encoder and linker were trained jointly in contrast to the other SOTA models, which were trained in increments. TELNet was trained using novel generated graphs in which nodes are the shot representations, and edges are the links among key-shot and other shots within a scene.
- Given that TELNet scans video shots in batches and aggregates the results, the model’s computational complexity grows linearly as the number of shots increases, in contrast to other models whose complexity grows linearly to the square of the number of shots [14], or the NP-hard complexity of the Ncut algorithm that estimates the number of scenes in a video [15]. The prior works of the video scene detection models are shown in Table 1.
2. Related Work
2.1. Shot Detection
2.2. Shot Representation
2.3. Scene Boundary Detection
3. Method
3.1. Shot Representation
3.2. Transformer Encoding
3.3. Linker
3.4. Merging Algorithm
4. Experiment
- Selecting the key-shot as the closest shot to the mean of the shot features within a scene (refer to Algorithm 1). In contrast to the maximum variance method, which aims to select the most diverse shot within a scene, we have proposed the key-shot selection based on its proximity to the mean of the shot features. The rationale behind our approach is to identify a shot that can effectively represent the storyline of the scene, encapsulating the most common visual content. By choosing the key-shot as the one closest to the mean, we prioritize shots that align with the overall visual theme of the scene, leading to a more cohesive and representative graph. While the maximum variance method may emphasize shots with diverse characteristics, it may include outliers that do not accurately depict the primary content of the scene. In contrast, our proposed method seeks to find a shot that best encapsulates the central theme, ensuring that the key-shot serves as a reliable anchor for connecting other intra-scene shots. This approach enhances the interpretability and coherence of the video graph, enabling more accurate analysis and understanding of the video content.
- Establishing links among intra-scene shots to the key-shot (refer to Figure 5). This step involves connecting the key-shot and other shots within the same scene. The final graph effectively captures the cohesive relationships among the shots, providing valuable insights into the scene’s content.
Algorithm 1 Key-shot selection, graph label generation |
|
4.1. Implementation Detail
4.2. Datasets
4.2.1. BBC Planet Earth Dataset
4.2.2. OVSD Dataset
4.2.3. MSC Dataset
4.3. Evaluation Metrics
4.4. Performance Comparison
- In the canonical setting, the F-score results on the BBC and OVSD datasets are obtained by averaging the leave-one-out results. For the MSC dataset, the F-score is calculated by splitting the dataset into 70% for training and 30% for testing.
- On the other hand, in the transfer setting, the F-score is calculated using a full dataset for training and a different dataset for testing. In this scenario, TELNet exhibits superior performance over ACRNet, indicating the effectiveness of the proposed encoder–linker in learning the essential video structure for scene boundary detection, independent of the specific video subjects.
4.5. Complexity Comparison
4.6. Results Sample
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, H.; Neumann, J.; Choi, J. Determining Video Highlights and Chaptering. U.S. Patent 11,172,272, 9 November 2021. [Google Scholar]
- Jindal, A.; Bedi, A. Extracting Session Information from Video Content to Facilitate Seeking. U.S. Patent 10,701,434, 30 June 2020. [Google Scholar]
- Otani, M.; Nakashima, Y.; Rahtu, E.; Heikkila, J. Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7596–7604. [Google Scholar]
- Rui, Y.; Huang, T.S.; Mehrotra, S. Constructing table-of-content for videos. Multimed. Syst. 1999, 7, 359–368. [Google Scholar] [CrossRef]
- Cotsaces, C.; Nikolaidis, N.; Pitas, I. Video shot detection and condensed representation. a review. IEEE Signal Process. Mag. 2006, 23, 28–37. [Google Scholar] [CrossRef]
- Abdulhussain, S.H.; Ramli, A.R.; Saripan, M.I.; Mahmmod, B.M.; Al-Haddad, S.A.R.; Jassim, W.A. Methods and challenges in shot boundary detection: A review. Entropy 2018, 20, 214. [Google Scholar] [CrossRef] [Green Version]
- Chavate, S.; Mishra, R.; Yadav, P. A Comparative Analysis of Video Shot Boundary Detection using Different Approaches. In Proceedings of the 2021 IEEE 10th International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India, 8–10 December 2021; pp. 1–7. [Google Scholar]
- Pal, G.; Rudrapaul, D.; Acharjee, S.; Ray, R.; Chakraborty, S.; Dey, N. Video shot boundary detection: A review. In Emerging ICT for Bridging the Future—Proceedings of the 49th Annual Convention of the Computer Society of India CSI; Springer: Berlin/Heidelberg, Germany, 2015; Volume 2, pp. 119–127. [Google Scholar]
- Kishi, R.M.; Trojahn, T.H.; Goularte, R. Correlation based feature fusion for the temporal video scene segmentation task. Multimed. Tools Appl. 2019, 78, 15623–15646. [Google Scholar] [CrossRef]
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
- Baraldi, L.; Grana, C.; Cucchiara, R. Recognizing and presenting the storytelling video structure with deep multimodal networks. IEEE Trans. Multimed. 2016, 19, 955–968. [Google Scholar] [CrossRef]
- Trojahn, T.H.; Goularte, R. Temporal video scene segmentation using deep-learning. Multimed. Tools Appl. 2021, 80, 17487–17513. [Google Scholar] [CrossRef]
- Baraldi, L.; Grana, C.; Cucchiara, R. A deep siamese network for scene detection in broadcast videos. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26 October 2015; pp. 1199–1202. [Google Scholar]
- Rotman, D.; Yaroker, Y.; Amrani, E.; Barzelay, U.; Ben-Ari, R. Learnable optimal sequential grouping for video scene detection. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1958–1966. [Google Scholar]
- Liu, D.; Kamath, N.; Bhattacharya, S.; Puri, R. Adaptive Context Reading Network for Movie Scene Detection. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3559–3574. [Google Scholar] [CrossRef]
- Sidiropoulos, P.; Mezaris, V.; Kompatsiaris, I.; Meinedo, H.; Bugalho, M.; Trancoso, I. Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans. Circuits Syst. Video Technol. 2011, 21, 1163–1177. [Google Scholar] [CrossRef]
- Protasov, S.; Khan, A.M.; Sozykin, K.; Ahmad, M. Using deep features for video scene detection and annotation. Signal. Image Video Process. 2018, 12, 991–999. [Google Scholar] [CrossRef]
- Pei, Y.; Wang, Z.; Chen, H.; Huang, B.; Tu, W. Video scene detection based on link prediction using graph convolution network. In Proceedings of the 2nd ACM International Conference on Multimedia in Asia, Singapore, 7–9 March 2021; pp. 1–7. [Google Scholar]
- Bouyahi, M.; Ayed, Y.B. Video scenes segmentation based on multimodal genre prediction. Procedia Comput. Sci. 2020, 176, 10–21. [Google Scholar] [CrossRef]
- Rotman, D.; Porat, D.; Ashour, G. Robust video scene detection using multimodal fusion of optimally grouped features. In Proceedings of the 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), Luton, UK, 16–18 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
- Son, J.W.; Lee, A.; Kwak, C.U.; Kim, S.J. Supervised Scene Boundary Detection with Relational and Sequential Information. In Proceedings of the 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT); IEEE: Piscataway, NJ, USA, 2020; pp. 250–258. [Google Scholar]
- Chu, W.S.; Song, Y.; Jaimes, A. Video co-summarization: Video summarization by visual co-occurrence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3584–3592. [Google Scholar]
- Zawbaa, H.M.; El-Bendary, N.; Hassanien, A.E.; Kim, T.H. Event detection based approach for soccer video summarization using machine learning. Int. J. Multimed. Ubiquitous Eng. 2012, 7, 63–80. [Google Scholar]
- Potapov, D.; Douze, M.; Harchaoui, Z.; Schmid, C. Category-specific video summarization. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 540–555. [Google Scholar]
- Sokeh, H.S.; Argyriou, V.; Monekosso, D.; Remagnino, P. Superframes, a temporal video segmentation. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 566–571. [Google Scholar]
- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Networks Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef] [Green Version]
- Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; Oliva, A. Learning deep features for scene recognition using places database. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Hara, K.; Kataoka, H.; Satoh, Y. Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3154–3160. [Google Scholar]
- Chen, S.; Nie, X.; Fan, D.; Zhang, D.; Bhat, V.; Hamid, R. Shot contrastive self-supervised learning for scene boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
- Wu, H.; Chen, K.; Luo, Y.; Qiao, R.; Ren, B.; Liu, H.; Xie, W.; Shen, L. Scene Consistency Representation Learning for Video Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14021–14030. [Google Scholar]
- Hanjalic, A.; Lagendijk, R.L.; Biemond, J. Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits Syst. Video Technol. 1999, 9, 580–588. [Google Scholar] [CrossRef] [Green Version]
- Chasanis, V.T.; Likas, A.C.; Galatsanos, N.P. Scene detection in videos using shot clustering and sequence alignment. IEEE Trans. Multimed. 2008, 11, 89–100. [Google Scholar] [CrossRef] [Green Version]
- Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the 7th IEEE international Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; IEEE: Piscataway, NJ, USA, 1999; Volume 2, pp. 1150–1157. [Google Scholar]
- Haroon, M.; Baber, J.; Ullah, I.; Daudpota, S.M.; Bakhtyar, M.; Devi, V. Video scene detection using compact bag of visual word models. Adv. Multimed. 2018, 2018, 2564963. [Google Scholar] [CrossRef]
- Trojahn, T.H.; Kishi, R.M.; Goularte, R. A new multimodal deep-learning model to video scene segmentation. In Proceedings of the 24th Brazilian Symposium on Multimedia and the Web, Salvador, Brazil, 16–19 October 2018; pp. 205–212. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- White, S.; Smyth, P. A spectral clustering approach to finding communities in graphs. In Proceedings of the 2005 SIAM International Conference on Data Mining, Newport Beach, CA, USA, 21—23 April 2005; pp. 274–285. [Google Scholar]
- Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar]
- Islam, M.M.; Hasan, M.; Athrey, K.S.; Braskich, T.; Bertasius, G. Efficient Movie Scene Detection Using State-Space Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2023; pp. 18749–18758. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014; pp. 1725–1732. [Google Scholar]
- Movieclips. Movieclips YouTube Channel. Available online: http://www.youtube.com/user/movieclips (accessed on 1 September 2020).
- Vendrig, J.; Worring, M. Systematic evaluation of logical story unit segmentation. IEEE Trans. Multimed. 2002, 4, 492–499. [Google Scholar] [CrossRef]
- Wikipedia Contributors. Segmentation-Based Object Categorization — Wikipedia, The Free Encyclopedia. 2021. Available online: https://en.wikipedia.org/wiki/Segmentation-based_object_categorization (accessed on 17 February 2022).
Model | Shot Feature | Feature Encoding | Shot Clustering | Complexity | Prior Knowledge |
---|---|---|---|---|---|
Sidiropoulos et al. [16] | Color Histogram + audio | Scene Transition Graph (STG) | Not required | ||
Triplet [11] | 2D CNN + MFCC + textual | DNN with triplet loss | Temporal Aware Clustering | O () | |
Kishi et al. [9] | SIFT + MFCC | Off-the shelf STG, etc. | |||
Trojahn et al. [12] | ConvFeats + MFCC + textual | LSTM | |||
SDN [13] | 2D CNN + textual | Siamese Network | |||
SAK-18 [17] | Overlapping Link | ||||
Pei et al. [18] | 2D CNN | GCN | |||
Bouyahi et al. [19] | 2D CNN + audio | Bi-Clustering | |||
OSG [20] | Optimal Sequence Grouping (OSG) | O | Number of scenes required | ||
OSG-Triplet [14] | Triplet loss | ||||
ACRNet [15] | 3D CNN | Self-attention | Normalized Cuts (NCuts) | NP-hard | |
SDRS [21] | 3D CNN + audio | GRU | |||
TELNet | 3D CNN | Transformer encoding linker | O (N) | Not required |
Transformer Encoder | Head = 4, Number of Stacks = 6, d Model = 4096 |
---|---|
Fully connected layer 1 | (4096, 2048), activate function= ReLU |
Fully connected layer 2 | (2048, 1024), activate function = ReLU |
Fully connected layer 3 | (1024, rolling windows size) |
Video Name | Video Length | Number of Shots | Number of Scenes | FPS | Resolution |
---|---|---|---|---|---|
From Pole to Pole (01) | 49:15 | 445 | 46 | 25 | 360 × 288 |
Mountains (02) | 48:05 | 383 | 44 | ||
Ice Worlds (03) | 49:17 | 421 | 48 | ||
Great Plains (04) | 49:03 | 472 | 57 | ||
Jungles (05) | 49:14 | 460 | 54 | ||
Seasonal Forests (06) | 49:19 | 526 | 52 | ||
Fresh Water (07) | 49:17 | 531 | 57 | ||
Ocean Deep (08) | 49:14 | 410 | 46 | ||
Shallow Seas (09) | 49:14 | 366 | 58 | ||
Caves (10) | 48:55 | 374 | 53 | ||
Total | 8:59:53 | 4855 | 568 |
Video Name | Video Length | Number of Shots | Number of Scenes | FPS | Resolution |
---|---|---|---|---|---|
BBB | 09:56 | 112 | 15 | 24 | 1280 × 720 |
BWNS | 01:09:46 | 257 | 36 | 30 | 524 × 360 |
CL | 12:10 | 98 | 7 | 24 | 1920 × 804 |
FBW | 1:16:06 | 686 | 62 | 30 | 720 × 528 |
Honey | 1:26:49 | 315 | 20 | 30 | 480 × 216 |
Meridian | 11:58 | 56 | 9 | 30 | 1280 × 720 |
LCDUP | 10:23 | 118 | 10 | 25 | 1264 × 720 |
Route 66 | 1:43:25 | 700 | 55 | 25 | 640 × 432 |
Star Wreck | 1:43:14 | 1055 | 55 | 25 | 640 × 304 |
Total | 11:44:37 | 3397 | 269 |
Number of Videos | Total Shots | Average Video Length | Average Number of Shots |
---|---|---|---|
468 | 16,131 | 2:35 | 34 |
Video Name | Triplet [11] | Kishi et al. [9] | Trojahn [12] | SDN [13] | SAK-18 [17] | SDRS [21] | Pei et al. [18] | Bouyahi et al. [19] | OSG [20] | OSG-Triplet [14] | ACRNet [15] | TELNet |
---|---|---|---|---|---|---|---|---|---|---|---|---|
From Pole to Pole (01) | 0.72 | 0.65 | 0.63 | 0.56 | 0.5 | 0.78 | 0.57 | 0.48 | 0.66 | 0.68 | 0.83 | 0.77 |
Mountains (02) | 0.75 | 0.65 | 0.65 | 0.63 | 0.54 | 0.73 | 0.58 | 0.5 | 0.65 | 0.65 | 0.82 | 0.68 |
Ice Worlds (03) | 0.73 | 0.66 | 0.64 | 0.66 | 0.5 | 0.74 | 0.56 | 0.54 | 0.64 | 0.64 | 0.77 | 0.69 |
Great Plains (04) | 0.63 | 0.7 | 0.68 | 0.61 | 0.54 | 0.68 | 0.57 | 0.66 | 0.6 | 0.6 | 0.72 | 0.75 |
Jungle (05) | 0.62 | 0.67 | 0.63 | 0.55 | 0.51 | 0.66 | 0.55 | 0.56 | 0.55 | 0.7 | 0.74 | |
Seasonal Forests (06) | 0.65 | 0.69 | 0.64 | 0.64 | 0.51 | 0.69 | 0.48 | 0.59 | 0.58 | 0.61 | 0.7 | 0.75 |
Fresh Water (07) | 0.67 | 0.67 | 0.66 | 0.59 | 0.53 | 0.73 | 0.58 | 0.54 | 0.56 | 0.7 | 0.74 | |
Occean Deep (08) | 0.65 | 0.64 | 0.67 | 0.64 | 0.38 | 0.66 | 0.55 | 0.68 | 0.65 | 0.66 | 0.73 | 0.76 |
Shallow (09) | 0.74 | 0.69 | 0.64 | 0.64 | 0.55 | 0.67 | 0.56 | 0.57 | 0.56 | 0.8 | 0.7 | |
Caves (10) | 0.62 | 0.65 | 0.67 | 0.64 | 0.43 | 0.66 | 0.54 | 0.64 | 0.59 | 0.61 | 0.75 | 0.77 |
Deserts (11) | 0.62 | 0.69 | 0.66 | 0.64 | 0.51 | 0.7 | 0.52 | 0.62 | 0.65 | 0.65 | 0.71 | 0.77 |
Average | 0.67 | 0.67 | 0.65 | 0.62 | 0.5 | 0.7 | 0.55 | 0.53 | 0.61 | 0.62 | 0.76 | 0.74 |
Video Name | Trojahn et al. [12] | SDRS [21] | Pei et al. [18] | ACRNet [15] | OSG-Triplet [14] | OSG [20] | TELNet |
---|---|---|---|---|---|---|---|
BBB | 0.57 | 0.75 | 0.65 | 0.74 | 0.81 | 0.83 | 0.69 |
BWNS | 0.53 | 0.67 | 0.7 | 0.75 | 0.63 | 0.6 | |
CL | 0.64 | 0.69 | 0.78 | 0.61 | 0.49 | 0.62 | 0.88 |
FBW | 0.57 | 0.55 | 0.58 | 0.76 | 0.57 | 0.66 | |
Honey | 0.6 | 0.67 | 0.73 | 0.73 | 0.58 | 0.77 | |
Meridian | 0.45 | 0.86 | 0.69 | 0.63 | 0.75 | ||
LCDUP | 0.63 | 0.83 | 0.71 | 0.72 | 0.73 | 0.76 | |
Route 66 | 0.63 | 0.55 | 0.64 | 0.72 | 0.54 | 0.64 | |
Star Wreck | 0.62 | 0.63 | 0.66 | 0.55 | 0.71 | ||
Average | 0.58 | 0.68 | 0.69 | 0.73 | 0.7 | 0.63 | 0.72 |
MSC Dataset | OSG [20] | OSG-Triplet [14] | ACRNet [15] | TELNet |
---|---|---|---|---|
Random 30% test | 0.57 | 0.59 | 0.67 | 0.69 |
Train / Test | MSC | BBC | OVSD |
---|---|---|---|
MSC | 0.67 | 0.64 | 0.63 |
BBC | 0.28 | 0.76 | 0.22 |
OVSD | 0.29 | 0.23 | 0.73 |
Train / Test | MSC | BBC | OVSD |
---|---|---|---|
MSC | 0.69 | 0.62 | 0.6 |
BBC | 0.64 | 0.74 | 0.56 |
OVSD | 0.64 | 0.64 | 0.72 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tseng, S.-M.; Yeh, Z.-T.; Wu, C.-Y.; Chang, J.-B.; Norouzi, M. Video Scene Detection Using Transformer Encoding Linker Network (TELNet). Sensors 2023, 23, 7050. https://doi.org/10.3390/s23167050
Tseng S-M, Yeh Z-T, Wu C-Y, Chang J-B, Norouzi M. Video Scene Detection Using Transformer Encoding Linker Network (TELNet). Sensors. 2023; 23(16):7050. https://doi.org/10.3390/s23167050
Chicago/Turabian StyleTseng, Shu-Ming, Zhi-Ting Yeh, Chia-Yang Wu, Jia-Bin Chang, and Mehdi Norouzi. 2023. "Video Scene Detection Using Transformer Encoding Linker Network (TELNet)" Sensors 23, no. 16: 7050. https://doi.org/10.3390/s23167050
APA StyleTseng, S. -M., Yeh, Z. -T., Wu, C. -Y., Chang, J. -B., & Norouzi, M. (2023). Video Scene Detection Using Transformer Encoding Linker Network (TELNet). Sensors, 23(16), 7050. https://doi.org/10.3390/s23167050