Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3397481.3450672acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article

Non-Linear Consumption of Videos Using a Sequence of Personalized Multimodal Fragments

Published: 14 April 2021 Publication History

Abstract

As videos progressively take a central role in conveying information on the Web, current linear-consumption methods that involve spending time proportional to the duration of the video need to be revisited. In this work, we present NoVoExp, a method that enables a Non-linear Video Consumption Experience by generating a sequence of multimodal fragments that represents the content in different segments of the videos in a succinct fashion. These fragments aid understanding the content of the video without watching it in entirely and serve as pointers to different segments of the video, enabling a new mechanism to consume videos. We design several baselines by building on top of video captioning and video summarization works to understand the relative advantages and disadvantages of NoVoExp, and compare the performances across video durations (short, medium, long) and categories (entertainment, lectures, tutorials). We observe that the sequences of multimodal fragments generated by NoVoExp have higher relevance to the video and are more diverse yet coherent. Our extensive evaluation using automated metrics and human studies show that our fragments are not only good at representing the contents of the video, but also align well with targeted viewer preferences.

References

[1]
Hervé Abdi. 2010. Coefficient of variation. Encyclopedia of research design 1 (2010), 169–171.
[2]
Online Article. 2019. Online video viewing to reach 100 minutes a day in 2021. https://www.zenithmedia.com/online-video-viewing-to-reach-100-minutes-a-day-in-2021/. Accessed: 2020-10-08.
[3]
Online Article. 2020. Microsoft Thinks Coronavirus will Forever Change the Way we Work And Learn. https://www.theverge.com/2020/4/9/21214314/microsoft-teams-usage-coronavirus-pandemic-work-habit-change. Accessed: 2020-10-08.
[4]
Online Article. 2020. The Virus Changed the Way We Internet. https://www.nytimes.com/interactive/2020/04/07/technology/coronavirus-internet-use.html. Accessed: 2020-10-08.
[5]
Yiyan Chen, Li Tao, Xueting Wang, and Toshihiko Yamasaki. 2019. Weakly supervised video summarization by hierarchical reinforcement learning. In Proceedings of the ACM Multimedia Asia. 1–6.
[6]
Brodie Clark. 2020. How to Make the Most of Video Timestamp Results in Google Search. Search Engine Journal(2020). https://www.searchenginejournal.com/video-timestamp-results-google-search/364020/
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).
[8]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612(2017).
[9]
Himaanshu Gauba, Pradeep Kumar, Partha Pratim Roy, Priyanka Singh, Debi Prosad Dogra, and Balasubramanian Raman. 2017. Prediction of advertisement preference by fusing EEG response and sentiment analysis. Neural Networks 92(2017), 77–88.
[10]
Hongxiang Gu and Viswanathan Swaminathan. 2018. From thumbnails to summaries-a single deep neural network to rule them all. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.
[11]
David Harwath and James Glass. 2015. Deep multimodal semantic embeddings for speech and images. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 237–244.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[13]
Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic understanding of image and video advertisements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1705–1715.
[14]
Vladimir Iashin and Esa Rahtu. 2020. A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer. arXiv preprint arXiv:2005.08271(2020).
[15]
Taehyeong Kim, Min-Oh Heo, Seonil Son, Kyoung-Wha Park, and Byoung-Tak Zhang. 2018. Glac net: Glocal attention cascading networks for multi-image cued story generation. arXiv preprint arXiv:1805.10973(2018).
[16]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539(2014).
[17]
Debabrata Mahapatra, Ragunathan Mariappan, Vaibhav Rajan, Kuldeep Yadav, and Sudeshna Roy. 2018. VideoKen: Automatic Video Summarization and Course Curation to Support Learning. In Companion Proceedings of the The Web Conference 2018. 239–242.
[18]
Jordi Mas and Gabriel Fernandez. 2003. Video Shot Boundary Detection Based on Color Histogram. In TRECVID.
[19]
Derek Miller. 2019. Leveraging BERT for extractive text summarization on lectures. arXiv preprint arXiv:1906.04165(2019).
[20]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 19–27.
[21]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.
[22]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3973–3983.
[23]
Yair Shemer, Daniel Rotman, and Nahum Shimkin. 2019. ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization. arXiv preprint arXiv:1912.03650(2019).
[24]
Kai Sun, Junqing Yu, Yue Huang, and Xiaoqiang Hu. 2009. An improved valence-arousal emotion space for video affective content representation and recognition. In 2009 IEEE International Conference on Multimedia and Expo. IEEE, 566–569.
[25]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
[26]
Xiaoming Tao, Linhao Dong, Yang Li, Jizhe Zhou, Ning Ge, and Jianhua Lu. 2015. Real-time personalized content catering via viewer sentiment feedback: a QoE perspective. IEEE Network 29, 6 (2015), 14–19.
[27]
Gyanendra K Verma and Uma Shanker Tiwary. 2017. Affect representation and recognition in 3d continuous valence–arousal–dominance space. Multimedia Tools and Applications 76, 2 (2017), 2159–2183.
[28]
Cheng Wang, Haojin Yang, and Christoph Meinel. 2016. Exploring multimodal video representation for action recognition. In 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 1924–1931.
[29]
Xin Wang, Yuan-Fang Wang, and William Yang Wang. 2018. Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 795–801.
[30]
Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2017. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. arXiv preprint arXiv:1801.00054(2017).
[31]
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 207–212.

Cited By

View all
  • (2023)Unsupervised Audio-Visual Lecture Segmentation2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00520(5221-5230)Online publication date: Jan-2023
  • (2023)Transitioning from Information Sharing to Knowledge Services: Unpacking the Evolution of Rural Media Within the Context of Media IntegrationJournal of the Knowledge Economy10.1007/s13132-023-01593-215:3(12751-12782)Online publication date: 14-Nov-2023
  • (2023)User experience evaluation of Edurell interface for video augmentationMultimedia Tools and Applications10.1007/s11042-023-15912-x83:12(36695-36717)Online publication date: 29-Jun-2023
  • Show More Cited By

Index Terms

  1. Non-Linear Consumption of Videos Using a Sequence of Personalized Multimodal Fragments
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          IUI '21: Proceedings of the 26th International Conference on Intelligent User Interfaces
          April 2021
          618 pages
          ISBN:9781450380171
          DOI:10.1145/3397481
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 14 April 2021

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. crossmodal translation
          2. non-linear consumption/interaction
          3. videos

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          IUI '21
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 746 of 2,811 submissions, 27%

          Upcoming Conference

          IUI '25

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)23
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 20 Nov 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Unsupervised Audio-Visual Lecture Segmentation2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00520(5221-5230)Online publication date: Jan-2023
          • (2023)Transitioning from Information Sharing to Knowledge Services: Unpacking the Evolution of Rural Media Within the Context of Media IntegrationJournal of the Knowledge Economy10.1007/s13132-023-01593-215:3(12751-12782)Online publication date: 14-Nov-2023
          • (2023)User experience evaluation of Edurell interface for video augmentationMultimedia Tools and Applications10.1007/s11042-023-15912-x83:12(36695-36717)Online publication date: 29-Jun-2023
          • (2022)Power to the Learner: Towards Human-Intuitive and Integrative Recommendations with Open Educational ResourcesSustainability10.3390/su14181168214:18(11682)Online publication date: 17-Sep-2022

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media