research-article

Non-Linear Consumption of Videos Using a Sequence of Personalized Multimodal Fragments

Authors:

Trikay Nalamada,

Keerti Harpavat,

Balaji Vasan SrinivasanAuthors Info & Claims

IUI '21: Proceedings of the 26th International Conference on Intelligent User Interfaces

Pages 249 - 259

https://doi.org/10.1145/3397481.3450672

Published: 14 April 2021 Publication History

Abstract

As videos progressively take a central role in conveying information on the Web, current linear-consumption methods that involve spending time proportional to the duration of the video need to be revisited. In this work, we present NoVoExp, a method that enables a Non-linear Video Consumption Experience by generating a sequence of multimodal fragments that represents the content in different segments of the videos in a succinct fashion. These fragments aid understanding the content of the video without watching it in entirely and serve as pointers to different segments of the video, enabling a new mechanism to consume videos. We design several baselines by building on top of video captioning and video summarization works to understand the relative advantages and disadvantages of NoVoExp, and compare the performances across video durations (short, medium, long) and categories (entertainment, lectures, tutorials). We observe that the sequences of multimodal fragments generated by NoVoExp have higher relevance to the video and are more diverse yet coherent. Our extensive evaluation using automated metrics and human studies show that our fragments are not only good at representing the contents of the video, but also align well with targeted viewer preferences.

References

[1]

Hervé Abdi. 2010. Coefficient of variation. Encyclopedia of research design 1 (2010), 169–171.

[2]

Online Article. 2019. Online video viewing to reach 100 minutes a day in 2021. https://www.zenithmedia.com/online-video-viewing-to-reach-100-minutes-a-day-in-2021/. Accessed: 2020-10-08.

[3]

Online Article. 2020. Microsoft Thinks Coronavirus will Forever Change the Way we Work And Learn. https://www.theverge.com/2020/4/9/21214314/microsoft-teams-usage-coronavirus-pandemic-work-habit-change. Accessed: 2020-10-08.

[4]

Online Article. 2020. The Virus Changed the Way We Internet. https://www.nytimes.com/interactive/2020/04/07/technology/coronavirus-internet-use.html. Accessed: 2020-10-08.

[5]

Yiyan Chen, Li Tao, Xueting Wang, and Toshihiko Yamasaki. 2019. Weakly supervised video summarization by hierarchical reinforcement learning. In Proceedings of the ACM Multimedia Asia. 1–6.

Digital Library

[6]

Brodie Clark. 2020. How to Make the Most of Video Timestamp Results in Google Search. Search Engine Journal(2020). https://www.searchenginejournal.com/video-timestamp-results-google-search/364020/

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).

[8]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612(2017).

[9]

Himaanshu Gauba, Pradeep Kumar, Partha Pratim Roy, Priyanka Singh, Debi Prosad Dogra, and Balasubramanian Raman. 2017. Prediction of advertisement preference by fusing EEG response and sentiment analysis. Neural Networks 92(2017), 77–88.

[10]

Hongxiang Gu and Viswanathan Swaminathan. 2018. From thumbnails to summaries-a single deep neural network to rule them all. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.

[11]

David Harwath and James Glass. 2015. Deep multimodal semantic embeddings for speech and images. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 237–244.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[13]

Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agha, Nathan Ong, and Adriana Kovashka. 2017. Automatic understanding of image and video advertisements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1705–1715.

[14]

Vladimir Iashin and Esa Rahtu. 2020. A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer. arXiv preprint arXiv:2005.08271(2020).

[15]

Taehyeong Kim, Min-Oh Heo, Seonil Son, Kyoung-Wha Park, and Byoung-Tak Zhang. 2018. Glac net: Glocal attention cascading networks for multi-image cued story generation. arXiv preprint arXiv:1805.10973(2018).

[16]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539(2014).

[17]

Debabrata Mahapatra, Ragunathan Mariappan, Vaibhav Rajan, Kuldeep Yadav, and Sudeshna Roy. 2018. VideoKen: Automatic Video Summarization and Course Curation to Support Learning. In Companion Proceedings of the The Web Conference 2018. 239–242.

Digital Library

[18]

Jordi Mas and Gabriel Fernandez. 2003. Video Shot Boundary Detection Based on Color Histogram. In TRECVID.

[19]

Derek Miller. 2019. Leveraging BERT for extractive text summarization on lectures. arXiv preprint arXiv:1906.04165(2019).

[20]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. 19–27.

Digital Library

[21]

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.

[22]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3973–3983.

[23]

Yair Shemer, Daniel Rotman, and Nahum Shimkin. 2019. ILS-SUMM: Iterated Local Search for Unsupervised Video Summarization. arXiv preprint arXiv:1912.03650(2019).

[24]

Kai Sun, Junqing Yu, Yue Huang, and Xiaoqiang Hu. 2009. An improved valence-arousal emotion space for video affective content representation and recognition. In 2009 IEEE International Conference on Multimedia and Expo. IEEE, 566–569.

Digital Library

[25]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.

[26]

Xiaoming Tao, Linhao Dong, Yang Li, Jizhe Zhou, Ning Ge, and Jianhua Lu. 2015. Real-time personalized content catering via viewer sentiment feedback: a QoE perspective. IEEE Network 29, 6 (2015), 14–19.

Digital Library

[27]

Gyanendra K Verma and Uma Shanker Tiwary. 2017. Affect representation and recognition in 3d continuous valence–arousal–dominance space. Multimedia Tools and Applications 76, 2 (2017), 2159–2183.

Digital Library

[28]

Cheng Wang, Haojin Yang, and Christoph Meinel. 2016. Exploring multimodal video representation for action recognition. In 2016 International Joint Conference on Neural Networks (IJCNN). IEEE, 1924–1931.

[29]

Xin Wang, Yuan-Fang Wang, and William Yang Wang. 2018. Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 795–801.

[30]

Kaiyang Zhou, Yu Qiao, and Tao Xiang. 2017. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. arXiv preprint arXiv:1801.00054(2017).

[31]

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 207–212.

Cited By

Singh S DGupta AJawahar CTapaswi M(2023)Unsupervised Audio-Visual Lecture Segmentation2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00520(5221-5230)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00520
Wang SLi NLiu NHabes M(2023)Transitioning from Information Sharing to Knowledge Services: Unpacking the Evolution of Rural Media Within the Context of Media IntegrationJournal of the Knowledge Economy10.1007/s13132-023-01593-215:3(12751-12782)Online publication date: 14-Nov-2023
https://doi.org/10.1007/s13132-023-01593-2
Coccoli MTorre IGalluccio I(2023)User experience evaluation of Edurell interface for video augmentationMultimedia Tools and Applications10.1007/s11042-023-15912-x83:12(36695-36717)Online publication date: 29-Jun-2023
https://doi.org/10.1007/s11042-023-15912-x
Show More Cited By

Index Terms

Non-Linear Consumption of Videos Using a Sequence of Personalized Multimodal Fragments

Index terms have been assigned to the content through auto-classification.

Recommendations

Human-robot collaborative tutoring using multiparty multimodal spoken dialogue
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction

In this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is ...
Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos
AI 2024: Advances in Artificial Intelligence
Abstract
Multimodal sentiment analysis in videos is a key task in many real-world applications, which usually requires the integration of multimodal streams including visual, verbal, and acoustic behaviors. To improve the robustness of multimodal fusion, ...
Papeos: Augmenting Research Papers with Talk Videos
UIST '23: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

Research consumption has been traditionally limited to the reading of academic papers—a static, dense, and formally written format. Alternatively, pre-recorded conference presentation videos, which are more dynamic, concise, and colloquial, have recently ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IUI '21: Proceedings of the 26th International Conference on Intelligent User Interfaces

April 2021

618 pages

ISBN:9781450380171

DOI:10.1145/3397481

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 April 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

IUI '21

Sponsor:

IUI '21: 26th International Conference on Intelligent User Interfaces

April 14 - 17, 2021

TX, College Station, USA

Acceptance Rates

Overall Acceptance Rate 746 of 2,811 submissions, 27%

Upcoming Conference

IUI '25

Sponsor:
sigai
sigai

30th International Conference on Intelligent User Interfaces

March 24 - 27, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
210
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Singh S DGupta AJawahar CTapaswi M(2023)Unsupervised Audio-Visual Lecture Segmentation2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00520(5221-5230)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00520
Wang SLi NLiu NHabes M(2023)Transitioning from Information Sharing to Knowledge Services: Unpacking the Evolution of Rural Media Within the Context of Media IntegrationJournal of the Knowledge Economy10.1007/s13132-023-01593-215:3(12751-12782)Online publication date: 14-Nov-2023
https://doi.org/10.1007/s13132-023-01593-2
Coccoli MTorre IGalluccio I(2023)User experience evaluation of Edurell interface for video augmentationMultimedia Tools and Applications10.1007/s11042-023-15912-x83:12(36695-36717)Online publication date: 29-Jun-2023
https://doi.org/10.1007/s11042-023-15912-x
Bulathwela SPérez-Ortiz MYilmaz EShawe-Taylor J(2022)Power to the Learner: Towards Human-Intuitive and Integrative Recommendations with Open Educational ResourcesSustainability10.3390/su14181168214:18(11682)Online publication date: 17-Sep-2022
https://doi.org/10.3390/su141811682

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten