research-article

Online Estimation of Evolving Human Visual Interest

Authors:

Anoop Kolar Rajagopal,

Mohan Kankanhalli,

Ramakrishnan KalpathiAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 11, Issue 1

Article No.: 8, Pages 1 - 21

https://doi.org/10.1145/2632284

Published: 04 September 2014 Publication History

Abstract

Regions in video streams attracting human interest contribute significantly to human understanding of the video. Being able to predict salient and informative Regions of Interest (ROIs) through a sequence of eye movements is a challenging problem. Applications such as content-aware retargeting of videos to different aspect ratios while preserving informative regions and smart insertion of dialog (closed-caption text)¹ into the video stream can significantly be improved using the predicted ROIs. We propose an interactive human-in-the-loop framework to model eye movements and predict visual saliency into yet-unseen frames. Eye tracking and video content are used to model visual attention in a manner that accounts for important eye-gaze characteristics such as temporal discontinuities due to sudden eye movements, noise, and behavioral artifacts. A novel statistical- and algorithm-based method gaze buffering is proposed for eye-gaze analysis and its fusion with content-based features. Our robust saliency prediction is instantiated for two challenging and exciting applications. The first application alters video aspect ratios on-the-fly using content-aware video retargeting, thus making them suitable for a variety of display sizes. The second application dynamically localizes active speakers and places dialog captions on-the-fly in the video stream. Our method ensures that dialogs are faithful to active speaker locations and do not interfere with salient content in the video stream. Our framework naturally accommodates personalisation of the application to suit biases and preferences of individual users.

Supplementary Material

a8-katti-apndx.pdf (katti.zip)

Supplemental movie, appendix, image and software files for, Online Estimation of Evolving Human Visual Interest

Download
1.51 MB

References

[1]

A1 Clip: Paris Zarcilla. 2009. Smile. in aniBOOM online video clip, YouTube, https://www.youtube.com/watch&quest;v=ghgzFY85Gw.

[2]

J. S. Agustin, H. Skovsgaard, E. Mollenbach, M. Barret, M. Tall, D. W. Hansen, and J. P. Hansen. 2010. Evaluation of a low-cost open-source gaze tracker. In Proceedings of the Symposium on Eye-Tracking Research and Applications (ETRA'10). ACM Press, New York, 77--80.

Digital Library

[3]

F. Alnajar, T. Gevers, R. Valenti, and S. Ghebreab. 2013. Calibration-free gaze estimation using human gaze patterns. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'13). 137--144.

Digital Library

[4]

M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. 2002. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Trans. Signal Process. 50, 2, 174--188.

Digital Library

[5]

S. Avidan and A. Shamir. 2007. Seam carving for content-aware image resizing. ACM Trans. Graph. 26, 3, Article 10.

Digital Library

[6]

P. Baldi and L. Itti. 2010. Of bits and wows: A bayesian theory of surprise with applications to attention. Neural Netw. 23, 5, 649--666.

Digital Library

[7]

M. Cerf, J. Harel, W. Einhuser, and C. Koch. 2007. Predicting human gaze using low-level saliency combined with face detection. In Neural Information Processing Systems, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, Eds., MIT Press, 1--7.

[8]

Chakde Clip: Dir. Shimit Amin. 2007. Chak de! India. Yash Raj films, DVD.

[9]

N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). 886--893.

Digital Library

[10]

Y. Feng, G. Cheung, W.-T. Tan, P. Le Callet, and Y. Ji. 2013. Low-cost eye gaze prediction system for interactive networked video streaming. IEEE Trans. Multimedia 15, 8, 1865--1879.

Digital Library

[11]

M. Grundmann, V. Kwatra, M. Han, and I. A. Essa. 2010. Discontinuous seam-carving for video retargeting. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'10). 569--576.

[12]

J. Harel, C. Koch, and P. Perona. 2007. Graph-based visual saliency. Adv. Neural Inf. Process. Syst. 19, 545--552.

Digital Library

[13]

R. Hong, M. Wang, M. Xu, S. Yan, and T.-S. Chua. 2010. Dynamic captioning: Video accessibility enhancement for hearing impairment. In Proceedings of the International Conference on Multimedia (MM'10). 421--430.

Digital Library

[14]

L. Itti, C. Koch, and E. Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20, 11, 1254--1259.

Digital Library

[15]

E. Jain. 2012. Attention-guided algorithms to retarget and augment animations, stills, and videos. Ph.D. dissertation, The Robotics Institute, Carnegie Mellon University.

Digital Library

[16]

JBDY Clip: Dir. Kundan Shah. 1983. Jaane Bhi Do Yaaro. in National Film Development Corporation, Ultra Distributors, DVD 2004.

[17]

T. Judd, K. Ehinger, F. Durand, and A. Torralba. 2009. Learning to predict where humans look. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'09). 2106--2113.

[18]

M. Kankanhalli, J. Wang, and R. Jain. 2006. Experiential sampling in multimedia systems. IEEE Trans. Multimedia 8, 5, 937--946.

Digital Library

[19]

H. Katti, S. Ramanathan, M. S. Kankanhalli, N. Sebe, T. S. Chua, and K. R. Ramakrishnan. 2010. Making computers look the way we look: Exploiting visual attention for image understanding. In Proceedings of the International Conference on Multimedia (MM'10). ACM Press, New York, 667--670.

Digital Library

[20]

S. Kopf, J. Kiess, H. Lemelson, and W. Effelsberg. 2009. FSCAV: Fast seam carving for size adaptation of videos. In Proceedings of the 17^th ACM International Conference on Multimedia (MM'09). ACM Press, New York, 321--330.

Digital Library

[21]

National Captioning Institute. 1970. Online article on captioned television. http://www.ncicap.org/caphist.asp.

[22]

S. Ramanathan, H. Katti, R. Huang, T. S. Chua, and M. S. Kankanhalli. 2009. Automated localization of affective objects and actions in images via caption text-cum-eye gaze analysis. In Proceedings of the 17^th ACM International Conference on Multimedia (MM'09). 729--732.

Digital Library

[23]

S. Ramanathan, H. Katti, M. S. Kankanhalli, T. S. Chua, and N. Sebe. 2010. An eye fixation database for saliency detection in images. In Proceedings of the 11^th European Conference on Computer Vision (ECCV'10). 30--43.

Digital Library

[24]

M. Rubinstein, A. Shamir, and S. Avidan. 2008. Improved seam carving for video retargeting. ACM Trans. Graph. 27, 3, 1--9.

Digital Library

[25]

J. San Agustin, H. Skovsgaard, J. P. Hansen, and D. W. Hansen. 2009. Low-cost gaze interaction: Ready to deliver the promises. In Proceedings of the Extended Abstracts on Human Factors in Computing Systems (CHI-EA'09). ACM Press, New York, 4453--4458.

Digital Library

[26]

A. Santella and D. Decarlo. 2004. Robust clustering of eye movement recordings for quantification of visual interest. In Proceedings of the Symposium on Eye Tracking Research and Applications (ETRA'04). ACM Press, New York, 27--34.

Digital Library

[27]

A. Shamir and O. Sorkine. 2009. Visual media retargeting. In Proceeding of the ACM SIGGRAPH ASIA Courses. ACM Press, New York, 1--13.

Digital Library

[28]

TTSS Clip: Dir. Tomas Alfredson. 2011. Tinker Tailor Soldier Spy. in StudioCanal, Karla Films, Paradis Films, Kinowelt, Filmproduction, Working Title Films, Canal+, Cin+. STUDIOCANAL, UK, DVD.

[29]

D. Xu and P. Nasiopoulos. 2009. Logo insertion transcoding for h.264/avc compressed video. In Proceedings of the 16^th IEEE International Conference on Image Processing (ICIP'09). 3693--3696.

Digital Library

[30]

Y2 Clip: Justin Lee, K. Tam, T. Bradbury, K. Rashidi, and K. Liang. 2009. The 5 second rule. in CAMPUS MOVIEFEST, Outspire Productions online video clip, YouTube, https://www.youtube.com/watch&quest;v=9rgCsosjJtl.

Cited By

Hwang ELee J(2024)Attention-based automatic editing of virtual lectures for reduced production labor and effective learning experienceInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2023.103161181:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.ijhcs.2023.103161
Shukla AGullapuram SKatti HKankanhalli MWinkler SSubramanian R(2022)Recognition of Advertisement Emotions With Application to Computational AdvertisingIEEE Transactions on Affective Computing10.1109/TAFFC.2020.296454913:2(781-792)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TAFFC.2020.2964549
Mocanu BRuxandra T(2022)Active Speaker Recognition using Cross Attention Audio-Video Fusion2022 10th European Workshop on Visual Information Processing (EUVIP)10.1109/EUVIP53989.2022.9922810(1-6)Online publication date: 11-Sep-2022
https://doi.org/10.1109/EUVIP53989.2022.9922810
Show More Cited By

Index Terms

Online Estimation of Evolving Human Visual Interest
1. Human-centered computing
  1. Interaction design
    1. Interaction design process and methods
      1. User centered design

Recommendations

Human Visual Scanpath Prediction Based on RGB-D Saliency
ICIGP '18: Proceedings of the 2018 International Conference on Image and Graphics Processing

Human visual perception is considered as a dynamic process of information acquisition, while the visual scanpath can clearly reflect the shift of our eye fixations. In the previous study of visual attention, researchers generally do the saliency ...
Visual attention in spoken human-robot interaction
HRI '09: Proceedings of the 4th ACM/IEEE international conference on Human robot interaction

Psycholinguistic studies of situated language processing have revealed that gaze in the visual environment is tightly coupled with both spoken language comprehension and production. It has also been established that interlocutors monitor the gaze of ...
Simulating Human Visual System Based on Vision Transformer
SUI '23: Proceedings of the 2023 ACM Symposium on Spatial User Interaction

The human visual system (HVS) is capable of responding in real-time to complex visual environments. During the process of freely observing visual scenes, predicting eye movements and visual fixations is a task known as scanpath prediction, which aims to ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 11, Issue 1

August 2014

151 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/2665935

Editor:
Ralf Steinmetz
Technische Universität Darmstadt, Germany

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2014

Accepted: 01 April 2014

Revised: 01 December 2013

Received: 01 August 2013

Published in TOMM Volume 11, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
388
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hwang ELee J(2024)Attention-based automatic editing of virtual lectures for reduced production labor and effective learning experienceInternational Journal of Human-Computer Studies10.1016/j.ijhcs.2023.103161181:COnline publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1016/j.ijhcs.2023.103161
Shukla AGullapuram SKatti HKankanhalli MWinkler SSubramanian R(2022)Recognition of Advertisement Emotions With Application to Computational AdvertisingIEEE Transactions on Affective Computing10.1109/TAFFC.2020.296454913:2(781-792)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TAFFC.2020.2964549
Mocanu BRuxandra T(2022)Active Speaker Recognition using Cross Attention Audio-Video Fusion2022 10th European Workshop on Visual Information Processing (EUVIP)10.1109/EUVIP53989.2022.9922810(1-6)Online publication date: 11-Sep-2022
https://doi.org/10.1109/EUVIP53989.2022.9922810
Mocanu BTapu R(2021)Automatic Subtitle Placement Through Active Speaker Identification in Multimedia Documents2021 International Conference on e-Health and Bioengineering (EHB)10.1109/EHB52898.2021.9657604(1-4)Online publication date: 18-Nov-2021
https://doi.org/10.1109/EHB52898.2021.9657604
Kurzhals KGöbel FAngerbauer KSedlmair MRaubal MBernhaupt RMueller FVerweij DAndres JMcGrenere JCockburn AAvellino IGoguey ABjørn PZhao SSamson BKocielnik R(2020)A View on the Viewer: Gaze-Adaptive Captions for VideosProceedings of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3313831.3376266(1-12)Online publication date: 21-Apr-2020
https://dl.acm.org/doi/10.1145/3313831.3376266
Tapu RMocanu BZaharia T(2019)DEEP-HEAR: A Multimodal Subtitle Positioning System Dedicated to Deaf and Hearing-Impaired PeopleIEEE Access10.1109/ACCESS.2019.2925806(1-1)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2925806
Furuta RTsubaki IYamasaki T(2018)Fast Volume Seam Carving With Multipass Dynamic ProgrammingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2016.262056328:5(1087-1101)Online publication date: May-2018
https://doi.org/10.1109/TCSVT.2016.2620563
Shukla AGullapuram SKatti HYadati KKankanhalli MSubramanian RLank EVinciarelli AHoggan ESubramanian SBrewster S(2017)Evaluating content-centric vs. user-centric ad affect recognitionProceedings of the 19th ACM International Conference on Multimodal Interaction10.1145/3136755.3136796(402-410)Online publication date: 3-Nov-2017
https://dl.acm.org/doi/10.1145/3136755.3136796
Akahori WHirai TKawamura SMorishima SWhitney PMurray JBasapur SAli Hasan NHuber J(2016)Region-of-Interest-Based Subtitle Placement Using Eye-Tracking Data of Multiple ViewersProceedings of the ACM International Conference on Interactive Experiences for TV and Online Video10.1145/2932206.2933558(123-128)Online publication date: 17-Jun-2016
https://dl.acm.org/doi/10.1145/2932206.2933558
Adiba ATanaka NMiyake J(2016)An Adjustable Gaze Tracking System and Its Application for Automatic Discrimination of Interest ObjectsIEEE/ASME Transactions on Mechatronics10.1109/TMECH.2015.247052221:2(973-979)Online publication date: Apr-2016
https://doi.org/10.1109/TMECH.2015.2470522
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents