Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Fixation Prediction through Multimodal Analysis

Published: 25 October 2016 Publication History

Abstract

In this article, we propose to predict human eye fixation through incorporating both audio and visual cues. Traditional visual attention models generally make the utmost of stimuli’s visual features, yet they bypass all audio information. In the real world, however, we not only direct our gaze according to visual saliency, but also are attracted by salient audio cues. Psychological experiments show that audio has an influence on visual attention, and subjects tend to be attracted by the sound sources. Therefore, we propose fusing both audio and visual information to predict eye fixation. In our proposed framework, we first localize the moving--sound-generating objects through multimodal analysis and generate an audio attention map. Then, we calculate the spatial and temporal attention maps using the visual modality. Finally, the audio, spatial, and temporal attention maps are fused to generate the final audiovisual saliency map. The proposed method is applicable to scenes containing moving--sound-generating objects. We gather a set of video sequences and collect eye-tracking data under an audiovisual test condition. Experiment results show that we can achieve better eye fixation prediction performance when taking both audio and visual cues into consideration, especially in some typical scenes in which object motion and audio are highly correlated.

References

[1]
Ravi Achanta, Sheila Hemami, Francisco Estrada, and Sabine Susstrunk. 2009. Frequency-tuned salient region detection. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 1597--1604.
[2]
Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. 2012. Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 11, 2189--2202.
[3]
Xuan Bao and Romit Roy Choudhury. 2010. Movi: Mobile phone based video highlights via collaborative sensing. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services. ACM, 357--370.
[4]
Zohar Barzelay and Yoav Y. Schechner. 2007. Harmony in motion. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 1--8.
[5]
Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. 2014. Salient object detection: A survey. ArXiv Preprint.
[6]
Ali Borji and Laurent Itti. 2013. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1, 185--207.
[7]
Ali Borji, Dicky N. Sihite, and Laurent Itti. 2013. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Transactions on Image Processing 22, 1, 55--69.
[8]
Zoya Bylinskii, Tilke Judd, Ali Borji, Laurent Itti, Frédo Durand, Aude Oliva, and Antonio Torralba. 2012. MIT Saliency Benchmark. Retrieved September 25, 2016 from http://saliency.mit.edu/.
[9]
Moran Cerf, E. Paxon Frady, and Christof Koch. 2009. Faces and text attract gaze independent of the task: Experimental data and computer model. Journal of Vision 9, 12, 10.
[10]
Moran Cerf, Jonathan Harel, Wolfgang Einhäuser, and Christof Koch. 2008. Predicting human gaze using low-level saliency combined with face detection. In Advances in Neural Information Processing Systems. 241--248.
[11]
Christel Chamaret, Jean-Claude Chevet, and Olivier Le Meur. 2010. Spatio-temporal combination of saliency maps and eye-tracking assessment of different strategies. In Proceedings of IEEE International Conference on Image Processing. 1077--1080.
[12]
Yanxiang Chen, Tam V. Nguyen, Mohan Kankanhalli, Jun Yuan, Shuicheng Yan, and Meng Wang. 2014. Audio matters in visual attention. IEEE Transactions on Circuits and Systems for Video Technology 24, 11, 1992--2003.
[13]
Ming-Ming Cheng, Guo-Xin Zhang, Niloy J. Mitra, Xiaolei Huang, and Shi-Min Hu. 2011. Global contrast based salient region detection. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 409--416.
[14]
Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 3286--3293.
[15]
Antoine Coutrot and Nathalie Guyader. 2013. Toward the introduction of auditory information in dynamic visual attention models. In Proceedings of IEEE International Workshop on Image Analysis for Multimedia Interactive Services. 1--4.
[16]
Antoine Coutrot and Nathalie Guyader. 2014. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of Vision 14, 8, 5.
[17]
Erkut Erdem and Aykut Erdem. 2013. Visual saliency estimation by nonlinearly integrating features using region covariances. Journal of Vision 13, 4, 11.
[18]
Georgios Evangelopoulos, Athanasia Zlatintsi, Alexandros Potamianos, Petros Maragos, Konstantinos Rapantzikos, Georgios Skoumas, and Yannis Avrithis. 2013. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia 15, 7, 1553--1568.
[19]
Ke Gu, Guangtao Zhai, Weisi Lin, Xiaokang Yang, and Wenjun Zhang. 2015a. Visual saliency detection with free energy theory. IEEE Signal Processing Letters 22, 10, 1552--1555.
[20]
Ke Gu, Guofu Zhai, Xu Yang, Wensheng Zhang, and Chang Wen Chen. 2015b. Automatic contrast enhancement technology with saliency preservation. IEEE Transactions on Circuits and Systems for Video Technology 25, 9, 1480--1494.
[21]
Chenlei Guo, Qi Ma, and Liming Zhang. 2008. Spatio-temporal saliency detection using phase spectrum of quaternion Fourier transform. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 1--8.
[22]
David Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12, 2639--2664.
[23]
Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-based visual saliency. In Advances in Neural Information Processing Systems. 545--552.
[24]
Xiaodi Hou and Liqing Zhang. 2007. Saliency detection: A spectral residual approach. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 1--8.
[25]
Xiaodi Hou and Liqing Zhang. 2009. Dynamic visual attention: Searching for coding length increments. In Advances in Neural Information Processing Systems. 681--688.
[26]
Laurent Itti. 2004. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing 13, 10, 1304--1318.
[27]
Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11, 1254--1259.
[28]
Hamid Izadinia, Imran Saleemi, and Mubarak Shah. 2013. Multimodal analysis for identification and segmentation of moving-sounding objects. IEEE Transactions on Multimedia 15, 2, 378--390.
[29]
Lloyd A. Jeffress. 1948. A place theory of sound localization. Journal of Comparative and Physiological Psychology 41, 1, 35.
[30]
Tilke Judd, Krista Ehinger, Fredo Durand, and Antonio Torralba. 2009. Learning to predict where humans look. In Proceedings of IEEE International Conference on Computer Vision. 2106--2113.
[31]
Einat Kidron, Yoav Y. Schechner, and Michael Elad. 2005. Pixels that sound. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Vol. 1. 88--95.
[32]
Einat Kidron, Yoav Y. Schechner, and Michael Elad. 2007. Cross-modal localization via sparsity. IEEE Transactions on Signal Processing 55, 4, 1390--1404.
[33]
Hansang Kim, Youngbae Kim, Jae-Young Sim, and Chang-Su Kim. 2015. Spatiotemporal saliency detection for video sequences based on random walk with restart. IEEE Transactions on Image Processing. 24, 8 (Aug. 2015), 2552--2564.
[34]
Jong-Seok Lee, Francesca De Simone, and Touradj Ebrahimi. 2011. Subjective quality evaluation of foveated video coding using audio-visual focus of attention. IEEE Journal on Selected Topics in Signal Processing 5, 7, 1322--1331.
[35]
Jian Li, Martin D. Levine, Xiangjing An, Xin Xu, and Hangen He. 2013. Visual saliency based on scale-space analysis in the frequency domain. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 4, 996--1010.
[36]
Kai Li, Jun Ye, and Kien A. Hua. 2014. What’s making that sound? In Proceedings of ACM International Conference on Multimedia. 147--156.
[37]
Ce Liu. 2009. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. Ph.D. Dissertation. Citeseer.
[38]
Yu-Fei Ma, Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. 2005. A generic framework of user attention model and its application in video summarization. IEEE Transactions on Multimedia 7, 5, 907--919.
[39]
Xiongkuo Min, Guangtao Zhai, Zhongpai Gao, Chunjia Hu, and Xiaokang Yang. 2014. Sound influences visual attention discriminately in videos. In Proceedings of IEEE International Workshop on Quality of Multimedia Experience. 153--158.
[40]
Vicente P. Minotto, Claudio R. Jung, and Bowon Lee. 2014. Simultaneous-speaker voice activity detection and localization using mid-fusion of SVM and HMMs. IEEE Transactions on Multimedia 16, 4, 1032--1044.
[41]
Meinard Müller. 2007. Information Retrieval for Music and Motion. Vol. 2. Springer.
[42]
Alexandre Ninassi, Olivier Le Meur, Patrick Le Callet, and D. Barbba. 2007. Does where you gaze on an image affect your perception of quality? Applying visual attention to image quality metric. In Proceedings of IEEE International Conference on Image Processing, Vol. 2. II--169--II--172.
[43]
David R. Perrott, Kourosh Saberi, Kathleen Brown, and Thomas Z. Strybel. 1990. Auditory psychomotor coordination and visual search performance. Perception 8 Psychophysics 48, 3, 214--226.
[44]
Robert J. Peters, Asha Iyer, Laurent Itti, and Christof Koch. 2005. Components of bottom-up gaze allocation in natural images. Vision Research 45, 18, 2397--2416.
[45]
Hae Jong Seo and Peyman Milanfar. 2009. Static and space-time visual saliency detection by self-resemblance. Journal of Vision 9, 12, 15.
[46]
G. W. Snecdecor and W. G. Cochran. 1989. Statistical Methods (8th ed.). Iowa State University Press, Iowa City, IA.
[47]
Guanghan Song, Denis Pellerin, and Lionel Granjon. 2013. Different types of sounds influence gaze differently in videos. Journal of Eye Movement Research 6, 4, 1--13.
[48]
Jean Vroomen and Beatrice de Gelder. 2000. Sound enhances visual perception: Cross-modal effects of auditory organization on vision. Journal of Experimental Psychology: Human Perception and Performance 26, 5, 1583.
[49]
Chenliang Xu, Caiming Xiong, and Jason J. Corso. 2012. Streaming hierarchical video segmentation. In European Conference on Computer Vision, Springer, 626--639.
[50]
Steven Yantis and John Jonides. 1990. Abrupt visual onsets and selective attention: Voluntary versus automatic allocation. Journal of Experimental Psychology: Human Perception and Performance 16, 1, 121.
[51]
Yun Zhai and Mubarak Shah. 2006. Visual attention detection in video sequences using spatiotemporal cues. In Proceedings of ACM International Conference on Multimedia. 815--824.
[52]
Jianming Zhang and Stan Sclaroff. 2013. Saliency detection: A Boolean map approach. In Proceedings of IEEE International Conference on Computer Vision. 153--160.
[53]
Lingyun Zhang, Matthew H. Tong, Tim K. Marks, Honghao Shan, and Garrison W. Cottrell. 2008. SUN: A Bayesian framework for saliency using natural statistics. Journal of Vision 8, 7, 32.

Cited By

View all
  • (2025)GeodesicPSIM: Predicting the Quality of Static Mesh With Texture Map via Geodesic Patch SimilarityIEEE Transactions on Image Processing10.1109/TIP.2024.350107434(44-59)Online publication date: 1-Jan-2025
  • (2025)CLIPVQA: Video Quality Assessment via CLIPIEEE Transactions on Broadcasting10.1109/TBC.2024.351192771:1(291-306)Online publication date: Mar-2025
  • (2025)GSNet: A new small object attention based deep classifier for presence of gun in complex scenesNeurocomputing10.1016/j.neucom.2025.129855(129855)Online publication date: Mar-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 13, Issue 1
February 2017
278 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3012406
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2016
Accepted: 01 August 2016
Revised: 01 July 2016
Received: 01 November 2015
Published in TOMM Volume 13, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Audiovisual attention
  2. attention fusion
  3. eye fixation prediction
  4. multimodal analysis
  5. saliency

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National High-Tech R8D Program of China
  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)GeodesicPSIM: Predicting the Quality of Static Mesh With Texture Map via Geodesic Patch SimilarityIEEE Transactions on Image Processing10.1109/TIP.2024.350107434(44-59)Online publication date: 1-Jan-2025
  • (2025)CLIPVQA: Video Quality Assessment via CLIPIEEE Transactions on Broadcasting10.1109/TBC.2024.351192771:1(291-306)Online publication date: Mar-2025
  • (2025)GSNet: A new small object attention based deep classifier for presence of gun in complex scenesNeurocomputing10.1016/j.neucom.2025.129855(129855)Online publication date: Mar-2025
  • (2025)Is red alert always optimal? An empirical study on the effects of red and blue feedback on performance under excessive stressDisplays10.1016/j.displa.2025.10300888(103008)Online publication date: Jul-2025
  • (2025)Perceptually-calibrated synergy network for night-time image quality assessment with enhancement booster and knowledge cross-sharingDisplays10.1016/j.displa.2024.10287786(102877)Online publication date: Jan-2025
  • (2024)Cascade contour-enhanced panoptic segmentation for robotic vision perceptionFrontiers in Neurorobotics10.3389/fnbot.2024.148902118Online publication date: 21-Oct-2024
  • (2024)IQAGPT: computed tomography image quality assessment with vision-language and ChatGPT modelsVisual Computing for Industry, Biomedicine, and Art10.1186/s42492-024-00171-w7:1Online publication date: 5-Aug-2024
  • (2024)Perceptual Quality Assessment of Face Video Compression: A Benchmark and An Effective MethodIEEE Transactions on Multimedia10.1109/TMM.2024.338026026(8596-8608)Online publication date: 1-Jan-2024
  • (2024)Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality TokenIEEE Transactions on Multimedia10.1109/TMM.2023.332571926(4641-4651)Online publication date: 1-Jan-2024
  • (2024)SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual AttentionIEEE Transactions on Multimedia10.1109/TMM.2023.330659626(3061-3076)Online publication date: 1-Jan-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media