Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3462244.3479954acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Predicting Gaze from Egocentric Social Interaction Videos and IMU Data

Published: 18 October 2021 Publication History

Abstract

Gaze prediction in egocentric videos is a fairly new research topic, which might have several applications for assistive technology (e.g., supporting blind people in their daily interactions), security (e.g., attention tracking in risky work environments), education (e.g., augmented / mixed reality training simulators, immersive games) and so forth. Egocentric gaze is typically estimated from video while few works attempt to use inertial measurement unit (IMU) data, a sensor modality often available in wearable devices (e.g., augmented reality headsets). Instead, in this paper, we examine whether joint learning of egocentric video and corresponding IMU data can improve the first-person gaze prediction compared to using these modalities separately. In this respect, we propose a multimodal network and evaluate it on several unconstrained social interaction scenarios captured by a first-person perspective. The proposed multimodal network achieves better results compared to unimodal methods as well as several (multimodal) baselines, showing that using egocentric video together with IMU data can boost the first-person gaze estimation performance.

Supplementary Material

Presentation slides (p717-presentation.pptx)
MP4 File (ICMI21-sp1225.mp4)
Egocentric vision is a fairly new domain in research. It provide necessary details to understand the human perception and interaction abilities. Egocentric gaze is typically estimated using visual models though some work have tried to estimate gaze using data sensors essentially IMUs. In this work, we attempt to combine both the modalities to examine if a joint learning of egocentric video and corresponding IMU data can improve the first-person gaze prediction compared to using these modalities separately. The proposed approach performs better as compared to uni modal methods on an in-the-wild dataset collected using Tobii Eye tracker 3 for various scenarios.
MP4 File (p717-ICMI21-sp1225.mp4)
Video presentation

References

[1]
[n.d.]. Tobii Pro Glasses 3. https://www.tobiipro.com/product-listing/tobii-pro-glasses-3/. Accessed: 2021-05-13.
[2]
M. Aghaei, M. Dimiccoli, and P. Radeva. 2016. With whom do I interact? Detecting social interactions in egocentric photo-streams. In Proc. of ICPR. 2959–2964. https://doi.org/10.1109/ICPR.2016.7900087
[3]
Xavier Alameda-Pineda, Yan Yan, Elisa Ricci, Oswald Lanz, and Nicu Sebe. 2015. Analyzing Free-Standing Conversational Groups: A Multimodal Approach. In Proc. of ACMMM. 5–14. https://doi.org/10.1145/2733373.2806238
[4]
Stefano Alletto, Giuseppe Serra, Simone Calderara, and Rita Cucchiara. 2015. Understanding social relationships in egocentric vision. Pattern Recognition 48, 12 (2015), 4082–4096. https://doi.org/10.1016/j.patcog.2015.06.006
[5]
Shumeet Baluja and Dean Pomerleau. 1993. Non-Intrusive Gaze Tracking Using Artificial Neural Networks. In Proc. of NIPS. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 753–760.
[6]
S. Bano, T. Suveges, J. Zhang, and S. J. Mckenna. 2018. Multimodal Egocentric Analysis of Focused Interactions. IEEE Access 6(2018), 37493–37505. https://doi.org/10.1109/ACCESS.2018.2850284
[7]
Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino. 2017. Multi-Task Learning of Social Psychology Assessments and Nonverbal Features for Automatic Leadership Identification. In Proc. of ICMI (Glasgow, UK) (ICMI ’17). 451–455. https://doi.org/10.1145/3136755.3136812
[8]
C. Beyan, V. Katsageorgiou, and V. Murino. 2019. A Sequential Data Analysis Approach to Detect Emergent Leaders in Small Groups. IEEE Trans. on Multimedia 21, 8 (2019), 2107–2116.
[9]
Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. 2018. Understanding Batch Normalization. In Proc. of NeuroIPS, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett(Eds.), Vol. 31. Curran Associates, Inc.
[10]
Roser Canigueral and Antonia F. de C. Hamilton. 2019. The Role of Eye Gaze During Natural Social Interactions in Typical and Autistic People. Frontiers in Psychology 10 (2019), 560. https://doi.org/10.3389/fpsyg.2019.00560
[11]
Francesca Capozzi, Cigdem Beyan, Antonio Pierro, Atesh Koul, Vittorio Murino, Stefano Livi, Andrew P. Bayliss, Jelena Ristic, and Cristina Becchio. 2019. Tracking the Leader: Gaze Behavior in Group Interactions. iScience 16(2019), 242–249. https://doi.org/10.1016/j.isci.2019.05.035
[12]
Joao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proc. of IEEE CVPR.
[13]
Yihua Cheng, Feng Lu, and Xucong Zhang. 2018. Appearance-Based Gaze Estimation via Evaluation-Guided Asymmetric Regression. In Proc. of ECCV.
[14]
A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox. 2015. FlowNet: Learning Optical Flow with Convolutional Networks. In Proc. of IEEE ICCV. 2758–2766. https://doi.org/10.1109/ICCV.2015.316
[15]
Carlos Elmadjian, Pushkar Shukla, Antonio Diaz Tula, and Carlos H. Morimoto. 2018. 3D Gaze Estimation in the Scene Volume with a Head-Mounted Eye Tracker. In Proc. of the Workshop on Communication by Gaze Interaction. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3206343.3206351
[16]
A. Fathi, J. K. Hodgins, and J. M. Rehg. 2012. Social interactions: A first-person perspective. In Proc. of CVPR. 1226–1233. https://doi.org/10.1109/CVPR.2012.6247805
[17]
Tobias Fischer, Hyung Jin Chang, and Yiannis Demiris. 2018. RT-GENE: Real-Time Eye Gaze Estimation in Natural Environments. In Proc. of ECCV.
[18]
Nuno Cruz Garcia, Sarah Adel Bargal, Vitaly Ablavsky, Pietro Morerio, Vittorio Murino, and Stan Sclaroff. 2021. Distillation Multiple Choice Learning for Multimodal Action Recognition. In Proc. of the IEEE/CVF WACV. 2755–2764.
[19]
E. D. Guestrin and M. Eizenman. 2006. General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Trans. on Biomedical Engineering 53, 6 (2006), 1124–1133. https://doi.org/10.1109/TBME.2005.863952
[20]
Yifei Huang, Minjie Cai, Zhenqiang Li, and Yoichi Sato. 2018. Predicting gaze in egocentric video by learning task-dependent attention transition. In Proc. of ECCV. 754–769.
[21]
Kar-Han Tan, D. J. Kriegman, and N. Ahuja. 2002. Appearance-based eye gaze estimation. In Proc. of WACV. 191–195. https://doi.org/10.1109/ACV.2002.1182180
[22]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. arxiv:cs.CV/1705.06950
[23]
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. 2019. EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition. In Proc. of IEEE/CVF ICCV.
[24]
K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik, and A. Torralba. 2016. Eye Tracking for Everyone. In Proc. of IEEE CVPR. 2176–2184. https://doi.org/10.1109/CVPR.2016.239
[25]
S. Kumano, K. Otsuka, R. Ishii, and J. Yamato. 2015. Automatic gaze analysis in multiparty conversations based on Collective First-Person Vision. In Proc. of IEEE FG, Vol. 05. 1–8. https://doi.org/10.1109/FG.2015.7284861
[26]
M. Leach, R. Baxter, N. Robertson, and E. Sparks. 2014. Detecting Social Groups in Crowded Surveillance Videos Using Visual Attention. In CVPR Workshops. 467–473.
[27]
Teesid Leelasawassuk, Dima Damen, and Walterio W. Mayol-Cuevas. 2015. Estimating visual attention from a head mounted IMU. In Proc. of ACM ISWC, Kenji Mase, Marc Langheinrich, Daniel Gatica-Perez, Kristof Van Laerhoven, and Tsutomu Terada (Eds.). ACM, 147–150. https://doi.org/10.1145/2802083.2808394
[28]
Haoxin Li, Yijun Cai, and Wei-Shi Zheng. 2019. Deep Dual Relation Modeling for Egocentric Interaction Recognition. In Proc. IEEE/CVF CVPR.
[29]
Y. Li, A. Fathi, and J. M. Rehg. 2013. Learning to Predict Gaze in Egocentric Video. In Proc. of CVPR. 3216–3223. https://doi.org/10.1109/ICCV.2013.399
[30]
Yin Li, Miao Liu, and Jame Rehg. 2021. In the Eye of the Beholder: Gaze and Actions in First Person Video. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–1. https://doi.org/10.1109/TPAMI.2021.3051319
[31]
Meng Liu, Youfu Li, and Hai Liu. 2020. 3D Gaze Estimation for Head-Mounted Eye Tracking System With Auto-Calibration Method. IEEE Access 8(2020), 104207–104215. https://doi.org/10.1109/ACCESS.2020.2999633
[32]
F. Lu, Y. Sugano, T. Okabe, and Y. Sato. 2014. Adaptive Linear Regression for Appearance-Based Gaze Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 10(2014), 2033–2046. https://doi.org/10.1109/TPAMI.2014.2313123
[33]
Radoslaw Niewiadomski, Lea Chauvigne, Maurizio Mancini, and Antonio Camurri. 2018. Towards a Model of Nonverbal Leadership in Unstructured Joint Physical Activity. In Proc. of MoCo (Genoa, Italy) (MOCO ’18). Association for Computing Machinery, Article 20, 8 pages. https://doi.org/10.1145/3212721.3212816
[34]
Hyun Soo Park, Eakta Jain, and Yaser Sheikh. 2012. 3D Gaze Concurrences From Head-mounted Cameras. In Proc. of NIPS.
[35]
Mahmoud Qodseya, Marta Sanzari, Valsamis Ntouskos, and Fiora Pirri. 2016. A3D: A Device for Studying Gaze in 3D. In Proc. of ECCV workshop, Gang Hua and Hervé Jégou (Eds.). Springer International Publishing, Cham, 572–588.
[36]
M. J. Reale, S. Canavan, L. Yin, K. Hu, and T. Hung. 2011. A Multi-Gesture Interaction System Using a 3-D Iris Disk Model for Gaze Estimation and an Active Appearance Model for 3-D Hand Pointing. IEEE Trans. on Multimedia 13, 3 (2011), 474–486. https://doi.org/10.1109/TMM.2011.2120600
[37]
Adria Recasens, Aditya Khosla, Carl Vondrick, and Antonio Torralba. 2015. Where are they looking?. In Proc. of NIPS.
[38]
Ludwig Sidenmark and Hans Gellersen. 2019. Eye, Head and Torso Coordination During Gaze Shifts in Virtual Reality. ACM Trans. Comput.-Hum. Interact. 27, 1 (2019). https://doi.org/10.1145/3361218
[39]
Suriya Singh, Chetan Arora, and C. V. Jawahar. 2016. First Person Action Recognition Using Deep Learned Descriptors. In Proc. of IEEE CVPR.
[40]
Ramanathan Subramanian, Jacopo Staiano, Kyriaki Kalimeri, Nicu Sebe, and Fabio Pianesi. 2010. Putting the pieces together: multimodal analysis of social attention in meetings. In ACMMM. 25–29.
[41]
Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2019. LSTA: Long Short-Term Attention for Egocentric Action Recognition. In Proc. of IEEE/CVF CVPR.
[42]
Yusuke Sugano and Andreas Bulling. 2015. Self-Calibrating Head-Mounted Eye Trackers Using Egocentric Visual Saliency. In Proc. of UIST. Association for Computing Machinery, New York, NY, USA, 363–372. https://doi.org/10.1145/2807442.2807445
[43]
Kentaro Takemura, Kenji Takahashi, Jun Takamatsu, and Tsukasa Ogasawara. 2014. Estimating 3-D Point-of-Regard in a Real Environment Using a Head-Mounted Eye-Tracking System. IEEE Trans. on Human-Machine Systems 44, 4 (2014), 531–536. https://doi.org/10.1109/THMS.2014.2318324
[44]
Gyula Vörös, Anita Verö, Balázs Pintér, Brigitta Miksztai-Réthey, Takumi Toyama, András Lörincz, and Daniel Sonntag. 2014. Towards a Smart Wearable Tool to Enable People with SSPI to Communicate by Sentence Fragments. In Pervasive Computing Paradigms for Mental Health, 4th International Symposium, MindCare 2014(Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering), Pietro Cipresso, Aleksandar Matic, and Guillaume Lopez (Eds.), Vol. 100. Springer, 90–99. https://doi.org/10.1007/978-3-319-11564-1_10
[45]
Haofei Wang, Jimin Pi, Tong Qin, Shaojie Shen, and Bertram E. Shi. 2018. SLAM-Based Localization of 3D Gaze Using a Mobile Eye Tracker. In Proc. of ETRA. Association for Computing Machinery. https://doi.org/10.1145/3204493.3204584
[46]
Kang Wang and Qiang Ji. 2016. Real time eye gaze tracking with Kinect. In Proc. of ICPR. 2752–2757. https://doi.org/10.1109/ICPR.2016.7900052
[47]
Weiyao Wang, Du Tran, and Matt Feiszli. 2020. What Makes Training Multi-Modal Classification Networks Hard?. In Proc. of the IEEE/CVF CVPR.
[48]
Zhecan Wang, Jian Zhao, Cheng Lu, Fan Yang, Han Huang, lianji li, and Yandong Guo. 2020. Learning to Detect Head Movement in Unconstrained Remote Gaze Estimation in the Wild. In Proc. of the IEEE/CVF WACV.
[49]
Yi-Leh Wu, Chun-Tsai Yeh, Wei-Chih Hung, and Cheng-Yuan Tang. 2014. Gaze direction estimation using support vector machine with active appearance model. Multimedia Tools Appl. 70 (2014), 2037–2062. https://doi.org/10.1007/s11042-012-1220-z
[50]
L.-Q. Xu, D. Machin, and P. Sheppard. 1998. A Novel Approach to Real-time Non-intrusive Gaze Finding. In Proc. of BMVC. BMVA Press, 43.1–43.10.
[51]
Kentaro Yamada, Yusuke Sugano, Takahiro Okabe, Yoichi Sato, Akihiro Sugimoto, and Kazuo Hiraki. 2011. Attention Prediction in Egocentric Video Using Motion and Visual Saliency. In Proc. of PSIVT. Springer-Verlag, Berlin, Heidelberg, 277–288. https://doi.org/10.1007/978-3-642-25367-6_25
[52]
A. L. Yarbus. 1967. Eye Movements and Vision. Springer US.
[53]
Lingyu Zhang, Mallory Morgan, Indrani Bhattacharya, Michael Foley, Jonas Braasch, Christoph Riedl, Brooke Foucault Welles, and Richard J. Radke. 2019. Improved Visual Focus of Attention Estimation and Prosodic Features for Analyzing Group Interactions. In Proc. of ACM ICMI. 385–394. https://doi.org/10.1145/3340555.3353761
[54]
M. Zhang, K. T. Ma, J. H. Lim, Q. Zhao, and J. Feng. 2017. Deep Future Gaze: Gaze Anticipation on Egocentric Videos Using Adversarial Networks. In Proc. of IEEE CVPR. 3539–3548. https://doi.org/10.1109/CVPR.2017.377
[55]
X. Zhang, Y. Sugano, M. Fritz, and A. Bulling. 2015. Appearance-based gaze estimation in the wild. In Proc. of IEEE CVPR. 4511–4520. https://doi.org/10.1109/CVPR.2015.7299081

Cited By

View all
  • (2024)GESCAM : A Dataset and Method on Gaze Estimation for Classroom Attention Measurement2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00068(636-645)Online publication date: 17-Jun-2024
  • (2024)An Outlook into the Future of Egocentric VisionInternational Journal of Computer Vision10.1007/s11263-024-02095-7132:11(4880-4936)Online publication date: 28-May-2024
  • (2023)Gaze Target Detection Based on Predictive Consistency EmbeddingJournal of Image and Signal Processing10.12677/JISP.2023.12201512:02(144-157)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction
October 2021
876 pages
ISBN:9781450384810
DOI:10.1145/3462244
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. IMU
  2. egocentric video
  3. gaze
  4. multimodal
  5. social interactions

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

ICMI '21
Sponsor:
ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 18 - 22, 2021
QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)85
  • Downloads (Last 6 weeks)12
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GESCAM : A Dataset and Method on Gaze Estimation for Classroom Attention Measurement2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00068(636-645)Online publication date: 17-Jun-2024
  • (2024)An Outlook into the Future of Egocentric VisionInternational Journal of Computer Vision10.1007/s11263-024-02095-7132:11(4880-4936)Online publication date: 28-May-2024
  • (2023)Gaze Target Detection Based on Predictive Consistency EmbeddingJournal of Image and Signal Processing10.12677/JISP.2023.12201512:02(144-157)Online publication date: 2023
  • (2023)ViWise: Fusing Visual and Wireless Sensing Data for Trajectory Relationship RecognitionACM Transactions on Internet of Things10.1145/36144414:4(1-29)Online publication date: 22-Nov-2023
  • (2023)Egocentric Auditory Attention Localization in Conversations2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01409(14663-14674)Online publication date: Jun-2023
  • (2023)Head Mounted IMU-Based Driver’s Gaze Zone Estimation Using Machine Learning AlgorithmInternational Journal of Human–Computer Interaction10.1080/10447318.2023.227652040:23(7970-7981)Online publication date: 8-Nov-2023
  • (2023)MECCANOComputer Vision and Image Understanding10.1016/j.cviu.2023.103764235:COnline publication date: 1-Oct-2023
  • (2023)In the Eye of Transformer: Global–Local Correlation for Egocentric Gaze Estimation and BeyondInternational Journal of Computer Vision10.1007/s11263-023-01879-7132:3(854-871)Online publication date: 18-Oct-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media