research-article

Multitask Prediction of Exchange-level Annotations for Multimodal Dialogue Systems

Authors:

Haruto Nishimoto,

Kazunori KomataniAuthors Info & Claims

ICMI '19: 2019 International Conference on Multimodal Interaction

Pages 85 - 94

https://doi.org/10.1145/3340555.3353730

Published: 14 October 2019 Publication History

Abstract

This paper presents multimodal computational modeling of three labels that are independently annotated per exchange to implement an adaptation mechanism of dialogue strategy in spoken dialogue systems based on recognizing user sentiment by multimodal signal processing. The three labels include (1) user’s interest label pertaining to the current topic, (2) user’s sentiment label, and (3) topic continuance denoting whether the system should continue the current topic or change it. Predicting the three types of labels that capture different aspects of the user’s sentiment level and the system’s next action contribute to adopting a dialogue strategy based on the user’s sentiment. For this purpose, we enhanced shared multimodal dialogue data by annotating impressed sentiment labels and the topic continuance labels. With the corpus, we develop a multimodal prediction model for the three labels. A multitask learning technique is applied for binary classification tasks of the three labels considering the partial similarities among them. The prediction model was efficiently trained even with a small data set (less than 2000 samples) thanks to the multitask learning framework. Experimental results show that the multitask deep neural network (DNN) model trained with multimodal features including linguistics, facial expressions, body and head motions, and acoustic features, outperformed those trained as single-task DNNs by 1.6 points at the maximum.

References

[1]

Masahiro Araki, Sayaka Tomimasu, Mikio Nakano, Kazunori Komatani, Shogo Okada, Shinya Fujie, and Hiroaki Sugiyama. 2018. Collection of Multimodal Dialog Data and Analysis of the Result of Annotation of Users’ Interest Level. In Proc. International Conference on Language Resources and Evaluation (LREC). European Language Resources Association (ELRA).

[2]

Oya Aran and Daniel Gatica-Perez. 2013. One of a Kind: Inferring Personality Impressions in Meetings. In Proc. International Conference on Multimodal Interaction (ICMI). 11–18.

Digital Library

[3]

T. Baltrusaitis, A. Zadeh, Y. Lim, and L. Morency. 2018. OpenFace 2.0: Facial Behavior Analysis Toolkit. In Proc. International Conference on Automatic Face and Gesture Recognition (FG). IEEE Computer Society, 59–66.

[4]

Dan Bohus and Eric Horvitz. 2009. Learning to Predict Engagement with a Spoken Dialog System in Open-world Settings. In Proc. Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). Association for Computational Linguistics, 244–252.

[5]

Richard Caruana. 1993. Multitask Learning: A Knowledge-Based Source of Inductive Bias. In Proc. International Conference on Machine Learning (ICML). Morgan Kaufmann, 41–48.

[6]

Richard Caruana. 1997. Multitask Learning. Machine Learning 28, 1 (1997), 41–75.

Digital Library

[7]

Yuya Chiba, Masashi Ito, Takashi Nose, and Akinori Ito. 2014. User Modeling by Using Bag-of-Behaviors for Building a Dialog System Sensitive to the Interlocutor’s Internal State. In Proc. Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). 74–78.

[8]

Corinna Cortes and Vladimir Vapnik. 1995. Support-Vector Networks. In Machine Learning. 273–297.

[9]

David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, Gale Lucas, Stacy Marsella, Fabrizio Morbini, Angela Nazarian, Stefan Scherer, Giota Stratou, Apar Suri, David Traum, Rachel Wood, Yuyu Xu, Albert Rizzo, and Louis-Philippe Morency. 2014. SimSensei Kiosk: A Virtual Human Interviewer for Healthcare Decision Support. In Proc. International Conference on Autonomous Agents and Multi-agent Systems (AAMAS). International Foundation for Autonomous Agents and Multiagent Systems, 1061–1068.

[10]

D.Kingma and J.Ba. 2014. Adam: A method for stochastic optimization. In Proc. International Conference for Learning Representations (ICLR).

[11]

Ekman.P and W V Friesen. 1978. The facial action coding system: A technique for the measurement of facial movement.Consulting Psychologists Press.

[12]

Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572–587.

Digital Library

[13]

Edward Filisko and Stephanie Seneff. 2004. Error detection and recovery in spoken dialogue systems. In Proc. Spoken Language Understanding for Conversational Systems and Higher Level Linguistic Information for Speech Processing (HLT-NAACL).

[14]

Nadine Glas and Catherine Pelachaud. 2015. Definitions of engagement in human-agent interaction. In Proc. International Workshop on Engagment in Human Computer Interaction (ENHANCE). 944–949.

Digital Library

[15]

Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Kobayashi, and Michimasa Inaba. 2016. The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics. In Proc. International Conference on Language Resources and Evaluation (LREC).

[16]

Takatsugu Hirayama, Yasuyuki Sumi, Tatsuya Kawahara, and Takashi Matsuyama. 2011. Info-concierge: Proactive multi-modal interaction through mind probing. In The Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC 2011).

[17]

Mohammed Ehsan Hoque, Matthieu Courgeon, Jean-Claude Martin, Bilge Mutlu, and Rosalind W Picard. 2013. Mach: My automated conversation coach. In Proc. International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp). ACM, 697–706.

Digital Library

[18]

Koji Inoue, Divesh Lala, Katsuya Takanashi, and Tatsuya Kawahara. 2018. Latent character model for engagement recognition based on multimodal behaviors. In Proc. International Workshop on Spoken Dialogue Systems (IWSDS).

[19]

Kazunori Komatani, Shogo Okada, Haruto Nishimoto, Masahiro Araki, and Mikio Nakano. 2019. Multimodal Dialogue Data Collection and Analysis of Annotation Disagreement. In Proc. International Workshop on Spoken Dialogue Systems (IWSDS).

[20]

Krippendorff.K. 2011. Computing Krippendorff’s Alpha-Reliability.

[21]

Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying Conditional Random Fields to Japanese Morphological Analysis. In Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 230–237.

[22]

Divesh Lala, Koji Inoue, and Tatsuya Kawahara. 2018. Evaluation of real-time deep learning turn-taking models for multiple dialogue scenarios. In Proc. International Conference on Multimodal Interaction (ICMI). ACM, 78–86.

Digital Library

[23]

J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1 (1977), 159–174.

[24]

Yukiko I. Nakano and Ryo Ishii. 2010. Estimating User’s Engagement from Eye-gaze Behaviors in Human-agent Conversations. In Proc. International Conference on Intelligent User Interfaces (IUI). ACM, 139–148.

[25]

Setareh Nasihati Gilani, David Traum, Arcangelo Merla, Eugenia Hee, Zoey Walker, Barbara Manini, Grady Gallagher, and Laura-Ann Petitto. 2018. Multimodal Dialogue Management for Multiparty Interaction with Infants. In Proc. International Conference on Multimodal Interaction (ICMI). ACM, 5–13.

Digital Library

[26]

Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-level multimodal sentiment analysis. In Proc. Annual Meeting of the Association for Computational Linguistics. 973–982.

[27]

Kumar Ravi and Vadlamani Ravi. 2015. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowledge-Based Systems 89 (2015), 14–46.

Digital Library

[28]

Björn Schuller, Stefan Steidl, and Anton Batliner. 2009. The INTERSPEECH 2009 emotion challenge. In Proc. Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, 312–315.

[29]

Gabriel Skantze. 2007. Error Handling in Spoken Dialogue Systems : Managing Uncertainty, Grounding and Miscommunication. Ph.D. Dissertation. KTH, Speech, Music and Hearing, TMH. QC 20100812.

[30]

Gabriel Skantze, Anna Hjalmarsson, and Catharine Oertel. 2014. Turn-taking, feedback and joint attention in situated human-robot interaction. Speech Communication 65(2014), 50 – 66.

[31]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 1929–1958.

Digital Library

[32]

Giota Stratou and Louis-Philippe Morency. 2017. MultiSense—Context-Aware Nonverbal Behavior Analysis Framework: A Psychological Distress Use Case. IEEE Transactions on Affective Computing 8, 2 (2017), 190–203.

Digital Library

[33]

Hiroki Tanaka, Hiroyoshi Adachi, Norimichi Ukita, Manabu Ikeda, Hiroaki Kazui, Takashi Kudo, and Satoshi Nakamura. 2017. Detecting Dementia Through Interactive Computer Avatars. IEEE journal of translational engineering in health and medicine 5 (2017), 1–11.

[34]

Hiroki Tanaka, Hideki Negoro, Hidemi Iwasaka, and Satoshi Nakamura. 2017. Embodied conversational agents for multimodal automated social skills training in people with autism spectrum disorders. PloS one 12, 8 (2017), e0182151.

[35]

Hiroki Tanaka, Hideki Negoro, Hidemi Iwasaka, and Satoshi Nakamura. 2018. Listening Skills Assessment Through Computer Agents. In Proc. International Conference on Multimodal Interaction (ICMI). ACM, 492–496.

Digital Library

[36]

Sayaka Tomimasu and Masahiro Araki. 2016. Assessment of Users’ Interests in Multimodal Dialog Based on Exchange Unit. In Proc. International Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction. ACM, 33–37.

Digital Library

[37]

Klaus Weber, Hannes Ritschel, Ilhan Aslan, Florian Lingenfelser, and Elisabeth André. 2018. How to Shape the Humor of a Robot - Social Behavior Adaptation Based on Reinforcement Learning. In Proc. International Conference on Multimodal Interaction (ICMI). ACM, 154–162.

Digital Library

[38]

Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence 31, 1(2009), 39–58.

Digital Library

Cited By

Nagasawa FOkada SIshihara TNitta K(2024)Adaptive Interview Strategy Based on Interviewees’ Speaking Willingness Recognition for Interview RobotsIEEE Transactions on Affective Computing10.1109/TAFFC.2023.330964015:3(942-957)Online publication date: Jul-2024
https://doi.org/10.1109/TAFFC.2023.3309640
Li SOkada S(2024)Empirical Analysis of Individual Differences Based on Sentiment Estimation Performance Toward Speaker Adaptation for Social Signal ProcessingSocial Computing and Social Media10.1007/978-3-031-61281-7_26(359-371)Online publication date: 1-Jun-2024
https://doi.org/10.1007/978-3-031-61281-7_26
Liu MLi LWang HGuo XLiu YLi YSong KShao YWu FZhang JSun NZhang TLuan L(2023)A multilayer perceptron-based model applied to histopathology image classification of lung adenocarcinoma subtypesFrontiers in Oncology10.3389/fonc.2023.117223413Online publication date: 18-May-2023
https://doi.org/10.3389/fonc.2023.1172234
Show More Cited By

Recommendations

Recognizing Social Signals with Weakly Supervised Multitask Learning for Multimodal Dialogue Systems
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Social signal processing is a methodology that is used to infer human inner states, including attitudes, sentiments and impressions, from verbal and nonverbal multimodal information. The difficulty in training a social signal recognition model is that ...
Multimodal User Satisfaction Recognition for Non-task Oriented Dialogue Systems
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Multimodal dialogue systems (MDSs) are needed to allow users to converse with virtual agents that use natural language by sensing the multimodal behavior of users. One crucial step in the development of an MDS is measuring how well the dialogue system ...
Multimodal and Multitask Approach to Listener's Backchannel Prediction: Can Prediction of Turn-changing and Turn-management Willingness Improve Backchannel Modeling?
IVA '21: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents

The listener's backchannel has the important function of encouraging a current speaker to hold their turn and continue to speak, which enables smooth conversation. The listener monitors the speaker's turn-management (a.k.a. speaking and listening) ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '19: 2019 International Conference on Multimodal Interaction

October 2019

601 pages

ISBN:9781450368605

DOI:10.1145/3340555

Editors:
Wen Gao
Peking University, China
,
Helen Mei Ling Meng
Chinese University of Hong Kong, China
,
Matthew Turk
Toyota Technological Institute at Chicago, USA
,
Susan R. Fussell
Cornell University, USA
,
Björn Schuller
Imperial College London / University of Augsburg, UK
,
Yale Song
Microsoft Research, USA
,
Kai Yu
Shanghai Jiao Tong University, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMI '19

ICMI '19: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 14 - 18, 2019

Suzhou, China

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
611
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Nagasawa FOkada SIshihara TNitta K(2024)Adaptive Interview Strategy Based on Interviewees’ Speaking Willingness Recognition for Interview RobotsIEEE Transactions on Affective Computing10.1109/TAFFC.2023.330964015:3(942-957)Online publication date: Jul-2024
https://doi.org/10.1109/TAFFC.2023.3309640
Li SOkada S(2024)Empirical Analysis of Individual Differences Based on Sentiment Estimation Performance Toward Speaker Adaptation for Social Signal ProcessingSocial Computing and Social Media10.1007/978-3-031-61281-7_26(359-371)Online publication date: 1-Jun-2024
https://doi.org/10.1007/978-3-031-61281-7_26
Liu MLi LWang HGuo XLiu YLi YSong KShao YWu FZhang JSun NZhang TLuan L(2023)A multilayer perceptron-based model applied to histopathology image classification of lung adenocarcinoma subtypesFrontiers in Oncology10.3389/fonc.2023.117223413Online publication date: 18-May-2023
https://doi.org/10.3389/fonc.2023.1172234
Katada SOkada SKomatani K(2023)Effects of Physiological Signals in Different Types of Multimodal Sentiment EstimationIEEE Transactions on Affective Computing10.1109/TAFFC.2022.315560414:3(2443-2457)Online publication date: 1-Jul-2023
https://doi.org/10.1109/TAFFC.2022.3155604
Ohba TMawalim CKatada SKuroki HOkada S(2022)Multimodal Analysis for Communication Skill and Self-Efficacy Level Estimation in Job Interview ScenarioProceedings of the 21st International Conference on Mobile and Ubiquitous Multimedia10.1145/3568444.3568461(110-120)Online publication date: 27-Nov-2022
https://dl.acm.org/doi/10.1145/3568444.3568461
Wei WLi SOkada S(2022)Investigating the relationship between dialogue and exchange-level impressionProceedings of the 2022 International Conference on Multimodal Interaction10.1145/3536221.3556602(359-367)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3536221.3556602
Wei WLi SOkada SKomatani K(2021)Multimodal User Satisfaction Recognition for Non-task Oriented Dialogue SystemsProceedings of the 2021 International Conference on Multimodal Interaction10.1145/3462244.3479928(586-594)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3462244.3479928
Hirano YOkada SKomatani K(2021)Recognizing Social Signals with Weakly Supervised Multitask Learning for Multimodal Dialogue SystemsProceedings of the 2021 International Conference on Multimodal Interaction10.1145/3462244.3479927(141-149)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3462244.3479927
Komatani KOkada S(2021)Multimodal Human-Agent Dialogue Corpus with Annotations at Utterance and Dialogue Levels2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII)10.1109/ACII52823.2021.9597447(1-8)Online publication date: 28-Sep-2021
https://doi.org/10.1109/ACII52823.2021.9597447
Laperrière JLam DFunakoshi K(2020)Packing, Stacking, and Tracking: An Empirical Study of Online User AdaptationConversational Dialogue Systems for the Next Decade10.1007/978-981-15-8395-7_24(319-336)Online publication date: 25-Oct-2020
https://doi.org/10.1007/978-981-15-8395-7_24

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents