research-article

Knowledge-aware Multimodal Dialogue Systems

Authors:

Tat-Seng ChuaAuthors Info & Claims

MM '18: Proceedings of the 26th ACM international conference on Multimedia

Pages 801 - 809

https://doi.org/10.1145/3240508.3240605

Published: 15 October 2018 Publication History

Abstract

By offering a natural way for information seeking, multimodal dialogue systems are attracting increasing attention in several domains such as retail, travel etc. However, most existing dialogue systems are limited to textual modality, which cannot be easily extended to capture the rich semantics in visual modality such as product images. For example, in fashion domain, the visual appearance of clothes and matching styles play a crucial role in understanding the user's intention. Without considering these, the dialogue agent may fail to generate desirable responses for users. In this paper, we present a Knowledge-aware Multimodal Dialogue (KMD) model to address the limitation of text-based dialogue systems. It gives special consideration to the semantics and domain knowledge revealed in visual content, and is featured with three key components. First, we build a taxonomy-based learning module to capture the fine-grained semantics in images the category and attributes of a product). Second, we propose an end-to-end neural conversational model to generate responses based on the conversation history, visual semantics, and domain knowledge. Lastly, to avoid inconsistent dialogues, we adopt a deep reinforcement learning method which accounts for future rewards to optimize the neural conversational model. We perform extensive evaluation on a multi-turn task-oriented dialogue dataset in fashion domain. Experiment results show that our method significantly outperforms state-of-the-art methods, demonstrating the efficacy of modeling visual modality and domain knowledge for dialogue systems.

References

[1]

Sungjin Ahn, Heeyoul Choi, Tanel P"arnamaa, and Yoshua Bengio. 2016. A neural knowledge language model. arXiv preprint arXiv:1608.00318 (2016).

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In ICCV. 2425--2433.

Digital Library

[3]

Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. The semantic web . 722--735.

Digital Library

[4]

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. An actor-critic algorithm for sequence prediction. ICLR .

[5]

Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end goal-oriented dialog. In ICLR .

[6]

Antoine Bordes, Jason Weston, Ronan Collobert, Yoshua Bengio, et almbox. 2011. Learning Structured Embeddings of Knowledge Bases. AAAI. 301--306.

Digital Library

[7]

Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A Survey on Dialogue Systems: Recent Advances and New Frontiers. SIGKDD Explor. Newsl. (2017), 25--35.

Digital Library

[8]

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In CVPR . 1080--1089.

[9]

Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. 2017. GuessWhat?! Visual object discovery through multi-modal dialogue. In CVPR . 5503--5512.

[10]

Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access. In ACL . 484--495.

[11]

Mihail Eric, Lakshmi Krishnan, Francois Charette, and Christopher D Manning. 2017. Key-Value Retrieval Networks for Task-Oriented Dialogue. In SIGDIAL. 37--49.

[12]

Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural conversation model. In AAAI. 5110--5117.

[13]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR. 6904--6913.

[14]

Kelvin Guu, John Miller, and Percy Liang. 2015. Traversing Knowledge Graphs in Vector Space. In EMNLP. 318--327.

[15]

Sangdo Han, Jeesoo Bang, Seonghan Ryu, and Gary Geunbae Lee. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In SIGDAL. 129--133.

[16]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW. 173--182.

Digital Library

[17]

Sungjin Lee and Maxine Eskenazi. 2013. Recipe for building robust spoken dialog state trackers: Dialog state tracking challenge system description. In SIGDIAL. 414--422.

[18]

Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep Reinforcement Learning for Dialogue Generation. In EMNLP . 1192--1202.

[19]

Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-End Task-Completion Neural Dialogue Systems. In IJCNLP. 733--743.

[20]

Lizi Liao, Xiangnan He, Bo Zhao, Chong-Wah Ngo, and Tat-seng Chua. 2018a. Interpretable Multimodal Fashion Retrieval for Fashion Products. In MM .

Digital Library

[21]

Lizi Liao, You Zhou, Yunshan Ma, Richang Hong, and Tat-seng Chua. 2018b. Knowledge-aware Multimodal Fashion Chatbot. In MM .

Digital Library

[22]

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. {n. d.}. Optimization of image description metrics using policy gradient methods. In ICCV . 873--881.

[23]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. 375--383.

[24]

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-Value Memory Networks for Directly Reading Documents. In EMNLP. 1400--1409.

[25]

George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM (1995), 39--41.

Digital Library

[26]

Tom M Mitchell, William W Cohen, Estevam R Hruschka Jr, Partha Pratim Talukdar, Justin Betteridge, Andrew Carlson, Bhavana Dalvi Mishra, Matthew Gardner, Bryan Kisiel, Jayant Krishnamurthy, et almbox. 2015. Never Ending Learning. In AAAI. 2302--2310.

Digital Library

[27]

Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao, Georgios Spithourakis, and Lucy Vanderwende. 2017. Image-Grounded Conversations: Multimodal Context for Natural Question and Response Generation. In IJCNLP. 462--472.

[28]

Nikola Mrkvs ić, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, and Steve Young. 2017. Neural Belief Tracker: Data-Driven Dialogue State Tracking. In ACL. 1777--1788.

[29]

Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A review of relational machine learning for knowledge graphs. In Proceedings of the IEEE . 11--33.

[30]

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2016. Sequence level training with recurrent neural networks. In ICLR .

[31]

Amrita Saha, Mitesh M Khapra, and Karthik Sankaranarayanan. 2018. Towards Building Large Scale Multimodal Domain-Aware Conversation Systems. In AAAI . 696--704.

[32]

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015. Hierarchical Neural Network Generative Models for Movie Dialogues. arXiv preprint arXiv:1507.04808 (2015).

[33]

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In AAAI. 3776--3784.

Digital Library

[34]

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In AAAI .

[35]

Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In NAACL . 196--205.

[36]

Florian Strub, Harm De Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. 2017. End-to-end optimization of goal-driven and visually grounded dialogue systems. In IJCAI . 2765--2771.

Digital Library

[37]

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In NIPS . 1057--1063.

Digital Library

[38]

Jie Tang, Juanzi Li, Bangyong Liang, Xiaotong Huang, Yi Li, and Kehong Wang. 2006. Using Bayesian Decision for Ontology Mapping. Journal of Web Semantics (2006).

Digital Library

[39]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. CVPR. 3156--3164.

[40]

TH Wen, D Vandyke, N Mrkvs 'ic, M Gavs 'ic, LM Rojas-Barahona, PH Su, S Ultes, and S Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In EACL . 438--449.

[41]

Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory Networks. In ICLR .

[42]

Jason D Williams and Geoffrey Zweig. 2016. End-to-end LS™-based dialog control optimized with supervised and reinforcement learning. arXiv preprint arXiv:1606.01269 (2016).

[43]

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement Learning . 5--32.

[44]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In CVPR . 203--212.

[45]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML . 2048--2057.

Digital Library

[46]

Zhao Yan, Nan Duan, Peng Chen, Ming Zhou, Jianshe Zhou, and Zhoujun Li. 2017. Building Task-Oriented Dialogue Systems for Online Shopping. In AAAI. 4618--4626.

[47]

Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. In ICLR .

[48]

Wojciech Zaremba and Ilya Sutskever. 2015. Reinforcement learning neural turing machines-revised. arXiv preprint arXiv:1505.00521 (2015).

[49]

Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, Yue Gao, and Tat-Seng Chua. 2013. Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In MM. 33--42.

Digital Library

[50]

Peng Zhang, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. Yin and yang: Balancing and answering binary visual questions. In CVPR. 5014--5022.

[51]

Tiancheng Zhao and Maxine Eskenazi. 2016. Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning. In SIGDIAL. 1.

[52]

Wenya Zhu, Kaixiang Mo, Yu Zhang, Zhangbin Zhu, Xuezheng Peng, and Qiang Yang. 2017. Flexible End-to-End Dialogue System for Knowledge Grounded Conversation. arXiv preprint arXiv:1709.04264 (2017).

Cited By

Zheng LChen BFei HLi FWu SLiao LJi DTeng CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference ResolutionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680966(8576-8585)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680966
He WLi ZWang HXu TWang ZHuai BYuan NChen E(2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3645099
Song YZhao XWong RBaeza-Yates RBonchi F(2024)Marrying Dialogue Systems with Data Visualization: Interactive Data Visualization Generation from Natural Language ConversationsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671935(2733-2744)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671935
Show More Cited By

Index Terms

Knowledge-aware Multimodal Dialogue Systems
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
    2. Users and interactive retrieval

Recommendations

User Attention-guided Multimodal Dialog Systems
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

As an intelligent way to interact with computers, the dialog system has been catching more and more attention. However, most research efforts only focus on text-based dialog systems, completely ignoring the rich semantics conveyed by the visual cues. ...
Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Recently, multimodal dialogue systems have engaged increasing attention in several domains such as retail, travel, etc. In spite of the promising performance of pioneer works, existing studies usually focus on utterance-level semantic representations ...
A multimodal dialogue system for improving user satisfaction via knowledge-enriched response and image recommendation
Abstract
Task-oriented multimodal dialogue systems have important application value and development prospects. Existing methods have made significant progress, but the following challenges still exist: (1) Most existing methods focus on improving the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '18: Proceedings of the 26th ACM international conference on Multimedia

October 2018

2167 pages

ISBN:9781450356657

DOI:10.1145/3240508

General Chairs:
Susanne Boll
University of Oldenburg, Germany
,
Kyoung Mu Lee
Seoul National University, Korea
,
Jiebo Luo
University of Rochester, USA
,
Wenwu Zhu
Tsinghua University, China
,
Program Chairs:
Hyeran Byun
Yonsei University, Korea
,
Chang Wen Chen
State Univ. Of New York at Buffalo, USA
,
Rainer Lienhart
University of Augsburg, Germany
,
Tao Mei
JD AI, China

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation Singapore

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 22 - 26, 2018

Seoul, Republic of Korea

Acceptance Rates

MM '18 Paper Acceptance Rate 209 of 757 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

80
Total Citations
View Citations
1,702
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)20

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zheng LChen BFei HLi FWu SLiao LJi DTeng CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Self-Adaptive Fine-grained Multi-modal Data Augmentation for Semi-supervised Muti-modal Coreference ResolutionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680966(8576-8585)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680966
He WLi ZWang HXu TWang ZHuai BYuan NChen E(2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3645099
Song YZhao XWong RBaeza-Yates RBonchi F(2024)Marrying Dialogue Systems with Data Visualization: Interactive Data Visualization Generation from Natural Language ConversationsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671935(2733-2744)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671935
Yuan YSiro CAliannejadi MRijke MLam WChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational SearchProceedings of the ACM Web Conference 202410.1145/3589334.3645483(1474-1485)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645483
Gao DLi JFeng BRen ZDing XZhang L(2024)MulmQA: Multimodal Question Answering for Database Alarm2024 5th Information Communication Technologies Conference (ICTC)10.1109/ICTC61510.2024.10602092(291-296)Online publication date: 10-May-2024
https://doi.org/10.1109/ICTC61510.2024.10602092
Ji ZLi ZZhang YWang HPang YLi X(2024)Hierarchical matching and reasoning for multi-query image retrievalNeural Networks10.1016/j.neunet.2024.106200173(106200)Online publication date: May-2024
https://doi.org/10.1016/j.neunet.2024.106200
Wan JYuan MDong ZHou LXie JZhu HWen Q(2024)Construction of Multimodal Dialog System via Knowledge Graph in Travel DomainWeb and Big Data10.1007/978-981-97-2421-5_28(422-437)Online publication date: 12-May-2024
https://doi.org/10.1007/978-981-97-2421-5_28
Sogi NShibata TTerao M(2024)Object-Aware Query Perturbation for Cross-Modal Image-Text RetrievalComputer Vision – ECCV 202410.1007/978-3-031-72986-7_26(447-464)Online publication date: 2-Nov-2024
https://doi.org/10.1007/978-3-031-72986-7_26
Heo YKang SSeo J(2023)Natural-Language-Driven Multimodal Representation Learning for Audio-Visual Scene-Aware Dialog SystemSensors10.3390/s2318787523:18(7875)Online publication date: 14-Sep-2023
https://doi.org/10.3390/s23187875
Zhang CLiu J(2023)Content Based Deep Learning Image Retrieval: A SurveyProceedings of the 2023 9th International Conference on Communication and Information Processing10.1145/3638884.3638908(158-163)Online publication date: 14-Dec-2023
https://dl.acm.org/doi/10.1145/3638884.3638908
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents