Article

Overview of the NLPCC 2019 Shared Task: Open Domain Conversation Evaluation

Authors:

Ying Shan,

Anqi Cui,

Luchen Tan,

Kun XiongAuthors Info & Claims

Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II

Pages 829 - 834

https://doi.org/10.1007/978-3-030-32236-6_76

Published: 09 October 2019 Publication History

Abstract

This paper presents an overview of the Open Domain Conversation Evaluation task in NLPCC 2019. The evaluation consists of two sub-tasks: Single-turn conversation and Multi-turn conversation. Each of the reply is judged from four to five dimensions, from syntax, contents to deep semantics. We illustrate the detailed problem definition, evaluation metrics, scoring strategy as well as datasets. We have built our dataset from commercial chatbot logs and public Internet. It covers a variety of 16 topical domains and two non-topical domains. We prepared to annotate all the data by human annotators, however, no teams submit their systems. This may due to the complexity of such conversation systems. Our baseline system achieves a single-round score of 55 out of 100 and a multi-round score of 292 out of 400. This indicates the system is more of an answering system rather than a chatting system. We would expect more participation in the succeeding years.

References

[1]

Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)

Google Scholar

[2]

Bruni, E., Fernandez, R.: Adversarial evaluation for open-domain dialogue generation. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 284–288 (2017)

Google Scholar

[3]

Guo, F., Metallinou, A., Khatri, C., Raju, A., Venkatesh, A., Ram, A.: Topic-based evaluation for conversational bots. arXiv preprint: arXiv:1801.03622 (2018)

Google Scholar

[4]

Jurčíček, F., et al.: Real user evaluation of spoken dialogue systems using Amazon mechanical Turk. In: Twelfth Annual Conference of the International Speech Communication Association (2011)

Google Scholar

[5]

Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

Google Scholar

[6]

Liu, C.W., Lowe, R., Serban, I., Noseworthy, M., Charlin, L., Pineau, J.: How not to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2122–2132 (2016)

Google Scholar

[7]

Lowe, R., Noseworthy, M., Serban, I.V., Angelard-Gontier, N., Bengio, Y., Pineau, J.: Towards an automatic turing test: learning to evaluate dialogue responses. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Long Papers, vol. 1, pp. 1116–1126 (2017)

Google Scholar

[8]

Lowe, R., Serban, I.V., Noseworthy, M., Charlin, L., Pineau, J.: On the evaluation of dialogue systems with next utterance classification. In: Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 264–269 (2016)

Google Scholar

[9]

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)

Google Scholar

[10]

[inline-graphic not available: see fulltext] . Chin. Sci. Bull. 57, 3409 (2012)

Google Scholar

Index Terms

Overview of the NLPCC 2019 Shared Task: Open Domain Conversation Evaluation

Index terms have been assigned to the content through auto-classification.

Recommendations

Overview of the NLPCC 2022 Shared Task: Multi-modal Dialogue Understanding and Generation
Natural Language Processing and Chinese Computing
Abstract
In this paper, we give an overview of multi-modal dialogue understanding and generation at NLPCC 2022 shared task, which includes three sub-tasks: dialogue scene identification, dialogue session identification, and dialogue response generation. A ...
RUBICON: Rubric-Based Evaluation of Domain-Specific Human AI Conversations
AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software

Evaluating conversational assistants, such as GitHub Copilot Chat, poses a significant challenge for tool builders in the domain of Software Engineering. These assistants rely on language models and chat-based user experiences, rendering their evaluation ...
Overview of the NLPCC 2023 Shared Task 10: Learn to Watch TV: Multimodal Dialogue Understanding and Response Generation
Natural Language Processing and Chinese Computing
Abstract
In this paper, we present an overview of NLPCC 2023 Shared Task 10, Multimodal Dialogue Understanding and Response Generation, which includes four sub-tasks: dialogue scene identification, dialogue session identification, dialogue response ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part II

Oct 2019

875 pages

ISBN:978-3-030-32235-9

DOI:10.1007/978-3-030-32236-6

Editors:
Jie Tang
Tsinghua University, Beijing, China
,
Min-Yen Kan
National University of Singapore, Singapore, Singapore
,
Dongyan Zhao
Peking University, Beijing, China
,
Sujian Li
Peking University, Beijing, China
,
Hongying Zan
Zhengzhou University, Zhengzhou, China

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 09 October 2019

Author Tags

Qualifiers

Article

Index Terms

Recommendations

Overview of the NLPCC 2022 Shared Task: Multi-modal Dialogue Understanding and Generation

RUBICON: Rubric-Based Evaluation of Domain-Specific Human AI Conversations

Overview of the NLPCC 2023 Shared Task 10: Learn to Watch TV: Multimodal Dialogue Understanding and Response Generation

Comments

Published In

Publisher

Publication History

Author Tags

Qualifiers

Other Metrics

Article Metrics

Other Metrics

Abstract

References

Index Terms

Recommendations

Overview of the NLPCC 2022 Shared Task: Multi-modal Dialogue Understanding and Generation

RUBICON: Rubric-Based Evaluation of Domain-Specific Human AI Conversations

Overview of the NLPCC 2023 Shared Task 10: Learn to Watch TV: Multimodal Dialogue Understanding and Response Generation

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Share

Share this Publication link

Share on social media

Affiliations