Automatic Extraction of Conversation Flows from Human Dialogues: Understanding Their Impact to Refine NLP Models

Matheus Ferraroni Sanches ORCID: orcid.org/0000-0002-6904-7873¹,
Jader Martins Camboim de Sá¹,
Rafael Roque de Souza¹,
Allan Mariano de Souza¹,
Julio Cesar Dos Reis^1,2 &
…
Leandro Aparecido Villas¹

166 Accesses
Explore all metrics

Abstract

In recent years, extensive state-of-the-art research has been conducted on natural language processing (NLP) issues. This includes improved text generation and text comprehension models. These solutions are deeply data dependent, as models use high-quality data. The need for more data in a particular language severely restricts the number of available datasets. This investigation proposes methodology for creating conversational datasets (MCCD), designed to extract multi-turn and multi-user conversational datasets. MCCD can obtain data from existing sources and identify multiple answers to the same message to create conversation flows for the extracted datasets. MCCD creates larger datasets suited to question answering (Questions & Answers (QA)) of open-domain conversational agents. In addition, this article proposes a tool based on MCCD to assist future researchers and applications. Our software tool was applied to extract two human conversation datasets. The evaluation of our methodology and resulted datasets was conducted based on the training of a Portuguese NLP model. We explored the outcome models in a classification task, obtaining better results than a state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 5

Fig. 6

Overview of the NLPCC 2022 Shared Task: Dialogue Text Analysis (DTA)

Automated Utterance Labeling of Conversations Using Natural Language Processing

Overview of the NLPCC 2019 Shared Task: Open Domain Conversation Evaluation

Data Availability

Point covered by two datasets made publicly for this publication.

Notes

References

Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pre-trained models for natural language processing: a survey. Sci China Technol Sci. 2020;63(10):1872–97. https://doi.org/10.1007/s11431-020-1647-3.
Article Google Scholar
Bansal A, Kauffman RJ, Weitz RR. Comparing the modeling performance of regression and neural networks as data quality varies: a business value approach. J Manag Inf Syst. 1993;10(1):11–32. https://doi.org/10.1080/07421222.1993.11517988.
Article Google Scholar
Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A survey on bias and fairness in machine learning. ACM Comput Surv. 2021. https://doi.org/10.1145/3457607.
Article Google Scholar
Devlin J, Chang M-W, Lee K, Toutanova K, Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 2018.
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV, XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019. https://doi.org/10.48550/ARXIV.1906.08237. https://arxiv.org/abs/1906.08237.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V, RoBERTa: a robustly optimized BERT pretraining approach. arXiv 2019. https://doi.org/10.48550/ARXIV.1907.11692. https://arxiv.org/abs/1907.11692.
Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S, Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. arXiv 2015. https://doi.org/10.48550/ARXIV.1506.06724. https://arxiv.org/abs/1506.06724
Souza F, Nogueira R, Lotufo R, BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to Appear) 2020.
Parker, Robert, Graff, David, Kong, Junbo, Chen, Ke, Maeda, Kazuaki, English Gigaword Fifth Edition. Linguistic Data Consortium (2011). https://doi.org/10.35111/WK4F-QT80. https://catalog.ldc.upenn.edu/LDC2011T07
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S, GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks For NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium 2018. https://doi.org/10.18653/v1/W18-5446. https://aclanthology.org/W18-5446
Lai G, Xie Q, Liu H, Yang Y, Hovy E, Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683 2017.
Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016;2383–2392. Association for Computational Linguistics, Austin, Texas. https://doi.org/10.18653/v1/D16-1264. https://aclanthology.org/D16-1264
Nagel S. News Dataset. commoncrawl 2016. https://commoncrawl.org/2016/10/news-dataset-available/
Gokaslan A, Cohen V. OpenWebText Corpus. http://Skylion007.github.io/OpenWebTextCorpus 2019
Trinh TH, Le QV. A Simple Method for Commonsense Reasoning. arXiv 2018. https://doi.org/10.48550/ARXIV.1806.02847. https://arxiv.org/abs/1806.02847
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan TJ, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. ArXiv abs/2005.14165 2020.
Sanches, M, C. de Sá J, M. de Souza, A, Silva, D, R. de Souza, R, Reis J, Villas, L, MCCD: Generating Human Natural Language Conversational Datasets. In: Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 2: ICEIS,, 2022;247–255. SciTePress, Virtual Conference . https://doi.org/10.5220/0011077400003179. INSTICC
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. In: OpenAI 2019.
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. In: Proc. of NAACL 2018.
Traum DR. In: Wooldridge, M., Rao, A. (eds.) Speech Acts for Dialogue Agents, 1999;169–201. Springer, Dordrecht. https://doi.org/10.1007/978-94-015-9204-8_8.
Wolf MJ, Miller KW, Grodzinsky FS. Why we should have seen that coming: Comments on microsoft’s tay “experiment,” and wider implications. ORBIT J 1(2), 1–12 (2017). https://doi.org/10.29297/orbit.v1i2.49
Wagner Filho JA, Wilkens R, Idiart M, Villavicencio A, The brWaC corpus: A new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan 2018. https://aclanthology.org/L18-1686
Smith JR, Saint-Amand H, Plamada M, Koehn P, Callison-Burch C, Lopez A, Dirt cheap web-scale parallel text from the Common Crawl. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013;1374–1383. Association for Computational Linguistics, Sofia, Bulgaria. https://aclanthology.org/P13-1135
Lowe R, Pow N, Serban I, Pineau J, The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. 2015. arXiv preprint arXiv:1506.08909
Budzianowski P, Wen T-H, Tseng B-H, Casanueva I, Ultes S, Ramadan O, Gašic M, Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018. https://doi.org/10.18653/v1/d18-1547
Kelley JF. An iterative design methodology for user-friendly natural language office information applications. ACM Trans Inf Syst. 1984;2(1):26–41. https://doi.org/10.1145/357417.357420.
Article Google Scholar
Williams JD, Raux A, Henderson M. The dialog state tracking challenge series: a review. Dialogue Discourse. 2016;7(3):4–33.
Article Google Scholar
Li Y, Su H, Shen X, Li W, Cao Z, Niu S. DailyDialog: A manually labelled multi-turn dialogue dataset. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017;986–995. Asian Federation of Natural Language Processing, Taipei, Taiwan . https://aclanthology.org/I17-1099
Byrne B, Krishnamoorthi K, Sankar C, Neelakantan A, Goodrich B, Duckworth D Yavuz S, Dubey A, Kim K-Y, Cedilnik A. Taskmaster-1: Toward a realistic and diverse dialog dataset. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019. https://doi.org/10.18653/v1/d19-1459
Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, Crawford K. Datasheets for datasets. Commun ACM. 2021;64(12):86–92.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computing, University of Campinas, Campinas, Brazil
Matheus Ferraroni Sanches, Jader Martins Camboim de Sá, Rafael Roque de Souza, Allan Mariano de Souza, Julio Cesar Dos Reis & Leandro Aparecido Villas
Nucleus of Informatics applied to Education, University of Campinas, Campinas, Brazil
Julio Cesar Dos Reis

Authors

Matheus Ferraroni Sanches
View author publications
You can also search for this author in PubMed Google Scholar
Jader Martins Camboim de Sá
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Roque de Souza
View author publications
You can also search for this author in PubMed Google Scholar
Allan Mariano de Souza
View author publications
You can also search for this author in PubMed Google Scholar
Julio Cesar Dos Reis
View author publications
You can also search for this author in PubMed Google Scholar
Leandro Aparecido Villas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matheus Ferraroni Sanches.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances on Enterprise Information Systems” guest edited by Michal Smialek, Slimane Hammoudi, Alexander Brodsky and Joaquim Filipe.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sanches, M.F., de Sá, J.M.C., de Souza, R.R. et al. Automatic Extraction of Conversation Flows from Human Dialogues: Understanding Their Impact to Refine NLP Models. SN COMPUT. SCI. 4, 706 (2023). https://doi.org/10.1007/s42979-023-02148-7

Download citation

Received: 21 November 2022
Accepted: 14 July 2023
Published: 14 September 2023
DOI: https://doi.org/10.1007/s42979-023-02148-7

Automatic Extraction of Conversation Flows from Human Dialogues: Understanding Their Impact to Refine NLP Models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Overview of the NLPCC 2022 Shared Task: Dialogue Text Analysis (DTA)

Automated Utterance Labeling of Conversations Using Natural Language Processing

Overview of the NLPCC 2019 Shared Task: Open Domain Conversation Evaluation

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Automatic Extraction of Conversation Flows from Human Dialogues: Understanding Their Impact to Refine NLP Models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Overview of the NLPCC 2022 Shared Task: Dialogue Text Analysis (DTA)

Automated Utterance Labeling of Conversations Using Natural Language Processing

Overview of the NLPCC 2019 Shared Task: Open Domain Conversation Evaluation

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation