Abstract
In recent years, extensive state-of-the-art research has been conducted on natural language processing (NLP) issues. This includes improved text generation and text comprehension models. These solutions are deeply data dependent, as models use high-quality data. The need for more data in a particular language severely restricts the number of available datasets. This investigation proposes methodology for creating conversational datasets (MCCD), designed to extract multi-turn and multi-user conversational datasets. MCCD can obtain data from existing sources and identify multiple answers to the same message to create conversation flows for the extracted datasets. MCCD creates larger datasets suited to question answering (Questions & Answers (QA)) of open-domain conversational agents. In addition, this article proposes a tool based on MCCD to assist future researchers and applications. Our software tool was applied to extract two human conversation datasets. The evaluation of our methodology and resulted datasets was conducted based on the training of a Portuguese NLP model. We explored the outcome models in a classification task, obtaining better results than a state-of-the-art models.
Similar content being viewed by others
Data Availability
Point covered by two datasets made publicly for this publication.
References
Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pre-trained models for natural language processing: a survey. Sci China Technol Sci. 2020;63(10):1872–97. https://doi.org/10.1007/s11431-020-1647-3.
Bansal A, Kauffman RJ, Weitz RR. Comparing the modeling performance of regression and neural networks as data quality varies: a business value approach. J Manag Inf Syst. 1993;10(1):11–32. https://doi.org/10.1080/07421222.1993.11517988.
Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A survey on bias and fairness in machine learning. ACM Comput Surv. 2021. https://doi.org/10.1145/3457607.
Devlin J, Chang M-W, Lee K, Toutanova K, Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 2018.
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV, XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019. https://doi.org/10.48550/ARXIV.1906.08237. https://arxiv.org/abs/1906.08237.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V, RoBERTa: a robustly optimized BERT pretraining approach. arXiv 2019. https://doi.org/10.48550/ARXIV.1907.11692. https://arxiv.org/abs/1907.11692.
Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S, Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. arXiv 2015. https://doi.org/10.48550/ARXIV.1506.06724. https://arxiv.org/abs/1506.06724
Souza F, Nogueira R, Lotufo R, BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to Appear) 2020.
Parker, Robert, Graff, David, Kong, Junbo, Chen, Ke, Maeda, Kazuaki, English Gigaword Fifth Edition. Linguistic Data Consortium (2011). https://doi.org/10.35111/WK4F-QT80. https://catalog.ldc.upenn.edu/LDC2011T07
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S, GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks For NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium 2018. https://doi.org/10.18653/v1/W18-5446. https://aclanthology.org/W18-5446
Lai G, Xie Q, Liu H, Yang Y, Hovy E, Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683 2017.
Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016;2383–2392. Association for Computational Linguistics, Austin, Texas. https://doi.org/10.18653/v1/D16-1264. https://aclanthology.org/D16-1264
Nagel S. News Dataset. commoncrawl 2016. https://commoncrawl.org/2016/10/news-dataset-available/
Gokaslan A, Cohen V. OpenWebText Corpus. http://Skylion007.github.io/OpenWebTextCorpus 2019
Trinh TH, Le QV. A Simple Method for Commonsense Reasoning. arXiv 2018. https://doi.org/10.48550/ARXIV.1806.02847. https://arxiv.org/abs/1806.02847
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan TJ, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. ArXiv abs/2005.14165 2020.
Sanches, M, C. de Sá J, M. de Souza, A, Silva, D, R. de Souza, R, Reis J, Villas, L, MCCD: Generating Human Natural Language Conversational Datasets. In: Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 2: ICEIS,, 2022;247–255. SciTePress, Virtual Conference . https://doi.org/10.5220/0011077400003179. INSTICC
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. In: OpenAI 2019.
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. In: Proc. of NAACL 2018.
Traum DR. In: Wooldridge, M., Rao, A. (eds.) Speech Acts for Dialogue Agents, 1999;169–201. Springer, Dordrecht. https://doi.org/10.1007/978-94-015-9204-8_8.
Wolf MJ, Miller KW, Grodzinsky FS. Why we should have seen that coming: Comments on microsoft’s tay “experiment,” and wider implications. ORBIT J 1(2), 1–12 (2017). https://doi.org/10.29297/orbit.v1i2.49
Wagner Filho JA, Wilkens R, Idiart M, Villavicencio A, The brWaC corpus: A new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan 2018. https://aclanthology.org/L18-1686
Smith JR, Saint-Amand H, Plamada M, Koehn P, Callison-Burch C, Lopez A, Dirt cheap web-scale parallel text from the Common Crawl. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013;1374–1383. Association for Computational Linguistics, Sofia, Bulgaria. https://aclanthology.org/P13-1135
Lowe R, Pow N, Serban I, Pineau J, The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. 2015. arXiv preprint arXiv:1506.08909
Budzianowski P, Wen T-H, Tseng B-H, Casanueva I, Ultes S, Ramadan O, Gašic M, Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018. https://doi.org/10.18653/v1/d18-1547
Kelley JF. An iterative design methodology for user-friendly natural language office information applications. ACM Trans Inf Syst. 1984;2(1):26–41. https://doi.org/10.1145/357417.357420.
Williams JD, Raux A, Henderson M. The dialog state tracking challenge series: a review. Dialogue Discourse. 2016;7(3):4–33.
Li Y, Su H, Shen X, Li W, Cao Z, Niu S. DailyDialog: A manually labelled multi-turn dialogue dataset. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017;986–995. Asian Federation of Natural Language Processing, Taipei, Taiwan . https://aclanthology.org/I17-1099
Byrne B, Krishnamoorthi K, Sankar C, Neelakantan A, Goodrich B, Duckworth D Yavuz S, Dubey A, Kim K-Y, Cedilnik A. Taskmaster-1: Toward a realistic and diverse dialog dataset. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019. https://doi.org/10.18653/v1/d19-1459
Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, Crawford K. Datasheets for datasets. Commun ACM. 2021;64(12):86–92.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Advances on Enterprise Information Systems” guest edited by Michal Smialek, Slimane Hammoudi, Alexander Brodsky and Joaquim Filipe.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sanches, M.F., de Sá, J.M.C., de Souza, R.R. et al. Automatic Extraction of Conversation Flows from Human Dialogues: Understanding Their Impact to Refine NLP Models. SN COMPUT. SCI. 4, 706 (2023). https://doi.org/10.1007/s42979-023-02148-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-023-02148-7