Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Automatic Extraction of Conversation Flows from Human Dialogues: Understanding Their Impact to Refine NLP Models

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

In recent years, extensive state-of-the-art research has been conducted on natural language processing (NLP) issues. This includes improved text generation and text comprehension models. These solutions are deeply data dependent, as models use high-quality data. The need for more data in a particular language severely restricts the number of available datasets. This investigation proposes methodology for creating conversational datasets (MCCD), designed to extract multi-turn and multi-user conversational datasets. MCCD can obtain data from existing sources and identify multiple answers to the same message to create conversation flows for the extracted datasets. MCCD creates larger datasets suited to question answering (Questions & Answers (QA)) of open-domain conversational agents. In addition, this article proposes a tool based on MCCD to assist future researchers and applications. Our software tool was applied to extract two human conversation datasets. The evaluation of our methodology and resulted datasets was conducted based on the training of a Portuguese NLP model. We explored the outcome models in a classification task, obtaining better results than a state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Data Availability

Point covered by two datasets made publicly for this publication.

Notes

  1. https://huggingface.co/datasets.

  2. https://xenforo.com/.

  3. https://forum.adrenaline.com.br.

  4. https://forum.outerspace.com.br.

References

  1. Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pre-trained models for natural language processing: a survey. Sci China Technol Sci. 2020;63(10):1872–97. https://doi.org/10.1007/s11431-020-1647-3.

    Article  Google Scholar 

  2. Bansal A, Kauffman RJ, Weitz RR. Comparing the modeling performance of regression and neural networks as data quality varies: a business value approach. J Manag Inf Syst. 1993;10(1):11–32. https://doi.org/10.1080/07421222.1993.11517988.

    Article  Google Scholar 

  3. Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A survey on bias and fairness in machine learning. ACM Comput Surv. 2021. https://doi.org/10.1145/3457607.

    Article  Google Scholar 

  4. Devlin J, Chang M-W, Lee K, Toutanova K, Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 2018.

  5. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV, XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019. https://doi.org/10.48550/ARXIV.1906.08237. https://arxiv.org/abs/1906.08237.

  6. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V, RoBERTa: a robustly optimized BERT pretraining approach. arXiv 2019. https://doi.org/10.48550/ARXIV.1907.11692. https://arxiv.org/abs/1907.11692.

  7. Zhu Y, Kiros R, Zemel R, Salakhutdinov R, Urtasun R, Torralba A, Fidler S, Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. arXiv 2015. https://doi.org/10.48550/ARXIV.1506.06724. https://arxiv.org/abs/1506.06724

  8. Souza F, Nogueira R, Lotufo R, BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to Appear) 2020.

  9. Parker, Robert, Graff, David, Kong, Junbo, Chen, Ke, Maeda, Kazuaki, English Gigaword Fifth Edition. Linguistic Data Consortium (2011). https://doi.org/10.35111/WK4F-QT80. https://catalog.ldc.upenn.edu/LDC2011T07

  10. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S, GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks For NLP, pp. 353–355. Association for Computational Linguistics, Brussels, Belgium 2018. https://doi.org/10.18653/v1/W18-5446. https://aclanthology.org/W18-5446

  11. Lai G, Xie Q, Liu H, Yang Y, Hovy E, Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683 2017.

  12. Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016;2383–2392. Association for Computational Linguistics, Austin, Texas. https://doi.org/10.18653/v1/D16-1264. https://aclanthology.org/D16-1264

  13. Nagel S. News Dataset. commoncrawl 2016. https://commoncrawl.org/2016/10/news-dataset-available/

  14. Gokaslan A, Cohen V. OpenWebText Corpus. http://Skylion007.github.io/OpenWebTextCorpus 2019

  15. Trinh TH, Le QV. A Simple Method for Commonsense Reasoning. arXiv 2018. https://doi.org/10.48550/ARXIV.1806.02847. https://arxiv.org/abs/1806.02847

  16. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan TJ, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. ArXiv abs/2005.14165 2020.

  17. Sanches, M, C. de Sá J, M. de Souza, A, Silva, D, R. de Souza, R, Reis J, Villas, L, MCCD: Generating Human Natural Language Conversational Datasets. In: Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 2: ICEIS,, 2022;247–255. SciTePress, Virtual Conference . https://doi.org/10.5220/0011077400003179. INSTICC

  18. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. In: OpenAI 2019.

  19. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. In: Proc. of NAACL 2018.

  20. Traum DR. In: Wooldridge, M., Rao, A. (eds.) Speech Acts for Dialogue Agents, 1999;169–201. Springer, Dordrecht. https://doi.org/10.1007/978-94-015-9204-8_8.

  21. Wolf MJ, Miller KW, Grodzinsky FS. Why we should have seen that coming: Comments on microsoft’s tay “experiment,” and wider implications. ORBIT J 1(2), 1–12 (2017). https://doi.org/10.29297/orbit.v1i2.49

  22. Wagner Filho JA, Wilkens R, Idiart M, Villavicencio A, The brWaC corpus: A new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan 2018. https://aclanthology.org/L18-1686

  23. Smith JR, Saint-Amand H, Plamada M, Koehn P, Callison-Burch C, Lopez A, Dirt cheap web-scale parallel text from the Common Crawl. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2013;1374–1383. Association for Computational Linguistics, Sofia, Bulgaria. https://aclanthology.org/P13-1135

  24. Lowe R, Pow N, Serban I, Pineau J, The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. 2015. arXiv preprint arXiv:1506.08909

  25. Budzianowski P, Wen T-H, Tseng B-H, Casanueva I, Ultes S, Ramadan O, Gašic M, Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018. https://doi.org/10.18653/v1/d18-1547

  26. Kelley JF. An iterative design methodology for user-friendly natural language office information applications. ACM Trans Inf Syst. 1984;2(1):26–41. https://doi.org/10.1145/357417.357420.

    Article  Google Scholar 

  27. Williams JD, Raux A, Henderson M. The dialog state tracking challenge series: a review. Dialogue Discourse. 2016;7(3):4–33.

    Article  Google Scholar 

  28. Li Y, Su H, Shen X, Li W, Cao Z, Niu S. DailyDialog: A manually labelled multi-turn dialogue dataset. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2017;986–995. Asian Federation of Natural Language Processing, Taipei, Taiwan . https://aclanthology.org/I17-1099

  29. Byrne B, Krishnamoorthi K, Sankar C, Neelakantan A, Goodrich B, Duckworth D Yavuz S, Dubey A, Kim K-Y, Cedilnik A. Taskmaster-1: Toward a realistic and diverse dialog dataset. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019. https://doi.org/10.18653/v1/d19-1459

  30. Gebru T, Morgenstern J, Vecchione B, Vaughan JW, Wallach H, Iii HD, Crawford K. Datasheets for datasets. Commun ACM. 2021;64(12):86–92.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matheus Ferraroni Sanches.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances on Enterprise Information Systems” guest edited by Michal Smialek, Slimane Hammoudi, Alexander Brodsky and Joaquim Filipe.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sanches, M.F., de Sá, J.M.C., de Souza, R.R. et al. Automatic Extraction of Conversation Flows from Human Dialogues: Understanding Their Impact to Refine NLP Models. SN COMPUT. SCI. 4, 706 (2023). https://doi.org/10.1007/s42979-023-02148-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02148-7

Keywords

Navigation