Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Text2EL+: Expert Guided Event Log Enrichment Using Unstructured Text

Published: 06 March 2024 Publication History

Abstract

Through the application of process mining, business processes can be improved on the basis of process execution data captured in event logs. Naturally, the quality of this data determines the quality of the improvement recommendations. Improving data quality is non-trivial, and there is great potential to exploit unstructured text, e.g., from notes, reviews, and comments, for this purpose and to enrich event logs. To this end, this article introduces Text2EL+ , a three-phase approach to enrich event logs using unstructured text. In its first phase, events and (case and event) attributes are derived from unstructured text linked to organisational processes. In its second phase, these events and attributes undergo a semantic and contextual validation before their incorporation in the event log. In its third and final phase, recognising the importance of human domain expertise, expert guidance is used to further improve data quality by removing redundant and irrelevant events. Expert input is used to train a Named Entity Recognition (NER) model with customised tags to detect event log elements. The approach applies natural language processing techniques, sentence embeddings, training pipelines and models, as well as contextual and expression validation. Various unstructured clinical notes associated with a healthcare case study were analysed, and completeness, concordance, and correctness of the derived event log elements were evaluated through experiments. The results show that the proposed method is feasible and applicable.

References

[1]
Lars Ackermann, Julian Neuberger, and Stefan Jablonski. 2021. Data-driven annotation of textual process descriptions based on formal meaning representations. In 33rd International Conference on Advanced Information Systems Engineering (CAiSE’21)(Lecture Notes in Computer Science, Vol. 12751). Springer, 75–90.
[2]
Amirah Alharbi, Andy Bulpitt, and Owen Johnson. 2017. Improving pattern detection in healthcare process mining using an interval-based event selection method. In Lecture Notes in Business Information Processing. Springer International Publishing, 88–105. DOI:
[3]
Robert Andrews, Moe T. Wynn, Kirsten Vallmuur, Arthur H. M. Ter Hofstede, Emma Bosley, Mark Elcock, and Stephen Rashford. 2019. Leveraging data quality to better prepare for process mining: An approach illustrated through analysing road trauma pre-hospital retrieval and transport processes in Queensland. Int. J. Environ. Res. 16, 7 (2019).
[4]
Rolf Banziger, Artie Basukoski, and Thierry J. Chaussalet. 2018. Discovering business processes in CRM systems by leveraging unstructured text data. In 20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems (HPCC/SmartCity/DSS’18). IEEE, 1571–1577.
[5]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993–1022.
[6]
Angel X. Chang and Christopher D. Manning. 2012. SUTIME: A library for recognizing and normalizing time expressions. In International Conference on Language Resources and Evaluation. ELRA.
[7]
Qingyu Chen, Yifan Peng, and Zhiyong Lu. 2019. BioSentVec: Creating sentence embeddings for biomedical texts. In IEEE International Conference on Healthcare Informatics. 1–5.
[8]
Marie-Catherine de Marneffe and Christopher D. Manning. 2008. The Stanford typed dependencies representation. In Workshop on Cross-framework and Cross-domain Parser Evaluation. ACL, 1–8.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North. ACL.
[10]
P. M. Dixit, J. C. A. M. Buijs, Wil M. P. van der Aalst, B. F. A. Hompes, and J. Buurman. 2017. Using domain knowledge to enhance process mining results. In Lecture Notes in Business Information Processing. Springer International Publishing, Vol. 244, 76–104. DOI:
[11]
Elena Viorica Epure, Patricia Martín-Rodilla, Charlotte Hug, Rebecca Deneckére, and Camille Salinesi. 2015. Automatic process model discovery from textual methodologies. In Research Challenges in Information Science Conference. IEEE.
[12]
Dominik Andreas Fischer, Kanika Goel, Robert Andrews, Christopher Gerhard Johannes van Dun, Moe Thandar Wynn, and Maximilian Röglinger. 2020. Enhancing event log quality: Detecting and quantifying timestamp imperfections. In Lecture Notes in Computer Science. Springer International Publishing, Vol. 12168, 309–326. DOI:
[13]
Fabian Friedrich, Jan Mendling, and Frank Puhlmann. 2011. Process model generation from natural language text. In International Conference on Advanced Information Systems Engineering (CAiSE’11). Springer, 482–496.
[14]
Dakshi Kapugama Geeganage, Moe Thandar Wynn, and Arthur H. M. ter Hofstede. 2022. Text2EL: Exploiting unstructured text for event log enrichment. In International Conference on Signal Image Technology & Internet based Systems. IEEE.
[15]
Aditya Ghose, George Koliadis, and Arthur Chueng. 2007. Process discovery from model and text artefacts. In IEEE Congress on Services (Services’07). IEEE.
[16]
Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner, Simone Kriglstein, Margit Pohl, and Nik Suchy. 2014. TimeCleanser: A visual analytics approach for data cleansing of time-oriented data. In 14th International Conference on Knowledge Technologies and Data-Driven Business (i-KNOW’14). Association for Computing Machinery, New York, NY. DOI:
[17]
Monika Gupta, Prerna Agarwal, Tarun Tater, Sampath Dechu, and Alexander Serebrenik. 2020. Analyzing comments in ticket resolution to capture underlying process interactions. In BPM Workshops. Springer, 219–231.
[18]
Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. 2019. An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes. Comput. Meth. Prog. Biomed. 177 (Aug. 2019), 141–153. DOI:
[19]
Zhengxing Huang, Wei Dong, Lei Ji, Chenxi Gan, Xudong Lu, and Huilong Duan. 2014. Discovery of clinical pathway patterns from event logs using probabilistic topic models. J. Biomed. Inform. 47 (2014), 39–57.
[20]
A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific Data 3 (2016), 160035.
[21]
Christoph Kecht, Andreas Egger, Wolfgang Kratsch, and Maximilian Röglinger. 2021. Event log construction from customer service conversations using natural language inference. In International Conference on Process Mining (ICPM’21). IEEE.
[22]
Wiza Kumwenda, G. Kunyenje, J. Gama, J. Chinkonde, F. Martinson, I. Hoffman, M. Hosseinipour, and N. Rosenberg. 2018. Information management in Malawi’s prevention of mother-to-child transmission (PMTCT) program: Health workers’ perspectives. Malawi Med. J. 29, 4 (2018).
[23]
Angelina Prima Kurniati, Geoff Hall, David Hogg, and Owen Johnson. 2018. Process mining in oncology using the MIMIC-III dataset. J. Phys.: Conf. Series 971 (Mar. 2018), 012008. DOI:
[24]
Angelina Prima Kurniati, Eric Rojas, David Hogg, Geoff Hall, and Owen A. Johnson. 2018. The assessment of data quality issues for process mining in healthcare using medical information mart for intensive care III, a freely available e-health record database. Health Inform. J. 25, 4 (Nov. 2018), 1878–1893. DOI:
[25]
Henrik Leopold, Han van der Aa, and Hajo A. Reijers. 2017. Searching textual and model-based process descriptions based on a unified data format. Softw. Syst. Model. 18, 2 (2017), 1179–1194.
[26]
Niels Martin, Antonio Martinez-Millana, Bernardo Valdivieso, and Carlos Fernández-Llatas. 2019. Interactive data cleaning for process mining: A case study of an outpatient clinic’s appointment system. In Business Process Management Workshops, Chiara Di Francescomarino, Remco Dijkman, and Uwe Zdun (Eds.). Springer International Publishing, Cham, 532–544.
[27]
Mahdi Naser Moghadasi and Yu Zhuang. 2020. Sent2Vec: A new sentence embedding representation with sentimental semantic. In Big Data Conference. IEEE.
[28]
Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. ScispaCy: Fast and robust models for biomedical natural language processing. In 18th BioNLP Workshop and Shared Task. ACL.
[29]
Siddhartha Nuthakki, Sunil Neela, Judy W. Gichoya, and Saptarshi Purkayastha. 2019. Natural language processing of MIMIC-III clinical notes for identifying diagnosis and procedures with neural networks. CoRR abs/1912.12397 (2019).
[30]
Avner Ottensooser, Alan Fekete, Hajo A. Reijers, Jan Mendling, and Con Menictas. 2012. Making sense of business process descriptions: An experimental comparison of graphical and textual notations. J. Syst. Softw. 85, 3 (2012), 596–606.
[31]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.
[32]
Tal Perry. 2021. LightTag: Text annotation platform. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 20–27. Retrieved from https://aclanthology.org/2021.emnlp-demo.3
[33]
James Pustejovsky, Jose Castano, Robert Ingria, and Roser Sauri. 2003. TimeML: Robust specification of event and temporal expressions in text. New Direct. Quest. Answer 2003 (2003), 28–34.
[34]
Chen Qian, Lijie Wen, Akhil Kumar, Leilei Lin, Li Lin, Zan Zong, Shu’ang Li, and Jianmin Wang. 2020. An approach for process model extraction by multi-grained text classification. In International Conference on Advanced Information Systems Engineering (CAiSE’20). Springer, 268–282.
[35]
Luis Quishpi, Josep Carmona, and Lluís Padró. 2020. Extracting annotations from textual descriptions of processes. In Business Process Management Conference (BPM’20). Springer, 184–201.
[36]
Belén Ramos-Gutiérrez, Ángel Jesús Varela-Vaca, F. Javier Ortega, María Teresa Gómez-López, and Moe Thandar Wynn. 2021. A NLP-oriented methodology to enhance event log quality. In Enterprise, Business-Process and Information Systems Modeling. Springer, 19–35.
[37]
Adrian Rebmann and Han van der Aa. 2021. Extracting semantic process information from the natural language in event logs. In International Conference on Advanced Information Systems Engineering (CAiSE’21). Springer, 57–74.
[38]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Conference on Empirical Methods in Natural Language Processing (EMNLP’19). ACL.
[39]
Maximilian Riefer, Simon Ternis, and Tom Thaler. 2016. Mining process models from natural language text: A state-of-the-art analysis. In Multikonferenz Wirtschaftsinformatik (MKWI’16).
[40]
Nina Rizun, Aleksandra Revina, and Vera G. Meister. 2021. Assessing business process complexity based on textual data: Evidence from ITIL IT ticket processing. Busin. Process Manag. J. 27 (2021).
[41]
Sareh Sadeghianasl, Arthur H. M ter Hofstede, Suriadi Suriadi, and Selen Turkay. 2020. Collaborative and interactive detection and repair of activity labels in process event logs. In International Conference on Process Mining (ICPM’20). IEEE.
[42]
Mohammadreza Fani Sani, Alessandro Berti, Sebastiaan J. van Zelst, and Wil M. P. van der Aalst. 2019. Filtering toolkit: Interactively filter event logs to improve the quality of discovered models. In Business Process Management Conference (BPM’19).
[43]
Daniel Schuster, Sebastiaan J. van Zelst, and Wil M. P. van der Aalst. 2022. Utilizing domain knowledge in data-driven process discovery: A literature review. Comput. Ind. 137, C (May 2022), 19 pages. DOI:
[44]
Yohei Seki, Kangkang Zhao, Masaki Oguni, and Kazunari Sugiyama. 2022. CNN-based framework for classifying temporal relations with question encoder. Int J. Digit Libr 23 (2022), 167–177.
[45]
Shunmuga Siddharthan, Marcel Dix, Barbara Sprick, and Benjamin Klöpper. 2020. Summarizing industrial log data with latent Dirichlet allocation. Archives Data Sci. 6, 1 (2020).
[46]
Jannik Strötgen and Michael Gertz. 2010. HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In 5th International Workshop on Semantic Evaluation. ACL, 321–324.
[47]
Suriadi Suriadi, Robert Andrews, Arthur H. M. ter Hofstede, and Moe T. Wynn. 2017. Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs. Inf. Syst. 64 (2017), 132–150.
[48]
Betty van Aken, Sebastian Herrmann, and Alexander Löser. 2021. What do you see in this patient? Behavioral testing of clinical NLP models. CoRR abs/2111.15512 (2021).
[49]
Betty van Aken, Jens-Michalis Papaioannou, Manuel Mayrdorfer, Klemens Budde, Felix A. Gers, and Alexander Löser. 2021. Clinical outcome prediction from admission notes using self-supervised knowledge integration. CoRR abs/2102.04110 (2021).
[50]
Han van der Aa, Josep Carmona, Henrik Leopold, Jan Mendling, and Lluís Padró. 2018. Challenges and opportunities of applying natural language processing in business process management. In International Conference on Computational Linguistics (COLING’18). ACL, 2791–2801.
[51]
Han van der Aa, Claudio Di Ciccio, Henrik Leopold, and Hajo A. Reijers. 2019. Extracting declarative process models from natural language. In International Conference on Advanced Information Systems Engineering (CAiSE’19). Springer, 365–382.
[52]
Han van der Aa, Henrik Leopold, and Hajo A. Reijers. 2017. Comparing textual descriptions to process models—The automatic detection of inconsistencies. Inf. Syst. 64 (2017), 447–460.
[53]
Han van der Aa, Adrian Rebmann, and Henrik Leopold. 2021. Natural language-based detection of semantic execution anomalies in event logs. Inf. Syst. 102 (2021).
[54]
Wil van der Aalst. 2016. Process Mining. Springer.
[55]
Wil van der Aalst, Arya Adriansyah, Ana Karla Alves de Medeiros, Franco Arcieri, Thomas Baier, Tobias Blickle, Jagadeesh Chandra Bose, Peter van den Brand, Ronald Brandtjen, Joos Buijs, Andrea Burattin, Josep Carmona, Malu Castellanos, Jan Claes, Jonathan Cook, Nicola Costantini, Francisco Curbera, Ernesto Damiani, Massimiliano de Leoni, Pavlos Delias, Boudewijn F. van Dongen, Marlon Dumas, Schahram Dustdar, Dirk Fahland, Diogo R. Ferreira, Walid Gaaloul, Frank van Geffen, Sukriti Goel, Christian Günther, Antonella Guzzo, Paul Harmon, Arthur ter Hofstede, John Hoogland, Jon Espen Ingvaldsen, Koki Kato, Rudolf Kuhn, Akhil Kumar, Marcello La Rosa, Fabrizio Maggi, Donato Malerba, Ronny S. Mans, Alberto Manuel, Martin McCreesh, Paola Mello, Jan Mendling, Marco Montali, Hamid R. Motahari-Nezhad, Michael zur Muehlen, Jorge Munoz-Gama, Luigi Pontieri, Joel Ribeiro, Anne Rozinat, Hugo Seguel Pérez, Ricardo Seguel Pérez, Marcos Sepúlveda, Jim Sinur, Pnina Soffer, Minseok Song, Alessandro Sperduti, Giovanni Stilo, Casper Stoel, Keith Swenson, Maurizio Talamo, Wei Tan, Chris Turner, Jan Vanthienen, George Varvaressos, Eric Verbeek, Marc Verdonk, Roberto Vigo, Jianmin Wang, Barbara Weber, Matthias Weidlich, Ton Weijters, Lijie Wen, Michael Westergaard, and Moe Wynn. 2012. Process mining manifesto. In BPM Workshops. Springer, 169–194.
[56]
Wil M. P. van der Aalst and Josep Carmona (Eds.). 2022. Process Mining Handbook. Lecture Notes in Business Information Processing, Vol. 448. Springer. DOI:

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 16, Issue 1
March 2024
187 pages
EISSN:1936-1963
DOI:10.1145/3613486
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2024
Online AM: 10 January 2024
Accepted: 31 December 2023
Revised: 30 October 2023
Received: 03 December 2022
Published in JDIQ Volume 16, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Event data quality
  2. process mining
  3. event log
  4. unstructured text
  5. natural language processing
  6. semantic validation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 302
    Total Downloads
  • Downloads (Last 12 months)258
  • Downloads (Last 6 weeks)11
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media