research-article

Neural knowledge extraction from cloud service incidents

Authors:

Nachiappan Nagappan,

Thomas ZimmermannAuthors Info & Claims

ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice

Pages 218 - 227

https://doi.org/10.1109/ICSE-SEIP52600.2021.00031

Published: 17 December 2021 Publication History

Abstract

The move from boxed products to services and the widespread adoption of cloud computing has had a huge impact on the software development life cycle and DevOps processes. Particularly, incident management has become critical for developing and operating large-scale services. Prior work on incident management has heavily focused on the challenges with incident triaging and de-duplication. In this work, we address the fundamental problem of structured knowledge extraction from service incidents. We have built SoftNER, a framework for unsupervised knowledge extraction from service incidents. We frame the knowledge extraction problem as a Named-Entity Recognition task for extracting factual information. SoftNER leverages structural patterns like key-value pairs and tables for bootstrapping the training data. Further, we build a novel multitask learning based BiLSTM-CRF model which leverages not just the semantic context but also the data-types for named-entity extraction. We have deployed SoftNER at Microsoft, a major cloud service provider and have evaluated it on more than 2 months of cloud incidents. We show that the unsupervised machine learning pipeline has a high precision of 0.96. Our multi-task learning based deep learning model also outperforms the state of the art NER models. Lastly, using the knowledge extracted by SoftNER we are able to build significantly more accurate models for important downstream tasks like incident triaging.

References

[1]

S. Mehta, R. Bhagwan, R. Kumar, C. Bansal, C. Maddila, B. Ashok, S. Asthana, C. Bird, and A. Kumar, "Rex: Preventing bugs and miscon-figuration in large services using correlated change analysis," in 17th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 20), 2020, pp. 435--448.

[2]

Y. Dang, Q. Lin, and P. Huang, "Aiops: real-world challenges and research innovations," in 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 2019, pp. 4--5.

Digital Library

[3]

R. Kumar, C. Bansal, C. Maddila, N. Sharma, S. Martelock, and R. Bhargava, "Building sankie: An ai platform for devops," in Proceedings of the 1st International Workshop on Bots in Software Engineering, ser. BotSE '19. IEEE Press, 2019, p. 48--53.

[4]

J. Chen, X. He, Q. Lin, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, and D. Zhang, "Continuous incident triage for large-scale online service systems," in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 364--375.

Digital Library

[5]

J. Chen, X. He, Q. Lin, Y. Xu, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, and D. Zhang, "An empirical investigation of incident triage for online service systems," in 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2019, pp. 111--120.

Digital Library

[6]

C. Bansal, S. Renganathan, A. Asudani, O. Midy, and M. Janakiraman, "Decaf: Diagnosing and triaging performance issues in large-scale cloud services," in 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2020.

Digital Library

[7]

C. Luo, J.-G. Lou, Q. Lin, Q. Fu, R. Ding, D. Zhang, and Z. Wang, "Correlating events with time series for incident diagnosis," in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 1583--1592.

[8]

D. Nadeau and S. Sekine, "A survey of named entity recognition and classification," Lingvisticae Investigationes, vol. 30, no. 1, pp. 3--26.

[9]

G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, "Neural architectures for named entity recognition," arXiv preprint arXiv:1603.01360, 2016.

[10]

A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré, "Snorkel: Rapid training data creation with weak supervision," in Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, vol. 11, no. 3. NIH Public Access, 2017, p. 269.

Digital Library

[11]

N. Rao, C. Bansal, and J. Guan, "Code search intent classification using weak supervision," arXiv preprint arXiv:2011.11950, 2020.

[12]

R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, "Natural language processing (almost) from scratch," Journal of machine learning research, vol. 12, no. Aug, pp. 2493--2537, 2011.

Digital Library

[13]

J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.

[14]

T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, "Recurrent neural network based language model," in Eleventh annual conference of the international speech communication association, 2010.

[15]

S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997.

Digital Library

[16]

A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional lstm and other neural network architectures," Neural networks, vol. 18, no. 5--6, pp. 602--610, 2005.

[17]

D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.

[18]

P. Chen, Z. Sun, L. Bing, and W. Yang, "Recurrent attention network on memory for aspect sentiment analysis," in Proceedings of the 2017 conference on empirical methods in natural language processing, 2017, pp. 452--461.

[19]

Q. Li, T. Li, and B. Chang, "Discourse parsing with attention-based hierarchical neural networks," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 362--371.

[20]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998--6008.

Digital Library

[21]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.

[22]

J. Lafferty, A. McCallum, and F. C. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," 2001.

Digital Library

[23]

R. Caruana, "Multitask learning," Machine learning, vol. 28, no. 1, pp. 41--75, 1997.

Digital Library

[24]

Z. Huang, W. Xu, and K. Yu, "Bidirectional lstm-crf models for sequence tagging," arXiv preprint arXiv:1508.01991, 2015.

[25]

J. P. Chiu and E. Nichols, "Named entity recognition with bidirectional lstm-cnns," Transactions of the Association for Computational Linguistics, vol. 4, pp. 357--370, 2016.

[26]

R. Paulus, C. Xiong, and R. Socher, "A deep reinforced model for abstractive summarization," arXiv preprint arXiv:1705.04304, 2017.

[27]

Y. Chen, X. Yang, Q. Lin, H. Zhang, F. Gao, Z. Xu, Y. Dang, D. Zhang, H. Dong, Y. Xu et al., "Outage prediction and diagnosis for cloud service systems," in The World Wide Web Conference, 2019, pp. 2659--2665.

Digital Library

[28]

N. Bettenburg, R. Premraj, T. Zimmermann, and S. Kim, "Extracting structural information from bug reports," in Proceedings of the 2008 international working conference on Mining software repositories, 2008.

Digital Library

[29]

J. Anvik, L. Hiew, and G. C. Murphy, "Who should fix this bug?" in Proceedings of the 28th ICSE, 2006, pp. 361--370.

[30]

Y. Tian, D. Wijedasa, D. Lo, and C. Le Goues, "Learning to rank for bug report assignee recommendation," in 2016 IEEE 24th International Conference on Program Comprehension (ICPC). IEEE, 2016, pp. 1--10.

[31]

P. Pantel, T. Lin, and M. Gamon, "Mining entity types from query logs via user intent modeling," in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012, pp. 563--571.

[32]

Y. Xu, F. Ding, and B. Wang, "Entity-based query reformulation using wikipedia," in Proceedings of the 17th ACM conference on Information and knowledge management, 2008, pp. 1441--1442.

[33]

R. Florian, A. Ittycheriah, H. Jing, and T. Zhang, "Named entity recognition through classifier combination," in Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003, pp. 168--171.

Digital Library

[34]

A. McCallum and W. Li, "Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons," in Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, 2003.

Digital Library

Cited By

Las-Casas PKumbhare AFonseca RAgarwal S(2024)LLexus: an AI agent system for incident managementACM SIGOPS Operating Systems Review10.1145/3689051.368905658:1(23-36)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3689051.3689056
Nguyen ALe TBabar M(2024)Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686670(72-83)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686670
Toslali MSnible EChen JCha ASingh SKalantar MParthasarathy Sd'Amorim M(2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663829
Show More Cited By

Index Terms

Neural knowledge extraction from cloud service incidents
1. Computer systems organization

Index terms have been assigned to the content through auto-classification.

Recommendations

SoftNER: Mining knowledge graphs from cloud incidents
Abstract
The move from boxed products to services and the widespread adoption of cloud computing has had a huge impact on the software development life cycle and DevOps processes. Particularly, incident management has become critical for developing and ...
What bugs cause production cloud incidents?
HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems

Cloud services have become the backbone of today's computing world. Runtime incidents, which adversely affect the expected service operations, are extremely costly in terms of user impacts and engineering efforts required to resolve them. Hence, such ...
Cloud service engineering
ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2

Building on compute and storage virtualization, Cloud Computing provides scalable, network-centric, abstracted IT infrastructure, platforms, and applications as on-demand services that are billed by consumption. Cloud Service Engineering is the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice

May 2021

405 pages

ISBN:9780738146690

Conference Chairs:
Sigrid Eldh
Ericsson, Sweden
,
Davide Falessi
California Polytechnic State University

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 17 December 2021

Check for updates

Qualifiers

Research-article

Conference

ICSE '21

Sponsor:

SIGSOFT

ICSE '21: 43rd International Conference on Software Engineering

May 25 - 28, 2021

Virtual Event, Spain

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
53
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Las-Casas PKumbhare AFonseca RAgarwal S(2024)LLexus: an AI agent system for incident managementACM SIGOPS Operating Systems Review10.1145/3689051.368905658:1(23-36)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3689051.3689056
Nguyen ALe TBabar M(2024)Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686670(72-83)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686670
Toslali MSnible EChen JCha ASingh SKalantar MParthasarathy Sd'Amorim M(2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663829
Wang XYu HMeng XCao HZhang HSun HLiu XHu C(2024)MTL-TRANSFER: Leveraging Multi-task Learning and Transferred Knowledge for Improving Fault Localization and Program RepairACM Transactions on Software Engineering and Methodology10.1145/365444133:6(1-31)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3654441
Jiang YZhang CHe SYang ZMa MQin SKang YDang YRajmohan SLin QZhang DRoychoudhury APaiva AAbreu RStorey M(2024)Xpert: Empowering Incident Management with Query Recommendations via Large Language ModelsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639081(1-13)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639081
Ghosh SGrover KWong JBansal CNamineni RVerma MRajmohan SChua TNgo CKumar RLauw HKa-Wei Lee R(2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648311
Ganatra VParayil AGhosh SKang YMa MBansal CNath SMace JChandra SBlincoe KTonella P(2023)Detection Is Better Than Cure: A Cloud Incidents PerspectiveProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613898(1891-1902)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3613898
Ghosh SShetty MBansal CNath SGavrilovska AAltınbüken DBinnig C(2022)How to fight production incidents?Proceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563482(126-141)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3542929.3563482
Shetty MBansal CNath SBowles SWang HArman OAhari SDwyer MDamian DZeller A(2022)DeepAnalyzeProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3512759(549-560)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510003.3512759

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents