Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3368089.3417061acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Efficient customer incident triage via linking with system incidents

Published: 08 November 2020 Publication History

Abstract

In cloud service systems, customers will report the service issues they have encountered to cloud service providers. Despite many issues can be handled by the support team, sometimes the customer issues can not be easily solved, thus raising customer incidents. Quick troubleshooting of a customer incident is critical. To this end, a customer incident should be assigned to its responsible team accurately in a timely manner.
Our industrial experiences show that linking customer incidents with detected system incidents can help the customer incident triage. In particular, our empirical study on 7 real cloud service systems shows that with the additional information about the system incidents (i.e., incident reports generated by system monitors), the triage time of customer incidents can be accelerated 13.1× on average. Based on this observation, in this paper, we propose LinkCM, a learning based approach to automatically link customer incidents to monitor reported system incidents. LinkCM incorporates a novel learning-based model that effectively extracts related information from two resources, and a transfer learning strategy is proposed to help LinkCM achieve better performance without huge amount of data. The experimental results indicate that LinkCM is able to achieve accurate link prediction. Furthermore, case studies are presented to demonstrate how LinkCM can help the customer incident triage procedure in real production cloud service systems.

Supplementary Material

Auxiliary Teaser Video (fse20ind-p90-p-teaser.mp4)
This is a presentation video of my talk at FSE 2020 on our paper accepted in the industry track. In this paper, we first conduct an empirical study on customer incident triage in production cloud service systems, and the results indicate that linking customer incidents with system incidents is a promising way to improve the efficiency of customer incident triage. Then we present LinkCM, a tool to automatically link the customer incidents with the monitor reported system incidents. We use transfer learning to overcome the problem of insufficient training data, which can shed light on other cloud system maintenance tasks. We show that LinkCM is an effective tool via case studies on 7 production cloud service systems in Microsoft.
Auxiliary Presentation Video (fse20ind-p90-p-video.mp4)
This is a presentation video of my talk at FSE 2020 on our paper accepted in the industry track. In this paper, we first conduct an empirical study on customer incident triage in production cloud service systems, and the results indicate that linking customer incidents with system incidents is a promising way to improve the efficiency of customer incident triage. Then we present LinkCM, a tool to automatically link the customer incidents with the monitor reported system incidents. We use transfer learning to overcome the problem of insufficient training data, which can shed light on other cloud system maintenance tasks. We show that LinkCM is an effective tool via case studies on 7 production cloud service systems in Microsoft.

References

[1]
John Anvik, Lyndon Hiew, and Gail C. Murphy. 2006. Who should fix this bug?. In Proceedings of the 28th International Conference on Software Engineering ICSE 2006. ACM, 361-370. https://doi.org/10.1145/1134285.1134336
[2]
Ali Sajedi Badashian, Abram Hindle, and Eleni Stroulia. 2015. Crowdsourced bug triaging. In 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015. IEEE Computer Society, 506-510. https://doi.org/10.1109/ ICSM. 2015.7332503
[3]
Ali Sajedi Badashian, Abram Hindle, and Eleni Stroulia. 2016. Crowdsourced Bug Triaging: Leveraging Q&A Platforms for Bug Assignment. In Proceedings of the Fundamental Approaches to Software Engineering-19th International Conference, FASE 2016, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2016 (Lecture Notes in Computer Science), Vol. 9633. Springer, 231-248. https://doi.org/10.1007/978-3-662-49665-7_14
[4]
Leonard J. Bass, Ingo M. Weber, and Liming Zhu. 2015. DevOps-A Software Architect's Perspective. Addison-Wesley.
[5]
Nicolas Bettenburg, Rahul Premraj, Thomas Zimmermann, and Sunghun Kim. 2008. Duplicate bug reports considered harmful... really?. In Proceedings of the 24th IEEE International Conference on Software Maintenance, ICSM 2008. IEEE Computer Society, 337-345. https://doi.org/10.1109/ICSM. 2008.4658082
[6]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 ( 2017 ), 135-146.
[7]
Gerald Bortis and André van der Hoek. 2013. PorchLight: a tag-based approach to bug triaging. In Proceedings of the 35th International Conference on Software Engineering, ICSE 2013. IEEE Computer Society, 342-351. https://doi.org/10.1109/ ICSE. 2013.6606580
[8]
Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An empirical investigation of incident triage for online service systems. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP ) 2019. IEEE / ACM, 111-120. https://doi.org/10.1109/ICSESEIP. 2019.00020
[9]
Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous Incident Triage for Large-Scale Online Service Systems. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019. IEEE, 364-375. https://doi.org/10.1109/ASE. 2019.00042
[10]
Davor Cubranic and Gail C. Murphy. 2004. Automatic bug triage using text categorization. In Proceedings of the 16th International Conference on Software Engineering & Knowledge Engineering, SEKE 2004. Citeseer, 92-97.
[11]
Ricardo Bezerra de Andrade e Silva, Jiji Zhang, and James G. Shanahan. 2005. Probabilistic workflow mining. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005. ACM, 275-284. https://doi.org/10.1145/1081870.1081903
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. ACL, 4171-4186. https://doi.org/10.18653/v1/n19-1423
[13]
Cuiyun Gao, Jichuan Zeng, Michael R. Lyu, and Irwin King. 2018. Online app review analysis for identifying emerging issues. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018. ACM, 48-58. https://doi.org/10.1145/3180155.3180218
[14]
Cuiyun Gao, Wujie Zheng, Yuetang Deng, David Lo, Jichuan Zeng, Michael R. Lyu, and Irwin King. 2019. Emerging app issue identification from user feedback: experience on WeChat. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP ) 2019. IEEE / ACM, 279-288. https://doi.org/10.1109/ICSE-SEIP. 2019.00040
[15]
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep Sparse Rectifier Neural Networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, AISTATS 2011 (JMLR Proceedings), Vol. 15. JMLR.org, 315-323. http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf
[16]
Google. 2019. Google Cloud Platform. https://cloud.google.com/
[17]
Bernd Grobauer and Thomas Schreck. 2010. Towards incident handling in the cloud: challenges and approaches. In Proceedings of the 2nd ACM Cloud Computing Security Workshop, CCSW 2010, Adrian Perrig and Radu Sion (Eds.). ACM, 77-86. https://doi.org/10.1145/1866835.1866850
[18]
Lyndon Hiew. 2006. Assisted detection of duplicate bug reports. Ph.D. Dissertation. University of British Columbia.
[19]
Hao Hu, Hongyu Zhang, Jifeng Xuan, and Weigang Sun. 2014. Efective Bug Triage Based on Historical Bug-Fix Information. In Proceedings of the 25th IEEE International Symposium on Software Reliability Engineering, ISSRE 2014. IEEE Computer Society, 122-132. https://doi.org/10.1109/ISSRE. 2014.17
[20]
Amazon Web Services Inc. 2019. Amazon Web Service. https://aws.amazon.com/
[21]
Gaeul Jeong, Sunghun Kim, and Thomas Zimmermann. 2009. Improving bug triage with bug tossing graphs. In Proceedings of the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 111-120. https://doi. org/10.1145/1595696.1595715
[22]
Leif Jonsson, Markus Borg, David Broman, Kristian Sandahl, Sigrid Eldh, and Per Runeson. 2016. Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts. Empir. Softw. Eng. 21, 4 ( 2016 ), 1533-1578. https://doi.org/10.1007/s10664-015-9401-9
[23]
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 ( 2016 ).
[24]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. ACL, 1746-1751. https://doi.org/10.3115/v1/d14-1181
[25]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. ( 2015 ). http://arxiv.org/abs/1412.6980
[26]
Alex Krizhevsky, Ilya Sutskever, and Geofrey E. Hinton. 2012. ImageNet Classiifcation with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, NeurIPS 2012. 1106-1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
[27]
Sun-Ro Lee, Min-Jae Heo, Chan-Gun Lee, Milhan Kim, and Gaeul Jeong. 2017. Applying deep learning based automatic bug triager to industrial projects. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017. ACM, 926-931. https://doi.org/10.1145/3106237.3117776
[28]
Michael R Lyu et al. 1996. Handbook of software reliability engineering. Vol. 222. IEEE computer society press CA.
[29]
McAfee. 2019. Cloud Adoption and Risk Report. https://www.mcafee.com/ enterprise/en-us/assets/reports/restricted/rp-cloud-adoption-risk.pdf
[30]
Phil McMinn. 2004. Search-based software test data generation: a survey. Software Testing, Verification & Reliability (STVR) 14, 2 ( 2004 ), 105-156. https://doi.org/10. 1002/stvr.294
[31]
Microsoft. 2019. Microsoft Azure. https://azure.microsoft.com/
[32]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jefrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, NeurIPS 2013. 3111-3119. http://papers.nips.cc/paper/5021-distributed-representations-of-wordsand-phrases-and-their-compositionality
[33]
Hoda Naguib, Nitesh Narayan, Bernd Brügge, and Dina Helal. 2013. Bug report assignee recommendation using activity profiles. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR 2013. IEEE Computer Society, 22-30. https://doi.org/10.1109/MSR. 2013.6623999
[34]
Dennis Pagano and Walid Maalej. 2013. User feedback in the appstore: An empirical study. In Proceedings of the 21st IEEE International Requirements Engineering Conference, RE 2013. IEEE Computer Society, 125-134. https://doi.org/10.1109/ RE. 2013.6636712
[35]
Fabio Palomba, Mario Linares Vásquez, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Denys Poshyvanyk, and Andrea De Lucia. 2015. User reviews matter! Tracking crowdsourced reviews to support evolution of successful apps. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015. IEEE Computer Society, 291-300. https://doi.org/10.1109/ICSM. 2015.7332475
[36]
Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016. The Association for Computational Linguistics, 2249-2255. https: //doi.org/10.18653/v1/d16-1244
[37]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019. 8024-8035. http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-highperformance-deep-learning-library
[38]
Jefrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532-1543. https://doi. org/10.3115/v1/d14-1162
[39]
Per Runeson, Magnus Alexandersson, and Oskar Nyholm. 2007. Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the 29th International Conference on Software Engineering, ICSE 2007. IEEE Computer Society, 499-510. https://doi.org/10.1109/ICSE. 2007.32
[40]
Qihong Shao, Yi Chen, Shu Tao, Xifeng Yan, and Nikos Anerousis. 2008. Eficient ticket routing by resolution sequence mining. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008. ACM, 605-613. https://doi.org/10.1145/1401890.1401964
[41]
Ramin Shokripour, John Anvik, Zarinah Mohd Kasirun, and Sima Zamani. 2013. Why so complicated? simple term filtering and weighting for location-based bug report assignment recommendation. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR 2013. IEEE Computer Society, 2-11. https://doi.org/10.1109/MSR. 2013.6623997
[42]
Andrea Di Sorbo, Sebastiano Panichella, Carol V. Alexandru, Junji Shimagaki, Corrado Aaron Visaggio, Gerardo Canfora, and Harald C. Gall. 2016. What would users change in my app? summarizing app reviews for recommending software changes. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016. ACM, 499-510. https://doi.org/10. 1145/2950290.2950299
[43]
Ahmed Tamrawi, Tung Thanh Nguyen, Jafar M. Al-Kofahi, and Tien N. Nguyen. 2011. Fuzzy set and cache-based approach for bug triaging. In Proceedings of the 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-19) and ESEC'11: 13th European Software Engineering Conference (ESEC-13). ACM, 365-375. https://doi.org/10.1145/2025113.2025163
[44]
Lorenzo Villarroel, Gabriele Bavota, Barbara Russo, Rocco Oliveto, and Massimiliano Di Penta. 2016. Release planning of mobile apps based on user reviews. In Proceedings of the 38th International Conference on Software Engineering, ICSE 2016. ACM, 14-24. https://doi.org/10.1145/2884781.2884818
[45]
Phong Minh Vu, Hung Viet Pham, Tam The Nguyen, and Tung Thanh Nguyen. 2016. Phrase-based extraction of user opinions in mobile app reviews. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016. ACM, 726-731. https://doi.org/10.1145/2970276.2970365
[46]
Song Wang, Wen Zhang, and Qing Wang. 2014. FixerCache: unsupervised caching active developers for diverse bug triage. In 2014 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2014. ACM, 25 : 1-25 : 10. https://doi.org/10.1145/2652524.2652536
[47]
Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik, and Jiasu Sun. 2008. An approach to detecting duplicate bug reports using natural language and execution information. In Proceedings of the 30th International Conference on Software Engineering, ICSE 2008. ACM, 461-470. https://doi.org/10.1145/1368088.1368151
[48]
Xin Xia, David Lo, Ying Ding, Jafar M. Al-Kofahi, Tien N. Nguyen, and Xinyu Wang. 2017. Improving Automated Bug Triaging with Specialized Topic Model. IEEE Transactions on Software Engineering 43, 3 ( 2017 ), 272-297. https://doi.org/ 10.1109/TSE. 2016.2576454
[49]
Qiao Zhang, Guo Yu, Chuanxiong Guo, Yingnong Dang, Nick Swanson, Xinsheng Yang, Randolph Yao, Murali Chintalapati, Arvind Krishnamurthy, and Thomas E. Anderson. 2018. Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018. USENIX Association, 519-532. https: //www.usenix.org/conference/nsdi18/presentation/zhang-qiao
[50]
Wujie Zheng, Haochuan Lu, Yangfan Zhou, Jianming Liang, Haibing Zheng, and Yuetang Deng. 2019. iFeedback: Exploiting User Feedback for Real-Time Issue Detection in Large-Scale Online Service Systems. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019. IEEE, 352-363. https://doi.org/10.1109/ASE. 2019.00041

Cited By

View all
  • (2024)FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud SystemsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639754(392-404)Online publication date: 14-Apr-2024
  • (2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
  • (2024)SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure PredictionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648303(65-72)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2020
1703 pages
ISBN:9781450370431
DOI:10.1145/3368089
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cloud Service Systems
  2. Customer Issue Triage
  3. Transfer Learning

Qualifiers

  • Research-article

Conference

ESEC/FSE '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)2
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud SystemsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639754(392-404)Online publication date: 14-Apr-2024
  • (2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
  • (2024)SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure PredictionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648303(65-72)Online publication date: 13-May-2024
  • (2023)Assess and Summarize: Improve Outage Understanding with Large Language ModelsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613891(1657-1668)Online publication date: 30-Nov-2023
  • (2023)Incident-aware Duplicate Ticket Aggregation for Cloud Systems2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00193(2299-2311)Online publication date: May-2023
  • (2023)Prism: Revealing Hidden Functional Clusters from Massive Instances in Cloud Systems2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE56229.2023.00077(268-280)Online publication date: 11-Sep-2023
  • (2022)An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident DetectionACM SIGOPS Operating Systems Review10.1145/3544497.354449956:1(1-7)Online publication date: 14-Jun-2022
  • (2022)MuffinProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510092(1418-1430)Online publication date: 21-May-2022
  • (2022)Using Screenshot Attachments in Issue Reports for TriagingEmpirical Software Engineering10.1007/s10664-022-10228-027:7Online publication date: 1-Dec-2022
  • (2021)NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud PlatformsProceedings of the Web Conference 202110.1145/3442381.3449867(1181-1191)Online publication date: 19-Apr-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media