In cloud service systems, customers will report the service issues they have encountered to cloud service providers. Despite many issues can be handled by the support team, sometimes the customer issues can not be easily solved, thus raising customer incidents. Quick troubleshooting of a customer incident is critical. To this end, a customer incident should be assigned to its responsible team accurately in a timely manner.

Our industrial experiences show that linking customer incidents with detected system incidents can help the customer incident triage. In particular, our empirical study on 7 real cloud service systems shows that with the additional information about the system incidents (i.e., incident reports generated by system monitors), the triage time of customer incidents can be accelerated 13.1× on average. Based on this observation, in this paper, we propose LinkCM, a learning based approach to automatically link customer incidents to monitor reported system incidents. LinkCM incorporates a novel learning-based model that effectively extracts related information from two resources, and a transfer learning strategy is proposed to help LinkCM achieve better performance without huge amount of data. The experimental results indicate that LinkCM is able to achieve accurate link prediction. Furthermore, case studies are presented to demonstrate how LinkCM can help the customer incident triage procedure in real production cloud service systems.

Supplementary Material

Auxiliary Teaser Video (fse20ind-p90-p-teaser.mp4)

This is a presentation video of my talk at FSE 2020 on our paper accepted in the industry track. In this paper, we first conduct an empirical study on customer incident triage in production cloud service systems, and the results indicate that linking customer incidents with system incidents is a promising way to improve the efficiency of customer incident triage. Then we present LinkCM, a tool to automatically link the customer incidents with the monitor reported system incidents. We use transfer learning to overcome the problem of insufficient training data, which can shed light on other cloud system maintenance tasks. We show that LinkCM is an effective tool via case studies on 7 production cloud service systems in Microsoft.

Download
13.44 MB

Auxiliary Presentation Video (fse20ind-p90-p-video.mp4)

Download
161.31 MB

References

[1]

John Anvik, Lyndon Hiew, and Gail C. Murphy. 2006. Who should fix this bug?. In Proceedings of the 28th International Conference on Software Engineering ICSE 2006. ACM, 361-370. https://doi.org/10.1145/1134285.1134336

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Electronic Triage System: Casualties Monitoring System in the Disaster Scene

An empirical investigation of incident triage for online service systems

Continuous incident triage for large-scale online service systems

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations