survey

A Survey of AIOps Methods for Failure Management

Authors:

Michael GerndtAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 12, Issue 6

Article No.: 81, Pages 1 - 45

https://doi.org/10.1145/3483424

Published: 30 November 2021 Publication History

Abstract

Modern society is increasingly moving toward complex and distributed computing systems. The increase in scale and complexity of these systems challenges O&M teams that perform daily monitoring and repair operations, in contrast with the increasing demand for reliability and scalability of modern applications. For this reason, the study of automated and intelligent monitoring systems has recently sparked much interest across applied IT industry and academia. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to Machine Learning, AI, and Big Data. However, AIOps as a research topic is still largely unstructured and unexplored, due to missing conventions in categorizing contributions for their data requirements, target goals, and components. In this work, we focus on AIOps for Failure Management (FM), characterizing and describing 5 different categories and 14 subcategories of contributions, based on their time intervention window and the target problem being solved. We review 100 FM solutions, focusing on applicability requirements and the quantitative results achieved, to facilitate an effective application of AIOps solutions. Finally, we discuss current development problems in the areas covered by AIOps and delineate possible future trends for AI-based failure management.

References

[1]

Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. 2009. Spectrum-based multiple fault localization. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. IEEE, 88–99. https://doi.org/10.1109/ASE.2009.25

Digital Library

[2]

Armen Aghasaryan, Eric Fabre, Albert Benveniste, Renée Boubour, and Claude Jard. 1998. Fault detection and diagnosis in distributed systems: An approach by partially stochastic petri nets. Discrete Event Dynam. Syst. 8, 2 (1998), 203–231. https://doi.org/10.1023/a:1008241818642

Digital Library

[3]

Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operat. Syst. Rev. 37, 5 (Dec. 2003), 74–89. https://doi.org/10.1145/1165389.945454

Digital Library

[4]

Javier Alonso, Jordi Torres, Josep Ll. Berral, and Ricard Gavalda. 2010. Adaptive on-line software aging prediction based on machine learning. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems Networks (DSN’10). IEEE, 507–516. https://doi.org/10.1109/dsn.2010.5544275

[5]

J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.-C. Fabre, J.-C. Laprie, E. Martins, and D. Powell. 1990. Fault injection for dependability validation: A methodology and some applications. IEEE Trans. Softw. Eng. 16, 2 (Feb. 1990), 166–182. https://doi.org/10.1109/32.44380

Digital Library

[6]

Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-Ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 307–320. https://doi.org/10.5555/2387880.2387910

Digital Library

[7]

Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A. Zuluaga. 2020. USAD: UnSupervised anomaly detection on multivariate time series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’20). ACM, New York, NY, 3395–3404. https://doi.org/10.1145/3394486.3403392

Digital Library

[8]

Tom Auld, Andrew W. Moore, and Stephen F. Gull. 2007. Bayesian neural networks for internet traffic classification. IEEE Trans. Neural Netw. 18, 1 (Jan. 2007), 223–239. https://doi.org/10.1109/tnn.2006.883010

Digital Library

[9]

Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang. 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’07). ACM, New York, NY, 13–24. https://doi.org/10.1145/1282380.1282383

Digital Library

[10]

Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, and Mathru Janakiraman. 2020. DeCaf: Diagnosing and triaging performance issues in large-scale cloud services. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP’20). ACM, New York, NY, 201–210. https://doi.org/10.1145/3377813.3381353

Digital Library

[11]

Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In Proceedings of the 9th Conference on Hot Topics in Operating Systems, Vol. 9. USENIX Association, 15. https://doi.org/10.5555/1251054.1251069

Digital Library

[12]

Ivan Beschastnikh, Yuriy Brun, Michael D. Ernst, and Arvind Krishnamurthy. 2014. Inferring models of concurrent systems from logs of their behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). ACM, New York, NY, 468–479. https://doi.org/10.1145/2568225.2568246

Digital Library

[13]

Netflix Technology Blog. 2016. Netflix Chaos Monkey Upgraded. Retrieved from https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d.

[14]

Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, NY, 111–124. https://doi.org/10.1145/1755913.1755926

Digital Library

[15]

A. T. Bouloutas, S. Calo, and A. Finkel. 1994. Alarm correlation and fault identification in communication networks. IEEE Trans. Commun. 42, 2/3/4 (2 1994), 523–533. https://doi.org/10.1109/tcomm.1994.577079

[16]

L. C. Briand, J. W. Daly, and J. K. Wust. 1999. A unified framework for coupling measurement in object-oriented systems. IEEE Trans. Softw. Eng. 25, 1 (1 1999), 91–121. https://doi.org/10.1109/32.748920

Digital Library

[17]

Broadcom. 2020. AIOps—Broadcom. Retrieved from https://www.broadcom.com/products/software/aiops.

[18]

Andy Brown, Aaron Tuor, Brian Hutchinson, and Nicole Nichols. 2018. Recurrent neural network attention mechanisms for interpretable system log anomaly detection. In Proceedings of the 1st Workshop on Machine Learning for Computing Systems (MLCS’18). ACM, New York, NY, Article 1, 8 pages. https://doi.org/10.1145/3217871.3217872

Digital Library

[19]

Lisa Burnell and Eric Horvitz. 1995. Structure and chance: Melding logic and probability for software debugging. Commun. ACM 38, 3 (Mar. 1995), 31–ff.https://doi.org/10.1145/203330.203338

Digital Library

[20]

K. L. Butler and J. A. Momoh. 1999. A neural net based approach for fault diagnosis in distribution networks. In Proceedings of the IEEE Power Engineering Society, Vol. 1. IEEE, 353–356. https://doi.org/10.1109/PESW.1999.747478

[21]

V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan, and W. P. Zeggert. 2001. Proactive management of software aging. IBM J. Res. Dev. 45, 2 (Mar. 2001), 311–332. https://doi.org/10.1147/rd.452.0311

Digital Library

[22]

Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep Learning for Anomaly Detection: A Survey. Retrieved from http://arxiv.org/abs/1901.03407.

[23]

Thanyalak Chalermarrewong, Tiranee Achalakul, and Simon Chong Wee See. 2012. Failure prediction of data centers using time series and fault tree analysis. In Proceedings of the IEEE 18th International Conference on Parallel and Distributed Systems. IEEE, 794–799. https://doi.org/10.1109/icpads.2012.129

Digital Library

[24]

Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. Comput. Surveys 41, 3 (July 2009), 15:1–15:58. https://doi.org/10.1145/1541880.1541882

Digital Library

[25]

M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. 2002. Pinpoint: Problem determination in large, dynamic Internet services. In Proceedings of the International Conference on Dependable Systems and Networks. IEEE, 595–604. https://doi.org/10.1109/DSN.2002.1029005

Digital Library

[26]

Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer. 2004. Path-based failure and evolution management. In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation (NSDI’04). USENIX Association, 23. Retrieved from https://dl.acm.org/doi/10.5555/1251175.1251198.

Digital Library

[27]

Xin Chen, Charng-Da Lu, and Karthik Pattabiraman. 2014. Failure analysis of jobs in compute clouds: A Google cluster case study. In Proceedings of the IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 167–177. https://doi.org/10.1109/issre.2014.34

Digital Library

[28]

S. R. Chidamber and C. F. Kemerer. 1994. A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20, 6 (6 1994), 476–493. https://doi.org/10.1109/32.295895

Digital Library

[29]

Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, 217–231. https://dl.acm.org/doi/10.5555/2685048.2685066

Digital Library

[30]

Holger Cleve and Andreas Zeller. 2005. Locating causes of program failures. In Proceedings of the 27th International Conference on Software Engineering (ICSE’05). ACM, New York, NY, 342–351. https://doi.org/10.1145/1062455.1062522

Digital Library

[31]

Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, and Jeffrey S. Chase. 2004. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI’04). USENIX Association, 16. Retrieved from https://dl.acm.org/doi/10.5555/1251254.1251270.

Digital Library

[32]

Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving system history. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). ACM, New York, NY, 105–118. https://doi.org/10.1145/1095810.1095821

Digital Library

[33]

Carlos H. A. Costa, Yoonho Park, Bryan S. Rosenburg, Chen-Yong Cher, and Kyung Dong Ryu. 2014. A system software approach to proactive memory-error avoidance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE, 12 pages. https://doi.org/10.1109/SC.2014.63

Digital Library

[34]

A. Csenki. 1990. Bayes predictive analysis of a fundamental software reliability model. IEEE Trans. Reliabil. 39, 2 (June 1990), 177–183. https://doi.org/10.1109/24.55879

[35]

Marco D’Ambros, Michele Lanza, and Romain Robbes. 2011. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empir. Softw. Eng. 17, 4–5 (Aug. 2011), 531–577. https://doi.org/10.1007/s10664-011-9173-9

Digital Library

[36]

Yingnong Dang, Qingwei Lin, and Peng Huang. 2019. AIOps: Real-world challenges and research innovations. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE’19). IEEE, 4–5. https://doi.org/10.1109/icse-companion.2019.00023

Digital Library

[37]

Nickolas Allen Davis, Abdelmounaam Rezgui, Hamdy Soliman, Skyler Manzanares, and Milagre Coates. 2017. FailureSim: A system for predicting hardware failures in cloud data centers using neural networks. In Proceedings of the IEEE 10th International Conference on Cloud Computing (CLOUD’17). IEEE, 544–551. https://doi.org/10.1109/cloud.2017.75

[38]

Karel Dejaeger, Thomas Verbraken, and Bart Baesens. 2013. Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans. Softw. Eng. 39, 2 (Feb. 2013), 237–257. https://doi.org/10.1109/tse.2012.20

Digital Library

[39]

B. Dhanalaxmi, G. Apparao Naidu, and K. Anuradha. 2015. A review on software fault detection and prevention mechanism in software development activities. J. Comput. Eng. 17, 6 (2015), 25–30.

[40]

Peter A. Dinda and David R. O’Hallaron. 1999. An evaluation of linear models for host load prediction. In Proceedings of The 8th International Symposium on High Performance Distributed Computing. IEEE, 87–96.

Digital Library

[41]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285–1298. https://doi.org/10.1145/3133956.3134015

Digital Library

[42]

Karim O. Elish and Mahmoud O. Elish. 2008. Predicting defect-prone software modules using support vector machines. J. Syst. Softw. 81, 5 (May 2008), 649–660. https://doi.org/10.1016/j.jss.2007.07.040

Digital Library

[43]

Alice Este, Francesco Gringoli, and Luca Salgarelli. 2009. Support vector machines for TCP traffic classification. Comput. Netw. 53, 14 (Sep. 2009), 2476–2490. https://doi.org/10.1016/j.comnet.2009.05.003

Digital Library

[44]

Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. 2013. Failure prediction based on log files using random indexing and support vector machines. J. Syst. Softw. 86, 1 (1 2013), 2–11. https://doi.org/10.1016/j.jss.2012.06.025

Digital Library

[45]

Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE Computer Society, 149–158. https://doi.org/10.1109/icdm.2009.60

Digital Library

[46]

Zhiwei Gao, Carlo Cecati, and Steven X. Ding. 2015. A survey of fault diagnosis and fault-tolerant techniques—Part I: Fault diagnosis with model-based and signal-based approaches. IEEE Trans. Industr. Electr. 62, 6 (June 2015), 3757–3767. https://doi.org/10.1109/tie.2015.2417501

[47]

Zhiwei Gao, Carlo Cecati, and Steven X. Ding. 2015. A survey of fault diagnosis and fault-tolerant techniques—Part II: Fault diagnosis with knowledge-based and hybrid/active approaches. IEEE Trans. Industr. Electr. 62, 6 (June 2015), 3768–3774. https://doi.org/10.1109/TIE.2015.2419013

[48]

S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. 1995. Analysis of software rejuvenation using Markov regenerative stochastic petri net. In Proceedings of the 6th International Symposium on Software Reliability Engineering (ISSRE’95) (1995). IEEE, 180–187. https://doi.org/10.1109/issre.1995.497656

[49]

S. Garg, A. van Moorsel, K. Vaidyanathan, and K. S. Trivedi. 1998. A methodology for detection and estimation of software aging. In Proceedings of the 9th International Symposium on Software Reliability Engineering. IEEE, 283–292. https://doi.org/10.1109/issre.1998.730892

Digital Library

[50]

Emanuel Giger, Marco D’Ambros, Martin Pinzger, and Harald C. Gall. 2012. Method-level bug prediction. In Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM’12). ACM, New York, NY, 171–180. https://doi.org/10.1145/2372251.2372285

Digital Library

[51]

Tarun Goyal, Ajit Singh, and Aakanksha Agrawal. 2012. Cloudsim: Simulator for cloud computing infrastructure and modeling. Proc. Eng. 38 (Nov. 2012), 3566–3572. https://doi.org/10.1016/j.proeng.2012.06.412

[52]

T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. 2000. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng. 26, 7 (July 2000), 653–661. https://doi.org/10.1109/32.859533

Digital Library

[53]

Maurice H. Halstead. 1977. Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc.

Digital Library

[54]

Greg Hamerly and Charles Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann, San Francisco, CA, 202–209. https://doi.org/10.5555/645530.655825

Digital Library

[55]

Seungjae Han, K. G. Shin, and H. A. Rosenberg. 1995. DOCTOR: An integrated software fault injection environment for distributed real-time systems. In Proceedings of the IEEE International Computer Performance and Dependability Symposium. IEEE, 204–213. https://doi.org/10.1109/IPDS.1995.395831

Digital Library

[56]

J. L. Hellerstein, Fan Zhang, and P. Shahabuddin. 1999. An approach to predictive detection for service management. In Proceedings of the 6th IFIP/IEEE International Symposium on Integrated Network Management. IEEE, 309–322. https://doi.org/10.1109/inm.1999.770691

[57]

Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton. 1995. Software rejuvenation: Analysis, module and applications. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing. Digest of Papers. IEEE, 381–390. https://doi.org/10.1109/FTCS.1995.466961

Digital Library

[58]

G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliabil. 51, 3 (Sep. 2002), 350–357. https://doi.org/10.1109/TR.2002.802886

[59]

Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria—IEEE conference publication. In Proceedings of the 16th International Conference on Software Engineering. IEEE, 191–200. Retrieved from https://ieeexplore.ieee.org/document/296778.

Digital Library

[60]

Olumuyiwa Ibidunmoye, Francisco Hernández-Rodriguez, and Erik Elmroth. 2015. Performance anomaly detection and bottleneck identification. Comput. Surveys 48, 1 (Sep. 2015), 1–35. https://doi.org/10.1145/2791120

Digital Library

[61]

Tariqul Islam and Dakshnamoorthy Manivannan. 2017. Predicting application failure in cloud: A machine learning approach. In Proceedings of the IEEE International Conference on Cognitive Computing (ICCC’17). IEEE, 24–31. https://doi.org/10.1109/ieee.iccc.2017.11

[62]

Itthichok Jangjaimon and Nian-Feng Tzeng. 2015. Effective cost reduction for elastic clouds under spot instance pricing through adaptive checkpointing. IEEE Trans. Comput. 64, 2 (Feb. 2015), 396–409. https://doi.org/10.1109/tc.2013.225

Digital Library

[63]

David Jauk, Dai Yang, and Martin Schulz. 2019. Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19). ACM, New York, NY, Article 30, 13 pages. https://doi.org/10.1145/3295500.3356185

Digital Library

[64]

Srikanth Kandula, Dina Katabi, and Jean-Philippe Vasseur. 2005. Shrink: A tool for failure diagnosis in IP networks. In Proceeding of the ACM SIGCOMM Workshop on Mining Network Data (MineNet’05). ACM, New York, NY, 6 pages. https://doi.org/10.1145/1080173.1080178

Digital Library

[65]

N. Karunanithi, D. Whitley, and Y. K. Malaiya. 1992. Prediction of software reliability using connectionist models. IEEE Trans. Softw. Eng. 18, 7 (July 1992), 563–574. https://doi.org/10.1109/32.148475

Digital Library

[66]

Taghi M. Khoshgoftaar and David L. Lanning. 1995. A neural network approach for early detection of program modules having high risk in the maintenance phase. J. Syst. Softw. 29, 1 (Apr. 1995), 85–91. https://doi.org/10.1016/0164-1212(94)00130-f

Digital Library

[67]

Barbara A. Kitchenham, David Budgen, and O. Pearl Brereton. 2010. The value of mapping studies - A participant-observer case study. In Proceedings of the 14th International Conference on Evaluation and Assessment in Software Engineering (EASE’10). BCS Learning & Development, 25–33. https://doi.org/10.14236/ewic/EASE2010.4

Digital Library

[68]

S. Kliger, S. Yemini, Y. Yemini, D. Ohsie, and S. Stolfo. 1995. A Coding Approach to Event Correlation. Springer US, Boston, MA, 266–277. https://doi.org/10.1007/978-0-387-34890-2_24

Digital Library

[69]

Yevgeniy Sverdlik Data Center Knowledge. 2016. What Facebook Has Learned from Regularly Shutting Down Entire Data Centers. Retrieved from https://www.datacenterknowledge.com/archives/2016/08/31/facebook-learned-regularly-shutting-entire-data-centers.

[70]

Khairy A. H. Kobbacy and Sunil Vadera. 2011. A survey of AI in operations management from 2005 to 2009. J. Manufact. Technol. Manage. 22, 6 (July 2011), 706–733. https://doi.org/10.1108/17410381111149602

[71]

K. A. H. Kobbacy, S Vadera, and M. H. Rasmy. 2007. AI and OR in management of operations: History and trends. J. Oper. Res. Soc. 58, 1 (Jan. 2007), 10–28. https://doi.org/10.1057/palgrave.jors.2602132

[72]

Anukool Lakhina, Mark Crovella, and Christophe Diot. 2004. Diagnosing network-wide traffic anomalies. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’04). ACM, New York, NY, 219–230. https://doi.org/10.1145/1015467.1015492

Digital Library

[73]

Anukool Lakhina, Mark Crovella, and Christophe Diot. 2005. Mining anomalies using traffic feature distributions. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (Philadelphia, Pennsylvania) (SIGCOMM’05). ACM, New York, NY, 217–228. https://doi.org/10.1145/1080091.1080118

Digital Library

[74]

Andrew Lerner. 2017. AIOps Platforms—Gartner. Retrieved from https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms/.

[75]

Anna Levin, Shelly Garion, Elliot K. Kolodner, Dean H. Lorenz, Katherine Barabash, Mike Kugler, and Niall McShane. 2019. AIOps for a cloud object storage service. In Proceedings of the IEEE International Congress on Big Data (BigDataCongress’19). IEEE, 165–169. https://doi.org/10.1109/BigDataCongress.2019.00036

[76]

Jian Li, Pinjia He, Jieming Zhu, and Michael R. Lyu. 2017. Software defect prediction via convolutional neural network. In Proceedings of the IEEE International Conference on Software Quality, Reliability and Security (QRS’17). IEEE, 318–328. https://doi.org/10.1109/qrs.2017.42

[77]

Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard drive failure prediction using classification and regression trees. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 383–394. https://doi.org/10.1109/dsn.2014.44

Digital Library

[78]

Jing Li, Rebecca J. Stones, Gang Wang, Zhongwei Li, Xiaoguang Liu, and Kang Xiao. 2016. Being accurate is not enough: New metrics for disk failure prediction. In Proceedings of the IEEE 35th Symposium on Reliable Distributed Systems (SRDS’16). IEEE, 71–80. https://doi.org/10.1109/srds.2016.019

[79]

Yangguang Li, Zhen Ming (Jack) Jiang, Heng Li, Ahmed E. Hassan, Cheng He, Ruirui Huang, Zhengda Zeng, Mian Wang, and Pinan Chen. 2020. Predicting node failures in an ultra-large-scale cloud computing platform: An AIOps solution. ACM Trans. Softw. Eng. Methodol. 29, 2 (27 4 2020), 13:1–13:24. https://doi.org/10.1145/3385187

Digital Library

[80]

Zeyan Li, Chengyang Luo, Yiwei Zhao, Yongqian Sun, Kaixin Sui, Xiping Wang, Dapeng Liu, Xing Jin, Qi Wang, and Dan Pei. 2019. Generic and robust localization of multi-dimensional root causes. In Proceedings of the IEEE 30th International Symposium on Software Reliability Engineering (ISSRE’19). IEEE, IEEE, 47–57.

[81]

Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in IBM bluegene/l event logs. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 583–588. https://doi.org/10.1109/icdm.2007.46

Digital Library

[82]

Fan Lin, Matt Beadon, Harish Dattatraya Dixit, Gautham Vunnam, Amol Desai, and Sriram Sankar. 2018. Hardware remediation at scale. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W’18). IEEE, 14–17. https://doi.org/10.1109/dsn-w.2018.00015

[83]

Fred Lin, Keyur Muzumdar, Nikolay Pavlovich Laptev, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar. 2020. Fast dimensional analysis for root cause investigation in a large-scale service environment. Proc. ACM Meas. Anal. Comput. Syst. 4, 2, Article 31 (June 2020), 23 pages. https://doi.org/10.1145/3392149

Digital Library

[84]

Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE’16). ACM, New York, NY, 102–111. https://doi.org/10.1145/2889160.2889232

Digital Library

[85]

Chao Liu, Xifeng Yan, Long Fei, Jiawei Han, and Samuel P. Midkiff. 2005. SOBER: Statistical model-based bug localization. ACM SIGSOFT Softw. Eng. Notes 30, 5 (Sep. 2005), 286. https://doi.org/10.1145/1095430.1081753

Digital Library

[86]

Dapeng Liu, Youjian Zhao, Haowen Xu, Yongqian Sun, Dan Pei, Jiao Luo, Xiaowei Jing, and Mei Feng. 2015. Opprentice: Towards practical and automatic anomaly detection through machine learning. In Proceedings of the Internet Measurement Conference. ACM, New York, NY, 211–224. https://doi.org/10.1145/2815675.2815679

Digital Library

[87]

David Lo, Hong Cheng, Jiawei Han, Siau-Cheng Khoo, and Chengnian Sun. 2009. Classification of software behaviors for failure detection: A discriminative pattern mining approach. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, NY, 557–566. https://doi.org/10.1145/1557019.1557083

Digital Library

[88]

Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 241–256. https://www.usenix.org/conference/fast15/technical-sessions/presentation/ma

Digital Library

[89]

Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’17). USENIX Association, 391–402. Retrieved from https://www.usenix.org/conference/atc17/technical-sessions/presentation/mahdisoltani.

Digital Library

[90]

T. J. McCabe. 1976. A complexity measure. IEEE Trans. Softw. Eng. SE-2, 4 (Dec. 1976), 308–320. https://doi.org/10.1109/TSE.1976.233837

Digital Library

[91]

Meiliana, Syaeful Karim, Harco Leslie Hendric Spits Warnars, Ford Lumban Gaol, Edi Abdurachman, and Benfano Soewito. 2017. Software metrics for fault prediction using machine learning approaches: A literature review with PROMISE repository dataset. In Proceedings of the IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom’17). IEEE, 19–23. https://doi.org/10.1109/cyberneticscom.2017.8311708

[92]

Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, et al. 2019. LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 4739–4745. https://doi.org/10.24963/ijcai.2019/658

Digital Library

[93]

Tim Menzies. 2004. PROMISE DATASETS PAGE. Retrieved from http://promise.site.uottawa.ca/SERepository/datasets-page.html.

[94]

Tim Menzies, Jeremy Greenwald, and Art Frank. 2007. Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33, 1 (Jan. 2007), 2–13. https://doi.org/10.1109/TSE.2007.256941

Digital Library

[95]

Adam Moody, Greg Bronevetsky, Kathryn Mohror, and de Bronis R. Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11. https://doi.org/10.1109/sc.2010.18

Digital Library

[96]

Moogsoft. 2020. What Is AIOps? Moogsoft. Retrieved from https://www.moogsoft.com/resources/aiops/guide/everything-aiops/.

[97]

Andrew W. Moore and Denis Zuev. 2005. Internet traffic classification using bayesian analysis techniques. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’05). ACM, New York, NY, 50–60. https://doi.org/10.1145/1064212.1064220

Digital Library

[98]

Raimund Moser, Witold Pedrycz, and Giancarlo Succi. 2008. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In Proceedings of the 13th International Conference on Software Engineering. ACM, New York, NY, 181–190. https://doi.org/10.1145/1368088.1368114

Digital Library

[99]

Mukosi Abraham Mukwevho and Turgay Celik. 2021. Toward a smart cloud: A review of fault-tolerance methods in cloud systems. IEEE Trans. Serv. Comput. 14, 2 (2021), 589–605. https://doi.org/10.1109/tsc.2018.2816644

[100]

Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2003. Hard drive failure prediction using non-parametric statistical methods. Retrieved from http://dsp.ucsd.edu/jfmurray/publications/Murray2003.pdf.

[101]

Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Mach. Learn. Res. 6 (Jan. 2005), 783–816. https://doi.org/10.5555/1046920.1088699

Digital Library

[102]

Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. 2006. Mining metrics to predict component failures. In Proceeding of the 28th International Conference on Software Engineering (ICSE’06). ACM, New York, NY, 452–461. https://doi.org/10.1145/1134285.1134349

Digital Library

[103]

Jaechang Nam, Sinno Jialin Pan, and Sunghun Kim. 2013. Transfer defect learning. In Proceedings of the 35th International Conference on Software Engineering (ICSE’13). IEEE, 382–391. https://doi.org/10.1109/icse.2013.6606584

Digital Library

[104]

Iyswarya Narayanan, Kushagra Vaid, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, and Badriddine Khessib. 2016. SSD failures in datacenters: What? When? and Why? In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16). ACM, New York, NY, Article 7, 11 pages. https://doi.org/10.1145/2928275.2928278

Digital Library

[105]

Roberto Natella, Domenico Cotroneo, Joao A. Duraes, and Henrique S. Madeira. 2013. On fault representativeness of software fault injection. IEEE Trans. Softw. Eng. 39, 1 (Jan. 2013), 80–96. https://doi.org/10.1109/tse.2011.124

Digital Library

[106]

Roberto Natella, Domenico Cotroneo, and Henrique S. Madeira. 2016. Assessing dependability with software fault injection: A survey. Comput. Surveys 48, 3 (Aug. 2016), 44:1–44:55. https://doi.org/10.1145/2841425

Digital Library

[107]

Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly detection and classification using distributed tracing and deep learning. In Proceedings of the 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’19). IEEE, 241–250. https://doi.org/10.1109/ccgrid.2019.00038

[108]

NetManAIOps. 2019. SMD Dataset—OmniAnomaly. Retrieved from https://github.com/NetManAIOps/OmniAnomaly.

[109]

Clodoaldo Brasilino Leite Neto, Pedro Batista De Carvalho Filho, and Alexandre Nóbrega Duarte. 2013. A systematic mapping study on fault management in cloud computing. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 332–337. https://doi.org/10.1109/PDCAT.2013.59

Digital Library

[110]

Hiep Nguyen, Zhiming Shen, Yongmin Tan, and Xiaohui Gu. 2013. FChain: Toward black-box online fault localization for cloud systems. In Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems. IEEE, 21–30. https://doi.org/10.1109/icdcs.2013.26

Digital Library

[111]

Thuy T. T. Nguyen and Grenville Armitage. 2008. A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surveys Tutor. 10, 4 (2008), 56–76. https://doi.org/10.1109/surv.2008.080406

Digital Library

[112]

Changhai Nie and Hareton Leung. 2011. A survey of combinatorial testing. Comput. Surveys 43, 2 (Apr. 2011), 11:1–11:29. https://doi.org/10.1145/1883612.1883618

Digital Library

[113]

Paolo Notaro, Jorge Cardoso, and Michael Gerndt. 2020. A systematic mapping study in AIOps. In Proceedings of the International Conference on Service-oriented Computing (ICSOC’20). Workshops: AIOps, CFTIC, STRAPS, AI-PA, AI-IOTS, and Satellite Events. Springer, 110–123. Retrieved from http://arxiv.org/abs/2012.09108.

[114]

H. Okamura, Y. Nishimura, and T. Dohi. 2004. A dynamic checkpointing scheme based on reinforcement learning. In Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing. IEEE, 151–158. https://doi.org/10.1109/PRDC.2004.1276566

Digital Library

[115]

Ahmet Okutan and Olcay Taner Yıldız. 2012. Software defect prediction using bayesian networks. Empir. Softw. Eng. 19, 1 (Aug. 2012), 154–181. https://doi.org/10.1007/s10664-012-9218-8

Digital Library

[116]

OpsRamp. 2020. AIOps (AI for IT Operations)—OpsRamp. Retrieved from https://www.opsramp.com/solutions/service-centric-aiops/.

[117]

T. J. Ostrand, E. J. Weyuker, and R. M. Bell. 2005. Predicting the location and number of faults in large software systems. IEEE Trans. Softw. Eng. 31, 4 (Apr. 2005), 340–355. https://doi.org/10.1109/tse.2005.49

Digital Library

[118]

Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). USENIX Association, 2. Retrieved from https://www.usenix.org/conference/fast-07/failure-trends-large-disk-drive-population.

Digital Library

[119]

Teerat Pitakrat, Dušan Okanović, André van Hoorn, and Lars Grunske. 2018. Hora: Architecture-aware online failure prediction. J. Syst. Softw. 137 (Mar. 2018), 669–685. https://doi.org/10.1016/j.jss.2017.02.041

[120]

A. Podgurski, D. Leon, P. Francis, W. Masri, M. Minch, Jiayang Sun, and Bin Wang. 2003. Automated support for classifying software failure reports. In Proceedings of the 25th International Conference on Software Engineering. IEEE, 465–475. https://doi.org/10.1109/icse.2003.1201224

Digital Library

[121]

M. Renieris and S. P. Reiss. 2003. Fault localization with nearest neighbor queries. In Proceedings of the 18th IEEE International Conference on Automated Software Engineering. IEEE, 30–39. https://ieeexplore.ieee.org/document/1240292

Digital Library

[122]

Felix Salfner, Maren Lenk, and Miroslaw Malek. 2010. A survey of online failure prediction methods. Comput. Surveys 42, 3 (Mar. 2010), 1–42. https://doi.org/10.1145/1670679.1670680

Digital Library

[123]

Felix Salfner and Miroslaw Malek. 2007. Using hidden semi-Markov models for effective online failure prediction. In Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS’07). IEEE, 161–174. https://doi.org/10.1109/srds.2007.35

Digital Library

[124]

Areeg Samir and Claus Pahl. 2019. A Controller Architecture for Anomaly Detection, Root Cause Analysis and Self-Adaptation for Cluster Architectures. Retrieved from https://orbilu.uni.lu/handle/10993/42062.

[125]

Mark Schwabacher and Kai Goebel. 2007. A survey of artificial intelligence for prognostics. In Proceedings of the AAAI Fall Symposium on Artificial Intelligence for Prognostics. AAAI, 108–115. Retrieved from https://www.aaai.org/Library/Symposia/Fall/2007/fs07-02-016.php.

[126]

Qihong Shao, Yi Chen, Shu Tao, Xifeng Yan, and Nikos Anerousis. 2008. Efficient ticket routing by resolution sequence mining. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’08). ACM, New York, NY, 605–613. https://doi.org/10.1145/1401890.1401964

Digital Library

[127]

Bikash Sharma, Praveen Jayachandran, Akshat Verma, and Chita R. Das. 2013. CloudPD: Problem determination and diagnosis in shared dynamic clouds. In Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). IEEE, 1–12. https://doi.org/10.1109/dsn.2013.6575298

Digital Library

[128]

Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the 13th International Conference on Software Engineering (ICSE’08). ACM, New York, NY, 351–360. https://doi.org/10.1145/1368088.1368136

Digital Library

[129]

BMC Software. 2020. AIOps—BMC. Retrieved from https://www.bmc.com/it-solutions/aiops.html.

[130]

Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on Models and Techniques for Root-Cause Analysis. Retrieved from http://arxiv.org/abs/1701.08546.

[131]

Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19). ACM, New York, NY, 2828–2837.

Digital Library

[132]

Yongqian Sun, Youjian Zhao, Ya Su, Dapeng Liu, Xiaohui Nie, Yuan Meng, Shiwen Cheng, Dan Pei, Shenglin Zhang, Xianping Qu et al. 2018. Hotspot: Anomaly localization for additive kpis with multi-dimensional attributes. IEEE Access 6 (2018), 10909–10923.

[133]

Resolve Systems. 2020. What is AIOps?—Resolve. Retrieved from https://resolve.io/what-is-aiops.

[134]

D. Tang and R. K. Iyer. 1993. Dependability measurement and modeling of a multicomputer system. IEEE Trans. Comput. 42, 1 (1993), 62–75. https://doi.org/10.1109/12.192214

Digital Library

[135]

Timothy K. Tsai and Ravishankar K. Iyer. 1995. FTAPE: A fault injection tool to measure fault Tolerance. NASA STI/Recon Technical Report. 25333 pages. https://doi.org/10.2514/6.1995-1041

[136]

K. Vaidyanathan and K. S. Trivedi. 1999. A measurement-based model for estimation of resource exhaustion in operational software systems. In Proceedings of the 10th International Symposium on Software Reliability Engineering. IEEE, 84–93. https://doi.org/10.1109/issre.1999.809313

Digital Library

[137]

K. Vaidyanathan and K. S. Trivedi. 2005. A comprehensive model for software rejuvenation. IEEE Trans. Depend. Secure Comput. 2, 2 (Feb. 2005), 124–137. https://doi.org/10.1109/tdsc.2005.15

Digital Library

[138]

Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, New York, NY, 193–204. https://doi.org/10.1145/1807128.1807161

Digital Library

[139]

Jing Wang, Daniel Rossell, Christos G. Cassandras, and Ioannis Ch. Paschalidis. 2013. Network anomaly detection: A survey and comparative analysis of stochastic and deterministic methods. In Proceedings of the 52nd IEEE Conference on Decision and Control. IEEE, 182–187. https://doi.org/10.1109/CDC.2013.6759879

[140]

Qing Wang, Wubai Zhou, Chunqiu Zeng, Tao Li, Larisa Shwartz, and Genady Ya. Grabarnik. 2017. Constructing the knowledge base for cognitive IT service management. In Proceedings of the IEEE International Conference on Services Computing (SCC’17). IEEE, 410–417. https://doi.org/10.1109/scc.2017.59

[141]

Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, 297–308. https://doi.org/10.1145/2884781.2884804

Digital Library

[142]

Wei Wang, Ming Zhu, Jinlin Wang, Xuewen Zeng, and Zhongzhen Yang. 2017. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI’17). IEEE, 43–48. https://doi.org/10.1109/isi.2017.8004872

[143]

Yu Wang, Qiang Miao, Eden W. M. Ma, Kwok-Leung Tsui, and Michael G. Pecht. 2013. Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliabil. 62, 1 (Mar. 2013), 136–145. https://doi.org/10.1109/tr.2013.2241204

[144]

Amy Ward, Peter Glynn, and Kathy Richardson. 1998. Internet service performance failure detection. ACM SIGMETRICS Perform. Eval. Rev. 26, 3 (Dec. 1998), 38–43. https://doi.org/10.1145/306225.306237

Digital Library

[145]

W. Eric Wong, Vidroha Debroy, Ruizhi Gao, and Yihao Li. 2014. The DStar method for effective software fault localization. IEEE Trans. Reliabil. 63, 1 (Mar. 2014), 290–308. https://doi.org/10.1109/tr.2013.2285319

[146]

W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization. IEEE Trans. Softw. Eng. 42, 8 (Aug. 2016), 707–740. https://doi.org/10.1109/tse.2016.2521368

Digital Library

[147]

Rongxin Wu, Hongyu Zhang, Sunghun Kim, and Shing-Chi Cheung. 2011. ReLink: Recovering links between bugs and changes. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11). ACM, 15–25. https://doi.org/10.1145/2025113.2025120

Digital Library

[148]

Jiang Xiao, Zhuang Xiong, Song Wu, Yusheng Yi, Hai Jin, and Kan Hu. 2018. Disk failure prediction in data centers via online learning. In Proceedings of the 47th International Conference on Parallel Processing (ICPP’18). ACM, New York, NY, Article 35, 10 pages. https://doi.org/10.1145/3225058.3225106

Digital Library

[149]

Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (Nov. 2016), 3502–3508. https://doi.org/10.1109/tc.2016.2538237

Digital Library

[150]

Haowen Xu, Yang Feng, Jie Chen, Zhaogang Wang, Honglin Qiao, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Liet al. 2018. Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In Proceedings of the World Wide Web Conference (WWW’18). ACM, New York, NY, 187–196. https://doi.org/10.1145/3178876.3185996

Digital Library

[151]

Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09) . ACM, New York, NY, 117–132. https://doi.org/10.1145/1629575.1629587

Digital Library

[152]

Zhenghua Xue, Xiaoshe Dong, Siyuan Ma, and Weiqing Dong. 2007. A survey on failure prediction of large-scale server clusters. In Proceedings of the 8th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD’07). IEEE, 733–738. https://doi.org/10.1109/snpd.2007.284

Digital Library

[153]

Xiaoxing Yang, Ke Tang, and Xin Yao. 2015. A learning-to-rank approach to software defect prediction. IEEE Trans. Reliabil. 64, 1 (3 2015), 234–246. https://doi.org/10.1109/tr.2014.2370891

[154]

Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. 2010. SherLog: Error diagnosis by connecting clues from run-time logs. ACM SIGARCH Comput. Architect. News 38, 1 (Mar. 2010), 143–154. https://doi.org/10.1145/1735970.1736038

Digital Library

[155]

Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael M. Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be conservative: Enhancing failure diagnosis with proactive logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 293–306.

Digital Library

[156]

Andreas Zeller. 2002. Isolating cause-effect chains from computer programs. In Proceedings of the 10th ACM SIGSOFT Symposium on Foundations of Software Engineering (SIGSOFT’02/FSE’10). ACM, New York, NY, 1–10. https://doi.org/10.1145/587051.587053

Digital Library

[157]

Andreas Zeller. 2006. Eclipse Bug Data!—Software Engineering Chair (Prof. Zeller)—Saarland University. Retrieved from https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/.

[158]

Chunqiu Zeng, Wubai Zhou, Tao Li, Larisa Shwartz, and Genady Ya Grabarnik. 2017. Knowledge guided hierarchical multi-label classification over ticket data. IEEE Trans. Netw. Service Manage. 14, 2 (6 2017), 246–260. https://doi.org/10.1109/tnsm.2017.2668363

Digital Library

[159]

Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and Nitesh V. Chawla. 2019. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI, 1409–1416.

Digital Library

[160]

Ke Zhang, Jianwu Xu, Martin Renqiang Min, Guofei Jiang, Konstantinos Pelechrinis, and Hui Zhang. 2016. Automated IT system failure prediction: A deep learning approach. In Proceedings of the IEEE International Conference on Big Data (BigData’16). IEEE, 1291–1300. https://doi.org/10.1109/bigdata.2016.7840733

[161]

Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang, Ying Liu, Dan Pei, Jun Xu, Yu Chen, Hui Dong, Xianping Qu, and et al. 2017. Syslog processing for switch failure diagnosis and prediction in datacenter networks. In Proceedings of the IEEE/ACM 25th International Symposium on Quality of Service (IWQoS’17). IEEE, 1–10. https://doi.org/10.1109/iwqos.2017.7969130

[162]

Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, Junjie Chen, Xiaoting He, Randolph Yao, Jian-Guang Lou, Murali Chintalapati, Furao Shen, and Dongmei Zhang. 2019. Robust log-based anomaly detection on unstable log data. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19). ACM, New York, NY, 807–817. https://doi.org/10.1145/3338906.3338931

Digital Library

[163]

Xu Zhao, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan, and Yuanyuan Zhou. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 565–581. https://doi.org/10.1145/3132747.3132778

Digital Library

[164]

Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting disk failures with HMM- and HSMM-based approaches. In Proceedings of the 10th Industrial Conference on Advances in Data Mining: Applications and Theoretical Aspects (ICDM’10). Springer-Verlag, Berlin, 390–404.

Digital Library

[165]

Shuai Zheng, Kosta Ristovski, Ahmed Farahat, and Chetan Gupta. 2017. Long short-term memory network for remaining useful life estimation. In Proceedings of the IEEE International Conference on Prognostics and Health Management (ICPHM’17) (2017–06). IEEE, 88–95. https://doi.org/10.1109/icphm.2017.7998311

[166]

Wubai Zhou, Liang Tang, Tao Li, Larisa Shwartz, and Genady Ya. Grabarnik. 2015. Resolution recommendation for event tickets in service management. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management (IM’15). IEEE, 287–295. https://doi.org/10.1109/inm.2015.7140303

[167]

Bingpeng Zhu, Gang Wang, Xiaoguang Liu, Dianming Hu, Sheng Lin, and Jingwei Ma. 2013. Proactive drive failure prediction for large scale storage systems. In Proceedings of the IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST’13). IEEE, 1–5. https://doi.org/10.1109/msst.2013.6558427

[168]

Jieming Zhu, Pinjia He, Qiang Fu, Hongyu Zhang, Michael R. Lyu, and Dongmei Zhang. 2015. Learning to log: Helping developers make informed logging decisions. In Proceedings of the IEEE/ACM 37th IEEE International Conference on Software Engineering. IEEE, 415–425. https://doi.org/10.1109/icse.2015.60

Cited By

Zhang SXu TZhu JSun YJin PShi BPei D(2025)Privacy-preserving MTS anomaly detection for network devices through federated learningInformation Sciences10.1016/j.ins.2024.121590690(121590)Online publication date: Feb-2025
https://doi.org/10.1016/j.ins.2024.121590
Zha JShan XLu JZhu JLiu Z(2024)Leveraging Large Language Models for Efficient Alert Aggregation in AIOPsElectronics10.3390/electronics1322442513:22(4425)Online publication date: 12-Nov-2024
https://doi.org/10.3390/electronics13224425
Duan YBao HBai GWei YXue KYou ZZhang YLiu BChen JWang SOu Z(2024)Learning to Diagnose: Meta-Learning for Efficient Adaptation in Few-Shot AIOps ScenariosElectronics10.3390/electronics1311210213:11(2102)Online publication date: 28-May-2024
https://doi.org/10.3390/electronics13112102
Show More Cited By

Index Terms

A Survey of AIOps Methods for Failure Management

Recommendations

A Systematic Mapping Study in AIOps
Service-Oriented Computing – ICSOC 2020 Workshops
Abstract
IT systems of today are becoming larger and more complex, rendering their human supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to AI and Big ...
A Comprehensive Survey of Explainable Artificial Intelligence (XAI) Methods: Exploring Transparency and Interpretability
Web Information Systems Engineering – WISE 2023
Abstract
Artificial Intelligence (AI) is undergoing a significant transformation. In recent years, the deployment of AI models, from Analytical to Cognitive and Generative AI, has become imminent; however, the widespread utilization of these models has ...
AI-based Proactive Storage Failure Management in Software-Defined Data Centres
ICISS '23: Proceedings of the 2023 6th International Conference on Information Science and Systems

Abstract — Proactive failure management is essential to alleviate potential risks of service unavailability and downtime in Software-Defined Data Centres (SDDCs). Artificial Intelligence (AI) models enable proactive failure management by predicting and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 12, Issue 6

December 2021

356 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3501281

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2021

Accepted: 01 August 2021

Revised: 01 July 2021

Received: 01 April 2021

Published in TIST Volume 12, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
2,896
Total Downloads

Downloads (Last 12 months)850
Downloads (Last 6 weeks)79

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang SXu TZhu JSun YJin PShi BPei D(2025)Privacy-preserving MTS anomaly detection for network devices through federated learningInformation Sciences10.1016/j.ins.2024.121590690(121590)Online publication date: Feb-2025
https://doi.org/10.1016/j.ins.2024.121590
Zha JShan XLu JZhu JLiu Z(2024)Leveraging Large Language Models for Efficient Alert Aggregation in AIOPsElectronics10.3390/electronics1322442513:22(4425)Online publication date: 12-Nov-2024
https://doi.org/10.3390/electronics13224425
Duan YBao HBai GWei YXue KYou ZZhang YLiu BChen JWang SOu Z(2024)Learning to Diagnose: Meta-Learning for Efficient Adaptation in Few-Shot AIOps ScenariosElectronics10.3390/electronics1311210213:11(2102)Online publication date: 28-May-2024
https://doi.org/10.3390/electronics13112102
Shetty MChen YSomashekar GMa MSimmhan YZhang XMace JVandevoorde DLas-Casas PGupta SNath SBansal CRajmohan S(2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698525
Han YDu QHuang YWu JTian FHe CFilkov VRay BZhou M(2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695475
Ye WWang HCao SYan L(2024)Design and Implementation of Fault Reconfiguration System Based on MQTT2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT63103.2024.10704320(505-509)Online publication date: 19-Jul-2024
https://doi.org/10.1109/ECNCT63103.2024.10704320
Li PDu QZhao S(2024)KEWS: A KPIs-Based Evaluation Framework of Workload Simulation On Microservice System2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD61410.2024.10580478(1146-1152)Online publication date: 8-May-2024
https://doi.org/10.1109/CSCWD61410.2024.10580478
Tsubouchi YTsuruta H(2024)MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud ApplicationsIEEE Access10.1109/ACCESS.2024.337433412(37398-37417)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3374334
Wu SGuan J(2024)Indicator Fault Detection Method Based on Periodic Self Discovery and Historical Anomaly FilteringIEEE Access10.1109/ACCESS.2024.336167212(20530-20539)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3361672
Zhang CHu DYang T(2024)Research of artificial intelligence operations for wind turbines considering anomaly detection, root cause analysis, and incremental trainingReliability Engineering & System Safety10.1016/j.ress.2023.109634241(109634)Online publication date: Jan-2024
https://doi.org/10.1016/j.ress.2023.109634
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents