Nothing Special   »   [go: up one dir, main page]

skip to main content
survey

A Survey of AIOps Methods for Failure Management

Published: 30 November 2021 Publication History

Abstract

Modern society is increasingly moving toward complex and distributed computing systems. The increase in scale and complexity of these systems challenges O&M teams that perform daily monitoring and repair operations, in contrast with the increasing demand for reliability and scalability of modern applications. For this reason, the study of automated and intelligent monitoring systems has recently sparked much interest across applied IT industry and academia. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to Machine Learning, AI, and Big Data. However, AIOps as a research topic is still largely unstructured and unexplored, due to missing conventions in categorizing contributions for their data requirements, target goals, and components. In this work, we focus on AIOps for Failure Management (FM), characterizing and describing 5 different categories and 14 subcategories of contributions, based on their time intervention window and the target problem being solved. We review 100 FM solutions, focusing on applicability requirements and the quantitative results achieved, to facilitate an effective application of AIOps solutions. Finally, we discuss current development problems in the areas covered by AIOps and delineate possible future trends for AI-based failure management.

References

[1]
Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. 2009. Spectrum-based multiple fault localization. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering. IEEE, 88–99. https://doi.org/10.1109/ASE.2009.25
[2]
Armen Aghasaryan, Eric Fabre, Albert Benveniste, Renée Boubour, and Claude Jard. 1998. Fault detection and diagnosis in distributed systems: An approach by partially stochastic petri nets. Discrete Event Dynam. Syst. 8, 2 (1998), 203–231. https://doi.org/10.1023/a:1008241818642
[3]
Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operat. Syst. Rev. 37, 5 (Dec. 2003), 74–89. https://doi.org/10.1145/1165389.945454
[4]
Javier Alonso, Jordi Torres, Josep Ll. Berral, and Ricard Gavalda. 2010. Adaptive on-line software aging prediction based on machine learning. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems Networks (DSN’10). IEEE, 507–516. https://doi.org/10.1109/dsn.2010.5544275
[5]
J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.-C. Fabre, J.-C. Laprie, E. Martins, and D. Powell. 1990. Fault injection for dependability validation: A methodology and some applications. IEEE Trans. Softw. Eng. 16, 2 (Feb. 1990), 166–182. https://doi.org/10.1109/32.44380
[6]
Mona Attariyan, Michael Chow, and Jason Flinn. 2012. X-Ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 307–320. https://doi.org/10.5555/2387880.2387910
[7]
Julien Audibert, Pietro Michiardi, Frédéric Guyard, Sébastien Marti, and Maria A. Zuluaga. 2020. USAD: UnSupervised anomaly detection on multivariate time series. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’20). ACM, New York, NY, 3395–3404. https://doi.org/10.1145/3394486.3403392
[8]
Tom Auld, Andrew W. Moore, and Stephen F. Gull. 2007. Bayesian neural networks for internet traffic classification. IEEE Trans. Neural Netw. 18, 1 (Jan. 2007), 223–239. https://doi.org/10.1109/tnn.2006.883010
[9]
Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang. 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’07). ACM, New York, NY, 13–24. https://doi.org/10.1145/1282380.1282383
[10]
Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, and Mathru Janakiraman. 2020. DeCaf: Diagnosing and triaging performance issues in large-scale cloud services. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP’20). ACM, New York, NY, 201–210. https://doi.org/10.1145/3377813.3381353
[11]
Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In Proceedings of the 9th Conference on Hot Topics in Operating Systems, Vol. 9. USENIX Association, 15. https://doi.org/10.5555/1251054.1251069
[12]
Ivan Beschastnikh, Yuriy Brun, Michael D. Ernst, and Arvind Krishnamurthy. 2014. Inferring models of concurrent systems from logs of their behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). ACM, New York, NY, 468–479. https://doi.org/10.1145/2568225.2568246
[13]
Netflix Technology Blog. 2016. Netflix Chaos Monkey Upgraded. Retrieved from https://netflixtechblog.com/netflix-chaos-monkey-upgraded-1d679429be5d.
[14]
Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 5th European Conference on Computer Systems (EuroSys’10). ACM, New York, NY, 111–124. https://doi.org/10.1145/1755913.1755926
[15]
A. T. Bouloutas, S. Calo, and A. Finkel. 1994. Alarm correlation and fault identification in communication networks. IEEE Trans. Commun. 42, 2/3/4 (2 1994), 523–533. https://doi.org/10.1109/tcomm.1994.577079
[16]
L. C. Briand, J. W. Daly, and J. K. Wust. 1999. A unified framework for coupling measurement in object-oriented systems. IEEE Trans. Softw. Eng. 25, 1 (1 1999), 91–121. https://doi.org/10.1109/32.748920
[17]
Broadcom. 2020. AIOps—Broadcom. Retrieved from https://www.broadcom.com/products/software/aiops.
[18]
Andy Brown, Aaron Tuor, Brian Hutchinson, and Nicole Nichols. 2018. Recurrent neural network attention mechanisms for interpretable system log anomaly detection. In Proceedings of the 1st Workshop on Machine Learning for Computing Systems (MLCS’18). ACM, New York, NY, Article 1, 8 pages. https://doi.org/10.1145/3217871.3217872
[19]
Lisa Burnell and Eric Horvitz. 1995. Structure and chance: Melding logic and probability for software debugging. Commun. ACM 38, 3 (Mar. 1995), 31–ff.https://doi.org/10.1145/203330.203338
[20]
K. L. Butler and J. A. Momoh. 1999. A neural net based approach for fault diagnosis in distribution networks. In Proceedings of the IEEE Power Engineering Society, Vol. 1. IEEE, 353–356. https://doi.org/10.1109/PESW.1999.747478
[21]
V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan, and W. P. Zeggert. 2001. Proactive management of software aging. IBM J. Res. Dev. 45, 2 (Mar. 2001), 311–332. https://doi.org/10.1147/rd.452.0311
[22]
Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep Learning for Anomaly Detection: A Survey. Retrieved from http://arxiv.org/abs/1901.03407.
[23]
Thanyalak Chalermarrewong, Tiranee Achalakul, and Simon Chong Wee See. 2012. Failure prediction of data centers using time series and fault tree analysis. In Proceedings of the IEEE 18th International Conference on Parallel and Distributed Systems. IEEE, 794–799. https://doi.org/10.1109/icpads.2012.129
[24]
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. Comput. Surveys 41, 3 (July 2009), 15:1–15:58. https://doi.org/10.1145/1541880.1541882
[25]
M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. 2002. Pinpoint: Problem determination in large, dynamic Internet services. In Proceedings of the International Conference on Dependable Systems and Networks. IEEE, 595–604. https://doi.org/10.1109/DSN.2002.1029005
[26]
Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer. 2004. Path-based failure and evolution management. In Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation (NSDI’04). USENIX Association, 23. Retrieved from https://dl.acm.org/doi/10.5555/1251175.1251198.
[27]
Xin Chen, Charng-Da Lu, and Karthik Pattabiraman. 2014. Failure analysis of jobs in compute clouds: A Google cluster case study. In Proceedings of the IEEE 25th International Symposium on Software Reliability Engineering. IEEE, 167–177. https://doi.org/10.1109/issre.2014.34
[28]
S. R. Chidamber and C. F. Kemerer. 1994. A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20, 6 (6 1994), 476–493. https://doi.org/10.1109/32.295895
[29]
Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F. Wenisch. 2014. The mystery machine: End-to-end performance analysis of large-scale internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, 217–231. https://dl.acm.org/doi/10.5555/2685048.2685066
[30]
Holger Cleve and Andreas Zeller. 2005. Locating causes of program failures. In Proceedings of the 27th International Conference on Software Engineering (ICSE’05). ACM, New York, NY, 342–351. https://doi.org/10.1145/1062455.1062522
[31]
Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, and Jeffrey S. Chase. 2004. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI’04). USENIX Association, 16. Retrieved from https://dl.acm.org/doi/10.5555/1251254.1251270.
[32]
Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving system history. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP’05). ACM, New York, NY, 105–118. https://doi.org/10.1145/1095810.1095821
[33]
Carlos H. A. Costa, Yoonho Park, Bryan S. Rosenburg, Chen-Yong Cher, and Kyung Dong Ryu. 2014. A system software approach to proactive memory-error avoidance. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE, 12 pages. https://doi.org/10.1109/SC.2014.63
[34]
A. Csenki. 1990. Bayes predictive analysis of a fundamental software reliability model. IEEE Trans. Reliabil. 39, 2 (June 1990), 177–183. https://doi.org/10.1109/24.55879
[35]
Marco D’Ambros, Michele Lanza, and Romain Robbes. 2011. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empir. Softw. Eng. 17, 4–5 (Aug. 2011), 531–577. https://doi.org/10.1007/s10664-011-9173-9
[36]
Yingnong Dang, Qingwei Lin, and Peng Huang. 2019. AIOps: Real-world challenges and research innovations. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE’19). IEEE, 4–5. https://doi.org/10.1109/icse-companion.2019.00023
[37]
Nickolas Allen Davis, Abdelmounaam Rezgui, Hamdy Soliman, Skyler Manzanares, and Milagre Coates. 2017. FailureSim: A system for predicting hardware failures in cloud data centers using neural networks. In Proceedings of the IEEE 10th International Conference on Cloud Computing (CLOUD’17). IEEE, 544–551. https://doi.org/10.1109/cloud.2017.75
[38]
Karel Dejaeger, Thomas Verbraken, and Bart Baesens. 2013. Toward comprehensible software fault prediction models using Bayesian network classifiers. IEEE Trans. Softw. Eng. 39, 2 (Feb. 2013), 237–257. https://doi.org/10.1109/tse.2012.20
[39]
B. Dhanalaxmi, G. Apparao Naidu, and K. Anuradha. 2015. A review on software fault detection and prevention mechanism in software development activities. J. Comput. Eng. 17, 6 (2015), 25–30.
[40]
Peter A. Dinda and David R. O’Hallaron. 1999. An evaluation of linear models for host load prediction. In Proceedings of The 8th International Symposium on High Performance Distributed Computing. IEEE, 87–96.
[41]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285–1298. https://doi.org/10.1145/3133956.3134015
[42]
Karim O. Elish and Mahmoud O. Elish. 2008. Predicting defect-prone software modules using support vector machines. J. Syst. Softw. 81, 5 (May 2008), 649–660. https://doi.org/10.1016/j.jss.2007.07.040
[43]
Alice Este, Francesco Gringoli, and Luca Salgarelli. 2009. Support vector machines for TCP traffic classification. Comput. Netw. 53, 14 (Sep. 2009), 2476–2490. https://doi.org/10.1016/j.comnet.2009.05.003
[44]
Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. 2013. Failure prediction based on log files using random indexing and support vector machines. J. Syst. Softw. 86, 1 (1 2013), 2–11. https://doi.org/10.1016/j.jss.2012.06.025
[45]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE Computer Society, 149–158. https://doi.org/10.1109/icdm.2009.60
[46]
Zhiwei Gao, Carlo Cecati, and Steven X. Ding. 2015. A survey of fault diagnosis and fault-tolerant techniques—Part I: Fault diagnosis with model-based and signal-based approaches. IEEE Trans. Industr. Electr. 62, 6 (June 2015), 3757–3767. https://doi.org/10.1109/tie.2015.2417501
[47]
Zhiwei Gao, Carlo Cecati, and Steven X. Ding. 2015. A survey of fault diagnosis and fault-tolerant techniques—Part II: Fault diagnosis with knowledge-based and hybrid/active approaches. IEEE Trans. Industr. Electr. 62, 6 (June 2015), 3768–3774. https://doi.org/10.1109/TIE.2015.2419013
[48]
S. Garg, A. Puliafito, M. Telek, and K. S. Trivedi. 1995. Analysis of software rejuvenation using Markov regenerative stochastic petri net. In Proceedings of the 6th International Symposium on Software Reliability Engineering (ISSRE’95) (1995). IEEE, 180–187. https://doi.org/10.1109/issre.1995.497656
[49]
S. Garg, A. van Moorsel, K. Vaidyanathan, and K. S. Trivedi. 1998. A methodology for detection and estimation of software aging. In Proceedings of the 9th International Symposium on Software Reliability Engineering. IEEE, 283–292. https://doi.org/10.1109/issre.1998.730892
[50]
Emanuel Giger, Marco D’Ambros, Martin Pinzger, and Harald C. Gall. 2012. Method-level bug prediction. In Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM’12). ACM, New York, NY, 171–180. https://doi.org/10.1145/2372251.2372285
[51]
Tarun Goyal, Ajit Singh, and Aakanksha Agrawal. 2012. Cloudsim: Simulator for cloud computing infrastructure and modeling. Proc. Eng. 38 (Nov. 2012), 3566–3572. https://doi.org/10.1016/j.proeng.2012.06.412
[52]
T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. 2000. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng. 26, 7 (July 2000), 653–661. https://doi.org/10.1109/32.859533
[53]
Maurice H. Halstead. 1977. Elements of Software Science (Operating and Programming Systems Series). Elsevier Science Inc.
[54]
Greg Hamerly and Charles Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan Kaufmann, San Francisco, CA, 202–209. https://doi.org/10.5555/645530.655825
[55]
Seungjae Han, K. G. Shin, and H. A. Rosenberg. 1995. DOCTOR: An integrated software fault injection environment for distributed real-time systems. In Proceedings of the IEEE International Computer Performance and Dependability Symposium. IEEE, 204–213. https://doi.org/10.1109/IPDS.1995.395831
[56]
J. L. Hellerstein, Fan Zhang, and P. Shahabuddin. 1999. An approach to predictive detection for service management. In Proceedings of the 6th IFIP/IEEE International Symposium on Integrated Network Management. IEEE, 309–322. https://doi.org/10.1109/inm.1999.770691
[57]
Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton. 1995. Software rejuvenation: Analysis, module and applications. In Proceedings of the 25th International Symposium on Fault-Tolerant Computing. Digest of Papers. IEEE, 381–390. https://doi.org/10.1109/FTCS.1995.466961
[58]
G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliabil. 51, 3 (Sep. 2002), 350–357. https://doi.org/10.1109/TR.2002.802886
[59]
Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria—IEEE conference publication. In Proceedings of the 16th International Conference on Software Engineering. IEEE, 191–200. Retrieved from https://ieeexplore.ieee.org/document/296778.
[60]
Olumuyiwa Ibidunmoye, Francisco Hernández-Rodriguez, and Erik Elmroth. 2015. Performance anomaly detection and bottleneck identification. Comput. Surveys 48, 1 (Sep. 2015), 1–35. https://doi.org/10.1145/2791120
[61]
Tariqul Islam and Dakshnamoorthy Manivannan. 2017. Predicting application failure in cloud: A machine learning approach. In Proceedings of the IEEE International Conference on Cognitive Computing (ICCC’17). IEEE, 24–31. https://doi.org/10.1109/ieee.iccc.2017.11
[62]
Itthichok Jangjaimon and Nian-Feng Tzeng. 2015. Effective cost reduction for elastic clouds under spot instance pricing through adaptive checkpointing. IEEE Trans. Comput. 64, 2 (Feb. 2015), 396–409. https://doi.org/10.1109/tc.2013.225
[63]
David Jauk, Dai Yang, and Martin Schulz. 2019. Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19). ACM, New York, NY, Article 30, 13 pages. https://doi.org/10.1145/3295500.3356185
[64]
Srikanth Kandula, Dina Katabi, and Jean-Philippe Vasseur. 2005. Shrink: A tool for failure diagnosis in IP networks. In Proceeding of the ACM SIGCOMM Workshop on Mining Network Data (MineNet’05). ACM, New York, NY, 6 pages. https://doi.org/10.1145/1080173.1080178
[65]
N. Karunanithi, D. Whitley, and Y. K. Malaiya. 1992. Prediction of software reliability using connectionist models. IEEE Trans. Softw. Eng. 18, 7 (July 1992), 563–574. https://doi.org/10.1109/32.148475
[66]
Taghi M. Khoshgoftaar and David L. Lanning. 1995. A neural network approach for early detection of program modules having high risk in the maintenance phase. J. Syst. Softw. 29, 1 (Apr. 1995), 85–91. https://doi.org/10.1016/0164-1212(94)00130-f
[67]
Barbara A. Kitchenham, David Budgen, and O. Pearl Brereton. 2010. The value of mapping studies - A participant-observer case study. In Proceedings of the 14th International Conference on Evaluation and Assessment in Software Engineering (EASE’10). BCS Learning & Development, 25–33. https://doi.org/10.14236/ewic/EASE2010.4
[68]
S. Kliger, S. Yemini, Y. Yemini, D. Ohsie, and S. Stolfo. 1995. A Coding Approach to Event Correlation. Springer US, Boston, MA, 266–277. https://doi.org/10.1007/978-0-387-34890-2_24
[69]
Yevgeniy Sverdlik Data Center Knowledge. 2016. What Facebook Has Learned from Regularly Shutting Down Entire Data Centers. Retrieved from https://www.datacenterknowledge.com/archives/2016/08/31/facebook-learned-regularly-shutting-entire-data-centers.
[70]
Khairy A. H. Kobbacy and Sunil Vadera. 2011. A survey of AI in operations management from 2005 to 2009. J. Manufact. Technol. Manage. 22, 6 (July 2011), 706–733. https://doi.org/10.1108/17410381111149602
[71]
K. A. H. Kobbacy, S Vadera, and M. H. Rasmy. 2007. AI and OR in management of operations: History and trends. J. Oper. Res. Soc. 58, 1 (Jan. 2007), 10–28. https://doi.org/10.1057/palgrave.jors.2602132
[72]
Anukool Lakhina, Mark Crovella, and Christophe Diot. 2004. Diagnosing network-wide traffic anomalies. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM’04). ACM, New York, NY, 219–230. https://doi.org/10.1145/1015467.1015492
[73]
Anukool Lakhina, Mark Crovella, and Christophe Diot. 2005. Mining anomalies using traffic feature distributions. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (Philadelphia, Pennsylvania) (SIGCOMM’05). ACM, New York, NY, 217–228. https://doi.org/10.1145/1080091.1080118
[74]
Andrew Lerner. 2017. AIOps Platforms—Gartner. Retrieved from https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms/.
[75]
Anna Levin, Shelly Garion, Elliot K. Kolodner, Dean H. Lorenz, Katherine Barabash, Mike Kugler, and Niall McShane. 2019. AIOps for a cloud object storage service. In Proceedings of the IEEE International Congress on Big Data (BigDataCongress’19). IEEE, 165–169. https://doi.org/10.1109/BigDataCongress.2019.00036
[76]
Jian Li, Pinjia He, Jieming Zhu, and Michael R. Lyu. 2017. Software defect prediction via convolutional neural network. In Proceedings of the IEEE International Conference on Software Quality, Reliability and Security (QRS’17). IEEE, 318–328. https://doi.org/10.1109/qrs.2017.42
[77]
Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard drive failure prediction using classification and regression trees. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 383–394. https://doi.org/10.1109/dsn.2014.44
[78]
Jing Li, Rebecca J. Stones, Gang Wang, Zhongwei Li, Xiaoguang Liu, and Kang Xiao. 2016. Being accurate is not enough: New metrics for disk failure prediction. In Proceedings of the IEEE 35th Symposium on Reliable Distributed Systems (SRDS’16). IEEE, 71–80. https://doi.org/10.1109/srds.2016.019
[79]
Yangguang Li, Zhen Ming (Jack) Jiang, Heng Li, Ahmed E. Hassan, Cheng He, Ruirui Huang, Zhengda Zeng, Mian Wang, and Pinan Chen. 2020. Predicting node failures in an ultra-large-scale cloud computing platform: An AIOps solution. ACM Trans. Softw. Eng. Methodol. 29, 2 (27 4 2020), 13:1–13:24. https://doi.org/10.1145/3385187
[80]
Zeyan Li, Chengyang Luo, Yiwei Zhao, Yongqian Sun, Kaixin Sui, Xiping Wang, Dapeng Liu, Xing Jin, Qi Wang, and Dan Pei. 2019. Generic and robust localization of multi-dimensional root causes. In Proceedings of the IEEE 30th International Symposium on Software Reliability Engineering (ISSRE’19). IEEE, IEEE, 47–57.
[81]
Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in IBM bluegene/l event logs. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 583–588. https://doi.org/10.1109/icdm.2007.46
[82]
Fan Lin, Matt Beadon, Harish Dattatraya Dixit, Gautham Vunnam, Amol Desai, and Sriram Sankar. 2018. Hardware remediation at scale. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W’18). IEEE, 14–17. https://doi.org/10.1109/dsn-w.2018.00015
[83]
Fred Lin, Keyur Muzumdar, Nikolay Pavlovich Laptev, Mihai-Valentin Curelea, Seunghak Lee, and Sriram Sankar. 2020. Fast dimensional analysis for root cause investigation in a large-scale service environment. Proc. ACM Meas. Anal. Comput. Syst. 4, 2, Article 31 (June 2020), 23 pages. https://doi.org/10.1145/3392149
[84]
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE’16). ACM, New York, NY, 102–111. https://doi.org/10.1145/2889160.2889232
[85]
Chao Liu, Xifeng Yan, Long Fei, Jiawei Han, and Samuel P. Midkiff. 2005. SOBER: Statistical model-based bug localization. ACM SIGSOFT Softw. Eng. Notes 30, 5 (Sep. 2005), 286. https://doi.org/10.1145/1095430.1081753
[86]
Dapeng Liu, Youjian Zhao, Haowen Xu, Yongqian Sun, Dan Pei, Jiao Luo, Xiaowei Jing, and Mei Feng. 2015. Opprentice: Towards practical and automatic anomaly detection through machine learning. In Proceedings of the Internet Measurement Conference. ACM, New York, NY, 211–224. https://doi.org/10.1145/2815675.2815679
[87]
David Lo, Hong Cheng, Jiawei Han, Siau-Cheng Khoo, and Chengnian Sun. 2009. Classification of software behaviors for failure detection: A discriminative pattern mining approach. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, NY, 557–566. https://doi.org/10.1145/1557019.1557083
[88]
Ao Ma, Fred Douglis, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, 241–256. https://www.usenix.org/conference/fast15/technical-sessions/presentation/ma
[89]
Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’17). USENIX Association, 391–402. Retrieved from https://www.usenix.org/conference/atc17/technical-sessions/presentation/mahdisoltani.
[90]
T. J. McCabe. 1976. A complexity measure. IEEE Trans. Softw. Eng. SE-2, 4 (Dec. 1976), 308–320. https://doi.org/10.1109/TSE.1976.233837
[91]
Meiliana, Syaeful Karim, Harco Leslie Hendric Spits Warnars, Ford Lumban Gaol, Edi Abdurachman, and Benfano Soewito. 2017. Software metrics for fault prediction using machine learning approaches: A literature review with PROMISE repository dataset. In Proceedings of the IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom’17). IEEE, 19–23. https://doi.org/10.1109/cyberneticscom.2017.8311708
[92]
Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, et al. 2019. LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 4739–4745. https://doi.org/10.24963/ijcai.2019/658
[93]
Tim Menzies. 2004. PROMISE DATASETS PAGE. Retrieved from http://promise.site.uottawa.ca/SERepository/datasets-page.html.
[94]
Tim Menzies, Jeremy Greenwald, and Art Frank. 2007. Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 33, 1 (Jan. 2007), 2–13. https://doi.org/10.1109/TSE.2007.256941
[95]
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and de Bronis R. Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11. https://doi.org/10.1109/sc.2010.18
[96]
Moogsoft. 2020. What Is AIOps? Moogsoft. Retrieved from https://www.moogsoft.com/resources/aiops/guide/everything-aiops/.
[97]
Andrew W. Moore and Denis Zuev. 2005. Internet traffic classification using bayesian analysis techniques. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’05). ACM, New York, NY, 50–60. https://doi.org/10.1145/1064212.1064220
[98]
Raimund Moser, Witold Pedrycz, and Giancarlo Succi. 2008. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In Proceedings of the 13th International Conference on Software Engineering. ACM, New York, NY, 181–190. https://doi.org/10.1145/1368088.1368114
[99]
Mukosi Abraham Mukwevho and Turgay Celik. 2021. Toward a smart cloud: A review of fault-tolerance methods in cloud systems. IEEE Trans. Serv. Comput. 14, 2 (2021), 589–605. https://doi.org/10.1109/tsc.2018.2816644
[100]
Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2003. Hard drive failure prediction using non-parametric statistical methods. Retrieved from http://dsp.ucsd.edu/jfmurray/publications/Murray2003.pdf.
[101]
Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Mach. Learn. Res. 6 (Jan. 2005), 783–816. https://doi.org/10.5555/1046920.1088699
[102]
Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. 2006. Mining metrics to predict component failures. In Proceeding of the 28th International Conference on Software Engineering (ICSE’06). ACM, New York, NY, 452–461. https://doi.org/10.1145/1134285.1134349
[103]
Jaechang Nam, Sinno Jialin Pan, and Sunghun Kim. 2013. Transfer defect learning. In Proceedings of the 35th International Conference on Software Engineering (ICSE’13). IEEE, 382–391. https://doi.org/10.1109/icse.2013.6606584
[104]
Iyswarya Narayanan, Kushagra Vaid, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, and Badriddine Khessib. 2016. SSD failures in datacenters: What? When? and Why? In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR’16). ACM, New York, NY, Article 7, 11 pages. https://doi.org/10.1145/2928275.2928278
[105]
Roberto Natella, Domenico Cotroneo, Joao A. Duraes, and Henrique S. Madeira. 2013. On fault representativeness of software fault injection. IEEE Trans. Softw. Eng. 39, 1 (Jan. 2013), 80–96. https://doi.org/10.1109/tse.2011.124
[106]
Roberto Natella, Domenico Cotroneo, and Henrique S. Madeira. 2016. Assessing dependability with software fault injection: A survey. Comput. Surveys 48, 3 (Aug. 2016), 44:1–44:55. https://doi.org/10.1145/2841425
[107]
Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly detection and classification using distributed tracing and deep learning. In Proceedings of the 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’19). IEEE, 241–250. https://doi.org/10.1109/ccgrid.2019.00038
[108]
NetManAIOps. 2019. SMD Dataset—OmniAnomaly. Retrieved from https://github.com/NetManAIOps/OmniAnomaly.
[109]
Clodoaldo Brasilino Leite Neto, Pedro Batista De Carvalho Filho, and Alexandre Nóbrega Duarte. 2013. A systematic mapping study on fault management in cloud computing. In Proceedings of the International Conference on Parallel and Distributed Computing, Applications and Technologies. IEEE, 332–337. https://doi.org/10.1109/PDCAT.2013.59
[110]
Hiep Nguyen, Zhiming Shen, Yongmin Tan, and Xiaohui Gu. 2013. FChain: Toward black-box online fault localization for cloud systems. In Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems. IEEE, 21–30. https://doi.org/10.1109/icdcs.2013.26
[111]
Thuy T. T. Nguyen and Grenville Armitage. 2008. A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surveys Tutor. 10, 4 (2008), 56–76. https://doi.org/10.1109/surv.2008.080406
[112]
Changhai Nie and Hareton Leung. 2011. A survey of combinatorial testing. Comput. Surveys 43, 2 (Apr. 2011), 11:1–11:29. https://doi.org/10.1145/1883612.1883618
[113]
Paolo Notaro, Jorge Cardoso, and Michael Gerndt. 2020. A systematic mapping study in AIOps. In Proceedings of the International Conference on Service-oriented Computing (ICSOC’20). Workshops: AIOps, CFTIC, STRAPS, AI-PA, AI-IOTS, and Satellite Events. Springer, 110–123. Retrieved from http://arxiv.org/abs/2012.09108.
[114]
H. Okamura, Y. Nishimura, and T. Dohi. 2004. A dynamic checkpointing scheme based on reinforcement learning. In Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing. IEEE, 151–158. https://doi.org/10.1109/PRDC.2004.1276566
[115]
Ahmet Okutan and Olcay Taner Yıldız. 2012. Software defect prediction using bayesian networks. Empir. Softw. Eng. 19, 1 (Aug. 2012), 154–181. https://doi.org/10.1007/s10664-012-9218-8
[116]
OpsRamp. 2020. AIOps (AI for IT Operations)—OpsRamp. Retrieved from https://www.opsramp.com/solutions/service-centric-aiops/.
[117]
T. J. Ostrand, E. J. Weyuker, and R. M. Bell. 2005. Predicting the location and number of faults in large software systems. IEEE Trans. Softw. Eng. 31, 4 (Apr. 2005), 340–355. https://doi.org/10.1109/tse.2005.49
[118]
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). USENIX Association, 2. Retrieved from https://www.usenix.org/conference/fast-07/failure-trends-large-disk-drive-population.
[119]
Teerat Pitakrat, Dušan Okanović, André van Hoorn, and Lars Grunske. 2018. Hora: Architecture-aware online failure prediction. J. Syst. Softw. 137 (Mar. 2018), 669–685. https://doi.org/10.1016/j.jss.2017.02.041
[120]
A. Podgurski, D. Leon, P. Francis, W. Masri, M. Minch, Jiayang Sun, and Bin Wang. 2003. Automated support for classifying software failure reports. In Proceedings of the 25th International Conference on Software Engineering. IEEE, 465–475. https://doi.org/10.1109/icse.2003.1201224
[121]
M. Renieris and S. P. Reiss. 2003. Fault localization with nearest neighbor queries. In Proceedings of the 18th IEEE International Conference on Automated Software Engineering. IEEE, 30–39. https://ieeexplore.ieee.org/document/1240292
[122]
Felix Salfner, Maren Lenk, and Miroslaw Malek. 2010. A survey of online failure prediction methods. Comput. Surveys 42, 3 (Mar. 2010), 1–42. https://doi.org/10.1145/1670679.1670680
[123]
Felix Salfner and Miroslaw Malek. 2007. Using hidden semi-Markov models for effective online failure prediction. In Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS’07). IEEE, 161–174. https://doi.org/10.1109/srds.2007.35
[124]
Areeg Samir and Claus Pahl. 2019. A Controller Architecture for Anomaly Detection, Root Cause Analysis and Self-Adaptation for Cluster Architectures. Retrieved from https://orbilu.uni.lu/handle/10993/42062.
[125]
Mark Schwabacher and Kai Goebel. 2007. A survey of artificial intelligence for prognostics. In Proceedings of the AAAI Fall Symposium on Artificial Intelligence for Prognostics. AAAI, 108–115. Retrieved from https://www.aaai.org/Library/Symposia/Fall/2007/fs07-02-016.php.
[126]
Qihong Shao, Yi Chen, Shu Tao, Xifeng Yan, and Nikos Anerousis. 2008. Efficient ticket routing by resolution sequence mining. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD’08). ACM, New York, NY, 605–613. https://doi.org/10.1145/1401890.1401964
[127]
Bikash Sharma, Praveen Jayachandran, Akshat Verma, and Chita R. Das. 2013. CloudPD: Problem determination and diagnosis in shared dynamic clouds. In Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). IEEE, 1–12. https://doi.org/10.1109/dsn.2013.6575298
[128]
Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the 13th International Conference on Software Engineering (ICSE’08). ACM, New York, NY, 351–360. https://doi.org/10.1145/1368088.1368136
[129]
BMC Software. 2020. AIOps—BMC. Retrieved from https://www.bmc.com/it-solutions/aiops.html.
[130]
Marc Solé, Victor Muntés-Mulero, Annie Ibrahim Rana, and Giovani Estrada. 2017. Survey on Models and Techniques for Root-Cause Analysis. Retrieved from http://arxiv.org/abs/1701.08546.
[131]
Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’19). ACM, New York, NY, 2828–2837.
[132]
Yongqian Sun, Youjian Zhao, Ya Su, Dapeng Liu, Xiaohui Nie, Yuan Meng, Shiwen Cheng, Dan Pei, Shenglin Zhang, Xianping Qu et al. 2018. Hotspot: Anomaly localization for additive kpis with multi-dimensional attributes. IEEE Access 6 (2018), 10909–10923.
[133]
Resolve Systems. 2020. What is AIOps?—Resolve. Retrieved from https://resolve.io/what-is-aiops.
[134]
D. Tang and R. K. Iyer. 1993. Dependability measurement and modeling of a multicomputer system. IEEE Trans. Comput. 42, 1 (1993), 62–75. https://doi.org/10.1109/12.192214
[135]
Timothy K. Tsai and Ravishankar K. Iyer. 1995. FTAPE: A fault injection tool to measure fault Tolerance. NASA STI/Recon Technical Report. 25333 pages. https://doi.org/10.2514/6.1995-1041
[136]
K. Vaidyanathan and K. S. Trivedi. 1999. A measurement-based model for estimation of resource exhaustion in operational software systems. In Proceedings of the 10th International Symposium on Software Reliability Engineering. IEEE, 84–93. https://doi.org/10.1109/issre.1999.809313
[137]
K. Vaidyanathan and K. S. Trivedi. 2005. A comprehensive model for software rejuvenation. IEEE Trans. Depend. Secure Comput. 2, 2 (Feb. 2005), 124–137. https://doi.org/10.1109/tdsc.2005.15
[138]
Kashi Venkatesh Vishwanath and Nachiappan Nagappan. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, New York, NY, 193–204. https://doi.org/10.1145/1807128.1807161
[139]
Jing Wang, Daniel Rossell, Christos G. Cassandras, and Ioannis Ch. Paschalidis. 2013. Network anomaly detection: A survey and comparative analysis of stochastic and deterministic methods. In Proceedings of the 52nd IEEE Conference on Decision and Control. IEEE, 182–187. https://doi.org/10.1109/CDC.2013.6759879
[140]
Qing Wang, Wubai Zhou, Chunqiu Zeng, Tao Li, Larisa Shwartz, and Genady Ya. Grabarnik. 2017. Constructing the knowledge base for cognitive IT service management. In Proceedings of the IEEE International Conference on Services Computing (SCC’17). IEEE, 410–417. https://doi.org/10.1109/scc.2017.59
[141]
Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, 297–308. https://doi.org/10.1145/2884781.2884804
[142]
Wei Wang, Ming Zhu, Jinlin Wang, Xuewen Zeng, and Zhongzhen Yang. 2017. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics (ISI’17). IEEE, 43–48. https://doi.org/10.1109/isi.2017.8004872
[143]
Yu Wang, Qiang Miao, Eden W. M. Ma, Kwok-Leung Tsui, and Michael G. Pecht. 2013. Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliabil. 62, 1 (Mar. 2013), 136–145. https://doi.org/10.1109/tr.2013.2241204
[144]
Amy Ward, Peter Glynn, and Kathy Richardson. 1998. Internet service performance failure detection. ACM SIGMETRICS Perform. Eval. Rev. 26, 3 (Dec. 1998), 38–43. https://doi.org/10.1145/306225.306237
[145]
W. Eric Wong, Vidroha Debroy, Ruizhi Gao, and Yihao Li. 2014. The DStar method for effective software fault localization. IEEE Trans. Reliabil. 63, 1 (Mar. 2014), 290–308. https://doi.org/10.1109/tr.2013.2285319
[146]
W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A survey on software fault localization. IEEE Trans. Softw. Eng. 42, 8 (Aug. 2016), 707–740. https://doi.org/10.1109/tse.2016.2521368
[147]
Rongxin Wu, Hongyu Zhang, Sunghun Kim, and Shing-Chi Cheung. 2011. ReLink: Recovering links between bugs and changes. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE’11). ACM, 15–25. https://doi.org/10.1145/2025113.2025120
[148]
Jiang Xiao, Zhuang Xiong, Song Wu, Yusheng Yi, Hai Jin, and Kan Hu. 2018. Disk failure prediction in data centers via online learning. In Proceedings of the 47th International Conference on Parallel Processing (ICPP’18). ACM, New York, NY, Article 35, 10 pages. https://doi.org/10.1145/3225058.3225106
[149]
Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (Nov. 2016), 3502–3508. https://doi.org/10.1109/tc.2016.2538237
[150]
Haowen Xu, Yang Feng, Jie Chen, Zhaogang Wang, Honglin Qiao, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Liet al. 2018. Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In Proceedings of the World Wide Web Conference (WWW’18). ACM, New York, NY, 187–196. https://doi.org/10.1145/3178876.3185996
[151]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09) . ACM, New York, NY, 117–132. https://doi.org/10.1145/1629575.1629587
[152]
Zhenghua Xue, Xiaoshe Dong, Siyuan Ma, and Weiqing Dong. 2007. A survey on failure prediction of large-scale server clusters. In Proceedings of the 8th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD’07). IEEE, 733–738. https://doi.org/10.1109/snpd.2007.284
[153]
Xiaoxing Yang, Ke Tang, and Xin Yao. 2015. A learning-to-rank approach to software defect prediction. IEEE Trans. Reliabil. 64, 1 (3 2015), 234–246. https://doi.org/10.1109/tr.2014.2370891
[154]
Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. 2010. SherLog: Error diagnosis by connecting clues from run-time logs. ACM SIGARCH Comput. Architect. News 38, 1 (Mar. 2010), 143–154. https://doi.org/10.1145/1735970.1736038
[155]
Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael M. Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be conservative: Enhancing failure diagnosis with proactive logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, 293–306.
[156]
Andreas Zeller. 2002. Isolating cause-effect chains from computer programs. In Proceedings of the 10th ACM SIGSOFT Symposium on Foundations of Software Engineering (SIGSOFT’02/FSE’10). ACM, New York, NY, 1–10. https://doi.org/10.1145/587051.587053
[157]
Andreas Zeller. 2006. Eclipse Bug Data!—Software Engineering Chair (Prof. Zeller)—Saarland University. Retrieved from https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/.
[158]
Chunqiu Zeng, Wubai Zhou, Tao Li, Larisa Shwartz, and Genady Ya Grabarnik. 2017. Knowledge guided hierarchical multi-label classification over ticket data. IEEE Trans. Netw. Service Manage. 14, 2 (6 2017), 246–260. https://doi.org/10.1109/tnsm.2017.2668363
[159]
Chuxu Zhang, Dongjin Song, Yuncong Chen, Xinyang Feng, Cristian Lumezanu, Wei Cheng, Jingchao Ni, Bo Zong, Haifeng Chen, and Nitesh V. Chawla. 2019. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI, 1409–1416.
[160]
Ke Zhang, Jianwu Xu, Martin Renqiang Min, Guofei Jiang, Konstantinos Pelechrinis, and Hui Zhang. 2016. Automated IT system failure prediction: A deep learning approach. In Proceedings of the IEEE International Conference on Big Data (BigData’16). IEEE, 1291–1300. https://doi.org/10.1109/bigdata.2016.7840733
[161]
Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang, Ying Liu, Dan Pei, Jun Xu, Yu Chen, Hui Dong, Xianping Qu, and et al. 2017. Syslog processing for switch failure diagnosis and prediction in datacenter networks. In Proceedings of the IEEE/ACM 25th International Symposium on Quality of Service (IWQoS’17). IEEE, 1–10. https://doi.org/10.1109/iwqos.2017.7969130
[162]
Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, Junjie Chen, Xiaoting He, Randolph Yao, Jian-Guang Lou, Murali Chintalapati, Furao Shen, and Dongmei Zhang. 2019. Robust log-based anomaly detection on unstable log data. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19). ACM, New York, NY, 807–817. https://doi.org/10.1145/3338906.3338931
[163]
Xu Zhao, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan, and Yuanyuan Zhou. 2017. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 565–581. https://doi.org/10.1145/3132747.3132778
[164]
Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting disk failures with HMM- and HSMM-based approaches. In Proceedings of the 10th Industrial Conference on Advances in Data Mining: Applications and Theoretical Aspects (ICDM’10). Springer-Verlag, Berlin, 390–404.
[165]
Shuai Zheng, Kosta Ristovski, Ahmed Farahat, and Chetan Gupta. 2017. Long short-term memory network for remaining useful life estimation. In Proceedings of the IEEE International Conference on Prognostics and Health Management (ICPHM’17) (2017–06). IEEE, 88–95. https://doi.org/10.1109/icphm.2017.7998311
[166]
Wubai Zhou, Liang Tang, Tao Li, Larisa Shwartz, and Genady Ya. Grabarnik. 2015. Resolution recommendation for event tickets in service management. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management (IM’15). IEEE, 287–295. https://doi.org/10.1109/inm.2015.7140303
[167]
Bingpeng Zhu, Gang Wang, Xiaoguang Liu, Dianming Hu, Sheng Lin, and Jingwei Ma. 2013. Proactive drive failure prediction for large scale storage systems. In Proceedings of the IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST’13). IEEE, 1–5. https://doi.org/10.1109/msst.2013.6558427
[168]
Jieming Zhu, Pinjia He, Qiang Fu, Hongyu Zhang, Michael R. Lyu, and Dongmei Zhang. 2015. Learning to log: Helping developers make informed logging decisions. In Proceedings of the IEEE/ACM 37th IEEE International Conference on Software Engineering. IEEE, 415–425. https://doi.org/10.1109/icse.2015.60

Cited By

View all
  • (2025)Privacy-preserving MTS anomaly detection for network devices through federated learningInformation Sciences10.1016/j.ins.2024.121590690(121590)Online publication date: Feb-2025
  • (2024)Leveraging Large Language Models for Efficient Alert Aggregation in AIOPsElectronics10.3390/electronics1322442513:22(4425)Online publication date: 12-Nov-2024
  • (2024)Learning to Diagnose: Meta-Learning for Efficient Adaptation in Few-Shot AIOps ScenariosElectronics10.3390/electronics1311210213:11(2102)Online publication date: 28-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 12, Issue 6
December 2021
356 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3501281
  • Editor:
  • Huan Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2021
Accepted: 01 August 2021
Revised: 01 July 2021
Received: 01 April 2021
Published in TIST Volume 12, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AIOps
  2. IT operations and maintenance
  3. failure management
  4. artificial intelligence

Qualifiers

  • Survey
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)850
  • Downloads (Last 6 weeks)79
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Privacy-preserving MTS anomaly detection for network devices through federated learningInformation Sciences10.1016/j.ins.2024.121590690(121590)Online publication date: Feb-2025
  • (2024)Leveraging Large Language Models for Efficient Alert Aggregation in AIOPsElectronics10.3390/electronics1322442513:22(4425)Online publication date: 12-Nov-2024
  • (2024)Learning to Diagnose: Meta-Learning for Efficient Adaptation in Few-Shot AIOps ScenariosElectronics10.3390/electronics1311210213:11(2102)Online publication date: 28-May-2024
  • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
  • (2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
  • (2024)Design and Implementation of Fault Reconfiguration System Based on MQTT2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT63103.2024.10704320(505-509)Online publication date: 19-Jul-2024
  • (2024)KEWS: A KPIs-Based Evaluation Framework of Workload Simulation On Microservice System2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD)10.1109/CSCWD61410.2024.10580478(1146-1152)Online publication date: 8-May-2024
  • (2024)MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud ApplicationsIEEE Access10.1109/ACCESS.2024.337433412(37398-37417)Online publication date: 2024
  • (2024)Indicator Fault Detection Method Based on Periodic Self Discovery and Historical Anomaly FilteringIEEE Access10.1109/ACCESS.2024.336167212(20530-20539)Online publication date: 2024
  • (2024)Research of artificial intelligence operations for wind turbines considering anomaly detection, root cause analysis, and incremental trainingReliability Engineering & System Safety10.1016/j.ress.2023.109634241(109634)Online publication date: Jan-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media