Abstract
Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-h window. Our evaluation reveals that if we limit false positive rates to 5 %, we can achieve true positive rates between 27 and 88 % with precision varying between 50 and 72 %. This level of performance allows us to recover large fraction of jobs’ executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. We discuss the feasibility of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available on GitHub.
Similar content being viewed by others
Notes
Delfina Eberly, Director of Data Center Operations at Facebook, speaking on “Operations at Scale” at the 7 \(\times \) 24 Exchange 2013 Fall Conference.
Based on current Google BigQuery pricing.
References
Abdul-Rahman, O.A., Aida, K.: Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 272–277. Singapore (2014)
Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: Bidal: Big data analyzer for cluster traces. In: Informatika (BigSys workshop). Lecture Notes in Informatics, vol. 232, pp. 1781–1795. GI-Edition (2014)
Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: A big data analyzer for large trace logs. Computing. (2015). doi:10.1007/s00607-015-0480-7
Breitgand, D., Dubitzky, Z., Epstein, A., Feder, O., Glikson, A., Shapira, I., Toffetti, G.: An adaptive utilization accelerator for virtualized environments. In: International Conference on Cloud Engineering (IC2E), pp. 165–174. IEEE, Boston (2014)
Caglar, F., Gokhale, A.: iOverbook: Intelligent resource-overbooking to support soft real-time applications in the cloud. In: 7th IEEE International Conference on Cloud Computing (IEEE CLOUD). Anchorage (2014). http://www.dre.vanderbilt.edu/gokhale/WWW/papers/CLOUD-2014.pdf
Di, S., Kondo, D., Cirne, W.: Characterization and comparison of google cloud load versus grids. In: International Conference on Cluster Computing (IEEE CLUSTER), pp. 230–238 (2012)
Di, S., Kondo, D., Cirne, W.: Host load prediction in a Google compute cloud with a Bayesian model. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). doi:10.1109/SC.2012.68
Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 25th International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Denver (2013)
Dudko, R., Sharma, A., Tedesco, J.: Effective Failure Prediction in Hadoop Clusters. University of Idaho White Paper, pp. 1–8 (2012)
Gainaru, A., Bouguerra, M.S., Cappello, F., Snir, M., Kramer, W.: Navigating the blue waters: online failure prediction in the petascale era. Argonne National Laboratory Technical Report, ANL/MCS-P5219-1014 (2014)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C 42(4), 463–484 (2012)
Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 32nd IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 205–214. Braga (2013)
Iglesias, J.O., Lero, L.M., Cauwer, M.D., Mehta, D., O’Sullivan, B.: A methodology for online consolidation of tasks through more accurate resource estimations. In: IEEE/ACM International Conference on Utility and Cloud Computing (UCC). London (2014)
Javadi, B., Kondo, D., Losup, A., Epema, D.: The Failure Trace Archive: enabling the comparison of failure measurements and models of distributed systems. J. Parallel Distrib. Comput. 73(8), 1208–1223 (2013)
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: 19th IEEE International Conference on Tools with Artificial Intelligence, 2007 (ICTAI 2007), vol. 2, pp. 310–317. IEEE (2007)
Kuncheva, L.I., Whitaker, C.J., Shipp, C.A., Duin, R.P.: Is independence good for combining classifiers? In: Proceedings of the 15th International Conference on Pattern Recognition, 2000, vol. 2, pp. 168–171. IEEE (2000)
Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 583–588 (2007). doi:10.1109/ICDM.2007.46
Liu, Z., Cho, S.: Characterizing machines and workloads on a Google cluster. In: 8th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS) (2012)
Mishra, A.K., Hellerstein, J.L., Cirne, W., Das, C.R.: Towards characterizing cloud backend workloads: insights from Google compute clusters. Sigmetrics Perform. Eval. Rev. 37(4), 34–41 (2010)
Opitz, D.W., Shavlik, J.W., et al.: Generating accurate and diverse members of a neural-network ensemble. In: Advances in Neural Information Processing Systems, pp. 535–541 (1996)
Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: ACM Symposium on Cloud Computing (SoCC) (2012)
Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Towards understanding heterogeneous clouds at scale: Google trace analysis. Carnegie Mellon University Technical Reports ISTC-CC-TR(12–101) (2012)
Reiss, C., Wilkes, J., Hellerstein, J.L.: Obfuscatory obscanturism: making workload traces of commercially-sensitive systems safe to release. In: Network Operations and Management Symposium (NOMS), 2012, pp. 1279–1286. IEEE (2012)
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)
Rosà, A., Chen, L., Birke, R., Binder, W.: Demystifying casualties of evictions in big data priority scheduling. ACM SIGMETRICS Perform. Eval. Rev. 42(4), 12–21 (2015)
Rosa, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 221–230. IEEE. (2015). doi:10.1109/CCGrid.2015.139
Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Comput. Surv. (CSUR) 42(3), 1–68 (2010)
Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Silva, F., Vahi, K.: Failure analysis of distributed scientific workflows executing in the cloud. In: Network and Service Management (cnsm). In: 2012 8th International Conference and 2012 Workshop on Systems Virtualiztion Management (svm), pp. 46–54. IEEE (2012)
Sharma, B., Chudnovsky, V., Hellerstein, J.L., Rifaat, R., Das, C.R.: Modeling and synthesizing task placement constraints in google compute clusters. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 3. ACM (2011)
Shipp, C.A., Kuncheva, L.I.: Relationships between combination methods and measures of diversity in combining classifiers. Inf. Fusion 3(2), 135–148 (2002). doi:10.1016/S1566-2535(02)00051-9
Sîrbu, A., Babaoglu, O.: BigQuery and ML scripts. GitHub (2015). Available at https://github.com/alinasirbu/google_cluster_failure_prediction
Sîrbu, A., Babaoglu, O.: Towards data-driven autonomics in data centers. In: 2015 International Conference on Cloud and Autonomic Computing (ICCAC), pp. 45–56 (2015). doi:10.1109/ICCAC.2015.19
Tigani, J., Naidu, S.: Google BigQuery Analytics. Wiley, Indianapolis (2014)
Verma, A., Pedrosa, L., Korupolu, M.R., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France (2015)
Wang, G., Butt, A.R., Monti, H., Gupta, K.: Towards synthesizing realistic workload traces for studying the hadoop ecosystem. In: 19th IEEE Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 400–408 (2011)
Wilkes, J.: More Google cluster data. Google research blog (2011). http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html
Zhang, Q., Hellerstein, J.L., Boutaba, R.: Characterizing task usage shapes in Google’s compute clusters. In: Proceedings of the 5th International Workshop on Large Scale Distributed Systems and Middleware (2011)
Zhang, Q., Zhani, M.F., Boutaba, R., Hellerstein, J.L.: Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans. Cloud Comput. 2(1), 14–28 (2014)
Acknowledgments
BigQuery analysis was carried out through a generous Cloud Credits grant from Google. We are grateful to John Wilkes of Google for helpful discussions regarding the cluster trace data.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sîrbu, A., Babaoglu, O. Towards operator-less data centers through data-driven, predictive, proactive autonomics. Cluster Comput 19, 865–878 (2016). https://doi.org/10.1007/s10586-016-0564-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-016-0564-y