Towards operator-less data centers through data-driven, predictive, proactive autonomics

Alina Sîrbu^1,2 &
Ozalp Babaoglu²

380 Accesses
13 Citations
3 Altmetric
Explore all metrics

Abstract

Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-h window. Our evaluation reveals that if we limit false positive rates to 5 %, we can achieve true positive rates between 27 and 88 % with precision varying between 50 and 72 %. This level of performance allows us to recover large fraction of jobs’ executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. We discuss the feasibility of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available on GitHub.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Autonomic Architecture for Big Data Performance Optimization

Purifying Data by Machine Learning with Certainty Levels

Effective Fault Prediction Techniques for the Green Cloud Computing Environment Applying Machine Learning to Enhance Network Management

Notes

Delfina Eberly, Director of Data Center Operations at Facebook, speaking on “Operations at Scale” at the 7 $\times $ 24 Exchange 2013 Fall Conference.
Based on current Google BigQuery pricing.

References

Abdul-Rahman, O.A., Aida, K.: Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 272–277. Singapore (2014)
Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: Bidal: Big data analyzer for cluster traces. In: Informatika (BigSys workshop). Lecture Notes in Informatics, vol. 232, pp. 1781–1795. GI-Edition (2014)
Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: A big data analyzer for large trace logs. Computing. (2015). doi:10.1007/s00607-015-0480-7
Breitgand, D., Dubitzky, Z., Epstein, A., Feder, O., Glikson, A., Shapira, I., Toffetti, G.: An adaptive utilization accelerator for virtualized environments. In: International Conference on Cloud Engineering (IC2E), pp. 165–174. IEEE, Boston (2014)
Caglar, F., Gokhale, A.: iOverbook: Intelligent resource-overbooking to support soft real-time applications in the cloud. In: 7th IEEE International Conference on Cloud Computing (IEEE CLOUD). Anchorage (2014). http://www.dre.vanderbilt.edu/gokhale/WWW/papers/CLOUD-2014.pdf
Di, S., Kondo, D., Cirne, W.: Characterization and comparison of google cloud load versus grids. In: International Conference on Cluster Computing (IEEE CLUSTER), pp. 230–238 (2012)
Di, S., Kondo, D., Cirne, W.: Host load prediction in a Google compute cloud with a Bayesian model. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). doi:10.1109/SC.2012.68
Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 25th International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Denver (2013)
Dudko, R., Sharma, A., Tedesco, J.: Effective Failure Prediction in Hadoop Clusters. University of Idaho White Paper, pp. 1–8 (2012)
Gainaru, A., Bouguerra, M.S., Cappello, F., Snir, M., Kramer, W.: Navigating the blue waters: online failure prediction in the petascale era. Argonne National Laboratory Technical Report, ANL/MCS-P5219-1014 (2014)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C 42(4), 463–484 (2012)
Article Google Scholar
Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 32nd IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 205–214. Braga (2013)
Iglesias, J.O., Lero, L.M., Cauwer, M.D., Mehta, D., O’Sullivan, B.: A methodology for online consolidation of tasks through more accurate resource estimations. In: IEEE/ACM International Conference on Utility and Cloud Computing (UCC). London (2014)
Javadi, B., Kondo, D., Losup, A., Epema, D.: The Failure Trace Archive: enabling the comparison of failure measurements and models of distributed systems. J. Parallel Distrib. Comput. 73(8), 1208–1223 (2013)
Article Google Scholar
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: 19th IEEE International Conference on Tools with Artificial Intelligence, 2007 (ICTAI 2007), vol. 2, pp. 310–317. IEEE (2007)
Kuncheva, L.I., Whitaker, C.J., Shipp, C.A., Duin, R.P.: Is independence good for combining classifiers? In: Proceedings of the 15th International Conference on Pattern Recognition, 2000, vol. 2, pp. 168–171. IEEE (2000)
Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 583–588 (2007). doi:10.1109/ICDM.2007.46
Liu, Z., Cho, S.: Characterizing machines and workloads on a Google cluster. In: 8th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS) (2012)
Mishra, A.K., Hellerstein, J.L., Cirne, W., Das, C.R.: Towards characterizing cloud backend workloads: insights from Google compute clusters. Sigmetrics Perform. Eval. Rev. 37(4), 34–41 (2010)
Article Google Scholar
Opitz, D.W., Shavlik, J.W., et al.: Generating accurate and diverse members of a neural-network ensemble. In: Advances in Neural Information Processing Systems, pp. 535–541 (1996)
Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: ACM Symposium on Cloud Computing (SoCC) (2012)
Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Towards understanding heterogeneous clouds at scale: Google trace analysis. Carnegie Mellon University Technical Reports ISTC-CC-TR(12–101) (2012)
Reiss, C., Wilkes, J., Hellerstein, J.L.: Obfuscatory obscanturism: making workload traces of commercially-sensitive systems safe to release. In: Network Operations and Management Symposium (NOMS), 2012, pp. 1279–1286. IEEE (2012)
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)
Article Google Scholar
Rosà, A., Chen, L., Birke, R., Binder, W.: Demystifying casualties of evictions in big data priority scheduling. ACM SIGMETRICS Perform. Eval. Rev. 42(4), 12–21 (2015)
Article Google Scholar
Rosa, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 221–230. IEEE. (2015). doi:10.1109/CCGrid.2015.139
Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Comput. Surv. (CSUR) 42(3), 1–68 (2010)
Article Google Scholar
Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Silva, F., Vahi, K.: Failure analysis of distributed scientific workflows executing in the cloud. In: Network and Service Management (cnsm). In: 2012 8th International Conference and 2012 Workshop on Systems Virtualiztion Management (svm), pp. 46–54. IEEE (2012)
Sharma, B., Chudnovsky, V., Hellerstein, J.L., Rifaat, R., Das, C.R.: Modeling and synthesizing task placement constraints in google compute clusters. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 3. ACM (2011)
Shipp, C.A., Kuncheva, L.I.: Relationships between combination methods and measures of diversity in combining classifiers. Inf. Fusion 3(2), 135–148 (2002). doi:10.1016/S1566-2535(02)00051-9
Sîrbu, A., Babaoglu, O.: BigQuery and ML scripts. GitHub (2015). Available at https://github.com/alinasirbu/google_cluster_failure_prediction
Sîrbu, A., Babaoglu, O.: Towards data-driven autonomics in data centers. In: 2015 International Conference on Cloud and Autonomic Computing (ICCAC), pp. 45–56 (2015). doi:10.1109/ICCAC.2015.19
Tigani, J., Naidu, S.: Google BigQuery Analytics. Wiley, Indianapolis (2014)
Google Scholar
Verma, A., Pedrosa, L., Korupolu, M.R., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France (2015)
Wang, G., Butt, A.R., Monti, H., Gupta, K.: Towards synthesizing realistic workload traces for studying the hadoop ecosystem. In: 19th IEEE Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 400–408 (2011)
Wilkes, J.: More Google cluster data. Google research blog (2011). http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html
Zhang, Q., Hellerstein, J.L., Boutaba, R.: Characterizing task usage shapes in Google’s compute clusters. In: Proceedings of the 5th International Workshop on Large Scale Distributed Systems and Middleware (2011)
Zhang, Q., Zhani, M.F., Boutaba, R., Hellerstein, J.L.: Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans. Cloud Comput. 2(1), 14–28 (2014)
Article Google Scholar

Download references

Acknowledgments

BigQuery analysis was carried out through a generous Cloud Credits grant from Google. We are grateful to John Wilkes of Google for helpful discussions regarding the cluster trace data.

Author information

Authors and Affiliations

Department of Computer Science, University of Pisa, Largo Bruno Pontecorvo 3, 56127, Pisa, Italy
Alina Sîrbu
Department of Computer Science and Engineering, University of Bologna, Mura Anteo Zamboni 7, 40126, Bologna, Italy
Alina Sîrbu & Ozalp Babaoglu

Authors

Alina Sîrbu
View author publications
You can also search for this author in PubMed Google Scholar
Ozalp Babaoglu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alina Sîrbu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sîrbu, A., Babaoglu, O. Towards operator-less data centers through data-driven, predictive, proactive autonomics. Cluster Comput 19, 865–878 (2016). https://doi.org/10.1007/s10586-016-0564-y

Download citation

Received: 08 January 2016
Revised: 23 March 2016
Accepted: 30 March 2016
Published: 23 April 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s10586-016-0564-y

Towards operator-less data centers through data-driven, predictive, proactive autonomics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Autonomic Architecture for Big Data Performance Optimization

Purifying Data by Machine Learning with Certainty Levels

Effective Fault Prediction Techniques for the Green Cloud Computing Environment Applying Machine Learning to Enhance Network Management

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Towards operator-less data centers through data-driven, predictive, proactive autonomics

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Autonomic Architecture for Big Data Performance Optimization

Purifying Data by Machine Learning with Certainty Levels

Effective Fault Prediction Techniques for the Green Cloud Computing Environment Applying Machine Learning to Enhance Network Management

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation