Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3265723.3265730acmconferencesArticle/Chapter ViewAbstractPublication PagesapsysConference Proceedingsconference-collections
research-article

Detecting Data Center Cooling Problems Using a Data-driven Approach

Published: 27 August 2018 Publication History

Abstract

Cooling problems are common in data centers and many of them are hard to detect especially the hidden. These problems affect overall system dependability, performance and power efficiency. We propose a novel method to detect the cooling problems. Using common monitoring data available in most data centers, such as environmental temperature and hardware status, we build a workload-independent cooling profile for each server. With the cooling profiles, we are able to detect two types of both transient and lasting cooling failures. We detect transient failures by comparing the observed temperature with the model prediction, while we detect lasting failures by comparing the cooling profiles among different servers. We demonstrate the general applicability of our detection methods in three production data centers with vastly different scale, server types and workload, and detect several real cooling problems that have been hidden for months.

References

[1]
Gray, J. (1986, January). Why do computers stop and what can be done about it?. In Symposium on reliability in distributed software and database systems (pp. 3--12).
[2]
Paterson D, Brown A, Broadwell P, et al. Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies{R}. Technical Report UCB//CSD-02-1175, UC Berkeley Computer Science, 2002.
[3]
Sankar S, Shaw M, Vaid K. Impact of temperature on hard disk drive reliability in large datacenters{J}. 2011:530--537.
[4]
C. Belady, A. Rawson, J. Pfleuger, and T. Cader. The Green Grid Data Center Power Efficiency Metrics: PUE & DCiE. Technical report, Green Grid, 2008.
[5]
Lawrence Berkeley National Labs. Benchmarking Data Centers. htp://hightech.lbl.gov/ benchmarking-dc.html, December 2007.
[6]
Intel I. and IA-32 Architectures Software Developer?s Manual{J}. Volume 3A: System Programming Guide, Part, 64, 1.
[7]
N. El-Sayed, I. Stefanovici, G. Amvrosiadis, A. A. Hwang and B. Schroeder, "Temperature Management in Data Centers: Why Some (Might) Like It Hot," ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 1, pp. 163--174, Jun. 2012.
[8]
W. A. Abdelmaksoud, H. E. Khalifa, T.Q. Dang, R. R. Schmidt and M. Iyengar, "Improved CFD modeling of a small data center test cell," 12th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), Jun. 2010.
[9]
Frachtenberg E, Lee D, Magarelli M, et al. Thermal design in the open compute datacenter{C}//Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), 2012 13th IEEE Intersociety Conference on. IEEE, 2012: 530--538.
[10]
H. Bhagwat, A. Singh and A. Sivasubramaniam, "Thermal influence indices: Causality metrics for efficient exploration of data center cooling," 2012 International Green Computing Conference (IGCC), Jun. 2012.
[11]
Seymour M, Ikemoto S. Design and management of data center effectiveness, risks and costs{C}//Semiconductor Thermal Measurement and Management Symposium (SEMI- THERM), 2012 28th Annual IEEE. IEEE, 2012: 64--68.
[12]
C. Kyosung, R. M. Galante, M.Ohadi and D.Cooper, "Measured and Simulated Energy Consumption Analysis of a Data Center on an Academic Campus," 29th Annual IEEE Semiconductor Thermal Measurement and Management Symposium (SEMI- THERM), Mar. 2013.
[13]
Chen J, Tan R, Wang Y, et al. "A high-idelity temperature distribution forecasting system for data centers"Real-Time Systems Symposium (RTSS), 2012 IEEE 33rd. IEEE, 2012: 215--224.
[14]
Xiaodong Wang, Xiaorui Wang, Guoliang Xing, Cheng-Xian Lin "Leveraging Thermal Dynamics in Sensor Placement for Overheating Server Component Detection", 3rd IEEE International Green Computing Conference IGCC, San Jose, USA, 2012, pp. 1--10
[15]
Mulay, V."Open Compute Project: Server and Data Center Design." Panel presentation at ASME InterPack, San Jose, (2011).
[16]
Anderson D, Dykes J, Riedel E. "More Than an Interface-SCSI vs. ATA"FAST. 2003, 2(00): 3.
[17]
Sankar S, Shaw M, Vaid K." Impact of temperature on hard disk drive reliability in large datacenters", Dependable Systems and Networks (DSN), 2011 IEEE/IFIP 41st International Conference on. IEEE, 2011: 530--537.
[18]
Schroeder B, Pinheiro E, Weber W D. Dram errors in the wild: a large-scale field study{C}// ACM SIGMETRICS Performance Evaluation Review. ACM, 2009, 37(1): 193--204.
[19]
E. Kursun and C.-Y. Cher,"Temperature variation characterization and thermal management of multicore architectures", IEEE/ACM MICRO, 2009.
[20]
M. Pedram and S. Nazarian,"Thermal modeling, analysis, and management in vlsi circuits: Principles and methods" Proceedings of the IEEE, vol. 94, no. 8, 2006.
[21]
X. Fan, W. D. Weber and L. A. Barroso, "Power Provisioning for a Warehouse-sized Computer," ACM SIGARCH Computer Architecture News, vol. 35, no. 2, pp. 13-23, Jun. 2007.
[22]
H. Xu, C. Feng and B. Li, "Temperature aware workload management in geo-distributed datacenters," ACM SIGMETRICS Performance Evaluation Review, vol. 41, no. 1, pp. 373--374, Jun. 2013.
[23]
Moore J D, Chase J S, Ranganathan P, et al. "Making Scheduling" Cool": Temperature-Aware Workload Placement in Data Centers" USENIX annual technical conference, General Track. 2005: 61--75.
[24]
R. Raghavendra, P. Ranganathan, V. Talwar, Z. Wang and X. Zhu, "No power struggles: Coordinated multi-level power management for the data center," ACM SIGARCH Computer Architecture News, vol. 36, no. 1, pp. 48-59, Mar. 2008.
[25]
D. Wang, C. Ren, S. Govindan, A. Sivasubramaniam, B. Urgaonkar, A. Kansal and K. Vaid, "ACE: Abstracting, characterizing and exploiting datacenter power demands," IEEE International Symposium on Workload Characterization (IISWC), pp. 44--55, Sep. 2013.
[26]
Callou G, Maciel P, Tutsch D, et al. "Models for dependability and sustainability analysis of data center cooling architectures", Dependable Systems and Networks Workshops (DSN-W), 2012 IEEE/IFIP 42nd International Conference on. IEEE, 2012: 1--6.
[27]
Beloglazov A, Buyya R. "Energy efficient resource management in virtualized cloud data centers", Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE Computer Society, 2010: 826--831.
[28]
Wang L, Von Laszewski G, Dayal J, et al. "Towards thermal aware workload scheduling in a data center", Pervasive Systems, Algorithms, and Networks (ISPAN), 2009 10th International Symposium on. IEEE, 2009: 116--122.
[29]
Wang L, Khan S U, Dayal J." Thermal aware workload placement with task-temperature profiles in a data center", The Journal of Supercomputing, 2012, 61(3): 780--803.
[30]
J. Gao and R. Jamidar, "Machine Learning Applications for Data Center Optimization" Google White Paper, Oct. 2014.
[31]
J. Moore, J. S. Chase and P. Ranganathan, "Weatherman: Automated, online and predictive thermal mapping and management for data centers," IEEE International Conference on Autonomic Computing (ICAC), pp. 155--164, Jun. 2006.
[32]
Brunschwiler T, Smith B, Ruetsche E, et al. Toward zero-emission data centers through direct reuse of thermal energy{J}. IBM Journal of Research and Development, 2009, 53(3): 11: 1--11: 13.
[33]
M. Arlitt, A. Palo and C. Bash, "Towards the Design and Operation of Net-Zero Energy Data Centers," 13th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), Jun. 2012.
[34]
R. Zhou, Z. Wang, C. E. Bash, A. McReynolds, C. Hoover, R. Shih, N. Kumari and R. K. Sharma, "A Holistic and Optimal Approach for Data Center Cooling Management," American Control Conference (ACC), Jul. 2011.
[35]
G. Hoefel and C. Elkan, "Learning a Two-Stage SVM/CRF Sequence Classifier," ACM 17th Conference on Information and Knowledge Management, pp. 271--278, Oct. 2008.
[36]
Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey{J}. ACM computing surveys (CSUR), 2009, 41(3): 15.
[37]
Schwenkler T. Intelligent Platform Management Interface{J}. Sicheres Netzwerkmanagement: Konzepte, Protokolle, Tools, 2006: 169--207.
[38]
E. Upton and G. Halfacree, "Raspberry Pi user guide," John Wiley & Sons, Dec. 2013.
[39]
Hadoop A. Hadoop{J}. 2009-03--06. http://hadoop.apache.org, 2011.
[40]
G. Cole, "Estimating drive reliability in desktop computers and consumer electronics systems," Seagate Technology Paper TP, 338, 2000.
[41]
B. Schroeder, E. Pinheiro and W. D. Weber, "DRAM Errors in the Wild: A Large-Scale Field Study," ACM SIGMETRICS Performance Evaluation Review, vol. 37, no. 1, pp. 193--204, Jun. 2009.
[42]
Phan L, Bhusal S, Lin C X. Improving Cooling Efficiency by Using Mixed Tiles to Control Airflow Uniformity of Perforated Tiles in a Data Center Model{C}//ASME 2017 Heat Transfer Summer Conference. American Society of Mechanical Engineers, 2017: V001T08A005-V001T08A005.
[43]
Cho J, Yang J, Park W. Evaluation of air distribution system's airflow performance for cooling energy savings in high-density data centers{J}. Energy and buildings, 2014, 68: 270--279.
[44]
Mok S, Joshi Y K, Kumar S, et al. Energy Simulations of Data Centers With Hybrid Liquid/Air Cooling and Waste Heat Re-Use{C}//ASME 2016 International Mechanical Engineering Congress and Exposition. American Society of Mechanical Engineers, 2016: V008T10A008-V008T10A008.

Cited By

View all
  • (2022)A systematic literature review about integrating dependability attributes, performability and sustainability in the implantation of cooling subsystems in data centerThe Journal of Supercomputing10.1007/s11227-022-04515-278:14(15820-15856)Online publication date: 27-Apr-2022
  • (2021)Operating Liquid-Cooled Large-Scale Systems: Long-Term Monitoring, Reliability Analysis, and Efficiency Measures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00078(881-893)Online publication date: Feb-2021
  • (2021)Thermal Management in Large Data Centres: Security Threats and MitigationSecurity in Computing and Communications10.1007/978-981-16-0422-5_12(165-179)Online publication date: 10-Feb-2021

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
APSys '18: Proceedings of the 9th Asia-Pacific Workshop on Systems
August 2018
150 pages
ISBN:9781450360067
DOI:10.1145/3265723
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Abnormal detection
  2. Cooling Profile
  3. Data-driven approach

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

APSys '18
Sponsor:
APSys '18: 9th Asia-Pacific Workshop on Systems
August 27 - 28, 2018
Jeju Island, Republic of Korea

Acceptance Rates

APSys '18 Paper Acceptance Rate 18 of 48 submissions, 38%;
Overall Acceptance Rate 169 of 430 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A systematic literature review about integrating dependability attributes, performability and sustainability in the implantation of cooling subsystems in data centerThe Journal of Supercomputing10.1007/s11227-022-04515-278:14(15820-15856)Online publication date: 27-Apr-2022
  • (2021)Operating Liquid-Cooled Large-Scale Systems: Long-Term Monitoring, Reliability Analysis, and Efficiency Measures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00078(881-893)Online publication date: Feb-2021
  • (2021)Thermal Management in Large Data Centres: Security Threats and MitigationSecurity in Computing and Communications10.1007/978-981-16-0422-5_12(165-179)Online publication date: 10-Feb-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media