Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

Published: 01 July 2013 Publication History

Abstract

With the advent of cloud computing and online services, large enterprises rely heavily on their datacenters to serve end users. A large datacenter facility incurs increased maintenance costs in addition to service unavailability when there are increased failures. Among different server components, hard disk drives are known to contribute significantly to server failures; however, there is very little understanding of the major determinants of disk failures in datacenters. In this work, we focus on the interrelationship between temperature, workload, and hard disk drive failures in a large scale datacenter. We present a dense storage case study from a population housing thousands of servers and tens of thousands of disk drives, hosting a large-scale online service at Microsoft. We specifically establish correlation between temperatures and failures observed at different location granularities: (a) inside drive locations in a server chassis, (b) across server locations in a rack, and (c) across multiple racks in a datacenter. We show that temperature exhibits a stronger correlation to failures than the correlation of disk utilization with drive failures. We establish that variations in temperature are not significant in datacenters and have little impact on failures. We also explore workload impacts on temperature and disk failures and show that the impact of workload is not significant. We then experimentally evaluate knobs that control disk drive temperature, including workload and chassis design knobs. We corroborate our findings from the real data study and show that workload knobs show minimal impact on temperature. Chassis knobs like disk placement and fan speeds have a larger impact on temperature. Finally, we also show the proposed cost benefit of temperature optimizations that increase hard disk drive reliability.

References

[1]
Cole, G. 2000. Estimating drive reliability in desktop computers and consumer electronics systems. Seagate Tech. rep. TP-338.1.
[2]
Elerath, J. G. and Shah, S. 2004. Server class disk drives: How reliable are they? In Proceedings of the Annual Symposium on Reliability and Maintainability. 151--156.
[3]
El-Sayed, N., Stefanovici, I. A., Amvrosiadis, G., Hwang, A. A., and Schroeder, B. 2012. Temperature management in data centers: Why some (might) like it hot. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).
[4]
Facebook 2011. Open compute project at Facebook. http://opencompute.org/.
[5]
Govindan, M. S. S., Lefurgy, C., and Dholakia, A. 2009. Using on-line power modeling for server power capping. In Proceedings of the Workshop on Energy-Efficient Design (WEED).
[6]
Gray, J. and Van Ingen, C. 2005. Empirical measurements of disk failure rates and error rates. Tech. rep. MSR-TR-2005-166, Microsoft Research.
[7]
Greenberg, S., Mills, E., Tschudi, W., Rumsey, P., and Myatt, B. 2006. Best practices for data centers: Lessons learned from benchmarking 22 data centers. ACEEE Summer Study on Energy Efficiency in Buildings.
[8]
Guo, G. and Zhang, J. 2003. Feedforward control for reducing disk-flutter-induced track misregistration. IEEE Trans. Magn. 39, 4, 2103--2108.
[9]
Gurumurthi, S., Zhang, J., Sivasubramaniam, A., Kandemir, M., Franke, H., Vijaykrishnan, N., and Irwin, M. 2003. Interplay of energy and performance for disk arrays running transaction processing workloads. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 123--132.
[10]
Gurumurthi, S., Sivasubramaniam, A., and Natarajan, V. 2005 Disk drive roadmap from the thermal perspective: A case for dynamic thermal management. In Proceedings of the International Symposium on Computer Architecture (ISCA). 38--49.
[11]
Hamilton, J. 2007. An architecture for modular data centers. In Proceedings of CIDR.
[12]
Hamilton, J. 2008. Datacenter TCO Model. http://perspectives.mvdirona.com.
[13]
Hoelzle, U. and Barroso, L. A. 2009. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers.
[14]
HP. 2003. Assessing and comparing serial attached SCSI and Serial ATA hard disk drives and SAS interface. White paper.
[15]
HP. 2011. SSA70 Storage Disk Enclosure, h18006.www1.hp.com/storage/disk_storage/index.html.
[16]
Intel. 2008. Reducing data center cost with an air economizer. Intel.
[17]
IOMeter. 2011. IOMeter project---www.iometer.org.
[18]
Kim, Y., Gurumurthi, S., and Sivasubramaniam, A. 2006. Understanding the performance-temperature interactions in disk I/O of server workloads. In Proceedings of the International Symposium on High Performance Computer Architecture. 179--189.
[19]
Microsoft. 2009. Microsoft’s chiller-less data center. Datacenter Knowl.
[20]
Namek, R. Y. and Fournier, E. 2011. Two strategies to reduce chiller power and plant energy consumption in datacenters. DatacenterDynamics.
[21]
Park, I. and Buch, R. 2007. Improve debugging and performance tuning with ETW. Microsoft Corporation.
[22]
Patterson, M. K. 2008. The effect of data center temperature on energy efficiency. In Proceedings of the 11th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems. 1167--1174.
[23]
Pinheiro, E., Weber, W. D., and Barroso, L. A. 2007. Failure trends in a large disk drive population. In Proceedings of the FAST Conference on File and Storage Technologies.
[24]
Sankar, S., Gurumurthi, S., and Stan, M. R. 2008. Intra-disk parallelism: An idea whose time has come. In Proceedings of the International Symposium on Computer Architecture.
[25]
Schroeder, B. and Gibson, G. 2006. A large scale study of failures in high-performance-computing systems. In Proceedings of International Symposium on Dependable Systems and Networks (DSN).
[26]
Schroeder, B. and Gibson, G. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. 13--16.
[27]
Schroeder, B., Pinheiro, E., and Weber, W. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems.
[28]
Schwartz, T., Baker, M., Bassi, S., Baumgart, B., Flagg, W., Van Ingen, C., Joste, K., Nasse, M., and Shah, M. 2006. Disk failure investigations at the Internet archive. In Proceedings of 14th NASA Goddard, 23rd IEEE Conference on Mass Storage Systems and Technologies.
[29]
Seagate. 2011. Seagate Constellation ES drive datasheet.
[30]
Vishwanath, K. V. and Nagappan, N. 2010. Characterizing cloud computing hardware reliability. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC).
[31]
Yang, J. and Sun, F. 1999. A comprehensive review of hard-disk drive reliability. In Proceedings of the Annual Symposium on Reliability and Maintainability. 403--409.

Cited By

View all
  • (2024)Impacts of Increasing Temperature and Relative Humidity in Air-Cooled Tropical Data CentersIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33795509:5(790-802)Online publication date: Sep-2024
  • (2024)Building a Rule-Based Expert System to Enhance the Hard Disk Drive Manufacturing ProcessesIEEE Access10.1109/ACCESS.2024.336944312(29558-29570)Online publication date: 2024
  • (2024)New Weibull Log-Logistic grey forecasting model for a hard disk drive failuresApplied Mathematical Modelling10.1016/j.apm.2024.04.025131(669-690)Online publication date: Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 9, Issue 2
July 2013
89 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/2491472
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2013
Accepted: 01 October 2012
Revised: 01 September 2012
Received: 01 February 2012
Published in TOS Volume 9, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Datacenter
  2. hard disk drives
  3. temperature impact

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)11
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Impacts of Increasing Temperature and Relative Humidity in Air-Cooled Tropical Data CentersIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33795509:5(790-802)Online publication date: Sep-2024
  • (2024)Building a Rule-Based Expert System to Enhance the Hard Disk Drive Manufacturing ProcessesIEEE Access10.1109/ACCESS.2024.336944312(29558-29570)Online publication date: 2024
  • (2024)New Weibull Log-Logistic grey forecasting model for a hard disk drive failuresApplied Mathematical Modelling10.1016/j.apm.2024.04.025131(669-690)Online publication date: Jul-2024
  • (2023)Disk Failure Trends in Alpine Storage SystemProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624119(502-506)Online publication date: 12-Nov-2023
  • (2023)Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613866(2050-2055)Online publication date: 30-Nov-2023
  • (2023)Research and Technologies for next-generation high-temperature data centers – State-of-the-arts and future perspectivesRenewable and Sustainable Energy Reviews10.1016/j.rser.2022.112991171(112991)Online publication date: Jan-2023
  • (2023)Comparative eco-efficiency assessment of cybersecurity solutionsEnvironmental Impact Assessment Review10.1016/j.eiar.2023.107096100(107096)Online publication date: May-2023
  • (2022)Multi-task Hierarchical Classification for Disk Failure Prediction in Online Service SystemsProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539176(3438-3446)Online publication date: 14-Aug-2022
  • (2022)Air Free-Cooled Tropical Data Center: Design, Evaluation, and Learned LessonsIEEE Transactions on Sustainable Computing10.1109/TSUSC.2021.31329277:3(579-594)Online publication date: 1-Jul-2022
  • (2022)The Impact of CPU Voltage Margins on Power-Constrained ExecutionIEEE Transactions on Sustainable Computing10.1109/TSUSC.2020.30451957:1(221-234)Online publication date: 1-Jan-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media