research-article

A 'cool' way of improving the reliability of HPC machines

Authors:

Esteban Meneses,

Laxmikant V. KaleAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 58, Pages 1 - 12

https://doi.org/10.1145/2503210.2503228

Published: 17 November 2013 Publication History

Abstract

Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.

References

[1]

T. Renzenbrink, "Data Centers Use 1.3% of WorldÕs Total Electricity. A Decline in growth." {Online}. Available: http://www.techthefuture.com/energy/

[2]

C. D. Patel, C. E. Bash, R. Sharma, M. Beitelmal, and R. Friedrich, "Smart cooling of data centers," ASME Conference Proceedings, vol. 2003, no. 36908b, pp. 129--137, 2003.

[3]

R. F. Sullivan, "Alternating cold and hot aisles provides more reliable cooling for server farms," White Paper, Uptime Institute, 2000.

[4]

R. Sawyer, "Calculating total power requirements for data centers," White Paper, American Power Conversion, 2004.

[5]

R. American Society of Heating and A.-C. Engineers, "2008 ashrae environmental guidelines for datacom equipment." {Online}. Available: http://tc99.ashraetcs.org/documents/ASHRAE_Extended_Environmental_Envelope_Final_Aug_1_2008.pdf

[6]

A. Liu, "The data center temperature debate." {Online}. Available: http://ezinearticles.com/?The-Data-Center-Temperature-Debate&id=2637938

[7]

O. Sarood and L. V. Kale, "A 'cool' load balancer for parallel applications," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '11. New York, NY, USA: ACM, 2011, pp. 21:1--21:11. {Online}. Available: http://doi.acm.org/10.1145/2063384.2063412

Digital Library

[8]

O. Sarood, P. Miller, E. Totoni, and L. V. Kale, "Cool load balancing for high performance computing data centers," vol. 61, no. 12. Los Alamitos, CA, USA: IEEE Computer Society, 2012, pp. 1752--1764.

Digital Library

[9]

C. hsing Hsu, W. chun Feng, and J. S. Archuleta, "Towards efficient supercomputing: A quest for the right metric," in In Proceedings of the HighPerformance Power-Aware Computing Workshop, 2005.

Digital Library

[10]

W.-C. Feng, "Making a case for efficient supercomputing," vol. 1, no. 7. New York, NY, USA: ACM, Oct. 2003, pp. 54--64. {Online}. Available: http://doi.acm.org/10.1145/957717.957772

Digital Library

[11]

W.-C. Feng, "The Importance of Being Low Power in High-Performance Computing," Cyberinfrastructure Technology Watch Quarterly (CTWatch Quarterly), vol. 1, no. 3, August 2005.

[12]

Ericsson, "Reliability Aspects on Power Supplies," Technical ReportDesign Note 002, Ericsson Microelectronics, April 2000.

[13]

L. Kalé, "The Chare Kernel parallel programming language and system," in Proceedings of the International Conference on Parallel Processing, vol. II, Aug. 1990, pp. 17--25.

[14]

J. Srinivasan, S. Adve, P. Bose, and J. Rivers, "The impact of technology scaling on lifetime reliability," in Dependable Systems and Networks, 2004 International Conference on, 2004, pp. 177--186.

Digital Library

[15]

J. A. Chung H. Hsu, W. Feng, "Towards Efficient Supercomputing: A Quest for the Right Metric." {Online}. Available: http://sss.cs.vt.edu/presentations/hppac05.ppt.pdf

Digital Library

[16]

F. Petrini, K. Davis, and J. Sancho, "System-level fault-tolerance in large-scale parallel machines with buffered coscheduling," in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, 2004, pp. 209--.

[17]

G. Zheng, L. Shi, and L. V. Kalé, "FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI," in 2004 IEEE International Conference on Cluster Computing, San Diego, CA, September 2004, pp. 93--103.

Digital Library

[18]

J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Generation Comp. Syst., vol. 22, no. 3, pp. 303--312, 2006.

Digital Library

[19]

J. W. Young, "A first order approximation to the optimal checkpoint interval," Commun. ACM, vol. 17, no. 9, pp. 530--531, 1974.

Digital Library

[20]

L. Bautista-Gomez, D. Komatitsch, N. Maruyama, S. Tsuboi, F. Cappello, and S. Matsuoka, "FTI: High performance fault tolerance interface for hybrid systems," in 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2011, pp. 1--12.

Digital Library

[21]

P. H. Hargrove and J. C. Duell, "Berkeley lab checkpoint/restart (blcr) for linux clusters," in SciDAC, 2006.

[22]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in SC, 2010, pp. 1--11.

Digital Library

[23]

"Lulesh," http://computation.llnl.gov/casc/ShockHydro/.

[24]

R. K. Brunner and L. V. Kalé, "Handling application-induced load imbalance using parallel objects," in Parallel and Distributed Computing for Symbolic and Irregular Applications. World Scientific Publishing, 2000, pp. 167--181.

[25]

G. Zheng, "Achieving high performance on extremely large parallel machines: performance prediction and load balancing," Ph.D. dissertation, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.

Digital Library

[26]

P. Jetley, F. Gioachin, C. Mendes, L. V. Kale, and T. R. Quinn, "Massively parallel cosmological simulations with ChaNGa," in Proceedings of IEEE International Parallel and Distributed Processing Symposium 2008, 2008.

[27]

G. Zheng, A. Bhatele, E. Meneses, and L. V. Kale, "Periodic Hierarchical Load Balancing for Large Supercomputers," International Journal of High Performance Computing Applications (IJHPCA), March 2011.

Digital Library

[28]

"Intel turbo boost technology," http://www.intel.com/technology/turboboost/.

[29]

P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, "Exascale computing study: Technology challenges in achieving exascale systems," 2008.

[30]

"Top500 supercomputing sites," http://top500.org.

[31]

B. Schroeder and G. A. Gibson, "Understanding failures in petascale computers."

[32]

D. Fiala, "Detection and correction of silent data corruption for large-scale high-performance computing," in Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, 2011, pp. 2069--2072.

Digital Library

[33]

P. Ramachandran, S. Adve, P. Bose, and J. Rivers, "Metrics for architecture-level lifetime reliability analysis," in Performance Analysis of Systems and software, 2008. ISPASS 2008. IEEE International Symposium on, 2008, pp. 202--212.

Digital Library

[34]

J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The case for lifetime reliability-aware microprocessors," in Proceedings of the 31st annual international symposium on Computer architecture, ser. ISCA '04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 276--. {Online}. Available: http://dl.acm.org/citation.cfm?id=998680.1006725

Digital Library

[35]

E. Meneses, X. Ni, and L. V. Kale, "A Message-Logging Protocol for Multicore Systems," in Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), Boston, USA, June 2012.

[36]

S. Chakravorty and L. V. Kale, "A fault tolerance protocol with fast fault recovery," in Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium. IEEE Press, 2007.

[37]

E. Meneses, O. Sarood, and L. V. Kale, "Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems," in Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2012), New York, USA, October 2012.

Digital Library

[38]

K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold, "Evaluating the viability of process replication reliability for exascale systems," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA: ACM, 2011, pp. 44:1--44:12. {Online}. Available: http://doi.acm.org/10.1145/2063384.2063443

Digital Library

[39]

J. Srinivasan, S. Adve, P. Bose, and J. Rivers, "Lifetime reliability: toward an architectural solution," Micro, IEEE, vol. 25, no. 3, pp. 70--80, 2005.

Digital Library

Cited By

Massari GPeta MCampi AReghenzani FTerraneo FAgosta GFornaciari WCiesielski SKulczewski MPiatek W(2023)Reliability-oriented resource management for High-Performance ComputingSustainable Computing: Informatics and Systems10.1016/j.suscom.2023.10087339(100873)Online publication date: Sep-2023
https://doi.org/10.1016/j.suscom.2023.100873
Ferreira KLevy SHemmert JPedretti KWeissman JChandra AGavrilovska ATiwari D(2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3502181.3531465
Jiang WJia ZFeng SLiu FJin HManne SHunter HAltman E(2019)Fine-grained warm water cooling for improving datacenter economyProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322236(474-486)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322236
Show More Cited By

A 'cool' way of improving the reliability of HPC machines
1. Computer systems organization

Recommendations

Leveraging Hotspots and Improving Chip Reliability via Carbon Nanotube Grid Thermal Structure
The increasing power consumption of integrated circuits (ICs) enabled by technology scaling requires more efficient heat dissipation solutions to improve overall chip reliability and reduce hotspots. Rapidly growing 3-D IC technology strengthens the ...
Failure Type-Aware Reliability Assessment with Component Failure Dependency
SSIRI '10: Proceedings of the 2010 Fourth International Conference on Secure Software Integration and Reliability Improvement

Most of the existing reliability assessment techniques assume that components fail independently and consider different types of failures equally. By disregarding component failure dependency, these techniques assume inappropriately that a component ...
HPC-BLAST: distributed BLAST for xeon phi clusters
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics

The near exponential growth in sequence data available to bioinformaticists, and the emergence of new fields of biological research, continue to fuel an incessant need for increases in sequence alignment performance. Concurrently, the High Performance ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
271
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Massari GPeta MCampi AReghenzani FTerraneo FAgosta GFornaciari WCiesielski SKulczewski MPiatek W(2023)Reliability-oriented resource management for High-Performance ComputingSustainable Computing: Informatics and Systems10.1016/j.suscom.2023.10087339(100873)Online publication date: Sep-2023
https://doi.org/10.1016/j.suscom.2023.100873
Ferreira KLevy SHemmert JPedretti KWeissman JChandra AGavrilovska ATiwari D(2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3502181.3531465
Jiang WJia ZFeng SLiu FJin HManne SHunter HAltman E(2019)Fine-grained warm water cooling for improving datacenter economyProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322236(474-486)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322236
Fanfarillo A(2019)Quantifying Uncertainty in Source Term Estimation with Tensorflow Probability2019 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC)10.1109/UrgentHPC49580.2019.00006(1-6)Online publication date: Nov-2019
https://doi.org/10.1109/UrgentHPC49580.2019.00006
Babaoglu OSirbu A(2018)Cognified Distributed Computing2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS.2018.00118(1180-1191)Online publication date: Jul-2018
https://doi.org/10.1109/ICDCS.2018.00118
Dauwe DJhaveri RPasricha SMaciejewski ASiegel H(2017)Optimizing checkpoint intervals for reduced energy use in exascale systems2017 Eighth International Green and Sustainable Computing Conference (IGSC)10.1109/IGCC.2017.8323598(1-8)Online publication date: Oct-2017
https://doi.org/10.1109/IGCC.2017.8323598
Acun BLee EPark YKale L(2017)Support for Power Efficient Proactive Cooling Mechanisms2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00020(94-103)Online publication date: Dec-2017
https://doi.org/10.1109/HiPC.2017.00020
Sun HStolf PPierson J(2017)Spatio-temporal thermal-aware scheduling for homogeneous high-performance computing datacentersFuture Generation Computer Systems10.1016/j.future.2017.02.00571:C(157-170)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1016/j.future.2017.02.005
Padoin EPilla LCastro MNavaux PMéhaut J(2017)Exploration of Load Balancing Thresholds to Save Energy on Iterative ApplicationsHigh Performance Computing10.1007/978-3-319-57972-6_6(76-88)Online publication date: 29-Apr-2017
https://doi.org/10.1007/978-3-319-57972-6_6
Wallace SYang XVishwanath VAllcock WCoghlan SPapka MLan ZWest J(2016)A data driven scheduling approach for power management on HPC systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014979(1-11)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3014904.3014979
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents