Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2503210.2503228acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

A 'cool' way of improving the reliability of HPC machines

Published: 17 November 2013 Publication History

Abstract

Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.

References

[1]
T. Renzenbrink, "Data Centers Use 1.3% of WorldÕs Total Electricity. A Decline in growth." {Online}. Available: http://www.techthefuture.com/energy/
[2]
C. D. Patel, C. E. Bash, R. Sharma, M. Beitelmal, and R. Friedrich, "Smart cooling of data centers," ASME Conference Proceedings, vol. 2003, no. 36908b, pp. 129--137, 2003.
[3]
R. F. Sullivan, "Alternating cold and hot aisles provides more reliable cooling for server farms," White Paper, Uptime Institute, 2000.
[4]
R. Sawyer, "Calculating total power requirements for data centers," White Paper, American Power Conversion, 2004.
[5]
R. American Society of Heating and A.-C. Engineers, "2008 ashrae environmental guidelines for datacom equipment." {Online}. Available: http://tc99.ashraetcs.org/documents/ASHRAE_Extended_Environmental_Envelope_Final_Aug_1_2008.pdf
[6]
A. Liu, "The data center temperature debate." {Online}. Available: http://ezinearticles.com/?The-Data-Center-Temperature-Debate&id=2637938
[7]
O. Sarood and L. V. Kale, "A 'cool' load balancer for parallel applications," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '11. New York, NY, USA: ACM, 2011, pp. 21:1--21:11. {Online}. Available: http://doi.acm.org/10.1145/2063384.2063412
[8]
O. Sarood, P. Miller, E. Totoni, and L. V. Kale, "Cool load balancing for high performance computing data centers," vol. 61, no. 12. Los Alamitos, CA, USA: IEEE Computer Society, 2012, pp. 1752--1764.
[9]
C. hsing Hsu, W. chun Feng, and J. S. Archuleta, "Towards efficient supercomputing: A quest for the right metric," in In Proceedings of the HighPerformance Power-Aware Computing Workshop, 2005.
[10]
W.-C. Feng, "Making a case for efficient supercomputing," vol. 1, no. 7. New York, NY, USA: ACM, Oct. 2003, pp. 54--64. {Online}. Available: http://doi.acm.org/10.1145/957717.957772
[11]
W.-C. Feng, "The Importance of Being Low Power in High-Performance Computing," Cyberinfrastructure Technology Watch Quarterly (CTWatch Quarterly), vol. 1, no. 3, August 2005.
[12]
Ericsson, "Reliability Aspects on Power Supplies," Technical ReportDesign Note 002, Ericsson Microelectronics, April 2000.
[13]
L. Kalé, "The Chare Kernel parallel programming language and system," in Proceedings of the International Conference on Parallel Processing, vol. II, Aug. 1990, pp. 17--25.
[14]
J. Srinivasan, S. Adve, P. Bose, and J. Rivers, "The impact of technology scaling on lifetime reliability," in Dependable Systems and Networks, 2004 International Conference on, 2004, pp. 177--186.
[15]
J. A. Chung H. Hsu, W. Feng, "Towards Efficient Supercomputing: A Quest for the Right Metric." {Online}. Available: http://sss.cs.vt.edu/presentations/hppac05.ppt.pdf
[16]
F. Petrini, K. Davis, and J. Sancho, "System-level fault-tolerance in large-scale parallel machines with buffered coscheduling," in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, 2004, pp. 209--.
[17]
G. Zheng, L. Shi, and L. V. Kalé, "FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI," in 2004 IEEE International Conference on Cluster Computing, San Diego, CA, September 2004, pp. 93--103.
[18]
J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Generation Comp. Syst., vol. 22, no. 3, pp. 303--312, 2006.
[19]
J. W. Young, "A first order approximation to the optimal checkpoint interval," Commun. ACM, vol. 17, no. 9, pp. 530--531, 1974.
[20]
L. Bautista-Gomez, D. Komatitsch, N. Maruyama, S. Tsuboi, F. Cappello, and S. Matsuoka, "FTI: High performance fault tolerance interface for hybrid systems," in 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2011, pp. 1--12.
[21]
P. H. Hargrove and J. C. Duell, "Berkeley lab checkpoint/restart (blcr) for linux clusters," in SciDAC, 2006.
[22]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in SC, 2010, pp. 1--11.
[23]
"Lulesh," http://computation.llnl.gov/casc/ShockHydro/.
[24]
R. K. Brunner and L. V. Kalé, "Handling application-induced load imbalance using parallel objects," in Parallel and Distributed Computing for Symbolic and Irregular Applications. World Scientific Publishing, 2000, pp. 167--181.
[25]
G. Zheng, "Achieving high performance on extremely large parallel machines: performance prediction and load balancing," Ph.D. dissertation, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.
[26]
P. Jetley, F. Gioachin, C. Mendes, L. V. Kale, and T. R. Quinn, "Massively parallel cosmological simulations with ChaNGa," in Proceedings of IEEE International Parallel and Distributed Processing Symposium 2008, 2008.
[27]
G. Zheng, A. Bhatele, E. Meneses, and L. V. Kale, "Periodic Hierarchical Load Balancing for Large Supercomputers," International Journal of High Performance Computing Applications (IJHPCA), March 2011.
[28]
"Intel turbo boost technology," http://www.intel.com/technology/turboboost/.
[29]
P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, "Exascale computing study: Technology challenges in achieving exascale systems," 2008.
[30]
"Top500 supercomputing sites," http://top500.org.
[31]
B. Schroeder and G. A. Gibson, "Understanding failures in petascale computers."
[32]
D. Fiala, "Detection and correction of silent data corruption for large-scale high-performance computing," in Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, 2011, pp. 2069--2072.
[33]
P. Ramachandran, S. Adve, P. Bose, and J. Rivers, "Metrics for architecture-level lifetime reliability analysis," in Performance Analysis of Systems and software, 2008. ISPASS 2008. IEEE International Symposium on, 2008, pp. 202--212.
[34]
J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The case for lifetime reliability-aware microprocessors," in Proceedings of the 31st annual international symposium on Computer architecture, ser. ISCA '04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 276--. {Online}. Available: http://dl.acm.org/citation.cfm?id=998680.1006725
[35]
E. Meneses, X. Ni, and L. V. Kale, "A Message-Logging Protocol for Multicore Systems," in Proceedings of the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS), Boston, USA, June 2012.
[36]
S. Chakravorty and L. V. Kale, "A fault tolerance protocol with fast fault recovery," in Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium. IEEE Press, 2007.
[37]
E. Meneses, O. Sarood, and L. V. Kale, "Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems," in Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2012), New York, USA, October 2012.
[38]
K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold, "Evaluating the viability of process replication reliability for exascale systems," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA: ACM, 2011, pp. 44:1--44:12. {Online}. Available: http://doi.acm.org/10.1145/2063384.2063443
[39]
J. Srinivasan, S. Adve, P. Bose, and J. Rivers, "Lifetime reliability: toward an architectural solution," Micro, IEEE, vol. 25, no. 3, pp. 70--80, 2005.

Cited By

View all
  • (2023)Reliability-oriented resource management for High-Performance ComputingSustainable Computing: Informatics and Systems10.1016/j.suscom.2023.10087339(100873)Online publication date: Sep-2023
  • (2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
  • (2019)Fine-grained warm water cooling for improving datacenter economyProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322236(474-486)Online publication date: 22-Jun-2019
  • Show More Cited By
  1. A 'cool' way of improving the reliability of HPC machines

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
    November 2013
    1123 pages
    ISBN:9781450323789
    DOI:10.1145/2503210
    • General Chair:
    • William Gropp,
    • Program Chair:
    • Satoshi Matsuoka
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. actionable modeling
    2. checkpointing restart
    3. energy minimization
    4. fault tolerance
    5. load balancing
    6. temperature capping
    7. temperature thresholds
    8. thermal control

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC13
    Sponsor:

    Acceptance Rates

    SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Reliability-oriented resource management for High-Performance ComputingSustainable Computing: Informatics and Systems10.1016/j.suscom.2023.10087339(100873)Online publication date: Sep-2023
    • (2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
    • (2019)Fine-grained warm water cooling for improving datacenter economyProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322236(474-486)Online publication date: 22-Jun-2019
    • (2019)Quantifying Uncertainty in Source Term Estimation with Tensorflow Probability2019 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC)10.1109/UrgentHPC49580.2019.00006(1-6)Online publication date: Nov-2019
    • (2018)Cognified Distributed Computing2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS.2018.00118(1180-1191)Online publication date: Jul-2018
    • (2017)Optimizing checkpoint intervals for reduced energy use in exascale systems2017 Eighth International Green and Sustainable Computing Conference (IGSC)10.1109/IGCC.2017.8323598(1-8)Online publication date: Oct-2017
    • (2017)Support for Power Efficient Proactive Cooling Mechanisms2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00020(94-103)Online publication date: Dec-2017
    • (2017)Spatio-temporal thermal-aware scheduling for homogeneous high-performance computing datacentersFuture Generation Computer Systems10.1016/j.future.2017.02.00571:C(157-170)Online publication date: 1-Jun-2017
    • (2017)Exploration of Load Balancing Thresholds to Save Energy on Iterative ApplicationsHigh Performance Computing10.1007/978-3-319-57972-6_6(76-88)Online publication date: 29-Apr-2017
    • (2016)A data driven scheduling approach for power management on HPC systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014979(1-11)Online publication date: 13-Nov-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media