Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3311790.3396624acmconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

Monitoring and Analysis of Power Consumption on HPC Clusters using XDMoD

Published: 26 July 2020 Publication History

Abstract

As part of the NSF funded XMS project we are developing tools and techniques for the audit and analysis of HPC infrastructure. This includes a suite of tools for the analysis of HPC jobs based on performance metrics collected from compute nodes. Although it may not be salient to the user, the energy consumption of an HPC system is an important part of the cost of maintenance and contributes a substantial fraction of the cost of calculations done with the system. We added support for energy usage analysis to the open-source XDMoD tool chain. This allows HPC centers to provide information directly to HPC stakeholders about the power consumption. This includes providing end users with energy usage information about their jobs as well as providing data to allow HPC center staff to analyze how the energy usage of the system is related to other system parameters. We explain how energy metrics were added to XDMoD and describe the issues we overcame in instrumenting a 1400 node academic HPC cluster. We present an analysis of 14 months of data collected on real jobs on the cluster. We performed a machine learning analysis of the data and show how energy usage is related to other system performance metrics.

Supplemental Material

MP4 File
Presentation video

References

[1]
Advanced Micro Devices, Inc.2015. BIOS and Kernel Developer’s Guide(BKDG) for AMD Family 16hModels 00h-0FhProcessors.
[2]
Andrew Barry. 2013. Resource Utilization Reporting. In Proceedings of the Cray User Group conference(Napa, CA, USA). Cray User Group, Inc., Oak Ridge, TN. https://cug.org/proceedings/cug2013_proceedings/includes/files/pap103.pdf
[3]
J. Brandt, A. Gentile, J. Mayo, P. Pebay, D. Roe, D. Thompson, and M. Wong. 2009. Resource monitoring and management with OVIS to enable HPC in cloud computing environments. In 2009 IEEE International Symposium on Parallel Distributed Processing. 1–8. https://doi.org/10.1109/IPDPS.2009.5161234
[4]
J. M. Brandt, A. C. Gentile, D. J. Hale, and P. P. Pebay. 2006. OVIS: a tool for intelligent, real-time monitoring of computational clusters. In Proceedings 20th IEEE International Parallel Distributed Processing Symposium. IEEE Computer Society, 8 pp.–. https://doi.org/10.1109/IPDPS.2006.1639698
[5]
Center for Computational Research, University at Buffalo. 2016. SUPReMM Job Summarization Software. https://github.com/ubccr/supremm.
[6]
Center for Computational Research, University at Buffalo. 2019. Fork of PCP git repository. https://github.com/ubccr/pcp.
[7]
Alber Chu, Anand Babu, Anand Avati, Balamurugan, Ian Zimmerman, and Raghavendra. 2004. GNU FreeIPMI. https://www.gnu.org/software/freeipmi.
[8]
M. Dayarathna, Y. Wen, and R. Fan. 2016. Data Center Energy Consumption Modeling: A Survey. IEEE Communications Surveys Tutorials 18, 1 (Firstquarter 2016), 732–794. https://doi.org/10.1109/COMST.2015.2481183
[9]
Spencer Desrochers, Chad Paradis, and Vincent M. Weaver. 2016. A Validation of DRAM RAPL Power Measurements. In Proceedings of the Second International Symposium on Memory Systems (Alexandria, VA, USA) (MEMSYS ’16). ACM, New York, NY, USA, 455–470. https://doi.org/10.1145/2989081.2989088
[10]
Todd Evans, William L. Barth, James C. Browne, Robert L. DeLeon, Thomas R. Furlani, Steven M. Gallo, Matthew D. Jones, and Abani K. Patra. 2014. Comprehensive Resource Use Monitoring for HPC Systems with TACC_Stats. In Proceedings of the First International Workshop on HPC User Support Tools (New Orleans, Louisiana) (HUST ’14). IEEE Press, Piscataway, NJ, USA, 13–21. https://doi.org/10.1109/HUST.2014.7
[11]
Steven Feldman, Deli Zhang, Damian Dechev, and James Brandt. 2015. Extending LDMS to Enable Performance Monitoring in Multi-core Applications. In Proceedings of the 2015 IEEE International Conference on Cluster Computing(CLUSTER ’15). IEEE Computer Society, Washington, DC, USA, 717–720. https://doi.org/10.1109/CLUSTER.2015.125
[12]
Thomas R. Furlani, Matthew D. Jones, Steven M. Gallo, Andrew E. Bruno, Charng-Da Lu, Amin Ghadersohi, Ryan J. Gentner, Abani Patra, Robert L. DeLeon, Gregor von Laszewski, Fugang Wang, and Ann Zimmerman. 2013. Performance metrics and auditing framework using application kernels for high-performance computer systems. Concurrency and Computation: Practice and Experience 25, 7(2013), 918–931. https://doi.org/10.1002/cpe.2871 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.2871
[13]
R. Ge, X. Feng, S. Song, H. Chang, D. Li, and K. W. Cameron. 2010. PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications. IEEE Transactions on Parallel and Distributed Systems 21, 5 (May 2010), 658–671. https://doi.org/10.1109/TPDS.2009.76
[14]
Yiannis Georgiou, Thomas Cadeau, David Glesser, Danny Auble, Morris Jette, and Matthieu Hautreux. 2014. Energy Accounting and Control with SLURM Resource and Job Management System. In Distributed Computing and Networking, Mainak Chatterjee, Jian-nong Cao, Kishore Kothapalli, and Sergio Rajsbaum (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 96–118.
[15]
R. E. Grant, J. H. Laros, M. Levenhagen, S. L. Olivier, K. Pedretti, L. Ward, and A. J. Younge. 2017. Evaluating energy and power profiling techniques for HPC workloads. In 2017 Eighth International Green and Sustainable Computing Conference (IGSC). 1–8. https://doi.org/10.1109/IGCC.2017.8323587
[16]
D. Hackenberg, T. Ilsche, R. Schöne, D. Molka, M. Schmidt, and W. E. Nagel. 2013. Power measurement techniques on standard compute nodes: A quantitative comparison. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 194–204. https://doi.org/10.1109/ISPASS.2013.6557170
[17]
F. C. Heinrich, T. Cornebize, A. Degomme, A. Legrand, A. Carpen-Amarie, S. Hunold, A. Orgerie, and M. Quinson. 2017. Predicting the Energy-Consumption of MPI Applications at Scale Using Only a Single Node. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 92–102. https://doi.org/10.1109/CLUSTER.2017.66
[18]
Intel Corporation. 2011. DCMI Data Center Manageability Interface Specification. https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/dcmi-v1-5-rev-spec.pdf.
[19]
Intel Corporation. 2015. Intel 64 and IA-32 Architectures Software Developer’s Manual.
[20]
R. Kavanagh, D. Armstrong, and K. Djemame. 2016. Accuracy of Energy Model Calibration with IPMI. In 2016 IEEE 9th International Conference on Cloud Computing (CLOUD). 648–655. https://doi.org/10.1109/CLOUD.2016.0091
[21]
S. Khoshbakht and N. Dimopoulos. 2017. SAPPP: The Software-Aware Power and Performance Profiler. In 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE). 1–6. https://doi.org/10.1109/CCECE.2017.7946659
[22]
Weiwei Lin, Haoyu Wang, Yufeng Zhang, Deyu Qi, James Z. Wang, and Victor Chang. 2018. A cloud server energy consumption measurement system for heterogeneous cloud environments. Information Sciences 468(2018), 47 – 62. https://doi.org/10.1016/j.ins.2018.08.032
[23]
Steven J. Martin and Matthew Kappel. 2014. Cray XC30 Power Monitoring and Management. In Proceedings of the Cray User Group (CUG) (Napa, CA, USA). Cray User Group, Inc., Oak Ridge, TN. https://cug.org/proceedings/cug2014_proceedings/includes/files/pap130.pdf
[24]
Matthew L Massie, Brent N Chun, and David E Culler. 2004. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30, 7 (2004), 817 – 840. https://doi.org/10.1016/j.parco.2004.04.001
[25]
NVIDIA. 2019. NVML API Reference. https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html.
[26]
Anne-Cecile Orgerie, Marcos Dias de Assuncao, and Laurent Lefevre. 2014. A Survey on Techniques for Improving the Energy Efficiency of Large-scale Distributed Systems. ACM Comput. Surv. 46, 4, Article 47 (March 2014), 31 pages. https://doi.org/10.1145/2532637
[27]
Jeffrey T. Palmer, Steven M. Gallo, Thomas R. Furlani, Matthew D. Jones, Robert L. DeLeon, Joseph P. White, Nikolay Simakov, Abani K. Patra, Jeanette M. Sperhac, Thomas Yearke, Ryan Rathsam, Martins Innus, Cynthia D. Cornelius, James C. Browne, William L. Barth, and Richard T. Evans. 2015. Open XDMoD: A tool for the comprehensive management of high-performance computing resources. Computing in Science and Engineering 17, 4 (2015), 52–62. https://doi.org/10.1109/MCSE.2015.68
[28]
M. Rashti, G. Sabin, D. Vansickle, and B. Norris. 2015. WattProf: A Flexible Platform for Fine-Grained HPC Power Profiling. In 2015 IEEE International Conference on Cluster Computing. 698–705. https://doi.org/10.1109/CLUSTER.2015.121
[29]
Frank Schmuck and Roger Haskin. 2002. GPFS: A Shared-disk File System for Large Computing Clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (Monterey, CA) (FAST’02). USENIX Association, Berkeley, CA, USA, 16–16. http://dl.acm.org/citation.cfm?id=1973333.1973349
[30]
S. Schubert, D. Kostic, W. Zwaenepoel, and K. G. Shin. 2012. Profiling Software for Energy Consumption. In 2012 IEEE International Conference on Green Computing and Communications. 515–522. https://doi.org/10.1109/GreenCom.2012.86
[31]
Silicon Graphics Inc, Aconex, and Red Hat. 2000. Performance Co-Pilot (PCP). https://pcp.io.
[32]
Joseph P. White, Alexander D. Kofke, Robert L. DeLeon, Martins Innus, Matthew D. Jones, and Thomas R. Furlani. 2018. Automatic Characterization of HPC Job Parallel Filesystem I/O Patterns. In Proceedings of the Practice and Experience on Advanced Research Computing (Pittsburgh, PA, USA) (PEARC ’18). ACM, New York, NY, USA, Article 3, 8 pages. https://doi.org/10.1145/3219104.3219121

Cited By

View all
  • (2024)The Data Analytics Framework for XDMoDSN Computer Science10.1007/s42979-024-02789-25:5Online publication date: 20-Apr-2024
  • (2021)Performance Analysis of Cloud Computing for Distributed Data Center using Cloud-Sim2021 IEEE International Conference on Communications Workshops (ICC Workshops)10.1109/ICCWorkshops50388.2021.9473876(1-6)Online publication date: Jun-2021

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PEARC '20: Practice and Experience in Advanced Research Computing 2020: Catch the Wave
July 2020
556 pages
ISBN:9781450366892
DOI:10.1145/3311790
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 July 2020

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

PEARC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 133 of 202 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)9
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Data Analytics Framework for XDMoDSN Computer Science10.1007/s42979-024-02789-25:5Online publication date: 20-Apr-2024
  • (2021)Performance Analysis of Cloud Computing for Distributed Data Center using Cloud-Sim2021 IEEE International Conference on Communications Workshops (ICC Workshops)10.1109/ICCWorkshops50388.2021.9473876(1-6)Online publication date: Jun-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media