Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3624062.3624131acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

ICE 2.0: Restructuring and Growing an Instructional HPC Cluster

Published: 12 November 2023 Publication History

Abstract

The Partnership for an Advanced Computing Environment (PACE) at Georgia Tech (GT) has been running two campus-wide cluster resources available for academic courses and workshops for five years. The initial design focused on creating a federated resource for a wide range of educational topics, based on a PACE and College of Computing (COC) partnership. Due to funding, this took the form of separate resources, one funded by PACE, and another by COC. These "Instructional Cluster Environments", PACE-ICE and COC-ICE, became very popular with instructors at GT but led to a high maintenance cost due to the split nature of the environments. With the transition to the Slurm scheduler, PACE collaborated with COC to merge the two clusters into one, ICE. This work details the strategies used to sensibly merge the two production systems, including the storage architecture, shared system policies, and scheduler priority configurations that honor funding complexities.

Supplemental Material

MP4 File
Recording of "ICE 2.0: Restructuring and Growing an Instructional HPC Cluster" presentation at HPCSYSPROS23.

References

[1]
Joel C. Adams, Suzanne J. Matthews, Elizabeth Shoop, David Toth, and James Wolfer. 2017. Using Inexpensive Microclusters and Accessible Materials for Cost-Effective Parallel and Distributed Computing Education. The Journal of Computational Science Education 8 (Dec. 2017), 2–10. Issue 3. https://doi.org/10.22369/issn.2153-4136/8/3/1
[2]
Lluc Alvarez, Eduard Ayguade, and Filippo Mantovani. 2018. Teaching HPC Systems and Parallel Programming with Small-Scale Clusters. In 2018 IEEE/ACM Workshop on Education for High-Performance Computing (EduHPC). 1–10. https://doi.org/10.1109/EduHPC.2018.00004
[3]
Michael E. Baldwin, Xiao Zhu, Preston M. Smith, Stephen Lien Harrell, Robert Skeel, and Amiya Maji. 2016. Scholar: A Campus HPC Resource to Enable Computational Literacy. In 2016 Workshop on Education for High-Performance Computing (EduHPC). 25–31. https://doi.org/10.1109/EduHPC.2016.009
[4]
Mehmet Belgin, Trever C. Nightingale, David A. Mercer, Fang Cherry Liu, Peter Wan, Andre C. McNeill, Ruben Lara, Paul Manno, and Neil Bright. 2018. ICE: A Federated Instructional Cluster Environment for Georgia Tech. In Proceedings of the Practice and Experience on Advanced Research Computing (Pittsburgh, PA, USA) (PEARC ’18). Association for Computing Machinery, New York, NY, USA, Article 16, 7 pages. https://doi.org/10.1145/3219104.3219112
[5]
Brett Bode, Tim Bouvet, Jeremy Enos, and Sharif Islam. 2016. Account Management of a Large-Scale HPC Resource. In In HPCSYSPROS16: HPC System Professionals Workshop. Salt Lake City, UT. https://doi.org/10.5281/zenodo.4327649
[6]
Caughlin Bohn and Carrie Brown. 2020. Legion: A K-12 HPC Outreach and Education Cluster. In Practice and Experience in Advanced Research Computing (Portland, OR, USA) (PEARC ’20). Association for Computing Machinery, New York, NY, USA, 448–451. https://doi.org/10.1145/3311790.3400845
[7]
Dhruva K. Chakravorty, Marinus "Maikel" Pennings, Honggao Liu, Zengyu "Sheldon" Wei, Dylan M. Rodriguez, Levi T. Jordan, Donald "Rick" McMullen, Noushin Ghaffari, and Shaina D. Le. 2019. Effectively Extending Computational Training Using Informal Means at Larger Institutions. The Journal of Computational Science Education 10 (Jan. 2019), 40–47. Issue 1. https://doi.org/10.22369/issn.2153-4136/10/1/7
[8]
Adaptive Computing. 2023. Moab HPC Suite. Retrieved August 3, 2023 from https://adaptivecomputing.com/moab-hpc-suite/
[9]
Edward Coyle, Jan Allebach, and Joy Krueger. 2006. The vertically integrated projects (VIP) program in ECE at purdue: fully integrating undergraduate education and graduate research. In 2006 Annual Conference & Exposition. 11–1336.
[10]
Henry A. Gabb, Alexandru Nicolau, Satish Puri, Michael D. Shah, Rahul Toppur, Neftali Watkinson, Weijia Xu, and Hui Zhang. 2021. Lightning Talks of EduHPC 2021. In 2021 IEEE/ACM Ninth Workshop on Education for High Performance Computing (EduHPC). St. Louis, MO.
[11]
Dave Hudak, Doug Johnson, Alan Chalker, Jeremy Nicklas, Eric Franz, Trey Dockendorf, and Brian L. McMichael. 2018. Open OnDemand: A web-based client portal for HPC centers. Journal of Open Source Software 3, 25 (2018), 622. https://doi.org/10.21105/joss.00622
[12]
Cluster File Systems Inc.2002. Lustre : A Scalable, High-Performance File System Cluster. Retrieved March 28, 2022 from https://cse.buffalo.edu/faculty/tkosar/cse710/papers/lustre-whitepaper.pdf
[13]
Aaron Jezghani, Semir Sarajlic, Michael Brandon, Neil Bright, Mehmet Belgin, Gergory Beyer, Christopher Blanton, Pam Buffington, J. Eric Coulter, Ruben Lara, Lew Lefton, David Leonard, Fang Cherry Liu, Kevin Manalo, Paul Manno, Craig Moseley, Trever Nightingale, N. Bray Bonner, Ronald Rahaman, Christopher Stone, Kenneth J. Suda, Peter Wan, Michael D. Weiner, Deirdre Womack, Nuyun Zhang, and Dan Zhou. 2022. Phoenix: The Revival of Research Computing and the Launch of the New Cost Model at Georgia Tech. In Practice and Experience in Advanced Research Computing (Boston, MA, USA) (PEARC ’22). Association for Computing Machinery, New York, NY, USA, Article 13, 9 pages. https://doi.org/10.1145/3491418.3530767
[14]
Fang (Cherry) Liu, Ronald Rahaman, Michael D Weiner, J Eric Coulter, Deepa Phanish, Jeffrey Valdez, Semir Sarajlic, Ruben Lara, and Pam Buffington. 2023. Semi-Automatic Hybrid Software Deployment Workflow in a Research Computing Center. In Practice and Experience in Advanced Research Computing (Portland, OR, USA) (PEARC ’23). Association for Computing Machinery, 9 pages.
[15]
Fang Cherry Liu, Michael D. Weiner, Kevin Manalo, Aaron Jezghani, Christopher J. Blanton, Christopher Stone, Kenneth Suda, Nuyun Zhang, Dan Zhou, Mehmet Belgin, Semir Sarajlic, and Ruben Lara. 2021. Human-in-the-Loop Automatic Data Migration for a Large Research Computing Data Center. In 2021 International Conference on Computational Science and Computational Intelligence (CSCI). 1752–1758. https://doi.org/10.1109/CSCI54926.2021.00068
[16]
Ping Luo, Benjamin Evans, Tyler Trafford, Kaylea Nelson, Thomas Langford, Jay Kubeck, and Andrew Sherman. 2021. Using Single Sign-On Authentication with Multiple Open OnDemand Accounts: A Solution for HPC Hosted Courses. In Practice and Experience in Advanced Research Computing (Boston, MA, USA) (PEARC ’21). Association for Computing Machinery, New York, NY, USA, Article 15, 6 pages. https://doi.org/10.1145/3437359.3465575
[17]
Junya Nakamura and Masatoshi Tsuchiya. 2018. Automated User Registration Using Authentication Federation on Academic HPC System. In 2018 7th International Congress on Advanced Applied Informatics (IIAI-AAI). 61–67. https://doi.org/10.1109/IIAI-AAI.2018.00022
[18]
NetApp. 2022. Data Management Solutions for the Cloud. Retrieved March 28, 2022 from https://www.netapp.com/
[19]
Linh B. Ngo and Jon Kilgannon. 2020. Virtual Cluster for HPC Education. J. Comput. Sci. Coll. 36, 3 (oct 2020), 20–30.
[20]
OpenACC. 2023. Open Hackathons. Retrieved August 3, 2023 from https://www.openhackathons.org
[21]
Jeffrey T. Palmer, Steven M. Gallo, Thomas R. Furlani, Matthew D. Jones, Robert L. DeLeon, Joseph P. White, Nikolay Simakov, Abani K. Patra, Jeanette Sperhac, Thomas Yearke, Ryan Rathsam, Martins Innus, Cynthia D. Cornelius, James C. Browne, William L. Barth, and Richard T. Evans. 2015. Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources. Computing in Science & Engineering 17, 4 (2015), 52–62. https://doi.org/10.1109/MCSE.2015.68
[22]
GNU Project. 2021. findutils. Retrieved August 3, 2023 from https://www.gnu.org/software/findutils/
[23]
SSSD Project. 2021. System Security Services Daemon. Retrieved August 3, 2023 from https://sssd.io/
[24]
Rajendra K. Raj, Carol J. Romanowski, John Impagliazzo, Sherif G. Aly, Brett A. Becker, Juan Chen, Sheikh Ghafoor, Nasser Giacaman, Steven I. Gordon, Cruz Izu, Shahram Rahimi, Michael P. Robson, and Neena Thota. 2020. High Performance Computing Education: Current Challenges and Future Directions. In Proceedings of the Working Group Reports on Innovation and Technology in Computer Science Education (Trondheim, Norway) (ITiCSE-WGR ’20). Association for Computing Machinery, New York, NY, USA, 51–74. https://doi.org/10.1145/3437800.3439203
[25]
SchedMD. 2023. slurm.conf. Retrieved August 3, 2023 from https://slurm.schedmd.com/slurm.conf.html#OPT_PriorityTier
[26]
Garrick Staples. 2006. TORQUE Resource Manager. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (Tampa, Florida) (SC ’06). Association for Computing Machinery, New York, NY, USA, 8–es. https://doi.org/10.1145/1188455.1188464
[27]
Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple linux utility for resource management. In Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003. Revised Paper 9. Springer, 44–60.

Cited By

View all
  • (2024)Exploring Research Dataset-Sharing Strategies for Concurrent AI WorkflowsPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670597(1-4)Online publication date: 17-Jul-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
November 2023
2180 pages
ISBN:9798400707858
DOI:10.1145/3624062
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2023

Check for updates

Author Tags

  1. Accounting
  2. Education
  3. HPC
  4. HPC Access
  5. Instructional Infrastructure
  6. System design

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC-W 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)243
  • Downloads (Last 6 weeks)122
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring Research Dataset-Sharing Strategies for Concurrent AI WorkflowsPractice and Experience in Advanced Research Computing 2024: Human Powered Computing10.1145/3626203.3670597(1-4)Online publication date: 17-Jul-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media