Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3624062.3624190acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

"PoliMOR: A Policy Engine \"Made-to-Order\" for Automated and Scalable Data Management in Lustre"

Published: 12 November 2023 Publication History

Abstract

Modern supercomputing systems are increasingly reliant on hierarchical, multi-tiered file and storage system architectures due to cost-performance-capacity trade-offs. Within such multi-tiered systems, data management services are required to maintain healthy utilization, performance, and capacity levels. We present PoliMOR, a pragmatic and reliable policy-driven data management framework. PoliMOR is composed of modular, single-purpose agents that gather file system metadata and enforce policies on storage systems. PoliMOR facilitates automated and scalable data management with customizable agents tailored to HPC facility-specific storage systems and policies. Our evaluations demonstrate the scalability and performance of PoliMOR both by its individual agents and as a collective entity. We believe PoliMOR is widely applicable across HPC facilities with large-scale data management challenges and will garner interest from the HPC community, given its flexible and open-source nature.

Supplemental Material

MP4 File
Recording of "PoliMOR: A Policy Engine \"Made-to-Order\" for Automated and Scalable Data Management in Lustre" at PDSW 2023.

References

[1]
NATS Authors. 2023. NATS, Connective Technology for Adaptive Edge and Distributed Systems. Retrieved March 5, 2023 from https://nats.io/
[2]
CEA-HPC. 2017. Robinhood v3 admin documentation. Retrieved September 18, 2023 from https://github.com/cea-hpc/robinhood/wiki/robinhood_v3_admin_doc#user-content-Feeding_the_beast
[3]
Cray ClusterStor data services User Guide 2.1.2 (S-1239). 2023. What ClusterStor data services Provides. Retrieved March 9, 2023 from https://support.hpe.com/hpesc/public/docDisplay?docId=sd00002127en_us&docLocale=en_US&page=GUID-D1887849-55E8-4B93-9880-5EED70536873.html
[4]
Tina M Declerck 2014. Using Robinhood to purge data from Lustre file systems. Proceedings of the 2014 Cray User Group, Lugano (2014).
[5]
IBM Documenetation. 2015. Introducing General Parallel File System. Retrieved September 13, 2023 from https://www.ibm.com/docs/en/gpfs/4.1.0.4?topic=guide-introducing-general-parallel-file-system
[6]
IBM Spectrum Scale Documenetation. 2021. Information lifecycle management for IBM Spectrum Scale. Retrieved March 9, 2023 from https://www.ibm.com/docs/en/spectrum-scale/5.0.5?topic=administering-information-lifecycle-management-spectrum-scale
[7]
OpenSFS; EOFS. 2023. Lustre* software release 2.x: Operations manual. Retrieved April 8, 2023 from https://doc.lustre.org/lustre_manual.xhtml
[8]
Oak Ridge Leadership Computing Facility. 2021. OLCF announces storage specifications for Frontier exascale system. Retrieved March 29, 2023 from https://www.olcf.ornl.gov/2021/05/20/olcf-announces-storage-specifications-for-frontier-exascale-system/
[9]
IT Center for Science (CSC). 2020. One of the world’s mightiest supercomputers, LUMI, will lift European research and competitiveness to a new level and promotes green transition. Retrieved March 28, 2023 from https://www.csc.fi/en/-/lumi-one-of-the-worlds-mightiest-supercomputers
[10]
Anjus George, Richard Mohr, and James Simmons. 2022. LU-10378 utils: add formatted printf to lfs find. Retrieved March 30, 2023 from https://review.whamcloud.com/c/fs/lustre-release/+/45136
[11]
Anjus George, Rick Mohr, James Simmons, and Sarp Oral. 2021. Understanding Lustre Internals. Second Edition. (9 2021). https://doi.org/10.2172/1824954
[12]
Anjus George, Arun Ravindran, Matías Mendieta, and Hamed Tabkhi. 2021. Mez: An Adaptive Messaging System for Latency-Sensitive Multi-Camera Machine Vision at the IoT Edge. IEEE Access 9 (2021), 21457–21473. https://doi.org/10.1109/ACCESS.2021.3055775
[13]
ThinkParQ GmbH. 2023. BeeGFS, The leading parallel file system. Retrieved September 13, 2023 from https://www.beegfs.io/c/
[14]
HPCWire. 2020. Exascale Watch: El Capitan Will Use AMD CPUs & GPUs to Reach 2 Exaflops. Retrieved February 18, 2023 from https://www.hpcwire.com/2021/02/18/livermores-el-capitan-supercomputer-hpe-rabbit-storage-nodes/
[15]
Anthony Kougkas, Hariharan Devarajan, and Xian-He Sun. 2018. Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 219–230.
[16]
Thomas Leibovici. 2015. Taking back control of HPC file systems with Robinhood Policy Engine. arXiv preprint arXiv:1505.01448 (2015).
[17]
Fang Liu, Dan Zhou, Ken Suda, Michael D Weiner, Mehmet Belgin, Ruben Lara, and Pam Buffington. 2022. A Fully Automated Scratch Storage Cleanup Tool for Heterogeneous Parallel Filesystems. In Practice and Experience in Advanced Research Computing. 1–7.
[18]
Richard Mohr, Anjus George, and James Simmons. 2020. The Improved “lfs find” Command. (2020). https://www.opensfs.org/wp-content/uploads/Mohr-Improved_lfs_find_Command.pdf Lustre User Group Conference (LUG2022).
[19]
DataDirect Networks. 2023. ddn exa6, Introducing Strategem. Retrieved April 7, 2023 from https://www.ddn.com/products/lustre-file-system-exascaler/
[20]
OLCF. 2018. ALPINE, OLCF’s center-wide, POSIX-based IBM Spectrum Scale file system. Retrieved April 10, 2023 from https://www.olcf.ornl.gov/olcf-resources/data-visualization-resources/alpine/
[21]
Oak Ridge National Laboratory (ORNL). 2022. Frontier, Direction of Discovery. Retrieved March 24, 2023 from https://www.olcf.ornl.gov/frontier/
[22]
Arnab K Paul, Brian Wang, Nathan Rutman, Cory Spitz, and Ali R Butt. 2020. Efficient metadata indexing for hpc storage systems. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 162–171.
[23]
Tristan Penman. 2023. Valijson. Retrieved March 20, 2023 from https://github.com/tristanpenman/valijson
[24]
Waldemar Quevedo. 2018. Introduction to NATS. Apress, Berkeley, CA, 1–18. https://doi.org/10.1007/978-1-4842-3570-6_1
[25]
Boris Schäling. 2023. The Boost C++ Libraries. Retrieved March 30, 2023 from https://theboostcpplibraries.com/
[26]
Kirill Simonov. 2023. LibYAML - A C library for parsing and emitting YAML. Retrieved March 17, 2023 from https://github.com/yaml/libyaml
[27]
Feiyi Wang, Hyogi Sim, Cameron Harr, and Sarp Oral. 2017. Diving into Petascale Production File Systems through Large Scale Profiling and Analysis. In Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (Denver, Colorado) (PDSW-DISCS ’17). Association for Computing Machinery, New York, NY, USA, 37–42. https://doi.org/10.1145/3149393.3149399
[28]
Gong Zhang, Lawrence Chiu, Clem Dickey, Ling Liu, Paul Muench, and Sangeetha Seshadri. 2010. Automated lookahead data migration in SSD-enabled multi-tiered storage systems. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1–6.

Index Terms

  1. "PoliMOR: A Policy Engine \"Made-to-Order\" for Automated and Scalable Data Management in Lustre"
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
      November 2023
      2180 pages
      ISBN:9798400707858
      DOI:10.1145/3624062
      Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 November 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. high performance computing
      2. message queues
      3. multi-tiered parallel file system
      4. policy engine
      5. storage and data management

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Office of Science of the U.S. Department of Energy

      Conference

      SC-W 2023

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 70
        Total Downloads
      • Downloads (Last 12 months)70
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 22 Nov 2024

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media