Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3493649.3493654acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
short-paper

Designing a Kubernetes Operator for Machine Learning Applications

Published: 06 December 2021 Publication History

Abstract

Machine Learning workloads such as deep learning and hyperparameter tuning are compute-intensive by nature. Parallel execution is key to reducing the learning time. The Ray Framework is a distributed middleware that provides primitives to seamlessly parallelize machine learning code execution across a cluster of compute node. Launching a Ray managed machine learning application requires a Ray cluster that is diligently configured, well connected and easily scalable. Kubernetes, the container management middleware, satisfies all the requirements to create and scale ray clusters. However, setting up a cluster within Kubernetes, is a tedious and error prone task when done manually. In this paper we present KubeRay, an Operator and suite of tools designed, and built to create Ray cluster in Kubernetes with minimum effort. We present our architectural choices, our open-source implementation, and we analyze the performance of our solution.

References

[1]
Thang Le Duc, Rafael García Leiva, Paolo Casari, and Per-Olov Östberg. 2019. Machine Learning Methods for Reliable Resource Provisioning in Edge-Cloud Computing: A Survey. ACM Comput. Surv. 52, 5, Article 94 (October 2019), 39 pages.
[2]
The Ray Framework. URL: https://www.ray.io/ last accessed in October 2021
[3]
Ray Remote Functions URL: https://docs.ray.io/en/latest/walkthrough.html last accessed in October 2021
[4]
Kubernetes: The Container Management Framework. URL: https://kubernetes.io/ last accessed in October 2021.
[5]
Azure Kubernetes Service. URL: https://azure.microsoft.com/en-us/services/kubernetes-service/#overview last accessed in October 2021
[6]
AWS Kubernetes Service. URL: https://aws.amazon.com/eks/ last accessed in October 2021.
[7]
Google Kubernetes Engine. URL: https://cloud.google.com/kubernetes-engine last accessed in October 2021.
[8]
KubeRay: The Ray Kubernetes Operator. URL: https://github.com/ray-project/kuberay/tree/master/ray-operator last accessed in October 2021
[9]
The Kubernetes Operator Design Pattern. URL: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/ last accessed in October 2021
[10]
Kubernetes Cluster Federation. URL: https://github.com/kubernetes-sigs/kubefed last accessed in October 2021.
[11]
KubeRay design document, URL: https://docs.google.com/document/d/1DPS-e34DkqQ4AeJpoBnSrUM8SnHnQVkiLlcmI4zWEWg/edit#heading=h.uvghkc1tgnfw, last accessed on October 2021.
[12]
Kubernetes Downward API. URL: https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/ last accessed in October 2021.
[13]
Kubernetes Service Level Objective / Service Level Indicator SLO/SLI https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md last accessed in October 2021.
[14]
KubeFlow https://www.kubeflow.org/ last accessed in October 2021.
[15]
Amazon Sagemaker URL: https://aws.amazon.com/pm/sagemaker/ last accessed in October 2021.
[16]
Run:AI https://www.run.ai/ last accessed in October 2021.
[17]
L. Toka, G. Dobreff, B. Fodor and B. Sonkoly, "Machine Learning-Based Scaling Management for Kubernetes Edge Clusters," in IEEE Transactions on Network and Service Management, vol. 18, no. 1, pp. 958--972, March 2021.
[18]
Chun-Hsiang Lee, Zhaofeng Li, Xu Lu, Tiyun Chen, Saisai Yang, and Chao Wu. 2020. Multi-Tenant Machine Learning Platform Based on Kubernetes. In Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence (ICCAI '20). Association for Computing Machinery, New York, NY, USA, 5--12.
[19]
Yuzhou Huang, Kaiyu cai, Ran Zong, and Yugang Mao. 2019. Design and implementation of an edge computing platform architecture using Docker and Kubernetes for machine learning. In Proceedings of the 3rd International Conference on High Performance Compilation, Computing and Communications (HP3C '19). Association for Computing Machinery, New York, NY, USA, 29--32.
[20]
AnyScale. URL: https://www.anyscale.com/ last accessed in October 2021.
[21]
ByteDance. URL: https://www.bytedance.com/en/ last accessed in October 2021.
[22]
AntGroup. URL: https://www.antgroup.com/en last accessed in October 2021.
[23]
Ray Autoscaler Kubernetes-plugin URL: https://github.com/ray-project/ray/tree/ray-1.7.0/python/ray/autoscaler last accessed in October 2021.

Cited By

View all
  • (2024)A Survey on Spatio-Temporal Big Data Analytics Ecosystem: Resource Management, Processing Platform, and ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2023.334261910:2(174-193)Online publication date: Apr-2024
  • (2024)A MECApp-aware Lifecycle Management Approach in 5G Edge-Cloud Deployments2024 33rd International Conference on Computer Communications and Networks (ICCCN)10.1109/ICCCN61486.2024.10637651(1-6)Online publication date: 29-Jul-2024
  • (2024)An Evaluation of Time-Sliced GPU Sharing with KubeRay for Machine Learning Workloads2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00194(1456-1459)Online publication date: 2-Jul-2024
  • Show More Cited By

Index Terms

  1. Designing a Kubernetes Operator for Machine Learning Applications

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WoC '21: Proceedings of the Seventh International Workshop on Container Technologies and Container Clouds
    December 2021
    37 pages
    ISBN:9781450391719
    DOI:10.1145/3493649
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IFIP

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 December 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Clustering, Linux Containers
    2. Distributed Systems
    3. Kubernetes
    4. Machine Learning
    5. Operators
    6. Ray Framework

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    Middleware '21
    Sponsor:

    Upcoming Conference

    MIDDLEWARE '24
    25th International Middleware Conference
    December 2 - 6, 2024
    Hong Kong , Hong Kong

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)124
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 26 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Survey on Spatio-Temporal Big Data Analytics Ecosystem: Resource Management, Processing Platform, and ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2023.334261910:2(174-193)Online publication date: Apr-2024
    • (2024)A MECApp-aware Lifecycle Management Approach in 5G Edge-Cloud Deployments2024 33rd International Conference on Computer Communications and Networks (ICCCN)10.1109/ICCCN61486.2024.10637651(1-6)Online publication date: 29-Jul-2024
    • (2024)An Evaluation of Time-Sliced GPU Sharing with KubeRay for Machine Learning Workloads2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00194(1456-1459)Online publication date: 2-Jul-2024
    • (2023)Performance costs for IPv6-based mobility management on the top of Kubernetes2023 IEEE 9th International Conference on Network Softwarization (NetSoft)10.1109/NetSoft57336.2023.10175456(217-221)Online publication date: 19-Jun-2023
    • (2023)Resource Orchestration and Scheduling Algorithms for Enhancing Distributed Reinforcement Learning2023 China Automation Congress (CAC)10.1109/CAC59555.2023.10451295(8450-8455)Online publication date: 17-Nov-2023
    • (2022)Intent-based 5G UPF configuration via Kubernetes Operators in the Edge2022 Thirteenth International Conference on Ubiquitous and Future Networks (ICUFN)10.1109/ICUFN55119.2022.9829576(186-189)Online publication date: 5-Jul-2022

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media