Computer Science > Distributed, Parallel, and Cluster Computing
[Submitted on 20 May 2022 (v1), last revised 12 Aug 2024 (this version, v3)]
Title:MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services
View PDF HTML (experimental)Abstract:While modern internet services, such as chatbots, search engines, and online advertising, demand the use of large-scale deep neural networks (DNNs), distributed training and inference over heterogeneous computing systems are desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the most common strategies to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms. For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate MoESys, where MoESys successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that MoESys outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, MoESys achieved 64% higher throughput with 18% lower memory footprints.
Submission history
From: Liang Shen [view email][v1] Fri, 20 May 2022 09:09:27 UTC (450 KB)
[v2] Mon, 12 Jun 2023 12:07:22 UTC (465 KB)
[v3] Mon, 12 Aug 2024 09:23:13 UTC (1,275 KB)
References & Citations
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
Papers with Code (What is Papers with Code?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.