Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2408.12596 (cs)

[Submitted on 22 Aug 2024]

Title:Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

Authors:WenZheng Zhang, Yang Hu, Jing Shi, Xiaoying Bai

Abstract:Scaling Deep Neural Networks (DNNs) requires significant computational resources in terms of GPU quantity and compute capacity. In practice, there usually exists a large number of heterogeneous GPU devices due to the rapid release cycle of GPU products. It is highly needed to efficiently and economically harness the power of heterogeneous GPUs, so that it can meet the requirements of DNN research and development. The paper introduces Poplar, a distributed training system that extends Zero Redundancy Optimizer (ZeRO) with heterogeneous-aware capabilities. We explore a broader spectrum of GPU heterogeneity, including compute capability, memory capacity, quantity and a combination of them. In order to achieve high computational efficiency across all heterogeneous conditions, Poplar conducts fine-grained measurements of GPUs in each ZeRO stage. We propose a novel batch allocation method and a search algorithm to optimize the utilization of heterogeneous GPUs clusters. Furthermore, Poplar implements fully automated parallelism, eliminating the need for deploying heterogeneous hardware and finding suitable batch size. Extensive experiments on three heterogeneous clusters, comprising six different types of GPUs, demonstrate that Poplar achieves a training throughput improvement of 1.02-3.92x over current state-of-the-art heterogeneous training systems.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2408.12596 [cs.DC]
	(or arXiv:2408.12596v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2408.12596

Submission history

From: WenZheng Zhang [view email]
[v1] Thu, 22 Aug 2024 17:58:06 UTC (725 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Poplar: Efficient Scaling of Distributed DNN Training on Heterogeneous GPU Clusters

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators