Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2402.06194 (cs)

[Submitted on 9 Feb 2024 (v1), last revised 8 Jun 2024 (this version, v2)]

Title:SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation

Abstract:Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions.
We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61x. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.

Comments:	USENIX ATC '24
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2402.06194 [cs.DC]
	(or arXiv:2402.06194v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2402.06194

Submission history

From: Yifan Xiong [view email]
[v1] Fri, 9 Feb 2024 05:27:07 UTC (683 KB)
[v2] Sat, 8 Jun 2024 01:30:27 UTC (670 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators