Computer Science > Machine Learning

arXiv:2401.09135 (cs)

[Submitted on 17 Jan 2024 (v1), last revised 23 Sep 2024 (this version, v2)]

Title:Asynchronous Local-SGD Training for Language Modeling

Authors:Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato

View PDF HTML (experimental)

Abstract:Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2401.09135 [cs.LG]
	(or arXiv:2401.09135v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2401.09135

Submission history

From: Arthur Douillard [view email]
[v1] Wed, 17 Jan 2024 11:17:04 UTC (6,353 KB)
[v2] Mon, 23 Sep 2024 10:49:33 UTC (6,353 KB)

Computer Science > Machine Learning

Title:Asynchronous Local-SGD Training for Language Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Asynchronous Local-SGD Training for Language Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators