Computer Science > Machine Learning

arXiv:2407.17465 (cs)

[Submitted on 24 Jul 2024 (v1), last revised 29 Oct 2024 (this version, v2)]

Title:u-$μ$P: The Unit-Scaled Maximal Update Parametrization

Authors:Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Y. Prince, Björn Deiseroth, Andres Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, Douglas Orr

View PDF HTML (experimental)

Abstract:The Maximal Update Parametrization ($\mu$P) aims to make the optimal hyperparameters (HPs) of a model independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-$\mu$P, which improves upon $\mu$P by combining it with Unit Scaling, a method for designing models that makes them easy to train in low-precision. The two techniques have a natural affinity: $\mu$P ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-$\mu$P models reaching a loss that is equal to or lower than comparable $\mu$P models and working out-of-the-box in FP8.

Comments:	50 pages
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2407.17465 [cs.LG]
	(or arXiv:2407.17465v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2407.17465

Submission history

From: Charlie Blake [view email]
[v1] Wed, 24 Jul 2024 17:58:42 UTC (6,696 KB)
[v2] Tue, 29 Oct 2024 18:09:28 UTC (7,220 KB)

Computer Science > Machine Learning

Title:u-$μ$P: The Unit-Scaled Maximal Update Parametrization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:u-$μ$P: The Unit-Scaled Maximal Update Parametrization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators