Deep Fusion: Efficient Network Training via Pre-trained Initializations

Hanna Mazzawi, Javier Gonzalvo, Michael Wunder, Sammy Jerome, Benoit Dherin

Published: 02 May 2024, Last Modified: 25 Jun 2024ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Training deep neural networks for large language models (LLMs) remains computationally very expensive. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. In this paper, we propose a theoretical framework using backward error analysis to illuminate the dynamics of mid-training network growth. Furthermore, we introduce Deep Fusion, an efficient network training approach that leverages pre-trained initializations of smaller networks, facilitating network growth from diverse sources. Our experiments validate the power of our theoretical framework in guiding the optimal use of Deep Fusion. With carefully optimized training dynamics, Deep Fusion demonstrates significant reductions in both training time and resource consumption. Importantly, these gains are achieved without sacrificing performance. We demonstrate reduced computational requirements, and improved generalization performance on a variety of NLP tasks and T5 model sizes.

Submission Number: 3451