Accelerated optimization in deep learning with a proportional-integral-derivative controller

Song Chen¹,
Jiaxu Liu¹,
Pengkai Wang²,
Chao Xu ORCID: orcid.org/0000-0002-2759-6364^2,3,4,
Shengze Cai ORCID: orcid.org/0000-0003-0122-6864^2,3 &
…
Jian Chu^2,3

16k Accesses
9 Altmetric
Explore all metrics

Abstract

High-performance optimization algorithms are essential in deep learning. However, understanding the behavior of optimization (i.e., learning process) remains challenging due to the instability and weak interpretability of the algorithms. Since gradient-based optimizations can be interpreted as continuous-time dynamical systems, applying feedback control to the dynamical systems that model the optimizers may provide another perspective for exploring more robust, accurate and explainable optimization algorithms. In this study, we present a framework for optimization called controlled heavy-ball optimizer. By employing the proportional-integral-derivative (PID) controller in the optimizer, we develop a deterministic continuous-time optimizer called Proportional-Integral-Derivative Accelerated Optimizer (PIDAO), and provide theoretical convergence analysis of PIDAO in unconstrained (non-)convex optimizations. As a byproduct, we derive PIDAO-family schemes for training deep neural networks by using specific discretization methods. Compared to classical optimizers, PIDAO can be empirically proven a more aggressive capacity to explore the loss landscape with lower computational costs due to the property of PID controller. Experimental evaluations demonstrate that PIDAO can accelerate the convergence and enhance the accuracy of deep learning, achieving state-of-the-art performance compared with advanced algorithms.

Deterministic Neural Networks Optimization from a Continuous and Energy Point of View

Article 24 May 2023

PINN-BO: A Black-Box Optimization Algorithm Using Physics-Informed Neural Networks

Learning dynamics of gradient descent optimization in deep neural networks

Article 08 April 2021

Introduction

Deep learning has gained popularity due to the ability of deep function approximation in achieving state-of-the-art performance in a large variety of engineering and scientific tasks. A representative example is learning-based control, where machine learning or deep learning is leveraged to serve complex controlled intelligent systems with remarkable results surpassing those of traditional methods (see Fig. 1a), e.g., motion planning in robotics^1,2,3, racing^4,5, prediction and control of complex physical systems^6,7,8,9, such as fusion plasmas^10,11. All these deep learning-based tasks primarily rely on solving large-scale optimization problems^12,13, i.e., training deep neural networks (NNs), some of which have over millions or billions of decision variables^14,15. In general, the train loss landscapes of deep learning models are highly degraded¹⁶, featuring multiple local and global minima that lead to varying generalization capability and model performance^17,18. In order to obtain suitable minima in the high-dimensional variable space, gradient-based optimization algorithms¹⁹ are generally and successfully employed (Fig. 1a). For example, the gradient descent (GD) and adaptive moment estimation (Adam²⁰) algorithms are the most commonly used optimizers in the deep learning community. To accelerate the training process, physics-inspired mechanisms such as momentum²¹ can also be incorporated into the optimization algorithms. There is no doubt that these optimizers contribute mostly to the success of deep learning. However, understanding the impact of optimization on generalization and approximation of deep learning models remains challenging.

**Fig. 1: Overview of the closed-loop active control optimizer in solving optimization problems.**

A possible solution to understanding the optimization (i.e., learning process) is to model the gradient-based algorithms or NNs as continuous-time dynamical systems^22,23, or discrete-time feedback systems^24,25,26,27. Here, we use the momentum optimizer as an example. Let us consider a differential objective function $f:\Omega \to {\mathbb{R}},x\mapsto f(x)$ from a feasible set Ω to the real numbers, where x ∈ Ω is the decision variable. The momentum optimizer satisfies the iterative scheme x_k+1 = x_k + β(x_k − x_k−1) − bs ∇ f(x_k) with iteration index k, momentum coefficients β > 0, a step size of s, and a step-scaling factor of b > 0. Upon that, for ${\min }_{x\in \Omega }f(x)$, we can obtain the sequence ${\{{x}_{k}\}}_{k\ge 0}$ generated by the momentum method and compute f(x_k) to approximate the minimum as k → + ∞. By rewriting the momentum method as $\frac{{x}_{k+1}-2{x}_{k}+{x}_{k-1}}{{(\sqrt{s})}^{2}}=-\frac{1-\beta }{\sqrt{s}}\frac{{x}_{k}-{x}_{k-1}}{\sqrt{s}}-b\nabla f({x}_{k})$, with $\frac{{x}_{k+1}-2{x}_{k}+{x}_{k-1}}{{(\sqrt{s})}^{2}}\approx \ddot{X}(k\sqrt{s})$ and $\frac{{x}_{k}-{x}_{k-1}}{\sqrt{s}}\approx \dot{X}(k\sqrt{s})$, this iterative scheme can be explicitly regarded as an explicit Euler-discretization scheme for the underlying dynamical system:

$$\ddot{X}+a\dot{X}+b\nabla f(X)=0,$$

(1)

where $a\, \triangleq \, \frac{1-\beta }{\sqrt{s}}$, X ≜ X(t) is the state at time t in the continuous-time domain, $\dot{X}$ is the derivative of X with respect to t, and ∇ f(X) is the gradient of f with respect to X. $X(k\sqrt{s})$ approximates x_k as s → 0. Here, the state X(t) and a are tight coupling. Specifically, since Eq. (1) is a second-order system, by using Newton’s second law and serving $\ddot{X}$ as an acceleration²¹, Eq. (1) can be modeled as a ball with a unit mass rolling across the loss landscape f(X), seeking minima passively driven by a “friction force” $-a\dot{X}$ and a “gravity” − b ∇ f(X) (see Fig. 1b).

Interpreting the gradient-based optimizations as continuous-time dynamical systems allows us to gain insights into the optimization process. By doing this, the phenomena that are not easily explained can be revealed from a physical perspective by building the connection between the properties of learning frameworks (e.g., universal approximation and controllability) and the concepts of control (convergence and Lyapunov stability)^{23,27,28,29,30,31,32}. For example, in addition to the aforementioned momentum optimizer, the commonly-used GD and Adam algorithms have been regarded as continuous-time differential equations^33,34,35, allowing to reveal the properties of the training dynamics (e.g., generalization and convergence) via Lyapunov stability^29,36,37,38 and understand the NNs trained using gradient-based algorithms^39,40. Such a combination of dynamical systems and control theory to understand the optimization and learning process (i.e., control for learning) has attracted increasing attention and has been developing rapidly over the past few years^41,42,43. One can introduce open-loop or closed-loop feedback controllers^44,45 into the dynamical systems to mitigate various kinds of deep learning pathologies. For example, an optimal control-based optimizer is proposed that may avoid some pitfalls (e.g., slow convergence on flat loss landscape) of gradient-based methods^24,46,47. Recently, closed-loop control via feedback has been utilized to enhance the robustness of the training dynamics of generative adversarial networks⁴⁸ and to automatically detect and rectify errors caused by perturbations in NNs⁴⁹. The state-feedback law has been also used to mitigate the training unbalance caused by gradient flow⁵⁰ and convergence rate problems^51,52. Following these pioneering works, which provide promising directions on control for learning, we are interested in answering the question in this paper: can the optimization algorithms in deep learning benefit from control theory?

To this end, we integrate active control via feedback into the continuous-time optimization framework (Fig. 1a). First, we recall the momentum optimizer in (Eq. (1)). The ODE (Eq. (1)) can be rewritten as a second-order system $\ddot{X}+a\dot{X}+u=0$ with a control input u = b ∇ f(X). Building upon this concept, we propose a framework called controlled heavy-ball optimizer, which is depicted in Fig. 1c and formulated by the following ODE:

$$\ddot{X}+a\dot{X}+u(X,\dot{X},\nabla f(X),{\nabla }^{2}f(X))=0,$$

(2)

where ∇²f(X) is the Hessian matrix. From a physical point of view, Eq. (2) is regarded as a mechanical system with active control $u\, \triangleq \, u(X,\dot{X},\nabla f(X),{\nabla }^{2}f(X))$ (Fig. 1c). The optimizer with the dynamics (2) seeks optima across the loss landscape f(X) driven by the feedback law u. Unlike the momentum optimizer (Eq. (1)) that is forced to seek optima passively, the optimizer (2) actively employs a feedback control u and subsequently uses feedback to improve its optimization performance, as depicted in Fig. 1c. For the momentum optimizer (Eq. (1)), different values of the coefficient a can impact the convergence rate and the generalization capabilities of the trained models⁵³, while it is challenging to determine a suitable a due to the tight coupling between a and state X(t). However, by introducing an additional active control u, we can apply various control theories⁴⁴ to decouple or minimize this dependency, thus improving the interpretability and stability of the optimizers. Therefore, the Eq. (2) is the basic formation of the optimizer proposed in this paper. The next question is how to design the feedback control u.

Designing the feedback control law is the main issue of control theory⁴⁴. One of the most renowned and efficient feedback controllers in the control society is the proportional-integral-derivative (PID) controller⁵⁴. The PID input describes a sum of the proportional, derivative, and integral actions of the error, where the error is a distance between the output of one physical model and the desired reference, as depicted in Fig. 1d. In fact, the PID controller can reduce the error and enhance the robustness of the systems through the feedback input including the “present” (proportional action), the “past” (integral action), and the “tendency” (derivative action)^55,56. The idea of PID controller has been used in the design of model architectures^57,58 and optimizer⁵⁹. In particular, for the optimizer, through ∇ f(X), the GD and momentum (Eq. (1)) can be seen as the P-controller which only considers the “present”. Wang et al.⁵⁹ first introduced the idea of PID into the optimizer design, and proposed a discrete-time optimizer (called PIDopt) by adding a derivative term in the momentum optimizer. However, PIDopt was proposed through qualitative analysis and therefore lacks a rigorous theoretical framework and interpretability.

In this paper, after formulating the optimization problem as a continuous-time control system (Eq. (2)), we quantitatively introduce the PID controller into the design of optimization algorithms to accelerate the learning process and improve the accuracy. By doing this, a more straightforward interpretation of optimization is available due to the interpretability of the PID parametric space. We first employ the PID controller to generate a feedback control input u for the control heavy-ball optimizer (Fig. 1e). As a result, we propose an accelerated optimizer called PID accelerated optimizer (PIDAO) within the continuous-time framework, formulated by Eq. (3) below. PIDAO is more interpretable due to the greater interpretability of the parametric space of the PID control. We present global and local convergence results of PIDAO in unconstrained (strongly) convex and non-convex settings via Lyapunov analysis, respectively, which demonstrate its ability to explore the loss landscape aggressively via a result $\int_{{t}_{0}}^{t}\nabla f(X(s)){{{\rm{d}}}}s \to 0$. This exploration capability contributes to escape saddle points faster and may converge to better minima during the deep learning training process, such as flat minima that promote improved generalization^16,17. We use specific discretization methods to derive PIDAO-family schemes for deep learning problems and further validate the effectiveness of the algorithm. Numerical results show that PIDAO inherits the effectiveness of the PID controller in terms of reducing generalization errors and accelerating training convergence with lower computational costs, and outperforms the state-of-the-art methods in various benchmark cases.

We note that our proposal shares a similar idea with PIDopt⁵⁹, but differs from PIDopt in terms of the problem formulation, availability of convergence proofs, discrete iterative formulation, and performance for a number of test cases. A detailed comparison between PIDopt and our work is shown in the experiments and in Supplementary Information (SI) Section I.

Results

We first describe the setup of the unconstrained optimization problem. Consider the optimization problem ${\min }_{X\in {{\mathbb{R}}}^{n}}f(X)$, where f is a differentiable and bounded function. We denote that ${f}^{\star }\triangleq {\min }_{X\in {{\mathbb{R}}}^{n}}f$ and ${X}^{\star }\triangleq {{\mbox{argmin}}}_{X\in {{\mathbb{R}}}^{n}}f(X)$. We use ∥ ⋅ ∥ to represent the standard Euclidean norm hereafter. Three sets of functions are introduced for f(X): ${{{{\mathcal{S}}}}}_{\mu,L}^{2}({{\mathbb{R}}}^{n})$, ${{{{\mathcal{F}}}}}_{L}^{2}({{\mathbb{R}}}^{n})$, and ${{{{\mathcal{P}}}}}_{\mu }^{2}({{\mathbb{R}}}^{n})$. In particular, (1) $f\in {{{{\mathcal{F}}}}}_{L}^{2}$ represents that f(x) is convex and twice-differentiable for any $x\in {{\mathbb{R}}}^{n}$ with a L-Lipschitz continuous gradient. (2) $f\in {{{{\mathcal{S}}}}}_{\mu,L}^{2}({{\mathbb{R}}}^{n})$ (called strongly convex function) means that $f\in {{{{\mathcal{F}}}}}_{L}^{2}$ and $f(X)-\frac{\mu }{2}\parallel X-{X}^{\star }{\parallel }^{2}$ is also convex (X^⋆ is the minimizer of f). (3) Lastly, $f\in {{{{\mathcal{P}}}}}_{\mu }^{2}$ satisfies the twice-differentiability and the Polyak-Łojasiewicz (PL) condition $\frac{1}{2}\parallel \nabla f(X){\parallel }^{2}\ge \mu (f(X)-{f}^{\star })$ for any $x\in {{\mathbb{R}}}^{n}$. Moreover, ${C}^{k}({{{\mathcal{M}}}})$ denotes a space of k-th continuously differentiable functions from space ${{{\mathcal{M}}}}$ to space ${{{\mathcal{M}}}}$. Here, we define three parametric sets: (1) ${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}$ and ${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{NC}}}}}$ are two sets of hyperparameters (k_p, k_i, k_d, a) in the PIDAO, and (2) ${{{{\mathcal{S}}}}}_{{{{\rm{EPID}}}}{-}{{{\rm{C}}}}}$ is a set of hyperparameters (k_p, k_i, k_d, a, c) in the Enhanced PIDAO. These three sets are defined in Methods.

Problem Setup: formulations of PIDAO

Before presenting the basic results of the PIDAO for ${\min }_{X\in {{\mathbb{R}}}^{n}}f(X)$, we first formulate two types of PIDAOs for unconstrained optimization. Firstly, based on Fig. 1d, one basic control input u can be defined as the sum of the proportional, integral, and derivative terms of the output y(t) = ∇ f(X):

$$u={k}_{p}y(t)+{k}_{i}\int_{{t}_{0}}^{t}y(s){{{\rm{d}}}}s+{k}_{d}\frac{{{{\rm{d}}}}y(t)}{{{{\rm{d}}}}t},$$

where k_p, k_i, and k_d are constants. Then, by combining Eq. (2) and the above input u, for unconstrained optimization, the standard PIDAO is described by the dynamical system:

$$\ddot{X}+a\dot{X}+{k}_{p}\nabla f(X)+{k}_{i}\int_{{t}_{0}}^{t}\nabla f(X){{{\rm{d}}}}s+{k}_{d}\frac{{{{\rm{d}}}}\nabla f(X)}{{{{\rm{d}}}}t}=0,$$

(3)

where a, k_p, k_i, and k_d are hyperparamters of this optimizer. Without loss of generality, let us assume that the optimizer starts at t = t₀. As discussed in Sec. “Introduction”, a control system has various output options, such as ∇ f(X) and $\dot{X}$. Hence, both ∇ f(X) and $\dot{X}$ can be integrated into the output y(t), resulting in $y(t)=c\dot{X}+\nabla f(X)$. Then, by replacing y(t) = ∇ f(X) in (3) with $y(t)=c\dot{X}+\nabla f(X)$, we can obtain another PID-form optimizer called the Enhanced PIDAO. This formulation is mathematically equivalent to

$$\ddot{X}+a\dot{X}+{k}_{p}\nabla f(X)+{k}_{i}\int_{{t}_{0}}^{t}(\nabla f(X)+c\dot{X}){{{\rm{d}}}}s+{k}_{d}\frac{{{{\rm{d}}}}\nabla f(X)}{{{{\rm{d}}}}t}=0.$$

(4)

In addition, as the integral coefficient k_i in the PIDAO or the Enhanced PIDAO equals 0, this degenerated optimizer is called PDAO, which is formulated by

$$\ddot{X}+a\dot{X}+{k}_{p}\nabla f(X)+{k}_{d}\frac{{{{\rm{d}}}}\nabla f(X)}{{{{\rm{d}}}}t}=0.$$

(5)

Furthermore, we also consider the possibility of dynamically adjusting the parameters k_p, k_i, k_d in order to better balance or capture the information contained in the output feedback y(t) = ∇ f(X). More discussions about a time-varying PIDAO are presented in SI Section E. For unconstrained optimization ${\min }_{X\in {{\mathbb{R}}}^{n}}f(X)$, one fundamental question arises regarding Eq. (3)–(5): how to choose hyperparameters k_p, k_i, k_d and a such that the solutions X(t) of Eq. (3)-(5) can satisfy $f(X(t))\to {\min }_{X\in {{\mathbb{R}}}^{n}}f$ or ∇ f(X(t)) → 0 as t → + ∞. We have provided theoretical results for this question in Theorem 1–4 in “Methods”. In the following, we demonstrate the main findings of PIDAO for various optimization problems.

Global convergence for convex optimization

Firstly, the strongly convex objective $f\in {{{{\mathcal{S}}}}}_{\mu,L}^{2}({{\mathbb{R}}}^{n})$ is considered here. The strongly convex objective function is the simplest type of objective function, which contains a global minimum and is often used in regression problems. We present a global convergence result in Theorem 1 in Methods, namely for any $({k}_{p},{k}_{i},{k}_{d},a)\in {{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}$ defined in (6) and $X({t}_{0})\in {{\mathbb{R}}}^{n}$, there exists a positive constant η such that the solution X(t) of the PIDAO satisfies $\parallel X(t)-{X}^{\star }\parallel+\parallel \int_{{t}_{0}}^{t}\nabla f(X(s)){{{\rm{d}}}}s\parallel \le O({e}^{-\eta t})$. Since $f\in {{{{\mathcal{S}}}}}_{\mu,L}^{2}({{\mathbb{R}}}^{n})$, it follows that ∥ ∇ f(X(t))∥≤O(e^−ηt) and f(X(t)) − f^⋆≤O(e^−ηt), which eventually means that as the hyperparameters of PIDAO are selected in ${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}$, f(X(t)) converges to the global minimum at an exponential rate.

We consider a vanilla quadratic loss as the strongly convex objective $f(X)=0.05{X}_{1}^{2}+5{X}_{2}^{2}$ to verify the performance of the PIDAO, which is demonstrated in Fig. 2a. There exists a line y = kt + b with a finite slope k < 0 and a finite b such that $\log \left(f(X(t))-{f}^{\star }\right)\le kt+b$, which further means that f(X(t)) − f^⋆≤ O(e^−kt). However, in Fig. 2a, we observe that compared to momentum and PDAO optimizers, PIDAO has a significant overshoot near X^⋆ = [0, 0], which slows down the convergence for strongly convex optimization. Such a phenomenon is denoted as the offset effect of PIDAO, which will be discussed in Sec. “PIDAO has more potentials to escape local minimum” when explaining the dynamics of PIDAO.

**Fig. 2: The performance of PIDAO for low-dimensional (non)convex optimization.**

Apart from the basic exponential convergence of PIDAO including ∥X(t) − X^⋆∥ ≤ O(e^−ηt), we remark that the PIDAO achieves an interesting result, namely $\parallel \int_{{t}_{0}}^{t}\nabla f(X(s)){{{\rm{d}}}}s\parallel$ obtains an exponential rate of convergence, which will be discussed in more details in Sec. “PIDAO has more potentials to escape local minimum” as well. To the best of our knowledge, none of the existing gradient-based dynamics, such as the low-resolution dynamics^38,60 and the high-resolution dynamics^29,61, can produce this result, which also cannot be deduced from the inequality ∥ ∇ f(X(t))∥ ≤ O(e^−ηt). To further argue that other optimizers do not have this result, we show a more detailed description in SI Section B.1.

Next, we consider convex objective functions that are more general than strongly convex functions, which appear in machine learning problems such as support vector machines and principal component analysis. We present convergence results of the PIDAO (3) and the Enhanced PIDAO (4) for the convex objective $f\in {{{{\mathcal{F}}}}}_{L}^{2}({{\mathbb{R}}}^{n})$. Let us first focus on the convex optimization problem for the Enhanced PIDAO (4). We provide a global convergence result in Theorem 2, which says that for any $({k}_{p},{k}_{i},{k}_{d},a,c)\in {{{{\mathcal{S}}}}}_{{{{\rm{EPID}}}}{-}{{{\rm{C}}}}}$ defined in (7) and X(t₀), the solution X(t) of the Enhanced PIDAO has ${\lim }_{t \to +\infty }\int_{{t}_{0}}^{t}\left(\nabla f(X(s))+c\dot{X}\right){{{\rm{d}}}}s =\mathop{\lim }_{t \to +\infty }f(X(t))-{f}^{\star }=0$. Regarding the second result in Theorem 2, since a strongly convex function $f\in {{{{\mathcal{S}}}}}_{\mu,L}^{2}({{\mathbb{R}}}^{n})\subset {{{{\mathcal{F}}}}}_{L}^{2}({{\mathbb{R}}}^{n})\bigcap {{{{\mathcal{P}}}}}_{\mu }^{2}({{\mathbb{R}}}^{n})$, one can also obtain the exponential convergence of the Enhanced PIDAO. Specifically, if $f\in {{{{\mathcal{S}}}}}_{\mu,L}^{2}({{\mathbb{R}}}^{n})$ and $({k}_{p},{k}_{i},{k}_{d},a,c)\in {{{{\mathcal{S}}}}}_{{{{\rm{EPID}}}}{-}{{{\rm{C}}}}}$ defined in (7), there exists η > 0 that only depends on hyperparameters such that $\parallel \int_{{t}_{0}}^{t} \left(\nabla f(X(s))+ c\dot{X}\right){{{\rm{d}}}}s\parallel+f(X(t))-{f}^{\star }\le O({e}^{-\eta t}).$

To verify Theorem 2, we test the performance of Enhanced PIDAO using the same quadratic loss function as in Fig. 2a. We choose three values of c for the optimizer and the exponential convergence results are shown in Fig. 2b. It is also shown that the difference in c has an impact on the convergence rate.

We note that the Theorem 2 is only valid for Enhanced PIDAO. In convex optimization setting, $f\in {{{{\mathcal{F}}}}}_{L}^{2}({{\mathbb{R}}}^{n})$, the convergence analysis of the PIDAO (3) cannot be provided in this framework, i.e., ∇ f(X(t)) → 0 and $\int_{{t}_{0}}^{t}\left(\nabla f(X(s))\right){{{\rm{d}}}}s \to 0$ as t → + ∞ cannot be proved together. Fortunately, PIDAO can be approximated by Enhanced PIDAO when c approaches zero (see Proposition 1, 2 in SI Section C). According to such an approximation and Theorem 2, one can use the solution of the Enhanced PIDAO to depict the solution of the PIDAO for convex optimization as precise as possible (see details in SI Section D.2). In Fig. 2b, we show that the difference between PIDAO and Enhanced PIDAO with the same $({k}_{p},{k}_{i},{k}_{d},a)\in {{{{\mathcal{S}}}}}_{{{{\rm{EPID}}}}{-}{{{\rm{C}}}}}$ can be governed by a small parameter c. As c decreases, the evolution of PIDAO closely aligns with that of Enhanced PIDAO.

In SI Section E, we further investigate a time-varying optimizer for convex optimization with the objective function $f\in {{{{\mathcal{F}}}}}_{L}^{2}({{\mathbb{R}}}^{n})$, which provides a unified framework for acceleration and yields an exponential convergence rate similar to that of ³² or a faster convergence rate $O\left(\frac{1}{{t}^{p}}\right),p \, > \, 2$ compared to ref. ³⁸ by regulating the prescribed function g(t) defined in SI Section E.

Local convergence for nonconvex optimization

Now we consider the objective f to be twice-differentiable and bounded in ${{\mathbb{R}}}^{n}$ ($f\in {{{{\mathcal{C}}}}}^{2}({{\mathbb{R}}}^{n})$). Typically, such an objective function is common in neural network training. Additionally, we also consider a class of twice-differentiable functions that satisfy the PL condition ($f\in {{{{\mathcal{P}}}}}_{\mu }^{2}({{\mathbb{R}}}^{n})$) in this section, which says that for μ > 0, the following holds for the function $f\in {{{{\mathcal{P}}}}}_{\mu }^{2}({{\mathbb{R}}}^{n})$: $\frac{1}{2}\parallel \nabla f(X){\parallel }^{2}\ge \mu (f(X)-{f}^{\star }).$ We note that each stationary point $\hat{X}$ of $f\in {{{{\mathcal{P}}}}}_{\mu }^{2}({{\mathbb{R}}}^{n})$ is the global minima⁶² since $0=\frac{1}{2}\parallel \nabla f(\hat{X}){\parallel }^{2}\ge \mu (f(\hat{X})-{f}^{\star })\ge 0$ that implies $f(\hat{X})={f}^{\star }$ and $\hat{X}\in {{\mbox{argmin}}}_{{{\mathbb{R}}}^{n}}f$. An example of non-convex functions satisfying the PL condition is considered: $f(X)=\mathop{\sum }_{k=1}^{n}\left({X}_{k}^{2}+2{\sin }^{2}({X}_{k})\right)$. PL condition has also received wide attention in the machine learning community^63,64,65.

We present a local convergence result in Theorem 3 in Methods. For any $({k}_{p},{k}_{i},{k}_{d},a)\in {{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{NC}}}}}$ defined in (8) and $X({t}_{0})\in {{\mathbb{R}}}^{n}$, we have $\mathop{\lim }_{t \to + \infty }\parallel \nabla f(X(t))\parallel=0$. We verified the local convergence results under three non-convex losses, as shown in Fig. 2c–e. In all three examples, PIDAO converges to the critical point, but compared to the other optimizers in Fig. 2c, d, PIDAO even converges to the global minimum.

One surprising result in Theorem 3 is that whether the function $f\in {{{{\mathcal{C}}}}}^{2}({{\mathbb{R}}}^{n})$ is convex or non-convex, the convergence of the critical points ( ∇ f(X(t)) → 0 as t → + ∞) can be provided. In addition, since we have $\mathop{\lim }_{t \to+\infty }\dot{X}(t)+\frac{{k}_{i}}{a}\int_{{t}_{0}}^{t}\nabla f(X(s)){{{\rm{d}}}}s=0$ in Theorem 3, if ∇ f(X(t)) → 0 as t → + ∞ can infer that X(t) converges to a finite point X^⋆ (namely $\mathop{\lim }_{t \to+\infty }X(t)={X}^{\star }$ and ∥X^⋆∥ < + ∞), which implies $\mathop{\lim }_{t\to+\infty }\dot{X}(t)=0$, then one can further get $\mathop{\lim }_{t\to+\infty }\int_{{t}_{0}}^{t}\nabla f(X(s)){{{\rm{d}}}}s=0$ for nonconvex optimization.

PIDAO has more potentials to escape local minimum

As analyzed above, PIDAO exhibits remarkable behavior in both convex and general settings, either directly or indirectly. It demonstrates the convergence of the integral $\int_{{t}_{0}}^{t}\nabla f(X){{{\rm{d}}}}s\to 0$ as t approaches infinity. Based on this deduction, we can suspect that PIDAO possesses an effective capacity to explore the loss landscape f(X) along with the solution trajectory of this optimizer. Qualitatively, for a sufficiently large t meeting with $t-{t}_{0}\in {{\mathbb{N}}}_{+}$ and a tagged partition of a closed interval [t₀, t] on the real line given by t₀≤ t₁≤ t₂≤ ⋯ ≤t_n−1≤t_n = t with a unified interval δ = t_k − t_k−1, the integral can be approximated by a Riemann sum of a function f(X(t)) with respect to this tagged partition. This approximation is defined as:

$$\int^t_{t_0} \nabla f(X(s)){{{\mathrm{d}}}}s\approx\mathop{\sum}_{k=0}^{n-1} \nabla f(X(t_k))\delta\approx0\Rightarrow\underbrace{\sum\limits_{k=0}^{m-1} \nabla f(X(t_k))}_{{{{\rm{accumulation}}}}}+\sum\limits_{k=m}^{n-1} \nabla f(X(t_k))\approx0.$$

Therefore, in the physical sense, the trajectory X(t) of PIDAO needs to cover as much of the loss landscape f(X) as possible in order to use the subsequent accumulation ${\sum }_{k=m}^{n-1}\nabla f(X({t}_{k}))$ to compensate the previous accumulation ${\sum }_{k=0}^{m-1}\nabla f(X({t}_{k}))$, eventually making the total accumulation $\mathop{\sum }_{k=0}^{n-1}\nabla f(X({t}_{k}))$ zero. This phenomenon is referred to as the offset effect, which means that PIDAO would not immediately stop searching even in the neighborhood of one local minimum. Instead, it continues to move forward, exploring the loss landscape further. Figure 2f illustrates such offset effect with different values of k_i, where PIDAO equips a more aggressive capacity to explore the loss landscape as k_i increases. Hence, if k_i is appropriately selected, PIDAO may escape local minima or saddle points and converge to a better minimum (even the global minimum).

This property can be further verified by another two examples (Fig. 2c, d), where the objective functions are Six-hump camel back function and Rastrigin function, respectively. Compared with other optimizers, the PIDAO demonstrates the superiority, which can escape from local optimality and converge to global optimality. As shown in the subfigures of Fig. 2c, d, the loss error decreases continuously with respect to time. Regarding the Rosenbrock function (see Fig. 2e), PIDAO goes ahead instead of stopping due to this offset effect, even if it approaches the unique minimum value (see the subfigure at the bottom), which reduces the PIDAO’s convergence rate for a short time. This example indicates that if the objective f(X) has a global minimum, the offset effect caused by k_i or the integral may lead to unnecessary oscillation and decrease the convergence rate. Therefore, PDAO, namely the PIDAO without the integral term (k_i = 0 in Eq. (3)), would be more suitable for f(X) that has only one minimum (e.g, Rosenbrock function).

The aforementioned results can be concluded by Theorem 4 in Methods. For the strongly convex function $f(X)\in {{{{\mathcal{S}}}}}_{\mu,L}^{2}({{\mathbb{R}}}^{n})$, PDAO may converge to the minimum faster because there is no additional offset effect arose by k_i. We show for any $({k}_{p},{k}_{d},a)\in {{{{\mathcal{S}}}}}_{{{{\rm{PD}}}}{-}{{{\rm{SC}}}}}$ and X(t₀), we have a global convergence result of the solution X(t) of PDAO: f(X(t)) − f^⋆≤ O(e^−2mt), where $m=\min \{a,\mu {k}_{d}\}$. The high-resolution NAG dynamics proposed by Shi et al.²⁹ is one special case of the PDAO optimizer if ${k}_{p}=1+\sqrt{\mu s},{k}_{d}=\sqrt{s},a=2\sqrt{\mu }$ with s > 0. However, our result provides a faster convergence rate $f(X)-\mathop{\min }_{X\in {{\mathbb{R}}}^{n}}f\le O({e}^{-\sqrt{2\mu }t})$ rather than $O({e}^{-\sqrt{\mu }t/4})$ via $s=1/2\sqrt{\mu }$. In Fig. 2a, we show that the solution of PDAO converges to the minimum more rapidly than that of the PIDAO, which can be attributed to the absence of the offset effect. Therefore, if the objective function is convex, PDAO is superior to PIDAO which may exhibit critical damping during convergence.

Furthermore, it is shown that PIDAO can escape saddle points at a faster rate than the classical optimizers. In particular, we use a saddle point problem here to discuss PIDAO’s ability to escape saddle points. Consider a quadratic objective $f({X}_{1},{X}_{2})=\frac{1}{2}({X}_{1}^{2}-\epsilon {X}_{2}^{2})$ with a positive constant ϵ ≪ 1. We define the escaping time $\tau=\inf \left\{\tau \, > \, 0\left\vert f({X}_{1}(\tau ),{X}_{2}(\tau )) < -\delta \right.\right\}$ for optimizers, where δ > 0 and X(τ) is the solution of the optimizer. For different optimizers that have the same X(t₀), a smaller escaping time away from the saddle point (0, 0) means a faster-escaping rate. We show that the escaping times τ for PIDAO, momentum, and gradient flow are $O(\frac{1}{{\epsilon }^{\frac{1}{3}}}\ln \frac{\delta }{\epsilon })$, $O(\frac{1}{{\epsilon }^{\frac{1}{2}}}\ln \frac{\delta }{\epsilon })$ and $O\left(\frac{1}{\epsilon }\ln \frac{\delta }{\epsilon }\right)$, respectively, which means that PIDAO can help to escape rather flat saddles and accelerate the optimization process. As shown in Fig. 3a, as ϵ gets smaller (flatter around the saddle point), the escaping times of three optimizers increase, but the escaping time of PIDAO is the smallest. In other words, PIDAO can escape the saddle point with more possibility and less time. More details are provided in SI Section B.2.

**Fig. 3: PIDAO’s ability to escape saddle points and the interpretability of PIDAO’s hyperparameters.**

Connection between PIDAO and PID controller

We note that there is a clear connection between the PID controller and the PIDAO, which is expected to provide interpretability for PIDAO. In a closed-loop system controlled by a PID controller, the proportional item provides a fast response that allows the state to approach the target quickly. However, there may be a steady-state error, which refers to the difference between the desired setpoint and the actual state of the system. The integral term is utilized to reduce the steady-state error. The derivative item reduces overshoot and oscillation of the state^54,59.

In the context of optimization, we can draw a parallel by considering the steady-state error as the distance between the state X(t) and the global minimum X^⋆. (1) k_p provides a fast convergence rate for e(t) = ∇ f(X) to converge to zero, which may lead to a steady-state error since the point X^⋆ with ∇ f(X^⋆) = 0 may be a local minimum. (2) k_i reduces the steady-state error. By employing a suitable integral gain k_i in the PIDAO, the exploration capacity of the optimizer can effectively reduce this steady-state error by escaping local minima or saddle points, as discussed earlier. (3) k_d reduces overshoot and oscillation of the optimization process.

Figure 3b–d shows numerical verification of the above interpretability for (k_p, k_i, k_d) in the PIDAO. The Rastrigin loss has local minima $\Omega=\{X| {X}_{1}=-5\sin ({X}_{1}),{X}_{2}=-5\sin ({X}_{2})\}$ and $\mathop{\min }_{X\in {{\mathbb{R}}}^{2}}f({X}_{1},{X}_{2})=f(0,0)=0$. In Fig. 3b, as k_p increases, the convergence speed of PIDAO to Ω = {X∣ ∇ f(X) = 0} becomes faster. However, PIDAO finally converges to a local minimum. To overcome this issue, by increasing k_i, PIDAO converges to a global minimum, reducing the steady-state error (Fig. 3c). Moreover, as shown in Fig. 3d, the overshoot and oscillation of the optimization process are eliminated by introducing an appropriate parameter k_d.

PIDAO improves accuracy and efficiency in deep learning

In the aforementioned low-dimensional cases, PIDAO is implemented by directly solving the ODE Eq. (3) with respect to the optimization problems. However, for high-dimensional optimization $\mathop{\min }_{X\in {{\mathbb{R}}}^{n}}f(X)$, it is unrealistic to implement the proposed PIDAO method by solving the ODE within a given error tolerance due to the high computational cost.

To this end, we apply the semi-implicit Euler and symplectic integral methods⁶⁶ to develop two iterative forms, that can be easily integrated with the existing deep learning codes, namely PIDAO (SI) and PIDAO (ST); the PDAO scheme is contained in PIDAO (SI) with k_i = 0. Moreover, an adaptive algorithm, PIDAO (AdSI), is also proposed by using the adaptive learning rate (lr) in PIDAO (SI), inspired by the implementation of RMSprop. These three main algorithms and the corresponding discretization methods are reported in Table 1 and Methods. The hyperparameter h in these algorithms is the learning rate (lr). We remark that the gradient term in algorithms can be replaced by the stochastic gradient when using the mini-batch setting. We refer to these algorithms as the PIDAO-family algorithms, which provide alternative approaches to implement the PIDAO in a computationally efficient manner for high-dimensional optimization. In this work, we pay more attention to the implementation of PIDAO-family algorithms. Although the optimization performance such as convergence property cannot be proven theoretically, it is illustrated by empirical experiments in the following. Moreover, suggestions for selecting hyperparameters (k_p, k_i, k_d, a) are provided in Methods.

Table 1 Algorithms of PIDAO via different discretization methods

Full size table

Example 1: low-dimensional optimization setting

We first compare our method for low-dimensional optimization objectives f(x) with algorithms, namely Momentum, PIDopt, Adam, and Adaptive heavy-ball (called AdaHB)⁶⁷ algorithms, in a discrete-time domain. In order to maintain the fairness of the algorithm in comparison, we set the same lr and momentum coefficient (if any) in each algorithm. We perform this comparison on two benchmark objectives, quadratic and Rosenbrock, and provide a detailed description of the experimental settings in Methods. The results are summarized in Fig. 4a–b. We find that our method provides comparable or even better results than classical algorithms in each case. In the quadratic objective (Fig. 4a), the PDAO, PIDopt, and Momentum algorithms converge to the minimum faster than the PIDAO (SI) and PIDAO (ST) due to the offset effect of the latters. Notably, the PDAO, PIDopt, and Momentum algorithms exhibit almost identical convergence rates. Regarding the comparison of adaptive algorithms, although PIDAO (AdSI) has a similar trajectory to Adam and AdaHB, the PIDAO (AdSI) holds a faster convergence rate. A similar result can be observed in the Rosenbrock case (Fig. 4b), where we find that all PIDAO-family algorithms converge to smaller loss values, indicating that PIDAO can explore the loss landscape more efficiently.

**Fig. 4: Comparison between the classical algorithms and PIDAO-family algorithms on various optimization problems.**

Example 2: deep learning-based classification

Here, we investigate the performance of PIDAO algorithms on high-dimensional optimization problems using deep learning classifiers as examples. We provide quantitative results for three testing datasets, where each dataset employs a different neural network (NN_θ) to approximate the classifier. Specifically, we utilize fully connected NN (FNN), convolutional NN (CNN), and residual NN (ResNet) architectures for classifying the MNIST (Fig. 4c), FashionMNIST (Fig. 4d), and CIFAR-10 (Fig. 4e) datasets, respectively. For each case, we compare the performance of our algorithms with other algorithms. The evaluation metrics include train loss, test loss, train accuracy, and test accuracy. Further details about the experimental setup can be found in the “Methods” section.

Given the enhanced exploration capability of PIDAO-family algorithms in the loss landscape, we expect that our algorithms can converge to a better minimum with a faster convergence rate. As depicted in Fig. 4c–e, we observe that PIDAO (SI) and PIDAO (ST), with fixed lr, consistently outperform the other algorithms in terms of accuracy, loss, and convergence speed in each case. Furthermore, we can conclude that the minimum to which PIDAO (SI) and PIDAO (ST) ultimately converge is flatter than that of other algorithms, leading to better model generalization¹⁷. On the other hand, regarding the comparison of adaptive algorithms, the results indicate that PIDAO (AdSI) achieves faster convergence than Adam and AdaHB without overfitting, which is confirmed by the higher accuracy in testing. In addition to the aforementioned three cases, we also provide an example where PIDAO is applied to train the PointNet for 3D point cloud classification, which is shown in SI Section H.3, indicating the superior performance of PIDAO as well.

As mentioned above, a key differentiating feature of our algorithms compared with other algorithms is that PIDAO-family algorithms (both fixed-lr algorithms and adaptive-lr algorithms) converge to flatter minima θ^⋆ that lead to better model generalization. To clearly demonstrate this finding, we plot the loss landscapes of those trained models, taking the result of MNIST-FNN as an example. We project the training and test loss landscapes of FNN around the parameter θ^⋆ into one- and two-dimensional parameter subspaces¹⁶; see Fig. 5a, b, respectively. More descriptions of the projection are contained in Methods. As shown in Fig. 5a, we find that in the training results of PIDAO (SI) and PIDAO (ST), the difference between the corresponding train loss function and test loss function is smaller than the difference caused by Momentum and PIDopt algorithms. PIDAO algorithms make the training and testing closer, leading to higher test accuracies (both more than 98.5% on average). Compared to other adaptive algorithms, similar conclusions can be found for PIDAO (AdSI). Furthermore, we see that in the result of train loss shown in Fig. 4c, PIDAO (SI) and PIDAO (ST) present a “loss descent-ascent” phenomenon but end up at a minimum θ^⋆ with a train loss similar to that of the Momentum algorithm. We assert this phenomenon is caused by the effective exploration capacity of PIDAO. Specifically, in Fig. 5b, we project the learning trajectory and train loss landscape of each algorithm into the two-dimensional parameter subspace. During training, both PIDAO (SI) and PIDAO (ST) pass through basins with lower losses (as shown by the red stars in the 30th epoch). Instead of terminating the training, they both go ahead and eventually stop at flatter minima (as shown by the blue squares), although the losses are a bit higher than before. This phenomenon is not observed in PIDAO (AdSI) as shown in Fig. 5b, which may be due to the change of continuous-time dynamics of adaptive PIDAO algorithm caused by adaptive parameterization. However, compared with Adam and AdaHB, PIDAO (AdSI) still has better capacity to learn which may be partly inherited from the dynamics of PIDAO.

**Fig. 5: One- and two-dimensional subspace projections of loss landscapes in MNIST-FNN case.**

Example 3: deep learning-based PDE solver

To investigate the capacity of PIDAO-family algorithms on more complex cases such as scientific machine learning, we apply the PIDAO (AdSI) to train NNs for learning partial differential equations (PDEs). The physics-informed NNs (PINNs)⁶⁸ and Fourier neural operator (FNO)⁶⁹ are investigated. Further details about the experiment are presented in “Methods”.

We first consider solving a benchmark fluid dynamics problem via PINNs, namely the steady-state flow in a 2D lid-driven cavity, which is governed by the incompressible Navier-Stokes (NS) equations. The results are summarized in Fig. 6a, where we can observe that the velocity field obtained by the PIDAO (AdSI) is in good agreement with the reference. Moreover, compared to Adam and other optimizers (Fig. 6a), PIDAO (AdSI) presents a higher accuracy and a faster convergence rate. Another example of PINNs is solving the 1D Burgers’ equation. As shown in Fig. 6b, we find that the exact and the predicted solutions at different time instants t = 0.3, 0.6 are consistent. In addition, the result obtained by PIDAO (AdSI) is more accurate than those of other algorithms, as depicted by the plot of test errors against training epoch in Fig. 6b.

**Fig. 6: Solving partial differential equations via neural networks with different optimizers, including PIDAO (AdSI), Adam, RMSprop, AdamW⁷⁵, and AdaHB.**

We also investigate the capacity of the proposed algorithm, PIDAO (AdSI), to solve PDEs based on FNO. Here, the 1D Burgers’ equation and the 2D Darcy flow equation are considered, the results of which are illustrated in Fig. 6c, d, respectively. From the results, we observe that the models for the two cases trained by PIDAO (AdSI) both converge to lower L₂ losses in testing, which means that PIDAO (AdSI) is more applicable and produces a better performance. The discussion here also provides evidence for the assertion that the PIDAO-family algorithms have more capacity to explore the loss landscape than other algorithms.

All results of the above deep learning examples are summarized in Table 2.

Table 2 Comparison between benchmark algorithms and PIDAO-family algorithms on various deep learning cases

Full size table

Computational cost of PIDAO

We test the computational cost of PIDAO-family algorithms and the comparison algorithms in two ways: running time per epoch and the running time used to reach a certain train loss (e.g., 1e − 3), which are shown in Fig. 7a, b, respectively. As shown in Fig. 7a, in each task, the PIDAO-family algorithms use a little more runtime per epoch compared to other algorithms, but not by much more. The reason is that PIDAO-family algorithms have to store an integral variable z_k during the iteration, which consumes a certain amount of running time. However, the PIDAO-family of algorithms requires much less runtime than other algorithms to achieve a certain error due to the acceleration of PIDAO. Specifically, in the comparison of adaptive algorithms for the three tasks, PIDAO (AdSI) requires less runtime; in the comparison of non-adaptive algorithms, PIDAO (SI) and PIDAO (ST) also require less runtime, because the PIDAO-family algorithms have a faster convergence rate so that only fewer epochs are needed to reach the specified loss.

**Fig. 7: Comparison of computational cost between the classical algorithms and PIDAO-family algorithms on data classification problems.**

Discussion

In this paper, we explore the possibility of applying control theory for deep learning. We establish a connection between closed-loop feedback control and optimization algorithms by the controlled heavy-ball optimizer. This connection not only motivates us to explain the algorithmic behavior and training dynamics through the lens of continuous-time dynamical systems^{29,33,38,39,50}, but also enables us to develop high-performance closed-loop control optimizers for learning via feedback. In this work, by leveraging the PID control, we propose an accelerated optimizer, PIDAO. We investigate the convergence property and the performance of PIDAO for three basic types of unconstrained optimization problems: strongly convex, convex, and bounded non-convex optimization. Most of the characteristics are described with theoretical analysis. We prove that PIDAO can achieve an exponential convergence in the strongly convex setup and achieve global and local convergence results in the convex and non-convex setups, respectively. Our numerical experiments demonstrate that PIDAO exhibits a more aggressive exploration capacity across the loss landscape. Such a property provides PIDAO the potential to faster escape from suboptimal minima and converge towards superior minima, such as global minima or flat minima. Eventually, to further improve the applicability of PIDAO, we generate the PIDAO-family algorithms via discretization methods, including two learning rate-fixed algorithms and one adaptive algorithm, which can be easily employed in deep learning tasks. As demonstrated in numerous examples (see Figs. 4–7), our algorithms possess a more effective exploration competence along the loss landscape with lower computational costs, resulting in improved accuracy and efficiency in deep learning applications (e.g., classification and PDE-solver).

Our proposed framework is inspired by a second-order ODE (1) characterizing the heavy-ball algorithm. However, there are notable distinctions between PIDAO and other existing optimizers, which are summarized in Table 3. The primary differentiation lies in the generation of each optimizer. Most optimizers are typically derived by establishing an iteration scheme and then deducing the corresponding continuous-time optimizer through limit analysis. However, PIDAO is first introduced by incorporating a controller into the momentum optimizer from a physical perspective, followed by the implementation of several PIDAO-family algorithms through discretization. In addition, there are also differences in the way how the optimizers handle the current, past, and tendency information. The momentum optimizer employs the difference between the current state and the previous state for acceleration, and the adaptive optimizers (e.g., Adam) construct the first and second moments to process the past and current gradient information, and further adaptively update the search direction in the solution space and the learning rate. On the contrary, PIDAO captures current, past, and tendency from the gradient information, corresponding to the proportional, integral, and derivative computations in the PID controller. This distinctive approach makes PIDAO more comprehensive and interpretable in those learning and optimization problems. Moreover, compared with another PID-inspired optimizer, PIDopt⁵⁹, the motivation of PIDAO is different. We also propose PIDAO with theoretical results in the continuous-time framework. In addition, PIDAO is a more general framework, and PIDopt can be considered as a typical PD “controller” in PIDAO. We provide more details on the comparison between PIDAO and PIDopt in SI Section I.

Table 3 Comparison between existing optimizers and PIDAO in the continuous-time perspective

Full size table

It is worth noting again that PIDAO achieves an interesting result $\int_{{t}_{0}}^{t}\nabla f(X(s)){{{\rm{d}}}}s\to 0$ as t → + ∞, which is called offset effect and is a piece of evidence that the PIDAO has a stronger exploration capability along the loss landscape. We set up numerous experiments, including low-dimensional optimization problems and high-dimensional learning tasks, to verify the effectiveness of the proposed optimizer. In particular, the results of complex deep-learning tasks indicate that our algorithms can outperform state-of-the-art algorithms (e.g., Adam) in terms of accuracy.

The framework of PIDAO is constructed based on the PID mechanism in this study. We believe there is substantial room for future work in applying other existing but suitable control methods to construct controlled physical optimizers (e.g., heavy ball in Sec. “Introduction”), that can efficiently solve learning problems. Although this article focuses on unconstrained optimization issues, in the future, we could study constrained optimization problems. For example, it is possible to consider physical systems for optimization on manifolds and their control problems, such as the control of Euler-Lagrange systems on Riemannian manifolds. Regarding PIDAO, there are still several aspects that require further investigation. The first one is the optimal convergence rate of PIDAO for strongly convex optimization. Although both the PIDAO family of learning rate-fixed algorithms and adaptive algorithms outperform other algorithms in the examples here, we do not yet know why they improve the accuracy of learning. A possibility would be to explore the relationship between a key feature of the proposed algorithms, namely the offset effect, and the generalization ability of deep learning models, drawing upon well-established aspects^39,70. Moreover, although the algorithm PIDAO (AdSI) achieves great performance in the deep learning examples, the continuous-time dynamical system of PIDAO (AdSI) is no longer exactly equivalent to PIDAO due to the incorporation of an adaptive learning rate. Therefore, deducing a continuous-time system for PIDAO (AdSI) would be a promising research direction to gain further insights into this adaptive algorithm. Last but not least, the concept of PIDAO can be extended to continuous-time stochastic differential equations, as mini-batch algorithms are commonly employed in deep learning practice.

Methods

Here, we describe more details about the proposed optimizer PIDAO, and summarize the discretization methods of PIDAO, the hyperparameters, testing objective functions, and NN architectures that we used in our numerical experiments. We note that in this paper, the continuous-time optimizers for low-dimensional optimization in Fig. 2 and the discrete-time algorithms for deep learning in Figs. 4–6, are implemented in the solve_ivp method and the Pytorch package, respectively. Our implementations are all publicly available.

Global convergence for strongly convex optimization

We denote one nonempty set of hyperparameters (k_p, k_i, k_d, a):

$${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}=\left\{({k}_{p},{k}_{i},{k}_{d},a)\in {{\mathbb{R}}}_{+}^{4}\left\vert {k}_{p} \, > \, a{k}_{d},\mu {k}_{d}^{2}+a{k}_{d} \, > \, {k}_{p}+\frac{{k}_{i}}{a}\right.\right\},$$

(6)

Theorem 1

We suppose that $f\in {{{{\mathcal{S}}}}}_{\mu, L}^{2}({{\mathbb{R}}}^{n})$. For any $({k}_{p},{k}_{i},{k}_{d},a)\in {{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}$ defined in (6), there exists a positive constant η such that the solutions of the proposed PIDAO (3), $X:\left[{t}_{0},+\infty \right) \to {{\mathbb{R}}}^{n}$, satisfy $\parallel X(t)-{X}^{\star }\parallel+\parallel \int_{{t}_{0}}^{t}\nabla f(X(s)){{{\rm{d}}}}s\parallel \le O({e}^{-\eta t}),$ where η only depends on parameters k_p, k_i, k_d, a, μ.

The proof of Theorem 1 is given in SI Section A.1. We also provide a stochastic analysis result of PIDAO in this setting in SI Section A.2. We highlight the following points. (i) One key aspect in the proof of Theorem 1 is the construction of a general Lyapunov function V(X)⁴⁴, as shown in SI Section A.1. The set ${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}$ ensures that V(X) exponentially decreases along with the solution X(t) of (3), which means that different Lyapunov functions V(X) may require different ${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}$ for convergence. Hence, the set ${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}$ is conservative. (ii) η is a nonlinear function of k_p, k_i, k_d, a, μ. In other words, the convergence rate is determined in a nonlinear manner by the parameters k_p, k_i, k_d, a, μ. The parameters leading to an optimal convergence rate for strongly convex optimization⁷¹ would not be discussed in this work. However, in Methods, we summarize how to select k_p, k_i, k_d, a in practical applications.

Global convergence for convex optimization

We propose another set for hyperparameters (k_p, k_i, k_d, a, c):

$${{{{\mathcal{S}}}}}_{{{{\rm{EPID}}}}{-}{{{\rm{C}}}}}=\left\{({k}_{p},{k}_{i},{k}_{d},a)\in {{\mathbb{R}}}_{+}^{4}\left.\right\vert {k}_{p} > \max \left\{\frac{{k}_{i}+\frac{{k}_{d}^{2}{L}^{2}}{4}}{a},\frac{{k}_{d}{k}_{i}}{r}+r,a\right\},\frac{{k}_{i}}{r}+rc=a,0 < c < 1\right\},$$

(7)

where r > 0 is related with k_i, c, a by the second equality in (7).

Theorem 2

For the solution of the proposed Enhanced PIDAO $X:\left[{t}_{0},+ \infty \right)\to {{\mathbb{R}}}^{n}$, if $f\in {{{{\mathcal{F}}}}}_{L}^{2}({{\mathbb{R}}}^{n})$ and $({k}_{p},{k}_{i},{k}_{d},a,c)\in {{{{\mathcal{S}}}}}_{{{{\rm{EPID}}}}{-}{{{\rm{C}}}}}$ defined in (7), then the following assertions hold:

(1)
For all $X({t}_{0})\in {{\mathbb{R}}}^{n}$, the global convergence can be held, i.e.,
$$\mathop{\lim }_{t\to+\infty }\int_{{t}_{0}}^{t}\left(\nabla f(X(s))+c\dot{X}\right){{{\rm{d}}}}s=\mathop{\lim } _{t\to+\infty }f(X(t))-{\min } _{X\in {{\mathbb{R}}}^{n}}f=0.$$
(2)
Moreover, if the convex objective f obeys the Polyak-Łojasiewicz condition with μ > 0, i.e., $f\in {{{{\mathcal{F}}}}}_{L}^{2}({{\mathbb{R}}}^{n})\bigcap {{{{\mathcal{P}}}}}_{\mu }^{2}({{\mathbb{R}}}^{n})$, there exists M, η > 0 such that the exponential convergence rate can be held as follows: $\parallel \int_{{t}_{0}}^{t}(\nabla f(X(s))+c\dot{X}){{{\rm{d}}}}s\parallel+(f(X(t))-{\min }_{X\in {{\mathbb{R}}}^{n}}f)\le M{e}^{-\eta t}$, where M and η only depend on parameters k_p, k_i, k_d, a, L, μ.

We provide the proof in SI Section D.1.

Local convergence for non-convex optimization

Now we define a set of $({k}_{p},{k}_{i},{k}_{d},a)\in {{\mathbb{R}}}_{+}^{4}$ for this class of objective,

$${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{NC}}}}}=\left\{({k}_{p},{k}_{i},{k}_{d},a)\in {{\mathbb{R}}}_{+}^{4}\left.\right\vert {k}_{p}-\frac{{k}_{i}}{a}-a{k}_{d} \, > \, 0\right\}.$$

(8)

Theorem 3

We consider the parameters $({k}_{p},{k}_{i},{k}_{d},a)\in {{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{NC}}}}}$ defined in (8) and the bounded objective function $f\in {{{{\mathcal{C}}}}}^{2}({{\mathbb{R}}}^{n})$. Then, the solution trajectory of PIDAO (3) with the respective parameters (k_p, k_i, k_d, a) starting from any initial condition X(t₀) holds

$${\lim }_{t\to+\infty }\nabla f(X(t))=\mathop{\lim }_{t\to+\infty }\dot{X}(t)+\frac{{k}_{i}}{a}\int_{{t}_{0}}^{t}\nabla f(X(s)){{{\rm{d}}}}s=0.$$

Moreover, if $f\in {{{{\mathcal{P}}}}}_{\mu }^{2}({{\mathbb{R}}}^{n})$, the global convergence of the global minima can be obtained, i.e., ${\lim } _{t\to+\infty }f(X(t))-{f}^{\star }=0$ for any $X({t}_{0})\in {{\mathbb{R}}}^{n}$.

More details of the proof can be found in SI Section F.

Fast convergence of the PDAO optimizer for strongly convex optimization

Let us define a set for (k_p, k_d, a) in the PDAO:

$${{{{\mathcal{S}}}}}_{{{{\rm{PD}}}}{-}{{{\rm{SC}}}}}=\left\{({k}_{p} , {k}_{d}, a)\in {{\mathbb{R}}}_{+}^{3}\left.\right \vert {k}_{p} > a{k}_{d}\right\}.$$

(9)

Theorem 4

We assume that $f\in {{{{\mathcal{S}}}}}_{\mu ,L}^{2}({{\mathbb{R}}}^{n})$. For any $({k}_{p},{k}_{d},a)\in {{{{\mathcal{S}}}}}_{{{{\rm{PD-SC}}}}}$ defined in (9), the solution $X:\left[{t}_{0} ,+\infty \right) \to {{\mathbb{R}}}^{n}$ of the PDAO formulated by Eq. (3) with k_i = 0 satisfies f(X) − f^⋆≤ O(e^−2mt), where $m=\min \left\{a,\mu {k}_{d}\right\}$.

We provide the proof in SI Section G.

Discrete-time algorithms of PIDAO

As shown in Proposition 1 in SI Section C, the following first-order ODE group is equivalent to the original ODE (Eq. (3)):

$${{\mathrm{Equivalent}}}\ {{\mathrm{first}}}{-}{{\mathrm{order}}}\ {{\mathrm{ODE:}}}\,\left\{\begin{array}{l}\dot{X}=Y-{k}_{d}\nabla f(X),\hfill \\ \dot{Y}=-aY-\left({k}_{p}-a{k}_{d}\right)\nabla f(X)-{k}_{i}Z,\\ \dot{Z}=\nabla f(X).\hfill\end{array}\right.$$

Then, we can discrete these ODEs to solve the optimization problems. In this paper, three discrete-time PIDAO algorithms are proposed and implemented in Pytorch, which are summarized in Table 1.

PIDAO (SI)

The semi-implicit Euler method is used to obtain PIDAO (SI), which means that on the right side of the above ODE, we use explicit x_k, and implicit y_k+1, z_k+1 to drive the scheme. As k_i = 0, we also produce the scheme of PDAO.

PIDAO (ST)

The algorithm PIDAO (ST) is based on the symplectic method and the trapezoid formula: in detail, the first and second equations in the above ODE are discretized by the symplectic method, while the third equation uses the trapezoid formula.

PIDAO (AdSI)

Based on the PIDAO (SI) scheme, we exploit an adaptive algorithm PIDAO (AdSI) by introducing the adaptive lr factor $\sqrt{{\hat{v}}_{k}}$ and the moving average into the gradient direction ∇ f(x) and z_k. Here, instead of using all the past gradients in the updated z_k, we use a moving average of past gradients which is inspired by RMSprop. In particular, we calculate the exponential weighted moving average of gradient squared v_k+1 = v_k + (1 − ρ₁)( ∇ f(x_k+1) ⊙ ∇ f(x_k+1) − v_k) and use a correction of v_k, i.e., ${\hat{v}}_{k}=\frac{{v}_{k}}{1-{\rho }_{1}^{2}}$, to update k_p, k_i, k_d.

The pseudo-codes for these three algorithms are provided in Algorithm 1–3 in SI Section H.

Hyperparameters in discretized PIDAO

To further improve the practicality of the algorithm, we explore the choice of hyperparameters in the above PIDAO-family algorithms (shown in Table 1). Recall the PIDAO (SI), we can rewrite the corresponding scheme by substituting ${y}_{k+1}=\frac{{x}_{k+1}-{x}_{k}}{h}+{k}_{d}\nabla f({x}_{k})$ into the second equation in this scheme and generate the following scheme:

$${x}_{k+1}-{x}_{k}= \frac{1}{1+ah}({x}_{k}-{x}_{k-1})-\frac{{h}^{2}{k}_{p}}{1+ah}\nabla f({x}_{k})-\frac{h{k}_{d}}{1+ah}(\nabla f({x}_{k})\\ -\nabla f({x}_{k-1}))-\frac{{h}^{2}{k}_{i}}{1+ah}{z}_{k+1}.$$

By setting k_d = k_i = 0, the PIDAO (SI) degrades into the momentum scheme ${x}_{k+1}-{x}_{k}=\frac{1}{1+ah}({x}_{k}-{x}_{k-1})-\frac{{h}^{2}{k}_{p}}{1+ah}\nabla f({x}_{k})$. Here, we serve two terms $\frac{1}{1+ah}$ and $\frac{{h}^{2}{k}_{p}}{1+ah}$ as the equivalent momentum and the equivalent learning rate, denoted by a_eq and lr_eq respectively, which can produce suitable k_p, a, h in this view. For example, if we take the equivalent momentum and the equivalent learning rate as a_eq = 0.9 and lr_eq = 0.01, respectively, then we have that h = lr_eq = 0.01, $a=\frac{1}{h}(\frac{1}{{a}_{eq}}-1)=\frac{100}{9}$, and ${k}_{p}{h}^{2}{a}_{eq}=l{r}_{eq}\Rightarrow {k}_{p}=\frac{1}{{a}_{eq}l{r}_{eq}}=\frac{1000}{9}$. For the remaining two hyperparameters in the PIDAO (SI), k_d and k_i, their selection needs to satisfy ${k}_{p} > \frac{{k}_{i}}{a}+a{k}_{d}$ such that ${k}_{p},{k}_{i},{k}_{d},a\in {{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{NC}}}}}$ that is defined in (8).

Therefore, we summarize the method of selecting hyperparameters: once we determine a_eq and lr_eq, (k_p, k_i, k_d, a) and h could follow that

$$h=l{r}_{eq},\,a=\frac{1}{l{r}_{eq}}\left(\frac{1}{{a}_{eq}}-1\right),\,{k}_{p}=\frac{1}{{a}_{eq}l{r}_{eq}},\,{k}_{p} \, > \, \frac{{k}_{i}}{a}+a{k}_{d}.$$

The above parameter selection method is also applicable to the PIDAO (ST) and the PIDAO (AdSI). In practice, we suggest that k_i ∈ (0, 10) and ${k}_{d}\in \left(0,1\right]$. Further optimization for hyperparameters k_i and k_d can be based on Bayesian optimization or rich PID tuning methods in classical control theory, such as Ziegler-Nichols tuning⁵⁴. The default a_eq and lr_eq for PIDAO (SI) and PIDAO (ST) are suggested by lr_eq = 10⁻², a_eq = 0.9, while for PIDAO (AdSI) lr_eq = 10⁻³, a_eq = 0.9 are available in practice.

Experiments on low-dimensional optimization via continuous-time setting

The continuous-time optimizers investigated in Fig. 2 are summarized in Table 3. The corresponding ODE for each optimizer is solved using the backward differentiation formula (BDF) method. Firstly, we introduce the hyperparameters of PIDAO used in strongly convex cases (Fig. 2a, f). In Fig. 2a and b, we use ${k}_{p}=\frac{1000}{9},{k}_{i}=60,{k}_{d}=2.5,a=\frac{100}{9}$ for PIDAO and Enhanced PIDAO. In Fig. 2f, for one function $f\in {{{{\mathcal{S}}}}}_{\mu,L}^{2}({{\mathbb{R}}}^{n})$, f − μ∥X − X^⋆∥²/2 is a convex function. By choosing suitable k, s, ϵ, one can construct the following subset of ${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{SC}}}}}$:

$${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}\,{-}\,{{{\rm{SC}}}}}^{{{{\rm{sub}}}}}(k,s,\epsilon )=\left\{{k}_{d}=k\sqrt{s},a=2\sqrt{\mu },{k}_{p}=2k\sqrt{\mu s}+\epsilon,0 < {k}_{i} < (\mu {k}_{d}^{2}+a{k}_{d}-{k}_{p})a-\epsilon \right\}.$$

For example, in Fig. 2f, we derive μ = 4 and select k = 9, s = 0.01, ϵ = 10⁻⁴ that produces ${k}_{d}=0.9,a=4,{k}_{p}=3.6001,{k}_{i,n}=\frac{12.9595}{n}$. Moreover, we use the same k_p, k_d, a (if any) in the PDAO, NAG, and Momentum optimizers due to the fairness.

Secondly, for non-convex cases (Fig. 2c–e), the hyperparameters of PIDAO should obey ${{{{\mathcal{S}}}}}_{{{{\rm{PID}}}}{-}{{{\rm{NC}}}}}$(8). Here, we consider three objectives for non-convex optimization, namely (1) Six-hump camel back (SHCB) $f(X)=(4-2.1{X}_{1}^{2}+\frac{{X}_{1}^{4}}{3}){X}_{1}^{2}+{X}_{1}{X}_{2}+(-4+4{X}_{2}^{2}){X}_{2}^{2}$, (2) Rosenbrock (ROS) $f(X)=20{({X}_{2}-{X}_{1}^{2})}^{2}+{({X}_{1}-1)}^{2}$, and (3) Rastrigin (RAS) $f(X)=20+{X}_{1}^{2}+{X}_{2}^{2}-10(\cos ({X}_{1})+\cos ({X}_{2}))$. k_p, k_d, and a are the same in these cases: 10, 0.3, and 3, respectively, while k_is are different, which are 28 (for SHCB), 9 (for ROS), and 9 (for RAS), respectively.

Experiments on low-dimensional optimization via discrete-time setting

We compare the proposed PIDAO-family algorithms with other algorithms in two cases, namely the ill-conditioned quadratic loss $f(X)=0.05{X}_{1}^{2}+5{X}_{2}^{2}$ and the Rosenbrock objective $f(X)=20{({X}_{2}- {X}_{1}^{2})}^{2}+{({X}_{1}-1)}^{2}$. After selecting the equivalent momentum a_eq and the equivalent learning rate lr_eq, the h and k_p, k_i, k_d, a in PIDAO-family algorithms follow the design method of hyperparameters illustrated above $h=l{r}_{eq},\,a=\frac{1}{l{r}_{eq}}(\frac{1}{{a}_{eq}}-1),\,{k}_{p}=\frac{1}{{a}_{eq}l{r}_{eq}},\,{k}_{p} > \frac{{k}_{i}}{a}+a{k}_{d}.$ And the learning rate and momentum coefficients α (if any) are also used for Adam and Momentum algorithms via h = lr_eq, α = a_eq.

Experiments on deep learning-based classification

We compare the PIDAO-family algorithm with other comparison algorithms (Adam and Momentum) in different datasets, namely MNIST, Fashion MNIST, and CIFAR-10. Specifically, we train a deep FNN, a convolution network, and the ResNet18⁷² on MNIST, Fashion MNIST, and CIFAR-10, respectively. Each network training is performed with a supervised loss defined by a cross-entropy formula

$$H(x,y)={\sum}_{k\in {{{\mathcal{S}}}}}-{y}_{k}\log {{\mbox{NN}}}_{\theta }({x}_{k}),$$

where x_k and y_k denote the input and the label of a sample (x_k, y_k). NN_θ(x_k) is the output of NN for x_k. This cross-entropy loss is also used for assessing the performance of network models. More experimental details about networks and hyperparameters of optimization algorithms are presented in SI Section H.

Experiments on deep learning-based PDE solver

We mainly consider solving three benchmark PDEs by deep learning methods, i.e., the 1D Burgers’ equation along with Dirichlet boundary conditions, the 2D incompressible NS equation on the square cavity (lid-driven cavity), and the 2D Darcy Flow equation on the unit box.

In the context of PINNs⁶⁸, the solutions of the Burgers’ equation and the NS equation are approximated by FNNs, where the loss functions used for training can be referred to⁷³. In particular, each network training is performed with a supervised loss taking the general form

$${{{{\mathcal{L}}}}}_{{{{\rm{train}}}}}(\theta )=\frac{1}{{N}_{u}}{\sum}_{i=1}^{{N}_{u}}{\left\Vert {{{{\boldsymbol{u}}}}}_{i}-{{\mbox{NN}}}_{\theta }\left({{{{\boldsymbol{x}}}}}_{i}\right)\right\Vert }^{2}+\frac{1}{\lambda }{{{\mathcal{R}}}}\left[{{\mbox{NN}}}_{\theta }({{{\boldsymbol{x}}}})\right],$$

where u_i is the measurement data of the initial and boundary conditions, ${{\mbox{NN}}}_{\theta }\left({{{{\boldsymbol{x}}}}}_{i}\right)$ is the corresponding DNN-approximated solution, N_u is the total number of training data points, and ${{{\mathcal{R}}}}\left[{{\mbox{NN}}}_{\theta }({{{\boldsymbol{x}}}})\right]$ is designed to constrain the outputs of neural network to satisfy a PDE condition controlled by a parameter λ. To quantitatively assess the performance of network models, we adopt a relative L₂ norm as a test metric defined by

$${{{{\mathcal{L}}}}}_{{{{\rm{test}}}}}(\theta )=\frac{1}{{N}_{c}}{\sum}_{i=1}^{{N}_{c}}\frac{{\left\Vert {{{{\boldsymbol{u}}}}}_{i}-{{\mbox{NN}}}_{\theta }\left({{{{\boldsymbol{x}}}}}_{i}\right)\right\Vert }^{2}}{{\left\Vert {{{{\boldsymbol{u}}}}}_{i}\right\Vert }^{2}},$$

where N_c is the total number of testing data points.

In addition, we utilize the FNO method⁶⁹ to solve the Burgers’ equation and the Darcy Flow equation, where the network architectures used to approximate the PDE solution can be found in ref. ⁶⁹. In these FNO experiments, we use a L₂ loss to train and test the network formulated by

$${{{\mathcal{L}}}}(\theta )=\frac{1}{N}{\sum}_{i=1}^{N}{\left\Vert {{{{\boldsymbol{u}}}}}_{i}-{{\mbox{NN}}}_{\theta }\left({{{{\boldsymbol{x}}}}}_{i}\right)\right\Vert }^{2}.$$

In SI Section H, we provide more details about the networks and the hyperparameters of optimization algorithms. Here, we train the deep learning model with mini-batch algorithms, the mini-batch size of which is identical for all algorithms used in each example. We also note that with the given lr_eq and a_eq, the hyperparameters of PIDAO-family algorithms used in the above experiments (the classification and physical learning problems) follow one common criterion $h=l{r}_{eq},\,a=\frac{1}{l{r}_{eq}}(\frac{1}{{a}_{eq}}-1),\,{k}_{p}=\frac{1}{{a}_{eq}l{r}_{eq}},\,{k}_{p} > \frac{{k}_{i}}{a}+a{k}_{d}$ such that the equivalent momentum and lr in the PIDAO-family algorithm are equal to lr_eq and a_eq, respectively. To keep the fairness of comparison in experiments, the learning rate and the momentum coefficient of other algorithms (if any) are also equal to lr_eq and a_eq, respectively.

Subspace projection of loss landscape

In the MNIST-FNN case, we project the training and test loss landscapes (or surfaces) around the trained final parameter θ^⋆ (or θ_n) into low dimensional parameter subspaces using the visualization method¹⁶. More precisely, after obtaining the one-dimensional projections of the training and test loss landscape $f(\alpha ;{\theta }^{\star })=f({\theta }^{\star }+\alpha {\theta }_{pd}),\alpha \in \left[-1,1\right]$ with a specific projection direction θ_pd, we calculate the training and test accuracy along the projected training and test loss landscapes. Additionally, we denote θ_k as the model parameter vector at epoch k during the training. The training trajectory of each algorithm $\left[{\theta }_{1},{\theta }_{2},\cdots \,,{\theta }^{\star }\right]$ is also projected into a two-dimensional subspace $f({\alpha }_{1},{\alpha }_{2};{\theta }^{\star })=f({\theta }^{\star }+{\alpha }_{1}{\theta }_{p{d}_{1}}+{\alpha }_{2}{\theta }_{p{d}_{2}})$ that is the projection of the train loss landscape. Here, after applying the principal component analysis to the matrix $M=\left[{\theta }_{1}-{\theta }^{\star },{\theta }_{2}-{\theta }^{\star },\cdots \,,{\theta }_{n-1}-{\theta }^{\star }\right]$ and obtaining the first second principal components, these two principal components become two projection directions ${\theta }_{p{d}_{1}}$ and ${\theta }_{p{d}_{2}}$. It is worth mentioning that the first principal component can be also represented as the projection direction θ_pd used in the above one-dimensional projection.

Data availability

All benchmark datasets used in this paper are publicly available, including MNIST, FashionMNIST, CIFAR-10 that can be accessed in https://git-disl.github.io/GTDLBench/datasets/, and the PDE dataset for FNO that is available in https://drive.google.com/drive/folders/1UnbQh2WWc6knEHbLn-ZaXrKUZhp7pjt-.

Code availability

The source code⁷⁴ is available at https://github.com/NoulliCHEN/PIDAO.

References

O’Connell, M. et al. Neural-fly enables rapid learning for agile flight in strong winds. Sci. Robot. 7, eabm6597 (2022).
Article PubMed Google Scholar
Ichnowski, J., Avigal, Y., Satish, V. & Goldberg, K. Deep learning can accelerate grasp-optimized motion planning. Sci. Robot. 5, eabd7710 (2020).
Article PubMed Google Scholar
Loquercio, A. et al. Learning high-speed flight in the wild. Sci. Robot. 6, eabg5810 (2021).
Article PubMed Google Scholar
Kaufmann, E. et al. Champion-level drone racing using deep reinforcement learning. Nature 620, 982–987 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Wurman, P. R. et al. Outracing champion gran turismo drivers with deep reinforcement learning. Nature 602, 223–228 (2022).
Article ADS CAS PubMed Google Scholar
Iwami, R. et al. Controlling chaotic itinerancy in laser dynamics for reinforcement learning. Sci. Adv. 8, eabn8325 (2022).
Article PubMed PubMed Central Google Scholar
Böttcher, L., Antulov-Fantulin, N. & Asikis, T. AI Pontryagin or how artificial neural networks learn to control dynamical systems. Nat. Commun. 13, 333 (2022).
Article ADS PubMed PubMed Central Google Scholar
Course, K. & Nair, P. B. State estimation of a physical system with unknown governing equations. Nature 622, 261–267 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Han, J., Jentzen, A. & Weinan, E. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci. USA 115, 8505–8510 (2018).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Degrave, J. et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602, 414–419 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Kates-Harbeck, J., Svyatkovskiy, A. & Tang, W. Predicting disruptive instabilities in controlled fusion plasmas through deep learning. Nature 568, 526–531 (2019).
Article ADS CAS PubMed Google Scholar
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Lake, B. M. & Baroni, M. Human-like systematic generalization through a meta-learning neural network. Nature 623, 115–121 (2023).
Bi, K. et al. Accurate medium-range global weather forecasting with 3D neural networks. Nature 619, 533–538 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Liang, J., Xu, C. & Cai, S. Recurrent graph optimal transport for learning 3D flow motion in particle tracking. Nat. Mach. Intell. 5, 505–517 (2023).
Li, H., Xu, Z., Taylor, G., Studer, C. & Goldstein, T. Visualizing the loss landscape of neural nets. In: 31st International Conference on Advances in Neural Information Processing Systems (NeurIPS, Montreal, Canada, 2018).
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. In: 5th International Conference on Learning Representations (ICLR, Palais des Congrès Neptune, Toulon, France, 2016).
Baldassi, C., Pittorino, F. & Zecchina, R. Shaping the learning landscape in neural networks around wide flat minima. Proc. Natl. Acad. Sci. USA 117, 161–170 (2020).
Article ADS MathSciNet CAS PubMed Google Scholar
Bottou, L., Curtis, F. E. & Nocedal, J. Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018).
Article MathSciNet Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations (ICLR, San Diega, CA, USA, 2015).
Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4, 1–17 (1964).
Article Google Scholar
Jordan, M. I. Dynamical, symplectic and stochastic perspectives on gradient-based optimization, In Proc. International Congress of Mathematicians: Rio de Janeiro 2018 523–549 (World Scientific, 2018).
Tabuada, P. & Gharesifard, B. Universal approximation power of deep residual neural networks through the lens of control. IEEE Trans. Autom. Control 68, 2715–2728 (2022).
Article MathSciNet Google Scholar
Li, Q. & Hao, S. An optimal control approach to deep learning and applications to discrete-weight neural networks. In Proc. International Conference on Machine Learning 2985–2994 (PMLR, 2018).
Lessard, L., Recht, B. & Packard, A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26, 57–95 (2016).
Article MathSciNet Google Scholar
Hu, B. & Lessard, L. Control interpretations for first-order optimization methods. In Proc. 2017 American Control Conference (ACC) 3114–3119 (IEEE, 2017).
Fazlyab, M., Ribeiro, A., Morari, M. & Preciado, V. M. Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM J. Optim. 28, 2654–2689 (2018).
Article MathSciNet Google Scholar
Ruiz-Balet, D. & Zuazua, E. Neural ODE control for classification, approximation and transport. SIAM Rev. 65, 735–773 (2023).
Article MathSciNet Google Scholar
Shi, B., Du, S. S., Jordan, M. I. & Su, W. J., Understanding the acceleration phenomenon via high-resolution differential equations, Math. Program. 195, 79–148 (2021).
Wilson, A. C., Recht, B. & Jordan, M. I. A Lyapunov analysis of accelerated methods in optimization. J. Mach. Learn. Res. 22, 5040–5073 (2021).
MathSciNet Google Scholar
Yang, L. et al. The physical systems behind optimization algorithms. In: 31st International Conference on Advances in Neural Information Processing Systems (NeurIPS, Montreal, Canada, 2018).
Wibisono, A., Wilson, A. C. & Jordan, M. I. A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. USA 113, E7351–E7358 (2016).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Li, Q., Tai, C. & Weinan, E. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proc. International Conference on Machine Learning 2101–2110 (PMLR, 2017).
Zhou, P., Feng, J., Ma, C., Xiong, C. & Hoi, S. C. H. Towards theoretically understanding why SGD generalizes better than Adam in deep learning. Adv. Neural Inf. Process. Syst. 33, 21285–21296 (2020).
Google Scholar
Da Silva, A. B. & Gazeau, M. A general system of differential equations to model first-order adaptive algorithms. J. Mach. Learn. Res. 21, 5072–5113 (2020).
MathSciNet Google Scholar
Barrett, D. & Dherin, B. Implicit gradient regularization. In: 9th International Conference on Learning Representations (ICLR, Vienna, Austria, 2020).
Compagnoni, E. M. et al. An SDE for modeling SAM: theory and insights. In Proc. International Conference on Machine Learning 25209–25253 (PMLR, 2023).
Su, W., Boyd, S. & Candës, E. J. A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17, 1–43 (2016).
MathSciNet Google Scholar
Jacot, A., Gabriel, F. & Hongler, C. Neural tangent kernel: convergence and generalization in neural networks. In: 31st International Conference on Advances in Neural Information Processing Systems (NeurIPS, Montreal, Canada, 2018).
Wang, S., Yu, X. & Perdikaris, P. When and why PINNs fail to train: a neural tangent kernel perspective. J. Comput. Phys. 449, 110768 (2022).
Article MathSciNet Google Scholar
Weinan, E. A proposal on machine learning via dynamical systems. Commun. Math. Stat. 1, 1–11 (2017).
MathSciNet Google Scholar
Xue, H., Araujo, A., Hu, B. & Chen, Y. Diffusion-based adversarial sample generation for improved stealthiness and controllability, In: 37th International Conference on Advances in Neural Information Processing Systems (NeurIPS, New Orleans, Louisiana, USA, 2024).
Hauswirth, A., He, Z., Bolognani, S., Hug, G. & Dörfler, F. Optimization algorithms as robust feedback controllers. Annu. Rev. Control 57, 100941 (2024).
Article MathSciNet Google Scholar
Krstic, M., Kokotovic, P. V. & Kanellakopoulos, I. Nonlinear and Adaptive Control Design (John Wiley & Sons, Inc., 1995).
Sontag, E. D. Mathematical Control Theory: Deterministic Finite Dimensional Systems Vol. 6 (Springer Science & Business Media, 2013).
Ross, I. M. An optimal control theory for nonlinear optimization. J. Comput. Appl. Math. 354, 39–51 (2019).
Article MathSciNet Google Scholar
Li, Q., Chen, L., Tai, C. & Weinan, E. Maximum principle based algorithms for deep learning. J. Mach. Learn. Res. 18, 1–29 (2018).
ADS MathSciNet Google Scholar
Xu, K., Li, C., Zhu, J. & Zhang, B. Understanding and stabilizing GANs’ training dynamics using control theory. In Proceedings of the 37th International Conference on Machine Learning Vol. 119 of Proceedings of Machine Learning Research (eds Daumé, H., Singh, A.) 10566–10575 (PMLR, 2020).
Chen, Z., Li, Q. & Zhang, Z. Self-healing robust neural networks via closed-loop control. J. Mach. Learn. Res. 23, 14329–14382 (2022).
MathSciNet Google Scholar
Wang, S., Teng, Y. & Perdikaris, P. Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM J. Sci. Comput. 43, A3055–A3081 (2021).
Article MathSciNet Google Scholar
Kolarijani, A. S., Esfahani, P. M. & Keviczky, T. Continuous-time accelerated methods via a hybrid control lens. IEEE Trans. Autom. Control 65, 3425–3440 (2019).
Article MathSciNet Google Scholar
Vaquero, M. & Cortés, J. Convergence-rate-matching discretization of accelerated optimization flows through opportunistic state-triggered control. In: 32nd International Conference on Advances in Neural Information Processing Systems (NeurIPS, Vancouver, Canada, 2019).
Ghosh, A., Lyu, H., Zhang, X. & Wang, R. Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent. In: 11th International Conference on Learning Representations (ICLR, Kigali, Rwanda, 2023).
Astrom, K. J. PID Controllers: Theory, Design, and Tuning (The Instrumentation, Systems and Automation Society, 1995).
Filo, M., Kumar, S. & Khammash, M. A hierarchy of biomolecular proportional-integral-derivative feedback controllers for robust perfect adaptation and dynamic performance. Nat. Commun. 13, 2119 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhao, C. & Guo, L. Towards a theoretical foundation of PID control for uncertain nonlinear systems. Automatica 142, 110360 (2022).
Article MathSciNet Google Scholar
Ma, R., Zhang, B., Zhou, Y., Li, Z. & Lei, F. PID controller-guided attention neural network learning for fast and effective real photographs denoising. IEEE Trans. Neural Netw. Learn. Syst. 33, 3010–3023 (2021).
Article Google Scholar
Xu, J., Xiong, Z. & Bhattacharyya, S P. PIDNet: a real-time semantic segmentation network inspired by PID controllers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 19529–19539 (IEEE, 2023).
Wang, H. et al. PID controller-based stochastic optimization acceleration for deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 31, 5079–5091 (2020).
Article PubMed Google Scholar
Attouch, H., Chbani, Z., Peypouquet, J. & Redont, P. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Program. 168, 123–175 (2018).
Article MathSciNet Google Scholar
Attouch, H., Chbani, Z., Fadili, J. & Riahi, H. First-order optimization algorithms via inertial systems with Hessian driven damping. Math. Program. 193, 113–155 (2022).
Karimi, H., Nutini, J. & Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, Vol. 9851, 795–811 (ECML, 2016).
Fazel, M., Ge, R., Kakade, S. & Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. In Proc. International Conference on Machine Learning 1467–1476 (PMLR, 2018).
Liu, C., Zhu, L. & Belkin, M. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Appl. Comput. Harmon. Anal. 59, 85–116 (2022).
Article MathSciNet Google Scholar
Li, Y., Ma, T., Zhang, H. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Proc. Conference On Learning Theory 2–47 (PMLR, 2018).
Hairer, E., Hochbruck, M., Iserles, A. & Lubich, C. Geometric numerical integration. Oberwolfach Rep. 3, 805–882 (2006).
Article MathSciNet Google Scholar
Saab Jr, S., Phoha, S., Zhu, M. & Ray, A. An adaptive polyak heavy-ball method. Mach. Learn. 111, 3245–3277 (2022).
Article MathSciNet Google Scholar
Karniadakis, G. E. et al. Physics-informed machine learning. Nat. Rev. Phys. 3, 422–440 (2021).
Article Google Scholar
Li, Z. et al. Fourier neural operator for parametric partial differential equations. In: 9th International Conference on Learning Representations (ICLR, Vienna, Austria, 2020).
Wu, L., Ma, C. & Weinan, E. How SGD selects the global minima in over-parameterized learning: a dynamical stability perspective. In: 31st International Conference on Advances in Neural Information Processing Systems (NeurIPS, Montreal, Canada, 2018).
Nesterov, Y. Introductory Lectures on Convex Optimization: a Basic Course Vol. 87 (Springer Science & Business Media, 2003).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019).
Article ADS MathSciNet Google Scholar
Chen, S. et al. Accelerated optimization in deep learning with a proportional-integral-derivative controller. PIDAO, https://doi.org/10.5281/zenodo.13939583 (2024).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In 7th Proc. International Conference on Learning Representations (ICLR, New Orleans, Louisiana, United States, 2019).

Download references

Acknowledgements

C.X., S. Cai., and S. Chen would like to acknowledge the support by the National Natural Science Foundation of China (No. 61973270, No. 62301486, and No. 12171431). C.X. gratefully acknowledges the financial support provided by the Zhejiang Province High-Level Talent Special Support Plan for Scientific and Technological Innovation Leading Talents (No. 2023R5217). S. Cai acknowledges the support by Qizhen Scholar Foundation of Zhejiang University Education Foundation and the State Key Laboratory of Industrial Control Technology (Grant No. ICT2024A05). S.Chen acknowledges the scholarship of Doctoral Academic Star Program of Zhejiang University.

Author information

Authors and Affiliations

School of Mathematical Science, Zhejiang University, Hangzhou, Zhejiang, China
Song Chen & Jiaxu Liu
Institute of Cyber-Systems and Control, College of Control Science and Engineering, Zhejiang University, Hangzhou, Zhejiang, China
Pengkai Wang, Chao Xu, Shengze Cai & Jian Chu
State Key Laboratory for Industrial Control Technology, Zhejiang University, Hangzhou, Zhejiang, China
Chao Xu, Shengze Cai & Jian Chu
Zhejiang Provincial Engineering Research Center for Intelligent Mobile Unmanned Systems Technology and Huzhou Key Lab for Autonomous Systems, Huzhou Institute of Zhejiang University, Huzhou, Zhejiang, China
Chao Xu

Authors

Song Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jiaxu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Pengkai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shengze Cai
View author publications
You can also search for this author in PubMed Google Scholar
Jian Chu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S. Chen performed theoretical analysis of the methods and wrote the original manuscript, assisted by J.L. S.Chen, P.W., C.X., and S. Cai implemented the algorithms, performed the experiments, and analyzed the results. S.Chen, J.L., P.W., C.X., S. Cai. and J.C. contributed to discussing the results, writing and reviewing the paper. C.X. and S. Cai supervised research.

Corresponding authors

Correspondence to Chao Xu or Shengze Cai.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, S., Liu, J., Wang, P. et al. Accelerated optimization in deep learning with a proportional-integral-derivative controller. Nat Commun 15, 10263 (2024). https://doi.org/10.1038/s41467-024-54451-3

Download citation

Received: 26 July 2024
Accepted: 08 November 2024
Published: 26 November 2024
DOI: https://doi.org/10.1038/s41467-024-54451-3
Springer Nature Limited

Accelerated optimization in deep learning with a proportional-integral-derivative controller

From

Abstract

Similar content being viewed by others

Deterministic Neural Networks Optimization from a Continuous and Energy Point of View

PINN-BO: A Black-Box Optimization Algorithm Using Physics-Informed Neural Networks

Learning dynamics of gradient descent optimization in deep neural networks

Introduction

Results

Problem Setup: formulations of PIDAO

Global convergence for convex optimization

Local convergence for nonconvex optimization

PIDAO has more potentials to escape local minimum

Connection between PIDAO and PID controller

PIDAO improves accuracy and efficiency in deep learning

Example 1: low-dimensional optimization setting

Example 2: deep learning-based classification

Example 3: deep learning-based PDE solver

Computational cost of PIDAO

Discussion

Methods

Global convergence for strongly convex optimization

Theorem 1

Global convergence for convex optimization

Theorem 2

Local convergence for non-convex optimization

Theorem 3

Fast convergence of the PDAO optimizer for strongly convex optimization

Theorem 4

Discrete-time algorithms of PIDAO

PIDAO (SI)

PIDAO (ST)

PIDAO (AdSI)

Hyperparameters in discretized PIDAO

Experiments on low-dimensional optimization via continuous-time setting

Experiments on low-dimensional optimization via discrete-time setting

Experiments on deep learning-based classification

Experiments on deep learning-based PDE solver

Subspace projection of loss landscape

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Transparent Peer Review file

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation