约束非线性随机系统局部最优反馈控制的广义迭代 LQG 方法
约束非线性随机系统局部最优反馈控制的广义迭代 LQG 方法
约束非线性随机系统局部最优反馈控制的广义迭代 LQG 方法
2
June 8-10, 2005. Portland, OR, USA
Abstract-We present an iterative Linear-Quadratic- developed a new method - iterative Linear-Quadratic Reg-
Gaussian method for locally-optimal feedback control of ulator design (ILQR) - which is closely related to DDP
nonlinear stochastic systems subject to control constraints. but turns out to be significantly more efficient: by a factor
Previously, similar methods have been restricted to determin-
istic unconstrained problems with quadratic costs. The new of 10 on reasonably complex control problems [SI. Our
method constructs an affine feedback control law, obtained by ILQR method uses iterative linearizations of the nonlinear
minimizing a novel quadratic approximation to the optimal dynamics around the current trajectory, and improves that
cost-to-go function. Global convergence is guaranteed through trajectory via modified Riccati equations. Both DDP and
a Levenberg-Marquardt method; convergence in the vicinity ILQR yield feedback control laws - which is a major
of a local minimum is quadratic. Performance is illustrated
on a limited-torque inverted pendulum problem, as well as a advantage compared to open-loop methods. However, these
complex biomechanical control problem involving a stochastic methods are still deterministic. Another shortcoming is that,
model of the human arm, with 10 state dimensions and unlike open-loop methods, DDP and ILQR cannot deal with
6 muscle actuators. A Matlab implementation of the new control constraints and non-quadratic cost functions. The
algorithm is availabe at www.cogsci.ucsd.edu/-todorov. goal of the present paper is to remove these limitations.
While the new algorithm should be applicable to a range
I. INTRODUCTION of problems, our specific motivation for developing it is
Despite an intense interest in optimal control theory the modeling of biological movement. Such modeling has
over the last 50 years, solving complex optimal control proven extremely useful in the study of how the brain
problems - that do not fit in the well-developed Linear- controls movement [ 151. Yet, progress has been impeded
Quadratic-Gaussian (LQG) formalism - remains a challenge by the lack of efficient methods that can handle realis-
[17]. Most existing numerical methods fall in one of two tic biomechanical control problems. The characteristics of
categories. Global methods based on the Hamilton-Jacobi- such problems are: high-dimensional nonlinear dynamics;
Bellman (HJB) equations and the idea of dynamic program- control constraints (e.g. non-negative muscle activations);
ming [7], [ l ] can yield globally-optimal feedback control multiplicative noise, with standard deviation proportional to
laws for general stochastic systems. However, such methods the control signals [4], [14]; complex performance criteria,
involve discretizations of the state and control spaces - that are rarely quadratic in the state variables [6].
which makes them inapplicable to high-dimensional prob- Before deriving our new iterative Linear-Quadratic-
lems, due to the curse of dimensionality. Local methods Gaussian (ILQG) method, we give a more detailed overview
based on the Maximum Principle [3], [13] avoid the curse of what is new here:
of dimensionality, by solving a set of ODEs - via shooting,
A . Noise
relaxation, collocation, or gradient descent. But the resulting
locally-optimal control laws are open-loop, and stochastic DDP, ILQR, and the new ILQG are dynamic program-
dynamics cannot be taken into account. ming methods that use quadratic approximations to the
An ideal blend of the advantages of local and global optimal cost-to-go function. All such methods are "blind" to
methods is provided by Differential Dynamic Programming additive noise (see Discussion). However, in many problems
(DDP) [5]. This method is still local, in the sense that of interest the noise is control-dependent, and such noise
it maintains a representation of a single trajectory and can easily be captured by quadratic approximations as we
improves it locally. The improvement however does not rely show below. Our new ILQG method incorporates control-
on solving ODEs, but is based on dynamic programming - dependent noise - which turns out to have an effect similar
applied within a "tube" around the current trajectory. DDP to an energy cost.
is known to have second-order convergence [9], [ll], and
B. Constraints
numerically appears to be more efficient [ 101 than (efficient
implementations of) Newton's method [ 121. We recently Quadratic approximation methods are presently restricted
to unconstrained problems. Generally speaking, constraints
This work is supported by NIH Grant R01-NS04591.5. make the optimal cost-to-go function non-quadratic. But
Emanuel Todorov is with the faculty of the Cognitive Science Depart- since we are approximating that function anyway, we might
ment, University of California San Diego.
Weiwei Li is with the Deparment of Mechanical and Aerospace Engi- as well take into account the effects of control constraints to
neering. University of California San Diego. the extent possible. Our new ILQG method does that - by
0-7803-9098-9105/$25.0002005 AACC 300
modifying the feedback gain matrix whenever an element 111. LOCALLQG APPROXIMATION
of the open-loop control sequence lies on the constraint The locally-optimal control law is constructed iteratively.
boundary. Each iteration of the algorithm begins with an open-loop
C. Convexity control sequence ii ( t )and the corresponding "zero-noise''
Quadratic approximation methods are based on Riccati trajectoly X ( t ) ,obtained by applying Ti ( t )to the determin-
equations, which are derived by setting up a quadratic istic dynamics 2 = f (x,u) with X(0) = X O . This can be
optimization problem at time step t , solving it analytically, done by Euler integration
and obtaining a formula for the optimal cost-to-go function
at time step t - 1. Optimizing a quadratic is only possible
Xk+l =X k + At f ( X k , EA) (1)
when the Hessian is positive-definite. This is of course true or by defining a continuous ii ( t )via interpolation, applying
in the classic LQG setting, but when LQG methods are a continuous-time integrator such as Runge-Kutta, and
used to approximate general nonlinear dynamics with non- discretizing the resulting X (t).
quadratic costs, the Hessian can (and in practice does) have Next we linearize the system dynamics and quadratize
zero and even negative eigenvalues. The traditional rem- the cost functions around X,U, to obtain a discrete-time
edy is to "fix" the Hessian, using a Levenberg-Marquardt linear dynamical system with quadratic cost. Importantly,
method, or an adaptive shift scheme [lo], or simply replace the linearized dynamics and quadratized cost are expressed
it with the identity matrix (which yields the steepest descent not in terms of the actual state and control variables, but in
method). The problem is that after fixing the Hessian, the terms of the state and control deviations b X k a X k - K k ,
optimization at time step t is no longer performed exactly b u k a u k - i i k . Writing the discrete-time dynamics as
method takes the fixed Hessian into account, and constructs expanding f around X k , u k up to first order, and subtracting
a cost-to-go approximation consistent with the resulting (l), our original optimal control problem is approximated
control law. This is done by modified Riccati-like equations. locally by the following modified LQG problem:
11. PROBLEM STATEMENT bxk+l = AkbXk f BkbUk -k c k ( b u k ) <k (2)
Consider the nonlinear dynamical system described by c k (buk)
=
A
[Cl,k+ C l , k b U k Cp,k + C p , k b U k ]
' ' '
qk + bXk q k + -bXZQkbXk
the stochastic differential equation T 1
costk =
dx =f ( x ,u)dt + F ( x ,u)dw 2
+bUZrk + 5bU:RkbUk + bU;PkbXk
1
-
with state x E R", control u E R", and standard
Brownian motion noise w E R P . Let k ' ( t , x , u ) 2 0 be where bxl = 0, < k N (0; I p ) , and the last time step is
an instantaneous cost rate, h ( x ( T ) )2 0 a final cost, T a K . The quantities that define this modified LQG problem
specified final time, and u = T ( t ,x ) a deterministic control are A k , B k , C i , k , C i , k , q k , q k , Q k , r k , R k , p k , K . At each
law. Define the cost-to-go function vT ( t ,x ) as the total cost ( X k , &), these quantities are obtained from the original
expected to accumulate if the system is initialized in state problem as:
x at time t , and controlled until time T according to the
control law T: = I+At d f / d x ; B k = A t d f / d u (3)
1
A k
= 6F[Z1; C i , k = 6d F [ i l / d u
[ JT
Ci,k
urn( t ,x ) E h (x ( T ) ) + k' (7,x ,
( T ) 7r ( 7 ,x ( T ) ) ) d~
qk = At f?; q k = A t de/&
At = T / ( K - 1); thus u k u ( ( k - 1)At),and similarly Note that we are using a simplified noise model, where
for all other time-varying quantities. F (x,u) is only linearized with respect to b u . This is
301
sufficient to capture noise that is multiplicative in the control The shortcuts g , G , H appearing on the last line of (5) are
signal - which is what we are mainly interested in. It is defined at each time step as
straightforward to include state-dependent noise as well, and
repeat the derivation that follows. g =
A
rk +BLsk+l+ ci C:kSk+lCi,k (6)
'
Now that we have an LQG approximation to our original G Pk+BlSk+lAk
optimal control problem, we can proceed with computing
an approximately-optima1 control law. This is done in the
H Rk +B,TSk+lBk f C:kSk+lCi,k
following two sections. We then describe the main iteration At this point one may notice that the expression for
of the algorithm, which constructs a convergent sequence uk ( 6x) is a quadratic function of r,and set the control
of LQG approximations. signal to the value of r which makes the gradient vanish:
buk = -H-lgk - H - l G G x . But we will not assume this
I v . COMPUTING THE COST-TO-GO FUNCTION
specific form of the control law here, because H may have
The approximately-optima1 control law for the LQG negative eigenvalues (in which case the above 6u is not
approximation will be affine, in the form 6u = r k ( 6 x ) = a minimum), and also because some control constraints
lk + L k 6 x . The unusual open-loop component l k (letter 'el') may be violated. Instead we will defer the computation
arises because we are dealing with state and control devi- of the control law to the next section. All we assume for
ations, and is needed to make the algorithm iterative (see now is that the control law computed later will be in the
below). The control law is approximately-optimal because
we may have control constraints and non-convex costs.
+
general form 6u = l k L k 6 X . With this assumption we
can complete the computation of the cost-to-go function.
Suppose the control law r has already been designed
for time steps k . . . K - 1. Then the cost-to-go function
+
Replacing r with l k & 6 X , and noting that the square
matrix s k is symmetric, the 7r-dependent expression in the
u k (6x) is well-defined - as the cost expected to accumulate
last line of (5) becomes
if system (2) is initialized in state bx at time step k , and
controlled according to r for the remaining time steps. 1;g + i1LHlk + 6 X T ( G T l k + L:g + L i H l k )
We will show by induction (backwards in time) that if the
control law is affine, the corresponding cost-to-go function
+i6XT (LLHLk + L l G 4- GTLk) 6X
remains in the quadratic form We now see that the cost-to-go function remains quadratic
uk ( 6 x ) = Sk + 6XTSk f i 6 x T S k 6 x (4)
in bx, which completes the induction proof. Thus we have
for all k. At the last time step this holds, because the final Lemma (ILQG) For any affine control law in the form
cost is a quadratic that does not depend on the control +
6u ( k ,6x) = 1k L k 6 x , the cost-to-go for problem (2) is
signal. Now suppose that 2) is in the above form for time in the form (4)for all k. At the final time step K , the cost-
steps k + 1 . . . K . Using the shortcut r in place of the to-go parameters are SK = Q x , S K = q K , S K = q K . For
+
control signal l k L k 6 X that our control law generates, the k < K the parameters can be computed recursively as
Bellman equation for the cost-to-go function is s k = Qk f A;Sk+lAk +
L l H L k + LiG + GTLk (7)
Uk (6x) = immediate cost + E [uk+l (next state)] sk = qk + T
AkSk+l +LhHlk Llg + + GTlk
= qk + 6XT ( q k + aQk6x) + rT + (rk i R k r ) = + sk+l +i + fc g
+ +
sk
+E
qk c i C[kSk+lCi,k ilLHlk
+rTP,6X [Uk+l ( A k d x B k r Cktk)]
where g, G , H are defined in (6). The total cost is sl.
Evaluating the expectation term E [.] above yields
If we are not concerned with negative eigenvalues of H
sk+l + ( A k 6 x + B k r ) T Sk+l + or violations of control constraints, and set 1k = -H-'g,
3 ( A k b x f B k r ) T s k + l ( A k 6 x +B k r ) f L k = -H-% as mentioned above, a number of terms
31 trace (E:=, ( c z , k + c z , k r ) ( c z , k + C ~ , k SI;+])
r)~ cancel and (7) reduces to
qk + Sk+l + 3 X i C:kSk+lCi,k
1 TH-1
2g g
$rT (E,c : k s k + l c z , k ) Sk = -
(c,
Tf
302
Thus our method can be reduced to familiar methods in several choices of 71, and found the best convergence using
special cases, but has the added flexibility of keeping the a method related to Levenberg-Marquardt: (1) compute the
quadratic cost-to-go calculation consistent regardless of how eigenvalue decomposition [V,D ] = eig ( H ) ; ( 2 ) replace all
the affine feedback control law is computed. negative elements of the diagonal matrix D with 0; (3)
It can be seen from the above equations that in the add a positive constant X to the diagonal of D ;(4) set
absence of control-dependent noise (i.e. when C = 0), 71 = VDVT, using the modified D from steps (2) and
additive noise has no effect on the control law. However, (3). The constant X is set and adapted in the main iteration
when C > 0, increasing the additive noise magnitude c of the algorithm, as described below. Note that 71-1 can be
results in changes of the control law. efficiently computed as VDP1VT.
v. COMPUTING THE CONTROL LAW B. Constrained second-order methods
As we saw in ( 5 ) , the cost-to-go function V I , (6x) depends The problem here is to find the affine control law 6u =
on the control 6Uk = 7Fk (6x) through the term + +
1 L b x minimizing (8) subject to constraints 6 u ii E U ,
assuming that H has already been replaced with a positive
a (bu,bx)= 6uT( g + GSx) + ;SuTH6u (8) definite 'FI. We first optimize the open-loop component 1
where we have suppressed the time index k. Ideally we for 6 x = 0, and then deal with the feedback term LSx. The
would choose the 6u that minimizes a for every 6x, subject unconstrained minimum is bu' = -'FI-'g. If it satisfies
to whatever control constraints are present. However, this SU* +Ti E U we are done. Otherwise we have two options.
is not always possible within the family of affine control The more efficient but less accurate method is to backtrack
+
laws b u = 1 LSx that wc arc considering. Since the goal once, i.e. to find the maximal E E [O; 11 such that ~Su*+iiE
of the LQG stage is to approximate the optimal controller U . This is appropriate in the early phase of the iterative
for the nonlinear system in the vicinity of X, we will give algorithm when X is still far away from X'; in that phase
preference to linear control laws that are optimaVfeasible it makes more sense to quickly improve the control law
for small 6x, even if that (unavoidably) makes them sub- rather than refine the solution to an LQG problem that is
optimalhnfeasible for larger 6x. In particular we need to an inaccurate approximation to the original problem. But in
make sure that for 6 x = 0, the new open-loop control the final phase of the iterative algorithm we want to obtain
6 u = 1 performs no worse than the current open-loop the best control law possible for the given LQG problem.
control 6 u = 0. Since a (0,6x) = 0, this holds if a (1,O) = In that phase we use quadratic programming.
l T g + + l T H l 5 0. The latter is always achievable by setting Once 1 is determined, we have to compute the feedback
1 = - Eg for a small enough E 2 0. gain matrix L. Given that 6x is unconstrained, the only
general way to enforce the constraints U is to set L = 0. In
A. First-order and second-order methods practice we do not want to be that conservative, since we
The gradient Vdua (6u,Sx) evaluated at Su = 0 is g + are looking for an approximation to the nonlinear problem
GSx. Therefore we can make an improvement along the +
that is valid around Sx = 0. If 1 ii is inside U , small
+
gradient of a by setting Su = - E ( g G6x). If we are only changes Lbx will not cause constraint violations and so we
interested in open-loop control, we can use 6 u = - ~ g .If the can use the optimal L = -71-'G. But if 1 ti lies on+
new control signal (for the nonlinear system) E--Eg violates the constraint boundary dU, we have to modify L so that
the constraints, we have to reduce E until the constraints L6x can only cause changes along the boundary. Modifying
are satisfied. Note that ii is by definition a feasible control L is straightforward in the typical case when the range of
signal, so unless it lies on the constraint boundary (and g each element of u is specified independently. In that case
points inside the feasible set) we can find an E > 0 for which we simply set to 0 the rows of -'FI-'G corresponding to
ii - Eg is also feasible. This gives a first-order method. elements of 1 + ii that have reached their limits.
To obtain a second-order method we use the Hessian
H . If H is positive semi-definite we can compute the VI. MAINITERATION
unconstrained optimal control law 6 u = -H-' ( g + G ~ x ) , Each iteration of ILQG starts with an open-loop con-
and deal with the control constraints as described below. trol sequence ( t ) ,and a corresponding state trajectory
But when H has negative eigenvalues, there exist bu's that di) ( t ) computed as in (1). As described in the previous
make a (and therefore v) arbitrarily negative. Note that the sections, we then build a local LQG approximation around
cost-to-go function for the nonlinear problem is always non- di), and design an affine control law for the linearized
negative, but since we are using an approximation to the system, in the form 6uk = 1k + LkbXk. This control
true cost we may (and in practice do) encounter situations law is applied forward in time to the linearized system
where a does not have a minimum. In that case the gradient d X k + l = Akbxk + Bkbuk, initialized at 6x1 = 0. The new
V6"a = g + GSx is still correct, and so the true cost-to-go open-loop controls Uk = E t i ) + 6 u k are computed along the
+
decreases in the direction -X-' ( g G6x) for any positive way, enforcing U E ZA if necessary. If the sequences idi)
definite matrix 'H. Thus we use a matrix 'FI that "resembles" and U are sufficiently close, the iteration ends. Note that in
H , but is positive definite. We have experimented with the absence of the 11, term we would have 6uk = 0 and
303
(A) (B)
B. A model of the human arm
goal state
(180 deg) The second model we study is rather complex, and we
EF
!2 lack the space to describe it in detail - see [8]. We model
BF the nonlinear dynamics of a 2-link 6-muscle human arm
torque SF moving in the horizontal plane (Fig lB), using standard
EX
motor equations of motion in the form
pendulum SX
where 0 E R2 is the joint angle vector (shoulder: 81, elbow:
e,), M(B) E R2x2is the inertia matrix, C(0,e) E R2
is the vector of centripetal and Coriolis forces, I3 is the
Fig. 1 . Illustration of the control problems being studied. In (B), we have joint friction matrix, and T E R2 is the joint torque that
elbow flexors (EF), elbow extensors (EX), shoulder flexors (SF), shoulder
extensors (SX), biarticulate flexors (BF), and biarticulate extensors (BX). the muscles generate. The muscles are partitioned into 6
actuator groups. The joint torques produced by a muscle
are a function of its posture-dependent moment arm,length-
6xk = 0 for all k , and so the reference trajectory used to velocity-tension curve, and activation level. Muscles act
center the LQG approximation will never improve. like first-order nonlinear low-pass filters, and thus have
If U results in better performance than idi), we set activations states a. Each muscle receives control ui that
idi+')= U and decrease the Levenberg-Marquardt constant is polluted with multiplicative noise, and has activation
PI;
A. Otherwise we increase X and recompute U. When X = 0 dynamics ui = (u,- ui)/t(ui,ui).Thus the state x =
we have a Newton method using the true Hessian. When e 2 ; 81; e a ; a l ; a2; as; a4; as; a61 is 10-dimensional.
X is large the Hessian is effectively replaced by XI, and The task we study is reaching: the arm has to start at
so the algorithm takes very small steps in the direction of some initial position and move to a target in a specified
the gradient. We have found empirically that using such time interval. It also has to stop at the target, and do all that
an adaptive method make a difference - in terms of both with minimal energy consumption. There are good reasons
robustness and speed of convergence. to believe that such costs are indeed relevant to the neural
control of movement [14]. The cost function is defined as
1
VII. OPTIMAL CONTROL PROBLEMS TO BE STUDIED
- e*1I2+ 0.001
2
We have thus far tested the method on two problems, JO = Ile (qq) j/i:( e p ) , B ( T ) )
both of which have nonlinear dynamics, non-quadratic
costs, control constraints, and (for the second problem) +: lTO.OOOl l l ~ 1 1d t~
multiplicative noise.
where e (e) is the forward kinematics transformation from
A. An inverted pendulum joint coordinates to end-point coordinates, and the target e*
We use the popular inverted pendulum problem (Fig lA), is defined in end-point coordinates.
with a limit on the torque that the motor can generate,
VIII. NUMERICAL
RESULTS
and also a quadratic cost on the torque. Thus the optimal
solutions are not always in the form of bang-bang control Since the pendulum has a two-dimensional state space,
(which is the case when the control cost is absent) but we can discretize it with a dense grid and solve the time-
exhibit both torque saturation and continuous transitions varying Hamilton-Jacobi-Bellman PDE numerically. We
between torque limits. The dynamics are used a 100x100 grid, and 400 time step (4 sec interval,
10 msec time step).
il = 2 2 Fig 2 shows the optimal trajectories of the pendulum,
x2 = u - 4 s i n x 1 according to the HJB solution, in gray. The best solutions
found by our iterative method, after 3 restarts (with constant
where the state variables are 21 = 8, 5 2 = 8. The goal is initial control sequences u ( t ) = 0,1, -1 ) are shown in
to make the pendulum swing up (corresponding to a 180 black. Note the close correspondence. In the only case
deg angle) and also make it stop - at the final time T. The where we saw a mismatch, running the algorithm with
control objective is to find the control u ( t ) that minimizes additional initial conditions found a better local minimum
the performance index (dotted line) that agrees with the HJB solution.
T We also applied ILQG to the human arm model described
~. (l T ) ~ + 0 . ~0 (1t 1) ~ d above.
.To= ( l + c o ~ z ~ ( T ) ) ~ +~O t Note that this model is stochastic: we include
multiplicative noise in the muscle activations, with standard
We use a time step of 10 msec, T = 4sec, and the maximum deviation equal to 20% of the control signal. Fig 3 shows av-
control torque that can be generated is I uI 5 2. erage behavior: hand paths in (A), tangential speed profiles
304
280 (A) (B)
-280
-180 deg 180
10 cm 0 350
Time (ms)
(C)
Elbow Flexor Elbow eXtensor Shoulder Flexor
Fig. 2. State-control trajectories for the inverted pendulum, found by Shoulder eXtensor Bi-art. Flexor Bi-art. eXtensor
ILQG (black) and a global method (gray). Circles mark the starting state,
crosses mark the target (repeated every 360 deg).
-3
arm dynamics is too complex and so we had to resort to a
centered finite difference approximation (which turned out
to be the computational bottleneck). -5
Finally, we explored the issue of local minima for the 1