约束非线性随机系统局部最优反馈控制的广义迭代 LQG 方法

2005 American Control Conference WeA10.
2
June 8-10, 2005. Portland, OR, USA
A generalized iterative LQG method for locally-optimal feedback

control of constrained nonlinear stochastic systems
Emanuel Todorov and Weiwei Li
Abstract-We present an iterative Linear-Quadratic- developed a new method - iterative Linear-Quadratic Reg-
Gaussian method for locally-optimal feedback control of ulator design (ILQR) - which is closely related to DDP
nonlinear stochastic systems subject to control constraints. but turns out to be significantly more efficient: by a factor
Previously, similar methods have been restricted to determin-
istic unconstrained problems with quadratic costs. The new of 10 on reasonably complex control problems [SI. Our
method constructs an affine feedback control law, obtained by ILQR method uses iterative linearizations of the nonlinear
minimizing a novel quadratic approximation to the optimal dynamics around the current trajectory, and improves that
cost-to-go function. Global convergence is guaranteed through trajectory via modified Riccati equations. Both DDP and
a Levenberg-Marquardt method; convergence in the vicinity ILQR yield feedback control laws - which is a major
of a local minimum is quadratic. Performance is illustrated
on a limited-torque inverted pendulum problem, as well as a advantage compared to open-loop methods. However, these
complex biomechanical control problem involving a stochastic methods are still deterministic. Another shortcoming is that,
model of the human arm, with 10 state dimensions and unlike open-loop methods, DDP and ILQR cannot deal with
6 muscle actuators. A Matlab implementation of the new control constraints and non-quadratic cost functions. The
algorithm is availabe at www.cogsci.ucsd.edu/-todorov. goal of the present paper is to remove these limitations.
While the new algorithm should be applicable to a range
I. INTRODUCTION of problems, our specific motivation for developing it is
Despite an intense interest in optimal control theory the modeling of biological movement. Such modeling has
over the last 50 years, solving complex optimal control proven extremely useful in the study of how the brain
problems - that do not fit in the well-developed Linear- controls movement [ 151. Yet, progress has been impeded
Quadratic-Gaussian (LQG) formalism - remains a challenge by the lack of efficient methods that can handle realis-
[17]. Most existing numerical methods fall in one of two tic biomechanical control problems. The characteristics of
categories. Global methods based on the Hamilton-Jacobi- such problems are: high-dimensional nonlinear dynamics;
Bellman (HJB) equations and the idea of dynamic program- control constraints (e.g. non-negative muscle activations);
ming [7], [ l ] can yield globally-optimal feedback control multiplicative noise, with standard deviation proportional to
laws for general stochastic systems. However, such methods the control signals [4], [14]; complex performance criteria,
involve discretizations of the state and control spaces - that are rarely quadratic in the state variables [6].
which makes them inapplicable to high-dimensional prob- Before deriving our new iterative Linear-Quadratic-
lems, due to the curse of dimensionality. Local methods Gaussian (ILQG) method, we give a more detailed overview
based on the Maximum Principle [3], [13] avoid the curse of what is new here:
of dimensionality, by solving a set of ODEs - via shooting,
A . Noise
relaxation, collocation, or gradient descent. But the resulting
locally-optimal control laws are open-loop, and stochastic DDP, ILQR, and the new ILQG are dynamic program-
dynamics cannot be taken into account. ming methods that use quadratic approximations to the
An ideal blend of the advantages of local and global optimal cost-to-go function. All such methods are "blind" to
methods is provided by Differential Dynamic Programming additive noise (see Discussion). However, in many problems
(DDP) [5]. This method is still local, in the sense that of interest the noise is control-dependent, and such noise
it maintains a representation of a single trajectory and can easily be captured by quadratic approximations as we
improves it locally. The improvement however does not rely show below. Our new ILQG method incorporates control-
on solving ODEs, but is based on dynamic programming - dependent noise - which turns out to have an effect similar
applied within a "tube" around the current trajectory. DDP to an energy cost.
is known to have second-order convergence [9], [ll], and
B. Constraints
numerically appears to be more efficient [ 101 than (efficient
implementations of) Newton's method [ 121. We recently Quadratic approximation methods are presently restricted
to unconstrained problems. Generally speaking, constraints
This work is supported by NIH Grant R01-NS04591.5. make the optimal cost-to-go function non-quadratic. But
Emanuel Todorov is with the faculty of the Cognitive Science Depart- since we are approximating that function anyway, we might
ment, University of California San Diego.
Weiwei Li is with the Deparment of Mechanical and Aerospace Engi- as well take into account the effects of control constraints to
neering. University of California San Diego. the extent possible. Our new ILQG method does that - by
0-7803-9098-9105/$25.0002005 AACC 300
modifying the feedback gain matrix whenever an element 111. LOCALLQG APPROXIMATION
of the open-loop control sequence lies on the constraint The locally-optimal control law is constructed iteratively.
boundary. Each iteration of the algorithm begins with an open-loop
C. Convexity control sequence ii ( t )and the corresponding "zero-noise''
Quadratic approximation methods are based on Riccati trajectoly X ( t ) ,obtained by applying Ti ( t )to the determin-
equations, which are derived by setting up a quadratic istic dynamics 2 = f (x,u) with X(0) = X O . This can be
optimization problem at time step t , solving it analytically, done by Euler integration
and obtaining a formula for the optimal cost-to-go function
at time step t - 1. Optimizing a quadratic is only possible
Xk+l =X k + At f ( X k , EA) (1)
when the Hessian is positive-definite. This is of course true or by defining a continuous ii ( t )via interpolation, applying
in the classic LQG setting, but when LQG methods are a continuous-time integrator such as Runge-Kutta, and
used to approximate general nonlinear dynamics with non- discretizing the resulting X (t).
quadratic costs, the Hessian can (and in practice does) have Next we linearize the system dynamics and quadratize
zero and even negative eigenvalues. The traditional rem- the cost functions around X,U, to obtain a discrete-time
edy is to "fix" the Hessian, using a Levenberg-Marquardt linear dynamical system with quadratic cost. Importantly,
method, or an adaptive shift scheme [lo], or simply replace the linearized dynamics and quadratized cost are expressed
it with the identity matrix (which yields the steepest descent not in terms of the actual state and control variables, but in
method). The problem is that after fixing the Hessian, the terms of the state and control deviations b X k a X k - K k ,
optimization at time step t is no longer performed exactly b u k a u k - i i k . Writing the discrete-time dynamics as
- contrary to what the derivation of Riccati equations
assumes. Instead of making this invalid assumption, our new

dxk+l + xk+l =bxk + Kk -k At f (6xk + X k , bUk + Uk)
method takes the fixed Hessian into account, and constructs expanding f around X k , u k up to first order, and subtracting
a cost-to-go approximation consistent with the resulting (l), our original optimal control problem is approximated
control law. This is done by modified Riccati-like equations. locally by the following modified LQG problem:
11. PROBLEM STATEMENT bxk+l = AkbXk f BkbUk -k c k ( b u k ) <k (2)
Consider the nonlinear dynamical system described by c k (buk)
=
A
[Cl,k+ C l , k b U k Cp,k + C p , k b U k ]
' ' '
qk + bXk q k + -bXZQkbXk
the stochastic differential equation T 1
costk =
dx =f ( x ,u)dt + F ( x ,u)dw 2
+bUZrk + 5bU:RkbUk + bU;PkbXk
1
-
with state x E R", control u E R", and standard
Brownian motion noise w E R P . Let k ' ( t , x , u ) 2 0 be where bxl = 0, < k N (0; I p ) , and the last time step is
an instantaneous cost rate, h ( x ( T ) )2 0 a final cost, T a K . The quantities that define this modified LQG problem
specified final time, and u = T ( t ,x ) a deterministic control are A k , B k , C i , k , C i , k , q k , q k , Q k , r k , R k , p k , K . At each
law. Define the cost-to-go function vT ( t ,x ) as the total cost ( X k , &), these quantities are obtained from the original
expected to accumulate if the system is initialized in state problem as:
x at time t , and controlled until time T according to the
control law T: = I+At d f / d x ; B k = A t d f / d u (3)
1
A k
= 6F[Z1; C i , k = 6d F [ i l / d u
[ JT
Ci,k
urn( t ,x ) E h (x ( T ) ) + k' (7,x ,
( T ) 7r ( 7 ,x ( T ) ) ) d~
qk = At f?; q k = A t de/&
The expectation is taken over the instantiations of the Qk = At d2k/dxdx; P k = At d2f?/dudx

stochastic process w . The admissible control signals may be rk = At df?/du; R k = At d2e/dudu
constrained: u ( t )E U , where we assume that U is convex. where F[i]denotes the ith column of the matrix F . At
The objective of optimal control is to find the control the final time step IC = K the cost model is qK = h;
law T * that minimizes uT (0, X O ) . Note that the globally-
q K = ah/&; Q K = d2h/dxdx, and the control-related
optimal control law 7r* ( t ,x ) does not depend on a specific
a
cost terms r K , R K ,PK do not affect the optimization. The
initial state. However, finding this control law in com-
term appears because the covariance of Brownian
plex problems is unlikely. Instead, we seek locally-optimal
motion grows linearly with time. The ith column of the
control laws: we will approximate 7r* in the vicinity of
the trajectory X*( t ) that results from applying 7r* to the
+
matrix C k ( h k ) is ci C i 6 u k . Thus the noise covariance
is
deterministic system k = f ( x , u ) . Since X* depends on
+ +
P
X O , so does our approximation to 7r*. Throughout the cov [ c k ( b u k ) < k ] = (Ci,k Ci,khuk) (ci,k Ci,kfiUk)'
paper time is discretized as k = 1. . . K , with time step i=l
At = T / ( K - 1); thus u k u ( ( k - 1)At),and similarly Note that we are using a simplified noise model, where
for all other time-varying quantities. F (x,u) is only linearized with respect to b u . This is
301
sufficient to capture noise that is multiplicative in the control The shortcuts g , G , H appearing on the last line of (5) are
signal - which is what we are mainly interested in. It is defined at each time step as
straightforward to include state-dependent noise as well, and
repeat the derivation that follows. g =
A
rk +BLsk+l+ ci C:kSk+lCi,k (6)
'
Now that we have an LQG approximation to our original G Pk+BlSk+lAk
optimal control problem, we can proceed with computing
an approximately-optima1 control law. This is done in the
H Rk +B,TSk+lBk f C:kSk+lCi,k
following two sections. We then describe the main iteration At this point one may notice that the expression for
of the algorithm, which constructs a convergent sequence uk ( 6x) is a quadratic function of r,and set the control
of LQG approximations. signal to the value of r which makes the gradient vanish:
buk = -H-lgk - H - l G G x . But we will not assume this
I v . COMPUTING THE COST-TO-GO FUNCTION
specific form of the control law here, because H may have
The approximately-optima1 control law for the LQG negative eigenvalues (in which case the above 6u is not
approximation will be affine, in the form 6u = r k ( 6 x ) = a minimum), and also because some control constraints
lk + L k 6 x . The unusual open-loop component l k (letter 'el') may be violated. Instead we will defer the computation
arises because we are dealing with state and control devi- of the control law to the next section. All we assume for
ations, and is needed to make the algorithm iterative (see now is that the control law computed later will be in the
below). The control law is approximately-optimal because
we may have control constraints and non-convex costs.
+
general form 6u = l k L k 6 X . With this assumption we
can complete the computation of the cost-to-go function.
Suppose the control law r has already been designed
for time steps k . . . K - 1. Then the cost-to-go function
+
Replacing r with l k & 6 X , and noting that the square
matrix s k is symmetric, the 7r-dependent expression in the
u k (6x) is well-defined - as the cost expected to accumulate
last line of (5) becomes
if system (2) is initialized in state bx at time step k , and
controlled according to r for the remaining time steps. 1;g + i1LHlk + 6 X T ( G T l k + L:g + L i H l k )
We will show by induction (backwards in time) that if the
control law is affine, the corresponding cost-to-go function
+i6XT (LLHLk + L l G 4- GTLk) 6X
remains in the quadratic form We now see that the cost-to-go function remains quadratic
uk ( 6 x ) = Sk + 6XTSk f i 6 x T S k 6 x (4)
in bx, which completes the induction proof. Thus we have
for all k. At the last time step this holds, because the final Lemma (ILQG) For any affine control law in the form
cost is a quadratic that does not depend on the control +
6u ( k ,6x) = 1k L k 6 x , the cost-to-go for problem (2) is
signal. Now suppose that 2) is in the above form for time in the form (4)for all k. At the final time step K , the cost-
steps k + 1 . . . K . Using the shortcut r in place of the to-go parameters are SK = Q x , S K = q K , S K = q K . For
+
control signal l k L k 6 X that our control law generates, the k < K the parameters can be computed recursively as
Bellman equation for the cost-to-go function is s k = Qk f A;Sk+lAk +
L l H L k + LiG + GTLk (7)
Uk (6x) = immediate cost + E [uk+l (next state)] sk = qk + T
AkSk+l +LhHlk Llg + + GTlk
= qk + 6XT ( q k + aQk6x) + rT + (rk i R k r ) = + sk+l +i + fc g
+ +
sk
+E
qk c i C[kSk+lCi,k ilLHlk
+rTP,6X [Uk+l ( A k d x B k r Cktk)]
where g, G , H are defined in (6). The total cost is sl.
Evaluating the expectation term E [.] above yields
If we are not concerned with negative eigenvalues of H
sk+l + ( A k 6 x + B k r ) T Sk+l + or violations of control constraints, and set 1k = -H-'g,
3 ( A k b x f B k r ) T s k + l ( A k 6 x +B k r ) f L k = -H-% as mentioned above, a number of terms
31 trace (E:=, ( c z , k + c z , k r ) ( c z , k + C ~ , k SI;+])
r)~ cancel and (7) reduces to
Using the fact that trace(UV) = trace(VU), the

s k = +AlSk+lAk G T H - l G -
i trace (.) term above becomes Sk = q k +AlSk+l -G T H - l g
qk + Sk+l + 3 X i C:kSk+lCi,k
1 TH-1
2g g
$rT (E,c : k s k + l c z , k ) Sk = -
(c,
Tf
rT CzkSk+lCt,k) + i (E,C : k S k + l C z , k ) If we further remove the control-dependent noise (by setting

ci,k = 0) and the linear terms in the cost function (by
Combining the results and grouping terms, the cost-to-go is setting q k = r k = 0), we see that g = l k = Sk = 0 and the
"Jk ( 6 x ) = 4k + s k + l + E, c T s k + l c z (5) first line of (7) reduces to the familiar LQR discrete-time
f 6 X T ( q k + ALSk+i)
Riccati equation
+$biT (Qk + A l S k + i A k ) 6 X s k = Qk +A:Sk+iAk
+rT( g + Gbx) + i x T H x -AlSk+iBk (Rk +B:Sk+iBk)-' B:Sk+iAk
302
Thus our method can be reduced to familiar methods in several choices of 71, and found the best convergence using
special cases, but has the added flexibility of keeping the a method related to Levenberg-Marquardt: (1) compute the
quadratic cost-to-go calculation consistent regardless of how eigenvalue decomposition [V,D ] = eig ( H ) ; ( 2 ) replace all
the affine feedback control law is computed. negative elements of the diagonal matrix D with 0; (3)
It can be seen from the above equations that in the add a positive constant X to the diagonal of D ;(4) set
absence of control-dependent noise (i.e. when C = 0), 71 = VDVT, using the modified D from steps (2) and
additive noise has no effect on the control law. However, (3). The constant X is set and adapted in the main iteration
when C > 0, increasing the additive noise magnitude c of the algorithm, as described below. Note that 71-1 can be
results in changes of the control law. efficiently computed as VDP1VT.
v. COMPUTING THE CONTROL LAW B. Constrained second-order methods
As we saw in ( 5 ) , the cost-to-go function V I , (6x) depends The problem here is to find the affine control law 6u =
on the control 6Uk = 7Fk (6x) through the term + +
1 L b x minimizing (8) subject to constraints 6 u ii E U ,
assuming that H has already been replaced with a positive
a (bu,bx)= 6uT( g + GSx) + ;SuTH6u (8) definite 'FI. We first optimize the open-loop component 1
where we have suppressed the time index k. Ideally we for 6 x = 0, and then deal with the feedback term LSx. The
would choose the 6u that minimizes a for every 6x, subject unconstrained minimum is bu' = -'FI-'g. If it satisfies
to whatever control constraints are present. However, this SU* +Ti E U we are done. Otherwise we have two options.
is not always possible within the family of affine control The more efficient but less accurate method is to backtrack
+
laws b u = 1 LSx that wc arc considering. Since the goal once, i.e. to find the maximal E E [O; 11 such that ~Su*+iiE
of the LQG stage is to approximate the optimal controller U . This is appropriate in the early phase of the iterative
for the nonlinear system in the vicinity of X, we will give algorithm when X is still far away from X'; in that phase
preference to linear control laws that are optimaVfeasible it makes more sense to quickly improve the control law
for small 6x, even if that (unavoidably) makes them sub- rather than refine the solution to an LQG problem that is
optimalhnfeasible for larger 6x. In particular we need to an inaccurate approximation to the original problem. But in
make sure that for 6 x = 0, the new open-loop control the final phase of the iterative algorithm we want to obtain
6 u = 1 performs no worse than the current open-loop the best control law possible for the given LQG problem.
control 6 u = 0. Since a (0,6x) = 0, this holds if a (1,O) = In that phase we use quadratic programming.
l T g + + l T H l 5 0. The latter is always achievable by setting Once 1 is determined, we have to compute the feedback
1 = - Eg for a small enough E 2 0. gain matrix L. Given that 6x is unconstrained, the only
general way to enforce the constraints U is to set L = 0. In
A. First-order and second-order methods practice we do not want to be that conservative, since we
The gradient Vdua (6u,Sx) evaluated at Su = 0 is g + are looking for an approximation to the nonlinear problem
GSx. Therefore we can make an improvement along the +
that is valid around Sx = 0. If 1 ii is inside U , small
+
gradient of a by setting Su = - E ( g G6x). If we are only changes Lbx will not cause constraint violations and so we
interested in open-loop control, we can use 6 u = - ~ g .If the can use the optimal L = -71-'G. But if 1 ti lies on+
new control signal (for the nonlinear system) E--Eg violates the constraint boundary dU, we have to modify L so that
the constraints, we have to reduce E until the constraints L6x can only cause changes along the boundary. Modifying
are satisfied. Note that ii is by definition a feasible control L is straightforward in the typical case when the range of
signal, so unless it lies on the constraint boundary (and g each element of u is specified independently. In that case
points inside the feasible set) we can find an E > 0 for which we simply set to 0 the rows of -'FI-'G corresponding to
ii - Eg is also feasible. This gives a first-order method. elements of 1 + ii that have reached their limits.
To obtain a second-order method we use the Hessian
H . If H is positive semi-definite we can compute the VI. MAINITERATION
unconstrained optimal control law 6 u = -H-' ( g + G ~ x ) , Each iteration of ILQG starts with an open-loop con-
and deal with the control constraints as described below. trol sequence ( t ) ,and a corresponding state trajectory
But when H has negative eigenvalues, there exist bu's that di) ( t ) computed as in (1). As described in the previous
make a (and therefore v) arbitrarily negative. Note that the sections, we then build a local LQG approximation around
cost-to-go function for the nonlinear problem is always non- di), and design an affine control law for the linearized
negative, but since we are using an approximation to the system, in the form 6uk = 1k + LkbXk. This control
true cost we may (and in practice do) encounter situations law is applied forward in time to the linearized system
where a does not have a minimum. In that case the gradient d X k + l = Akbxk + Bkbuk, initialized at 6x1 = 0. The new
V6"a = g + GSx is still correct, and so the true cost-to-go open-loop controls Uk = E t i ) + 6 u k are computed along the
+
decreases in the direction -X-' ( g G6x) for any positive way, enforcing U E ZA if necessary. If the sequences idi)
definite matrix 'H. Thus we use a matrix 'FI that "resembles" and U are sufficiently close, the iteration ends. Note that in
H , but is positive definite. We have experimented with the absence of the 11, term we would have 6uk = 0 and
303
(A) (B)
B. A model of the human arm
goal state
(180 deg) The second model we study is rather complex, and we
EF
!2 lack the space to describe it in detail - see [8]. We model
BF the nonlinear dynamics of a 2-link 6-muscle human arm
torque SF moving in the horizontal plane (Fig lB), using standard
EX
motor equations of motion in the form
M(8)8+ C(e, e) + I38 =

BX
!1 T,
pendulum SX
where 0 E R2 is the joint angle vector (shoulder: 81, elbow:
e,), M(B) E R2x2is the inertia matrix, C(0,e) E R2
is the vector of centripetal and Coriolis forces, I3 is the
Fig. 1 . Illustration of the control problems being studied. In (B), we have joint friction matrix, and T E R2 is the joint torque that
elbow flexors (EF), elbow extensors (EX), shoulder flexors (SF), shoulder
extensors (SX), biarticulate flexors (BF), and biarticulate extensors (BX). the muscles generate. The muscles are partitioned into 6
actuator groups. The joint torques produced by a muscle
are a function of its posture-dependent moment arm,length-
6xk = 0 for all k , and so the reference trajectory used to velocity-tension curve, and activation level. Muscles act
center the LQG approximation will never improve. like first-order nonlinear low-pass filters, and thus have
If U results in better performance than idi), we set activations states a. Each muscle receives control ui that
idi+')= U and decrease the Levenberg-Marquardt constant is polluted with multiplicative noise, and has activation
PI;
A. Otherwise we increase X and recompute U. When X = 0 dynamics ui = (u,- ui)/t(ui,ui).Thus the state x =
we have a Newton method using the true Hessian. When e 2 ; 81; e a ; a l ; a2; as; a4; as; a61 is 10-dimensional.
X is large the Hessian is effectively replaced by XI, and The task we study is reaching: the arm has to start at
so the algorithm takes very small steps in the direction of some initial position and move to a target in a specified
the gradient. We have found empirically that using such time interval. It also has to stop at the target, and do all that
an adaptive method make a difference - in terms of both with minimal energy consumption. There are good reasons
robustness and speed of convergence. to believe that such costs are indeed relevant to the neural
control of movement [14]. The cost function is defined as
1
VII. OPTIMAL CONTROL PROBLEMS TO BE STUDIED
- e*1I2+ 0.001
2
We have thus far tested the method on two problems, JO = Ile (qq) j/i:( e p ) , B ( T ) )
both of which have nonlinear dynamics, non-quadratic
costs, control constraints, and (for the second problem) +: lTO.OOOl l l ~ 1 1d t~
multiplicative noise.
where e (e) is the forward kinematics transformation from
A. An inverted pendulum joint coordinates to end-point coordinates, and the target e*
We use the popular inverted pendulum problem (Fig lA), is defined in end-point coordinates.
with a limit on the torque that the motor can generate,
VIII. NUMERICAL
RESULTS
and also a quadratic cost on the torque. Thus the optimal
solutions are not always in the form of bang-bang control Since the pendulum has a two-dimensional state space,
(which is the case when the control cost is absent) but we can discretize it with a dense grid and solve the time-
exhibit both torque saturation and continuous transitions varying Hamilton-Jacobi-Bellman PDE numerically. We
between torque limits. The dynamics are used a 100x100 grid, and 400 time step (4 sec interval,
10 msec time step).
il = 2 2 Fig 2 shows the optimal trajectories of the pendulum,
x2 = u - 4 s i n x 1 according to the HJB solution, in gray. The best solutions
found by our iterative method, after 3 restarts (with constant
where the state variables are 21 = 8, 5 2 = 8. The goal is initial control sequences u ( t ) = 0,1, -1 ) are shown in
to make the pendulum swing up (corresponding to a 180 black. Note the close correspondence. In the only case
deg angle) and also make it stop - at the final time T. The where we saw a mismatch, running the algorithm with
control objective is to find the control u ( t ) that minimizes additional initial conditions found a better local minimum
the performance index (dotted line) that agrees with the HJB solution.
T We also applied ILQG to the human arm model described
~. (l T ) ~ + 0 . ~0 (1t 1) ~ d above.
.To= ( l + c o ~ z ~ ( T ) ) ~ +~O t Note that this model is stochastic: we include
multiplicative noise in the muscle activations, with standard
We use a time step of 10 msec, T = 4sec, and the maximum deviation equal to 20% of the control signal. Fig 3 shows av-
control torque that can be generated is I uI 5 2. erage behavior: hand paths in (A), tangential speed profiles
304
280 (A) (B)
-280
-180 deg 180
10 cm 0 350
Time (ms)
(C)
Elbow Flexor Elbow eXtensor Shoulder Flexor
Fig. 2. State-control trajectories for the inverted pendulum, found by Shoulder eXtensor Bi-art. Flexor Bi-art. eXtensor
ILQG (black) and a global method (gray). Circles mark the starting state,
crosses mark the target (repeated every 360 deg).
in (B), and muscle activations in (C). Both the movement

kinematics and the muscle activations share many features 0
Time (ms)
350
with experimental data on human arm movements - but
a detailed discussion of the relevance to Motor Control is
beyond the scope of this paper. Fig. 3. Average behavior of the ILQG controller for reaching movements,
Fig 4 illustrates the robustness to noise: open-loop control using a 2-link 6-muscle model of the human arm. (A) Hand paths for
in (A), closed-loop control in (B), and closed-loop control movement in 16 directions; (B) Speed profiles; (C) Muscle activations.
optimized for a deterministic system in (C). Closed-loop (A) (B) (C)
control is based on the time-varying feedback gain matrix
L generated by the ILQG method, while open-loop control
only uses the final ii constructed by the algorithm. As the
endpoint error ellipses show, the feedback control scheme
substantially reduces the effects of the noise, and benefits
from being optimized for the correct multiplicative noise
model. 5 cm
Another encouraging result is that in terms of CPU
time, the complex arm model does not require much more
computation than the pendulum. Fig 5A shows how the total Fig. 4. Effects of control-dependent noise on hand reaching trajectories,
cost (for the arm control problem) decreased over iterations under different control laws. (A) open-loop control; (B) closed-loop
of the algorithm, for reaching in 8 different directions. control; (C) closed-loop controller optimized for deterministic system.
On average, ILQG found a locally-optimal time-varying (A) (B)
-1
feedback control law in about 5 seconds (on a 3.2GHz
Pentium 4 machine, in Matlab). It should be noted that while
2
the pendulum dynamics could be linearized analytically, the
elbow angle (rad)
log10 cost
-3
arm dynamics is too complex and so we had to resort to a
centered finite difference approximation (which turned out
to be the computational bottleneck). -5
Finally, we explored the issue of local minima for the 1
arm control problem (Fig 5B). We initialized the open-

loop controls to random sequences, whose elements were -7
0 2 4 6
uniformly sampled between 0 and 0.1 (the allowed range of cpu time (seconds)
0 1
shoulder angle (rad)
muscle control signals was between 0 and 1). The resulting
initial trajectories are shown as an inset in Fig 5B. Note
that they all go in roughly the same direction, because Fig. 5. (A) Cost over iterations of ILQG. (B) Hand paths for random
some arm muscle are stronger than others. We used 10 initial conrol laws (inset) and optimized paths (black) to 8 targets. Poor
different initialization, for each of 8 movement directions. local minima (7 out of 80) are shown in cyan.
305
Over 90% of the optimization runs (73 out of 80) converged to the optimal cost-to-go, but that has to be done whenever
to a solution very similar to the best solution we found for fixed-horizon model-predictive control is used.
the corresponding target direction. The final trajectories are (4) For deterministic systems ILQG converges to ex-
shown in black in Fig 5B. The remaining 7 runs found tremal trajectories, just like ODE methods based on Pon-
local minima (shown in cyan). Therefore, a small number tryagin’s Maximum principle (with the advantage of yield-
of restarts of ILQG are sufficient to discover what appears ing a feedback controller). In that case the linear-quadratic
to be the global minimum in a relatively complex control approximation does not lead to suboptimality of the solu-
problem. tion (modulo local minima problems). But for stochastic
problems we have no such guarantees, and in fact we know
I x . DISCUSSION
AND EXTENSIONS that second-order methods are insensitive to additive noise
Here we presented a new local method for optimal - suggesting that higher-order approximations to the cost-
control of stochastic nonlinear systems subject to control to-go may be needed. There is a simple way to increase
constraints, and applied it to two test problems - a simple accuracy, by augmenting the state with some nonlinear
pendulum and a high-dimensional biomechanical model of functions of the state: z = [x; y], where y = h (x).Then
the human arm. In the inverted pendulum problem we 9 = dh (x)/& k,and so the augmented system has well-
demonstrated numerically that the ILQG solutions are close defined dynamics. In the reaching problem, for example,
to the global minimum. Additional work is needed to the cost is a complex trigonometric function in the joint
ascertain the properties of the algorithm in more complex angles but a simple quadratic in the endpoint coordinates.
problems, where we cannot use global methods for vali- Augmenting the state with those coordinates may therefore
dation. One possibility that we are considering is to take increase performance for stochastic problems.
the solution of the ILQG method, apply a stochastic policy
REFERENCES
gradient algorithm to it (which can be very inefficient, but in
the limit of large sample size avoids approximation errors), [l] D. Bertsekas and J. Tsitsiklis (1996). Neuro-dynamic programming.
Athena Scientific, Belmont, MA
and see if the ILQG solution is a minimum of gradient [2] I. Brown, E. Cheng and G. Loeb (1999). Measured and modeled
descent in the space of feedback control laws. We are also properties of mammalian skeletal muscle. 11. The effects of stimulus
in the process of implementing a very detailed model of the frequency on force-length and force-velocity relationships. J. Muscle
Res Cell Motil 20: 627-643
human arm, including 10 degrees of freedom and around [3] A. Bryson and Y.-C. Ho (1969). Applied Optimal Control. Blaisdell
50 muscles; it will be interesting to see if ILQG can handle Publishing Company, Massachusetts, USA
such a complex system in a reasonable amount of time. [4] C. Harris and D. Wolpert (1998). Signal-dependent Noise Determines
Motor Planning. Nature, 394: 780-784
Finally, there are several possible extensions to the work [5] D. Jacobson and D. Mayne (1970). Dzfferential Dynamic Program-
we presented here: ming. Elsevier Publishing Company, New York, USA
(1) We assumed implicitly that the control system can [6] K. Koerding and D. Wolpert (2004). The loss function of sensorimo-
tor learning. Pmceedings of the National Academy of Sciences 101:
directly observe the plant state x, while in reality feedback 9839-9842
is based on delayed and noisy sensors that may not measure [7] H. Kushner and P. Dupuis (2001). Numerical methods for stochastic
all state variables. It should be possible to extend the control problems in continuous time (2nd edition). Spinger, New York
[8] W. Li and E. Todorov (2004). Iterative Linear Quadratic Regulator
algorithm to the partially observable case by combining Design for Nonlinear Biological Movement Systems. In proceedings
it with an extended Kalman filter. This will result in a of the 1st International Conference on Informatics in Control,
coupled estimation-control problem, which is complicated Automation and Robotics, 1: 222-229
[9] L.-Z. Liao and C. Shoemaker (1991). Convergence in unconstrained
in the presence of multiplicative noise. However, we have discrete-time differential dynamic programming. IEEE Transactions
recently solved such a problem for linear systems [16], and on Automatic Control 36(6): 692-706
are now working on adapting that methodology to the ILQG [IO] L.-Z. Liao and C. Shoemaker (1993). Advantages of differential dy-
namic programming over Newton’s method for discrete-time optimal
framework. control problems. Cornell University Technical Report.
(2) As with all second-order methods, convergence is fast [ I l l C. Nh, L.-Z. Liao and D. Li (2002). A globally convergent and
near a local minimum but can be slow far away from it. The efficient method for unconstrained discrete-time optimal control.
Journal of Global Optimization 23: 401-421
question then arises, how can we make the algorithm faster [12] J. Pantoja (1988) Differential Dynamic Programming and Newton’s
at the beginning of the iteration? A very promising idea is to method. Intl J Control 47: 1539
use a multi-resolution method in time: start with a large time [13] 0. von Stryk (1993). Numerical solution to optimal control problems
by direct collocation. International Series on Numerical Mathemat-
step, and gradually reduce it (possibly alternating). With this ics 111: 129-143
in mind, we showed explicitly how the LQG approximation [14] E. Todorov and M. Jordan (2002). Optimal Feedback Control as a
depends on the time step At - so that the algorithm can Theory of Motor Coordination. Nature Neuroscience, 5(11): 1226-
1235
be applied without modification with a At that varies from [I51 E. Todorov (2004). Optimality principles in sensorimotor control
one iteration to the next. (Review). Narure Neuroscience 7(9): 907-91 5
(3) While we assumed a specified final time T , the [ 161 E. Todorov (2005). Stochastic optimal control and estimation meth-
ods adapted to the noise characteristics of the sensorimotor system.
algorithm can be applied in model-predictive mode, using Neural Computation 17(5), in press
a fixed time horizon rather than a fixed final time. The final [ 171 R. Vinter (2000). Optimal Control. Birkhauser, Boston.
cost h (x)will have to be replaced with some approximation
306

约束非线性随机系统局部最优反馈控制的广义迭代 LQG 方法

Uploaded by

Copyright:

Available Formats

约束非线性随机系统局部最优反馈控制的广义迭代 LQG 方法

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

约束非线性随机系统局部最优反馈控制的广义迭代 LQG 方法

Uploaded by

Copyright:

Available Formats

2005 American Control Conference WeA10.

A generalized iterative LQG method for locally-optimal feedback

- contrary to what the derivation of Riccati equations

assumes. Instead of making this invalid assumption, our new

The expectation is taken over the instantiations of the Qk = At d2k/dxdx; P k = At d2f?/dudx

Using the fact that trace(UV) = trace(VU), the

i trace (.) term above becomes Sk = q k +AlSk+l -G T H - l g

rT CzkSk+lCt,k) + i (E,C : k S k + l C z , k ) If we further remove the control-dependent noise (by setting

M(8)8+ C(e, e) + I38 =

in (B), and muscle activations in (C). Both the movement

arm control problem (Fig 5B). We initialized the open-

You might also like