Immediate Download Reinforcement Learning For Optimal Feedback Control Rushikesh Kamalapurkar Ebooks 2024
Immediate Download Reinforcement Learning For Optimal Feedback Control Rushikesh Kamalapurkar Ebooks 2024
Immediate Download Reinforcement Learning For Optimal Feedback Control Rushikesh Kamalapurkar Ebooks 2024
com
DOWLOAD HERE
https://textbookfull.com/product/reinforcement-
learning-for-optimal-feedback-control-rushikesh-
kamalapurkar/
DOWLOAD NOW
https://textbookfull.com/product/reinforcement-learning-and-
optimal-control-draft-version-1st-edition-dmitri-bertsekas/
https://textbookfull.com/product/intelligent-optimal-adaptive-
control-for-mechatronic-systems-1st-edition-marcin-szuster/
https://textbookfull.com/product/networked-control-systems-with-
intermittent-feedback-1st-edition-hirche/
https://textbookfull.com/product/housing-fit-for-purpose-
performance-feedback-and-learning-1st-edition-fionn-stevenson/
Grokking Deep Reinforcement Learning First Edition
Miguel Morales
https://textbookfull.com/product/grokking-deep-reinforcement-
learning-first-edition-miguel-morales/
https://textbookfull.com/product/deep-reinforcement-learning-in-
action-1st-edition-alexander-zai/
https://textbookfull.com/product/optimal-control-in-thermal-
engineering-1st-edition-viorel-badescu/
https://textbookfull.com/product/analog-automation-and-digital-
feedback-control-techniques-1st-edition-jean-mbihi/
https://textbookfull.com/product/reinforcement-learning-an-
introduction-adaptive-computation-and-machine-learning-series-
second-edition-sutton/
Communications and Control Engineering
Rushikesh Kamalapurkar
Patrick Walters · Joel Rosenfeld
Warren Dixon
Reinforcement
Learning for
Optimal Feedback
Control
A Lyapunov-Based Approach
Communications and Control Engineering
Series editors
Alberto Isidori, Roma, Italy
Jan H. van Schuppen, Amsterdam, The Netherlands
Eduardo D. Sontag, Boston, USA
Miroslav Krstic, La Jolla, USA
Communications and Control Engineering is a high-level academic monograph
series publishing research in control and systems theory, control engineering and
communications. It has worldwide distribution to engineers, researchers, educators
(several of the titles in this series find use as advanced textbooks although that is not
their primary purpose), and libraries.
The series reflects the major technological and mathematical advances that have
a great impact in the fields of communication and control. The range of areas to
which control and systems theory is applied is broadening rapidly with particular
growth being noticeable in the fields of finance and biologically-inspired control.
Books in this series generally pull together many related research threads in more
mature areas of the subject than the highly-specialised volumes of Lecture Notes in
Control and Information Sciences. This series’s mathematical and control-theoretic
emphasis is complemented by Advances in Industrial Control which provides a
much more applied, engineering-oriented outlook.
Publishing Ethics: Researchers should conduct their research from research
proposal to publication in line with best practices and codes of conduct of relevant
professional bodies and/or national and international regulatory bodies. For more
details on individual ethics matters please see:
https://www.springer.com/gp/authors-editors/journal-author/journal-author-help-
desk/publishing-ethics/14214.
Reinforcement Learning
for Optimal Feedback
Control
A Lyapunov-Based Approach
123
Rushikesh Kamalapurkar Joel Rosenfeld
Mechanical and Aerospace Engineering Electrical Engineering
Oklahoma State University Vanderbilt University
Stillwater, OK Nashville, TN
USA USA
MATLAB® and Simulink® are registered trademarks of The MathWorks, Inc., 1 Apple Hill Drive,
Natick, MA 01760-2098, USA, http://www.mathworks.com.
Mathematics Subject Classification (2010): 49-XX, 34-XX, 46-XX, 65-XX, 68-XX, 90-XX, 91-XX,
93-XX
This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my nurturing grandmother, Mangala
Vasant Kamalapurkar.
—Rushikesh Kamalapurkar
Making the best possible decision according to some desired set of criteria is always
difficult. Such decisions are even more difficult when there are time constraints and
can be impossible when there is uncertainty in the system model. Yet, the ability to
make such decisions can enable higher levels of autonomy in robotic systems and,
as a result, have dramatic impacts on society. Given this motivation, various
mathematical theories have been developed related to concepts such as optimality,
feedback control, and adaptation/learning. This book describes how such theories
can be used to develop optimal (i.e., the best possible) controllers/policies (i.e., the
decision) for a particular class of problems. Specifically, this book is focused on the
development of concurrent, real-time learning and execution of approximate opti-
mal policies for infinite-horizon optimal control problems for continuous-time
deterministic uncertain nonlinear systems.
The developed approximate optimal controllers are based on reinforcement
learning-based solutions, where learning occurs through an actor–critic-based
reward system. Detailed attention to control-theoretic concerns such as convergence
and stability differentiates this book from the large body of existing literature on
reinforcement learning. Moreover, both model-free and model-based methods are
developed. The model-based methods are motivated by the idea that a system can
be controlled better as more knowledge is available about the system. To account
for the uncertainty in the model, typical actor–critic reinforcement learning is
augmented with unique model identification methods. The optimal policies in this
book are derived from dynamic programming methods; hence, they suffer from the
curse of dimensionality. To address the computational demands of such an
approach, a unique function approximation strategy is provided to significantly
reduce the number of required kernels along with parallel learning through novel
state extrapolation strategies.
The material is intended for readers that have a basic understanding of nonlinear
analysis tools such as Lyapunov-based methods. The development and results may
help to support educators, practitioners, and researchers with nonlinear
systems/control, optimal control, and intelligent/adaptive control interests working
in aerospace engineering, computer science, electrical engineering, industrial
vii
viii Preface
1 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 The Bolza Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Necessary Conditions for Optimality . . . . . . . . . . . . . . . . 3
1.4.2 Sufficient Conditions for Optimality . . . . . . . . . . . . . . . . 5
1.5 The Unconstrained Affine-Quadratic Regulator . . . . . . . . . . . . . . . 5
1.6 Input Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Connections with Pontryagin’s Maximum Principle . . . . . . . . . . . 9
1.8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8.1 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.8.2 Differential Games and Equilibrium Solutions . . . . . . . . . 11
1.8.3 Viscosity Solutions and State Constraints . . . . . . . . . . . . 12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Approximate Dynamic Programming . . . . . . . . . . . . . . . . . . . . . ... 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 17
2.2 Exact Dynamic Programming in Continuous Time and Space . ... 17
2.2.1 Exact Policy Iteration: Differential and Integral
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 18
2.2.2 Value Iteration and Associated Challenges . . . . . . . . . ... 22
2.3 Approximate Dynamic Programming in Continuous Time
and Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Some Remarks on Function Approximation . . . . . . . . . . . 23
2.3.2 Approximate Policy Iteration . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Development of Actor-Critic Methods . . . . . . . . . . . . . . . 25
2.3.4 Actor-Critic Methods in Continuous Time and Space . . . . 26
2.4 Optimal Control and Lyapunov Stability . . . . . . . . . . . . . . . . . . . 26
xi
xii Contents
6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
6.2 Station-Keeping of a Marine Craft . . . . . . . . . . . . . . . . . . . . . . . . 196
6.2.1 Vehicle Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.2.2 System Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6.2.4 Approximate Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
6.2.5 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
6.2.6 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.3 Online Optimal Control for Path-Following . . . . . . . . . . . . . . . . . 213
6.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.3.2 Optimal Control and Approximate Solution . . . . . . . . . . . 215
6.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3.5 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.4 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 223
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
7 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.2 Reproducing Kernel Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . 230
7.3 StaF: A Local Approximation Method . . . . . . . . . . . . . . . . . . . . . 232
7.3.1 The StaF Problem Statement . . . . . . . . . . . . . . . . . . . . . . 232
7.3.2 Feasibility of the StaF Approximation
and the Ideal Weight Functions . . . . . . . . . . . . . . . . . . . . 233
7.3.3 Explicit Bound for the Exponential Kernel . . . . . . . . . . . 235
7.3.4 The Gradient Chase Theorem . . . . . . . . . . . . . . . . . . . . . 237
7.3.5 Simulation for the Gradient Chase Theorem . . . . . . . . . . 240
7.4 Local Approximation for Efficient Model-Based
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4.1 StaF Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4.2 StaF Kernel Functions for Online Approximate
Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.4.4 Extension to Systems with Uncertain Drift
Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
7.4.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.5 Background and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . 260
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Appendix A: Supplementary Lemmas and Definitions . . . . . . . . . . . . . . . 265
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Symbols
Lists of abbreviations and symbols used in definitions, lemmas, theorems, and the
development in the subsequent chapters.
xv
xvi Symbols
1.1 Introduction
The ability to learn behaviors from interactions with the environment is a desirable
characteristic of a cognitive agent. Typical interactions between an agent and its
environment can be described in terms of actions, states, and rewards (or penalties).
Actions executed by the agent affect the state of the system (i.e., the agent and the
environment), and the agent is presented with a reward (or a penalty). Assuming that
the agent chooses an action based on the state of the system, the behavior (or the
policy) of the agent can be described as a map from the state-space to the action-space.
Desired behaviors can be learned by adjusting the agent-environment interaction
through the rewards/penalties. Typically, the rewards/penalties are qualified by a cost.
For example, in many applications, the correctness of a policy is often quantified in
terms of the Lagrange cost and the Mayer cost. The Lagrange cost is the cumulative
penalty accumulated along a path traversed by the agent and the Mayer cost is the
penalty at the boundary. Policies with lower total cost are considered better and
policies that minimize the total cost are considered optimal. The problem of finding
the optimal policy that minimizes the total Lagrange and Meyer cost is known as the
Bolza optimal control problem.
1.2 Notation
Throughout the book, unless otherwise specified, the domain of all the functions is
assumed to be R≥0 . Function names corresponding to state and control trajectories are
reused to denote elements in the range of the function. For example, the notation u (·)
is used to denote the function u : R≥t0 → Rm , the notation u is used to denote an arbi-
trary element of Rm , and the notation u (t) is used to denote the value of the function
u (·) evaluated at time t. Unless otherwise specified, all the mathematical quanti-
ties are assumed to be time-varying, an equation of the form g (x) = f + h (y, t)
is interpreted as g (x (t)) = f (t) + h (y (t) , t) for all t ∈ R≥0 , and a definition of
the form g (x, y) f (y) + h (x) for functions g : A × B → C, f : B → C and
where t0 is the initial time, x : R≥t0 → Rn denotes the system state and u : R≥t0 →
U ⊂ Rm denotes the control input, and U denotes the action-space.
To ensure local existence and uniqueness of Carathéodory solutions to (1.1), it is
assumed that the function f : Rn × U × R≥t0 → Rn is continuous with respect to
t and u, and continuously differentiable with respect to x. Furthermore, the control
signal, u (·), is restricted to be piecewise continuous. The assumptions stated here are
sufficient but not necessary to ensure local existence and uniqueness of Carathéodory
solutions to (1.1). For further discussion on existence and uniqueness of Carathéodory
solutions, see [1, 2]. Further restrictions on the dynamical system are stated, when
necessary, in subsequent chapters.
Consider a fixed final time optimal control problem where the optimality of a
control policy is quantified in terms of a cost functional
t f
J (t0 , x0 , u (·)) = L (x (t; t0 , x0 , u (·)) , u (t) , t) dt + Φ x f , (1.2)
t0
cost, and t f and x f x t f denote the final time and state, respectively. In (1.2),
the notation x (t; t0 , x0 , u (·)) is used to denote a trajectory of the system in (1.1),
evaluated at time t, under the controller u (·), starting at the initial time t0 , and with
the initial state x0 . Similarly, for a given policy φ : Rn → Rn , the short notation
x (t; t0 , x0 , φ (x (·))) is used to denote a trajectory under the feedback controller
u (t) = φ (x (t; t0 , x0 , u (·))). Throughout the book, the symbol x is also used to
denote generic initial conditions in Rn . Furthermore, when the controller, the initial
time, and the initial state are understood from the context, the shorthand x (·) is used
when referring to the entire trajectory, and the shorthand x (t) is used when referring
to the state of the system at time t.
The two most popular approaches to solve Bolza problems are Pontryagin’s max-
imum principle and dynamic programming. The two approaches are independent,
both conceptually and in terms of their historic development. Both the approaches
are developed on the foundation of calculus of variations, which has its origins in
1.3 The Bolza Problem 3
Newton’s Minimal Resistance Problem dating back to 1685 and Johann Bernoulli’s
Brachistochrone problem dating back to 1696. The maximum principle was devel-
oped by the Pontryagin school at the Steklov Institute in the 1950s [3]. The devel-
opment of dynamic programming methods was simultaneously but independently
initiated by Bellman at the RAND Corporation [4]. While Pontryagin’s maximum
principle results in optimal control methods that generate optimal state and control
trajectories starting from a specific state, dynamic programming results in methods
that generate optimal policies (i.e., they determine the optimal decision to be made
at any state of the system).
Barring some comparative remarks, the rest of this monograph will focus on the
dynamic programming approach to solve Bolza problems. The interested reader is
directed to the books by Kirk [5], Bryson and Ho [6], Liberzon [7], and Vinter [8]
for an in-depth discussion of Pontryagin’s maximum principle.
t f
J (t, x, u (·)) = L (x (τ ; t, x, u (·)) , u (τ ) , τ ) dτ + Φ x f (1.3)
t
is solved, where t ∈ t0 , t f , t f ∈ R≥0 , and x ∈ Rn . A solution to the family of Bolza
problems in (1.3) can be characterized using the optimal cost-to-go function (i.e.,
the optimal value function) V ∗ : Rn × R≥0 → R, defined as
where the notation u [t,τ ] for τ ≥ t ≥ t0 denotes the controller u (·) restricted to the
time interval [t, τ ].
⎧ t+Δt ⎫
⎨ ⎬
V (x, t) inf L (x (τ ) , u (τ ) , τ ) dτ + V ∗ (x (t + Δt) , t + Δt) .
u [t,t+Δt] ⎩ ⎭
t
t+Δt
V (x, t) = inf L (x (τ ) , u (τ ) , τ ) dτ + inf J (t + Δt, x (t + Δt) , u (·)) .
u [t,t+Δt] u
t+Δt,t f
t
t+Δt
Thus, V (x, t) ≤ V ∗ (x, t), which, along with (1.6), implies V (x, t) = V ∗ (x, t).
Under the assumption that V ∗ ∈ C 1 Rn × t0 , t f , R , the optimal value function
can be shown to satisfy
0 = −∇t V ∗ (x, t) − inf L (x, u, t) + ∇x V ∗T (x, t) f (x, u, t) ,
u∈U
Another random document with
no related content on Scribd:
PART IV.
PHYSICS,
OR THE PHENOMENA OF NATURE.
CHAPTER XXV.
OF SENSE AND ANIMAL MOTION.
1. The connexion of what hath been said with that which followeth.—2. The
investigation of the nature of sense, and the definition of sense.—3. The
subject and object of sense.—4. The organs of sense.—5. All bodies are not
indued with sense.—6. But one phantasm at one and the same time.—7.
Imagination the remains of past sense, which also is memory. Of sleep.—8.
How phantasms succeed one another.—9. Dreams, whence they proceed.—
10. Of the senses, their kinds, their organs, and phantasms proper and
common.—11. The magnitude of images, how and by what it is determined.—
12. Pleasure, pain, appetite and aversion, what they are.—13. Deliberation
and will, what.
That is to say, seeing the office and property of body is to press all
things downwards; and on the contrary, seeing the nature of
vacuum is to have no weight at all; therefore when of two bodies of
equal magnitude, one is lighter than the other, it is manifest that the
lighter body hath in it more vacuum than the other.
To say nothing of the assumption concerning the endeavour of
bodies downwards, which is not rightly assumed, because the world
hath nothing to do with downwards, which is a mere fiction of ours;
nor of this, that if all things tended to the same lowest part of the
world, either there would be no coalescence at all of bodies, or they
would all be gathered together into the same place: this only is
sufficient to take away the force of the argument, that air,
intermingled with those his atoms, had served as well for his
purpose as his intermingled vacuum.
The third argument is drawn from this, that lightning, sound, heat
and cold, do penetrate all bodies, except atoms, how solid soever
they be. But this reason, except it be first demonstrated that the
same things cannot happen without vacuum by perpetual generation
of motion, is altogether invalid. But that all the same things may so
happen, shall in due place be demonstrated.
Lastly, the fourth argument is set down by the same Lucretius in
these verses: