Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Combining Reinforcement Learning And Optimal Control For The Control Of Nonlinear Dynamical Systems

2016, PhD Thesis

This thesis presents a novel hierarchical learning framework, Reinforcement Learning Optimal Control, for controlling nonlinear dynamical systems with continuous states and actions. The adapted approach mimics the neural computations that allow our brain to bridge across the divide between symbolic action-selection and low-level actuation control by operating at two levels of abstraction. First, current findings demonstrate that at the level of limb coordination human behaviour is explained by linear optimal feedback control theory, where cost functions match energy and timing constraints of tasks. Second, humans learn cognitive tasks involving learning symbolic level action selection, in terms of both model-free and model-based reinforcement learning algorithms. We postulate that the ease with which humans learn complex nonlinear tasks arises from combining these two levels of abstraction. The Reinforcement Learning Optimal Control framework learns the local task dynamics from naive experience using an expectation maximization algorithm for estimation of linear dynamical systems and forms locally optimal Linear Quadratic Regulators, producing continuous low-level control. A high-level reinforcement learning agent uses these available controllers as actions and learns how to combine them in state space, while maximizing a long term reward. The optimal control costs form training signals for high-level symbolic learner. The algorithm demonstrates that a small number of locally optimal linear controllers can be combined in a smart way to solve global nonlinear control problems and forms a proof-of-principle to how the brain may bridge the divide between low-level continuous control and high-level symbolic action selection. It competes in terms of computational cost and solution quality with state-of-the-art control, which is illustrated with solutions to benchmark problems.

Combining Reinforcement Learning And Optimal Control For The Control Of Nonlinear Dynamical Systems Ekaterina Abramova Department of Computing Imperial College London Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of Imperial College London and the Diploma of Imperial College London May 2015 Abstract This thesis presents a novel hierarchical learning framework, Reinforcement Learning Optimal Control, for controlling nonlinear dynamical systems with continuous states and actions. The adapted approach mimics the neural computations that allow our brain to bridge across the divide between symbolic action-selection and low-level actuation control by operating at two levels of abstraction. First, current findings demonstrate that at the level of limb coordination human behaviour is explained by linear optimal feedback control theory, where cost functions match energy and timing constraints of tasks. Second, humans learn cognitive tasks involving learning symbolic level action selection, in terms of both model-free and model-based reinforcement learning algorithms. We postulate that the ease with which humans learn complex nonlinear tasks arises from combining these two levels of abstraction. The Reinforcement Learning Optimal Control framework learns the local task dynamics from naive experience using an expectation maximization algorithm for estimation of linear dynamical systems and forms locally optimal Linear Quadratic Regulators, producing continuous low-level control. A high-level reinforcement learning agent uses these available controllers as actions and learns how to combine them in state space, while maximizing a long term reward. The optimal control costs form training signals for high-level symbolic learner. The algorithm demonstrates that a small number of locally optimal linear controllers can be combined in a smart way to solve global nonlinear control problems and forms a proof-of-principle to how the brain may bridge the divide between low-level continuous control and high-level symbolic action selection. It competes in terms of computational cost and solution quality with state-of-the-art control, which is illustrated with solutions to benchmark problems. Academic Outcomes The initial findings, Chapter 3, were published in the proceedings of the Imperial College Computing Students Workshop (ICCSW, 2011) [3] and the final findings, Chapter 4, were published in the proceedings of European Workshop on Reinforcement Learning (EWRL, 2012) [198]. The work of Chapter 5 was presented at the Champalimaud Neuroscience Symposium (CNS, 2012) and Computational and Systems Neuroscience (COSYNE, 2013), and then followed with an oral presentation at the International Cognitive Robotics Workshop (ICRW, 2014) (invitation for a publication in the Journal of Artificial Intelligence), and it is currently in the process of submission to a journal publication. Further awards: the ’Qualcomm Travel Award’ COSYNE 2013, the ’Distinguished Poster Award’ – Imperial Symposium 2014, the 3rd place – Google Poster Competition 2014, and the ’Best Oral Presentation’ – International Cognitive Robotics Workshop 2014. Ekaterina Abramova May 2015 Acknowledgements I would like to thank my supervisor Dr. Aldo A. Faisal for everything which he has taught me and for many hours of highly interesting discussions, from which I have learnt things I have not previously considered. I’d like to extend my special gratitude to my 2nd supervisor Dr. Daniel Kuhn, post-doc Dr. Luke Dickens and to Dr. Marc Deisenroth, Dr. Guy-Bart Stan, Prof. Ammar Kherraz, Dr. Wolfram Wiesemann, Dr. Aivar Sootla, Dr. Jennifer Siggers, Dr. Simon Schultz, Dr. Anil Bharath and Dr. Mariea Brady for their support. The foundations of my knowledge were seeded in Moscow, School N811, Russia. I would like to say extend my gratitude to all of my teachers, whose guidance I was so lucky to have throughout my academic years: Biology Semenova Svetlana Alexandrovna (Head of Curriculum), Zhuk Nadezhda Vladimirovna; Chemistry Roslyakova Svetlana Borisovna; Computer Science Shishkova Nadezhda Alekseevna; Director Tylina Zinaida Petrovna; English Language; Class Leader Sineva Zinaida Ivanovna; Geography Volkova Nadezhda Vladimirovna; History Lifanova Natalia Nikolaevna; Mathematics Kisileva Valentina Pavlovna; Physical Education Markina Elena Alexandrovna, Titova Natalia Romanova Physics Sahno Valentina Arkadievna; Primary school teacher Klimova Svetlana Grigorievna, Parkina Vera Andreevna; Russian Language and Literature Kalmykova Lubov Egorovna. I would like to thank all of the lecturers who taught me during my undergraduate degree at Loughborough University: Biochemistry: Dr. Christie Dr. Dart, Prof. Jones, Dr. Salt, Dr. Lucas; Forensic Analysis and Analytical Chemistry Dr. Dart, Dr. Edmonds, Dr. Mastana, Dr. Miller, Dr. Reid, Dr. Sharp, Prof. Smith, Dr. Sutherland; Inorganic Chemistry: Dr. Dann (Mentor), Dr. Elsegood, Dr. Kelly, Dr. Smith, Prof. McKee; Mathematics for Chemistry: Dr. Lewis; Organic Chemistry: Dr. Allin, Prof. Bowman, Dr. Cristie, Dr. Edgar, Prof. Jones, Prof. Page, Dr. Pritchard, Dr. Weaver; Physical Chemistry: Prof. Fletcher, Dr. Marken (Quantum Mechanics, Dissertation Supervisor), Dr. Mortimer, Prof. Warwick, Dr. Worral. I also extend my sincere gratitude to all of the lecturers who taught me during my postgraduate degree at Birkbeck College London: Applied Statistics and Econometrics Prof. Smith; Financial Econometrics Prof. Psaradakis; Financial Markets Dr. Hori; Mathematical Finance Dr. Wright; Mathematical Methods; Pricing Prof. Brummelhuis; Programming in C++ Prof. Mitton; Quantitative Techniques Dr. Adcock; Risk Dr. Hubbert. I would also like to thank the Department of Computing of Imperial College London, especially Dr. Amani El-Kholy, Prof. Marek Sergot, Dr. Krysia Broda, Dr. William Knottenbelt and Prof. Berc Rustem for their continued guidance, as well as the Department of Bioengineering and other viii members of the Faisal Lab: Alessandro Ticchi, Ali Neishabouri, Alan Zucconi, Anastasia Sylaidi, Andreas Thomik, Ariadne Whitby, Chin-Hsuan Lin, Constantinos Gavriel, Feryal Mehraban Pour, Rajeshwari Iyer, Scott Taylor, Tomaso Muzzu and William Abbott; and the Imperial College London staff members Britta Ross, Silvana Zappacosta, Duncan White, as well as everybody else who made my PhD a wonderful experience. I am also grateful to the EPSRC and the department of Computing for their financial support. I would like to extend my special gratitude to the people who I am very close to and without whose help I would not have been able to achieve my goals: Natasha Uvanova, Ulija Xvostova, Angela Aguele, Uliana Kmitovska, Samantha Daniel, Glen Williams and Nikolai Jangaev. This thesis is dedicated to my family without whose support, encouragement and love I would not have been able to achieve this: my mum Abramova Galina Geroldovna, my granddad Apazov Gerold Dimitrievich, my gradmum Apazova Ludmila Vladimirovna and my aunt Apazova Natalya Geroldovna. Declaration I hereby certify that all material in this dissertation is my own original work, except where explicitly acknowledged. The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any re-use or re-distribution, researchers must make clear to others the licence terms of this work. Ekaterina Abramova May 2015 Nomenclature θ Angle (joint / pole) ∈ R π Angle measurement in radians (equiv. to 180°) ∈ R θ̈ Angular acceleration ∈ R θ̇ Angular velocity ∈ R ēd Average error of dimension d across trials ∈ R µēd Average of the average error of dimension d across trials ∈ R z Cart location on the track ∈ R C(·) Centripetal and Coriolis forces vector Jilqr Cost of controlling a system with an iLQR benchmark algorithm ∈ R≥0 Jlqr Cost of controlling a system with an LQR benchmark algorithm ∈ R≥0 Jrloc Cost of controlling the system with RLOC learnt policy ∈ R≥0 B p(i) DIPC friction constant of link i=1,2 ∈ R L(i) DIPC length of link i=1,2 ∈ R l(i) DIPC length of midpoint of link i=1,2 ∈ R mc DIPC mass of cart, ∈ R m(i) DIPC mass of link i=1,2 ∈ R (i) ed Error for dimension d for symbolic starting state (i) ∈ R F Force applied to the cart ∈ R Friction vector/matrix  C (k) δ x(k), δ u(k) iLQG constant of linearisation of nonlinear dynamics with noise ∈ Rℓ×p B xiv cFj,k iLQG constant of linearisation: jth column of matrix F [ j] at time step k ∈ Rℓ Cuj,k iLQG constant of linearisation: derivative of jth column of matrix F [ j] at time step k with respect to the control ∈ Rℓ×m Cxj,k iLQG constant of linearisation: derivative of jth column of matrix F [ j] at time step k with respect to the state ∈ Rℓ×ℓ  δ u(k), δ x(k) iLQG control dependant cost-to-go term ∈ R ailqg G(k) iLQG cost-to-go parameter vector term ∈ Rm×ℓ H(k) iLQG cost-to-go parameter vector term ∈ Rm×m g(k) iLQG cost-to-go parameter vector term ∈ Rm S(k) iLQG cost-to-go parameter, matrix term ∈ Rℓ×ℓ s(k) iLQG cost-to-go parameter, scalar term ∈ R s(k) iLQG cost-to-go parameter, vector term ∈ Rℓ r(k) iLQG derivative of instantaneous discrete cost at time step k with respect to the control over ∆ ∈ Rm q(k) iLQG derivative of instantaneous discrete cost at time step k with respect to the state over ∆ ∈ Rℓ F(x(t), u(t)) iLQG diffusion coefficient of nonlinear dynamical system with noise ∈ Rℓ×p F [ j] iLQG diffusion coefficient’s jth column of its matrix ∈ Rℓ f(x(t), u(t)) iLQG drift term of nonlinear dynamical system with noise ∈ Rℓ δ u(k) iLQG incremental improvement of the control vector ∈ Rm δ x(k) iLQG incremental improvement of the state vector ∈ Rℓ h(x(T )) iLQG instantaneous continuous cost at time T (final cost), ∈ R≥0 q(k) iLQG instantaneous discrete cost at time step k over ∆, ∈ R≥0 e H iLQG matrix similar to H with only positive eigenvalues P(k) iLQG second derivative of instantaneous discrete cost at time step k with respect to the control and state over ∆ ∈ Rm×ℓ λmin (H) iLQG minimum eigenvalue of H xv J¯ilqr iLQR average cost for controlling system to target, taken across symbolic starting states ∈ R≥0 ū0ilqr iLQR initial control trajectory ∈ Rm×nK init λilqr iLQR initial value of Levenberg-Marquardt lambda ∈ R e λilqr iLQR Levenberg-Marquardt lambda multiplication factor ∈ R tol δilqr iLQR tolerance on improvement ∈ R ntol ilqr iLQR tolerance on iteration number ∈ R max iLQR Levenberg-Marquardt maximum allowed lambda ∈ R λilqr M(·) Inertia matrix L Lagrangian function specifying system dynamics z(k) LDSi hidden state ∈ Rℓ tol δldsi LDSi likelihood tolerance ∈ N+ Cldsi LDSi matrix acting upon the hidden state ∈ R p×ℓ v(k) LDSi noise for the observed output ∼ N (0, Σv ) w(k) LDSi noise for the state ∼ N (0, Σw ) ncldsi LDSi number of cycles ∈ N+ Σv LDSi output noise variance ∈ R p Σw LDSi state noise covariance matrix ∈ Rℓ×ℓ l(·) Length of a rod (eg. link 1 l1 , link 2 l2 ) ∈ R>0 L LQR finite horizon feedback gain matrix, ∈ Rm×ℓ×nK nL LQR total number of feedback gains L∞ / linearisation points, x0 , used in a study, ∈ N+ c OC (actuator / low-level) constant of linearisation ∈ Rℓ u OC (actuator / low-level) continuous control vector ∈ Rm ẋ OC (actuator / low-level) continuous dynamics ∈ Rℓ x OC (actuator / low-level) continuous state vector ∈ Rℓ e x OC (actuator / low-level) continuous state vector in a shifted coordinate system ∈ Rℓ xvi l(t, x, u) OC (actuator / low-level) control instantaneous cost ∈ R≥0 (i) u0 OC (actuator / low-level) control space linearisation point i = 1, . . . , na ∈ Rm π (x) OC (actuator / low-level) deterministic control law/policy ∈ U (x) m OC (actuator / low-level) dimensionality of the control vector ∈ N+ ℓ OC (actuator / low-level) dimensionality of the state vector ∈ N+ ∆ OC (actuator / low-level) discretisation time step ∈ R>0 L(V (k), k) OC (actuator / low-level) finite time horizon feedback gain matrix ∈ Rm×ℓ×nK L∞ OC (actuator / low-level) infinite horizon feedback gain matrix ∈ Rm×ℓ cost(x, u) OC (actuator / low-level) instantaneous cost obtained as a result of applying control u in state x ∈ R≥0 (i) (i)  ζ (i) x0 , u0 OC (actuator / low-level) linearisation centre / point umax OC (actuator / low-level) maximum admissible control signal size ∈ R umin OC (actuator / low-level) minimum admissible control signal size ∈ R x∗ OC (actuator / low-level) nonzero target state ∈ Rℓ neK OC (actuator / low-level) number of time steps for calculation of infinite horizon feedback gain matrix ∈ N+ J(e x(k), u(k)) OC (actuator / low-level) objective function ∈ R≥0 v(e x(k), k) OC (actuator / low-level) optimal cost-to-go function of a single period ∈ R≥0 Z OC (actuator / low-level) real symmetric positive definite input (control) weight matrix ∈ Rm×m V (k) OC (actuator / low-level) real symmetric positive semi-definite square matrix to be determined ∈ Rℓ×ℓ W OC (actuator / low-level) real symmetric positive semi-definite state weight matrix ∈ Rℓ×ℓ U (x) OC (actuator / low-level) set of admissible controls, i.e. control space next(x, u) OC (actuator / low-level) state obtained as a result of applying control u in state x ∈ Rℓ (i) x0 OC (actuator / low-level) state space linearisation point i = 1, . . . , na ∈ Rℓ A(x(t)) OC (actuator / low-level) time-invariant, state dependent system control matrix ∈ Rℓ×ℓ xvii B OC (actuator / low-level) time-invariant, state independent control gain matrix ∈ Rℓ×m A OC (actuator / low-level) time-invariant, state independent system control matrix ∈ Rℓ×ℓ B(x(t),t) OC (actuator / low-level) time-varying, state dependent control gain matrix ∈ Rℓ×m A(x(t),t) OC (actuator / low-level) time-varying, state dependent system control matrix ∈ Rℓ×ℓ B(x(t)) OC (actuator / low-level) time-varying, state independent control gain matrix ∈ Rℓ×m B(t) OC (actuator / low-level) time-varying, state independent control gain matrix ∈ Rℓ×m A(t) OC (actuator / low-level) time-varying, state independent system control matrix ∈ Rℓ×ℓ nK OC (actuator / low-level) total number of simulation time steps ∈ N+ T OC (actuator / low-level) total simulation time ∈ N+ ū OC (actuator / low-level) trajectory of continuous control signal ∈ Rm×nK −1 ū∗ OC (actuator / low-level) trajectory of continuous optimal control signal ∈ Rm×nK −1 x̄ OC (actuator / low-level) trajectory of continuous state ∈ Rℓ×nK e x∗ OC (actuator / low-level) zero target state of a dynamical system in a shifted coordinate space ∈ Rℓ X OC (actuator / lower-level) set of admissible states, i.e. state space ∗ Optimality / Target (used interchangeably) Kd PID derivative gain parameter ∈ Rm×ℓ e(k) PID error variable ∈ Rℓ Ki PID integral gain parameter ∈ Rm×ℓ y(k) PID output ∈ Rℓ Kp PID proportional gain parameter ∈ Rm×ℓ Pr Probability τ ( j) RL jth epoch ∈ N+ a(i) (t) RL (symbolic / high-level) action at RL time step t ∈ N+ r(t) RL (symbolic / high-level) immediate reward at RL time step t ∈ R xviii na RL (symbolic / high-level) number of actions / linearisation points in state space ∈ N+ nℓ RL (symbolic / high-level) number of evaluated dimensions of the state vector ∈ N+ NRL RL (symbolic / high-level) number of state-action pairs to be investigated by MC at least once ∈ N+ ns RL (symbolic / high-level) number of states ∈ N+ nz RL (symbolic / high-level) segments each dimension is discretised to ∈ N+ s(i) (t) RL (symbolic / high-level) state at RL time step t ∈ N+ A RL (symbolic /high-level) action space R RL (symbolic /high-level) reward function S RL (symbolic /high-level) state space π (s, a) RL (symbolic /high-level) stochastic control policy ∈ Rns ×na T RL (symbolic /high-level) transition probability Qπ (s, a) RL action value function ∈ Rns ×na εi RL degree of exploration parameter at ith visit to (s, a) γ RL discount factor ∈ R αi RL learning parameter at ith visit to (s, a) ν RL rate of exploration parameter decay µ RL rate of learning parameter decay R RL return ∈ R nT RL total number of epochs in a single learning trial ∈ N+ ( j) nT RL total number of time steps (i.e. total number of triplets {rt+1 , st , at }) recorded in jth epoch ∈ N+ ψ (a) RLOC control mapping A → Rm ω (x(k), k) RLOC control policy Rℓ × N → Rm c(i) RLOC controller ∈ R≥0 φ (x(k)) RLOC feature mapping Rℓ → S xix M (i) RLOC linear model of the system D RLOC naive control initiation set {0, 1, −1} b RLOC naive control rate of decay ∈ N+ U RLOC naive control trajectories tensor ∈ Rm×(nK −1)×nc ū(i,0) RLOC naive control trajectory i ∈ Rm×(nK −1) se(i) RLOC naive starting state (i) ∈ N+ X RLOC naive state trajectories tensor ∈ Rℓ×nK ×nN nst RLOC number of actuator time steps spent in RL state st ∈ RN nlqr RLOC number of LQR feedback gain matrices (nlqr = na − n pid ) ∈ N+ nc RLOC number of naive control trajectories ∈ N+ nN RLOC number of naive state trajectories ∈ N+ n pid RLOC number of PID controllers (n pid = na − nlqr ), ∈ N+ + RCD RLOC relative cost difference measure ∈ R≥0 δθ RLOC tolerance on sub-sampling of naive trajectories for angle dimensions ∈ R>0 δθ̇ RLOC tolerance on sub-sampling of naive trajectories for angular velocity dimensions ∈ R>0 nse RLOC total number of native starting states ∈ N+ ε Standard normal random variable vector ∼ N (0, I p ) ∈ R p dWt Stochastic differential into the future ∼ N (0, dtI p ) ∈ R p τ System control signal ∈ Rm Table of contents List of figures xxv List of tables 1 2 xxvii Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Early Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Random Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Linear and Nonlinear Programming . . . . . . . . . . . . . . . . . 1.2.4 Classical Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Modern Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 The Simplest Methods – Shooting Algorithms . . . . . . . . . . . . 1.3.2 From Discrete to Continuous Optimal Control . . . . . . . . . . . . 1.3.3 From Dynamic Programming to Reinforcement Learning and MDPs 1.4 State-of-the-Art Nonlinear Control . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Model-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Model-Free Algorithms . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 How To Do Better? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathematical Theory 2.1 Linearisation and Discretisation of Nonlinear Systems . . . . 2.1.1 Time-Invariant Nonlinear Dynamical Systems . . . . 2.1.2 Studied Time-Invariant Nonlinear Dynamical System 2.2 Linear Quadratic Regulator . . . . . . . . . . . . . . . . . . 2.2.1 Studied Time-Invariant Linear Dynamical System . . 2.2.2 Infinite Horizon, Nonzero Target Control Solution . 2.3 Iterative Linear Quadratic Gaussian . . . . . . . . . . . . . 2.3.1 Local Linear Quadratic Gaussian Approximation . . 2.3.2 Computing The Cost-To-Go Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 5 5 6 7 7 9 10 11 13 16 17 19 20 21 . . . . . . . . . 23 23 24 26 28 28 29 32 34 36 Table of contents xxii 2.3.3 2.4 3 40 2.4.1 Planar 2-Link Model of the Human Arm . . . . . . . . . . . . . . . . . . . . 40 2.4.2 Free Swinging Pendulum on a Moving Cart . . . . . . . . . . . . . . . . . . 42 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.1 Engineering Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.2 Neurobiological Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Architectural Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.1 Learning and Controlling In RLOC Framework . . . . . . . . . . . . . . . . 55 3.2.2 Mapping Optimal Control Problem into Markov Process Framework . . . . . 56 Combining Reinforcement Learning and Optimal Control . . . . . . . . . . . . . . . 60 3.3.1 Top-Down Hierarchical Design . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.2 High-Level Controller Design . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.3 Low-Level Controller Design . . . . . . . . . . . . . . . . . . . . . . . . . 64 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4 6 Benchmark Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 5 39 Reinforcement Learning Optimal Control 3.2 4 Computing The Control Law . . . . . . . . . . . . . . . . . . . . . . . . . . Heterogeneous, Hierarchical Control of Systems with Known Dynamics 71 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.1 Low- and High-Level Controller Specification . . . . . . . . . . . . . . . . . 72 4.2.2 Heterogeneous Reinforcement Learning Optimal Control Algorithm . . . . . 75 4.2.3 Benchmark Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.4 Study Of The LQR Feedback Gain Matrices – Robotic Arm, Cart Pole . . . . 79 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Reinforcement Learning Optimal Control for Unknown Dynamical Systems 89 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.1 Naive State-Control Trajectory Formation . . . . . . . . . . . . . . . . . . . 92 5.2.2 Low-Level Controller Design . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 General Discussion References 109 113 Table of contents Appendix A Further Work A.1 Dynamical System – Double Inverted Pendulum Cart A.2 Physical Robotic Arm . . . . . . . . . . . . . . . . . A.3 Dynamical System – Simple Example . . . . . . . . A.4 Linear Bellman Combination for Control . . . . . . . xxiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 129 138 140 142 Appendix B Additional Results 149 B.1 Robotic Arm (Unknown Dynamics) – Supplement Ch 5 . . . . . . . . . . . . . . . . 149 Appendix C Additional Information 153 C.1 Muscle Synergies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 List of figures 1.1 Example Optimal Control Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Closed Loop Control Frameworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Open vs. Closed Loop Control Systems. . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Agent-Environment Interaction in reinforcement learning. . . . . . . . . . . . . . . 15 2.1 Robotic Arm System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.2 Cart Pole System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.3 Total Energy of the Cart Pole System. . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.1 RLOC Conceptual Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Hierarchical Organisation of Action Selection. . . . . . . . . . . . . . . . . . . . . . 54 3.3 RLOC Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Arbitrary Problem Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Example State-Action Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 Epoch Sample Path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7 Overview of a PID Controller in Discrete Time Setting. . . . . . . . . . . . . . . . . 65 4.1 System control gain matrices of the Robotic Arm and the Cart Pole tasks. . . . . . . 73 4.2 Feature Mapping in RLOC Framework for Benchmark Tasks. . . . . . . . . . . . . . 77 4.3 Control and State Trajectories Produced for nL Linearisation Points – Robotic Arm. . 80 4.4 Control Trajectories Produced for nL Linearisation Points Grouped by Similarity – Cart Pole. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Relative Cost Difference Results, RLOC vs. LQR/iLQR – Robotic Arm. . . . . . . . 84 4.6 Learning Curves – Robotic Arm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.7 Relative Cost Difference Results, RLOC vs. LQR – Cart Pole. . . . . . . . . . . . . 86 4.8 Learning Curves – Cart Pole. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.1 Naive Trajectories Sub-Sampling – Cart Pole. . . . . . . . . . . . . . . . . . . . . . 94 5.2 RLOC Performance vs. Number Of Controllers – Cart Pole. . . . . . . . . . . . . . 98 5.3 Tukey boxplot of RLOC performance for varied number of controls – Cart Pole. . . . 99 5.4 Histograms of RLOC Performance - Cart Pole. . . . . . . . . . . . . . . . . . . . . 100 List of figures xxvi 5.5 5.6 5.7 5.8 High-Level Reinforcement Learning Agent, Learning Curve (na = 8) – Cart Pole. Value Function – Cart Pole. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Action Sequences of an Example Learnt Policy – Cart Pole. . . . . . . . . . . . . Example State Trajectories – Cart Pole. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 102 103 105 A.1 A.2 A.3 A.4 A.5 Double Inverted Pendulum on Cart System . . . . . . Total Energy for Double Inverted Pendulum System. Physical Air Muscle Robotic Arm . . . . . . . . . . Simple example – system control matrix entries. . . . Simple example – control matrix entries. . . . . . . . . . . . . . . . . . 131 138 139 141 141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1 Value Function – Robotic Arm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 B.2 Action Sequences of an Example Learnt Policy – Robotic Arm. . . . . . . . . . . . . 151 B.3 Example State Trajectories – Robotic Arm. . . . . . . . . . . . . . . . . . . . . . . 152 List of tables 2.1 2.2 Robotic Arm Simulation Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . Cart Pole Simulation Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 43 3.1 Comparing Reinforcement Learning and Optimal Control. . . . . . . . . . . . . . . 50 4.1 4.2 4.3 4.4 4.5 PID Parameters of RLOC Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . Pre-Defined (Fixed) Linearisation Centers of LQR Controllers in RLOC Simulation. ε -Greedy On-Policy Monte Carlo Learning Parameters. . . . . . . . . . . . . . . . . Simulation Parameters – RLOC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation Parameters – LQR / iLQR. . . . . . . . . . . . . . . . . . . . . . . . . . 73 74 75 78 79 5.1 5.2 Parameters for Linear Model Inference. . . . . . . . . . . . . . . . . . . . . . . . . Simulation Parameters – RLOC (unknown dynamics). . . . . . . . . . . . . . . . . . 95 96 A.1 Double Inverted Pendulum Cart Simulation Parameters. . . . . . . . . . . . . . . . . 130 B.1 Randomly Selected LQR Linearisation Centers for an RLOC Simulation. . . . . . . 149 Chapter 1 Introduction "Nothing at all takes place in the universe in which some rule of maximum or minimum does not appear." — L. Euler Obtaining the best possible outcome under given conditions forms the principle of optimality, which is deeply ingrained in the laws of the Universe as we know it. Consider that over millions of years nature has essentially been performing a process of optimisation, known to us as evolution, producing ever more sophisticated physical systems able to operate under large internal fluctuations, noise and unpredictable changes in the environment. The principles of optimality form the foundation to numerous, important fields of study, such as Optimal Control (OC) and Reinforcement Learning (RL), which provide a principled approach to choosing optimal actions in multi-stage decision processes. The development of the optimal control theory has taken place over the last 400 years and it represents a mathematical optimisation method that aims to find a control policy able to manipulate a dynamical system optimally over time in order to achieve a desired target state of the system (reference value). Typically solving an optimal control problem requires a minimisation of a cost function (performance index) associated with the system of interest, which depends both on the state and control variables, and leads to the continuous control of the continuously evolving system state. Some examples of applications are: vehicle fuel consumption [221], energy management of electric trains [7], minimisation of trading execution costs of a portfolio [36], waste recycling in a production line [179], application to quantum mechanics and quantum computing [60], while recent findings showed optimal feedback control to accurately predict output of the human sensorimotor system [223] (see Figure 1.1 for further examples). In the late 1980s a branch of Artificial Intelligence called reinforcement learning emerged from three separately existing fields – the psychology of animal trial-by-trial learning, dynamic programming/value functions of the optimal control theory and temporal-difference methods. It represents a computational approach to automating goal-directed learning by interacting with the world using decisions, interpreting their outcomes and maximising an observed reward signal. An appropriate discretisation of the system state is typically required in order to learn a discrete global state-to-control Introduction 2 (a) Sensorimotor control (b) Manufacturing (c) Optimal drug delivery (d) Prosthetics (e) Piloting; Steering (f) Space exploration Fig. 1.1 Example Optimal Control Problems. (a) From an infinite number of possible movement trajectories, humans and animals show stereotyped trajectories with smooth velocity profiles [110] and the control of redundant biomechanical systems has been shown to fall under optimal control framework [262]. (b) Processes such as manufacturing [5], assembly line [38] and factory floor control [269] require optimality in decisions on time consumption, tools/materials inventory, workers assignment and other factors for profit maximisation. (c) Optimal control of drug delivery is essential for currently crude/ad-hoc methods of anaesthetic drug administration during operations [169, 182] and insulin injections for patients with diabetes [39]. (d) The application of suitable control algorithms allowing smooth (optimal) movements of prosthetic devices can enable ease of mobility to impaired individuals [196, 215]. (e) Optimal control algorithms are applied to aerospace control problems such as those encountered in piloting: steering [191], landing [159], trajectory planning [208, 230] and in military such as missile tracking (e.g. find rocket thrust trajectory to achieve maximum altitude) [118]. (f) Space explorations and construction/repair of international space stations requires optimal actions of robotic devices. mapping using environment generated reward signals. Reinforcement learning is able to solve tasks with nonlinear, unknown, stochastic dynamics with examples of practical applications in robotics (active sensing [153], quadrupedal robot locomotion [142], ball grasping [86]), control (autonomous helicopter flight [20]), operations research (vehicle pricing [211], targeted marketing [2], autonomous driving [177]), games (backgammon [251], solitaire [285], chess [23], checkers [216]) and economics [181]. Other applications include: simulated robot soccer [236], an agent becoming part a virtual world community [121] and optimizing dialogue policy design in spoken dialogue systems [227]. Even though control theory has been a subject of intense study, solving complex control problems, for example those involving nonlinear dynamics, remains a challenge [263]. The key aim of this thesis is to combine the best of both worlds of reinforcement learning and optimal control (see Figure 1.2 for overview of each control algorithm) in order to exploit their respective advantages while mitigating the known disadvantages in order to find a practical, efficient algorithm for robust numerical approximation to nonlinear control problems, using continuous control of the continuous state evolution. 1.1 Motivation 1.1 3 Motivation Nonlinear optimal control problems represent the most difficult area of control theory and involve complex, often unknown, dynamics. The analytical solutions to nonlinear systems with known/approximated dynamics are based on Hamilton-Jacobi-Bellman partial differential equation and are often limited to stylised problems of little practical relevance [44, 200]. The lack of availability of analytical solutions led to the development of iterative, yet often nonadaptive, methods which find approximately optimal policies, while trying to keep the computational costs low. Such state-of-the-art methods frequently require partial/full knowledge of the system’s dynamics, tend to carry a large computational cost associated with the repetitive calculations and may require heuristic tuning of the parameters to achieve convergence. Even though optimal control has been a subject of intense study, the numerical solutions of complex nonlinear optimal control problems, such as those encountered in air/ground traffic control, train scheduling, fluid dynamics, robotic manipulation, supply chain management, missile systems and other engineering areas (see Figure 1.1) remain a fundamental challenge [134, 263]. The equations describing the evolution of state with time may be difficult to formulate for systems which are not well characterized i.e. system those dynamics may be subject to frequent systematic changes. Such cases require either making linear assumptions or attempting to estimate whole nonlinear functions [96] by using for example Least Squares Estimation [92], Maximum Likelihood Estimation [165], Expectation Maximisation Techniques [95], and others based on Machine Learning methods and Finite Difference methods [180]. The difficulties in dealing with nonlinear systems may be overcome using reinforcement learning techniques. Together with supervised and unsupervised learning, reinforcement learning forms one of the three pillars of machine learning [37]. Supervised learning trains the system using known input-output pairs, where regression is used for continuous data (e.g. house prices prediction), while classification is used for discrete data (e.g. digit recognition). Unsupervised learning uses unlabelled data and therefore has no teacher/error signal. Such tasks may include clustering, finding hidden structure in data, density estimation determining distribution of data within input space and dimensionality reduction for purposes of visualisation of high dimensional data. Reinforcement learning differs from both of the above pillars, since it is neither given examples of optimal actions, nor is it trying to find an underlying data structure. Rather it learns optimal behavioural strategies on a trial-by-trial basis, while interacting directly with an environment yielding immediate rewards, with a sole goal of maximising long term reward it obtains from the system. Advantageously, RL is able to deal with both linear and nonlinear systems, which can have known or unknown dynamics. In order to solve a task using reinforcement learning, the task is typically formulated as a Markov Decision Process – a discrete state, discrete action, probabilistic state transition system, which satisfies the Markov Property – requirement of each consecutive state and reward to depend only on the previous state and action. RL is a versatile field of study, offering a variety of algorithms that can be applied to systems, with the three most important elementary solutions: Dynamics Programming, Monte Carlo Introduction 4 and Temporal Difference methods. Each can be chosen depending on the task at hand, for example, when the dynamics of an MDP are known perfectly in advance the Dynamic Programming algorithm may be used, however in practice it carries a large computational cost and is therefore considered as an important theoretical foundation instead. Monte Carlo methods are preferred for semi-Markovian tasks with unknown system dynamics, since they have been shown to be more robust than other standard RL methods [239]. For Markovian tasks with unknown dynamics the Temporal Difference methods tend to be the method of choice, as they represent a combination of Monte Carlo and Dynamic Programming ideas in a single algorithm. A control problem in an RL framework represents finding an optimal policy and is therefore a problem of learning an optimal state-to-action mapping. Usually however, this involves discretising the system’s state space into a number of symbolic states and applying discrete actions while learning this mapping. The act of discretisation leads to what is known as the curse of dimensionality – an exponential growth in the complexity of a problem as the dimensionality of the task increases, making the problem intractable [26]. Consider an RL framework with nℓ evaluated dimensions each discretised into nz equal segments, then the number of states to be investigated grows exponentially as ns = nz nℓ . Each state needs to be investigated using all possible actions, na , therefore for each state-action pair to be visited at least once the overall "cost" is NRL = na nz nℓ number of evaluations [148]. In practice only up to six discretised dimensions (nℓ = 6) are computationally feasible, even if highly optimised methods are used. (a) Reinforcement Learning: Maximise Reward. (b) Optimal Control: Minimise Cost. Fig. 1.2 Closed Loop Control Frameworks. Both reinforcement learning and optimal control frameworks operate in a closed loop feedback system. (a) Reinforcement learning involves a high-level agent that makes decisions on choosing a discrete action, a, applied to the world advancing it to a next discrete state, s, giving an observable immediate reward, r. Both of these are fed back into the agent, which evaluates the effect of its chosen actions on the world for maximisation of a long term reward (aggregated, discounted future rewards) and learns appropriate state-action mapping. (b) Optimal control uses an analytical solution for linear systems to calculate continuous control signal u, which is applied to a plant in its current state x advancing it to the next continuous state x′ and producing a continuous cost, l. Both the cost and the new state are fed back into the system, which in turn applies the necessary control given the current state, in order to achieve the desired reference point (goal state). 1.2 Early Optimal Control 1.2 5 Early Optimal Control Optimal control is a global parameter optimisation approach to finding a controller for Multi-Input Multi-Output (MIMO) systems based on the minimisation of a cost function associated with the system. Optimal control originates from various disciplines: 1) Calculus of Variations, 2) Random Processes, 3) Linear and Nonlinear Programming and 4) Classical Control [46], which are discussed below. 1.2.1 Calculus of Variations The Calculus of Variations involves minimising/maximising a function, that takes in a vector and returns a scalar (functional), over a space of all possible state trajectories. It is therefore a subject of determining functions that optimize a function of a function (hence optimal control is sometimes termed functional optimization), unlike ordinary calculus where the objective is to determine points that optimize a function. Any problem that can be solved using this calculus is called variational problem. In the case of optimal control, the functional is a performance index (cost function) which is minimised with respect to a control signal. The calculus of variations begun with the Brachistochrone problem posed by Galileo Galilei in 1638. It concerns finding a path of quickest descent between two points (i.e. an explicit minimum time problem), which Johann Bernoulli solved in 1697 [93]. In 1744 Leonhard Euler formulated the problem in general terms and derived the first-order necessary condition in his paper "The Method of Finding Curves that Show Some Property Of Maximum or Minimum" [81], which laid foundation to the Calculus of Variations. First Order Variation Up until 1755 the solutions to Euler equation were mostly geometrical [101], at which point Jean Lagrange developed an analytical solution based on variations, perturbations to the performance index – and hence to optimal state trajectory, and method of multipliers which lead directly to Euler’s necessary condition1 , publishing his work in 1762 which gave the subject its name. Second Order Variation Adrien Legendre studied the second-order variation in 1786 and discovered the second-order necessary condition which he applied to scalar problems [155]. This was extended to vector spaces by Alfred Clebsch [53] forming the Legendre-Clebsch condition that the Euler-Lagrange equation must satisfy2 . Rowan Hamilton reformulated the necessary conditions published in his work on least action in mechanical systems [107], expressing his principle as a pair of differential equations. Hamilton-Jacobi Equation In 1838 Karl Jacobi showed that only one equation of the partial derivatives of a performance index w.r.t. each state was necessary, resulting in the Hamilton-Jacobi equation [202]. Karl Weierstrass defined a condition using an excess function [276] while considering strong variations, which served as a precursor to Bellman’s and Pontryagin’s maximum principle. 1 This lead to Euler-Lagrange Equation based on first-order variations – a second-order partial differential equation whose solutions are the functions for which a given functional is stationary. 2 Legendre-Clebsch condition requires 2nd partial derivative of performance index w.r.t. state to be nonnegative definite. Introduction 6 1.2.2 Random Processes In 1828 an English botanist Robert Brown noticed that plant pollen suspended in still water exhibited a perpetual, irregular, swarming motion, which puzzled scientists for decades. This was inadvertently solved by Albert Einstein, when shortly following his dissertation on statistical molecular theory of liquids in 1905, he published a paper in which he applied the molecular theory of heat to liquids. He predicted that the random motions of molecules in a liquid impacting on larger suspended particles would result in an irregular, random motions directly observable under a microscope, confirming exactly what Brown saw and terming it Brownian motion. This theory was used by Norbert Wiener to form generalised harmonic analysis [280] which in 1949 he applied to an optimal filter problem [281]. Linear Quadratic Regulator Rudolf Kalman laid foundation to the era of modern control theory by publishing two ground breaking papers. In 1960 he introduced an integral performance index with quadratic penalties on control size and output errors [130], which resulted in what is now known as Linear Quadratic Regulator (LQR) algorithm [13] for solving deterministic optimal control problems. Linear Quadratic Estimator In 1960 Kalman extended the work of Wiener to discrete-time finitedimensional time-varying linear systems disturbed by additive Gaussian white noise (system and measurement) and derived the optimal controls (also known as optimal filter gains, state-feedback gains) using calculus of variation. He solved the backward Riccati equation3 to a steady state (i.e. zero target state) and showed that the optimal controls were linear feedbacks of the difference in estimated and actual states (estimate errors) [131]. This came to be known as the Linear Quadratic Estimator (LQE) or discrete Kalman filter, however similar equations were independently derived earlier by Peter Swerling in 1957 [244, 245]. The continuous version of the Kalman filter was a result of Kalman’s collaboration with Bucy resulting in the Kalman-Bucy filter in 1961 [132]. Kalman was first to propose a compact vector/matrix notation, enabling application of theories to multiple-input multiple-output systems dramatically improving on then best-to-date successive loop-closure solutions. Linear Quadratic Gaussian The combination of LQE theory (estimated state feedback) with LQR theory (gains under quadratic state and control penalties) resulted in a controller applicable to linear systems that minimised the expected integral value of a quadratic performance index under additive white Gaussian noise – Linear Quadratic Gaussian (LQG) compensator4 [105, 127]. Worst-case design algorithms were developed to deal with the LQG controllers’ sensitivity to variations in plant parameters and unmodelled dynamics causing robustness problems. The H-infinity, H∞ , compensator [79, 287] stabilises a family of plants with an anticipated distribution of parameters within a bounded range and minimises a quadratic performance index for the worst case parameter deviations, forming a minimax problem where the plant parameters depend on the worst controller parameters and vice-versa. 3 The scalar form of the Matrix Riccati Equation for solving linear second-order two-point boundary-value problems was given by Jacopo Riccati in the 18th century. 4 Shown earlier using dynamic programming approach in an econometric discipline by Herbert Simon in 1958 [226]. 1.2 Early Optimal Control 1.2.3 7 Linear and Nonlinear Programming Mathematical Programming (Optimisation) was largely founded by Leonid Kantorovich in 1939, reported in a publication on solving a wide class of planning problems using a unified approach [133]. General (non)linear programming optimisation problems involve finding a maximum or minimum of an objective function for a system defined with equalities and inequalities (i.e. system defined under constraints), where either the objective function and/or the system may be (non)linear. Outside of the influence on the development of the optimal control theory, mathematical programming is used in mechanics (e.g. problems of rigid body dynamics), engineering (e.g. computational models of petroleum reservoirs), economics (e.g. utility optimisation problem), operations research (e.g. factory items packaging) and control engineering (e.g. Model Predictive Control, which places explicit constraints on the calculated input signal and calculates the iterative solution to the optimisation problem online for an infinite horizon setting). The application of programming to the field of optimal control was pioneered by Harold Kuhn and Albert Tucker in the 1950s [146]. They identified the necessary condition, which states that a system is required to be on a constraint boundary, i.e. gradient of a performance index must lie inside a cone of constraint gradients. Nonlinear programming (NLP) problems are categorised into dense (small number of variables/constraints and many nonzero derivatives of objective, constraint functions w.r.t. control) and sparse (large number of variables/constraints and many zero derivatives of objective, constraint functions w.r.t. control). NLP problems can be solved using gradient based methods and heuristic numerical methods and have been used to solve the steady-state linear quadratic control problems through optimisation of the controller parameters [171]. A general optimal control problem can be converted to an NLP problem and solved using a direct trajectory optimisation method [109], while linear programming methods are used in stochastic optimal control problems [233]. 1.2.4 Classical Control Classical Control deals with influencing the behaviour of dynamical systems and forms a collection of ad-hoc techniques offering simplicity in numerically approximating linear, time-invariant systems using step-by-step methods. It handles well the Single-Input Single-Output (SISO) systems, in which a single output signal is controlled by one input signal. A simple form of classical control uses open-loop input for directing a system to a desired target, which means that the output of a system is not made available to the controller and it does not have data on how well it is performing a given task. A non-error based variation of an open-loop control, termed feedforward control, accounts for any disturbances by the controller before they have a chance to affect the system. In order to overcome the limitation of open-loop control, the state resulting from applying the control at the last time step is passed back into the system forming feedback, thus allowing the controller to make necessary adjustments based on the observed error at each point in time in order to bringing the system closer to the desired reference point, thus forming a closed-loop control system (see Figure 1.3). Introduction 8 (a) Open Loop Control. (b) Closed Loop Control. Fig. 1.3 Open vs. Closed Loop Control Systems. (A) Open control framework – a control (input) signal is applied to a system without observing outcome of an action (e.g. a heater warming up a room without having a temperature sensor). Any disturbances are unaccounted for in advance. (B) Feedforward control framework – an open loop control is applied to a system, while any disturbances are observed and accounted for in advance of them affecting a system (e.g. an increase in an amount of heat applied by a heater when a signal of an opened window is received). (C) Feedback control framework – a closed loop control operates on the basis of sensing deviations from a desired reference point (e.g. a continuous room temperature reading supplied to a heater, which adjusts an amount of heat applied to a system based on a sensor reading). Any disturbances are unaccounted. (D) Feedforward and feedback control framework – a combination of both feeding back of a state error and an ability to account for any disturbances to a system before they take effect. Proportional Integral and Derivative Controller A common closed-loop controller is the Proportional (P), Integral (I) and Derivative (D) controller, applied either as P, PI, PD or PID and these account for 90% of modern industrial control applications [140]. The PID controllers are effective and simple to implement and are generally used for simple systems that can be hand tuned on-line. They do not require intensive computations, however may be difficult to implement for multivariate systems. In 1948 Walter Evans developed a graphical method, root locus analysis, for examining the effect of variation of gains on the roots of a feedback system [82, 83]. It allowed to determine the stability of a given system and to better design controllers (such as PID)5 . The availability of analog computers in the 1950s meant that the time-response criteria were easier to check (e.g. overshoot, settling time for a step command), resulting in PIDs being applied to more complicated tasks. Progress towards optimal control was made by George Newton et al. [189] in 1957, who first proposed an integral square error as a performance index, implicitly investigating the use of penalty on integral square control without producing a clear algorithm. This work was extended by Sheldon Chang [50] in 1961 who explicitly stated the need for a constraint on an integral square control (i.e. penalty on control size) using a square weight factor k2 (in analogy to Newton’s integral square error), proposing the root-square locus vs. k2 plot in a complex s-plane. Even though the use of classical control is widespread, generally optimal control is preferred due to its ability to (a) deal with MIMO systems, (b) specify explicit constraints on the input and (c) perform global optimisation. 5 The Root Locus plot operates in a complex s-plane, where the poles of the closed loop transfer function are plotted as a function of a gain parameter. 1.3 Modern Optimal Control 1.3 9 Modern Optimal Control Most optimal control problems are solved using methods with the following fundamental components: solving a system of nonlinear algebraic equations (non-essential), solving a nonlinear optimization problem (non-essential), and solving differential equations and integrating functions (essential). The development of modern digital computers allowed nonlinear optimal control problems to be approximated using numerical methods, divided into (a) Direct methods (control discretisation / nonlinear programming), (b) Indirect methods (variational approach / Pontryagin), and (c) Dynamic programming (discrete time systems) / Hamilton-Jacobi-Bellman equation (continuous time systems) [272]. Indirect Method solves an optimal control problem indirectly by converting the original problem to a multiple-point boundary-value problem using the calculus of variations. The optimal solution is found by solving a system of differential equations that satisfies end point and/or interior point conditions leading to first-order necessary conditions, and producing candidate optimal trajectories (extremal paths / extremals), each of which is examined for a local minimum, maximum or saddle point with the extremal of the lowest cost being chosen. Therefore, the numerical solution of differential equations is combined with the numerical solution of systems of nonlinear algebraic equations. Direct Method discretises and approximates the state and/or control of an original optimal control problem (infinite-dimensional optimization problem), transcribing it into one of nonlinear optimization / nonlinear programming (a finite-dimensional optimization problem) [24, 32], which is then solved using well known optimization techniques. A method of control parameterisation involves approximation of the control variable only, while state and control parameterisation involves an approximation of both state and control variables [100]. Therefore, the numerical solution of differential equations is combined with nonlinear optimization. Stability and controllability analysis of dynamical systems originated in 1892 with the thesis publication of Aleksandr Lyapunov on "The General Problem of the Stability of Motion" [172], in which he described the necessary and sufficient condition of existence of scalar functions (Lyapunov Functions) that prove stability of an equilibrium point for an ordinary differential equation (ODE). This means that an existence of a Lyapunov Function at an equilibrium point of a dynamical system drives it to converge to and remain at that equilibrium point. This lead to the development of Control Lyapunov Functions in the 1980s by Eduardo Sontag [231] and which are used to test whether a dynamical system with inputs is feedback stabilisable, where an existence of such functions for dynamics of a problem guarantee an existence of a stabilising control (i.e. a system can be brought to an equilibrium by applying control). The theories play a crucial role in the study of stabilisability, "strong stability" (all trajectories go to an equilibrium point) and controllability, "weak stability" (some trajectories go to an equilibrium point) of dynamical systems. Introduction 10 1.3.1 The Simplest Methods – Shooting Algorithms The simplest of the modern optimal control methods were developed in the 1950s. They reduce a two-point boundary value problem (finding a solution to an ODE defined with an initial-terminal constraint) to purely an initial boundary value problem (one with an initial constraint only) [135], called shooting methods and can be split into • Indirect Shooting Methods use a guess of initial values as if an ODE was formulated as an initial value problem. A trajectory is rolled out by integrating the Hamiltonian system forward in time until the final step and an error between a resultant boundary value and a desired one is recorded. The choice of an initial condition is altered and the process is repeated iteratively until the error meets a given tolerance. • Indirect Multiple Shooting Methods reduce sensitivity of an integration error to the choice of an unknown initial condition by segmenting a problem into smaller time intervals. Standard indirect shooting methods can then be applied over each sub-interval, where initial values of a state and adjoint equations (Lagrange multipliers) of interior intervals need to be determined. • Direct Shooting Methods parametrise the control variable using a specified functional form and unknown parameters to be determined from the optimisation. This transformation results in a nonlinear programming problem and an iterative trajectory for the full time interval is rolled out. An error in terminal conditions is computed and initial values are updated in order to minimise cost subject to paths and constraints. • Direct Multiple Shooting Methods dramatically improve the numerical stability of direct methods by dividing the full time interval into sub-intervals and by solving an initial value problem with direct shooting methods for each sub-interval. In order to overcome the instability of shooting methods, impulse response functions (gradients) were introduced by Henry Kelley [136] and Arthur Bryson et al. [45] in 1960. An initial guess of a control trajectory is applied to a system and a corresponding state trajectory is recorded. The gradients of a performance index are obtained by integrating adjoint equations backward over the nominal trajectory and the control trajectory is adjusted in the direction of negative gradients (for a minimum problem). The procedure is iteratively repeated until both the performance index variation and the terminal conditions variation fall within a pre-determined tolerance. Gradient methods are the standard de facto choice for optimal control problems [204]. The controller design and stability analysis can be greatly simplified using decentralised control, developed in the 1970s. It calculates multiple compensators [57] for decentralised / large-scale / interconnected / composite dynamical systems (those consisting of or those that have been decomposed into several low-level interconnected sub-systems), where multiple controllers are linearly combined to drive the overall system to a goal. 1.3 Modern Optimal Control 1.3.2 11 From Discrete to Continuous Optimal Control Classical Dynamic Programming In 1950s Richard Bellman extended the Hamilton-Jacobi theory to discrete-time dynamic systems with quantized states and controls (combinatorial systems). He introduced a series of concepts: [25] state – a description of a dynamical system at a given time, optimal policy – a sequence of decisions which is most advantageous according to a pre-assigned criterion, optimal value function – a return obtained using an optimal policy starting from an initial state, and control function – a feedback on current state and time, thus establishing the theory of Dynamic Programming. Bellman recognised that rather than considering an exhaustive search of all possible decision paths from which minimal cost control path is selected, it is sufficient to chose a decision optimally at the present time based on current state only, reducing both the problem dimensionality and its computational requirements. Therefore he broke a single multi-stage decision problem into a set of sub-problems to which he applied his Principle of Optimality (see Definition 1) to define a functional equation, which expresses a relationship between the value of a state and the values of its successor states [26], the Bellman Equation6 v(x) = min u∈U (x)  cost(x, u) + v next(x, u)  (1.3.1) where x ∈ X is the state of an environment, X is a set of admissible states, u ∈ U is the control signal (action), U (x) is a set of admissible controls, v(x) is the value of scurrent state ("cost-to-go" – cost expected to accumulate for starting from state x and thereafter following a control policy π (x)), π (x) is a deterministic control law / policy – a mapping from states to actions, cost(x, u) is the cost obtained as a result of applying control u in state x, and next(x, u) is the state obtained as a result of applying control u in state x. Definition 1 (Bellman’s Principle of Optimality) An optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decisions [25]. Dynamic programming deals with families of extremal paths that meet specified terminal conditions. It finds globally optimal closed-loop control7 law (policy) off-line, which gives an appropriate action for any state by reducing the problem of cost-to-go minimisation of the whole path to a sequence of minimisations at each time step only. The algorithm utilises Value Iteration – iterating backwards from the final time-step, where the terminal cost is known, at each time interval solving the Bellman equation for all possible states and storing the (optimal) control signal of least cost for that time step only, thus reducing computational requirements of an exhaustive search. In this new framework, the partial derivatives of the value function w.r.t. state are identical to Lagrange’s multipliers and the 6 The Bellman Equation considers every action available at the current state, adds its immediate cost to the value of the resulting next state and chooses an action for which the sum is minimal [260]. 7 Closed-loop / feedback control gives optimal controller as a function of the current state, time and task parameters (i.e. the controller is a decision rule – it can be obtained for any admissible state and time period). Introduction 12 Euler-Lagrange equations are easily derivable [80]. The discretisation of admissible controls and state variables not only result in approximation errors but also cause what Bellman called the Curse of Dimensionality – an exponential growth in computation requirements with respect to state dimensions, therefore making high-dimensional problems intractable. Even through dynamic programming was used to derive important results for simple control problems such as Linear Quadratic (Gaussian) optimal control, it is not considered to be a practical algorithm and was used as a conceptual tool particularly before Pontryagin’s Maximum Principle was well understood [193]. Bellman also studied the minimum time optimal control problems and in 1955 found that the feedback control signal for such systems is bounded but discontinuous, with abrupt switches between the lower and upper bounds of its limits (two extremes), this was called Bang-Bang Control, where for an nth -order linear system there are at most n − 1 switches [27]. Typically such control is applied to binary "on" / "off" input systems or to discontinuous nonlinear systems (variable structure systems). Hamilton-Jacobi-Bellman vs. Pontryagin The classical dynamic programming was extended to continuous time dynamical systems by Kalman in 1960s, who made an explicit connection between the work of Bellman [29] and the Hamilton-Jacobi equation, to form a system specific Hamilton-JacobiBellman equation (see Appendix A.4.8). It uses the principle of optimality and requires the cost-to-go function to be smooth and everywhere differentiable, usually solved by backward iteration and resulting in a closed-loop solution, u∗ (x, k)8 . The first-order partial differential equation is a sufficient and necessary condition for optimality when solved for the whole state space and a necessary condition when solved locally (i.e. provides candidate solutions for optimality). For many real life systems this equation is nonlinear and presents difficulties in its solution requiring numerical approximations. The form of the value function may be guessed for simpler nonlinear systems leading to a solution. Dynamic programming methods may fail for non-smooth optimal cost-to-go functions in which case an alternative method, developed by Lev Pontryagin in 1962, may be used instead. The Pontryagin’s Minimum Principle [194] transforms a variational problem into a nonlinear programming problem, which considers one extremal at a time. This avoids the curse of dimensionality and leads to an ODE, producing an open-loop solution u∗ (k) from specific initial conditions. Lev extended the necessary condition of Weierstrass to cases where control functions are bounded [195], resulting in a generalisation of the Euler-Lagrange equation, which is required to be satisfied at each point along the control path, while minimising the Hamiltonian by optimal controls within their bounded region. Computationally this is less complicated than the Hamilton-Jacobi-Bellman approach since it does not require discretisation of the state space, thus bypassing the expensive calculation of the cost-to-go function and finds an optimal control trajectory directly. Even though both Hamilton-Jacobi-Bellman approach and Pontryagin’s Minimum Principle involve: (a) optimization of the Hamiltonian w.r.t. the control signal, (b) an initialisation of value function using terminal cost, and (c) a differential equation that can be used in a backward pass through time, they are fundamentally different. 8 Even though the method itself is continuous, numerical approximation of nonlinear equations requires discretisation when solved on digital computers. 1.3 Modern Optimal Control 13 Differential Dynamic Programming In 1966 David Mayne extended the classic dynamic programming method that considers an optimal cost function globally, to an iterative local optimisation algorithm, called Differential Dynamic Programming, [173, 174] which considers a cost function locally to a nominal state and control trajectories. The state space dynamics are iteratively linearised to a second order and expanded around each point of a nominal state trajectory, producing a linear system model and a linear quadratic one-step cost function model. A linear quadratic regulator control is applied to each linearised point improving the local cost-to-go function and the local linear optimal feedback controller [124]. The differential dynamic programming achieves quadratic rate of global convergence under assumption of convexity and twice continuously differentiability of the loss and state transition functions. Mayne originally discussed deterministic discrete time problems, however Jacobson & Mayne extended it to stochastic variations able to handle additive noise [125] (see further work carried out by Emo Todorov et al. [254] able to handle multiplicative state and control noise), and showed its applicability to bang-bang optimal control problems [123]. Differential dynamic programming has been used in a receding horizon setting [248], combined with local trajectory optimisation by Christopher Atkeson [14] in 1994 (who adapted and modified the state incrementing dynamic programming approach of Larson [157], reducing the number of necessary discretisation steps), and further improved the calculation of steady-state policies by combining sparse random sampling of states [15], value function / policy approximation and local trajectory optimisation [164, 234, 235]. 1.3.3 From Dynamic Programming to Reinforcement Learning and MDPs Approximate Dynamic Programming The work on stochastic dynamical systems and on tackling the curse of dimensionality led to the development of a new method, Approximate Dynamic Programming, through the unification of work of control theorists, artificial intelligence researches, computer scientists and psychologists. It originated from research: (a) of Richard Bellman in 1957 on discrete stochastic version of optimal control problems [28] which became known as the Markov Decision Processes (MDPs), a framework used to define a class of control problems satisfying the Markov Property, and in the 1960s Ron Howard developed the first Policy Iteration Method for solving MDPs, (b) of Arthur Samuel in 1959 [213] in computer science on training computers to play a game of checkers [214] and (c) in control theory: the work of Paul Werbos in 1974 [277, 278], and the work of David White & Donald Sofge in 1992 [279] on using artificial neural networks to generate control signals for driving motors. These works involved research on solving dynamic programming problems in an approximate manner and together with (d) theories of trial-and-error animal learning using reward and punishment in the 1980s, lead to the development of an entirely new field, initially called Adaptive Dynamic Programming [273], later termed Approximate Dynamic Programming and then given its current name Reinforcement Learning [129, 239–242]. The term Neuro-Dynamic Programming (NDP) was introduced by Dimitri Bertsekas and John Tsitsiklis [34] in 1995 who made the connection between classical dynamic programming and artificial neural networks used as function approximators. Introduction 14 Rollout Algorithms Intermediate methods were developed in 1997 by Dimitri Bertsekas, who introduced Rollout Algorithms [35] that substantially improve performance of base heuristics (underlying methods providing sub-optimal, fast solutions, where standard methods are too slow) through sequential application of online learning and simulation. The algorithms provide approximate solutions to discrete, deterministic / stochastic optimisation problems and work by embedding the original system into dynamic programming framework and applying policy iteration algorithms (NDP / RL). They form an intermediate methodology between heuristics and NDP / RL methods, since rollout algorithms are superior (but slower) in solution quality to stand-alone heuristics and simpler (but faster and sub-optimal) to stand-alone NDP / RL. Even though RL methods require more computation than heuristics, its methodology offers an artificial agent (a system which learns and makes decisions) which optimises its behaviour in an unknown, stochastic / nonlinear environment without prior knowledge, making a big advancement to the previously existing dynamic programming methods, which are model-based i.e. require a priori knowledge of the system dynamics. Markov Decision Processes In order to solve a problem using RL algorithms and have convergence guarantees, the system needs to be formulated as a Markov Decision Process – a process defined by its state space S , action space A (s) (where s represents state), transition probability T : S × A (s) × S → R≥0 and reward function R : S × A (s) × S → R, where the transition probability specifies system’s response to actions (not required for majority of RL algorithms) and reward function specifies the quality of state transitions, and which overall satisfy the Markov Property – requirement of each consecutive state and reward to depend only on the previous state and action. In this framework control is applied according to a policy – defined by Bellman to be a sequence of decisions which map state space to control signal, π (s) : S → A (s). More formally, an agent-environment interface is described using Figure 1.4 [239], where at each discrete RL time step t, an agent receives information about the current state st ∈ S of the environment and applies an action at ∈ A (st ) according to policy π (s). The environment responds to the applied action, transitioning into a new state st+1 according to transition probability T , and produces a scalar reward rt+1 according to a reward function R which evaluates the immediate effect of action at . The new state and reward are fed back into the agent, which records this information and repeats the cycle while maximising a long term reward it receives for its actions in the long run. A central issue of reinforcement learning is the temporal credit assignment problem – an assignment of penalty / credit to actions based on results obtained following their selection. In the RL framework the Bellman Equation for the value function vπ , expressing the relationship between the value of a state and the values of its successor states, is     h i π π ′ a a v (s) = Eπ Rt |st = s = ∑ π (s, a) ∑ Pss′ Rss′ + γ v (s ) (1.3.2) a s′ 1.3 Modern Optimal Control 15 n where Rt = ∑∞ n=0 γ rt+n+1 is the discounted return under discount factor γ ∈ [0, 1] (sum of discounted immediate rewards received for starting in state s at time t and following policy π ); π (s, a) is a stochastic policy – probability of selecting action a in state s; Pssa ′ is a probability of transitioning from state s into next state s′ under action a; and Rass′ is an expected immediate reward from transitioning from state s into next state s′ under action a. The classification of reinforcement learning algorithms is based on the path taken to search for an optimal policy and coincides with that of dynamic programming due to its mixed inheritance. These typically fall into three classes of algorithms: • Policy Iteration Algorithms iteratively perform two steps: (1) Policy Evaluation (E) – evaluates the cost-to-go (value) function for the current policy and (2) Policy Improvement (I) – improves the current policy based on the evaluated value function; thus finding a monotonically improving ∗ I I E E sequence until convergence to optimal policy / value function π 0 → vπ 0 → . . . → π ∗ → vπ . • Value Iteration Algorithms search for the optimal cost-to-go (value) function improving its estimate at each iteration, where upon convergence the optimal value function is used to find an optimal policy. It is typically equivalent to a single sweep of policy iteration at each step and more generally can utilise multiple policy evaluation sweeps between each policy improvement. • Policy Search Algorithms search for an optimal policy directly using optimization methods. Fig. 1.4 Agent-Environment Interaction in reinforcement learning. A closed-loop control applied to a system formulated in a Markov Decision Process framework, capable of learning optimal actions from interaction with an unknown environment. This is made possible solely by following an objective of long term (observed) reward maximisation by the agent. Note: adapted from Sutton & Barto book [239]. Most RL methods work on-line – interact with an environment in order to learn from it – and are model-free – receive transition and reward data from the environment as opposed to knowing it a priori. An overview of the current state-of-the-art model-based and model-free algorithms available for (approximate) optimal solutions of nonlinear control systems is given below. Introduction 16 1.4 State-of-the-Art Nonlinear Control The aim of my thesis is to optimally control nonlinear systems with (un)known dynamics. This section surveys the state-of-the-art control algorithms for solving this class of problems. The common frameworks employed when solving optimal control tasks can be classified into three distinct groups: • Model-based algorithms – these either (a) require full knowledge of the plant dynamics prior to interaction with the environment (e.g. majority of control algorithms in engineering), or (b) learn the system dynamics on-line through experience, thus extracting valuable information from available data9 (e.g. some RL algorithms). • Model-free algorithms – receive all the necessary information from the environment, thus not requiring any system identification in order to control it (e.g. artificial neural networks, majority of RL algorithms10 ). The major advantages are: (a) ability to handle changes in an underlying system dynamics, (b) no requirement of open-loop training data and (c) possession of robustness in presence of strongly varying controls. • Hybrid algorithms – combine model-based techniques with model-free techniques and thus usually require only an approximate knowledge of the dynamics. In order to solve complex nonlinear, high-dimensional or continuous tasks, most methods require performing numerical approximations of the value function / policy, with popular function approximators: polynomials / splines / trigonometric series, Gaussian functions / Gaussian processes, artificial neural networks / wavelet networks etc. The approximations typically result in a near-optimal rather than optimal control solutions and can be classified into two distinct groups: • Parametric approximators – are functions the form of which is given a priori and the form of which is not data dependent. The parameters of the functions are tuned using data based on target value function / policy (e.g. linear combination of a fixed set of basis functions). • Nonparametric approximators – are functions the form of which depends on and is derived from data. The parameters of the functions and the number of parameters are derived from the data (e.g. kernel-based approximators represent a target function / policy with a linear combination of basis functions, but define one basis function for each data point). These methods tend to be less prone to modelling errors, but may require more training data. All of the surveyed methods aim to approximate solutions to nonlinear control problems and some result in control laws applicable in transfer learning11 . 9 Model-based algorithms are prone to model bias – inaccurately learnt models due to restrictive / uninformative data obtained from interaction with the environment [17]. 10 The large number of real-life trials required by model-free RL highly contrasts how humans learning to perform a task. 11 Transfer learning utilises information learnt in a past task in order to improve learning of a new related task. 1.4 State-of-the-Art Nonlinear Control 1.4.1 17 Model-Based Algorithms Nonlinear Model Predictive Control (NMPC) Many real life problems are inherently nonlinear and do not have a naturally defined end point, requiring a control framework with an infinite time horizon control setting and possibly bounded admissible inputs. Nonlinear Model Predictive Control [87, 104, 154] addresses this class of problems and is an extension of the Receding Horizon Control developed in the 1970s [91, 175]. It linearises a system at the current time step t and makes an estimation of the system dynamics over a finite prediction horizon t + Tp . A pre-determined performance objective is then minimised using an open-loop control12 over the control horizon t + Tc ≤ Tp . Due to mismatch of the predicted plant dynamics and the actual system behaviour, feedback is incorporated by using a calculated open-loop controller only for short time horizon, recalculation time δ ≪ Tc . After this a new state is obtained at time step t + δ , and the prediction and optimisation are repeated for an infinite time horizon. Apart from heavy on-line computation requirement, the open-loop control calculation leads to a poor approximation of the closed-loop control it actually performs in an uncertain environment, leading to sub-optimality. However, this technique is widely accepted and used in engineering disciplines. Iterative Linear Quadratic Gaussian (iLQG) An extension to the differential dynamic programming method, which utilises the Linear Quadratic Regulator of Kalman [76, 130], was made by Emanuel Todorov et al., and called the Iterative Linear Quadratic Regulator (iLQR) [162]. It was developed specifically for control of biological systems, however was also tested vs. benchmark algorithms on an inverted pendulum problem (no cart) and found to be 10 times more efficient. It iteratively linearises the nonlinear system dynamics around a nominal state and control trajectories  x̄(i) (t), ū(i) (t) 13 and computes a locally optimal feedback control law using a modified LQR technique. The control law is applied to the linearised system and is used to improve the nominal trajectory incrementally [162] until convergence (i.e. until the change in both control and state trajectories upon successive iterations meet a required tolerance). The Iterative Linear Quadratic Gaussian (iLQG) method [263] handles (a) control and state multiplicative noise, (b) bounded control signals, which typically make the cost-to-go function non-quadratic, and (c) constructs consistent value function approximations via modified Riccati equations (only possible when the square matrix of second-order partial derivatives, the Hessian, is positive-definite). This is true for any quadratic approximation methods like DDP which are based on Riccati equations, therefore the iLQG method systematically corrects any negative Hessian matrix entries. A detailed derivation of the iLQG method is given in Section 2.3, which to my best knowledge is not available elsewhere in the literature. 12 Open-loop control – a controller derived via the necessary conditions of optimal control as a function of: boundary conditions (initial/ terminal values), time and task parameters (i.e. the controller is a curve – it is obtained as the independent time variable changes over the planning horizon). 13 The initial state trajectory x̄init is obtained with an arbitrary control ūinit and if a good initialisation of the control signal can not be used, a best guess of 0 is employed. 18 Introduction LQR-Trees, Motion Planning The LQR-trees algorithm [250] offers a probabilistic feedback coverage with Lyapunov verified trajectories stabilised with LQR feedback by growing a randomised tree backwards from a target state. The calculated Lyapunov functions evaluate the regions of attraction of the randomized trees and are closely related to motion planning funnels [48], producing a stored control policy in the form of LQR-trees thus mapping state space to nonlinear feedback control. Modular Selection and Identification for Control (MOSAIC) Optimal control has provided a solid general framework for constructing models of biological movement and the Modular Selection and Identification for Control (MOSAIC) method forms a prominent model for sensorimotor learning and control [113]. The MOSAIC learns the system’s dynamics from on-line experience and postulates that the brain has multiple modules responsible for the global sensorimotor control, thus suggesting a symbolic task representation within the brain. Each module of the MOSAIC contains three parts: a responsibility predictor "prior" (which evaluates contextual sensory information - calculated prior to movement), a forward model "likelihood" (which predicts the effect of a motor command, effectively learning the dynamics - calculated after movement) and an inverse model "controller" (which produces a feedforward motor command). The responsibility predictor and the forward model (trained using prediction errors) are used to determine a degree ("responsibility") to which each module captures the current context and should therefore participate in control. The overall motor command is calculated by weighting each unit’s feedforward control with its corresponding responsibility. The learning loop is closed with the generated next state and overall control feeding back into the forward model for training. Hence the algorithm executes n-paired modules, each producing a controller, the contributions of which are weighed up with machine learning methods both prior to and during task execution. Probabilistic Inference for Learning Control (PILCO) More recently Bayesian and function approximation approaches have been combined, such as the work of Marc Deisenroth and Carl Rasmussen on Probabilistic Inference for Learning Control (PILCO) [69, 205] which uses complex representations based on non-parametric probabilistic Gaussian Processes to model and leverage the uncertainty of the unknown dynamics and policy, while simultaneously learning both. PILCO’s policysearch framework is comprised from: a model of the dynamics (trained through forward modelling), an analytic indirect and approximate policy search method (thus not requiring an explicit value function model) and a gradient-based policy improvement method. It forms an exceptionally efficient, modelbased technique for control of nonlinear plant dynamics. Notoriously, Gaussian Processes require an inversion of the kernel matrices, which can scale unfavourably with respect to the number of data points required for learning e.g. when the task is temporally-complex or highly dimensional in state space. Even though PILCO requires only 17.5s of interaction with a physical system [205] to learn the benchmark cart-pole problem, it performs intensive off-line calculations for global policy parameter optimization, a reported 1 hour is required for the cart pole task using a 2008 work station [205]; this may present a challenge for more complex, higher-dimensional real-world tasks. 1.4 State-of-the-Art Nonlinear Control 1.4.2 19 Model-Free Algorithms A number of model-free algorithms used for solving nonlinear optimal control problems exist, such as: fuzzy logic [274], artificial neural networks [119] and those based on reinforcement learning. The RL field was originally developed for control of stochastic systems with unknown dynamics and thus the majority of its algorithms are model-free, such as Q-learning and TD-learning. I will focus on the recent advances in RL because these are most relevant to my work. Most problems have continuous or large discrete state / action spaces and classical RL algorithms encounter the curse of dimensionality. Modifications to standard RL methods yielded successful applications, for example optimising quadrupedal robotic locomotion [143], however still requiring a large number of trails. As a result standard RL algorithms have been extended to form novel state-of-the-art techniques explained below. Continuous Reinforcement Learning The scalability of reinforcement learning can be much improved with function approximation methods [205, 239], resulting in algorithms operating in continuous time such as Advantage Updating [21, 129]. The most prominent algorithm for continuous time dynamical systems without a priori discretisation of time, state and action was developed by Kenji Doya in 2000 [77]. It is based on the Hamilton-Jacobi-Bellman equation for infinite horizon discounted reward problems and uses function approximators to estimate value functions and improve policies, where (a) value function estimation involves minimisation of the temporal difference error extended to continuous-time form (using backward Euler approximation, exponential eligibility traces) and (b) policy improvement is performed with either a continuous actor-critic method or a value-gradient-based greedy policy (using a value gradient and an input gain model for deriving nonlinear feedback control). Hierarchical Reinforcement Learning (HRL) A different approach to solving the curse of dimensionality, Hierarchical Reinforcement Learning, proposes temporal abstractions [22] and concerns actions ("options") resulting in state transitions of extended and variable durations [40, 66, 68, 128, 197, 228, 243]. An option consists of an initiation set (determining when an it should be activated), a policy (which decides which primitive actions to use) and a terminating condition (indicating when a new option should be picked). Any fixed set of options defines a discrete time semi-Markov Decision Process (semi-MDP) – a process that selects among options, executing each to termination – embedded within an original MDP. The architecture typically involves manually decomposing a task into a set of sub-tasks and learning both high-level and low-level policies [176], as for example is used in robotic tasks [160, 183]. By defining such sub-goals, the set of policies that need to be considered during RL is constrained, allowing individual algorithms within the hierarchy to ignore large parts of state space. The MAXQ is one such general purpose algorithm making use of temporal abstractions. It works by decomposing the value function into a set of sub-problems, each forming a context independent policy – MAX-nodes, which overall converge to a recursively optimal policy using actions attributed to each sub-task – context dependent value function, Q-nodes. The MAXQ keeps the task Markovian at Introduction 20 each level within the hierarchy, thereby allowing a problem to be solved with standard RL methods. This results in faster learning than the flat Q-learning, however may only be locally optimal [73, 74]. In 2003 Andrew Barto & Sridhar Mahadevan provided an overview of hierarchical RL methods [22] and in 2009 Barto et al., proposed further improvement using new option ("skill") discovery on-line [144], which establishes chains of options in continuous domains. Motion Primitives A special kind of parametrized policies, called motion primitives exploit the analogy between autonomous differential equations and control policies (i.e. those that are based on dynamical systems) and term each learnt policy – a single motion primitive [8, 120]. In this framework a ’primitive’ is defined as a segment of movement rather than as a superposition of concurrent primitives [63]). A hierarchical probabilistic generative model developed by Amos Storkey et al., analysed handwritten data to infer a model for the superposition of sparsely activated motion primitives [282]. It is comprised of a model of primitive timing (high-level), Hidden Markov Model (low-level) and an Expectation Maximisation algorithm for the inference of motion primitives and their timing. The single motion primitive approach was extended to using many motion primitives in a reinforcement learning framework, where temporal abstractions, termed motion templates / dynamic movement primitives, allowed learning an arbitrary parametrization of the abstracted level rather than fixing it a priori, forming building blocks of motion and defining new "parametrized options" in continuous time [188]. Recently Schaal et al., proposed a probabilistic reinforcement learning method for Policy Improvement using Path Integrals (PI2 ), which scales well to high dimensional control systems. The authors show a successful learning outcome for controlling a complex jumping behaviour of both simulated and physical robot dog with 12 degrees of freedom [252, 253]. 1.4.3 Hybrid Algorithms Hybrid models Models combining model-free and model-based approaches are known as hybrid. Successful robot control was also achieved by hybrid algorithms learning a reward function from demonstrations, e.g. Atkeson & Schaal [18]. Another example of hybrid framework is proposed by Abbeel et al., who developed an algorithm that requires only an approximate model dynamics and a small number of real-life trials [1]. It uses regression to form a crude model of the dynamics on-line and applies differential dynamic programming to iteratively improve deterministic systems via policy gradient evaluations using real-life experience. In more recent work Abbeel et al., show a hybrid algorithm that learns stationary nonlinear policies for solutions to unknown dynamical systems under guided policy search. The method builds upon the iLQG algorithm [263] and utilises background dynamics distribution as a prior probability to reduce complexity [161]. The algorithm succeeds in the presence of non-smooth dynamics, partially observed environments, underactuation and contact discontinuities. 1.5 How To Do Better? 1.5 21 How To Do Better? Even though a big advancement towards solving nonlinear control problems with (un)known dynamics has been made in the last 50 years, difficulties in applying the state-of-the-art algorithms, reviewed in Section 1.4, still exist: (a) it is extremely difficult to build accurate models of high-dimensional MDPs using model-based RL, which often results in policies non-transferable from simulation to real-life; (b) difficulty in solving the Hamilton-Jacobi-Bellman partial differential equations for nonlinear systems to obtain a closed form analytical solution raises need for numerical approximations; (c) model-free RL methods often require an infeasibly large number of real-life trials due to the curse of dimensionality. This contrasts humans who can learn difficult tasks (such as turning a car around a bend) in a matter of a few trials. It means that this important and difficult class of problems remains a challenging research topic, which is of interest to a wide research community. The aim of my thesis is to successfully combine reinforcement learning and optimal control in order to produce an accurate, data efficient and computationally inexpensive method for numerical approximation of nonlinear control problems with unknown dynamics. I wish to exploit the advantages of each method, while mitigating the known disadvantages and introduce a hybrid hierarchical algorithm, Reinforcement Learning Optimal Control (RLOC), that uses both model-free and model-based techniques to produce near-optimal, continuous control of the continuous state evolution. To my best knowledge, the existing research providing such a framework is scarce. The RLOC algorithm has a hierarchical structure with a high-level symbolic decision maker (Monte-Carlo RL) and low-level continuous control (LQR controllers). It uses low-level optimal control costs as training signals for the high-level symbolic decision maker to learn how to globally combine ("chain") actions in state space for near-optimal control. The choice of both high-level and low-level controllers is not restrictive and I show that the RL agent is capable of making a selection between varied low-level controllers. I provide a neurobiological motivation for RLOC and show that it forms a feasible proof-of-principle framework for how the brain may bridge the divide between lowlevel continuous control and high-level symbolic learning. Hence my thesis proposes a general purpose algorithm applicable to a wide range of control problems with (un)known (non)linear dynamics. The thesis has the following structure: Chapter 1 gives a brief overview of the optimal control field and its link to the emergence of the RL field. Chapter 2 gives the mathematical tools necessary to develop the RLOC framework. Full derivation of state-of-the-art nonlinear control algorithm iLQG is provided, as well as the derivations of the equations of motion for the benchmark tasks. Chapter 3 provides neurobiological and control engineering motivation, and describes the principles behind combining Reinforcement Learning with Optimal Control which result in the general RLOC framework. Chapter 4 develops the RLOC algorithm using an engineering perspective, showing how a heterogeneous set of low-level controllers (PID and LQR) can be combined hierarchically, so as to optimally solve nonlinear problems with known dynamics. Chapter 5 extends RLOC in order to learn to control nonlinear systems with unknown dynamics and utilises the hierarchical neurobiological motivation of the earlier chapters. Chapter 6 provides the general discussion and conclusion. Chapter 2 Mathematical Theory "A man is like a fraction whose numerator is what he is and whose denominator is what he thinks of himself. The larger the denominator, the smaller the fraction." — L. Tolstoy In this chapter I derive from first principles (a) the linearisation and time-discretisation of general 2.1.1 and commonly studied 2.1.2 continuous nonlinear dynamics [33], showing the resulting system and control matrices 2.1.4, (b) the infinite horizon, nonzero target Linear Quadratic Regulator [130] control 2.2, (c) the iterative Linear Quadratic Gaussian [263] control 2.3 and (d) the studied systems dynamics of the Robotic Arm 2.4.1 and Cart Pole 2.4.2. Even though some of the above derivations exist in the literature, I have found them to either be incomplete or unavailable. 2.1 Linearisation and Discretisation of Nonlinear Systems The optimal control problems solved by engineering algorithms require a model (a mathematical description) of the system dynamics of the process to be controlled. Dynamical systems are generally represented by first or higher order ordinary differential equations and can be classified into [137] • Time Invariant / Stationary / Fixed Nonlinear Systems – dynamics do not depend on time  ẋ(t) = f x(t), u(t) mostly simulated systems (real life example: electrical circuits) • Time Varying Systems – dynamics depend on time (true for majority of real-life systems)  ẋ(t) = f x(t), u(t),t (real life examples: aircraft, vocal tract) where x(t) ∈ Rℓ , u(t) ∈ Rm and ẋ(t) = d(x(t)) are the continuous state, control and derivative of the dt state w.r.t. time vectors respectively and f(·) is a (transfer) function describing the state evolution with time. The dynamical systems studied in optimal control theory [33] are represented by the equations for Mathematical Theory 24 • Linear Systems1 – dynamics do not depend on the state; these may either be time-varying ẋ(t) = A(t)x(t) + B(t)u(t) or time-invariant ẋ(t) = Ax(t) + Bu(t) • Nonlinear Systems2 – dynamics do depend on the state; these may either be time-varying     ẋ(t) = A x(t),t x(t) + B x(t),t u(t) or time-invariant ẋ(t) = A x(t) x(t) + B x(t) u(t) where A(·) ∈ Rℓ×ℓ is the system control matrix acting on the state vector and B(·) ∈ Rℓ×m is the control gain matrix acting on the control vector. 2.1.1 Time-Invariant Nonlinear Dynamical Systems Many engineering control techniques operate on and assume a linear time-invariant system dynamics (even though real-life systems are often nonlinear and time-varying), since these models have proved to be exceptionally valuable in optimal control due to their ability to accurately approximate real-life dynamics and to allow for an approximately optimal feedback control design [33]. Linearisation The celebrated optimal control algorithms such as LQR and iLQR among others require the dynamics to be linear, as a result nonlinear systems are often (iteratively) linearised [14, 263]. Consider the dynamics of a general time-invariant continuous system ẋ(t) = f x(t), u(t)  (2.1.1) (i) (i) (i) (i)  can be linearised for each linearisation point x0 , u0 , i = 1, . . . , na (where x0 ∈ Rℓ and u0 ∈ Rm are the points in state and control space respectively) with first order Taylor approximation and using the Leibniz notation for partial derivatives evaluated at a point   (i) (i) ẋ(t) = f x0 , u0 + +O + − ∂ f x(t), u(t) ∂ x(t) (i)  |x(t) − x0 |2 + O  ∂ f x(t), u(t) ∂ u(t) ∂ f x(t), u(t) ∂ u(t)  !   (i) + x(t) − x0 (i) x(t)=x0 (i) u(t)=u0 (i)  |u(t) − u0 |2 = ! ∂ f x(t), u(t) ∂ x(t)   (i) (i) u(t) + f x , u − 0 0 (i) x(t)=x0 (i) u(t)=u0 (i) u (i) 0 x(t)=x0 (i) u(t)=u0 1A  !   ∂ f x(t), u(t) ∂ u(t) (i) x(t) x(t)=x0 (i) ! !   (i) u(t) − u0 (i) x(t)=x0 (i) u(t)=u0 u(t)=u0  ∂ f x(t), u(t) ∂ x(t) (i) (i) x(t)=x0 (i) u(t)=u0 (i)  (i)  + O |x(t) − x0 |2 + O |u(t) − u0 |2 x0 ! (2.1.2) linear system satisfies the Superposition Principle which states that the net response of two or more stimuli at a given point is the sum of each individual stimulus – i.e. where the additivity f(x + u) = f(x) + f(u) and homogeneity properties f(αx) = αf(x), f(β u) = β f(u) hold true for a function f(αx + β u) = αf(x) + β f(u). 2 An example of a nonlinear system of the form f(x, u) = A(x)x + B(x)u is f(2x, 2u) = A(2x)2x + B(2u)2u ̸= 2f(x, u). 2.1 Linearisation and Discretisation of Nonlinear Systems I will use a condense Leibniz partial derivative notation, where 25 ∂ f(x,y) x=ax ∂x u=ay = ∂f ∂ x (ax , ay ). Discretisation In order to apply the state-of-the-art algorithms (such as those discussed in Section 1.4) using present day digital computers, the continuous-time dynamical system needs to be discretised in the time domain, which can be done using the formal definition of the derivative x(t + ∆t) − x(t) ∆t→0 ∆t ẋ(t) = lim (2.1.3) where ∆t ∈ R>0 is an increment of time. When this is fixed at a small value ∆ = nTK ∈ R>0 , called x −x the discretisation time step, Equation 2.1.3 can be rewritten as k+1∆ k , where xk = x(k × ∆), T ∈ N+ is the total simulation time, nK ∈ N+ is the total number of time steps and k = 1, . . . , nK ∈ N+ is the time step number of a finite discrete-time optimal control problem. The appropriate choice of the discretisation time step size is of great importance in approximation of continuous nonlinear systems. The linearised, time-invariant, continuous-time system of Equation 2.1.2 can be approximated with a time-invariant, discrete-time linear system   ∂f   ∂ f  (i) (i)  (i) ∂ f  (i) (i)  ∂ f  (i) (i)  (i) xk+1 − xk (i) (i) (i) (i) x0 , u0 xk − x0 , u0 x0 + x0 , u0 uk − x , u u0 = f x0 , u0 + ∆ ∂x ∂x ∂u ∂u 0 0 (i)  (i)  + O |xk − x0 |2 + O |uk − u0 |2   ∂ f  (i) (i)  ∂ f  (i) (i)  (i) (i) (i) xk+1 = xk + f x0 , u0 ∆ + x , u xk ∆ − x , u x0 ∆ ∂x 0 0 ∂x 0 0 ∂ f  (i) (i)  ∂ f  (i) (i)  (i) (i)  (i)  + x0 , u0 uk ∆ − x0 , u0 u0 ∆ + O |xk − x0 |2 + O |uk − u0 |2 ∂u ∂u      ∂ f (i) (i) ∂ f (i) (i)  xk+1 = I + x , u ∆ xk + x , u ∆ uk ∂x 0 0 ∂u 0 0    ∂ f  (i) (i)  (i) ∂ f  (i) (i)  (i) (i) (i) + f x0 , u0 ∆ − x , u x0 ∆ − x , u u0 ∆ ∂x 0 0 ∂u 0 0 (i)  (i)  + O |xk − x0 |2 + O |uk − u0 |2 (i)  (i)  xk+1 = Axk + Buk + c + O |xk − x0 |2 + O |uk − u0 |2 (2.1.4) (i) (i) where A = I + ∂∂ xf (x0 , u0 )∆ ∈ Rℓ×ℓ and B = ∂ f (i) (i) ℓ×m ∂ u (x0 , u0 )∆ ∈ R  are the time-invariant, discrete(i) (i)  (i) (i)  (i) time, state-independent system and control gain matrices and c = f x0 , u0 − ∂∂ xf x0 , u0 x0 −  (i) (i)  (i) ∂f ℓ u , u x 0 ∆ ∈ R is the constant of linearisation. 0 ∂u 0 Mathematical Theory 26 2.1.2 Studied Time-Invariant Nonlinear Dynamical System The nonlinear dynamical system I wish to study has a more specific form than the general dynamics  e x(t) ∈ Rℓ×ℓ of Equation 2.1.1 and is a common representation of nonlinear systems using state A  and control Be x(t) ∈ Rℓ×m weight matrices, nonlinear in the state x(t). It describes the nonlinear evolution of the state with time, forming a sub-set of all possible time-invariant, continuous-time, nonlinear dynamical systems (formally known as input-affine systems)    e x(t) x(t) + Be x(t) u(t) ẋ(t) = f x(t), u(t) = A (2.1.5)   It is evident from Equation 2.1.2 that the matrices A x(t) and B x(t) are partial derivatives of ẋ(t) with respect to the state x(t) ∈ Rℓ and control u(t) ∈ Rm variables respectively, thus the derivatives are found using a conventional summation notation with dummy indices, which are bound to each other in an expression in order to avoid ambiguity in tensor contraction. The derivative of the system dynamics ẋ(t) with respect to the state x(t) vector, using both an explicit notation (2.1.6) of tensor contraction [247] and the compact notation (2.1.7), are ei,k ∂ fi ∂ ℓ e ∂ m e ∂ ℓ ∂A ∂ xk  m ∂ Bei,k e = ∑ Ai,k xk + ∂ x j ∑ Bi,k uk = ∂ x j ∑ ∂ x j xk + Ai,k ∂ x j + ∑ ∂ x j uk ∂ x j ∂ x j k=1 k=1 k=1 k=1 ℓ m e e ∂ fi ei, j + ∑ ∂ Ai,k xk + ∑ ∂ Bi,k uk =A (2.1.6) ∂xj k=1 ∂ x j k=1 ∂ x j     e x(t) x(t) + Be x(t) u(t)    ∂ A e x(t) e x(t)  ∂A ∂ f x(t), u(t) ∂ B e x(t) + = =A x(t) + u(t) ∂ x(t) ∂ x(t) ∂ x(t) ∂ x(t) (2.1.7) where i, j are the indices for the (i, j)th component of the derivative of the function f ∈ Rℓ w.r.t the e ∂ A(x(t)) ℓ×ℓ×ℓ and ∂ x(t) ∈ R e ∂ B(x(t)) ℓ×m are ∂ x(t) u(t) ∈ R state x ∈ Rℓ , which can be expressed using a more compact notation, where e ∂ B(x(t)) ∂ x(t) ∈ Rℓ×m×ℓ are 3-dimensional tensors, while 2-dimensional matrices. e ∂ A(x(t)) ∂ x(t) x(t) ∈ Rℓ×ℓ and The derivative of the system dynamics ẋ(t) with respect to the control vector u(t), using both an explicit notation (2.1.8) and the compact notation (2.1.9), are ∂ fi ∂ m e = ∑ Bi,k uk = Bei, j ∂ u j ∂ u j k=1     e x(t) x(t) + Be x(t) u(t)  ∂ A  ∂ f x(t), u(t) = = Be x(t) ∂ u(t) ∂ u(t) (2.1.8) (2.1.9) where i, j are the indices for the (i, j)th component of the derivative of f ∈ Rℓ w.r.t the control u ∈ Rm . 2.1 Linearisation and Discretisation of Nonlinear Systems 27 (i) (i)  Linearisation The nonlinear dynamics of Equation 2.1.5 can be linearised around a point x0 , u0 , i = 1, ..., na , using first order Taylor approximation employing the condensed notation of Equations 2.1.7,2.1.9 (while bearing in mind the explicit notation of Equations 2.1.6, 2.1.8)   (i) (i) ẋ(t) = f x0 , u0 + ∂ f x(t), u(t) ∂ x(t)  !   (i) + x(t) − x0 (i) x(t)=x0 (i) u(t)=u0 (i)  (i)  + O |x(t) − x0 |2 + O |u(t) − u0 |2 ! !   (i)  (i)  (i) (i) (i) (i) e x x + Be x u = A + Be x u(t) − u 0 0 0 0 0   e x(t) e x(t) A B ∂ ∂ (i) e x + + A ∂ x(t) x(t)=x(i) 0 ∂ x(t) 0   (i) (i) + O |x(t) − x0 |2 + O |u(t) − u0 |2 (i)  x0 +  ∂ f x(t), u(t) ∂ u(t) !   (i) u(t) − u0 (i) x(t)=x0 (i) u(t)=u0 0 (i) u0 (i) x(t)=x0 !  (i) x(t) − x0 (2.1.10) Discretisation The continuous-time system can be time-discretised using the formal definition of the x −x derivative (2.1.3) and fixing ∆t to a small value ∆, the dynamics ẋ(t) can be written as k+1∆ k , where xk = x(k × ∆), giving the time-discretised linearised system xk+1 − xk = ∆ ! ! !       ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ (i) (i) (i) (i) (i) (i) (i) (i) (i) e✘ e x(i) xk − ✘ e✘ A x0✘✘x0 + ✘ A x0✘✘x0 + Be x0 uk − ✘ Be ✘ x0✘✘u0 + A Be ✘ x0✘✘u0 ✘ 0 ! !  ∂A    ∂ Be   ∂ Be e     (i) (i) (i) (i) (i) (i) (i) (i) (i) (i) + x x xk − x x x0 + x u xk − x u x0 ∂x 0 0 ∂x 0 0 ∂x 0 0 ∂x 0 0 (i)  (i)  + O |xk − x0 |2 + O |uk − u0 |2  ∂ Be  ∂A e (i)  (i)  (i)  (i)  (i) (i)  e x0 x0 xk ∆ + x0 u0 xk ∆ + Be x0 uk ∆ xk+1 = xk + A x0 xk ∆ + ∂x ∂x ∂A   ∂ Be e (i)  (i)  (i)  (i) (i) (i) (i)  (i)  x0 x0 x0 ∆ − x0 u0 x0 ∆ + O |xk − x0 |2 + O |uk − u0 |2 − ∂x ∂x     e e (i) (i)   A B ∂ ∂ (i) (i) (i) (i) e x ∆+ (x )x ∆ + (x )u ∆ xk + Be x0 ∆ uk xk+1 = I + A 0 ∂x 0 0 ∂x 0 0   e    ∂ Be ∂ A (i)  (i)  (i) (i)  (i) (i) (i)  (i)  + − x0 x0 x0 ∆ − x0 u0 x0 ∆ + O |xk − x0 |2 + O |uk − u0 |2 ∂x ∂x (i)  (i)  xk+1 = Axk + Buk + c + O |xk − x0 |2 + O |uk − u0 |2 (2.1.11) ∂A e        e (i) ) (i) e (i) ) (i) e (i)  (i) ∂ A(x ∂ B(x (i) (i)  (i)  0 0 e e where A = I+ A x0 ∆+ ∂ x x0 ∆+ ∂ x u0 ∆ , B = B x0 ∆ , c = − ∂∂ Ax x0 x0 x0 ∆−     (i)  e (i)  ∂ Be (i) (i) ∂A ℓ×ℓ×ℓ and ∂ Be x(i) ∈ Rm×ℓ×ℓ are 3-dimensional tensors, while ∂ x x0 u0 x0 ∆ , and ∂ x x0 ∈ R ∂x 0 Mathematical Theory 28 (i)  x0 xk ∈ Rm×m and per Equation 2.1.6. e ∂A ∂x ∂ Be ∂x (i)  x0 uk ∈ Rm×ℓ are 2-dimensional matrices, with the explicit notation as (i) (i) If the system is linearised around points x0 = 0, u0 = 0 the constant term disappears reducing to (i)  (i)  xk+1 = Axk + Buk + O |xk − x0 |2 + O |uk − u0 |2 (2.1.12)   e x(i) ∆, B = Be x(i) ∆. where A = I + A 0 0  The aim of a general nonlinear control problem is to find a sequence of optimal controls ū∗ x(k), k ∈ ¯ indicates a trajectory. If the first state variable x1 = xinit is known at time k = k1 , Rm×nK−1 , where (·) then the sequence x1 , x2 , . . . , xnK is known for time k ≥ k1 , provided the inputs u1 , u2 , . . . , unK −1 to the system are given at each time step k. Note the control signal of a finite time horizon control system is not applied at the final step nK . 2.2 Linear Quadratic Regulator An analytical control solution of nonlinear dynamical systems is generally unavailable due to the difficulty in solving the Hamilton-Jacobi-Bellman partial differential equation . Closed-form solutions typically exist for linear systems only [9] and are obtained using the Linear Quadratic Regulator (LQR) theory [13, 130]. Even though the control of linear systems may seem of little practical relevance to the solution of real-life nonlinear systems, it plays a crucial role in many state-of-the-art algorithms using iterative linearisations of the dynamics around nominal x̄, ū trajectories, while incrementally improving the control solution. Given the importance of LQR control in solving nonlinear control problems (and its application in my algorithm), I now show the full derivation for the feedback gain matrix for a nonzero target, infinite time horizon linear control problem. It deals with systems which are either inherently linear (Equation 2.2.1) or with nonlinear systems which are linearised around a point (Equation 2.1.2). My derivations supplement the partially shown equations of the standard LQR theory (finite horizon, zero target control system) by Emanuel Todorov in [260]. 2.2.1 Studied Time-Invariant Linear Dynamical System Consider the following time-invariant, continuous-time linear dynamical system [260]  e e ẋ(t) = f x(t), u(t) = Ax(t) + Bu(t) (2.2.1) where Equation 2.1.2 shows that A and B matrices are partial derivatives of ẋ(t) with respect to state x(t) ∈ X , Rℓ and control u(t) ∈ U , Rm variables respectively, therefore applying the differentiation e e e e ∂ ẋ(t) Bu(t)) e and ∂ ẋ(t) = ∂ (Ax(t)+ = ∂ (Ax(t)+Bu(t)) = A = Be are obtained as expected for a linear system. ∂x ∂ x(t) ∂ u(t) ∂ u(t) 2.2 Linear Quadratic Regulator 29 Discretisation The continuous-time system can be time-discretised using the formal definition of the derivative (Equation 2.1.3) and fixing ∆t to a small value ∆, the dynamics ẋ(t) can be written as xk+1 −xk , where xk = x(k × ∆), giving the time-discretised linear system ∆ e k ∆ + Bu e k∆ xk+1 = xk ∆ + Ax e k + (B∆)u e k = (I + A∆)x (2.2.2) = Axk + Buk where A = I + ∂∂ xẋ ∆ ∈ Rℓ×ℓ and B = ∂∂ uẋ ∆ ∈ Rℓ×m are the time-invariant, discrete-time, state-independent system control and control gain matrices. 2.2.2 Infinite Horizon, Nonzero Target Control Solution The Linear Quadratic Regulator operates in a discrete-time setting, enabling calculations on digital computers, and provides continuous optimal control u∗ (k) for linear systems (Equation 2.2.2) using the information about the state at each time step k, x(k). The control of linear systems with LQR inherently assumes a target state of zero in its derivations of analytical optimal control solution (steady state solution) [78], however most real-life systems require control to a nonzero target state x∗ (where ∗ indicates ’target’). Moreover many tasks require a control of an indefinite time duration, e.g. control of a hovering helicopter above a target location / stabilisation of a pole on a moving cart, i.e. they do not have a naturally defined control end point requiring an infinite time horizon solution. Nonzero Target State Setting The objective of an nonzero target LQR optimal control solution requires moving the system to and keeping it at a nonzero target state, i.e. attain xk = xk+1 = x∗ . x(k) = x(k) − x∗ , such that in this shifted For notational convenience I define a change of variables e coordinate system the target is at zero, e x∗ = 0, with the linearised dynamics e xk+1 = Ae xk + Buk (2.2.3) The objective is to find an infinite horizon feedback gain matrix L∞ ∈ Rm×ℓ , which allows to calculate  x(k), k ∈ Rm×nK−1 for each time step given an initial state xinit , where an optimal control trajectory ū∗ e nK = T∆ is the total number of simulation time steps and ∗ indicates ’optimality’. Infinite Time Horizon Setting In the infinite horizon setting the LQR seeks to minimise an infinite  x(k), u(k) ∈ R≥0 of the horizon quadratic cost function / performance index / objective function J e state and control, which captures the objective of moving the system to the target in a manner defined by the weight matrices and which, in the transformed coordinate system, takes the form    1 T 1 neK −1 T T J e x(k), u(k) = e xk We xk + uk Zuk x W e x + ∑ e 2 neK neK neK 2 k=1 (2.2.4) Mathematical Theory 30 with real symmetric positive semi-definite state weight matrix W ∈ Rℓ×ℓ ≥ 0 and real symmetric positive definite input (control) weight matrix Z ∈ Rm×m > 0, specifying the penalty on the state deviation from target and the penalty on the control size respectively. There are several ways of calculating the infinite horizon feedback gain matrix, I choose to use a finite but sufficiently large number of iteration time steps, neK ≫ nK , to approximate the infinite horizon matrix L∞ ∈ Rm×ℓ [152]. In the infinite horizon setting the final state penalty at time step neK is imposed by WneK , which in the infinite time horizon is equal to the incremental weight matrix W = WneK . Feedback Gain Derivation via the Bellman Equation The Bellman’s Principle of Optimality (see Definition 1) leads to the Bellman Equation, used in dynamic programming which, adjusted for the new shifted coordinate system, takes the form of v(e x) = min u∈U (e x)  cost(e x, u) + v next(e x, u)  (2.2.5) where cost(e x, u) is the incremental cost of a single period from Equation 2.2.4, next(e x, u) gives the ≥0 variables at the next time step, and v(e x) ∈ R is the optimal value function / optimal cost-to-go x to target e x∗ . The value function function – the minimal total cost for controlling the system from state e is guessed to have a quadratic form  1 T v e x(k), k = e x x Vke 2 (2.2.6) where V (k) ∈ Rℓ×ℓ is a real symmetric positive semi-definite square matrix. Substituting the cost(·) function (Equation 2.2.4), the quadratic value function v(·) (Equation 2.2.6)  and the expression for v next(·) (Equation 2.2.3) into the Bellman Equation (2.2.5) gives     1 T 1 T xk = min costk (e xk+1 x Vke x Vk+1e xk , uk ) + vk+1 (e xk+1 ) = min Jk (e v(e xk , k) = e xk , uk ) + e 2 k 2 k+1 u∈U (e x) u∈U (e x)      T 1 T 1 T 1 e = min (2.2.7) xk + Buk x We xk + uk Zuk + Ae xk + Buk Vk+1 Ae 2 k 2 2 u∈U (e x)   1 T T T T T T T T T T x We xk + uk Zuk + e xk A Vk+1 Ae xk + e xk A Vk+1 Buk + uk B Vk+1 Ae xk + uk B Vk+1 Buk min e = 2 u∈U (ex) k  An analytical result for the optimal control law u∗ e x(k), k , is found by setting the gradient of the Hamiltonian (term inside the bracket of Equation 2.2.7), with respect to control, to zero 1 1 T T T 1 1 T T 0 = 0 + 2uTk Z + 0 + e xk A Vk+1 B + e xk A Vk+1 B + 2uTk BTVk+1 B 2 2 2 2 T T T T T 0 = uk Z + e xk A Vk+1 B + uk B Vk+1 B −uTk Z − uTk BTVk+1 B = e xTk ATVk+1 B 2.2 Linear Quadratic Regulator 31    T  T T T T T = e xk A Vk+1 B − uk Z + B Vk+1 B  T T − Z T + BTVk+1 B uk = BTVk+1 Ae xk −1 u∗k = − Z + BTVk+1 B BTVk+1 Ae xk (2.2.8)  where V (k) = V (k)T , R = RT are symmetric, u∗ x(k), k ∈ U (e x) is the optimal control law calculated using state e xk at each time step k, and U (e x) is a convex space of admissible controls.  The recursive relationship for V (k) is found by substituting the optimal control u∗ x(k), k of Equation 2.2.8 back into Equation 2.2.7 T   1 1✄ T 1✄  ✄ T T ✄e xk = ✄ (Z + BTVk+1 B)−1 BTVk+1 Ae xk Vke xk A Vk+1 Ae xk xk Z (Z + BTVk+1 B)−1 BTVk+1 Ae xk + ✄ e ✄2 ✄2 ✄2 T 1✄  1✄ T T xk A Vk+1 B(Z + BTVk+1 B)−1 BTVk+1 Ae xk − ✄ (Z + BTVk+1 B)−1 BTVk+1 Ae xk BTVk+1 Ae xk − ✄e ✄2 ✄2 T 1✄  1✄ T T −1 T + ✄ (Z + B Vk+1 B) B Vk+1 Ae xk We xk xk BTVk+1 B(Z + BTVk+1 B)−1 BTVk+1 Ae xk + ✄ e ✄2 ✄2 Matching the quadratic form of the left hand side with the right hand side and cancelling T e xk x✓ k Vk e ✓      T T T T T −1 T T T −1 T xk x✓ x✓ xk + e Z(Z + B Vk+1 B) B Vk+1 A e = e k A Vk+1 B (Z + B Vk+1 B) k A Vk+1 A e ✓ ✓      T T T T T −1 T T T T T −1 xk B Vk+1 A e xk − e x✓ x✓ − e k A Vk+1 B(Z + B Vk+1 B) B Vk+1 A e k A Vk+1 B (Z + B Vk+1 B) ✓ ✓      T T T T T T T −1 ✓ xk x✓ xk A Vk+1 B (Z + B Vk+1 B) xk + e BTVk+1 B(Z + BTVk+1 B)−1 BTVk+1 A e + e kW e ✓ ✓  −1 T  T B)−1 = (Z + BTV −1 since = (Z T + BTVk+1 where (Z + BTVk+1 B)−1 = (Z + BTVk+1 B)T k+1 B) Z T = Z and V T (k) = V (k) are both symmetric, square matrices.  Factorizing out the ATVk+1 B(Z + BTVk+1 B)−1 · BTVk+1 A term !   Vk = W + ATVk+1 A − ATVk+1 B(Z + BTVk+1 B)−1 BTVk+1 A       T T −1 T −1 T T T −1 T + A Vk+1 B(Z + B Vk+1 B) Z(Z + B Vk+1 B) B Vk+1 A − A Vk+1 B(Z + B Vk+1 B) B Vk+1 A   T T −1 T T −1 T + A Vk+1 B(Z + B Vk+1 B) B Vk+1 B(Z + B Vk+1 B) B Vk+1 A !     = W + ATVk+1 A − ATVk+1 B(Z + BTVk+1 B)−1 BTVk+1 A ! Mathematical Theory 32 !   + ATVk+1 B(Z + BTVk+1 B)−1 Z(Z + BTVk+1 B)−1 − I + BTVk+1 B(Z + BTVk+1 B)−1 BTVk+1 A !   = W + ATVk+1 A − ATVk+1 B(Z + BTVk+1 B)−1 BTVk+1 A   + ATVk+1 B(Z + BTVk+1 B)−1  !  (Z + BTVk+1 B)(Z + BTVk+1 B)−1 −I BTVk+1 A | {z } =I The recursive relationship for V (k) is defined by the discrete-time Dynamic Riccati Equation3  Vk = W + A Vk+1 A − A Vk+1 B(Z + B Vk+1 B) B Vk+1 A T  T T −1 T (2.2.9) which does not have an explicit dependence on x(k), hence can be calculated off-line. The discrete-time Dynamic Riccati Equation 2.2.9 is iterated backwards in time for a very large number of steps k = neK , . . . , 1, where VneK = WneK = W , and the value function is used to obtain the  feedback gain matrix L V (k), k ∈ Rm×ℓ×enK −1 at every step Lk = (Z +BTVk+1 B)−1 BTVk+1 A as defined by Equation 2.2.8. The feedback gain matrix obtained at the first time step k = 1 is guaranteed to converge [156] to the infinite horizon feedback gain matrix L∞ limneK →∞ L1 → L∞ , ∈ Rm×ℓ , hence L∞ ≈ L1 = (Z + BTV2 B)−1 BTV2 A (2.2.10) xk u∗k = −L∞e (2.2.11)  The matrix L∞ is used to find the optimal control sequence ū∗ x(k), k ∈ Rm×nK−1 at every step k = 1, . . . , nK−1 generating an optimal state trajectory x̄∗ (k) ∈ Rℓ×nK for a finite number of simulation time steps nK , using the adapted Equation 2.2.8 2.3 Iterative Linear Quadratic Gaussian An Iterative Linear Quadratic Gaussian control (iLQG) [263] is the state-of-the-art algorithm for iterative approximation of the control policy in stochastic nonlinear optimal control problems. I use the deterministic version of this algorithm (iLQR) [162] as a benchmark for comparing my deterministic control solution of nonlinear dynamical systems. In this section I calculate the full derivations4 of the key equations shown (but not explicitly derived) in the original paper by Todorov et al. [263], which I 3 The Algebraic Riccati Equation arises in the continuous/discrete time infinite horizon setting, which can be solved by matrix factorizations where Vk = Vk+1 = V∞ or by iterating the Dynamic Riccati Equation (finite horizon setting) until it converges. I use the latter approach in my derivation of the recursive relationship for the value function. 4 The RLOC operates in a deterministic setting, however it could be extended to handling stochastic systems. With this in mind, I derive the full stochastic equations for iLQG, which can be reduced to a deterministic setting if necessary. 2.3 Iterative Linear Quadratic Gaussian 33 have not found anywhere else and in some cases which I have found to be incorrect, such as in [255]5 . Please see Subsection 1.4.1 for the overview of the principles behind the iLQR and iLQG. Consider a continuous-time nonlinear dynamical system described by a stochastic differential equation possessing multiplicative control and state noise term   dx(t) = f x(t), u(t) dt + F x(t), u(t) dWt (2.3.1)   where x(t) ∈ Rℓ is the system state, u(t) = π x(t),t ∈ Rm is the deterministic control, π x(t),t   is the control policy / law, f x(t), u(t) ∈ Rℓ is the drift term, F x(t), u(t) ∈ Rℓ×p is the diffusion coefficient, the stochastic term Wt ∼ N (0,tI p ), ∈ R p is the standard Brownian Motion / Wiener Process, dWt ∼ N (0, dtI p ), ∈ R p is the stochastic differential into the future, T is a specified final   time, l t, x(t), u(t) ∈ R≥0 is the instantaneous cost and h x(T ) ∈ R≥0 is the final cost at time T . The time-discretised nonlinear system dynamics takes on the following form √ xk+1 = xk + f(xk , uk )∆ + F(xk , uk )ε k ∆ (2.3.2) where ∆ = nTK is the discretisation time step, xk = x(k × ∆), ε ∼ N (0, I p ) ∈ R p is a vector of standard √ normal random variables and ε k ∆ = dW ∼ N (0, ∆I p ). Each iteration of the iLQG algorithm involves the following calculations: • An open-loop control sequence ū(i) (k) ∈ Rm×nK−1 at iteration number i = 1, . . . , N and discretisation time step number k = 1, . . . , nK−1 , is applied to the discrete-time deterministic dynamics one time step at a time, as defined by Equation 2.3.3 and initiated with xinit at time step k = 1, producing a corresponding state trajectory x̄(i) (k) ∈ Rℓ×nK (i) (i)  (i) (i) x̄k+1 = x̄k + f x̄k , ūk ∆ (2.3.3) where stochastic term is omitted because the applied control is deterministic by definition. • The nonlinear time-discretised system is linearised around x̄(i) (k) and ū(i) (k) building a local modified LQG approximation to the original nonlinear problem, where the new linear dynamics and quadratic costs are expressed in terms of improvements (i) (i) δ xk = xk − x̄k , δ uk = uk − ūk (2.3.4) • A near-optimal affine control law6 δ uk = lk + Lk δ xk is obtained via dynamic programming and is applied forward in time to the linearised system δ xk+1 = Ak δ xk + Bk δ uk initiated at (i+1) (i) = ūk + δ uk are computed along the way, enforcing δ x1 = 0. The new open-loop controls ūk 5 The following equations in Chapter IV of the paper [255] have mistakes: δ xtk+1 + x̄tk+1 , δ xtk+1 and Ci,k . control law is only approximately optimal because of the approximation of the optimal controller for nonlinear systems in the vicinity of x̄(k), especially in the presence of control constraints and non-convex costs [263]. 6 The Mathematical Theory 34 ū(i+1) (k) ∈ U (x). Once ū(i+1) (k) and ū(i) (k) are sufficiently close, the iteration ends. The open-loop component lk makes the algorithm iterative, as without it δ xk = 0 and δ uk = 0. 2.3.1 Local Linear Quadratic Gaussian Approximation The iLQG approximates the original nonlinear optimal control problem by reducing it to a local linear LQG problem at each point along nominal x̄(k)(i) and ū(k)(i) trajectories. An approximately optimal control law can then be computed by constructing a convergent sequence of such LQG approximations. Linearisation and Time-Discretisation of System Dynamics The discrete-time dynamics of Equation 2.3.2 expressing the actual states x(k) and controls u(k) can be written in terms of nominal trajectories to which small improvements ("deviations") are applied    √ (i) (i) (i) (i) (i) (i) x̄k+1 + δ xk+1 = x̄k + δ xk + f x̄k + δ xk , ūk + δ uk ∆ + F x̄k + δ xk , ūk + δ uk ε k ∆ (2.3.5) | {z } | {z } {z } | {z } | xk+1 xk f(xk ,uk ) F(xk ,uk )   The nonlinear terms f x(k), u(k) and F x(k), u(k) of Equation 2.3.5 can be linearised around the nominal trajectories x̄(i) (k), ū(i) (k) with deviations δ x(k), δ u(k) using first order Taylor approximation  ∂ f (i) (i)  ∂ f (i) (i)  (i) (i)  (i) (i) f x̄k + δ xk , ūk + δ uk = f x̄k , ūk + x̄k , ūk δ xk + x̄ , ū δ uk ∂x ∂u k k  ∂ F (i) (i)  ∂ F (i) (i)  (i) (i) (i) (i)  F x̄k + δ xk , ūk + δ uk = F x̄k , ūk + x̄k , ūk δ xk + x̄ , ū δ uk ∂x ∂u k k (2.3.6) (2.3.7) The time-discretised linearised dynamics are written in terms of improvements δ (·) using Equation   (i) (i) (i) (i) 2.3.4, by substituting f x̄k + δ xk , ūk + δ uk , F x̄k + δ xk , ūk + δ uk into Equation 2.3.5 and taking away Equation 2.3.3, thus creating a local approximation of the original problem using an LQG form ✘ ∂ f (i) (i)  ∂ f (i) (i)  ✘✘ (i) (i) (i) (i) (i) x̄k , ūk δ xk ∆ + x̄ , ū δ uk ∆ f ✘ x̄k✘,✘ δ xk+1 = x̄k+1 + δ xk+1 − x̄k+1 = x̄k + δ xk + ✘ ūk ∆ + ∂x ∂u k k √ √ ✘ ∂ F (i) (i)  ∂ F (i) (i)  ✘✘ (i) (i)  √ (i) (i) (i) + F x̄k , ūk ε k ∆ + f ✘ x̄k✘,✘ x̄k , ūk δ xk ε k ∆ + x̄k , ūk δ uk ε k ∆ − x̄k − ✘ ūk ∆ ∂x ∂u ∂ f (i) (i)  ∂ f (i) (i)  = δ xk + x̄ , ū δ xk ∆ + x̄ , ū δ uk ∆ ∂x k k ∂u k k √ √ ∂ F (i) (i)  ∂ F (i) (i)  (i) (i)  √ + F x̄k , ūk ε k ∆ + x̄k , ūk δ xk ε k ∆ + x̄k , ūk δ uk ε k ∆ ∂x ∂u    ∂f  ∂ f (i) (i)   (i) (i) = I+ x̄ , ū ∆ δ xk + x̄ , ū ∆ δ uk ∂x k k ∂u k k √ √ ∂ F (i) (i)  √ ∂ F (i) (i)  (i) (i)  x̄k , ūk ε k δ xk + ∆ x̄ , ū ε k δ uk + ∆F x̄k , ūk ε k + ∆ ∂x ∂u k k (2.3.8) = Ak δ xk + Bk δ uk + Ck (δ xk , δ uk )ε k 2.3 Iterative Linear Quadratic Gaussian 35   (i) (i)  (i) (i)  where the system control matrix Ak = I+ ∂∂ xf x̄k , ūk ∆ , the control gain matrix Bk = ∂∂uf x̄k , ūk ∆ (i) (i)  are both time dependent since at each time step k the linearisation point x̄k , ūk around which the dynamics are linearised varies; the state and control dependent noise transition matrix C (k) ∈ Rℓ×p   x δ x + Cu δ u , ..., cF + Cx δ x + Cu δ u , is defined column wise as Ck (δ xk , δ uk ) = cF1,k + C1,k k k k k p,k p,k 1,k p,k √ √ [ j] (i) (i)  (i) (i)  where the vector cFj,k = ∆F [ j] x̄k , ūk ∈ Rℓ , the matrix Cxj,k = ∆ ∂∂Fx x̄k , ūk ∈ Rℓ×ℓ , the matrix √ [ j] (i) (i)  Cuj,k = ∆ ∂∂Fu x̄k , ūk ∈ Rℓ×m , where F [ j] is the jth column of the matrix F. Linearisation and Time-Discretisation of Cost Function The second order Taylor approxima tion of the time-discretised, instantaneous continuous cost l t, x(t), u(t) ≥ 0 around the nominal   trajectories x̄(i) (k), ū(i) (k) , with deviations δ x(k), δ u(k) and discretisation time step ∆ = nTK ∂l ∂l (i) (i)  (i) (i)  (i) (i)  (i)  (i)  lk = l k, x̄k , ūk ∆ + k, x̄k , ūk k, x̄k , ūk xk − x̄k ∆ + uk − ūk ∆ ∂x ∂u | {z } | {z } δ xk + + 1 2| 2 (i) T ∂ l xk − x̄k ∂ x∂ x {z δ xk } (i) (i)  k, x̄k , ūk (i)  xk − x̄k ∆ + | | {z } δ xk 2 1 (i) T ∂ l (i) (i)  (i)  uk − ūk k, x̄k , ūk uk − ūk ∆ 2 | {z } ∂ u∂ u | {z } δ uk δ uk 2 (i) T ∂ l uk − ūk ∂ u∂ x {z δ uk } (i) (i)  k, x̄k , ūk (i)  xk − x̄k ∆ | {z } δ xk δ uk ∂ 2l ∂l ∂l ∂ 2l ∂ 2l 1 1 δ xk ∆ + δ uk ∆ + δ xk T δ x k ∆ + ∆ δ uT δ uk + ∆δ uk T δ xk ∂x ∂u 2 ∂ x∂ x 2 ∂ u∂ u ∂ u∂ x 1 1 (2.3.9) = qk + δ xk T qk + δ xk TWk δ xk + δ uk T rk + δ uk T Zk δ uk + δ uk T Pk δ xk 2 2 = l∆ + 2 (i) (i)  (i) (i)  (i) (i)  where qk = l k, x̄k , ūk ∆ ∈ R≥0 , qk = ∂∂xl k, x̄k , ūk ∆ ∈ Rℓ , Wk = ∂∂x∂lx k, x̄k , ūk ∆ ∈ Rℓ×ℓ , Zk =   (i) (i)  ∂ 2l m×m , r = ∂ l k, x̄(i) , ū(i) ∆ ∈ Rm , P = ∂ 2 l k, x̄(i) , ū(i) ∆ ∈ Rm×ℓ and k = k k k k k k ∂ u∂ u k, x̄k , ūk ∆ ∈ R ∂u ∂ u∂ x 1, . . . , nK−1 . The second order Taylor approximation of the final cost at time step nK results in (note this cost is not incremental and therefore does not contain the ∆ term) ∂ h (i)  ∂ 2 h (i)  (i)  1 (i)  (i)  x̄nK xnK − x̄nK + δ xnK T x̄nK xnK − x̄nK h(xnK ) =h x̄nK + ∂x ∂ x∂ x | {z } 2 | {z } δ xnK 1 =qnK + δ xnK T qnK + δ xTWnK δ xnK 2 2 δ xnK (2.3.10) where qnK = h, qnK = ∂∂ hx and WnK = ∂∂x∂hx . Note the control related terms rnK , RnK , PnK do not appear in the derivation of the final cost, since no control is applied at time step nK . Mathematical Theory 36 2.3.2 Computing The Cost-To-Go Function   The optimal control u∗ δ x(k), k minimises the value (cost-to-go) function v x(k), k , assumed to be quadratic, which is defined as the cost expected to accumulate for controlling the system from  a starting state x(k) to a target x∗ using a policy π x(k), k . In the new local LQG approximation  setting (Equation 2.3.8), the optimal control improvement δ u∗ δ x(k), k minimises the still quadratic cost-to-go function v(δ x, k) for every δ x(k). The quadratic form of v(δ x, k) is proven by backwards induction using the Bellman Equation  h vk (δ xk ) = min lk (δ xk , δ uk ) + E vk+1 (δ xk+1 ) δu i (2.3.11) 1 vk (δ xk ) = sk + δ xTk sk + δ xTk Sk δ xk (2.3.12) 2 h i The expectation E vk+1 (δ xk+1 ) is found by substituting δ x(k) (Equation 2.3.8) into Equation 2.3.12 and changing notation from δ u(k) to π (k), while assuming control dependent noise only Ck (π k )ε k      E · = E vk+1 Ak δ xk + Bk π k + Ck (π k )ε k   T = E sk+1 + Ak δ xk + Bk π k + Ck (π k )ε k sk+1 T   1 + Ak δ xk + Bk π k + Ck (π k )ε k Sk+1 Ak δ xk + Bk π k + Ck (π k )ε k 2    T sk+1 = sk+1 + Ak δ xk + Bk π k + Ck (π k ) E ε k | {z } 0  T    1 + E trace Ak δ xk + Bk π k + Ck (π k )ε k Sk+1 Ak δ xk + Bk π k + Ck (π k )ε k 2   T T   1 = sk+1 + Ak δ xk + Bk π k sk+1 + trace Ak δ xk + Bk π k Sk+1 Ak δ xk + Bk π k (2.3.13) 2       T T T  1 1 1 + E Ck (π k )ε k Sk+1 Ak δ xk + E Ck (π k )ε k Sk+1 Bk π k + E Ck (π k )ε k Sk+1 Ck (π k )ε k 2 2 2 {z } | {z } | 0 0       where E ε k = 0 since the expectation of the random variable is zero, and E (·) = traceE (·) since a     quadratic form is a scalar quantity and E (·) = E trace(·) since both E and trace are linear operators.   The last expectation term of Equation 2.3.13 can be re-written using the rule E zT Λz = tr ΣΛ +  µ T Λµ , where µ is the mean and Σ is the variance-covariance matrix of z, and where tr denotes a trace   h   i iT i T  1 1 h 1 h 1 E Ck (π k )ε k Sk+1 Ck (π k )ε k = tr Var Ck (π k )ε k Sk+1 + E Ck (π k )ε k Sk+1 E Ck (π k )ε k 2 2 |2 } |2  {z }  {z E ε k =0 E ε k =0 2.3 Iterative Linear Quadratic Gaussian 37 This can be further simplified using the rules Var(Az + a) = AVar(z)AT and trace(UV ) = trace(VU)      h  i   1 1 1 T E · = tr Var Ck (π k )ε k Sk+1 = tr Ck (π k )Var ε k Ck (π k ) Sk+1 2 2 2 | {z } I  p    p  F T T  1 1 u F u F u F u = tr ∑ ci,k +Ci,k π k ci,k +Ci,k π k Sk+1 = tr ∑ ci,k +Ci,k π k Sk+1 ci,k +Ci,k π k 2 2 i=1 i=1   p p    1 1 p F T F T u T F T u T u = tr ∑ ci,k Sk+1ci,k + ∑ π k Ci,k Sk+1ci,k + 2 ∑ π k Ci,k Sk+1Ci,k π k 2 i=1 i=1 i=1   p p    1 1 T p T u T F u T u F F T = tr ∑ ci,k Sk+1ci,k + π k ∑ Ci,k Sk+1ci,k + 2 π k ∑ Ci,k Sk+1Ci,k π k (2.3.14) 2 i=1 i=1 i=1 The expression for the cost-to-go function can now be obtained by substituting the results of Equations 2.3.9, 2.3.13 and 2.3.14 into the Bellman Equation 2.3.11 " " #   1 vk (δ xk ) = min lk (δ xk , δ uk ) + E vk+1 (δ xk+1 ) = min qk + δ xk T qk + δ xk TWk δ xk + π k T rk 2 δπ δπ ! T T 1 1 + π k T Zk π k + π k T Pk δ xk + sk+1 + Ak δ xk + Bk π k sk+1 + trace Ak δ xk + Bk π k Sk+1 2 2  p !# p p     1 1 T Ak δ xk + Bk π k + tr ∑ cFi,k Sk+1cFi,k + π k T ∑ Ci,ku TSk+1cFi,k + 2 π k T ∑ Ci,ku TSk+1Ci,ku π k 2 i=1 i=1 i=1 " ! 1 1 = min qk + δ xk T qk + δ xk TWk δ xk + π k T rk + π k T Zk π k + π k T Pk δ xk + sk+1 + δ xk T Ak T sk+1 2 2 δπ 1 1 1 + π k T Bk T sk+1 + δ xk T Ak T Sk+1 Ak δ xk + δ xk T Ak T Sk+1 Bk π k + π k T Bk T Sk+1 Ak δ xk 2 2 2 !# p p p   1 1 T T 1 T u T u T u Sk+1 cFi,k + π k T ∑ Ci,k Sk+1Ci,k πk + π k Bk Sk+1 Bk π k + ∑ (cFi,k Sk+1 cFi,k ) + π k T ∑ Ci,k 2 2 i=1 2 i=1 i=1 "   1  1 p T = min qk + sk+1 + ∑ cFi,k Sk+1 cFi,k + δ xk T qk + Ak T sk+1 + δ xk T Wk + Ak T Sk+1 Ak δ xk 2 i=1 2 δπ     p  1 T T T T T T T u T F + π k rk + π k Bk sk+1 + π k ∑ Ci,k Sk+1 ci,k + π k Pk δ xk + π k Bk Sk+1 Ak δ xk 2 i=1 #   p  1 1 1 T u T u π k Zk π k + π k T Bk T Sk+1 Bk π k + π k T ∑ Ci,k πk Sk+1Ci,k + 2 2 2 i=1 "  1 p 1 T = min qk + sk+1 + ∑ cFi,k Sk+1 cFi,k + δ xk T (qk + Ak T sk+1 ) + δ xk T (Wk + Ak T Sk+1 Ak )δ xk 2 i=1 2 δπ # 1 (2.3.15) + π k T (gk + Gk δ xk ) + π k T Hk π k 2 Mathematical Theory 38 where π (k) = δ u(k), and where the variables g(k), G(k) and H(k) at time step k are defined as p u T Sk+1 cFi,k ) gk = rk + Bk T sk+1 + ∑ (Ci,k (2.3.16) i=1 Gk = Pk + Bk T Sk+1 Ak (2.3.17) p u T u Sk+1Ci,k ) Hk = Zk + Bk T Sk+1 Bk + ∑ (Ci,k (2.3.18) i=1 Using the result δ uk = ℓ k + Lk δ xk of Subsection 2.3.3, the proof for the quadratic value function is completed by switching δ uk with π k terms in the last line of Equation 2.3.15 1 1 π k T (gk + Gk δ xk ) + π k T Hk π k = (ℓℓk + Lk δ xk )T (gk + Gk δ xk ) + (ℓℓk + Lk δ xk )T Hk (ℓℓk + Lk δ xk ) 2 2 1 1 1 = ℓ Tk gk + ℓ Tk Gk δ xk + δ xTk LkT gk + δ xTk LkT Gk δ xk + ℓ Tk Hk ℓ k + ℓ Tk Hk Lk δ xk + δ xTk LkT Hk ℓ k 2 2 2 1 T T + δ xk Lk Hk Lk δ xk 2   1  1 1 1 LkT Gk + GTk Lk + LkT Hk Lk δ xk = ℓ Tk gk + ℓ Tk Hk ℓ k + δ xTk GTk ℓ k +LkT gk + LkT Hk ℓ k + δ xTk 2 | {z } |2 {z 2 } 2 ℓ Tk Gk δ xk LkT Gk (2.3.19)  The value function v δ x(k), k of the new LQG setting computed with deviations δ x(k), δ u(k), is proven to remain in the quadratic form of Equation 2.3.12 with constant, linear and quadratic terms  1 p FT 1 vk (δ xk ) = qk + sk+1 + ∑ ci,k Sk+1 cFi,k + δ xk T (qk + Ak T sk+1 ) + δ xk T (Wk + Ak T Sk+1 Ak )δ xk 2 i=1 2     1 1 + ℓ Tk gk + ℓ Tk Hk ℓ k + δ xTk GTk ℓ k + LkT gk + LkT Hk ℓ k + δ xTk LkT Gk + GTk Lk + LkT Hk Lk δ xk 2 2 p    1 1 T T F F T T T T T T = qk + sk+1 + ∑ ci,k Sk+1 ci,k + ℓ k Hk ℓ k + ℓ k gk +δ xk qk + Ak sk+1 + Lk Hk ℓ k + Lk gk + Gk ℓ k 2 i=1 2 {z } | | {z } sk sk  1 T T T T T + δ xk Wk + Ak Sk+1 Ak + Lk Hk Lk + Lk Gk + Gk Lk δ xk 2 {z } | (2.3.20) Sk where S(k) is square and symmetric. At the final time step, nK , the cost-to-go parameters are set to SnK = WnK ∈ Rℓ×ℓ , snK = qnK ∈ Rℓ , snK = qnK ∈ R≥0 , and for time k < nK can be computed recursively  1 1 p FT sk = qk + sk+1 + ∑ ci,k Sk+1 cFi,k + ℓ Tk Hk ℓ k + ℓ Tk gk 2 i=1 2 (2.3.21) 2.3 Iterative Linear Quadratic Gaussian 2.3.3 39 sk = qk + Ak T sk+1 + LkT Hk ℓ k + LkT gk + GTk ℓ k (2.3.22) Sk = Wk + ATk Sk+1 Ak + LkT Hk Lk + LkT Gk + GTk Lk (2.3.23) Computing The Control Law The goal of the iLQG is to approximate the optimal controller (which is noise-free i.e. ε k = 0 ∀k) for the nonlinear system in the vicinity of the state trajectory x̄(k)(i) at the ith iteration of the algorithm.  Consider that the cost-to-go function v δ x(k), k of Equation 2.3.15 depends on the control π (k) = δ u(k) through the term 1 ailqg (δ uk , δ xk ) = δ uk T (gk + Gk δ xk ) + δ uTk Hk δ uk 2 (2.3.24)  The value of δ u(k) which minimises the value function v δ x(k), k is  ∂ v δ x(k), k 1 = gTk + δ xTk GTk + 2δ uTk Hk = 0 ∂δu 2 T T T T δ uk Hk = −gk − δ xk Gk  T  T δ uTk Hk = − gTk − δ xTk GTk HkT δ uk = −gk − Gk δ xk (2.3.25) δ u∗k = ℓ k + Lk δ xk (2.3.26) Hk−1 Hk δ uk = −Hk−1 gk − Hk−1 Gk δ xk where ∗ indicates optimality, H(k) = H(k)T is a square symmetric matrix, g(k), G(k), H(k) at time step k are defined by Equations 2.3.16, 2.3.17, 2.3.18, and where the open loop component ℓ k = −Hk−1 gk and the feedback gain matrix Lk = −Hk−1 Gk . If H(k) is a positive semi-definite matrix Equation 2.3.26 can be used to calculate the optimal control increment δ u∗ (k) directly, otherwise negative eigenvalues may result in either δ u∗ (k) not being the minimum or in the violation of constraints, forcing for the approximation methods to be used. First Order Method Rather than choosing a control which minimises ailqg of Equation 2.3.24 for all δ x(k) (i.e. the δ u∗ (k) found in Equation 2.3.26), a control optimal for small δ x(k) is chosen.  Consider making an improvement along the gradient of ▽δ u ailqg δ u(k), δ x(k) , evaluated at δ u = 0 for Equation 2.3.24 without the H(k) term, i.e. δ u∗k = −gk − Gk δ xk , made using a "small enough" constant α ≥ 0 δ uk = −α Hk gk − α Gk δ xk (2.3.27) Mathematical Theory 40 Second Order Method Use a technique based on Levenberg-Marquardt method to obtain a matrix  e = H + α − λmin (H) , similar to H, which has been altered to contain only positive eigenvalues H where λmin (H) is the minimum eigenvalue of H and α > 0. Or use eigenvalue decomposition e and set H e =VD e −1V T . [V, D] = eig(H), replace all elements on the diagonal of D < α with α to give D This can be further enhanced to deal with control constraints. In this section I derived the full stochastic version of the iLQG algorithm in order to compare the solution quality of my algorithm RLOC to this state-of-the-art benchmark method. In this thesis, RLOC is designed to operate on deterministic systems, with the view of extending to stochastic control tasks. Therefore the benchmark I use is the iterative Linear Quadratic Regulator (iLQR), which is obtained by a quick simplification of the iLQG equations. 2.4 Benchmark Dynamical Systems In this section I introduce two dynamical systems which are used as benchmark tasks for testing my algorithm Reinforcement Learning Optimal Control: a planar 2-link robotic arm and a swinging pendulum on a moving cart. Further three nonlinear dynamical systems are derived / described in Appendix A (double inverted pendulum on the cart, physical robotic arm and an arbitrary toy example), which could be used to further test the RLOC algorithm. 2.4.1 Planar 2-Link Model of the Human Arm The human motor control continually specifies muscle lengths and joint angles in order to attain desired arm and hand positions (inverse kinematic problem) in a highly redundant physical system. The ability of biological systems to produce controls optimised over millions of years by evolution, motivate research in engineering of improved robots that can learn from experience rather than being susceptible to repeated execution errors due external calibration references [16]. A specific human arm path trajectory can be achieved through transformation from joint torques to movement (forward dynamics) and the pattern of muscle activations to achieve these torques has been shown to be consistent with prediction of torque signals obtained by an optimal control framework [222]. Hence, approximations of the human arm dynamics in particular have been extensively studied, resulting in robot arm models of varied complexity, the control of which is relevant to both robotics and neuro-prosthetics [162, 163, 263]. The simulated dynamics of a 2-link planar robotic arm system, to be controlled by the RLOC algorithm, is shown in Figure 2.1. The shoulder joint is ’attached’ to a point, while allowing both the arm and forearm links to move in an angle range [0, π ] ∈ R2 , with shoulder and elbow joint angles positively defined to the right. The arm is modelled as two weightless rods of lengths l1 (arm) and l2 (forearm), with masses centred a third of the way along the arm link and at half way along the forearm link, reflecting the physical properties of the biological system, with friction acting on the elbow joint. 2.4 Benchmark Dynamical Systems 41 Fig. 2.1 Robotic Arm System. The 2 degrees of freedom planar robotic arm is modelled as 2 weightless rods of lengths l1 , l2 and masses m1 , m2 centred near the middle of each link. The shoulder joint is ’attached’ to a point, with a planar freely moving shoulder and elbow links defined for angels positive to the right θ = (θ1 , θ2 )T and restricted to a range (0, π) ∈ R2 . The simulated robot arm is a function of joint positions θ = [θ1 , θ2 ]T , joint velocities θ̇θ = (θ̇1 , θ̇2 )T and joint accelerations θ̇θ = (θ̈1 , θ̈2 )T . A biologically plausible constraint on the size of admissible torque exerted on each joint is applied [−10, 10] Nm. See Table 2.1 for the full list of parameter values. The robotic arm system is represented using the state x = (θ1 , θ2 , θ̇1 , θ̇2 )T ∈ R4 and the state differentiated with respect to time ẋ = (θ̇1 , θ̇2 , θ̈1 , θ̈2 )T ∈ R4 , where the forward dynamics are  θ̈θ = M(θ )−1 (τ − C θ , θ̇θ − Bθ̇θ ) (2.4.1) where θ ∈ R2 is the joint angle vector, M(θ ) ∈ R2×2 is a positive semi-definite symmetric inertia matrix, C(θ , θ̇θ ) ∈ R2 is a vector of Centripetal and Coriolis forces, B ∈ R2×2 is the joint friction matrix and τ = u = (u1 , u2 )T ∈ R2 is the joint torque defined to be the system control. The variables and parameters of the forward dynamics (defined in Table 2.1) are M(θ ) = a1 + 2a2 cos θ2 a3 + a2 cos θ2 a3 a3 + a2 cos θ2 ! , C(θ , θ̇θ ) = −θ̇2 (2θ̇1 + θ̇2 ) θ̇12 ! a2 sin θ2   11 b12 where parameters a1 = I1 + I2 + m2 l12 , a2 = m2 l1 s2 , a3 = I2 , B = bb21 b22 , si represents the distance from the joint to the centre of mass of link i and Ii represents the moment of inertia of link i. The control task of the robotic arm represents picking any initial state configuration θ init ∈ [0, π ], θ̇θ ∈ [−1.5π , 1.5π ] and performing a planar reaching movement to a selected target x∗ (where ∗ indicates ’target’) by applying admissible controls (torques) u ∈ [−10, 10] Nm to each joint as dictated by a control policy π . init The nonlinear robotic arm system will be used as a benchmark for testing the ’quality’ of the optimal control solution of my RLOC algorithm. I will simulate the dynamics defined by ẋ of Equation 2.4.1 in order to apply my optimal control solution to planar arm reaching movements (i.e. no gravity effect), starting from any initial arm configuration and taking the arm to a nonzero target. Mathematical Theory 42 Table 2.1 Robotic Arm Simulation Parameters. Parameter Description Length of link 1 Length of link 2 Mass of link 1 Mass of link 2 Centre of mass for link 1 Centre of mass for link 2 Moment of inertia for link 1 Moment of inertia for link 2 Joint friction Joint friction Joint friction Joint friction 2.4.2 Symbol l1 l2 m1 m2 s1 s2 I1 I2 b11 b12 b21 b22 Value 0.3 0.33 1.4 2.5 0.11 0.165 0.025 0.072 0.5 0.1 0.1 0.5 SI Unit [m] [m] [kg] [kg] [m] [m] [kg m2 ] [kg m2 ] [N] [N] [N] [N] Free Swinging Pendulum on a Moving Cart The cart-pole system depicted in Figure 2.2 is a standard, yet challenging benchmark in modern control theory. The task consists of actively balancing the pole in an upright position (unstable equilibrium), with the cart stopping in a pre-defined position on the track (z = 0), with limited control forces applied to the cart laterally along the horizontal plane. Since the rotation of the pole cannot be controlled directly the cart-pole is an underactuated system, making standard nonlinear control techniques ineffective [170]. Fig. 2.2 Cart Pole System. The cart-pole system consists of a cart moving on a track possessing a mass mc and a velocity Vc = ż at a location (xc , 0), a freely swinging pendulum of length l with a mass m p , angle θ defined positive to the right, angular velocity θ̇ and tangential velocity Vp defined at the right angle to the pendulum tip point (x p , y p ). The admissible control (force) is applied to the front / rare of the cart, F ∈ [−20, 20] N. The cart-pole system has a stable equilibrium for pendulum hanging down θ ∈ {−π, π} and an unstable equilibrium for pendulum being upright θ = 0. The cart pole system is modelled to have a weightless rod with a point mass acted upon by gravity at the end of the pole, friction acting at the cart-ground contact (non-zero) and the cart-pendulum attachment (zero). The system is a function of the cart position z, cart velocity ż, pendulum angle θ 2.4 Benchmark Dynamical Systems 43 and the pendulum angular velocity θ̇ . The system is represented with the state x = (z, ż, θ , θ̇ )T ∈ R4 and the state derivative w.r.t. time ẋ = (ż, z̈, θ̇ , θ̈ )T ∈ R4 , with the inverse kinematics (see Equations 2.4.6, 2.4.8) ė = τ M(q)q̈2 + C(q, q̇)q̇ + G(q) + Bq (2.4.2) where τ = (F, 0)T ∈ R2 , where F ∈ R is the force applied to the system, q = [z, θ ]T ∈ R2 , q̇ =  ė = (sgn(ż), θ̇ )T ∈ R2 , where sgn(·) means ’sign of’, q̈ = (z̈, θ̈ ) ∈ R2 , M(q) ∈ R2×2 is ż, θ̇ ∈ R2 , q a positive definite symmetric matrix, C(q, q̇) ∈ R2×2 , G(q) ∈ R2 and B = (Bc , B p )T is a vector of friction parameters. The variables and parameters of the inverse dynamics, defined in Table 2.2, are M(q) = m p + mc m p l cos θ m p l cos θ m pl2 ! , C(q, q̇) = 0 −m p l θ̇ sin θ 0 0 ! , G(q) = 0 −m p gl sin θ ! where mc is the mass of the cart, m p is the mass of the pendulum, l is the length of the pendulum, g is the standard gravity and Bc , B p are the friction acting upon the cart and the pendulum respectively. Table 2.2 Cart Pole Simulation Parameters. Parameter Description Length of pendulum Mass of pendulum Mass of cart Standard gravity Pendulum friction constant Cart friction constant Symbol l mp mc g Bp Bc Value 0.6 0.5 0.5 9.80665 0 0.1 SI Unit [m] [kg] [kg] [m/s2 ] [N/m/s] [N/m/s] The control task of the cart pole represents starting from any state configuration xinit = [z, ż, θ , θ̇ ]T where z ∈ [0] is the position of the cart, ż ∈ [0] is the velocity of the cart, θ ∈ [−π , π ] is the pendulum angle, θ̇ ∈ [−1.5π , 1.5π ] is the velocity of the pendulum (with θ = 0 defined to be the upright, unstable equilibrium) and controlling the cart-pole to the target state of x∗ = 0 by applying admissible controls (forces) u ∈ [−20, 20] N to the front / rare of the cart as dictated by a control policy π . Derivation of Equations of Motion The Lagrangian, L , is a function that summarises the dynamics of the system L = T −U, where T = 12 mV 2 is the kinetic energy and U = mgh is the potential energy, m is the mass of the object acted upon by gravity, h is the height at which the object is held to the c) ground and g is the standard gravity. The velocity of the cart is Vc = d(x dt = ż and the velocity of the q q  dy p 2  dx p 2 2 +V 2 = pendulum is Vp = (as labelled in Figure 2.2) + Vp1 p2 dt dt L = 1    1 m pVp2 + mcVc2 − m p gh − mc g0 2 2 Mathematical Theory 44 = = = = r  d(x p ) 2  d(y p ) 2 2 1  d(xc ) 2 + mc + − m p gh dt dt 2 dt !  d(z + l sin θ ) 2  d(l cos θ ) 2 1 1 mp + + mc ż2 − m p gl cos θ 2 dt dt 2 !  d(sin θ ) d(θ ) 2  d(cos θ ) d(θ ) 2 1 1 + l + mc ż2 − m p gl cos θ m p ż + l 2 dθ dt dθ dt 2   2 2 1 1 m p ż + l θ̇ cos θ + − l θ̇ sin θ + mc ż2 − m p gl cos θ 2 2  1 1 2 2 2 2 2 m p ż + 2l żθ̇ cos θ + l θ̇ cos θ + sin θ + mc ż2 − m p gl cos θ 2 2 | {z } 1 = mp 2 1 1 1 1 = m p ż2 + m p l żθ̇ cos θ + m p l 2 θ̇ 2 + mc ż2 − m p gl cos θ 2 2 2 2 1 1 = m p + mc ż + m p l żθ̇ cos θ + m p l 2 θ̇ 2 − m p gl cos θ 2 2 (2.4.3) The Equations of Motion for the cart (Equation 2.4.4) and the pole (Equation 2.4.5) are d ∂L  ∂L − = F − Bc sgn(ż) dt ∂ ż ∂z d ∂L  ∂L − = 0 − B p θ̇ dt ∂ θ̇ ∂θ (2.4.4) (2.4.5) Obtaining the expression for the equation of motion of the cart using Equations 2.4.3 and 2.4.4 F − Bc sgn(ż) = = = = = F=   d m p + mc ż + m p l θ̇ cos θ + 0 − 0 − 0 dt   d θ̇ cos θ m p + mc z̈ + m p l dt   d cos θ   d θ̇  + cos θ m p + mc z̈ + m p l θ̇ dt  dt  d cos θ d θ    + θ̈ cos θ m p + mc z̈ + m p l θ̇ dθ dt    m p + mc z̈ + m p l θ̇ (− sin θ )θ̇ + θ̈ cos θ  m p + mc z̈ + m p l θ̈ cos θ − m p l θ̇ 2 sin θ + Bc sgn(ż) (2.4.6) which can be re-written to obtain z̈ z̈ =    1  F + m p l θ̇ 2 sin θ − θ̈ cos θ − Bc sgn(ż) m p + mc (2.4.7) 2.4 Benchmark Dynamical Systems 45 Obtaining the expression for the equation of motion of the pole using Equations 2.4.3 and 2.4.5    d 0 + m p l ż cos θ + m p l 2 θ̇ − 0 − 0 + m p l żθ̇ (− sin θ ) + 0 − m p lg(− sin θ ) + B p θ̇ dt  d ż cos θ B p θ̇ 2 m✚ +✚ m✚ m✚ m✚ =✚ p l θ̈ + ✚ pl p l żθ̇ sin θ − ✚ p l g sin θ + dt m pl   d cos θ d θ B p θ̇ = ż + z̈ cos θ + l θ̈ + żθ̇ sin θ − g sin θ + dθ dt m pl  B p θ̇ ✘✘ = ż − sin θ θ̇ + z̈ cos θ + l θ̈ + ✘ żθ̇✘sin θ − g sin θ + m pl 0= = cos θ z̈ + l θ̈ − g sin θ + B p θ̇ m pl (2.4.8) which can be re-written to obtain z̈ 1 z̈ = m p l cos θ  B p θ̇ − l θ̈ + g sin θ − m pl  (2.4.9) In order to establish the dynamics of the cart-pole system ẋ needs to be found. Substituting Equation 2.4.9 into Equation 2.4.6 and rearranging for θ̈ gives 1 θ̈ = m p l cos θ 1 = m p l cos θ (m p + mc )l θ̈ 1 θ̈ − = 2 m p l cos θ m p l cos2 θ  m p + mc  1 = θ̈ 1 − 2 m p cos θ m p l cos2 θ F − (m p + mc ) θ̈ = ✘ 2θ ✘✘ l✘ m✘ p cos cos θ B p θ̇ mpl ! ! + m p l θ̇ 2 sin θ + Bc sgn(ż) (m p + mc )  B p θ̇  (m p + mc ) F+ l θ̈ − g sin θ − + m p l θ̇ 2 sin θ + Bc sgn(ż) cos θ cos θ m pl !  B p θ̇  F cos θ − (m p + mc ) g sin θ − + m p l θ̇ 2 sin θ cos θ + Bc cos θ sgn(ż) m pl ! B p θ̇ m p l θ̇ 2 sin θ cos θ Bc cos θ sgn(ż) F cos θ − g sin θ + + + m p + mc m pl m p + mc m p + mc  m cos2 θ − (m + m )  1 p c p θ̈ = 2 m p cos θ m p l cos2 θ 2✘ ✘✘ θ m✘ p cos ✘ −l θ̈ + g sin θ − − g sin θ +   2 F + m p l θ̇ sin θ − Bc sgn(ż) cos θ  g sin θ + − m p cos2 θ + (m p + mc ) m p + mc B p θ̇ + m pl   Bc sgn(ż) − F − m p l θ̇ 2 sin θ cos θ m p + mc ! B p θ̇ − m pl ! ! Mathematical Theory 46 1  g sin θ + θ̈ = l mc + m p (1 − cos2 θ ) {z } |   Bc sgn(ż) − F − m p l θ̇ 2 sin θ cos θ m p + mc B p θ̇ − m pl ! (2.4.10) sin2 θ  Derivations of Partial Derivatives of the Dynamics In order to define the matrices A = I +  (i) (i)  (i) (i)  ∂ f (i) (i) and B = ∂∂uf x0 , u0 for the cart-pole task evaluated at points x0 , u0 , the derivatives x , u 0 ∂x 0 of the system dynamics ẋ with respect to state x ∈ R4 and control u ∈ R need to be found. ∂ ẋ ∂u    =   ∂ ż ∂z ∂ z̈ ∂z ∂ θ̇ ∂z ∂ θ̈ ∂z ∂ ż ∂ ż ∂ z̈ ∂ ż ∂ θ̇ ∂ ż ∂ θ̈ ∂ ż ∂ ż ∂θ ∂ z̈ ∂θ ∂ θ̇ ∂θ ∂ θ̈ ∂θ ∂ ż ∂ θ̇ ∂ z̈ ∂ θ̇ ∂ θ̇ ∂ θ̇ ∂ θ̈ ∂ θ̇       =     0 0 0 0 1 0 0 0 0 0 ∂ z̈ ∂θ ∂ z̈ ∂ θ̇ 0 ∂ θ̈ ∂θ    , 1   ∂ ẋ ∂u ∂ θ̈ ∂ θ̇ The following constants are defined for the ease of calculation    =   ∂ ż ∂F ∂ z̈ ∂F ∂ θ̇ ∂F ∂ θ̈ ∂F   0    ∂ z̈      =  ∂F    0     c1 = (m p + mc )m p lg, c2 = m p lBc sgn(ż) c3 = m p l 2 (m p + mc ) c4 = m2p l 2 ∂ θ̈ ∂F (2.4.11) The expression for θ̈ of Equation 2.4.10 can be re-written using expressions in Equation 2.4.11 θ̈ = g sin θ (m p + mc )m p l + m p lBc sgn(ż) cos θ − m p lF cos θ − m2p l 2 θ̇ 2 sin θ cos θ − B p θ̇ = g sin θ (m p + mc )m p l + m p lBc sgn(ż) cos θ − m p lF cos θ − m2p l 2 θ̇ 2 sin θ cos θ − B p θ̇ = m p cos2 θ ) m p l(m p + mc )l(1 − m p +m c m p l 2 (m p + mc ) − ✘c ) m2p l 2 (✘ m p✘ +m cos2 θ ✘ m +m ✘ ✘p c c1 sin θ + c2 cos θ − m p lF cos θ − c4 θ̇ 2 sin θ cos θ − B p θ̇ c3 − c4 cos2 θ The individual matrix entries are defined as follows     2 2 0 + m p l θ̇ − θ̈ (− sin θ ) − 0 m p l θ̇ + θ̈ sin θ ∂ z̈ = = ∂θ m p + mc m p + mc   0 + m p l 2θ̇ cos θ − 0 − 0 2m p l θ̇ cos θ ∂ z̈ = = m p + mc m p + mc ∂ θ̇ 1 ∂ z̈ = ∂F m p + mc m p l cos θ ∂ θ̈ =− ∂F c3 − c4 cos2 θ −2c4 θ̇ sin θ cos θ − B p ∂ θ̈ = c3 − c4 cos2 θ ∂ θ̇ (2.4.12) (2.4.13) (2.4.14) (2.4.15) (2.4.16) (2.4.17) 2.4 Benchmark Dynamical Systems 47 I use the quotient rule to derive the differential of θ̈ with respect to θ , where for a function ) ′ f (θ ) = u(θ v(θ ) the derivative is f (θ ) = v(θ ) du(θ ) dv(θ ) dθ −u(θ ) dθ 2 v(θ ) , hence setting v(θ ) = c3 − c4 cos2 θ and u(θ ) = c1 sin θ + c2 cos θ − m p lF cos θ − c4 θ̇ 2 sin θ cos θ − B p θ̇ dv(θ ) = 2c4 sin θ cos θ dθ  du(θ ) = c1 cos θ + c2 (− sin θ ) − m p lF(− sin θ ) − c4 θ̇ 2 sin θ (− sin θ ) + cos θ cos θ dθ  = c1 cos θ − c2 sin θ + m p lF sin θ − c4 θ̇ 2 cos2 θ − sin2 θ   = c1 cos θ + m p lF − c2 sin θ − c4 θ̇ 2 cos2 θ − sin2 θ      c3 − c4 cos2 θ c1 cos θ + m p lF − c2 sin θ − c4 θ̇ 2 cos2 θ − sin2 θ ∂ θ̈ =  2 ∂θ c3 − c4 cos2 θ   2c4 sin θ cos θ c1 sin θ + c2 cos θ − m p lF cos θ − c4 θ̇ 2 sin θ cos θ − B p θ̇ −  2 c3 − c4 cos2 θ (2.4.18) Validation of Derivations by Simulation I use the total energy of the system, composed of the kinetic and potential energies acting on a frictionless system E = T +U (Equation 2.4.19), to ensure that the equations of motion are derived correctly. If this is the case, the total energy should remain constant over the simulation horizon (here chosen as T = 5 sec), when the pendulum is dropped for an arbitrary height (chosen as θ = π2 rad). T= 1 2  m p + mc ż2 + m p l żθ̇ cos θ + 12 m p l 2 θ̇ 2 , U = m p gl (2.4.19) Figure 2.3 depicts the stability of the kinetic (see Figure 2.3 a) and potential (see Figure 2.3 b) energies, which oscillate with an equal and opposite magnitude as the pendulum swings about the stable equilibrium, when dropped from θ = π2 angle. Due to the conservation of energy, the potential energy turns into kinetic energy as the height of pendulum tip from the ground is reduced and vice versa. The state trajectories demonstrate the stable system behaviour during the simulation, with stable: cart position, cart velocity, pole angle and angular velocity, as the pendulum swings between π2 and − π2 (see Figure 2.3 c). The stability of the total energy (E = 0) of the cart-pole system (see Figure 2.3 d) proves the correct derivations of the equations of motion, where any variation is due to the accuracy set for the ODE45 solver (an inbuilt Matlab function used to solve non-stiff differential equations)7 . The nonlinear cart pole system is used as a benchmark for testing the ’quality’ of the optimal control solution of my RLOC algorithm. I simulate the dynamics defined by ẋ in order to apply my optimal 7 The ODE45 relative and absolute tolerances were set to 10−9 , resulting in total energy fluctuation range of 2 × 10−8 . 48 Mathematical Theory Fig. 2.3 Total Energy of the Cart Pole System. The pole is dropped from an angle of π2 radians and the behaviour of the system is observed over a period of 5 seconds (with equivalent results for a period of 5 hours). The simulation is carried out using an ODE45 Matlab function with the absolute and relative tolerances of 10−9 . (a) Displays the periodic changes in kinetic energy of the system. (b) Displays the periodic changes in potential energy of the system. (c) Displays the cart-pole system trajectory over a 5 sec interval. (d) Displays the stable Total Energy (mean 4.8 × 10−10 , variance 8.7 × 10−18 ). control solution to the active pole balancing in the upright unstable equilibrium, with the cart stopping in a pre-defined position on the track, by starting from any state configuration within a bounded range. Chapter 3 Reinforcement Learning Optimal Control "The scientist is not a person who gives the right answers, he’s one who asks the right questions." — C. Lévi-Strauss In this chapter I explain the motivation and principles behind my algorithm Reinforcement Learning Optimal Control (RLOC), provide the theory of its design and explain the algorithm’s architecture. 3.1 Introduction The most challenging class of control in engineering is the one seeking to provide optimal control solutions to tasks with nonlinear dynamics. The analytical solutions to such systems, based on Hamilton-Jacobi-Bellman partial differential equation, are limited to stylised problems of little practical relevance, resulting in the vast development of approximation methods (the state-of-the-art algorithms, discussed in Section 1.4). These approaches typically require knowledge of the system dynamics, carry a large computational cost associated with iterative calculations and kernel matrix inversions, which scale unfavourably with respect to the number of data points required for learning [69], and may require expert heuristic tuning of the parameters to achieve convergence [263]. As an alternative this thesis presents a novel hierarchical learning framework, RLOC, for solving nonlinear control tasks with (un)known dynamics. The rich history of Optimal Control (OC) (see Sections 1.2, 1.3) laid foundation to the Machine Learning [37] topic of Reinforcement Learning (RL) (see Subsection 1.3.3), and the two fields quickly diverged (see Table 3.1 for comparison, including advantages and disadvantages), with OC theory solving systems with known dynamics and RL solving systems with (un)known dynamics by maximising an observed long term reward and learning from experience [239]. Even though OC has been a subject of intense study, solving complex control problems with multiple inputs multiple outputs (MIMO) systems [132], for example those involving nonlinear dynamics, remains a challenge [263]. My research seeks to answer the question of "How can the two fields of Reinforcement Learning and Optimal Control be combined in order to utilise the best of both worlds?". Reinforcement Learning Optimal Control 50 3.1.1 Engineering Motivation A combination of a reinforcement learning with optimal control, employed for solving nonlinear control problems with (un)known dynamics allows implementing a divide-and-conquer technique1 . This involves decomposing a task into one of selection between local strategies, pushing back the curse of dimensionality2 [25], otherwise encountered by trying to solve a problem monolithically. Combining these two fields utilises their respective advantages, such as RL’s ability to deal with nonlinear systems of unknown dynamics while learning from online trials [239] and OC’s advantage of producing smooth continuous cost-minimal control [260], while mitigating their common disadvantage of the curse of dimensionality (see Table 3.1 for the comparison of each methodology). Table 3.1 Reinforcement Learning and Optimal Control Comparison. Reinforcement Learning Machine Learning Bellman Optimality Principle / Markov Property Model-Free / Model-Based Learns from Reward / Punishment Generally Discrete States and Actions Learns Linear & Nonlinear Systems Maximises Long Term Reward Suffers from Curse of Dimensionality Optimal Control Control Engineering Bellman Optimality Principle Model-Based Requires Full Knowledge of Dynamics Generally Continuous States and Inputs Analytical Solution for Linear Systems Only Minimises Global Cost (Performance Index) Suffers from Curse of Dimensionality It is natural to combine these two approaches in a hierarchical way: an RL agent can learn to choose between its available high-level actions, each linked to a low-level feedback controller of choice. This provides abstraction and task complexity reduction by efficiently structuring control of high-level actions to achieve a global solution. Here each RL action activates a low-level controller operating on an area of state space, with RL policy dictating the switching between controllers (see Figure 3.1). The choice of low-level optimal controllers is not restrictive to one particular type and in theory any feedback controller can be chosen (e.g. optimal Linear Quadratic Regulators (LQR), Proportional Integral and Derivative (PID), H infinity (H∞ ) controllers etc.). I postulate that using optimal controllers forms a sound basis of choice, since they represent state-of-the-art solutions in control engineering, able to handle MIMO systems (majority of real life systems). Even though PID controllers are extremely popular in the industry [12], there are no guarantees of the stability of their solutions for MIMO systems [106], requiring a decoupling of the system dynamics. Hence, while bearing in mind that in theory any controller can be chosen, optimal controllers seem to be the best choice. This feature of RLOC algorithm is a result of its ability to learn a global state-action mapping [239], where only the action (controller) yielding the highest long term reward is chosen to be activated in a particular area 1 A divide-and-conquer algorithm breaks down a problem into a number of directly solvable sub-problems of similar type, which are first tackled individually and then recombined to present a global solution to the original problem. 2 The curse of dimensionality refers to an exponential growth in the complexity of a problem as the dimensionality of the task increases making the problem intractable [26]. 3.1 Introduction 51 of state space. This means that the global policy learnt by RLOC simply will not utilise any low-level controller which is not deemed to lead to the best long term outcome. Fig. 3.1 RLOC Conceptual Overview. A snowboarding task is used to illustrate RLOC’s architecture. The agent’s actuators (low-level muscle activations controlling the board/body – e.g. local optimal feedback controllers) are engaged by a high-level reinforcement learning policy (comprised of high-level symbolic actions – "Glide", "Lift", "Fly", "Land"). The continuous actuator state space is mapped into an RL framework using feature mapping and each low-level controller corresponds to a high-level symbolic RL action. The type of the low-level controller is a design choice and in theory any feedback controller could be implemented (e.g. LQR, PID, H∞ controllers), permitted by the ability of the high-level RL policy to learn to choose the best actions for the (discrete/continuous) symbolic RL state space. In the case where the task dynamics are unknown, the local models may be learnt and moderated online through repeated trials (local dynamics learning). The RLOC algorithm learns to smartly re-use available low-level controllers, producing a global action-to-state mapping, while maximising the long term reward of a given task. The learning loop between the low-level control and the high-level action selection is closed by mapping the accumulated actuator cost, J, of a chosen low-level controller acquired for controlling the system to the high-level reward signal, which drives the agent’s learning (note: cost = −reward). The use of abstractions (i.e. low-level optimal controllers as RL actions) results in a reduced amount of exploration needed for the high-level agent, as utilised in Hierarchical Reinforcement Learning (HRL) [22]. This allows an agent to learn from fewer trials when compared to a flat RL structure, typically requiring very small sampling frequency when choosing new actions (e.g. it would require an order of 103 more number of state-action pairs to be explored in an RL setting without abstractions in order to solve the benchmark problems addressed in this thesis). Equally the use of an RL agent as a high-level learner allows for a reduction in the number of controllers necessary when compared to stand alone methods such as iterative Linear Quadratic Regulator (iLQR) [263]. The iLQR method requires as many linearisation points as incremental number of steps, which are iteratively calculated during optimization. It often requires expertly initiated control trajectories in order to achieve convergence. RLOC on the other hand reduces the number of necessary linearisation points by an order of 102 for the tasks addressed in this thesis. An RL agent is able to operate on systems with both known and unknown dynamics, and can create models of the system from online experience, allowing additional learning from a simulated environment [239]. Hence, in cases of "costly data" (e.g. patient drug administration), RLOC can be initially trained off-line on available data and then applied to patients, performing tuning using limited physical interaction by simulating experience from learnt individual data. Reinforcement Learning Optimal Control 52 3.1.2 Neurobiological Motivation In contrast to artificial systems which struggle to control high-dimensional nonlinear problems, humans and animals are able to control highly complex nonlinear systems with ease. Consider that children can learn riding a bicycle with few hours of practice, which computationally involves simultaneous orchestration of 650 muscles to control a highly nonlinear system with evolutionarily unknown dynamics from a relatively minute data set. Intriguingly, while we tend to plan and formulate complex tasks at a symbolic level [41] (e.g. "turn left", "accelerate", "jump"), our body’s actuators (muscles, limbs, joints) are controlled by firing patterns of motor neuron pools [108] of the spinal cord3 . The neural computations that allow our brain to bridge across the divide between cognitive symbolic action selection and low-level continuous control command execution, let alone (re)learning complex representations of nonlinear tasks dynamics, remain unclear [58, 72, 283]. Recent findings emerge on two levels: (1) At the level of actuation, it has been shown that the human brain controls the redundant degrees of freedom of reaching movements in a manner well predicted by linear optimal feedback control (OFC) theory [110, 222, 259], which overcomes sensory feedback delays using optimal state estimation (in part carried out by the cerebellum, that builds an internal model/representation of a task, and associated cortical regions [199, 238]) with a Kalman filter [260, 262]. The evidence strongly suggests that the brain minimises motor noise [84] and/or effort [190] (jerk, torque change [88]), which are assumed to be quadratic cost functions of the muscle control signals. An application of mechanical perturbations in arm reaching tasks, results in goal-directed correctional control responses with possible overcompensation [223], correctly predicted by OFC since early compensation produces movements with minimum effort [122, 187]. Even though the majority of complex feedback appears to be generated through transcortical feedback pathways, the spinal cord provides the first level of feedback processing [223], as in many species the spinal circuits support sophisticated control such as scratching, wiping and quadrupedal locomotion [89, 210, 232] (e.g. turtle, cat), as well as multi-joint / multi-limb reflex patterns. Optimal linear controllers of the state space were found to operate at the actuator level and characteristic linear control motor primitives (muscle synergies, see Appendix C.1) were found in the spinal cord [30, 63, 111, 185, 257]. (2) The experimental evidence on mice and monkeys reveals that optimal action selection experiments in symbolic serial decision making tasks are effectively solved by implementation of reinforcement learning algorithms inside the brain [65, 184] (such as the basal ganglia, lateral prefrontal cortex). Evidence for a reward prediction error, which calculates the difference between actual and expected reward of a chosen action in a certain state, is shown to be encoded in the activity of the mid-brain dopaminergic neurons of substantia nigra and ventral tegmental areas, that form a part of the basal ganglia circuit [62, 219]. In the context of RL this reward prediction error signal is used to learn values for action choices that maximize an expected future reward and is known as model-free RL 3 Motor neurons are spatially grouped into pools of the spinal cord (motor neuron pools), with each motor neuron innervating an individual unit of grouped muscle fibres (muscle motor unit). 3.1 Introduction 53 [239]. Research opposing this view has been carried out by Kevin Gurney et al., who propose that the short-latency dopamine response is too short to signal a reward prediction error and that instead it may signal attention switching to an unexpected stimuli (associative learning) [207] and may play a role in discovering novel actions [206]. In contrast, another class of RL builds internal models of interaction between actions and states, enabling simulating an experience. This model-based RL calculates a state prediction error, which measures the surprise in the new state given the current estimate of the state-action-state transition probabilities and a neural signature for this error has been identified in the intraparietal sulcus and lateral prefrontal cortex areas of the brain [99]. Recent research by Nathaniel Daw et al., reveals that in multi-step decision making tasks in which model-free and model-based RL could be distinguished, the neural signatures reflected the simultaneous presence of both predictions suggesting an integrated architecture [64]. Motor-Babbling A natural question may be asked of how the unknown body/task dynamics are learnt and represented within the brain. During the early development of a baby, the exploration of the World seems to occur through unstructured interactions generated with endogenous motor commands, known as motor-babbling [47, 192]. In engineering terms, both plant and environment models are learnt through random controller selection and the motor-babbling approach to learning purposive actions has been applied in dynamical systems methods to early robot development [71]. It can be postulated that motor-babbling in humans provides the initial data for local models of body/World dynamics to be learnt, and forms a key step in the development of more directed movements. Following the development of the initial local models in babies, and as the organism evolves, a more sophisticated approach to executing novel motor tasks may be employed, e.g. blending existing controls [59]. In fact such initial local models might be the foundation of structure specific learning – creation of generic / abstract representations within the brain that correspond to multiple tasks (thus reducing computational complexities and enabling transfer learning) [4, 43]. Previously encountered tasks may be re-learnt at a faster rate (a phenomenon known as savings [139, 225]) through updates to the created representations and to the previously learnt control policies, however how this is done in parallel remains unclear. Hierarchical Organisation of Behaviour in Sequential Tasks In 1951 Karl Lashley [158] proposed that the sequencing of low-level actions requires higher-level task context representations, and that sequential behaviour displays hierarchical structure made up of nested subroutines (e.g. tree-like structure of coffee-making task [41], see Figure 3.2). The computational models of hierarchical structure of behaviour first appeared in 1960 [178] and since featured in cognitive psychology [217], developmental psychology [212], neuropsychology [220] and neuroscience [19, 141]. Many of the suggested models closely resemble Hierarchical RL [40] (see Subsection 1.4.2 for an overview of HRL methods) and postulate a modular representation, where each module/unit can be activated for an indefinite period of time while building individual policy mappings [41]. 54 Reinforcement Learning Optimal Control Fig. 3.2 Hierarchical Organisation of Action Selection. Research in cognitive psychology suggests that the brain: (1) represents sequential tasks with hierarchically organised symbolic tree-like processing units [54–56] e.g. coffee-making task (figure adapted from [41]); and (2) solves decision-making tasks with RL algorithms [207, 219]. On the lower level of the Central Nervous System (figure adapted from [249]), the spinal cord provides the first level of feedback processing (primarily performed by transcortical feedback pathways) and the linear OFC theory predicts well the activation patterns of muscles in reaching movements [222, 223]. The control of actuators appears to be represented with linear motor primitives in the spinal cord [185, 257]. In summary, recent findings suggest that the brain learns complex nonlinear control tasks, such as object manipulation and sensorimotor control, using computations on two levels of abstraction. This involves (a) representing tasks symbolically on the high-level and (b) controlling the body’s actuators using characteristic linear optimal control signals while engaging motor-neuron pools in the spinal cord. It is suggested that local models of the unknown dynamics of the body/World may be inferred through unstructured interactions, resembling motor-babbling in babies. Parallel research suggests that the brain may learn the dynamics of reaching movements at the level of motor primitives i.e. at the level of low-level controllers [257], which implies placing system identification algorithms on the low-level of the RLOC hierarchy. Overall, the RLOC algorithm aims to combine these recent findings into a proof-of-principle algorithm in order to bridge the divide between low-level actuation control and high-level symbolic decision making, with a hierarchical architecture motivated by the current research (see Figure 3.2). 3.2 Architectural Design 3.2 55 Architectural Design I will now outline the design of RLOC, which forms the core of my thesis, providing targeted background on the components of the algorithm. The specific implementation and applications of RLOC are then covered in the subsequent chapters. The RLOC algorithm is designed for solving nonlinear continuous control problems4 with (un)known dynamics of the following general form e e ẋ(t) = f(x(t), u(t)) = A(x(t))x(t) + B(x(t))u(t) (3.2.1) e where x(t) is the continuous actuator state, u(t) is the continuous affine actuator control, A(x(t)) is the e nonlinear system control matrix and B(x(t)) is the nonlinear control gain matrix. The system dynamics may be known in which case they can be made available to the algorithm or they may be unknown, in which case they can be learnt by RLOC, as addressed in Chapters 4 and 5 respectively. The architectural decisions of the hierarchical RLOC algorithm are based on piecing together research evidence from symbolic decision making [54–56], action selection [103, 126], optimal feedback control [222, 224, 262] and processes of model learning / representation using system identification that is proposed to resemble motor-babbling in babies. Overall RLOC suggests that the brain learns to control nonlinear systems by learning at the level of actuators the linear dynamics to infer a linear feedback control and by learning to choose at the symbolic level which actions (i.e. low-level controllers) to engage given an area of state space, in order to solve a global task optimally. 3.2.1 Learning and Controlling In RLOC Framework Control in RLOC Framework Neuroscience research shows that the control of actuators (limbs, joints, muscles) appears to be represented in the form of local controllers of the state space [110, 222] and characteristic linear control primitives (muscle synergies, see Appendix C.1) have been found to operate on the level of the spinal cord [111, 257]. RLOC postulates that the brain controls its nonlinear motor system by using a high-level policy that dictates the switching between a set of low-level optimal feedback controllers resulting in chained actions, where for a given region of the state space a single controller is engaged. The control signals are thus hierarchically organised which reduces the complexity of the global required control solution. The low-level controllers are obtained for predefined regions of state space, linearisation points, where each point leads to a single locally optimal linear feedback controller, using known/learnt plant dynamics. If the plant dynamics are unknown they are inferred via state and control signals limited to a region of state space assumed sufficiently small such that a local linearisation of the control plant is a "good enough" approximation of the globally 4 The studied dynamics are assumed to be deterministic and time invariant (even though most-real life tasks are explicitly time dependent with stochastic dynamics e.g. neuron firing); this is a design decision since most optimal control solutions operate on time-invariant assumptions, however the proof-of-principle RLOC algorithm can be extended to handle stochastic dynamics. Reinforcement Learning Optimal Control 56 nonlinear dynamics. Thus, control involves high-level action selection – symbolic controller, and low-level control specifying continuous signals of the actuators for a global task solution. Learning in RLOC Framework Control in RLOC is learnt on two levels of abstraction: (1) At the high-level (symbolic level) RLOC learns to choose the correct sequence of actions, i.e. which local controller to use in a given state to achieve a composite task goal, using an RL agent. The state-action mapping is learnt while smartly re-using chosen low-level controllers by assignment to possibly non-neighbouring ("non-topologically conjunct") regions of state space. (2) At the low-level (actuator level) if the plant dynamics are unknown, they can be learnt using system identification during online experience, namely by evaluating through recursive estimation an internal forward model of how control signals and current state predict future states (i.e. learning how actuator control and movement are coupled to state space e.g. [95]). The two levels of abstraction are integrated naturally by RLOC as the low-level control cost accumulated as a result of choosing a high-level action determines the reward signal used to drive the high-level learning, thus closing the learning loop between these two levels. 3.2.2 Mapping Optimal Control Problem into Markov Process Framework The two levels of abstraction can be combined into a single algorithm for any problem that can be modelled as a finite Markov Decision Process (finite-MDP)5 . This is possible if the problem can be described using a finite number of states and actions with dynamics satisfying the Markov property6 . The nonlinear optimal control problem (see Equation 3.2.1) satisfies the Markov property requirement because in the discrete time setting the system dynamics at iteration step k depend purely on the state and control at the time step k − 1. Hence the nonlinear continuous optimal control problem can be converted into a finite Markov Decision Process using the following steps: (1) Define a feature mapping φ (x) between the low-level (optimal control) and the high-level (reinforcement learner) state space representation. Note, this mapping can either be from the continuous OC domain to a continuous RL domain, or to a discrete RL domain, since it can handle either. (2) Define a control mapping ψ (a) between the high-level (RL) actions and the low-level (OC) continuous feedback control. In a traditional RL setting used here the action space is discrete, however this can be extended to continuous domain using function approximation techniques. (3) Define a reward mapping which transforms the accumulated cost of the low-level controls into high-level rewards that reinforcement learner b , where the control and feature utilises for improving its policy. (4) Define a control policy mapping ω mappings are combined with a learnt deterministic policy π (s) : S → A (s) to define an estimate of the time-dependent optimal control policy mapping ω ∗ (where ∗ indicates optimality). 5 A finite Markov Decision Process is a discrete time stochastic control process, which is defined by its finite state and action sets, and by one-step dynamics of the environment. At each time step the environment/plant/process takes on a state s and the agent selects an action a ∈ A (s) to be applied in that state. The process transitions with certain probability into the new state s′ and the agent observes a reward associated with that action. 6 The Markov property requires subsequent state and reward to depend only on previous state and action. 3.2 Architectural Design 57 Fig. 3.3 RLOC Architecture. A top-level model-free RL agent operates in the symbolic space of mapped state features φ (x) and actions ψ(a). It selects between its symbolic actions a(i) , i = 1, . . . , na each linked to an underlying closed-loop feedback controller c(i) operating on the actuator space in continuous domain, for i = 1, . . . , na , where na is the number of actions available to the algorithm. The agent learns a mapping between a problem defined state space and RL actions by evaluating outcomes of each controller, which generates accumulated costs that are mapped into the RL framework driving the agent’s learning. The accumulated costs are attributed to a state-action pair (s, a), thus allowing the agent to chain sequences of successful actions for global use, repeating exploration over many RL epochs. The combination of the two levels of abstraction results in a new problem formulation, where two distinct spaces are defined (see Figure 3.3): (1) The low-level Actuator Space representing information relevant to the optimal control definitions, such as the cost function, continuous states x ∈ Rℓ and controls u ∈ Rm ; and (2) The high-level Symbolic Space representing information relevant to the reinforcement learning agent (formulated as a finite-MDP) which deals with the feature and control variables mapped from the actuator space, such as discrete states s(i) , i = 1, . . . , ns and actions a(i) , i = 1, . . . , na where ns is the number of discrete RL states and na is the number of discrete RL actions (note: na is equal to the number of low-level controllers), as well as the mapped rewards. A typical mapping employed in this thesis (see Figure 3.4) involves discretising the continuous actuator state space X of nℓ chosen state dimensions into ns number of symbolic states s(i) ∈ S , i = 1, . . . , ns . In order to simplify explanation of RLOC design in this Chapter, I assume that only LQR controllers are employed as low-level actions, therefore na linearisation points (greed solid circles) are placed randomly in state space, from the location of which na individual low-level optimal controllers are calculated off-line using the OC framework. RLOC uses the low-level controllers to find the 58 Reinforcement Learning Optimal Control best boundaries for engaging each controller in a given region of state space. This true hypothetical boundary is continuous (blue curves), however due to the discretisation of state space in formulating the new problem, RLOC assigns actions to a gridded symbolic state space resulting in an approximation of these true control regions (red lines). Fig. 3.4 Arbitrary Problem Design. An arbitrary control problem is used to illustrate a mapping of choice between a low-level actuator state space X and a high-level symbolic state space S , here chosen as φ (x) : Rℓ → N+ (as opposed to φ (x) : Rℓ → R). This mapping is performed by choosing a number of state dimensions, nℓ (here chosen as nℓ = 2 and denoted by z1 , z2 ) to be discretised into an (un)even grid, creating {s(1) , s(2) , ..., s(ns ) } symbolic states. A naive assignment of LQR controllers to state space is shown for a hypothetical system in Figure 3.5a. The controllers labelled A–F are obtained from linearisation points (white solid circles) and the Naive policy selects at each time step k the controller with the shortest Euclidean distance to the present actuator state. This results in topologically conjunct regions of state space being served by the same controller (marked with colour blocks) until the Euclidean distance to another linearisation point becomes shorter, triggering a switch to that controller. The RLOC method on the other hand is able to smartly re-use the available controllers and find a solution using a high-level policy, where controllers are assigned to possibly non-topologically conjunct regions of state space (see Figure 3.5b). A pseudo Algorithm 1 provides a step by step overview of RLOC’s operations. The discretisation of the continuous actuator state space is a design choice implemented to perform the mapping between low- and high-level abstractions. It is possible to create a continuous mapping between these two levels using function approximation, allowing to deal with high-dimensional tasks. In this work, discretisation of the continuous low-level space is performed, allowing to obtain proof-of-principle results, note however, that the low-level control is performed in the continuous state and control domains of the original problem. 3.2 Architectural Design (a) Niave state space controller activation. 59 (b) Example RLOC state-action space learnt mapping. Fig. 3.5 Example State-Action Mapping. An arbitrary optimal control problem is considered, showing an state-action space (learnt) mapping for (a) Naive and (b) RLOC control. Each square corresponds to a discretised symbolic state {s(1) , s(2) , ..., s(ns ) } used by the RL agent for assignment of high-level actions. Algorithm 1 RLOC Overview 1: Discretise actuator space to give an equally spaced grid of symbolic states. 2: Collect state-control mapped trajectories using naive control. 3: for each controller do 4: Specify the required parameters for the controller of choice (learn dynamics if necessary) 5: Obtain closed-loop controller (using learnt dynamics if necessary) 6: end for 7: for a number of RL iterations do 8: Use reinforcement learning algorithm of choice for sampling of the symbolic states 9: Implement low-level feedback controllers as actions 10: Improve RL policy using rewards based on accumulated costs of low-level controllers 11: end for 12: Use learnt policy to control the given task from any actuator state. In summary, the RLOC algorithm is designed to solve nonlinear time-invariant continuous control systems of the form Equation 3.2.1 with either known or unknown dynamics, built using a hierarchical architecture. A Reinforcement Learner is used as a high-level symbolic decision maker, which learns to chain low-level continuous optimal controllers forming global control solutions in continuous time and control domains. The learning loop between the low-level control and the high-level action selection is closed by mapping the accumulated quadratic cost (penalising control size and deviation from the target) of a chosen optimal controller to the high-level reward signal. The following section gives formal definitions of the hierarchical structure of RLOC and reviews the theory behind the selected high- and low-level controller design. Reinforcement Learning Optimal Control 60 3.3 Combining Reinforcement Learning and Optimal Control I will now formally define the hierarchical design of my framework and explain the algorithms employed on each level of abstraction, providing motivation and theoretical background for each choice. 3.3.1 Top-Down Hierarchical Design The Reinforcement Learning Optimal Control (RLOC) algorithm is hierarchically structured with a high-level RL decision maker and low-level continuous feedback controllers (e.g. LQRs, PIDs). Therefore the algorithm effectively operates in two spaces defined as (1) Symbolic Space describing every aspect of the symbolic level Markov Decision Process (MDP) such as the discrete actions A = {a(i) : i, . . . , na ∈ N+ }7 and discrete states S = {s( j) : j, . . . ns ∈ N+ }, where na and ns are the total number of reinforcement learning actions and states respectively; and (2) Actuator Space describing every aspect of the closed-loop control problem such as the continuous state x ∈ Rℓ and continuous control u ∈ Rm , modelled using k = 1, . . . , nK time steps, where nK is the total number of low-level controller time steps per episode (see Figure 3.3). Each high-level discrete RL action a(i) engages a corresponding low-level controller c(i) , i = 1, . . . , na when selected by the RL policy. The c(i) controller uses an existing state xk to calculate a continuous control uk which is in turn applied to the system advancing it to the next state xk+1 as a response. Hence the continuous time control interacts with the system to evolve the state, generating trajectories of state x̄ ∈ Rℓ×nK and control ū ∈ Rm×nK −1 . These control signals and state features are mapped into RL framework (see Subsection 3.2.2) and the learning loop is completed with the inclusion of a low-level quadratic cost J(e x, u) ∈ R (see Equation 2.2.4) which is defined uniformly 8 across all chosen controllers . This cost is observed by the Reinforcement Learner, and the action selection policy is improved in response. For systems with known dynamics the infinite time horizon LQR controllers can be directly built = 1, . . . , nlqr , where nlqr is the number of feedback gain matrices, using a linear model of the  system M (i) = ζ (i) , A(i) , B(i) where ζ (i) is the "linearisation centre" i.e. a point in state space (i) (i)  x0 , u0 around which the nonlinear dynamics are linearised and indicate the neighbourhood in which the approximation is expected to be most accurate, A(i) ∈ Rℓ×ℓ and B(i) ∈ Rℓ×m are the true system and control gain matrices. For systems with unknown dynamics the linear models need to be  b(i) ∈ Rℓ×ℓ is the estimated b(i) , Bb(i) , Σ b(i) estimated from data M (i) = ζ (i) , A , i = 1, . . . nlqr , where A w b(i) system control matrix, Bb(i) ∈ Rℓ×m is the estimated control gain matrix and Σ w is the estimated state covariance noise. In the case where a PID controllers are utilised, the linear model is represented by the  PID’s parameters, M ( j) = K p , Ki , Kd , j = 1, . . . , n pid , where n pid is the number of PID controllers, (i) L∞ , i 7 The cardinality of set A is finite and equal to the total number of low-level controllers used. The low-level controller costs are defined by the quadratic cost function typically utilsed in LQR theory, and remains the same irrespective of the type of controller selected. 8 3.3 Combining Reinforcement Learning and Optimal Control 61 since this form of control does not require knowledge of the system dynamics. Note the total number of low-level controllers is na = n pid + nlqr . Following this overview of the hierarchical structure of my framework, I will present the specific algorithms chosen for each level of RLOC’s abstraction and provide the specific theory required for the design of RLOC. 3.3.2 High-Level Controller Design A reinforcement learning agent is chosen to operate on the high-(abstract) level of the RLOC algorithm, due to both the engineering and neurobiological motivations given in Subsections 3.1.1 and 3.1.2 respectively. It involves an agent (a decision maker) which repeatedly interacts with its environment that is typically formulated as a Markov Decision Process (MDP), by choosing actions and observing the resulting state and reward, reflecting desirability of the most recent state change. The agent’s objective is to maximize the long term reward signal over time (see Subsection 1.3.3 for overview of reinforcement learning theory). I choose to employ model-free RL algorithms, as they typically make no assumption about the system dynamics other than the Markov property, and optimise control purely through interaction with the environment. Model-based RL algorithms on the other hand build internal models of the system dynamics based on interaction with the world and can learn from simulating experience of the learnt dynamics. These types of algorithms should be considered for cases when interactions with the physical system are "costly" e.g. patient drug administration. This highlights the architectural advantages of hierarchical algorithms, where the low-level design is separated from the high-level abstraction, and the high-level learner can be interchanged so long as the mapping between the two levels can be defined. Standard model-free RL algorithms for discrete state problems include Monte-Carlo (MC) Control, SARSA and Q-learning [239]. The high-level symbolic space domain thus allows for the task abstraction and task dimensionality reduction. The chosen RL agent thus operates on the symbolic space, where the original low-level actuator space, mapped into the RL framework, is Markovian since the environment’s response at RL time step t + 1 depends only on state and action at t, more formally Pr{st+1 = s′ , rt+1 = r|st , at } ∀s′ , r, st , at , where s is the RL state, s′ is the next RL state which is a result of applying action a at time t, producing an observable immediate reward r. However the use of abstractions in creating the high-level controller, results in some loss of information, making the symbolic space semi-Markovian [229]. This means that the Markov property condition is not strictly met by the RLOC algorithm, however the symbolic space is a coarse approximation to that, and the method of choice for the high-level controller, Monte-Carlo RL, is known to be robust in situations where this property is not strictly met [239]. Monte Carlo Reinforcement Learning A Monte Carlo (MC) agent is chosen because of its robustness in dealing with semi-Markovian environments [239]: its value estimates are not updated based on the value estimates of successor Reinforcement Learning Optimal Control 62 states (i.e. it does not bootstrap). The Monte-Carlo reinforcement learning algorithm is episodically iterative: it improves its policy π on an episode-by-episode basis using epochs – trajectory roll outs which record the {rt+1 , st , at } triplets until an absorbing state is met (see Figure 3.6 for a description of each epoch). MC typically employs ε -soft (i.e. π (s, a) > 0 ∀s ∈ S , a ∈ A (s)) policies, which allow both exploration – selecting an action deemed less successful than others, and exploitation – selecting an action which gives the highest long term reward – the greedy action. The MC algorithm −1  is expected to converge to the true solution with a convergence rate of O nT 2 based on the Law of Large Numbers, where nT is the number of MC simulations (epochs). Hence if one wishes to double MC result accuracy the number of simulations needs to be quadrupled. Fig. 3.6 Epoch Sample Path. Each epoch sample path (denoted in blue) begins with a randomly chosen symbolic state st (states are denoted with a white circle) and an action at (actions are denoted with a red circle) is chosen according to policy π . This results in the environment transitioning into a next state st+1 and producing a reward rt+1 associated with the choice of action at time t. This is recorded as a triplet {rt+1 , st , at } and the process is continued until an absorbing state (denoted with ’T’ in a green square) is reached, at which point a full epoch sample path is obtained. Note a full MC simulation involves obtaining many such epochs of varied duration, each depending on the number of steps taken until an absorbing state is reached. The policy can either be improved at the end of each epoch or at the end of a number of epochs (termed a batch). Note: figure adapted from Sutton and Barto [239]. Monte Carlo is also a natural algorithm choice for RLOC, not only because it handles semiMarkovian environments, but also because the tasks addressed in this thesis (e.g. pole balancing / arm reaching) are episodic – they break down into sequences of repeated trials with a defined goal (final time step) – which is a requirement of Monte Carlo. Arguably MC is also better suited to tasks under episodic conditions than SARSA or Q-learning, because MC does not bootstrap and keeps all the information about the visited states without the need for approximation. Specifically I choose to use ε -greedy [275] on-policy MC Control algorithm with iterative updating as the high-level learner. The on-policy algorithm is used9 , since the systems tested in this thesis 9 On-policy algorithm estimate the policy while using it for decision making, as opposed to off-policy algorithms which use two policies: one for control (behaviour policy) and one being estimated (estimation policy). 3.3 Combining Reinforcement Learning and Optimal Control 63 are directly accessible, allowing the algorithm direct interaction and value function updates based on actual (vs. simulated) experience. Using an ε -greedy policy for estimating the action-values Qπ allows maintaining a degree of exploration of the non-greedy actions during learning and is guaranteed to be an improvement over any other ε -soft policy by the policy improvement theorem [239]. The choice of ε -greedy action selection often proves hard to beat [271]. It has been known to outperform the so f tmax action selection [147] and is considered as the preferred method of choice [114]. I choose an iterative (vs. batch) MC in order to take advantage of recency – updating the policy with the most recent experience after every epoch – alleviating the computational requirements by using the information instantly without storing it over many batches. ε -Greedy On-Policy Monte-Carlo Algorithm The ε -greedy on-policy MC algorithm keeps a bπ (s, a) ∈ Rns ×na , under policy π , where ε ∈ [0, 1] is the degree of exploration, table of action-values Q ns is the number of symbolic (RL) states and na is the number of symbolic (RL) actions, with entries for each state-action pair (s, a). It aims to approximate the expected future discounted return, action-value n function under policy π , Qπ (s, a) = Eπ (Rt |st = s, at = a) = Eπ (∑∞ n=0 γ rt+n+1 |st = s, at = a), where t is the RL time step, γ ∈ [0, 1] is the discount factor, r ∈ R is the observed immediate reward and R ∈ R is the discounted return. The control decisions are made by the policy π , which maps past experience to actions, and can be probabilistic (see the general pseudo code of Algorithm 2). Algorithm 2 An ε -Greedy On-Policy Monte Carlo Pseudo Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: Init for all s ∈ S , a ∈ A (s) π (s, a) ← |A 1(s)| bπ (s, a) ← 0 Q i(s, a) ← 0 ∀s ∈ S , a ∈ A (s) Repeat Generate an epoch τ , using π For all (s,a) appearing in τ do i(s, a) ← i(s, a) + 1 R ← return from every appearance of (s,a) in τ bπ (s, a) + αi (Ri − Q bπ (s, a) ← Q bπ (s, a)) Q i i i For each s in an epoch bπ (s, a) a∗ ← arg maxa Q i For all a ∈ A((s) 1 − εi + |Aεi | for a = a∗ π (s, a) = εi otherwise |A | For simplicity, stationary policies are chosen for RLOC, where this mapping depends only on the current state. So for an observed state s and action a, π (s, a) ∈ [0, 1] gives the probability of choosing a in s. The critic only reinforcement learning uses the estimated Q-values to determine the bπ (s, a) ∀s ∈ S , a ∈ A (s) to update the policy policy. In particular the ε -greedy policy uses estimates Q btπ (st , a) and ε otherwise, ε ∈ [0, 1]. π (st , at ) = Pr(at |st ) = 1 − ε + |Aε | iff at = arg maxa Q |A | Reinforcement Learning Optimal Control 64 e is reduced based on In my implementation of the algorithm, a step-size parameter αi = iαeµ (i.e. α 10 the number of i visits to the (s, a) pair, where µ ∈ [0, 1] is the rate of decay ), is used to incrementally improve the estimate of the action-value function, Qπ (s, a), during epoch τ ( j) , j = 1, . . . , nT , where nT ∈ N+ is the total number of RL learning epochs. Equally, the degree of exploration is made step dependent εi = ieεν (i.e. e ε is reduced based on the number of i visits to the (s, a), where ν ∈ [0, 1] is the rate of decay), is used to incrementally reduce the degree of exploration based on the number of visits to (s, a). The i-th update of state s, using action a (i.e. i-th visit to state action pair (s, a)) corresponds ( j) to the t-th time step, t = 1, . . . , nT 11 , of the j-th epoch   π π π b b b Qi+1 (s, a) = Qi (s, a) + αi Ri (s, a) − Qi (s, a) (3.3.1) ( j) Ri (s, a) = nT −t ∑ γ n rt+n+1 (3.3.2) n=0 where Ri (s, a) is the i-th (s, a) visit return while in current epoch τ ( j) , obtained by aggregating dis( j) counted immediate rewards from time step t of epoch j to the end of epoch time step nT , corresponding to the i-th update of (s, a). Each MC epoch τ ( j) is simulated for nK low-level controller time steps and whenever a new symbolic state s(i) , i = 1, . . . , ns is encountered the agent either chooses a new or the same action a(i) , i = 1, . . . , na (which engages the corresponding low-level controller). Every-visit Monte Carlo is performed, meaning that all of the visits to (s, a) pairs are considered for a given epoch. The ε -greedy on-policy MC reinforcement learning algorithm is thus used as the high-level learner, operating on a discrete state space with discrete actions which are coupled to low-level controllers, the choice of which is discussed in the next section. 3.3.3 Low-Level Controller Design The choice of the low-level closed-loop controllers is not restrictive to one particular type, permitted by the ability of the high-level reinforcement learning policy to learn the best action (each linked to a continuous low-level controller) for a given area of state space. In theory any feedback controller can be chosen (e.g. optimal Linear Quadratic Regulators (LQR), Proportional Integral and Derivative (PID), H infinity (H∞ ) controllers etc.). This means that the final policy learnt by RLOC simply will not pick any low-level controller deemed to not yield the highest long term reward. In this section I give an overview of the two types of controllers which are used in the RLOC framework, the PID and LQR controllers, and give the reasoning behind their selection. 10 The use of non-constant step-size parameter αi was inspired by the sample average method [239], which results in a constant weighting for each prior return when estimating Qπ (s, a) for stationary environments. 11 Note that the total number of time steps, n( j) ∈ N+ for each epoch j, differs between epochs and represents the number T of recorded triplets within each epoch. 3.3 Combining Reinforcement Learning and Optimal Control 65 Proportional Integral and Derivative Controller The Proportional (P), Integral (I) and Derivative (D) controller accounts for 90% of modern industrial control applications [140] and is a simple-toimplement closed-loop controller. This type of control is generally effective on simple systems that tend to be either SISO or decoupled MIMO systems, and hence PID implementation does not require a priori knowledge of the system dynamics. The PID is a closed-loop feedback controller [11] and uses three parameters the P I and D (leading to its second name the Three-Term Control), which are tuned individually for each system using various techniques. My implementation involves time discretising the continuous control in order to perform calculations on the modern day digital computers (see Chapter 2 for more information). The PID control applied to decoupled MIMO systems aims to minimize the error e(k) ∈ Rℓ at each time step k between the output y(k) ∈ Rℓ and a prescribed target state x∗ ∈ Rℓ ("set point"), calculated as e(k) = y(k) − x∗ . The control signal at each time step u(k) is calculated using the error variable utilised in three different ways: (1) the proportional error i.e. current error e(k), (2) the integral error i.e. the sum of the history of past errors up to e(k) (i.e. up to time step k) and (3) the derivative error, i.e. calculated over the last time step. These three components are weighted by the K p , Ki , Kd ∈ Rm×ℓ proportional, integral and derivative gain parameters respectively (tuned by the user for each system) and are summed up to produce an overall plant control u(k) at time step k (see Figure 3.7). Fig. 3.7 Overview of a PID Controller in Discrete Time Setting. The PID control involves three control parameters (P, I and D) and a measured output variable at each time step k only, hence it does not require knowledge of the system dynamics in order to control it. At each time step the error variable, e(k) = y(k) − x∗ , is calculated and utilised in three different ways: (1) the current error ek is weighted by proportional term K p , (2) the history of past errors to up ek is summed up and weighted by the integral term Ki and (3) the derivative of the error over the last time step is weighted by the derivative term Kd . These three components are summed up to produce an overall plant control uk . Reinforcement Learning Optimal Control 66 k uk = K p ek + Ki ∑ ei + Kd i=1 ek − ek−1 ∆ (3.3.3) where k = 1, . . . , nK is the time step, K p , Ki , Kd ∈ Rm×ℓ are the proportional, integral and derivative  weights, ∆ is the discretisation time step. The proportional term K p ek ∈ Rm in Equation 3.3.3 is a   e −e simple error contribution to the control. The integral Ki ∑ki=1 ei ∈ Rm and derivative Kd k ∆k−1 ∈ Rm terms act as a forcing condition for slow convergence and as a damping force for an anticipated future oscillations respectively. The simple properties of the PID controller (which is not necessarily optimal) make it appealing to use in the RLOC architecture as a low-level controller, which may or may not be selected by the high-level policy (as discussed in Chapter 4). Linear Quadratic Regulator Controller The Linear Quadratic Regulator (LQR) is a popular controller used to provide an analytical solution for linear dynamical systems and its attractive feature is that it can be calculated off-line. From a purely engineering motivation (see Subsection 3.1.1), it forms an attractive low-level controller, which can be smartly re-used by the high-level policy if deemed appropriate. The neruobiological motivation (see Subsection 3.1.2) points to optimal feedback control being utilised in regions of the brain and the spinal cord. Hence overall it seems a good choice of low-level control in the RLOC hierarchy. The LQR control used in my thesis is derived in detail in Section 2.2, and now I will go over the key ideas piecing them together to describe how LQR control is used in the RLOC framework. The type of time-discretised linear dynamics to which an LQR controller is applied, has the form of Equation 2.2.2 in which x(k) ∈ Rℓ is the state and u(k) ∈ Rm is the control12 . The LQR inherently assumes the target state to be equal to zero, however in many settings a system may need to be controlled to a nonzero target state x∗ (where ∗ indicates ’target’). This requires a change of variables, e x(k) = x(k) − x∗ , such that in this shifted coordinate system the target is at zero, e x∗ = 0, with the linear dynamics becoming (see Equation 2.2.3) e xk+1 = Ae xk + Buk where A = I + ∂∂ xẋ ∆ ∈ Rℓ×ℓ and B = ∂∂ uẋ ∆ ∈ Rℓ×m are the time-invariant, discrete-time, state-independent system control and control gain matrices. In order to obtain the optimal control law u∗ (where ∗ indicates ’optimality’), a quadratic cost function of the state and control is minimised with respect to the control signal. In a finite time horizon setting this yields a feedback gain matrix L ∈ Rm×ℓ×nK −1 i.e. providing one gain matrix per time step k = 1, . . . , nK − 1. However in the RLOC setting only one low-level controller L ∈ Rm×ℓ is coupled to each high-level action a ∈ N+ . This ensures that when the high-level policy evaluates the outcome of a 12 In the case of nonlinear systems, this linear form of the dynamics can be obtained by linearising the nonlinear system (Equation 2.1.5) around a set of points (x0 , u0 ) in state X and control U space respectively (derived in Subsection 2.1.2). 3.3 Combining Reinforcement Learning and Optimal Control 67 controller (action) in different parts of state space, the comparison is between the same controller L and not its different parts depending on the time step k = 1, . . . , nK . In order to achieve this, an infinite control horizon setting is used which produces a single infinite feedback gain matrix L∞ ∈ Rm×ℓ to be applied to the system at all time steps. This type of control is typically used in tasks with no naturally defined end time point (e.g. hovering a helicopter above a point). The infinite horizon feedback gain matrix (2.2.10) can be obtained by iterating backwards in time the discrete-time Dynamic Riccati Equation (2.2.9) for a very large number of time steps neK ≫ nK (approximating their ’infinite’ number, e.g. neK = 104 vs. nK = 300). The cost function in this approximately infinite control horizon (and nonzero target) setting is (see Equation 2.2.4)   1 neK −1 T 1 T T x W e x + ∑ e xk We xk + uk Zuk J(e x(k), u(k)) = e 2 neK neK neK 2 k=1 where W = WneK ∈ Rℓ×ℓ ≥ 0 is a symmetric positive semi-definite state weight matrix and Z ∈ Rm×m > 0 is a symmetric positive definite input weight matrix, specifying the penalty on the state deviation from target x∗ and the penalty on the control size respectively. Each low-level controller (i.e. individual infinite horizon feedback gain matrix) is linked to a high-level action and when activated produces a continuous optimal control signal (see Equation 2.2.11) for an unfixed number of time steps as dictated by the high-level policy xk u∗k = −L∞e In an LQR framework the linear control solution is only optimal on linear systems and in an infinite horizon setting a single feedback gain matrix is used for the whole duration of the simulation. This means that for nonlinear systems linearised at a point, the application of linear feedback controllers, will soon become sub-optimal once the system departs from the point of linearisation, thus not guaranteeing a solution. Various techniques have been applied in order to utilise the LQR controllers for global control solutions (see Section 1.4). The RLOC algorithm ensures a global near-optimal solution is found by learning how to chain low-level controls together using a high-level RL policy and linking each high-level action to a single low-level controller. Hence on the low-level of abstraction RLOC utilises na different controllers of choice, which are engaged by the high-level policy. Each low-level controller controls the plant for a varied duration of time, until the high-level policy dictates a switch to a new action, while a single simulation lasts a total of nK time steps. Learning of The Local Dynamics Online Most real world control tasks either have very complicated dynamics which are not easily written in the mathematical notation of differential equations or they may be unknown altogether. Since the locally optimal linear controllers (LQRs) used in RLOC require the knowledge of the system dynamics, these can be learnt online through recursive estimation of an internal forward model of how control signals and current state predict future states (i.e. learning how actuator control and movement are coupled to state space). The parameters of the linear dynamical Reinforcement Learning Optimal Control 68 system A and B (see Equation 2.2.3) can be estimated using the algorithm [95], called LDSi (Linear Dynamical Systems with Inputs) based on the expectation maximization method described in [94]. Linear Dynamical Systems are used to model the temporal evolution of systems that undergo changes which depend only on the previously observed state. A parallel between this inference and "motor-babbling" [47, 192] (constrained towards particular tasks, see 3.1.2) is made, since in engineering terms [71] this is equivalent to a naive exploration of the state space through random controller actions (’unstructured interactions’). An LDSi can be described using the following equations zk+1 = Azk + Buk + wk xk = Cldsi zk + vk (3.3.4) (3.3.5) where the observable output, xk ∈ R p , is linearly dependent on the hidden state zk ∈ Rℓ , while subsequent states, zk+1 are linearly dependent on the prior state zk and controls uk . The terms wk ∼ N (0, Σw ) and vk ∼ N (0, Σv ) are the state and the observed output noise respectively, with the diagonal covariance matrices being estimated by the model, Σw ∈ Rℓ×ℓ , Σv ∈ R p . The A ∈ Rℓ×ℓ , B ∈ Rℓ×m ,Cldsi ∈ R p×ℓ matrices act on hidden state, input and observable state vectors respectively. The LDSi algorithm estimates the A, B and Cldsi matrices from data. In a simplified model setting employed here, the system is assumed non-deterministic, vk ̸≡ 0, with a fully observable state z(k) = x(k), hence Cldsi = I. The unobserved state noise term accounts for nonlinearities of the system and hence wk ̸≡ 0. The RLOC algorithm postulates that motor-babbling in humans provides the initial data for local models of body/World dynamics to be learnt, and are a key step in the development of more directed movement. Following the development of the initial local models in babies, and as the organism develops, a more sophisticated approach to executing novel motor tasks may be employed, e.g. by blending of the existing controls [59]). This section described the overall RLOC hierarchy and provided the theoretical underpinnings of each level of abstraction in the RLOC framework, setting the scene for showing how the specific design choices of low-level controllers (either using a mixture of PID and LQR or using single type of controllers only) influence the control solution for two benchmark tasks. 3.4 Summary Here the key algorithm developed in this thesis, Reinforcement Learning Optimal Control (RLOC), is presented and motivated through both engineering and neurobiological considerations. Its main aim is to solve nonlinear control problems with either known or unknown dynamics for multiple input multiple output systems. In the case where the dynamics are unknown, RLOC is able to infer the dynamics online through system identification [95]. The RLOC algorithm has a hierarchical design with high- and low-level abstractions, which allows it to implement the divide-and-conquer 3.4 Summary 69 technique for solving a control task globally. A reinforcement learning agent is chosen to operate on the high-level, which selects between the available low-level controllers, using their costs as high-level rewards driving the learning. In theory any low-level controller can be chosen, enabled by the RL agent’s ability to select only the best actions. In the next two chapters I will present the specific design and application of RLOC algorithm to problems with known (Chapter 4) and unknown (Chapter 5) dynamics. I will show how different low-level controllers are combined in a single policy in Chapter 4 to provide heterogeneous control and show in Chapter 5 how the gap between the low-level actuation control and the high-level symbolic decision making in the brain can be bridged using RLOC for neurobiological research. Chapter 4 Heterogeneous, Hierarchical Control of Systems with Known Dynamics "Science is organized knowledge. Wisdom is organized life." — Will Durant The previous chapter introduced the hierarchical framework, Reinforcement Learning Optimal Control (RLOC), motivated with engineering and neurobiological considerations, and explained the algorithm choices for the high- and low-levels of abstraction. In this chapter I am going to show how RLOC algorithm is designed and applied to finding global control solutions to two benchmark tasks: robotic arm and cart pole systems, where system dynamics are made available a priori. The utilisation of a mixture of low-level controllers: Proportional Integral Derivative and Linear Quadratic Regulator both described in Section 3.3.3, results in a heterogeneous control on the low-level of abstraction1 . 4.1 Introduction This chapter focuses on solving nonlinear control tasks with known dynamics and aims to utilise the engineering motivation of the RLOC design: the ability of RLOC to use a heterogeneous set of controllers at the low-level of abstraction that operate directly on task kinematics. Specifically, the locally optimal linear controllers (LQRs) and proportional integral derivative controllers (PIDs) are allowed to operate on the actuator level and are engaged by the symbolic high-level reinforcement learning policy. The agent learns the state-action mapping which is in contrast to using standalone low-level control, such as usually employed in the engineering literature [250]. I show that such a design is able to compete in terms of computational cost and control quality with a state-of-the-art engineering solution for nonlinear control problems: iterative Linear Quadratic Regulator algorithm [263], and that it offers a robust flexible framework to address large scale nonlinear control problems. 1 The initial findings presented here were first published in the proceedings of the Imperial College Computing Students Workshop (ICCSW, 2011) [3] and the fully developed work was published in the proceedings of the European Workshop on Reinforcement Learning (EWRL, 2012) [198]. Heterogeneous, Hierarchical Control of Systems with Known Dynamics 72 4.2 Methods The RLOC algorithm seeks to solve the following class of stationary continuous time nonlinear dynamical systems (introduced in Chapter 2, Equation 2.1.5) e e ẋ(t) = f(x(t), u(t)) = A(x(t))x(t) + B(x(t))u(t) e e ∈ Rℓ×ℓ is the nonlinear system control matrix acting on the state vector and B(x(t)) where A(x(t)) ∈ ℓ×m is the nonlinear control gain matrix acting on the control vector. The overall hierarchy of RLOC R in this chapter is as per Subsection 3.3.1 (see Figure 3.3), with the specific choice of actuator controllers c(i) , for i = 1, . . . , na total number of low-level controllers, representing a mixture of PIDs and LQRs utilised in the way described below. 4.2.1 Low- and High-Level Controller Specification Proportional Integral Derivative (PID) Control The theory of PID control (used at the low(actuator) level of RLOC) is explained in Section 3.3.3. Usually PIDs are applied to simple systems such as Single Input Single Output or decoupled Multiple Input Multiple Output (MIMO) systems. This is largely due to the simplicity of its implementation [140]. The stability of PID control solution has not been analytically proven for MIMO systems [106]. The degree to which a system is coupled can be determined using a Condition Number (CN) of a matrix, which is the ratio of the largest to smallest singular value of the matrix. A system is considered to be difficult to decouple for CN > 50. In order to ensure that PID controllers can be used on the benchmark MIMO systems addressed in this thesis (robotic arm and cart pole, Section 2.4), the typical A matrices for both benchmarks are obtained following the linearisation of the dynamics around a linearisation point ζ , which for the robotic arm was chosen to be ζ = [ π2 , π2 , 0, 0]T and for the cart pole ζ = 0 (upright unstable equilibrium), where A = I + ∂∂ xẋ ∆. The corresponding A matrices for each benchmark are plotted in Figure 4.1. The singular value decomposition is performed in order to obtain the condition numbers for the robotic arm and the cart pole tasks, and these are shown to exhibit weak coupling (i.e. CN ≪ 50), giving CN of 1.1 and 1.4 respectively. The control signal uk ∈ Rm at time step k depends purely on the observation of the output, yk , and the error between the observed output and the target state ek = yk − x∗ , and is calculated using (see Equation 3.3.3) k ek − ek−1 uk = K p ek + Ki ∑ ei + Kd ∆ i=1 where ∆ is the discretisation time step, k = 1, . . . , nK is the simulation step number, nK is the total number of simulation steps, K p ∈ Rm×ℓ is the proportional gain, Ki ∈ Rm×ℓ is the integral gain and Kd ∈ Rm×ℓ is the derivative gain parameters (specified by the user), which are tuned to a particular task (where dimensionality of the parameters is task specific). 4.2 Methods 73 (a) System control gain of Robotic Arm. (b) System control gain of Cart Pole. Fig. 4.1 System control gain matrices of the Robotic Arm and the Cart Pole tasks. The system control gain matrices do not show large variation with respect to different linearisation points, hence this figure is a good representation of the system control gain values of the robotic arm and cart pole linearised systems. In the RLOC control framework, a set of three PID control parameters K p , Ki , Kd are treated as a single low-level controller c(i) , i = 1, . . . , na 2 , and the parameters are obtained by brief optimisation for individual tasks in parallel using a brute-force search method. The RLOC algorithm uses the following values of the PID parameters (see Table 4.1), where each column of parameters M ( j) = {K p , Ki , Kd }, j = 1, . . . , n pid represents a single PID controller for each of the benchmark tasks, and n pid is the number of PID controllers used on the low-level of RLOC. The table shows that the robotic arm simulation utilises n pid = 4 controllers, while cart pole simulation utilises n pid = 6 controllers. Each low-level PID controller is linked to a single high-level RL action a(i) , i = 1, . . . , na . Table 4.1 PID Parameters of RLOC Simulation. Sym Kp Ki Kd Sym Kp Ki Kd Robotic  −0.5 0 Arm   −0.2 0 0 0   −0.1 0 0 0   −0.8 0 0 0  00 0 −0.5 0 0   0 −0.2 0 0   0 −0.1 0 0   0 −0.8 0 0   −0.1 0 00 −1.0 0 0 0 −0.5 0 0 0 −0.3 0 0 0 0 −0.1 0 0   0 −1.0 0 0   0 −0.5 0 0   0 −0.3 0 0   −0.5 0 00 −1.0 0 0 0 −0.3 0 0 0 −0.1 0 0 0 0 −0.5 0 0 0 −1.0 0 0 0 −0.3 0 0 0 −0.1 0 0 Cart Pole [ 0 0 −0.5 0 ] [ 0 0 −0.2 0 ] [ 0 0 −0.8 0 ] [ 0 0 +0.8 0 ] [ 0 0 +0.5 0 ] [ 0 0 +0.6 0 ] [ 0 0 −0.1 0 ] [ 0 0 −1.0 0 ] [ 0 0 −0.2 0 ] [ 0 0 +0.2 0 ] [ 0 0 −0.5 0 ] [ 0 0 +0.3 0 ] [ 0 0 −0.5 0 ] [ 0 0 −1.0 0 ] [ 0 0 −0.3 0 ] [ 0 0 +0.3 0 ] [ 0 0 −0.5 0 ] [ 0 0 +0.1 0 ] SI Unit [a.u.] [a.u.] [a.u.] SI Unit [a.u.] [a.u.] [a.u.] Linear Quadratic Regulator (LQR) Control The theory of LQR control used at the low-(actuator) level of RLOC is explained in Subsection 2.2.2. The nonlinear dynamics (see Equation 2.1.5), made available to the RLOC algorithm a priori, are time-discretised and linearised around a set of (i) (i)  linearisation points ζ (i) , x0 , u0 for i = 1, . . . , nlqr , where nlqr is the total number of LQR controllers, 2 Note the RLOC algorithm used in this chapter employs n pid PID controllers and nlqr LQR controllers, making up a total of na number of low-level controllers, c (i.e. high-level actions, a). Heterogeneous, Hierarchical Control of Systems with Known Dynamics 74 producing the linearised dynamics. Hence each linearisation center ζ (i) contains the system and control  gain matrices A(i) , B(i) (see Subsection 2.1.2). Each pair of these matrices is used by the standard optimal control framework in an infinite horizon setting (for a nonzero target state x∗ ) to produce a (i) single feedback gain matrix (LQR controller) applicable at any time step k, L∞ (see Equation 2.2.10). The RLOC algorithm links each high-level action to an individual low-level continuous LQR controller which are applied in a finite horizon setting. This ensures that the same controller (and not its different parts dependent on the step k) is evaluated by the high-level policy when assigned to different areas of state space in the learning phase (see Subsection 3.3.3). The control matrices vary depending on the placement of linearisation points ζ (i) in the state X and control U (x) space (all controllers are (i) obtained for u0 = 0). For the initial testing of the RLOC algorithm in this chapter, an arbitrary number of such points is selected on an evenly spaced grid (an arbitrary example of such linearisation points placement is shown in Figure 3.4). (i) The feedback gain matrix L∞ , when chosen by the high-level policy as an action, is used together xk to compute a locally optimal linear control law at time step k (see Equation with the current state e 2.2.11) (i) e xk uk = −L∞ where u(k) ∈ Rm is the linear optimal control signal at time step k, e x(k) = x(k) − x∗ ∈ Rℓ is the state in the shifted coordinate system, that accounts for the nonzero target. Thus on the low-level of RLOC control each linearisation point, ζ (i) , i = 1, . . . , nlqr , represents a high-level action of a reinforcement learner a( j) , j = 1, . . . , na (where na = nlqr + n pid ), which allows the agent to learn the mapping from a set discretised symbolic states S to a set of local continuous closed-loop controllers that operate directly on task kinematics. (i) (i)  The linearisation points ζ (i) = x0 , u0 used for the Robotic Arm and the Cart Pole simulations (i) are defined in Table 4.2 (note u0 = 0 for all simulations). Table 4.2 Pre-Defined (Fixed) Linearisation Centers of LQR Controllers in RLOC Simulation. Benchmark Sym. Robotic Arm x0 Cart Pole x0 (i) (i) Value  1.57   1.57   1.57   1.57   1.57   1.57  1.83 1.31 0.78 2.35 0.26 0 0 0 0 0 0 0 0 0 0  0  0 0 0  0  0 0 0 0 0 −1.3 −2.5 1.3 0 2.5 0 0 0 0 0 2.87 0 0 SI Unit " rad # rad rad/s " rad/s # m m/s rad rad/s High-Level Reinforcement Learner The high-level reinforcement learner used in this chapter is described in detail in Subsection 3.3.2 and the exact values of the RL parameters used in the simulations of RLOC algorithm are shown in Table 4.3. The reason for the specific choice of each parameter of the Monte Carlo algorithm are described below: (a) a low epsilon, e ε , value is used to allow a restricted degree of exploration, hence the epsilon decay constant, ν , is also chosen to be low to allow for a slow rate of decay; (b) a high discount factor gamma, γ , makes the algorithm far-sighted, allowing it to 4.2 Methods 75 take future rewards (those further along the episode) more strongly into account when calculating the e , is used in order to reduce oscillations in the estimated action-value return; (c) a low learning rate, α function from one iteration to the next; (d) the learning rate decay, µ , is chosen to be higher than epsilon decay rate ν , to allow for a trade-off between the rates of decay. All of the above values have been found by running a small number of simulations with different parameters and keeping to the standard values in the literature [240]. Table 4.3 ε-Greedy On-Policy Monte Carlo Learning Parameters. Parameter Description Epsilon Epsilon decay Gamma Alpha Alpha decay Symbol e ε ν γ e α µ Value 0.1 0.1 0.9 0.1 0.5 SI Unit [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] In the RLOC control framework, the high-level policy learning is driven by the observation of low-level rewards r(t) accumulated for selecting a particular action a(t) in a given state s(t), each linked to a continuous low-level controller (PID, LQR), defined by a linear model M (i) , i = 1, . . . , na . It is important for the high-level learner to be able to compare the choice of actions and the observed rewards across state space S , therefore the same quadratic cost based on Equation 2.2.4 is used across the continuous state space X , penalising the state deviation from the target and the amount of control used at each time step k. The accumulated quadratic cost of using low-level controller c(i) engaged by selecting action at in state st is recorded as negative reward rt+1 and is calculated using n  1 st T J(e x(k), u(k)) = ∑ e xk We xk + uTk Zuk 2 k=1 (4.2.1) where nst is the number of actuator time steps k = 1, . . . , nK spent in RL state st , with the maximum number of simulation time steps nK . 4.2.2 Heterogeneous Reinforcement Learning Optimal Control Algorithm The pseudo-algorithm for heterogeneous control of nonlinear systems with known dynamics using RLOC framework, is given in Algorithm 3. 4.2.3 Benchmark Experiments The two benchmark tasks studied in this thesis have been discussed and derived in detail in Section 2.4, specifically the Robotic Arm system (see Subsection 2.4.1) and the Cart Pole task (see Subsection 2.4.2). Heterogeneous, Hierarchical Control of Systems with Known Dynamics 76 Algorithm 3 Heterogeneous RLOC Algorithm 1. Perform feature mapping φ : Rℓ → S : discretise nℓ ≤ ℓ chosen dimensions of the actuator state, x ∈ X , into symbolic states {s(1) , s(2) , . . . s(ns ) } to obtain equally spaced nℓ -dim grid. 2. Perform control mapping ψ : A (s) → Rm : define a(i) , i = 1, . . . , na RL actions for na = n pid +nlqr using linear models M (i) 1: for each LQR controller c(i) , i = 1, . . . , nlqr , do (i) (i)  (i) 2: Randomly place linearisation point x0 , u0 in state X space and keep u0 = 0  3: Create a linear model M (i) = ζ (i) , A(i) , B(i) (known dynamics) 4: 5: 6: 7: 8: 9: 10: 11: (i) Derive infinite horizon feedback gain matrix L∞ using LQR theory (i) Associate symbolic action a(i) with optimal gain matrix L∞ end for for each PID controller c( j) , j = 1, . . . , n pid do, Briefly optimise the K p , Ki , Kd parameters using brute search  Create a linear model M ( j) = K p , Ki , Kd Associate symbolic action a( j) with M ( j) end for 3. Initialize policy π and state-action values Qπ , for all s ∈ S , a ∈ A (s). 1: π (s, a) ← |A 1(s)| 2: Qπ (s, a) ← 0 3: i(s, a) ← 0 ∀s ∈ S , a ∈ A (s) 4. Approximate best policy π (s, a) with a modified ε -Greedy On-Policy Monte Carlo RL. 1: for epoch τ ( j) , j = 1, ..., nT do 2: Pick a random state s ∈ S , choose its centre as initial actuator state xinit 3: Sample action a0 ∼ Pr(a|π , s0 ) and set symbolic time t = 0 4: Initialise control uinit from initial actuator state xinit and controller ψ (a0 ) 5: for Actuator time step k = 1, . . . , nK do 6: Update xk using previous state xk−1 , control uk−1 and system dynamics. 7: if state changes i.e. φ (xk ) ̸= st then  nst e xTk We xk + uTk Zuk for at in st as negative rt+1 Store cost J(e x(k), u(k)) = 12 ∑k=1 8: 9: Append triplet {rt+1 , st , at } to a list (forming epoch path) 10: Increment symbolic time, t = t + 1 11: Update state, st = φ (xk ) 12: Sample action at ∼ Pr(a|π , st ) 13: end if 14: Update control uk from current state xk and controller ψ (at ) 15: end for 16: for each (s,a) pair appearing in episode τ ( j) do 17: i(s, a) ← i(s, a) + 1  bπ (s, a) ← Q bπ (s, a) bπ (s, a) + αi Ri (s, a) − Q 18: Q i i i bπ 19: Update π (s, a) to ε -Greedy from Q i 20: end for 21: end for bπ (s, a) ∀s ∈ S. 5. Finalise policy π (s) ← arg maxa Q 6. Use π (s) to control the system from any actuator state to the target. 4.2 Methods (a) Feature Mapping – Robotic Arm Task. 77 (b) Feature Mapping – Cart Pole Task. Fig. 4.2 Feature Mapping in RLOC Framework for Benchmark Tasks. The feature mapping φ (x) : Rℓ → S for both tasks is visually displayed for the robotic arm and the cart pole tasks, where an evenly spaced grid of symbolic states (i) (i) s(i) , i = 1 . . . , ns is obtained. The linearisation points x0 , i = 1, . . . , nlqr for LQR controllers (note u0 = 0) are depicted with green circles, chosen to be equally spaced in state space for the simulations carried out in this chapter. The dashed red line symbolises a stencil for the dimension of interest, along which the LQR controller placement varies. (a) Displays the robotic arm system, with the two discretised dimensions (θ1 , θ2 ), resulting in a total of ns = 36 RL states. (b) Displays the cart pole task, for which the most relevant dimensions (θ , θ̇ ) are discretised, resulting in a total of ns = 25 RL states. Robotic Arm The Robotic Arm control task represents planar arm reaching movement control to a selected target x∗ ∈ X by applying biologically plausible control torques to the joints θ . A feature mapping φ (x) : Rℓ → S is performed between the robotic arm actuator state x = (θ1 , θ2 , θ̇1 , θ̇2 )T ∈ R4 and the high-level symbolic state s(i) ∈ N+ , involving a discretisation of nℓ = 2 dimensions – the joint positions θ = (θ1 , θ2 )T in continuous state space, where θ1 is the shoulder angle and θ2 is the elbow angle – to produce a 2D-grid of equally spaced symbolic states (see Figure 4.2a). The control mapping ψ : A → Rm is performed between the symbolic actions a(i) , i = 1, . . . , na and the low-level continuous control u(k) ∈ Rm , where each controller is individually obtained for the known dynamics. In the case of PID, the parameters are briefly optimised, while the LQR controllers are obtained using optimal control framework [260] utilising the known dynamics. The nonlinearity of the robotic arm is concentrated in the elbow angle, as demonstrated by Equation 2.4.1 (where the inertia matrix and the vector centripetal and coriolis forces impose nonlinear sin(·), cos(·) operations on θ2 , the elbow angle). Therefore in order to help the RLOC algorithm solve the nonlinear dynamics, (i) (i) the LQR controller state linearisation points x0 ∈ X , i = 1, . . . , nlqr (note u0 = 0) are restricted to the elbow angle, where they are spaced equally along the dimension (see Figure 4.2a). Cart Pole The control task of the cart pole consists of starting from any admissible state configuration xinit = [z, ż, θ , θ̇ ]T where z ∈ [0], ż ∈ [0], θ ∈ [−π , π ], θ̇ ∈ [−1.5π , 1.5π ], and learning an optimal policy π (s) for controlling the pole to and actively balancing the pole in an upright position of unstable equilibrium by applying admissible forces u ∈ [−20, 20] N to the front/rare of the cart. In this chapter Heterogeneous, Hierarchical Control of Systems with Known Dynamics 78 the cart is not required to stop at a pre-defined position while in the next chapter the target cart position is z = 0 (i.e. the penalty on the cart position is imposed through the state weight matrix W ). A feature mapping φ (x) : Rℓ → S transforms the cart pole low-level continuous state x = (z, ż, θ , θ̇ )T ∈ R4 into the high-level framework of discretised state space s(i) ∈ N+ , which involves segmenting the state dimensions, θ , θ̇ i.e. nℓ = 2, resulting in an equally spaced grid of symbolic states (see Figure 4.2b). The control mapping ψ : A (s) → Rm is performed between symbolic actions and the low-level actuation. The PID controllers are obtained by brute search optimisation and the LQR controllers are built using standard linear control theory [260]. The cart pole task is known to exhibit higher nonlinearities in its control solution than the robotic arm, due to the underactuation of the cart pole system where the rotation of the pole cannot be controlled directly. The state linearisation points (i) (i) x0 ∈ X , i = 1, . . . , nlqr (note u0 = 0) are chosen to be equally spaced along the pendulum angle, θ , while setting angular velocity θ̇ = 0 (see Figure 4.2b). The specific task parameters of the robotic arm and the cart pole tasks are shown in Table 4.4. Table 4.4 Simulation Parameters – RLOC. Parameter Dim. actuator state Dim. actuator control Dim. symbolic state Max. control size Max. neg. control size Num. epochs Sym. ℓ m ns umax umin nT State weight matrix W Control weight matrix Num. PID controllers Num. LQR controllers Total simulation time Discretisation step Actuator num. time steps (i) L∞ num. time steps Z n pid nlqr T ∆ nK neK Target state x∗ Robotic Arm 4 2 36 10 -10 1000   10 0 0 0 0 10 0 0 0 0 0.01 0  100 0 0 0.01 01 Cart Pole 4 1 25 20 -20 2000   0 0 0 0 0 0 0 0 0 0 200 0 0 0 0 20 4 6 3 0.01 300 10000 1 6 5 3 0.01 300 10000 [1.3, 1.3, 0, 0]T [0, 0, 0, 0]T SI Unit [a.u.] [a.u.] [a.u.] [N m], [N] [N m], [N] [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] [a.u.]   rad rad, rad, rad , s s  m m, s , rad, rad s The number of epochs, nT , was optimised using brute-force search in the range [20,104 ]. Both benchmarks were simulated using the Runge-Kutta method which approximates solutions to ordinary differential equations of the form Equation 2.1.1, found using xk+1 = xk + 16 (l1 +2l2 +2l3 +l4 )+O(n5 ), where l1 = ∆f(xk , uk ), l2 = ∆f(xk + 21 l1 , uk ), l3 = ∆f(xk + 12 l2 , uk ), l4 = ∆f(xk + l3 , uk ). The simulation parameters of iLQR benchmark state-of-the-art algorithm are kept constant for all experiments and shown in Table 4.5 (where LM stands for Levenberg-Marquardt). 4.2 Methods 79 Table 4.5 Simulation Parameters – LQR / iLQR. 4.2.4 Parameter Dim. state Dim. control Num. start states Max. control size Max. neg. control size Sym. ℓ m ns umax umin State weight matrix W Control weight matrix Total simulation time Discretisation step Num. time steps Improvement tolerance Iteration num. tolerance LM initial lambda LM lambda multiplier LM max. lambda Initial control traj. Z T ∆ nK tol δilqr ntol ilqr init λilqr e λilqr Target state Robotic Arm 4 2 36 10 -10   Cart Pole 4 1 25 20 -20   SI Unit [a.u.] [a.u.] [a.u.] [N m], [N] [N m], [N] 3 0.01 300 0.001 300 0.1 1 3 0.01 300 0.001 300 0.1 max λilqr ū0ilqr 10 103 0̄ 10 103 0̄ x∗ [1.3, 1.3, 0, 0]T [0, 0, 0, 0]T [a.u.] [a.u.] [N m], [N]   rad , rad, rad, rad s s  m m, s , rad, rad s 10 0 0 0 0 10 0 0 0 0 0.01 0  100 0 0 0.01 01 0 0 0 0 0 0 0 0 0 0 200 0 0 0 0 20 [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] [a.u.] Study Of The LQR Feedback Gain Matrices – Robotic Arm, Cart Pole The individual control trajectories ū(i) (k), i = 1, . . . , nL (where nL is the total number of feedback gains (i) / linearisation points used in this study) produced by different LQR control matrices L∞ are studied in order to obtain a better understanding of how each LQR controller behaves when trying to solve each nonlinear benchmark problem (robotic arm, cart pole) monolithically. The visual observation of the individual control and state trajectories aids understanding the potential behaviour of the high-level RLOC learner when utilising each low-level controller in a region of sub-space. The linearisation (i) (i)  points x0 , u0 , i = 1, . . . , nL are equally spaced along the chosen dimensions of each control task (see (i) red dotted line in Figure 4.2). The control part of linearisation point is kept at zero u0 = 0 ∀i (which is consistent with how low-level LQR controllers are obtained in RLOC). Each linearisation point is used (i) to obtain the parameters A, B for the calculation of the infinite control horizon gains L∞ (Equations 2.2.9, 2.2.10) using standard LQR theory [260]. A total number nL = 40 of state linearisation points (i) x0 (resulting in nL gains) are used in both control tasks, with parameters of Table 4.4. Each control (i) trajectory is obtained using Equation 2.2.11 and setting each starting state xinit = x0 .  (i) (i) Gains – Robotic Arm The linearisation points x0 , u0 = 0 , i = 1, . . . , 40 are equally spaced along the elbow dimension on the [0, π ] interval, with shoulder angle starting position of π2 radians and (i) angular velocities of zero, resulting in nL = 40 number of L∞ gain matrices. An arbitrary position Heterogeneous, Hierarchical Control of Systems with Known Dynamics 80 of the shoulder reflects the fact that this angle does not contribute to nonlinearities of the system as per Equation 2.4.1. The control trajectories ū(i) (k) are produced one per linearisation point, using the (i) gain matrices L∞ and the information about the current state xk according to Equation 2.2.11. The robotic arm is controlled to the target of x∗ = e x∗ = 0, with parameters initialised as per Table 2.1. The control trajectories are reported in Figure 4.3a and the corresponding elbow angle trajectories (i) θ̄2 (k), i = 1, . . . , 40, k = 1, . . . , 300, are displayed in Figure 4.3b. This investigation shows that the target elbow angle θ2 = 0 is achieved correctly but imprecisely, while the control size is reduced to near zero consistent with what is expected from an LQR controller. This means that even though the individual locally optimal linear LQR controllers are able to solve the robotic arm reaching control task, they do so sub-optimally due to the task nonlinearities. (a) Control trajecotires, ū(i) (k). (b) Elbow angle trajectories, θ̄2(i) (k). Fig. 4.3 Control Trajectories and State Produced for nL Linearisation Points – Robotic Arm. The control ū(i) (k) for (i) i = 1, . . . , nL , nL = 40, and the corresponding elbow trajectories θ̄2 ∈ [0, π] are displayed. (a) Near optimal, smooth controls are produced ranging between [0.8,-10] N, which is biologically plausible, indicating appropriate cost function. The amount of initial torque applied to the elbow joint increases with the size of the angle, as expected. (b) The corresponding elbow angle θ2 is taken to near 0 target state as intended, indicating that the appropriate controls are applied to the system. The target state is not precisely achieved however and small oscillations can be seen indicating sub-optimality. (i) Gains – Cart Pole The state linearisation points x0 i = 1, . . . , 40 are equally spaced along the pole angle θ ∈ [0, π ], while keeping the cart position z, cart velocity ż and angular velocity θ̇ at zero (note (i) u0 = 0). The reason behind not using the full angle range of [−π , π ] is due to the cart pole task symmetry about the upright unstable equilibrium of θ = 0, resulting in equality of the feedback gain matrices for angles of equal but opposite sign (i.e. dynamics have the same behaviour)3 . The control (i) (i) trajectories obtained with L∞ gain matrices for xinit = x0 and information about states x(k) at each time step k, are shown in Figure 4.4. The control trajectories clearly show that the LQR controllers are only able to solve the cart pole tasks within the vicinity of the unstable upright pole equilibrium, x∗ = e x∗ = 0, i.e. for the pole angle in the approx. range of [0,1] rad, while admissible controls of [-20,20] N employed in RLOC are only seen for angles in the range of [0,0.3] rad (see Figure 4.4a). All of the other observed control for controlling the cart pole with LQRs obtained in the range of [1, π ] rad (see Figures 4.4b-d) provides unstable solutions with controls way beyond those of admissible 3 The dynamics for state variables (θ = a, θ̇ = 0) ̸= (θ = a, θ̇ ̸= 0) due to non-linearities of θ̇ (see Equation 2.4.2). 4.2 Methods 81 size. The difficulty in solving the cart pole task for angles beyond ±0.3 rad is due to the fact that LQR is only optimal for linear systems and the nonlinearity of the cart pole task increases the further away from target the starting angle is. Hence simple stand alone LQR controllers are unable to solve the cart pole task unless the stabilisation is performed very near to the target state of the upright angle. This brings an interesting consideration of whether the RLOC algorithm can successfully utilise LQR controllers with linearisation points beyond [0,0.3] rad and potentially re-use them in areas of states space other than where they were obtained from in order to solve the global nonlinear task optimally. Fig. 4.4 Control Trajectories Produced for nL Linearisation Points Grouped by Similarity – Cart Pole. The state (i) linearisation points x0 i = 1, . . . , 40 are equally spread along the pendulum angle dimension θ ∈ [0, π]. (a) The control solution for starting angles [0,1.3]rad is smooth and continuous, while guiding the cart pole to target, as evidenced by the reduction of control to near zero size (17 trajectories). The red colour indicates control produced for lower starting angle values – i.e. least effort required for stabilisation. (b) The LQR control produces inadmissible controls for starting pole angles [1.3,1.6] rad (3 trajectories). The solid squares indicate the point after which the control signals exploded and therefore the reported trajectories were truncated before this point. (c) Oscillating control produced for the angle range [1.6,1.9] rad, means that no solution was found (4 trajectories). (d) The control trajectories produced for the angle range [1.9,π] rad can be seen to destabilise quickly, producing oscillating behaviour with inadmissible control size (16 trajectories). The investigated control trajectories reveal that for the Robotic Arm task the LQR controllers are able to solve the task sub-optimally, while for the cart pole task the LQR controllers are only able to provide solutions near to the upright unstable equilibrium. The RLOC solutions to controlling these two benchmark tasks using both LQRs and PIDs as low-level controllers, providing heterogeneous control, are now discussed in the results section below. Heterogeneous, Hierarchical Control of Systems with Known Dynamics 82 4.3 Results Two benchmark tasks are examined: the Robotic Arm (see Subsection 2.4.1) and the Cart Pole (see Subsection 2.4.2) tasks. This chapter aims to show that RLOC can achieve near optimal control by performing a single learning pass over the control task with fixed LQR linearisation points placement and an average of 5 trials is reported. The next chapter will focus on carrying out more detailed analysis of many learning trials while varying linearisation point locations. The full list of parameters used in the (i) (i)  simulations (including the locations of linearisation points for LQR ζ (i) = x0 , u0 for i = 1, . . . , nlqr  (i) and PID ζ (i) = K p , Ki , Kd for i = 1, . . . , n pid ) can be found in Tables 4.1, 4.2, 4.3 and 4.4. At the end of simulation a deterministic RL policy π (s) is obtained from a stochastic leant policy π (s, a) and is used to generate state x̄(i) and control ū(i) trajectories for each symbolic state s(i) ∈ S , i = 1, . . . , ns used as starting state. Each trajectory is generated by equating the center of each symbolic state to xinit (see Figure 4.2) and activating low-level controllers according to π (s) for nK (actuator level) simulation steps. The high-level policy switches actions based on the encountered symbolic states, while the low-level actuation is performed in a continuous state and action domain. The state-of-the-art iLQR algorithm is used to find the control and state trajectories (ū(i) , x̄(i) ), i = 1, . . . , ns following an iLQR policy, using the same starting states and simulation parameter values as Table 4.5. The results are compared based on the accumulated quadratic costs of state and control trajectories (one cost per each starting state simulation i.e. per (ū(i) , x̄(i) )) calculated using Equation 4.2.1. The final results are reported using Relative Cost Difference (RCD) RCD = Jrloc − Jb 100% |Jb | (4.3.1) where Jb is the total cost for controlling a task with a benchmark LQR/iLQR algorithm resulting in Jlqr , Jilqr cost respectively. Hence negative RCD values indicate that RLOC costs are lower than the benchmark (i.e. that it is less costly to control the system using RLOC, which is desired.) Results – Robotic Arm The investigation of the accumulated costs for controlling the robotic arm with the learnt deterministic RLOC policy π (s) (single learning trial, see Figure 4.5), reveals that the solution quality of RLOC is almost as good as that of the state-of-the-art nonlinear control algorithm iLQR (see Figure 4.5a), with a difference of at most 1%, when using nlqr = 6, n pid = 4 LQR and PID controllers. This is very encouraging since iLQR was specifically designed for solving the nonlinear 6 arm muscle model and is expected to do well on an even simpler arm model utilised here. LQR control is only optimal for linear systems and can be seen to solve the arm reaching task sub-optimally (see Figure 4.3). A comparison between RLOC and LQR is also made (see Figure 4.5b) and it reveals that RLOC beats LQR for almost all of the states by up to 89%, while matching the performance of LQR for the states where the nonlinearity is most minimal i.e. those states where θ2 ̸= 1.3, thus not requiring much change in θ2 , the variable containing all of the nonlinearity of the arm kinematics, in order to reach the target (θ1 , θ2 ) = [1.3, 1.3]. This is a solid result, meaning that a 4.3 Results 83 reduced number of controllers (in this case a total of 10) is able to achieve optimal control for areas of state space which are linear w.r.t to the target state, and also meet the quality of the state-of-the-art iLQR solution for areas containing nonlinearities. The RLOC algorithm did not seem to benefit from the inclusion of the PID controllers at the low-level of actuation when solving the RA task, since the high-level policy did not assign any PID controller to any area of state space. This may seem to be negative result, however it is actually an interesting one, because it highlights the architectural design advantage of RLOC algorithm: the high-level policy is able to discard controllers which do not provide the best possible solution (least cost) for given symbolic states, when seeking to achieve the overall task goal. In the case of RLOC simulations for the control of the Robotic Arm, the LQR controllers seem to have been much better at solving the overall task, which meant that the PID controllers were not chosen. In order to show that the RLOC algorithm did indeed learn to control the Robotic Arm to the target state, an average of 5 learning trials (with the same starting conditions, hence variation arises due to Monte Carlo RL algorithm) is obtained (see Figure 4.6). The error in the final position of the robotic arm is reported in Figure 4.6a. First the error in the final position, ed , for each dimension, d, of the robotic arm is recorded for all symbolic starting states s(i) , i = 1, . . . , ns , across all trials, n, (i) (i) e.g. shoulder dimension error is eθ1 = (θ1 − θ1∗ )), where ∗ indicates target. The error ed is averaged (i) through the n = 5 learning trials across the states s(i) to give ēd i.e. resulting in ns different average (i) errors for each dimension. The average of ēd is then taken to find a single value representing the ’error in the final position’ for dimension d, µēd . The results show that RLOC algorithm outperforms the LQR algorithm in terms of final end effector position for a given target and outperforms the iLQR for the elbow dimension (but not in the shoulder dimension). Overall this figure shows that all 3 algorithms correctly solve the robotic arm reaching movement task with a varied degree of precision. The high-level RL learning curve is reported for the RLOC algorithm (see Figure 4.6b), where a direct sample of the RLOC learning reward signal is averaged (sampled at 6 epoch interval) across trials. The costs for the benchmark algorithm iLQR (i.e. average of iLQR costs for controlling the arm from all starting states, J¯ilqr to target) – red line, and LQR – green line, are also provided. The plot shows that on average, the direct reward experienced by RLOC during the learning phase beats the LQR after 300 epochs, while achieving the same cost solution as the optimal iLQR control algorithm after 500 epochs. The mean CPU usage for the simulations is RLOC (unoptimized) 281 ticks, iLQR 95 ticks and LQR 8 ticks. Results – Cart Pole The cart pole is a much more difficult problem to solve than the robotic arm, since the rotation of the pole cannot be controlled directly (i.e. the system is underactuated) and therefore represents a nonlinear control problem with a higher degree of difficulty. The investigation of the accumulated costs for controlling the cart pole with the learnt deterministic RLOC policy π (s) is shown in Figure 4.7 (single learning trial, fixed controller placements, nlqr = 5 and n pid = 6). Figure 84 Heterogeneous, Hierarchical Control of Systems with Known Dynamics (a) Relative Cost Difference for RLOC vs. iLQR – Robotic Arm. (b) Relative Cost Difference for RLOC vs. LQR – Robotic Arm. Fig. 4.5 Relative Cost Difference Results, RLOC vs. LQR/iLQR – Robotic Arm.) The RCD are reported, where a negative number indicates that RLOC excels in controlling the task, when compared to a benchmark policy. The solid red circle at the position (θ1 , θ2 ) = [1.3, 1.3] radians indicates the target state (ns = 15). (a) The accumulated costs are compared between RLOC and iLQR, revealing that RLOC is able to achieve near-optimal performance which is on par with the solution quality of the state-of-the-art iLQR algorithm. (b) The results of comparing RLOC vs. LQR show that RLOC outperforms LQR for the states with elbow angle θ2 ̸= 1.3, while meeting the performance of LQR for the states where θ2 = 1.3. This result reflects on nonlinearities of the arm kinematics, which are contained in the elbow angle only. 4.7a reveals that the solution quality of RLOC beats that of the benchmark state-of-the-art iLQR algorithm by up to 36%. This is an improvement over the results of the robotic arm, where RLOC was only able to approach the iLQR performance. It can be seen that RLOC performs better than iLQR in regions of state space which would be most difficult to control: pendulum hanging down and given a high initial velocity. The large differences recorded for some of the states, for example state s(1) , could be due to iLQR particularly struggling to solve the control task from that starting state. Typically 4.3 Results (a) Average effector position error – Robotic Arm. 85 (b) Learning curve over 5 trials – Robotic Arm. Fig. 4.6 Learning Curves – Robotic Arm.) Averaged results for 5 learning trials are reported. Note LQR/iLQR cost = −reward. (a) The average error in the end effector position, µēd , shows that all 3 algorithms achieve the control solution of the reaching arm movement. (b) The average sample of direct reward experienced by RLOC algorithm during learning is reported against the benchmark of iLQR (average cost – red line) and linear optimal controller LQR (average cost – green line). The results show that the RLOC learning is achieved after 500 epochs, after which a stable policy for the control of the robotic arm is produced. ¯ indicates ’trajectory’) was the recommended initial control trajectory of ū0ilqr = 0̄ [263] (where (·) supplied to the iLQR algorithm. However iLQR struggled to provide solutions for some starting states, specifically those making up the central band region (θ = 0 target, nonzero initial velocity). The initial control trajectory of ū0ilqr = 0̄ applied to initial states of θ = 0, result in first dropping the pendulum (i.e. no control applied), creating an initial far from optimal trajectory. However it is not clear why iLQR failed to converge for those states all-together. In the cases where it failed to outperform the LQR controller for the given initial state, other initial control trajectories were supplied ū0ilqr = Ī, ū0ilqr = −Ī and ū0ilqr = ū∗lqr (where ∗ indicates optimality). The first two did show some incremental improvement, however iLQR did not improve the LQR solution. This result is explained by the fact that an LQR controller linearised at the x0 = 0 target position is able to solve the cart-pole balancing task for initial configurations within approximately ±0.3 rad of the upright unstable equilibrium (Figure 4.4a) and the controller is able to compensate for cases where some initial velocity is present. Hence the LQR controller is already near-optimal in this region of state space (less nonlinear and hence suitable to be solved with a linear controller), thus making it difficult to improve further. Figure 4.7b shows that RLOC is able to outperform LQR for 18 out of 25 states s(i) ∈ S , i = 1, . . . , 25, by up to 74%. These states correspond to regions of state space containing higher nonlinearities than the central band of near upright position (θ = 0) with initial velocities (θ̇ ∈ [−1.5π , 1.5π ]). As shown in Figures 4.4b-d, a stand alone linear optimal LQR controller cannot solve highly nonlinear cart pole task from areas beyond 0.3 rad within admissible control bounds. The central band region of the cart pole RLOC solution shows that it is not able to outperform the LQR solution (same result Heterogeneous, Hierarchical Control of Systems with Known Dynamics 86 as iLQR) for the initial pendulum configurations of near upright (target) position containing initial velocity. Since these states are not as nonlinear as the rest of the state space, an LQR controller linearised at the top x0 = 0 (optimal for linear systems) is able to better solve the task for that area of state space, compared to the global solution found by RLOC policy4 . One of the low-level controllers provided to RLOC was an LQR controller linearised at x0 = 0. The result of the Figure 4.7b means that this LQR controller was not selected to be activated in the central band region states when searching for the global solution, even though it would have been locally optimal. Hence RLOC finds a global (yet sub-optimal) solution for the entire state space when controlling the cart-pole to target upright unstable equilibrium state. The RLOC algorithm did benefit from the inclusion of the PID controllers at the low-level of actuation when solving the cart pole task, since the high-level policy π (s) did assign some PID controllers to operate on areas of state space. (a) RLOC vs. iLQR – Cart Pole. (b) RLOC vs. LQR – Cart Pole. Fig. 4.7 Relative Cost Difference Results, RLOC vs. LQR – Cart Pole. The Relative Cost Difference (RCD) are reported, where a negative number indicates that RLOC excels in controlling the task, when compared to a benchmark policy. The solid red circle at the position (θ , θ̇ ) = [0, 0] indicates the target state (ns = 13). (a) The accumulated costs are compared between RLOC and iLQR, revealing that RLOC is able to outperform by up to 36% (and at worst equal) the optimal performance of the nonlinear iLQR control benchmark. (b) The results of comparing RLOC vs. LQR show that RLOC outperforms LQR for 18 out of 25 states by up to 74% for the regions of state space which are most nonlinear. The middle part (the coarse ’central band’ θ = 0 i.e. pendulum already upright but containing some amount of velocity) can clearly be seen to be learnt suboptimally when compared to the LQR controller. This result reflects the fact the LQRs are able to control the cart pole task at the vicinity of the target (even when some velocity is present). The average error of the end effector (pole) is shown in Figure 4.8a. The procedure for obtaining the average error values is the same as explained for the case of the robotic arm above. The results show that RLOC is the best out of the 3 algorithms in controlling the cart pole to the stationary upright target state and it does so with an error of µēθ = 0.05 rad, µēθ̇ = 0.15 rad/s. In the case of the iLQR, the control solution tends to be nonstationary in θ̇ , which it misses on average by µēθ̇ = 1.45 rad/s). 4 The pre-fixed controller centers (’linearisation points’) were located at equal and opposite angles of the pendulum (see Table 4.2), which were shown to produce exactly the same control matrices (see Section 4.2.4). Therefore in reality this cart-pole task was actually solved by RLOC with 3 and not 5 controllers, since 2 controllers were duplicate solutions. 4.3 Results 87 The LQR result shows that on average it does not solve the task effectively, struggling to bring the cart pole to target µēθ = 0.9 rad and µēθ̇ = 0.95 rad/s. Figure 4.8b shows the average cumulative reward (± standard error of the mean (SEM)) experienced by the RLOC high-level agent for each epoch (tested on a 10 epoch interval, with the total number of epochs nT = 2000, and compared to LQR and iLQR benchmarks). The results show that RLOC (red curve) outperforms the LQR algorithm (black line) after 250 epochs and the iLQR algorithm (green line) is outperformed after 500 epochs, following which the direct reward experienced by the agent stabilises out. The oscillations observed for the direct reward are due to the nature of the high-level learner. The ε -Greedy Monte Carlo agent keeps a degree of exploration of the non-greedy actions (those deemed less desirable), therefore potentially selecting actions with higher negative reward. Equally, the direct reward sample is calculated for a cumulative sum of all experienced rewards for a given epoch, therefore particularly long epochs may produce higher negative total rewards. In this simulation the high-level control policy did benefit from the inclusion of PID controllers at the low-level of actuation, and these were selected for the final control policy. It shows that the PIDs can indeed be integrated with LQRs at the low-level of actuation, while letting the high-level agent dictate how to best combine these for global use. (a) Average error of final state – Cart Pole. (b) Learning curve over 5 trials – Cart Pole. Fig. 4.8 Learning Curves – Cart Pole. Average results for 5 learning trials are reported. Note LQR/iLQR cost = −reward. (a) The average error of (θ , θ̇ ) state values are the lowest for the RLOC control solution, µēd . While the errors for the iLQR tend to show non-stationary control solutions (θ̇ of up to 1.5 rad). The LQR result on the other hand shows that on average it struggles to bring the cart pole to target, missing it by up to 1 radian (i.e. not solving the task) and supplemented by the standard error of the mean (SEM , pastel pink band). (b) The average sample of direct reward experienced by RLOC algorithm during learning (red curve) is reported against the benchmark of iLQR (average cost – green solid line) and linear optimal controller LQR (average cost – black solid line). The results show that the direct cost sample reduces as the simulation progresses indicating learning. A good solution quality is achieved after 500 epochs, where the iLQR benchmark starts to be outperformed. The mean CPU usage for the simulations is RLOC (unoptimized) 402 ticks, iLQR 350 ticks and LQR 6 ticks. Heterogeneous, Hierarchical Control of Systems with Known Dynamics 88 4.4 Discussion This chapter tested the engineering motivation of the RLOC algorithm: whether the best of both worlds of reinforcement learning and optimal control can be brought together to compete with/outperform the solution of the state-of-the-art engineering nonlinear control algorithm iLQR. A parallel question of interest rested in whether RLOC would be able to (or would benefit from) combining heterogeneous low-level controllers – i.e. having the availability to use a mixture of different types of close-loop controls, operating directly on the task kinematics at the actuator level of the RLOC hierarchy. The results showed that RLOC is able to find near-optimal control solutions from a single learning trial, suitable for operating on an entire continuous state space of the task domain (see Figures 4.5, 4.7). This means that with a reduced number of low-level controllers (robotic arm na = 10, cart pole na = 11), RLOC policy is able to either match the solution of or outperform the stand alone state-of-the-art policies (see Figures 4.7a, 4.8a). It has to be highlighted, that the iLQR algorithm requires an iterative optimisation of the entire control trajectory (and their complete re-calculation for different start-end point pairs) thus requiring as many number of controllers as number of steps. RLOC reduces this number by a factor of 102 and finds it sufficient. It also provides a reduction to the discretisation of state space by an order of approximately 1.5 ∗ 104 , which would otherwise need to be explored using monolithic RL implementations. Interestingly, the learnt RLOC policy, π (s), which maps actions to states thus combining low-level controllers for global use, was found to contain both LQR and PID controllers for the cart pole task (but not the robotic arm task), indicating that the RLOC algorithm design allows it to effectively choose between the varied low-level controllers as most suitable for the task. It is one of the strong aspects of RLOC, which means that other controls can be easily plugged in and tested when searching for a global control solution, allowed by the smart high-level learning. As an engineering solution, RLOC allows the use of different types of RL algorithms: discrete vs. continuous state, model-free vs. model-based approaches. Here one of the standard discrete state-action algorithms, called Monte-Carlo RL was used for illustration purposes and without limiting the generality of RLOC. It acts more reliably in semi-Markovian environments, which result from high-level abstractions of the hierarchical organisation of RLOC. Biologically speaking, RLOC can implement, at the symbolic level, both the class of model-free and model-based RL algorithms. Indeed, recent fMRI studies, using episodic multi-step decision tasks, found that ventral striatum supported an integrated model-free and model-based computational process [64]. In accordance with these findings it is straightforward to combine RLOC’s symbolic level with model-based RL. This forms a novel and flexible approach to solving nonlinear control problems as demonstrated on two standard benchmarks. Overall the basic building blocks of RLOC are shown to provide near-optimal control solutions using a possibly heterogeneous set of low-level controllers, when dealing with nonlinear systems with known dynamics. Real world tasks, however, often have dynamics which are too complicated to be written down or may altogether be unknown a priori, which motivates an extension of the RLOC algorithm to learning a representation of the task during online interaction with the unknown environment, which forms the topic of the next chapter. Chapter 5 Reinforcement Learning Optimal Control for Unknown Dynamical Systems "Never trust a computer you can’t throw out of a window." — Steve Wozniak The previous chapter showed that the RLOC algorithm can be successfully applied to solving nonlinear control systems with known dynamics, where it was able to meet/exceed the optimal control solution of the two benchmark tasks when compared to state-of-the-art engineering nonlinear control algorithms. This chapter is based on the neurobiological motivation provided in Subsection 3.1.2 and focuses on extending the RLOC algorithm to systems with unknown dynamics, which in this adapted framework are learnt at the level of actuation by using direct interaction and model inference on the unknown environment. It aims to provide a framework to address one of the key questions in neuroscience: how to bridge the gap between the high-level symbolic action selection and continuous low-level actuator control for highly redundant nonlinear systems1 . 5.1 Introduction The architectural foundations of the RLOC algorithm (here with an extension to unknown system dynamics), is as strongly motivated by neuroscience, as by the engineering considerations presented in Chapter 3. RLOC uses computations on two levels of abstraction, which involves (a) representing tasks symbolically on the high-level and linking them to (b) low-level locally optimal feedback controllers, responsible for direct operation on task kinematics and producing continuous smooth control execution. The former (point (a)) is strongly motivated by a number of research findings, such as evidence from symbolic decision making [54–56]. The sequencing of low-level actions (i.e. sequential behaviour) is 1 The work in this chapter was presented at the International Cognitive Robotics Workshop (ICRW, 2014) (award for best oral presentation and an invitation for a publication in the Journal of Artificial Intelligence), awarded the ’Qualcomm Travel Award’ COSYNE 2013, ’Distinguished Poster Award’ – Imperial Symposium 2014, 3rd place – Google Poster Competition 2014, and is currently in the process of being submitted to a journal publication. 90 Reinforcement Learning Optimal Control for Unknown Dynamical Systems suggested to require a higher-level task context representations, which should result in a hierarchical structure made up of nested sub-routines [41]. The work carried out on action selection and action sequencing, showed that recordings from the striatum point to chunking of representations of motor and cognitive action sequences, which may later be implemented as single performance units [103, 126]. Thus the basal ganglia is believed to play a central role in learning and influencing motor / cognitive pattern generation. For example, recent work in rodents shows that the basal ganglia can implement whole action sequences (lever presses) as single performance units [126]. On the level of motor control, imitation and social cognition, the generation and perception of structured action sequences, composed of some form of "motor primitives" [63, 98, 185] is considered of prime importance [49, 284]. Indeed, grammar-like rules have been proposed as neurocognitive mechanisms for the sequencing of behaviour, and were shown to have provided a foundation for the evolution of language [201], with Palaeolithic stone toolmaking indicating early expressions of this evolving capacity [85, 116, 237]. However, it remains unclear how these high-level sequences are learnt biologically and how the high-level representations are linked to the low-level learning of the controller commands. This leaves an open topic of research, which RLOC hopes to address from a computational perspective, proposing a plausible proof-of-principle hierarchical solution. The latter (point (b)) rests on the research evidence that the brain generates high-level motor commands, which are mapped to movement as demonstrated by the correlated muscle activation ("muscle synergies", "motor primitives" [31, 63], see Appendix C.1) and limb movement patterns from which motion primitives are derived and found to generalise across subjects [256]. These distinct functional modules are suggested to be organised at the level of the spinal cord, where upon activation the synergies are linearly combined to produce a wide range of motor behaviours [63, 98, 185], as such these comprise low-level building blocks in a modular organisation of motor control [111]. Evidence also exists in support of individual primary motor cortex (M1) neurons encoding individual muscle synergies [115], which may be activated by higher level inputs to control a particular synergy. These activity-dependent correlation patterns appear to be consistent with movements being orchestrated by simple feedback controllers proposed by optimal feedback control (OFC) theory for optimal motor control [72, 186, 222]. While the majority of complex feedback is processed through transcortical feedback pathways, the spinal cord represents the first level of feedback processing [223]. The OFC theory assumes that motor control incurs costs that the motor system aims to minimise while achieving a task goal, and has been shown to be very successful in predicting entire trajectories and correlation structures in both reaching movement and manipulation of complex objects [72, 186, 222]. This is consistent with placing the locally optimal linear controllers on the low-level of RLOC hierarchy. An interesting question which would naturally follow from this organisation is how such low-level controllers are learnt from online interaction with the unknown object/world? This question represents a strong topic of interest with a large body of research. The research by Shadmehr et al, shows that the brain may learn the dynamics of reaching movements through a combination of motor primitives with Gaussian-like tuning curves, resembling those of Purkinje cells located in the cerebellum [257]. Most recently evidence has emerged suggesting that our brain learns optimal control for object manipulation 5.2 Methods 91 through system identification of the dynamics of the arm and object [246]. This mechanism appears to predict trial-to-trial changes in reaching trajectories during learning. It was also posited that the brain can perform system identification of the local system dynamics through gradient-descent of the movement error (predicted vs. actual trajectory) over the parameters of the arm and object dynamics [246]. While RLOC could employ such method, a different system identification method was chosen based on an Expectation-Maximisation algorithm (Linear Dynamical Systems with Inputs, LDSi [95]) which is more data efficient in the nonlinear optimal control scenarios employed here. Thus RLOC uses the LDSi numerical algorithm in order to infer the task/world online (local linear models M (i) , i = 1, . . . , na ), enabling learning at the actuator level by employing unstructured interactions resembling "motor-babbling" in babies [47, 192]. Correspondingly, RLOC implements low-level linear optimal feedback controllers that are learned through local system identification, which is consistent with latest research [246]. The crucial computational aspect that enables RLOC to learn nonlinear control tasks using both high-level and low-level control in an integrated manner, is that it uses the low-level controllers’ cost function (quadratic in state and control) to derive endogenous (internally generated) reinforcement signals for learning the correct sequence of high-level actions. This enables RLOC to simultaneously learn both levels of control while operating on two very different levels of representations: high-level abstract ("cognitive") action domain representing entire trajectories of movements and low-level actuation with continuous motor commands. 5.2 Methods The hierarchical architecture of RLOC algorithm is described in Section 3.3. In this chapter the lowlevel controllers are restricted to Linear Quadratic Regulators only (na = nlqr ) in order to accurately reflect the neurobiological research. The controllers are formed from the inferred linear models (M (i) ) and the variability of RLOC’s performance vs. number of actuator level controllers is investigated for the benchmark tasks, using repeated trials and the controller placements are randomised in state space across trials. (i) Linear Local Models The feedback gain matrices L∞ (’low-level controllers’) are derived from  b(i) , Bb(i) , Σ b(i) a collection of approximate linear models of the system M (i) = ζ (i) , A , i = 1, . . . na , w (i) where ζ is the "linearisation centre" which is a point in state space around which the nonlinear dynamics are linearly approximated, and indicate the neighbourhood in which the approximation is b(i) ∈ Rℓ×ℓ is the estimated system control matrix, Bb(i) ∈ Rℓ×m is the expected to be most accurate, A b(i) estimated control gain matrix and Σ w is the estimated state covariance noise. The initialization of the low-level actions is performed using random placements of the linearisation points in state space. The local linear controllers differentiate from one another based on the nonlinearities of the dynamics around which the linearisation centres are being obtained. Many learning trials are performed in order to account for the variability resulting from the random controller placements. 92 Reinforcement Learning Optimal Control for Unknown Dynamical Systems b(i) and Bb(i) , the RLOC model In order to approximate the nonlinear dynamics, i.e. to obtain each A interacts with the system using naive control U ∈ Rm×(nK −1)×nc , which is a stack of nc = 3m naive control trajectories ū( j,0) ∈ Rm×(nK −1) , j = 1, . . . , nc , where m is the dimensionality of actuator control vector at a single time step k and (0) indicates a ’naive’ solution. Each naive control trajectory ū( j,0) is applied to each naive starting state se(i) , i = 1, . . . , nse, where nse is the total number of naive starting states, discretised on a grid, producing state experience X ∈ Rℓ×nK ×nN which is a stack of nN = nc nse naive state trajectories collected, x̄(l,0) ∈ Rℓ×nK , l = 1, . . . , nN where ℓ is the dimensionality of actuator state vector at a single time step k. The state experience X is paired with corresponding control trajectories and stored as a tensor Y ∈ R(ℓ+m)×(nK −1)×nN . The collected trajectories Y are sub-sampled to only give experience local to each linearisation centre ζ (i) , i = 1, . . . , na defined by boundaries on each element of the state using tolerances δθ , δθ̇ (e.g. sub-sampling of trajectories of length h = 7 time steps for the cart pole task is given in Figure 5.1). nH This produces nH sub-trajectories {(x̄0k:k+h , ū0k:k+h )(i,p) } p=1 for each linearisation centre ζ (i) . Each linear model M (i) is inferred using the sampled sub-trajectories and applying the LDSi algorithm [95] with parameters given by Table 5.1. The models provide a principled basis of forming the continuous controllers, using the standard infinite horizon LQR theory, hence each high-level action a(i) is linked (i) to a low-level linear feedback controller L∞ corresponding to a learnt linear model M (i) , i = 1, . . . , na . The pseudo-algorithm for RLOC framework employed in this chapter is given in Algorithm 4 (note only step 2 differs from Algorithm 3). The RLOC algorithm operates on formally defined mappings of the symbolic space: the control mapping where ψ (a(i) ) → u(k) for i = 1, . . . , na ψ : A (s) → Rm (5.2.1) and in a feature mapping between an actuator state x(k) and a symbolic state s ∈ S , where φ (x(k)) → s(i) for i = 1, . . . , ns φ : Rℓ → S (5.2.2) An RL agent is used to learn the best action choice at each symbolic state to minimise the cost-to-go, giving a policy π (s) : S → A (s). The control and feature mappings can be combined with the b : N → Rm , which is an learnt deterministic policy, π (s), to define the full control policy mapping ω approximate estimate of the time-dependent optimal control policy mapping ω ∗ b (xk , k) = ψ (xk , π (φ (xk ))) ω ∗ (xk , k) ≈ ω 5.2.1 (5.2.3) Naive State-Control Trajectory Formation T T A naive control tensor, U = (ū(1,0) , . . . , ū(nc ,0) ) ∈ Rm×(nK −1)×nc , which contains individual trajectories ū( j,0) ∈ Rm×(nK −1) , j = 1, . . . , nc , of harmonically decaying controls, where m is the dimensionality 5.2 Methods 93 Algorithm 4 RLOC Algorithm 1. Perform feature mapping φ : Rℓ → S : discretise nℓ ≤ ℓ chosen dimensions of the actuator state, x ∈ X , into symbolic states {s(1) , s(2) , . . . s(ns ) } to obtain equally spaced nℓ -dim grid. 2. Perform control mapping ψ : A (s) → Rm : define a(i) , i = 1, . . . , na symbolic actions, from (X, U) trajectories collected with naive control sequences ("motor-babbling"). 1: for each controller a(i) , i = 1, . . . , na , do 2: Randomly place linearisation centre ζ (i) in state space X H 3: Sample nH sub-trajectories {(x̄0k:k+h , ū0k:k+h )(i,p) }np=1 local to ζ (i)  b(i) , Bb(i) , Σ b(i) with LDSi (unknown dynamics) 4: Estimate linear model M (i) = ζ (i) , A w (i) Derive infinite horizon feedback gain matrix L∞ using LQR theory (i) 6: Associate symbolic action a(i) with feedback gain matrix L∞ 7: end for 5: 3. Initialize policy π and state-action values Qπ , for all s ∈ S , a ∈ A (s). 1: π (s, a) ← |A 1(s)| 2: Qπ (s, a) ← 0 3: i(s, a) ← 0 ∀s ∈ S , a ∈ A (s) 4. Approximate best policy π (s, a) with a modified ε -Greedy On-Policy Monte Carlo RL. 1: for epoch τ ( j) , j = 1, ..., nT do 2: Pick a random state s ∈ S , choose its centre as initial actuator state xinit 3: Sample action a0 ∼ Pr(a|π , s0 ) and set symbolic time t = 0 4: Initialise control uinit from initial actuator state xinit and controller ψ (a0 ) 5: for Actuator time step k = 1, . . . , nK do 6: Update xk using previous state xk−1 , control uk−1 and system dynamics. 7: if state changes i.e. φ (xk ) ̸= st then  nst e 8: Store cost J(e x(k), u(k)) = 12 ∑k=1 xTk We xk + uTk Zuk for at in st as negative rt+1 9: Append triplet {rt+1 , st , at } to a list (forming epoch path) 10: Increment symbolic time, t = t + 1 11: Update state, st = φ (xk ) 12: Sample action at ∼ Pr(a|π , st ) 13: end if 14: Update control uk from current state xk and controller ψ (at ) 15: end for 16: for each (s,a) pair appearing in episode τ ( j) do 17: i(s, a) ← i(s, a) + 1  bπ (s, a) + αi Ri (s, a) − Q bπ (s, a) bπ (s, a) ← Q 18: Q i i i bπ 19: Update π (s, a) to ε -Greedy from Q i 20: end for 21: end for bπ (s, a) ∀s ∈ S. 5. Finalise policy π (s) ← arg maxa Q 6. Use π (s) to control the system from any actuator state to the target. 94 Reinforcement Learning Optimal Control for Unknown Dynamical Systems Fig. 5.1 Naive Trajectories Sub-Sampling – Cart Pole. Example depicting sampling of the cart-pole collected naive trajectories Y in order to obtain nH sub-trajectories of length h = 7, using 2 dimensions of the state trajectory x̄(l,0) , where l = 1, . . . , nN , the pendulum angle θ (x-axis) and pendulum angular velocity θ̇ (y-axis). The figure shows how ten sub-trajectories (nH = 10), {(x̄0k:k+7 , ū0k:k+7 )(i,p) }10 p=1 , each of length h = 7 are sampled in the region close to the first (1) (1) (i = 1) linearisation centre ζ (1) = (x0 = 0, u0 = 0) (red dot), p = 1, . . . , nH sub-trajectory number. The sub-sampling boundaries are expertly defined for each task (red rectangle). Sub-trajectories that fall outside of the specified boundaries are not considered (black dots), while those that fall inside are suitable hence sub-sampled (each sub-trajectory is uniquely coloured). The sampled sub-trajectory entries do not repeat and all points of the sub-trajectories are within a bounding box centred at the linearisation centre. of the control, nK is the total number of actuator level simulation steps of each trajectory and nc is the total number of naive control trajectories, is built using the following steps ( j,0) • Construct nc number of vectors u1 ∈ Rm , j = 1, . . . , nc , for the first time step k = 1, where each m dimensional vector is built by choosing m elements from sample points of a naive control set D = {0, 1, −1}. Obtain all permutations (using replacement [209]) by building nc = |D|m such naive control vectors, each containing entries di, j ’s, where di, j is drawn from D. • Create c number of matrices ū( j,0) ∈ Rm×(nK −1) , j = 1, . . . , nc , by replicating the chosen sample points di, j along each of the dimensions of each vector for nK − 1 steps. • Adjust the individual control matrices ū( j,0) ∈ Rm×(nK −1) , j = 1, . . . , nc , to form harmonically decaying control trajectories, using the following formula for each of the dimensions ( j,0) ui,k = di, j b umax b+k−1 (5.2.4) 5.2 Methods 95 where k = 1, . . . , nK − 1 time steps, b is the naive control rate of decay, umax is the upper task control bound which remains the same for each of the control dimensions and di, j is the chosen sample point of D (note that for a chosen (i, j) pair the sample point remains constant over k). Model Inference with LDSi System identification for a dynamical system around linearisation centres ζ (i) , i = 1, . . . , na , is performed with the following steps • Apply each naive control trajectory, ū( j,0) , j = 1, . . . , nc , (see Equation 5.2.4) to all naive start states se(i) , i = 1, . . . , nse (positioned at equally spaced locations on a grid) and collect naive state T T trajectories X = (x̄(1,0) , . . . , x̄(nN ,0) ) ∈ Rℓ×nK ×nN , where nN = nc nse and x̄( j,0) ∈ Rℓ×nK . • Pair the state trajectories x̄( j,0) ∈ Rℓ×nK , j = 1, . . . , nN with the corresponding controls ū( j,0) , j = ( j,0) 1, . . . , nc (the control applied at time step k, uk , is paired with the resultant state obtained ( j,0) xk+1 ), and store all pairs as "experience" Y ∈ R(ℓ+m)×(nK −1)×nN . (0) (0) H • Sample Y for nH non-overlapping sub-trajectories of length h, {(x̄k:k+h , ūk:k+h )(i,p) }np=1 in the (i) region close to each linearisation centre ζ , for i = 1, . . . , na , which is more strictly defined by a tolerance on angle and angular velocity deviations from ζ (i) (δθ and δθ̇ ), where all points of the sub-trajectories are within a bounding box centred at the linearisation centre (see Figure 5.1). b B) b (i) for each linearisation • Use the sampled sub-trajectories to estimate the local dynamics (A, b(i) zk + Bb(i) uk + centre using parameter estimation for LDSi, where xk+1 = Izk+1 +vk and zk+1 = A wk and where the covariance matrices Σv and Σw of the vk , wk noise terms are estimated from data [95]. The LDSi theory is described in Subsection 3.3.3 (for parameters see Table 5.1). Table 5.1 Parameters for Linear Model Inference. Parameter Description LDSi number of cycles LDSi likelihood tol. Sub-trajectory tol. angle deviation from ζ (i) Sub-trajectory tol. angular vel. deviation from ζ (i) Naive control rate of decay Symbol ncldsi tol δldsi δθ δθ̇ b Value 100 0.0000001 20 120 10 SI Unit [a.u.] [a.u.] [deg] [deg/s] [a.u.] b B) b (i) is an For a given model M (i) , it is reasonable to assume that the linear model using (A, (i) accurate approximation close to ζ (i) , and also that an LQR controller L∞ gives a good control strategy close to this point, for i = 1, . . . , na . Hence each local linear model M (i) allows a principled way of (i) deriving feedback gain matrices L∞ corresponding to high-level actions {a(1) , a(2) , . . . a(na ) }. I predict that RLOC can learn how to re-use the low-level controllers in areas of state space other than where they were obtained. This mapping is able to be learnt by RLOC due to its hierarchical structure and the ability of the high-level decision maker to operate on an abstract domain. Reinforcement Learning Optimal Control for Unknown Dynamical Systems 96 5.2.2 Low-Level Controller Design The design of the low-level Linear Quadratic Regulator (LQR) controllers is explained in detail in Subsections 3.3.3 and 4.2.1. It is important to note that the system dynamics are not made available to the agent a priori and are instead approximated using experience via local linear approximations with LDSi, while the penalty imposed on the inputs make the controller more robust to model inaccuracies b(i) , Bb(i) ), [10]. So, for each linearisation point ζ (i) , RLOC learns a local model of the dynamics with (A (i) (i) and uses this to calculate, off-line, the feedback gain matrices L∞ (ith controller). Each controller, L∞ , in turn, corresponds to a high-level reinforcement learning action a(i) (ith action). 5.3 Results The algorithm’s effectiveness is demonstrated on the standard yet difficult benchmark nonlinear cart pole control task (see Section 2.4). For the analogous robotic arm reaching task results see Appendix B, Section B.1. The system dynamics are not available to the algorithm a priori. Instead RLOC learns the dynamics from naive control experience on-line, while the simulated environment is being perturbed by an additive white Gaussian noise. All the simulation parameters defined in Tables 4.3, 4.4, 4.5, 5.1 are kept constant (except for where explicitly stated in Table 5.2). The difference for the cart-pole state matrix entries W11 = 30, W22 = 3 is due to the requirement of the cart to stop in a pre-defined position on the track (z = 0), therefore requiring a penalty on the cart position and cart velocity; the entries W33 = 2000, W44 = 200 were increased in order to reflect the primary requirement of the task for an upright and stationary pole. Table 5.2 Simulation Parameters – RLOC (unknown dynamics), iLQR (benchmark). Parameter RLOC num. symbolic state RLOC num. LQR controllers Sym. ns na RLOC/iLQR State weight matrix W Robotic Arm 36 5  Cart Pole 49 10  {−Ī, 0̄, Ī} 30 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 RLOC/iLQR target state x∗ [ π2 , π2 , 0, 0]T iLQR initial control traj. ū0ilqr {−Ī, 0̄, Ī} 30 0 0 0 0 0 0 3 0 0 0 2000 0 0 0 200 [0, 0, 0, 0]T  SI Unit [a.u.] [a.u.] [a.u.]   rad , rad, rad, rad s s ,  m m, s , rad, rad s [N m], [N] Obtaining Benchmark Solutions The benchmark costs in for the following figures were obtained by averaging the individual start state costs of attempting to reach the upright unstable equilibrium target from 100 equally spaced start positions in actuator state space (see Figure 4.2b for an example of 25 such starting positions – center of each state). The LQR infinity (black line) used as many number of controllers as starting states (i.e. 100 linearisation points, located at the centre of each start state) and displayed as the mean average cost of the deterministic control of 1 learning trial 5.3 Results 97 per each starting position. The state-of-the-art nonlinear iterative optimal control, iLQR (green line), used as many linearisation points as the number of simulation time steps (with each control trajectory recalculated from scratch for individual start-target states) calculated recursively to achieve convergence. The algorithm was initialised with three different best guess control force trajectories m×nK −1 = {−Ī, 0̄, Ī} resulting in 3 trials per each starting position, with results reported as ūinit ilqr ∈ R mean ± standard error of the mean (SEM). All of the algorithms were simulated for nK total number of time steps, using an identical quadratic cost function of Equation 4.2.1. The state-of-the-art data efficient probabilistic policy search algorithm that expresses model uncertainty, PILCO (orange line), used 50 Gaussian basis functions to give a probabilistic policy for each state. The final policy was recorded after 5 iterations of PILCO algorithm. The PILCO average cost for one learning trial was found by averaging the cost for controlling the cart-pole from each of the 100 starting states (repeated for 3 trials, reported as mean ± SEM). PILCO was trained using 100 time steps (total simulation time of 5 seconds with discretisation step ∆ = 0.05). PILCO was trained using a preferred ’saturating’ cost ([69]) in order to ensure best learning outcomes and evaluated on the quadratic cost function used to train RLOC (see Equation 4.2.1) in order to enable a fair comparison. RLOC Performance vs. Number of Controllers Plot Figure 5.2 shows the performance (average cost) of RLOC for controlling the CP system from 100 equally spaced start positions to a pre-defined upright, stationary target, shown as a function of the number of controllers available to the system. The costs are calculated using quadratic penalisation of the state deviation from the target and of the forces applied to the cart. This performance is compared to the standard benchmark average costs of infinite horizon LQR (black solid line), iLQR (green solid line) and PILCO (orange solid line) averaged through repeated trials while using the same dynamics (see Table 2.2). To account for the variability in learning performance resulting from the initial random placement of the controllers, the random placement and subsequent learning was repeated (referred to as one learning "trial") 500 times. The average RLOC cost for each learning trial was obtained by taking a mean of the control cost (after learning) across all start configurations of the cart-and-pendulum (100 possibilities). Following this the per-trial cost was averaged across 500 random initialisations of the controllers placements and reported as the average cost and standard error of the mean across trials. To make the results comparable across the increasing number of controllers obtained via random placement of linearisation points, the number of controllers used in the RL agent, na = {1, . . . , 10}, was grown by preserving the previously obtained linearisation locations at (na − 1) and adding controllers sequentially until maximum number of tested linearisation points was reached (repeated for 500 trials). The RLOC data points show a smooth decrease in the average cost as the number of controllers increases. On average RLOC is able to outperform the (stochastic) state-of-the-art control system iLQR, and match/outperform the mean solution cost of probabilistic state-of-the-art algorithm PILCO, while achieving a stable performance with 7 or more controllers for the cart-pole task. High-level learning for smart controller switching is necessary, as demonstrated by plotting the average cost associated with naive nearest neighbour controller switching for task execution using 98 Reinforcement Learning Optimal Control for Unknown Dynamical Systems Fig. 5.2 RLOC Performance vs. Number Of Controllers – Cart Pole. The performance (reported as average cost) of RLOC control solution learnt as a function of individual controllers (data points) with respect to engineering state of the art solutions (solid lines: PILCO – orange line, LQR – black line and NNOC – blue data points). RLOC – red data points with learning trial-to-trial variability displayed as mean and standard error of the mean of 500 learning trials, shows performance for a system with both high-level and low-level learning. The results show that on average RLOC outperforms LQR with 2 or more controllers, iLQR with 4 or more controllers and matches/outperforms the average cost of PILCO with 7 or more controllers respectively. Note the SEM of iLQR and PILCO algorithms is small, however does exist as pointed out by the zoomed in sections. the policies obtained by RLOC for the increasing number of controllers during the 500 trials. The Naive Nearest Optimal Control (NNOC) involves switching to the nearest controller by calculating the Euclidean distance from the current symbolic state to all of the available linearisation points and picking the one with the shortest Euclidean distance, therefore providing non-smart switching between locally linear controllers of the low-level learner. The NNOC (blue data points reported as mean ± SEM of 500 learning trials), shows an average cost higher than that of RLOC, which proves the necessity for the high-level learning. It is worth remarking that the average cost of RLOC for 1 controller is different from the LQR∞ mean cost because RLOC uses one controller located at the target state, while the LQR∞ uses linearisation points located in the centre of each starting state (i.e. total 100 controls). The distribution of the data was checked with D’Agostino-Pearson K 2 test and was found to not be normally distributed. 5.3 Results 99 Fig. 5.3 Tukey boxplot of RLOC performance for varied number of controls – Cart Pole. The performance of RLOC control solution learnt as a function of individual controllers (data points) with respect to engineering state of the art solutions (solid lines), reported using box plot and whiskers diagram, in order to use non-parametric representation of the data. The Tukey boxplot allows to observe the following statistics for number of controllers a(i) , i = 1, . . . , na ,: lower quartile (q1), median (q2, black line) and upper quartile (q3). The whiskers are calculated using interquartile range (IQR): upper whisker data ∈ (q3, q3+1.5×IQR); lower whisker data ∈ (q1, q1−1.5×IQR); data outside of the whiskers is reported as outliers (black dots). RLOC’s performance for solving the cart-pole task stabilises with the increasing number of controllers and for na ≥ 6 the interquartile range and the median of the average RLOC costs are below the average cost of PILCO and iLQR (orange and green lines respectively)). Note the SEM of iLQR and PILCO algorithms is small, however does exist as pointed out by the zoomed in sections. Box-Plot and Histogram-Plots A non-parametric representation of the data (i.e. no assumption of underlying distribution) is displayed in Figure 5.3, which analyses the average RLOC cost for controlling the cart-pole from 100 equally spaced start positions to the upright target state as a function of the number of controllers available to the system. Each controller learning trial is repeated 500 times i.e same data as Figure 5.2 and the results are displayed using the Tukey boxplot, where outliers are plotted as individual points. The plot shows the interquartile range and median of av. RLOC costs (dynamics unknown a priori) to be below the av. cost of state-of-the-art algorithms PILCO (dynamics unknown a priori) and iLQR (dynamics known a priori) for na ≥ 6 number of controllers. The implication is that RLOC is able to outperform the PILCO state-of-the-art probabilistic algorithm. In order to investigate the percentage by which RLOC beats PILCO’s performance a histogram (see Figure 5.4) is plotted. The Figure 5.4a shows that even with the random controller placement, RLOC is able to find optimal solutions for the cart-pole problem with fewer controllers and with less cost than state-of-the-art control solutions, specifically it outperforms iLQR 96% of the time and PILCO 75% of the time. The Figure 5.4b shows that even though naive switching of controls (to the nearest controller 100 Reinforcement Learning Optimal Control for Unknown Dynamical Systems (a) Histogram plots of RLOC performance for varied number of controls. (b) Histogram plots of NNOC performance for varied number of controls. Fig. 5.4 Histograms of RLOC Performance - Cart Pole. (a) Depicts 10 histograms, 100 bins each, for RLOC av. costs when solving the CP task with a(i) , i = 1, . . . , 10, LQR controllers. The av. costs for 500 trials of each controller outperform iLQR and PILCO as the number of controllers increases, where on average for na ≥6 controllers the av. RLOC costs outperform those of iLQR 93% and PILCO 75% of the time. (b) Depicts 10 histograms, 100 bins each, for NNOC av. costs when solving the CP task with a(i) , i = 1, . . . , 10, LQR controllers. We show that the average costs for the 500 trials of each controller, do not outperform either iLQR or PILCO, with at most being 6% better. in state space using Euclidean distance) can outperform LQR for solving the cart-pole task, high-level learning – smartly assigning controllers to state space (i.e. RLOC) – is necessary. 5.3 Results 101 Fig. 5.5 High-Level Reinforcement Learning Agent, Learning Curve (na = 8) – Cart Pole. A number of controllers is fixed to na = 8 (as it produces a good RLOC performance) and the shape of the learning curve is investigate for the 500 trials. The plot shows the average "direct reward sample" (defined as the current full epoch’s cumulative reward experienced by the algorithm at a fixed 6 epoch interval during 2000 epoch learning trial), taken across corresponding epochs of the 500 trials and reported as mean ± SEM, colour coded in red and pastel red respectively. The benchmark algorithms PILCO (orange), iLQR (green) and LQR infinity (black) are shown as solid lines. The learning curve for the cart-pole task shows that, on average, RLOC is able to outperform the state of the art iLQR algorithm after just 250 epochs of learning, following which the algorithm’s performance stabilises for the remainder of the epochs. Note the SEM of iLQR and PILCO algorithms is small, however does exist as pointed out by the zoomed in sections. Agent Learning Curves To visualise the results, RLOC was operated with 8 controllers as the algorithm shows a stable performance for this number of actions. The Figure 5.5 shows the learning curve achieved by the RLOC algorithm, plotted as a direct sample of the learning reward signal (experienced by the reinforcement learner during the 2000 epochs learning and sampled at 6 epoch interval) averaged through 500 trials and reported as mean ± SEM. The plot shows that on average, the algorithm starts off with a performance better than the LQR infinity (black solid line; which is consistent with results of Figure 5.3) and is able to outperform state-of-the-art iLQR algorithm (green solid line) after 250 epochs of learning. The reinforcement learning policy used to control the cartpole during learning is probabilistic hence retaining a degree of exploration of less desirable actions, therefore the reported direct reward mean is just below that of PILCO, compared to the deterministic (greedy) policy used to obtain in Figure 5.4a where for na = 8 RLOC policy on average outperforms PILCO 75% of the time. 102 Reinforcement Learning Optimal Control for Unknown Dynamical Systems Fig. 5.6 Value Function – Cart Pole. An example RLOC value function (cost-to-go) plot depicting state-space to control mapping, reflecting the algorithm’s learnt choice of actions when a particular area of space is encountered during task execution. The plot demonstrates the cost of the learnt deterministic control for starting in any area of state space (discretised to 106 states, learning for nT = 2000 epochs, using na = 8 actions labelled A - H, any of which can be assigned by the high-level policy to any of the ns = 49 available symbolic states) and controlling the task to unstable upright equilibrium end-point, with warmer areas indicating a higher incurred cost. The action-state space assignment, learnt by the high-level policy, is shown with letters superimposed onto an area of state space where the action is active, while the shape of the displayed cost function is dictated by the underlying dynamics of the task. The value function largely displays symmetry reflecting the periodic nature of the task, however the learnt high-level RLOC state-action mapping (and hence the displayed cost-to-go) is not exactly symmetrical due to the nature of the abstract learning for the global solution. The Figures 5.6 and 5.7 show the evaluation of a learnt policy for the cart-pole task using 8 controllers (labelled A - H) from 100 equally spaced starting states. For the purposes of demonstration a random final policy using 8 actions was selected from the 500 learnt policies of the executed trials. Every time the system enters a new symbolic state, the high-level policy dictates the action, corresponding to an underlying LQR∞ controller, which should be engaged in that state, thus generating sequences of actions as the cart-pole is guided to its upright target. Value Function The Figure 5.6 displays the state space plot of the final value function, expressed using costs for performing the control task from 106 start positions for nT = 2000 MC epochs, ∆ = 0.01 time step, T = 10 second simulation of the cart-pole system2 . The colormap indicates the difficulty of the task with respect to its dynamics, while the superimposed letters refer to the low-level continuous control policy mapping from states to actions3 . The cart-pole value function plot corresponds well 2 The linearisation centre for each controller of the example policy has the general form ζ (i) = [0, 0, θ (i) , 0]T , and in the specific example policy used in the above figure, the pendulum angle had the following randomly selected entries θ (i) = {0, −155°, −109°, −86°, −10°, 86°, −160°, 155°}, for controllers i = {1, . . . , 8} respectively. 3 The solution of the Hamilton-Jacobi-Bellman equation defines the optimal value function, which maps the task state to the lowest achievable cost of reaching the task goal. The optimal policy follows the gradient of the optimal value function. 5.3 Results 103 to the optimal value function obtained using Dynamic Programming method that has full knowledge of the dynamics (see Fig. 2a, M. Deisenroth et al., [70]) and exhibits the expected discontinuity near the central diagonal band. The band is given by those states from which a small cost is incurred for bringing the pendulum up to the unstable equilibrium target without dropping it first. However once the starting angle position goes beyond a critical point, a higher cost is incurred for dropping the pendulum (regardless of the force applied to the cart) and then being swung back up. The cart-pole is an underactuated system, which means that the pole cannot be directly controlled to achieve arbitrary trajectories, thus for the majority of the state space first enough momentum needs to be built up in the swung up stage before stabilisation can begin (while both of these phases accumulate cost). Fig. 5.7 Action Sequences of an Example Learnt Policy – Cart Pole. The action sequences obtained during control of the cart pole to upright target from 100 start states while using the same policy as used to obtain the Value Function plot. Each row within the sub-plot indicates an action (labelled A-H) which, if chosen, is marked with a black dot. Each column corresponds to the cart-pole moving to a new symbolic state, where a new action (which may be the same) is chosen. Note the amount of switches depends on the cart-pole trajectory through state space, since the policy is allowed to engage a new controller once the continuous actuator state enters a new symbolic state. The figure shows that RLOC’s high-level policy dictates the linking of actions into sequences and demonstrates to not only be able to reuse low-level actions, but also to learn reusing whole concatenated action sequences for solving the task. Mostly, topologically conjunct regions of state space produce similar patterns of activation, which are clearly correlated with regions of the value function, while the switching in the action patterns can be seen to occur as the value function abruptly changes. Observing these patterns of higher level action organisation and using analogy to language sequences with similar structure (but different specific meaning), these patterns may be referred to as "action grammars". This suggests that biological learning of complex motor control tasks may take the form of concatenation of high-level controllers into complex action sequences. Action Grammars Figure 5.7 shows the sequences of controller actions ("action sequences") obtained during control of the cart-pole system. Each row depicts the activation time of one controller (black dot, labelled A - H). Each column corresponds to the cart-pole moving to a new symbolic state where a new action (which may be the same) is chosen. Note, the amount of switches depends on the 104 Reinforcement Learning Optimal Control for Unknown Dynamical Systems cart-pole trajectory through state space, since the policy is allowed to engage a new controller once a new symbolic state is entered. The continuous motor control commands generated by RLOC exhibit characteristic action patterns across the state space. Regions of state space can be seen to produce similar, characteristic patterns of action sequences during task execution. Mostly, adjacent regions of state space produce similar patterns, but on the boundary between some regions abrupt changes in the patterns occur. These regions, while characterised by the sequence of actions, are clearly correlated with regions of the value function and switching in the action patterns occurs as the value function abruptly changes. Observing these patterns of higher level action organisation and using analogy to language sequences with similar structure (but different specific meaning), we may refer to the patterns as "action grammars". Indeed, grammar-like rules have been proposed as neurocognitive mechanisms for the sequencing of motor behaviour, and were simultaneously shown to have provided a foundation for the evolution of language [201], with Palaeolithic stone toolmaking representing early expressions of this evolving capacity [85, 116, 237]. The high-level policy of RLOC (located putatively in the striatum of the basal ganglia circuit of the brain) dictates the concatenation of actions into sequences generating activation patterns, which reduces the overall task complexity by not only reusing actions in different state spaces, but also by reusing whole action sequences for controlling the task from conjunct regions of state space. Modular corticostriatal projection patterns (observed experimentally) have been suggested to enable striatum re-coding of cortically derived information as part of the Basal Ganglia learning, allegedly enabling striatum (whose activity patterns are modified gradually during stimulus-response learning) to chunk representation of motor action sequences into performance units [103]. Recent work by Jin et al. [126] suggested that entire action sequences are encoded as single actions within basal ganglia circuits. Hence a parallel is drawn between [126] and the persistent activation of repeatable sequences (that may be considered as single actions observed upon repeated trials) while "recording" from RLOC during movement. Therefore biological learning of complex motor control tasks may take the form of concatenation of high-level controllers into complex action sequences using known mechanisms of action selection in the basal ganglia. The hierarchical nature of RLOC arises naturally from the representation of the high-level controllers as individual chunks (which when put together form complex motor action sequences), since chunking in the striatum implies hierarchical organization of learnt behaviour [103]. Trajectories The characteristic trajectories experienced during control of the cart-pole system are evaluated with the same policy as used to obtain the final value function of the cart-pole task. The evaluation time of T = 10 sec (longer than that of the learning phase, T = 3 sec) was used to investigate the behaviour of the system beyond the learnt simulation time. This allowed observing how the system copes with prolonged task execution, showing that the RLOC policy is able to continue to control the system after the learning phase is complete, over an infinite horizon and could include stabilizing the system following unexpected perturbations. This is in contrast to state-of-the-art iLQR technique which calculates the required control for a specific start-end state combination and for a pre-specified number of time steps only. The displayed trajectories demonstrate that the RLOC learnt policy is successfully 5.3 Results 105 Fig. 5.8 Example State Trajectories – Cart Pole. The figure depicts 100 example state trajectories in the state space of the cart-pole following the completion of a single learning trial. The learnt policy is able to successfully control the given task to the target (upright and stationary pendulum and the cart stopping in a pre-defined position on the track, z = 0) from 100 equally spaced starting positions (solid circles) located on a grid for comprehensiveness of admissible state space testing. The area of space containing the target is shown in a zoomed box section and demonstrates that RLOC algorithm solution provides a policy which controls the cart-pole to the target using smooth, precise, near-optimal control. RLOC algorithm successfully stabilizes the pendulum (upright and stationary) on the cart, stopping in a pre-defined position on the track, from all starting states. able to control the task to the pre-determined target (showed in the zoomed panel) from any area of the state space (tested on a grid with 100 start states denoted by solid circles of different colour). Figure 5.8 shows the periodic cart-pole task represented on a flat state space surface. The plot reflects the nature of the periodic cart-pole task, showing a discontinuity at the stable equilibrium of the downward pole position, which means that visually single trajectories are sliced at that point, even though they are continuous (-π is equivalent to +π ). Most start states can be seen to require the pendulum to build up enough momentum first, before being swung up towards unstable upright equilibrium which is consistent with the nature of the underactuated cart-pole task. The RLOC algorithm successfully stabilizes the pendulum in an upright and stationary position with the cart stopping in a predefined position on the track for all 100 starting states of the symbolic space, as shown in the zoomed in section Reinforcement Learning Optimal Control for Unknown Dynamical Systems 106 containing the location of the target. Even though the target state requires both the pole and the cart to be stationary, meaning that the cart and the pendulum velocity is driven to zero by the high-level control, some degree of movement is required to be retained in order to actively balance the pole in its unstable equilibrium (as seen by the spiral trajectories near the target area). The ability of RLOC to control a chosen system from any area of state space is highly advantageous, since for most control algorithms (e.g. iLQR) the control sequences must be recalculated for different start-end points while RLOC defines a single policy after just one complete learning trial covering the entire state space. 5.4 Discussion The human and animal motor system has been optimised by evolution over millions of years, producing a central nervous system with a remarkable ability to learn to manipulate redundant sets of non-linear actuators (joints, muscles, limbs) with very little data, when compared to engineering systems. This proves highly motivating when searching for design of state-of-the-art nonlinear control algorithms and presently researches tend to look at the control solutions of biological systems for inspiration. Alongside engineering considerations, this motivated a novel neurobiologically inspired algorithm, RLOC, which was developed in this chapter to form a viable proof-of-principle for combining two key effects of cognitive and motor learning. Firstly, a high-level symbolic action selection based on a state dependent policy (which could be learnt in the basal ganglia) generating persistent action sequences ("activation patterns") and, secondly, a low-level continuous action execution represented in the form of locally optimal linear feedback controllers (Linear Quadratic Regulators (LQRs) with costs quadratic in state and input). The biological analogy of these low-level control signals is the generation of muscle activation patterns and could be implemented in the motor execution related parts of the motor cortex and the spinal cord [185]. Thus this chapter presented RLOC as an integrative model for how the brain can perform nonlinear control and learning, that spans both the ’cognitive’ domain – the symbolic domain of "abstract" actions: "grab cup", "pour coffee", "spoon sugar" – and the low-level musculoskeletal level, coordinating muscle activations. Thus, employing specific linear optimal feedback controllers as low-level actions is biologically both plausible and feasible. Areas of Caution and Consideration Some research suggests that humans do not behave optimally [67, 90, 117, 138, 168] and criticises OFC theory because it ’cannot be disproved’: cost function can fit behaviour [223] (see Appendix C.1). Equally OFC framework assumes the full knowledge of the dynamics, which is not the case for the CNS, hence it represents a coarse framework and should not be taken literally. Other considerations are difficulties in merging adaptation to novel contexts with simultaneous updates in task parameters and in the high-level control policy. Even though in theory any feedback controller can be implemented on the low-level of actuation, caution must be exercised for cases where ’bad’ controls can lead to ’terminal’ outcomes (e.g. fatality, break of equipment). A solution could be to implement a model-based high-level policy, which allows 5.4 Discussion 107 partial off-line learning by utilising simulations of a built representation of the World. RLOC should also be extended to handle stochastic systems and higher dimensional tasks, in order to better deal with real life problems; this can be done with utilisation of the Kalman Filter (iLQG) on the low-level of actuation, and by moving RL to a continuous state/action domain [42], respectively. Another area of improvement could be optimising the linearisation center placements in the actuator state space. An interesting avenue would be to investigate the emergence of the observed action-grammars. While RLOC in its current state does not explicitly exploit the emerging of "action grammars" for learning, it does provide a first model of how such task abstraction may arise during structure learning. Related Work The RLOC’s hierarchical framework differs conceptually from other hierarchical or modular approaches to control: Todorov et al. [264] proposed a hierarchical optimal control framework for control of nonlinear redundant systems (where dynamics are known a priori). The architecture of the algorithm is inspired by the hierarchical divide-and-conquer strategies that seem to be implemented in the nervous system, with the general idea that in sensorimotor control tasks, the brain monitors a small number of task parameters, and generates abstract commands (high-level control) which are mapped to muscle activations using motor synergies (low-level control). Based on this, the authors proposed a low-level controller which receives rich information about the high-dimensional plant, that is then compressed to generate a more abstract representation of the state. The high-level controller monitors the task progress, while operating on this reduced state and issues commands on how it should change. The job of the low-level controller is to compute efficient controls which are consistent with the high-level command. Hence the authors designed an algorithm that allows the high-level control to operate on a plant whose dimensionality is reduced. It is worth noting however, that the function performing the mapping from high-D state to low-D state was expertly provided by the authors and was specific to investigation of the 6 muscle planar arm model. Another seminal framework, Modular Selection and Identification for Control (MOSAIC) [113], addresses an accurate generation of motor control by creating a modular architecture where all modules simultaneously contribute to learn a combined weighted feedforward motor command. The weights are derived from the predictions of an internal forward model, actual and desired state. After this modular weighted combination however, the motor command is combined with an unspecified feedback controller to learn object tracking tasks [113]. MOSAIC uses supervised learning in order to understand how each module contributes to the overall motor command, with the algorithm having a flat (nonhierarchical) structure across the n-paired units producing pure motor control commands. RLOC on the other hand operates at a different level of representation and utilises a hierarchical approach: it uses a reinforcement learning agent as a high-level decision maker for action selection and action sequencing, which allows abstraction and dimensionality reduction as compared to pure low-level motor control. The ability of RLOC to learn to control actions at the lower controller and the higher symbolic level, enables it to span both conceptual and experimental findings arising in motor neuroscience and cognitive neuroscience. RLOC high-level controllers can be thought of as abstract controllers that 108 Reinforcement Learning Optimal Control for Unknown Dynamical Systems can generate specific series of high-level actions towards the completion of a complex task goal (as highlighted in Figure 3.1). The solution of a complex task by means of individual action symbols (here, the choice of high-level controllers) is in this case implemented by acting out the state-action policy. This chapter shows how this results a visibly striking invariant patterns of controller activations, which abstract away from the specific initial conditions of the task. The patterns tend to group the control actions into 5-6 different generic patterns, following overall rules but local deviations may be considered as reflecting a type of "action grammar". The invariant patterns emerging from RLOC may explain how motor task abstraction may emerge during motor learning. It was recently found that humans can perform structure learning, i.e. learn higher-order representations that capture features invariant to task-specific details, but instead reflect generic features of the task [43]. In structural learning tasks it was found that subject’s were faster at learning novel motor tasks by if they followed a common underlying structure. While RLOC in its current state does not explicitly exploit the emerging "action grammars" for learning, it does provide a first model of how such task abstraction may arise during structure learning. Recent work in rodents suggests that action sequences are represented as single units in the basal ganglia [126], in analogy to how "recording" from RLOC representations would show persistent and constant activation of high-level actions throughout a movement sequence. Thus RLOC presented in this chapter suggests how biologically the learning and representation of more complex motor control tasks can be achieved by using the known mechanisms of action selection in the basal ganglia. More specifically, the cognitive control of action sequencing, may be well learnt and implemented by the very infrastructure that learns complex motor control tasks. Thus RLOC algorithm offers a novel way of learning an important class of nonlinear control problems. Our proof-of-principle RLOC framework demonstrates how it may be used for solving harder nonlinear problems where iterative linearisation algorithms may fail to converge, or where the curse of dimensionality makes monolithic learning difficult. The RLOC framework performance was shown to rival/beat that of other nonlinear learning policies, which were determined by function approximating approaches using probabilistic inference, control theoretic algorithms with full knowledge of the system dynamics, or monolithic RL approaches which are computationally intensive, for controlling the benchmark problems. This work offers an extended understanding from both the control/machine learning and biological point of views. In the former it offers a sound framework for solving nonlinear control tasks with unknown dynamics, while in the letter it allows for a bridge in the divide between understanding of the low-level actuation control and high-level symbolic decision making. This Chapter extended the RLOC algorithm presented in Chapter 4 to control of nonlinear systems with unknown dynamics and used neurobiological considerations presented in Chapter 3. RLOC relied on locally optimal linear controllers to operate on the low-level of abstraction, as well as drawing a parallel to ’motor babbling’ in babies and using on-line model inference from unstructured interactions with the environment for system identification. Next Chapter will summarise the work carried out in this thesis, review the outcomes of RLOC’s learning, provide overview of advantages and disadvantages, discuss relevant literature, and provide ideas on further algorithm development. Chapter 6 General Discussion Many real world control problems have unknown dynamics, which makes it difficult to apply standard control engineering solutions. Moreover, typical control problems require efficient solutions, i.e. those that achieve a goal while minimising a cost, thus requiring a form of optimality in their control which is difficult to compute for nonlinear task domains. Hence, optimal or near-optimal solutions to nonlinear control problems with unknown dynamics remain an open challenge in both control, robotics, and machine learning fields [44, 134, 263]. This thesis proposed a framework motivated by both engineering and neurobiological considerations with the view of finding control solutions to systems with nonlinear tasks dynamics which may either know known a priori or unknown all together. The general design of the RLOC framework was introduced in Chapter 3. From a purely engineering perspective it was motivated by the use of abstractions, state dimensionality reduction and the reduction in the number of required controllers necessary, while looking to utilise the best of both Worlds of Reinforcement Learning (RL) and Optimal Control (OC). All of this naturally suggested a hierarchical architecture, which allows the implementation of the divide-and-conquer techniques for a global control solution of reduced complexity naturally suited to RL. From the biological aspect of research, RLOC was motivated by optimal control theory [223], the discovery of motor primitives at the level of the spinal cord [185], the existence of optimal feedback pathways the transcortical circuits [223] – all pointing to the suitability of optimal feedback controllers at the low-level of actuation. And equally: symbolic/hierarchical action selection/representation within the brain [41] and implementation of both model-based [64] and model-free [62] RL algorithms inside the brain [65] – all pointing to suitability of the use of RL agent at the high-level of RLOC hierarchy. Therefore, RLOC was chosen to be built in a hierarchical way, utilising a high-level RL agent and low-level feedback controllers. It is shown in the following that the hierarchical combination of a high-level action reinforcement learner with a continuous low-level optimal controllers meets the goals and produces expected outcomes of abstraction and task complexity reduction. RLOC does indeed allow a significant reduction in the number of controllers typically necessary as compared to the nonlearning iterative optimal control algorithm iLQR, by approx. a factor of 102 linearised controllers. The Figure 5.2 shows that on average 4 controllers are sufficient for RLOC to solve the nonlinear optimal control problem of the cart pole 110 General Discussion upswing with the same final cost as the 300 controller equivalents (recalculated iteratively) as used by iLQR. On the other hand RLOC also reduces the overall state complexity as compared to stand-alone reinforcement learning (operating without approximations) methods by a factor of approx. 103 : the cart-pole task RLOC requires approx. 2 × 102 state-action pairs to be explored vs. 2 × 105 as required by a flat RL algorithm. The need for high-level learning is proven with observing the accumulated costs of naive controller switching (no smart learning) – where the system automatically switches to the nearest linear optimal controllers as it moves through state space. The Figure 5.2 shows that high-level action selection has over 50% lower average cost in finding optimal solutions than the naive solution, in solving the cart-pole task. An exciting feature of the RLOC architecture allowed an investigation of the use of heterogeneous controllers (a mixture of PIDs and LQRs) on the low-level of actuation in Chapter 4, while solving nonlinear tasks with known dynamics. This is permitted by the ability of RLOC to learn a global the best possible state-action mapping for global use, which means that a low-level controller will not be engaged in any area of state space for which it was deemed to be suboptimal by the high-level policy. The results show that such a heterogeneous combination was able to achieve optimal control solutions 4.7a while utilising both types of controllers, when solving the cart pole task and comparing RLOC costs to the iLQR benchmark. Hence as an engineering solution, RLOC algorithm is not limited to a particular type of low-level controller. The same is true for the high-level reinforcement learner; indeed recent studies showed both model-free and model-based computational process in the basal ganglia [64], meaning that RLOC could potentially build models of the World and supplement its exploratory experience using simulation and thus without the need for interaction (model-based RL). This means that a future avenue of work could focus on developing this aspect of RLOC’s the high-level learner1 . and utilise model-based RL, especially for tasks where interaction with the world may be costly (e.g. patient drug administration.) Meanwhile Chapter 5 further develops the RLOC algorithm in order to allow learning, on the low-level of actuators, the unknown nonlinear dynamics of the system, in order to infer local linear models used for building low-level LQR controllers. Here I focused on investigating the behaviour of RLOC when designed with purely neurobiological considerations. More specifically RLOC aimed to answer the question of whether it is possible to design a proof-of-principle framework for bridging the divide between the low-level actuation control and the high-level symbolic decision making (see Figure 3.2) (one of the key questions in neuroscience: combining two key effects of cognitive and motor learning). The results have provided many promising aspects: firstly, the ability of RLOC to learn to control and chain actions (see Figure 5.7) on both the high / symbolic-level and lower linear optimal controller level, addresses the experimental and conceptual findings researched in motor neuroscience2 1 In my personal opinion, model-based RL (in its highly simplified version) may correspond to the dreaming phase in which the brain inters and is able to simulate any kind of artificial environment it has previously encountered, carry out actions and learn from the observed experience – all done off-line, i.e. no interaction with the World. For me this draws a parallel between model-based RL’s off-line exploration and dreaming, as utilised by brain. 2 Motor neuroscience concerns the representation of actions in the brain i.e. how does motor learning occur in terms of low-level continuous control of actuators. 111 and cognitive neuroscience3 . Moreover the cart pole results presented in this chapter show that on average the RLOC algorithm (no prior knowledge of the nonlinear task dynamics, random controller placements) outperforms the state-of-the-art probabilistic control system PILCO (no prior knowledge of the dynamics) 75% of the time, and state-of-the-art deterministic control algorithm iLQR (full prior knowledge of the dynamics) 96% of the time, when using 7 or more controllers (see Figure 5.4a). I would now like to turn the attention existing work carried out by other researchers, which resembles RLOC from the engineering perspective of combining reinforcement learning with optimal controllers. Firstly the majority of the body of work in this area used full knowledge of the dynamics and expertly built controllers therefore not truly learning the problem from the bottom-up (as does RLOC in Chapter 5). It is also not clear that these approaches would generalize well to other non-linear tasks without expertly reconstructing controllers, due to being fine tuned to a particular problem. For example, Radlov et al., proposed combining an RL agent with a single handcrafted LQR controller [203], which was activated once the agent drove the system to a predefined state positioned near to the target, hence prefixing the switching condition in advance. The LQR controller was expertly located in this area which guaranteed stability at target of the double link pendulum (with motor at each joint), allowing the system to learn faster than compared to using a stand alone agent. This approach has been extended by work such as that of Yoshimoto et al., who used an RL agent to learn how to best combine controllers previously devised for controlling the double link pendulum task with known dynamics in different areas of state space for global use [286]. The authors show that a rather large number of epochs is required to train the system, by an order of 400 times more, than compared to a similar task with unknown dynamics (see Figure 5.5) as solved with RLOC (i.e. acrobot4 vs. cart-pole respectively). The final policy was only able to control the acrobot (using single motor located at the joint of the two links) to target from the stable equilibrium starting state (acrobot hanging down), which contrasts the RLOC framework, where by design a global policy is learnt for controlling a given system from any starting configuration in the admissible state space. In addition RLOC learns the unknown system dynamics from experience and forms locally optimal controllers on-line. RLOC learning operates at two levels of abstraction: RL for the optimal sequencing of actions and LQRs for continuous optimal control of action execution, to achieve a composite task goal. The two levels are integrated naturally by RLOC as the low-level control cost (quadratic in state and control) determines the reward signal generated as a result of choosing the high-level actions. This is in contrast to Hierarchical RL algorithms, where the system switches to a different sub-task (action) once its sub-goal has been reached. These sub-goals have to be either specified and defined by hand or may be learnt on-line. In contrast, RLOC action selection chooses an underlying local optimal controller and switches to another one from the top-down without a predefined temporal and/or subgoal structure, while keeping the same cost function (i.e. same reward structure) over the entire symbolic state space. 3 Cognitive neuroscience concerns the processes of capturing of behaviour in cognitive tasks and relies strongly on theories in decision making and reinforcement learning. 4 An acrobot system consists of two links (double pendulum) attached at a stationary point i.e. in an analogy to a gymnast swinging on a high-bar. This is an underactuated system, which can only exert torque on the second link (corresponding to gymnast bending their waist). 112 General Discussion RLOC is a data efficient algorithm: it learns both the low-level dynamics as well as the high-level control so that within 250 trials (taking 1.25 minutes running time) it matches the performance of iLQR (see Figure 5.5), and requires an order of 4 × 102 less trials when compared to learning a similar "acrobot" task performed in [286]. In contrast, an algorithm using a monolithic continuous RL agent [77], was reported to have taken 1000 epochs (approx. 6 hours computation time) to learn the benchmark cart-pole task, controlled from any area of state space, without reporting the learning curves. The RLOC solution quality is also compared to PILCO, the current state-of-the-art and the most data efficient learning control algorithm which uses complex representations based on Gaussian Processes to leverage uncertainty of the learnt data. This algorithm is trained through forward modelling [69, 205] (requiring only 17.5 seconds of interaction with a physical system to learn the benchmark cart-pole problem), however it requires intensive off-line calculations for global policy parameter optimization – it takes a reported order of one 1 hour off-line computation time for a single start-target state learning on the cart-pole task [205]). In comparison, RLOC has a simple representation enabling fast computation to learn nonlinear control from any start position in state space to a target position in a single learning run, while performing all the simple calculations on-line (and on current computers) much faster than real-time. RLOC’s speed and simplicity in terms of computational operations and representations suggest a feasible neurobiological implementation. Comparing the computational requirements of RLOC and PILCO for solving the cart-pole task for 100 different start states of the pole, we find that RLOC takes approximately 2 minutes, while PILCO algorithm takes approximately 10 hours as timed on the same machine. This is in part because PILCO uses probabilistic representations of uncertainty that are expensive to compute, specifically Gaussian Processes require an inversion of the kernel matrices which scales O(n3 ) unfavourably with respect to the number of data points required for learning e.g. when the task is temporally-complex or has a very high dimensional state space, although approximative solutions may improve the performance. RLOC meanwhile, benefits from using linear optimal controllers for which the feedback gain matrices are not state dependent and hence are calculated off-line once, with the most expensive operation of O(n3 ), where n is the dimensionality of the state vector. The high-level learning is performed iteratively using Monte Carlo reinforcement learning which is not as expensive (scaling as O(n2 )). This thesis proposed a computational framework, Reinforcement Learning Optimal Control, motivated by both engineering and neurobiological considerations with the view of finding control solutions to nonlinear systems with unknown dynamics. The power of RLOC stems from combining two relatively simple approaches and demonstrating that this combination proves to be far more computationally efficient and effective in terms of solution quality than previous highly engineered approaches. Simplicity is the strength of RLOC, combining simultaneous learning of symbolic and continuous control strategies, inspired by neuroscience. RLOC applications and further development encompasses application in bio-inspired robotics and as a potential model to understand the complexity of human motor learning and control. References [1] Abbeel, P., Quigley, M., and Ng, A. Y. (2006). Using inaccurate models in reinforcement learning. Proceedings of the 23rd International Conference on Machine Learning, pages 1–8. [2] Abe, N., Verma, N., Apte, C., and Schroko, R. (2004). Cross channel optimized marketing by reinforcement learning. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 767–772. [3] Abramova, E., Kuhn, D., and Faisal, A. (2011). Combining markov decision processes with linear optimal controllers. Imperial College Computing Student Workshop, pages 3–9. [4] Acuna, D. and Schrater, P. R. (2009). Structure learning in human sequential decision-making. pages 1–8. [5] Akella, R. and Kumar, P. (1986). Optimal control of production rate in a failure prone manufacturing system. IEEE Transactions on Automatic Control, 31(2):116–126. [6] Alnajjar, F., Itkonen, M., Berenz, V., Tournier, M., Nagai, C., and Shimoda, S. (2014). Sensory synergy as environmental input integration. Frontiers in neuroscience, 8. [7] Ambühl, D., Sundström, O., Sciarretta, A., and Guzzella, L. (2010). Explicit optimal control policy and its practical application for hybrid electric powertrains. Control engineering practice, 18(12):1429–1439. [8] Amit, R. and Matarić, M. J. (2002). Parametric primitives for motor representation and control. IEEE International Conference on Robotics and Automation, 1:863–868. [9] Anderson, B. and Moore, J. (1989). Optimal control: linear quadratic methods. Prentice-Hall. [10] Anderson, B. D. and Moore, J. B. (1990). Optimal control: linear quadratic methods. Prentice Hall, New Jersey. [11] Åström, K. and Hägglund, T. (2006a). Advanced PID control. The ISA Society. [12] Åström, K. J. and Hägglund, T. (2006b). Advanced PID control. ISA-The Instrumentation, Systems, and Automation Society; Research Triangle Park, NC 27709. [13] Athans, M. et al. (1971). Special issue on linear-quadratic-gaussian problem. IEEE Transactions on Automatic Control, 16(6):847–869. [14] Atkeson, C. (1994). Using local trajectory optimizers to speed up global optimization in dynamic programming. Proceedings of the Advances in Neural Information Processing Systems, pages 663–663. [15] Atkeson, C. and Stephens, B. (2008). Random sampling of states in dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 38(4):924–929. 114 References [16] Atkeson, C. G. (1989). Learning arm kinematics and dynamics. Annual Review of Neuroscience, 12(1):157–183. [17] Atkeson, C. G. and Santamaria, J. C. (1997). A comparison of direct and model-based reinforcement learning. International Conference on Robotics and Automation. [18] Atkeson, C. G. and Schaal, S. (1997). Robot learning from demonstration. Proceedings of the International Conference on Machine Learning, 97:12–20. [19] Badre, D. (2008). Cognitive control, hierarchy, and the rostro–caudal organization of the frontal lobes. Trends in cognitive sciences, 12(5):193–200. [20] Bagnell, J. A. and Schneider, J. G. (2001). Autonomous helicopter control using reinforcement learning policy search methods. Robotics and Automation, 2001. Proceedings 2001 ICRA. IEEE International Conference on, 2:1615–1620. [21] Baird III, L. C. (1994). Reinforcement learning in continuous time: Advantage updating. IEEE International Conference on Neural Networks, 4:2448–2453. [22] Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379. [23] Baxter, J., Tridgell, A., and Weaver, L. (1998). Knightcap: A chess programm that learns by combining td (lambda) with game-tree search. Proceedings of the 15th International Conference on Machine Learning, pages 28–36. [24] Bazaraa, M. S., Sherali, H. D., and Shetty, C. M. (2013). Nonlinear programming: theory and algorithms. John Wiley & Sons. [25] Bellman, R. (1954). The theory of dynamic programming. Technical report, DTIC Document. [26] Bellman, R. (1957). Dynamic programming. Princenton University Press. [27] Bellman, R., Glicksberg, I., and Gross, O. (1956). On the ’bang-bang’ control problem. Quarterly of Applied Mathematics, 14:11–18. [28] Bellman, R. and Kalaba, R. (1959). On adaptive control processes. IRE Transactions on Automatic Control, 4(2):1–9. [29] Bellman, R. E. and Kalaba, R. E. (1964). Selected papers on mathematical trends in control theory. Dover Publications. [30] Berniker, M., Jarc, A., Bizzi, E., and Tresch, M. C. (2009). Simplified and effective motor control based on muscle synergies to exploit musculoskeletal dynamics. Proceedings of the National Academy of Sciences, 106(18):7601–7606. [31] Bernstein, N. A. (1967). The coordination and regulation of movements. Pergamon. [32] Bertsekas, D. P. (1999). Nonlinear programming. Athena scientific Belmont. [33] Bertsekas, D. P., Bertsekas, D. P., Bertsekas, D. P., and Bertsekas, D. P. (1995). Dynamic programming and optimal control, volume 1. Athena Scientific Belmont, MA. [34] Bertsekas, D. P. and Tsitsiklis, J. N. (1995). Neuro-dynamic programming: an overview. Proceedings of the 34th IEEE Conference on Decision and Control, 1:560–564. References 115 [35] Bertsekas, D. P., Tsitsiklis, J. N., and Wu, C. (1997). Rollout algorithms for combinatorial optimization. Journal of Heuristics, 3(3):245–262. [36] Bertsimas, D. and Lo, A. W. (1998). Optimal control of execution costs. Journal of Financial Markets, 1(1):1–50. [37] Bishop, C. M. (2006). Pattern recognition and machine learning, volume 4. Springer New York. [38] Bonvik, A. M., Couch, C., and Gershwin, S. B. (1997). A comparison of production-line control mechanisms. International journal of production research, 35(3):789–804. [39] Bothe, M. K., Dickens, L., Reichel, K., Tellmann, A., Ellger, B., Westphal, M., and Faisal, A. A. (2013). The use of reinforcement learning algorithms to meet the challenges of an artificial pancreas. [40] Botvinick, M., Niv, Y., and Barto, A. (2009). Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition, 113(3):262–280. [41] Botvinick, M. M. (2008). Hierarchical models of behavior and prefrontal function. Trends in cognitive sciences, 12(5):201–208. [42] Boyan, J. A. (1999). Least-squares temporal difference learning. pages 49–56. [43] Braun, D. A., Mehring, C., and Wolpert, D. M. (2010). Structure learning in action. Behavioural brain research, 206(2):157–165. [44] Bryson, A. E. (1975). Applied optimal control: optimization, estimation and control. Hemisphere Publishing Corporation. [45] Bryson, A. E. and Denham, W. F. (1962). A steepest-ascent method for solving optimum programming problems. Journal of Applied Mechanics, 29(2):247–257. [46] Bryson Jr, A. E. (1996). Optimal control-1950 to 1985. Control Systems, IEEE, 16(3):26–33. [47] Bullock, D., Grossberg, S., and Guenther, F. H. (1993). A self-organizing neural model of motor equivalent reaching and tool use by a multijoint arm. Journal of Cognitive Neuroscience, 5(4):408–435. [48] Burridge, R. R., Rizzi, A. A., and Koditschek, D. E. (1999). Sequential composition of dynamically dexterous robot behaviors. The International Journal of Robotics Research, 18(6):534–555. [49] Byrne, R. W. and Russon, A. E. (1998). Learning by imitation: a hierarchical approach. Behavioral and Brain Sciences, 21(05):667–684. [50] Chang, S. S. (1961). Synthesis of optimum control systems. McGraw-Hill New York. [51] Chhabra, M. and Jacobs, R. A. (2006). Properties of synergies arising from a theory of optimal motor behavior. Neural computation, 18(10):2320–2342. [52] Chomsky, N. and Halle, M. (1968). The sound pattern of english. [53] Clebsch, A. (1858). Ueber die reduction der zweiten variation auf ihre einfachste form. Journal für die reine und angewandte Mathematik, 55:254–273. [54] Cooper, R. and Shallice, T. (2000). Contention scheduling and the control of routine activities. Cognitive neuropsychology, 17(4):297–338. 116 References [55] Cooper, R. P., Schwartz, M. F., Yule, P., and Shallice, T. (2005). The simulation of action disorganisation in complex activities of daily living. Cognitive Neuropsychology, 22(8):959–1004. [56] Cooper, R. P. and Shallice, T. (2006). Hierarchical schemas and goals in the control of sequential behavior. Psychology Review, 113:887–931. [57] Corfmat, J.-P. and Morse, A. (1976). Decentralized control of linear multivariable systems. Automatica, 12(5):479–495. [58] Costa, R. (2011). A selectionist account of de novo action learning. Current Opinion in Neurobiology, 21(4):579–586. [59] Da Silva, M., Durand, F., and Popović, J. (2009). Linear bellman combination for control of character animation. Proceedings of the ACM Transactions on Graphics, 28(3):82–92. [60] D’alessandro, D. and Dahleh, M. (2001). Optimal control of two-level quantum systems. IEEE Transactions on Automatic Control, 46(6):866–876. [61] Danna-Dos-Santos, A., Shapkova, E. Y., Shapkova, A. L., Degani, A. M., and Latash, M. L. (2009). Postural control during upper body locomotor-like movements: similar synergies based on dissimilar muscle modes. Experimental brain research, 193(4):565–579. [62] D’Ardenne, K., McClure, S. M., Nystrom, L. E., and Cohen, J. D. (2008). Bold responses reflecting dopaminergic signals in the human ventral tegmental area. Science, 319(5867):1264– 1267. [63] d’Avella, A., Saltiel, P., and Bizzi, E. (2003). Combinations of muscle synergies in the construction of a natural motor behavior. Nature neuroscience, 6(3):300–308. [64] Daw, N., Gershman, S., Seymour, B., Dayan, P., and Dolan, R. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6):1204–1215. [65] Dayan, P. and Daw, N. D. (2008). Decision theory, reinforcement learning, and the brain. Cognitive, Affective, & Behavioral Neuroscience, 8(4):429–453. [66] Dayan, P. and Hinton, G. E. (1993). Feudal reinforcement learning. Proceedings of the Advances in Neural Information Processing Systems, pages 271–279. [67] De Rugy, A., Loeb, G. E., and Carroll, T. J. (2012). Muscle coordination is habitual rather than optimal. The Journal of Neuroscience, 32(21):7384–7391. [68] Dean, T. and Lin, S.-H. (1995). Decomposition techniques for planning in stochastic domains. Proceedings of the International Joint Conferences on Artificial Intelligence, pages 3–49. [69] Deisenroth, M. and Rasmussen, C. E. (2011). Pilco: A model-based and data-efficient approach to policy search. Proceedings of the 28th International Conference on Machine Learning, pages 465–472. [70] Deisenroth, M. P., Rasmussen, C. E., and Peters, J. (2009). Gaussian process dynamic programming. Neurocomputing, 72(7):1508–1524. [71] Der, R. and Martius, G. (2006). From motor babbling to purposive actions: Emerging selfexploration in a dynamical systems approach to early robot development. Proceedings of the From Animals to Animats 9, pages 406–421. References 117 [72] Diedrichsen, J., Shadmehr, R., and Ivry, R. (2010). The coordination of movement: optimal feedback control and beyond. Trends in cognitive sciences, 14(1):31–39. [73] Dietterich, T. G. (1998). The maxq method for hierarchical reinforcement learning. Proceedings of the International Conference on Machine Learning, pages 118–126. [74] Dietterich, T. G. (2000). Hierarchical reinforcement learning with the maxq value function decomposition. Artificial Intelligence, 13:227–303. [75] Dipietro, L., Krebs, H. I., Fasoli, S. E., Volpe, B. T., Stein, J., Bever, C., and Hogan, N. (2007). Changing motor synergies in chronic stroke. Journal of neurophysiology, 98(2):757–768. [76] Dorato, P. and Levis, A. (1971). Optimal linear regulators: the discrete-time case. Automatic Control, IEEE Transactions on, 16(6):613–620. [77] Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Comp., 12(1):219– 245. [78] Doya, K. (2007). Bayesian brain: Probabilistic approaches to neural coding. MIT press. [79] Doyle, J. (1982). Analysis of feedback systems with structured uncertainties. IEE Proceedings on Control Theory and Applications, 129(6):242–250. [80] Dreyfus, S. E. (1965). Dynamic programming and the calculus of variations. Academic Press New York. [81] Euler, L. (1774). Methodus inveniendi lineas curvas maximi minimive proprietate gaudentes (appendix, de curvis elasticis). Bousquent, Lausanne und Genevae. [82] Evans, W. R. (1948). Graphical analysis of control systems. American Institute of Electrical Engineers, Transactions of the, 67(1):547–551. [83] Evans, W. R. (1950). Control system synthesis by root locus method. American Institute of Electrical Engineers, Transactions of the, 69(1):66–69. [84] Faisal, A., Selen, L., and Wolpert, D. (2008). Noise in the nervous system. Nature Reviews Neuroscience, 9(4):292–303. [85] Faisal, A., Stout, D., Apel, J., and Bradley, B. (2010). The manipulative complexity of lower paleolithic stone toolmaking. PloS one, 5(11):e13718. [86] Fidelman, P. and Stone, P. (2004). Learning ball acquisition on a physical robot. pages 6–12. [87] Findeisen, R. and Allgöwer, F. (2002). An introduction to nonlinear model predictive control. Procedings of the 21st Benelux Meeting on Systems and Control, 11:57–89. [88] Franklin, D. W. and Wolpert, D. M. (2011). Computational mechanisms of sensorimotor control. Neuron, 72(3):425–442. [89] Frigon, A. (2012). Central pattern generators of the mammalian spinal cord. The Neuroscientist, 18(1):56–69. [90] Friston, K. (2011). What is optimal about motor control? Neuron, 72(3):488–498. [91] Garcia, C. E., Prett, D. M., and Morari, M. (1989). Model predictive control: theory and practice—a survey. Automatica, 25(3):335–348. 118 References [92] Garg, D. and Boziuk III, J. (1972). Parameter estimation of non-linear dynamical systems. International Journal of Control, 15(6):1121–1127. [93] Gelfand, I. M. and Fomin, S. V. (2000). Calculus of variations. Courier Corporation. [94] Ghahramani, Z. and Hinton, G. (1996a). The em algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto. [95] Ghahramani, Z. and Hinton, G. E. (1996b). Parameter estimation for linear dynamical systems. Technical report, Technical Report CRG-TR-96-2, University of Totronto, Dept. of Computer Science. [96] Ghahramani, Z. and Roweis, S. T. (1999). Learning nonlinear dynamical systems using an em algorithm. Advances in neural information processing systems, pages 431–437. [97] Giszter, S. F., Hart, C. B., and Silfies, S. P. (2010). Spinal cord modularity: evolution, development, and optimization and the possible relevance to low back pain in man. Experimental brain research, 200(3-4):283–306. [98] Giszter, S. F., Mussa-Ivaldi, F. A., and Bizzi, E. (1993). Convergent force fields organized in the frog’s spinal cord. The Journal of Neuroscience, 13(2):467–491. [99] Gläscher, J., Daw, N., Dayan, P., and O’Doherty, J. (2010). States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron, 66(4):585–595. [100] Goh, C. and Teo, K. (1988). Control parametrization: a unified approach to optimal control problems with general constraints. Automatica, 24(1):3–18. [101] Goldstine, H. H. and Goldstine, H. H. (1980). A History of the Calculus of Variations from the 17th through the 19th Century. Springer New York. [102] Goodman, S. R. and Latash, M. L. (2006). Feed-forward control of a redundant motor system. Biological cybernetics, 95(3):271–280. [103] Graybiel, A. M. (1998). The basal ganglia and chunking of action repertoires. Neurobiology of learning and memory, 70(1):119–136. [104] Grüne, L. and Pannek, J. (2011). Nonlinear model predictive control. Springer. [105] Gunckel, T. and Franklin, G. F. (1963). A general solution for linear, sampled-data control. Journal of Fluids Engineering, 85(2):197–201. [106] Hagiwara, T., Yamada, K., Matsuura, S., and Aoyama, S. (2012). A design method for modified pid controllers for multiple-input/multiple-output plants. Journal of System Design and Dynamics, 6(2):131–144. [107] Hamilton, S. W. R. (1931). Mathematical papers. XIII. [108] Hamm, T. M., Nemeth, P. M., Solanki, L., Gordon, D. A., Reinking, R. M., and Stuart, D. G. (1988). Association between biochemical and physiological properties in single motor units. Muscle & nerve, 11(3):245–254. [109] Hargraves, C. R. and Paris, S. W. (1987). Direct trajectory optimization using nonlinear programming and collocation. Journal of Guidance, Control, and Dynamics, 10(4):338–342. References 119 [110] Harris, C. and Wolpert, D. (1998). Signal-dependent noise determines motor planning. Nature, 394(6695):780–784. [111] Hart, C. and Giszter, S. (2010). A neural basis for motor primitives in the spinal cord. Journal of Neuroscience, 30(4):1322–1336. [112] Hart, C. B. and Giszter, S. F. (2004). Modular premotor drives and unit bursts as primitives for frog motor behaviors. The Journal of neuroscience, 24(22):5269–5282. [113] Haruno, M., Wolpert, D. M., and Kawato, M. (2001). Mosaic model for sensorimotor learning and control. Neural computation, 13(10):2201–2220. [114] Heidrich-Meisner, V. (2009). Interview with. Richard S Sutton. Künstliche Intelligenz, 3:41–43. [115] Holdefer, R. and Miller, L. (2002). Primary motor cortical neurons encode functional muscle synergies. Experimental Brain Research, 146(2):233–243. [116] Holloway Jr, R. L. (1969). Culture: a human domain. Current Anthropology, 10:395–412. [117] Hudson, T. E., Tassinari, H., and Landy, M. S. (2010). Compensation for changing motor uncertainty. PLoS computational biology, 6(11):e1000982. [118] Hull, D. G. (1997). Conversion of optimal control problems into parameter optimization problems. Journal of Guidance, Control, and Dynamics, 20(1):57–60. [119] Hunt, K. J., Sbarbaro, D., Żbikowski, R., and Gawthrop, P. J. (1992). Neural networks for control systems—a survey. Automatica, 28(6):1083–1112. [120] Ijspeert, A. J., Nakanishi, J., and Schaal, S. (2003). Learning attractor landscapes for learning motor primitives. Proceedings of the Advances in Neural Information Processing Systems, pages 1547–1554. [121] Isbell Jr, C. L., Kearns, M., Singh, S., Shelton, C. R., Stone, P., and Kormann, D. (2006). Cobot in lambdamoo: An adaptive social statistics agent. Autonomous Agents and Multi-Agent Systems, 13(3):327–354. [122] Izawa, J., Rane, T., Donchin, O., and Shadmehr, R. (2008). Motor adaptation as a process of reoptimization. The Journal of Neuroscience, 28(11):2883–2891. [123] Jacobson, D. (1968a). Differential dynamic programming methods for solving bang-bang control problems. Automatic Control, IEEE Transactions on, 13(6):661–675. [124] Jacobson, D. (1968b). New second-order and first-order algorithms for determining optimal control: A differential dynamic programming approach. Journal of Optimization Theory and Applications, 2(6):411–440. [125] Jacobson, D. H. and Mayne, D. Q. (1970). Differential dynamic programming. North-Holland. [126] Jin, X., Tecuapetla, F., and Costa, R. M. (2014). Basal ganglia subcircuits distinctively encode the parsing and concatenation of action sequences. Nature Neuroscience, 17:423–430. [127] Joseph, D. P. and Tou, T. J. (1961). On linear control theory. Transactions of the American Institute of Electrical Engineers, Part II: Applications and Industry, 80(4):193–196. [128] Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminary results. Prodeedings of the 10th International Conference on Machine Learning, pages 167–173. 120 References [129] Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: A survey. Journal of artificial intelligence research, pages 237–285. [130] Kalman, R. (1960a). Contributions to the theory of optimal control. Bol. Soc. Mat. Mexicana, 5(2):102–119. [131] Kalman, R. E. (1960b). A new approach to linear filtering and prediction problems. Journal of Fluids Engineering, 82(1):35–45. [132] Kalman, R. E. and Bucy, R. S. (1961). New results in linear filtering and prediction theory. Journal of Fluids Engineering, 83(1):95–108. [133] Kantorovich, L. V. (1960). Mathematical methods of organizing and planning production. Management Science, 6(4):366–422. [134] Kappen, H. J. (2011). Optimal control theory and the linear bellman equation. Inference and Learning in Dynamic Models, pages 363–387. [135] Keller, H. B. (1976). Numerical solution of two point boundary value problems, volume 24. SIaM. [136] Kelley, H. J. (1960). Gradient theory of optimal flight paths. Journal of American Rocket Society, 30(10):947–954. [137] Kirk, D. E. (2012). Optimal control theory: an introduction. Courier Corporation. [138] Kistemaker, D. A., Wong, J. D., and Gribble, P. L. (2010). The central nervous system does not minimize energy cost in arm movements. Journal of neurophysiology, 104(6):2985–2994. [139] Kitago, T., Ryan, S. L., Mazzoni, P., Krakauer, J. W., and Haith, A. M. (2013). Unlearning versus savings in visuomotor adaptation: comparing effects of washout, passage of time, and removal of errors on motor memory. Frontiers in human neuroscience, 7. [140] Knospe, C. (2006). Pid control. IEEE Control Systems Magazine, 26(1):30–31. [141] Koechlin, E., Ody, C., and Kouneiher, F. (2003). The architecture of cognitive control in the human prefrontal cortex. Science, 302(5648):1181–1185. [142] Kohl, N. and Stone, P. (2004a). Machine learning for fast quadrupedal locomotion. 4:611–616. [143] Kohl, N. and Stone, P. (2004b). Policy gradient reinforcement learning for fast quadrupedal locomotion. Proceedings of the 2004 IEEE International Conference on Robotics and Automation, 3:2619–2624. [144] Konidaris, G. and Barto, A. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. Advances in Neural Information Processing Systems, 22:1015–1023. [145] Krishnamoorthy, V., Latash, M. L., Scholz, J. P., and Zatsiorsky, V. M. (2003). Muscle synergies during shifts of the center of pressure by standing persons. Experimental brain research, 152(3):281– 292. [146] Kuhn, H., Tucker, A., et al. (1951). Nonlinear programming. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability. [147] Kuleshov, V. and Precup, D. (2014). Algorithms for multi-armed bandit problems. Journal of Machine Learning. References 121 [148] Kuo, F. and Sloan, I. (2005). Lifting the curse of dimensionality. Notices of the AMS, 52(11):1320–1328. [149] Kurtzer, I., Pruszynski, J. A., Herter, T. M., and Scott, S. H. (2006). Primate upper limb muscles exhibit activity patterns that differ from their anatomical action during a postural task. Journal of neurophysiology, 95(1):493–504. [150] Kutch, J. J., Kuo, A. D., Bloch, A. M., and Rymer, W. Z. (2008). Endpoint force fluctuations reveal flexible rather than synergistic patterns of muscle cooperation. Journal of neurophysiology, 100(5):2455–2471. [151] Kutch, J. J. and Valero-Cuevas, F. J. (2012). Challenges and new approaches to proving the existence of muscle synergies of neural origin. PLoS Comput. Biol, 8(5). [152] Kwakernaak, H. and Sivan, R. (1972). Linear optimal control systems. Wiley Interscience, New York. [153] Kwok, C. and Fox, D. (2004). Reinforcement learning for sensing strategies. 4:3158–3163. [154] Kwon, W., Bruckstein, A., and Kailath, T. (1983). Stabilizing state-feedback design via the moving horizon method. International Journal of Control, 37(3):631–643. [155] Lagrange, J. L. (1797). Théorie des fonctions analytiques. De l’imprimerie de la République, Paris. [156] Lancaster, P. and Rodman, L. (1995). Algebraic riccati equations. Oxford University Press. [157] Larson, R. E. (1968). State increment dynamic programming. [158] Lashley, K. S. (1951). The problem of serial order in behavior. Bobbs-Merrill. [159] Lee, A. Y., Bryson, A. E., and Hindson, W. S. (1988). Optimal landing of a helicopter in autorotation. Journal of Guidance, Control, and Dynamics, 11(1):7–12. [160] Lee, H., Shen, Y., Yu, C., Singh, G., and Ng, A. (2006). Quadruped robot obstacle negotiation via reinforcement learning. Proceedings of the IEEE International Conference on Robotics and Automation, pages 3003–3010. [161] Levine, S. and Abbeel, P. (2014). Learning neural network policies with guided policy search under unknown dynamics. pages 1071–1079. [162] Li, W. and Todorov, E. (2004). Iterative linear-quadratic regulator design for nonlinear biological movement systems. Proceedings of the First International Conference on Informatics in Control, Automation, and Robotics, pages 222–229. [163] Li, W. and Todorov, E. (2007). Iterative linearization methods for approximately optimal control and estimation of non-linear stochastic system. International Journal of Control, 80(9):1439–1453. [164] Liu, C. and Atkeson, C. (2009). Standing balance control using a trajectory library. IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3031–3036. [165] Ljung, L. (1998). System identification. Springer. [166] Lockhart, D. B. and Ting, L. H. (2007). Optimal sensorimotor transformations for balance. Nature neuroscience, 10(10):1329–1336. 122 References [167] Loeb, G., Brown, I., and Cheng, E. (1999). A hierarchical foundation for models of sensorimotor control. Experimental brain research, 126(1):1–18. [168] Loeb, G. E. (2012). Optimal isn’t good enough. Biological cybernetics, 106(11-12):757–765. [169] Lowery, C. and Faisal, A. A. (2013). Towards efficient, personalized anesthesia using continuous reinforcement learning for propofol infusion control. Proceedings of the 6th International IEEE/EMBS Conference on Neural Engineering, pages 1414–1417. [170] Lozano, R., Fantoni, I., and Block, D. (2000). Stabilization of the inverted pendulum around its homoclinic orbit. Systems & Control Letters, 40(3):197–204. [171] Ly, U.-L. (1982). Design Algorithm for Robust Low-order Controllers. Stanford University. [172] Lyapunov, A. M. (1992). The general problem of the stability of motion. International Journal of Control, 55(3):531–534. [173] Mayne, D. (1966). A second-order gradient method for determining optimal trajectories of non-linear discrete-time systems. International Journal of Control, 3(1):85–95. [174] Mayne, D. (1973). Differential dynamic programming — a unified approach to the optimization of dynamical systems. Control and Dynamical Systems: Advances in Theory and Applications, 10:179–254. [175] Mayne, D. Q., Rawlings, J. B., Rao, C. V., and Scokaert, P. O. (2000). Constrained model predictive control: Stability and optimality. Automatica, 36(6):789–814. [176] McGovern, A., Precup, D., Ravindran, B., Singh, S., and Sutton, R. (1998). Hierarchical optimal control of mdps. Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, pages 186–191. [177] Michels, J., Saxena, A., and Ng, A. Y. (2005). High speed obstacle avoidance using monocular vision and reinforcement learning. Proceedings of the 22nd International Conference on Machine Learning, pages 593–600. [178] Miller, G. A., Galanter, E., and Pribram, K. H. (1986). Plans and the structure of behavior. Adams Bannister Cox. [179] Minner, S. and Kleber, R. (2001). Optimal control of production and remanufacturing in a simple recovery model with linear cost functions. OR-Spektrum, 23(1):3–24. [180] Mitrovic, D., Klanke, S., and Vijayakumar, S. (2010). Adaptive optimal feedback control with learned internal dynamics models. From Motor Learning to Interaction Learning in Robots, pages 65–84. [181] Moody, J. and Saffell, M. (2001). Learning to trade via direct reinforcement. Neural Networks, IEEE Transactions on, 12(4):875–889. [182] Moore, B. L., Panousis, P., Kulkarni, V., Pyeatt, L. D., and Doufas, A. G. (2010). Reinforcement learning for closed-loop propofol anesthesia: A human volunteer study. IAAI. [183] Morimoto, J. and Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36(1):37–51. [184] Morris, G., Nevet, A., Arkadir, D., Vaadia, E., and Bergman, H. (2006). Midbrain dopamine neurons encode decisions for future action. Nature neuroscience, 9(8):1057–1063. References 123 [185] Mussa-Ivaldi, F. A., Giszter, S. F., and Bizzi, E. (1994). Linear combinations of primitives in vertebrate motor control. Proceedings of the National Academy of Sciences, 91(16):7534–7538. [186] Nagengast, A. J., Braun, D. A., and Wolpert, D. M. (2009). Optimal control predicts human performance on objects with internal degrees of freedom. PLoS Computational Biology, 5(6):e1000419. [187] Nashed, J. Y., Crevecoeur, F., and Scott, S. H. (2012). Influence of the behavioral goal and environmental obstacles on rapid feedback responses. Journal of neurophysiology, 108(4):999– 1009. [188] Neumann, G., Maass, W., and Peters, J. (2009). Learning complex motions by sequencing simpler motion templates. Proceedings of the 26th Annual International Conference on Machine Learning, pages 753–760. [189] Newton, G. C., Gould, L. A., and Kaiser, J. F. (1957). Analytical design of linear feedback controls. [190] O’Sullivan, I., Burdet, E., and Diedrichsen, J. (2009). Dissociating variability and effort as determinants of coordination. PLoS computational biology, 5(4):e1000345. [191] Parmar, M. and Hung, J. Y. (2004). A sensorless optimal control system for an automotive electric power assist steering system. IEEE Transactions on Industrial Electronics, 51(2):290–298. [192] Petitto, L. A., Holowka, S., Sergio, L. E., and Ostry, D. (2001). Language rhythms in baby hand movements. Nature, 413(6851):35–36. [193] Polak, E. (1973). SIAM review, 15(2):553–584. [194] Pontryagin, L., Boltyansky, V., Gamkrelidze, R., and Mischenko, E. (1962). Mathematical theory of optimal processes. [195] Pontryagin, L. S. (1987). Mathematical theory of optimal processes. CRC Press. [196] Popović, D., Oĝuztöreli, M., and Stein, R. (1991). Optimal control for the active above-knee prosthesis. Annals of biomedical engineering, 19(2):131–150. [197] Precup, D., Sutton, R. S., and Singh, S. (1998). Multi-time models for temporally abstract planning. Proceedings of the Advances in Neural Information Processing Systems, pages 1050–1056. [198] Press, E. W. (2015). Hierarchical, heterogeneous control of non-linear dynamical systems using reinforcement learning. [199] Prevosto, V., Graf, W., and Ugolini, G. (2009). Cerebellar inputs to intraparietal cortex areas lip and mip: functional frameworks for adaptive control of eye movements, reaching, and arm/eye/head movement coordination. Cerebral Cortex, page bhp091. [200] Primbs, J. A., Nevistić, V., and Doyle, J. C. (1999). Nonlinear optimal control: A control lyapunov function and receding horizon perspective. Asian Journal of Control, 1(1):14–24. [201] Pulvermüller, F. and Fadiga, L. (2010). Active perception: sensorimotor circuits as a cortical basis for language. Nature Reviews Neuroscience, 11(5):351–360. [202] Pytlak, R. (1999). Numerical methods for optimal control problems with state constraints, volume 1707. Springer Science. 124 References [203] Randløv, J., Barto, A., and Rosenstein, M. (2000). Combining reinforcement learning with a local control algorithm. Proceedings of the 17th International Conference on Machine Learning, pages 775–782. [204] Rao, A. V. (2009). A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, 135(1):497–528. [205] Rasmussen, C. E. and Deisenroth, M. P. (2008). Probabilistic inference for fast learning in control. Lecture Notes in Computer Science, 5323:229–242. [206] Redgrave, P. and Gurney, K. (2006). The short-latency dopamine signal: a role in discovering novel actions? Nature reviews neuroscience, 7(12):967–975. [207] Redgrave, P., Prescott, T. J., and Gurney, K. (1999). Is the short-latency dopamine response too short to signal reward error? Trends in neurosciences, 22(4):146–151. [208] Richards, A. and How, J. P. (2002). Aircraft trajectory planning with collision avoidance using mixed integer linear programming. Proceedings of the American Control Conference, 3:1936–1941. [209] Riordan, J. (2012). Introduction to combinatorial analysis. Courier Corporation. [210] Rossignol, S., Dubuc, R., and Gossard, J.-P. (2006). Dynamic sensorimotor interactions in locomotion. Physiological reviews, 86(1):89–154. [211] Rusmevichientong, P., Salisbury, J. A., Truss, L. T., Van Roy, B., and Glynn, P. W. (2006). Opportunities and challenges in using online preference data for vehicle pricing: A case study at general motors. Journal of Revenue and Pricing Management, 5(1):45–61. [212] Saffran, J. R. and Wilson, D. P. (2003). From syllables to syntax: Multilevel statistical learning by 12-month-old infants. Infancy, 4(2):273–284. [213] Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3:211–229. [214] Samuel, A. L. (1967). Some studies in machine learning using the game of checkers. ii—recent progress. IBM Journal of research and development, 11(6):601–617. [215] Saridis, G. N. and Lee, C.-S. G. (1979). An approximation theory of optimal control for trainable manipulators. IEEE Transactions on Systems, Man and Cybernetics, 9(3):152–159. [216] Schaeffer, J., Hlynka, M., and Jussila, V. (2001). Temporal difference learning applied to a high-performance game-playing program. 1:529–534. [217] Schneider, D. W. and Logan, G. D. (2006). Hierarchical control of cognitive processes: Switching tasks in sequences. Journal of Experimental Psychology: General, 135(4):623–640. [218] Scholz, J. P. and Schöner, G. (1999). The uncontrolled manifold concept: identifying control variables for a functional task. Experimental brain research, 126(3):289–306. [219] Schultz, W., Dayan, P., and Montague, P. (1997). A neural substrate of prediction and reward. Science, 275(5306):1593–1599. [220] Schwartz, M. F. (2006). The cognitive neuropsychology of everyday action and planning. Cognitive neuropsychology, 23(1):202–221. References 125 [221] Sciarretta, A., Back, M., and Guzzella, L. (2004). Optimal control of parallel hybrid electric vehicles. IEEE Transactions on Control Systems Technology, 12(3):352–363. [222] Scott, S. (2004). Optimal feedback control and the neural basis of volitional motor control. Nature Reviews Neuroscience, 5(7):532–546. [223] Scott, S. H. (2012). The computational and neural basis of voluntary motor control and planning. Trends in Cognitive Sciences, 16(11):541–549. [224] Shadmehr, R. and Krakauer, J. W. (2008). A computational neuroanatomy for motor control. Experimental Brain Research, 185(3):359–381. [225] Shadmehr, R. and Moussavi, Z. M. (2000). Spatial generalization from learning dynamics of reaching movements. The Journal of Neuroscience, 20(20):7807–7815. [226] Simon, H. A. (1956). Dynamic programming under uncertainty with a quadratic criterion function. Econometrica, Journal of the Econometric Society, pages 74–81. [227] Singh, S., Litman, D., Kearns, M., and Walker, M. (2002). Optimizing dialogue management with reinforcement learning: Experiments with the njfun system. Journal of Artificial Intelligence Research, pages 105–133. [228] Singh, S. P. (1992). Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3-4):323–339. [229] Singh, S. P., Jaakkola, T., and Jordan, M. I. (1994). Learning without state estimation in partially observable markovian decision processes. Proceedings of the 11th International Conference on Machine Learning, pages 284–292. [230] Soler, M., Olivares, A., and Staffetti, E. (2010). Hybrid optimal control approach to commercial aircraft trajectory planning. Journal of Guidance, Control, and Dynamics, 33(3):985–991. [231] Sontag, E. (1983). Lyapunov-like characterization of asymptotic controllability. SIAM Journal on Control and Optimization, 21(3):462–471. [232] Stein, P. (2005). Neuronal control of turtle hindlimb motor rhythms. Journal of Comparative Physiology A, 191(3):213–229. [233] Stockbridge, R. H. (1990). Time-average control of martingale problems: Existence of a stationary solution. The Annals of Probability, pages 190–205. [234] Stolle, M. and Atkeson, C. (2006). Policies based on trajectory libraries. Proceedings of the International Conference on Robotics and Automation, pages 3344–3349. [235] Stolle, M., Tappeiner, H., Chestnutt, J., and Atkeson, C. (2007). Transfer of policies based on trajectory libraries. IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2981–2986. [236] Stone, P. and Sutton, R. (2001). Scaling reinforcement learning toward robocup soccer. Proceedings of the 18th International Conference on Machine Learning, pages 537––544. [237] Stout, D. and Chaminade, T. (2012). Stone tools, language and the brain in human evolution. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1585):75–87. [238] Strick, P. L., Dum, R. P., and Fiez, J. A. (2009). Cerebellum and nonmotor function. Annual review of neuroscience, 32:413–434. 126 References [239] Sutton, R. and Barto, A. (1998). Reinforcement learning. MIT Press. [240] Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning, phd thesis. [241] Sutton, R. S. and Barto, A. G. (1981). Toward a modern theory of adaptive networks: expectation and prediction. Psychological review, 88(2):135–170. [242] Sutton, R. S., Barto, A. G., and Williams, R. J. (1992). Reinforcement learning is direct adaptive optimal control. Control Systems, IEEE, 12(2):19–22. [243] Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211. [244] Swerling, P. (1958). A proposed stagewise differential correction procedure for satellite tracking and prediciton. Rand Corporation. [245] Swerling, P. (1959). First-order error propagation in a stagewise smoothing procedure for satellite observations. [246] Sylaidi, A. and Faisal, A. A. (2012). What is the hierarchical representation of tasks involving objects with complex internal dynamics? Frontiers in Computational Neuroscience, (129). [247] Synge, J. L. (1969). Tensor calculus, volume 5. Courier Corporation. [248] Tassa, Y., Erez, T., and Smart, W. D. (2008). Receding horizon differential dynamic programming. Advances in neural information processing systems, pages 1465–1472. [249] Technology, P. (2015). Cns description. [250] Tedrake, R. (2009). Lqr-trees: Feedback motion planning on sparse randomized trees. Proceedings of the 5th annual Robotics: Science and Systems Conference, pages 17–24. [251] Tesauro, G. (1995). Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68. [252] Theodorou, E., Buchli, J., and Schaal, S. (2010a). A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11:3137–3181. [253] Theodorou, E., Buchli, J., and Schaal, S. (2010b). Reinforcement learning of motor skills in high dimensions: A path integral approach. IEEE International Conference on Robotics and Automation (ICRA), pages 2397–2403. [254] Theodorou, E., Tassa, Y., and Todorov, E. (2010c). Stochastic differential dynamic programming. Proceedings of the American Control Conference, pages 1125–1132. [255] Theodorou, E., Todorov, E., and Valero-Cuevas, F. J. (2011). Neuromuscular stochastic optimal control of a tendon driven index finger model. American Control Conference, pages 348–355. [256] Thomik, A. and Faisal, A. A. (2012). Deriving motion primitives from naturalistic hand movements for neuroprosthetic control. Computational Neuroscience, 256:201–202. [257] Thoroughman, K. A. and Shadmehr, R. (2000). Learning of action through adaptive combination of motor primitives. Nature, 407(6805):742–747. [258] Ting, L. H. and Macpherson, J. M. (2005). A limited set of muscle synergies for force control during a postural task. Journal of neurophysiology, 93(1):609–613. References 127 [259] Todorov, E. (2004). Optimality principles in sensorimotor control. Nature neuroscience, 7(9):907–915. [260] Todorov, E. (2006). Optimal control theory. Bayesian brain: probabilistic approaches to neural coding, pages 269–298. [261] Todorov, E. and Ghahramani, Z. (2003). Unsupervised learning of sensory-motor primitives. 2:1750–1753. [262] Todorov, E. and Jordan, M. I. (2002). Optimal feedback control as a theory of motor coordination. Nature neuroscience, 5(11):1226–1235. [263] Todorov, E. and Li, W. (2005). A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. Proceedings of the American Control Conference, pages 300–306. [264] Todorov, E., Li, W., and Pan, X. (2005). From task parameters to motor synergies: A hierarchical framework for approximately optimal control of redundant manipulators. Journal of Robotic Systems, 22(11):691–710. [265] Torres-Oviedo, G., Macpherson, J. M., and Ting, L. H. (2006). Muscle synergy organization is robust across a variety of postural perturbations. Journal of neurophysiology, 96(3):1530–1546. [266] Torres-Oviedo, G. and Ting, L. H. (2007). Muscle synergies characterizing human postural responses. Journal of neurophysiology, 98(4):2144–2156. [267] Tresch, M. C. and Jarc, A. (2009). The case for and against muscle synergies. Current opinion in neurobiology, 19(6):601–607. [268] Tresch, M. C., Saltiel, P., and Bizzi, E. (1999). The construction of movement by the spinal cord. Nature neuroscience, 2(2):162–167. [269] Uzsoy, R., Lee, C.-Y., and Martin-Vega, L. A. (1994). A review of production planning and scheduling models in the semiconductor industry part ii: shop-floor control. Transactions of Institute of Industrial Engineers, 26(5):44–55. [270] Valero-Cuevas, F. J., Venkadesan, M., and Todorov, E. (2009). Structured variability of muscle activations supports the minimal intervention principle of motor control. Journal of neurophysiology, 102(1):59–68. [271] Vermorel, J. and Mohri, M. (2005). Multi-armed bandit algorithms and empirical evaluation. ECML Machine Learning, pages 437–448. [272] Von Stryk, O. and Bulirsch, R. (1992). Direct and indirect methods for trajectory optimization. Annals of operations research, 37(1):357–373. [273] Vrabie, D., Pastravanu, O., Abu-Khalaf, M., and Lewis, F. L. (2009). Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 45(2):477–484. [274] Wang, L.-X. (1993). Stable adaptive fuzzy control of nonlinear systems. Fuzzy Systems, IEEE Transactions on, 1(2):146–155. [275] Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, University of Cambridge England. 128 References [276] Weierstrass, K. (1927). Vorlesungen über variationsrechnung. Akademisches Verlagsgesellschaft, Leipzig, Werke, Bd. 7. [277] Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences, phd thesis. [278] Werbos, P. J. (1989). Backpropagation and neurocontrol: A review and prospectus. International Joint Conference on Neural Networks, pages 209–216. [279] White, D. A. and Sofge, D. A. (1992). Handbook of Intelligent Control: Neural, Fuzzy, and Adaptative Approaches. Van Nostrand Reinhold Company. [280] Wiener, N. (1930). Generalized harmonic analysis. Acta mathematica, 55(1):117–258. [281] Wiener, N. (1949). Extrapolation, interpolation, and smoothing of stationary time series, volume 2. MIT press Cambridge, MA. [282] Williams, B., Toussaint, M., and Storkey, A. J. (2008). Modelling motion primitives and their timing in biologically executed movements. Advances in neural information processing systems, pages 1609–1616. [283] Wolpert, D., Diedrichsen, J., and Flanagan, J. (2011). Principles of sensorimotor learning. Nature Reviews Neuroscience, 12(12):739–751. [284] Wolpert, D. M., Doya, K., and Kawato, M. (2003). A unifying computational framework for motor control and social interaction. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 358(1431):593–602. [285] Yan, X., Diaconis, P., Rusmevichientong, P., and Roy, B. V. (2004). Solitaire: Man versus machine. Advances in Neural Information Processing Systems, 17:1553–1560. [286] Yoshimoto, J., Nishimura, M., Tokita, Y., and Ishii, S. (2005). Acrobot control by learning the switching of multiple controllers. Artificial Life and Robotics, 9(2):67–71. [287] Zames, G. and Francis, B. A. (1983). Feedback minimax sensitivity and optimal robustness. IEEE Transactions on Automatic Control, 28(5):585–601. Appendix A Further Work A.1 Dynamical System – Double Inverted Pendulum Cart The double inverted pendulum on the cart system depicted in Figure A.1 is a challenging benchmark in modern control theory due to underactuation of the system [170]. The control tasks involves actively balancing both poles in an upright (unstable equilibrium) position, with the cart stopping in a pre-defined position on the track (z = 0), with limited control forces applied to the cart laterally along the horizontal plane. The system is modelled with 2 solid rods, where each centre of mass is located in the middle of each rod. The state and the dynamics are given by x = (z, ż, θ1 , θ2 , θ̇1 , θ̇2 )T ẋ = f(x, u) = (ż, z̈, θ̇1 , θ̇2 , θ̈1 , θ̈2 )T where z, ż are the cart position and velocity, θ1 , θ2 are the angles of poles (links) 1 and 2, θ̇1 , θ̇2 are the angular velocities of poles 1 and 2, and θ̈1 , θ̈2 ) are the accelerations of poles 1 and 2 respectively. The parameters used for simulating the DIPC are displayed in Table A.1. Derivation of Equations of Motion Since the system deals with solid rods, both translational and rotational kinetic energies must be considered. The Lagrangian is defined as total potential energy minus the total kinetic energy L = T −U = (Tc + T1 + T2 ) − (Uc +U1 +U2 ) where 1 Tc = mc ż2 2 1 1 T1 = m1 ż2 + m1 żl1 θ̇1 cos(θ1 ) + θ̇12 (m1 l12 + I1 ) 2 2 1 1 1 T2 = m2 ż2 + m2 L12 θ̇12 + (m2 l22 + I2 )θ̇22 + m2 L1 żθ̇1 cos(θ1 ) + m2 l2 żθ̇2 cos(θ2 ) + m2 L1 l2 θ̇1 θ̇2 cos(θ1 − θ2 ) 2 2 2 Uc = 0 U1 = m1 gl1 cos(θ1 ) U2 = m2 g(L1 cos(θ1 ) + l2 cos(θ2 )) Further Work 130 Table A.1 Double Inverted Pendulum Cart Simulation Parameters Parameter Description Length of pendulum 1 Length of pendulum 2 Mass of pendulum 1 Mass of pendulum 2 Mass of cart Acceleration due to gravity Pendulum 1 friction constant Pendulum 2 friction constant Cart friction constant Maximum force Maximum negative force Number of naive start states LDSi sub-trajectory length LDSi average number of sub-trajectories Number of actuator space simulated steps Low-level control simulation time step Target state Number of discretised dimensions of x Number of Symbolic States Symbol L1 L2 m1 m2 mc g B p1 B p2 Bc umax umin nse h nH nK ∆t x∗ nl ns Value 0.6 0.6 0.5 0.5 0.5 9.80665 0.5 0.5 0.1 20 -20 500 7 4 300 0.05 0 4 2500 SI Unit [m] [m] [kg] [kg] [kg] [m/s2 ] [N/m/s] [N/m/s] [N/m/s] [N] [N] [a.u.] [a.u.] [a.u.] [a.u.] [s]   m rad , m, s , rad, rad, rad s s [a.u.] [a.u.] hence the Lagrangian is given by 1 1 1 L = (mc + m1 + m2 )ż2 + (m1 l12 + m2 L12 + I1 )θ̇12 + (m2 l22 + I2 )θ̇22 + (m1 l1 + m2 L1 )żθ̇1 cos(θ1 ) 2 2 2 + m2 l2 żθ̇2 cos(θ2 ) + +m2 L1 l2 θ̇1 θ̇2 cos(θ1 − θ2 ) − (m1 l1 + m2 L1 )g cos(θ1 ) − m2 l2 g cos(θ2 ) Differentiating the Lagrangian by θ and θ̇ gives the Equations of Motion (The Lagrange Equations) d ∂L  ∂L = F − Bc sgn(ż) − dt ∂ ż ∂z d ∂L  ∂L = 0 − B p1 θ̇1 − dt ∂ θ̇1 ∂ θ1 d ∂L  ∂L − = 0 − B p2 θ̇2 dt ∂ θ̇2 ∂ θ2 The three linear equations can be rewritten as the linear equation system A.1 Dynamical System – Double Inverted Pendulum Cart 131 Fig. A.1 Double Inverted Pendulum on Cart System. The DIPC is modelled as having two solid rods, where centre of mass is assumed to be in the middle of each rod. The cart is required to stop in a pre-defined position on the track at the origin (z = 0 meters).     m2 l2 cos(θ2 ) mc + m1 + m2 (m1 l1 + m2 L1 ) cos(θ1 ) z̈     m1 l12 + m2 L12 + I1 m2 L1 l2 cos(θ1 − θ2 ) ×  θ¨1  = (m1 l1 + m2 L1 ) cos(θ1 ) θ¨2 m2 L1 l2 cos(θ1 − θ2 ) m2 l22 + I2 m2 l2 cos(θ2 )   2 2 F + (m1 l1 + m2 L1 )θ˙1 sin(θ1 ) + m2 l2 θ˙2 sin(θ2 ) − Bc sgn(ż)   2  −m2 L1 l2 sin(θ1 − θ2 )θ˙2 + (m1 l1 + m2 L1 )g sin(θ1 ) − B p1 θ˙1  2 m2 L1 l2 sin(θ1 − θ2 )θ˙1 + m2 l2 g sin(θ2 ) − B p θ˙2 2 and in the more compact form, we have P r = s      z̈ F + c2 θ̇12 sin(θ1 ) + c3 θ̇22 sin(θ2 ) − Bc sgn(ż) c3 cos θ2 c1 c2 cos θ1       c4 c5 cos(θ1 − θ2 ) × θ̈1  =  −c5 θ̇22 sin(θ1 − θ2 ) + c2 g sin(θ1 ) − B p1 θ˙1  c2 cos θ1 c5 θ̇12 sin(θ1 − θ2 ) + c3 g sin(θ2 ) − B p2 θ˙2 θ̈2 c6 c3 cos θ2 c5 cos(θ1 − θ2 ) {z } | {z } | {z } |  r P where li = Li 2 and Ii = mi Li2 12 and: c1 = mc + m1 + m2 c2 = m1 l1 + m2 L1 c3 = m2 l2 c4 = m1 l12 + m2 L12 + I1 c5 = m2 L1 l2 c6 = m2 l22 + I2 s (A.1.1) Further Work 132 The linear system of equations can be solved for z̈, θ̈1 and θ̈2 using Gaussian Elimination, hence allowing for numerical simulation to be carried out. Using a general notation of for a 3 × 4 matrix  a b c k1    d e f k2  g h i k3  Forward Elimination 1. Use a as a pivot to obtains zeros below a. (a) Obtain zero in element(2, 1) of the matrix i. m = − da ii. Replace R2 by mR1 + R2 iii. mR1 = [− da a, − da b, − da c, − da k1 ] iv. mR1 + R2 = [0, − da b + e, − da c + f , − da k1 + k2 ] (a) Obtain zero in element(3, 1) of the matrix i. m = − ga ii. Replace R3 by mR1 + R3 iii. mR1 = [− ga a, − ag b, − ag c, − ga k1 ] iv. mR1 + R2 = [0, − ag b + h, − ga c + i, − ga k1 + k3 ] 2. Repeat step 1. for the sub-matrix formed by all the rows excluding the first one. (a) Taking first non-zero entry of the sub-matrix first row as a pivot (e − db a ) obtain zero in element(3, 2) of the matrix. i. m = − h− gb a e− db a ii. Replace R3 by mR2 + R3 " iii. mR2 = h− gb h− gb a 0, (− dba )(e − db )( f ), (− a e− a e− db a "  iv. mR2 + R3 = 0, 0, −  gb h− a e− db a and the Echelon form is obtained    a b c k1 a    d e f k2  ∼ 0 g h i k3 0 f−  dc a b  e − db a  h − ga b − # h− gb dc a )(k2 − dka1 ) ), (− a e− db a   h− gb + i − ag c , − dba c  f − dc a i − ag c e− a   k2 − dka1 + k3 −  k1  k2 − dka1  ∼  k3 − gka1 gk1  a # A.1 Dynamical System – Double Inverted Pendulum Cart  a  0   0 133  b c k1    dk1  dc  f − e − db k − 2 a a a     ∼ gb gb     h− a h− a g gk1  dk1 dc f − a + i − ac 0 − db − db k2 − a + k3 − a e− e− a a   k1 a b c     f − dc k2 − dka1  0 e − db a a 0 0 A B     gb     h− gb h− g dc short hand notation A = − dba f − a + i − a c and B = − dba k2 − dka1 + k3 − gka1 e− e− a a Backward Elimination 1. Start from last row: use A as a pivot and obtain zeros above it. (a) Make pivot of the last row equal to 1. i. m = 1 A ii. Replace R3 by mR3   iii. mR3 = 0, 0, 1, AB (b) Obtain zero in element(2, 3)  i. m = − f − dc a ii. Replace R2 by mR3 + R2 h i dc dc B iii. mR3 = 0, 0, −( f − a ), −( f − a ) A h i dk1 db dc B iv. mR3 + R2 = 0, e − a , 0, (k2 − a ) − ( f − a ) A (c) Obtain zero in element(1, 3) i. m = −c ii. Replace R1 by mR3 + R1 i h B iii. mR3 = 0, 0, −c, −c A h i B iv. mR3 + R1 = a, b, 0, k1 − c A 2. Repeat step 1. for the sub-matrix formed by all the rows excluding the last one: use (e − db a ) as a pivot and obtain zero above it. (a) Make pivot of the 2nd row equal to 1. i. m = 1 (e− db a ) ii. Replace R2 by mR2 " iii. mR2 = 0, 1, 0, dk B (k2 − a1 )−( f − dc a )A db (e− a ) # Further Work 134 (b) Obtain zero in element(1, 2) i. m = −b ii. Replace R1 by mR2 + R1 # "  dk1 dc B  (k − )−( f − ) iii. mR2 = 0, −b, 0, −b 2 a db a A (e− a ) "   B iv. mR1 + R1 = a, 0, 0, k1 − c A − b (k2 − dk1 dc B a )−( f − a ) A db (e− a ) And the Row Canonical form is obtained    k1 a b a b c       dk k2 − a1  ∼ 0 e − db f − dc 0 e − db a a a B 0 0 A 0 0  a  0 0  a 0 0    0 1 0   0 0 1      z̈ =    k1 − c gb h− a (− db e− a (− gk dk )(k2 − a1 )+(k3 − a1 ) gb h− a db e− a g )( f − dc a )+(i− a c) !!   − b   a (k2 − dk1 dc a )−( f − a ) A   k1 − c AB   ! dk B  k2 − a1 − f − dc  a A ∼  e− db a   B A  k1 −c AB −b (k2 − dk B (k2 − a1 )−( f − dc a )A db (e− a ) a dk1 dc B a )−( f − a ) A db (e− a ) B A Therefore we obtain the results:   k1  k2 − dka1  ∼  B c  f − dc a 1  a b 0  k1 − c A b 0    dk1  dc B  ∼ 0 1 0 e − db 0 − f − k −  2  a a a A   B 0 1 A 0 0 1     !   dk1 dc B    k − − f − 2 a a A B  1 0 0  k1 − c A − b  db  e− a     !     dk1  dc B k2 − a − f − a A ∼    db e− a   0 1 0   B  A 0 0 1  B !#              ! gb h− a gk dk )(k2 − a1 )+(k3 − a1 ) db e− a gb h− a g )( f − dc (− a )+(i− a c) db e− a (e− db a ) (− !  !            (A.1.2) A.1 Dynamical System – Double Inverted Pendulum Cart  (k2 − dka1 ) − ( f − dc a)   θ̈1 =   (− gb h− a db e− a (− )(k2 − gb h− a e− db a gk1 dk1 a )+(k3 − a ) g )( f − dc a )+(i− a c) ! (e − db a )   h− gb a e− db a 135       (A.1.3)  (− )(k2 − dka1 ) + (k3 − gka1 )     θ̈2 = gb   h− g  (− dba )( f − dc ) + (i − c)  a a e− (A.1.4) a (A.1.5) and in the previous shorthand notation:  !  1 ˙2 ˙2 z̈ =  c1 F + c2 θ1 sin(θ1 ) + c3 θ2 sin(θ2 ) − Bc sgn(ż)  1 B − c3 cos(θ2 ) c1 A ! 2 −    θ̈1 =   1 c2 cos(θ1 ) c1 − c5 θ˙2 sin(θ1 − θ2 ) + c2 g sin(θ1 ) − B p1 θ˙1 −  c2 cos(θ1 )c2 cos(θ1 ) c1 c4 − ! − c5 cos(θ1 − θ2 ) − c2 cos(θ1c)c1 3 cos(θ2 ) AB   1    − c2 cos(θ1 )  c1 c2 cos(θ1 )c2 cos(θ1 )  c4 − c1 2 − c5 θ˙2 sin(θ1 − θ2 ) + c2 g sin(θ1 ) − B p1 θ˙1 −  − c4 − c5 cos(θ1 − θ2 ) − c2 cos(θ1c)c1 3 cos(θ2 ) BA       c2 cos(θ1 )c2 cos(θ1 )  c − 4 c1  2  2 2 c2 cos(θ1 )F+c2 θ˙1 sin(θ1 )+c3 θ˙2 sin(θ2 )−Bc sgn(ż) c1 c2 cos(θ1 )c2 cos(θ1 ) c1 ! 2 c2 cos(θ1 )F+c2 θ˙1 sin(θ1 )+c3 θ˙2 sin(θ2 )−Bc sgn(ż) c1 ! ! Further Work 136 θ̈2 = B A where  !  c5 cos(θ1 − θ2 ) − c2 c3 cos(θc11 ) cos(θ2 )  B =  − c2 (cos(θ ))2  c4 − 2 c 1 1 ! ˙1 2 sin(θ1 ) + c3 θ˙2 2 sin(θ2 ) − Bc sgn(ż) c cos( θ θ )F + c 2 1 2 − c5 θ˙2 sin(θ1 − θ2 ) + c2 g sin(θ1 ) − B p1 θ˙1 − c1  2  ˙1 2 sin(θ1 ) + c3 θ˙2 2 sin(θ2 ) − Bc sgn(ż)  θ θ )F + c c cos( 2 2 3  + c5 θ˙1 sin(θ1 − θ2 ) + c3 g sin(θ2 ) − B p2 θ˙2 −  c1  2 A= − c5 cos(θ1 − θ2 ) − c3 cos(θ2c)c1 2 cos(θ1 ) ! c2 cos(θ1 )c3 cos(θ2 ) c5 cos(θ1 − θ2 ) − c1 c4 − c2 cos(θ1c)c1 2 cos(θ1 ) ! c3 cos(θ2 ) + c6 − c3 cos(θ2 ) c1    2   c5 cos(θ1 − θ2 ) − c2 c3 cos(θc11 )cos(θ2 )  c23 (cos(θ2 ))2    A =c6 − −  c22 (cos(θ1 ))2 c 1   c4 − c1 ! Derivations of Partial Derivatives of the Dynamics In order to define the matrices A = I +  (i) (i) ∂ f (i) (i) ∂ f (i) (i) ∂ x (x0 , u0 ) and B = ∂ u (x̄0 , ū0 ) for the DIPC stabilisation task evaluated at points (x0 , u0 ), the derivatives of the system dynamics ẋ w.r.t. state x ∈ R6 and control u ∈ R need to be found. The differential of the first derivative of state with respect to state is given by  ∂ ẋ ∂x ∂ ż ∂z  ∂ z̈  ∂z   ∂ θ˙1  ∂z =  ∂ θ˙2  ∂z  ∂ θ¨  1  ∂z ∂ θ¨2 ∂z ∂ ż ∂ ż ∂ z̈ ∂ ż ∂ θ˙1 ∂ ż ∂ θ˙2 ∂ ż ∂ θ¨1 ∂ ż ∂ θ¨2 ∂ ż ∂ ż ∂ θ1 ∂ z̈ ∂ θ1 ∂ θ˙1 ∂ θ1 ∂ θ˙2 ∂ θ1 ∂ θ¨1 ∂ θ1 ∂ θ¨2 ∂ θ1 ∂ ż ∂ θ2 ∂ z̈ ∂ θ2 ∂ θ˙1 ∂ θ2 ∂ θ˙2 ∂ θ2 ∂ θ¨1 ∂ θ2 ∂ θ¨2 ∂ θ2 ∂ ż ∂ θ˙1 ∂ z̈ ∂ θ˙1 ∂ θ˙1 ∂ θ˙1 ∂ θ˙2 ∂ θ˙1 ∂ θ¨1 ∂ θ˙1 ∂ θ¨2 ∂ θ˙1  ∂ ż ∂ θ˙2 ∂ z̈   ∂ θ˙2  ˙ ∂ θ1   ∂ θ˙2  ∂ θ˙2  = ∂ θ˙2   ∂ θ¨1  ˙ ∂ θ2  ∂ θ¨2 ∂ θ˙2  0 0   0   0  0  0 1 0 0 0 0 0 0 0 0 ∂ z̈ ∂ θ1 ∂ z̈ ∂ θ2 0 0 0 0 ∂ z̈ ∂ θ˙1 ∂ θ¨1 ∂ θ1 ∂ θ¨2 ∂ θ1 ∂ θ¨1 ∂ θ2 ∂ θ¨2 ∂ θ2 1 0 ∂ θ¨1 ∂ θ˙1 ∂ θ¨2 ∂ θ˙1 0  ∂ z̈  ∂ θ˙2        ∂ θ¨1  ˙ ∂ θ2  0 1 ∂ θ¨2 ∂ θ˙2 A.1 Dynamical System – Double Inverted Pendulum Cart 137 Deriving the equations by hand is not optimal, therefore I used Matlab’s ’Symbolic Math Toolbox’ to obtain symbolic differentiation results. The differential of the first derivative of state with respect to control is given by ∂ ẋ h ∂ ż ∂ z̈ ∂ θ̇1 ∂ θ̇2 ∂ θ̈1 ∂ θ̈2 iT h = ∂u ∂u ∂u ∂u ∂u ∂u = 0 ∂u ∂ z̈ ∂u 00 ∂ θ̈1 ∂ θ̈2 ∂u ∂u iT Deriving the equations by hand is not optimal, therefore I have used the Matlab’s ’Symbolic Math Toolbox’ to obtain symbolic differentiation results. Validation of Derivations by Simulation Analogous to cart-pole of Subsection 2.4.2 the DIPC system is tested to ensure the equations of motion are correctly derived using the total energy of the system. The frictionless system is tested for L = T +U Equation A.1.6 which is expected to remain constant over the tested period of time (here taken to be T = 5 sec) when the tip of the double link pendulum is released from a height (θ1 = π2 rad, θ2 = 0 rad). 1 1 1 T = (mc + m1 + m2 )ż2 + (m1 l12 + m2 L12 + I1 )θ̇12 + (m2 l22 + I2 )θ̇22 + (m1 l1 + m2 L1 )żθ̇1 cos(θ1 ) 2 2 2 + m2 l2 żθ̇2 cos(θ2 ) + +m2 L1 l2 θ̇1 θ̇2 cos(θ1 − θ2 ) U = (m1 l1 + m2 L1 )g cos(θ1 ) + m2 l2 g cos(θ2 ) (A.1.6) Figure A.2 shows the results of the stability testing of kinetic (see Figure A.2a) and potential (see Figure A.2 b) energies, which oscillate at equal but opposite magnitudes showing varied frequencies due to the presence of 2 interconnected links. The potential energy turns into kinetic energy as the height of pendulum tip from the ground is reduced and vice versa due to the conservation of energy. The state trajectories demonstrate the stable system behaviour during the simulation, with stable: cart position, cart velocity, θ1 angle and θ˙1 angular velocity as the pendulum swings between π2 and − π2 (see Figure A.2 c). The total energy of the DIPC system (see Figure A.2 d) shows stable total result of E = 2.94 J with fluctuation range of 1.5 × 10−71 . Note that addition of an extra link to the cart-pole task of Section 2.4.2 affects the total energy of the system making in not equal to zero (as opposed to that of the cart-pole which is zero, see Figure 2.3 (d)). The derivations for the dynamics of the system, ẋ, were also tested with solving the system of linear equations for r ((A.1.1)) i.e. finding r = inv(P)s and comparing the answer to the derived equations using arbitrary parameters. If all of the manual derivations are correct, the two answers should exactly agree, which they did. 1 The ODE45 solver (an inbuilt Matlab function used to solve non-stiff differential equations) relative and absolute tolerances were set to 10−9 , while any fluctuation of the recorded total energy is due to this set ODE45 accuracy. Further Work 138 Fig. A.2 Total Energy for Double Inverted Pendulum System. The tip of the second link is dropped from an angle of π 2 (while θ1 = 0) rad and the behaviour of the system is observed over a period of 5 seconds (with equivalent results for a period of 5 hours). The simulation was carried out using an ODE45 Matlab function with the absolute and relative tolerances of 10−9 . (a) Displays the periodic changes in kinetic energy of the system. (b) Displays the periodic changes in potential energy of the system. (c) Displays the DIPC state trajectory generated by ODE45 for a 5 sec simulation. (d) Displays the stable non-zero Total Energy (mean 2.942, variance 6.58 × 10−17 ). A.2 Physical Robotic Arm A physical air-muscle based Robotic Arm was built by Mr Dieter Buchler (MSc, FaisalLab), see Figure A.3. The arm has to be controlled to a non-zero target state under the influence of gravity and hence using a non-zero target control, which standard LQR does not support. Therefore a careful transformation of the system was applied: start with a standard assumption of the dynamics e k + Bu e k xk+1 = Ax (A.2.1) The target state and target control are xk = xk+1 = x∗ , uk = uk+1 = u∗ respectively, therefore the equation (A.2.1) takes the form e ∗ + Bu e ∗ x∗ = Ax A careful transformation of Equation A.2.1 is performed e k + Bu e k − x∗ xk+1 − x∗ =Ax A.2 Physical Robotic Arm 139 Fig. A.3 Physical Air Muscle Robotic Arm. The physical air-muscle based robotic arm under the effects of gravity. e k + Bu e ∗ + Bu e k − (Ax e ∗) xk+1 − x∗ =Ax e k − Ax e ∗ + Bu e k − Bu e ∗ xk+1 − x∗ =Ax e k − x∗ ) + Bu e k − Bu e ∗ xk+1 − x∗ =A(x exk + Bu e k +c e xk+1 =Ae e ∗. where e xk+1 = xk+1 − x∗ and c = −Bu The transformation results in a constant term which accounts for the non-zero target control. e and control Be matrices, i.e. perform However, this needs to be learnt together with the system A ’system identification’. The system is hence re-written in an augmented state form, in order to allow e matrix. for the constant term to be incorporated into the A # " " # " # " # h i e c e e xk+1 A xk Be = + × × uk 1 1 0 1 0 (A.2.2) Hence we obtain the new form: b xk+1 =Ab xk + Buk (A.2.3) (A.2.4) Further Work 140 " # " # " # e c e e x(k) A b= B . where b x(k) = ,A= and B 1 0 1 0 The system identification was performed by RLOC using naive interaction with the robotic arm, described in Section 5.2. The results demonstrated that the new system A and control B matrices are learnt correctly for the chosen linearisation points. A.3 Dynamical System – Simple Example A simple nonlinear dynamical system is defined by the system and control weight surfaces A(x) and B(x) respectively, in order to test the RLOC algorithm on an arbitrary problem. The state space over which the matrices are valid is X ∈ [−10, 10] for a time varying state x(k) = (x1 , x2 )T ∈ R2 . The control signal u ∈ R2 is applied to both of the arbitrary state entries. The surfaces are designed separately for each matrix entry, using 5 Gaussian distributions merged together into a single   0 distribution. Initially arbitrary A(i) , B(i) and covariance matrices (Σ(i) = 4.5 0 4.5 , i = 1, . . . , 5) are created, where each element of each matrix serves as the basis for each peak within a surface. The locations of the peaks of each distribution are defined using centers C. Each point on the state space grid (defined by (x1 , x2 )) is passed through the Multivariate Gaussian function for each of the centers (A.3.1) defined with covariance matrices and means, resulting in ’weights’. The weights define the shape of the Gaussian peaks, based on the arbitrary scale of the original A and B matrices. f (x1 , . . . , xn ) = p 1 T −1 1 e− 2 (x−µ) Σ (x−µ) (2π )n |Σ| (A.3.1) The resulting A(x) and B(x) matrices defining the dynamics of the system are displayed in Figures A.4 and A.5, where the characteristic shapes of each overall Gaussian function can be adjusted by a different initiation of the A and B matrices. It should be noted that this is a simple problem and does not necessarily reflect a real life dynamical system. The code used to generate the data is displayed below. 1 2 3 4 5 6 7 8 9 10 11 12 settings.wCntrs = 5; % 5 peaks per each element of the matrix for ii = 1 : settings.wCntrs As(:,:,ii) = [(1.5*ii) -0.1; -0.05 (0.5*ii)]; % 2x2x5 Bs(:,:,ii) = [(2*ii)/10 -0.05; -0.025 -(2.5*ii)/10]; % 2x2x5 sigmas(:,:,ii)= diag([4.5, 4.5]); % var-cov matrix end % centers for location of each peak on the state space grid. C(:,1) = [4,-4]; C(:,2) = [-4,4]; C(:,3) = [4,4]; C(:,4) = [-4,-4]; C(:,5) = [0 0]; % ----- create weights spacing = -10 : 0.1 : 10; nPts = length(spacing); [X1, X2] = meshgrid(spacing, spacing); A.3 Dynamical System – Simple Example 141 Fig. A.4 Simple example – system control matrix entries. Each entry of the system matrix A specifying the passive dynamics of the arbitrary task are designed using 5 Gaussian peaks. The points in the 3-D plot correspond to area of state space x = (x1 , x2 )T ∈ [−10, 10]. (a) Element A11 . (b) Element A12 . (c) Element A21 . (d) Element A22 . Fig. A.5 Simple example – control matrix entries. Each entry of the system matrix B specifying the passive dynamics of the arbitrary task are designed using 5 Gaussian peaks. The points in the 3-D plot correspond to area of state space x = (x1 , x2 )T ∈ [−10, 10]. (a) Element B11 . (b) Element B12 . (c) Element B21 . (d) Element B22 . 13 14 15 16 X = [X1(:) X2(:)]'; % X is a matrix of state space in range [-10,10] W = []; % weights creating gaussian surface for A, B. for ii = 1 : settings.wCntrs temp = []; Further Work 142 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 for jj = 1 : length(X) temp(jj,1) = 1 / ( 2*pi*det(sigmas(:,:,ii))^(1/2) ) * exp( -0.5 ... .*(X(:,jj)-C(:,ii))'*inv(sigmas(:,:,ii))*(X(:,jj)-C(:,ii)) ); end W(:,ii) = temp; end dim_A = settings.numDims_x*settings.numDims_x; dim_B = settings.numDims_x*settings.numDims_u; new_As = reshape(As,[dim_A,settings.wCntrs]); % column vec per dim new_Bs = reshape(Bs,[dim_B,settings.wCntrs]); temp_overall_A = zeros(length(X), dim_A); % each row of overall_A contains 4 elements of A at that pt X temp_overall_B = zeros(length(X), dim_B); for jj = 1 : length(X) % Matrices linearly combined, weighted by ws appropriate to the state space temp_overall_A(jj,:) = new_As * W(jj,:)'; temp_overall_B(jj,:) = new_Bs * W(jj,:)'; end % ----- reshape A, B into matrices sqrt_spacing = sqrt(length(X)); overall_A = zeros(sqrt_spacing,sqrt_spacing,dim_A); overall_B = zeros(sqrt_spacing,sqrt_spacing,dim_B); for ii = 1 : dim_A overall_A(:,:,ii) = reshape(temp_overall_A(:,ii),sqrt_spacing,sqrt_spacing); end for ii = 1 : dim_B overall_B(:,:,ii) = reshape(temp_overall_B(:,ii),sqrt_spacing,sqrt_spacing)/10; end The matrices A(x) and B(x) can be used to define the behaviour of the system given the current state and control using the formula ˙ = A(x(t))x(t) + B(x(t))u(t) x(t) A.4 (A.3.2) Linear Bellman Combination for Control In this Section I explain key concepts and fully derive the key equations stated in the paper by J. Popovic et al., Linear Bellman Combination for Control of Character Animation [59]. This research is of prime interest to the RLOC algorithm development, since it allows a smooth combination of the available LQR controllers, applicable to adaptation to new tasks and broadening of the existing control domain. The theory can be applied to low-level LQR controllers of the RLOC algorithm. A simple linear combination of optimal value functions is not sufficient due to non-linearity of the Hamilton-Jacobi-Bellman (HJB) Equation. A.4 Linear Bellman Combination for Control 143 The method is based on three key concepts (1) the non-linear HJB partial differential equation can be transformed to an equivalent linear equation for certain systems, (2) the Linear Superposition Principle is applied to optimal policies (controllers) for similar tasks but different boundary conditions, (3.) approximating value functions (one per controller) around samples from example trajectories. Each controller is weighted by its relevance to current state and goal of the system, which outperforms naive combination. A representation of each controller’s value function and policy is required; and only temporally aligned optimal controllers can be combined. The value functions are approximated around example trajectories. Hamilton-Jacobi-Bellman Equation The type of non-linear dynamical systems controlled are restricted to the case of control-affine, stochastic dynamics   dx = a(x) + B(x)u dt +C(x)dW (A.4.1) where a(x) ∈ Rℓ describes the passive dynamics of the system, B(x) ∈ Rℓ×m non-linear function of the state, and u ∈ Rm is the control signal, dW ∼ N (0, dtIℓ ) ∈ Rℓ is the stochastic differential into the future (while C(x)dW is distributed as ∼ N (0, dtCCT ). In continuous time the Optimal Policy / Optimal Control Law minimises the objective function for starting at a given (x,t) and acting according to policy ( min g(xT ) + u∈U Z T 0 l(x, u,t)dt ) (A.4.2) where g(x) is the terminal cost function and l(x, u,t) is the instantaneous cost function defined as 1 l(x, u,t) = q(x,t) + uT u 2 (A.4.3) for q(x,t) any positive function. The stochastic Bellman Equation defines the recursive relationship for the value function v(x) = min u∈U (x) (  ) cost(x, u) + E v(next(x, u)) (A.4.4) where next(x, u) is the next state obtained as a result of applying action u in state x. Hence applying Dynamic Programming (see Subsection 1.3.2) to the time-discretised stochastic problem gives )    √   v(x, k) = min ∆l(x, u, k∆) + E v x + a(x) + B(x)u ∆ + ε k ∆C(x), k + 1 u∈U ( (A.4.5) Further Work 144 where ∆ = nTK is the discretisation time step, xk = x(k × ∆), ε ∼ N (0, Iℓ ) ∈ Rℓ is a vector of standard √ normal random variables and where ε k ∆ = dW ∼ N (0, ∆Iℓ ). In order to compute the expectation, first the second-order Taylor expansion of the value function √   around x where δ = a(x) + B(x)u ∆ + ε k ∆C(x), while keeping the time variable constant, is found followed by taking the expectation and keeping terms up to first order in ∆, since (dt)2 is negligible by   Ito’s calculus. Note O |x − (x + δ )|n+1 = O δ n+1 .  1 T ∂ 2 v(x, k + 1) 3 (A.4.6) δ + O δ v(x + δ , k + 1) = v(x, k + 1) + δ + δ ∂x 2 ∂ x2      ∂ v(x, k + 1) 1 T ∂ 2 v(x, k + 1) E v(x + δ , k + 1) = E v(x, k + 1) + δ T + δ δ +O δ3 2 ∂x 2 ∂x   T ∂ v(x, k + 1) √  = E v(x, k + 1) + a(x) + B(x)u ∆ + ε k ∆C(x) ∂x T ∂ 2 v(x, k + 1)   √ √   1  + a(x) + B(x)u ∆ + ε k ∆C(x) a(x) + B(x)u ∆ + ε k ∆C(x) 2 ∂ x2  + O ∆3  √ T ∂ v(x, k + 1) T ∂ v(x, k + 1) = E v(x, k + 1) + ∆a(x) + ∆B(x)u + ε k ∆C(x) ∂x ∂x T ∂ 2 v(x, k + 1)  1 + ∆a(x) + ∆B(x)u ∆a(x) + ∆B(x)u 2 ∂ x2 2 √ √ T ∂ v(x, k + 1)  1 ε ∆C(x) + ε k ∆C(x) k 2 ∂ x2  √ T ∂ 2 v(x, k + 1)   1 + 2 ∆a(x) + ∆B(x)u ε k ∆C(x) + O ∆3 2 2 ∂x       √ T ∂ v(x, k + 1) T ∂ v(x, k + 1) = E v(x, k + 1) + E ∆a(x) + ∆B(x)u + E ε k ∆C(x) ∂x ∂x {z } | T ∂ v(x, k + 1)    E[εε ]=0 T ∂ 2 v(x, k + 1) 1 ∆a(x) + ∆B(x)u + E ∆a(x) + ∆B(x)u 2 ∂ x2 | {z  } O ∆2  √ √ T ∂ 2 v(x, k + 1)  1 + E ε k ∆C(x) ε k ∆C(x) 2 ∂ x2   √   T ∂ 2 v(x, k + 1) 3 ε ∆C(x) + O ∆ + E ∆a(x) + ∆B(x)u k ∂ x2 | {z }  E[εε ]=0     2 T ∂ v(x, k + 1) 1  T ∂ v(x, k + 1) + O ∆2 + tr ∆C(x)C(x) E v(x + δ , k + 1) = v(x, k + 1) + ∆a(x) + ∆B(x)u 2 ∂x 2 ∂x A.4 Linear Bellman Combination for Control 145     √ √ T ∂ 2 v(x,k+1)   √ T ∂ 2 v(x,k+1)  since E ε k ∆C(x) ε k ∆C(x) = E tr ε k ∆C(x) ε k ∆C(x) ∂ x2 ∂ x2           √ √  √ T ∂ 2 v(x,k+1)  ∂ 2 v(x,k+1) T ε ε ε = tr E = tr Var = tr ∆CC vxx . k ∆C(x) k ∆C(x) k ∆C(x) ∂ x2 ∂ x2  √ The Bellman Equation (A.4.5) therefore takes on the form ( T ∂ v(x, k + 1) v(x, k) = min ∆l(x, u, k∆) + v(x, k + 1) + ∆a(x) + ∆B(x)u ∂x u∈U )   2  1 T ∂ v(x, k + 1) 2 + tr ∆C(x)C(x) + O ∆ 2 ∂ x2 ( T ∂ v(x, k + 1) v(x, k + 1) − v(x, k) = min l(x, u, k∆) + a(x) + B(x)u − ∆ ∂x u∈U )    ∂ 2 v(x, k + 1) 1 (A.4.7) + tr C(x)C(x)T + O ∆ 2 ∂ x2 Therefore the Hamilton-Jacobi-Bellman Equation for problems with finite-horizon is obtained from (A.4.7) in the limit of ∆ → 0 using formal definition of a derivative (2.1.3) ) (  T ∂ v(x,t) 1 h 2 v(x,t) i ∂ v(x,t) ∂ = min l(x, u,t) + a(x) + B(x)u + tr C(x)C(x)T − ∂t ∂x 2 ∂ x2 u∈U (A.4.8) v(x, T ) = g(x) This second order differential equation is non-linear (due to the presence of the min{·} operator) and is typically high-dimensional, making it difficult to find the solution analytically. Exponential Mapping The cost-to-go function, v(x,t), is exponentially mapped into a reward-to-go function, z(x,t) (allowing to transform the Hamilton-Jacobi-Bellman Equation A.4.8 into a form where the dependence on the value function is linear). Given the particular quadratic penalty on control (A.4.3), the linearity of the Bellman equation only holds if B(x) = C(x), thus implying that noise acts in the same sub-space as control. z(x,t) = e−v(x,t) z(x, T ) = e−g(x) = h(x) The derivatives necessary to complete the transformation are ∂ z(x,t) ∂ v(x,t) ∂ v(x,t) = −e−v(x,t) = −z(x,t) ∂t ∂t ∂t ∂ z(x,t) ∂ v(x,t) ∂ v(x,t) = −e−v(x,t) = −z(x,t) ∂x ∂x ∂x Further Work 146 2 2 ∂ 2 z(x,t) ∂ v(x,t) T −v(x,t) ∂ v(x,t) ∂ v(x,t) T ∂ z(x,t) −v(x,t) ∂ v(x,t) −v(x,t) ∂ v(x,t) e = −e + = −e − ∂ x2 ∂ x2 ∂x ∂x ∂ x2 ∂x ∂x 1 ∂ z(x,t) ∂ v(x,t) − = ∂t z(x,t) ∂ t ∂ v(x,t) 1 ∂ z(x,t) =− ∂x z(x,t) ∂ x    ∂ 2 v(x,t) ∂ 2 z(x,t) ∂ v(x,t) T ∂ v(x,t) 1 z(x,t) − = − ∂ x2 ∂ x2 ∂x ∂x z(x,t)    2    1 ∂ z(x,t)  1 ∂ z(x,t) 1 ∂ z(x,t) T ✟ ✟ z(x,t) − − − − = ✟ ∂x ✟ ∂ x2 z(x,t) ∂ x z(x,t) ✟ z(x,t) ✟  2   ∂ z(x,t) 1 1 ∂ z(x,t) T ∂ z(x,t) = − − 2 ∂x z(x,t) ∂ x ∂x z(x,t) Hence the Hamilton-Jacobi-Bellman Equation A.4.8 is equivalent to (  T  1 ∂ z(x,t) 1 T   1 ∂ z(x,t)  z(x,t) q(x,t) + = min u u + a(x) + B(x)u − ∂t 2 u∈U ✟✟ ✟✟ ∂ x z(x,t) z(x,t) ✟ ✟ " #)   2 T z(x,t) 1 ∂ ∂ z(x,t) ∂ z(x,t) 1 1 + tr C(x)C(x)T − − 2 ∂ x2 z(x,t) ∂ x ∂x ✟✟ z(x,t) ✟ (  T ∂ z(x,t) 1 ∂ z(x,t) = min z(x,t)q(x,t) + z(x,t)uT u − a(x) + B(x)u ∂t 2 ∂x u∈U " #)  T 2 z(x,t) 1 ∂ ∂ z(x,t) ∂ z(x,t) 1 + tr B(x)B(x)T − (A.4.9) + 2 ∂ x2 z(x,t) ∂ x ∂x The minimum of the transformed HJB function is found by setting the gradient of the Hamiltonian (term inside Equation A.4.9), with respect to control u, to zero z(x,t) T ∂ z(x,t) T u − B(x) ∂x 2✁ ∂ z(x,t) T B(x) z(x,t)uT = ∂x 1 ∂ z(x,t) T uT = B(x) z(x,t) ∂ x 1 ∂ v(x,t) ✟✟ u∗ = B(x)T (−✟ z(x,t)) ✟ ∂x ✟ z(x,t) ✟ ∂ v(x,t) u∗ = −B(x)T ∂x 0 = 2✁ (A.4.10) A.4 Linear Bellman Combination for Control 147 Substitute the expression for the resulting optimal control, u∗ , into the Hamiltonian (the term being minimised) of Equation A.4.9 T  ∂z ∂ z(x,t) z(x,t) ∂ v(x,t) T T ∂ v(x,t) T ∂ v(x,t) B(x)B(x) = z(x,t)q(x,t) + − a(x) − B(x)B(x) ∂t 2 ∂x ∂x ∂x ∂x # " T 2 ∂ z(x,t) ∂ z(x,t) 1 ∂ z(x,t) 1 B(x)B(x)T + + tr − B(x)B(x)T 2 2 ∂x z(x,t) ∂ x ∂x T  T ✟✟ ∂ z(x,t) 1 ∂ z(x,t) z(x,t) T ∂ z(x,t) T ∂ z(x,t) ✟ B(x)B(x) − a(x) + B(x)B(x) = z(x,t)q(x,t) + ∂x z(x,t) ∂x ∂x 2z(x,t)✁2 ∂ x # # " " 2 ∂ z(x,t) 1 1 ∂ z(x,t) T 1 T ∂ z(x,t) B(x)B(x)T + tr − tr B(x)B(x) 2 2 ∂x 2 z(x,t) ∂ x ∂x ✭✭✭ 1 1 ∂ z(x,t) T✭✭✭✭✭T✭∂✭z(x,t) T ∂ z(x,t) ✭ B(x)B(x) = z(x,t)q(x,t) + − a(x) ✭ ✭ ✭ ✭✭ ∂ x 2✭ z(x,t) ∂x ∂x ✭ # " ✭ ✭ ✭✭✭ ✭ 1 ∂ z(x,t) T ✭✭✭✭✭ 1 1 ∂ z(x,t) T✭✭✭✭✭T✭∂✭z(x,t) ∂✭z(x,t) ∂ 2 z(x,t) 1 T T + − tr B(x)B(x) + ✭✭ B(x)B(x) ✭✭✭✭B(x)B(x) ✭✭✭✭ ✭✭✭ ∂ x z(x,t) ∂x 2 ∂ x2 2✭ z(x,t) ∂x ∂x ✭ ✭ The Hamilton-Jacobi-Bellman Equation (A.4.8) is therefore transformed into a linear PDE for the time evolution of the reward function z (i.e. the dependence on z is linear and the minimum operator, min{·}, could be eliminated) # " 2 z(x,t) ∂ z(x,t) ∂ z(x,t) ∂ 1 = q(x,t)z(x,t) − a(x)T − tr C(x)C(x)T ∂t ∂x 2 ∂ x2 z(x, T ) = e−g(x) = h(x) (A.4.11) The linearity of the Bellman Equation only holds if noise acts in the same sub-space as control, i.e. B(x) = C(x). In this space linear combinations of optimal value functions are solutions to an optimal control problem. The terminal functions, g(x), are restricted to be quadratic, which results in Gaussian reward functions, h(x), following exponentiation. Linear Bellman Combination The Optimal Policy for the control affine systems (A.4.1), under quadratic cost (A.4.3), which descends the value function along its steepest gradient for each configuration of the system, is given by [33] π (x,t) = −B(x)T ∂ v(x,t) ∂x (A.4.12) While the Superposition Principle means that for i = 1, . . . , n solutions zi (x,t) to Equation A.4.11 with boundary conditions hi (x), a new solution is z(x,t) with boundary function h(x) (which can be Further Work 148 transformed into original space for v(x,t) and g(x)) n n z(x,t) = ∑ wi zi (x,t) h(x) = ∑ wi hi (x) i=1 v(x,t) = −log  n ∑ wizi(x,t) i=1  i=1  n  g(x) = −log ∑ wi hi (x) (A.4.13) i=1 The Linear Bellman Combination applies the superposition principle while using Equation A.4.12, to derive a combined controller, π (x,t), that solves a new task by weighting i = 1, . . . , n existing control policies ui (x,t) = π i (x,t) ∂ π (x,t) = −B(x)T   − log ∑ni=1 wi zi (x,t) ∂x   n 1 ∂ zi (x,t) T = B(x) n ∑ wi ∂ x ∑i=1 wi zi (x,t) i=1   n 1 ∂ vi (x,t) T = B(x) n ∑ wi − zi(x,t) ∂ x ∑i=1 wi zi (x,t) i=1   n 1 T ∂ vi (x,t) = ∑ wizi(x,t) − B(x) ∂ x z(x,t) i=1 n wi zi (x,t) π i (x,t) i=1 z(x,t) =∑ n = ∑ αi (x,t)π i (x,t) (A.4.14) i=1 i zi (x,t) where αi (x,t) = wz(x,t) are the time-varying coefficients (principally computed using = ∑nwi zwi (x,t) i=1 i zi (x,t) the controller’s value function (approximated around samples from example trajectories)), n is the number of existing controllers. The fixed blend weights, wi , determine whether the designed controller is formed for an interpolated terminal cost (weights are set automatically given a new terminal cost function), or whether the task domain of control is being broadened for the same terminal cost (weights are set to create a terminal cost function with many minima). Appendix B Additional Results B.1 Robotic Arm (Unknown Dynamics) – Supplement Ch 5 The results for the robotic arm benchmark task performed with RLOC (see Algorithm 4) are reported here and supplement the more difficult benchmark cart-pole task results discussed in Section 5.3. The protocol for obtaining the following plots is identical to the one described in Chapter 5. The task parameters are specified in Tables 4.3, 4.4, 4.5, 5.1 are kept constant (except for where explicitly stated, see Table 5.2). The target state for learning to control robotic arm with unknown dynamics was selected to be x∗ = [ π2 , π2 , 0, 0]T , which is different from that in Chapter 4 in order to investigate RLOC performance on a reaching task to a different end point. The admissible state space range for init starting joint angles and angular velocities is θiinit ∈ [0, π ], θ̇θ i ∈ [−1.5π , 1.5π ], i = 1, 2. The results discussed in this section use a single deterministic learnt RLOC policy π (s) (selected at random from 500 trials for na = 5 controllers, random controller placement). The linearisation points placement of Chapter 5 was performed randomly, and the selected policy has linearisation centers ζ (i) of the  π T , uinit = 0 displayed in Table B.1. , low-level controllers located at xinit = [ θ , 0, 0] 2 0 0 2 Table B.1 Randomly Selected LQR Linearisation Centers for an RLOC Simulation. Benchmark Sym. Robotic Arm x0 (i) Value  1.57   1.57   1.57   1.57   1.57  3 0 0 1.7 0 0 1.15 0 0 0.23 0 0 0.52 0 0 SI Unit # " rad rad rad/s rad/s Figure B.1 shows the value function of the robotic arm planar reaching movement task obtained using a deterministic learnt policy utilising 5 low-level LQR controllers, labelled A–E and obtained via random linearisation points placement. The state-to-control space mapping is shown with letters superimposed onto each area of state space where the controller is activated by the high-level policy π (s) (ns = 36 number of states). The value function is obtained for a total of 1 × 105 discretised states, spaced out equally on the grid. A smooth, convex value function is observed, with the smallest cost incurred for controlling the arm from states near to the target x∗ = [ π2 , π2 , 0, 0] (least control distance 150 Additional Results Fig. B.1 Value Function – Robotic Arm. An example robotic arm final policy is evaluated for na = 5 controllers (labelled A-E) and the cost-to-go function is displayed with superimposed state space to control mapping. The high-level policy learns the area of state space (i.e. state) to which each controller should be assigned to in order to solve the robotic arm inverse kinematics task. A smooth convex value function is observed demonstrating that RLOC policy is able to solve the task from any area of state space, with the most nonlinearity observed at the fully outstretched arm (elbow joint θ2 = 0). and least non-linearity to solve). The cost for controlling the arm increases gradually as the start position moves further and further away from the target due to having to control the non-linear system for further duration of time (hence incurring more cost) and due to having to handle task non-linearities. Mathematically the planar robotic arm is described by the Lagrange equations of motion, which show non-linear dependencies of the elbow angle only (and not the shoulder), which means that it is hardest to move the arm with the elbow fully extended (i.e. θ2 = 0 rad) - hence the area of state space in the (0, 0)T region is the hardest to control, displaying higher cost-to-go (red colour). Figure B.2 depicts action sequences generated upon ’recording’ from learnt deterministic RLOC policy, π (s), during control of the robotic arm to a target from 100 equally spaced initial states and using 5 controllers. The linearisation centre for each of the 5 controllers is given in Table B.1. Each row within the sub-plot indicates an action (labelled A-E) which, if chosen, is marked with a black dot plotted vs. horizontal axis of the action switch number. Each column within the sub-plot corresponds to the cart-pole moving to a new symbolic state, where a new action (which may be the same) is chosen. The switching of the controls, as dictated by the high-level policy, is allowed once the continuous actuator state xk (i.e. that of the optimal control space), enters a new discrete symbolic state s(i) , i = 1, . . . , ns (i.e. that of the reinforcement learner) at which point either a new controller or the current controller is activated. Hence the amount of switches by the high-level policy varies for different starting states and is associated with the number of encountered symbolic states during a single start-target control. The high-level action sequences appear as "activation patterns", which B.1 Robotic Arm (Unknown Dynamics) – Supplement Ch 5 151 Fig. B.2 Action Sequences of an Example Learnt Policy – Robotic Arm. An example robotic arm final policy π (s) utilising na = 5 controllers is evaluated with the quadratic cost to produce characteristic activation patterns. Each starting state is evaluated for T = 3 sec and the high-level action switches are marked with a black dot plotted vs. horizontal axis of the action switch number. The observed patterns of higher-level action organisation suggest that biological learning of complex motor control tasks may take the form of concatenation of high-level controllers into complex action sequences. are re-used by the RLOC policy when solving the task from areas of actuator state space with similar cost-to-go function. Action "C" can be seen to be preferred when brining the arm to the target, which interestingly is not the one with the shortest Euclidean distance from the target, showing that RLOC finds a globally near optimal policy rather than using an LQR controller typically offered by the control theory (i.e. one linearised at the target state). The results suggest that RLOC is able to re-use both the low-level LQR controllers and the whole concatenated action sequences for adjacent regions of state space. These correlate with regions of the value function from Figure B.2 and suggest ’action grammar’ like rules [52]. Figure B.3 shows that the robotic arm was successfully controlled from all of the starting states (solid coloured circles) to the target. The control trajectories shape reflects the task dynamics and the demands of the cost function during learning. A higher ratio of the deviation from target to control penalty results in a flatter individual trajectory shape (i.e. algorithm uses admissible control, u ∈ R2 [−10, 10] Nm torque on each joint, to bring the arm to target as quickly as possible). The zoomed in section proves that the target was reached precisely from all of the starting states. The control of the robotic arm system was tested for response to unexpected perturbations1 and results showed the arm to be correctly controlled to the target state (results not reported here) by RLOC policy. The robotic art planar reaching movement experiments were performed with RLOC algorithm without a priori knowledge of the system dynamics. The results showed that RLOC is able to successfully control the arm to target after a single learning run (random controller placement) from 1 If the actuator state went outside the admissible range of joint angles, it was treated to be the state nearest in Euclidean distance. 152 Additional Results Fig. B.3 Example State Trajectories – Robotic Arm. The example state trajectories for the robotic arm obtained using deterministic learnt policy for control π (s). All of the 100 starting states are successfully controlled to the target of x∗ = [ π2 , π2 , 0, 0] in a smooth manner. In contrast to the cart-pole system, the robotic arm was controlled to perform planer reaching movements, therefore the arm is not requiring the arm to work against gravity, hence RLOC can be seen (cf. zoomed in box section) to control both the shoulder and elbow to a near stop position, as expected. any area of the admissible state space. The available low-level controllers were re-used by RLOC’s high-level policy and entire concatenated action sequences were produced for regions of state space. Appendix C Additional Information C.1 Muscle Synergies The Central Nervous System (CNS – brain and spinal cord) is able to organise a large number of musculoskeletal effectors (muscles, joints, limbs), which create many degrees of freedom by which an action can be achieved. It produces complex task performances under relatively few constraints and thus creating the question of motor redundancy (Bernstein Problem). The smooth and efficient sequential combination of body parts (motor coordination) is created with proprioceptive information about relative position of body parts and strength of effort employed in a movement1 . A proposed neural strategy of the CNS for producing smooth motor control is to linearly combine distinct functional modules identified in the neural circuits of spinal cord’s grey matter [185], termed "muscle synergies"2 [31]. This suggests a modular organisation of the spinal motor system, allowing the system to produce a wide range of behaviours. Such synergies are typically identified with specific patterns of muscle activations using statistical analysis of filtered EMG (electromyography) data recorded from muscles during complex behaviour. Following this a number of muscle synergies that best represent the original EMG are found and related to task relevant variables. A wide range of behaviour has been fitted using muscle synergies identified in this way. Another body of research identifies individual neurons within primary motor cortex (M1) to encode individual functional muscle synergies [115]. The muscle synergies are suggested to (a) be the proposed means by which the CNS simplifies the control of multiple degrees of freedom. The CNS thus controls a much smaller number of variables i.e. controls a number of synergies vs. all of the degrees of freedom individually [63, 258, 268]; (b) provide the CNS with a way to simplify complex optimal control problems by controlling only the 1 Proprioceptive information is obtained with sensory feedback provided by (a) skeletal striated muscles (voluntarily controlled muscles) by muscle spindles – sensory receptors that detect changes in the length of this muscle, (b) tendons (Golgi tendon organ) and (c) fibrous capsules in joints. 2 A muscle synergy is a pattern of co-activation of muscles recruited by a single neural command signal [265, 266]. A ’synchronous synergy’ activates all muscles in the synergy without a time delay, while a ’time-varying synergy’ [63] allows for delays between muscle activations within a synergy. Note: one muscle can be part of multiple muscle synergies. 154 Additional Information task-relevant parameters [51, 261, 264]; (c) provide hierarchical control strategy for organisation of sensory feedback and motor control parameters, by offering translation between task level goals and execution level commands [166, 167]. The control of end effectors is simplified by muscle synergies which engage most relevant muscle groups; (d) be the primitive solution to motor coordination by the spinal cord, considered as an older neural system [75, 97, 112]. It is widely debated whether muscle synergies are a result of kinematic constraints by ’flexion’/’extension’ synergies on the muscles/joints which allow to eliminate the redundant degrees of freedom [267] or whether they are a neural strategy ("neural synergies") [6]. Evidence Against Muscle Synergies (a) Synergies may reflect task/biomechanical constraints rather than a neural control strategy (the motor redundancy is highly reduced when taking in consideration stability requirements, minimisation of noise and other additional constraints placed on the CNS) [149– 151, 270]. (b) Minimum Intervention Principle / Uncontrolled Manifold Hypothesis proposes that the CNS only corrects the variability preventing the achievement of task goal and does not impose penalties on task-irrelevant variabilities (’uncontrolled manifold’) [218, 264, 270] i.e. does not eliminate taskirrelevant redundant degrees of freedom (with proposed computational frameworks supporting both feedback [262] and feedforward [102] processes). This research often identifies structures identical to muscle synergies (here a ’synergy’ is defined as an organisation of task-stabilising degrees of freedom), yet these are not essential for the correct task performance [61, 145].