Abstract
This chapter discusses mixed density reinforcement learning (RL)-based approximate optimal control methods applied to deterministic systems. Such methods typically require a persistence of excitation (PE) condition for convergence. In this chapter, data-based methods will be discussed to soften the stringent PE condition by learning via simulation-based extrapolation. The development is based on the observation that, given a model of the system, RL can be implemented by evaluating the Bellman error (BE) at any number of desired points in the state space, thus virtually simulating the system. The sections will discuss necessary and sufficient conditions for optimality, regional model-based RL, local (StaF) RL, combining regional and local model-based RL, and RL with sparse BE extrapolation. Notes on stability follow within each method’s respective section.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The notation \(\nabla _{x}h\left( x,y,t\right) \) denotes the partial derivative of generic function \(h\left( x,y,t\right) \) with respect to generic variable x. The notation \(h^{\prime }\left( x,y\right) \) denotes the gradient with respect to the first argument of the generic function, \(h\left( \cdot ,\cdot \right) \), e.g., \(h'\left( x,y\right) =\nabla _{x}h\left( x,y\right) .\)
- 2.
For notational brevity, unless otherwise specified, the domain of all the functions is assumed to be \(\mathbb {R}_{\ge 0}\), where \(\mathbb {R}_{\ge a}\) denotes the interval \(\left[ a,\infty \right) \). The notation \(\left\| \cdot \right\| \) denotes the Euclidean norm for vectors and the Frobenius norm for matrices.
- 3.
The notation \(I_{n}\) denotes the \(n\times n\) identity matrix.
- 4.
The notation G, \(G_{\sigma }\), and \(G_{\varepsilon }\) is defined as \(G=G\left( x\right) \triangleq g\left( x\right) R^{-1}g^{T}\left( x\right) \) , \(G_{\sigma }=G_{\sigma }\triangleq \sigma ^{\prime }\left( x\right) G\left( x\right) \sigma ^{\prime }\left( x\right) ^{T}\), and \(G_{\varepsilon }=G_{\varepsilon }\left( x\right) \triangleq \varepsilon ^{\prime }\left( x\right) G\left( x\right) \varepsilon ^{\prime }\left( x\right) ^{T}\), respectively.
- 5.
- 6.
The Lipschitz property is exploited here for clarity of exposition. The bound in (5.38) can be easily generalized to \(\left\| Y\left( x\right) \right\| \le L_{Y}\left( \left\| x\right\| \right) \left\| x\right\| \), where \(L_{Y}:\mathbb {R}\rightarrow \mathbb {R}\) is a positive, non-decreasing function.
- 7.
The notation \({a \atopwithdelims ()b}\) denotes the combinatorial operation “a choose b”.
- 8.
Similar to NN-based approximation methods such as [1,2,3,4,5,6,7,8], the function approximation error, \(\varepsilon \), is unknown, and in general, infeasible to compute for a given function, since the ideal NN weights are unknown. Since a bound on \(\varepsilon \) is unavailable, the gain conditions in (5.57)–(5.59) cannot be formally verified. However, they can be met using trial and error by increasing the gain \(k_{a2}\), the number of StaF basis functions, and \(\underline{c}\), by selecting more points to extrapolate the BE.
References
Doya, K.: Reinforcement learning in continuous time and space. Neural Comput. 12(1), 219–245 (2000)
Padhi, R., Unnikrishnan, N., Wang, X., Balakrishnan, S.: A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural Netw. 19(10), 1648–1660 (2006)
Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans. Syst. Man Cybern. Part B Cybern. 38, 943–949 (2008)
Lewis, F.L., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
Dierks, T., Thumati, B., Jagannathan, S.: Optimal control of unknown affine nonlinear discrete-time systems using offline-trained neural networks with proof of convergence. Neural Netw. 22(5–6), 851–860 (2009)
Mehta, P., Meyn, S.: Q-learning and pontryagin’s minimum principle. In: Proceedings of the IEEE Conference on Decision and Control, pp. 3598–3605
Vamvoudakis, K.G., Lewis, F.L.: Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5), 878–888 (2010)
Zhang, H., Cui, L., Zhang, X., Luo, Y.: Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Trans. Neural Netw. 22(12), 2226–2236 (2011)
Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K.G., Lewis, F.L., Dixon, W.E.: A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49(1), 89–92 (2013)
Zhang, H., Cui, L., Luo, Y.: Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network adp. IEEE Trans. Cybern. 43(1), 206–216 (2013)
Zhang, H., Liu, D., Luo, Y., Wang, D.: Adaptive Dynamic Programming for Control Algorithms and Stability, ser. Communications and Control Engineering. Springer, London (2013)
Kaelbling, L., Littman, M., Moore, A.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)
Vrabie, D.: Online adaptive optimal control for continuous-time systems, Ph.D. dissertation, University of Texas at Arlington (2010)
Vamvoudakis, K.G., Vrabie, D., Lewis, F.L.: Online adaptive algorithm for optimal control with integral reinforcement learning. Int. J. Robust Nonlinear Control 24(17), 2686–2710 (2014)
Kamalapurkar, R., Walters, P., Dixon, W.E.: Model-based reinforcement learning for approximate optimal regulation. Automatica 64, 94–104 (2016)
He, P., Jagannathan, S.: Reinforcement learning neural-network-based controller for nonlinear discrete-time systems with input constraints. IEEE Trans. Syst. Man Cybern. Part B Cybern. 37(2), 425–436 (2007)
Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy hdp iteration algorithm. SIEEE Trans. Syst. Man Cybern. Part B Cybern. 38(4), 937–942 (2008)
Kamalapurkar, R., Rosenfeld, J., Dixon, W.E.: Efficient model-based reinforcement learning for approximate online optimal control. Automatica 74, 247–258 (2016)
Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free q-learning designs for linear discrete-time zero-sum games with application to \(H_{\infty }\) control. Automatica 43, 473–481 (2007)
Vamvoudakis, K.G., Lewis, F.L.: Multi-player non-zero-sum games: Online adaptive learning solution of coupled hamilton-jacobi equations. Automatica 47, 1556–1569 (2011)
Vamvoudakis, K.G., Lewis, F.L., Hudas, G.R.: Multi-agent differential graphical games: Online adaptive learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012). http://www.sciencedirect.com/science/article/pii/S0005109812002476
Modares, H., Lewis, F.L., Naghibi-Sistani, M.-B.: Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks. IEEE Trans. Neural Netw. Learn. Syst. 24(10), 1513–1525 (2013)
Kiumarsi, B., Lewis, F.L., Modares, H., Karimpour, A., Naghibi-Sistani, M.-B.: Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 50(4), 1167–1175 (2014)
Modares, H., Lewis, F.L., Naghibi-Sistani, M.-B.: Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50(1), 193–202 (2014)
Modares, H., Lewis, F.L.: Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica 50(7), 1780–1792 (2014)
Kamalapurkar, R., Walters, P.S., Rosenfeld, J.A., Dixon, W.E.: Reinforcement Learning for Optimal Feedback Control: A Lyapunov-Based Approach. Springer, Berlin (2018)
Singh, S.P.: Reinforcement learning with a hierarchy of abstract models. AAAI Natl. Conf. Artif. Intell. 92, 202–207 (1992)
Atkeson, C.G., Schaal, S.: Robot learning from demonstration. Int. Conf. Mach. Learn. 97, 12–20 (1997)
Abbeel, P., Quigley, M., Ng, A.Y.: Using inaccurate models in reinforcement learning. In: International Conference on Machine Learning, pp. 1–8. ACM, New York (2006)
Deisenroth, M.P.: Efficient Reinforcement Learning Using Gaussian Processes. KIT Scientific Publishing (2010)
Mitrovic, D., Klanke, S., Vijayakumar, S.: Adaptive optimal feedback control with learned internal dynamics models. In: Sigaud, O., Peters, J., (eds.), From Motor Learning to Interaction Learning in Robots. Series Studies in Computational Intelligence, vol. 264, pp. 65–84. Springer Berlin (2010)
Deisenroth, M.P., Rasmussen, C.E., Pilco: a model-based and data-efficient approach to policy search. In: International Conference on Machine Learning 2011, pp. 465–472 (2011)
Liberzon, D.: Calculus of Variations and Optimal Control Theory: A Concise Introduction. Princeton University Press, Princeton (2012)
Kirk, D.: Optimal Control Theory: An Introduction. Dover, Mineola (2004)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Konda, V., Tsitsiklis, J.: On actor-critic algorithms. SIAM J. Control Optim. 42(4), 1143–1166 (2004)
Dierks, T., Jagannathan, S.: Optimal tracking control of affine nonlinear discrete-time systems with unknown internal dynamics. In: Proceedings of the IEEE Conference on Decision and Control, Shanghai, CN, Dec. 2009, pp. 6750–6755 (2009)
Vamvoudakis, K.G., Lewis, F.L.: Online synchronous policy iteration method for optimal control. In: Yu, W. (ed.) Recent Advances in Intelligent Control Systems, pp. 357–374. Springer, London (2009)
Dierks, T., Jagannathan, S.: Optimal control of affine nonlinear continuous-time systems. In: Proceedings of the American Control Conference, 2010, pp. 1568–1573 (2010)
Khalil, H.K.: Nonlinear Systems, 3rd edn. Prentice Hall, Upper Saddle River (2002)
Chowdhary, G., Concurrent learning for convergence in adaptive control without persistency of excitation, Ph.D. dissertation, Georgia Institute of Technology (2010)
Chowdhary, G., Johnson, E.: A singular value maximizing data recording algorithm for concurrent learning. In: Proceedings of the American Control Conference, 2011, pp. 3547–3552 (2011)
Chowdhary, G., Yucelen, T., Mühlegg, M., Johnson, E.N.: Concurrent learning adaptive control of linear systems with exponentially convergent bounds. Int. J. Adapt. Control Signal Process. 27(4), 280–301 (2013)
Kamalapurkar, R., Walters, P., Dixon, W.E.: Model-based reinforcement learning for approximate optimal regulation. Automatica 64, 94–104 (2016)
Kamalapurkar, R., Andrews, L., Walters, P., Dixon, W.E.: Model-based reinforcement learning for infinite-horizon approximate optimal tracking. IEEE Trans. Neural Netw. Learn. Syst. 28(3), 753–758 (2017)
Kamalapurkar, R., Klotz, J., Dixon, W.E.: Concurrent learning-based online approximate feedback Nash equilibrium solution of N-player nonzero-sum differential games. IEEE/CAA J. Autom. Sin. 1(3), 239–247 (2014)
Luo, B., Wu, H.-N., Huang, T., Liu, D.: Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica (2014)
Yang, X., Liu, D., Wei, Q.: Online approximate optimal control for affine non-linear systems with unknown internal dynamics using adaptive dynamic programming. IET Control Theory Appl. 8(16), 1676–1688 (2014)
Rosenfeld, J.A., Kamalapurkar, R., Dixon, W.E.: The state following (staf) approximation method. IEEE Trans. Neural Netw. Learn. Syst. 30(6), 1716–1730 (2019)
Lorentz, G.G.: Bernstein Polynomials, 2nd edn. Chelsea Publishing Co., New York (1986)
Rosenfeld, J.A., Kamalapurkar, R., Dixon, W.E.: State following (StaF) kernel functions for function approximation Part I: theory and motivation. In: Proceedings of the American Control Conference, 2015, pp. 1217–1222 (2015)
Deptula, P., Rosenfeld, J., Kamalapurkar, R., Dixon, W.E.: Approximate dynamic programming: combining regional and local state following approximations. IEEE Trans. Neural Netw. Learn. Syst. 29(6), 2154–2166 (2018)
Walters, P.S.: Guidance and control of marine craft: an adaptive dynamic programming approach, Ph.D. dissertation, University of Florida (2015)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceeding of the International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323 (2011)
Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: Proceedings of the Advances in Neural Information Processing Systems, 2007, pp. 801–808 (2007)
Nivison, S.A., Khargonekar, P.: Improving long-term learning of model reference adaptive controllers for flight applications: a sparse neural network approach. In: Proceedings of the AIAA Guidance, Navigation and Control Conference, Jan. 2017 (2017)
Nivison, S.A., Khargonekar, P.P.: Development of a robust deep recurrent neural network controller for flight applications. In: Proceedings of the American Control Conference, IEEE, 2017, pp. 5336–5342 (2017)
lan Boureau, Y., Cun, Y.L., Ranzato, M.: Sparse feature learning for deep belief networks. In: Proceedings of the Advances in Neural Information Processing Systems, 2008, pp. 1185–1192 (2008)
Nivison, S.A., Khargonekar, P.P.: Development of a robust, sparsely-activated, and deep recurrent neural network controller for flight applications. In: Proceedings of the IEEE Conference on Decision and Control, pp. 384–390. IEEE (2018)
Nivison, S.A., Khargonekar, P.: A sparse neural network approach to model reference adaptive control with hypersonic flight applications. In: Proceedings of the AIAA Guidance, Navigation and Control Conference, 2018, p. 0842 (2018)
Greene, M.L., Deptula, P., Nivison, S., Dixon, W.E.: Sparse learning-based approximate dynamic programming with barrier constraints. IEEE Control Syst. Lett. 4(3), 743–748 (2020)
Nivison, S.A.: Sparse and deep learning-based nonlinear control design with hypersonic flight applications, Ph.D. dissertation, University of Florida (2017)
Walters, P., Kamalapurkar, R., Voight, F., Schwartz, E., Dixon, W.E.: Online approximate optimal station keeping of a marine craft in the presence of an irrotational current. IEEE Trans. Robot. 34(2), 486–496 (2018)
Fan, Q.-Y., Yang, G.-H.: Active complementary control for affine nonlinear control systems with actuator faults. IEEE Trans. Cybern. 47(11), 3542–3553 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Greene, M.L., Deptula, P., Kamalapurkar, R., Dixon, W.E. (2021). Mixed Density Methods for Approximate Dynamic Programming. In: Vamvoudakis, K.G., Wan, Y., Lewis, F.L., Cansever, D. (eds) Handbook of Reinforcement Learning and Control. Studies in Systems, Decision and Control, vol 325. Springer, Cham. https://doi.org/10.1007/978-3-030-60990-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-60990-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60989-4
Online ISBN: 978-3-030-60990-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)