Abstract
There has been little research into multiobjective reinforcement learning (MORL) algorithms using stochastic or non-stationary policies, even though such policies may Pareto-dominate deterministic stationary policies. One approach is steering which forms a non-stationary combination of deterministic stationary base policies. This paper presents two new steering algorithms designed for the task of learning Pareto-optimal policies. The first algorithm (w-steering) is a direct adaptation of previous approaches to steering, and therefore requires prior knowledge of recurrent states which are guaranteed to be revisited. The second algorithm (Q-steering) eliminates this requirement. Empirical results show that both algorithms perform well when given knowledge of recurrent states, but that Q-steering provides substantial performance improvements over w-steering when this knowledge is not available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
MORL methods based on linear scalarisation are limited to discovering policies which lie on the convex hull of the Pareto front [17]. However in the context of learning base policies this is actually advantageous as policies lying inside the convex hull will not form part of any Pareto-optimal combination of base policies [15].
- 2.
The other alternative is to treat no states as members of \(S_R\), but this would mean the agent would never switch base policies, thereby losing the benefits of non-stationarity.
References
Brys, T., Van Moffaert, K., Van Vaerenbergh, K., Nowé, A.: On the behaviour of scalarization methods for the engagement of a wet clutch. In: The 12th International Conference on Machine Learning and Applications. IEEE (2013)
Castelletti, A., Corani, G., Rizzolli, A., Soncini-Sessa, R., Weber, E.: Reinforcement learning in the operational management of a water system. In: IFAC Workshop on Modeling and Control in Environmental Issues, pp. 325–330 (2002)
Chatterjee, K., Majumdar, R., Henzinger, T.A.: Markov decision processes with multiple objectives. In: Durand, B., Thomas, W. (eds.) STACS 2006. LNCS, vol. 3884, pp. 325–336. Springer, Heidelberg (2006)
Handa, H.: Solving multi-objective reinforcement learning problems by EDA-RL - acquisition of various strategies. In: Proceedings of the Ninth Internatonal Conference on Intelligent Sysems Design and Applications, pp. 426–431 (2009)
Kalyanakrishnan, S., Stone, P.: An empirical analysis of value function-based and policy search reinforcement learning. In: Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems, vol. 2, pp. 749–756. International Foundation for Autonomous Agents and Multiagent Systems (2009)
Lizotte, D.J., Bowling, M., Murphy, S.A.: Efficient reinforcement learning with multiple reward functions for randomized clinical trial analysis. In: 27th International Conference on Machine Learning, pp. 695–702 (2010)
Mannor, S., Shimkin, N.: The steering approach for multi-criteria reinforcement learning. In: Neural Information Processing Systems, pp. 1563–1570 (2001)
Mannor, S., Shimkin, N.: A geometric approach to multi-criterion reinforcement learning. J. Mach. Learn. Res. 5, 325–360 (2004)
Parisi, S., Pirotta, M., Smacchia, N., Bascetta, L., Restelli, M.: Policy gradient approaches for multi-objective sequential decision making. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 2323–2330. IEEE (2014)
Roijers, D.M., Whiteson, S., Oliehoek, F.A.: Computing convex coverage sets for multi-objective coordination graphs. In: Perny, P., Pirlot, M., Tsoukiàs, A. (eds.) ADT 2013. LNCS, vol. 8176, pp. 309–323. Springer, Heidelberg (2013)
Roijers, D., Vamplew, P., Whiteson, S., Dazeley, R.: A survey of multi-objective sequential decision-making. J. Artif. Intell. Res. 48, 67–113 (2013)
Shelton, C.: Importance sampling for reinforcement learning with multiple objectives. AI Technical report 2001–003, MIT, August 2001
Soh, H., Demiris, Y.: Evolving policies for multi-reward partially observable Markov decision processes (MR-POMDPs). In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO 2011, pp. 713–720 (2011)
Taylor, M.E., Whiteson, S., Stone, P.: Temporal difference and policy search methods for reinforcement learning: an empirical comparison. In: Proceedings of the National Conference on Artificial Intelligence, vol. 22, p. 1675 (2007)
Vamplew, P., Dazeley, R., Barker, E., Kelarev, A.: Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In: Nicholson, A., Li, X. (eds.) AI 2009. LNCS, vol. 5866, pp. 340–349. Springer, Heidelberg (2009)
Vamplew, P., Dazeley, R., Berry, A., Dekker, E., Issabekov, R.: Empirical evaluation methods for multiobjective reinforcement learning algorithms. Mach. Learn. 84(1–2), 51–80 (2011)
Vamplew, P., Yearwood, J., Dazeley, R., Berry, A.: On the limitations of scalarisation for multi-objective reinforcement learning of Pareto fronts. In: Wobcke, W., Zhang, M. (eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 372–378. Springer, Heidelberg (2008)
Van Moffaert, K., Nowé, A.: Multi-objective reinforcement learning using sets of pareto dominating policies. J. Mach. Learn. Res. 15, 3483–3512 (2014)
Watkins, C.J.C.H.: Learning from delayed rewards (1989)
Whiteson, S., Taylor, M.E., Stone, P.: Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning. Auton. Agent. Multi-Agent Syst. 21(1), 1–35 (2010)
Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., da Fonseca, V.G.: Performance assessment of multiobjective optimisers: an analysis and review. IEEE Trans. Evol. Comput. 7(2), 117–132 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Vamplew, P., Issabekov, R., Dazeley, R., Foale, C. (2015). Reinforcement Learning of Pareto-Optimal Multiobjective Policies Using Steering. In: Pfahringer, B., Renz, J. (eds) AI 2015: Advances in Artificial Intelligence. AI 2015. Lecture Notes in Computer Science(), vol 9457. Springer, Cham. https://doi.org/10.1007/978-3-319-26350-2_53
Download citation
DOI: https://doi.org/10.1007/978-3-319-26350-2_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26349-6
Online ISBN: 978-3-319-26350-2
eBook Packages: Computer ScienceComputer Science (R0)