Abstract
Reinforcement learning algorithms describe how an agent can learn an optimal action policy in a sequential decision process, through repeated experience. In a given environment, the agent policy provides him some running and terminal rewards. As in online learning, the agent learns sequentially. As in multi-armed bandit problems, when an agent picks an action, he can not infer ex-post the rewards induced by other action choices. In reinforcement learning, his actions have consequences: they influence not only rewards, but also future states of the world. The goal of reinforcement learning is to find an optimal policy – a mapping from the states of the world to the set of actions, in order to maximize cumulative reward, which is a long term strategy. Exploring might be sub-optimal on a short-term horizon but could lead to optimal long-term ones. Many problems of optimal control, popular in economics for more than forty years, can be expressed in the reinforcement learning framework, and recent advances in computational science, provided in particular by deep learning algorithms, can be used by economists in order to solve complex behavioral problems. In this article, we propose a state-of-the-art of reinforcement learning techniques, and present applications in economics, game theory, operation research and finance.
Similar content being viewed by others
References
Abbeel, P., & Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference in Machine Learning (ICML 2004).
Abel, D. (2019). Concepts in Bounded Rationality: Perspectives from Reinforcement Learning. PhD thesis, Brown University.
Aguirregabiria, V., & Mira, P. (2002). Swapping the nested fixed point algorithm: a class of estimators for discrete markov decision models. Econometrica, 70(4), 1519–1543.
Aguirregabiria, V., & Mira, P. (2010). Dynamic discrete choice structural models: a survey. Journal of Econometrics, 156(1), 38–67.
Almahdi, S., & Yang, S. Y. (2017). An adaptive portfolio trading system: a risk-return portfolio optimization using recurrent reinforcement learning with expected maximum drawdown. Expert Systems with Applications, 87, 267–279.
Arthur, W. B. (1991). Designing economic agents that act like human agents: a behavioral approach to bounded rationality. The American Economic Review, 81(2), 353–359.
Arthur, W. B. (1994). Inductive reasoning and bounded rationality. The American Economic Review, 84(2), 406–411.
Athey, S., & Imbens, G. W. (2016). The econometrics of randomized experiments. ArXiv e-prints.
Athey, S., & Imbens, G. W. (2019). Machine learning methods that economists should know about. Annual Review of Economics, 11(1), 685–725.
Aumann, R. J. (1997). Rationality and bounded rationality. Games and Economic Behavior, 21(1), 2–14.
Bain, M., & Sammut, C. (1995). A framework for behavioural cloning. In Machine Intelligence 15.
Baldacci, B. Manziuk, I., Mastrolia, T., & Rosenbaum, M. (2019). Market making and incentives design in the presence of a dark pool: a deep reinforcement learning approach. arXiv preprint arXiv:1912.01129.
Barto,A. G., & Singh, S. P. (1991). On the computational economics of reinforcement learning. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton (eds), Connectionist Models, pp. 35 – 44. Morgan Kaufmann.
Basci, E. (1999). Learning by imitation. Journal of Economic Dynamics and Control, 23(9), 1569–1585.
Bellman, R. (1957). Dynamic Programming. Princeton, NJ: Princeton University Press.
Bello, I., Pham, H., Le, Q. V., Norouzi, M., & Bengio, S. (2016). Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940.
Bergemann, D., & Hege, U. (1998). Venture capital financing, moral hazard and learning. Journal of Banking and Finance, 22(6), 703–735.
Bergemann, D., & Hege, U. (2005). The financing of innovation: Learning and stopping. The RAND Journal of Economics, 36(4), 719–752.
Bergemann, D., & Välimäki, J. (1996). Learning and strategic pricing. Econometrica, 64(5), 1125–1149.
Bernheim, B. D. (1984). Rationalizable strategic behavior. Econometrica, 52(4), 1007–1028.
Berry, D. A., & Fristedt, B. (1985). Bandits Problems Sequential Allocation of Experiments. — (Monographs on statistics and applied probability). Chapman and Hall.
Bertsekas, D. P., & Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific.
Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad (ed), Online Learning and Neural Networks.
Brown, G. W. (1951). Iterative solutions of games by fictitious play. In T. Koopmans (Ed.), Activity Analysis of Production and Allocation (pp. 374–376). NewYork: Wiley.
Buehler, H., Gonon, L., Teichmann, J., & Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8), 1271–1291.
Börgers, T., Morales, A. J., & Sarin, R. (2004). Expedient and monotone learning rules. Econometrica, 72(2), 383–405.
Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu, Y., & Guo, D. (2017). Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, pp. 661–670. Association for Computing Machinery: New York, USA.
Charpentier, A., Flachaire, E., & Ly, A. (2018). Econometrics and machine learning. Economics and Statistics, 505(1), 147–169.
Chattopadhyay, R., & Duflo, E. (2004). Women as policy makers: Evidence from a randomized policy experiment in india. Econometrica, 72(5), 1409–1443.
Cherniak, C. (1986). Minimal Rationality. MIT Press: MIT Press.
Christofides, N. (1976). Worst-case analysis of a new heuristic for the travelling salesman problem. Graduate School of Industrial Administration, CMU: Technical report.
Croes, G. A. (1958). A method for solving traveling-salesman problems. Operations research, 6(6), 791–812.
Cyert, R. M., & DeGroot, M. H. (1974). Rational expectations and bayesian analysis. Journal of Political Economy, 82(3), 521–536.
Dai, H., Khalil, E. B., Zhang, Y., Dilkina, B., & Song, L. (2017). Learning combinatorial optimization algorithms over graphs. arXiv preprint arXiv:1704.01665.
Deng, Y., Bao, F., Kong, Y., Ren, Z., & Dai, Q. (2016). Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems, 28(3), 653–664.
Deudon, M., Cournut, P., Lacoste, A., Adulyasak, Y., & Rousseau, L.-M. (2018). Learning heuristics for the tsp by policy gradient. Artificial Intelligence, and Operations Research. In W.-J. van Hoeve (Ed.), Integration of Constraint Programming (pp. 170–181). Cham: Springer International Publishing.
Devaine, M., Gaillard, P., Goude, Y., & Stoltz, G. (2013). Forecasting electricity consumption by aggregating specialized experts. Machine Learning, 90(2), 231–260.
Dilaver, O., Calvert Jump, R., & Levine, P. (2018). Agent-based macroeconomics and dynamic stochastic general equilibrium models: Where do we go from here? Journal of Economic Surveys, 32(4), 1134–1159.
Doraszelski, U., & Satterthwaite, M. (2010). Computable markov-perfect industry dynamics. The RAND Journal of Economics, 41(2), 215–243.
Dorigo,M., & Gambardella, L. M. (1996). Ant colonies for the traveling salesman problem. Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, 3.
Dütting, P., Feng, Z., Narasimhan, H., Parkes, D. C., & Ravindranath, S. S. (2017). Optimal auctions through deep learning.
Elie, R., Perolat, J., Laurière, M., Geist, M., & Pietquin, O. (2020). On the convergence of model free learning in mean field games. In AAAI Conference one Artificial Intelligence (AAAI 2020).
Erev, I., & Roth, A. E. (1998). Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review, 88(4), 848–881.
Ericson, R., & Pakes, A. (1995). Markov-perfect industry dynamics: a framework for empirical work. The Review of Economic Studies, 62(1), 53–82.
Escobar, J. F. (2013). Equilibrium analysis of dynamic models of imperfect competition. International Journal of Industrial Organization, 31(1), 92–101.
Even Dar, E., Mirrokni, V. S., Muthukrishnan, S., Mansour, Y., & Nadav, U. (2009). Bid optimization for broad match ad auctions. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, pages 231–240. Association for Computing Machinery: New York, USA.
Feldman, M. (1987). Bayesian learning and convergence to rational expectations. Journal of Mathematical Economics, 16(3), 297–313.
Feng, Z., Narasimhan, H., Parkes, D. C. (2018). Deep learning for revenue-optimal auctions with budgets. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’18, pp. 354–362. International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC.
Fershtman, C., & Pakes, A. (2012). Dynamic Games with Asymmetric Information: a Framework for Empirical Work*. The Quarterly Journal of Economics, 127(4), 1611–1661.
Flood, M. M. (1956). The travelling salesman problem. Operations Research, 4, 61–75.
Folkers, A., Rick, M., & Buskens, C. (2019). Controlling an autonomous vehicle with deep reinforcement learning. 2019 IEEE Intelligent Vehicles Symposium (IV). https://doi.org/10.1109/ivs.2019.8814124.
Franke, R. (2003). Reinforcement learning in the el farol model. Journal of Economic Behavior and Organization, 51(3), 367–388.
Fudenberg, D., & Levine, D. (1998). The Theory of Learning in Games. USA: Massachusetts Institute of Technology (MIT) Press.
Fécamp, S., Mikael, J., & Warin, X. (2019). Risk management with machine-learning-based algorithms. arXiv preprint arXiv:1902.05287,.
Gabaix, X. (2014). A sparsity-based model of bounded rationality. The Quarterly Journal of Economics, 129(4), 1661–1710.
Galichon, A. (2017). Optimal transport methods in economics. USA: Princeton University Press.
Gambardella, L. M., & Dorigo, M. (1995). Ant-Q: A reinforcement learning approach to the traveling salesman problem. In A. Prieditis and S. Russell, editors, Machine Learning Proceedings 1995, pp. 252–260. Morgan Kaufmann.
Ganesh, S., Vadori, N., Xu, M., Zheng, H., Reddy, P., & Veloso, M. (2019). Reinforcement learning for market making in a multi-agent dealer market. arXiv preprint arXiv:1911.05892.
Garcia, J. (1981). The nature of learning explanations. Behavioral and Brain Sciences, 4(1), 143–144.
Gennaioli, N., & Shleifer, A. (2010). What Comes to Mind*. The Quarterly Journal of Economics, 125(4), 1399–1433.
Gershman, S. J., Horvitz, E. J., & Tenenbaum, J. B. (2015). Computational rationality: a converging paradigm for intelligence in brains, minds, and machines. Science, 349(6245), 273–278.
Gibson, B. (2007). A multi-agent systems approach to microeconomic foundations of macro. Economics Department Working Paper, University of Massachusetts, 2007-10.
Gigerenzer, G., & Goldstein, D. (1996). Reasoning the fast and frugal way: models of bounded rationality. Psychological review, 103(4), 650.
Gittins, J. (1989). Bandit processes and dynamic allocation indices. NewYork: Wiley.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
Granato, J., Guse, E. A., & Wong, M. C. S. (2008). Learning from the expectations of others. Macroeconomic Dynamics, 12(3), 345–377. https://doi.org/10.1017/S1365100507070186.
Guéant, O., & Manziuk, I. (2020). Deep reinforcement learning for market making in corporate bonds: beating the curse of dimensionality. Applied Mathematical Finance, 26(5), 387–452.
Hansen, L. P., & Sargent, T. J. (2013). Recursive Models of Dynamic Linear Economies. The Gorman Lectures in Economics. Princeton University Press.
Hart, S., & Mas-Colell, A. (2003). Uncoupled dynamics do not lead to nash equilibrium. American Economic Review, 93(5), 1830–1836.
Hasselt, H. V. (2010). Double q-learning. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pp. 2613–2621. Curran Associates, Inc.
Hellwig, M. F. (1973). Sequential models in economic dynamics. PhD thesis, Massachusetts Institute of Technology, Department of Economics.
Holland, J. H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. USA: University of Michigan Press.
Holland, J. H. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine Learning: An Artificial Intelligence Approach (Vol. 2). Los Altos, CA: Morgan Kaufmann.
Hopkins, E. (2002). Two competing models of how people learn in games. Econometrica, 70(6), 2141–2166.
Horst, U. (2005). Stationary equilibria in discounted stochastic games with weakly interacting players. Games and Economic Behavior, 51(1), 83–108.
Hotz, V. J., & Miller, R. A. (1993). Conditional choice probabilities and the estimation of dynamic models. The Review of Economic Studies, 60(3), 497–529.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. Cambridge, Massachusetts: MIT Press.
Huang, M., Malhamé, R. P., & Caines, P. E. (2006). Large population stochastic dynamic games: closed-loop mckean-vlasov systems and the nash certainty equivalence principle. Communications in Information and Systems, 6(3), 221–252.
Hughes, N. (2014). Applying reinforcement learning to economic problems. Technical report, Australian National University.
Igami, M. (2017). Artificial intelligence as structural estimation: Economic interpretations of deep blue, bonanza, and alphago. arXiv preprint arXiv:1710.10967.
Ito, K., & Reguant, M. (2016). Sequential markets, market power, and arbitrage. American Economic Review, 106(7), 1921–57. https://doi.org/10.1257/aer.20141529.
Jenkins, H. M. (1979). Animal learning and behavior theory. In E. Hearst (ed), The first century of experimental psychology, pp. 177–228.
Jovanovic, B. (1982). Selection and the evolution of industry. Econometrica, 50(3), 649–670.
Kahneman, D. (2011). Thinking, fast and slow. NewYork: Macmillan.
Kasy, M., & Sautmann, A. (2019). Adaptive treatment assignment in experiments for policy choice. Technical report, Harvard University.
Keller, G., & Rady, S. (1999). Optimal experimentation in a changing environment. The Review of Economic Studies, 66(3), 475–507.
Kimbrough, S. O., & Murphy, F. H. (2008). Learning to collude tacitly on production levels by oligopolistic agents. Computational Economics, 33(1), 47.
Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Sallab, A. A. A., Yogamani, S., & Pérez, P. (2020). Deep reinforcement learning for autonomous driving: A survey. arXiv preprint arXiv:2002.00444, 2020.
Kiyotaki, N., & Wright, R. (1989). On money as a medium of exchange. Journal of Political Economy, 97(4), 927–954.
Klein, E., Geist, M., Piot, B., & Pietquin, O. (2012). Inverse reinforcement learning through structured classification. Advances in Neural Information Processing Systems, 1007–1015.
Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.
Lasry, J.-M., & Lions, P.-L. (2006a). Jeux à champ moyen. i - le cas stationnaire. Comptes Rendus Mathematique, 343(9), 619–625.
Lasry, J.-M., & Lions, P.-L. (2006b). Jeux à champ moyen. ii - horizon fini et contrôle optimal. Comptes Rendus Mathematique, 343(10), 679–684.
Leimar, O., & McNamara, J. (2019). Learning leads to bounded rationality and the evolution of cognitive bias in public goods games. Nature Scientific Reports, 9, 16319.
Lettau, M., & Uhlig, H. (1999). Rules of thumb versus dynamic programming. American Economic Review, 89(1), 148–174.
Levina, T., Levin, Y., McGill, J., & Nediak, M. (2009). Dynamic pricing with online learning and strategic consumers: an application of the aggregating algorithm. Operations Research, 57(2), 327–341.
Li, B., & Hoi, S. C. (2014). Online portfolio selection: a survey. ACM Computing Surveys (CSUR), 46(3), 1–36.
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pp. 157–163. Elsevier.
Ljungqvist, L., & Sargent, T. J. (2018). Recursive macroeconomic theory (Vol. 4). USA: MIT Press.
Magnac, T., & Thesmar, D. (2002). Identifying dynamic discrete decision processes. Econometrica, 70(2), 801–816.
Marcet, A., & Sargent, T. J. (1989a). Convergence of least-squares learning in environments with hidden state variables and private information. Journal of Political Economy, 97(6), 1306–1322.
Marcet, A., & Sargent, T. J. (1989b). Convergence of least squares learning mechanisms in self-referential linear stochastic models. Journal of Economic Theory, 48(2), 337–368.
Maskin, E., & Tirole, J. (1988a). A theory of dynamic oligopoly, I: Overview and quantity competition with large fixed costs. Econometrica, 56, 549–569.
Maskin, E., & Tirole, J. (1988b). A theory of dynamic oligopoly, II: Price competition, kinked demand curves, and edgeworth cycles. Econometrica, 56, 571–579.
McLennan, A. (1984). Price dispersion and incomplete learning in the long run. Journal of Economic Dynamics and Control, 7(3), 331–347.
Miller, R. A. (1984). Job matching and occupational choice. Journal of Political Economy, 92(6), 1086–1120.
Minsky, M. (1961). Steps toward artificial intelligence. Transactions on Institute of Radio Engineers, 49, 8–30.
Misra, K., Schwartz, E. M., & Abernethy, J. (2019). Dynamic online pricing with incomplete information using multiarmed bandit experiments. Marketing Science, 38(2), 226–252.
Moody, J., & Saffell, M. (2001). Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4), 875–889.
Mullainathan, S., & Spiess, J. (2017). Machine learning: an applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106.
Nedić, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems, 13, 79–110.
Ng, A. Y., Russell, S. J., et al. (2000). Algorithms for inverse reinforcement learning. Proceedings of the International Conference on Machine Learning (ICML), 663–670.
O’Neill, D., Levorato, M., Goldsmith, A., & Mitra, U. (Oct 2010). Residential demand response using reinforcement learning. In 2010 First IEEE International Conference on Smart Grid Communications, pp. 409–414.
Pakes, A. (1986). Patents as options: some estimates of the value of holding european patent stocks. Econometrica, 54(4), 755–784.
Pakes, A., & Schankerman, M. (1984). The rate of obsolescence of patents, research gestation lags, and the private rate of return to research resources (pp. 73–88). Chicago: University of Chicago Press.
Pearce, D. G. (1984). Rationalizable strategic behavior and the problem of perfection. Econometrica, 52(4), 1029–1050.
Pearl, J. (2019). The seven tools of causal inference, with reflections on machine learning. Commununications of the ACM, 62(3), 54–60.
Perolat, J., Piot, B., & Pietquin, O. (2018). Actor-critic fictitious play in simultaneous move multistage games. International Conference on Artificial Intelligence and Statistics, pp. 919–928.
Rescorla, R. A. (1979). Aspects of the reinforcer learned in second-order Pavlovian conditioning. Journal of Experimental Psychology: Animal Behavior Processes, 5(1), 79–95.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5), 527–535.
Robinson, J. (1951). An iterative method of solving a game. Annals of mathematics, 296–301.
Rosenkrantz, D. J., Stearns, R. E., & Lewis, P. M. (Oct 1974). Approximate algorithms for the traveling salesperson problem. In 15th Annual Symposium on Switching and Automata Theory (swat 1974), pp. 33–42.
Rothkopf, C. A., & Dimitrakakis, C. Preference elicitation and inverse reinforcement learning. In Machine Learning and Knowledge Discovery in Databases, pp. 34–48. Springer: Berlin.
Rothschild, M. (1974). A two-armed bandit theory of market pricing. Journal of Economic Theory, 9(2), 185–202.
Rubinstein, A. (1998). Modeling Bounded Rationality. USA: MIT Press.
Russell, S. J., & Norvig, P. (2009). Artificial Intelligence: A Modern Approach (3rd ed.). Prentice Hall: New Jersey.
Russell, S. J., & Subramanian, D. (1995). Provably bounded-optimal agents. Journal of Artificial Intelligence Research, 2(1), 575–609.
Rust, J. (1987). Optimal replacement of gmc bus engines: An empirical model of harold zurcher. Econometrica, 55(5), 999–1033.
Rustichini, A. (1999). Optimal properties of stimulus-response learning models. Games and Economic Behavior, 29(1), 244–273. https://doi.org/10.1006/game.1999.0712.
Samuelson, L. (1997). Evolutionary games and equilibrium selection. Mass: MIT Press Cambridge.
Sargent, T. (1993). Bounded rationality in macroeconomics. Oxford: Oxford University Press.
Schaal, S. (1996). Learning from demonstration. In Proceedings of the 9th International Conference on Neural Information Processing Systems, NIPS’96, pp.1040-1046, Cambridge, MA, USA. MIT Press.
Schwalbe, U. (2019). Algorithms, machine learning, and collusion. Journal of Competition Law and Economics, 14(4), 568–607.
Schwind, M. (2007). Dynamic pricing and automated resource allocation for complex information services: reinforcement learning and combinatorial auctions. Berlin: Springer-Verlag.
Semenova, V. (2018). Machine learning for dynamic discrete choice. arXiv preprint arXiv:1808.02569.
Shapley, L. (1964). Some topics in two-person games. Advances in Game Theory, 52, 1–29.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144.
Simon, H. A. (1972). Theories of bounded rationality. Decision and Organization, 1(1), 161–176.
Sinitskaya, E., & Tesfatsion, L. (2015). Macroeconomies as constructively rational games. Journal of Economic Dynamics and Control, 61, 152–182.
Skinner, B. F. (1938). The behavior of organisms: an experimental analysis. New York: Appleton-Century-Crofts.
Spooner, T., Fearnley, J., Savani, R., & Koukorinis, A. (2018). Market making via reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 434–442. International Foundation for Autonomous Agents and Multiagent Systems.
Stokey, N. L., Lucas, R. E., & Prescott, E. C. (1989). Recursive methods in economic dynamics. Cambridge: Harvard University Press.
Su, C.-L., & Judd, K. L. (2012). Constrained optimization approaches to estimation of structural models. Econometrica, 80(5), 2213–2230.
Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks: expectation and prediction. Psychological Review, 88(2), 135.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIP Press.
Tamar, A., Chow, Y., Ghavamzadeh, M., & Mannor, S. (2015). Policy gradient for coherent risk measures. Advances in Neural Information Processing Systems, 1468–1476.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294.
Thorndike, E. L. (1911). Animal Intelligence. New York, NY: Macmillan.
Tolman, E. C. (1948). Cognitive maps in rats and men. Psychological review, 55(4), 189.
Vyetrenko, S., & Xu, S. (2019). Risk-sensitive compact decision trees for autonomous execution in presence of simulated market response. arXiv preprint arXiv:1906.02312
Waltman, L., & Kaymak, U. (2008). \(q\)-learning agents in a cournot oligopoly model. Journal of Economic Dynamics and Control, 32(10), 3275–3293.
Wang, H., & Zhou, X.Y. (2019). Continuous-time mean-variance portfolio optimization via reinforcement learning. arXiv preprint arXiv:1904.11392
Watkins, C.J. (1989). Learning from delayed reward. PhD thesis, Cambridge University
Watkins, C. J. C. H., & Dayan, P. (1992). \(q\)-learning. Machine Learning, 8(3), 279–292.
Weber, R. (1992). On the gittins index for multiarmed bandits. The Annals of Applied Probability, 2(4), 1024–1033.
Weinan, E., Han, J., & Jentzen, A. (2017). Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4), 349–380.
Weitzman, M. L. (1979). Optimal search for the best alternative. Econometrica, 47(3), 641–654.
Whittle, P. (1983). Optimization Over Time (Vol. 1). Chichester, UK: Wiley.
Wiese, M., Bai, L., Wood, B., & Buehler, H. (2019a). Deep hedging: learning to simulate equity option markets. Available at SSRN 3470756
Wiese, M., Knobloch, R., Korn, R., & Kretschmer, P. (2019b). Quant gans: deep generation of financial time series. arXiv preprint arXiv:1907.06673
Wintenberger, O. (2017). Optimal learning with bernstein online aggregation. Machine Learning, 106(1), 119–141.
Wolpin, K. I. (1984). An estimable dynamic stochastic model of fertility and child mortality. Journal of Political Economy, 92(5), 852–874.
Zhang, K., Yang, Z., & Başar, T. (2019). Multi-agent reinforcement learning: a selective overview of theories and algorithms
Zhang, W., Yuan, S., & Wang, J. (2014). Optimal real-time bidding for display advertising. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pp. 1077–1086, New York, NY, USA. Association for Computing Machinery.
Zhao, J., Qiu, G., Guan, Z., Zhao, W., He, X. (2018). Deep reinforcement learning for sponsored search real-time bidding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, pp. 1021–1030. Association for Computing Machinery: New York, USA.
Funding
Arthur Charpentier acknowledges the financial support of the AXA Research Fund through the joint research initiative Use and value of unusual data in actuarial science, as well as NSERC grant 2019-07077.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Arthur Charpentier acknowledges the financial support of the AXA Research Fund through the joint research initiative Use and value of unusual data in actuarial science, as well as NSERC grant 2019-07077.
Rights and permissions
About this article
Cite this article
Charpentier, A., Élie, R. & Remlinger, C. Reinforcement Learning in Economics and Finance. Comput Econ 62, 425–462 (2023). https://doi.org/10.1007/s10614-021-10119-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10614-021-10119-4
Keywords
- Causality
- Control
- Machine learning
- Markov decision process
- Multi-armed bandits
- Online-learning
- Q-learning
- Regret
- Reinforcement learning
- Rewards
- Sequential learning