Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3219819.3219846acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application

Published: 19 July 2018 Publication History

Abstract

In E-commerce platforms such as Amazon and TaoBao, ranking items in a search session is a typical multi-step decision-making problem. Learning to rank (LTR) methods have been widely applied to ranking problems. However, such methods often consider different ranking steps in a session to be independent, which conversely may be highly correlated to each other. For better utilizing the correlation between different ranking steps, in this paper, we propose to use reinforcement learning (RL) to learn an optimal ranking policy which maximizes the expected accumulative rewards in a search session. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multi-step ranking problem. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy, which is able to deal with the problem of high reward variance and unbalanced reward distribution of an SSMDP. Experiments are conducted in simulation and TaoBao search engine. The results demonstrate that our algorithm performs much better than the state-of-the-art LTR methods, with more than 40% and 30% growth of total transaction amount in the simulation and the real application, respectively.

References

[1]
Alizila. 2017. Joe Tsai Looks Beyond Alibaba's RMB 3 Trillion Milestone. http://www.alizila.com/joe-tsai-beyond-alibabas-3-trillion-milestone/.
[2]
Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal of Machine Learning Research Vol. 3, Nov (2002), 397--422.
[3]
Ronen I. Brafman and Moshe Tennenholtz. 2002. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research Vol. 3 (2002), 213--231.
[4]
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank Using Gradient Descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML'05). 89--96.
[5]
Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao-Wuen Hon. 2006. Adapting Ranking SVM to Document Retrieval. In Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval (SIGIR'06). 186--193.
[6]
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach Proceedings of the 24th International Conference on Machine Learning (ICML'07). ACM, 129--136.
[7]
Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. 2018. Stablizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'18).
[8]
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval. Information Retrieval Vol. 16, 1 (2013), 63--90.
[9]
Thorsten Joachims. 2002. Optimizing Search Engines Using Clickthrough Data Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02). ACM, 133--142.
[10]
Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. 2017. Stochastic Rank-1 Bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017. 392--401.
[11]
Michael Kearns and Satinder Singh. 2002. Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning Vol. 49, 2--3 (2002), 209--232.
[12]
Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015. Cascading Bandits: Learning to Rank in the Cascade Model Proceedings of the 32nd International Conference on Machine Learning (ICML'15). 767--776.
[13]
Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015. Combinatorial Cascading Bandits. In Advances in Neural Information Processing Systems 28 (NIPS'15). 1450--1458.
[14]
Paul Lagrée, Claire Vernade, and Olivier Cappe. 2016. Multiple-Play Bandits in the Position-based Model Advances in Neural Information Processing Systems 29 (NIPS'16). 1597--1605.
[15]
John Langford and Tong Zhang. 2008. The Epoch-greedy Algorithm for Multi-Armed Bandits with Side Information Advances in Neural Information Processing Systems 21 (NIPS'08). 817--824.
[16]
Ping Li, Qiang Wu, and Christopher J. Burges. 2008. Mcrank: Learning to Rank Using Multiple Classification and Gradient Boosting. In Advances in Neural Information Processing Systems 21 (NIPS'08). 897--904.
[17]
Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. 2016. Contextual Combinatorial Cascading Bandits. In Proceedings of the 31st International Conference on Machine Learning (ICML'16). 1245--1253.
[18]
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous Control with Deep Reinforcement Learning. arXiv preprint arXiv:1509.02971 (2015).
[19]
Tie-Yan Liu et al. 2009. Learning to Rank for Information Retrieval. Foundations and Trends® in Information Retrieval Vol. 3, 3 (2009), 225--331.
[20]
Hamid R. Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton. 2010. Toward Off-Policy Learning Control with Function Approximation Proceedings of the 27th International Conference on Machine Learning (ICML'10). 719--726.
[21]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. 2015. Human-Level Control Through Deep Reinforcement Learning. Nature Vol. 518, 7540 (2015), 529--533.
[22]
Ramesh Nallapati. 2004. Discriminative Models for Information Retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04). ACM, 64--71.
[23]
Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning Diverse Rankings with Multi-Armed Bandits Proceedings of the 25th International Conference on Machine Learning (ICML'08). ACM, 784--791.
[24]
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML'15). 1889--1897.
[25]
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature Vol. 529, 7587 (2016), 484--489.
[26]
David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML'14). 387--395.
[27]
Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked Bandits in Metric Spaces: Learning Diverse Rankings Over Large Document Collections. Journal of Machine Learning Research Vol. 14, Feb (2013), 399--436.
[28]
R. S. Sutton and A. G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.
[29]
Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation Advances in Neural Information Processing Systems 13 (NIPS'00). 1057--1063.
[30]
C. J. C. H. Watkins. 1989. Learning From Delayed Rewards. Ph.D. Dissertation. King's College, Cambridge.
[31]
Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning Vol. 8, 3-4 (1992), 229--256.
[32]
Yisong Yue and Thorsten Joachims. 2009. Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem. In Proceedings of the 26th International Conference on Machine Learning (ICML'09). ACM, 1201--1208.
[33]
Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online Learning to Rank in Stochastic Click Models Proceedings of the 34th International Conference on Machine Learning (ICML'17). 4199--4208.
[34]
Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading Bandits for Large-Scale Recommendation Problems Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI'16). 835--844.

Cited By

View all
  • (2024)The Power of Linear Programming in Sponsored Listings Ranking: Evidence from Field ExperimentsSSRN Electronic Journal10.2139/ssrn.4767661Online publication date: 2024
  • (2024)A social image recommendation system based on deep reinforcement learningPLOS ONE10.1371/journal.pone.030005919:4(e0300059)Online publication date: 4-Apr-2024
  • (2024)Maximum-Entropy Regularized Decision Transformer with Reward Relabelling for Dynamic RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671750(376-384)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. online learning to rank
  2. policy gradient
  3. reinforcement learning

Qualifiers

  • Research-article

Funding Sources

  • Jiangsu SF

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)127
  • Downloads (Last 6 weeks)8
Reflects downloads up to 24 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Power of Linear Programming in Sponsored Listings Ranking: Evidence from Field ExperimentsSSRN Electronic Journal10.2139/ssrn.4767661Online publication date: 2024
  • (2024)A social image recommendation system based on deep reinforcement learningPLOS ONE10.1371/journal.pone.030005919:4(e0300059)Online publication date: 4-Apr-2024
  • (2024)Maximum-Entropy Regularized Decision Transformer with Reward Relabelling for Dynamic RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671750(376-384)Online publication date: 25-Aug-2024
  • (2024)Teach and Explore: A Multiplex Information-guided Effective and Efficient Reinforcement Learning for Sequential RecommendationACM Transactions on Information Systems10.1145/363000342:5(1-26)Online publication date: 29-Apr-2024
  • (2024)Reinforcement Learning-based Recommender Systems with Large Language Models for State Reward and Action ModelingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657767(375-385)Online publication date: 10-Jul-2024
  • (2024)Reinforcing Long-Term Performance in Recommender Systems with User-Oriented Exploration PolicyProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657714(1850-1860)Online publication date: 10-Jul-2024
  • (2024)On the Effectiveness of Unlearning in Session-Based RecommendationProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635823(855-863)Online publication date: 4-Mar-2024
  • (2024)HGCR: A Heterogeneous Graph-Enhanced Interactive Course Recommendation Scheme for Online LearningIEEE Transactions on Learning Technologies10.1109/TLT.2023.331439917(364-374)Online publication date: 2024
  • (2024)A Review of Explainable Recommender Systems Utilizing Knowledge Graphs and Reinforcement LearningIEEE Access10.1109/ACCESS.2024.342241612(91999-92019)Online publication date: 2024
  • (2024)UET4Rec: U-net encapsulated transformer for sequential recommenderExpert Systems with Applications10.1016/j.eswa.2024.124781255(124781)Online publication date: Dec-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media