research-article

Reinforcement Learning to Rank in E-Commerce Search Engine: Formalization, Analysis, and Application

Authors:

Yinghui XuAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 368 - 377

https://doi.org/10.1145/3219819.3219846

Published: 19 July 2018 Publication History

Abstract

In E-commerce platforms such as Amazon and TaoBao, ranking items in a search session is a typical multi-step decision-making problem. Learning to rank (LTR) methods have been widely applied to ranking problems. However, such methods often consider different ranking steps in a session to be independent, which conversely may be highly correlated to each other. For better utilizing the correlation between different ranking steps, in this paper, we propose to use reinforcement learning (RL) to learn an optimal ranking policy which maximizes the expected accumulative rewards in a search session. Firstly, we formally define the concept of search session Markov decision process (SSMDP) to formulate the multi-step ranking problem. Secondly, we analyze the property of SSMDP and theoretically prove the necessity of maximizing accumulative rewards. Lastly, we propose a novel policy gradient algorithm for learning an optimal ranking policy, which is able to deal with the problem of high reward variance and unbalanced reward distribution of an SSMDP. Experiments are conducted in simulation and TaoBao search engine. The results demonstrate that our algorithm performs much better than the state-of-the-art LTR methods, with more than 40% and 30% growth of total transaction amount in the simulation and the real application, respectively.

References

[1]

Alizila. 2017. Joe Tsai Looks Beyond Alibaba's RMB 3 Trillion Milestone. http://www.alizila.com/joe-tsai-beyond-alibabas-3-trillion-milestone/.

[2]

Peter Auer. 2002. Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal of Machine Learning Research Vol. 3, Nov (2002), 397--422.

Digital Library

[3]

Ronen I. Brafman and Moshe Tennenholtz. 2002. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. Journal of Machine Learning Research Vol. 3 (2002), 213--231.

Digital Library

[4]

Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to Rank Using Gradient Descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML'05). 89--96.

Digital Library

[5]

Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, and Hsiao-Wuen Hon. 2006. Adapting Ranking SVM to Document Retrieval. In Proceedings of the 29th Annual International Conference on Research and Development in Information Retrieval (SIGIR'06). 186--193.

Digital Library

[6]

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to Rank: From Pairwise Approach to Listwise Approach Proceedings of the 24th International Conference on Machine Learning (ICML'07). ACM, 129--136.

Digital Library

[7]

Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. 2018. Stablizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'18).

Digital Library

[8]

Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval. Information Retrieval Vol. 16, 1 (2013), 63--90.

Digital Library

[9]

Thorsten Joachims. 2002. Optimizing Search Engines Using Clickthrough Data Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02). ACM, 133--142.

Digital Library

[10]

Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, and Zheng Wen. 2017. Stochastic Rank-1 Bandits. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017. 392--401.

[11]

Michael Kearns and Satinder Singh. 2002. Near-Optimal Reinforcement Learning in Polynomial Time. Machine Learning Vol. 49, 2--3 (2002), 209--232.

Digital Library

[12]

Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. 2015. Cascading Bandits: Learning to Rank in the Cascade Model Proceedings of the 32nd International Conference on Machine Learning (ICML'15). 767--776.

Digital Library

[13]

Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. 2015. Combinatorial Cascading Bandits. In Advances in Neural Information Processing Systems 28 (NIPS'15). 1450--1458.

Digital Library

[14]

Paul Lagrée, Claire Vernade, and Olivier Cappe. 2016. Multiple-Play Bandits in the Position-based Model Advances in Neural Information Processing Systems 29 (NIPS'16). 1597--1605.

[15]

John Langford and Tong Zhang. 2008. The Epoch-greedy Algorithm for Multi-Armed Bandits with Side Information Advances in Neural Information Processing Systems 21 (NIPS'08). 817--824.

Digital Library

[16]

Ping Li, Qiang Wu, and Christopher J. Burges. 2008. Mcrank: Learning to Rank Using Multiple Classification and Gradient Boosting. In Advances in Neural Information Processing Systems 21 (NIPS'08). 897--904.

Digital Library

[17]

Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. 2016. Contextual Combinatorial Cascading Bandits. In Proceedings of the 31st International Conference on Machine Learning (ICML'16). 1245--1253.

Digital Library

[18]

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous Control with Deep Reinforcement Learning. arXiv preprint arXiv:1509.02971 (2015).

[19]

Tie-Yan Liu et al. 2009. Learning to Rank for Information Retrieval. Foundations and Trends® in Information Retrieval Vol. 3, 3 (2009), 225--331.

Digital Library

[20]

Hamid R. Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton. 2010. Toward Off-Policy Learning Control with Function Approximation Proceedings of the 27th International Conference on Machine Learning (ICML'10). 719--726.

Digital Library

[21]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. 2015. Human-Level Control Through Deep Reinforcement Learning. Nature Vol. 518, 7540 (2015), 529--533.

[22]

Ramesh Nallapati. 2004. Discriminative Models for Information Retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'04). ACM, 64--71.

Digital Library

[23]

Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. 2008. Learning Diverse Rankings with Multi-Armed Bandits Proceedings of the 25th International Conference on Machine Learning (ICML'08). ACM, 784--791.

Digital Library

[24]

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML'15). 1889--1897.

Digital Library

[25]

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature Vol. 529, 7587 (2016), 484--489.

[26]

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML'14). 387--395.

Digital Library

[27]

Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. 2013. Ranked Bandits in Metric Spaces: Learning Diverse Rankings Over Large Document Collections. Journal of Machine Learning Research Vol. 14, Feb (2013), 399--436.

Digital Library

[28]

R. S. Sutton and A. G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press.

Digital Library

[29]

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 2000. Policy Gradient Methods for Reinforcement Learning with Function Approximation Advances in Neural Information Processing Systems 13 (NIPS'00). 1057--1063.

Digital Library

[30]

C. J. C. H. Watkins. 1989. Learning From Delayed Rewards. Ph.D. Dissertation. King's College, Cambridge.

[31]

Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning Vol. 8, 3-4 (1992), 229--256.

Digital Library

[32]

Yisong Yue and Thorsten Joachims. 2009. Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem. In Proceedings of the 26th International Conference on Machine Learning (ICML'09). ACM, 1201--1208.

Digital Library

[33]

Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. 2017. Online Learning to Rank in Stochastic Click Models Proceedings of the 34th International Conference on Machine Learning (ICML'17). 4199--4208.

[34]

Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. 2016. Cascading Bandits for Large-Scale Recommendation Problems Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI'16). 835--844.

Digital Library

Cited By

Lu HZhang L(2024)The Power of Linear Programming in Sponsored Listings Ranking: Evidence from Field ExperimentsSSRN Electronic Journal10.2139/ssrn.4767661Online publication date: 2024
https://doi.org/10.2139/ssrn.4767661
Ahmadkhani SMoghaddam M(2024)A social image recommendation system based on deep reinforcement learningPLOS ONE10.1371/journal.pone.030005919:4(e0300059)Online publication date: 4-Apr-2024
https://doi.org/10.1371/journal.pone.0300059
Chen XWang SYao LBaeza-Yates RBonchi F(2024)Maximum-Entropy Regularized Decision Transformer with Reward Relabelling for Dynamic RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671750(376-384)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671750
Show More Cited By

Index Terms

Recommendations

Reinforcement Learning to Rank with Pairwise Policy Gradient
SIGIR '20: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

This paper concerns reinforcement learning~(RL) of the document ranking models for information retrieval~(IR). One branch of the RL approaches to ranking formalize the process of ranking with Markov decision process~(MDP) and determine the model ...
On Application of Learning to Rank for E-Commerce Search
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

E-Commerce (E-Com) search is an emerging important new application of information retrieval. Learning to Rank (LETOR) is a general effective strategy for optimizing search engines, and is thus also a key technology for E-Com search. While the use of ...
Reinforcement online learning to rank with unbiased reward shaping
Abstract
Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users’ interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Jiangsu SF

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

96
Total Citations
View Citations
2,345
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)8

Reflects downloads up to 24 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lu HZhang L(2024)The Power of Linear Programming in Sponsored Listings Ranking: Evidence from Field ExperimentsSSRN Electronic Journal10.2139/ssrn.4767661Online publication date: 2024
https://doi.org/10.2139/ssrn.4767661
Ahmadkhani SMoghaddam M(2024)A social image recommendation system based on deep reinforcement learningPLOS ONE10.1371/journal.pone.030005919:4(e0300059)Online publication date: 4-Apr-2024
https://doi.org/10.1371/journal.pone.0300059
Chen XWang SYao LBaeza-Yates RBonchi F(2024)Maximum-Entropy Regularized Decision Transformer with Reward Relabelling for Dynamic RecommendationProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671750(376-384)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671750
Yan SShi CWang HChen LJiang LGuo RLin K(2024)Teach and Explore: A Multiplex Information-guided Effective and Efficient Reinforcement Learning for Sequential RecommendationACM Transactions on Information Systems10.1145/363000342:5(1-26)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3630003
Wang JKaratzoglou AArapakis IJose JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Reinforcement Learning-based Recommender Systems with Large Language Models for State Reward and Action ModelingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657767(375-385)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657767
Zhang CChen SZhang XDai SYu WXu JHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Reinforcing Long-Term Performance in Recommender Systems with User-Oriented Exploration PolicyProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657714(1850-1860)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657714
Xin XYang LZhao ZRen PChen ZMa JRen ZAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)On the Effectiveness of Unlearning in Session-Based RecommendationProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635823(855-863)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635823
Wang YMa DMa JJin Q(2024)HGCR: A Heterogeneous Graph-Enhanced Interactive Course Recommendation Scheme for Online LearningIEEE Transactions on Learning Technologies10.1109/TLT.2023.331439917(364-374)Online publication date: 2024
https://doi.org/10.1109/TLT.2023.3314399
Tiwary NMohd Noah SFauzi FYee T(2024)A Review of Explainable Recommender Systems Utilizing Knowledge Graphs and Reinforcement LearningIEEE Access10.1109/ACCESS.2024.342241612(91999-92019)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3422416
Wang JIgnacio MYu SJin HKim Y(2024)UET4Rec: U-net encapsulated transformer for sequential recommenderExpert Systems with Applications10.1016/j.eswa.2024.124781255(124781)Online publication date: Dec-2024
https://doi.org/10.1016/j.eswa.2024.124781
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents