Abstract
We study the design of rating systems that incentivize (more) efficient social learning among self-interested agents. Agents arrive sequentially and are presented with a set of possible actions, each of which yields a positive reward with an unknown probability. A disclosure policy sends messages about the rewards of previously-chosen actions to arriving agents. These messages can alter agents' incentives towards exploration, taking potentially sub-optimal actions for the sake of learning more about their rewards. Prior work achieves much progress with disclosure policies that merely recommend an action to each user, without any other supporting information, and sometimes recommend exploratory actions. All this work relies heavily on standard, yet very strong rationality assumptions. However, these assumptions are quite problematic in the context of the motivating applications: recommendation systems such as Yelp, Amazon, or Netflix, and macthing markets such as AirBnB. It is very unclear whether users would know and understand a complicated disclosure policy announced by the principal, let alone trust the principal to faithfully implement it. (The principal may deviate from the announced policy either intentionally, or due to insufficient information about the users, or because of bugs in implementation.) Even if the users understand the policy and trust that it was implemented as claimed, they might not react to it rationally, particularly given the lack of supporting information and the possibility of being singled out for exploration. For example, users may find such disclosure policies unacceptable and leave the system.
We study a particular class of disclosure policies that use messages, called unbiased subhistories, consisting of the actions and rewards from a subsequence of past agents. Each subsequence is chosen ahead of time, according to a predetermined partial order on the rounds. We posit a flexible model of frequentist agent response, which we argue is plausible for this class of "order-based" disclosure policies. We measure the performance of a policy by its regret, i.e., the difference in expected total reward between the best action and the policy. A disclosure policy that reveals full history in each round risks inducing herding behavior among the agents, and typically has regret linear in the time horizon T. Our main result is an order-based disclosure policy that obtains regret ~O (√T). This regret is known to be optimal in the worst case over reward distributions, even absent incentives. We also exhibit simpler order-based policies with higher, but still sublinear, regret. These policies can be interpreted as dividing a sublinear number of agents into constant-sized focus groups, whose histories are then revealed to future agents.
Helping market participants find whatever they are looking for, and coordinating their search and exploration behavior in a globally optimal way, is an essential part of market design. This paper continues the line of work on "incentivized exploration": essentially, exploration-exploitation learning in the presence of self-interested users whose incentives are skewed in favor of exploitation. Conceptually, we study the interplay of information design, social learning, and multi-armed bandit algorithms. To the best of our knowledge, this is the first paper in the literature on incentivized exploration (and possibly in the broader literature on "learning and incentives") which attempts to mitigate the limitations of standard economic assumptions. Full version: https://arxiv.org/abs/1811.06026.