1 Introduction
As online services have become inescapable fixtures of modern life, recommender systems have become ubiquitous, influencing the music curated into our playlists, the movies pumped into the carousels of streaming services, the news that we read, and the products suggested whenever we visit e-commerce sites. These systems are commonly data-driven and algorithmic, built upon the intuition that historical interactions might be informative of users’ preferences, and thus could be leveraged to make better recommendations [
23]. While these systems are prevalent in real-world applications, we often observe misalignment between their behavior and human preferences [
30]. In many cases, such divergence comes from the fact that the underlying assumptions powering the learning algorithms are questionable.
Recommendation algorithms for these systems mostly rely on supervised learning heuristics [
9,
23], including latent factor models such as matrix factorization [
34] and deep learning approaches that are designed to predict various heuristically chosen targets [
60] (e.g., purchases, ratings, reviews, watch time, or clicks [
9,
43]). Typically they rely on the naive assumption that these behavioral signals straightforwardly indicate the users’ preferences. However, this assumption may not hold true in general for a variety of reasons, including exposure bias (users are only able to provide behavioral signals for items they have been recommended) and censoring (e.g., reviews tend to be written by users with strong opinions) [
31,
57].
On the other hand, to study the decision-theoretic nature of recommender systems, the online decision-making framework—multi-armed bandits (MABs)—has commonly been used [
35,
56]. In MABs, at any given time, the decision-maker chooses an arm to pull (a recommendation to make in the recommender system setting) among a set of arms and receives a reward, with the goal of obtaining high expected cumulative reward over time. Theoretical research on MABs centers on algorithms that balance between the exploration and exploitation tradeoff and analyses capturing the performance
1 of these algorithms [
35]. There is a long line of work on developing MAB-based approaches for recommender systems in settings including traditional
K-armed bandits [
5, and references therein], contextual bandits [
41,
44,
45, and references therein], and Markov decision processes [
55,
12,
13,
14,
59,
15, and references therein]. To guide such research on applying MAB-based algorithms in recommender systems, it is of importance to
test whether the assumptions that these algorithms are built upon are valid in real-world recommendation settings.
In this work, we focus on the assumption of temporal stability that underlies both practical supervised learning methods and algorithms for classical MAB settings where the reward distribution is assumed to be fixed over time. In a recommendation setting, the reward distribution of an arm corresponds to the user’s preference towards that recommendation item. Although the assumption that the reward distribution is fixed may be appropriate to applications driving early MABs research (e.g., sequential experimental design in medical domains) [
51], one may find it to be unreasonable in recommender systems given that the interactants are humans and the reward distributions represent human preferences. For example, consider the task of restaurant recommendations, though a user may be happy with a recommended restaurant for the first time, such enjoyment may decline as the same recommendation is made over and over again. This particular form of evolving preference is known as satiation, and results from repeated consumption [
21]. One may also think of cases where a user’s enjoyment increases as the same item being recommended multiple times, due to reasons including sensitization [
26]. In both settings, the assumption that reward distributions are fixed is violated and the recommendation algorithms may influence the preferences of their users.
We test the assumption on fixed reward distributions through randomized controlled trials conducted on Amazon Mechanical Turk. In the experiment, we simulate a
K-armed bandit setting (a MAB setup where the arm set is the same set of
K arms over time) and recommend comics from
K comic series to the study participants. After reading a comic, the participants provide an enjoyment score on a 9-point Likert scale [
17], which serves as the reward received by the algorithm for the recommendation (for pulling the corresponding arm). Each comic series belongs to a different genre and represents an arm. Our analyses on the collected dataset reveal that in a bandit recommendation setup, human preferences can evolve, even within a short period of time (less than 30 minutes) (Section
4). In particular, between two predefined sequences that result in the same number of pulls of each arm in different order, the mean reward for one arm has a significant difference of 0.57 (
\(95\%\) CI = [0.30, 0.84],
p-value < 0.001). This suggests that any MAB algorithms that are applied to recommendation settings should account for the dynamical aspects of human preferences and the fact that the recommendations made by these algorithms may influence users’ preferences.
The line of work that develops contextual bandits and reinforcement learning algorithms for recommender systems [
12,
13,
14,
15,
44,
45,
55,
59] hinges upon the assumption that users’ preferences depend on the past recommendations they receive. The proposed algorithms have been deployed to interact with users on large-scale industry platforms (e.g., Youtube). These prior works differ from ours in multiple ways. Among them, the most significant distinction is that our work aims to understand and identify assumptions on reward distributions that better represent user preference characteristics. On the other hand, this line of work asks the question that
once an assumption on the reward distribution has been made, how to design the recommendation algorithms so that they have better performance. For example, in settings where recommender systems are modeled as contextual bandits [
44,
45], the reward distributions are assumed to take a certain functional form (often linear) in terms of the observable states. When treating recommender systems as reinforcement learning agents in a Markov decision process [
12,
13,
14,
15,
55,
59], one has made the core assumption that the reward is Markovian and depends only on the
observable user states (e.g., user history) and recommendations. In other words, the reward (and user preference) does not depend on
unobservable states (e.g., user emotions) as in partially observable Markov decision processes. Under this assumption, the proposed algorithms deal with difficulties (e.g., large action spaces) in the reinforcement learning problem. It is worth noting that in these prior works, the proposed algorithms have been evaluated on industry platforms. For academics who want to evaluate their recommendation algorithms with human interactants, such infrastructure is not easily accessible. We take an initial step to address this need by developing our experimental framework.
The experimental framework we developed is flexible. It allows one to conduct field tests of MAB algorithms, use pre-defined recommendation sequences to analyze human preferences, and ask users to choose an arm to pull on their own. These functionalities can be used to identify assumptions on user preference dynamics and reward distributions that better capture user characteristics. As an illustration of the flexibility of our experimental framework, we have collected data while the participants interact with some traditional MAB algorithms and analyze their experience with these algorithms. Interestingly, we observe that interactants (of a particular algorithm) who have experienced the lowest level of satisfaction are the ones to have the poorest performance in recalling their past ratings for previously seen items.
In summary, we provide a flexible experimental framework that can be used to run field tests with humans for any
K-armed bandit algorithms (Section
3.3). Using this experimental framework, we have tested the validity of the fixed-reward-distribution (fixed-user-preference) assumption for applying MAB algorithms to recommendation settings (Section
4). As an illustration of the flexibility of our experimental framework, we have inspected different bandit algorithms in terms of user enjoyment and attentiveness (Section
5). We discuss the limitation of our study in Section
6.
2 Related Work
The study of evolving preferences has a long history, addressed by such diverse areas as psychology [
16,
21], economics [
27,
48], marketing [
58], operations research [
7], philosophy [
4,
42], and recommender systems [
32,
37,
49]. In the bandits literature, there is a recent line of work, motivated by recommender systems, that aims to incorporate the dynamic nature of human preference into the design of algorithms. These papers have different models on human preferences, expressed as specific forms of how the reward of an arm depends on the arm’s past pulls. Levine et al. [
40], Seznec et al. [
54] model rewards as monotonic functions of the number of pulls. By contrast, Basu et al. [
6], Cella and Cesa-Bianchi [
11], Kleinberg and Immorlica [
33] consider the reward to be a function of the time elapsed since the last pull of the corresponding arm. In Mintz et al. [
46], rewards are context-dependent, where the contexts are updated based on known deterministic dynamics. Finally, Leqi et al. [
39] consider the reward dynamics to be unknown stochastic linear dynamical systems. These prior works model the reward (user preferences) in distinct ways, and lack (i) empirical evidence on whether user preferences evolve in the short period of time in a bandit setup; and (ii) datasets and experimental toolkits that can be used to verify the proposed theoretical models and to explore more realistic ways of modeling user preferences. On the other hand, there is a line of work on developing contextual bandits and reinforcement learning algorithms to account for user preference dynamics in recommender systems. The evaluations of these algorithms against human users rely on accessibility to large-scale industry platforms [
12,
13,
14,
15,
44,
45,
55,
59].
Datasets that are publicly available and can be used to evaluate bandit algorithms in recommendation settings often contain ratings or other types of user feedback of recommendations [
38,
52]. These datasets do not contain trajectories of recommendations and the associated feedback signal for a particular user, making it hard to understand the dynamical aspects of user preferences and to identify effects of past recommendations on the preferences. In addition, existing MAB libraries (e.g., [
28]) only consist of implementations of bandit algorithms, but lack the appropriate tools and infrastructure to conduct field tests of these algorithms when interacting with humans. The toolkit we have developed closes this gap and allows one to use these libraries while conducting human experiments on bandit algorithms. Another relevant stream of research that considers human experiments in bandits settings are the ones that ask human participants to make decisions in a bandit task and collect data for modeling their decision-making [
2,
36,
50]. In other words, in those experiments, human participants take the role of the algorithm that selects the arm to pull instead of the role of providing reward signals. In one of our experimental setups, the study participants are asked to select comics to read on their own. However, in contrast to reward distributions that are defined by the experiment designers of those experiments, in our setting, the rewards are provided by the human participants, indicating their preferences.
3 Experimental Setup
In this section, we first describe a classical MABs setting—stochastic
K-armed bandits—and discuss the algorithms we have used in our experiments (Section
3.1). We then provide reasoning on why we choose comics as the reommendation domain and selection criteria for the comics used for recommendations (Section
3.2). Finally, we discuss our experimental framework (Section
3.3).
3.1 Stochastic K-armed bandits
In stochastic
K-armed bandits, at any time
t, the learner (the recommender in our case) chooses an arm
a(
t) ∈ [
K] to pull (a recommendation to present in our case) and receives a reward
R(
t). For any horizon
T, the goal for the learner is to attain the highest expected cumulative reward
\(\mathbb {E}[\sum _{t=1}^T R(t)]\) . In this setup, the reward distribution of each arm
k ∈ [
K] is assumed to be fixed over time and the rewards received by pulling the same arm are independently and identically distributed. An oracle (the best policy), in this case, would always play the arm with the highest mean reward. The difference between the expected cumulative reward obtained by the oracle and the one obtained by the learner is known to be the “regret.” As is shown in existing literature [
35], the regret lower bound for stochastic
K-armed bandits is
\(\Omega (\sqrt {T})\) . Many existing algorithms achieve the regret at the optimal rate
\(O(\sqrt {T})\) , including the Upper Confidence Bound (UCB) algorithm and the Thompson Sampling (TS) algorithm. We also include a traditional algorithm Explore-then-Commit (ETC) and a greedy heuristic (ε-Greedy) in our study. These algorithms along with their regret guarantees are built upon the key assumption that the reward distributions are fixed. In Section
4, we test the validity of this assumption in a recommendation setting where the interactants are humans and the rewards represent their preferences.
Algorithms. We give a brief summary and a high-level intuition for each algorithm. More detailed descriptions of these algorithms can be found in Appendix
A in the supplementary material. We use the term “algorithm” in a broad sense and the following items (e.g., Self-selected and Fixed sequence) may not all be traditional MAB algorithms.
•
Self-selected: The participants who are assigned to the Self-selected algorithm will choose which arm to interact with by themselves. In other words, at each time t, instead of a prescribed learning policy determining the arm a(t), the participants themselves will choose the arm.
•
UCB: At time t, UCB deterministically pulls the arm with the highest upper confidence bound, a value that for each arm, combines the empirical mean reward with an uncertainty estimate of the mean reward. An arm with high upper confidence bound can have high empirical mean and/or high uncertainty on the mean estimate.
•
TS: In Thompson Sampling, a belief distribution is maintained over the possible reward values for each arm. At time t, the algorithm samples a reward from each arm’s belief distribution and pulls the arm with the highest sampled reward. When a reward is received after pulling the arm, TS updates the corresponding arm’s prior belief distribution to obtain a posterior.
•
ETC: Unlike UCB and TS, the Explore-then-Commit algorithm has two separate stages—the exploration stage and the exploitation stage. It starts with an exploration period where the algorithm pulls the arms in a cyclic order and then switches to an exploitation stage where only the arm with the highest empirical mean in the exploration stage will be pulled. Given that the ETC algorithm achieves a regret of O(T2/3) when the exploration time is on the order of T2/3, we have set the exploration period to be c · T2/3 for a positive constant c.
•
ε-Greedy: This greedy heuristic pulls the arm with the highest empirical mean with probability 1 − ε where 0 < ε < 1, and pulls an arm uniformly at random with probability ε. In a setting with long interaction period, one way of setting ε is to have it decreasing over time, e.g., setting ε to be on the order of
\(\frac{1}{t}\) [
3]. Given the short interaction period in our setting, we have used a fixed ε = 0.1 (which may result in linear regret).
•
Fixed sequence (CYCLE, REPEAT): The fixed sequence algorithms pull arms by following a predefined sequence. That is, the arm pulled at time
t only depends on the current time step and does not rely on the received rewards so far. We have used two fixed sequences CYCLE and REPEAT for testing whether the reward distributions (that represent participants’ preferences) are fixed over time. We provide more details on these two fixed sequences in Section
4.
Next, we present how we have selected the comics used for recommendations.
3.2 Comics data
In our experiment, we choose comics as our recommendation domain for the following reasons: (i) Fast consumption time: Given the nature of our experiment where study participants are recommended a sequence of items to consume in a limited amount of time, we require the time for consuming each of the recommendations to be short. For example, recommending movie clips to watch may not be appropriate in our setting given that each clip may take a couple of minutes to finish. (ii) No strong pre-existing preferences: Another important feature of the chosen recommendation domain is that the majority of the study participants should have no strong preference on that subject prior to the experiment. For example, unlike comics, music is a subject that most people tend to already have strong preferences towards [
53]. In such cases, the effects of recommendations towards the participants’ preferences may be minimal.
We collected comics from 5 comic series on GoComics [
22]. Each comic series belongs to a genre and represents an arm for pulling. The 5 comic series along with their genre are Lisa Benson (political, conservative), Nick Anderson (political, liberal), Baldo (family), The Born Loser (office), and The Argyle Sweater (gag). The genres of these comics are assigned by GoComics. For each series, we take the following steps to select the set of comics:
(1)
We first collect all comics belonging to the comic series from the year 2018. We select this time period to be not too recent so that the results of the study are not heavily influenced by ongoing events. It is also not too distant so that the content is still relevant to all subjects.
(2)
For each individual comic, we obtain its number of likes on GoComics. Then, we choose the top 60 comics from each comic genre/series in terms of the number of likes. This selection criteria is designed to ensure the quality of the chosen comics.
(3)
Finally, we randomly assign an ordering to the comics. We keep this ordering fixed throughout the study such that if an arm is pulled (a genre is chosen) at its j-th time, the presented comics will always be the same.
The comics within the same comic series are independent, in the sense that they can generally be read in any order, without requiring much context from previous comics from the same series. For these collected comics, we have labeled the number of unique characters in them, and use it for the attention check questions to ensure that the study participants have read the comics. Although we have adopted the above selection criteria to ensure the quality of comics from the same comic series to be similar, there is heterogeneity among individual comics and thus may influence the interpretation of our results. We provide more discussion on this in Section
6. In Section
3.3, we discuss our experimental protocol and platform.
3.3 Experimental protocol and platform
We first outline our experimental protocol, which consists of the following steps (Figure
1):
(1)
Background survey (initial filtering): We ask the study participants to complete a brief background survey (Appendix
D in the supplementary material). The participants will only be given a code to continue to the next step if an arithmetic question is answered correctly. The goal for the first step is to set up an initial filtering for participants.
(2)
Registration: After completing the background survey, participants are then asked to register on our platform using their completion code. Each participant in our study is assigned an algorithm in the following fixed probabilities—0.25 for Self-selected, 0.125 for UCB, 0.125 for TS, 0.125 for ETC, 0.125 for ε-Greedy and 0.125 for each of the two fixed sequences.
(3)
Comic rating: In this step, participants will read a sequence of 50 comics and provide an enjoyment score (a rating) after reading each comic. The sequence of comics can be generated by any of the algorithms discussed in Section
3.1. After reading each comic and providing a rating, the participants are also asked to answer an attention check question.
(4)
Post-study survey: Once the participants are finished with the comics rating step of the study, they are asked to complete a post-study survey about their reading experience. They are asked if they remember reading certain comics and if they have rated them positively, as well as how they perceive the recommendations they are provided. An example of the post-study survey questions can be found in Figure
6 (Appendix
E in the supplementary material).
Our experimental platform is built as a toolkit for running field tests for any MAB algorithms with human users. It consists of (a) the participant-facing web interface, and (b) the server backend that stores and processes incoming data from the web interface. When designing the experimental protocol and platform, we consider the following questions:
(1)
Given that we are asking users to give subjective feedback, how do we have more user responses that are reflective to the user’s true preference?
(2)
How do we design an experimental interface that impose less bias to the users?
(3)
Since our study requires users to complete a long (up to 30-minute) sequence of non-independent tasks, how do we have the study to be less interrupted?
(4)
How do we build the system flexible enough to conduct studies for different MAB algorithms and test different assumptions of MAB setups, including ones that we do not cover in this work?
For (1), we adopt a 9-point Likert scale so that the numbers are sufficiently distinguishable to the participants [
17] and check whether users are paying sufficient attention during the study. In particular, we test the participants on objective properties of the comics they have read, e.g. the number of unique characters (with a face and/or body) in them. We also set a minimum time threshold of 10 seconds before each user response can be submitted so that users spend adequate time on each comic.
For (2), to ensure that the participant’s rating is not biased towards (e.g., anchored on) the Likert scale slider’s initial value, we set the slider to be transparent before the participant clicks on the scale. In addition, in the Self-selected setting where the participants choose the genre of comics to read next, we minimize the color- and ordering-based biases by setting the category selection buttons to be the same color and in random order.
As stated in design question (3), because we are interested in studying evolving preferences over a sequence of non-independent tasks, we would like to have continuous and uninterrupted user attention over a period of time. To do so, we design the system to be stateful so that participants can resume the study where they left off in the event of brief network disconnection and browser issues.
Finally, we discuss the flexibility of our system and address design question (4). Our experimental platform allows the experimenter to specify any recommendation domains and MAB algorithms for interacting with the human interactant. This flexibility allows the experimenter to not only test the performance of different MAB algorithms but also design pull sequences to understand user preference dynamics and test existing assumptions on user preferences. For example, one may design pull sequences to study the correlation between rewards obtained at different time steps and rewards obtained from pulling different but related arms. It is worth noting that the attention checks in our system are also customizable, allowing for diverse attention checks for each recommendation domain.
We plan to open source our code and data. For more details about the experimental platform and MTurk-related implementation details, we refer the readers to Appendix
E in the supplementary material.
3.4 Recruitment and compensation
To ensure the quality of our collected data, we only allow MTurk workers to participate in the study if they are U.S. residents and have completed at least 500 Human Intelligence Tasks (HITs) with an above
\(97\%\) HIT approval rate. The anticipated (and actual) time to complete the study is less than 30 minutes. For participants who have successfully completed the study and answered the attention check questions correctly
\(70\%\) of time, we have paid $7.5 (the equivalent hourly salary is above $15/hr). In Appendix
D, we provide more details on the participants’ demographics and backgrounds. Out of the 360 participants who have successfully completed the study, 316 passed the attention check (acceptance rate
\(87.8 \%\) ). Our analyses are only conducted on the data collected from these participants. Table
1 shows the number of participants for each experimental setup.
4 Evolving Preferences in K-armed bandits
As we have previously discussed, in
K-armed bandits, the reward distribution of each arm is assumed to be fixed over time [
35,
51,
56], which implies that the mean reward of each arm remains the same. It is unclear whether such an assumption is reasonable in recommender system settings where the reward distributions represent human preferences. Our first aim is to test for the existence of evolving preference in the
K-armed bandits setup. In other words, using randomized controlled trials, we want to test the following hypothesis:
In a K-armed bandit recommendation setting, the reward distribution of each arm (i.e., the user preference towards each item) is not fixed over time.To answer this, we collect enjoyment scores for two fixed recommendation sequences, where each sequence is of length
T = 50. The sequence consists of recommendations from
K = 5 genre of comics. In other words, the total number of arms is 5. The first sequence pulls the arms in a cyclic fashion which we denote by CYCLE, while the second sequence pulls each arm repeatedly for
m =
T/
K times which we denote by REPEAT:
We note that for both sequences, the number of arm pulls of each arm is the same (
m times). Since the order of the comics is fixed for each arm (e.g., pulling an arm for
m times will always result in the same sequence of
m comics from that genre), the set of comics recommended by the two sequences are the same. The only difference between the two sequences is the order of the presented comics. Intuitively, if the mean reward of each arm is fixed over time (and does not depend on the past pulls of that arm), then the (empirical) mean reward of each arm should be very similar under the two pull sequences.
In this work, we utilize a modification of the two-sample permutation test [
19] to deal with the different numbers of participants under the two recommendation sequences. We let
\(\mathcal {P}^{\text{CYCLE}}\) and
\(\mathcal {P}^{\text{REPEAT}}\) denote the set of participants assigned to the CYCLE and REPEAT recommendation sequence, respectively. For each participant
\(i \in \mathcal {P}^{\text{CYCLE}}\) , we use
ai(
t) to denote the pulled arm (the recommended comic genre) at time
t and
Xi(
t) to denote the corresponding enjoyment score (the reward) that the participant has provided. Similarly, for each participant
\(j \in \mathcal {P}^{\text{REPEAT}}\) , we use
aj(
t) and
Yj(
t) to denote the arm pulled at time
t and the enjoyment score collected from participant
j at time
t. Using these notations, for each arm
k ∈ [
K], we define the test statistic as follows:
The test statistic
τk is the difference between the mean reward (enjoyment score)
\(M_k^\text{CYCLE}\) under the CYCLE recommendation sequence and the mean reward
\(M_k^\text{REPEAT}\) under the REPEAT recommendation sequence for arm
k. A non-zero
τk suggests that the mean reward of arm
k is different under CYCLE and REPEATand that the reward distribution is evolving. The higher the absolute value of
τk is, the bigger the difference between the two mean rewards is. A positive test statistic value indicates that the participants prefer the arm under CYCLE over REPEAT on average.
To quantify the significance of the value of the test statistic, we use a two-sample permutation test to obtain the
p-value of the test [
19]: First, we permute participants between
\(\mathcal {P}^{\text{\text{CYCLE}}}\) and
\(\mathcal {P}^{\text{\text{REPEAT}}}\) uniformly at random for 10,000 times and ensure that the size of each group remains the same after each permutation. Then, we recompute the test statistic
τk for each permutation to obtain a distribution of the test statistic. Finally, we use the original value of our test statistic along with this distribution to determine the
p-value.
To report the
\(95\%\) confidence interval of the test statistic for each arm, we use bootstrap and re-sample the data for 5,000 times at the level of arm pulls. That is, for each arm pull at time
t, we obtain its bootstrapped rewards under CYCLE and REPEAT by resampling from the actual rewards obtained by pulling that arm under CYCLE and REPEAT at time
t, respectively. Given that we have conducted 5 tests simultaneously (one for each arm), in order to control the family-wise error rate, we need to correct the level
αk (
k ∈ [
K]) for each test. More formally, to ensure the probability that we falsely reject
any null hypothesis to be at most
α, for each test, the
p-value of a test should be at most its corresponding corrected
αk. We adopt the Holm’s Sequential Bonferroni Procedure (details presented in Appendix
B) to perform this correction [
1].
Our results show that for three arms—family, political (conservative) and political (liberal)—the non-zero difference between the mean reward under the two recommendation sequences are significant at level
α = 0.1 (Table
2). These findings confirm our research hypothesis that user preferences are not fixed (even in a short amount of time) in a
K-armed bandit recommendation setting. There may be many causes of the evolving reward distributions (preferences). One possibility, among many others, is that the reward of an arm depends on its past pulls. In other words, people’s preference towards an item depends on their past consumption of it. For example, existing marketing and psychology literature has suggested that people may experience hedonic decline upon repeated exposures to the same item [
7,
21]. On the other hand, in music play-listing, one may expect the expected reward of an arm (a genre) to increase due to the taste that the listener has developed for that genre of music [
53]. For a more comprehensive discussion on preference formation, we refer the readers to Becker [
8].
Finally, to better understand the nature of our findings, we divide the participants into heavy comic readers who read comics daily and light comic readers who read comics at a lower frequency. Among participants who are assigned the CYCLE sequence, there are 17 heavy readers and 23 light readers. For REPEAT, there are 16 heavy readers and 22 light readers. We perform the same analysis as noted above for each of the two groups. Similar to the overall findings, among both heavy and light readers, evolving preferences (evolving mean reward) have been observed (Table
2), confirming our research hypothesis. Interestingly, we find that for each genre, the heavy readers tend to prefer the recommendations from the REPEAT sequence over the CYCLE sequence. On the contrary, for each genre, light readers prefer recommendations from the CYCLE sequence over the REPEAT sequence. As an initial step towards understanding this phenomenon, we present descriptive data analysis on how rewards (user preferences) evolve for heavy and light readers under the two recommendation sequences. Similar to our results in Table
2, for light readers, at most time steps, the reward trajectory of CYCLE has a higher value than the reward trajectory of REPEAT; while for heavy readers, this is the opposite (Figure
3). By looking at the reward trajectories (and the lines fitted through the reward trajectories) over the entire recommendation sequence (Figure
3), we find additional trends: for light readers, the differences between the rewards collected under CYCLE and REPEAT are increasing over time; while for heavy readers, such differences are relatively stable. This trend is also observed in the reward trajectory (and the line fitted through the reward trajectory) of each arm under CYCLE and REPEAT (Figure
4). In particular, for light readers, among all arms (comic genres) except family, we find that the lines fitted through the rewards collected under CYCLE and REPEAT become further apart as the number of arm pulls increases. This distinction between heavy and light users may be due to various reasons. For example, light readers may prefer variety in their recommendations because they are exploring their interests or the light readers and heavy readers share different satiation rates. On a related note, a recent study has shown that the satiation rates of people may depend on their personality traits [
20]. Precisely understanding the causes of this distinction between heavy and light readers is of future interest.
6 Conclusions and Future Work
Our work provides a general experimental framework and toolkit to understand assumptions on human preferences in a bandit setup. It also allows one to perform field tests of different bandit algorithms that interact with humans. Using these tools, we build a publicly available dataset that contains trajectories of recommendations and ratings of them given by multiple users in a bandit setting. The collected data has been used to check the validity of a core assumption on human preference in MABs literature—that the reward distributions (corresponding to user preferences) are fixed. We show that even in a short time period like our bandit recommendation setup, such dynamical preference exists. Further, our observations on the difference between the light and heavy user in their preference dynamics suggest the need of having a more granular understanding of human preferences. As an illustrative usage of our experimental framework, we have explored the study participants’ preferences towards selecting content to read on their own and being recommended content to read. As we have discussed above, these findings are exploratory in nature with the goal of showcasing different usages of our experimental framework; thus, they should not be interpreted as definitive answers. In our exploratory analysis, we observe that an algorithm achieving the highest cumulative reward does not necessarily imply that it will achieve the highest hindsight satisfaction.
At a higher level, our work fits in the broader picture of understanding the validity of assumptions that machine learning systems (and in our case, recommender systems) rely on. Although all machine learning models are built upon some simplifying assumptions of the world, some of these assumptions oversimplify the world to the extent that they lose the core characterizations of the problem we are interested in. In our case, the assumptions we want to understand are centered around user preferences. We have identified that assumptions on the temporal stability of preferences used in traditional MABs literature are oversimplifications. The balance between identifying simplifications that are helpful for building learning systems and avoiding oversimplifications that discard core characteristics of the problem is difficult. As put by the famous statistician George Box, “all models are wrong, but some are useful” [
10]. Our goal for developing the experimental toolkit and conducting the human subjects study is to provide ways for identifying assumptions on user preferences that are useful for developing MABs algorithms in recommender systems. Below we discuss the limitations and future directions of our work.
6.1 Limitations
We discuss several limitations of our study. First, the sizes of the mean reward differences of each arm reported in Section
4 are within 1.5 point on the 9-point Likert scale, which may be considered small. This is due to many reasons. For one, the enjoyment score (the reward) provided by each participant is subjective and may have high variance due to this subjectivity. Though a difference may be considered to be strong in a within-subjective study, it may be thought of as small in a between-subject study (the type of study in our case). However, we note that our results obtained using the permutation test show that reward distributions are indeed not fixed over time for multiple arms in the
K-armed bandit recommendation setup. Second, each arm represents a comic series. Though we have selected the comics from each series in terms of their quality (the number of likes that the comics receive), there may still be heterogeneity among the selected comics belonging to the same series, and it is up to discussion on whether one should consider these comics to belong to the same arm. Thirdly, many quantities (e.g., enjoyment/satisfaction, mindfulness/attentiveness) we want to measure are less well-defined. Our way of measuring them is from a particular angle, and may not be widely applicable. Fourthly, the study domain is chosen to be comics due to reasons including the short consumption time of a comic and the study requires the participants to read 50 comics. Although all of our experiments are completed within 30 minutes, it is uncommon in real-world settings for people to read 50 comics at a time and thus may introduce uncontrolled boredom effects. Fifthly, in our experiments, we compare the user preferences towards each arm using two fixed sequences. One may also utilize a purely randomized sequence as a baseline to better understand user preferences. Finally, due to resource constraints (e.g., funding limits for conducting the study and computational limits on the number of simultaneous experiment instances we can host on our server), we recruited 360 participants (with 316 of them passing the attention checks) for our study. Compared to industry-scale experiments, the number of participants in our study is on the lower end. It is also worth mentioning that our implementations of the bandit algorithms ensure that the comic recommendations only depend on the user’s own history, which is not a common practice for recommender systems on existing platforms. In practice, one may utilize other users’ interaction histories to warm start these bandit algorithms. Though there are these limitations, we would like to emphasize that our work makes a substantive step towards understanding the applicability of traditional assumptions on user preferences in MABs literature.
6.2 Future Work
There are multiple future directions for our work. Our findings on the existence of evolving preferences in a
K-armed bandits recommendation setting suggest that in order to study the decision-theoretic nature of recommender systems using the MAB framework, one must account for such preference dynamics. The need for learning algorithms (oftentimes reinforcement learning algorithms) that deal with the impact of recommendations on user preferences have also been proposed in recent works [
12,
13,
14,
15,
29,
44,
45,
55,
59,
61]. An important building block for this line of research is to have better modeling of human preference dynamics. Our experimental framework and toolkit can provide more grounding and accessibility for research on it. As an example, our observation that heavy and light comic readers have different preference dynamics can be further investigated using our experimental framework, advancing our understanding on evolving preferences in a more granular way. More broadly, our toolkit can be used for: (i) collecting data using fixed or randomized recommendation sequences or bandit algorithms to identify and estimate preference dynamics; and (ii) conducting field tests of bandit algorithms designed to address evolving preferences.
Our exploratory data analysis on the performance of different algorithms suggests that the human interactants of the bandit algorithms may care about other aspects of their experience in addition to cumulative rewards. For example, Self-selected has a higher percentage of satisfied participants in hindsight compared to UCB, though UCB has a higher average cumulative reward. This suggests that besides traditional performance metrics used to analyze and develop these bandit algorithms, we should consider a broader set of objectives and metrics when studying these problems. On a related note, given that we want our algorithm to account for evolving preferences, when regret (the difference between the expected cumulative reward obtained by a proposed policy and an oracle) is used as the performance metric, the oracle should be chosen to be adaptive instead of the best-fixed arm considered in many MABs (including contextual bandits) literature.
B Holm’s Sequential Bonferroni Procedure
When testing
K hypotheses (in our case, we have one hypothesis for each arm), we adopt Holm’s Sequential Bonferroni Procedure to control the family-wise error rate. To ensure that the probability of falsely rejecting
any null hypothesis to be at most
α, we perform the following procedure: Suppose that we are given
m sorted
p-values
p1, …,
pm from the lowest to the highest for hypothesis
H1, …,
Hm. For
i ∈ [
m], if
pi <
α/(
m + 1 −
i), then reject
Hi and move on to the next hypothesis; otherwise, exit the process (and we cannot reject the rest of the hypotheses). As an illustration, in Table
2, since we are testing 5 hypothesis (one for each arm) at the same time and we have set the overall
α level to be 0.1, to reject all null hypotheses, we require the lowest to the highest
p-values to be no bigger than 0.02, 0.04, 0.06, 0.08, 0.1. In other words, setting
α at level 0.1 suggests that the probability of falsely rejecting
any null hypothesis is at most 0.1. Under Bonferroni’s correction, when rejecting the null hypothesis with the lowest
p-value, the corrected alpha level is
α/
K = 0.02.
C Additional analyses for Section 5
We present additional analyses comparing the performance of different MABs algorithms in terms of user enjoyment and attentiveness. As discussed in the main paper, these analyses are meant to showcase potential use cases of our experimental framework instead of claiming that certain algorithms outperform others. For user enjoyment, we focus on users’ hindsight satisfaction (Section
5.1) for different algorithms and use the two-sample permutation test to check whether the algorithms’ performances differ much from Self-selected’s. More specifically, for each algorithm (UCB, TS, ETC, ϵ-Greedy, CYCLE and REPEAT), we compare its hindsight satisfaction rate with the rate for Self-selected. The test statistic is defined as follows:
Under the null hypothesis,
\(\beta _\text{Algo} = 0\) , which indicates that the algorithm and Self-selected share similar performances in terms of the hindsight satisfaction rate. The results shown in Table
5 suggest that in our experiment, although on average Self-selected performs better than other algorithms in terms of hindsight satisfaction, we cannot reject the null hypotheses that these algorithms’ hindsight satisfaction is similar to Self-selected’s in general. The only null hypothesis we can reject is the one for REPEAT. That is, we can claim that in terms of hindsight satisfaction rate, Self-selected outperforms REPEAT. In general, collecting more data may help us increase the power of our tests. In terms of attentiveness, we study the rating memory of the participants (Section
5.2) under different algorithms. Similarly, we adopt the two-sample permutation test to check whether other algorithms’ performances differ much from that of Self-selected, in terms of the the percentage of questions that the participants answer correctly regarding the ratings they provided for comics they read. The specific test statistic we used is as follows:
The null hypothesis states that the correctness percentage on these rating-memory-related questions should not differ much between other algorithms and Self-selected. Using the data from our experiments, we cannot reject this null hypothesis. However, as demonstrated here, our experimental framework equips researchers with tools to answer these types of questions.