In the following, we describe the individual aspects of the evaluation design space.
3.4.1 Types of Data.
The essential basis for the evaluation of recommender systems is data. The characteristics of data can be manifold and may depend on the type of data used for computing the actual recommendations, among other factors. In the following, we give a brief overview of the different characteristics of data that may be used when evaluating RS.
Implicit and Explicit Rating Data. User ratings are usually collected by user behavior observations, which may, for instance, include records on the items that a user consumed, purchased, rated, viewed, or explored (e.g., pre-listening of songs), where the source may be an existing dataset or one that is collected for the respective study. When relying on the observation of user behavior when interacting with a RS, we typically distinguish between explicit and implicit feedback [
105,
120]. Explicit feedback is provided directly by the user and the data unequivocally captures the user’s perception of an item. Platforms that employ recommender systems frequently integrate mechanisms that allow users to explicitly express their interests in or preference for a specific item via rating scales (e.g., five-star rating scale, likes, thumbs-up, or thumbs-down). The rating scales used for providing explicit feedback usually allow for expressing both, positive and negative preferences (e.g., a scale from “I like it a lot” to “I do not like it”).
Implicit feedback, in contrast, is inferred from a user’s observable and measurable behavior when interacting with a RS (e.g., purchases, clicks, dwell time). When relying on implicit feedback, evaluations presume that, for instance, a consumed item is a quality choice, while all other items are considered irrelevant [
15]. Hence, implicit feedback is typically positive only (e.g., purchase, click), while the absence of such information does not imply that the user does not like an item (e.g., a user not having listened to a track does not imply that the user does not like the track). Some scenarios also allow for opportunities for negative implicit feedback such as, for instance, the skipping of songs. Furthermore, implicit feedback can be used to infer relative preferences (for example, if a user watched one movie ten times whereas other movies typically only once or play counts of songs for a music RS). Thus, implicit feedback may be mapped to a degree of preferences, thereby ranging on a continuous scale to its positive extremity [
120]. When interpreting implicit feedback, the assumption is that specific behavior is an indication of quality, regardless of whether the behavior may have other causes; thus, for example, closing a music streaming app may be mistakenly interpreted as a skip (i.e., negative feedback) [
31] or the behavior is influenced by interruptions or distractions [
56].
Most of the research in RS has focused on either explicit or implicit data [
120], while comparably few have combined these two heterogeneous types of feedback (e.g., References [
147,
151,
152]). Table
3 summarizes the characteristics of explicit and implicit feedback. Explicit feedback provides higher accuracy than implicit feedback inferred from behavior based on assumptions (e.g., the assumption that users only click on items they are interested in). Typically, when users navigate through a platform that employs a RS, an abundance of data about user behavior is logged. In contrast, users are reluctant to explicitly rate items [
100,
128], which leads to comparably little explicit feedback data. Note that explicit feedback tends to concentrate on either side of the rating scale, because users are more likely to express their preferences if they feel strongly in favor or against an item [
12].
Although explicit and implicit feedback are heterogeneous types of feedback [
120], research investigating the relations between implicit and explicit feedback for preference elicitation has shown that using implicit feedback is a viable alternative [
165]. Still, implicit measures may reveal aspects that explicit measures do not [
211]—particularly when user self-reports are not consistent with the actual user behavior. Integrating both, observation of actual user behavior and users’ self-reports on intentions and perceptions, may deliver rich insights for which each approach in isolation would be insufficient.
Note that many evaluation designs presume that a consumed item is a viable option also in other contexts (e.g., another time, location, or activity) and consider item consumption as a generally valid positive implicit feedback. What the user indeed experiences, however, remains unclear. The validity of the feedback for other contexts depends on the design of the feedback mechanism. For instance, an item rated with five stars may be the user’s lifetime favorite, but still not suitable for a certain occasion (e.g., a ballad for a workout, or a horror movie when watching with kids).
User, Item Information. RS algorithms typically heavily rely on rating data for the computation of recommendations, where the computations are mostly based solely on the user-item matrix. However, these approaches have been shown to suffer from sparsity and also, the cold-start problem, where recommendations for new items or users cannot be computed accurately as there is not enough information on the user or item, respectively. Therefore, metadata on the user, items, or context can also be incorporated to further enhance recommendations (this information is often referred to as “side information”) [
75,
162]. For instance, keywords describing the item may be extracted from, e.g., reviews on the item [
10] or social ties between users can be extracted from relationships in social networks [
154,
210]. Furthermore, when working toward business-oriented goals and metrics (cf. Section
3.4.4), data such as revenue information or click-through rates also have to be logged and analyzed [
112]. In addition, context information is useful when users are expected to have different preferences in different contexts (e.g., watching a movie in a cinema or at home [
187]).
Qualitative and Quantitative Data. Besides collecting behavioral user data (e.g., implicit feedback logged during user interactions with the system), evaluations may also rely on qualitative or quantitative evidence where data is gathered directly from the user. Quantitative data collection methods are highly-structured instruments—such as scales, tests, surveys, or questionnaires—which are typically standardized (e.g., same questions, same scales). This standardization facilitates validity and comparability across studies. Quantitative evidence allows for a deductive mode of analysis using statistical methods; answers may be compared and interrelated and allow for generalization to the population. Qualitative evidence is frequently deployed to understand the sample studied. Commonly used data collection methods include interviews, focus groups, and participant observations, where data is collected in the form of notes, videos, audio recordings, images, or text documents [
80].
Natural and Synthetic Data. Herlocker et al. [
101] distinguish between natural and synthetic datasets. While natural datasets capture user interactions with a RS or are directly derived from those, synthetic datasets are artificially created (e.g., References [
58,
223]). Natural datasets contain (historical) data that may capture previous interactions of users with a RS (e.g., user behavior such as clicks or likes), or data that may be associated with those (e.g., data that reflects users attitudes and feelings while interacting with a RS), or are derived from user interactions (e.g., turnover attributed to recommendations). In cases where a natural real-world dataset that would be sufficiently suitable for developing, training, and evaluating a RS is not available, a synthesized dataset may be used. In such cases, a synthesized dataset would allow for particularly modeling specific critical aspects that should be evaluated. For instance, a synthesized dataset may be created to reflect out-of-the-norm behavior. Herlocker et al. [
101] stress that a synthetic dataset should only be used in the early stages of developing a RS and that synthesized datasets cannot simulate and represent real user behavior. Yet, not only user-behavior-related data can be synthesized. For instance, Jannach and Adomavicius [
110] use fictitious profit values to investigate profitability aspects of RS.
3.4.2 Data Collection.
Data collection methods may be distinguished based on their focus on considering contemporary and historical events, where methods may rely on past events (e.g., existing datasets, data retrieved from social media) or investigate contemporary events (e.g., observations, laboratory experiments) [
221]. In the following, we give an overview of data collection aspects.
User Involvement. Evaluation methods may be distinguished with respect to user involvement. While offline studies do not require user interaction with a RS, user-centric evaluations need users to be involved, which is typically more expensive in terms of time and money [
93,
189]— which is especially true for online evaluations with large user samples (cf. Section
3.3).
Randomized control trials are often considered the gold standard in behavioral science and related fields. In terms of RS evaluation, this means that users are recruited for the trial and randomly allocated to the RS to be evaluated (i.e., intervention) or to a standard RS (i.e., baseline) as the control. This procedure is also referred to as A/B-testing (e.g., References [
54,
135,
136]). Randomized group assignment minimizes selection bias, keeping the participant groups that encounter an intervention or the baseline as similar as possible. Presuming that the environment can control for all the remaining variables (i.e., keeping the variables constant), the different groups allow for comparing the proposed system to the baselines. For instance, randomized control trials that are grounded on prior knowledge (e.g., observations or theory) [
217] and where the factors measured (and the instruments used for measuring these factors) are carefully selected may help determine whether an intervention was effective [
42]; explaining presumed causal links in real-world interventions is often too complex for experimental methods.
While randomized control trials are conducted in laboratory settings, experiments in field settings are typically referred to as “social experiments.” Thereby, the term social experiment covers research in a field setting where investigators treat whole groups of people in different ways [
221]. In online environments, this is referred to as online field experiment [
47]. In field settings, the investigator’s control is only partly possible. Field settings have the advantage that outcomes are observed in a natural, real-world environment rather than in an artificial laboratory environment—in the field, people are expected to behave naturally. Overall, though, field experiments are always less controlled than laboratory experiments, and field experiments are more difficult to replicate [
150]. For RS evaluation, an online field experiment [
47] very often requires collaboration with a RS provider from industry, who is commercially oriented and may not be willing to engage in risky interventions that may cause losing users and/or revenues. However, e.g., for the 2017 RecSys Challenge,
7 the best job recommendation approaches (determined by offline experiments) were also rolled out in XING’s productive systems for online field experiments. Besides collaborating with industry, a number of online field experiments have been carried out using research systems (e.g., MovieLens) (e.g., References [
47,
227]). However, when carrying out a study with a research system, one also has to build a user community for it. Generally, this is often too great an investment just to carry out an experiment. This is why many researchers have argued for funding shared research infrastructure (in both Europe and the USA) including a system with actual users [
137].
It is important to note that it is rarely feasible to repeat studies with user involvement for a substantially different set of algorithms and settings. System-centric (offline) evaluations are, in contrast, easily repeatable with varying algorithms [
93,
101,
189]. However, offline evaluations have several weaknesses. For instance, data sparsity limits the coverage of items that can be evaluated. Also, the evaluation does not capture any explanations why a particular system or recommendation is preferred by a user (e.g., recommendation quality, aesthetics of the interface) [
101]. Knijnenburg et al. [
132,
133] propose a theoretical framework for user-centric evaluations that describes how users’ personal interpretation of a system’s critical features influences their experience and interaction with a system. In addition, Herlocker et al. [
101] describe various dimensions that may be used to further differentiate user study evaluations. Examples for user-centric evaluations can, for instance, be found in References [
51,
63,
70,
188].
Overall, while system-centric methods without user involvement typically aim to evaluate the RS from an algorithmic perspective (e.g., in terms of accuracy of predictions), user involvement opens up possibilities for evaluating user experience [
189].
User Feedback Elicitation. At the core of many recommender systems are user preference models. Building such models requires eliciting feedback from users, for which—at runtime—data is typically collected while users interact with the RS. For evaluation purposes, we can leverage a wider variety of methods for data collection. For instance, besides considering interaction logs, observation [
127] may be used to elicit users’ behavior. An alternative method is to ask users for their behavior or intentions in a particular scenario. Such self-reports may be directed to reports on what they have done in the past or what users intend to do in a certain context. However, self-reports may not be consistent with user behavior [
43,
141,
211], because the link between an individual’s attitude and behavior is generally not very strong [
7]. Furthermore, the process of reporting on one’s behavior may itself induce reflection and actual change of behavior, which is known as the question-behavior effect [
201]. It is, thus, good practice to combine self-report data with other information or to apply adjustment methods, because such an assessment considering several perspectives is more likely to provide an accurate picture [
11].
For the elicitation of feedback on user experience, Pu et al. [
172] propose an evaluation framework, called ResQue (Recommender systems’ Quality of user experience) that aims to evaluate a comprehensive set of features of a RS: the system’s usability, usefulness, interaction qualities, influence of these qualities on users’ behavioral intentions, aspects influencing the adoption, and so on. ResQue provides specific questionnaire items and is, thus, considered highly operational. Knijnenburg et al. [
133]’s framework for the user-centric evaluation of recommender systems takes a more abstract approach. It describes the structural relationships between the higher-level concepts without tying the concepts to specific questionnaire items. Therefore, it provides the flexibility to use and adapt the framework for various RS purposes and contextual settings and allows researchers to define and operationalize a set of specific, lower-level constructs. Both frameworks (i.e., Knijnenburg et al. [
133] and Pu et al. [
172]) may be integrated in user studies and online evaluations alike.
Existing Datasets. One advantage of relying on existing datasets is that (offline) evaluations can be conducted early in a project. In comparison to soliciting and evaluating contemporary events, it is frequently “easier” and less expensive in terms of money and time to rely on historical data [
93]. Also, by utilizing popular datasets (e.g., the MovieLens dataset [
97]), results can be compared with similar research. However, such an evaluation is restricted to the past. For instance, the goal of a leave-
\(n\)-out analysis [
32] is to analyze to which extent recommender algorithms can reconstruct past user interactions. Hence, such an evaluation can only serve as a baseline evaluation measure, because it only considers items that a user has already used in the past; assuming that unused items would not be used even if they were actually recommended [
93]. Additional items that users might still consider useful are not considered in the evaluation, because ratings for these items are not contained in the dataset [
224]. This is also stressed by Gunawardana et al. [
93] by the following scenario: “For example, a user may not have used an item because she was unaware of its existence, but after the recommendation exposed that item the user can decide to select it. In this case, the number of false positives is overestimated.”
Another risk is that the dataset chosen might not be (sufficiently) representative—the more realistic and representative the dataset is for real user behavior, the more reliable the results of the offline experiments are [
93]. In fact, the applicability of the findings gained in an evaluation based on a historic dataset is highly impacted by the “quality, volume and closeness of the evaluation dataset to the data which would be collected by the intended recommender system” [
81].
Table
4 lists datasets widely used for evaluating recommender systems and their main characteristics such as the domain, size, rating type, and examples of papers that have utilized the dataset in the evaluation of their system. There are different MovieLens datasets, differing in the number of ratings contained (from 100K ratings in the ML100K dataset to 20M ratings in the ML20M dataset; we list ML1M and ML20M in the table). Alternatively, the yearly conducted RecSys-Challenge
8 also provides datasets from a yearly changing application domain and task (including job, music, or accommodation (hotel) recommendation).
3.4.4 Evaluation Metrics.
There is an extensive number of facets of RS that may be considered when assessing the performance of a recommendation algorithm [
92,
93]. Consequently, also the evaluation of RS relies on a diverse set of metrics, which we briefly summarize in the following. The presented metrics can be utilized for different experiment types, however, we note that due to the dominance of offline experiments, most of the presented metrics stem from offline settings.
In their early work on RS evaluation, Herlocker et al. [
101] differentiate metrics for quantifying predictive accuracy, classification accuracy, rank accuracy, and prediction-rating correlation. Along the same lines, Gunawardana and Shani [
92] investigate accuracy evaluation metrics and distinguish metrics based on the underlying task (rating prediction, recommending good items, optimizing utility, recommending fixed recommendation lists). Said et al. [
189] classify the available metrics into classification metrics, predictive metrics, coverage metrics, confidence metrics, and learning rate metrics. In contrast, Avazpour et al. [
16] provide a more detailed classification, distinguishing 15 classes of evaluation dimensions; these range, for instance, from correctness to coverage, utility, robustness, and novelty. Gunawardana et al. [
93] distinguish prediction accuracy (rating prediction accuracy, usage prediction, ranking measures), coverage, novelty, serendipity, diversity, and confidence.
14 Chen and Liu [
45] review evaluation metrics from four different perspectives (or rather, disciplines): machine learning (e.g., mean absolute error), information retrieval (e.g., recall or precision), human-computer interaction (e.g., diversity, trust, or novelty), and software engineering (e.g., robustness or scalability).
In the following, we discuss the most widely used categories of evaluation metrics. Table
5 gives an overview of these metrics, which we classify along the lines of previous classifications. For an extensive overview of evaluation metrics in the context of recommender systems, we refer to References [
45,
87,
92,
93,
101,
166,
195]. Several works [
185,
209] have shown that the metrics implemented in different libraries for RS evaluation (Section
3.4.5) sometimes use the same name while measuring different things, which leads to different results given the same input. Similarly, Bellogín and Said [
24] report that papers present different variations of metrics (e.g., normalized vs non-normalized; computed over the entire dataset or on user-basis and then averaged); and sometimes the details of the evaluation protocol are not reported in papers [
24,
36]. Tamm et al. [
209] conclude that the more complex a metric is, the more room there is for different interpretations of the metric, leading to different variations of metric implementations. As a result, this might lead to misinterpretations of results within an evaluation [
209], and limits the comparability across evaluations [
24,
36,
185,
209]. In line with previous works [
24,
36], we urge for a more detailed description of evaluation protocols as this will strengthen reproducibility and improve accountability [
24].
Fundamentally, we emphasize that it is important to evaluate a RS with a suite of metrics, because a one-metric evaluation will—in most cases—be one-sided and cannot characterize the broad performance of a RS. When optimizing a RS for one metric, it is crucial to also evaluate whether this optimization sacrifices performance elsewhere in the process [
87,
101]. For instance, it is doubtful whether a RS algorithm optimized for prediction accuracy while sacrificing performance in terms of diversity, novelty, or coverage is overall desirable. Similarly, a RS that performs equally across various user groups but for all groups with similarly low accuracy and low diversity will not likely reach a good user experience for any user. It is, thus, crucial to measure—and report—a set of complementary metrics. In many cases, it will be key to find a good balance across metrics.
Prediction accuracy refers to the extent to which the RS can predict user ratings [
93,
101]. These include error metrics that quantify the error of the rating prediction performed by the RS (i.e., the difference between the predicted rating and the actual rating in a leave-
\(n\)-out setting). The most widely used prediction accuracy metrics are mean absolute error and root mean squared error.
Usage prediction metrics can be seen as classification metrics that capture the rate of correct recommendations—in a setting where each recommendation can be classified as relevant or non-relevant [
92,
93,
101]. This involves binarizing ratings such as, e.g., on a rating scale of 1–5 considering ratings of 1–3 as non-relevant and ratings of 4 and 5 as relevant. The most popular usage prediction metrics are recall, precision, and the F-score, which combines recall and precision. Precision is the fraction of recommended items that are also relevant. In contrast, recall measures the fraction of relevant items that are indeed recommended. Often, this includes restricting relevant items to the
\(k\) most relevant items, where the system’s ability to identify the
\(k\) most suitable items for a user is captured as opposed to evaluating all recommendations (often referred to as recall@
\(k\) or precision@
\(k\), respectively) [
101]. Alternatively, the receiver operating characteristic curve can also be used to measure usage prediction, where the true positive rate is plotted against the false positive rate for various recommendation list lengths
\(k\). These curves can also be aggregated into a single score by computing the area under the ROC curve (AUC).
Ranking metrics are used to quantify the quality of the ranking of recommendation candidates [
92,
166]. Relevant recommendations that are ranked higher are scored higher, whereas relevant documents that are ranked lower are provided a discounted score. Typical ranking metrics include normalized discounted cumulative gain (NDCG) [
119], or mean reciprocal rank (MRR) [
215].
Diversity refers to the dissimilarity of the items recommended [
38,
125,
143,
214], where low similarity values mean high diversity. Diversity is often measured by computing the intra-list diversity [
200,
231] and thereby, aggregating the pairwise similarity of all items on the recommendation list. Here, similarity can be computed, e.g., by Jaccard or cosine similarity [
125].
Novelty metrics aim at measuring to which extent recommended items are novel [
38]. Item novelty [
107,
230] refers to the fraction of recommended items that are indeed new to the user, whereas global long-tail novelty measures the global novelty of items—i.e., if an item is known by few users and, hence, is in the long tail of the item popularity distribution [
32,
40].
Serendipity describes how surprising recommendations are to a user and, hence, is tightly related to novelty [
125,
159]. However, as Gunawardana et al. [
93] note, recommending a movie staring an actor that the user has liked in the past might be novel, but not necessarily surprising to the user. The so-called unexpectedness measure compares the recommendations produced by a serendipitous recommender to the recommendations computed by a baseline [
159]. Building on the unexpectedness measure, serendipity can be measured by the fraction of relevant and unexpected recommendations in the list [
125] or the unexpectedness measure [
2].
Coverage metrics describe the extent to which items are actually recommended [
4,
87]. This includes catalog coverage (i.e., the fraction of all available items that can be recommended; often referred to as item space coverage) [
189], user space coverage [
93] (i.e., the fraction of items that are recommended to a user; often also referred to as prediction coverage [
87]), or measuring the distribution of items chosen by users (e.g., by using the Gini index or Shannon entropy) [
93]. Coverage metrics are also used to measure fairness, because coverage captures the share of items or users that are served by the RS.
Fairness metrics concern both, fairness across users and across items. In both cases, fairness may be captured at the level of the individual or at group level. Individual fairness captures fairness (or unfairness) at the level of individual subjects [
27] and implies that similar subjects (hence, similar users or similar items) are treated similarly [
65]. Group fairness defines fairness on a group level and requires that salient subject groups (e.g., demographic groups) should be treated comparably [
66]; in other words, group fairness is defined as the collective treatment received by all members of a group [
27]. A major goal of group fairness is that protected attributes—for instance, demographic traits such as age, gender, or ethnicity—do not influence recommendation outcomes due to data bias or model inaccuracies and biases [
27,
196].
Fairness across users is typically addressed at the group level. One way to address group fairness from the user perspective is to disaggregate the user-oriented metrics to measure and compare to which extent user groups are provided with lower-quality recommendations (e.g., References [
69,
73,
74,
108,
142,
158,
196]). Yao and Huang [
220] propose three (un-)fairness metrics: value unfairness measures, whether groups of users receive constantly lower or higher predicted ratings compared to their true preference; absolute unfairness measures the absolute difference of the estimation error for groups, and under/overestimation of fairness measures inconsistency in the extent to which predictions under- or overestimate the true ratings.
Fairness across items addresses the fair representation of item groups [
27] and it is addressed at group level and at the level of individual items, too. The goal of many metrics is to measure the exposure or attention [
27,
198] an item group receives and assess the fairness of this distribution: in a ranked list of recommendations, lower ranks are assumed to get less exposure and, thus, less attention.
15Beutel et al. [
26] propose the concept of pairwise fairness, which aims to measure whether items of one group are consistently ranked lower than those of another group. Other metrics put exposure across groups and relevance of items into relation. The disparate treatment ratio (DTR) [
198] is a statistical parity metric that measures exposure across groups proportional to relevance. Diaz et al. [
60] consider the distribution over rankings instead of a single fixed ranking. The idea behind the principle of equal expected exposure is that “no item should receive more or less expected exposure than any other item of the same relevance grade” [
60]. Biega et al. [
27] capture unfairness at the level of individual items; they propose the equity of amortized attention, which indicates whether the attention is distributed proportionally to relevance when amortized over a sequence of rankings. The disparate impact ratio (DIR) [
198] goes further than exposure and considers the impact of exposure: DIR measures across items groups, whether items obtain proportional impact in terms of the click-through rate. The viable-
\(\Lambda\) test [
191] accounts for varying user attention patterns through parametrization in the measurement of group fairness across items.
Business-oriented metrics are used by service providers to assess the business value of recommendations [
112]. While service providers naturally are interested in user-centered metrics as positive user experience impacts revenue, business-oriented metrics allow to directly measure click-through-rates [
55,
86,
89,
126], adoption and conversion rates [
55,
89], and revenue [
46,
145]. Click-through rates measure the number of clicks generated by recommendations, whereas adoption and conversion rates measure how many clicks actually lead to the consumption of recommended items. Therefore, adoption and conversion rates, and even more so, the sales and revenue generated by recommended items, more directly measure the generated business value of recommendations.