5.1 Experimental setup
Dataset preprocessing. We use six freely available datasets (see Table
6) from [
46]
12: Lastfm;
13 Ml-1m [
14]; Book-x [
50]; Amazon-lb, Amazon-dm, and Amazon-is [
30]. We remove users/items with
\(\lt 5\) interactions and use
\(80\%/10\%/10\%\) to train/validate/test, with a user-based random split for Lastfm and Book-x (timestamps are not available), and a user-based temporal split for all other datasets, i.e., the last
\(10\%\) of each user’s interactions are in the test set. We convert ratings
\(\ge 3\) on Ml-1m & Amazon-* and ratings
\(\ge 6\) on Book-x to 1. We discard the rest of the ratings. We choose these thresholds as the ratings are from 1–5 in Ml-1m and Amazon-*, and 0–10 in Book-x. We do not convert for Lastfm as it uses implicit feedback, so all interactions have a value of 1. For duplicate values, we keep the last interaction.
Recommenders. For recommendation we use:
Pop [
34] (recommends
k most popular items),
item-based K-Nearest Neighbours (
ItemKNN) [
8],
Sparse Linear Method (
SLIM) [
31],
Bayesian Personalized Ranking (
BPR) [
35],
Neural Graph Collaborative Filtering (
NGCF) [
39],
Neural Matrix Factorization (
NeuMF) [
15], and
Variational Autoencoder (MultiVAE) with multinomial likelihood (MultiVAE) [
23]. We use training batch sizes of 4096, Adam [
20] as optimizer, and the RecBole library [
46]. We train BPR, NGCF, NeuMF, and MultiVAE for 300 epochs, but use early stopping of 10 epochs and keep the model that produces the best NDCG@10 on the validation set. We tune hyperparameters on all models except Pop, with RecBole’s hyperparameter tuning module. The hyperparameter search space and optimal hyperparameters are in Appendix
B.1. For all recommenders, when we generate the recommendation list for a user during testing, the items in the user’s train or validation set are placed at the end of the user’s list to avoid re-recommending them.
Measures. We evaluate models w.r.t. (a) relevance-only measures (HR, MRR, Precision (P), Recall (R), MAP, NDCG), and (b) individual item fairness measures, both the original and our corrected measures. All measures are computed at
\(k=10\), unless otherwise stated. We evaluate on the full test set of items instead of a sample of them, as doing the latter is known to yield misleading results [
21]. This leads to lower performance than reported when sampling the test set. Lastly, for Ent, we use the log base-
n. For VoCD we choose the values of
\(\alpha\) and
\(\beta\) such that VoCD maintains comparability with the other fairness measures: all recommended items are considered similar
14 (
\(\alpha =2\)) and thus
A is the set of all possible pairs of different items in the top
k, without any tolerance for coverage disparity (
\(\beta =0\)). We also choose this configuration to avoid reliance on similarity scores based on item embeddings. For II-D and AI-D, we use
\(\gamma =0.8\) [
43].
5.2 Analysis of Relevance and Fairness
We start by studying the relevance and fairness of several recommender models. We compare (a) different recommender models w.r.t. relevance and fairness scores, and (b) different evaluation measures, including corrected and uncorrected measures. The goal of (a) is to study whether relevance and/or fairness scores vary between models, and to obtain a ranking of models that is used in subsequent analysis (Section
5.3). The goal of (b) is to study measures based on diverse concepts of fairness, and highlight their differences (and similarity, if any).
The evaluation results are shown in Table
7 for Lastfm and Ml-1m, and in Appendix
B.2 for the other datasets.
5.2.1 Discussion of Recommendation Models in Table 7.
BPR has the highest relevance scores, while MultiVAE generally has the highest fairness scores. Between BPR and MultiVAE, we observe that the relevance scores are higher in BPR but the fairness scores are higher for MultiVAE. E.g., in Lastfm, NDCG \(=0.223\) for BPR and NDCG \(=0.219\) for MultiVAE. Meanwhile, the higher-is-better fairness scores range between \([0.078, 0.656]\) for BPR and \([0.132, 0.763]\) for MultiVAE. This is not always observed for all models. E.g., for ItemKNN and SLIM, better fairness tends to be accompanied by better relevance. Furthermore, other discrepancies exist: the relevance of ItemKNN and MultiVAE is on par, but their fairness scores e.g., the scores of \(\uparrow\)Jain of ItemKNN, are only half of those achieved by MultiVAE. Generally, the recommenders agree in relative ordering of scores, but some models have higher scores for .\(_{\text{our}}\) than .\(_{\text{ori}}\) and vice-versa, e.g., for Lastfm, \(\downarrow\)Gini\(_{\text{ori}}\)\(\gt\)\(\downarrow\)Gini\(_{\text{our}}\) for ItemKNN but \(\downarrow\)Gini\(_{\text{ori}}\)\(\lt\)\(\downarrow\)Gini\(_{\text{our}}\) for SLIM. Overall, we observe that:
—
A recommender model that is the best in terms of relevance may also be relatively fair.
—
The recommender models mostly have a similar ordering of the scores of fairness measures: if a fairness measure has a higher value than another in a recommender model X, it is the same for recommender model Y.
—
Some models achieve relatively similar relevance scores, but with a huge disparity between their fairness scores.
5.2.2 Discussion of Fairness Evaluation Measures in Table 7.
For the higher-is-better measures, Jain, QF, FSat, and Gini, the scores of the original measures and our measures are similar. Both
\(\uparrow\)Jain and
\(\uparrow\)QF should range in
\([0,1]\), but
\(\uparrow\)Jain is very close to 0 i.e., (
\(\sim\)0.1 or less), and
\(\uparrow\)QF scores are
\(\sim\)0.7 (in Lastfm) and
\(\sim\)0.5 (in Ml-1m). Similarly, the scores of
\(\downarrow\)II-D and
\(\downarrow\)AI-D are also very close to 0 while
\(\downarrow\)Gini scores are closer to 1. While these are due to the different underlying fairness ideas between the measures, the big differences in scores may cause confusion, e.g., that a recommendation is very unfair based on
\(\uparrow\)Jain or
\(\downarrow\)Gini, or moderately fair based on
\(\uparrow\)QF. For Lastfm and Ml-1m, we also see that the absolute scores for the same recommender, e.g., MultiVAE, follow the same order from the lowest to the highest:
\(\uparrow\)Jain,
\(\uparrow\)FSat,
\(\uparrow\)QF, and
\(\uparrow\)Ent. This indicates that
\(\uparrow\)Jain tends to give lower scores (more unfair) than the other measures. We observe similar trends for
\(\downarrow\)II-D and
\(\downarrow\)AI-D, which tend to give lower scores (more fair) compared to other lower-is-better measures. The weighted
\(\downarrow\)Gini-w is also more strict than the unweighted
\(\downarrow\)Gini as
\(\downarrow\)Gini-w tends to give more unfair scores than
\(\downarrow\)Gini. We study further the strictness of these measures in Section
5.6.
We also observe that scores of \(\downarrow\)AI-D are hardly distinguishable. They differ only in the fourth or more decimal point for both Lastfm and Ml-1m. However, differences in other measures can be seen in the first or second decimal point. The small scores of AI-D may be due to the measure quantifying the disparity between item exposure and random exposure, which is very little when we have a large number of items. This finding suggests that when computing \(\downarrow\)AI-D, care should be taken to avoid rounding errors and failure to distinguish the scores due to the floating-point format.
For all datasets and models, the original Ent cannot be calculated because it returns NaN due to zero division errors. This happens because there are items in the dataset that are not recommended. Our corrected version of this measure (Section
4) does not suffer from this problem.
For the same dataset and
k, regardless of the recommender, the II-D scores are always the same/constant. Due to the fixed amount of slots
km, within a single recommendation round, the exposure values
\(E_{u,i} \in \lbrace 1, \gamma , \dots , \gamma ^k, 0\rbrace\) (see Equation (
8)) for all user-item pairs, and the number of user-item pair having a specific exposure value
\(E_{u,i}\) is always
m. Both of these properties lead to constant II-D scores, as II-D is calculated by taking a mean squared difference between each
\(E_{u,i}\) value and a constant value based on random expected exposure. When considering cases with multiple rounds of recommendations, the score of II-D may not remain constant anymore, as the user-item exposure values are aggregated across recommendation rounds, resulting in the possibility of
\(E_{u,i}\) having linear combinations of values from the set above. We have also illustrated in Section
3.1.4 that II-D is not constant in multiple recommendation rounds.
Overall, we observe that:
—
The different fairness measures have different ranges in these experiments, even if theoretically they have the same range.
—
The original Ent is always incomputable in the experiments, and our corrected Ent resolves the issue.
—
II-D\(_{\text{ori}}\) remains constant for the same dataset, rendering this measure notably less meaningful under this single-round experimental set-up.
—
Both \(\downarrow\)II-D\(_{\text{ori}}\) and \(\downarrow\)AI-D\(_{\text{ori}}\) have minuscule values, indicating near-perfect fairness even if this contradicts other fairness scores.
5.3 Correlation between Measures
When comparing different recommender models, sometimes the ranking of the models (e.g., from the most to least fair score) is more concerning than the absolute values of the measures that we have seen in Table
7. Motivated by this, we analyse the measures’ correlation in order to study the agreement of model rankings based on different measures of relevance and fairness. We compare the following things: (1) the agreement between measures of the same type (relevance or fairness); (2) the agreement between measures of different types; (3) the agreement between the original measures and our corrections to the original measures; (4) the agreement between measures across different datasets. By performing this analysis, we also gain insights into how measures that capture different fairness concepts (dis)agree with one another.
We use Kendall’s
\(\tau\) between measures to compute ranking agreement. Figures
1–
2 show the Kendall’s
\(\tau\) values between relevance measures and fairness measures for Lastfm and Ml-1m (see Appendix
B.3 for the other datasets). The computation is as follows: for each dataset, we rank the models based on the most relevant or most fair scores. We omit Ent
\(_{\text{ori}}\) as it produces NaN in Table
7 and we also omit II-D as the scores for one dataset are the same across models. We compute the correlation significance and correct errors arising from multiple testings for a dataset, using the
Benjamini-Hochberg (
BH) procedure that is based on false discovery rate [
3]. Upon correction, some correlations are still significant; these are indicated by an asterisk (
\(^*\)) in Figures
1–
2.
15We first analyse the correlation among measures of the same type. The relevance measures are highly correlated with each other:
\([0.81,1]\) for Lastfm and
\([0.52,1]\) for Ml-1m, as expected [
42]. The fairness measures are also strongly correlated with each other:
\([0.71,1]\) for Lastfm and
\([0.81,1]\) for Ml-1m, except VoCD
\(_{\text{ori}}\) for Lastfm,
\([0.43,0.71]\). This is expected from how the measures treat items when computing fairness; VoCD
\(_{\text{ori}}\) only considers items in the recommendation list, while the remaining fairness measures consider all items in the dataset. For Ml-1m, all the computed correlations between fairness measures are significant after applying the BH procedure. On the other hand, after applying the same procedure for Lastfm, neither of QF nor VoCD, has significant correlations with the rest of the fairness measures, except for QF and Gini/Gini-w. It is also reasonable for QF to not correlate significantly with most of the measures, as it is the only measure insensitive to the difference in the number of times an item is recommended.
Interestingly, even though in Table
7 the scores of
\(\uparrow\)Jain,
\(\uparrow\)QF,
\(\uparrow\)Ent, and
\(\uparrow\)FSat occupy different parts of their range, these measures are highly correlated. The same goes for
\(\downarrow\)Gini and
\(\downarrow\)AI-D. This shows that even measures based on different concepts of fairness are still capable of producing similar rankings of models. Nevertheless, the absolute scores of the measures can be misinterpreted due to the measures occupying different parts of their range.
Our corrected fairness measures are always perfectly correlated with the original fairness measures (1 in both datasets). This is expected because our corrected versions are obtained by normalization, which does not change the relative order of the models.
Regarding the correlations between measures of different types, we see different trends between relevance and fairness measures for Lastfm and Ml-1m. In Lastfm, we see moderate correlations between fairness and relevance measures,
\([0.33,0.62]\), but these are lower for Ml-1m
\([0.14,0.52]\). These findings are expected as the fairness measures do not consider relevance. None of these correlations are significant after applying the BH procedure. Yet, some correlations between fairness and relevance measures are significant for Book-x and Amazon-is (Appendix
B.3).
5.4 Max/min Achievable Fairness
The aim of this experiment is to quantify the extent to which the fairness measures can achieve their theoretical maximum and minimum fairness value (0 or 1) for different datasets and different
\(k \in \lbrace 1,2,3,5,10,15,20\rbrace\). This relates to the
non-realisability limitation (
Causes 1–3). We experiment solely with the fairness measures for which we have resolved this limitation, namely Jain, QF, Ent, Gini, Gini-w, and FSat. We primarily compare the original (uncorrected) versions of these measures against the corrected ones. We use two settings: repeatable recommendation, where items in the train/val split can be re-recommended to users following practical cases in industry settings; and nonrepeatable recommendation, which is the typical setting for evaluating recommender systems in academic work. For each setting, we devise two recommenders: MostFair and MostUnfair. Repeatable MostFair aims at recommending each item in the dataset the same amount of times. However, this is impossible if
\(n \nmid km\) and in this case some items are recommended
\(\left\lfloor \frac{km}{n} \right\rfloor\) times while others
\(\left\lfloor \frac{km}{n} \right\rfloor +1\) times. For Nonrepeatable MostFair, for each user we generate a list of
recommendable items, defined as items in
I that have not appeared in their corresponding train/val split. For one user at a time, we then recommend the least popular
k recommendable items based on the current recommendation lists of all users. Repeatable MostUnfair recommends the same
k items to each user. Nonrepeatable MostUnfair does the same, but if any of those
k items is a non-recommendable item, the non-recommendable item is replaced by a recommendable item. The results of this experiment for Lastfm and Ml-1m are presented in Figures
3–
6 and for the remaining datasets in Appendix
B.4. We discuss the findings below.
Theoretical maximum fairness. For both nonrepeatable and repeatable settings, all original measures fail to achieve their theoretical maximum fairness values due to the
non-realisability limitation (
Causes 2–3). The scores of the original measures get closer to the theoretical maximum fairness values as
k increases. However, these scores are still not equal to the theoretical maximum fairness value. In the original measures, having more slots due to larger
k does not guarantee that the scores would be higher as well. E.g., the score of
\(\uparrow\)Jain
\(_{\text{ori}}\) in Figure
3 is higher at
\(k=3\) compared to
\(k=5\), because of the changing values of
\(km \bmod n\) for different values of
k. Our corrected versions always reach their theoretical maximum fairness values for both repeatability settings, except for Gini-w
\(_{\text{our}}\). This behaviour is due to the unresolvable non-realisability (
Cause 4) limitation for Gini-w. However, Gini-w
\(_{\text{our}}\) can still reach the theoretical most fair value when
\(k=1\) for Lastfm (Figure
4), while the original version fails to do so.
Theoretical minimum fairness. All original measures fail to reach the theoretical minimum values for all experimented values of
k for all settings due to
non-realisability limitation (
Cause 1). This happens less frequently in our measures (Figure
5). Our measures successfully achieve the theoretical minimum fair values under the repeatable settings, except for FSat in Lastfm (Figure
5). This is because when
\(k=1\), there are not enough slots for the items and due to the
always-fair limitation that is unresolvable (Section
4.3), the score for
\(\uparrow\)FSat is 1, which is not the theoretical minimum fair value. Additionally, the scores of the original measures diverge from the theoretical minimum fairness value with larger
k. This happens to our measures only in the nonrepeatable setting because the normalization is done by assuming that any item can be recommended to any users. This assumption is not true in the nonrepeatable settings because some items cannot be re-recommended to some users. However, differently from the original measures, the scores of our measures diverge less as
k increases.
Overall, while the difference in scores between the original measures and our versions is not large, our measures quantify the actual most (un)fair situations more accurately than the original measures. The difference between the original measures and our versions in the most unfair recommendation under the repeatable setting is
\(\frac{k}{n}-0 = \frac{k}{n}\) for Jain, QF, Gini, and FSat; and
\(\log {k}\) for Ent (see Table
5). However, the difference would be greater in item-poor domains where
n is small and therefore possibly close to
k, e.g., insurance [
5]. The scores of the original measures also change with
k. This makes their interpretation harder because the distance between the original scores and the theoretical maximum/minimum fair score also changes without an intuitive pattern for different values of
k, as seen in the
\(\uparrow\)Jain
\(_{\text{ori}}\) scores in Figure
3 which can increase or decrease as
k increases. Furthermore, the original measures suffer particularly for low
k values, which are the most important rank positions in real-life RSs. The scores of our measures rarely change with different values of
k.
5.5 Sliding Window: Relevance and Fairness at Different Rank Positions
This experiment studies how relevance and fairness scores of all measures vary at decreasing rank positions. The experiment aims at observing (1) the change in relevance scores, if any, as items should ideally be placed in the ranks according to decreasing order of true relevance; and (2) whether and how the fairness scores change across different rank positions. Due to bias in recommenders, popular items tend to be given more exposure. Thus, we expect the relevance scores to decrease and the fairness scores to become more fair at decreasing rank positions. We study how the above changes may generally differ between relevance measures and fairness measures, as well as between different fairness measures, including the ones with different fairness notions.
We conduct this experiment as follows. We use the runs from the BPR model, which is the best in our experiments. Given one run, we compute the measures for different sliding windows of rank positions in rankings 1–5, 2–6, and so on until 5–9. We reorder the recommended items such that items that were previously recommended at the top positions are now at the bottom positions when we change the window according to decreasing rank. The results for Lastfm and Ml-1m are presented in Figure
7 and for the rest of the datasets in Appendix
B.5.
The following observations from Figure
7 apply to both the original fairness measures and our corrected versions of these measures unless otherwise stated. All relevance scores decrease as rank decreases. The drop of relevance scores for Ml-1m (
\([0.04,0.23] \rightarrow [0.03, 0.20]\)) is less extreme than in Lastfm (
\([0.12,0.48] \rightarrow [0.04, 0.25]\)). This is partly because the test set of Lastfm has at most five relevant items per user, while on average, Ml-1m has many more. While relevance scores decrease, fairness measures show that fairness slightly increases down the rank, except for
\(\downarrow\)VoCD
\(_{\text{ori}}\). The range of higher-is-better fairness measures increases from
\([0.06,0.65]\rightarrow [0.10,0.71]\) for Lastfm and
\([0.05,0.65]\rightarrow [0.08,0.71]\) for Ml-1m. The range of
\(\downarrow\)Gini and
\(\downarrow\)Gini-w, decreases from
\([0.91,0.92]\rightarrow [0.87,0.88]\) for Lastfm and
\([0.91,0.92]\rightarrow [0.88,0.89]\) for Ml-1m.
\(\downarrow\)VoCD
\(_{\text{ori}}\) seems invariant to changes in the position window (
\(0.61 \rightarrow 0.60\) for Lastfm and
\(0.68\rightarrow 0.67\) for Ml-1m). This may be because VoCD
\(_{\text{ori}}\) is the only measure that considers fairness exclusively for recommended items, and the recommended items differ a little in terms of the number of times they are recommended as rank decreases.
\(\downarrow\)AI-D has even smaller changes in scores as the values are already minuscule in the first place, while
\(\downarrow\)II-D is always constantly small for a dataset. The small values, compared to other measures, are due to these measures quantifying fairness using different concepts from other measures, i.e., comparing exposure to random exposure (also observed and explained in Section
5.2). The ranges of all fairness measures are roughly the same across datasets, but the range of relevance measures varies across datasets. This also holds for the datasets in Appendix
B.5. This may be due to the distribution of the recommended items being similar across datasets, and the distribution of the number of relevant items differing across datasets, as explained above for Lastfm and Ml-1m.
Fairness measures are also somewhat invariant to changes in relevance. This is anticipated as the equations of fairness measures are independent of relevance values.
5.6 Measure Strictness and Sensitivity through Artificial Insertion of Items
We have observed in Section
5.2 that different fairness measures vary in their strictness of quantifying fairness (e.g., some measures give scores close to the most fair values, and the opposite for others). It is however unknown how sensitive fairness measures are, given the change of the number of times an item is exposed in the recommendation list across all users. Therefore, the goal of this experiment is to study the strictness and sensitivity of the measures, and compare these aspects between measures of similar and different fairness concepts. Knowing the strictness and sensitivity of the measures matters as this affects how we interpret the scores of the measures. For example, if one uses a measure that tends to produce scores close to the most fair value, they must be aware that the score may not reflect fairness accurately.
As such, we devise an experiment to specifically study how the relevance measures, existing fairness measures and our corrected fairness measures scores change when we artificially control the fraction of jointly least exposed and relevant items in the recommendation list. We start with an initial recommendation list. We define a
least exposed (LE) item as an item in the dataset with the least exposure, based on the current recommendation list.
16 An LE item in this experiment is therefore an item that has not appeared in the current recommendation list. We define a
relevant item as per the labels of relevance.
From the initial recommendation list, we insert jointly LE and relevant items, one item at a time. We create a synthetic dataset with
\(m=1000\) users and
\(n=10000\) items. The number of items is exactly the number of recommendation slots
km for a cut-off
\(k=10\). We artificially generate a ranking of top
k as follows. The artificial insertion of jointly LE and relevant items begins with the recommendation of the same 10 items
\(i_1, i_2, \dots , i_{10}\) to all users. These items are irrelevant to each user except
\(u_1\), as we keep the recommendation list for
\(u_1\) the same throughout the experiment. This is because we want to keep the number of items exactly
km where theoretically each item could be recommended exactly once and if we have to completely replace all
m users’ recommendation lists, we would need to have more than
km items. We expect the relevance measures to give scores close to zero on this initial recommendation list as only
\(u_1\) has relevant items. We expect the fairness measures to give scores that are equal to or close to the theoretical most unfair scores.
17Let P be the fraction of items in the k that are artificially inserted by us. We vary from \(P=0\), the original recommendation where we have not inserted any items artificially, to \(P=1\) where all items in the k are jointly LE and relevant items that are artificially inserted by us. We increase P in steps of \({1}/{k}\). From the bottom of a user’s recommendation list, we replace one item at a time with a known jointly LE and relevant item, until we end with a recommendation list of different km items across all users, that are all relevant only to the user to whom that item is recommended. At the end of the insertion process, each user is recommended exactly 10 relevant items, and those items are also fair w.r.t. the entire recommendation list for all users, considering all items in the dataset; item fairness is not defined w.r.t. a specific user. We expect the relevance measures to give scores of 1 on the final recommendation list and fairness measures to give scores that are (close to) the fairest scores.\(^{{\href {fn:close_to}{17}}}\)
The results of this experiment are presented in Figure
8. We see that all relevance measures increase as we add more relevant items.
18 The following observations apply to both the original fairness measures and to our corrected versions of these measures, unless otherwise specified. All fairness measures, except VoCD
\(_{\text{ori}}\) and II-D
\(_{\text{ori}}\), indicate more fairness as we increase
P, but with varying sensitivity, explained next.
\(\uparrow\)Jain is one of the strictest fairness measures. Even when the proportion of LE items is 0.9 (item
\(i_1\) is recommended to all users, but the rest of the recommendation lists are filled with different items), the
\(\uparrow\)Jain score is still close to 0, which translates to unfair while
\(\uparrow\)QF,
\(\uparrow\)Ent, and
\(\uparrow\)FSat are 0.9, which is close to the fairest score of 1. The scores of QF
\(_{\text{ori}}\) are exactly the same as FSat
\(_{\text{ori}}\), because all items in the recommendation list are recommended once, which is also the maximin share (defined in Section
2.2).
\(\uparrow\)QF
\(_{\text{our}}\),
\(\uparrow\)Ent
\(_{\text{our}}\), and
\(\uparrow\)FSat
\(_{\text{our}}\) also give identical scores. This is expected as the increase in the scores is constant and proportional to the fraction of artificially inserted LE items, yet this is interesting as these three measures are based on three different fairness notions (QF being insensitive to the number of times an item is recommended, and FSat being based on maximin-shared fairness).
Meanwhile, the increase of fairness in \(\downarrow\)Gini and \(\downarrow\)Gini-w follows a non-linear trend, with \(\downarrow\)Gini-w being stricter than \(\downarrow\)Gini. The non-linear trend is also expected as Gini and Gini-w are based on the Lorenz curve, a graphical representation of the cumulative proportion of exposure to the cumulative proportion of items. We also see that Gini-w\(_{\text{our}}\) is able to reach the theoretical most fair when the entire recommendation list consists of artificially inserted LE items, while Gini-w\(_{\text{ori}}\) fails. \(\downarrow\)VoCD\(_{\text{ori}}\) is insensitive to the insertion of items as it only considers fairness for recommended items. The number of times these items are recommended across all users does not differ much in this set-up, therefore \(\downarrow\)VoCD\(_{\text{ori}}\) returns scores that are close to the fairest. Most notably, \(\downarrow\)II-D\(_{\text{ori}}\) and \(\downarrow\)AI-D\(_{\text{ori}}\) are very close to 0 (on the scale of \(10^{-3}\) or even smaller) even when the same k items are recommended to all users. \(\downarrow\)II-D \(_{\text{ori}}\) remains constant, while \(\downarrow\)AI-D\(_{\text{ori}}\) is rather insensitive to the addition of LE and relevant items. The small scores are due to the measures quantifying fairness according to the closeness of item exposure with random exposure, while other measures have no such comparisons. Therefore, for these measures, the scales are not very meaningful, even though for AI-D, the scores still indicate improvement as we insert more LE items.
We see similar trends with
\(m \in \lbrace 100, 500\rbrace\), but the change of the scores is most stable with
\(m=1000\). As we increase the number of users (and items), the range of VoCD, II-D, and AI-D scores also becomes more compressed. In contrast, the range of the other measures remains similar. We also observe a similar but opposite trend of results when we artificially insert known irrelevant and multiple copies of items already in the recommendation list. Both of these results are in Appendix
B.6.
Overall, the artificial insertion experiment indicates that several measures respond linearly to the insertion of LE items i.e., \(\uparrow\)QF, \(\uparrow\)FSat, and \(\uparrow\)Ent, while the rest do so non-linearly. This can affect the interpretation of these scores, as we observe that it is generally harder to achieve a high fairness score in some measures. In some other measures, it is also easier to improve fairness when starting from a relatively fair situation, but much harder when starting from a completely unfair situation.