5.1 Collections and Measures
We make use of the binary judgments created for two TREC
2 datasets: TREC 7 (Disks 4 and 5) and TREC Robust 2004 (Disks 4 and 5, minus the congressional records). The dataset statistics are summarized in Table
2. These judgments, which are created to a high standard by the TREC processes, are used in our experiments as attestments. They undoubtedly do still contain errors, but of necessity are assumed to be a usable approximation of true attestments.
All 50 queries associated with TREC 7 were used. The Robust 2004 collection has a much larger number of associated topics; these were divided into two sets, one with 50 queries and another with 199 queries, to allow consideration of topic set size upon experimental stability; in this collection, the mean judged relevant and irrelevant documents across the full set of 249 queries are 70 and
\(1{,}181\), respectively. The sampling process used to obtain the two subsets was that every fifth query was assigned to the smaller query set. The runs from the top 50 systems in these two TREC rounds were selected, with “top” defined by system average for the metric
rank-biased precision (RBP) [
37] using a persistence value of
\(\phi =0.95\), which corresponds to users who, on average, examine the first 20 documents in each run and who value relevance in rank position 1 approximately 2.65 times more highly than they value relevance in rank position 20. As a check, we also selected the top 50 systems using AP and NDCG, and across the four combinations of metric and collection the minimum top-50 overlap against RBP was 46 out of 50.
We also used RBP (again with
\(\phi =0.95\), a value that should be assumed throughout unless otherwise noted) as a metric with which to evaluate runs, along with several other well-known approaches:
average precision (AP);
precision at 10 (P@10);
normalized discounted cumulative gain (NDCG) [
23] with log base 2; and
reciprocal rank (RR). Note that the use of
\(\phi =0.95\) broadly corresponds to the expected evaluation depth attained by AP and NDCG [
41].
Rank-biased overlap (RBO) [
62] was employed in experiments that required comparison of system rankings. Rank-biased overlap is a top-weighted list similarity measurement that assigns larger penalties to differences at the head of a ranking than it does to differences that occur further down; we use a parameter of
\(\phi =0.90\) except where otherwise specified, which corresponds to a probabilistic viewer of the rankings pair who on average scans from the top of the two runs, exits after looking at an average of 10 pairs of documents, and then assesses the degree of overlap they have observed. We mainly use RBO to compare system orderings and hence evaluate it to depth 50, the number of systems employed in the experiments.
Rank-biased overlap is an overlap coefficient and is zero only if the two lists are disjoint. When the two lists are permutations of each other, there is an expected “background” RBO score that can be determined by a Monte Carlo technique that generates and scores a large number of random permutations. That process was used to establish the floor value applicable to these experiments: permutations of length 50 and a parameter of \(\phi =0.90\). The expected RBO for random orderings was found to be \(0.193887\pm 0.071999\)—that is, around 0.2, implying that reported RBO values of less than around 0.3 should be interpreted as meaning that two orderings each containing 50 systems are uncorrelated. We also computed Kendall’s tau in connection with system orderings, noting that it is not top-weighted.
5.3 Raw Score Changes
Figure
6 shows the run score perturbations that result from the two judgoid generation protocols. To form the two graphs, the process shown in Figures
2 and
3 was applied to the TREC 7 qrels with a discriminating (
\(disc=3.0\)) and neutral (
\(bias=0.0\)) assessor assumed (that is, an “expert”), with 100 sets of judgoids generated. That resulted in a total of
\(50 \times 50 \times 100 = 250{,}000\) individual system–query runs (systems by topics by judgoids). A random sample of
\(5{,}000\) (that is, 2%) of those runs was then taken, and judgoid-derived and attestment-derived metric scores for AP and RBP computed and scatter-plotted.
In the left pane, the random qrel flips process generates RBP scores (blue dots) that are visibly consistent between judgoids and attestments, and hence well correlated. In contrast, the AP scores (red dots) are more dispersed and also show a clear pattern of shifting above the dotted mid-line, that is, of being numerically smaller with the perturbed judgments than with the attestments.
With the rank-biased qrel flips approach in the right pane, RBP scores tend to be below the mid-line; judgoid-based run scores tend to be larger than with the attestments, as popular non-relevant documents get flipped to relevant, a consequence of the biased subset selection employed in the \({{ {FPR}}}\) computation. In contrast, AP has many scores closer to the mid-line, but with a large mass of attestment scores that were near zero having much higher corresponding perturbed scores; that is, both of AP and RBP now show a high degree of dispersion.
However, the absolute run scores are in general not of direct interest, as their primary value is in system comparison; the key test is whether systems are affected in a consistent way by any pattern of drift in scores. The next two subsections examine what happens when the perturbed run scores are used in system-to-system comparisons.
5.4 Variations in System Rankings
In this experiment, we explore how perturbed judgments affect system ordering, in which each system is ranked from 1 to 50 by its average measured score with ties (which are rare) broken at random. Here, and in several of the other experiments discussed below, we report only results for AP and RBP as well-performing and widely used representative recall-based and utility-based metrics, respectively.
For each of 100 sets of judgoids, a system ranking is computed based on the mean metric score for each system across the corresponding set of topics, and then RBO is calculated, comparing that system ordering against the system ordering induced by the attestments. This process leads to 100 RBO scores for each parameter setting, each a measure of the similarity between results for one set of perturbed judgments and the original judgments. These scores are top-weighted, reflecting an interest in being able to identify a relatively small number of top-performing system. We then average the RBO values across the 100 sets of judgoids to obtain a single outcome value for each combination of parameters.
Before considering RBO scores, it is also helpful to look at rank ranges. These are computed out of the same experiment by tabulating, for each rank in each judgoid-induced system ordering, the corresponding system rank in the attestment-induced system ordering. This element of the
reverse protocol yields information that helps answer the question, “if a system is at rank
x using the judgoids, where might it have been ranked using the underlying attestments?” Figure
7 shows this type of output. A discriminating (
\(disc=3.0\)) and neutral (
\(bias=0.0\)) assessor is again assumed, the
random qrel flips strategy is employed, and rank correspondences for AP are shown on the left and for RBP on the right. Each vertical box-whisker element shows the distribution of attestment rank positions corresponding to one judgoid rank position, with the solid area in each column depicting the middle two quartiles. The orange dots provide a guide that shows the location of the line of perfect consistency.
As is noted in the figure, these settings for
disc and
bias give a
TPR of 0.93 and an
FPR of 0.07. With around
\(1,\!200\) judged irrelevant documents per query and 70 judged relevant, this corresponds to flipping to “off” around 4 previously positive judgments and flipping to “on” around 85 previously negative judgments. Those error rates intuitively feel quite small; as noted earlier, we would generally be happy with such an accurate assessment from human judges. However, as the raw numbers show, they do mean that the incorrect positive judgments risk swamping the correct ones. Nonetheless, RBP appears to be robust, with a notable consistency apparent in the plot, and as a corroboration, a corresponding mean RBO of
\(0.9452607641\); whereas the corresponding RBO value for AP is
\(0.6460283557\). That lower correlation is clearly evident in the left plot in Figure
7, with a wide range of attestment-induced ranks corresponding to each judgoid-induced rank, even among systems that score well (the low ranks in the horizontal axis of the plot).
When the same experiment was carried out using the 199 Robust 2004 queries and the corresponding 50 best systems, similar results emerged, with RBO correlations for AP and RBP of
\(0.7417443778\) and
\(0.9404976222\), respectively. These results are shown in Figure
8, using the same presentation as in Figure
7. This is somewhat surprising, since using a larger set of queries appears to not be helpful in terms of convergence of results; and suggests that the number of queries is not an important factor. That is, for a given error rate, we tentatively hypothesize that the instability in ranks under AP with
random qrel flips is innate and is not dampened by increasing the size of the query set. The reliance of AP on normalization by the number of relevant documents for each topic may be a contributing factor here, since each set of judgoids ends up with its own actual
\({{ {FPR}}}\) rate, with the process in Figure
3 delivering perturbations in expectation and not at a guaranteed rate that applies to every set of judgoids.
In other experiments with random qrel flips (results not shown here), RR and NDCG display a relatively wide range of variations in ordering similar to or worse than AP, while P@10 is more akin to RBP and is relatively stable.
Figure
9 presents the system rank relationships that arise when the
rank-biased qrel flips approach is applied to the TREC 7 documents and queries, again for a smart neutral assessor. Now, both AP and RBP show severe degradation in system ordering, presumably due to the fact that rank-biased perturbation particularly affects top-ranking systems—they get to the top by returning some unpopular yet relevant documents, which are exactly the ones that have a relatively high likelihood of being flipped and deemed “not relevant” in the judgoid sets. In particular, both measures have RBO scores of approximately 0.2; as discussed in Section
5.1, that means that the degree of judgment distortion deployed to obtain Figure
9 would render an experimental outcome essentially meaningless.
The rank-biased qrel flips perturbation method also affects outcomes for the Robust 2004 data and topics (not shown in a figure), but not as gravely, with RBO values of 0.517 and 0.629 for AP and RBP, respectively. Analysis of the TREC 7 runs revealed a high level of diversity and more distinctiveness between runs in terms of documents returned, including in terms of relevant documents returned. However, the Robust 2004 runs had more overlap with each other and hence a lesser degree of vulnerability to the volatility introduced by rank-biased qrel flips. Even so, there was still wide variation in attestment rank for Robust 2004 systems that were top-scoring according to the judgoids, showing that even RBO scores of 0.5 or 0.6 are not necessarily a cause for celebration.
Table
3 presents a broader sweep of results, focusing on mean RBO scores between judgoid-induced system rankings and attestment-induced system rankings. Five effectiveness measures, two different true positive rates, and four different (and smaller) false positive rates are shown for each of four different experimental configurations. As previously noted, RBO is calculated with a persistence of
\(\phi =0.90\) to depth 50; with the greatest RBO score for each combination of collection,
\({{ {TPR}}}\), and
\({{ {FPR}}}\) is highlighted in blue. Figure
10 shows the corresponding outcomes for the TREC 7 dataset. In Figure
10, the
\({{ {TPR}}}=0.9\) results for each metric and for the four
\({{ {FPR}}}\) values are laid down first via the bars in the solid colors, and then the
\({{ {TPR}}}=0.7\) results are laid over via the hatching.
To confirm the validity of these results, we used a paired Student
t-test to assess the significance of the differences between the results for RBP and AP. For
random qrel flips, across the 24 cases in Table
3 and Figure
10 the computed
p value was less than
\(10^{-6}\) in all but three instances, in which the difference was not significant. For
rank-biased qrel flips, AP is significantly better than RBP in one case but RBP is significantly better than AP in the other 23, always with
p similarly close to zero. We thus conclude that RBP does indeed lead to smaller RBO values when rankings induced by perturbed qrels are being compared to the rankings induced by the original attestments.
Given that the results shown in Table
3 and Figure
10 can be regarded as being significant, our overarching observations are as follows:
•
In all cases, RBO falls with decreasing TPR and with increasing FPR. This demonstrates that increasing error rates do degrade the reliability of measurements—a completely unsurprising result, but nevertheless useful confirmation that the experimental framework is behaving as predicted.
•
As already noted, the rank-biased qrel flips results for TREC 7 decrease quickly, a consequence of the distinctiveness of the systems used in that round of experimentation and the large number of documents retrieved by the top systems that are not found by other systems.
•
The
rank-biased qrel flips protocol degrades rankings more than does the
random qrel flips protocol, sometimes only slightly but sometimes substantially. As discussed above, this occurs because
rank-biased qrel flips has a disproportionate impact on the measurement of the top-scoring systems.
3•
The recall-based measures AP and NDCG are in all but one case more vulnerable to error than is the utility-based measure RBP. Reciprocal rank is also significantly worse than the utility-based metrics, consistent with other work that has shown it is insensitive and unstable [
42].
•
In terms of robustness of experiments in the presence of errors, increasing the volume of queries does not appear to help: The Robust 2004 (199) outcomes are no better than the outcomes with 50 queries.
•
Rank-biased precision is less sensitive to error than are the other measures, with the exception of P@10 on the Robust 2004 (199) collection with rank-biased qrel flips, where RBP falls slightly behind. At low error rates it gives superior performance to the other measures, though the alternatives are also competitive. Then, as error rates increase, it tends to remain better for longer.
As a confirmation, we also used Kendall’s
\(\tau\) to compare system rankings to provide a non-top weighted perspective. Figure
11 shows the same trend of ranking correlations as observed for the RBO scores, namely, that utility-based metrics are more robust to judgment errors than are recall-based measures. We also observe again that the
rank-biased qrel flips results in a faster onset of degradation.
To further explore the contrast between
random qrel flips and
rank-biased qrel flips, the collective behavior of the 10 top-scoring systems was compared to the remaining lower-scoring systems, noting that when the RBO persistence factor is 0.9 (used in Table
3 and Figure
10) the top 10 systems contribute the majority of the overlap calculation. We compared them to the remainder by building a consensus run for each topic that combined the 50 systems, ordering documents by decreasing meta-AP score. We then computed an RBO score for each topic and for each system relative to that consensus run. Because we are now comparing rankings of documents (rather than rankings of system), a much deeper overlap was sought, and RBO with a persistence factor of
\(\phi =0.98\) was employed.
For TREC 7 the average RBO (
\(\phi =0.98\)) score across all queries for the top 10 systems was
\(0.357928\), with the average over the other 40 systems being much higher, at
\(0.521761\). Investigating further, we found that the top systems in TREC 7 return documents that are relevant yet not popular; these documents do not gain a high meta-AP score and thus tend to be favored (Figure
1(b)) for flipping by the
rank-biased qrel flips approach. In contrast, the Robust 2004 systems tend to behave similarly to each other, with slightly better RBO scores for the top-ranking systems compared to a consensus run, relative to the non-top systems. That is, the top systems and non-top systems retrieve similar proportions of popular and unpopular documents.
In results not included here, we also carried out experiments using the TREC-9 collection and relevance judgments. While those results showed up further patterns of behavior, in very broad terms they fell between the TREC-7 results and the Robust 2004 in terms of their sensitivity to disruption.
Our various results have several implications. First, they provide justification for the strategy of pooling; it is essential that all documents be considered equally, regardless of how many systems retrieved them. Second, it highlights the value of having a large number of systems contributing to the pool. Third, and most importantly, it suggests that popularity is only approximate as a predictor of assessor error. We have relied on it here and are satisfied that our results are robust—as is illustrated by the strength of confirmation between random and ranked flipping of judgments—but it does imply that the model should be used with awareness of its limitations.
Arguably the most pertinent previous work to that reported in this section was that reported by Bailey et al. [
4] and Voorhees [
59]. In these papers, the authors investigated the agreement between different sets of relevance judgments. However, the different designs in this prior work mean that there are confounds to a comparison. Where we have been able to examine many millions of sets of synthetic judgments generated under a range of parameters, the prior work was of necessity limited to only one or two sets of additional real judgments on one or two test collections on smaller numbers of systems and queries.
Bailey et al.’s results align well with ours, showing that substantial differences between systems can reverse when some of the judgments change. The methodology used by Voorhees makes direct comparison more difficult, as one of the sets of judgments is not based on pooling and thus many unjudged documents are assumed irrelevant in the score calculation, that is, there is an unknown error bound in the results, and the correlation function that was used (Kendall’s tau) is not top-weighted. The greater volume of data in our work has allowed us to draw more conclusive inferences, with the limited scope of the previous experiments meaning that outcomes were, at least to some extent, tentative.
5.5 Stability of Significance Tests
A potential confound in the experiments reported in the previous subsection is that some disruptions to system orderings are uninteresting. When two systems have similar scores, or their differences in score are not statistically significant, changes in ordering may not be relevant to the hypotheses being tested. While we have noted consistent effects that vary monotonically in the degradation parameters, we nevertheless need to be alert to the possibility that what we have measured is the result of chance outcomes. To build confidence in the experiments reported in the previous subsection, we now restrict our attention to system differences that have been determined to be statistically significant. This is, after all, the core of a great many offline retrieval experiments: use of statistical significance tests to establish whether a new method can reasonably be claimed to be superior to alternatives. What we would hope to find is that when a pair of systems are significantly different when measured on a set of judgoids (which are all that is available in a typical experiment), then that significance would also be present in an evaluation based on the original attestments from which the judgoids were derived.
Consider the results shown in the two plots in Figure
12, which, as is the case throughout the rest of this subsection, concern the measures AP, NDCG, P@10, and RBP. The dataset is TREC 7, with judgoids created via the
random qrel flips protocol. From among the
\(125{,}500\) individual comparisons (that is, covering all pairs from among 50 systems, with 100 sets of judgoids applied to each pair), we extracted all “System
A, System
B, judgoids” triples where the two systems were found to be significantly different (according to those judgoids) with a
p value in the range
\(p\in [0.005, 0.015]\). Each of these outcomes is thus close to the “1 in 100” mistake level that would be regarded as being strong evidence of superiority. This filtering step is applied solely for the purposes of selecting for subsequent analysis a group of system pairs in which the relationship is highly significant and the
p values are of comparable magnitude, and does not affect the “even more highly significant” outcomes derived when
\(p\lt 0.005\). The number of such outcomes for each of the four effectiveness metrics is shown in the legend; for example, on the left-hand side the number for RBP is
\(11{,}053\).
We then took the matching “System
A, System
B” attestment-based
p-values (as a multi-set) corresponding to the selected triples and examined their distribution. The legend shows statistics of those four sets of corresponding
p-values as computed via the attestments; on the left, for RBP it is
\(0.02\pm 0.06\), for P@10 it is
\(0.04\pm 0.11\), for AP it is
\(0.32\pm 0.39\), and for NDCG it is
\(0.30\pm 0.36\). The distribution of those attestment-based
p-values is then shown by the inferred density curves.
4If system
A outperforms system
B on the judgoids, then it is still possible for
B to outperform
A on the attestments, and indeed for
B to be so much better than
A that the results are statistically significant. To capture this possibility, in Figure
12, we report what we call
mapped p-values:
(1)
when \(p\gt 0.5\), it indicates that the attestment and judgoid results disagree with \(1-p\) used as the value plotted on the horizontal axis.
(2)
when \(p\lt 0.5\), it indicates that the attestment and judgoid results agree with p used as the value plotted on the horizontal axis.
Thus, as a further subtlety in this experiment, p-values that are greater than 0.5 correspond to reversed system mean scores. If the perturbed observation was, say, that system A outperformed system B with \(p=0.01\), then an attestment observation of \(p=0.99\) would mean that B outperforms A with \(p=0.01\)—a strong contradiction of the finding that was inferred using the perturbed judgments.
As can be seen, in Figure
12(a), which shows results with a very low
TPR selected as an extreme case, the RBP
p-values remain reasonably tightly grouped at the left and show that it is behaving consistently: If the greatly perturbed judgoids suggest
\(p\approx 0.01\) for some
A-versus-
B system comparison, then the corresponding attestments also heavily favor small
p-values for
A-versus-
B. However, the matched
p-values for AP and NDCG are, more or less, spread right across the whole range. That is, in this context AP and NDCG have failed: The observed results in the presence of error bear little or no correspondence to the results that would have been obtained if the attestments had been available. This implies that experimental outcomes assessed with AP and NDCG—presuming that judgment errors of the supposed magnitude were indeed being made—would be consistently incorrect, even though each of the comparisons using the judgoids yielded statistical significance and high confidence.
Unfortunately, the same observations also apply to Figure
12(b), where the error rates are lower and are at the “expert assessor” levels reported by previous authors. The distribution for AP and NDCG is not quite as poor, but it remains the case that, for most sets of judgoids and with the attestments taken as “truth,” what are apparently successful findings using the judgoids are at best unsubstantiated and at worst are false.
This finding has serious implications for use of AP and NDCG in practice. While RBP attestment results are centered on \(p=0.01\), in alignment with the judgoid results, those for AP and NDCG are centered on \(p=0.07\) and \(p=0.15\), respectively. That is, with even these low error rates the results show that a 1-in-100 chance of significance being a false positive corresponds to an underlying likelihood that the results are in fact not significant. Even a small volume of errors, well below that observed even among expert assessors let alone crowd workers, can lead to experimental outcomes that are simply wrong.
Experiments with
random qrel flips on Robust 2004 (199) were very similar to those shown for TREC 7. A more interesting contrast is with the results for Robust 2004 (199) and
rank-biased qrel flips perturbation, shown in Figure
13. (Note that in this figure the vertical scale is approximately one-tenth that of the previous figure.) This data reveals failure for all four effectiveness measures. While RBP and P@10 are slightly better than AP and NDCG, in that they offer a higher peak on the left, the results show that the presence of rank-influenced judgment errors of the type suggested by Webber et al. [
63] has the potential to render meaningless any assessment of statistical significance when comparing systems. The plot in Figure
13(b), with a reduced error rate, is only marginally better; judgoid significance centered on
\(p=0.01\) corresponds to attestment significance of
\(p=0.14\),
\(p=0.22,\) and
\(p=0.34\) for P@10, RBP, and NDCG, respectively, even when the error rate is low.
A variant on the above results is shown in Figure
14. These two plots show the cumulative distribution of all attestment
p-values that correspond (as a multi-set) to perturbed
p-values using judgoids that are
\({}\le 0.01\). In other words, these are the “true”
p-values for all of the “plausible experiment in the presence of judgment errors” observed
p-values in which there was high statistical confidence in the measured outcome. The left plot shows results with the
random qrel flips approach, the right shows results with the
rank-biased qrel flips perturbation mechanism.
In these figures, the intercept with the right vertical axis (which is at
\(p=0.5\)) shows the fraction of system pairs that have the same ordering under both attestments and judgoids. For example, in Figure
14(b), for RBP, NDCG, and AP the proportions of system orderings that are consistent between judgoids and attestments are around 82%, 74%, and 67%, respectively. In Figure
14(a), effectively all of the
p-values for attestments for RBP are well below 0.05 and over 90% are below 0.01, and the curve as a whole is in keeping with the expected false negative rate for a significance test. For the other effectiveness measures, the results are less consistent; NDCG is particularly poor, with less than 70% agreement at 0.05. Nor do the results improve much for lower error rates. It is clear that on this collection the recall-based measures are unacceptably vulnerable to errors in the judgments. However, again echoing the earlier results, none of the metrics has done especially well on the Robust 2004 (199) collection, and nor do any of them do especially well in the face of
rank-biased qrel flips perturbations. If indeed judgment errors are made according to the patterns modeled by Webber et al. [
63], we may well need to find judges who are more expert than experts.