Ch5 Retrieval Evaluation 2021
Ch5 Retrieval Evaluation 2021
Ch5 Retrieval Evaluation 2021
July, 2021
• Types of Evaluation Strategies
• Major evaluation criteria
• Difficulties in Evaluating IR System
• Measuring retrieval effectiveness
Why System Evaluation?
It provides the ability to measure the difference between IR
systems
How well do our search engines work?
Is system A better than B?
Under what conditions? (relevancy/access time)
A. System-centered evaluation
• Given documents, queries, & relevance judgments
• Try several variations of the system
• Measure which system returns the “best” matching list of documents
B. User-centered evaluation
• Given several users, and at least two IR systems
• Have each user try the same task on each system
• Measure which system works the “best” for users information need
• How do we measure users satisfaction?
4
Major Evaluation Criteria
What are some of the main measures for evaluating an IR system’s performance?
I. Efficiency: time, space
• Speed in terms of retrieval time and indexing time
• Speed of query processing
• The space taken by corpus vs. index
Is there a need for compression
Index size: Index/corpus size ratio
II. Effectiveness
• How is a system capable of retrieving relevant documents from the
collection?
• Is a system better than another one?
• User satisfaction: How “good” are the documents that are returned as
a response to user query?
• “Relevance” of results to meet information need of users 5
Difficulties in Evaluating IR System
IR systems essentially facilitate communication between a user and document
collections
Relevance is a measure of the effectiveness of communication
Effectiveness is related to the relevancy of retrieved items.
Relevance: relates information need (query) and a document or
surrogate
Relevancy is not typically binary but continuous.
Even if relevancy is binary, it is a difficult judgment to make.
Relevance is the degree of a correspondence existing between a document and a
query as determined by requester / information specialist/ external judge / other
users
6
Difficulties in Evaluating IR System
7
Retrieval scenario
• The scenario where 13 results retrieved by different systems
for a given query?
1 2 3 4 5 6 7 8 9 10 11 12 13
8
Measuring Retrieval Effectiveness
relevant irrelevant
“Type one errors”
“Errors of commission”
retrieved
A B “False positives”
not
10
Measuring Retrieval Effectiveness
Relevant Not
relevant Collection size = A+B+C+D
Retrieved A B Relevant = A+C
Retrieved = A+B
Not retrieved C D
Hits 1-10
Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14
Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Hits 11-20
Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
= relevant document
=irrelevant document
12
Example 2
n doc # relevant Recall Precision
1 588 x 0.167 1
• Let total number of 2 589 x 0.333 1
relevant documents
3 576
is 6, compute recall
4 590 x 0.5 0.75
and precision for
each cut off point n: 5 986
6 592 x 0.667 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.833 0.38
14 990 13
Precision/Recall tradeoff
• Can increase recall by retrieving many documents (down to a
low level of relevance ranking),
•but many irrelevant documents would be fetched, reducing
precision
• Can get high recall (but low precision) by retrieving all
documents for all queries
documents but
Returns most relevant
misses many
documents but includes
useful ones too
15 0 lots of junk
Recall 1
Compare Two or More Systems
1
0.8 NoStem Stem
Precision
0.6
0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
16
Calculating Precision at Standard Recall Levels
Assume that there are a total of 10 relevant document
Ranking Relevant Recall Precision
1. Doc. 50 Rel 10% 100%
2. Doc. 34 Not rel ? ?
3. Doc. 45 Rel 20% 67%
4. Doc. 8 Not rel ? ?
5. Doc. 23 Not rel ? ?
6. Doc. 16 Rel 30% 50%
7. Doc. 63 Not rel ? ?
8. Doc. 119 Rel 40 50%
9. Doc. 10 Not rel ? ?
10. Doc. 2 Not rel ? ?
11. Doc. 9 Rel 50% 40% 17
Single-valued measures
Single value measures: may want a single value for each query to
evaluate performance
19
Average precision
• Average precision at each retrieved relevant document
• Relevant documents not retrieved contribute zero to score
• Example: Assume total of 16 relevant documents, compute
mean average precision.
Hits 1-10
Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Hits 11-20
Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
= relevant document
• (1/1)+(2/5)+(3/6)+(4/8)+(5/11)+(6/16) AP = 3.23/16= 0.202
• 1+0.4+0.5+0.5+0.4545+0.375 = 3.23
MAP (Mean Average Precision)
• Computing mean average for more than one query
• rij = rank of the j-th relevant
document for Qi
MAP
1 1 j
n
( )
Qi | Ri | D j Ri rij
• |Ri| = #rel. doc. for Qi
• n = # test queries
E.g. Assume that for query 1 and 2, there are 3 and 2 relevant
documents in the collection, respectively.
Relevant Docs. Query Query
retrieved 1 2 1 1 1 2 3 1 1 2 3
MAP [ ( ) ( )]
1st rel. doc. 1 4 2 3 1 5 10 2 4 8 8
2 PR 2
F 1 1
P R RP
• Compared to arithmetic mean, both need to be high for
harmonic mean to be high.
• What if no relevant documents exist?
25
E-Measure
Associated with Van Rijsbergen
Allows user to specify importance of recall and precision
It is parameterized F Measure. A variant of F measure that
allows weighting emphasis on precision2 over recall: 2
(1 ) PR (1 )
E 2 1
PR
2
R P
26
Problems with both precision and recall
• Number of irrelevant documents in the collection is not taken into
account.
• Recall is undefined when there is no relevant document in the collection.
• Precision is undefined when no document is retrieved.
Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
• Noise = 1 – Precision; Silence = 1 – Recall
| {Relevant} {NotRetrieved} |
Miss
| {Relevant} |
| {Retrieved} {NotRelevant} |
Fallout
| {NotRelevant} |
27
27