16 Flat
16 Flat
16 Flat
Introduction to
Information Retrieval
1
Introduction to Information Retrieval
Overview
❶ Recap
❷ Clustering: Introduction
❸ Clustering in IR
❹ K-means
❺ Evaluation
2
Introduction to Information Retrieval
Outline
❶ Recap
❷ Clustering: Introduction
❸ Clustering in IR
❹ K-means
❺ Evaluation
3
Introduction to Information Retrieval
4
Introduction to Information Retrieval
Linear classifiers
5
Introduction to Information Retrieval
A linear classifier in 1D
▪ A linear classifier in 1D is
a point described by the
equation w1d1 = θ
▪ The point at θ/w1
▪ Points (d1) with w1d1 ≥
are in the class c.
▪ Points (d1) with w1d1 < θ
are in the complement
class
6
Introduction to Information Retrieval
A linear classifier in 2D
▪ A linear classifier in 2D is a
line described by the
equation w1d1 +w2d2 = θ
▪ Example for a 2D linear
classifier
▪ Points (d1 d2) with w1d1 +
w2d2 ≥ θ are in the class c.
▪ Points (d1 d2) with w1d1 +
w2d2 < θ are in the
complement class
7
Introduction to Information Retrieval
A linear classifier in 3D
▪ A linear classifier in 3D is
a plane described by the
equation w1d1 + w2d2 +
w3d3 = θ
▪ Example for a 3D linear
classifier
▪ Points (d1 d2 d3) with w1d1 +
w2d2 + w3d3 ≥ θ are in the
class c.
▪ Points (d1 d2 d3) with w1d1 +
w2d2 + w3d3 < θ are in the
complement class
8
Introduction to Information Retrieval
9
Introduction to Information Retrieval
10
Introduction to Information Retrieval
11
Introduction to Information Retrieval
Take-away today
▪ What is clustering?
▪ Applications of clustering in information retrieval
▪ K-means algorithm
▪ Evaluation of clustering
▪ How many clusters?
12
Introduction to Information Retrieval
Outline
❶ Recap
❷ Clustering: Introduction
❸ Clustering in IR
❹ K-means
❺ Evaluation
13
Introduction to Information Retrieval
Clustering: Definition
14
Introduction to Information Retrieval
Propose algorithm
for finding the
cluster structure
in this example
15
Introduction to Information Retrieval
16
Introduction to Information Retrieval
Outline
❶ Recap
❷ Clustering: Introduction
❸ Clustering in IR
❹ K-means
❺ Evaluation
17
Introduction to Information Retrieval
18
Introduction to Information Retrieval
Applications of clustering in IR
Application What is Benefit
clustered?
Search result clustering search more effective
results information
presentation to user
Scatter-Gather (subsets alternative user
of) collection interface: “search
without typing”
Collection clustering collection effective information
presentation for
exploratory browsing
Cluster-based retrieval collection higher efficiency:
faster search
19
Introduction to Information Retrieval
20
Introduction to Information Retrieval
Scatter-Gather
21
Introduction to Information Retrieval
22
Introduction to Information Retrieval
23
Introduction to Information Retrieval
24
Introduction to Information Retrieval
25
Introduction to Information Retrieval
26
Introduction to Information Retrieval
27
Introduction to Information Retrieval
http://news.google.com
28
Introduction to Information Retrieval
29
Introduction to Information Retrieval
Propose algorithm
for finding the
cluster structure
in this example
30
Introduction to Information Retrieval
32
Introduction to Information Retrieval
Flat algorithms
▪ Flat algorithms compute a partition of N documents into a
set of K clusters.
▪ Given: a set of documents and the number K
▪ Find: a partition into K clusters that optimizes the chosen
partitioning criterion
▪ Global optimization: exhaustively enumerate partitions,
pick optimal one
▪ Not tractable
▪ Effective heuristic method: K-means algorithm
34
Introduction to Information Retrieval
Outline
❶ Recap
❷ Clustering: Introduction
❸ Clustering in IR
❹ K-means
❺ Evaluation
35
Introduction to Information Retrieval
K-means
36
Introduction to Information Retrieval
37
Introduction to Information Retrieval
K-means
▪ Each cluster in K-means is defined by a centroid.
▪ Objective/partitioning criterion: minimize the average
squared difference from the centroid
▪ Recall definition of centroid:
K-means algorithm
39
Introduction to Information Retrieval
40
Introduction to Information Retrieval
42
Introduction to Information Retrieval
43
Introduction to Information Retrieval
44
Introduction to Information Retrieval
45
Introduction to Information Retrieval
46
Introduction to Information Retrieval
47
Introduction to Information Retrieval
48
Introduction to Information Retrieval
49
Introduction to Information Retrieval
50
Introduction to Information Retrieval
51
Introduction to Information Retrieval
52
Introduction to Information Retrieval
53
Introduction to Information Retrieval
54
Introduction to Information Retrieval
55
Introduction to Information Retrieval
56
Introduction to Information Retrieval
57
Introduction to Information Retrieval
58
Introduction to Information Retrieval
59
Introduction to Information Retrieval
60
Introduction to Information Retrieval
61
Introduction to Information Retrieval
62
Introduction to Information Retrieval
63
Introduction to Information Retrieval
64
Introduction to Information Retrieval
66
Introduction to Information Retrieval
Optimality of K-means
67
Introduction to Information Retrieval
68
Introduction to Information Retrieval
Initialization of K-means
▪ Random seed selection is just one of many ways K-means
can be initialized.
▪ Random seed selection is not very robust: It’s easy to get a
suboptimal clustering.
▪ Better ways of computing initial centroids:
▪ Select seeds not randomly, but using some heuristic (e.g., filter
out outliers or find a set of seeds that has “good coverage” of
the document space)
▪ Use hierarchical clustering to find good seeds
▪ Select i (e.g., i = 10) different random sets of seeds, do a K-
means clustering for each, select the clustering with lowest RSS
69
Introduction to Information Retrieval
70
Introduction to Information Retrieval
Outline
❶ Recap
❷ Clustering: Introduction
❸ Clustering in IR
❹ K-means
❺ Evaluation
71
Introduction to Information Retrieval
▪ Internal criteria
▪ Example of an internal criterion: RSS in K-means
▪ But an internal criterion often does not evaluate the actual
utility of a clustering in the application.
▪ Alternative: External criteria
▪ Evaluate with respect to a human-defined classification
72
Introduction to Information Retrieval
73
Introduction to Information Retrieval
74
Introduction to Information Retrieval
75
Introduction to Information Retrieval
Rand index
▪ Definition:
78
Introduction to Information Retrieval
79
Introduction to Information Retrieval
80
Introduction to Information Retrieval
Outline
❶ Recap
❷ Clustering: Introduction
❸ Clustering in IR
❹ K-means
❺ Evaluation
81
Introduction to Information Retrieval
82
Introduction to Information Retrieval
Exercise
83
Introduction to Information Retrieval
▪ Basic idea:
▪ Start with 1 cluster (K = 1)
▪ Keep adding clusters (= keep increasing K)
▪ Add a penalty for each new cluster
▪ Trade off cluster penalties against average squared distance
from centroid
▪ Choose the value of K with the best tradeoff
84
Introduction to Information Retrieval
86
Introduction to Information Retrieval
Take-away today
▪ What is clustering?
▪ Applications of clustering in information retrieval
▪ K-means algorithm
▪ Evaluation of clustering
▪ How many clusters?
87
Introduction to Information Retrieval
Resources
▪ Chapter 16 of IIR
▪ Resources at http://ifnlp.org/ir
▪ K-means example
▪ Keith van Rijsbergen on the cluster hypothesis (he was one of
▪ the originators)
▪ Bing/Carrot2/Clusty: search result clustering
88