Nothing Special   »   [go: up one dir, main page]

Entity Analysis Resolution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

The VLDB Journal (2009) 18:255276

DOI 10.1007/s00778-008-0098-x
REGULAR PAPER
Swoosh: a generic approach to entity resolution
Omar Benjelloun Hector Garcia-Molina
David Menestrina Qi Su Steven Euijong Whang
Jennifer Widom
Received: 22 May 2007 / Revised: 9 January 2008 / Accepted: 10 January 2008 / Published online: 11 March 2008
Springer-Verlag 2008
Abstract We consider the entity resolution (ER) problem
(also known as deduplication, or mergepurge), in which
records determined to represent the same real-world entity
are successively located and merged. We formalize the gen-
eric ER problem, treating the functions for comparing and
merging records as black-boxes, which permits expressive
andextensible ERsolutions. We identifyfour important prop-
erties that, if satised by the match and merge functions,
enable much more efcient ERalgorithms. We develop three
efcient ER algorithms: G-Swoosh for the case where the
four properties donot hold, andR-SwooshandF-Swooshthat
exploit the four properties. F-Swoosh in addition assumes
knowledge of the features (e.g., attributes) used by the
match function. We experimentally evaluate the algorithms
using comparison shopping data from Yahoo! Shopping and
hotel information data fromYahoo! Travel. We also showthat
R-Swoosh (and F-Swoosh) can be used even when the four
match and merge properties do not hold, if an approximate
result is acceptable.
O. Benjelloun
Google Inc., Mountain View, CA 94043, USA
e-mail: benjello@google.com
H. Garcia-Molina D. Menestrina Q. Su S. E. Whang (B)
J. Widom
Computer Science Department, Stanford University,
Stanford, CA 94305, USA
e-mail: euijong@cs.stanford.edu
H. Garcia-Molina
e-mail: hector@cs.stanford.edu
D. Menestrina
e-mail: dmenest@cs.stanford.edu
Q. Su
e-mail: qisu@cs.stanford.edu
J. Widom
e-mail: widom@cs.stanford.edu
Keywords Entity resolution Generic entity resolution
Data cleaning
1 Introduction
Entity resolution (ER) (sometimes referred to as deduplica-
tion) is the process of identifying and merging records judged
to represent the same real-world entity. ER is a well-known
problem that arises in many applications. For example, mail-
ing lists may contain multiple entries representing the same
physical address, but each record may be slightly different,
e.g., containing different spellings or missing some informa-
tion. As a second example, consider a company that has dif-
ferent customer databases (e.g., one for each subsidiary), and
would like to consolidate them. Identifying matching records
is challenging because there are no unique identiers across
databases. A given customer may appear in different ways
in each database, and there is a fair amount of guesswork in
determining which customers match.
Deciding if records match is often computationally expen-
sive and application specic. For instance, a customer infor-
mation management solution froma company
1
we have been
interacting with uses a combination of nickname algorithms,
edit distance algorithms, fuzzy logic algorithms, and train-
able engines to match customer records. On the latest hard-
ware, the speeding of matching records ranges from 10 to
100M comparisons per hour (single threaded), depending
on the parsing and data cleansing options executed. A record
comparison can thus take up to about 0.36ms, greatly exceed-
ing the runtime of any simple string/numeric value compar-
ison. How to match and combine records is also application
1
This company wishes to remain anonymous so that the performance
numbers we give here are not associated with their product specifically.
123
256 O. Benjelloun et al.
specic. For instance, the functions used by that company to
match customers are different from those used by others to
match say products or DNA sequences.
In this paper we take a generic approach for solving
ER, i.e., we do not study the internal details of the functions
used to compare and merge records. Rather, we view these
functions as black-boxes to be invoked by the ER engine.
(Incidentally, there has been a lot of work done on the design
of effective comparison and merge functions; see Sect. 6.)
Given such black-boxes, we study algorithms for efciently
performing ER, i.e., we develop strategies that minimize the
number of invocations to these potentially expensive black-
boxes. In a way, our work is analogous to the design of ef-
cient join algorithms, except that the operator we study is the
ER operator. An important component of our work is that we
identify a set of properties that, if satised by the match and
merge functions, lead to significantly more efcient ER. For
example, associativity of merges is one such important prop-
erty: if merges are not associative, the order in which records
are merged may impact the nal result. Another notable fea-
ture is that we do not perform the matching and merging
separately, but tightly integrate them into a single process.
In this paper we focus on pairwise ER, a common way
to resolve records in the commercial world. In particular, the
following assumptions are made:
Pairwise decisions. Our black-box functions to match
and merge records operate on two records at a time. Their
operation depends solely on the data in these records, and
not on the evidence in other records. In general, it is eas-
ier for application specialists to write pairwise record
comparison and merge functions, as opposed to, say,
functions that determine when a group of records may
represent the same entity. Note that this requirement
needs only be true at the time ER is performed, and
does not preclude a prior training phase that consid-
ers the whole dataset, or a representative sample. (For
example, a rst phase can compute term frequencies
for say all product descriptions, and the frequencies can
then be used in comparing pairs of descriptions.) Thus,
approaches based on machine learning can be leveraged
to match or merge records.
No condences. We do not work with numeric similarity
values or condences. Record comparison functions may
indeed compute numeric similarities (e.g., how close is
this name to that name), but in the end they make a yes
no decision as to whether records match. Carrying con-
dences in the ER computations could in principle lead to
more accurate decisions, but complicates processing sig-
nificantly. For instance, one must decide howto combine
condences when records are merged. Also, condences
may decrease upon merges, which makes it more chal-
lenging to compare the information in merged records
Fig. 1 An instance of records representing persons
to that of base records. In a technical report [26], we
study generic ERwith condences, in an extension of the
framework presented here, where condences are also
handled by the black-box match and merge functions.
No relationships. In our model, records contain all the
information that pertains to each entity (See Fig. 1 for an
example). We do not consider a separate class of records
that describe relationships between entities. Of course,
some relationships can be represented in our model: for
example, say Fred is Bills brother. Then the record for
Fred may contain the value brother: {Bill}.
Consistent labels. We assume that the input data has gone
through a schema-level integration phase, where incom-
ing data is mapped to a common set of well-dened
labels. For instance, we assume that a salary label
means the same thing, no matter what the source of the
information is. However, we do not impose a rigid struc-
ture on records: we allow missing or multiple values for
each label.
The particular variant of the ER problem that we study in
this paper may not be the most sophisticated, but is used fre-
quently in practice, at least in the commercial world. Indeed,
IBMs recently introduced DB2 entity analytic solutions
[21] (formerly SRD) provides an exact, order insensitive
solution to the ER problem (applied to human identities),
which abstracts away from the particular functions used to
compare values. Another leading commercial offering from
Fair Isaac Corp. also encapsulates the match process as pair-
wise Boolean functions [10]. The customer information man-
agement company uses a pairwise matching framework to
which a combination of comparison algorithms can be
applied. Although these products have extra features, the core
of their approach is the same as ours. In fact, their commer-
cial success originally motivated our study of this particular
approach to entity resolution (see Sect. 6 for an overview of
alternative techniques).
In summary, the ER variant we address here is relatively
simple, but as we will see, can still be very expensive to com-
pute. One fundamental cause of this complexity in ER is that
record merges can lead to new matches. To illustrate, con-
sider the records of Fig. 1. Suppose that our black-box record
match function works as follows: the function compares the
name, phone and email values of the two records. If the names
are very similar (above some threshold), the records are said
to match. The records also match if the phone and email are
123
Swoosh: a generic approach to entity resolution 257
identical. For matching records, the black-box merge func-
tion combines the names into a normalized representative,
and performs a set-union on the emails and phone numbers.
Note that phone and e-mail are being treated as a unit for
comparison purposes. We call such a unit a feature (dened
formally in Sect. 4). Thus, in this example, there are two fea-
tures: one is name and the other is the pair phoneemail.
For our example, the black-box comparison function
determines that r
1
and r
2
match, but r
3
does not match either
r
1
or r
2
. For instance, the function nds that John Doe and
J. Doe are similar, but nds John D. not similar to any-
thing (e.g., because John is a frequent rst name). Thus, r
1
and r
2
merge into a new record r
4
:
r
4
{John Doe} {234-4358, {jdoe@yahoo}
235-2635}
Now notice that r
4
now matches r
3
since the same phone
and e-mail appear in both records. The combination of the
information in r
1
and r
2
led us to discover a new match with
r
3
, therefore yielding an initially unforeseen merge. Thus,
every time two records are merged, the combined record
needs to be re-compared with everything else.
Because record matching is inherently expensive, large
sets of input records are often divided into buckets using
application knowledge, and then ER is run on each bucket.
For instance, if we are resolving products, we may be able to
divide them using a category eld. Thus, camera records
will only be matched against other cameras, CDs will only be
matched against other CDs, and so on. If a record may match
records in more than one category, then typically copies of
the record are placed in multiple buckets. For example, a cell
phone with a camera may be placed in the camera and the
telephone buckets. (In our related work section we briey
mention other ways in which domain knowledge can be used
to prune the search space.) In this paper we focus on resolv-
ing records within one bucket, that is, we study algorithms
that must exhaustively consider all (within bucket) possible
record matches. This type of exhaustive algorithmis invoked
by a higher-level process that divides the data and decides
what buckets need to be resolved. And since buckets can be
quite large, it is still important to have as efcient an algo-
rithm as possible for exhaustive ER. (Note that if the seman-
tic function that divides records is imprecise, then overall
matches may be missed, e.g., two wet-suits may be incor-
rectly placed in different buckets, say clothing and sporting
goods. In this paper we do not consider the accuracy of the
semantic function that partitions records.)
In summary, in this paper we make the following contri-
butions:
We formalize the generic ER problem (Sect. 2). Unlike
other works that focus only on identifying matching
records (see related work in Sect. 6), we also include the
process of merging records and how it may lead to new
matches.
We identify the ICAR properties (see Sect. 2.2) of match
and merge functions that lead to efcient strategies.
We present ER algorithms for three scenarios:
G-Swoosh: The most general ER algorithm, for the
case where the four properties of match and merge
functions do not hold (Sect. 3.1).
R-Swoosh: Analgorithmthat exploits the four proper-
ties of match and merge functions, and that performs
comparisons at the granularity of records
(Sect. 3.2).
F-Swoosh: An algorithm that also exploits the four
properties, and uses feature-level comparison func-
tions (Sect. 4.1). F-Swoosh avoids repeated feature
comparisons and can be significantly more efcient
than R-Swoosh.
For each algorithm, we show that it computes the correct
ER result and that it is optimal in terms of the number
of comparisons performed. (What we mean by optimal
varies by scenario and is precisely dened in each sec-
tion.)
We experimentally evaluate the algorithms using actual
comparison shopping data from Yahoo! Shopping and
hotel information data from Yahoo! Travel. Our results
show that G-Swoosh can only be used on relatively small
data sets when merges occur frequently, while R-Swoosh
and F-Swoosh can handle substantially more data. Fur-
thermore, when we know the features used for compari-
sons, we can use F-Swoosh and achieve between 1.1 and
11.4 performance improvement.
Since G-Swoosh is so expensive, we investigate using
R-Swoosh even when the ICAR properties of match and
merge functions do not hold. In this case R-Swoosh does
not produce the correct answer, but we show that what
R-Swoosh produces is close to what G-Swoosh produces.
Thus, if the application can tolerate an approximate
answer, R-Swoosh and F-Swoosh are viable algorithms
for all scenarios.
2 Fundamentals of generic ER
We rst consider entity resolution at the granularity of
records. Our approach is very generic, since no assumption
is made on the form or data model used for records. Finer
granularity ER will be considered in Sect. 4.
2.1 Basic model
We assume an innite domain of records R. An instance
I = {r
1
, . . . , r
n
] is a nite set of records from R.
123
258 O. Benjelloun et al.
A match function M is a Boolean function over R R,
used to determine if two records r
1
and r
2
represent the same
real-world entity (in which case M(r
1
, r
2
) = true). Such a
match function reects the restrictions we are making that (1)
matching decisions depend solely on the two records being
compared, and (2) that such decisions are Boolean, and not
associated with any kind of numeric condence. In practice,
such functions are easier to write than functions that consider
multiple records.
A merge function is a partial function fromRRinto
R, that captures the computation of merged records. Func-
tion is only dened for pairs of matching records (i.e., for
r
1
, r
2
s.t. M(r
1
, r
2
) = true).
When M and are understood fromthe context, M(r
1
, r
2
)
= true (resp. M(r
1
, r
2
) = false) is denoted by r
1
r
2
(resp. r
1
, r
2
), and (r
1
, r
2
) is denoted by r
1
, r
2
).
In order to dene ER, we need to introduce two key inter-
mediary notions: the merge closure of an instance, and record
domination.
Merge closure. Intuitively, given an instance I , we would
like to nd all pairs of matching records in I and merge
them, using the match and merge functions dened above.
The notion of extending I with all the records that can be
derived this way is called the merge closure of I :
Denition 2.1 Given an instance I , the merge closure of I ,
denoted

I is the smallest set of records S such that:
I S
For any records r
1
, r
2
S, if r
1
r
2
, then r
1
, r
2
) S.
For any instance I , the merge closure of I clearly exists
and is unique. It can be obtained as the xpoint of adding to
I merges of matching records.
Note that the merge closure of a (nite) instance I may
be innite. Intuitively, arbitrarily long chains of matches
and merges may keep producing new records. However, the
match and merge functions used in practice for ER do not
exhibit such a behavior. We will give in Sect. 2.2 some sim-
ple properties, often satised by match and merge functions,
which guarantee that the merge closure is nite.
Domination. The merge closure is only a rst step towards
dening ER. The goal of ERis to determine the set of records
that best represent some real-life entities. Intuitively, if two
records r and r
/
are about the same entity but r holds more
information than r
/
, then r
/
is useless for representing this
entity. In this case, we say that r
/
is dominated by r, denoted
r
/
_ r. For instance, in our example, it is natural to consider
that r
1
_ r
4
, as r
4
contains all the values of r
1
, and may be
also that r
2
_ r
4
. Even though the name J. Doe does not
appear in r
4
it can be considered as subsumed by John Doe.
Formally, domination is dened to be any partial order
relation on records (i.e., a reexive, transitive and anti-
symmetric binary relation). The choice of a specic partial
order depends on the particular data and application at hand.
Just like the match and merge functions, we viewdomination
as a black-box. Hence, our focus is not on the accuracy
of the domination test. We will see in Sect. 2.2 that when
the match and merge function have some simple and natural
properties, then a canonical domination order can be dened
using them.
Domination on records can be naturally extended to
instances as follows:
Denition 2.2 Given two instances I
1
, I
2
, we say that I
1
is dominated by I
2
, denoted I
1
_ I
2
if r
1
I
1
, r
2

I
2
, such that r
1
_ r
2
.
It is straightforward to verify that instance domination is a
partial pre-order, i.e., that it is a reexive and transitive rela-
tion. Instance domination is not a partial order because it is
not anti-symmetric. Indeed, if r
1
_ r
2
, the instances {r
2
] and
{r
1
, r
2
] are distinct yet dominate each other.
Entity Resolution. We are now ready to dene entity reso-
lution formally:
Denition 2.3 Given an instance I , recall that

I is the merge
closure of I . An entity resolution of I is a set of records I
/
that satises the following conditions:
1. I
/


I,
2.

I _ I
/
,
3. No strict subset of I
/
satises conditions 1 and 2.
The following property establishes that ER is well-
dened. Proofs for this result and subsequent ones can be
found in Appendices A and B.
Proposition 2.1 For any instance I , the entity resolution of
I exists and is unique. We denote it ER(I ).
Although ER is well dened, just like the merge closure it
may be innite, and therefore not computable. Even when it
is nite, its computation may be very expensive. Intuitively,
any nite sequence of merges may produce a different record,
and dominated records can only be removed after all matches
have been found. We will give in Sect. 3.1 an algorithm that
computes ER when the merge closure is nite, which is opti-
mal interms of the number of recordcomparisons it performs.
Before that, we introduce in the next section some natural
properties often satised by the match and merge functions,
which ensure the ER computation is tractable.
123
Swoosh: a generic approach to entity resolution 259
2.2 ICAR match and merge properties
In practice, some M and functions have some desirable
properties that lead to efcient ER. We have identied the
following four such properties, which are quite intuitive.
1. Idempotence: r, r r and r, r) = r. Arecord always
matches itself, and merging it with itself still yields the
same record.
2. Commutativity: r
1
, r
2
, r
1
r
2
iff r
2
r
1
, and if r
1

r
2
, then r
1
, r
2
) = r
2
, r
1
).
3. Associativity: r
1
, r
2
, r
3
such that r
1
, r
2
, r
3
)) and r
1
,
r
2
), r
3
) exist, r
1
, r
2
, r
3
)) = r
1
, r
2
), r
3
).
4. Representativity: If r
3
= r
1
, r
2
) then for any r
4
such
that r
1
r
4
, we also have r
3
r
4
.
We call these the ICAR properties. We stress that not all
match and merge functions will satisfy these properties, but
it is nevertheless important to study the special case where
they hold.
Commutativity and idempotence are fairly natural proper-
ties to expect frommatch and merge functions. Associativity
is also a reasonable property to expect from a merge func-
tion. Note that if associativity does not hold, then it becomes
harder to interpret a result record, since it not only depends
of the source records, but on the order in which they were
merged.
The meaning of the representativity property is that record
r
3
obtained from merging two records r
1
and r
2
represents
the original records, in the sense that any record r
4
that would
have matched r
1
(or r
2
by commutativity) will also match
r
3
. Intuitively, this property states that there is no negative
evidence: merging two records r
1
and r
2
cannot create evi-
dence (in the merged record r
3
) that would prevent r
3
from
matching any other record that would have matched r
1
or r
2
.
Note also that we do not assume the match function to
be transitive (i.e. r
1
r
2
and r
2
r
3
does not necessarily
imply r
1
r
3
). Transitive match functions were considered
by [27]. In practice, designing transitive match functions is
difcult.
Merge domination. When the match and merge functions
satisfy the ICAR properties, there is a natural domination
order that can be dened based on them, which we call the
merge domination:
Denition 2.4 Given two records, r
1
and r
2
, we say that r
1
is merge dominated by r
2
, denoted r
1
r
2
, if r
1
r
2
and
r
1
, r
2
) = r
2
.
The properties of the match and merge functions dened
in the previous section guarantee that merge domination is a
valid domination partial order on records:
Proposition 2.2 Merge domination is a valid domination
order.
Note that all the properties we required for match and
merge functions are necessary to ensure that domination is a
partial order relation.
The merge domination order on records is useful to under-
stand how records relate to each other. For instance one can
easily check that the following monotonicity conditions hold:
(A) for any records r
1
, r
2
such that r
1
r
2
, it holds that
r
1
r
1
, r
2
) and r
2
r
1
, r
2
), i.e., a merge record
always dominates the records it was derived from,
(B) if r
1
r
2
and r
1
r, then r
2
r, i.e., the match
function is monotonic,
(C) if r
1
r
2
and r
1
r, then r
1
, r) r
2
, r), i.e., the
merge function is monotonic,
(D) if r
1
s, r
2
s and r
1
r
2
, then r
1
, r
2
) s.
Interestingly, merge domination is a canonical domination
order in the sense that it is the only one for which the match
and merge functions behave well, i.e., satisfy the above
monotonicity conditions:
Proposition 2.3 Given match and merge functions such that
the match function is reexive and commutative, if a domina-
tion order _exists such that the four monotonicity conditions
(A)(D) above are satised with replaced by _, then the
ICAR properties of Sect. 2.2 are also satised, and _ coin-
cides with the merge domination order .
In some sense, the above proposition justies the prop-
erties we required from match and merge functions, as they
capture the requirements needed to make entity resolution
a monotonic process. We believe that checking our simple
properties on match and merge functions is more practical
than looking for an order for which the monotonicity condi-
tions (A)(D) are satised. In the rest of the paper, whenever
the match and merge function satisfy the ICAR properties of
Sect. 2.2, we consider merge domination to be our default
domination order.
ERwith ICARproperties. When the match and merge func-
tions satisfy the ICAR properties above, then the ER process
itself has interesting computational properties: it is guaran-
teed to be nite, records can be matched and merged in any
order, and dominated records can be discarded anytime. We
next dene the notion of maximal derivation sequence, and
then use it to state these properties precisely.
Denition 2.5 Given an instance I , a derivation step I I
/
is a transformation of instance I into instance I
/
obtained by
applying one of the following two operations:
123
260 O. Benjelloun et al.
Merge step: Given two records r
1
and r
2
of I s.t. r
1
r
2
,
and r
3
= r
1
, r
2
) / I , I
/
= I {r
3
],
Purge step: Given two records r
1
and r
2
of I s.t. r
1
r
2
,
I
/
= I {r
1
].
A derivation sequence I

I
n
is any non-empty sequence
of derivation steps I I
1
I
n
. A derivation
sequence I

I
n
is maximal if there exists no instance I
n1
s.t. I
n
I
n1
is a valid derivation step.
The following theorem (proven in appendix) states the
properties of ER:
Theorem 2.1 Given match and merge functions that are
idempotent, commutative, associative and representative, for
any instance I , ER(I ) is nite, and any maximal derivation
sequence starting from I computes ER(I ).
Union class of match and merge functions. There is a broad
class of match and merge functions that satisfy the ICAR
properties because they are based on union of values. We
call this class the Union Class. The key idea is that each
record maintains all the values seen in its base records. For
example, if a record with name {John Doe} is merged with
a record with name {J. Doe}, the result would have the
name {John Doe, J. Doe}. Unioning values is convenient
in practice since we record all the variants seen for a per-
sons name, a hotels name, a companys phone number, and
so on. Keeping the lineage of our records is important in
many applications, and furthermore ensures we do not miss
future potential matches. Notice that the actual presentation
of this merged record to the user does not have to be a set, but
can be any string operation result on the possible values (e.g.,
{John Doe}). Such a strategy is perfectly ne as long as the
records only use the underlying set values for matching and
merging. Two records match if there exists a pair of values
from the records that match. In our example, say the match
function compares a third record with name {Johnny Doe} to
the merged record obtained earlier. If the function compares
names, then it would declare a match if Johnny Doe matches
either one of the two names. The match and merge functions
in this Union Class satisfy the ICARproperties as long as the
match function is reexive and commutative (two properties
that most functions have):
Proposition 2.4 Given match and merge functions such that
the match function is reexive and commutative, if the match
and merge functions are in the Union Class, the ICAR prop-
erties are satised.
Beyond the Union Class, there are other functions that
while not strictly in this class, also record in some way all
the values they have encountered. For example, a record may
represent the range of prices that have been seen. If the record
is merged with another record with a price outside the range,
the range is expanded to cover the newvalue. Thus, the range
covers all previously encountered values. Instead of checking
if the prices in the records match exactly, the match function
checks if price ranges overlap. It can be shown that match and
merge functions that keep all values explicitly or in ranges
also satisfy the ICAR properties.
In this section, we proposed four simple and natural con-
ditions on merge and match functions for records: commu-
tativity, idempotence, representativity, and associativity. We
showedthat under these conditions, records andinstances can
be meaningfully ordered through merge domination, and that
ER is nite and independent from the order in which records
are processed. We believe that the ICARproperties above are
important in practice, for two main reasons:
(a) There are many applications where these properties
hold. For example, in some intelligence gathering
applications, values are unioned during merges, to
accumulate all evidence. One can showthat such addi-
tive applications use Union Class match and merge
functions, satisfying the properties. The properties also
hold if values can be combined (when two record are
merged) into a representative value that captures all
matches with values it represents.
(b) By understanding the huge performance advantages
that the properties give us we believe that application
designers will be strongly incentivized to develop func-
tions that have the properties. In some cases, achieving
the properties involves small changes. For example, in
one application we ran across a match function that
was not idempotent. However, it was easy to make the
function idempotent by adding an explicit check for the
case where both input records had identical content.
In other cases, obtaining good functions may involve
more complex changes. But without knowing what ef-
cient algorithms exist for the case where the properties
hold, the designer may never put the effort into devel-
oping good functions.
In the next two sections, we propose actual algorithms to
compute ER for both the cases when the properties do not
hold and when they do. The performance advantage of having
the properties satised will be illustrated by our experiments
in Sect. 5.
3 Record-level ER algorithms
We start by presenting G-Swoosh, an algorithm that does not
require the match and merge functions to satisfy any partic-
ular properties. As ER may be innite, G-Swoosh may not
terminate, and in general is expensive, but we show that it
123
Swoosh: a generic approach to entity resolution 261
1: input: a set I of records
2: output: a set I
/
of records, I
/
= ER(I )
3: I
/
I ; N
4: repeat
5: I
/
I
/
N; N
6: for all pairs (r, r
/
) of records in I
/
do
7: if r r
/
then
8: merged r, r
/
)
9: if merged , I
/
then
10: add merged to N
11: end if
12: end if
13: end for
14: until N =
15: for all pairs (r, r
/
) of records in I
/
where r ,= r
/
do
16: if r
/
_ r then
17: Remove r
/
from I
/
18: end if
19: end for
Alg. 1: The BFA algorithm for ER(I)
is cost optimal for this very general scenario. We then pres-
ent R-Swoosh, an algorithm that applies when the match and
merge functions satisfy the ICAR properties, and which is
also optimal for that situation.
3.1 The G-Swoosh algorithm
To motivate G-Swoosh, we rst present a simple, naive algo-
rithm that makes no assumptions about the match and merge
functions. As dened in Sect. 2.1, ER(I) is the set of all non-
dominated records that can be derived from the records in I ,
or from records derived from them. Algorithm 1 presents a
brute force algorithm, BFA, that performs ER. The propo-
sition that follows states the correctness of BFA. Its proof is
given in Appendix.
Proposition 3.1 For any instance I such that

I is nite, BFA
terminates and correctly computes ER(I ).
To illustrate how BFA works, consider the instance of
Fig. 2. The initial instance I is represented by the records
in the left column. Matching (similar) records are enclosed
by a rectangle, and the arrow points to the resulting merged
record. The horizontal order corresponds to the progression
of the algorithm.
In the rst iteration, BFA compares all possible pairs of
records in the initial I , generating the new records r
12
and
r
23
. Since new records were generated, the algorithm con-
tinues with a second iteration, in which seven records are
compared (the 5 original ones plus the two new ones). Thus,
in this second iteration, the new record r
123
is generated.
Again, since a new record was found, we iterate with I
/
now
containing eight records generating r
1235
. The fourth and last
iteration nds no matches. Finally, BFA eliminates all dom-
inated records and terminates.
Fig. 2 Brute force ER on a simple instance
It is clear that BFA is performing a large number of match
calls, and that many of them are unnecessary. For exam-
ple, note that records r
4
and r
5
are compared a total of four
times. The three redundant comparisons could be avoided by
remembering results of previous match calls. The
G-Swoosh algorithm (Algorithm 2) avoids all these unnec-
essary comparisons, not by explicitly remembering calls, but
by intelligently ordering the match and merge calls.
1: input: a set I of records
2: output: a set I
/
of records, I
/
= ER(I )
3: I
/

4: while I ,= do
5: r a record from I
6: remove r from I
7: for all records r
/
in I
/
{r] do
8: if r r
/
(resp. r
/
r) then
9: merged r, r
/
) (resp. r
/
, r))
10: if merged , I I
/
{r] then
11: add merged to I
12: end if
13: end if
14: end for
15: add r to I
/
16: end while
17: Remove dominated records from I
/
(See lines 1518 in BFA)
18: return I
/
Alg. 2: The G-Swoosh algorithm for ER(I)
G-Swoosh works by maintaining two sets. I is the set of
records that have not been compared yet, and I
/
is a set of
records that have all been compared with each other. The
algorithm iteratively takes a record r out of I , compares it to
every record in I
/
, and then adds it to I
/
. For each record r
/
that matches r, the record r, r
/
) is added to I .
Returning to the example of Fig. 2, r
1
is rst added to I
/
(there is nothing yet to compare against). Next, r
2
is
compared against all records in I
/
, i.e. against r
1
. This step
123
262 O. Benjelloun et al.
generates r
12
, which is placed in I . At the same time, r
2
is added to I
/
. Next, r
3
is compared against I
/
= {r
1
, r
2
],
adding r
23
to I and r
3
to I
/
. Records r
4
, r
5
and r
12
gen-
erate no matches, so at this point we have I = {r
23
] and
I
/
= {r
1
, r
2
, r
3
, r
4
, r
5
, r
12
]. When we compare r
23
against
I
/
we add r
123
to I and r
23
to I
/
. We continue in this fash-
ion until I is empty and I
/
contains

I (i.e., all the records
shown in Fig. 2), then dominated records are removed and
the algorithmterminates. It is easy to see that in this example
G-Swoosh performs many fewer comparisons than BFA.
Note incidentally that if the commutativity and idempo-
tence properties hold, we can eliminate many comparisons
in G-Swoosh (as well as in BFA). Idempotence and commu-
tativity of the match and merge functions are easy to satisfy
in most applications. If they hold, in G-Swoosh we can elim-
inate one of the two match calls in line 8, and one of the two
merge calls in line 9. Furthermore, we do not need to match
r against itself (line 7).
Proposition 3.2 For any instance I such that

I is nite,
G-Swoosh terminates and computes ER(I ).
Even though G-Swoosh may not terminate (because the
match and merge functions are so general), it is an optimal
algorithm, in terms of the number of record comparisons, our
main cost metric.
Theorem 3.1 G-Swooshis optimal, inthe sense that noalgo-
rithm that computes ER(I ) makes fewer comparisons in the
worst case.
Notice that in our evaluation of BFA and G-Swoosh we
have used the number of calls to the match function as the
main metric. We believe this metric is the right one. Each
recordcomparisonmaybe quite complex, takingintoaccount
several data values and using costly techniques. Moreover,
the number of record comparisons is roughly quadratic in the
number of records in the original instance (see Sect. 5). (As
an aside, note that the quadratic cost is not specic of our
approach; for instance, machine learning approaches, over-
viewed in Sect. 6, need to compute similarities for at least all
pairs of records.) By contrast, merging records is generally
less costly, as it often relies on simple syntactic rules. It is
also less of a discriminating factor between algorithms, since
for a given instance they will all roughly perform the same
merges.
Another cost factor in G-Swoosh is the elimination of
dominated records at the end. Depending on how domina-
tion is dened, this step can also be quite expensive, but is
similar for both BFA and G-Swoosh algorithms.
If domination is a black-box partial order, then we can
onlyeliminate dominatedrecords after we generate the merge
closure

I (Definition 2.1). However, if we knowhowdomina-
tion is checked, we may be able to performdominated record
elimination more efciently. In particular, if we know that
dominated records can never generate undominated records,
then we can eliminate dominated records as soon as they are
found. Note that this informal property exactly corresponds
to monotonicity properties Band Cof Proposition 2.2. There
are actually several ways to exploit this property to improve
G-Swoosh by eliminating dominated records early, but we
do not discuss them here.
3.2 The R-Swoosh algorithm
In this section we assume that the ICARproperties dened in
Sect. 2.2 hold. Furthermore, we assume that merge domina-
tion (Definition 2.4) is used as the definition of domination.
As we argued earlier, these properties hold naturally in some
applications. In other applications the match and merge prop-
erties may not initially satisfy these properties, but with small
changes to the functions we may achieve the properties.
The properties simplify ER processing in two significant
ways:
1. When two records r
1
and r
2
match to yield r
12
, we are
able to immediately discard the source records r
1
and
r
2
, since whatever records can be derived from r
1
or r
2
can now be derived from r
12
.
2. If we eliminate records used in a merge, we do not need
to explicitly eliminate dominated records. To see this
fact, say we run ER without explicitly eliminating dom-
inated records at the end. In particular, say two records
r
1
and r
2
appear in the nal answer, and r
1
r
2
. By def-
inition of merge domination, r
1
r
2
and r
1
, r
2
) = r
2
.
Thus, the comparison of r
1
and r
2
should have generated
merged record r
2
, and r
1
should have been eliminated.
We use these two ideas in the R-Swoosh algorithm, given
in Algorithm 3. To illustrate the operation of R-Swoosh, we
revisit the example of Fig. 2. Processing is similar to that of
G-Swoosh, except when we nd a match, we immediately
discard both source records. In particular, when we nd that
r in I matches r
/
in I
/
, we do not need to compare r to any
other I
/
records: we simply remove r from I and r
2
from I
/
and add the new records to I . For example, after r
1
and r
2
are processed, I
/
is empty (in G-Swoosh it contained {r
1
, r
2
])
and I = {r
3
, r
4
, r
5
, r
12
]. When we next compare r
3
against
I
/
we do not perform any comparisons and just add r
3
to I
/
.
The nal result is I
/
= {r
4
, r
1235
]. At the end, there is no
need to remove dominated records.
With R-Swoosh we clearly avoid many comparisons that
G-Swoosh would have performed. For instance, once r
1
is
merged into r
12
, we do not need to compare r
1
to any other
records. Furthermore, we avoid generating some intermedi-
ate merged records. For example, R-Swoosh never generates
r
23
; r
3
merges directly with r
12
to generate r
123
.
123
Swoosh: a generic approach to entity resolution 263
1: input: a set I of records /* Initialization */
2: output: a set I
/
of records, I
/
= ER(I )
3: I
/

4: while I ,= do /* Main loop */
5: current Record a record from I
6: remove current Record from I
7: buddy null
8: for all records r
/
in I
/
do
9: if M(current Record, r
/
) = true then
10: buddy r
/
11: exitfor
12: end if
13: end for
14: if buddy = null then
15: add current Record to I
/
16: else
17: r
//
< current Record, buddy >
18: remove buddy from I
/
19: add r
//
to I
20: end if
21: end while
22: return I
/
Alg. 3: The R-Swoosh algorithm for ER(I)
The following proposition establishes the correctness of
the R-Swoosh algorithm.
Proposition 3.3 Given an instance I , the R-Swoosh algo-
rithm computes ER(I ).
As R-Swoosh randomly picks the next record from the
set I , this leaves roomfor improvement. In some cases, addi-
tional knowledge can be used to inuence the order in which
records are picked (e.g. through sorting the records, in the
style of [19]), so that the number of comparisons is reduced
on average. However, if we have no knowledge of what order
is best, then R-Swoosh is optimal in the sense that even on
the most unfavorable instances, R-Swoosh performs at least
as well as any other possible algorithm.
Proposition 3.4 For any instance I of n records such that
entity resolution yields j records, R-Swoosh performs at most
(n 1)
2

( j 1)( j 2)
2
record comparisons. There exists an
instance (with n records, yielding j records) on which any
algorithm performs at least as many record comparisons.
4 Feature-Level ER
Although R-Swoosh is optimal in terms of record compar-
isons, it may still perform redundant comparisons of the
underlying values. To see why, recall the example we used
in the introduction, corresponding to the instance of Fig. 1.
The names John D. and John Doe are rst compared
when records r
1
and r
3
are compared, and then recompared
when r
4
(obtained from merging r
1
and r
2
) and r
3
are com-
pared. More generally, different records may share common
values, therefore the same comparisons may be performed
redundantly.
We classify value comparisons based on their outcome:
positive comparisons are the ones that succeed (e.g., the
names John Doe and J. Doe are similar), while nega-
tive comparisons fail (e.g., John D. and John Doe do not
match in our example). Our goal is to avoid repeating both
positive and negative value comparisons.
To avoid these redundant comparisons, we rene the gran-
ularity of the match function, to take into account the contents
of records. We break down the process of comparing records
into several ne-grained comparisons on features (data sub-
sets) of the compared records. In the previous example, the
name is such a feature, while the combination of e-mail and
phone number forms another feature. For each feature, a spe-
cic comparison function is used. Two records match if one
or more of their features match. In a nutshell, the F-Swoosh
algorithm will improve performance by taking into account
these feature comparisons, and by keeping track of encoun-
tered values in order to avoid positive and negative redundant
comparisons.
More formally, we consider a nite set of features f
1
,. . .,
f
m
. Each feature f
i
is a function on records that returns
a set of feature values from some domain D
f
i
. For exam-
ple, since the second feature f
2
of our example above is
phone+email, f
2
(r
4
) = {{234-4358, jdoe@yahoo}, {235-
2635, jdoe@yahoo}}. Each feature f
i
comes with a boolean
match function M
f
i
dened over D
f
i
D
f
i
. Two records
r
1
, r
2
match iff there exists a feature f
i
and feature val-
ues v
1
, v
2
s.t. v
1
f
i
(r
1
), v
2
f
i
(r
2
) and M
f
i
(v
1
, v
2
) =
true.
Thus, record matching is dened as an existentially quan-
tieddisjunctionover feature matches. One couldthinkof the
scenario where record matching can be done by any one of
several match functions. Suppose two records match if they
are similar according to either an edit distance algorithm or
a fuzzy logic algorithm. In this case, the entire matching is a
disjunction of the two match functions. On the other hand, if
the individual match functions by themselves are not power-
ful enough to determine matches, one may want to consider
more complex combinations of features, e.g., involving con-
junction, universal quantication, or negation. However, the
disjunctive case we consider here leads to simple bookkeep-
ing, since one can determine if records match by comparing
one feature at a time. We believe that with more complex
conditions, bookkeeping will be significantly more complex,
and more storage will be necessary, slowing down the perfor-
mance of F-Swoosh. Bookkeeping becomes more complex
because we can no longer store each feature separately as we
do in F-Swoosh based on the assumption that a single fea-
ture match implies an entire record match. Managing several
features together also requires larger data structures, which
take longer to access.
123
264 O. Benjelloun et al.
Just as for R-Swoosh, we still require that the ICAR prop-
erties of Sect. 2.2 be satised. We need to make sure that the
feature-level match functions M
f
i
are such that their com-
bination yields a record-level match function that satises
these properties. A simple sufcient condition is to have
an idempotent, commutative and associative merge function,
and have each of the M
f
i
be idempotent, commutative and
representative for this merge function.
4.1 The F-Swoosh algorithm
We now present the F-Swoosh algorithm. As its name sug-
gests, F-Swoosh has a similar structure to that of R-Swoosh.
The set I
/
is here also used to incrementally build a set of
non-dominated, non-matching records. The main difference
is that for each feature, a hash table and a set are used to keep
track of previously seen feature values and save redundant
positive and negative comparisons, respectively. An impor-
tant point is that these data structures have a size which is only
linear in the size of the data. Simply recording the outcome of
all previously performed match comparisons would occupy
a quadratic space, which is unacceptable for large datasets.
The F-Swoosh algorithm is given in Algorithm 4. We rst
introduce the data structures used by F-Swoosh, before dis-
cussing the algorithm itself and its properties.
For each feature f
i
, we maintain a data structure P
f
i
that
avoids repeating positive value comparisons for f
i
. P
f
i
is a
hash table that records all previously seen values of f
i
, and
associates with each feature value v the record r
2
that cur-
rently represents v. The record r is either the rst record
where feature value v appeared for feature f
i
, or one that was
derived from it through a sequence of merge steps. If there is
no such record, i.e., feature value v is seen for the rst time,
P
f
i
(v) returns null. Note that there can be at most one record
associated with the feature value v for f
i
; if more than one
record has been seen, the records have been merged into the
record returned by P
f
i
(v). The hash table is updated by a
command of the form P
f
i
(v) r. If the feature value v (for
f
i
) had not been recorded earlier, then this command adds
the pair (v, r) to the table. If v had been seen (for f
i
), then
the command replaces the (v, r
/
) pair by (v, r), indicating
that the old record r
/
has been merged into r.
For each feature f
i
, we also maintain a data structure N
f
i
aimedat avoidingredundant negative value comparisons. N
f
i
is a set that records the feature values of f
i
that were com-
pared against all the feature values of records in I
/
and did not
match any of them (line 31). By representativity, this implies
that if the record currently processed by the algorithm has
an f
i
value that appears in N
f
i
, then this value need not be
further compared (line 23).
2
In fact, we slightly abuse notation, as this is a pointer to the corre-
sponding record, and not the record itself.
1: input: a set I of records
2: output: a set I
/
of records, I
/
= ER(I )
3: P
f
empty hash table, for each feature f /* Initialization */
4: N
f
empty set, for each feature f
5: I
/
, current Record null
6: while I ,= or current Record ,= null do /* Main loop */
7: if current Record = null then
8: current Record a record from I
9: remove current Record from I
10: end if
11: buddy null
12: for all ( f, v) of current Record do /* Keep track of any new
values in the record */
13: if P
f
(v) = null then P
f
(v) current Record
14: end for
15: for all ( f, v) of current Record do /* Was any value previously
encountered? */
16: if P
f
(v) ,= current Record then
17: buddy P
f
(v)
18: exitfor
19: end if
20: end for
21: if buddy = null then /* If not, look for matches */
22: for all ( f, v) of current Record do
23: if v , N
f
then /* If a value never made it to N
f
... */
24: for all value v
/
of each r
/
I
/
do /* ... compare it to the
values of I */
25: if M
f
(v, v
/
) then
26: buddy r
/
27: exitfor
28: end if
29: end for
30: if buddy ,= null then exitfor
31: add v to N
f
32: end if
33: end for
34: end if
35: if buddy = null then
36: add current Record to I
/
37: current Record null
38: else
39: r
//
< current Record, buddy >
40: remove buddy from I
/
/* Update P
f
s to point to the merged
record */
41: for all ( f, v) where P
f
(v) {current Record, buddy] do
42: P
f
(v) r
//
43: end for
44: current Record r
//
45: end if
46: end while
47: return I
/
Alg. 4: The F-Swoosh algorithm for ER(I)
Algorithm. When a new record is processed by F-Swoosh,
the algorithmrst registers any newfeature values (lines 12
14), then checks if any of the values of the record already
appeared in a different record (lines 1520). If this is the
case, the record will be merged with the one pointed by the
corresponding entry in the P
f
i
hash table. If not, the feature
values of the record are compared to those of the records
in I
/
(lines 2134), and if no match is found, the record is
inserted in I
/
. As for R-Swoosh, when a match is found, the
123
Swoosh: a generic approach to entity resolution 265
old records buddy and currentRecord are purged, while the
merged record is placed in I for processing. Additionally, the
P
f
i
hash tables are updated so that feature values that previ-
ously pointed to buddy or currentRecord now point to the
new merged record (lines 4244).
As an optimization, and to avoid scanning the hash tables
in this last step, one can keep an inverted hash table that
maintains, for each record, a list of (features, feature value)
pairs that point to it. This data structure costs space linear
in the size of the instance, and its maintenance is straight-
forward. This optimization was used in the code that ran our
experiments.
To illustrate the operation of F-Swoosh, suppose we have
the three records r
1
= {name: John Doe, phone: 235-2635,
email: jdoe@yahoo}, r
2
= {name: Fred, phone: 678-1253,
email: fred@yahoo}, and r
3
= {name: John Doe, phone: 235-
2635, email: jdoe@microsoft}. Suppose that there are two
features name and phone+email, and that two records
match if their names are similar or if both their phones and
emails are the same. We rst add r
1
to I
/
and then compare r
2
to r
1
. Since r
2
does not match with r
1
, r
2
is also added to I
/
.
Unlike R-Swoosh, however, r
3
is then directly merged with
r
1
(without running the match function) because the feature
value {John Doe} is found in P
name
. The merged record r
13
= {name: John Doe, phone: 235-2635, email: {jdoe@yahoo,
jdoe@microsoft}} is now the current record. This time, we
do not need to compare the names of r
13
and r
2
(unlike
R-Swoosh) because {John Doe} is found in N
name
(which
means that we know {John Doe} has already been compared
with all the feature values of name in I
/
and thus does not
match with {Fred}). After we compare the phone+email
values of r
13
and r
2
, r
13
is added to I
/
. As a result, F-Swoosh
performs fewer feature value comparisons than R-Swoosh.
The correctness of F-Swoosh is established by the follow-
ing proposition.
Proposition 4.1 Given an instance I, the F-Swoosh algo-
rithm computes the maximal derivation of I, and therefore
solves the ER problem.
F-Swoosh exercises a lot of care not to perform redun-
dant or unnecessary feature value comparisons. The P
f
i
hash
tables records all previously seen feature values (including
those that may have disappeared from I
/
because of merges)
and keep track of records that represent them, to immediately
merge any records where these feature values may appear
again (lines 1520). Pairs of feature values that match imme-
diately lead to a merge, and are never recompared again,
while pairs of feature values that do not match (or feature
values that represents them) are added to the sets N
f
i
, and
once this happens, are guaranteed to never be recompared
again.
Some feature value comparisons may still be carried out
multiple times by F-Swoosh. In particular, pairs of feature
values that do not match may be recompared at a later time
if a merge happens, and at least one of the feature values has
not been recorded in N
f
i
. Avoiding such redundancies would
require to store the outcome of all previous unsuccessful fea-
ture value comparisons, which would have an unacceptable
storage cost. Instead, our algorithmtries to minimize the win-
dows where such redundant comparisons may occur, by con-
straining the order in which records are processed. Whenever
a match is found, the merged record will be set as the next
record to be processed (line 45), and no new record will be
processed before the merged record, or one derived from it
is added to I
/
. At this time, encountered feature values have
been added to N
f
i
and will not be recompared against each
other.
The benets of F-Swoosh are further illustrated by our
experimental evaluation, presented in Sect. 5.
Incremental F-Swoosh. One important advantage of
F-Swoosh is that it is very easy to adapt to an incremen-
tal scenario where new data or new features are added. For
example, suppose we have performed ER on a set of records
S and have obtained S
/
using F-Swoosh. Next, new evidence
arrives, in the form of new records S. We do not need to
run F-Swoosh on S S. Instead, we run F-Swoosh with
I
/
S
/
and I S, and initialize the hash tables P
f
i
and
the sets N
f
i
to the state they had at the end of the original
run. This setup will avoid unnecessary comparisons between
S
/
records, which we already know have no matches among
themselves. In fact, this will avoid any comparison of feature
values that was performed during the original run. Note, inci-
dentally, that R-Swoosh can also be made incremental, in the
same fashion. Since R-Swoosh does not use additional data
structures, no internal state needs to be saved or restored.
An analogous setup can be used if a new feature f
m1
is
dened after F-Swoosh ran and generated S
/
. We again ini-
tialize each P
f
i
and N
f
i
to the state they had after the original
run, and add a new hash table P
f
m1
and a new set N
f
m1
,
both empty. We take I S
/
(and I
/
) and run F-Swoosh
with the full set of features (old ones plus f
m1
). Because
the data structures already hold the feature values for old fea-
tures, no unnecessary comparisons will be performed during
the new run.
5 Experiments
We implemented G-Swoosh (assuming commutativity and
idempotence; see Sect. 3.1), R-Swoosh, and F-Swoosh as
described in the paper and conducted extensive experiments
on a comparison shopping dataset fromYahoo! Shopping and
a hotel information dataset from Yahoo! Travel. To evaluate
the performance of an algorithm, we counted the number
of feature value comparisons done by the match function.
As conrmed by practitioners (see some of the companies
123
266 O. Benjelloun et al.
we interacted with in Sect. 1), black-box match functions
are very expensive and hence the number of times they are
invoked is the natural metric here. We compared the perfor-
mance of the three algorithms, while varying the selectivity
of the match function, to cover a large spectrum of ER situ-
ations. We also compared their scalability as the size of the
input dataset grows. Finally, we quantitatively investigated
whether R-Swoosh and F-Swoosh could be used even when
some of the properties do not hold.
Notice that in our context it does not make sense to eval-
uate the run-time of the match functions themselves, nor the
accuracy of the results. We have only studied when and how
to invoke match and merge functions, not how to build ef-
cient functions nor how to build ones that are good at iden-
tifying records that truly correspond to the same real-world
entity (see Sect. 6).
5.1 Experimental setting
We ran our experiments on a comparison shopping dataset
provided by Yahoo! Shopping. In this application, hundreds
of thousands of records arrive on a regular basis from differ-
ent online stores and must be resolved before they are used
to answer customer queries. Because of the volume of data,
records cannot be exhaustively compared to each other, and
must rst be partitioned into independent clusters using some
semantic knowledge, e.g., by product category, a technique
commonly known as blocking. Exhaustive algorithms such
as those proposedinthis paper are thenusedtoresolve similar
records within a partition or bucket. In our experiments, we
used a partition of size 5,000 containing records with the sub-
string iPod in their titles; we will call these iPod-related
records fromnowon (as mentioned in the Introduction, when
we partition the data, we miss inter-partition matches).
In addition, we ran our experiments on a hotel dataset from
Yahoo! Travel. This time, many records arrive fromdifferent
travel sources (e.g., Orbitz.com), and must be resolved before
they are shown to the users. The hotels are located in differ-
ent countries including the United States, United Kingdom,
Germany, etc. Thus, a natural way to partition the hotels is
by their countries. In our experiments, we used a partition of
size 14,574 containing hotels in the United States (called US
hotels from now on).
The ER code was implemented in Java, and our experi-
ments were run on an 1.8GHz AMDOpteron Dual Core pro-
cessor with24.5GBof RAM. Thoughour server hadmultiple
processors, we did not exploit parallelism. This is a topic we
are addressing in a separate paper [7].
5.2 Match and merge strategies
We used for the two datasets different match and merge
strategiescalled MM
shop
and MM
trav
that satisfy all the
ICARproperties of Sect. 2.2. MM
shop
uses three attributes
title, price, and categoryfor matching and merging shop-
ping data records while MM
trav
uses eightname, street
address, city, state, zip, country, latitude, and longitude
for hotel records. Each strategy will be explained in detail.
Since all the ICAR properties are satised, G-Swoosh may
use the merge domination of Definition 2.4 as its domination
function.
MM
shop
. The MM
shop
strategy uses a union approach for
titles and categories, and a range approach for prices. Titles
are compared using the JaroWinkler similarity measure
3
[22], to which a threshold t from 0 to 1 is applied to get a
yes/no answer. Categories are compared using exact string
matches. When two records merge, the titles and categories
are unioned. (Note that, since we doexact matches for catego-
ries, each record has one category.) Prices are represented by
ranges of possible values and are matched when their ranges
overlap. Initially, a single price x has a range that includes all
the prices that match within percentage , i.e., [xx,x+x].
When two prices merge, the two ranges are merged into a
new range.
MM
shop
uses two features: F
1
sh
, which consists of the attri-
butes title and price, and F
2
sh
, which consists of title, price,
and category. Although the attributes of F
1
sh
are a subset of
the attributes of F
2
sh
, using F
1
sh
and F
2
sh
makes sense by giving
higher thresholds to F
1
sh
for matching. The feature values of a
record for a certain feature is the combination of all possible
values of the attributes in the feature. Two records match if
at least one of their feature values match for F
1
sh
or F
2
sh
. Two
feature values match if each of the corresponding attribute
values match, as described in the previous paragraph.
MM
shop
satises the ICARproperties because each record
keeps all the values explicitly for titles and categories, and a
range that covers all encountered prices (see our discussion
in Sect. 2.2).
We experimented with MM
shop
using various thresholds.
We xed the price threshold to 0.1 (10%). Also, we always
set the title threshold of F
1
sh
halfway between the title thresh-
old of F
2
sh
and 1.0 to make it stricter than F
2
sh
s title threshold.
Thus, for the experiments we report on here we only varied
the title threshold of F
2
sh
.
There are obviously many other strategies we can use on
the shopping dataset. In our technical report [6], we show
more experimental results comparing additional strategies.
However, we think that the MM
shop
is representative of the
behavior of all those strategies.
3
The JaroWinkler similarity measure returns a similarity score in the
0 to 1 range base on many factors, including the number of characters
in common and the longest common substring.
123
Swoosh: a generic approach to entity resolution 267
MM
trav
. The MM
trav
strategy uses match and merge
functions in the Union Class (see Sect. 2.2). Each attribute
of a record retains all the distinct attribute values of the base
records. When comparing two records, we do pairwise com-
parisons between all the feature values from each record and
look for an existing match. The names and street addresses
are compared using the JaroWinkler similarity measure to
which a threshold is applied. The attributes city, state, zip,
country are compared using exact string matches. Finally, the
latitude and longitude are compared by checking if the abso-
lute difference of the numeric values is less than a threshold.
Since the match and merge functions are inside the Union
Class, MM
trav
naturally satises the ICAR properties.
MM
trav
uses three features for matching records: F
1
tr
con-
sists of the attributes name, street address, city, state, and
country; F
2
tr
consists of name, street address, zip, and coun-
try; and F
3
tr
consists of name, street address, latitude, and
longitude.
We also experimented with MM
tr
using various thresh-
olds. We used the same threshold for comparing the names
and street addresses because they had similar string lengths.
This threshold was used by all three features. We also xed
the latitude and longitude thresholds to 0.1 degree (which
corresponds to about 11.1km). Thus, for our experiments
we only vary the name threshold of F
1
tr
.
In summary, we tested a variety of match and merge func-
tions, each representing different application semantics.
For each scenario we may get fewer or more matches, and
the runtime may vary. As mentioned in the Introduction,
our focus is on how the different scenarios perform, not on
how well the match and merge functions capture application
semantics.
5.3 Match selectivity impact
We measured the performances of G-Swoosh, R-Swoosh,
and F-Swoosh using the MM
shop
strategy by varying the
selectivity of the match function on 3,000 random iPod-
related records. We did not test on the entire block of 5,000
records because there were too many matching records, mak-
ing G-Swoosh extremely slow. Varying the threshold of the
match function affects its selectivity. In our experiments, we
varied the title threshold of F
2
sh
to capture a variety of match
criteria. However, instead of plotting all graphs with the title
threshold of F
2
sh
as the horizontal axis, we plotted against the
number of actual merges that occurred, i.e., the number of
records in the initial set minus that in the result set. We think
that the number of merged records is a more intuitive param-
eter, one that captures the selectivity of an application.
The higher the number of merged records, the more selective
the match function is (allowing more records to match and
merge).
0
50
100
150
200
250
300
350
400
450
500
600 700 800 900 1000 1100 1200 1300 1400 1500 1600
F
e
a
t
u
r
e

v
a
l
u
e

c
o
m
p
a
r
i
s
o
n
s

(
m
i
l
l
i
o
n
)
Number of merges
G-Swoosh
R-Swoosh
F-Swoosh
Fig. 3 MM
shop
feature value comparisons
Figure 3 shows the number of feature value comparisons
for each algorithm as the number of merges increases for
the MM
shop
scenario. The comparisons of G-Swoosh rap-
idly increase even if a moderate number of records merge.
This is because the complexity of G-Swoosh is exponential
on the number of records that match each other. R-Swoosh
also increases rapidly compared to F-Swoosh because, as
merged records become larger, their numbers of feature val-
ues increase linearly for both F
1
sh
and F
2
sh
. As a result,
R-Swoosh does many feature value comparisons when com-
paring large records. F-Swoosh, on the other hand, saves
many of these comparisons by using the P
f
i
and N
f
i
hash
tables. Another noticeable trend is that the comparisons of
R-Swoosh start to increase rapidly after a certain point where
many titles start to match each other and result in a burst of
new matches.
To illustrate how the number of feature value compari-
sons relates to the actual performance, Fig. 4 shows the run-
time for each algorithm (same MM
sh
scenario). The runtime
results mainly depend on the number of feature value com-
parisons. The most dominant factor of the runtime for any
algorithm in our application turns out to be the total time for
matching titles during the feature value comparisons. This is
because title matching involves string similarity measuring,
which takes much longer than number comparing (used for
matching prices) and exact string matching (used for match-
ing categories). The total merge time is negligible because the
number of merges is much less than the number of matches
done.
Next, we compared G-Swoosh, R-Swoosh, and F-Swoosh
using the MM
trav
strategy on the block of 14,574 US hotel
records. This time, we varied the name threshold of F
1
tr
to
capture different selectivities of the application. One nota-
ble feature of the hotel dataset was that most of the merged
records consisted of exactly two base records. The reason is
twofold. First, the hotel records came fromfour different data
123
268 O. Benjelloun et al.
0
1
2
3
4
5
6
7
600 700 800 900 1000 1100 1200 1300 1400 1500 1600
R
u
n
t
i
m
e
(
h
r
s
)
Number of merges
G-Swoosh
R-Swoosh
F-Swoosh
Fig. 4 MM
shop
runtime
300
400
500
600
700
800
900
1000
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
F
e
a
t
u
r
e

v
a
l
u
e

c
o
m
p
a
r
i
s
o
n
s

(
m
i
l
l
i
o
n
)
Number of merges
G-Swoosh
R-Swoosh
F-Swoosh
Fig. 5 MM
trav
feature value comparisons
sources where each data source did not contain duplicates
in itself. Hence, the only way for duplicates to occur was
for different sources to share the same hotels. Second, each
hotel usually came from at most two different sources. Since
the merged records were so small, F-Swoosh was not sig-
nificantly better than R-Swoosh whereas G-Swoosh actually
performed reasonably well.
Figure 5 shows the number of feature value comparisons
for each algorithm as the number of merges increases for
the MM
trav
scenario. Although the differences among the
algorithms are minor compared to Fig. 3, the performance
gap between algorithms steadily increases as the number of
merges grows.
The reason is that, as we lower the comparison threshold,
some merged records do get larger and have many feature
values to compare.
Figure 6 compares the actual performance of the three
algorithms using MM
trav
. F-Swoosh is slightly faster than
R-Swoosh while G-Swoosh has a reasonable performance
0
2
4
6
8
10
12
14
400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
R
u
n
t
i
m
e
(
h
r
s
)
Number of merges
G-Swoosh
R-Swoosh
F-Swoosh
Fig. 6 MM
trav
runtime
and is only about twice as slow as F-Swoosh. Compared to
Fig. 5, there is a larger gap between G-Swoosh and the other
algorithms. This additional runtime is used in the last stage
of removing dominated records.
In summary, the actual number of comparisons and run-
time depend on the application semantics, i.e., on how
selective the match function is, how merged records grow
(e.g., by storing all the feature values of the base records),
and how efcient the value comparisons are. However, in
general:
F-Swoosh is 1.1 to 11.4 times faster than R-Swoosh in
the scenarios we considered.
G-Swoosh is extremely expensive and not practical (even
when we assume commutativity and idempotence) when
many merges occur, but performs reasonably well when
most of the merged records are very small.
Of course, even when G-Swoosh is impractical, it is an
important algorithm to understand, as it represents the
cost of exact ER when the ICAR properties do not hold.
5.4 Scalability
We conducted scalability tests for G-Swoosh, R-Swoosh and
F-Swoosh using MM
shop
and MM
trav
. We rst tested
MM
shop
on randomly-selected shopping data (regardless of
product types) ranging from 2,000 to 16,000 records. We
then tested MM
trav
on randomly-selected hotel data (regard-
less of the countries) ranging from 2,000 to 20,000 records.
The string comparison thresholds for MM
shop
and MM
trav
were set to values (0.9 and 0.85, respectively) that give good
results.
Figure 7 shows the runtime scalability test results for each
algorithm. Both R-Swoosh and F-Swoosh illustrate the qua-
dratic cost of ER. As the dataset gets larger, F-Swoosh out-
performs R-Swoosh by up to 49% (for the 16,000 shopping
123
Swoosh: a generic approach to entity resolution 269
0
2
4
6
8
10
12
14
16
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000
R
u
n
t
i
m
e

(
h
r
s
)
Number of records
G-Swoosh (MM_shop)
R-Swoosh (MM_shop)
F-Swoosh (MM_shop)
G-Swoosh (MM_trav)
R-Swoosh (MM_trav)
F-Swoosh (MM_trav)
Fig. 7 Runtime scalability
records). G-Swoosh increases more rapidly than the other
algorithms. While G-Swoosh cannot handle more than 3,900
shopping records in a reasonable amount of time, it scales
better on the hotel data.
In summary, ERis an inherently expensive process, so that
only relatively small sets of records can be handled. Thus,
large data sets need to be partitioned into smaller sets that
can be resolved in detail. How large a data set can be exhaus-
tively resolved depends on the application. It is also possible
to distribute the ERcomputations across multiple processors,
in order to handle larger data sets. In [7] we study various
strategies for distributing the work done by R-Swoosh.
5.5 Without the properties
So far, we have only studied scenarios where the ICARprop-
erties of Sect. 2.2 hold. We now consider a scenario where
the properties do not hold. In this case, we need to run G-
Swoosh to get a correct answer. From our previous results,
however, we know that G-Swoosh can be very expensive,
especially when there are many large merges. The alterna-
tives are (1) to modify the match and merge functions so that
the ICAR properties hold and that they still capture reason-
ably what the application intends or (2) to run R-Swoosh and
F-Swoosh even though we will not get correct answers. It
would be interesting to see what results we get for the sec-
ond alternative.
We used a variant of MM
shop
for our strategy. Instead of
saving all the titles when merging records, we simply choose
the longer string. If two strings have the same length, how-
ever, we choose the string that lexicographically precedes the
other in order to satisfy commutativity. For the price attribute,
we choose the numerically larger value. We did not take the
average of the values because, otherwise, G-Swoosh would
have producedaninnite number of newrecords withslightly
differing prices. We used the same match function as that of
MM
shop
. In order to provide a partial order domination, we
used a variation of Definition 2.4 where r
1
is dominated by
r
2
if r
1
s base records are included in r
2
s base records. This
new strategy does not satisfy the ICAR properties because
the ER result now depends on the order of record merges.
Our results show that, at least for our product resolution
application, the answer produced by R-Swoosh is very sim-
ilar to the answer G-Swoosh generates. Since R-Swoosh is
so much more efcient, it is thus attractive to use R-Swoosh
even when the ICAR properties do not hold. Note that even
the G-Swoosh answer may not be 100% accurate (some
resolved records may not represent true real-world entities)
because the match and merge functions themselves may not
be not perfect. Thus, the difference between the G-Swoosh
and R-Swoosh answers can also be viewed as additional
error beyond whatever error that may occur in the G-Swoosh
answer.
Before presenting our results, we give a simple example
that helps interpret the results. Consider an initial instance
with three records I = {r
1
, r
2
, r
3
]. Suppose that r
1
and r
2
match, yielding r
1
, r
2
) = r
12
. Similarly, r
2
and r
3
match and
r
2
, r
3
) = r
23
. However, there are no other matches. (Rep-
resentativity does not hold. If it did, for instance, r
3
would
matchr
12
yieldingr
123
.) Inthis case, G-Swooshcomputes the
set I
/
= {r
12
, r
23
], assuming that r
1
, r
2
, r
3
are dominated. On
the other hand, R-Swoosh computes I
/
= {r
12
, r
3
], assuming
the r
1
, r
2
comparison is done rst. (After r
12
is found, r
1
and
r
2
are discarded, so r
3
has nothing to match with.) If r
2
and r
3
are compared rst, then R-Swoosh computes I
/
= {r
23
, r
1
].
Thus, we see that R-Swoosh may miss some records in the
correct answer, and can generate records not in the G-Swoosh
answer.
Figure 8 shows the result size comparison between
G-Swoosh and R-Swoosh tested on 2,000 random iPod
records varying the title threshold of F
2
sh
. (Figure 8 shows the
results for a particular record ordering, based on the order
of records in the input le. Results for other orderings are
similar). Again, we did not test on the entire 5,000 records
became G-Swooshwas tooexpensive for lowthresholds. Sur-
prisingly, the result size of G-Swoosh is slightly smaller than
that of R-Swoosh. This fact is because G-Swoosh considers
all possible merges and tends to produce large records that
dominate smaller records (like r
3
in our example) that are
also part of the result of R-Swoosh.
Figure 9 shows the intersection between G-Swoosh and
R-Swoosh. On average, 99.49% of all the records produced
by R-Swoosh are also produced by G-Swoosh. The remain-
ing 0.51% are those that have been discarded by domination
in G-Swoosh. The result is also similar the other way: on
average, 99.64% of the records produced by G-Swoosh are
also produced by R-Swoosh. This time, the missing records
are the ones that R-Swoosh was not rigorous enough to come
up with (like r
23
in our example).
123
270 O. Benjelloun et al.
1540
1560
1580
1600
1620
1640
1660
1680
0.93 0.94 0.95 0.96 0.97 0.98 0.99
R
e
s
u
l
t

s
i
z
e
Title threshold
R-Swoosh
G-Swoosh
Fig. 8 Result sizes of G-Swoosh and R-Swoosh
98
98.5
99
99.5
100
100.5
101
0.93 0.94 0.95 0.96 0.97 0.98 0.99
P
o
r
t
i
o
n

o
f

r
e
c
o
r
d
s

a
l
s
o

i
n

o
t
h
e
r

r
e
s
u
l
t

(
%
)
Title threshold
R-Swoosh in G-Swoosh
G-Swoosh in R-Swoosh
Fig. 9 Intersection between G-Swoosh and R-Swoosh
Returning to our example, G-Swoosh computes the set
{r
12
, r
23
] while R-Swoosh computes {r
12
, r
3
]. In this case,
only 50% of the R-Swoosh records are in the G-Swoosh
answer. One can argue that this low number is misleading
because the incorrect r
3
record is closely related to the r
23
record in the G-Swoosh set, and hence is not completely
wrong. To capture this intuition, we dene a precision met-
ric that takes into account similarity between records. For
each record r in the R-Swoosh answer, we nd the maxi-
mum value of the Jaccard similarity coefcients between r
and each record in the G-Swoosh answer. The Jaccard simi-
larity coefcient between two records is dened as the size
of the intersection set of base records divided by the size of
the union set of base records (e.g., records r
12
and r
23
have
the Jaccard similarity coefcient of
[{r
2
][
[{r
1
,r
2
,r
3
][
=
1
3
). The pre-
cision of R-Swoosh is then the sum of the maximum Jaccard
similarity coefcients for all rs divided by the size of the
R-Swoosh answer. In our example, the sum of the maximum
Jaccard coefcients for {r
23
, r
1
] is
2
2

1
2
= 1.5, and the
precision is
1.5
2
= 75%. Figure 10 shows the precision of
98
98.5
99
99.5
100
100.5
101
0.93 0.94 0.95 0.96 0.97 0.98 0.99
P
r
e
c
i
s
i
o
n

(
%
)
Title threshold
Precision
Fig. 10 Precision of R-Swoosh compared to G-Swoosh
Name Threshold G-Swoosh R-Swoosh Precision
0.99 14095 14095 100.0
0.95 13843 13843 100.0
0.90 13385 13385 99.996
0.85 12579 12580 99.984
0.80 11316 11324 99.890
0.75 10327 10363 99.389
Fig. 11 Result size and precision on the hotel data
R-Swoosh compared to G-Swoosh. The average precision is
99.77%, showing that the R-Swoosh answer is very similar
to the G-Swoosh answer.
We have also done similar experiments on the hotel data
using a variant of MM
trav
for our strategy. When merging
names and street addresses, we choose the longer strings. If
two strings have the same length, we choose the string that
lexicographically precedes the other. For numerical attributes
such as latitude and longitude, we choose the numerically
larger values. We used the same match function as that of
MM
trav
. Finally, we used the same domination order as the
one used in our MM
shop
variant.
Figure 11 shows the result size comparison between
G-Swoosh and R-Swoosh tested on the block of 14,574 US
hotel records varying the name threshold of F
1
tr
. We show
the result sizes in a table because the results of G-Swoosh
and R-Swoosh are almost identical for all the thresholds we
tested on, and the sizes are hard to distinguish using a graph
presentation. Indeed, even in the worst case, the sizes differ
only by 0.35%.
Figure 11 also shows the precision of R-Swoosh using
the Jaccard similarity coefcient. The average precision is
99.88%, making the R-Swoosh result almost the same as
that of G-Swoosh.
The experiments show that R-Swoosh and F-Swoosh are
indeed reasonable ways to do EReven if the match and merge
functions do not satisfy all the ICAR properties. They solve
123
Swoosh: a generic approach to entity resolution 271
the scalability problem of G-Swoosh while producing most
of the records G-Swoosh does.
6 Related work
Originally introduced by Newcombe et al. [29] as record
linkage, entity resolution was then studied under various
names, such as merge/purge [19], deduplication [31], refer-
ence reconciliation [14], object identication [36], and
others.
Several works have addressed the issue of performance for
ERalgorithms. However, most of themmake strong assump-
tions on the data and/or the comparison functions to make
their algorithms efcient. For example, references [19, 20]
assume that records can be represented by one or multiple
alphanumeric keys, and that most matches occur between
records whose keys are lexicographically close. Ablo-cking
key can be used to split records into buckets [22] or can-
opies [25]. Reference [23] proposed mapping the records
values into a multi-dimensional Euclidean space, then per-
forming a similarity join. An overview of such blocking
methods can be found in [4]. Since they do not compare all
records, such techniques make ER algorithms approximate,
to an extent that depends on properties of the data. More
recently, reference [2] proposed efcient algorithms for set
similarity joins using string similarity functions. In contrast,
we view the match and merge functions as black-boxes and
provide exact ER algorithms that are optimal in the number
of black-box invocations.
Reference [27] proposes a generic and exact approach,
where the ER problem is viewed as an instance of the clas-
sical set-union problem [35], for which efcient data struc-
tures andalgorithms were extensivelystudied. However, their
work requires the record comparison function to be transi-
tive, a property we believe is constraining and difcult to
satisfy.
Iterative approaches [8, 14] identied the need to transi-
tively compare merged records to discover more matches,
for merges that are simple groupings of the data in merged
records. Our approach allows richer, custom merges. More
importantly, it eliminates redundant comparisons by tightly
integrating merges and comparisons, and naturally yields
incremental algorithms.
Our match functions are specied as logical formulas of
smaller feature-level match functions, in the style of the
equational theory of [19], and similar in spirit to works on
declarative data cleaning (e.g., [16]). Such a specication
has the benets of (1) having clear semantics, (2) allowing
the kinds of optimizations we perform.
While our work focuses on performance, there has also
been a significant amount of work on enhancing the preci-
sion and recall of the ER process. The rst formalization, by
Fellegi and Sunter [15] optimizes the relative importance of
numerical similarity functions between records, in a proba-
bilistic setting. In this paper and most follow-ups (see [18,
38] for recent surveys), the assessment of ER is in terms
of precision and recall of the obtained classication. Many
string comparison techniques based on edit-distances [34],
TF-IDF [13], or adaptive techniques such as q-grams [11, 17]
are used for matching records. Reference [28] removes attri-
bute level conicts of matching records by comparing the
quality of their data sources. Reference [32] provides user-
dened grouping as part of an SQL extension. As domain-
independent techniques may not be suitable for some
domains, one may need domain-specic value comparison
functions [1]. Any of these techniques can ll in the black-
boxes, which we decouple from our match and merge
process.
Finally, there has also been a great amount of research on
non-pairwise ER, including clustering techniques [3, 12, 27],
classiers such as Bayesian networks [37], decision trees,
SVMs, or conditional random elds [33]. The parameters
of these models are learned either from a (hopefully repre-
sentatitive) set of labeled example, possibly with the help
of a user [31], or in an unsupervised way [12, 39]. A recent
line of works focuses on the relationships among records [5,
14, 24, 30]. Reference [9] proposed a technique to resolve
entities collectively based on the relationship graph among
records. Such techniques are not pairwise because they gen-
erally examine all or part of the dataset to learn match deci-
sions. In contrast, our focus is on pairwise ER because of
its practical values such as easier coding and efciency. As
we mentioned in the introduction, however, we can use non-
pairwise techniques during a prior training phase.
7 Conclusion
Entity resolution (ER) is an important information integra-
tion problem arising when data from diverse sources is com-
bined. We have divided the problem into two aspects: the
black-box functions that match and merge records, and the
ER algorithm that invokes these functions. In our opinion,
this division has two important advantages: (1) it yields gene-
ric ERalgorithms that can be used, with well-dened seman-
tics, in many applications, and (2) it lets us focus on our
performance measure, the number of black-box invocations.
While there may be other factors that affect the overall
runtime performance, we assume that the black-box invoca-
tions are potentially expensive and thus are the critical run-
time factors. We have presented three algorithms, G-Swoosh,
R-Swoosh, and F-Swoosh, that make as few calls as pos-
sible, thus yielding significant performance improvements
over naive algorithms. We have also presented four important
yet natural ICAR properties for match and merge functions,
123
272 O. Benjelloun et al.
that lead to the significantly more efcient R-Swoosh and
F-Swoosh. We have argued that these properties should guide
the development of merge and match functions. If the proper-
ties do not hold because of application semantics, the design-
ers will know that ER will be inherently expensive.
Acknowledgments The authors acknowledge the useful comments
made by Jeff Jonas and Tanton Gibbs.
A Appendix: Proofs
A.1 Basic model
Proposition 2.1 For any instance I , the entity resolution of
I exists and is unique. We denote it ER(I ).
Proof Existence: starting from

I , we can remove dominated
records one at a time, until no more dominated records exist.
The obtained set satises the conditions for being an ER(I )
solution.
Unicity: suppose there are two ER(I ) solutions: I
/
1
and
I
/
2
. Say, some record r
1
is in I
/
1
but not I
/
2
. There exists r
2
in
I
/
2
s.t. r
1
_ r
2
, and r
3
in I
/
1
s.t. r
2
_ r
3
. Hence r
1
_ r
3
, and
they are distinct since r
1
,= r
2
. I
/
1
{r
1
], a strict subset of I
/
1
satises conditions 1 and 2, a contradiction. .
A.2 Match and Merge Properties
Proposition 2.2 Merge domination is a partial order on
records.
Proof We showthat merge domination is reexive, transitive
and anti-symmetric:
Reexivity: r r, follows from idempotence.
Transitivity: Suppose r
1
r
2
and r
2
r
3
. In particular,
r
2
, r
3
) = r
3
and r
1
r
2
hold, which implies, by rep-
resentativity, that r
1
r
3
. This ensures the existence of
r
1
, r
3
). r
1
, r
3
) = r
1
, r
2
, r
3
)) also equals, by associa-
tivity, r
1
, r
2
), r
3
) = r
2
, r
3
) = r
3
. Therefore r
1
r
3
holds.
Anti-symmetry: Suppose r
1
r
2
and r
2
r
1
hold. This
means that r
1
, r
2
) = r
2
and r
2
, r
1
) = r
1
. Commutativ-
ity ensures that the two left terms are equal, and therefore
that r
1
= r
2
. .
Proposition 2.3 Given match and merge functions such that
the match function is reexive and commutative, if a domina-
tion order _exists such that the four monotonicity conditions
[that followproposition 2.2] are satised with replaced by
_, then the ICAR properties of Sect. 2.2 are also satised,
and _ coincides with the merge domination order .
Proof We rst prove that all four properties of Sect. 2.2 hold,
and then that the orders coincide.
Idempotence: The match function is reexive, i.e., r r
always holds. Since r _ r, by the rst monotonicity prop-
erty, r _ r, r), and by the fourth r, r) _ r, and hence
r = r, r) follows.
Commutativity: Suppose r
1
r
2
. The match function is
commutative so r
2
r
1
. Hence both r
1
, r
2
) and r
2
, r
1
)
are dened. By applying the fourth monotonicity prop-
erty, it follows that r
1
, r
2
) _ r
2
, r
1
) and symmetrically
that r
2
, r
1
) _ r
1
, r
2
), hence r
1
, r
2
) = r
2
, r
1
).
Representativity: Suppose r
1
, r
2
) = r
3
and r
1
r
4
.
Since r
1
_ r
3
, it follows from the second monotonicity
property that r
3
r
4
.
Associativity: Suppose r
1
, r
2
), r
3
) and r
2
, r
3
) are
dened. Since r
3
_ r
2
, r
3
), we have that r
1
, r
2
), r
3
) _
r
1
, r
2
), r
2
, r
3
)) by the third monotonicity property. By
the same argument, r
2
, r
3
) _ r
1
, r
2
), r
3
). Therefore,
r
1
, r
2
), r
2
, r
3
)) _ r
1
, r
2
), r
3
), and since _ is anti-
symmetric, r
1
, r
2
), r
2
, r
3
)) is equal to r
1
, r
2
), r
3
).
By a symmetrical argument, we obtain that r
1
, r
2
), r
2
,
r
3
)) equals r
1
, r
2
, r
3
)).
We now show that r s iff r _ s. Recall that r s means
that r, s) = s. The forward implication directly follows from
the property r _ r, s). For the backward implication, sup-
pose r _ s. Since r r, it follows that r s, and that
r, s) _ s, s), i.e, that r, s) _ s. We also know that s _
r, s), and since _is anti-symmetric, we have that r, s) = s,
which is the definition of r s. .
A.3 ER with ICAR properties
In order to prove the results of Theorem 2.1, we need rst
to dene some terminology: the notions of a derivation tree,
and the base and depth of a record.
Denition A.1 Given an instance I , for any record r

I ,
a derivation tree d
r
of r represents a hierarchy of records
in

I {r] that were merged to produce r. The base of r
for derivation d
r
, denoted B(r, d
r
), is the set of I records at
the leaves of d
r
. The depth of r for derivation d
r
, denoted
D(r, d
r
) is the size of the longest path in d
r
. The depth of r,
denoted D(r) is its smallest depth across derivation trees of
r in

I .
We can now show an important intermediary result:
Proposition A.1 For any two records r, s if B(r, d
r
)
B(s, d
s
) for some derivation trees d
r
, d
s
of r, s respectively,
then r s.
Proof The proof is by induction on the depth of the record
with the smallest base.
123
Swoosh: a generic approach to entity resolution 273
Induction hypothesis: Given two records r, s such that
B(r, d
r
) B(s, d
s
) (for some d
r
, d
s
) and an integer n, 0 n:
If D(r, d
r
) n then r s.
Base: n = 0. In this case, r is a base record that belongs
to the derivation tree of s. Following the path from s to r in
this derivation tree, each record is dominated by its parent
node and therefore, by transitivity of , we have that r s.
Induction step: We now show that if the hypothesis holds
for n = k, then it also holds for n = k 1.
If D(r, d
r
) k, we can directly apply the induction
hypothesis. The case to deal with is D(r, d
r
) = k 1. Con-
sider the children of r in its d
r
derivation tree: r = r
1
, r
2
).
Clearly, D(r
1
, d
r
1
) k and D(r
2
, d
r
2
) k (with d
r
1
, d
r
2
being the derivation trees of r
1
, r
2
in d
r
) . Since B(r
1
, d
r
1
)
B(s, d
s
) and B(r
2
, d
r
2
) B(s, d
s
), we can apply the induc-
tion hypothesis to r
1
and r
2
. It follows that r
1
s, and
r
2
s. By the monotonicity properties of , we can inject
r
2
to obtain that r
1
, r
2
) s, r
2
). Similarly, by injecting s
we obtain that r
2
, s) s, s). By merge commutativity and
idempotence, and since is transitive, it follows that r s.
.
The niteness of ER when the properties hold is a direct
corollary of the above result:
Corollary A.1 When the match and merge properties are
satised, for any instance I ,

I (hence ER(I )) is nite.
Proof A direct consequence of the previous theorem is that
any two records with the same base are equal. The possi-
ble bases for records derivable from I are the elements of
the powerset of I , a nite set. Therefore

I is nite. Since
ER(I )

I , ER(I ) is nite as well. .
To prove the second part of Theorem 2.1, we rst need to
show a conuence result on derivation sequences:
Proposition A.2 (Conuence) Given an instance I and a
derivation sequence I

I
/
, for any other derivation
sequence I

I
//
, there exists a continuation I
//

I
///
such that I
/
I
///
.
Proof (sketch) The key observation is that derivation steps
are monotonic, i.e., if I I
/
then I I
/
: In the case of a
merge step, clearly I I
/
, and in the case of a purge step, the
only record in I which is not in I
/
is dominated by another
record in I (and therefore in I
/
). Now, consider I
/
and I
//
. The
only records of I
/
that may not be dominated by records in I
//
are those that are not in the initial instance I , and therefore
were produced through merge steps. Since I I
//
, r
i
I ,
r
//
i
I
//
s.t. r
i
r
//
i
. The idea is to continue I
//
by apply-
ing the same merge steps as those that lead to I
/
, but using
instead of each r
i
an r
//
i
I
//
that dominates it (skipping the
steps that do not modify the current instance). The obtained
records, because merges are also monotonic, dominate the
corresponding records in I
/
, hence I
/
I
///
. .
We can now prove the second part of Theorem 2.1:
Proposition A.3 Every derivation sequence is nite. Every
maximal derivation sequence starting from an instance I
computes ER(I ).
Proof For niteness, observe that for each derivation step
I I
/
, I ,= I
/
. Since derivation sequences are monotonic
(w/r to instance domination), all instances involved in a der-
ivation sequence are distinct. Moreover, all these instances
are subsets of

I . There is a nite number of them, since we
showed

I to be nite. Therefore every derivation sequence
is nite.
We now construct a maximal derivation sequence that
computes ER(I ): starting from I , perform all necessary
merge steps to produce the records in ER(I ). This is possible
since all the needed records are in

I . Then, perform purge
steps to remove all records which are not in ER(I ). Each
step is a valid purge step, since

I ER(I ). No additional
purge step are possible, since ER(I ) does not contain dom-
inated records, and no additional merge steps are possible,
since

I ER(I ). Therefore the derivation is maximal.
To conclude the proof, we show that all maximal deriva-
tions of an instance I compute the same result. By contra-
diction, suppose an instance I has two maximal derivations
I

I
1
and I

I
2
. Then, by Proposition A.2, there exists
I
3
s.t. I
1

I
3
and I
2
I
3
. Since I
1
is maximal, it has
no derivation but itself, and therefore we have that I
3
= I
1
,
hence I
2
I
1
. Symmetrically, we obtain that I
1
I
2
. Now,
if I
1
,= I
2
, some record in one of the instances (say, r
1
I
1
)
is not present in the other instance (here, I
2
). Since I
1
I
2
,
r
1
is dominated by some record r
2
I
2
, and we know that
r
2
,= r
1
. Similarly, since I
2
I
1
, r
2
is dominated by some
record r
3
in I
1
, and by transitivity r
1
r
3
(with r
1
,= r
3
).
Removing r
1
from I
1
would therefore be a valid purge step,
which contradicts the fact that I
1
is a maximal derivation.
.
Proposition 2.4 Given match and merge functions such that
the match function is reexive and commutative, if the match
and merge functions are in the Union Class, the ICAR prop-
erties are satised.
Proof Since the match and merge functions are in the Union
Class, each record can be represented as a set of base records
(records with single values), i.e., r = {b
1
, b
2
, . . . , b
n
]. The
merge of two records (r
1
, r
2
) = r
1
r
2
(i.e., the set union
of records). Given a match function for base records BM,
the match of two records M(r
1
, r
2
) = true if b
1
r
1
,
b
2
r
2
s.t. BM(b
1
, b
2
) = true.
We now prove that all four properties of Sect. 2.2 hold.
123
274 O. Benjelloun et al.
Idempotence: The match function is reexive, i.e., r r
always holds. Since set union is idempotent, the merge
function guarantees r = r, r).
Commutativity: Suppose r
1
r
2
. The match function is
commutative, sor
2
r
1
. Hence, both r
1
, r
2
) and r
2
, r
1
)
are dened. Since set union is commutative, the merge
function guarantees r
1
, r
2
) = r
2
, r
1
).
Associativity: Suppose that for r
1
, r
2
, r
3
, there exist
r
1
, r
2
, r
3
)) and r
1
, r
2
), r
3
). Since set union is asso-
ciative, the merge function guarantees r
1
, r
2
, r
3
)) =
r
1
, r
2
), r
3
).
Representativity: Let r
3
= r
1
, r
2
). Suppose we have r
4
such that r
1
r
4
. Since we use an existential match, there
exists a pair of values from r
1
and r
4
that match. Since
the merge function guarantees r
3
to have all the possible
values of r
1
, there also exists a pair of values from r
3
and
r
4
that match. Hence, according to the match function,
r
3
r
4
. .
B Appendix: Record-level algorithms
Proposition 3.1 For any instance I such that

I is nite, BFA
terminates and correctly computes ER(I ).
Proof (Sketch) BFA essentially computes

I by recursively
adding to the set I merges of all matching records, until a
xpoint is reached, then removes dominated records. The set
I increases by at least one record at each iteration, therefore
BFA terminates if

I is nite, and returns ER(I ). .
Proposition 3.2 For any instance I such that

I is nite,
G-Swoosh terminates and computes ER(I ).
Proof Similarly to BFA, G-Swoosh also rst computes

I ,
then removes dominated records (line 17). Observe that all
records added to I
/
are either records initially in I or records
derived from them by successive merges, therefore I
/


I .
I
/
increases by at least one record at each iteration, therefore
G-Swoosh necessarily terminates if

I is nite.
To prove that G-Swoosh computes ER(I ), all we need to
show is that

I I
/
before dominated records are removed.
We prove this inclusion by recursion on the depth of records
in

I . (Recall that the depth of a record was introduced in
Definition A.1.) More precisely, our induction hypothesis is
that all records of

I of depth less or equal than k (k 0)
appear in the working set I at some point in the algorithm.
Since all records that appear in I are added to I
/
later on,
this will establish that

I I
/
before dominated records are
removed.
Records of depth 0 are the records of the initial dataset,
which are present in I at the start of the algorithm. Suppose
the hypothesis holds for all records of

I of depth less or equal
than k (k 0), and let us show that the hypothesis is also
veried for any record r

I of depth k 1. Since r

I , r is
derived either (1) from merging a record r
/
of depth k with
itself, or (2) frommerging two records r
1
and r
2
of depth less
than or equal to k. For case (1), by our induction hypothesis,
r
/
appears in I , therefore when r
/
is processed (line 7), r
/
matches itself and r = r
/
, r
/
) is added to I (line 11). For
case (2), both r
1
and r
2
are of depth k, and therefore appear
in I . G-Swoosh picks these two records (line 5) in some order.
W.l.o.g., say r
1
is picked before r
2
. When r
1
is processed, it
is added to I
/
(line 15), and when r
2
s turn comes, a match is
found and r = r
1
, r
2
) is added to I (line 11). All records in I
are added to I
/
, therefore

I I
/
when the algorithm reaches
line 17. At that point, dominated records are removed, hence
G-Swoosh computes ER(I ). .
Theorem3.1 G-Swoosh is optimal, in the sense that no algo-
rithm that computes ER(I ) makes fewer comparisons in the
worst case.
Proof It is fairly immediate to see that for any pair of records
r
1
, r
2
(resp. for any single record r) in

I , G-Swoosh checks
whether r
1
r
2
(resp. whether r r) exactly once. Sup-
pose there exists an algorithm A that generates ER(I ) but
performs fewer comparisons than G-Swoosh. Then for some
run of A, there must exist two records r
1
, r
2
in

I such that
A does not check whether r
1
r
2
. (The case of a single
record r can be represented by taking r
1
= r
2
= r). Now,
we construct new match and merge functions. Functions M
/
and
/
are the same as the original functions M and , unless
the two records are r
1
and r
2
. In this case, M
/
(r
1
, r
2
) returns
true and
/
(r
1
, r
2
) returns a new record r
3
that is not dom-
inated, and is not in ER(I ) using the original match and
merge functions.
Using M
/
and
/
, r
3
ER(I ). But the run of algorithm
A never checks whether r
1
r
2
, so it cannot merge them to
obtain r
3
. Therefore, algorithm A does not generate ER(I ).
This is a contradiction, so no algorithmthat generates ER(I )
can perform fewer comparisons than G-Swoosh. .
Proposition3.3 Given an instance I, the R-Swoosh algorithm
computes ER(I ).
Proof Consider the instance J = I I
/
{current Record].
What the algorithmcomputes is indeed a derivation sequence
of J. Whenever two records r and r
/
match (line 9), a merge
step is performed by adding r
//
= r, r
/
) to I (an hence to
J) (line 19). Immediately after that, two purge steps are per-
formed: record r (removed from I at line 6) is not added to I
/
,
while records r
/
is removed from I
/
(line 19). Therefore, r, r
/
are removed from J). These two purge steps are valid, since
both r and r
/
are dominated by r
//
. When a record r from I
has no match in I
/
, it is added to I
/
, leaving J unchanged.
Moreover, observe that the following is an invariant for
I
/
: for any two records r, r
/
I
/
, r , r
/
. This is because a
123
Swoosh: a generic approach to entity resolution 275
record is added to I
/
iff it does not match any record already
there. The algorithm stops when I is empty. Since no two
records in I
/
match, the derivation sequence is maximal.
.
Proposition 3.4 For any instance I of n records such that
entity resolution yields j records, R-Swoosh performs at most
(n 1)
2

( j 1)( j 2)
2
record comparisons. There exists an
instance (with n records, yielding j records) on which any
algorithm performs at least as many record comparisons.
Proof We rst count the maximal number of record compar-
isons performed by Swoosh. For each record removed from
I , at most [I
/
[ comparisons are conducted. When no merges
occur, at each iteration [I [ decreases by 1, while [I
/
[ increases
by 1. If the original size of [I [ is n, this gives us at most
n(n1)
2
comparisons. Whenever a merge occurs, [I
/
[ is decreased by
1, while [I [ is not decreased. Therefore one extra round of
comparisons is added, but the number of records to compare
to is decreased. For the rst merge, the number of added com-
parisons is at most the maximal size of I
/
minus 1, i.e. (n2),
and for k merges, we obtain that the maximal number of com-
parisons performed by R-Swoosh is n(n1)/2(n2)
(n k 1). For R-Swoosh, it holds that j = n k.
Therefore, at most (n 1)
2

( j 1)( j 2)
2
comparisons are
performed.
We now prove that any exhaustive algorithm will do the
same number of comparisons in the worst case. We consider a
dataset consisting of n distinct records, and construct adver-
sarial match and merge functions (satisfying the four prop-
erties of Sect. 2.2) that behave as follows: the match function
returns true only after it was asked to compare all pairwise
combinations of the original records, thus effectively forc-
ing the algorithm to perform n (n 1)/2 comparisons.
For these two matching records, the merge function creates
a new record, and the match function waits for this record to
be compared against all the original records, (except the two
that were merged) before declaring a match, thus forcing the
algorithm to perform n 2 additional record comparisons.
The merge function will again create a new record for the
new match that was found. At the next step, n 3 compar-
isons are incurred, and so on. After k merges, j = n k
records remain, and the algorithm was forced to perform at
least (n 1)
2

( j 1)( j 2)
2
record comparisons. .
C Appendix: Feature-level algorithms
Proposition4.1 Given an instance I, the F-Swoosh algorithm
computes the maximal derivation of I, and therefore solves
the ER problem.
Proof (sketch) The algorithm essentially computes a deri-
vation of the instance J = I I
/
{currentRecord]. It is
easy to see that the derivation is correct: records are merged
only if one of their feature values for one feature match. The
tricky part is to show that the derivation is maximal, because
the feature values of currentRecord which already appear in
some N
f
i
are not compared to the values of the records in I
/
.
However, as we mentioned earlier, the feature values in N
f
i
do not match each other (except when they are fromthe same
record), and therefore if one of them appears in the current
record, it cannot match another feature value of a different
record in N
f
i
. Since the feature values of the records in I
/
are a subset of the values in N
f
i
, it follows that no match
can be missed. Just like in R-Swoosh, records in I
/
are non
dominated, and the set I currentRecord is empty at the
end of the algorithm, therefore the derivation is maximal.
.
References
1. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy
duplicates in data warehouses. In: Proceedings of VLDB, pp. 586
597 (2002)
2. Arasu, A., Ganti, V., Kaushik, R.: Efcient exact set-similarity
joins. In: VLDB, pp. 918929 (2006)
3. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In:
FOCS, p. 238 (2002)
4. Baxter, R., Christen, P., Churches, T.: A comparison of fast
blocking methods for record linkage. In: Proceedings of ACM
SIGKDD03 Workshop on Data Cleaning, Record Linkage, and
Object Consolidation (2003). http://citeseer.ist.psu.edu/article/
baxter03comparison.html
5. Bekkerman, R., McCallum, A.: Disambiguating web appearances
of people in a social network. In: WWW, pp. 463470 (2005)
6. Benjelloun, O., Garcia-Molina, H., Jonas, J., Menestrina, D.,
Whang, S., Su, Q., Widom, J.: Swoosh : a generic approach to
entity resolution. Technical Report, Stanford University (2006).
http://dbpubs.stanford.edu/pub/2005-5
7. Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E.,
Menestrina, D., Thavisomboon, S.: D-Swoosh : a family of algo-
rithms for generic, distributed entity resolution. In: ICDCS (2007)
8. Bhattacharya, I., Getoor, L.: Iterative record linkage for clean-
ing and integration. In: Proceedings of SIGMOD Workshop on
ResearchIssues onData MiningandKnowledge Discovery(2004)
9. Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsu-
pervised entity resolution. In: Sixth SIAM Conference on Data
Mining (2006)
10. Blume, M.: Automatic entity disambiguation: benets to NER,
relation extraction, link analysis, and inference. In: International
Conference on Intelligence Analysis (2005). https://analysis.
mitre.org/
11. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and
efcient fuzzy match for online data cleaning. In: Proceedings of
ACM SIGMOD, pp. 313324 (2003)
12. Chaudhuri, S., Ganti, V., Motwani, R.: Robust identication of
fuzzy duplicates. In: Proceedings of ICDE, Tokyo, Japan (2005)
13. Cohen, W.: Data integration using similarity joins and a word-
based information representation language. ACM Trans. Inf.
Syst. 18, 288321 (2000)
14. Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in
complex information spaces. In: Proceedings of ACM SIGMOD
(2005)
123
276 O. Benjelloun et al.
15. Fellegi, I.P., Sunter, A.B.: Atheory for record linkage. J. Am. Stat.
Assoc. 64(328), 11831210 (1969)
16. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.:
Declarative data cleaning: Language, model, and algorithms.
In: Proceedings of VLDB, pp. 371380 (2001)
17. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N.,
Muthukrishnan, S., Srivastava, D.: Approximate string joins in
a database (almost) for free. In: VLDB, pp. 491500 (2001)
18. Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage:
current practice and future directions. Technical Report 03/83,
CSIRO Mathematical and Information Sciences (2003)
19. Hernndez, M.A., Stolfo, S.J.: The merge/purge problemfor large
databases. In: Proceedings of ACMSIGMOD, pp. 127138(1995)
20. Hernndez, M.A., Stolfo, S.J.: Real-world data is dirty: data
cleansing and the merge/purge problem. Data Min. Knowl.
Discov. 2(1), 937 (1998)
21. IBM: DB2 Entity Analytic Solutions. http://www-306.ibm.com/
software/data/db2/eas/
22. Jaro, M.A.: Advances in record-linkage methodology as applied
to matching the 1985 census of tampa, orida. J. Am. Stat.
Assoc. 84(406), 414420 (1989)
23. Jin, L., Li, C., Mehrotra, S.: Efcient record linkage in large data
sets. In: Proceedings of International Conference on Database
Systems for Advanced Applications, p. 137 (2003)
24. Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relation-
ships for domain-independent data cleaning. In: Proceedings of
the SIAM International Conference on Data Mining, Newport
Beach, CA (2005)
25. McCallum, A.K., Nigam, K., Ungar, L.: Efcient clustering of
high-dimensional data sets with application to reference match-
ing. In: Proceedings of KDD, pp. 169178, Boston, MA (2000)
26. Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity
resolution with data condences. In: CleanDB (2006)
27. Monge, A.E., Elkan, C.: An efcient domain-independent algo-
rithm for detecting approximately duplicate database records. In:
Proceedings of SIGMOD Workshop on Research Issues on Data
Mining and Knowledge Discovery, pp. 2329 (1997)
28. Motro, A., Anokhin, P.: Fusionplex: resolutionof data inconsisten-
cies in the integration of heterogeneous information sources. Inf.
Fusion 7(2), 176196 (2006)
29. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Auto-
matic linkage of vital records. Science 130(3381), 954959 (1959)
30. Parag, D.P.: Multi-relational record linkage. In: Proceedings of
the KDD-2004 Workshop on Multi-Relational Data Mining,
pp. 3148 (2004)
31. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using
active learning. In: Proceedings of ACM SIGKDD, Edmonton,
Alberta (2002)
32. Schallehn, E., Sattler, K.U., Saake, G.: Extensible and similarity-
based grouping for data integratio. In: ICDE, p. 277 (2002)
33. Singla, P., Domingos, P.: Object identicationwithattribute-medi-
ated dependences. In: Proceedings of PKDD, pp. 297 308 (2005)
34. Smith, T.F., Waterman, M.S.: Identication of common molecular
subsequences. J. Mol. Biol. 147, 195197 (1981)
35. Tarjan, R.E.: Efciency of a good but not linear set union algo-
rithm. J. ACM. 22(2), 215225 (1975)
36. Tejada, S., Knoblock, C.A., Minton, S.: Learning object identi-
cation rules for information integration. Inf. Syst. J. 26(8), 635
656 (2001)
37. Verykios, V.S., Moustakides, G.V., Elfeky, M.G.: A bayesian
decision model for cost optimal record matching. VLDB J.
12(1), 2840(2003). http://www.cs.purdue.edu/homes/mgelfeky/
Papers/vldbj12(1).pdf
38. Winkler, W.: Overview of record linkage and current research
directions. Technical Report, Statistical Research Division, U.S.
Bureau of the Census, Washington, DC (2006)
39. Winkler, W.E.: Using the EMalgorithmfor weight computation in
the FellegiSunter model of record linkage. In: American Statis-
tical Association, Proceedings of the Section on Survey Research
Methods, pp. 667671 (1988)
123

You might also like