Entity Resolution: Tutorial
Entity Resolution: Tutorial
Entity Resolution: Tutorial
EntityResolution:Tutorial
Li Getoor
Lise
G t
Ashwin Machanavajjhala
UniversityofMaryland
College Park MD
CollegePark,MD
DukeUniversity
Durham,NC
,
http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
http://goo.gl/f5eym
WhatisEntityResolution?
Problemofidentifyingandlinking/groupingdifferent
manifestationsofthesamerealworldobject.
f
f
j
Examplesofmanifestationsandobjects:
p
j
Differentwaysofaddressing(names,emailaddresses,FaceBook
accounts)thesamepersonintext.
Webpageswithdifferingdescriptionsofthesamebusiness.
Differentphotosofthesameobject.
Ironically,EntityResolutionhasmanyduplicatenames
Recordlinkage
Coreference resolution
Duplicatedetection
Referencereconciliation
Fuzzymatch
y
Object consolidation
Objectconsolidation
Objectidentification
Deduplication
Approximatematch
Entity clustering
Entityclustering
Identityuncertainty
Hardeningsoftdatabases
Doubles
Merge/purge
Household matching
Householdmatching
Householdingg
Referencematchingg
ERMotivatingExamples
LinkingCensusRecords
Public Health
PublicHealth
Websearch
Comparison shopping
Comparisonshopping
Counterterrorism
Spam detection
Spamdetection
MachineReading
ERandNetworkAnalysis
before
after
Motivation:NetworkScience
Measuringthetopologyoftheinternetusing
traceroute
IPAliasingProblem[Willinger etal.2009]
IPAliasingProblem[Willinger etal.2009]
TraditionalChallengesinER
Name/Attributeambiguity
Thomas Cruise
ThomasCruise
MichaelJordan
TraditionalChallengesinER
Name/Attributeambiguity
Errors due to data entry
Errorsduetodataentry
TraditionalChallengesinER
Name/Attributeambiguity
Errors due to data entry
Errorsduetodataentry
MissingValues
[Gilletal;Univ ofOxford2003]
TraditionalChallengesinER
Name/Attributeambiguity
Errors due to data entry
Errorsduetodataentry
MissingValues
Changing Attributes
ChangingAttributes
Dataformatting
Abbreviations/DataTruncation
/
BigDataERChallenges
BigDataERChallenges
LargerandmoreDatasets
Needefficientparalleltechniques
MoreHeterogeneity
M
H t
it
Unstructured,UncleanandIncompletedata.Diversedatatypes.
Nolongerjustmatchingnameswithnames,butAmazonprofileswith
No longer just matching names with names, but Amazon profiles with
browsinghistoryonGoogleandfriendsnetworkinFacebook.
BigDataERChallenges
LargerandmoreDatasets
Needefficientparalleltechniques
MoreHeterogeneity
M
H t
it
Unstructured,UncleanandIncompletedata.Diversedatatypes.
Morelinked
More linked
Needtoinferrelationshipsinadditiontoequality
MultiRelational
Dealwithstructureofentities(AreWalmart andWalmart Pharmacy
thesame?)
Multidomain
l d
Customizablemethodsthatspanacrossdomains
Multipleapplications
Multiple applications (websearchversuscomparisonshopping)
(web search versus comparison shopping)
Servediverseapplicationwithdifferentaccuracyrequirements
Outline
1.
2
2.
3.
4
4.
AbstractProblemStatement
Algorithmic Foundations of ER
AlgorithmicFoundationsofER
ScalingERtoBigData
Challenges & Future Directions
Challenges&FutureDirections
Outline
1. AbstractProblemStatement
2 AlgorithmicFoundationsofER
2.
Algorithmic Foundations of ER
a)
b)
c)
d)
DataPreparationandMatchFeatures
Pairwise ER
ConstraintsinER
Algorithms
RecordLinkage
Record
Linkage
Deduplication
CollectiveER
3. ScalingERtoBigData
4. Challenges&FutureDirections
5minutebreak
Outline
1. AbstractProblemStatement
2 AlgorithmicFoundationsofER
2.
Algorithmic Foundations of ER
3. ScalingERtoBigData
a) Blocking/CanopyGeneration
Blocking/Canopy Generation
b) DistributedER
4. Challenges&FutureDirections
Challenges & Future Directions
Outline
1.
2
2.
3.
4
4.
AbstractProblemStatement
Algorithmic Foundations of ER
AlgorithmicFoundationsofER
ScalingERtoBigData
Challenges & Future Directions
Challenges&FutureDirections
ScopeoftheTutorial
Whatwecover:
FundamentalalgorithmicconceptsinER
ScalingERtobigdatasets
TaxonomyofcurrentERalgorithms
Whatwedonotcover:
Schema/ontologyresolution
Schema/ontology
resolution
Datafusion/integration/exchange/cleaning
Entity/InformationExtraction
PrivacyaspectsofEntityResolution
Detailsonsimilaritymeasures
Technical details and proofs
Technicaldetailsandproofs
ERReferences
Book/SurveyArticles
DataQualityandRecordLinkageTechniques
[T Herzog F Scheuren W Winkler Springer 07]
[T.Herzog,F.Scheuren,W.Winkler,Springer,
07]
DuplicateRecordDetection[A.Elmagrid,P.Ipeirotis,V.Verykios,TKDE07]
AnIntroductiontoDuplicateDetection[F.Naumann,M.Herschel,M&P
y
]
synthesislectures2010]
EvaluationofEntityResolutionApproachedonRealworldMatchProblems
[H.Kopke,A.Thor,E.Rahm,PVLDB2010]
DataMatching[P.Christen,Springer2012]
Tutorials
RecordLinkage:SimilaritymeasuresandAlgorithms
[N.Koudas,S.Sarawagi,D.Srivatsava SIGMOD06]
DatafusionResolvingdataconflictsforintegration
[X.Dong,F.Naumann VLDB09]
EntityResolution:Theory,PracticeandOpenChallenges
Entity Resolution: Theory Practice and Open Challenges
http://goo.gl/Ui38o [L.Getoor,A.Machanavajjhala AAAI12]
PART 1
PART1
ABSTRACT PROBLEM STATEMENT
ABSTRACTPROBLEMSTATEMENT
AbstractProblemStatement
RealWorld
DigitalWorld
Records/
/
Mentions
Deduplication ProblemStatement
Clustertherecords/mentionsthatcorrespondtosame
entity
y
Deduplication ProblemStatement
Clustertherecords/mentionsthatcorrespondtosame
entity
y
Intensional Variant:Computeclusterrepresentative
RecordLinkageProblemStatement
Linkrecordsthatmatchacrossdatabases
A
ReferenceMatchingProblem
Matchnoisyrecordstocleanrecordsinareferencetable
Reference
Table
bl
AbstractProblemStatement
RealWorld
DigitalWorld
Deduplication ProblemStatement
Deduplication withCanonicalization
GraphAlignment(&motifsearch)
Graph1
Graph2
Relationshipsarecrucial
Relationshipsarecrucial
Notation
R:setofrecords/mentions(typed)
H: set of relations / hyperedges (typed)
H:setofrelations/hyperedges
M:setofmatches(recordpairsthatcorrespondtosameentity)
N: set of non matches (recordpairscorrespondingtodifferententities)
N:setofnonmatches
(
d i
di t diff
t titi )
E:setofentities
L set of links
L:setoflinks
TTrue(M
(Mtrue,N
Ntrue,EEtrue,LLtrue):accordingtorealworld
)
di
l
ld
vs Predicted(Mpred,Npred,Epred,Lpred ):byalgorithm
RelationshipbetweenMtrue andMpred
Mtrue (SameAs ,Equivalence)
(Similar representations and similar attributes)
Mpred (Similarrepresentationsandsimilarattributes)
Mtrue
RxR
Mpred
Metrics
Pairwise metrics
Precision/Recall,F1
#ofpredictedmatchingpairs
Clusterlevelmetrics
purity,completeness,complexity
Precision/Recall/F1:Clusterlevel,closestcluster,MUC,B
Precision/Recall/F1: Cluster level closest cluster MUC B3 ,
RandIndex
Generalizedmergedistance[Menestrina etal,PVLDB10]
Littleworkthatevaluationscorrectpredictionoflinks
TypicalAssumptionsMade
Eachrecord/mentionisassociatedwithasinglereal
worldentity.
y
Inrecordlinkage,noduplicatesinthesamesource
Iftworecords/mentionsareidentical,thentheyaretrue
If two records/mentions are identical then they are true
matches
(
(,)
) Mtrue
ERversusClassification
Findingmatchesvs nonmatchesisaclassificationproblem
Imbalanced:typicallyO(R)matches,O(R^2)nonmatches
Instancesarepairsofrecords.PairsarenotIID
(,) Mtrue
AND
( , ) Mtrue
(,)
(
(,)
) Mtrue
ERvs (Multirelational)Clustering
Computingentitiesfromrecordsisaclusteringproblem
Intypicalclusteringalgorithms(kmeans,LDA,etc.)
number of clusters is a constant or sub linear in R.
numberofclustersisaconstantorsublinearinR.
In
InER:numberofclustersislinearinR,andaverage
ER: number of clusters is linear in R and average
clustersizeisaconstant.Significantfractionofclusters
aresingletons.
g
PART 2
PART2
ALGORITHMIC FOUNDATIONS OF ER
ALGORITHMICFOUNDATIONSOFER
OutlineofPart2
a) DataPreparationandMatchFeatures
b) Pairwise ER
Determiningwhetherornotapairofrecordsmatch
c) ConstraintsinER
5minutebreak
d) Algorithms
Recordlinkage(Propagationthroughexclusitivity negativeconstraint),
Deduplication (Propagationthroughtransitivitypositiveconstraint),
Collective(PropagationthroughGeneralConstraints)
MOTIVATINGEXAMPLE:
MOTIVATING
EXAMPLE:
BIBLIOGRAPHICDOMAIN
Entities&RelationsinBibliographicDomain
Wrote
Paper
p
Title
# of Authors
Topic
Word1
Word 2
WordN
Cites
Author
Name
Research Area
WorksAt
I tit ti
Institution
Name
Author Mention
NameString
Institute Mention
NameString
Paper Mention
TitleString
AppearsIn
pp
Venue
:entityrelationships
:cooccurrencerelationships
:resolutionrelationships
Name
Venue Mention
NameString
g
PART 2
PART2a
DATAPREPARATION&
DATA
PREPARATION &
MATCHFEATURES
Normalization
Schemanormalization
SchemaMatching e.g.,contactnumberandphonenumber
Compoundattributes
C
d tt ib t
f ll dd
fulladdressvs
str,city,state,zip
t it t t i
Nestedattributes
Listoffeaturesinonedataset(airconditioning,parking)vs eachfeaturea
boolean attribute
Setvaluedattributes
Setofphonesvs primary/secondaryphone
Recordsegmentationfromtext
Record segmentation from text
Datanormalization
Oftenconverttoalllower/allupper;removewhitespace
detectingandcorrectingvaluesthatcontainknowntypographicalerrorsor
d
i
d
i
l
h
i k
hi l
variations,
expandingabbreviationsandreplacingthemwithstandardforms;replacing
nicknames with their proper name forms
nicknameswiththeirpropernameforms
Usuallydonebasedondictionaries(e.g.,commercialdictionaries,postaladdresses,
etc.)
MatchingFeatures
Fortworeferencesxandy,computeacomparisonvectorof
similarityscoresofcomponentattribute.
[1stauthormatchscore,
papermatchscore,
venuematchscore,
yearmatchscore,.]
Similarityscores
y
Boolean(matchornotmatch)
Realvaluesbasedondistancefunctions
SummaryofMatchingFeatures
Handle
H
dl
Typographicalerrors
Equalityonaboolean predicate
Editdistance
Levenstein,SmithWaterman,Affine
VectorBased
Cosine
Cosinesimilarity,TFIDF
similarity, TFIDF
GoodforTextlike
reviews/tweets
AlignmentbasedorTwotiered
JaroWinkler,SoftTFIDF,MongeElkan
PhoneticSimilarity
Soundex
Setsimilarity
Jaccard,Dice
GoodforNames
Translationbased
Numericdistancebetweenvalues
Domainspecific
p
Useful packages
Usefulpackages:
SecondString,http://secondstring.sourceforge.net/
Simmetrics:http://sourceforge.net/projects/simmetrics/
LingPipe,http://aliasi.com/lingpipe/index.html
Li Pi
htt // li i
/li
i /i d ht l
Usefulfor
abbreviations,
alternate names.
alternatenames.
RelationalMatchingFeatures
Relationalfeaturesareoftensetbased
Setofcoauthorsforapaper
Setofcitiesinacountry
f ii i
Setofproductsmanufacturedbymanufacturer
Canusesetsimilarityfunctionsmentionedearlier
CommonNeighbors:Intersectionsize
Jaccard
Jaccardss Coefficient:Normalizebyunionsize
Coefficient: Normalize by union size
AdarCoefficient:Weightedsetsimilarity
Canreasonaboutsimilarityinsetsofvalues
C
b t i il it i
t f l
AverageorMax
Otheraggregates
PART 2 b
PART2b
PAIRWISE MATCHING
Pairwise MatchScore
Problem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).
Solutions:
1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.
Hardtopickweights.
Matchonlastnamematchmorepredictivethanloginname.
Match on Smith
Matchon
Smith lesspredictivethanmatchon
less predictive than match on Getoor
Getoor or
or Machanavajjhala.
Hardtotuneathreshold.
Pairwise MatchScore
Problem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).
Solutions:
1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.
2 Formulaterulesaboutwhatconstitutesamatch.
2.
Formulate rules about what constitutes a match
(1stauthormatchscore>0.7ANDvenuematchscore>0.8)
OR(papermatchscore>0.9ANDvenuematchscore>0.9)
M
Manuallyformulatingtherightsetofrulesishard.
ll f
l ti th i ht t f l i h d
BasicMLApproach
r =(x,y)isrecordpair, iscomparisonvector,M matches,U non
matches
Decisionrule
P( | r M )
R
P( | r U )
R t r Match
R t r Non - Match
P( | r M )
R
P( | r U )
R tl r Match
tl R tu r Potential Match
R tu r Non - Match
NaveBayes Assumption: P( | r M ) i P( i | r M )
MLPairwise Approaches
Supervisedmachinelearningalgorithms
Decisiontrees
[Cochinwala
[Co hin ala etal,IS01]
et al IS01]
Supportvectormachines
[Bilenko &Mooney,KDD03];[Christen,KDD08]
Ensemblesofclassifiers
Ensembles of classifiers
[Chenetal.,SIGMOD09]
ConditionalRandomFields(CRF)
[Gupta&Sarawagi,VLDB09]
[Gupta & Sarawagi VLDB09]
Issues:
Training
Trainingsetgeneration
set generation
Imbalancedclasses manymorenegativesthanpositives(evenafter
eliminatingobviousnonmatchesusingBlocking)
Misclassificationcost
Misclassification cost
CreatingaTrainingSetisakeyissue
Constructingatrainingsetishard sincemostpairsof
recordsareeasynonmatches.
y
100recordsfrom100cities.
Only106 pairsoutoftotal108 (1%)comefromthesamecity
Somepairsarehardtojudgeevenbyhumans
Inherentlyambiguous
E.g.,ParisHilton(personorbusiness)
Missingattributes
Starbucks,Torontovs Starbucks,QueenStreet,Toronto
AvoidingTrainingSetGeneration
Unsupervised/SemisupervisedTechniques
EMbasedtechniquestolearnparameters
[Winkler06,Herzogetal07]
GenerativeModels
[Ravikumar &Cohen,UAI04]
& Cohen UAI04]
ActiveLearning
CommitteeofClassifiers
[Sarawagi etalKDD00,Tajeda etalIS01]
Provablyoptimizingprecision/recall
Provably optimizing precision/recall
[Arasu etalSIGMOD10,Bellare etalKDD12]
Crowdsourcing
[WangetalVLDB12,MarcusetalVLDB12,]
[W
t l VLDB 12 M
t l VLDB 12 ]
ActiveLearningwithProvableGuarantees
Mostactivelearningtechniquesminimize01loss
[Beygelzimer etalNIPS2010].
However,ERisveryimbalanced:
Numberofnonmatches>100*numberofmatches.
Classifyingallpairsasnonmatcheshaslow01loss(<1%).
y g p
(
)
Hence,needactivelearningtechniquesthatminimize
precision/recall.
precision/recall
ActiveLearningwithProvableGuarantees
Monotonicity ofPrecision[Arasu etalSIGMOD10]
Thereisalargerfractionof
matchesinC1thaninC2.
Algorithmsearchesforthe
p
g
y
optimalclassifierusingbinary
searchoneachdimension
ActiveLearningwithProvableGuarantees
[Bellare etalKDD12]
O (log2 n)callstoablackbox
O(log
n) calls to a blackbox 01lossactivelearningalgorithm.
0 1 loss active learning algorithm
Exponentiallysmallerlabelcomplexitythan[Arasu etalSIGMOD10]
(in the worst case).
(intheworstcase).
1.
2.
3
3.
PrecisionConstrained Weighted01LossProblem
(using a Lagrange Multiplier )
(usingaLagrangeMultiplier).
Givenafixedvaluefor,weighted01Losscanbeoptimizedby(onecallto)a
blackbox activelearningclassifier.
Ri ht l
Rightvalueof
f iscomputedbysearchingoveralloptimalclassifiers.
i
t db
hi
ll ti l l ifi
Classifiersareembeddedina2dplane(precision/recall)
Searchisalongtheconvexhulloftheembeddedclassifiers
Crowdsourcing
Growinginterestinintegratinghumancomputationindeclarative
workflowengines.
ERisanimportantproblem(e.g.,forevaluatingfuzzyjoins)
[WangetalVLDB12,MarcusetalVLDB12,]
Opportunity:utilizecrowdsourcing forcreatingtrainingsets,
orforactivelearning.
Keyopenissue:Handlingerrorsinhumanjudgments
InanexperimentonAmazonMechanicalTurk:
Pairwise matchingjudgment,eachgivento5differentpeople
Majorityofworkersagreedontruthononly90%ofpairwise judgments.
SummaryofSingleEntityERAlgorithms
Manyalgorithmsforindependentclassificationofpairsofrecords
asmatch/nonmatch
MLbasedclassification&FellegiSunter
Pro:Advancedstateoftheart
P Ad
d
f h
Con:Buildinghighfidelitytrainingsetsisahardproblem
ActiveLearning&Crowdsourcing forERareactiveareasof
research.
PART 2
PART2c
CONSTRAINTS
Constraints
Importantformsofconstraints:
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2match,thenM3andM4must
Functional Dependency: If M1 and M2 match then M3 and M4 must
match
Transitivityiskeytodeduplication
Exclusivityiskeytorecordlinkage
Functionaldependenciesfordatacleaning,e.g.,
[Ananthakrishna etal.,VLDB02][Fan,PODS08][Bohannonet
al ICDE07]
al,ICDE07]
Positive&NegativeEvidence
Positive
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwith
M2
FunctionalDependency:IfM1andM2match,thenM3andM4must
match
Negative
Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not match
andM3donotmatch
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch
Positive&NegativeEvidence
Positive
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwith
M2
FunctionalDependency:IfM1andM2match,thenM3andM4must
match
Negative
Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not match
andM3donotmatch
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch
ConstraintTypes
HardConstraint
SoftConstraint
PositiveEvidence
IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match
IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch
NegativeEvidence
MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct
IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match
IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint
SoftConstraint
PositiveEvidence
IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match
IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch
NegativeEvidence
MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct
IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match
IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint
SoftConstraint
PositiveEvidence
IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match
IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch
NegativeEvidence
MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct
IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match
IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint
SoftConstraint
PositiveEvidence
IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match
IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch
NegativeEvidence
MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct
IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match
IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint
PositiveEvidence
SoftConstraint
IfM1,M2matchthenM3,
IfM1,M2matchthenM3,M4must
Notethatsomeofthe
M4morelikelytomatch
match
constraintsmayberelational
y
If t
Iftwopapersmatch,theirvenues
t h th i
If t
Iftwovenuesmatch,then
t h th
andrequirejoins
match
theirpapersaremorelikely
tomatch
Maybedirectional
May
be directional
orbidirectional
NegativeEvidence
MentionM1andM2mustreferto
IfM1,M2dontmatchthen
distinctentities(Uniqueness)
M3,M4lesslikelytomatch
Constraints can
be recursive
Coauthors are distinct Constraintscanberecursive,
Coauthorsaredistinct
If institutions
Ifinstitutionsdon
donttmatch,
match
thenauthorslesslikelyto
e.g.,iftwoauthorshave
match
IfM1,M2dontmatchthenM3,M4
matchingcoauthors,then
cannotmatch
theymatch
h
h
Iftwovenuesdontmatch,thentheir
papersdontmatch
AdditionalConstraints
AggregateConstraints[Chaudhuri etal.SIGMOD07]
countconstraints
count constraints
EntityAcanlinktoatmostNBs
Authorshaveatmost5papersatanyconference
Otheraggregateslikesum,averagemorecomplex
Oth
t lik
l
Again,
Again,thesecanbeeitherhardorsoftconstraints,
these can be either hard or soft constraints,
providepositiveornegativeevidence
MatchDependencies
Whenmatchingdecisionsdependonother
matching decisions (in other words, matching
matchingdecisions(inotherwords,matching
decisionsarenotmadeindependently),we
refer to the approach as collective
refertotheapproachascollective
MatchExtent
Global: Iftwopapersmatch,thentheirvenuesmatch
This
Thisconstraintcanbeappliedtoallinstancesofvenue
constraint can be applied to all instances of venue
mentions
AlloccurrencesofSIGMODcanbematchedtoInternational
Conference on Management of Data
ConferenceonManagementofData
Local:Iftwopapersmatch,thentheirauthorsmatch
Thisconstraintcanonlybeappliedlocally
This constraint can only be applied locally
DontwanttomatchalloccurrencesofJ.SmithwithJeffSmith,onlyin
thecontextofthecurrentpaper
Ex.SemanticIntegrityConstraints
Type
Example
Aggregate
C1=Noresearcher haspublishedmorethanfiveAAAIpapersinayear
Subsumption
C2=IfacitationXfromDBLP matchesacitationYinahomepage,then
eachauthormentionedinYmatchessomeauthormentionedinX
Neighborhood
C3=IfauthorsXandYsharesimilarnamesandsomecoauthors,they
arelikelytomatch
lik l t
t h
Incompatible
C4 =NoresearcherexistswhohaspublishedinbothHCIandnumerical
analysis
Layout
C5=Iftwomentionsinthesamedocumentsharesimilarnames,they
f
i
i h
d
h
i il
h
arelikelytomatch
Key/Uniqueness
C6=MentionsinthePClistingofaconferenceisto different
researchers
Ordering
C7=Iftwo citationsmatch,thentheirauthorswillbematchedinorder
Individual
C8=TheresearcherwiththenameMayssam Sariahasfewerthan
fi
fivementionsinDBLP(newgraduatestudent)
ti
i DBLP (
d t t d t)
[Shen,Li&Doan,AAAI05]
AlgorithmsforHandlingConstraints
Recordlinkage propagationthroughexclusivity
Deduplication propagationthroughtransitivity
Weightedkpartitematching
C
Correlationclustering
l i
l
i
Collective propagationthroughgeneralconstraints
Collective
propagation through general constraints
Similaritypropagation
P b bili ti
Probabilisticapproaches
h
Dependencygraphs,CollectiveRelationalClustering
LDA,CRFs,MarkovLogicNetworks,ProbabilisticRelationalModels,
Hybridapproaches
Dedupalog
PART 2 d
PART2d
ALGORITHMS
RECORD LINKAGE
RECORDLINKAGE
11assumption
Matchingbetween(almost)deduplicated databases.
Each record in one database matches at most one record
Eachrecordinonedatabasematchesatmostonerecord
inanotherdatabase.
Pairwise ERmaymatcharecordinonedatabasewith
morethanonerecordinseconddatabase
WeightedKPartiteMatching
Weighted
Edges
Weighted
Edges
Edgesbetweenpairsofrecordsfromdifferentdatabases
Edgeweights
o Pairwise matchscore
o Logoddsofmatching
L
dd f
hi
WeightedKPartiteMatching
Findamatching(eachrecordmatchesatmostoneotherrecord
fromotherdatabase)thatmaximizethesumofweights.
)
GeneralproblemisNPhard(3Dmatching)
Successivebipartitematchingistypicallyused.
Successive bipartite matching is typically used [Gupta&Sarawagi,VLDB
[G pta & Sara agi VLDB
09]
DEDUPLICATION
Deduplication =>Transitivity
Oftenpairwise ERalgorithmoutputinconsistentresults
(x,y) Mpred ,(y,z) Mpred ,but(x,z) Mpred
Idea:Correctthisbyaddingadditionalmatchesusingtransitive
closure
l
Incertaincases,thisisabadidea.
In certain cases this is a bad idea
Graphsresultingfrompairwise ERhave
diameter>20
[Rastogi etalCorr
et al Corr 12]
12]
Addedby
Transitive
Transitive
Closure
Needclusteringsolutionsthatdealwiththisproblemdirectlyby
reasoningaboutrecordsjointly.
ClusteringbasedER
Resolutiondecisionsarenotmadeindependentlyfor
eachpairofrecords
Basedonvarietyofclusteringalgorithms,but
Numberofclustersunknownaprioiri
Many,manysmall(possiblysingleton)clusters
Oftentakeapairwisesimilaritygraphasinput
Mayrequiretheconstructionofaclusterrepresentative
orcanonicalentity
ClusteringMethodsforER
HierarchicalClustering
[[Bilenko etal,ICDM05]
,
]
NearestNeighborbasedmethods
[Chaudhuri etal,ICDE05]
CorrelationClustering
[SoonetalCL01,Bansal etalML04,NgetalACL02,
Ailon etalJACM08,Elsner etalACL08,Elsner etalILPNLP09]
IntegerLinearProgrammingviewofER
rxy {0,1},rxy =1ifrecordsx andyareinthesamecluster.
w+xy [0,1],costofclusteringxandytogether
[ ]
g
y g
w xy [0,1],costofplacingxandyindifferentclusters
Transitive
closure
CorrelationClustering
Clustermentionssuchthat
totalcostisminimized
total cost is minimi ed
Solidedgescontributew+
totheobjective
Dashededgescontributew xyy totheobjective
xy
2
1
3
Costbasedonpairwise similarities
Additive:w+xy =pxy andw xy =(1pxy)
oga t
c +xy =log(p
og(pxy))andw
a d xy =log(1p
og( pxy)
Logarithmic:w
CorrelationClustering
SolvingtheILPisNPhard[Ailon etal2008JACM]
Anumberofheuristics[Elsner etal2009ILPNLP]
GreedyBEST/FIRST/VOTEalgorithms
Greedy BEST/FIRST/VOTE algorithms
GreedyPIVOTalgorithm(5approximation)
LocalSearch
GreedyAlgorithms
SStep1:Permutethenodesaccordingarandom
1 P
h
d
di
d
Step2:AssignrecordxtotheclusterthatmaximizesQuality
Start
a new cluster if Quality < 0
StartanewclusterifQuality<0
Quality:
BEST:Clustercontainingtheclosestmatch
[Ngetal2002ACL]
FIRST:Clustercontainsthemostrecentvertexywithw+xy >0
[Soonetal2001CL]
[Soon et al 2001 CL]
VOTE:Assigntoclusterthatminimizesobjectivefunction.
[Elsner etal08ACL]
PracticalNote:
Runthealgorithmformanyrandompermutations,andpicktheclusteringwith
bestobjectivevalue(betterthanaveragerun)
Greedywithapproximationguarantees
PIVOTAlgorithm
[Ailon etal2008JACM]
Pickarandom(pivot)recordp.
(p
)
p
Newcluster=
2
={1,2,3,4}C={{1,2,3,4}}
={2,4,1,3}C={{1,2},{4},{3}}
{ , , , }
{{ , }, { }, { }}
={3,2,4,1}C={{1,3},{2},{4}}
Whenweightsare0/1,
Forw+xy +wxy =1,
1
3
4
E(cost(greedy))<3 OPT
E(cost(greedy))<5 OPT
[Elsner etal,ILPNLP09]:Comparisonofvariouscorrelationclusteringalgorithms
PART2d
CANONICALIZATION
Canonicalization
Mergeinformationfromduplicatementionstoconstruct
aclusterrepresentativewithmaximalinformation
p
Starbucks,
3457HillsboroughRoad
Starbucks
Durham,NC
3457HillsboroughRoad,Durham,NC
Ph:null
Ph:(919)3334444
Starbacks,
HillsboroughRd,Durham
Ph:(919)3334444
CriticallyimportantinWebportalswhere
usersmust beshownaconsolidatedview
Eachmentiononlycontainsasubsetofthe
attributes
Mentionscontainvariations(ofnames,
addresses)
Someofthementionshaveincorrectvalues
CanonicalizationAlgorithms
Rulebased:
Fornames:typicallylongestnamesareused.
Forsetvaluesattributes:UNIONisused.
FForstrings,[Culotta
ti
[C l tt etalKDD07]learnaneditdistanceforfinding
t l KDD07] l
dit di t
f fi di
themostrepresentativecentroid.
Canusemajorityruletofixerrors
(if4outof5sayabusinessisclosed,thenbusinessisclosed).
Thismaynotalwaysworkduetocopying[DongetalVLDB09],orwhen
underlyingdatachanges[PaletalWWW11]
CanonicalizationforEfficiency
StanfordEntityResolutionFramework[Benjelloun VLDBJ09]
Considerablackbox matchandmergefunction
Matchisapairwise boolean operator
Merge:constructcanonicalversionofamatchingpair
Canminimizetimetocomputematchesbyinterleavingmatching
andmerging
esp.,whenmatchandmergefunctions
satisfymonotonicity properties.
r345
r12
r1
r45
r2
r3
r4
r5
CollectiveApproaches
Decisionsforclustermembershipdependsonotherclusters
Nonprobabilisticapproaches
p
pp
SimilarityPropagation
ProbabilisticModels
GenerativeModels
G
i M d l
UndirectedModels
HybridApproaches
y
pp
SIMILARITY PROPAGATION
SIMILARITYPROPAGATION
SimilarityPropagationApproaches
Similaritypropagationalgorithmsdefineagraphwhichencodes
thesimilaritybetweenentitymentionsandmatchingdecisions,
andcomputematchingdecisionsbypropagatingsimilarityvalues.
Detailsofconstructedgraphandhowthesimilarityiscomputedvaries
Algorithmsareusuallydefinedprocedurally
Algorithms are usually defined procedurally
Whileprobabilitiesmaybeencodedinvariouswaysinthealgorithms,there
isnoglobalprobabilisticmodeldefined
Approachesoftenmorescalablethanglobalprobabilisticmodels
DependencyGraph
[Dong et al SIGMOD05 ]
[Dongetal.,SIGMOD05]
Constructagraphwherenodesrepresentsimilaritycomparisons
between attribute values (realvalued)
betweenattributevalues(real
valued)andmatchdecisionsbased
and match decisions based
onmatchingdecisionsofassociatednodes(booleanvalued)
Asmentionsareresolved,enrichedtocontainassociatednodesof
allmatchedmentions
Similaritypropagateduntilfixedpointisreached
Negativeconstraints(notmatchnodes)arecheckedaftersimilarity
propagationisperformed,andinconsistenciesarefixed
ExploittheDependencyGraph
Slid f
Slidesfrom[Dongetal,SIGMOD05]
[D
t l SIGMOD05]
(a1,a4)
(Distributed,Distributed )
(RobertS.Epstein,Epstein,R.S.)
(169180,169180)
(a2,a5)
(p1,p2)
(MichaelStonebraker,Stonebraker,M.)
(a3,a6)
(EugeneWong,Wong,E.)
Referencesimilarity
(v1,v2)
(ACM,ACMSIGMOD)
(1978,1978)
Attributesimilarity
ExploittheDependencyGraph
(a1,a4)
(Distributed,Distributed )
(RobertS.Epstein,Epstein,R.S.)
(169180,169180)
(a2,a5)
(p1,p2)
(MichaelStonebraker,Stonebraker,M.)
(v1,v2)
(a3,a6)
(EugeneWong,Wong,E.)
Reconciled
(ACM,ACMSIGMOD)
Similar
(1978,1978)
CollectiveRelationalClustering
[Bh
[Bhattacharya&Getoor,TKDD07]
h
&G
TKDD07]
Constructagraphwhereleafnodesareindividual
mentions
i
Performhierarchicalagglomerativeclusteringtomerge
clustersofmentions
l t
f
ti
Similaritycomputedbasedonacombinationofattribute
and relational similarity
andrelationalsimilarity
Whenclustersaremerged,updatethesimilaritiesofany
related clusters (clusters corresponding to mentions
relatedclusters(clusterscorrespondingtomentions
whichcooccurwithmergedmentions)
ObjectiveFunction
Mi i i
Minimize:
w sim
A
weightfor
attributes
b
similarityof
attributes
b
weightfor
relations
l
Similaritybasedonrelationaledges
b
betweenc
d j
i andc
simA (ci , c j )
and
sim(c , c )
aAttributes
*
i
*
j
RelationalClusteringAlgorithm
1.
2.
3.
Findsimilarreferencesusingblocking
Bootstrapclustersusingattributesandrelations
Computesimilaritiesforclusterpairsandinsertintopriorityqueue
4.
5.
6.
7
7.
8.
9.
10.
Repeatuntilpriorityqueueisempty
p
p
yq
py
Findclosestclusterpair
Stopifsimilaritybelowthreshold
If no negative constraints violated
Ifnonegativeconstraintsviolated
Mergetocreatenewcluster
Constructcanonicalclusterrepresentative
Updatesimilarityforrelatedclusters
O(nklogn)algorithmw/efficientimplementation
SimilaritypropagationApproaches
Method
Notes
Constraints
Evaluation
RelDC
[Kalashnikovet
al,TODS06]
l TODS06]
Reference
disambiguation
usingusing
i
i
Relationship
baseddata
cleaning(RelDC)
g(
)
Modelchoice
nodesidentified
usingfeature
i f
basedsimilarity
Context
attraction
measuresthe
h
relational
similarity
Accuracyand
runtime forAuthor
resolutionand
l i
d
directorresolution
inMovie database
Reference
Reconciliation
[Dongetal,
SIGMOD05]
Dependency
Graphfor
propagating
similarities+
enforcenon
match
constraints
Reference
Both positive
enrichment
andnegative
Explicitly handle constraints
missingvalues
Parametersset
byhand
Precision/Recall,
F1onpersonal
information
managementdata
(PIM),Coradataset
Collective
Relational
Clustering
g
[Bhattacharya&
Getoor,TKDD07]
Modified
hierarchical
agglomerative
gg
clustering
approach
Constructs
canonical entity
asmergesare
g
made
Precision/Recall,
F1onthree
bibliographic
g p
datasets:CiteSeer,
ArXiv,andBioBase,
andsyntheticdata
Focuson
coauthor
resolution and
propagation
PROBABILISTICMODELS:
PROBABILISTIC
MODELS:
GENERATIVEAPPROACHES
GenerativeProbabilisticApproaches
ProbabilisticsemanticsbasedonDirectedModels
Model
Modeldependenciesbetweenmatchdecisionsinagenerative
dependencies between match decisions in a generative
manner
Disadvantage:acyclicity requirement
Varietyofapproaches
BasedonLatentDirichlet Allocation,BayesianNetworks
Examples
LatentDirichlet Allocation[Bhattacharya&Getoor,SDM07]
ProbabilisticRelationalModels[Pasula etal,NIPS02]
LDAforEntityResolution:Discovering
Groups from CoOccurrence
GroupsfromCo
OccurrenceRelations
Relations
Stephen P Johnson
Chris Walshaw
Kevin McManus
M k Cross
Mark
C ss
M tin E
Martin
Everett
tt
Stephen C Johnson
Alfred V Aho
Ravi Sethi
J ff
Jeffrey
D Ullman
Ull
Bell Labs Group
LDAERModel
T
r
V
R
InferenceusingblockedGibbssampling
forefficiency(andimprovedaccuracy)
GenerativeApproaches
Method
Learning/Inference
Method
Evaluation
[Li,Morie,&
[Li
Morie &
Generative
Generative
Roth,AAAI04] modelfor
mentionsin
documents
Truncated EMtolearn
EM to learn
parametersandMAP
inferenceforentities
(unsupervised)
F1on
F1
on person
person
names,
locationsand
organizationsin
TRECdataset
Probabilistic
Probabilistic
Relational
Relational
M d l [P l Models
Models[Pasula
M d l
etal.,NIPS03]
Parameterslearned
onseparatedcorpora,
i f
inferencedoneusing
d
i
MCMC
%ofcorrectly
identified
clusters
l
on
subsetsof
CiteSeer data
LatentDirichlet
Latent
Dirichlet Latent
LatentDirichlet
Dirichlet BlockedGibbs
Blocked Gibbs
Allocation
Allocation
Sampling
Unsupervised
[Bhattacharya Model
&Getoor,
approach
SDM06]
Precision/Recall
/F1onCiteSeer
andHEPdata
PROBABILISTICMODELS:
PROBABILISTIC
MODELS:
UNDIRECTEDAPPROACHES
UndirectedProbabilisticApproaches
ProbabilisticsemanticsbasedonMarkovNetworks
Advantage:noacyclicity
Advantage: no acyclicity requirements
Insomecases,syntaxbasedonfirstorderlogic
Advantage:declarative
g
Examples
ConditionalRandomFields(CRFs)[McCallum&Wellner,
NIPS04]
MarkovLogicNetworks(MLNs)[Singla &Domingos,ICDM06]
ProbabilisticSimilarityLogic[Broecheler &Getoor,UAI10]
MarkovLogic
AlogicalKBisasetofhardconstraints onthesetof
possibleworlds
Makethemsoftconstraints;whenaworldviolatesa
formula,itbecomeslessprobablebutnotimpossible
Giveeachformulaaweight
Higherweight
Higher weight Strongerconstraint
Stronger constraint
[Richardson&Domingos,06]
MarkovLogic
AMarkovLogicNetwork(MLN) isasetofpairs(F,w)
where
F isaformulainfirstorderlogic
w isarealnumber
# true groundings
of ith clause
P( X ) exp wi ni ( x)
Z
iF
Normalization Constant
[Richardson&Domingos,06]
ERProblemFormulationinMLNs
Given
A
ADBofrecordsrepresentingmentionsofentitiesinthereal
DB of records representing mentions of entities in the real
world,e.g.papermentions
Asetoffieldse.g.author,title,venue
A set of fields e g author title venue
Eachrecordrepresentedasasetoftypedpredicatese.g.
HasAuthor(paper,author),HasVenue(paper,venue)
(p p ,
),
(p p ,
)
Goal
Todeterminewhichoftherecords/fieldsrefertothesame
d
h h f h
d /f ld f
h
underlyingentity
Slidesfrom[Singla &Domingos,ICDM06]
HandlingEquality
IntroduceEquals(x,y) orx=y
Introducetheaxiomsofequality
I t d
th
i
f
lit
Reflexivity: x=x
Symmetry: x=y y=x
Transitivity: x=y y=z z=x
PredicateEquivalence:
x11 =xx2 y11 y22 (R(x1,y
, y1)
) R(x2,y2))
Positive,SoftEvidence
Introducereversepredicateequivalence
SSamerelationwiththesameentitygivesevidenceabout
l ti
ith th
tit i
id
b t
twoentitiesbeingsame
R(x1,y1) R(x2,y2) x1 =x2 y2 =y2
Nottruelogically,butgivesusefulinformation
Example
HasAuthor(C1,J.Cox)
HasAuthor(C1
J Cox) HasAuthor(C2,CoxJ.)
HasAuthor(C2 Cox J )
(J.Cox=CoxJ.)
C1 = C2
C1=C2
FieldComparison
Eachfieldisastringcomposedoftokens
Introduce HasWord(field word)
IntroduceHasWord(field,word)
Usereversepredicateequivalence
HasWord(f1,w1) HasWord(f2,w2) w1= w2 f1=f2
Example
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) (Cox=Cox)
(J.Cox=CoxJ.)
Canhavedifferentweightforeachword
h
d ff
h f
h
d
TwolevelSimilarity
Individualwordsasunits:Cantdealwithspelling
mistakes
Breakeachwordintongrams:Introduce
HasNgram(word,ngram)
Usereversepredicateequivalenceforwordcomparisons
RecordMatching
SimplestVersion:Fieldsimilaritiesmeasuredby
presence/absenceofwordsincommon
HasWord(f1,w1) HasWord(f2,w2) HasField(r1,f1)
HasField(r2,f2) w1= w2 r1=r2
Example
E
l
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) HasAuthor(P1,
J.Cox)
) HasAuthor(P2,CoxJ.)
( ,
) ((Cox=Cox)) ((P1=P2))
Transitivity
(f11 =ff2)
) (f2 =ff3)
) ( f3 =ff1)
AdditionalConstraints
HasAuthor(c,a
HasAuthor(c
a1)
) HasAuthor(c,a
HasAuthor(c a2)
) Coauthor(a1,a
a2)
Coauthor(a1,a2) Coauthor(a3,a4) a1=a3 a2=a4
Inference
Usecheapheuristics(e.g.TFIDFbasedsimilarity)to
identifyplausiblepairs
yp
p
Inference/learningoverplausiblepairs
Inferencemethod:lazygrounding+MaxWalkSAT
Learning:supervisedandtransfer(learn/handsetonone
g p
(
/
domainandtransferred)
ProbabilisticSoftLogic
[Broecheler &Getoor,UAI10]
& Getoor UAI10]
Declarativelanguagefordefiningconstrainedcontinuous
Markov random field (CCMRF) using first order logic
Markovrandomfield(CCMRF)usingfirstorderlogic
(FOL)
Softlogic:truthvaluesin[0,1]
Soft logic: truth values in [0 1]
LogicaloperatorsrelaxedusingLukasiewicz tnorms
Mechanismsforincorporatingsimilarityfunctions,and
Mechanisms for incorporating similarity functions and
reasoningaboutsets
MAPinferenceisaconvexoptimization
MAP inference is a convex optimization
Efficientsamplingmethodformarginalinference
FOLtoCCMRF
PSLconvertsaweightedruleintopotentialfunctionsby
penalizing its distance to satisfaction
penalizingitsdistancetosatisfaction,
isthetruthvalueofgroundruleunder
interpretationx
Thedistributionovertruthvaluesis
::weightofruler
weight of rule r
:allgroundingsofruler
p g
:PSLprogram
UndirectedApproaches
Method
Learning/Inference
Method
Evaluation
[McCallum&
[McCallum
&
Wellner,
NIPS04]
Conditional
Conditional
RandomFields
(CRFs)
capturing
transitivity
constraints
Graphpartitioning
Graph
partitioning
(Boykov etal.1999),
performedvia
correlationclustering
F1on
F1
on DARPA
DARPA
MUC&ACE
datasets
[Singla &
D i
Domingos,
ICDM06]
MarkovLogic
N
Networks
k
(MLNs)
Supervisedlearning
ConditionalLog
andinferenceusing
di f
i
lik lih d and
likelihood
d
MaxWalkSAT &MCMC AUConCora
andBibServ
data
Supervisedlearning
andinferenceusing
continuous
optimization
Precision/Recall
/F1Ontology
Alignment
HYBRID APPROACHES
HYBRIDAPPROACHES
HybridApproaches
Constraintbasedapproachesexplicitlyencoderelational
constraints
Theycanbeformulatedashybridofconstraintsand
probabilisticmodels
Orasconstraintoptimizationproblem
Examples
ConstraintbasedEntityMatching[Shen,Li&Doan,AAAI05]
Dedupalog [Arasu,Re,Suciu,ICDE09]
Datatobe
deduplicated
(Thresholded)Fuzzy
J i O t t
JoinOutput
Step(0)Createinitialapproximatematches;thisisinputtoDedupalog.
Step(1)Declaretheentities
ClusterPapers,Publishers,&Authors
Paper!(id) :- PaperRef(id,-,-,-)
Publisher!(p) :- PaperRef(-,-,-,p,-)
Author!(a)
( ) :- Wrote(-,a,-)
(, ,)
Dedupalog isflexible:
UniqueNamesAssumption(UNA)
P bli h (UNA) d P
Publishers(UNA)andPapers(NOTUNA)
(NOT UNA)
Slidesbasedon[Arasu,Re,Suciu,ICDE09]
Step(2)DeclareClusters
InputintheDB
Paper!(id) :- PaperRef(id,-,-,-)
Publisher!(p) :- PaperRef(-,-,-,p,-)
PaperRef(- - - p -)
Author!(a) :- Wrote(-,a,-)
Clustersaredeclared using*(likeIDBsorViews):Theseareoutput
Author1
AA
Author2
Arvind Arasu
ADedupalog
A
D d
l program isa
i
setofdataloglikerules
152
SimpleConstraints
Paperswithsimilartitlesshouldlikelybeclusteredtogether
Paper*(id1,id2) <-> PaperRef(id1,t1,-), PaperRef(id2,t2,-),TitleSimilar(t1,t2)
Author*(a1,a2) <-> AuthorSimilar(a1,a2)
Paper*(id
Paper
(id1,id
id2) <= PaperEq(id1,id
id2 )
Paper*(id1,id2) <= PaperNeq(id1,id2)
((<>)Softconstraints:
)
Payacostifviolated.
(<=)Hardconstraints:Any
clusteringmustsatisfythese
Additional Constraints
AdditionalConstraints
Clusteringtwopapers,thenmustclustertheirfirstauthors
if
iftwoauthorsdonotsharecoauthors,thendonotclusterthem
two authors do not share coauthors, then do not cluster them
Author (x, y) <- (Wrote(x, p1,), Wrote(y, p2,), Wrote(z, p1,),
Wrote(z, p2,), Author(x, y))
Dedupalog viaCC
Semantics:TranslateaDedupalog Programtoasetofgraphs
Nodesarereferences(inthe!Relation)
EntityReferences:Conference!(c)
VLDBJ
VLDB
Positiveedges
[ ] Negativeedgesareimplicit
[]
Negative edges are implicit
VLDBconf
ICDE
ICDT
InternationalConf.DE
Forasinglegraphw.o.hardconstraints
we can reuse prior work for O(1) apx
wecanreusepriorworkforO(1)apx.
156
CorrelationClustering
Soft
Conference*(c
Conference
(c1,cc2) << ConfSim(c1,cc2)
Conference*(c1,c2) <= ConfEQ(c1,c2)
Conference*(c
C f
*( 1,c2) <=
< ConfNEQ(c
C fNEQ( 1,c2)
Hard
Equal
Positive
[]Negative
NotEqual
VLDBJ
VLDB
VLDBconf
ICDE
ICDT
1. Pick a random order of edges
2. While there is a soft edge do
1. Pick first soft edge in order
2. If
turn into
[-]
] turn into
3. Else is [
4. Deduce labels
3. Return Transitively closed subsets
InternationalConf.DE
Voting
Extendalgorithmtowhole languageviavotingtechnique.
Support different entity types recursive programs etc
Supportdifferententitytypes,recursiveprograms,etc.
Manydedupalog
y
p gp
programs
g
haveanO(1)apx
Thm:AllsoftprogramsO(1)
Thm: Arecursivehard
constraintsnoO(1)apx
Expert:multiwaycuthard
Systemproperties:
(1)Streamingalgorithm
(2)linearin#ofmatches(notn2)
(3)Userinteraction
Features:Supportforweights,referencetables
(partially),andcorrespondinghardnessresults.
HybridApproaches
Method
Evaluation
Constraint
basedEntity
Matching
[Shen,Li&
Doan,
AAAI05];
buildson(Li,
Morie,&Roth,
AIMag 2004)
Twolayermodel:
Layer1:Generativemodelfordatasetsthatsatisfy
constraints;
Layer2:EMalgorithmandtherelaxationlabeling
algorithmtoperformmatching.Ineachiteration,use
EM to estimate parameters of the generative model
EMtoestimateparametersofthegenerativemodel
andamatchingassignment,thenemploysrelaxation
labelingtoexploittheconstraints
Researchers
andIMDBwith
noiseadded
Dedupalog
[Arasu,Re,
Suciu,ICDE09]
Declarativespecificationforrichcollectionof
constraintswithnicesyntactic sugaraddedtodatalog
forER.Inference:Correlationclustering+voting
Precision/Recall
onCora,subset
ofACMdataset
Summary:CollectiveApproaches
Decisionsforclustermembershipdependsonotherclusters
Similaritypropagationapproaches
yp p g
pp
ProbabilisticModels
GenerativeModels
UndirectedModels
U di
dM d l
HybridApproaches
Nonprobabilisticapproachesoftenscalebetterthangenerative
probabilisticapproaches
Undirected/constraintbasedmodelsareofteneasiertospecify
Scalingundirectedmodelsactiveareaofresearch
PART 3
PART3
SCALING ER TO BIGDATA
SCALINGERTOBIG
DATA
ScalingERtoBigData
Blocking/CanopyGeneration
Distributed ER
DistributedER
PART 3
PART3a
BLOCKING/CANOPY GENERATION
BLOCKING/CANOPYGENERATION
Blocking:Motivation
Navepairwise:|R|2 pairwise comparisons
1000
1000businesslistingseachfrom1,000differentcitiesacross
business listings each from 1,000 different cities across
theworld
1trillioncomparisons
11.6days (ifeachcomparisonis1s)
Mentionsfromdifferentcitiesareunlikelytobematches
BlockingCriterion:City
1billioncomparisons
16minutes(ifeachcomparisonis1s)
Blocking:Motivation
Mentionsfromdifferentcitiesareunlikelytobematches
Maymisspotentialmatches
May miss potential matches
Blocking:Motivation
MatchingPairs
ofRecords
PairsofRecords
satisfying
Blockingcriterion
SetofallPairs
ofRecords
BlockingAlgorithms1
Hashbasedblocking
EachblockCi isassociatedwithahashkeyhi.
Mentionx ishashedtoCi ifhash(x)=hi.
Withinablock,allpairsarecompared.
Each hash function results in disjoint blocks.
Eachhashfunctionresultsindisjointblocks.
Whathashfunction?
Deterministicfunctionofattributevalues
BooleanFunctionsoverattributevalues
[[Bilenko etalICDM06,MichelsonetalAAAI06,
,
,
DasSarma etalCIKM12]
minHash (minwiseindependentpermutations)
[Broder etalSTOC98]
BlockingAlgorithms2
Pairwise Similarity/Neighborhoodbasedblocking
Nearby
Nearbynodesaccordingtoasimilaritymetricareclustered
nodes according to a similarity metric are clustered
together
Resultsinnondisjointcanopies.
Techniques
SortedNeighborhoodApproach[HernandezetalSIGMOD95]
CanopyClustering[McCallumetalKDD00]
SimpleBlocking:InvertedIndexonaKey
Examplesofblockingkeys:
Firstthreecharactersoflastname
First
three characters of last name
City+State+Zip
CharacterorTokenngrams
Minimuminfrequentngrams
LearningOptimalBlockingFunctions
Usingoneormoreblockingkeysmaybeinsufficient
2,376,206American
2,376,206 AmericansssharedthesurnameSmithinthe2000US
shared the surname Smith in the 2000 US
NULLvaluesmaycreatelargeblocks.
Solution:Constructblockingfunctionsbycombining
simplefunctions
ComplexBlockingFunctions
Conjunctionoffunctions[MichelsonetalAAAI06,Bilenko etalICDM06]
{City}AND{lastfourdigitsofphone}
Chaintrees[DasSarma etalCIKM12]
If
If({City}=NULLorLA)then
({Ci } NULL LA) h {lastfourdigitsofphone}AND{areacode}
{l f
di i f h
} AND {
d }
else {lastfourdigitsofphone}AND{City}
LearninganOptimalfunction[Bilenko etalICDM06]
Findkblockingfunctionsthateliminatethemostnon
matches,whileretainingalmostallmatches.
,
g
Needatrainingsetofpositiveandnegativepairs
AlgorithmIdea:RedBlueSetCover
PositiveExamples
Blocking Keys
BlockingKeys
NegativeExamples
PickkBlockingkeyssuchthat
(a)Atmost bluenodesare
notcovered
(b)Numberofrednodes
coveredisminimized
LearninganOptimalfunction[Bilenko etalICDM06]
AlgorithmIdea:RedBlueSetCover
PositiveExamples
p
BlockingKeys
NegativeExamples
GreedyAlgorithm:
Greedy Algorithm
Constructgoodconjunctionsofblockingkeys{p1,p2,}.
Pick
k conjunctions {pi1,p
pi2,,p
pik}},suchthatthefollowingis
such that the following is
Pickkconjunctions{p
minimized
Let bearandompermutationoffeaturesinFx
E.g.,orderimposedbyarandomhashfunction
minHash(x)=minimumelementinFx accordingto
WhyminHash works?
Surprisingproperty:Forarandompermutation,
Howtobuildablockingschemesuchthatonlypairswith
How to build a blocking scheme such that only pairs with
Jacquardsimilarity>sfallinthesameblock(withhighprob)?
Probabilitythat
(x,y)mentionsare
bl k d t th
blockedtogether
Similarity(x,y)
BlockingusingminHashes
ComputeminHashes usingr*kpermutations(hash
functions)
)
Band
of r minHashes
Bandofr
k blocks
Signatures
Signature sthatmatchon1outofk
that match on 1 out of k bands,gotothe
bands go to the
sameblock.
minHash Analysis
FalseNegatives:(missingmatches)
P(pair x y notinthesameblock
P(pairx,y
not in the same block
withJacquardsim =s)
shouldbeverylowforhighsimilaritypairs
should be very low for high similarity pairs
False
Positives: (blocking nonmatches)
FalsePositives:(blockingnonmatches)
P(pairx,y inthesameblock
with Jacquard sim =s)
withJacquardsim
s)
Sim(s)
P(notsame
block)
0.9
108
0.8
0.00035
0.7
0.025
0.6
0.2
0.5
0.52
0.4
0.81
0.3
0.95
0.2
0.994
0.1
0.9998
CanopyClustering[McCallumetalKDD00]
Input:MentionsM,
d(x,y),adistancemetric,
thresholdsT1 >T2
Algorithm:
1 PickarandomelementxfromM
1.
Pi k
d
l
t f
M
2. CreatenewcanopyCx using
mentionsys.t.d(x,y)<T
y
( ,y) 1
3. Deleteallmentionsy fromM
s.t.d(x,y)<T2(fromconsiderationinthisalgorithm)
4. ReturntoStep1ifM isnotempty.
Inmultiple
canopies
Eachelement
h
hasasingle
i l
primarycanopy
PART 3 b
PART3b
DISTRIBUTED ER
DISTRIBUTEDER
DistributedER
Mapreduceisverypopularforlargetasks
M
d
i
l f l
k
Simpleprogrammingmodelformassivelydistributeddata
Hadoop providesfaulttoleranceandisopensource
MapPhase
(perrecordcomputation)
Shuffle
ReducePhase
(globalcomputation)
ERwithDisjointBlocking
ComputeBlocksinMap
Map Phase
MapPhase
(perrecordcomputation)
RemainingERinReduce
Reduce Phase
ReducePhase
(globalcomputation)
Shuffle
BlockID
Noneedtocompare
recordsacross
reducers
NondisjointBlocking
Howtoblock?
Hashbased:
Hash based:needanefficienttechniquetogrouprecordsif
need an efficient technique to group records if
theymatchonnoutofkblockingkeys[Vernica etalSIGMOD10]
Distancebased:canopyclusteringonmapreduce[Mahout]
IterativeBlocking[Whang etalSIGMOD09]
Problem:Informationneededforarecordisin
p
multiplereducers.
Informationneededforarecordisinmultiplereducers.
Example1:
Example 1:
Reducer1:amatcheswithb
Reducer2:amatcheswithc
Needtocommunicateinordertocorrectlyresolvea,b,c
N dt
i t i
d t
tl
l b
Problem:Informationneededforarecordisin
p
multiplereducers.
Example2:Dedup papersandauthors
Id
Author1
Author2
Paper
A1
JohnSmith
RichardJohnson IndicesandViews
A2
JSmith
RJohnson
SQLQueries
A3
Dr Smyth
Dr.Smyth
R Johnson
RJohnson
Slideadaptedfrom[Rastogi etalVLDB11]talk
Problem:Informationneededforarecordisin
p
multiplereducers.
Canopyclusteringresultsinnondisjointclusters
[McCallumetalKDD00]
J.Smith
Canopy
py
for
Richard
JohnS.
John
Richard
Smith JohnJacob
Johnson Richard
J h J b
Smith
RichardM.
R.Smith
Johnson
C
Canopy
forSmith
Canopy
f
for
John
Slideadaptedfrom[Rastogi etalVLDB11]talk
Problem:Informationneededforarecordisin
p
multiplereducers.
CoAuthor(A1,B1) CoAuthor(A2,B2) match(B1,B2) match(A1,A2)
CoAuthor rulegroundstothecorrelation
match(RichardJohnson,RJohnson)=>match(J.Smith,JohnSmith)
J.Smith
Richard
Johnson
Canopy
Canopy
for
Johnson
R
Johnson
Steve
Johnson
JohnS.
John
Smith JohnJacob
R.Smith
Canopy
forSmith
Canopy
for
John
Slideadaptedfrom[Rastogi etalVLDB11]talk
Problem:Informationneededforarecordisin
p
multiplereducers.
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,
KangetalICDM2009]
+CorrelationClustering/CollectiveERineachcomponent
C
l i Cl
i / C ll i ER i
h
Solution2:CorrelationClustering/CollectiveERineachcanopy
g/
py
+MessagePassing[Rastogi etalVLDB11]
Problem:Informationneededforarecordisin
p
multiplereducers.
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,
KangetalICDM2009]
+CorrelationClustering/CollectiveERineachcomponent
C
l i Cl
i / C ll i ER i
h
Connectedcomponentscanbelargeinrelational/multientityER.
p
g
/
y
Solution2:CorrelationClustering/CollectiveERineachcanopy
+MessagePassing[Rastogi etalVLDB11]
MessagePassing
Simple Message Passing (SMP)
SimpleMessagePassing(SMP)
1. RunentitymatcherM locallyineachcanopy
2. IfM findsamatch(r1,r2)insomecanopy,passitas
evidencetoallcanopies
3. RerunM withineachcanopyusingnewevidence
4. Repeatuntilnonewmatchesfoundineachcanopy
il
h f
di
h
Runtime:O(k
Runtime:
O(k2 f(k)c)
f(k) c)
k:maximumsizeofacanopy
f(k):TimetakenbyERoncanopyofsizek
c:numberofcanopies
Slideadaptedfrom[Rastogi etalVLDB11]talk
FormalProperties
forawellbehavedERmethod
Convergence:No.ofstepsno.ofmatches
Consistency:Outputindependentofthecanopyorder
y
p
p
py
Soundness:Eachoutputmatchisactuallyatruematch
Completeness:Eachtruematchisalsoaoutputmatch
J.Smith
JohnS.
John
Richard
Smith JohnJacob
Johnson Richard
Smith
RichardM.
R.Smith
Johnson
Slideadaptedfrom[Rastogi etalVLDB11]talk
Completeness
Papers2and3matchonlyifacanopy
knows that
knowsthat
match(a1,a2)
match(b2,b3)
match(c2,c3)
t h( 2 3)
Simplemessagepassingwillnotfindanymatches
thus,nomessagesarepassed,noprogress
Solution:Maximalmessagepassing
Sendamessageifthereisapotentialformatch
Slideadaptedfrom[Rastogi etalVLDB11]talk
SummaryofScalability
O(|R|2)pairwise computationscanbeprohibitive.
Blockingeliminatescomparisonsonalargefractionofnonmatches.
EqualitybasedBlocking:
Construct(oneormore)blockingkeysfromfeatures
Recordsnotmatchingonanykeyarenotcompared.
Records not matching on any key are not compared
Neighbohood basedBlocking:
Formoverlappingcanopiesofrecordsbasedonsimilarity.
Onlycomparerecordswithinacluster.
Computingconnectedcomponents/MessagePassinginadditionto
blocking can help distribute ER
blockingcanhelpdistributeER.
P t4
Part4
CHALLENGESANDFUTURE
CHALLENGES
AND FUTURE
DIRECTIONS
Challenges
Sofar,wehaveviewedERasaonetimeprocessappliedtoentire
database;noneoftheseholdinrealworld.
TemporalER
Temporal ER
ERalgorithmsneedtoaccountforchangeinrealworld
Reasoningaboutmultiplesources[Pal&M etal.WWW12]
Modeltransitions[LietalVLDB11]
Reasoningaboutsourcequality
Sourcesarenotindependent
CopyingProblem[DongetalVLDB09]
QueryTimeER
Howdoweselectivelydeterminethesmallestnumberofrecordstoresolve,so
wegetaccurateresultsforaparticularquery?
Collectiveresolutionforqueries
Collective resolution for queries [Bhattacharya&GetoorJAIR07]
[Bhattacharya & Getoor JAIR07]
ER&Usergenerateddata
Deduplicated entitiesinteractwithusersintherealworld
Userstag/associatephotos/reviewswithbusinessesonGoogle/Yahoo
Whatshouldbedonetosupportinteractions?
OpenIssues
ERisoftenpartofbiggerinferenceproblem
Pipelinedapproachesandjointapproachestoinformationextraction
and graph identification
andgraphidentification
HowcanwecharacterizehowERerrorsaffectoverallqualityof
results?
ERTheory
ER Theory
Needbettersupportfortheorywhichcangiverelationallearning
bounds
ER&Privacy
ER & Privacy
ERenablesrecordreidentification
HowdowedevelopatheoryofprivacypreservingER?
ERBenchmarks
ER B h
k
NeedforlargescalerealworldERdatasetswithgroundtruth
Syntheticdatausefulforscalingbuthardtocapturerichcomplexities
ofrealworld
f
l
ld
Summary
Growingomnipresenceofmassivelinkeddata,andtheneed
forcreatingknowledgebasesfromtextandunstructureddata
motivate a number of challenges in ER
motivateanumberofchallengesinER
EspeciallyinterestingchallengesandopportunitiesforERand
socialmedia/usergenerateddata
/
As
Asdata,noise,andknowledgegrows,greaterneeds&
data noise and knowledge grows greater needs &
opportunitiesforintelligentreasoningaboutentityresolution
Manyotherchallenges
M
th
h ll
Largescaleidentitymanagement
Understandingtheoreticalpotentials&limitsofER
THANK YOU!
THANKYOU!
References Intro
W.Willinger etal,MathematicsandtheInternet:ASourceofEnormousConfusionand
GreatPotential,NoticesoftheAMS56(5),2009
L Gill and M Goldcare English
L.GillandM.Goldcare,
EnglishNationalRecordLinkageofHospitalEpisodeStatisticsand
National Record Linkage of Hospital Episode Statistics and
DeathRegistrationRecords,ReporttotheDepartmentofHealth,2003
T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer2007
A. Elmagrid etal,
A.Elmagrid
et al, Duplicate
DuplicateRecordDetection
Record Detection,,TKDE2007
TKDE 2007
P.Christen,DataMatching,Springer2012
N.Koudas etal,RecordLinkage:SimilaritymeasuresandAlgorithms,SIGMOD2006
X. Dong & F. Naumann, Data
X.Dong&F.Naumann,
Datafusion
fusionResolving
Resolvingdataconflictsforintegration
data conflicts for integration,,VLDB2009
VLDB 2009
L.Getoor &A.Machanavajjhala,EntityResolution:Theory,PracticeandOpenChallenges,
AAAI2012
References SingleEntityER
D.Menestrina etal,EvaluationEntityResolutionResults,PVLDB3(12),2010
M.Cochinwala etal,Efficientdatareconciliation,InformationSciences137(14),2001
M Bilenko &R.Mooney,
M.Bilenko
& R Mooney Adaptive
AdaptiveDuplicateDetectionUsingLearnableStringSimilarity
Duplicate Detection Using Learnable String Similarity
Measures,KDD2003
P.Christen,Automaticrecordlinkageusingseedednearestneighbour andsupportvector
machineclassification.,KDD2008
Z.Chenetal,Exploitingcontextanalysisforcombiningmultipleentityresolutionsystems,
SIGMOD2009
A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004
f
H.Newcombe etal,Automaticlinkageofvitalrecords,Science1959
I.Fellegi &A.Sunter,ATheoryforRecordLinkage,JASA1969
W.Winkler,OverviewofRecordLinkageandCurrentResearchDirections,ResearchReport
Series,USCensus,2006
T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer,2007
P R ik
P.Ravikumar
&W C h
&W.Cohen,AHierarchicalGraphicalModelforRecordLinkage,UAI2004
A Hi
hi l G hi l M d l f R
d Li k UAI 2004
References SingleEntityER(contd.)
S.Sarawagi etal,InteractiveDeduplication usingActiveLearning,KDD2000
S.Tejada etal,LearningObjectIdentificationRulesforInformationIntegration,IS2001
A Arasu etal,
A.Arasu
et al On
Onactivelearningofrecordmatchingpackages
active learning of record matching packages,SIGMOD2010
SIGMOD 2010
K.Bellare etal,Activesamplingforentitymatching,KDD2012
A.Beygelzimer etal,AgnosticActiveLearningwithoutConstraints,NIPS2010
J Wang et al CrowdER:
J.Wangetal,
CrowdER:Crowdsourcing
Crowdsourcing EntityResolution
Entity Resolution,PVLDB5(11),2012
PVLDB 5(11) 2012
A.Marcusetal,HumanpoweredSortsandJoins,PVLDB5(1),2011
References SingleEntityER(contd.)
R.Gupta&S.Sarawagi,
R.
Gupta & S. Sarawagi, Answering
AnsweringTableAugmentationQueriesfromUnstructuredListsontheWeb
Table Augmentation Queries from Unstructured Lists on the Web,,
PVLDB2(1),2009
A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,CIKM
2012
M Bilenko etal,
M.Bilenko
et al Adaptive
AdaptiveProductNormalization:UsingOnlineLearningforRecordLinkagein
Product Normalization: Using Online Learning for Record Linkage in
ComparisonShopping,ICDM2005
S.Chaudhuri etal,RobustIdentificationofFuzzyDuplicates,ICDE2005
W.Soonetal,Amachinelearningapproachtocoreference resolutionofnounphrases,
C
ComputationalLinguistics27(4)2001
t ti
l Li
i ti 27(4) 2001
N.Bansal etal,CorrelationClustering,MachineLearning56(13),2004
V.Ng&C.Cardie,Improvingmachinelearningapproachestocoreference resolution,ACL2002
,
g
p
g
M.Elsner &E.Charnaik,Youtalkingtome?acorpusandalgorithmforconversation
disentanglement,ACLHLT2008
M.Elsner &W.Schudy,BoundingandComparingMethodsforCorrelationClusteringBeyondILP,
ILPNLP2009
N Ailon etal,
N.Ailon
et al Aggregating
Aggregatinginconsistentinformation:Rankingandclustering
inconsistent information: Ranking and clustering,JACM55(5),2008
JACM 55(5) 2008
X.Dongetal,IntegratingConflictingData:TheRoleofSourceDependence,PVLDB2(1),2009
A.Paletal,InformationIntegrationoverTimeinUnreliableandUncertainEnvironments,WWW
2012
A.Culotta etal,CanonicalizationofDatabaseRecordsusingAdaptiveSimilarityMeasures,KDD2007
O.Benjelloun etal,Swoosh:AgenericapproachtoEntityResolution,VLDBJ18(1),2009
References Constraints&MultiRelationalER
R.Ananthakrishna
A
h k i h et.al,Eliminatingfuzzyduplicatesindatawarehouses,VLDB2002
l li i i f
d li
i d
h
VL 2002
A.Arasu etal,LargeScaleDeduplication withConstraintsusingDedupalog,ICDE2009
S.Chaudhuri etal.,Leveragingaggregateconstraintsfordeduplication,SIGMOD07
X.Dongetal,ReferenceRecounciliation inComplexInformationSpaces,SIGMOD2005
I.Bhattacharya&L.Getoor,CollectiveEntityResolutioninRelationalData,TKDD2007
I.Bhattacharya&L.Getoor,ALatentDirichlet ModelforUnsupervisedEntityResolution,SDM2007
P.Bohannonetal.,ConditionalFunctionalDependenciesforDataCleaning,ICDE 2007
M.Broecheler &L.Getoor,ProbabilisticSimilarityLogic,UAI2010
,
y g ,
W.Fan,Dependenciesrevisitedforimprovingdataquality,PODS2008
H.Pasula etal,IdentityUncertaintyandCitationMatching,NIPS2002
D.Kalashnikovetal,DomainIndependentDataCleaningviaAnalysisofEntityRelationshipGraph,
TODS06
J.Laffertyetal,ConditionalRandomFields:ProbabilisticModelsforSegmentingandLabeling
SequenceData.,ICML2001
X.Lietal,IdentificationandTracingofAmbiguousNames:DiscriminativeandGenerative
Approaches,AAAI2004
A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004
M.Richardson&P.Domingos,MarkovLogic,MachineLearning62,2006
W.Shen etal.,ConstraintbasedEntityMatching,AAAI2005
P.Singla &P.Domingos,EntityResolutionwithMarkovLogic,ICDM2006
Whang etal.,GenericEntityResolutionwithNegativeRules,VLDBJ2009
Whang elal.,JointEntityResolution,ICDE2012
References Blocking
M.Bilenko etal,AdaptiveBlocking:LearningtoScaleUpRecordLinkageandClustering,ICDM
2006
M.Michelson&C.Knoblock,LearningBlockingSchemesforRecordLinkage,AAAI2006
g
g
g
A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,
CIKM2012
A.Broder etal,MinWiseIndependentPermutations,STOC1998
G P di etal,Beyond100millionentities:largescaleblockingbasedresolutionfor
G.Papadias
l B
d 100 illi
ii l
l bl ki b d
l i f
heterogenous data,WSDM2012
M.Hernandez&S.Stolfo,Themerge/purgeproblemforlargedatabases,SIGMOD1995
A. McCallum et al, Efficient
A.McCallumetal,
Efficientclusteringofhigh
clustering of highdimensional
dimensionaldatasetswithapplicationto
data sets with application to
referencematching,KDD2000
L.Kolbetal,Dedoop:Efficientdeduplication withHadoop,(demo)PVLDB5(12),2012
R.Vernica etal,EfficientParallelSetSimilarityJoinsUsingMapReduce,SIGMOD2010
ApacheMahout:ScalableMachineLearningandDataMining,http://mahout.apache.org/
S.Whang etal,EntityResolutionwithIterativeBlocking,SIGMOD2009
U.Kangetal,PEGASUS:APetaScaleGraphMiningSystem Implementationand
Observations,ICDM2009
Observations
ICDM 2009
V.Rastogi etal,FindingConnectedComponentsonMapreduceinPolyLogRounds,Corr 2012
V.Rastogi etal,LargeScaleCollectiveEntityMatching,PVLDB4(4),2011
References Challenges&FutureDirections
I.BhattacharyaandL.Getoor,"QuerytimeEntityResolution",JAIR2007
X.Dong,L.BertiEquille,D.Srivastava,Truthdiscoveryandcopyingdetectioninadynamic
world,VLDB2009
world
VLDB 2009
P.Li,X.Dong,A.Maurino,D.Srivastava,LinkingTemporalRecords,VLDB2011
A. Pal,V.Rastogi,A.Machanavajjhala,P.Bohannon,Informationintegrationovertimein
unreliable and uncertain environments,,WWW2012
unreliableanduncertainenvironments
WWW 2012