Entity Resolution: Tutorial

Entity Resolution: Tutorial
EntityResolution:Tutorial
Li Getoor
Lise
G t
Ashwin Machanavajjhala
UniversityofMaryland
College Park MD
CollegePark,MD
DukeUniversity
Durham,NC
,
http://www.cs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
http://goo.gl/f5eym
WhatisEntityResolution?
Problemofidentifyingandlinking/groupingdifferent
manifestationsofthesamerealworldobject.
f
f
j
Examplesofmanifestationsandobjects:
p
j
Differentwaysofaddressing(names,emailaddresses,FaceBook
accounts)thesamepersonintext.
Webpageswithdifferingdescriptionsofthesamebusiness.
Differentphotosofthesameobject.

Ironically,EntityResolutionhasmanyduplicatenames
Recordlinkage
Coreference resolution
Duplicatedetection
Referencereconciliation
Fuzzymatch
y
Object consolidation
Objectconsolidation
Objectidentification
Deduplication
Approximatematch
Entity clustering
Entityclustering
Identityuncertainty
Hardeningsoftdatabases
Doubles
Merge/purge
Household matching
Householdmatching
Householdingg
Referencematchingg
ERMotivatingExamples
LinkingCensusRecords
Public Health
PublicHealth
Websearch
Comparison shopping
Comparisonshopping
Counterterrorism
Spam detection
Spamdetection
MachineReading
ERandNetworkAnalysis
before
after
Motivation:NetworkScience
Measuringthetopologyoftheinternetusing
traceroute
IPAliasingProblem[Willinger etal.2009]
IPAliasingProblem[Willinger etal.2009]
TraditionalChallengesinER
Name/Attributeambiguity
Thomas Cruise
ThomasCruise
MichaelJordan
Errors due to data entry
Errorsduetodataentry
MissingValues
[Gilletal;Univ ofOxford2003]
MissingValues
Changing Attributes
ChangingAttributes
Dataformatting
Abbreviations/DataTruncation
/
BigDataERChallenges
BigDataERChallenges
LargerandmoreDatasets
Needefficientparalleltechniques
MoreHeterogeneity
M
H t
it
Unstructured,UncleanandIncompletedata.Diversedatatypes.
Nolongerjustmatchingnameswithnames,butAmazonprofileswith
No longer just matching names with names, but Amazon profiles with
browsinghistoryonGoogleandfriendsnetworkinFacebook.
BigDataERChallenges
LargerandmoreDatasets
Needefficientparalleltechniques
MoreHeterogeneity
M
H t
it
Unstructured,UncleanandIncompletedata.Diversedatatypes.
Morelinked
More linked
Needtoinferrelationshipsinadditiontoequality
MultiRelational
Dealwithstructureofentities(AreWalmart andWalmart Pharmacy
thesame?)
Multidomain
l d
Customizablemethodsthatspanacrossdomains
Multipleapplications
Multiple applications (websearchversuscomparisonshopping)
(web search versus comparison shopping)
Servediverseapplicationwithdifferentaccuracyrequirements
Outline
1.
2
2.
3.
4
4.
AbstractProblemStatement
Algorithmic Foundations of ER
AlgorithmicFoundationsofER
ScalingERtoBigData
Challenges & Future Directions
Challenges&FutureDirections
Outline
1. AbstractProblemStatement
2 AlgorithmicFoundationsofER
2.
a)
b)
c)
d)
DataPreparationandMatchFeatures
Pairwise ER
ConstraintsinER
Algorithms
RecordLinkage
Record
Linkage
Deduplication
CollectiveER
3. ScalingERtoBigData
4. Challenges&FutureDirections
5minutebreak
Outline
1. AbstractProblemStatement
2 AlgorithmicFoundationsofER
2.
3. ScalingERtoBigData
a) Blocking/CanopyGeneration
Blocking/Canopy Generation
b) DistributedER
4. Challenges&FutureDirections
Outline
1.
2
2.
3.
4
4.
AlgorithmicFoundationsofER
ScalingERtoBigData
Challenges&FutureDirections
ScopeoftheTutorial
Whatwecover:
FundamentalalgorithmicconceptsinER
ScalingERtobigdatasets
TaxonomyofcurrentERalgorithms
Whatwedonotcover:
Schema/ontologyresolution
Schema/ontology
resolution
Datafusion/integration/exchange/cleaning
Entity/InformationExtraction
PrivacyaspectsofEntityResolution
Detailsonsimilaritymeasures
Technical details and proofs
Technicaldetailsandproofs
ERReferences
Book/SurveyArticles
DataQualityandRecordLinkageTechniques
[T Herzog F Scheuren W Winkler Springer 07]
[T.Herzog,F.Scheuren,W.Winkler,Springer,
07]
DuplicateRecordDetection[A.Elmagrid,P.Ipeirotis,V.Verykios,TKDE07]
AnIntroductiontoDuplicateDetection[F.Naumann,M.Herschel,M&P
y
]
synthesislectures2010]
EvaluationofEntityResolutionApproachedonRealworldMatchProblems
[H.Kopke,A.Thor,E.Rahm,PVLDB2010]
DataMatching[P.Christen,Springer2012]
Tutorials
RecordLinkage:SimilaritymeasuresandAlgorithms
[N.Koudas,S.Sarawagi,D.Srivatsava SIGMOD06]
DatafusionResolvingdataconflictsforintegration
[X.Dong,F.Naumann VLDB09]
EntityResolution:Theory,PracticeandOpenChallenges
Entity Resolution: Theory Practice and Open Challenges
http://goo.gl/Ui38o [L.Getoor,A.Machanavajjhala AAAI12]
PART 1
PART1
ABSTRACT PROBLEM STATEMENT
ABSTRACTPROBLEMSTATEMENT
RealWorld
DigitalWorld
Records/
/
Mentions
Deduplication ProblemStatement
Clustertherecords/mentionsthatcorrespondtosame
entity
y
Clustertherecords/mentionsthatcorrespondtosame
entity
y
Intensional Variant:Computeclusterrepresentative
RecordLinkageProblemStatement
Linkrecordsthatmatchacrossdatabases
A
ReferenceMatchingProblem
Matchnoisyrecordstocleanrecordsinareferencetable
Reference
Table
bl
RealWorld
DigitalWorld
Deduplication withCanonicalization
GraphAlignment(&motifsearch)
Graph1
Graph2
Relationshipsarecrucial
Relationshipsarecrucial
Notation
R:setofrecords/mentions(typed)
H: set of relations / hyperedges (typed)
H:setofrelations/hyperedges
M:setofmatches(recordpairsthatcorrespondtosameentity)
N: set of non matches (recordpairscorrespondingtodifferententities)
N:setofnonmatches
(
d i
di t diff
t titi )
E:setofentities
L set of links
L:setoflinks
TTrue(M
(Mtrue,N
Ntrue,EEtrue,LLtrue):accordingtorealworld
)
di
l
ld
vs Predicted(Mpred,Npred,Epred,Lpred ):byalgorithm
RelationshipbetweenMtrue andMpred
Mtrue (SameAs ,Equivalence)
(Similar representations and similar attributes)
Mpred (Similarrepresentationsandsimilarattributes)
Mtrue
RxR
Mpred
Metrics
Pairwise metrics
Precision/Recall,F1
#ofpredictedmatchingpairs
Clusterlevelmetrics
purity,completeness,complexity
Precision/Recall/F1:Clusterlevel,closestcluster,MUC,B
Precision/Recall/F1: Cluster level closest cluster MUC B3 ,
RandIndex
Generalizedmergedistance[Menestrina etal,PVLDB10]
Littleworkthatevaluationscorrectpredictionoflinks
TypicalAssumptionsMade
Eachrecord/mentionisassociatedwithasinglereal
worldentity.
y
Inrecordlinkage,noduplicatesinthesamesource
Iftworecords/mentionsareidentical,thentheyaretrue
If two records/mentions are identical then they are true
matches
(
(,)
) Mtrue
ERversusClassification
Findingmatchesvs nonmatchesisaclassificationproblem
Imbalanced:typicallyO(R)matches,O(R^2)nonmatches
Instancesarepairsofrecords.PairsarenotIID
(,) Mtrue
AND
( , ) Mtrue
(,)
(
(,)
) Mtrue
ERvs (Multirelational)Clustering
Computingentitiesfromrecordsisaclusteringproblem
Intypicalclusteringalgorithms(kmeans,LDA,etc.)
number of clusters is a constant or sub linear in R.
numberofclustersisaconstantorsublinearinR.
In
InER:numberofclustersislinearinR,andaverage
ER: number of clusters is linear in R and average
clustersizeisaconstant.Significantfractionofclusters
aresingletons.
g
PART 2
PART2
ALGORITHMIC FOUNDATIONS OF ER
ALGORITHMICFOUNDATIONSOFER
OutlineofPart2
a) DataPreparationandMatchFeatures
b) Pairwise ER
Determiningwhetherornotapairofrecordsmatch
c) ConstraintsinER
5minutebreak
d) Algorithms
Recordlinkage(Propagationthroughexclusitivity negativeconstraint),
Deduplication (Propagationthroughtransitivitypositiveconstraint),
Collective(PropagationthroughGeneralConstraints)
MOTIVATINGEXAMPLE:
MOTIVATING
EXAMPLE:
BIBLIOGRAPHICDOMAIN
Entities&RelationsinBibliographicDomain
Wrote
Paper
p
Title
# of Authors
Topic
Word1
Word 2
WordN
Cites
Author
Name
Research Area
WorksAt
I tit ti
Institution
Name
Author Mention
NameString
Institute Mention
NameString
Paper Mention
TitleString
AppearsIn
pp
Venue
:entityrelationships
:cooccurrencerelationships
:resolutionrelationships
Name
Venue Mention
NameString
g
PART 2
PART2a
DATAPREPARATION&
DATA
PREPARATION &
MATCHFEATURES
Normalization
Schemanormalization
SchemaMatching e.g.,contactnumberandphonenumber
Compoundattributes
C
d tt ib t
f ll dd
fulladdressvs
str,city,state,zip
t it t t i
Nestedattributes
Listoffeaturesinonedataset(airconditioning,parking)vs eachfeaturea
boolean attribute
Setvaluedattributes
Setofphonesvs primary/secondaryphone
Recordsegmentationfromtext
Record segmentation from text
Datanormalization
Oftenconverttoalllower/allupper;removewhitespace
detectingandcorrectingvaluesthatcontainknowntypographicalerrorsor
d
i
d
i
l
h
i k
hi l
variations,
expandingabbreviationsandreplacingthemwithstandardforms;replacing
nicknames with their proper name forms
nicknameswiththeirpropernameforms
Usuallydonebasedondictionaries(e.g.,commercialdictionaries,postaladdresses,
etc.)
MatchingFeatures
Fortworeferencesxandy,computeacomparisonvectorof
similarityscoresofcomponentattribute.
[1stauthormatchscore,
papermatchscore,
venuematchscore,
yearmatchscore,.]
Similarityscores
y
Boolean(matchornotmatch)
Realvaluesbasedondistancefunctions
SummaryofMatchingFeatures
Handle
H
dl
Typographicalerrors
Equalityonaboolean predicate
Editdistance
Levenstein,SmithWaterman,Affine
VectorBased
Cosine
Cosinesimilarity,TFIDF
similarity, TFIDF
GoodforTextlike
reviews/tweets
AlignmentbasedorTwotiered
JaroWinkler,SoftTFIDF,MongeElkan
PhoneticSimilarity
Soundex
Setsimilarity
Jaccard,Dice
GoodforNames
Translationbased
Numericdistancebetweenvalues
Domainspecific
p
Useful packages
Usefulpackages:
SecondString,http://secondstring.sourceforge.net/
Simmetrics:http://sourceforge.net/projects/simmetrics/
LingPipe,http://aliasi.com/lingpipe/index.html
Li Pi
htt // li i
/li
i /i d ht l
Usefulfor
abbreviations,
alternate names.
alternatenames.
RelationalMatchingFeatures
Relationalfeaturesareoftensetbased
Setofcoauthorsforapaper
Setofcitiesinacountry
f ii i
Setofproductsmanufacturedbymanufacturer
Canusesetsimilarityfunctionsmentionedearlier
CommonNeighbors:Intersectionsize
Jaccard
Jaccardss Coefficient:Normalizebyunionsize
Coefficient: Normalize by union size
AdarCoefficient:Weightedsetsimilarity
Canreasonaboutsimilarityinsetsofvalues
C
b t i il it i
t f l
AverageorMax
Otheraggregates
PART 2 b
PART2b
PAIRWISE MATCHING
Pairwise MatchScore
Problem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).
Solutions:
1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.
0.5*1stauthormatchscore +0.2*venuematchscore +0.3*ppapermatchscore

p
.
Hardtopickweights.
Matchonlastnamematchmorepredictivethanloginname.
Match on Smith
Matchon
Smith lesspredictivethanmatchon
less predictive than match on Getoor
Getoor or
or Machanavajjhala.
Hardtotuneathreshold.
Pairwise MatchScore
Problem:Givenavectorofcomponentwisesimilaritiesforapairof
records(x,y),computeP(xandymatch).
Solutions:
1. Weightedsumoraverageofcomponentwisesimilarityscores.
Thresholddeterminesmatchornonmatch.
2 Formulaterulesaboutwhatconstitutesamatch.
2.
Formulate rules about what constitutes a match
(1stauthormatchscore>0.7ANDvenuematchscore>0.8)
OR(papermatchscore>0.9ANDvenuematchscore>0.9)
M
Manuallyformulatingtherightsetofrulesishard.
ll f
l ti th i ht t f l i h d
BasicMLApproach
r =(x,y)isrecordpair, iscomparisonvector,M matches,U non
matches
Decisionrule
P( | r M )
R
P( | r U )
R t r Match
R t r Non - Match
Fellegi &Sunter Model[FS,Science69]

r =(x,y)isrecordpair, iscomparisonvector,M matches,U non
matches
Decisionrule
P( | r M )
R
P( | r U )
R tl r Match
tl R tu r Potential Match
R tu r Non - Match
NaveBayes Assumption: P( | r M ) i P( i | r M )
MLPairwise Approaches
Supervisedmachinelearningalgorithms
Decisiontrees
[Cochinwala
[Co hin ala etal,IS01]
et al IS01]
Supportvectormachines
[Bilenko &Mooney,KDD03];[Christen,KDD08]
Ensemblesofclassifiers
Ensembles of classifiers
[Chenetal.,SIGMOD09]
ConditionalRandomFields(CRF)
[Gupta&Sarawagi,VLDB09]
[Gupta & Sarawagi VLDB09]
Issues:
Training
Trainingsetgeneration
set generation
Imbalancedclasses manymorenegativesthanpositives(evenafter
eliminatingobviousnonmatchesusingBlocking)
Misclassificationcost
Misclassification cost
CreatingaTrainingSetisakeyissue
Constructingatrainingsetishard sincemostpairsof
recordsareeasynonmatches.
y
100recordsfrom100cities.
Only106 pairsoutoftotal108 (1%)comefromthesamecity
Somepairsarehardtojudgeevenbyhumans
Inherentlyambiguous
E.g.,ParisHilton(personorbusiness)
Missingattributes
Starbucks,Torontovs Starbucks,QueenStreet,Toronto
AvoidingTrainingSetGeneration
Unsupervised/SemisupervisedTechniques
EMbasedtechniquestolearnparameters
[Winkler06,Herzogetal07]
GenerativeModels
[Ravikumar &Cohen,UAI04]
& Cohen UAI04]
ActiveLearning
CommitteeofClassifiers
[Sarawagi etalKDD00,Tajeda etalIS01]
Provablyoptimizingprecision/recall
Provably optimizing precision/recall
[Arasu etalSIGMOD10,Bellare etalKDD12]
Crowdsourcing
[WangetalVLDB12,MarcusetalVLDB12,]
[W
t l VLDB 12 M
t l VLDB 12 ]
CommitteeofClassifiers [Tejada etal,IS01]
ActiveLearningwithProvableGuarantees
Mostactivelearningtechniquesminimize01loss
[Beygelzimer etalNIPS2010].
However,ERisveryimbalanced:
Numberofnonmatches>100*numberofmatches.
Classifyingallpairsasnonmatcheshaslow01loss(<1%).
y g p
(
)
Hence,needactivelearningtechniquesthatminimize
precision/recall.
precision/recall
Monotonicity ofPrecision[Arasu etalSIGMOD10]
Thereisalargerfractionof
matchesinC1thaninC2.
Algorithmsearchesforthe
p
g
y
optimalclassifierusingbinary
searchoneachdimension
[Bellare etalKDD12]
O (log2 n)callstoablackbox
O(log
n) calls to a blackbox 01lossactivelearningalgorithm.
0 1 loss active learning algorithm
Exponentiallysmallerlabelcomplexitythan[Arasu etalSIGMOD10]
(in the worst case).
(intheworstcase).
1.
2.
3
3.
PrecisionConstrained Weighted01LossProblem
(using a Lagrange Multiplier )
(usingaLagrangeMultiplier).
Givenafixedvaluefor,weighted01Losscanbeoptimizedby(onecallto)a
blackbox activelearningclassifier.
Ri ht l
Rightvalueof
f iscomputedbysearchingoveralloptimalclassifiers.
i
t db
hi
ll ti l l ifi
Classifiersareembeddedina2dplane(precision/recall)
Searchisalongtheconvexhulloftheembeddedclassifiers
Crowdsourcing
Growinginterestinintegratinghumancomputationindeclarative
workflowengines.
ERisanimportantproblem(e.g.,forevaluatingfuzzyjoins)
[WangetalVLDB12,MarcusetalVLDB12,]
Opportunity:utilizecrowdsourcing forcreatingtrainingsets,
orforactivelearning.
Keyopenissue:Handlingerrorsinhumanjudgments
InanexperimentonAmazonMechanicalTurk:
Pairwise matchingjudgment,eachgivento5differentpeople
Majorityofworkersagreedontruthononly90%ofpairwise judgments.
SummaryofSingleEntityERAlgorithms
Manyalgorithmsforindependentclassificationofpairsofrecords
asmatch/nonmatch
MLbasedclassification&FellegiSunter
Pro:Advancedstateoftheart
P Ad
d
f h
Con:Buildinghighfidelitytrainingsetsisahardproblem
ActiveLearning&Crowdsourcing forERareactiveareasof
research.
PART 2
PART2c
CONSTRAINTS
Constraints
Importantformsofconstraints:
Transitivity:IfM1andM2match,M2andM3match,thenM1and
y
M3match
Exclusivity:IfM1matcheswithM2,thenM3cannotmatchwithM2
FunctionalDependency:IfM1andM2match,thenM3andM4must
Functional Dependency: If M1 and M2 match then M3 and M4 must
match
Transitivityiskeytodeduplication
Exclusivityiskeytorecordlinkage
Functionaldependenciesfordatacleaning,e.g.,
[Ananthakrishna etal.,VLDB02][Fan,PODS08][Bohannonet
al ICDE07]
al,ICDE07]
Positive&NegativeEvidence
Positive
y
M3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwith
M2
match
Negative
Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not match
andM3donotmatch
FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch
Positive&NegativeEvidence
Positive
y
M3match
Exclusivity:IfM1doesntmatchwithM2,thenM3canmatchwith
M2
match
Negative
Transitivity:IfM1andM2match,M2andM3donotmatch,thenM1
and M3 do not match
andM3donotmatch
FunctionalDependency:IfM1andM2donotmatch,thenM3and
M4cannotmatch
ConstraintTypes
HardConstraint
SoftConstraint
PositiveEvidence
IfM1,M2matchthenM3,M4must
match
If t
Iftwopapersmatch,theirvenues
t h th i
match
IfM1,M2matchthenM3,
M4morelikelytomatch
If t
Iftwovenuesmatch,then
t h th
theirpapersaremorelikely
tomatch
NegativeEvidence
MentionM1andM2mustreferto
distinctentities(Uniqueness)
Coauthors are distinct
Coauthorsaredistinct
IfM1,M2dontmatchthen
M3,M4lesslikelytomatch
If institutions donttmatch,
Ifinstitutionsdon
match
thenauthorslesslikelyto
match
IfM1,M2dontmatchthenM3,M4
cannotmatch
Iftwovenuesdontmatch,thentheir
papersdontmatch
ConstraintTypes
HardConstraint
SoftConstraint
PositiveEvidence
match
If t
t h th i
match
IfM1,M2matchthenM3,
M4morelikelytomatch
If t
t h th
tomatch
NegativeEvidence
Ifinstitutionsdon
match
match
cannotmatch
papersdontmatch
ConstraintTypes
HardConstraint
SoftConstraint
PositiveEvidence
match
If t
t h th i
match
IfM1,M2matchthenM3,
M4morelikelytomatch
If t
t h th
tomatch
NegativeEvidence
Ifinstitutionsdon
match
match
cannotmatch
papersdontmatch
ConstraintTypes
HardConstraint
SoftConstraint
PositiveEvidence
match
If t
t h th i
match
IfM1,M2matchthenM3,
M4morelikelytomatch
If t
t h th
tomatch
NegativeEvidence
Ifinstitutionsdon
match
match
cannotmatch
papersdontmatch
ConstraintTypes
HardConstraint
PositiveEvidence
SoftConstraint
IfM1,M2matchthenM3,
Notethatsomeofthe
M4morelikelytomatch
match
constraintsmayberelational
y
If t
t h th i
If t
t h th
andrequirejoins
match
tomatch
Maybedirectional
May
be directional
orbidirectional
NegativeEvidence
Constraints can
be recursive
Coauthors are distinct Constraintscanberecursive,
If institutions
Ifinstitutionsdon
donttmatch,
match
e.g.,iftwoauthorshave
match
matchingcoauthors,then
cannotmatch
theymatch
h
h
papersdontmatch
AdditionalConstraints
AggregateConstraints[Chaudhuri etal.SIGMOD07]
countconstraints
count constraints
EntityAcanlinktoatmostNBs
Authorshaveatmost5papersatanyconference
Otheraggregateslikesum,averagemorecomplex
Oth
t lik
l
Again,
Again,thesecanbeeitherhardorsoftconstraints,
these can be either hard or soft constraints,
providepositiveornegativeevidence
MatchDependencies
Whenmatchingdecisionsdependonother
matching decisions (in other words, matching
matchingdecisions(inotherwords,matching
decisionsarenotmadeindependently),we
refer to the approach as collective
refertotheapproachascollective
MatchExtent
Global: Iftwopapersmatch,thentheirvenuesmatch
This
Thisconstraintcanbeappliedtoallinstancesofvenue
constraint can be applied to all instances of venue
mentions
AlloccurrencesofSIGMODcanbematchedtoInternational
Conference on Management of Data
ConferenceonManagementofData
Local:Iftwopapersmatch,thentheirauthorsmatch
Thisconstraintcanonlybeappliedlocally
This constraint can only be applied locally
DontwanttomatchalloccurrencesofJ.SmithwithJeffSmith,onlyin
thecontextofthecurrentpaper
Ex.SemanticIntegrityConstraints
Type
Example
Aggregate
C1=Noresearcher haspublishedmorethanfiveAAAIpapersinayear
Subsumption
C2=IfacitationXfromDBLP matchesacitationYinahomepage,then
eachauthormentionedinYmatchessomeauthormentionedinX
Neighborhood
C3=IfauthorsXandYsharesimilarnamesandsomecoauthors,they
arelikelytomatch
lik l t
t h
Incompatible
C4 =NoresearcherexistswhohaspublishedinbothHCIandnumerical
analysis
Layout
C5=Iftwomentionsinthesamedocumentsharesimilarnames,they
f
i
i h
d
h
i il
h
arelikelytomatch
Key/Uniqueness
C6=MentionsinthePClistingofaconferenceisto different
researchers
Ordering
C7=Iftwo citationsmatch,thentheirauthorswillbematchedinorder
Individual
C8=TheresearcherwiththenameMayssam Sariahasfewerthan
fi
fivementionsinDBLP(newgraduatestudent)
ti
i DBLP (
d t t d t)
[Shen,Li&Doan,AAAI05]
AlgorithmsforHandlingConstraints
Recordlinkage propagationthroughexclusivity
Deduplication propagationthroughtransitivity
Weightedkpartitematching
C
Correlationclustering
l i
l
i
Collective propagationthroughgeneralconstraints
Collective
propagation through general constraints
Similaritypropagation
P b bili ti
Probabilisticapproaches
h
Dependencygraphs,CollectiveRelationalClustering
LDA,CRFs,MarkovLogicNetworks,ProbabilisticRelationalModels,
Hybridapproaches
Dedupalog
PART 2 d
PART2d
ALGORITHMS
RECORD LINKAGE
RECORDLINKAGE
11assumption
Matchingbetween(almost)deduplicated databases.
Each record in one database matches at most one record
Eachrecordinonedatabasematchesatmostonerecord
inanotherdatabase.
Pairwise ERmaymatcharecordinonedatabasewith
morethanonerecordinseconddatabase
WeightedKPartiteMatching
Weighted
Edges
Weighted
Edges
Edgesbetweenpairsofrecordsfromdifferentdatabases
Edgeweights
o Pairwise matchscore
o Logoddsofmatching
L
dd f
hi
WeightedKPartiteMatching
Findamatching(eachrecordmatchesatmostoneotherrecord
fromotherdatabase)thatmaximizethesumofweights.
)
GeneralproblemisNPhard(3Dmatching)
Successivebipartitematchingistypicallyused.
Successive bipartite matching is typically used [Gupta&Sarawagi,VLDB
[G pta & Sara agi VLDB
09]
DEDUPLICATION
Deduplication =>Transitivity
Oftenpairwise ERalgorithmoutputinconsistentresults
(x,y) Mpred ,(y,z) Mpred ,but(x,z) Mpred
Idea:Correctthisbyaddingadditionalmatchesusingtransitive
closure
l
Incertaincases,thisisabadidea.
In certain cases this is a bad idea
Graphsresultingfrompairwise ERhave
diameter>20
[Rastogi etalCorr
et al Corr 12]
12]
Addedby
Transitive
Transitive
Closure
Needclusteringsolutionsthatdealwiththisproblemdirectlyby
reasoningaboutrecordsjointly.
ClusteringbasedER
Resolutiondecisionsarenotmadeindependentlyfor
eachpairofrecords
Basedonvarietyofclusteringalgorithms,but
Numberofclustersunknownaprioiri
Many,manysmall(possiblysingleton)clusters
Oftentakeapairwisesimilaritygraphasinput
Mayrequiretheconstructionofaclusterrepresentative
orcanonicalentity
ClusteringMethodsforER
HierarchicalClustering
[[Bilenko etal,ICDM05]
,
]
NearestNeighborbasedmethods
[Chaudhuri etal,ICDE05]
CorrelationClustering
[SoonetalCL01,Bansal etalML04,NgetalACL02,
Ailon etalJACM08,Elsner etalACL08,Elsner etalILPNLP09]
IntegerLinearProgrammingviewofER
rxy {0,1},rxy =1ifrecordsx andyareinthesamecluster.
w+xy [0,1],costofclusteringxandytogether
[ ]
g
y g
w xy [0,1],costofplacingxandyindifferentclusters
Transitive
closure
Clustermentionssuchthat
totalcostisminimized
total cost is minimi ed
Solidedgescontributew+
totheobjective
Dashededgescontributew xyy totheobjective
xy
2
1
3
Costbasedonpairwise similarities
Additive:w+xy =pxy andw xy =(1pxy)
oga t
c +xy =log(p
og(pxy))andw
a d xy =log(1p
og( pxy)
Logarithmic:w
SolvingtheILPisNPhard[Ailon etal2008JACM]
Anumberofheuristics[Elsner etal2009ILPNLP]
GreedyBEST/FIRST/VOTEalgorithms
Greedy BEST/FIRST/VOTE algorithms
GreedyPIVOTalgorithm(5approximation)
LocalSearch
GreedyAlgorithms
SStep1:Permutethenodesaccordingarandom
1 P
h
d
di
d
Step2:AssignrecordxtotheclusterthatmaximizesQuality
Start
a new cluster if Quality < 0
StartanewclusterifQuality<0
Quality:
BEST:Clustercontainingtheclosestmatch
[Ngetal2002ACL]
FIRST:Clustercontainsthemostrecentvertexywithw+xy >0
[Soonetal2001CL]
[Soon et al 2001 CL]
VOTE:Assigntoclusterthatminimizesobjectivefunction.
[Elsner etal08ACL]
PracticalNote:
Runthealgorithmformanyrandompermutations,andpicktheclusteringwith
bestobjectivevalue(betterthanaveragerun)
Greedywithapproximationguarantees
PIVOTAlgorithm
[Ailon etal2008JACM]
Pickarandom(pivot)recordp.
(p
)
p
Newcluster=
2
={1,2,3,4}C={{1,2,3,4}}
={2,4,1,3}C={{1,2},{4},{3}}
{ , , , }
{{ , }, { }, { }}
={3,2,4,1}C={{1,3},{2},{4}}
Whenweightsare0/1,
Forw+xy +wxy =1,
1
3
4
E(cost(greedy))<3 OPT
E(cost(greedy))<5 OPT
[Elsner etal,ILPNLP09]:Comparisonofvariouscorrelationclusteringalgorithms
PART2d
CANONICALIZATION
Canonicalization
Mergeinformationfromduplicatementionstoconstruct
aclusterrepresentativewithmaximalinformation
p
Starbucks,
3457HillsboroughRoad
Starbucks
Durham,NC
3457HillsboroughRoad,Durham,NC
Ph:null
Ph:(919)3334444
Starbacks,
HillsboroughRd,Durham
Ph:(919)3334444
CriticallyimportantinWebportalswhere
usersmust beshownaconsolidatedview
Eachmentiononlycontainsasubsetofthe
attributes
Mentionscontainvariations(ofnames,
addresses)
Someofthementionshaveincorrectvalues
CanonicalizationAlgorithms
Rulebased:
Fornames:typicallylongestnamesareused.
Forsetvaluesattributes:UNIONisused.
FForstrings,[Culotta
ti
[C l tt etalKDD07]learnaneditdistanceforfinding
t l KDD07] l
dit di t
f fi di
themostrepresentativecentroid.
Canusemajorityruletofixerrors
(if4outof5sayabusinessisclosed,thenbusinessisclosed).
Thismaynotalwaysworkduetocopying[DongetalVLDB09],orwhen
underlyingdatachanges[PaletalWWW11]
CanonicalizationforEfficiency
StanfordEntityResolutionFramework[Benjelloun VLDBJ09]
Considerablackbox matchandmergefunction
Matchisapairwise boolean operator
Merge:constructcanonicalversionofamatchingpair
Canminimizetimetocomputematchesbyinterleavingmatching
andmerging
esp.,whenmatchandmergefunctions
satisfymonotonicity properties.
r345
r12
r1
r45
r2
r3
r4
r5
COLLECTIVE ENTITY RESOLUTION

COLLECTIVEENTITYRESOLUTION
CollectiveApproaches
Decisionsforclustermembershipdependsonotherclusters
Nonprobabilisticapproaches
p
pp
SimilarityPropagation
ProbabilisticModels
GenerativeModels
G
i M d l
UndirectedModels
HybridApproaches
y
pp
SIMILARITY PROPAGATION
SIMILARITYPROPAGATION
SimilarityPropagationApproaches
Similaritypropagationalgorithmsdefineagraphwhichencodes
thesimilaritybetweenentitymentionsandmatchingdecisions,
andcomputematchingdecisionsbypropagatingsimilarityvalues.
Detailsofconstructedgraphandhowthesimilarityiscomputedvaries
Algorithmsareusuallydefinedprocedurally
Algorithms are usually defined procedurally
Whileprobabilitiesmaybeencodedinvariouswaysinthealgorithms,there
isnoglobalprobabilisticmodeldefined
Approachesoftenmorescalablethanglobalprobabilisticmodels
DependencyGraph
[Dong et al SIGMOD05 ]
[Dongetal.,SIGMOD05]
Constructagraphwherenodesrepresentsimilaritycomparisons
between attribute values (realvalued)
betweenattributevalues(real
valued)andmatchdecisionsbased
and match decisions based
onmatchingdecisionsofassociatednodes(booleanvalued)
Asmentionsareresolved,enrichedtocontainassociatednodesof
allmatchedmentions
Similaritypropagateduntilfixedpointisreached
Negativeconstraints(notmatchnodes)arecheckedaftersimilarity
propagationisperformed,andinconsistenciesarefixed
ExploittheDependencyGraph
Slid f
Slidesfrom[Dongetal,SIGMOD05]
[D
t l SIGMOD05]
(a1,a4)
(Distributed,Distributed )
(RobertS.Epstein,Epstein,R.S.)
(169180,169180)
(a2,a5)
(p1,p2)
(MichaelStonebraker,Stonebraker,M.)
(a3,a6)
(EugeneWong,Wong,E.)
Referencesimilarity
(v1,v2)
(ACM,ACMSIGMOD)
(1978,1978)
Attributesimilarity
ExploittheDependencyGraph
(a1,a4)
(Distributed,Distributed )
(RobertS.Epstein,Epstein,R.S.)
(169180,169180)
(a2,a5)
(p1,p2)
(MichaelStonebraker,Stonebraker,M.)
(v1,v2)
(a3,a6)
(EugeneWong,Wong,E.)
Reconciled
(ACM,ACMSIGMOD)
Similar
(1978,1978)
CollectiveRelationalClustering
[Bh
[Bhattacharya&Getoor,TKDD07]
h
&G
TKDD07]
Constructagraphwhereleafnodesareindividual
mentions
i
Performhierarchicalagglomerativeclusteringtomerge
clustersofmentions
l t
f
ti
Similaritycomputedbasedonacombinationofattribute
and relational similarity
andrelationalsimilarity
Whenclustersaremerged,updatethesimilaritiesofany
related clusters (clusters corresponding to mentions
relatedclusters(clusterscorrespondingtomentions
whichcooccurwithmergedmentions)
ObjectiveFunction
Mi i i
Minimize:
w sim
A
weightfor
attributes
b
(ci ,c j ) wR simR (ci , c j )
similarityof
attributes
b
weightfor
relations
l
Similaritybasedonrelationaledges
b
betweenc
d j
i andc
Greedy clustering algorithm: merge cluster pair with max

reduction in objective function
where for example
simA (ci , c j )
and
sim(c , c )
aAttributes
*
i
*
j
for cluster representative c*
simR (ci , c j ) sim jaccard ( N (ci ),

) N (c j ))
where N(c) are the relational neighbors of c
RelationalClusteringAlgorithm
1.
2.
3.
Findsimilarreferencesusingblocking
Bootstrapclustersusingattributesandrelations
Computesimilaritiesforclusterpairsandinsertintopriorityqueue
4.
5.
6.
7
7.
8.
9.
10.
Repeatuntilpriorityqueueisempty
p
p
yq
py
Findclosestclusterpair
Stopifsimilaritybelowthreshold
If no negative constraints violated
Ifnonegativeconstraintsviolated
Mergetocreatenewcluster
Constructcanonicalclusterrepresentative
Updatesimilarityforrelatedclusters
O(nklogn)algorithmw/efficientimplementation
SimilaritypropagationApproaches
Method
Notes
Constraints
Evaluation
RelDC
[Kalashnikovet
al,TODS06]
l TODS06]
Reference
disambiguation
usingusing
i
i
Relationship
baseddata
cleaning(RelDC)
g(
)
Modelchoice
nodesidentified
usingfeature
i f
basedsimilarity
Context
attraction
measuresthe
h
relational
similarity
Accuracyand
runtime forAuthor
resolutionand
l i
d
directorresolution
inMovie database
Reference
Reconciliation
[Dongetal,
SIGMOD05]
Dependency
Graphfor
propagating
similarities+
enforcenon
match
constraints
Reference
Both positive
enrichment
andnegative
Explicitly handle constraints
missingvalues
Parametersset
byhand
Precision/Recall,
F1onpersonal
information
managementdata
(PIM),Coradataset
Collective
Relational
Clustering
g
[Bhattacharya&
Getoor,TKDD07]
Modified
hierarchical
agglomerative
gg
clustering
approach
Constructs
canonical entity
asmergesare
g
made
Precision/Recall,
F1onthree
bibliographic
g p
datasets:CiteSeer,
ArXiv,andBioBase,
andsyntheticdata
Focuson
coauthor
resolution and
propagation
PROBABILISTICMODELS:
PROBABILISTIC
MODELS:
GENERATIVEAPPROACHES
GenerativeProbabilisticApproaches
ProbabilisticsemanticsbasedonDirectedModels
Model
Modeldependenciesbetweenmatchdecisionsinagenerative
dependencies between match decisions in a generative
manner
Disadvantage:acyclicity requirement
Varietyofapproaches
BasedonLatentDirichlet Allocation,BayesianNetworks
Examples
LatentDirichlet Allocation[Bhattacharya&Getoor,SDM07]
ProbabilisticRelationalModels[Pasula etal,NIPS02]
LDAforEntityResolution:Discovering
Groups from CoOccurrence
GroupsfromCo
OccurrenceRelations
Relations
Stephen P Johnson
Chris Walshaw
Kevin McManus
M k Cross
Mark
C ss
M tin E
Martin
Everett
tt
Parallel Processing Research Group
Stephen C Johnson
Alfred V Aho
Ravi Sethi
J ff
Jeffrey
D Ullman
Ull
Bell Labs Group
P1: C. Walshaw, M. Cross, M. G. Everett,

S. Johnson
P4: Alfred V. Aho, Stephen C. Johnson,

J ff
Jefferey
D
D. Ull
Ullman
P2: C. Walshaw, M. Cross, M. G. Everett,

S. Johnson, K. McManus
P5: A. Aho, S. Johnson, J. Ullman
P3: C. Walshaw, M. Cross, M. G. Everett
P6: A. Aho, R. Sethi, J. Ullman
LDAERModel
Entity label a and group label z for

each reference r
: mixture
mixture of groups for each cooccurrence
z: multinomial for choosing entity a
for each group z
Va: multinomial for choosing
reference r from entity a
Dirichlet priors with and
T
r
V
R
InferenceusingblockedGibbssampling
forefficiency(andimprovedaccuracy)
GenerativeApproaches
Method
Learning/Inference
Method
Evaluation
[Li,Morie,&
[Li
Morie &
Generative
Generative
Roth,AAAI04] modelfor
mentionsin
documents
Truncated EMtolearn
EM to learn
parametersandMAP
inferenceforentities
(unsupervised)
F1on
F1
on person
person
names,
locationsand
organizationsin
TRECdataset
Probabilistic
Probabilistic
Relational
Relational
M d l [P l Models
Models[Pasula
M d l
etal.,NIPS03]
Parameterslearned
onseparatedcorpora,
i f
inferencedoneusing
d
i
MCMC
%ofcorrectly
identified
clusters
l
on
subsetsof
CiteSeer data
LatentDirichlet
Latent
Dirichlet Latent
LatentDirichlet
Dirichlet BlockedGibbs
Blocked Gibbs
Allocation
Allocation
Sampling
Unsupervised
[Bhattacharya Model
&Getoor,
approach
SDM06]
Precision/Recall
/F1onCiteSeer
andHEPdata
PROBABILISTICMODELS:
PROBABILISTIC
MODELS:
UNDIRECTEDAPPROACHES
UndirectedProbabilisticApproaches
ProbabilisticsemanticsbasedonMarkovNetworks
Advantage:noacyclicity
Advantage: no acyclicity requirements
Insomecases,syntaxbasedonfirstorderlogic
Advantage:declarative
g
Examples
ConditionalRandomFields(CRFs)[McCallum&Wellner,
NIPS04]
MarkovLogicNetworks(MLNs)[Singla &Domingos,ICDM06]
ProbabilisticSimilarityLogic[Broecheler &Getoor,UAI10]
MarkovLogic
AlogicalKBisasetofhardconstraints onthesetof
possibleworlds
Makethemsoftconstraints;whenaworldviolatesa
formula,itbecomeslessprobablebutnotimpossible
Giveeachformulaaweight
Higherweight
Higher weight Strongerconstraint
Stronger constraint
P(world) exp weights o f formula s it sat isfies
[Richardson&Domingos,06]
MarkovLogic
AMarkovLogicNetwork(MLN) isasetofpairs(F,w)
where
F isaformulainfirstorderlogic
w isarealnumber
# true groundings
of ith clause
P( X ) exp wi ni ( x)
Z
iF
Normalization Constant
Iterate over all first-order MLN formulas
[Richardson&Domingos,06]
ERProblemFormulationinMLNs
Given
A
ADBofrecordsrepresentingmentionsofentitiesinthereal
DB of records representing mentions of entities in the real
world,e.g.papermentions
Asetoffieldse.g.author,title,venue
A set of fields e g author title venue
Eachrecordrepresentedasasetoftypedpredicatese.g.
HasAuthor(paper,author),HasVenue(paper,venue)
(p p ,
),
(p p ,
)
Goal
Todeterminewhichoftherecords/fieldsrefertothesame
d
h h f h
d /f ld f
h
underlyingentity
Slidesfrom[Singla &Domingos,ICDM06]
HandlingEquality
IntroduceEquals(x,y) orx=y
Introducetheaxiomsofequality
I t d
th
i
f
lit
Reflexivity: x=x
Symmetry: x=y y=x
Transitivity: x=y y=z z=x
PredicateEquivalence:
x11 =xx2 y11 y22 (R(x1,y
, y1)
) R(x2,y2))
Positive,SoftEvidence
Introducereversepredicateequivalence
SSamerelationwiththesameentitygivesevidenceabout
l ti
ith th
tit i
id
b t
twoentitiesbeingsame
R(x1,y1) R(x2,y2) x1 =x2 y2 =y2
Nottruelogically,butgivesusefulinformation
Example
HasAuthor(C1,J.Cox)
HasAuthor(C1
J Cox) HasAuthor(C2,CoxJ.)
HasAuthor(C2 Cox J )
(J.Cox=CoxJ.)
C1 = C2
C1=C2
FieldComparison
Eachfieldisastringcomposedoftokens
Introduce HasWord(field word)
IntroduceHasWord(field,word)
Usereversepredicateequivalence
HasWord(f1,w1) HasWord(f2,w2) w1= w2 f1=f2
Example
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) (Cox=Cox)
(J.Cox=CoxJ.)
Canhavedifferentweightforeachword
h
d ff
h f
h
d
TwolevelSimilarity
Individualwordsasunits:Cantdealwithspelling
mistakes
Breakeachwordintongrams:Introduce
HasNgram(word,ngram)
Usereversepredicateequivalenceforwordcomparisons
RecordMatching
SimplestVersion:Fieldsimilaritiesmeasuredby
presence/absenceofwordsincommon
HasWord(f1,w1) HasWord(f2,w2) HasField(r1,f1)
HasField(r2,f2) w1= w2 r1=r2
Example
E
l
HasWord(J.Cox,Cox) HasWord(CoxJ.,Cox) HasAuthor(P1,
J.Cox)
) HasAuthor(P2,CoxJ.)
( ,
) ((Cox=Cox)) ((P1=P2))
Transitivity
(f11 =ff2)
) (f2 =ff3)
) ( f3 =ff1)
HasAuthor(c,a
HasAuthor(c
a1)
) HasAuthor(c,a
HasAuthor(c a2)
) Coauthor(a1,a
a2)
Coauthor(a1,a2) Coauthor(a3,a4) a1=a3 a2=a4
Inference
Usecheapheuristics(e.g.TFIDFbasedsimilarity)to
identifyplausiblepairs
yp
p
Inference/learningoverplausiblepairs
Inferencemethod:lazygrounding+MaxWalkSAT
Learning:supervisedandtransfer(learn/handsetonone
g p
(
/
domainandtransferred)
ProbabilisticSoftLogic
[Broecheler &Getoor,UAI10]
& Getoor UAI10]
Declarativelanguagefordefiningconstrainedcontinuous
Markov random field (CCMRF) using first order logic
Markovrandomfield(CCMRF)usingfirstorderlogic
(FOL)
Softlogic:truthvaluesin[0,1]
Soft logic: truth values in [0 1]
LogicaloperatorsrelaxedusingLukasiewicz tnorms
Mechanismsforincorporatingsimilarityfunctions,and
Mechanisms for incorporating similarity functions and
reasoningaboutsets
MAPinferenceisaconvexoptimization
MAP inference is a convex optimization
Efficientsamplingmethodformarginalinference
FOLtoCCMRF
PSLconvertsaweightedruleintopotentialfunctionsby
penalizing its distance to satisfaction
penalizingitsdistancetosatisfaction,
isthetruthvalueofgroundruleunder
interpretationx
Thedistributionovertruthvaluesis
::weightofruler
weight of rule r
:allgroundingsofruler
p g
:PSLprogram
UndirectedApproaches
Method
Learning/Inference
Method
Evaluation
[McCallum&
[McCallum
&
Wellner,
NIPS04]
Conditional
Conditional
RandomFields
(CRFs)
capturing
transitivity
constraints
Graphpartitioning
Graph
partitioning
(Boykov etal.1999),
performedvia
correlationclustering
F1on
F1
on DARPA
DARPA
MUC&ACE
datasets
[Singla &
D i
Domingos,
ICDM06]
MarkovLogic
N
Networks
k
(MLNs)
Supervisedlearning
ConditionalLog
andinferenceusing
di f
i
lik lih d and
likelihood
d
MaxWalkSAT &MCMC AUConCora
andBibServ
data
[Broecheler & Probabilistic

Getoor,UAI10] SimilarityLogic
(PSL)
Supervisedlearning
andinferenceusing
continuous
optimization
Precision/Recall
/F1Ontology
Alignment
HYBRID APPROACHES
HYBRIDAPPROACHES
HybridApproaches
Constraintbasedapproachesexplicitlyencoderelational
constraints
Theycanbeformulatedashybridofconstraintsand
probabilisticmodels
Orasconstraintoptimizationproblem
Examples
ConstraintbasedEntityMatching[Shen,Li&Doan,AAAI05]
Dedupalog [Arasu,Re,Suciu,ICDE09]
Dedupalog [Arasu etal.,ICDE09]

PaperRef(id, title, conference, publisher, year)
Wrote(id authorName,
Wrote(id,
authorName Position)
TitleSimilar(title1,title2)
A th Si il ( th 1 th 2)
AuthorSimilar(author1,author2)
Datatobe
deduplicated
(Thresholded)Fuzzy
J i O t t
JoinOutput
Step(0)Createinitialapproximatematches;thisisinputtoDedupalog.
Step(1)Declaretheentities
ClusterPapers,Publishers,&Authors
Paper!(id) :- PaperRef(id,-,-,-)
Publisher!(p) :- PaperRef(-,-,-,p,-)
Author!(a)
( ) :- Wrote(-,a,-)
(, ,)
Dedupalog isflexible:
UniqueNamesAssumption(UNA)
P bli h (UNA) d P
Publishers(UNA)andPapers(NOTUNA)
(NOT UNA)
Slidesbasedon[Arasu,Re,Suciu,ICDE09]
Step(2)DeclareClusters
InputintheDB
PaperRef(id, title, conference, publisher, year) Clusterpapers,

publishers and authors
publishers,andauthors
Wrote(id authorName,
Wrote(id,
authorName Position)
TitleSimilar(title1,title2)
A th Si il ( th 1 th 2)
AuthorSimilar(author1,author2)
Paper!(id) :- PaperRef(id,-,-,-)
Publisher!(p) :- PaperRef(-,-,-,p,-)
PaperRef(- - - p -)
Author!(a) :- Wrote(-,a,-)
Clustersaredeclared using*(likeIDBsorViews):Theseareoutput
Author*(a1,a2) <-> AuthorSimilar(a1,a2)

Clusterauthorswithsimilarnames
*IDBsareequivalencerelations:
q
Symmetric,Reflexive,&Transitively
ClosedRelations:i.e.,Clusters
Author1
AA
Author2
Arvind Arasu
Arvind A Arvind Arasu
ADedupalog
A
D d
l program isa
i
setofdataloglikerules
152
SimpleConstraints
Paperswithsimilartitlesshouldlikelybeclusteredtogether
Paper*(id1,id2) <-> PaperRef(id1,t1,-), PaperRef(id2,t2,-),TitleSimilar(t1,t2)
Author*(a1,a2) <-> AuthorSimilar(a1,a2)
Paper*(id
Paper
(id1,id
id2) <= PaperEq(id1,id
id2 )
Paper*(id1,id2) <= PaperNeq(id1,id2)
((<>)Softconstraints:
)
Payacostifviolated.
(<=)Hardconstraints:Any
clusteringmustsatisfythese
PapersinPaperEQ must beclusteredtogether,

thoseinPaperNEQ mustnot beclusteredtogether
1. PaperEQ,PaperNEQ arerelations(EDBS)
2. denotesNegationhere.
Additional Constraints
Clusteringtwopapers,thenmustclustertheirfirstauthors
Author*(a1,a2) <= Paper*(id1,id2), Wrote(id1,a1,1), Wrote(id2,a2,1)

Cl
Clusteringtwopapersmakesitlikelyweshouldclustertheirpublisher
i
k i lik l
h ld l
h i
bli h
Publisher*(x,y) <- Publishes(x,p1), Publishes(x,p2),Paper*(p1,p2)
if
iftwoauthorsdonotsharecoauthors,thendonotclusterthem
two authors do not share coauthors, then do not cluster them
Author (x, y) <- (Wrote(x, p1,), Wrote(y, p2,), Wrote(z, p1,),
Wrote(z, p2,), Author(x, y))
Dedupalog viaCC
Semantics:TranslateaDedupalog Programtoasetofgraphs
Nodesarereferences(inthe!Relation)
EntityReferences:Conference!(c)
VLDBJ
Conference*(c1,c2) <-> ConfSim(c1,c2)
VLDB
Positiveedges
[ ] Negativeedgesareimplicit
[]
Negative edges are implicit
VLDBconf
ICDE
ICDT
InternationalConf.DE
Forasinglegraphw.o.hardconstraints
we can reuse prior work for O(1) apx
wecanreusepriorworkforO(1)apx.
156
Soft
Conference*(c
Conference
(c1,cc2) << ConfSim(c1,cc2)
Conference*(c1,c2) <= ConfEQ(c1,c2)
Conference*(c
C f
*( 1,c2) <=
< ConfNEQ(c
C fNEQ( 1,c2)
Hard
Equal
Positive
[]Negative
NotEqual
VLDBJ
VLDB
VLDBconf
ICDE
ICDT
1. Pick a random order of edges
2. While there is a soft edge do
1. Pick first soft edge in order
2. If
turn into
[-]
] turn into
3. Else is [
4. Deduce labels
3. Return Transitively closed subsets
InternationalConf.DE
Voting
Extendalgorithmtowhole languageviavotingtechnique.
Support different entity types recursive programs etc
Supportdifferententitytypes,recursiveprograms,etc.
Manydedupalog
y
p gp
programs
g
haveanO(1)apx
Thm:AllsoftprogramsO(1)
Thm: Arecursivehard
constraintsnoO(1)apx
Expert:multiwaycuthard
Systemproperties:
(1)Streamingalgorithm
(2)linearin#ofmatches(notn2)
(3)Userinteraction
Features:Supportforweights,referencetables
(partially),andcorrespondinghardnessresults.
HybridApproaches
Method
Evaluation
Constraint
basedEntity
Matching
[Shen,Li&
Doan,
AAAI05];
buildson(Li,
Morie,&Roth,
AIMag 2004)
Twolayermodel:
Layer1:Generativemodelfordatasetsthatsatisfy
constraints;
Layer2:EMalgorithmandtherelaxationlabeling
algorithmtoperformmatching.Ineachiteration,use
EM to estimate parameters of the generative model
EMtoestimateparametersofthegenerativemodel
andamatchingassignment,thenemploysrelaxation
labelingtoexploittheconstraints
Researchers
andIMDBwith
noiseadded
Dedupalog
[Arasu,Re,
Suciu,ICDE09]
Declarativespecificationforrichcollectionof
constraintswithnicesyntactic sugaraddedtodatalog
forER.Inference:Correlationclustering+voting
Precision/Recall
onCora,subset
ofACMdataset
Summary:CollectiveApproaches
Decisionsforclustermembershipdependsonotherclusters
Similaritypropagationapproaches
yp p g
pp
ProbabilisticModels
GenerativeModels
UndirectedModels
U di
dM d l
HybridApproaches
Nonprobabilisticapproachesoftenscalebetterthangenerative
probabilisticapproaches
Undirected/constraintbasedmodelsareofteneasiertospecify
Scalingundirectedmodelsactiveareaofresearch
PART 3
PART3
SCALING ER TO BIGDATA
SCALINGERTOBIG
DATA
ScalingERtoBigData
Blocking/CanopyGeneration
Distributed ER
DistributedER
PART 3
PART3a
BLOCKING/CANOPY GENERATION
BLOCKING/CANOPYGENERATION
Blocking:Motivation
Navepairwise:|R|2 pairwise comparisons
1000
1000businesslistingseachfrom1,000differentcitiesacross
business listings each from 1,000 different cities across
theworld
1trillioncomparisons
11.6days (ifeachcomparisonis1s)
Mentionsfromdifferentcitiesareunlikelytobematches
BlockingCriterion:City
1billioncomparisons
16minutes(ifeachcomparisonis1s)
Blocking:Motivation
Mentionsfromdifferentcitiesareunlikelytobematches
Maymisspotentialmatches
May miss potential matches
Blocking:Motivation
MatchingPairs
ofRecords
PairsofRecords
satisfying
Blockingcriterion
SetofallPairs
ofRecords
BlockingAlgorithms1
Hashbasedblocking
EachblockCi isassociatedwithahashkeyhi.
Mentionx ishashedtoCi ifhash(x)=hi.
Withinablock,allpairsarecompared.
Each hash function results in disjoint blocks.
Eachhashfunctionresultsindisjointblocks.
Whathashfunction?
Deterministicfunctionofattributevalues
BooleanFunctionsoverattributevalues
[[Bilenko etalICDM06,MichelsonetalAAAI06,
,
,
DasSarma etalCIKM12]
minHash (minwiseindependentpermutations)
[Broder etalSTOC98]
BlockingAlgorithms2
Pairwise Similarity/Neighborhoodbasedblocking
Nearby
Nearbynodesaccordingtoasimilaritymetricareclustered
nodes according to a similarity metric are clustered
together
Resultsinnondisjointcanopies.
Techniques
SortedNeighborhoodApproach[HernandezetalSIGMOD95]
CanopyClustering[McCallumetalKDD00]
SimpleBlocking:InvertedIndexonaKey
Examplesofblockingkeys:
Firstthreecharactersoflastname
First
three characters of last name
City+State+Zip
CharacterorTokenngrams
Minimuminfrequentngrams
LearningOptimalBlockingFunctions
Usingoneormoreblockingkeysmaybeinsufficient
2,376,206American
2,376,206 AmericansssharedthesurnameSmithinthe2000US
shared the surname Smith in the 2000 US
NULLvaluesmaycreatelargeblocks.
Solution:Constructblockingfunctionsbycombining
simplefunctions
ComplexBlockingFunctions
Conjunctionoffunctions[MichelsonetalAAAI06,Bilenko etalICDM06]
{City}AND{lastfourdigitsofphone}
Chaintrees[DasSarma etalCIKM12]
If
If({City}=NULLorLA)then
({Ci } NULL LA) h {lastfourdigitsofphone}AND{areacode}
{l f
di i f h
} AND {
d }
else {lastfourdigitsofphone}AND{City}
BlkTrees [DasSarma etalCIKM12]
LearninganOptimalfunction[Bilenko etalICDM06]
Findkblockingfunctionsthateliminatethemostnon
matches,whileretainingalmostallmatches.
,
g
Needatrainingsetofpositiveandnegativepairs
AlgorithmIdea:RedBlueSetCover
PositiveExamples
Blocking Keys
BlockingKeys
NegativeExamples
PickkBlockingkeyssuchthat
(a)Atmost bluenodesare
notcovered
(b)Numberofrednodes
coveredisminimized
LearninganOptimalfunction[Bilenko etalICDM06]
AlgorithmIdea:RedBlueSetCover
PositiveExamples
p
BlockingKeys
Pick k Blocking keys such that

PickkBlockingkeyssuchthat
(a)Atmost bluenodesare
notcovered
(b)Numberofrednodes
coveredisminimized
NegativeExamples
GreedyAlgorithm:
Greedy Algorithm
Constructgoodconjunctionsofblockingkeys{p1,p2,}.
Pick
k conjunctions {pi1,p
pi2,,p
pik}},suchthatthefollowingis
such that the following is
Pickkconjunctions{p
minimized
minHash (Minwise IndependentPermutations)

LetFx beasetoffeaturesformentionx
(functionsof)attributevalues
(functions of) attribute values
characterngrams
optimalblockingfunctions
Let bearandompermutationoffeaturesinFx
E.g.,orderimposedbyarandomhashfunction
minHash(x)=minimumelementinFx accordingto
WhyminHash works?
Surprisingproperty:Forarandompermutation,
Howtobuildablockingschemesuchthatonlypairswith
How to build a blocking scheme such that only pairs with
Jacquardsimilarity>sfallinthesameblock(withhighprob)?
Probabilitythat
(x,y)mentionsare
bl k d t th
blockedtogether
Similarity(x,y)
BlockingusingminHashes
ComputeminHashes usingr*kpermutations(hash
functions)
)
Band
of r minHashes
Bandofr
k blocks
Signatures
Signature sthatmatchon1outofk
that match on 1 out of k bands,gotothe
bands go to the
sameblock.
minHash Analysis
FalseNegatives:(missingmatches)
P(pair x y notinthesameblock
P(pairx,y
not in the same block
withJacquardsim =s)
shouldbeverylowforhighsimilaritypairs
should be very low for high similarity pairs
False
Positives: (blocking nonmatches)
FalsePositives:(blockingnonmatches)
P(pairx,y inthesameblock
with Jacquard sim =s)
withJacquardsim
s)
Sim(s)
P(notsame
block)
0.9
108
0.8
0.00035
0.7
0.025
0.6
0.2
0.5
0.52
0.4
0.81
0.3
0.95
0.2
0.994
0.1
0.9998
CanopyClustering[McCallumetalKDD00]
Input:MentionsM,
d(x,y),adistancemetric,
thresholdsT1 >T2
Algorithm:
1 PickarandomelementxfromM
1.
Pi k
d
l
t f
M
2. CreatenewcanopyCx using
mentionsys.t.d(x,y)<T
y
( ,y) 1
3. Deleteallmentionsy fromM
s.t.d(x,y)<T2(fromconsiderationinthisalgorithm)
4. ReturntoStep1ifM isnotempty.
Inmultiple
canopies
Eachelement
h
hasasingle
i l
primarycanopy
PART 3 b
PART3b
DISTRIBUTED ER
DISTRIBUTEDER
DistributedER
Mapreduceisverypopularforlargetasks
M
d
i
l f l
k
Simpleprogrammingmodelformassivelydistributeddata
Hadoop providesfaulttoleranceandisopensource
MapPhase
(perrecordcomputation)
Shuffle
ReducePhase
(globalcomputation)
ERwithDisjointBlocking
ComputeBlocksinMap
Map Phase
MapPhase
(perrecordcomputation)
RemainingERinReduce
Reduce Phase
ReducePhase
(globalcomputation)
Shuffle
BlockID
Noneedtocompare
recordsacross
reducers
NondisjointBlocking
Howtoblock?
Hashbased:
Hash based:needanefficienttechniquetogrouprecordsif
need an efficient technique to group records if
theymatchonnoutofkblockingkeys[Vernica etalSIGMOD10]
Distancebased:canopyclusteringonmapreduce[Mahout]
IterativeBlocking[Whang etalSIGMOD09]
Problem:Informationneededforarecordisin
p
multiplereducers.
Informationneededforarecordisinmultiplereducers.
Example1:
Example 1:
Reducer1:amatcheswithb
Reducer2:amatcheswithc
Needtocommunicateinordertocorrectlyresolvea,b,c
N dt
i t i
d t
tl
l b
p
multiplereducers.
Example2:Dedup papersandauthors
Id
Author1
Author2
Paper
A1
JohnSmith
RichardJohnson IndicesandViews
A2
JSmith
RJohnson
SQLQueries
A3
Dr Smyth
Dr.Smyth
R Johnson
RJohnson
Indices and Views

IndicesandViews
Slideadaptedfrom[Rastogi etalVLDB11]talk
p
multiplereducers.
Canopyclusteringresultsinnondisjointclusters
[McCallumetalKDD00]
J.Smith
Canopy
py
for
Richard
JohnS.
John
Richard
Smith JohnJacob
Johnson Richard
J h J b
Smith
RichardM.
R.Smith
Johnson
C
Canopy
forSmith
Canopy
f
for
John
p
multiplereducers.
CoAuthor(A1,B1) CoAuthor(A2,B2) match(B1,B2) match(A1,A2)
CoAuthor rulegroundstothecorrelation
match(RichardJohnson,RJohnson)=>match(J.Smith,JohnSmith)
J.Smith
Richard
Johnson
Canopy
Canopy
for
Johnson
R
Johnson
Steve
Johnson
JohnS.
John
Smith JohnJacob
R.Smith
Canopy
forSmith
Canopy
for
John
p
multiplereducers.
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,
KangetalICDM2009]
+CorrelationClustering/CollectiveERineachcomponent
C
l i Cl
i / C ll i ER i
h
Solution2:CorrelationClustering/CollectiveERineachcanopy
g/
py
+MessagePassing[Rastogi etalVLDB11]
p
multiplereducers.
Solution1:EfficientlyfindConnectedComponents[Rastogi etal2012,
KangetalICDM2009]
+CorrelationClustering/CollectiveERineachcomponent
C
l i Cl
i / C ll i ER i
h
Connectedcomponentscanbelargeinrelational/multientityER.
p
g
/
y
Solution2:CorrelationClustering/CollectiveERineachcanopy
+MessagePassing[Rastogi etalVLDB11]
MessagePassing
Simple Message Passing (SMP)
SimpleMessagePassing(SMP)
1. RunentitymatcherM locallyineachcanopy
2. IfM findsamatch(r1,r2)insomecanopy,passitas
evidencetoallcanopies
3. RerunM withineachcanopyusingnewevidence
4. Repeatuntilnonewmatchesfoundineachcanopy
il
h f
di
h
Runtime:O(k
Runtime:
O(k2 f(k)c)
f(k) c)
k:maximumsizeofacanopy
f(k):TimetakenbyERoncanopyofsizek
c:numberofcanopies
FormalProperties
forawellbehavedERmethod
Convergence:No.ofstepsno.ofmatches
Consistency:Outputindependentofthecanopyorder
y
p
p
py
Soundness:Eachoutputmatchisactuallyatruematch
Completeness:Eachtruematchisalsoaoutputmatch
J.Smith
JohnS.
John
Richard
Smith JohnJacob
Johnson Richard
Smith
RichardM.
R.Smith
Johnson
Completeness
Papers2and3matchonlyifacanopy
knows that
knowsthat
match(a1,a2)
match(b2,b3)
match(c2,c3)
t h( 2 3)
Simplemessagepassingwillnotfindanymatches
thus,nomessagesarepassed,noprogress
Solution:Maximalmessagepassing
Sendamessageifthereisapotentialformatch
SummaryofScalability
O(|R|2)pairwise computationscanbeprohibitive.
Blockingeliminatescomparisonsonalargefractionofnonmatches.
EqualitybasedBlocking:
Construct(oneormore)blockingkeysfromfeatures
Recordsnotmatchingonanykeyarenotcompared.
Records not matching on any key are not compared
Neighbohood basedBlocking:
Formoverlappingcanopiesofrecordsbasedonsimilarity.
Onlycomparerecordswithinacluster.
Computingconnectedcomponents/MessagePassinginadditionto
blocking can help distribute ER
blockingcanhelpdistributeER.
P t4
Part4
CHALLENGESANDFUTURE
CHALLENGES
AND FUTURE
DIRECTIONS
Challenges
Sofar,wehaveviewedERasaonetimeprocessappliedtoentire
database;noneoftheseholdinrealworld.
TemporalER
Temporal ER
ERalgorithmsneedtoaccountforchangeinrealworld
Reasoningaboutmultiplesources[Pal&M etal.WWW12]
Modeltransitions[LietalVLDB11]
Reasoningaboutsourcequality
Sourcesarenotindependent
CopyingProblem[DongetalVLDB09]
QueryTimeER
Howdoweselectivelydeterminethesmallestnumberofrecordstoresolve,so
wegetaccurateresultsforaparticularquery?
Collectiveresolutionforqueries
Collective resolution for queries [Bhattacharya&GetoorJAIR07]
[Bhattacharya & Getoor JAIR07]
ER&Usergenerateddata
Deduplicated entitiesinteractwithusersintherealworld
Userstag/associatephotos/reviewswithbusinessesonGoogle/Yahoo
Whatshouldbedonetosupportinteractions?
OpenIssues
ERisoftenpartofbiggerinferenceproblem
Pipelinedapproachesandjointapproachestoinformationextraction
and graph identification
andgraphidentification
HowcanwecharacterizehowERerrorsaffectoverallqualityof
results?
ERTheory
ER Theory
Needbettersupportfortheorywhichcangiverelationallearning
bounds
ER&Privacy
ER & Privacy
ERenablesrecordreidentification
HowdowedevelopatheoryofprivacypreservingER?
ERBenchmarks
ER B h
k
NeedforlargescalerealworldERdatasetswithgroundtruth
Syntheticdatausefulforscalingbuthardtocapturerichcomplexities
ofrealworld
f
l
ld
Summary
Growingomnipresenceofmassivelinkeddata,andtheneed
forcreatingknowledgebasesfromtextandunstructureddata
motivate a number of challenges in ER
motivateanumberofchallengesinER
EspeciallyinterestingchallengesandopportunitiesforERand
socialmedia/usergenerateddata
/
As
Asdata,noise,andknowledgegrows,greaterneeds&
data noise and knowledge grows greater needs &
opportunitiesforintelligentreasoningaboutentityresolution
Manyotherchallenges
M
th
h ll
Largescaleidentitymanagement
Understandingtheoreticalpotentials&limitsofER
THANK YOU!
THANKYOU!
References Intro
W.Willinger etal,MathematicsandtheInternet:ASourceofEnormousConfusionand
GreatPotential,NoticesoftheAMS56(5),2009
L Gill and M Goldcare English
L.GillandM.Goldcare,
EnglishNationalRecordLinkageofHospitalEpisodeStatisticsand
National Record Linkage of Hospital Episode Statistics and
DeathRegistrationRecords,ReporttotheDepartmentofHealth,2003
T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer2007
A. Elmagrid etal,
A.Elmagrid
et al, Duplicate
DuplicateRecordDetection
Record Detection,,TKDE2007
TKDE 2007
P.Christen,DataMatching,Springer2012
N.Koudas etal,RecordLinkage:SimilaritymeasuresandAlgorithms,SIGMOD2006
X. Dong & F. Naumann, Data
X.Dong&F.Naumann,
Datafusion
fusionResolving
Resolvingdataconflictsforintegration
data conflicts for integration,,VLDB2009
VLDB 2009
L.Getoor &A.Machanavajjhala,EntityResolution:Theory,PracticeandOpenChallenges,
AAAI2012
References SingleEntityER
D.Menestrina etal,EvaluationEntityResolutionResults,PVLDB3(12),2010
M.Cochinwala etal,Efficientdatareconciliation,InformationSciences137(14),2001
M Bilenko &R.Mooney,
M.Bilenko
& R Mooney Adaptive
AdaptiveDuplicateDetectionUsingLearnableStringSimilarity
Duplicate Detection Using Learnable String Similarity
Measures,KDD2003
P.Christen,Automaticrecordlinkageusingseedednearestneighbour andsupportvector
machineclassification.,KDD2008
Z.Chenetal,Exploitingcontextanalysisforcombiningmultipleentityresolutionsystems,
SIGMOD2009
A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004
f
H.Newcombe etal,Automaticlinkageofvitalrecords,Science1959
I.Fellegi &A.Sunter,ATheoryforRecordLinkage,JASA1969
W.Winkler,OverviewofRecordLinkageandCurrentResearchDirections,ResearchReport
Series,USCensus,2006
T.Herzogetal,DataQualityandRecordLinkageTechniques,Springer,2007
P R ik
P.Ravikumar
&W C h
&W.Cohen,AHierarchicalGraphicalModelforRecordLinkage,UAI2004
A Hi
hi l G hi l M d l f R
d Li k UAI 2004
References SingleEntityER(contd.)
S.Sarawagi etal,InteractiveDeduplication usingActiveLearning,KDD2000
S.Tejada etal,LearningObjectIdentificationRulesforInformationIntegration,IS2001
A Arasu etal,
A.Arasu
et al On
Onactivelearningofrecordmatchingpackages
active learning of record matching packages,SIGMOD2010
SIGMOD 2010
K.Bellare etal,Activesamplingforentitymatching,KDD2012
A.Beygelzimer etal,AgnosticActiveLearningwithoutConstraints,NIPS2010
J Wang et al CrowdER:
J.Wangetal,
CrowdER:Crowdsourcing
Crowdsourcing EntityResolution
Entity Resolution,PVLDB5(11),2012
PVLDB 5(11) 2012
A.Marcusetal,HumanpoweredSortsandJoins,PVLDB5(1),2011
References SingleEntityER(contd.)
R.Gupta&S.Sarawagi,
R.
Gupta & S. Sarawagi, Answering
AnsweringTableAugmentationQueriesfromUnstructuredListsontheWeb
Table Augmentation Queries from Unstructured Lists on the Web,,
PVLDB2(1),2009
A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,CIKM
2012
M Bilenko etal,
M.Bilenko
et al Adaptive
AdaptiveProductNormalization:UsingOnlineLearningforRecordLinkagein
Product Normalization: Using Online Learning for Record Linkage in
ComparisonShopping,ICDM2005
S.Chaudhuri etal,RobustIdentificationofFuzzyDuplicates,ICDE2005
W.Soonetal,Amachinelearningapproachtocoreference resolutionofnounphrases,
C
ComputationalLinguistics27(4)2001
t ti
l Li
i ti 27(4) 2001
N.Bansal etal,CorrelationClustering,MachineLearning56(13),2004
V.Ng&C.Cardie,Improvingmachinelearningapproachestocoreference resolution,ACL2002
,
g
p
g
M.Elsner &E.Charnaik,Youtalkingtome?acorpusandalgorithmforconversation
disentanglement,ACLHLT2008
M.Elsner &W.Schudy,BoundingandComparingMethodsforCorrelationClusteringBeyondILP,
ILPNLP2009
N Ailon etal,
N.Ailon
et al Aggregating
Aggregatinginconsistentinformation:Rankingandclustering
inconsistent information: Ranking and clustering,JACM55(5),2008
JACM 55(5) 2008
X.Dongetal,IntegratingConflictingData:TheRoleofSourceDependence,PVLDB2(1),2009
A.Paletal,InformationIntegrationoverTimeinUnreliableandUncertainEnvironments,WWW
2012
A.Culotta etal,CanonicalizationofDatabaseRecordsusingAdaptiveSimilarityMeasures,KDD2007
O.Benjelloun etal,Swoosh:AgenericapproachtoEntityResolution,VLDBJ18(1),2009
References Constraints&MultiRelationalER
R.Ananthakrishna
A
h k i h et.al,Eliminatingfuzzyduplicatesindatawarehouses,VLDB2002
l li i i f
d li
i d
h
VL 2002
A.Arasu etal,LargeScaleDeduplication withConstraintsusingDedupalog,ICDE2009
S.Chaudhuri etal.,Leveragingaggregateconstraintsfordeduplication,SIGMOD07
X.Dongetal,ReferenceRecounciliation inComplexInformationSpaces,SIGMOD2005
I.Bhattacharya&L.Getoor,CollectiveEntityResolutioninRelationalData,TKDD2007
I.Bhattacharya&L.Getoor,ALatentDirichlet ModelforUnsupervisedEntityResolution,SDM2007
P.Bohannonetal.,ConditionalFunctionalDependenciesforDataCleaning,ICDE 2007
M.Broecheler &L.Getoor,ProbabilisticSimilarityLogic,UAI2010
,
y g ,
W.Fan,Dependenciesrevisitedforimprovingdataquality,PODS2008
H.Pasula etal,IdentityUncertaintyandCitationMatching,NIPS2002
D.Kalashnikovetal,DomainIndependentDataCleaningviaAnalysisofEntityRelationshipGraph,
TODS06
J.Laffertyetal,ConditionalRandomFields:ProbabilisticModelsforSegmentingandLabeling
SequenceData.,ICML2001
X.Lietal,IdentificationandTracingofAmbiguousNames:DiscriminativeandGenerative
Approaches,AAAI2004
A.McCallum&B.Wellner,ConditionalModelsofIdentityUncertaintywithApplicationtoNoun
Coreference,NIPS2004
M.Richardson&P.Domingos,MarkovLogic,MachineLearning62,2006
W.Shen etal.,ConstraintbasedEntityMatching,AAAI2005
P.Singla &P.Domingos,EntityResolutionwithMarkovLogic,ICDM2006
Whang etal.,GenericEntityResolutionwithNegativeRules,VLDBJ2009
Whang elal.,JointEntityResolution,ICDE2012
References Blocking
M.Bilenko etal,AdaptiveBlocking:LearningtoScaleUpRecordLinkageandClustering,ICDM
2006
M.Michelson&C.Knoblock,LearningBlockingSchemesforRecordLinkage,AAAI2006
g
g
g
A.DasSarma etal,AnAutomaticBlockingMechanismforLargeScaleDeduplicationTasks,
CIKM2012
A.Broder etal,MinWiseIndependentPermutations,STOC1998
G P di etal,Beyond100millionentities:largescaleblockingbasedresolutionfor
G.Papadias
l B
d 100 illi
ii l
l bl ki b d
l i f
heterogenous data,WSDM2012
M.Hernandez&S.Stolfo,Themerge/purgeproblemforlargedatabases,SIGMOD1995
A. McCallum et al, Efficient
A.McCallumetal,
Efficientclusteringofhigh
clustering of highdimensional
dimensionaldatasetswithapplicationto
data sets with application to
referencematching,KDD2000
L.Kolbetal,Dedoop:Efficientdeduplication withHadoop,(demo)PVLDB5(12),2012
R.Vernica etal,EfficientParallelSetSimilarityJoinsUsingMapReduce,SIGMOD2010
ApacheMahout:ScalableMachineLearningandDataMining,http://mahout.apache.org/
S.Whang etal,EntityResolutionwithIterativeBlocking,SIGMOD2009
U.Kangetal,PEGASUS:APetaScaleGraphMiningSystem Implementationand
Observations,ICDM2009
Observations
ICDM 2009
V.Rastogi etal,FindingConnectedComponentsonMapreduceinPolyLogRounds,Corr 2012
V.Rastogi etal,LargeScaleCollectiveEntityMatching,PVLDB4(4),2011
References Challenges&FutureDirections
I.BhattacharyaandL.Getoor,"QuerytimeEntityResolution",JAIR2007
X.Dong,L.BertiEquille,D.Srivastava,Truthdiscoveryandcopyingdetectioninadynamic
world,VLDB2009
world
VLDB 2009
P.Li,X.Dong,A.Maurino,D.Srivastava,LinkingTemporalRecords,VLDB2011
A. Pal,V.Rastogi,A.Machanavajjhala,P.Bohannon,Informationintegrationovertimein
unreliable and uncertain environments,,WWW2012
unreliableanduncertainenvironments
WWW 2012

Entity Resolution: Tutorial

Uploaded by

Copyright:

Available Formats

Entity Resolution: Tutorial

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Entity Resolution: Tutorial

Uploaded by

Copyright:

Available Formats

Entity Resolution: Tutorial

0.5*1stauthormatchscore +0.2*venuematchscore +0.3*ppapermatchscore

Fellegi &Sunter Model[FS,Science69]

CommitteeofClassifiers [Tejada etal,IS01]

COLLECTIVE ENTITY RESOLUTION

(ci ,c j ) wR simR (ci , c j )

Greedy clustering algorithm: merge cluster pair with max

for cluster representative c*

simR (ci , c j ) sim jaccard ( N (ci ),

where N(c) are the relational neighbors of c

Parallel Processing Research Group

P1: C. Walshaw, M. Cross, M. G. Everett,

P4: Alfred V. Aho, Stephen C. Johnson,

P2: C. Walshaw, M. Cross, M. G. Everett,

P5: A. Aho, S. Johnson, J. Ullman

P3: C. Walshaw, M. Cross, M. G. Everett

P6: A. Aho, R. Sethi, J. Ullman

Entity label a and group label z for

P(world) exp weights o f formula s it sat isfies

Iterate over all first-order MLN formulas

[Broecheler & Probabilistic

Dedupalog [Arasu etal.,ICDE09]

PaperRef(id, title, conference, publisher, year) Clusterpapers,

Author*(a1,a2) <-> AuthorSimilar(a1,a2)

Arvind A Arvind Arasu

PapersinPaperEQ must beclusteredtogether,

Author*(a1,a2) <= Paper*(id1,id2), Wrote(id1,a1,1), Wrote(id2,a2,1)

Conference*(c1,c2) <-> ConfSim(c1,c2)

BlkTrees [DasSarma etalCIKM12]

Pick k Blocking keys such that

minHash (Minwise IndependentPermutations)

Indices and Views

You might also like

0.51stauthormatchscore +0.2venuematchscore +0.3*ppapermatchscore

Author(a1,a2) <= Paper(id1,id2), Wrote(id1,a1,1), Wrote(id2,a2,1)