article

Free access

A Bayesian decision model for cost optimal record matching

Authors:

V. S. Verykios,

G. V. Moustakides,

M. G. ElfekyAuthors Info & Claims

The VLDB Journal — The International Journal on Very Large Data Bases, Volume 12, Issue 1

Pages 28 - 40

https://doi.org/10.1007/s00778-002-0072-y

Published: 01 May 2003 Publication History

Abstract

In an error-free system with perfectly clean data, the construction of a global view of the data consists of linking - in relational terms, joining - two or more tables on their key fields. Unfortunately, most of the time, these data are neither carefully controlled for quality nor necessarily defined commonly across different data sources. As a result, the creation of such a global data view resorts to approximate joins. In this paper, an optimal solution is proposed for the matching or the linking of database record pairs in the presence of inconsistencies, errors or missing values in the data. Existing models for record matching rely on decision rules that minimize the probability of error, that is the probability that a sample (a measurement vector) is assigned to the wrong class. In practice though, minimizing the probability of error is not the best criterion to design a decision rule because the misclassifications of different samples may have different consequences. In this paper we present a decision model that minimizes the cost of making a decision. In particular: (a) we present a decision rule: (b) we prove that this rule is optimal with respect to the cost of a decision: and (c) we compute the probabilities of the two types of errors (Type I and Type II) that incur when this rule is applied. We also present a closed form decision model for a certain class of record comparison pairs along with an example, and results from comparing the proposed cost-based model to the error-based model, for large record comparison spaces.

References

[1]

1. W. Alvey, B. Jamerson (1997) Record linkage techniques-1997. Proc. International Workshop and Exposition, March, Federal Committee on Statistical Methodology, Office of Management and Budget.

[2]

2. D. Bitton, D.J. DeWitt (1983) Duplicate record elimination in large data files. ACM Trans Database Syst 8(2):255-265.

[3]

3. M.G. Elfeky, V.S. Verykios, A.K. Elmagarmid (2002) TAILOR: A record linkage toolbox. Proc. 18th International Conference on Data Engineering, pp 17-28.

[4]

4. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (1996) Advances in knowledge discovery and data mining. AAAI/MIT

[5]

5. I.P. Fellegi, A.B. Sunter (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183-1210.

[6]

6. H. Galhardas, D. Florescu, D. Shasha, E. Simon, C.-A. Saita (2001) Declarative data cleaning: language, model, and algorithms. Proc. 27th International Conference on Very Large Data Bases, pp 371-380.

[7]

7. L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava (2001) Approximate string joins in a database (almost) for free. Proc. 27th International Conference on Very Large Data Bases, pp 491-500.

[8]

8. M.A. Hernadez (1996) A generalization of band joins and the merge/purge problem. Ph.D. thesis, Department of Computer Sciences, Columbia University.

[9]

9. M.A. Hernadez, S.J. Stolfo (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discovery 2(1):9-37.

[10]

10. J.A. Hylton (1996) Identifying and merging related bibliographic records. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Mass., USA.

[11]

11. M.A. Jaro (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84, no. 406, 414-420.

[12]

12. P.R. Kelley (1985) Advances in record linkage methodology: a method for determining the best blocking strategy. Proc. Workshop on Exact Matching Methodologies, pp 199-203.

[13]

13. B. Kliss, W. Alvey (1985) Record linkage techniques - 1985. Proc. Workshop on Exact Matching Methodologies, May, Department of the Treasury, Internal Revenue Service, Statistics Income Division.

[14]

14. M.L. Lee, H. Lu, T.W. Ling, Y.T. Ko (1999) Cleansing data for mining and warehousing. Proc. 10th International Conference on Databases and Expert Systems Applications.

[15]

15. U. Manber (1989) Introduction to algorithms. Addison-Wesley, Reading, Mass., USA.

[16]

16. A.E. Monge, C.P. Elkan (1996) The field matching problem: algorithms and applications. Proc. 2nd International Conference on Knowledge Discovery and Data Mining, pp 267-270.

[17]

17. A.E. Monge, C.P. Elkan (1997)An efficient domain-independent algorithm for detecting approximately duplicate database records. Proc. SIGMOD Workshop on Research Issues on DMKD, pp 23-29.

[18]

18. A.E. Monge (1997) Adaptive detection of approximately duplicate records and the database integration approach to information discovery. Ph.D. thesis, Department of Computer Science and Engineering, University of California, San Diego, Calif., USA.

[19]

19. G. Nathan (1967) Outcome probabilities for a record matching process with complete invariant information. J Am Stat Assoc 62(318):454-469.

[20]

20. H.B. Newcombe, J.M. Kennedy (1962) Record linkage: making maximum use of the discriminating power of identifying information. Comm ACM 5:563-566.

[21]

21. H.B. Newcombe, J.M. Kennedy, S.J. Axford, A.P. James (1959) Automatic linkage of vital records. Science 130(3381):954-959.

[22]

22. H.B. Newcombe (1967) Record linking: the design of efficient systems for linking records into individual and family histories. Am J Hum Genet 19(3).

[23]

23. N.S. D'Andrea Du Bois (1969) A solution to the problem of linking multivariate documents. J Am Stat Assoc 64(325):163- 174.

[24]

24. J.C. Pinheiro, D.X. Sun (1998) Methods for linking and mining heterogeneous databases. Proc. 4th International Conference on Knowledge Discovery and Data Mining, pp 309-313.

[25]

25. B.J. Tepping (1968) A model for optimum linkage of records. J Am Stat Assoc 63:1321-1332.

[26]

26. P.D. Turney (2000) Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at the 17th International Conference on Machine Learning, Stanford, Calif., USA.

Cited By

Zhao WHuang JFan TWu YLiu K(2022)A Novel Compressed Sensing-Based Graph Isomorphic Network for Key Node Recognition and Entity AlignmentInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.31560018:2(1-17)Online publication date: 16-Dec-2022
https://dl.acm.org/doi/10.4018/IJSWIS.315600
Yan YMeyles SHaghighi ASuciu DMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial ApplicationsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386143(2287-2301)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3386143
Wang HDing XLi JGao H(2018)Rule-Based Entity Resolution on Database with Hidden Temporal InformationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.281601830:11(2199-2212)Online publication date: 4-Oct-2018
https://dl.acm.org/doi/10.1109/TKDE.2018.2816018
Show More Cited By

Index Terms

A Bayesian decision model for cost optimal record matching
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Relational database model
    2. Query languages
      1. Relational database query languages
  2. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction
2. Mathematics of computing
  1. Probability and statistics
    1. Stochastic processes

Recommendations

A generalized cost optimal decision model for record matching
IQIS '04: Proceedings of the 2004 international workshop on Information quality in information systems

Record (or entity) matching or linkage is the process of identifying records in one or more data sources, that refer to the same real world entity or object. In record linkage, the ultimate goal of a decision model is to provide the decision maker with ...
Decision models for record linkage
Data Mining

The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important steps in many data mining applications. In this paper, we address one of the sub-tasks in ...
Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques

This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 12, Issue 1

May 2003

85 pages

ISSN:1066-8888

Issue’s Table of Contents

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 May 2003

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
876
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)2

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhao WHuang JFan TWu YLiu K(2022)A Novel Compressed Sensing-Based Graph Isomorphic Network for Key Node Recognition and Entity AlignmentInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.31560018:2(1-17)Online publication date: 16-Dec-2022
https://dl.acm.org/doi/10.4018/IJSWIS.315600
Yan YMeyles SHaghighi ASuciu DMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial ApplicationsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386143(2287-2301)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3386143
Wang HDing XLi JGao H(2018)Rule-Based Entity Resolution on Database with Hidden Temporal InformationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.281601830:11(2199-2212)Online publication date: 4-Oct-2018
https://dl.acm.org/doi/10.1109/TKDE.2018.2816018
Ding X(2018)Multiple Data Quality Evaluation and Data Cleaning on Imprecise Temporal DataAdvances in Conceptual Modeling10.1007/978-3-030-01391-2_14(69-75)Online publication date: 22-Oct-2018
https://dl.acm.org/doi/10.1007/978-3-030-01391-2_14
Reyes-Galaviz OPedrycz WHe ZPizzi N(2017)A supervised gradient-based learning algorithm for optimized entity resolutionData & Knowledge Engineering10.1016/j.datak.2017.10.004112:C(106-129)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1016/j.datak.2017.10.004
Enríquez JBlanco RDomínguez-Mayo FTuya JEscalona MVos TEldh SPrasetya W(2016)Towards an MDE-based approach to test entity reconciliation applicationsProceedings of the 7th International Workshop on Automating Test Case Design, Selection, and Evaluation10.1145/2994291.2994303(74-77)Online publication date: 18-Nov-2016
https://dl.acm.org/doi/10.1145/2994291.2994303
Li LLi JGao H(2016)Evaluating entity-description conflict on duplicated dataJournal of Combinatorial Optimization10.1007/s10878-014-9801-631:2(918-941)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1007/s10878-014-9801-6
Schewe KWang QRady M(2015)Knowledge-Based Entity Resolution with Contextual Information Defined over a MonoidProceedings of the 5th International Conference on Model and Data Engineering - Volume 934410.1007/978-3-319-23781-7_11(128-135)Online publication date: 26-Sep-2015
https://dl.acm.org/doi/10.1007/978-3-319-23781-7_11
Batjargal BKuyama TKimura FMaeda ABuchanan GKlein MRauber ACunningham S(2014)Identifying the same records across multiple Ukiyo-e image databases using textual data in different languagesProceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries10.5555/2740769.2740801(193-196)Online publication date: 8-Sep-2014
https://dl.acm.org/doi/10.5555/2740769.2740801
Song SChen LCheng H(2014)On concise set of relative candidate keysProceedings of the VLDB Endowment10.14778/2732977.27329917:12(1179-1190)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.14778/2732977.2732991
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents