Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

A Bayesian decision model for cost optimal record matching

Published: 01 May 2003 Publication History

Abstract

In an error-free system with perfectly clean data, the construction of a global view of the data consists of linking - in relational terms, joining - two or more tables on their key fields. Unfortunately, most of the time, these data are neither carefully controlled for quality nor necessarily defined commonly across different data sources. As a result, the creation of such a global data view resorts to approximate joins. In this paper, an optimal solution is proposed for the matching or the linking of database record pairs in the presence of inconsistencies, errors or missing values in the data. Existing models for record matching rely on decision rules that minimize the probability of error, that is the probability that a sample (a measurement vector) is assigned to the wrong class. In practice though, minimizing the probability of error is not the best criterion to design a decision rule because the misclassifications of different samples may have different consequences. In this paper we present a decision model that minimizes the cost of making a decision. In particular: (a) we present a decision rule: (b) we prove that this rule is optimal with respect to the cost of a decision: and (c) we compute the probabilities of the two types of errors (Type I and Type II) that incur when this rule is applied. We also present a closed form decision model for a certain class of record comparison pairs along with an example, and results from comparing the proposed cost-based model to the error-based model, for large record comparison spaces.

References

[1]
1. W. Alvey, B. Jamerson (1997) Record linkage techniques-1997. Proc. International Workshop and Exposition, March, Federal Committee on Statistical Methodology, Office of Management and Budget.
[2]
2. D. Bitton, D.J. DeWitt (1983) Duplicate record elimination in large data files. ACM Trans Database Syst 8(2):255-265.
[3]
3. M.G. Elfeky, V.S. Verykios, A.K. Elmagarmid (2002) TAILOR: A record linkage toolbox. Proc. 18th International Conference on Data Engineering, pp 17-28.
[4]
4. U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (1996) Advances in knowledge discovery and data mining. AAAI/MIT
[5]
5. I.P. Fellegi, A.B. Sunter (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183-1210.
[6]
6. H. Galhardas, D. Florescu, D. Shasha, E. Simon, C.-A. Saita (2001) Declarative data cleaning: language, model, and algorithms. Proc. 27th International Conference on Very Large Data Bases, pp 371-380.
[7]
7. L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava (2001) Approximate string joins in a database (almost) for free. Proc. 27th International Conference on Very Large Data Bases, pp 491-500.
[8]
8. M.A. Hernadez (1996) A generalization of band joins and the merge/purge problem. Ph.D. thesis, Department of Computer Sciences, Columbia University.
[9]
9. M.A. Hernadez, S.J. Stolfo (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining Knowl Discovery 2(1):9-37.
[10]
10. J.A. Hylton (1996) Identifying and merging related bibliographic records. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Mass., USA.
[11]
11. M.A. Jaro (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84, no. 406, 414-420.
[12]
12. P.R. Kelley (1985) Advances in record linkage methodology: a method for determining the best blocking strategy. Proc. Workshop on Exact Matching Methodologies, pp 199-203.
[13]
13. B. Kliss, W. Alvey (1985) Record linkage techniques - 1985. Proc. Workshop on Exact Matching Methodologies, May, Department of the Treasury, Internal Revenue Service, Statistics Income Division.
[14]
14. M.L. Lee, H. Lu, T.W. Ling, Y.T. Ko (1999) Cleansing data for mining and warehousing. Proc. 10th International Conference on Databases and Expert Systems Applications.
[15]
15. U. Manber (1989) Introduction to algorithms. Addison-Wesley, Reading, Mass., USA.
[16]
16. A.E. Monge, C.P. Elkan (1996) The field matching problem: algorithms and applications. Proc. 2nd International Conference on Knowledge Discovery and Data Mining, pp 267-270.
[17]
17. A.E. Monge, C.P. Elkan (1997)An efficient domain-independent algorithm for detecting approximately duplicate database records. Proc. SIGMOD Workshop on Research Issues on DMKD, pp 23-29.
[18]
18. A.E. Monge (1997) Adaptive detection of approximately duplicate records and the database integration approach to information discovery. Ph.D. thesis, Department of Computer Science and Engineering, University of California, San Diego, Calif., USA.
[19]
19. G. Nathan (1967) Outcome probabilities for a record matching process with complete invariant information. J Am Stat Assoc 62(318):454-469.
[20]
20. H.B. Newcombe, J.M. Kennedy (1962) Record linkage: making maximum use of the discriminating power of identifying information. Comm ACM 5:563-566.
[21]
21. H.B. Newcombe, J.M. Kennedy, S.J. Axford, A.P. James (1959) Automatic linkage of vital records. Science 130(3381):954-959.
[22]
22. H.B. Newcombe (1967) Record linking: the design of efficient systems for linking records into individual and family histories. Am J Hum Genet 19(3).
[23]
23. N.S. D'Andrea Du Bois (1969) A solution to the problem of linking multivariate documents. J Am Stat Assoc 64(325):163- 174.
[24]
24. J.C. Pinheiro, D.X. Sun (1998) Methods for linking and mining heterogeneous databases. Proc. 4th International Conference on Knowledge Discovery and Data Mining, pp 309-313.
[25]
25. B.J. Tepping (1968) A model for optimum linkage of records. J Am Stat Assoc 63:1321-1332.
[26]
26. P.D. Turney (2000) Types of cost in inductive concept learning. Workshop on Cost-Sensitive Learning at the 17th International Conference on Machine Learning, Stanford, Calif., USA.

Cited By

View all
  • (2022)A Novel Compressed Sensing-Based Graph Isomorphic Network for Key Node Recognition and Entity AlignmentInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.31560018:2(1-17)Online publication date: 16-Dec-2022
  • (2020)Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial ApplicationsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386143(2287-2301)Online publication date: 11-Jun-2020
  • (2018)Rule-Based Entity Resolution on Database with Hidden Temporal InformationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.281601830:11(2199-2212)Online publication date: 4-Oct-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 12, Issue 1
May 2003
85 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 May 2003

Author Tags

  1. Cost optimal statistical model
  2. Data cleaning
  3. Record linkage

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Novel Compressed Sensing-Based Graph Isomorphic Network for Key Node Recognition and Entity AlignmentInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.31560018:2(1-17)Online publication date: 16-Dec-2022
  • (2020)Entity Matching in the Wild: A Consistent and Versatile Framework to Unify Data in Industrial ApplicationsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3386143(2287-2301)Online publication date: 11-Jun-2020
  • (2018)Rule-Based Entity Resolution on Database with Hidden Temporal InformationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.281601830:11(2199-2212)Online publication date: 4-Oct-2018
  • (2018)Multiple Data Quality Evaluation and Data Cleaning on Imprecise Temporal DataAdvances in Conceptual Modeling10.1007/978-3-030-01391-2_14(69-75)Online publication date: 22-Oct-2018
  • (2017)A supervised gradient-based learning algorithm for optimized entity resolutionData & Knowledge Engineering10.1016/j.datak.2017.10.004112:C(106-129)Online publication date: 1-Nov-2017
  • (2016)Towards an MDE-based approach to test entity reconciliation applicationsProceedings of the 7th International Workshop on Automating Test Case Design, Selection, and Evaluation10.1145/2994291.2994303(74-77)Online publication date: 18-Nov-2016
  • (2016)Evaluating entity-description conflict on duplicated dataJournal of Combinatorial Optimization10.1007/s10878-014-9801-631:2(918-941)Online publication date: 1-Feb-2016
  • (2015)Knowledge-Based Entity Resolution with Contextual Information Defined over a MonoidProceedings of the 5th International Conference on Model and Data Engineering - Volume 934410.1007/978-3-319-23781-7_11(128-135)Online publication date: 26-Sep-2015
  • (2014)Identifying the same records across multiple Ukiyo-e image databases using textual data in different languagesProceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries10.5555/2740769.2740801(193-196)Online publication date: 8-Sep-2014
  • (2014)On concise set of relative candidate keysProceedings of the VLDB Endowment10.14778/2732977.27329917:12(1179-1190)Online publication date: 1-Aug-2014
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media