Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1247480.1247521acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Benchmarking declarative approximate selection predicates

Published: 11 June 2007 Publication History

Abstract

Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Over the last few years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc) and have been fully expressed using declarative SQL statements. In this paper we propose new similarity predicates along with their declarative realization, based on notions of probabilistic information retrieval. In particular we show how language models and hidden Markov models can be utilized as similarity predicates for data quality and present their full declarative instantiation. We also show how other scoring methods from information retrieval, can be utilized in a similar setting. We then present full declarative specifications of previously proposed similarity predicates in the literature, grouping them into classes according to their primary characteristics. Finally, we present a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations. We quantify both their runtime performance as well as their accuracy for several types of common quality problems encountered in operational databases.

References

[1]
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB '02.
[2]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB '06.
[3]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 2000.
[4]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD '03.
[5]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE '06.
[6]
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD '98.
[7]
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb '03.
[8]
J. B. Copas and F. J. Hilton. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society, 1990.
[9]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969.
[10]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01.
[11]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB '01.
[12]
L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for web data integration. In WWW 2003: 90--101.
[13]
D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997.
[14]
M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 1998.
[15]
M. A. Jaro. Advances in record linkage methodology as applied to matching the 1985 census of tampa. Journal of the American Statistical Association, 1984.
[16]
N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In VLDB '04.
[17]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.
[18]
C. D. Manning and H. Schütze. Foundations of statistical natural language processing, 1999.
[19]
D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In SIGIR '99.
[20]
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR '98.
[21]
L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, 1989.
[22]
S. Robertson. Understanding inverse document frequency: on theoretical arguments. Journal of Documentation, 2004.
[23]
S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In TREC '95.
[24]
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988.
[25]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1986.
[26]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD '04.
[27]
W. E. Winkler. The state of record linkage and current research problems. US Bureau of the Census, 1999.

Cited By

View all
  • (2023)BIGQA: Declarative Big Data Quality AssessmentJournal of Data and Information Quality10.1145/360370615:3(1-30)Online publication date: 22-Aug-2023
  • (2020)Social-minded Measures of Data QualityJournal of Data and Information Quality10.1145/340419312:3(1-8)Online publication date: 16-Jul-2020
  • (2020)Identifying Child Users via Touchscreen InteractionsACM Transactions on Sensor Networks10.1145/340357416:4(1-25)Online publication date: 28-Jul-2020
  • Show More Cited By

Index Terms

  1. Benchmarking declarative approximate selection predicates

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
    June 2007
    1210 pages
    ISBN:9781595936868
    DOI:10.1145/1247480
    • General Chairs:
    • Lizhu Zhou,
    • Tok Wang Ling,
    • Program Chair:
    • Beng Chin Ooi
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 June 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SQL
    2. accuracy
    3. data cleaning
    4. declarative data quality
    5. performance

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS07
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)BIGQA: Declarative Big Data Quality AssessmentJournal of Data and Information Quality10.1145/360370615:3(1-30)Online publication date: 22-Aug-2023
    • (2020)Social-minded Measures of Data QualityJournal of Data and Information Quality10.1145/340419312:3(1-8)Online publication date: 16-Jul-2020
    • (2020)Identifying Child Users via Touchscreen InteractionsACM Transactions on Sensor Networks10.1145/340357416:4(1-25)Online publication date: 28-Jul-2020
    • (2020)STARSACM Transactions on Intelligent Systems and Technology10.1145/339746311:5(1-25)Online publication date: 24-Jul-2020
    • (2020)Practical Privacy Preserving POI RecommendationACM Transactions on Intelligent Systems and Technology10.1145/339413811:5(1-20)Online publication date: 5-Jul-2020
    • (2020)Understanding the Long-Term Evolution of Electric Taxi NetworksACM Transactions on Intelligent Systems and Technology10.1145/339367111:4(1-27)Online publication date: 28-May-2020
    • (2020)Data Preparation for Duplicate DetectionJournal of Data and Information Quality10.1145/337787812:3(1-24)Online publication date: 13-Jun-2020
    • (2020)Anatomy of Metadata for Data CurationJournal of Data and Information Quality10.1145/337192512:3(1-30)Online publication date: 13-Jun-2020
    • (2020)Characterizing Disinformation Risk to Open Data in the Post-Truth EraJournal of Data and Information Quality10.1145/332874712:3(1-13)Online publication date: 3-Jun-2020
    • (2020)ZeroER: Entity Resolution using Zero Labeled ExamplesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389743(1149-1164)Online publication date: 11-Jun-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media