Article

Benchmarking declarative approximate selection predicates

Authors:

Oktie Hassanzadeh,

Mohammad Sadoghi,

Divesh SrivastavaAuthors Info & Claims

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Pages 353 - 364

https://doi.org/10.1145/1247480.1247521

Published: 11 June 2007 Publication History

Abstract

Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Over the last few years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc) and have been fully expressed using declarative SQL statements. In this paper we propose new similarity predicates along with their declarative realization, based on notions of probabilistic information retrieval. In particular we show how language models and hidden Markov models can be utilized as similarity predicates for data quality and present their full declarative instantiation. We also show how other scoring methods from information retrieval, can be utilized in a similar setting. We then present full declarative specifications of previously proposed similarity predicates in the literature, grouping them into classes according to their primary characteristics. Finally, we present a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations. We quantify both their runtime performance as well as their accuracy for several types of common quality problems encountered in operational databases.

References

[1]

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB '02.

Digital Library

[2]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB '06.

Digital Library

[3]

A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 2000.

Digital Library

[4]

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD '03.

Digital Library

[5]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE '06.

Digital Library

[6]

W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In SIGMOD '98.

Digital Library

[7]

W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb '03.

[8]

J. B. Copas and F. J. Hilton. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society, 1990.

[9]

I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969.

[10]

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01.

Digital Library

[11]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB '01.

Digital Library

[12]

L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins for web data integration. In WWW 2003: 90--101.

Digital Library

[13]

D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997.

Digital Library

[14]

M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 1998.

Digital Library

[15]

M. A. Jaro. Advances in record linkage methodology as applied to matching the 1985 census of tampa. Journal of the American Statistical Association, 1984.

[16]

N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in practice. In VLDB '04.

Digital Library

[17]

N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.

Digital Library

[18]

C. D. Manning and H. Schütze. Foundations of statistical natural language processing, 1999.

Digital Library

[19]

D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden Markov model information retrieval system. In SIGIR '99.

Digital Library

[20]

J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR '98.

Digital Library

[21]

L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, 1989.

[22]

S. Robertson. Understanding inverse document frequency: on theoretical arguments. Journal of Documentation, 2004.

[23]

S. E. Robertson, S. Walker, M. Hancock-Beaulieu, M. Gatford, and A. Payne. Okapi at trec-4. In TREC '95.

[24]

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988.

[25]

G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1986.

Digital Library

[26]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD '04.

Digital Library

[27]

W. E. Winkler. The state of record linkage and current research problems. US Bureau of the Census, 1999.

Cited By

Fadlallah HKilany RDhayne HEl Haddad RHaque RTaher YJaber A(2023)BIGQA: Declarative Big Data Quality AssessmentJournal of Data and Information Quality10.1145/360370615:3(1-30)Online publication date: 22-Aug-2023
https://dl.acm.org/doi/10.1145/3603706
Pitoura E(2020)Social-minded Measures of Data QualityJournal of Data and Information Quality10.1145/340419312:3(1-8)Online publication date: 16-Jul-2020
https://dl.acm.org/doi/10.1145/3404193
Cheng YJi XLi XZhang TMalebary SQu XXu W(2020)Identifying Child Users via Touchscreen InteractionsACM Transactions on Sensor Networks10.1145/340357416:4(1-25)Online publication date: 28-Jul-2020
https://dl.acm.org/doi/10.1145/3403574
Show More Cited By

Index Terms

Benchmarking declarative approximate selection predicates
1. Information systems
  1. Information retrieval

Recommendations

Comparing NoSQL MongoDB to an SQL DB
ACMSE '13: Proceedings of the 51st annual ACM Southeast Conference

NoSQL database solutions are becoming more and more prevalent in a world currently dominated by SQL relational databases. NoSQL databases were designed to provide database solutions for large volumes of data that is not structured. However, the ...
A tactic language for declarative proofs
ITP'10: Proceedings of the First international conference on Interactive Theorem Proving

Influenced by the success of the Mizar system many declarative proof languages have been developed in the theorem prover community, as declarative proofs are more readable, easier to modify and to maintain than their procedural counterparts. However, ...
Making standard ML a practical database programming language
ICFP '11: Proceedings of the 16th ACM SIGPLAN international conference on Functional programming

Integrating a database query language into a programming language is becoming increasingly important in recently emerging high-level cloud computing and other applications, where efficient and sophisticated data manipulation is required during ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

June 2007

1210 pages

ISBN:9781595936868

DOI:10.1145/1247480

General Chairs:
Lizhu Zhou
Tsinghua University, China
,
Tok Wang Ling
National University of Singapore, Singapore
,
Program Chair:
Beng Chin Ooi
National University of Singapore, Singapore

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGMOD/PODS07

Sponsor:

SIGMOD/PODS07: International Conference on Management of Data

June 11 - 14, 2007

Beijing, China

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

63
Total Citations
View Citations
60
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)5

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fadlallah HKilany RDhayne HEl Haddad RHaque RTaher YJaber A(2023)BIGQA: Declarative Big Data Quality AssessmentJournal of Data and Information Quality10.1145/360370615:3(1-30)Online publication date: 22-Aug-2023
https://dl.acm.org/doi/10.1145/3603706
Pitoura E(2020)Social-minded Measures of Data QualityJournal of Data and Information Quality10.1145/340419312:3(1-8)Online publication date: 16-Jul-2020
https://dl.acm.org/doi/10.1145/3404193
Cheng YJi XLi XZhang TMalebary SQu XXu W(2020)Identifying Child Users via Touchscreen InteractionsACM Transactions on Sensor Networks10.1145/340357416:4(1-25)Online publication date: 28-Jul-2020
https://dl.acm.org/doi/10.1145/3403574
Liu RLiu RPugliese ASubrahmanian V(2020)STARSACM Transactions on Intelligent Systems and Technology10.1145/339746311:5(1-25)Online publication date: 24-Jul-2020
https://dl.acm.org/doi/10.1145/3397463
Chen CZhou JWu BFang WWang LQi YZheng X(2020)Practical Privacy Preserving POI RecommendationACM Transactions on Intelligent Systems and Technology10.1145/339413811:5(1-20)Online publication date: 5-Jul-2020
https://dl.acm.org/doi/10.1145/3394138
Wang GZhang FSun HWang YZhang D(2020)Understanding the Long-Term Evolution of Electric Taxi NetworksACM Transactions on Intelligent Systems and Technology10.1145/339367111:4(1-27)Online publication date: 28-May-2020
https://dl.acm.org/doi/10.1145/3393671
Koumarelas IJiang LNaumann F(2020)Data Preparation for Duplicate DetectionJournal of Data and Information Quality10.1145/337787812:3(1-24)Online publication date: 13-Jun-2020
https://dl.acm.org/doi/10.1145/3377878
Visengeriyeva LAbedjan Z(2020)Anatomy of Metadata for Data CurationJournal of Data and Information Quality10.1145/337192512:3(1-30)Online publication date: 13-Jun-2020
https://dl.acm.org/doi/10.1145/3371925
Colborne ASmit M(2020)Characterizing Disinformation Risk to Open Data in the Post-Truth EraJournal of Data and Information Quality10.1145/332874712:3(1-13)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.1145/3328747
Wu RChaba SSawlani SChu XThirumuruganathan SMaier DPottinger RDoan ATan WAlawini ANgo H(2020)ZeroER: Entity Resolution using Zero Labeled ExamplesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389743(1149-1164)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3389743
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten