Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3340531.3412062acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Efficient Detection of Data Dependency Violations

Published: 19 October 2020 Publication History

Abstract

Research on data dependencies has experienced a revival as dependency violations can reveal errors in data. Several data cleaning systems use a DBMS to detect such violations. While DBMSs are efficient for some kinds of data dependencies (e.g., unique constraints), they are likely to fall short of satisfactory performance for more complex ones, such as order dependencies.
We present a novel system to efficiently detect violations of denial constraints (DCs), a well-known formalism that generalizes many kinds of data dependencies. We describe its execution model, which operates on a compressed block of tuples at-a-time, and we present various algorithms that take advantage of the predicate form in the DCs to provide effective code patterns. Our experimental evaluation includes comparisons with DBMS-based and DC-specific approaches, real-world and synthetic data, and various kinds of DCs. It shows that our system is up to three orders-of-magnitude faster than the other solutions, especially for datasets with a large number of tuples and DCs that identify a large number of violations.

Supplementary Material

MP4 File (3340531.3412062.mp4)
In this video, we present our paper entitled Efficient Detection of Data Dependency Violations.

References

[1]
Noga Alon, Phillip B. Gibbons, Yossi Matias, and Mario Szegedy. 1999. Tracking Join and Self-Join Sizes in Limited Storage. In (PODS). 10--20.
[2]
Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. 2017. Efficient Denial Constraint Discovery with Hydra. PVLDB, Vol. 11, 3 (2017), 311--323.
[3]
Philip Bohannon, Wenfei Fan, Michael Flaster, and Rajeev Rastogi. 2005. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. In SIGMOD. 143--154.
[4]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In (ICDE). 458--469.
[5]
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: A Commodity Data Cleaning System. In SIGMOD. 541--552.
[6]
Wenfei Fan. 2015. Data Quality: From Theory to Practice. SIGMOD Record, Vol. 44, 3 (2015), 7--18.
[7]
Wenfei Fan, Floris Geerts, Xibei Jia, and Anastasios Kementsietsidis. 2008. Conditional Functional Dependencies for Capturing Data Inconsistencies. TODS, Vol. 33, 2 (2008).
[8]
Floris Geerts, Giansalvatore Mecca, Paolo Papotti, and Donatello Santoro. 2014. That's All Folks!: Llunatic Goes Open Source. PVLDB (2014), 1565--1568.
[9]
Joseph M. Hellerstein and Michael Stonebraker. 1993. Predicate Migration: Optimizing Queries with Expensive Predicates. SIGMOD Rec., Vol. 22, 2 (1993), 267--276.
[10]
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2015a. BigDansing: A System for Big Data Cleansing. In SIGMOD. 1215--1230.
[11]
Zuhair Khayyat, William Lucia, Meghna Singh, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Panos Kalnis. 2015b. Lightning Fast and Space Efficient Inequality Joins. PVLDB, Vol. 8, 13 (2015), 2074--2085.
[12]
Viktor Leis, Bernhard Radke, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2018. Query Optimization through the Looking Glass, and What We Found Running the Join Order Benchmark. VLDB Journal, Vol. 27, 5 (2018), 643--668.
[13]
Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently Faster and Smaller Compressed Bitmaps with Roaring. Softw. Pract. Exper., Vol. 46, 11 (2016), 1547--1569.
[14]
Eduardo H. M. Pena, Eduardo C. de Almeida, and Felix Naumann. 2019. Discovery of Approximate (and Exact) Denial Constraints. PVLDB, Vol. 13, 3 (2019), 266--278.
[15]
Davood Rafiei and Fan Deng. 2020. Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment. IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 32, 4 (2020), 768--781.
[16]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, Vol. 10, 11 (2017), 1190--1201.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data cleaning
  2. error detection
  3. integrity constraints

Qualifiers

  • Research-article

Conference

CIKM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media