Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

On generating near-optimal tableaux for conditional functional dependencies

Published: 01 August 2008 Publication History

Abstract

Conditional functional dependencies (CFDs) have recently been proposed as a useful integrity constraint to summarize data semantics and identify data inconsistencies. A CFD augments a functional dependency (FD) with a pattern tableau that defines the context (i.e., the subset of tuples) in which the underlying FD holds. While many aspects of CFDs have been studied, including static analysis and detecting and repairing violations, there has not been prior work on generating pattern tableaux, which is critical to realize the full potential of CFDs.
This paper is the first to formally characterize a "good" pattern tableau, based on naturally desirable properties of support, confidence and parsimony. We show that the problem of generating an optimal tableau for a given FD is NP-complete but can be approximated in polynomial time via a greedy algorithm. For large data sets, we propose an "on-demand" algorithm providing the same approximation bound, that outperforms the basic greedy algorithm in running time by an order of magnitude. For ordered attributes, we propose the range tableau as a generalization of a pattern tableau, which can achieve even more parsimony. The effectiveness and efficiency of our techniques are experimentally demonstrated on real data.

References

[1]
D. Agarwal, D. Barman, D. Gunopulos, N. Young, F. Korn, and D. Srivastava. Efficient and effective explanation of change in hierarchical summaries. In KDD, pp 6--15, 2007.
[2]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pp 207--216, 1993.
[3]
P. Berman and M. Karpinski. On Some Tighter Inapproximability Results (Extended Abstract). In ICALP, pp. 200--209, 1999.
[4]
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, 2007.
[5]
L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB, 2007.
[6]
P. Brown, P. Haas. BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data. In VLDB, 2003.
[7]
T. Calders, R. Ng, and J. Wijsen. Searching for dependencies at multiple abstraction levels. TODS, 27(3):229--260, 2002.
[8]
S. Ceri, F. D. Guinta, and P. Lanzi. Mining constraint violations. TODS, 32(1), 2007.
[9]
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: consistency and accuracy. In VLDB, 2007.
[10]
M. Dalkilic and E. Robertson. Information dependencies. In PODS, pp 245--253, 2000.
[11]
R. Gandhi, S. Khuller, and A. Srinivasan. Approximation algorithms for partial covering problems. J. Algorithms, 53(1):55--84, 2004.
[12]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP---Completeness. Freeman, San Francisco, 1979.
[13]
C. Giannella and E. Robertson. On approximation measures for functional dependencies. Information Systems, 29(6):483--507, 2004.
[14]
Y. Huhtala, J. Karkkainen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
[15]
I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, pp 647--658, 2004.
[16]
M. J. Kearns and U. V. Vazirani. An introduction to computational learning theory. MIT Press, 1994.
[17]
R. King and J. Legendre. Discovery of functional and approximate functional dependencies in relational databases. Journal of Applied Mathematics and Decision Sciences, 7(1):49--59, 2003.
[18]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1):129--149, 1995.
[19]
U. Nambiar and S. Kambhampati. Mining approximate functional dependencies and concept similarities to answer imprecise queries. In WebDB, 2004.
[20]
S. Sarawagi. Explaining differences in multidimensional aggregates. In VLDB, pp 42--53, 1999.
[21]
R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD Rec., 25(2):1--12, 1996.
[22]
Y. Xu and R. Motwani. Random sampling based algorithms for efficient semi-key discovery. Stanford University Technical Report.

Cited By

View all
  • (2025)Data Quality-Driven Top-k Rule DiscoveryProceedings of the 3rd International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM2024)10.1007/978-981-96-1698-5_3(24-38)Online publication date: 6-Feb-2025
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 16-Dec-2024
  • (2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 1, Issue 1
August 2008
1216 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008
Published in PVLDB Volume 1, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Data Quality-Driven Top-k Rule DiscoveryProceedings of the 3rd International Conference on Machine Learning, Cloud Computing and Intelligent Mining (MLCCIM2024)10.1007/978-981-96-1698-5_3(24-38)Online publication date: 6-Feb-2025
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 16-Dec-2024
  • (2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
  • (2024)Rock: Cleaning Data by Embedding ML in Logic RulesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3653372(106-119)Online publication date: 9-Jun-2024
  • (2024)Decentralized and Incremental Discovery of Relaxed Functional Dependencies Using Bitwise SimilarityIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340392836:12(7380-7398)Online publication date: 1-Dec-2024
  • (2024)Revolutionizing protection dynamics in microgrids: Local validation environment and a novel global management control through multi-agent systemsComputers and Electrical Engineering10.1016/j.compeleceng.2024.109748120(109748)Online publication date: Dec-2024
  • (2024)ARDN: Attention Re-distribution Network for Visual Question AnsweringArabian Journal for Science and Engineering10.1007/s13369-024-09067-6Online publication date: 1-May-2024
  • (2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
  • (2023)SD-INT: Towards Lightweight Network-Wide Passive INT in the Self-Driving Way2023 IEEE 31st International Conference on Network Protocols (ICNP)10.1109/ICNP59255.2023.10355635(1-12)Online publication date: 10-Oct-2023
  • (2023)Detection of Groups with Biased Representation in Ranking2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00168(2167-2179)Online publication date: Apr-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media