research-article

Efficient discovery of approximate dependencies

Editors: Jian Pei, Sihem Amer-Yahia Authors:

Sebastian Kruse,

Felix NaumannAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 11, Issue 7

Pages 759 - 772

https://doi.org/10.14778/3192965.3192968

Published: 01 March 2018 Publication History

Abstract

Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Real-world dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery. To this end, we propose the novel and highly efficient algorithm Pyro to discover both approximate FDs and approximate UCCs. Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, Pyro outperforms existing discovery algorithms by a factor of up to 33, scales to larger datasets, and at the same time requires the least main memory.

References

[1]

Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB Journal, 24(4):557--581, 2015.

Digital Library

[2]

Z. Abedjan, P. Schulze, and F. Naumann. DFD: efficient functional dependency discovery. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 949--958, 2014.

Digital Library

[3]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 487--499, 1994.

Digital Library

[4]

W. W. Armstrong. Dependency structures of data base relationships. In IFIP Congress, pages 580--583, 1974.

[5]

J. Atoum. Mining approximate functional dependencies from databases based on minimal cover and equivalent classes. European Journal of Scientific Research, 33(2):338--346, 2009.

[6]

C. Beeri, M. Dowd, R. Fagin, and R. Statman. On the structure of Armstrong relations for functional dependencies. Journal of the ACM, 31(1):30--46, 1984.

Digital Library

[7]

G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010.

Digital Library

[8]

T. Bleifuß, S. Bülow, J. Frohnhofen, J. Risch, G. Wiese, S. Kruse, T. Papenbrock, and F. Naumann. Approximate discovery of functional dependencies for large datasets. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1803--1812, 2016.

Digital Library

[9]

L. Caruccio, V. Deufemia, and G. Polese. Relaxed functional dependencies - a survey of approaches. IEEE Transactions on Knowledge and Data Engineering (TKDE), 28(1):147--165, 2016.

Digital Library

[10]

G. Chandrashekar and F. Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16--28, 2014.

Digital Library

[11]

P. A. Flach and I. Savnik. FDEP source code. http://www.cs.bris.ac.uk/~flach/fdep/.

[12]

P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999.

Digital Library

[13]

F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9):625--636, 2013.

Digital Library

[14]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1--12, 2000.

Digital Library

[15]

A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. PVLDB, 7(4):301--312, 2013.

Digital Library

[16]

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: an efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.

[17]

I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 647--658, 2004.

Digital Library

[18]

R. M. Karp. Reducibility among combinatorial problems. In Proceedings of a symposium on the Complexity of Computer Computations, pages 85--103, 1972.

[19]

Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1215--1230, 2015.

Digital Library

[20]

R. S. King and J. J. Legendre. Discovery of functional and approximate functional dependencies in relational databases. Journal of Applied Mathematics and Decision Sciences, 7(1):49--59, 2003.

[21]

J. Kivinen and H. Mannila. Approximate dependency inference from relations. Proceedings of the International Conference on Database Theory (ICDT), pages 86--98, 1992.

Digital Library

[22]

J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1):129--149, 1995.

Digital Library

[23]

H. Köhler, S. Link, and X. Zhou. Possible and certain SQL keys. PVLDB, 8(11):1118--1129, 2015.

Digital Library

[24]

S. Kolahi and L. V. Lakshmanan. On approximating optimum repairs for functional dependency violations. In Proceedings of the International Conference on Database Theory (ICDT), pages 53--62, 2009.

Digital Library

[25]

S. Kruse, D. Hahn, M. Walter, and F. Naumann. Metacrate: organize and analyze millions of data profiles. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2017.

Digital Library

[26]

S. Kruse, T. Papenbrock, H. Harmouch, and F. Naumann. Data anamnesis: admitting raw data into an organization. IEEE Data Engineering Bulletin, 39(2):8--20, 2016.

[27]

V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? PVLDB, 9(3):204--215, 2015.

Digital Library

[28]

W. Li, Z. Li, Q. Chen, T. Jiang, and Z. Yin. Discovering approximate functional dependencies from distributed big data. In Asia-Pacific Web Conference, pages 289--301, 2016.

[29]

J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - A review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012.

Digital Library

[30]

S. Lopes, J.-M. Petit, and L. Lakhal. Functional and approximate dependency mining: database and FCA points of view. Journal of Experimental & Theoretical Artificial Intelligence, 14(2--3):93--114, 2002.

[31]

P. Mandros, M. Boley, and J. Vreeken. Discovering reliable approximate functional dependencies. In Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD), pages 355--363, 2017.

Digital Library

[32]

H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241--258, 1997.

Digital Library

[33]

V. Matos and B. Grasser. SQL-based discovery of exact and approximate functional dependencies. ACM SIGCSE Bulletin, 36(4):58--63, 2004.

Digital Library

[34]

R. J. Miller, M. A. Hernández, L. M. Haas, L.-L. Yan, C. H. Ho, R. Fagin, and L. Popa. The Clio project: managing heterogeneity. SIGMOD Record, 30(1):78--83, 2001.

Digital Library

[35]

U. Nambiar and S. Kambhampati. Mining approximate functional dependencies and concept similarities to answer imprecise queries. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB), pages 73--78, 2004.

Digital Library

[36]

T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with Metanome. PVLDB, 8(12):1860--1863, 2015.

Digital Library

[37]

T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: an experimental evaluation of seven algorithms. PVLDB, 8(10):1082--1093, 2015.

Digital Library

[38]

T. Papenbrock and F. Naumann. A hybrid approach to functional dependency discovery. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 821--833, 2016.

Digital Library

[39]

T. Papenbrock and F. Naumann. Data-driven schema normalization. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 342--353, 2017.

[40]

D. Sánchez, J. M. Serrano, I. Blanco, M. J. Martín-Bautista, and M.-A. Vila. Using association rules to mine for strong approximate dependencies. Data Mining and Knowledge Discovery, 16(3):313--348, 2008.

Digital Library

[41]

Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. GORDIAN: efficient and scalable discovery of composite keys. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 691--702, 2006.

Digital Library

[42]

S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, J. A. Quianaé-Ruiz, and N. Tang. UGuide: user-guided discovery of FD-detectable errors. In Proceedings of the International Conference on Management of Data (SIGMOD), 2017.

Digital Library

[43]

J. Zwiener. Quicker ways of doing fewer things: improved index structures and algorithms for data profiling. Master's thesis, Hasso Plattner Institute, University of Potsdam, 2015.

Cited By

Ding XLu YWang HWang CLiu YWang J(2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682015
Kakaraparthy APatel J(2024)SplitDF: Splitting Dataframes for Memory-Efficient Data AnalysisProceedings of the VLDB Endowment10.14778/3665844.366584917:9(2175-2184)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665849
Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Show More Cited By

Recommendations

Approximate Discovery of Functional Dependencies for Large Datasets
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution ...
Formalization and Discovery of Approximate Conditional Functional Dependencies
DEXA 2013: Proceedings of the 24th International Conference on Database and Expert Systems Applications - Volume 8055

We propose efficient and precise discoveries of approximate Conditional Functional Dependencies CFDs, by providing a precise formalization of approximate CFDs and presenting three discovery algorithms approxCFDMiner, approxCTANE and approxFastCFD as ...
Discovery of Approximate Lexicographical Order Dependencies
Lexicographical order dependencies (LODs) specify orders between list of attributes, and are proven useful in optimizing SQL queries with order by clauses. To discover hidden dependencies from dirty data in practice, approximate dependency discoveries are ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 11, Issue 7

March 2018

107 pages

ISSN:2150-8097

Editors:
Jian Pei
Simon Fraser University
,
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2018

Published in PVLDB Volume 11, Issue 7

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
334
Total Downloads

Downloads (Last 12 months)64
Downloads (Last 6 weeks)8

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ding XLu YWang HWang CLiu YWang J(2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3682015
Kakaraparthy APatel J(2024)SplitDF: Splitting Dataframes for Memory-Efficient Data AnalysisProceedings of the VLDB Endowment10.14778/3665844.366584917:9(2175-2184)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.14778/3665844.3665849
Kuang SYang HTan ZMa S(2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.14778/3654621.3654624
Fan WHan ZXie MZhang G(2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677131
Carmeli NGrohe MKimelfeld BLivshits ETibi M(2024)Database Repairing with Soft Functional DependenciesACM Transactions on Database Systems10.1145/365115649:2(1-34)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3651156
Sudhir STao WLaptev NHabis CCafarella MMadden S(2023)Pando: Enhanced Data Skipping with Logical Data PartitioningProceedings of the VLDB Endowment10.14778/3598581.359860116:9(2316-2329)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598601
Fan WHan ZWang YXie M(2023)Discovering Top-k Rules using Subjective and Objective CriteriaProceedings of the ACM on Management of Data10.1145/35889241:1(1-29)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588924
Qian CLi MTan ZRan AMa S(2023)Incremental discovery of denial constraintsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00788-y32:6(1289-1313)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1007/s00778-023-00788-y
Zhou GKwashie SZhang YBewong MNofong VHu JCheng DHe KLiu SFeng Z(2023)FastAGEDs: Fast Approximate Graph Entity Dependency DiscoveryWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_35(451-465)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1007/978-981-99-7254-8_35
Xiao RTan ZWang HMa S(2022)Fast approximate denial constraint discoveryProceedings of the VLDB Endowment10.14778/3565816.356582816:2(269-281)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565828
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents