Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Efficient discovery of approximate dependencies

Published: 01 March 2018 Publication History

Abstract

Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Real-world dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery. To this end, we propose the novel and highly efficient algorithm Pyro to discover both approximate FDs and approximate UCCs. Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, Pyro outperforms existing discovery algorithms by a factor of up to 33, scales to larger datasets, and at the same time requires the least main memory.

References

[1]
Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB Journal, 24(4):557--581, 2015.
[2]
Z. Abedjan, P. Schulze, and F. Naumann. DFD: efficient functional dependency discovery. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 949--958, 2014.
[3]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 487--499, 1994.
[4]
W. W. Armstrong. Dependency structures of data base relationships. In IFIP Congress, pages 580--583, 1974.
[5]
J. Atoum. Mining approximate functional dependencies from databases based on minimal cover and equivalent classes. European Journal of Scientific Research, 33(2):338--346, 2009.
[6]
C. Beeri, M. Dowd, R. Fagin, and R. Statman. On the structure of Armstrong relations for functional dependencies. Journal of the ACM, 31(1):30--46, 1984.
[7]
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010.
[8]
T. Bleifuß, S. Bülow, J. Frohnhofen, J. Risch, G. Wiese, S. Kruse, T. Papenbrock, and F. Naumann. Approximate discovery of functional dependencies for large datasets. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1803--1812, 2016.
[9]
L. Caruccio, V. Deufemia, and G. Polese. Relaxed functional dependencies - a survey of approaches. IEEE Transactions on Knowledge and Data Engineering (TKDE), 28(1):147--165, 2016.
[10]
G. Chandrashekar and F. Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16--28, 2014.
[11]
P. A. Flach and I. Savnik. FDEP source code. http://www.cs.bris.ac.uk/~flach/fdep/.
[12]
P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999.
[13]
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The LLUNATIC data-cleaning framework. PVLDB, 6(9):625--636, 2013.
[14]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1--12, 2000.
[15]
A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. PVLDB, 7(4):301--312, 2013.
[16]
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: an efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
[17]
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 647--658, 2004.
[18]
R. M. Karp. Reducibility among combinatorial problems. In Proceedings of a symposium on the Complexity of Computer Computations, pages 85--103, 1972.
[19]
Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1215--1230, 2015.
[20]
R. S. King and J. J. Legendre. Discovery of functional and approximate functional dependencies in relational databases. Journal of Applied Mathematics and Decision Sciences, 7(1):49--59, 2003.
[21]
J. Kivinen and H. Mannila. Approximate dependency inference from relations. Proceedings of the International Conference on Database Theory (ICDT), pages 86--98, 1992.
[22]
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1):129--149, 1995.
[23]
H. Köhler, S. Link, and X. Zhou. Possible and certain SQL keys. PVLDB, 8(11):1118--1129, 2015.
[24]
S. Kolahi and L. V. Lakshmanan. On approximating optimum repairs for functional dependency violations. In Proceedings of the International Conference on Database Theory (ICDT), pages 53--62, 2009.
[25]
S. Kruse, D. Hahn, M. Walter, and F. Naumann. Metacrate: organize and analyze millions of data profiles. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2017.
[26]
S. Kruse, T. Papenbrock, H. Harmouch, and F. Naumann. Data anamnesis: admitting raw data into an organization. IEEE Data Engineering Bulletin, 39(2):8--20, 2016.
[27]
V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? PVLDB, 9(3):204--215, 2015.
[28]
W. Li, Z. Li, Q. Chen, T. Jiang, and Z. Yin. Discovering approximate functional dependencies from distributed big data. In Asia-Pacific Web Conference, pages 289--301, 2016.
[29]
J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - A review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012.
[30]
S. Lopes, J.-M. Petit, and L. Lakhal. Functional and approximate dependency mining: database and FCA points of view. Journal of Experimental & Theoretical Artificial Intelligence, 14(2--3):93--114, 2002.
[31]
P. Mandros, M. Boley, and J. Vreeken. Discovering reliable approximate functional dependencies. In Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD), pages 355--363, 2017.
[32]
H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241--258, 1997.
[33]
V. Matos and B. Grasser. SQL-based discovery of exact and approximate functional dependencies. ACM SIGCSE Bulletin, 36(4):58--63, 2004.
[34]
R. J. Miller, M. A. Hernández, L. M. Haas, L.-L. Yan, C. H. Ho, R. Fagin, and L. Popa. The Clio project: managing heterogeneity. SIGMOD Record, 30(1):78--83, 2001.
[35]
U. Nambiar and S. Kambhampati. Mining approximate functional dependencies and concept similarities to answer imprecise queries. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB), pages 73--78, 2004.
[36]
T. Papenbrock, T. Bergmann, M. Finke, J. Zwiener, and F. Naumann. Data profiling with Metanome. PVLDB, 8(12):1860--1863, 2015.
[37]
T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: an experimental evaluation of seven algorithms. PVLDB, 8(10):1082--1093, 2015.
[38]
T. Papenbrock and F. Naumann. A hybrid approach to functional dependency discovery. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 821--833, 2016.
[39]
T. Papenbrock and F. Naumann. Data-driven schema normalization. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 342--353, 2017.
[40]
D. Sánchez, J. M. Serrano, I. Blanco, M. J. Martín-Bautista, and M.-A. Vila. Using association rules to mine for strong approximate dependencies. Data Mining and Knowledge Discovery, 16(3):313--348, 2008.
[41]
Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. GORDIAN: efficient and scalable discovery of composite keys. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 691--702, 2006.
[42]
S. Thirumuruganathan, L. Berti-Equille, M. Ouzzani, J. A. Quianaé-Ruiz, and N. Tang. UGuide: user-guided discovery of FD-detectable errors. In Proceedings of the International Conference on Management of Data (SIGMOD), 2017.
[43]
J. Zwiener. Quicker ways of doing fewer things: improved index structures and algorithms for data profiling. Master's thesis, Hasso Plattner Institute, University of Potsdam, 2015.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 7
March 2018
107 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2018
Published in PVLDB Volume 11, Issue 7

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)8
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DAFDiscover: Robust Mining Algorithm for Dynamic Approximate Functional Dependencies on Dirty DataProceedings of the VLDB Endowment10.14778/3681954.368201517:11(3484-3496)Online publication date: 1-Jul-2024
  • (2024)SplitDF: Splitting Dataframes for Memory-Efficient Data AnalysisProceedings of the VLDB Endowment10.14778/3665844.366584917:9(2175-2184)Online publication date: 1-May-2024
  • (2024)Efficient Differential Dependency DiscoveryProceedings of the VLDB Endowment10.14778/3654621.365462417:7(1552-1564)Online publication date: 1-Mar-2024
  • (2024)Discovering Top-k Relevant and Diversified RulesProceedings of the ACM on Management of Data10.1145/36771312:4(1-28)Online publication date: 30-Sep-2024
  • (2024)Database Repairing with Soft Functional DependenciesACM Transactions on Database Systems10.1145/365115649:2(1-34)Online publication date: 10-Apr-2024
  • (2023)Pando: Enhanced Data Skipping with Logical Data PartitioningProceedings of the VLDB Endowment10.14778/3598581.359860116:9(2316-2329)Online publication date: 1-May-2023
  • (2023)Discovering Top-k Rules using Subjective and Objective CriteriaProceedings of the ACM on Management of Data10.1145/35889241:1(1-29)Online publication date: 30-May-2023
  • (2023)Incremental discovery of denial constraintsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00788-y32:6(1289-1313)Online publication date: 17-Mar-2023
  • (2023)FastAGEDs: Fast Approximate Graph Entity Dependency DiscoveryWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_35(451-465)Online publication date: 25-Oct-2023
  • (2022)Fast approximate denial constraint discoveryProceedings of the VLDB Endowment10.14778/3565816.356582816:2(269-281)Online publication date: 1-Oct-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media