research-article

Scalable discovery of unique column combinations

Editors: H. V. Jagadish, Aoying Zhou Authors:

Jorge-Arnulfo Quiané-Ruiz,

Ziawasch Abedjan,

Felix NaumannAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 7, Issue 4

Pages 301 - 312

https://doi.org/10.14778/2732240.2732248

Published: 01 December 2013 Publication History

Abstract

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself.

In this paper, we devise Ducc, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of Ducc to scale up and out.

References

[1]

Z. Abedjan and F. Naumann. Advancing the discovery of unique column combinations. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1565--1570, 2011.

Digital Library

[2]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 487--499, 1994.

Digital Library

[3]

J. Bauckmann, Z. Abedjan, U. Leser, H. Müller, and F. Naumann. Discovering conditional inclusion dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2094--2098, 2012.

Digital Library

[4]

J. Bauckmann, U. Leser, F. Naumann, and V. Tietz. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE), pages 1448--1450, 2007.

[5]

L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 243--254, 2007.

Digital Library

[6]

P. G. Brown and P. J. Haas. BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 668--679, 2003.

Digital Library

[7]

G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang. Estimating the confidence of conditional functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 469--482, 2009.

Digital Library

[8]

W. Fan. Dependencies revisited for improving data quality. In Proceedings of the Symposium on Principles of Database Systems (PODS), pages 159--170, 2008.

Digital Library

[9]

W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(5): 683--698, 2011.

Digital Library

[10]

C. Giannella and C. Wyss. Finding minimal keys in a relation instance. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.7086, 1999. Technical report. Last accessed on 2013-02-21.

[11]

C. Goble and R. Stevens. State of the nation in data integration for bioinformatics. J. of Biomedical Informatics, 41(5): 687--693, 2008.

Digital Library

[12]

L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. Proceedings of the VLDB Endowment (PVLDB), 1(1): 376--390, 2008.

Digital Library

[13]

G. Grahne and J. Zhu. Discovering approximate keys in XML data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 453--460, 2002.

Digital Library

[14]

D. Gunopulos, R. Khardon, H. Mannila, and R. S. Sharma. Discovering all most specific sentences. ACM Transactions on Database Systems (TODS), 28: 140--174, 2003.

Digital Library

[15]

Y. Huhtala, J. Kaerkkaeinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of the International Conference on Database Theory (ICDT), pages 392--401, 1998.

Digital Library

[16]

Y. Huhtala, J. Kaerkkaeinen, P. Porkka, and H. Toivonen. TANE: an efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2): 100--111, 1999.

[17]

IBM InfoSphere Information Analyzer. http://www-01.ibm.com/software/data/infosphere/information-analyzer.

[18]

I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 647--658, 2004.

Digital Library

[19]

M. Kantola, H. Mannila, K.-J. Rih, and H. Siirtola. Discovering functional and inclusion dependencies in relational databases. International Journal of Intelligent Systems, 12: 591--607, 1992.

[20]

Z. Lacroix and T. Critchlow. Bioinformatics: managing scientific data. Morgan Kaufmann, Burlington, MA, 2003.

[21]

F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32(1): 53--73, 2009.

Digital Library

[22]

Microsoft Data Profiling Task, 03/2013. http://msdn.microsoft.com/en-us/library/bb895263.aspx.

[23]

Oracle 11g Data Profiling. http://www.oracle.com/technetwork/middleware/data-integration/index-082810.html.

[24]

H. Saiedian and T. Spencer. An efficient algorithm to compute the candidate keys of a relational database schema. The Computer Journal, 39(2): 124--132, 1996.

[25]

Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. Gordian: efficient and scalable discovery of composite keys. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 691--702, 2006.

Digital Library

[26]

M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era (it's time for a complete rewrite). In Proceedings of the International Conference on Very Large Databases (VLDB), pages 1150--1160, 2007.

Digital Library

[27]

M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. On multi-column foreign key discovery. Proceedings of the VLDB Endowment (PVLDB), 3(1-2): 805--814, 2010.

Digital Library

Cited By

Wang TZhi CZhou XWu JYin JDeng SJust RFraser G(2023)Data Constraint Mining for Automatic Reconciliation Scripts GenerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598122(1119-1130)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598122
Shrestha RHabibelahian OTermehchy APapotti P(2023)Exploratory Training: When Annonators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589280
Skavantzos PLeck UZhao KLink S(2023)Uniqueness Constraints for Object StoresJournal of Data and Information Quality10.1145/358175815:2(1-29)Online publication date: 19-Jan-2023
https://dl.acm.org/doi/10.1145/3581758
Show More Cited By

Recommendations

Advancing the discovery of unique column combinations
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Unique column combinations of a relational database table are sets of columns that contain only unique values. Discovering such combinations is a fundamental research problem and has many different data management and knowledge discovery applications. ...
SAKey: Scalable Almost Key Discovery in RDF Data
The Semantic Web – ISWC 2014
Abstract
Exploiting identity links among RDF resources allows applications to efficiently integrate data. Keys can be very useful to discover these identity links. A set of properties is considered as a key when its values uniquely identify resources. ...
Accuracy-Preserving and Scalable Column-Based Low-Rank Matrix Approximation
Knowledge Science, Engineering and Management
Abstract
Column-based low-rank matrix approximation is a useful method to analyze and interpret data in machine learning and data mining. However existing methods will face some accuracy and scalability problems when dealing with large-scale data. In this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 7, Issue 4

December 2013

112 pages

ISSN:2150-8097

Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2013

Published in PVLDB Volume 7, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
267
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)6

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang TZhi CZhou XWu JYin JDeng SJust RFraser G(2023)Data Constraint Mining for Automatic Reconciliation Scripts GenerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598122(1119-1130)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598122
Shrestha RHabibelahian OTermehchy APapotti P(2023)Exploratory Training: When Annonators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589280
Skavantzos PLeck UZhao KLink S(2023)Uniqueness Constraints for Object StoresJournal of Data and Information Quality10.1145/358175815:2(1-29)Online publication date: 19-Jan-2023
https://dl.acm.org/doi/10.1145/3581758
Huang HShen BZhong LZhou YAamodt TJerger NSwift M(2023)Protecting Data Integrity of Web Applications with Database Constraints Inferred from Application CodeProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575699(632-645)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575699
Hannula MZhang ZSong BLink S(2023)Discovery of Cross JoinsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.319284235:7(6839-6851)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TKDE.2022.3192842
Qian CLi MTan ZRan AMa S(2023)Incremental discovery of denial constraintsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00788-y32:6(1289-1313)Online publication date: 17-Mar-2023
https://dl.acm.org/doi/10.1007/s00778-023-00788-y
Gottschalk SDemidova EAlam MBuscaldi DCochez MOsborne FRefogiato Recupero DSack H(2022)Tab2KGSemantic Web10.3233/SW-22299313:3(571-597)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/SW-222993
Xiao RTan ZWang HMa S(2022)Fast approximate denial constraint discoveryProceedings of the VLDB Endowment10.14778/3565816.356582816:2(269-281)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.14778/3565816.3565828
Esmailoghli MQuiané-Ruiz JAbedjan Z(2022)MATEProceedings of the VLDB Endowment10.14778/3529337.352935315:8(1684-1696)Online publication date: 22-Jun-2022
https://dl.acm.org/doi/10.14778/3529337.3529353
Bläsius TFriedrich TLischeid JMeeks KSchirneck M(2022)Efficiently enumerating hitting sets of hypergraphs arising in data profilingJournal of Computer and System Sciences10.1016/j.jcss.2021.10.002124:C(192-213)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.jcss.2021.10.002
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents