Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Scalable discovery of unique column combinations

Published: 01 December 2013 Publication History

Abstract

The discovery of all unique (and non-unique) column combinations in a given dataset is at the core of any data profiling effort. The results are useful for a large number of areas of data management, such as anomaly detection, data integration, data modeling, duplicate detection, indexing, and query optimization. However, discovering all unique and non-unique column combinations is an NP-hard problem, which in principle requires to verify an exponential number of column combinations for uniqueness on all data values. Thus, achieving efficiency and scalability in this context is a tremendous challenge by itself.
In this paper, we devise Ducc, a scalable and efficient approach to the problem of finding all unique and non-unique column combinations in big datasets. We first model the problem as a graph coloring problem and analyze the pruning effect of individual combinations. We then present our hybrid column-based pruning technique, which traverses the lattice in a depth-first and random walk combination. This strategy allows Ducc to typically depend on the solution set size and hence to prune large swaths of the lattice. Ducc also incorporates row-based pruning to run uniqueness checks in just few milliseconds. To achieve even higher scalability, Ducc runs on several CPU cores (scale-up) and compute nodes (scale-out) with a very low overhead. We exhaustively evaluate Ducc using three datasets (two real and one synthetic) with several millions rows and hundreds of attributes. We compare Ducc with related work: Gordian and HCA. The results show that Ducc is up to more than 2 orders of magnitude faster than Gordian and HCA (631x faster than Gordian and 398x faster than HCA). Finally, a series of scalability experiments shows the efficiency of Ducc to scale up and out.

References

[1]
Z. Abedjan and F. Naumann. Advancing the discovery of unique column combinations. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1565--1570, 2011.
[2]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 487--499, 1994.
[3]
J. Bauckmann, Z. Abedjan, U. Leser, H. Müller, and F. Naumann. Discovering conditional inclusion dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2094--2098, 2012.
[4]
J. Bauckmann, U. Leser, F. Naumann, and V. Tietz. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE), pages 1448--1450, 2007.
[5]
L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 243--254, 2007.
[6]
P. G. Brown and P. J. Haas. BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 668--679, 2003.
[7]
G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang. Estimating the confidence of conditional functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 469--482, 2009.
[8]
W. Fan. Dependencies revisited for improving data quality. In Proceedings of the Symposium on Principles of Database Systems (PODS), pages 159--170, 2008.
[9]
W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(5): 683--698, 2011.
[10]
C. Giannella and C. Wyss. Finding minimal keys in a relation instance. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.7086, 1999. Technical report. Last accessed on 2013-02-21.
[11]
C. Goble and R. Stevens. State of the nation in data integration for bioinformatics. J. of Biomedical Informatics, 41(5): 687--693, 2008.
[12]
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. Proceedings of the VLDB Endowment (PVLDB), 1(1): 376--390, 2008.
[13]
G. Grahne and J. Zhu. Discovering approximate keys in XML data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 453--460, 2002.
[14]
D. Gunopulos, R. Khardon, H. Mannila, and R. S. Sharma. Discovering all most specific sentences. ACM Transactions on Database Systems (TODS), 28: 140--174, 2003.
[15]
Y. Huhtala, J. Kaerkkaeinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of the International Conference on Database Theory (ICDT), pages 392--401, 1998.
[16]
Y. Huhtala, J. Kaerkkaeinen, P. Porkka, and H. Toivonen. TANE: an efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2): 100--111, 1999.
[17]
IBM InfoSphere Information Analyzer. http://www-01.ibm.com/software/data/infosphere/information-analyzer.
[18]
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 647--658, 2004.
[19]
M. Kantola, H. Mannila, K.-J. Rih, and H. Siirtola. Discovering functional and inclusion dependencies in relational databases. International Journal of Intelligent Systems, 12: 591--607, 1992.
[20]
Z. Lacroix and T. Critchlow. Bioinformatics: managing scientific data. Morgan Kaufmann, Burlington, MA, 2003.
[21]
F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32(1): 53--73, 2009.
[22]
Microsoft Data Profiling Task, 03/2013. http://msdn.microsoft.com/en-us/library/bb895263.aspx.
[23]
Oracle 11g Data Profiling. http://www.oracle.com/technetwork/middleware/data-integration/index-082810.html.
[24]
H. Saiedian and T. Spencer. An efficient algorithm to compute the candidate keys of a relational database schema. The Computer Journal, 39(2): 124--132, 1996.
[25]
Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. Gordian: efficient and scalable discovery of composite keys. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 691--702, 2006.
[26]
M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era (it's time for a complete rewrite). In Proceedings of the International Conference on Very Large Databases (VLDB), pages 1150--1160, 2007.
[27]
M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. On multi-column foreign key discovery. Proceedings of the VLDB Endowment (PVLDB), 3(1-2): 805--814, 2010.

Cited By

View all
  • (2023)Data Constraint Mining for Automatic Reconciliation Scripts GenerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598122(1119-1130)Online publication date: 12-Jul-2023
  • (2023)Exploratory Training: When Annonators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
  • (2023)Uniqueness Constraints for Object StoresJournal of Data and Information Quality10.1145/358175815:2(1-29)Online publication date: 19-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 7, Issue 4
December 2013
112 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 December 2013
Published in PVLDB Volume 7, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)43
  • Downloads (Last 6 weeks)6
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Data Constraint Mining for Automatic Reconciliation Scripts GenerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598122(1119-1130)Online publication date: 12-Jul-2023
  • (2023)Exploratory Training: When Annonators Learn About DataProceedings of the ACM on Management of Data10.1145/35892801:2(1-25)Online publication date: 20-Jun-2023
  • (2023)Uniqueness Constraints for Object StoresJournal of Data and Information Quality10.1145/358175815:2(1-29)Online publication date: 19-Jan-2023
  • (2023)Protecting Data Integrity of Web Applications with Database Constraints Inferred from Application CodeProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575699(632-645)Online publication date: 27-Jan-2023
  • (2023)Discovery of Cross JoinsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.319284235:7(6839-6851)Online publication date: 1-Jul-2023
  • (2023)Incremental discovery of denial constraintsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00788-y32:6(1289-1313)Online publication date: 17-Mar-2023
  • (2022)Tab2KGSemantic Web10.3233/SW-22299313:3(571-597)Online publication date: 1-Jan-2022
  • (2022)Fast approximate denial constraint discoveryProceedings of the VLDB Endowment10.14778/3565816.356582816:2(269-281)Online publication date: 1-Oct-2022
  • (2022)MATEProceedings of the VLDB Endowment10.14778/3529337.352935315:8(1684-1696)Online publication date: 22-Jun-2022
  • (2022)Efficiently enumerating hitting sets of hypergraphs arising in data profilingJournal of Computer and System Sciences10.1016/j.jcss.2021.10.002124:C(192-213)Online publication date: 1-Mar-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media