research-article

Memory-efficient frequent-itemset mining

Authors:

Benjamin Schlegel,

Rainer Gemulla,

Wolfgang LehnerAuthors Info & Claims

EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology

Pages 461 - 472

https://doi.org/10.1145/1951365.1951420

Published: 21 March 2011 Publication History

Abstract

Efficient discovery of frequent itemsets in large datasets is a key component of many data mining tasks. In-core algorithms---which operate entirely in main memory and avoid expensive disk accesses---and in particular the prefix tree-based algorithm FP-growth are generally among the most efficient of the available algorithms. Unfortunately, their excessive memory requirements render them inapplicable for large datasets with many distinct items and/or itemsets of high cardinality. To overcome this limitation, we propose two novel data structures---the CFP-tree and the CFP-array---, which reduce memory consumption by about an order of magnitude. This allows us to process significantly larger datasets in main memory than previously possible. Our data structures are based on structural modifications of the prefix tree that increase compressability, an optimized physical representation, lightweight compression techniques, and intelligent node ordering and indexing. Experiments with both real-world and synthetic datasets show the effectiveness of our approach.

References

[1]

R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, New York, NY, USA, 1993. ACM.

Digital Library

[2]

R. Agrawal and R. Srikant. Quest synthetic data generator. IBM Almaden Research Center.

[3]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994.

Digital Library

[4]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In PACT, October 2008.

Digital Library

[5]

T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254--260, 1999.

Digital Library

[6]

G. Buehrer, S. Parthasarathy, and A. Ghoting. Out-of-core frequent pattern mining on a commodity pc. In SIGKDD, pages 86--95, New York, NY, USA, 2006. ACM.

Digital Library

[7]

G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In PPoPP, pages 2--12, New York, NY, USA, 2007. ACM.

Digital Library

[8]

Frequent Itemset Mining Implementations Repository: http://fimi.cs.helsinki.fi/.

[9]

K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board, Washington DC. (USA), January 12--16, page 18pp, 2003.

[10]

A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y.-K. Chen, and P. Dubey. Cache-conscious frequent pattern mining on a modern processor. In VLDB, pages 577--588, 2005.

Digital Library

[11]

B. Goethals and M. J. Zaki. Advances in frequent itemset mining implementations: report on fimi'03. SIGKDD Explorations, 6(1):109--117, 2004.

Digital Library

[12]

G. Grahne and J. Zhu. Efficiently using prefix-trees in mining frequent itemsets. In FIMI, 2003.

[13]

J. Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.

Digital Library

[14]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1--12, New York, NY, USA, 2000. ACM.

Digital Library

[15]

D. E. Knuth. The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, 1973.

[16]

E. Li and L. Liu. Optimization of frequent itemset mining on multiple-core processor. In VLDB, pages 1275--1285, 2007.

Digital Library

[17]

H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. Pfp: parallel fp-growth for query recommendation. In RecSys, pages 107--114, New York, NY, USA, 2008. ACM.

Digital Library

[18]

G. Liu, H. Lu, and J. X. Yu. Afopt: An efficient implementation of pattern growth approach. In In Proceedings of the ICDM workshop, 2003.

[19]

C. Lucchese, S. Orlando, R. Perego, and F. Silvestri. Webdocs: a real-life huge transactional dataset. In FIMI, 2004.

[20]

E. Özkural and C. Aykanat. A space optimization for fp-growth. In FIMI, 2004.

[21]

A. Pietracaprina and D. Zandolin. Mining frequent itemsets using patricia tries. In FIMI, 2003.

[22]

J. R. Punin, M. S. Krishnamoorthy, and M. J. Zaki. Web usage mining - languages and algorithms. In In Studies in Classification, Data Analysis, and Knowledge Organization, pages 88--112. Springer-Verlag, 2001.

[23]

B. Rácz. nonordfp: An fp-growth variation without rebuilding the fp-tree. In FIMI, 2004.

[24]

H. K. Reghbati. An overview of data compression techniques. IEEE Computer, 14(4):71--75, 1981.

Digital Library

[25]

D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652--686, 1985.

Digital Library

[26]

Y. G. Sucahyo and R. P. Gopalan. Ct-itl: Efficient frequent item set mining using a compressed prefix tree with pattern growth. In ADC, pages 95--104, 2003.

Digital Library

[27]

Y. G. Sucahyo and R. P. Gopalan. Ct-pro: A bottom-up non recursive frequent itemset mining algorithm using compressed fp-tree data structure. In FIMI, 2004.

[28]

H. Toivonen. Sampling large databases for association rules. In VLDB, pages 134--145, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.

Digital Library

[29]

T. Uno, M. Kiyomi, and H. Arimura. Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In FIMI, 2004.

[30]

J. T.-L. Wang, M. J. Zaki, H. Toivonen, and D. Shasha, editors. Data Mining in Bioinformatics. Springer, 2005.

Digital Library

[31]

T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte. The implementation and performance of compressed databases. SIGMOD Record, 29(3):55--67, 2000.

Digital Library

[32]

M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. Technical report, Rochester, NY, USA, 1997.

Digital Library

Cited By

Chaturvedi SSaritha SChaturvedi A(2023)Spark based Parallel Frequent Pattern Rules for Social Media Data Analytics2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00039(168-175)Online publication date: May-2023
https://doi.org/10.1109/CCGridW59191.2023.00039
Köppl DPuglisi SRaman R(2022)Fast and Simple Compact Hashing via BucketingAlgorithmica10.1007/s00453-022-00996-y84:9(2735-2766)Online publication date: 29-Jun-2022
https://doi.org/10.1007/s00453-022-00996-y
Chen JLi PFang WZhou NYin YZheng HXu HWang R(2021)Fuzzy Frequent Pattern Mining Algorithm Based on Weighted Sliding Window and Type-2 Fuzzy Sets over Medical Data StreamWireless Communications and Mobile Computing10.1155/2021/66622542021(1-17)Online publication date: 26-Dec-2021
https://doi.org/10.1155/2021/6662254
Show More Cited By

Index Terms

Memory-efficient frequent-itemset mining
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Memory issues in frequent itemset mining
SAC '04: Proceedings of the 2004 ACM symposium on Applied computing

During the past decade, many algorithms have been proposed to solve the frequent itemset mining problem, i.e. find all sets of items that frequently occur together in a given database of transactions. Although very efficient techniques have been ...
Memory Efficient Frequent Itemset Mining
Machine Learning and Data Mining in Pattern Recognition
Abstract
Frequent itemset mining has been one of the most popular data mining techniques. Despite a large number of algorithms developed to implement this functionality, there is still room for improvement of their efficiency. In this paper, we focus on ...
Approximate Inverse Frequent Itemset Mining: Privacy, Complexity, and Approximation
ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining

In order to generate synthetic basket datasets for better benchmark testing, it is important to integrate characteristics from real-life databases into the synthetic basket datasets. The characteristics that could be used for this purpose include the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology

March 2011

587 pages

ISBN:9781450305280

DOI:10.1145/1951365

Editors:
Anastasia Ailamaki
EPFL, Switzerland
,
Sihem Amer-Yahia
Yahoo! Research
,
Jignesh Pate
University of Wisconsin-Madison
,
Tore Risch
Uppsala University, Sweden
,
Pierre Senellart
Télécom ParisTech, France
,
Julia Stoyanovich
University of Pennsylvania

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Microsoft Research: Microsoft Research

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

EDBT/ICDT '11

Sponsor:

Microsoft Research

EDBT/ICDT '11: EDBT/ICDT '11 joint conference

March 21 - 24, 2011

Uppsala, Sweden

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
409
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chaturvedi SSaritha SChaturvedi A(2023)Spark based Parallel Frequent Pattern Rules for Social Media Data Analytics2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00039(168-175)Online publication date: May-2023
https://doi.org/10.1109/CCGridW59191.2023.00039
Köppl DPuglisi SRaman R(2022)Fast and Simple Compact Hashing via BucketingAlgorithmica10.1007/s00453-022-00996-y84:9(2735-2766)Online publication date: 29-Jun-2022
https://doi.org/10.1007/s00453-022-00996-y
Chen JLi PFang WZhou NYin YZheng HXu HWang R(2021)Fuzzy Frequent Pattern Mining Algorithm Based on Weighted Sliding Window and Type-2 Fuzzy Sets over Medical Data StreamWireless Communications and Mobile Computing10.1155/2021/66622542021(1-17)Online publication date: 26-Dec-2021
https://doi.org/10.1155/2021/6662254
Shahbazi NGryz J(2021)Upper bounds for can-tree and FP-treeJournal of Intelligent Information Systems10.1007/s10844-021-00673-658:1(197-222)Online publication date: 21-Sep-2021
https://doi.org/10.1007/s10844-021-00673-6
Shahbazi NGryz J(2020)Upper Bound on the Size of FP-TreeAdvances in Databases and Information Systems10.1007/978-3-030-54832-2_4(23-33)Online publication date: 17-Aug-2020
https://doi.org/10.1007/978-3-030-54832-2_4
Lin CLi WChen JChung WChung SLin K(2019)A Distributed Algorithm for Fast Mining Frequent Patterns in Limited and Varying Network Bandwidth EnvironmentsApplied Sciences10.3390/app90918599:9(1859)Online publication date: 6-May-2019
https://doi.org/10.3390/app9091859
Guidotti RRossetti GPappalardo LGiannotti FPedreschi D(2019)Personalized Market Basket Prediction with Temporal Annotated Recurring SequencesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.287258731:11(2151-2163)Online publication date: 1-Nov-2019
https://doi.org/10.1109/TKDE.2018.2872587
Ahmed SNath B(2019)Modified FP-Growth: An Efficient Frequent Pattern Mining Approach from FP-TreePattern Recognition and Machine Intelligence10.1007/978-3-030-34869-4_6(47-55)Online publication date: 25-Nov-2019
https://doi.org/10.1007/978-3-030-34869-4_6
Lin CChung SChen JYu YLin K(2018)A fast and low idle time method for mining frequent patterns in distributed and many-task computing environmentsDistributed and Parallel Databases10.1007/s10619-018-7221-936:4(613-641)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10619-018-7221-9
Elgohary ABoehm MHaas PReiss FReinwald B(2018)Compressed linear algebra for large-scale machine learningThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0478-127:5(719-744)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1007/s00778-017-0478-1
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents