Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1951365.1951420acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Memory-efficient frequent-itemset mining

Published: 21 March 2011 Publication History

Abstract

Efficient discovery of frequent itemsets in large datasets is a key component of many data mining tasks. In-core algorithms---which operate entirely in main memory and avoid expensive disk accesses---and in particular the prefix tree-based algorithm FP-growth are generally among the most efficient of the available algorithms. Unfortunately, their excessive memory requirements render them inapplicable for large datasets with many distinct items and/or itemsets of high cardinality. To overcome this limitation, we propose two novel data structures---the CFP-tree and the CFP-array---, which reduce memory consumption by about an order of magnitude. This allows us to process significantly larger datasets in main memory than previously possible. Our data structures are based on structural modifications of the prefix tree that increase compressability, an optimized physical representation, lightweight compression techniques, and intelligent node ordering and indexing. Experiments with both real-world and synthetic datasets show the effectiveness of our approach.

References

[1]
R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, New York, NY, USA, 1993. ACM.
[2]
R. Agrawal and R. Srikant. Quest synthetic data generator. IBM Almaden Research Center.
[3]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994.
[4]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural implications. In PACT, October 2008.
[5]
T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254--260, 1999.
[6]
G. Buehrer, S. Parthasarathy, and A. Ghoting. Out-of-core frequent pattern mining on a commodity pc. In SIGKDD, pages 86--95, New York, NY, USA, 2006. ACM.
[7]
G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In PPoPP, pages 2--12, New York, NY, USA, 2007. ACM.
[8]
Frequent Itemset Mining Implementations Repository: http://fimi.cs.helsinki.fi/.
[9]
K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board, Washington DC. (USA), January 12--16, page 18pp, 2003.
[10]
A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y.-K. Chen, and P. Dubey. Cache-conscious frequent pattern mining on a modern processor. In VLDB, pages 577--588, 2005.
[11]
B. Goethals and M. J. Zaki. Advances in frequent itemset mining implementations: report on fimi'03. SIGKDD Explorations, 6(1):109--117, 2004.
[12]
G. Grahne and J. Zhu. Efficiently using prefix-trees in mining frequent itemsets. In FIMI, 2003.
[13]
J. Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.
[14]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1--12, New York, NY, USA, 2000. ACM.
[15]
D. E. Knuth. The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley, 1973.
[16]
E. Li and L. Liu. Optimization of frequent itemset mining on multiple-core processor. In VLDB, pages 1275--1285, 2007.
[17]
H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang. Pfp: parallel fp-growth for query recommendation. In RecSys, pages 107--114, New York, NY, USA, 2008. ACM.
[18]
G. Liu, H. Lu, and J. X. Yu. Afopt: An efficient implementation of pattern growth approach. In In Proceedings of the ICDM workshop, 2003.
[19]
C. Lucchese, S. Orlando, R. Perego, and F. Silvestri. Webdocs: a real-life huge transactional dataset. In FIMI, 2004.
[20]
E. Özkural and C. Aykanat. A space optimization for fp-growth. In FIMI, 2004.
[21]
A. Pietracaprina and D. Zandolin. Mining frequent itemsets using patricia tries. In FIMI, 2003.
[22]
J. R. Punin, M. S. Krishnamoorthy, and M. J. Zaki. Web usage mining - languages and algorithms. In In Studies in Classification, Data Analysis, and Knowledge Organization, pages 88--112. Springer-Verlag, 2001.
[23]
B. Rácz. nonordfp: An fp-growth variation without rebuilding the fp-tree. In FIMI, 2004.
[24]
H. K. Reghbati. An overview of data compression techniques. IEEE Computer, 14(4):71--75, 1981.
[25]
D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652--686, 1985.
[26]
Y. G. Sucahyo and R. P. Gopalan. Ct-itl: Efficient frequent item set mining using a compressed prefix tree with pattern growth. In ADC, pages 95--104, 2003.
[27]
Y. G. Sucahyo and R. P. Gopalan. Ct-pro: A bottom-up non recursive frequent itemset mining algorithm using compressed fp-tree data structure. In FIMI, 2004.
[28]
H. Toivonen. Sampling large databases for association rules. In VLDB, pages 134--145, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.
[29]
T. Uno, M. Kiyomi, and H. Arimura. Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In FIMI, 2004.
[30]
J. T.-L. Wang, M. J. Zaki, H. Toivonen, and D. Shasha, editors. Data Mining in Bioinformatics. Springer, 2005.
[31]
T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte. The implementation and performance of compressed databases. SIGMOD Record, 29(3):55--67, 2000.
[32]
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. Technical report, Rochester, NY, USA, 1997.

Cited By

View all
  • (2023)Spark based Parallel Frequent Pattern Rules for Social Media Data Analytics2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00039(168-175)Online publication date: May-2023
  • (2022)Fast and Simple Compact Hashing via BucketingAlgorithmica10.1007/s00453-022-00996-y84:9(2735-2766)Online publication date: 29-Jun-2022
  • (2021)Fuzzy Frequent Pattern Mining Algorithm Based on Weighted Sliding Window and Type-2 Fuzzy Sets over Medical Data StreamWireless Communications and Mobile Computing10.1155/2021/66622542021(1-17)Online publication date: 26-Dec-2021
  • Show More Cited By

Index Terms

  1. Memory-efficient frequent-itemset mining

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    EDBT/ICDT '11: Proceedings of the 14th International Conference on Extending Database Technology
    March 2011
    587 pages
    ISBN:9781450305280
    DOI:10.1145/1951365
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • Microsoft Research: Microsoft Research

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 March 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    EDBT/ICDT '11
    Sponsor:
    • Microsoft Research
    EDBT/ICDT '11: EDBT/ICDT '11 joint conference
    March 21 - 24, 2011
    Uppsala, Sweden

    Acceptance Rates

    Overall Acceptance Rate 7 of 10 submissions, 70%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)15
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Spark based Parallel Frequent Pattern Rules for Social Media Data Analytics2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW)10.1109/CCGridW59191.2023.00039(168-175)Online publication date: May-2023
    • (2022)Fast and Simple Compact Hashing via BucketingAlgorithmica10.1007/s00453-022-00996-y84:9(2735-2766)Online publication date: 29-Jun-2022
    • (2021)Fuzzy Frequent Pattern Mining Algorithm Based on Weighted Sliding Window and Type-2 Fuzzy Sets over Medical Data StreamWireless Communications and Mobile Computing10.1155/2021/66622542021(1-17)Online publication date: 26-Dec-2021
    • (2021)Upper bounds for can-tree and FP-treeJournal of Intelligent Information Systems10.1007/s10844-021-00673-658:1(197-222)Online publication date: 21-Sep-2021
    • (2020)Upper Bound on the Size of FP-TreeAdvances in Databases and Information Systems10.1007/978-3-030-54832-2_4(23-33)Online publication date: 17-Aug-2020
    • (2019)A Distributed Algorithm for Fast Mining Frequent Patterns in Limited and Varying Network Bandwidth EnvironmentsApplied Sciences10.3390/app90918599:9(1859)Online publication date: 6-May-2019
    • (2019)Personalized Market Basket Prediction with Temporal Annotated Recurring SequencesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.287258731:11(2151-2163)Online publication date: 1-Nov-2019
    • (2019)Modified FP-Growth: An Efficient Frequent Pattern Mining Approach from FP-TreePattern Recognition and Machine Intelligence10.1007/978-3-030-34869-4_6(47-55)Online publication date: 25-Nov-2019
    • (2018)A fast and low idle time method for mining frequent patterns in distributed and many-task computing environmentsDistributed and Parallel Databases10.1007/s10619-018-7221-936:4(613-641)Online publication date: 1-Dec-2018
    • (2018)Compressed linear algebra for large-scale machine learningThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0478-127:5(719-744)Online publication date: 1-Oct-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media