research-article

Maximally informative k-itemset mining from massively distributed data streams

Authors:

Reza Akbarinia,

Sadok Ben Yahia,

Florent MassegliaAuthors Info & Claims

SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

Pages 502 - 509

https://doi.org/10.1145/3167132.3167187

Published: 09 April 2018 Publication History

Abstract

We address the problem of mining maximally informative k-itemsets (miki) in data streams based on joint entropy. We propose PentroS, a highly scalable parallel miki mining algorithm. PentroS renders the mining process of large volumes of incoming data very efficient. It is designed to take into account the continuous aspect of data streams, particularly by reducing the computations of need for updating the miki results after arrival/departure of transactions to/from the sliding window. PentroS has been extensively evaluated using massive real-world data streams. Our experimental results confirm the effectiveness of our proposal which allows excellent throughput with high itemset length.

References

[1]

Youssef Bassil. 2012. A Survey on Information Retrieval, Text Categorization, and Web Crawling. CoRR (2012).

[2]

Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. 2004. Finding frequent items in data streams. Theor. Comput. Sci. (2004).

Digital Library

[3]

Thomas M. Cover and Joy A. Thomas. 2006. Elements of information theory (2. ed.). Wiley.

Digital Library

[4]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. San Francisco, California, USA.

Digital Library

[5]

Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin Strauss. 2001. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. In VLDB 2001. Roma, Italy.

Digital Library

[6]

Florin Gorunescu. 2011. Data Mining - Concepts, Models and Techniques. Springer.

[7]

Hannes Heikinheimo, Jouni K. Seppänen, Eino Hinkkanen, Heikki Mannila, and Taneli Mielikäinen. 2007. Finding low-entropy sets and trees from binary data. In ACM SIGKDD 2007. San Jose, California, USA.

Digital Library

[8]

Cong-Rui Ji and Zhi-Hong Deng. 2007. Mining Frequent Ordered Patterns without Candidate Generation. In FSKD 2007. Haikou, Hainan, China.

Digital Library

[9]

Arno J. Knobbe and Eric K. Y. Ho. 2006. Maximally informative k-itemsets and their efficient discovery. In ACM SIGKDD 2006. Philadelphia, PA, USA.

Digital Library

[10]

Hoang Thanh Lam and Toon Calders. 2010. Mining top-k frequent items in a data stream with flexible sliding windows. In ACM SIGKDD 2010. Washington, DC, USA.

Digital Library

[11]

Sandy Moens, Emin Aksehirli, and Bart Goethals. 2013. Frequent Itemset Mining for Big Data. In IEEE BigData 2013. Santa Clara, CA, USA.

[12]

Odysseas Papapetrou, Minos N. Garofalakis, and Antonios Deligiannakis. 2015. Sketching distributed sliding-window data streams. The VLDB Journal (2015).

Digital Library

[13]

Thomas A. Runkler. 2016. Data Analytics - Models and Algorithms for Intelligent Data Analysis. Springer.

Digital Library

[14]

Saber Salah, Reza Akbarinia, and Florent Masseglia. 2015. Fast Parallel Mining of Maximally Informative k-Itemsets in Big Data. In ICDM 2015. Atlantic City, USA.

Digital Library

[15]

Wei-Guang Teng, Ming-Syan Chen, and Philip S. Yu. 2003. A Regression-Based Temporal Pattern Mining Scheme for Data Streams. In VLDB 2003.

Digital Library

[16]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In HotCloud 2010. Boston, USA.

Digital Library

[17]

Chongsheng Zhang and Florent Masseglia. 2010. Discovering Highly Informative Feature Sets from Data Streams. In DEXA 2010. Bilbao, Spain.

Digital Library

[18]

Mehdi Zitouni, Reza Akbarinia, Sadok Ben Yahia, and Florent Masseglia. 2015. A Prime Number Based Approach for Closed Frequent Itemset Mining in Big Data. In DEXA 2015. Valencia, Spain.

Digital Library

Index Terms

Maximally informative k-itemset mining from massively distributed data streams
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
    2. Parallel programming languages

Recommendations

Mining frequent itemsets over distributed data streams by continuously maintaining a global synopsis

Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine ...
SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming
Abstract
Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed ...
Frequent Closed Informative Itemset Mining
CIS '07: Proceedings of the 2007 International Conference on Computational Intelligence and Security

In recent years, cluster analysis and association analysis have attracted a lot of attention for large data analysis such as biomedical data analysis. This paper proposes a novel algorithm of frequent closed itemset mining. The algorithm addresses two ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

April 2018

2327 pages

ISBN:9781450351911

DOI:10.1145/3167132

Conference Chairs:
Hisham M. Haddad
Kennesaw State University
,
Roger L. Wainwright
University of Tulsa
,
Richard Chbeir
University of Pau & Pays Adour, France

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 April 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SAC 2018

Sponsor:

SIGAPP

SAC 2018: Symposium on Applied Computing

April 9 - 13, 2018

Pau, France

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
71
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten