Abstract
In many real-world applications, information such as web click data, stock ticker data, sensor network data, phone call records, and traffic monitoring data appear in the form of data streams. Online monitoring of data streams has emerged as an important research undertaking. Estimating the frequency of the items on these streams is an important aggregation and summary technique for both stream mining and data management systems with a broad range of applications. This paper reviews the state-of-the-art progress on methods of identifying frequent items from data streams. It describes different kinds of models for frequent items mining task. For general models such as cash register and Turnstile, we classify existing algorithms into sampling-based, counting-based, and hashing-based categories. The processing techniques and data synopsis structure of each algorithm are described and compared by evaluation measures. Accordingly, as an extension of the general data stream model, four more specific models including time-sensitive model, distributed model, hierarchical and multi-dimensional model, and skewed data model are introduced. The characteristics and limitations of the algorithms of each model are presented, and open issues waiting for study and improvement are discussed.
Similar content being viewed by others
References
Adamic L (2009) Zipf, power-law, pareto—a ranking tutorial. http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html. Accessed 1 Mar 2009
Aggarwal C (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156
Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequent moments. J Comput Syst Sci 58(1): 137–147
Arasu A, Babu S, Widom J (2003) CQL: a language for continuous queries over streams and relations. In: Proceedings of 9th conference on database programming languages, pp 1–11
Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: Proceedings of 23rd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 286–296
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of 21st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’ 02), pp 1–16
Babcock B, Olston C (2003) Distributed top-k monitoring. In: Proceedings of 22nd ACM SIGMOD international conference on management of data, pp 28–39
Babcock B, Datar M, Motwani R (2002) Sampling from a moving window over streaming data. In: Proceedings of 13th SIAM-ACM symposium on discrete algorithm. San Francisco, pp 633–634
Baumes J, Goldberg M, Magdon-Ismail M, Wallace W (2004) Discovering hidden groups in communication networks. In: Proceedings of 2nd NSF/NIJ symposium on intelligence and security informatics, pp 126–137
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7): 422–426
Bose P, Kranakis E, Morin P, Tang Y (2003) Bounds for frequency estimation of packet streams. In: Proceedings of 10th SIROCCO international colloquium on structural information and communication complexity, pp 33–42
Boyer B, Moore J (1982) A fast majority vote algorithm. Technical report 35, Institute for Computer Science, University of Texas
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Proceedings of 29th international colloquium on automata, languages and programming, pp 693–703
Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams. In: Proceedings of 28th international conference on very large databases (VLDB’ 02), pp 323–334
Cohen E, Strauss M (2006) Maintaining time-decaying stream aggregates. J Algorithm 59(1): 19–36
Cormode G (2009) Fundermentals of analyzing and mining data streams. http://www.rocq.inria.fr/axis/modulad/numero-36/Cormode/Cormode-slide-36.pdf. Accessed 1 Mar 2009
Cormode G, Marios H (2008) Finding frequent items in data streams. The Proceedings of the VLDB endowment, vol 1, issue 2, pp 1530–1541
Cormode G, Muthukrishnan S (2003) Whats hot and what’s not: tracking most frequent items dynamically. In: Proceedings of 22nd ACM SIGMOD-SIGACT-SIGART (PODS’ 03), pp 296–306
Cormode G, Muthukrishnan S (2004) An improved data stream summary: the count- min sketch and its applications. Lecture notes in computer science, vol 2976. Springer, Berlin, pp 29–38
Cormode G, Muthukrishnan S (2005) Summarizing and mining skewed data streams. In: Proceedings of 5th SIAM international conference on data mining (SDM’ 05), pp 44–55
Cormode G, Korn F, Muthukrishnan S, Srivastava D (2003) Finding hierarchical heavy hitters in data streams. In: Proceedings of 29th ACM international conference on very large databases (VLDB’ 03), pp 464–475
Cormode G, Korn F, Muthukrishnan S, Srivastava D (2004) Diamond in the rough: finding hierarchical heavy hitters in multi-dimensional data. In: Proceedings of 23rd ACM SIGMOD ICMD, pp 155–166
Cormode G, Johnson T, Korn F, Muthukrishnan S, Spatscheck O, Srivastava D (2004) Holistic UDAFs at streaming speeds. In: Proceedings of 23rd ACM SIGMOD ICMD, pp 35–46
Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. In: Proceedings of 2002 annual ACM-SIAM symposium on discrete algorithms (SODA’ 02), pp 635–644
Demaine ED, L’opez-Ortiz A, Munro JI (2002) Frequency estimation of Internet packet streams with limited space. In: Proceedings of 10th European symposium on algorithms (ESA 2002), pp 348–360
Estan C, Varghese G (2001) New directions in traffic measurement and accounting. In: Proceedings of 1st ACM SIGCOMM workshop on Internet measurement, pp 75–80
Estan C, Savage S, Varghese G (2003) Automatically inferring patterns of resource consumption in network traffic. In: Proceedings of ACM SIGCOMM, pp 137–148
Fang M, Shivakumar N, Garcia-Molina H, Motwani R, Ullman J (1998) Computing iceberg queries efficiently. In: Proceedings of 24th ACM international conference on very large databases (VLDB’98), pp 299–310
Fischer MJ, Salzberg SL (1982) Finding a majority among N votes: solution to problem 81–5. J Algorithm 3(4): 376–379
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. SIGMOD Record 2:18–26
Garofalakis M, Gehrke J, Rastogi R (2002) Querying and mining data streams: you only get one look. Tutorial at proceeding of the ACM SIGMOD international conference on management of data(SIGMOD’ 02), pp 635
Giannella C, Han J, Pei J, Yan X, Yu PS (2003) Mining frequent patterns in data streams at multiple time granularities: next generation data mining, pp 191–212
Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: Proceedings of ACM SIGMOD international conference on management of data (SIGMOD’98), pp 331–342
Gibbons PB, Matias Y (1999) Synopsis data structures for massive data set. DIMACS series in discrete mathematics and theoretical computer science. pp 39–70
Gilbert AC, Kotidis Y, Muthukrishnana S, Strauss M (2002) How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of 28th ACM international conference on very large databases (VLDB’ 02), pp 454–465
Gilbert AC, Kotidis Y, Muthukrishnana S, Strauss MJ (2001) QuickSAND: quick summary and analysis of network data. Technical Report 2001-43, DIMACS
Golab L, DeHaan D, Demaine ED, L’opez-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over online packet streams. In: Proceedings of 3rd ACM SIGCOMM Internet measurement conference (IMC’ 03), pp 173–178
Golab L, Ozsu MT (2003) Issues in data stream management. ACM SIGMOD Record 32(2): 5–14
Hershberger J, Shrivastava N, Suri S, Toth CD (2006) Adaptive spatial partitioning for multidimensional data streams. Algorithmica 46(1): 97–117
Hershberger J, Shrivastava N, Suri S, Toth CD (2005) Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In: Proceedings of 24th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’ 05), pp 338–347
Huang Z, Sun S, Wang W (2009) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst. doi:10.1007/s10115-008-0185-8
Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data Stream. In: Proceedings of 12th ACM conference on information and knowledge management (CIKM’ 03), pp 287–294
Johnson T, Muthukrishnan S, Rozenbaum I (2005) Sampling algorithms in a stream operator. In: Proceedings of 24th ACM SIGMOD international conference on management of data (SIGMOD’ 05), pp 1–12
Kargupta H, Park B, Pittie S, Liu L, Kushraj D, Sarkar K (2002) MobiMine: monitoring the stock market from a PDA. ACM SIGKDD explorations, vol 3, issue 2. pp 37–46. ACM Press, New York
Kargupta H, Bhargava R, Liu K, et al (2004) VEDAS: A mobile and distributed data stream mining system for real-time vehicle monitoring. In: Proceedings of 4th SIAM international conference on data mining (SDM’ 04)
Karp R, Papadimitriou C, Shenker S (2003) A simple algorithm for finding frequent elements in sets and bags. ACM Trans Database Syst 28(1): 51–55
Koudas N, Srivastava D (2003) Data stream query processing: a tutorial. In: Proceedings of 29th ACM international conference on very large databases (VLDB’ 03), pp 1149
Krishnamurthy B, Sen S, Zhang Y, Chen Y (2003) Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of 3rd ACM SIGCOMM Internet measurement conference (IMC’ 03), pp 234–247
Lee H, Lee W (2009) Consistent collective evaluation of multiple continuous queries for filtering heterogeneous data streams. Knowl Inf Syst. doi:10.1007/s10115-008-0186-7
Lee GM, Liu H, Yoon Y, Zhang Y (2005) Improving sketch reconstruction accuracy using linear least squares method. In: Proceedings of 5th ACM SIGCOMM Internet measurement conference (IMC’ 05), pp 273–278
Lin Y, Liu H (2006) Separator: sifting hierarchical heavy hitters accurately from data streams. In: Proceedings of the 3rd international conference on advanced data mining and applications. Lecture notes in artificial intelligence, vol 4632, pp 170–182
Liu HY, Lu Y, Han JW, He J (2006) Error-adaptive and time-aware maintenance of frequency counts over data streams. In: Yu JX, Kitsuregawa M, Leong H (eds) International conference on advances in web-age information management (WAIM’ 06). Lecture notes in computer science, vol 4016, Springer, Berlin, pp 484–495
Luo G, Wu K, Yu P (2009) Answering linear optimization queries with an approximate stream index. Knowl Inf Syst 20(1): 95–121
Madden S, Franklin MJ (2002) Fording the stream: an architecture for queries over streaming sensor data. Proceedings of the 18th IEEE international conference on data engineerring (ICDE’ 02), pp 555–566
Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2004) Finding (recently) frequent items in distributed data streams. Technical Report, CMU-CS-04-121, Carnegie Mellon University
Manku G, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th ACM international conference on very large databases (VLDB’ 02), pp 346–357
Metwally A, Agrawal D, Abbadi AE (200) Efficient computation of frequent and Top-k elements in data streams. In: Proceedings of 10th international conference on database theory (ICDT’ 05), pp 398–412
Metwally A, Agrawal D, Abbadi AE (2005) Using association rules for fraud detection in web advertising networks. In: Proceedings of 31st international conference on very large databases (VLDB’ 05), pp 169–180
Misra J, Gries D (1982) Finding repeated elements. Science of computer programming, vol 2. pp 143–152
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
Muthukrishnan S (2003) Data streams: algorithms and applications. In: Proceedings of 14th ACM symposium discrete algorithms (SDA ’03). http://athos.rutgers.edu/~muthu/stream-1-1.ps
Schweller R, Gupta A, Chen Y, Parsons E (2004) Reversible sketches for efficient and accurate change detection over network data streams. In: Proceedings of ACM SIGCOMM Internet measurement conference (IMC ’04), pp 207–212
Shrivastava N, Buragohain C, Agrawal D, Suri S (2004) Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of 2nd ACM conference on embedded networked sensor systems (SenSys ’04), pp 239–249
Srivastava A, Stroeve J (2003) Onboard detection of snow, ice, clouds and other geophysical processes using kernel methods. In: Proceedings of 12th ICML workshop on machine learning technologies for autonomous space applications, pp 673–685
Toivonen H (1996) Sampling large database for association rules. In: Proceedings of 22nd international conference on very large databases (VLDB’96), pp 134–145
Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1): 37–57
Wang XY, Liu HY, Han JW (2007) Finding frequent items in data streams using hierarchical information. IEEE international conference on system, man and cybernetics (SMC’ 07), Canada
Yu JX, Chong ZH, Lu HJ, Zhou AY (2004) False positive or false negative: mining frequent itemsets from high speed transactional data streams. In: Proceedings of 30th international conference on very large databases (VLDB’ 04), pp 204–215
Yang B, Huang H (2009) TOPSIL-Miner: an efficient algorithm for mining top-K significant itemsets over data streams. Knowl Inf Syst. doi:10.1007/s10115-009-0211-5
Zhang Y, Singh S, Sen S, Duffield N, Lund C (2004) Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Proceedings of ACM SIGCOMM Internet measurement conference (IMC’ 04), pp 101–114
Zhu Y, Shasha D (2002) StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of 28th international conference on very Large databases (VLDB’ 02), pp 358–369
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, H., Lin, Y. & Han, J. Methods for mining frequent items in data streams: an overview. Knowl Inf Syst 26, 1–30 (2011). https://doi.org/10.1007/s10115-009-0267-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0267-2