Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Methods for mining frequent items in data streams: an overview

  • Survey Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In many real-world applications, information such as web click data, stock ticker data, sensor network data, phone call records, and traffic monitoring data appear in the form of data streams. Online monitoring of data streams has emerged as an important research undertaking. Estimating the frequency of the items on these streams is an important aggregation and summary technique for both stream mining and data management systems with a broad range of applications. This paper reviews the state-of-the-art progress on methods of identifying frequent items from data streams. It describes different kinds of models for frequent items mining task. For general models such as cash register and Turnstile, we classify existing algorithms into sampling-based, counting-based, and hashing-based categories. The processing techniques and data synopsis structure of each algorithm are described and compared by evaluation measures. Accordingly, as an extension of the general data stream model, four more specific models including time-sensitive model, distributed model, hierarchical and multi-dimensional model, and skewed data model are introduced. The characteristics and limitations of the algorithms of each model are presented, and open issues waiting for study and improvement are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adamic L (2009) Zipf, power-law, pareto—a ranking tutorial. http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html. Accessed 1 Mar 2009

  2. Aggarwal C (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156

    Article  MathSciNet  Google Scholar 

  3. Alon N, Matias Y, Szegedy M (1999) The space complexity of approximating the frequent moments. J Comput Syst Sci 58(1): 137–147

    Article  MATH  MathSciNet  Google Scholar 

  4. Arasu A, Babu S, Widom J (2003) CQL: a language for continuous queries over streams and relations. In: Proceedings of 9th conference on database programming languages, pp 1–11

  5. Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: Proceedings of 23rd ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 286–296

  6. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of 21st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’ 02), pp 1–16

  7. Babcock B, Olston C (2003) Distributed top-k monitoring. In: Proceedings of 22nd ACM SIGMOD international conference on management of data, pp 28–39

  8. Babcock B, Datar M, Motwani R (2002) Sampling from a moving window over streaming data. In: Proceedings of 13th SIAM-ACM symposium on discrete algorithm. San Francisco, pp 633–634

  9. Baumes J, Goldberg M, Magdon-Ismail M, Wallace W (2004) Discovering hidden groups in communication networks. In: Proceedings of 2nd NSF/NIJ symposium on intelligence and security informatics, pp 126–137

  10. Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7): 422–426

    Article  MATH  Google Scholar 

  11. Bose P, Kranakis E, Morin P, Tang Y (2003) Bounds for frequency estimation of packet streams. In: Proceedings of 10th SIROCCO international colloquium on structural information and communication complexity, pp 33–42

  12. Boyer B, Moore J (1982) A fast majority vote algorithm. Technical report 35, Institute for Computer Science, University of Texas

  13. Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Proceedings of 29th international colloquium on automata, languages and programming, pp 693–703

  14. Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams. In: Proceedings of 28th international conference on very large databases (VLDB’ 02), pp 323–334

  15. Cohen E, Strauss M (2006) Maintaining time-decaying stream aggregates. J Algorithm 59(1): 19–36

    Article  MATH  MathSciNet  Google Scholar 

  16. Cormode G (2009) Fundermentals of analyzing and mining data streams. http://www.rocq.inria.fr/axis/modulad/numero-36/Cormode/Cormode-slide-36.pdf. Accessed 1 Mar 2009

  17. Cormode G, Marios H (2008) Finding frequent items in data streams. The Proceedings of the VLDB endowment, vol 1, issue 2, pp 1530–1541

  18. Cormode G, Muthukrishnan S (2003) Whats hot and what’s not: tracking most frequent items dynamically. In: Proceedings of 22nd ACM SIGMOD-SIGACT-SIGART (PODS’ 03), pp 296–306

  19. Cormode G, Muthukrishnan S (2004) An improved data stream summary: the count- min sketch and its applications. Lecture notes in computer science, vol 2976. Springer, Berlin, pp 29–38

  20. Cormode G, Muthukrishnan S (2005) Summarizing and mining skewed data streams. In: Proceedings of 5th SIAM international conference on data mining (SDM’ 05), pp 44–55

  21. Cormode G, Korn F, Muthukrishnan S, Srivastava D (2003) Finding hierarchical heavy hitters in data streams. In: Proceedings of 29th ACM international conference on very large databases (VLDB’ 03), pp 464–475

  22. Cormode G, Korn F, Muthukrishnan S, Srivastava D (2004) Diamond in the rough: finding hierarchical heavy hitters in multi-dimensional data. In: Proceedings of 23rd ACM SIGMOD ICMD, pp 155–166

  23. Cormode G, Johnson T, Korn F, Muthukrishnan S, Spatscheck O, Srivastava D (2004) Holistic UDAFs at streaming speeds. In: Proceedings of 23rd ACM SIGMOD ICMD, pp 35–46

  24. Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. In: Proceedings of 2002 annual ACM-SIAM symposium on discrete algorithms (SODA’ 02), pp 635–644

  25. Demaine ED, L’opez-Ortiz A, Munro JI (2002) Frequency estimation of Internet packet streams with limited space. In: Proceedings of 10th European symposium on algorithms (ESA 2002), pp 348–360

  26. Estan C, Varghese G (2001) New directions in traffic measurement and accounting. In: Proceedings of 1st ACM SIGCOMM workshop on Internet measurement, pp 75–80

  27. Estan C, Savage S, Varghese G (2003) Automatically inferring patterns of resource consumption in network traffic. In: Proceedings of ACM SIGCOMM, pp 137–148

  28. Fang M, Shivakumar N, Garcia-Molina H, Motwani R, Ullman J (1998) Computing iceberg queries efficiently. In: Proceedings of 24th ACM international conference on very large databases (VLDB’98), pp 299–310

  29. Fischer MJ, Salzberg SL (1982) Finding a majority among N votes: solution to problem 81–5. J Algorithm 3(4): 376–379

    Google Scholar 

  30. Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. SIGMOD Record 2:18–26

    Google Scholar 

  31. Garofalakis M, Gehrke J, Rastogi R (2002) Querying and mining data streams: you only get one look. Tutorial at proceeding of the ACM SIGMOD international conference on management of data(SIGMOD’ 02), pp 635

  32. Giannella C, Han J, Pei J, Yan X, Yu PS (2003) Mining frequent patterns in data streams at multiple time granularities: next generation data mining, pp 191–212

  33. Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: Proceedings of ACM SIGMOD international conference on management of data (SIGMOD’98), pp 331–342

  34. Gibbons PB, Matias Y (1999) Synopsis data structures for massive data set. DIMACS series in discrete mathematics and theoretical computer science. pp 39–70

  35. Gilbert AC, Kotidis Y, Muthukrishnana S, Strauss M (2002) How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of 28th ACM international conference on very large databases (VLDB’ 02), pp 454–465

  36. Gilbert AC, Kotidis Y, Muthukrishnana S, Strauss MJ (2001) QuickSAND: quick summary and analysis of network data. Technical Report 2001-43, DIMACS

  37. Golab L, DeHaan D, Demaine ED, L’opez-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over online packet streams. In: Proceedings of 3rd ACM SIGCOMM Internet measurement conference (IMC’ 03), pp 173–178

  38. Golab L, Ozsu MT (2003) Issues in data stream management. ACM SIGMOD Record 32(2): 5–14

    Article  Google Scholar 

  39. Hershberger J, Shrivastava N, Suri S, Toth CD (2006) Adaptive spatial partitioning for multidimensional data streams. Algorithmica 46(1): 97–117

    Article  MATH  MathSciNet  Google Scholar 

  40. Hershberger J, Shrivastava N, Suri S, Toth CD (2005) Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In: Proceedings of 24th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’ 05), pp 338–347

  41. Huang Z, Sun S, Wang W (2009) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst. doi:10.1007/s10115-008-0185-8

  42. Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data Stream. In: Proceedings of 12th ACM conference on information and knowledge management (CIKM’ 03), pp 287–294

  43. Johnson T, Muthukrishnan S, Rozenbaum I (2005) Sampling algorithms in a stream operator. In: Proceedings of 24th ACM SIGMOD international conference on management of data (SIGMOD’ 05), pp 1–12

  44. Kargupta H, Park B, Pittie S, Liu L, Kushraj D, Sarkar K (2002) MobiMine: monitoring the stock market from a PDA. ACM SIGKDD explorations, vol 3, issue 2. pp 37–46. ACM Press, New York

  45. Kargupta H, Bhargava R, Liu K, et al (2004) VEDAS: A mobile and distributed data stream mining system for real-time vehicle monitoring. In: Proceedings of 4th SIAM international conference on data mining (SDM’ 04)

  46. Karp R, Papadimitriou C, Shenker S (2003) A simple algorithm for finding frequent elements in sets and bags. ACM Trans Database Syst 28(1): 51–55

    Article  Google Scholar 

  47. Koudas N, Srivastava D (2003) Data stream query processing: a tutorial. In: Proceedings of 29th ACM international conference on very large databases (VLDB’ 03), pp 1149

  48. Krishnamurthy B, Sen S, Zhang Y, Chen Y (2003) Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of 3rd ACM SIGCOMM Internet measurement conference (IMC’ 03), pp 234–247

  49. Lee H, Lee W (2009) Consistent collective evaluation of multiple continuous queries for filtering heterogeneous data streams. Knowl Inf Syst. doi:10.1007/s10115-008-0186-7

  50. Lee GM, Liu H, Yoon Y, Zhang Y (2005) Improving sketch reconstruction accuracy using linear least squares method. In: Proceedings of 5th ACM SIGCOMM Internet measurement conference (IMC’ 05), pp 273–278

  51. Lin Y, Liu H (2006) Separator: sifting hierarchical heavy hitters accurately from data streams. In: Proceedings of the 3rd international conference on advanced data mining and applications. Lecture notes in artificial intelligence, vol 4632, pp 170–182

  52. Liu HY, Lu Y, Han JW, He J (2006) Error-adaptive and time-aware maintenance of frequency counts over data streams. In: Yu JX, Kitsuregawa M, Leong H (eds) International conference on advances in web-age information management (WAIM’ 06). Lecture notes in computer science, vol 4016, Springer, Berlin, pp 484–495

  53. Luo G, Wu K, Yu P (2009) Answering linear optimization queries with an approximate stream index. Knowl Inf Syst 20(1): 95–121

    Article  Google Scholar 

  54. Madden S, Franklin MJ (2002) Fording the stream: an architecture for queries over streaming sensor data. Proceedings of the 18th IEEE international conference on data engineerring (ICDE’ 02), pp 555–566

  55. Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2004) Finding (recently) frequent items in distributed data streams. Technical Report, CMU-CS-04-121, Carnegie Mellon University

  56. Manku G, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of 28th ACM international conference on very large databases (VLDB’ 02), pp 346–357

  57. Metwally A, Agrawal D, Abbadi AE (200) Efficient computation of frequent and Top-k elements in data streams. In: Proceedings of 10th international conference on database theory (ICDT’ 05), pp 398–412

  58. Metwally A, Agrawal D, Abbadi AE (2005) Using association rules for fraud detection in web advertising networks. In: Proceedings of 31st international conference on very large databases (VLDB’ 05), pp 169–180

  59. Misra J, Gries D (1982) Finding repeated elements. Science of computer programming, vol 2. pp 143–152

  60. Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  61. Muthukrishnan S (2003) Data streams: algorithms and applications. In: Proceedings of 14th ACM symposium discrete algorithms (SDA ’03). http://athos.rutgers.edu/~muthu/stream-1-1.ps

  62. Schweller R, Gupta A, Chen Y, Parsons E (2004) Reversible sketches for efficient and accurate change detection over network data streams. In: Proceedings of ACM SIGCOMM Internet measurement conference (IMC ’04), pp 207–212

  63. Shrivastava N, Buragohain C, Agrawal D, Suri S (2004) Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of 2nd ACM conference on embedded networked sensor systems (SenSys ’04), pp 239–249

  64. Srivastava A, Stroeve J (2003) Onboard detection of snow, ice, clouds and other geophysical processes using kernel methods. In: Proceedings of 12th ICML workshop on machine learning technologies for autonomous space applications, pp 673–685

  65. Toivonen H (1996) Sampling large database for association rules. In: Proceedings of 22nd international conference on very large databases (VLDB’96), pp 134–145

  66. Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1): 37–57

    Article  MATH  MathSciNet  Google Scholar 

  67. Wang XY, Liu HY, Han JW (2007) Finding frequent items in data streams using hierarchical information. IEEE international conference on system, man and cybernetics (SMC’ 07), Canada

  68. Yu JX, Chong ZH, Lu HJ, Zhou AY (2004) False positive or false negative: mining frequent itemsets from high speed transactional data streams. In: Proceedings of 30th international conference on very large databases (VLDB’ 04), pp 204–215

  69. Yang B, Huang H (2009) TOPSIL-Miner: an efficient algorithm for mining top-K significant itemsets over data streams. Knowl Inf Syst. doi:10.1007/s10115-009-0211-5

  70. Zhang Y, Singh S, Sen S, Duffield N, Lund C (2004) Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In: Proceedings of ACM SIGCOMM Internet measurement conference (IMC’ 04), pp 101–114

  71. Zhu Y, Shasha D (2002) StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of 28th international conference on very Large databases (VLDB’ 02), pp 358–369

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongyan Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, H., Lin, Y. & Han, J. Methods for mining frequent items in data streams: an overview. Knowl Inf Syst 26, 1–30 (2011). https://doi.org/10.1007/s10115-009-0267-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0267-2

Keywords

Navigation