Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3178876.3186082acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Free access

Finding Subcube Heavy Hitters in Analytics Data Streams

Published: 10 April 2018 Publication History

Abstract

Modern data streams typically have high dimensionality. For example, digital analytics streams consist of user online activities (e.g., web browsing activity, commercial site activity, apps and social behavior, and response to ads). An important problem is to find frequent joint values (heavy hitters) of subsets of dimensions.
Formally, the data stream consists of d-dimensional items and a \em k-dimensional subcube T is a subset of k distinct coordinates. Given a theshold γ, a \em subcube heavy hitter query $\rm Query (T,v)$ outputs YES if $f_T(v) \geq γ$ and NO if $f_T(v) ∠ γ/4$ where $f_T$ is the ratio of the number of stream items whose coordinates T have joint values v. The \em all subcube heavy hitters query $\rm AllQuery(T)$ outputs all joint values v that return YES to $\rm Query (T,v)$. The problem is to answer these queries correctly for all T and v.
We present a simple one-pass sampling algorithm to solve the subcube heavy hitters problem in $\tildeO (kd/γ)$ space. $\tildeO(\cdot)$ suppresses polylogarithmic factors. This is optimal up to polylogarithmic factors based on the lower bound of Liberty et al. In the worst case, this bound becomes Θ(d^2/γ)$ which is prohibitive for large d. Our main contribution is to circumvent this quadratic bottleneck via a model-based approach. In particular, we assume that the dimensions are related to each other via the Naive Bayes model. We present a new two-pass, $\tildeO (d/γ)$-space algorithm for our problem, and a fast algorithm for answering $\rm AllQuery (T)$ in $\tO((k/γ)^2)$ time.
We demonstrate the effectiveness of our approach on a synthetic dataset as well as real datasets from Adobe and Yandex. Our work shows the potential of model-based approach to data streams.

References

[1]
Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. 2000. Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach. CoRR cs.CL/0009009 (2000).
[2]
Vladimir Braverman, Kai-Min Chung, Zhenming Liu, Michael Mitzenmacher, and Rafail Ostrovsky. 2010. AMS Without 4-Wise Independence on Product Domains. In STACS (LIPIcs), Vol. 5. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 119--130.
[3]
Vladimir Braverman and Rafail Ostrovsky. 2010. Measuring independence of datasets. In STOC. ACM, 271--280.
[4]
Graham Cormode. 2008. Finding Frequent Items in Data Streams. http: //dmac.rutgers.edu/Workshops/WGUnifyingTheory/Slides/cormode.pdf. (2008). DIMACS Workshop.
[5]
Graham Cormode and S. Muthukrishnan. 2004. An Improved Data Stream Summary: The Count-Min Sketch and Its Applications. In LATIN (Lecture Notes in Computer Science), Vol. 2976. Springer, 29--38.
[6]
Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. 2016. Testing Ising Models. CoRR abs/1612.03147 (2016). arXiv:1612.03147 http://arxiv.org/ abs/1612.03147
[7]
Marco F. Duarte, Volkan Cevher, and Richard G. Baraniuk. 2009. Model-based compressive sensing for signal ensembles. In Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on. IEEE, 244--250.
[8]
Amit Goyal, Hal Daumé III, and Suresh Venkatasubramanian. 2009. Streaming for large scale NLP: Language Modeling. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, USA. The Association for Computational Linguistics, 512--520. http://www.aclweb.org/anthology/ N09--1058
[9]
Matthew Hamilton, Rhonda Chaytor, and Todd Wareham. 2006. The Parameterized Complexity of Enumerating Frequent Itemsets. In IWPEC (Lecture Notes in Computer Science), Vol. 4169. Springer, 227--238.
[10]
Piotr Indyk and Andrew McGregor. 2008. Declaring independence via the sketching of sketches. In SODA. SIAM, 737--745.
[11]
Branislav Kveton, Hung Hai Bui, Mohammad Ghavamzadeh, Georgios Theocharous, S. Muthukrishnan, and Siqi Sun. 2016. Graphical Model Sketch. In ECML/PKDD (1) (Lecture Notes in Computer Science), Vol. 9851. Springer, 81--97.
[12]
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research 5 (2004), 361--397.
[13]
Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan Ullman. 2016. Space Lower Bounds for Itemset Frequency Sketches. In PODS. ACM, 441--454.
[14]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
[15]
Andrew McGregor and Hoa T. Vu. 2015. Evaluating Bayesian Networks via Data Streams. In COCOON (Lecture Notes in Computer Science), Vol. 9198. Springer, 731--743.
[16]
Jayadev Misra and David Gries. 1982. Finding Repeated Elements. Sci. Comput. Program. 2, 2 (1982), 143--152.
[17]
Stuart J. Russell and Peter Norvig. 2010. Artificial Intelligence - A Modern Approach. Pearson Education.
[18]
Jeffrey Scott Vitter. 1985. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11, 1 (1985), 37--57.
[19]
David P. Woodruff. 2016. New Algorithms for Heavy Hitters in Data Streams (Invited Talk). In ICDT (LIPIcs), Vol. 48. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 4:1--4:12.
[20]
Yandex 2013. Yandex Personalized Web Search Challenge. https://www.kaggle.com/c/yandex-personalized-web-search-challenge. (2013).
[21]
Guizhen Yang. 2004. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In KDD. ACM, 344--353.

Cited By

View all
  • (2022)Box queries over multi-dimensional streamsInformation Systems10.1016/j.is.2022.102086109:COnline publication date: 1-Nov-2022
  • (2021)Box queries over multi-dimensional streamsProceedings of the 15th ACM International Conference on Distributed and Event-based Systems10.1145/3465480.3466925(90-101)Online publication date: 28-Jun-2021
  • (2021)Subspace ExplorationProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458312(273-284)Online publication date: 20-Jun-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '18: Proceedings of the 2018 World Wide Web Conference
April 2018
2000 pages
ISBN:9781450356398
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • IW3C2: International World Wide Web Conference Committee

In-Cooperation

Publisher

International World Wide Web Conferences Steering Committee

Republic and Canton of Geneva, Switzerland

Publication History

Published: 10 April 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. algorithms
  2. data streams
  3. graphical models
  4. heavy hitters

Qualifiers

  • Research-article

Conference

WWW '18
Sponsor:
  • IW3C2
WWW '18: The Web Conference 2018
April 23 - 27, 2018
Lyon, France

Acceptance Rates

WWW '18 Paper Acceptance Rate 170 of 1,155 submissions, 15%;
Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)84
  • Downloads (Last 6 weeks)17
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Box queries over multi-dimensional streamsInformation Systems10.1016/j.is.2022.102086109:COnline publication date: 1-Nov-2022
  • (2021)Box queries over multi-dimensional streamsProceedings of the 15th ACM International Conference on Distributed and Event-based Systems10.1145/3465480.3466925(90-101)Online publication date: 28-Jun-2021
  • (2021)Subspace ExplorationProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458312(273-284)Online publication date: 20-Jun-2021
  • (2019)Fast Identification of Heavy Hitters by Cached and Packed Group TestingString Processing and Information Retrieval10.1007/978-3-030-32686-9_17(241-257)Online publication date: 3-Oct-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media