Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Towards Optimal Moment Estimation in Streaming and Distributed Models

Published: 24 June 2023 Publication History

Abstract

One of the oldest problems in the data stream model is to approximate the pth moment \(\Vert \mathbf {X}\Vert _p^p = \sum _{i=1}^n \mathbf {X}_i^p\) of an underlying non-negative vector \(\mathbf {X}\in \mathbb {R}^n\), which is presented as a sequence of \(\mathrm{poly}(n)\) updates to its coordinates. Of particular interest is when \(p \in (0,2]\). Although a tight space bound of \(\Theta (\epsilon ^{-2} \log n)\) bits is known for this problem when both positive and negative updates are allowed, surprisingly, there is still a gap in the space complexity of this problem when all updates are positive. Specifically, the upper bound is \(O(\epsilon ^{-2} \log n)\) bits, while the lower bound is only \(\Omega (\epsilon ^{-2} + \log n)\) bits. Recently, an upper bound of \(\tilde{O}(\epsilon ^{-2} + \log n)\) bits was obtained under the assumption that the updates arrive in a random order.
We show that for \(p \in (0, 1]\), the random order assumption is not needed. Namely, we give an upper bound for worst-case streams of \(\tilde{O}(\epsilon ^{-2} + \log n)\) bits for estimating \(\Vert \mathbf {X}\Vert _p^p\). Our techniques also give new upper bounds for estimating the empirical entropy in a stream. However, we show that for \(p \in (1,2]\), in the natural coordinator and blackboard distributed communication topologies, there is an \(\tilde{O}(\epsilon ^{-2})\) bit max-communication upper bound based on a randomized rounding scheme. Our protocols also give rise to protocols for heavy hitters and approximate matrix product. We generalize our results to arbitrary communication topologies G, obtaining an \(\tilde{O}(\epsilon ^{2} \log d)\) max-communication upper bound, where d is the diameter of G. Interestingly, our upper bound rules out natural communication complexity-based approaches for proving an \(\Omega (\epsilon ^{-2} \log n)\) bit lower bound for \(p \in (1,2]\) for streaming algorithms. In particular, any such lower bound must come from a topology with large diameter.

References

[3]
Noga Alon, Yossi Matias, and Mario Szegedy. 1996. The space complexity of approximating the frequency moments. In Proceedings of the 28th Annual ACM Symposium on Theory of Computing. ACM, 20–29.
[4]
Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. 2011. Streaming algorithms via precision sampling. In Proceedings of the IEEE 52nd Annual Symposium on Foundations of Computer Science. IEEE, 363–372.
[5]
Chrisil Arackaparambil, Joshua Brody, and Amit Chakrabarti. 2009. Functional monitoring without monotonicity. In Proceedings of the International Colloquium on Automata, Languages, and Programming. Springer, 95–106.
[6]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 1–16.
[7]
Maria Florina Balcan, Yingyu Liang, Le Song, David Woodruff, and Bo Xie. 2016. Communication efficient distributed kernel principal component analysis. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 725–734.
[8]
Ziv Bar-Yossef, Thathachar S. Jayram, Ravi Kumar, and D. Sivakumar. 2004. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci. 68, 4 (2004), 702–732.
[9]
Srinadh Bhojanapalli, Prateek Jain, and Sujay Sanghavi. 2015. Tighter low-rank approximation via sampling the leveraged element. In Proceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 902–920.
[10]
Jarosław Błasiok, Jian Ding, and Jelani Nelson. 2017. Continuous monitoring of lp norms in data streams. arXiv preprint arXiv:1704.06710 (2017).
[11]
Christos Boutsidis, David P. Woodruff, and Peilin Zhong. 2016. Optimal principal component analysis in distributed and streaming models. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing. ACM, 236–249.
[12]
Mark Braverman, Faith Ellen, Rotem Oshman, Toniann Pitassi, and Vinod Vaikuntanathan. 2013. A tight bound for set disjointness in the message-passing model. In Proceedings of the IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 668–677.
[13]
Mark Braverman, Sumegha Garg, and David Woodruff. 2020. The coin problem with applications to data streams. In Proceedings of the IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).
[14]
Vladimir Braverman, Stephen R. Chestnut, Nikita Ivkin, Jelani Nelson, Zhengyu Wang, and David P. Woodruff. 2016. BPTree: An L2 heavy hitters algorithm using constant memory. arXiv preprint arXiv:1603.00759 (2016).
[15]
Vladimir Braverman, Stephen R. Chestnut, David P. Woodruff, and Lin F. Yang. 2016. Streaming space complexity of nearly all functions of one variable on frequency vectors. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 261–276.
[16]
Vladimir Braverman, Jonathan Katzman, Charles Seidell, and Gregory Vorsanger. 2014. An optimal algorithm for large frequency moments using O (n^2303(1-2/k)) bits. In Proceedings of the Conference on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM’14). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
[17]
Vladimir Braverman and Rafail Ostrovsky. 2010. Recursive sketching for frequency moments. arXiv preprint arXiv:1011.2571 (2010).
[18]
Vladimir Braverman, Emanuele Viola, David Woodruff, and Lin F. Yang. 2018. Revisiting frequency moment estimation in random order streams. arXiv preprint arXiv:1803.02270 (2018).
[19]
Amit Chakrabarti, Graham Cormode, and Andrew McGregor. 2010. A near-optimal algorithm for estimating the entropy of a stream. ACM Trans. Algor. 6, 3 (2010), 1–21.
[20]
Amit Chakrabarti, Graham Cormode, and Andrew McGregor. 2016. Robust lower bounds for communication and stream computation. Theory Comput. 12, 1 (2016), 1–35.
[21]
Amit Chakrabarti, Subhash Khot, and Xiaodong Sun. 2003. Near-optimal lower bounds on the multi-party communication complexity of set disjointness. In Proceedings of the 18th IEEE Annual Conference on Computational Complexity. IEEE, 107–117.
[22]
Amit Chakrabarti and Oded Regev. 2012. An optimal lower bound on the communication complexity of gap-Hamming-distance. SIAM J. Comput. 41, 5 (2012), 1299–1317.
[23]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Automata, Languages and Programming: 29th International Colloquium (ICALP’02). Springer Berlin Heidelberg.
[24]
Arkadev Chattopadhyay, Jaikumar Radhakrishnan, and Atri Rudra. 2014. Topology matters in communication. In Proceedings of the IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE, 631–640.
[25]
Jiecao Chen, He Sun, David Woodruff, and Qin Zhang. 2016. Communication-optimal distributed clustering. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 3727–3735.
[26]
Peter Clifford and Ioana Cosma. 2013. A simple sketching algorithm for entropy estimation over streaming data. Artificial Intelligence and Statistics. PMLR.
[27]
Edith Cohen. 2013. TAU CS 0368.3239, Leveraging Big Data Lecture Notes. (Fall2013).
[28]
Edith Cohen. 2017. HyperLogLog hyperextended: Sketches for concave sublinear frequency statistics. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 105–114.
[29]
Graham Cormode, Piotr Indyk, Nick Koudas, and S. Muthukrishnan. 2002. Fast mining of massive tabular data via approximate distance computations. In Proceedings of the 18th International Conference on Data Engineering. IEEE, 605–614.
[30]
Graham Cormode and Hossein Jowhari. 2019. Lp samplers and their applications: A survey. ACM Comput. Surv. 52, 1 (2019), 1–31.
[31]
Graham Cormode, S. Muthukrishnan, and Irina Rozenbaum. 2005. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment, 25–36.
[32]
Graham Cormode, S. Muthukrishnan, and Ke Yi. 2011. Algorithms for distributed functional monitoring. ACM Trans. Algor. 7, 2 (2011), 21.
[33]
Dan Feldman, Melanie Schmidt, and Christian Sohler. 2013. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 1434–1453.
[34]
Philippe Flajolet. 1985. Approximate counting: A detailed analysis. BIT Numer. Math. 25, 1 (1985), 113–134.
[35]
Mina Ghashami and Jeff M. Phillips. 2014. Relative errors for deterministic low-rank matrix approximations. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 707–717.
[36]
Phillip B. Gibbons and Yossi Matias. 1998. New sampling-based summary statistics for improving approximate query answers. In ACM SIGMOD Record, Vol. 27. ACM, 331–342.
[37]
Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms.
[38]
Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 454–465.
[39]
Andre Gronemeier. 2009. Asymptotically optimal lower bounds on the NIH-multi-party information. arXiv preprint arXiv:0902.1609 (2009).
[40]
Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. 2008. Sketching and streaming entropy via approximation theory. In Proceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 489–498.
[41]
Nicholas J. A. Harvey, Jelani Nelson, and Krzysztof Onak. 2008. Streaming algorithms for estimating entropy. In Proceedings of the IEEE Information Theory Workshop. IEEE, 227–231.
[42]
Ling Huang, XuanLong Nguyen, Minos Garofalakis, Joseph M. Hellerstein, Michael I. Jordan, Anthony D. Joseph, and Nina Taft. 2007. Communication-efficient online detection of network-wide anomalies. In Proceedings of the 26th IEEE International Conference on Computer Communications. IEEE, 134–142.
[43]
Zengfeng Huang, Ke Yi, and Qin Zhang. 2012. Randomized algorithms for tracking distributed count, frequencies, and ranks. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 295–306.
[44]
Piotr Indyk. 2006. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. ACM 53, 3 (2006), 307–323.
[45]
Piotr Indyk and David Woodruff. 2005. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing. ACM, 202–208.
[46]
Rajesh Jayaram, Gokarna Sharma, Srikanta Tirthapura, and David P. Woodruff. 2019. Weighted reservoir sampling from distributed streams. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 218–235.
[47]
Rajesh Jayaram and David P. Woodruff. 2018. Data streams with bounded deletions. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 341–354.
[48]
Rajesh Jayaram and David P. Woodruff. 2018. PerfectLp sampling in a data stream. In Proceedings of the IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 544–555.
[49]
T. S. Jayram. 2009. Hellinger strikes back: A note on the multi-party information complexity of AND. In Proceedings of the Conference on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Springer, 562–573.
[50]
Thathachar S. Jayram, Ravi Kumar, and D. Sivakumar. 2008. The one-way communication complexity of Hamming distance. Theor. Comput. 4, 1 (2008), 129–135.
[51]
Thathachar S. Jayram and David P. Woodruff. 2009. The data stream space complexity of cascaded norms. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science. IEEE, 765–774.
[52]
Hossein Jowhari, Mert Sağlam, and Gábor Tardos. 2011. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’11). ACM, New York, NY, 49–58. DOI:
[53]
Philo Juang, Hidekazu Oki, Yong Wang, Margaret Martonosi, Li Shiuan Peh, and Daniel Rubenstein. 2002. Energy-efficient computing for wildlife tracking: Design tradeoffs and early experiences with ZebraNet. ACM SIGARCH Comput. Archit. News 30, 5 (2002), 96–107.
[54]
Daniel M. Kane and Jelani Nelson. 2014. Sparser Johnson-Lindenstrauss transforms. J. ACM 61, 1 (2014), 4.
[55]
Daniel M. Kane, Jelani Nelson, Ely Porat, and David P. Woodruff. 2011. Fast moment estimation in data streams in optimal space. In Proceedings of the 43rd Annual ACM Symposium on Theory of Computing. ACM, 745–754.
[56]
Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. On the exact space complexity of sketching and streaming small norms. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 1161–1178.
[57]
Ravi Kannan, Santosh Vempala, and David Woodruff. 2014. Principal component analysis and higher correlations for distributed data. In Proceedings of the Conference on Learning Theory. 1040–1057.
[58]
Michael Kapralov, Jelani Nelson, Jakub Pachocki, Zhengyu Wang, David P. Woodruff, and Mobin Yahyazadeh. 2017. Optimal lower bounds for universal relation, and for samplers and finding duplicates in streams. arXiv preprint arXiv:1704.00633 (2017).
[59]
Ping Li. 2009. Compressed counting. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 412–421.
[60]
Ping Li and Cun-Hui Zhang. 2011. A new algorithm for compressed counting with applications in Shannon entropy estimation in dynamic data. In Proceedings of the 24th Annual Conference on Learning Theory. 477–496.
[61]
Yi Li and David P. Woodruff. 2013. A tight lower bound for high frequency moment estimation with small error. In Proceedings of the Conference on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Springer, 623–638.
[62]
Yingyu Liang, Maria-Florina F. Balcan, Vandana Kanchanapally, and David Woodruff. 2014. Improved distributed principal component analysis. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 3113–3121.
[63]
Edo Liberty. 2013. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 581–588.
[64]
Samuel R. Madden, Michael J. Franklin, Joseph M. Hellerstein, and Wei Hong. 2005. TinyDB: An acquisitional query processing system for sensor networks. ACM Trans. Datab. Syst. 30, 1 (2005), 122–173.
[65]
Andrew McGregor, A. Pavan, Srikanta Tirthapura, and David P. Woodruff. 2016. Space-efficient estimation of statistics over sub-sampled streams. Algorithmica 74, 2 (2016), 787–811.
[66]
Morteza Monemizadeh and David P. Woodruff. 2010. 1-pass relative-error lp-sampling with applications. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 1143–1160.
[67]
Robert Morris. 1978. Counting large numbers of events in small registers. Commun. ACM 21, 10 (1978), 840–842.
[68]
Shanmugavelayutham Muthukrishnan et al. 2005. Data streams: Algorithms and applications. Found. Trends Theoret. Comput. Sci. 1, 2 (2005), 117–236.
[69]
Noam Nisan. 1992. Pseudorandom generators for space-bounded computation. Combinatorica 12, 4 (1992), 449–461.
[70]
John P. Nolan. 2020. Univariate stable distributions. Springer Series in Operations Research and Financial Engineering 10 (2020), 978–3.
[71]
Frank Olken. 1993. Random Sampling from Databases. Ph.D. Dissertation. University of California, Berkeley.
[72]
Srikanta Tirthapura and David P. Woodruff. 2011. Optimal random sampling from distributed streams revisited. In Proceedings of the International Symposium on Distributed Computing. Springer, 283–297.
[73]
Omri Weinstein and David P. Woodruff. 2015. The simultaneous communication of disjointness with applications to data streams. In Proceedings of the International Colloquium on Automata, Languages, and Programming. Springer, 1082–1093.
[74]
David Woodruff. 2004. Optimal space lower bounds for all frequency moments. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 167–175.
[75]
David P. Woodruff et al. 2014. Sketching as a tool for numerical linear algebra. Found. Trends Theoret. Comput. Sci. 10, 1–2 (2014), 1–157.
[76]
David P. Woodruff and Qin Zhang. 2012. Tight bounds for distributed functional monitoring. In Proceedings of the 44th Annual ACM Symposium on Theory of Computing. ACM, 941–960.
[77]
David P. Woodruff and Qin Zhang. 2018. Distributed statistical estimation of matrix products with applications. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 383–394.
[78]
David P. Woodruff and Peilin Zhong. 2016. Distributed low rank approximation of implicit functions of a matrix. In Proceedings of the IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 847–858.
[79]
Ke Yi and Qin Zhang. 2013. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65, 1 (2013), 206–223.

Cited By

View all
  • (2023)Improved algorithms for white-box adversarial streamsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618807(9962-9975)Online publication date: 23-Jul-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Algorithms
ACM Transactions on Algorithms  Volume 19, Issue 3
July 2023
281 pages
ISSN:1549-6325
EISSN:1549-6333
DOI:10.1145/3592471
  • Editor:
  • Edith Cohen
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2023
Online AM: 10 May 2023
Accepted: 09 April 2023
Revised: 06 September 2022
Received: 15 July 2021
Published in TALG Volume 19, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Streaming
  2. moment estimation
  3. message passing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)139
  • Downloads (Last 6 weeks)16
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Improved algorithms for white-box adversarial streamsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3618807(9962-9975)Online publication date: 23-Jul-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media