Numerous massive data sets , ranging from flows of Internet traffic to logs of supermarket transactions, have emerged during the past few years. Their overwhelming size and the typically restricted access to them call for new computational models. This thesis studies three such models: sampling computations, data stream computations, and sketch computations.
While most of the previous work focused on designing algorithms in the new models, this thesis revolves around the limitations of the models. We develop a suite of lower bound techniques that characterize the complexity of functions in these models, indicating which problems can be solved efficiently in them. We derive specific bounds for a multitude of practical problems, arising from applications in database, networking, and information retrieval, such as frequency statistics, selection functions, statistical moments, and distance estimation.
We present general, powerful, and easy to use lower bound techniques for the sampling model. The techniques apply to all functions and address both oblivious and adaptive sampling. They frequently produce optimal bounds for a wide range of functions. They are stated in terms of new combinatorial and statistical properties of functions, which are easy to calculate.
We obtain lower bounds for the data stream and sketch models through one-way and simultaneous communication complexity. We develop lower bounds for the latter via a new information-theoretic view of communication complexity. A highlight of this work is an optimal simultaneous communication complexity lower bound for the important multi-party set-disjointness problem.
Finally, we present a powerful method for proving lower bounds for general communication complexity. The method is based on a direct sum property of a new measure of complexity for communication complexity protocols and on a novel statistical view of communication complexity. We use the technique to obtain improved communication complexity and data stream lower bounds for several problems, including multi-party set-disjointness, frequency moments, and Lp distance estimation. These results solve open problems of Alon, Matias, and Szegedy and of Saks and Sun.
Cited By
- Gupta S, Lee J, Price E and Valiant P Finite-sample maximum likelihood estimation of location Proceedings of the 36th International Conference on Neural Information Processing Systems, (30139-30149)
- Vempala S and Woodruff D Adaptive matrix vector product Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, (2044-2060)
- Braverman V, Chestnut S, Woodruff D and Yang L Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, (261-276)
- Acharya J, Diakonikolas I, Hegde C, Li J and Schmidt L Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, (249-263)
- Pagh R, Stöckel M and Woodruff D Is min-wise hashing optimal for summarizing set intersection? Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, (109-120)
- Li Y, Nguyen H and Woodruff D Turnstile streaming algorithms might as well be linear sketches Proceedings of the forty-sixth annual ACM symposium on Theory of computing, (174-183)
- Yi K, Wang L and Wei Z (2014). Indexing for summary queries, ACM Transactions on Database Systems, 39:1, (1-39), Online publication date: 1-Jan-2014.
- Huang Z, Yi K and Zhang Q Randomized algorithms for tracking distributed count, frequencies, and ranks Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, (295-306)
- McGregor A, Pavan A, Tirthapura S and Woodruff D Space-efficient estimation of statistics over sub-sampled streams Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, (273-282)
- Woodruff D and Zhang Q Tight bounds for distributed functional monitoring Proceedings of the forty-fourth annual ACM symposium on Theory of computing, (941-960)
- Kane D, Nelson J, Porat E and Woodruff D Fast moment estimation in data streams in optimal space Proceedings of the forty-third annual ACM symposium on Theory of computing, (745-754)
- Guha S, McGregor A and Venkatasubramanian S (2009). Sublinear estimation of entropy and information distances, ACM Transactions on Algorithms, 5:4, (1-16), Online publication date: 1-Oct-2009.
- Woodruff D The average-case complexity of counting distinct elements Proceedings of the 12th International Conference on Database Theory, (284-295)
- Mironov I, Naor M and Segev G Sketching in adversarial environments Proceedings of the fortieth annual ACM symposium on Theory of computing, (651-660)
- Fogaras D and Racz B (2007). Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs, IEEE Transactions on Knowledge and Data Engineering, 19:5, (585-598), Online publication date: 1-May-2007.
- Guha S, McGregor A and Venkatasubramanian S Streaming and sublinear approximation of entropy and information distances Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, (733-742)
- Kaushik R, Naughton J, Ramakrishnan R and Chakravarthy V (2005). Synopses for query optimization, ACM Transactions on Database Systems, 30:4, (1102-1127), Online publication date: 1-Dec-2005.
- Drineas P, Kannan R and Mahoney M Sampling sub-problems of heterogeneous max-cut problems and approximation algorithms Proceedings of the 22nd annual conference on Theoretical Aspects of Computer Science, (57-68)
- Drineas P, Frieze A, Kannan R, Vempala S and Vinay V (2004). Clustering Large Graphs via the Singular Value Decomposition, Machine Language, 56:1-3, (9-33), Online publication date: 25-Jun-2004.
- Kaushik R, Ramakrishnan R and Chakaravarthy V Synopses for query optimization Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, (201-209)
- Bar-Yossef Z Sampling lower bounds via information theory Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, (335-344)
- Akella A, Seshan S, Karp R, Shenker S and Papadimitriou C (2002). Selfish behavior and stability of the internet:, ACM SIGCOMM Computer Communication Review, 32:4, (117-130), Online publication date: 1-Oct-2002.
- Akella A, Seshan S, Karp R, Shenker S and Papadimitriou C Selfish behavior and stability of the internet: Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications, (117-130)
Index Terms
- The complexity of massive data set computations
Recommendations
Kolmogorov Complexity for Possibly Infinite Computations
In this paper we study the Kolmogorov complexity for non-effective computations, that is, either halting or non-halting computations on Turing machines. This complexity function is defined as the length of the shortest input that produce a desired ...
The multiparty communication complexity of set disjointness
STOC '12: Proceedings of the forty-fourth annual ACM symposium on Theory of computingWe study the set disjointness problem in the number-on-the-forehead model of multiparty communication.
(i) We prove that k-party set disjointness has communication complexity Omega(n/4k)1/4 in the randomized and nondeterministic models and Omega(n/4k)1/...
On the Parameterized Complexity of Approximating Dominating Set
Distributed Computing, Algorithms and Data Structures, Algorithms, Scientific Computing, Derandomizing Algorithms, Online Algorithms and Algorithmic Information TheoryWe study the parameterized complexity of approximating the k-Dominating Set (DomSet) problem where an integer k and a graph G on n vertices are given as input, and the goal is to find a dominating set of size at most F(k) ⋅ k whenever the graph G has a ...