Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/1347082.1347163acmconferencesArticle/Chapter ViewAbstractPublication PagessodaConference Proceedingsconference-collections
research-article

Declaring independence via the sketching of sketches

Published: 02 July 2019 Publication History

Abstract

We consider the problem of identifying correlations in data streams. Surprisingly, our work seems to be the first to consider this natural problem. In the centralized model, we consider a stream of pairs (i,j) ∈ [n]2 whose frequencies define a joint distribution (X,Y). In the distributed model, each coordinate of the pair may appear separately in the stream. We present a range of algorithms for approximating to what extent X and Y are independent, i.e., how close the joint distribution is to the product of the marginals. We consider various measures of closeness including ℓ1, ℓ2, and the mutual information between X and Y. Our algorithms are based on "sketching sketches", i.e., composing small-space linear synopses of the distributions. Perhaps ironically, the biggest technical challenges that arise relate to ensuring that different components of our estimates are sufficiently independent.

References

[1]
N. Alon, A. Andoni, T. Kaufman, K. Matulef, R. Rubinfeld, and N. Xie, Testing k-wise and almost k-wise independence., in STOC, 2007, pp. 496--505.
[2]
N. Alon, Y. Matias, and M. Szegedy, The space complexity of approximating the frequency moments, Journal of Computer and System Sciences, 58 (1999), pp. 137--147.
[3]
R. Ananthakrishna, A. Das, J. Gehrke, F. Korn, S. Muthukrishnan, and D. Srivastava, Efficient approximation of correlated sums on data streams., IEEE Trans. Knowl. Data Eng., 15 (2003), pp. 569--572.
[4]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, Models and issues in data stream systems, ACM Symposium on Principles of Database Systems, (2002), pp. 1--16.
[5]
Z. Bar-Yossef, R. Kumar, and D. Sivakumar, Reductions in streaming algorithms, with an application to counting triangles in graphs, in ACM-SIAM Symposium on Discrete Algorithms, 2002, pp. 623--632.
[6]
T. Batu, L. Fortnow, E. Fischer, R. Kumar, R. Rubinfeld, and P. White, Testing random variables for independence and identity., in FOCS, 2001, pp. 442--451.
[7]
L. Bhuvanagiri, S. Ganguly, D. Kesh, and C. Saha, Simpler algorithm for estimating frequency moments of data streams., in ACM-SIAM Symposium on Discrete Algorithms, 2006, pp. 708--713.
[8]
A. Chakrabarti, G. Cormode, and A. McGregor, A near-optimal algorithm for computing the entropy of a stream, in ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 328--335.
[9]
A. Chakrabarti, S. Khot, and X. Sun, Near-optimal lower bounds on the multi-party communication complexity of set disjointness., in IEEE Conference on Computational Complexity, 2003, pp. 107--117.
[10]
K. L. Chang and R. Kannan, The space complexity of pass-efficient algorithms for clustering., in ACM-SIAM Symposium on Discrete Algorithms, 2006, pp. 1157--1166.
[11]
M. Charikar, C. Chekuri, T. Feder, and R. Motwani, Incremental clustering and dynamic information retrieval., SIAM J. Comput., 33 (2004), pp. 1417--1440.
[12]
M. Charikar, L. O'Callaghan, and R. Panigrahy, Better streaming algorithms for clustering problems., in ACM Symposium on Theory of Computing, 2003, pp. 30--39.
[13]
G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, Comparing data streams using hamming norms (How to zero in)., IEEE Trans. Knowl. Data Eng., 15 (2003), pp. 529--540.
[14]
G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, Space- and time-efficient deterministic algorithms for biased quantiles over data streams., in ACM Symposium on Principles of Database Systems, 2006, pp. 263--272.
[15]
T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, 1991.
[16]
J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang, Graph distances in the streaming model: the value of space., in ACM-SIAM Symposium on Discrete Algorithms, 2005, pp. 745--754.
[17]
J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang, On graph problems in a semi-streaming model, Theoretical Computer Science, 348 (2005), pp. 207--216.
[18]
J. Gehrke, F. Korn, and D. Srivastava, On computing correlated aggregates over continual data streams., in SIGMOD Conference, 2001, pp. 13--24.
[19]
M. Greenwald and S. Khanna, Efficient online computation of quantile summaries., in ACM International Conference on Management of Data, 2001, pp. 58--66.
[20]
S. Guha, P. Indyk, and A. McGregor, Sketching information divergences, in Conference on Learning Theory, 2007, pp. 424--438.
[21]
S. Guha, N. Koudas, and K. Shim, Approximation and streaming algorithms for histogram construction problems, ACM Trans. Database Syst., 31 (2006), pp. 396--438.
[22]
S. Guha and A. McGregor, Lower bounds for quantile estimation in random-order and multipass streaming, in International Colloquium on Automata, Languages and Programming, 2007, pp. 704--715.
[23]
S. Guha and A. McGregor, Space-efficient sampling, in AISTATS, 2007, pp. 169--176.
[24]
P. Indyk, Stable distributions, pseudorandom generators, embeddings, and data stream computation., J. ACM, 53 (2006), pp. 307--323.
[25]
P. Indyk and D. P. Woodruff, Optimal approximations of the frequency moments of data streams., in ACM Symposium on Theory of Computing, 2005, pp. 202--208.
[26]
T. S. Jayram, R. Kumar, and D. Sivakumar, Simple lower bound on one-way Gap-Hamming., in www.cse.iitk.ac.in/users/sganguly/slides/ravikumar.pdf, 2007.
[27]
C. D. Manning and H. Schtze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.
[28]
S. Muthukrishnan, Data streams: Algorithms and applications, Now Publishers, (2006).
[29]
D. P. Woodruff, Optimal space lower bounds for all frequency moments., in ACM-SIAM Symposium on Discrete Algorithms, 2004, pp. 167--175.

Cited By

View all
  • (2019)Tight dimensionality reduction for sketching low degree polynomial kernelsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455137(9475-9486)Online publication date: 8-Dec-2019
  • (2018)Finding Subcube Heavy Hitters in Analytics Data StreamsProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186082(1705-1714)Online publication date: 10-Apr-2018
  • (2014)On sketching matrix norms and the top singular vectorProceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms10.5555/2634074.2634188(1562-1581)Online publication date: 5-Jan-2014
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SODA '08: Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
January 2008
1289 pages
  • Program Chair:
  • Shang-Hua Teng

Sponsors

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 02 July 2019

Check for updates

Qualifiers

  • Research-article

Conference

SODA08
Sponsor:
SODA08: 19th ACM-SIAM Symposium on Discrete Algorithms
January 20 - 22, 2008
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 411 of 1,322 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Tight dimensionality reduction for sketching low degree polynomial kernelsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455137(9475-9486)Online publication date: 8-Dec-2019
  • (2018)Finding Subcube Heavy Hitters in Analytics Data StreamsProceedings of the 2018 World Wide Web Conference10.1145/3178876.3186082(1705-1714)Online publication date: 10-Apr-2018
  • (2014)On sketching matrix norms and the top singular vectorProceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms10.5555/2634074.2634188(1562-1581)Online publication date: 5-Jan-2014
  • (2013)Compressed matrix multiplicationACM Transactions on Computation Theory10.1145/2493252.24932545:3(1-17)Online publication date: 22-Aug-2013
  • (2013)Testing Closeness of Discrete DistributionsJournal of the ACM10.1145/2432622.243262660:1(1-25)Online publication date: 1-Feb-2013
  • (2012)A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional dataProceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2339530.2339669(877-885)Online publication date: 12-Aug-2012
  • (2012)Compressed matrix multiplicationProceedings of the 3rd Innovations in Theoretical Computer Science Conference10.1145/2090236.2090271(442-451)Online publication date: 8-Jan-2012
  • (2011)Periodicity and cyclic shifts via linear sketchesProceedings of the 14th international workshop and 15th international conference on Approximation, randomization, and combinatorial optimization: algorithms and techniques10.5555/2033252.2033267(158-170)Online publication date: 17-Aug-2011
  • (2011)Fast moment estimation in data streams in optimal spaceProceedings of the forty-third annual ACM symposium on Theory of computing10.1145/1993636.1993735(745-754)Online publication date: 6-Jun-2011
  • (2011)Near-optimal private approximation protocols via a black box transformationProceedings of the forty-third annual ACM symposium on Theory of computing10.1145/1993636.1993733(735-744)Online publication date: 6-Jun-2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media