research-article

Distance queries from sampled data: accurate and efficient

Author:

Edith CohenAuthors Info & Claims

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 681 - 690

https://doi.org/10.1145/2623330.2623680

Published: 24 August 2014 Publication History

Abstract

Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as measurements, snapshots of a system, content, traffic matrices, and activity logs are collected repeatedly. Random sampling, which can be efficiently performed over streamed or distributed data, is an important tool for scalable data analysis. The sample constitutes an extremely flexible summary, which naturally supports domain queries and scalable estimation of statistics, which can be specified after the sample is generated. The effectiveness of a sample as a summary, however, hinges on the estimators we have.

We derive novel estimators for estimating $L_p$ distance from sampled data. Our estimators apply with the most common weighted sampling schemes: Poisson Probability Proportional to Size (PPS) and its fixed sample size variants. They also apply when the samples of different data sets are independent or coordinated. Our estimators are admissible (Pareto optimal in terms of variance) and have compelling properties.

We study the performance of our Manhattan and Euclidean distance (p=1,2) estimators on diverse datasets, demonstrating scalability and accuracy even when a small fraction of the data is sampled. Our work, for the first time, facilitates effective distance estimation over sampled data.

Supplementary Material

MP4 File (p681-sidebyside.mp4)

Download
297.22 MB

References

[1]

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999.

Digital Library

[2]

B. Bahmani, A. Goel, and R. Shinde. efficient distributed locality sensitive hashing. In CIKM. ACM, 2012.

Digital Library

[3]

K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210. ACM, 2007.

Digital Library

[4]

K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian Journal of Statistics, 14(3):231--239, 1972.

[5]

A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. IEEE, 1997.

Digital Library

[6]

A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc.of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LNCS, pages 1--10. Springer, 2000.

Digital Library

[7]

S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, 1996.

Digital Library

[8]

E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997.

Digital Library

[9]

E. Cohen. Estimation for monotone sampling: Competitiveness and customization. In PODC. ACM, 2014. full version http://arxiv.org/abs/1212.0243.

Digital Library

[10]

E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput., 40(5), 2011.

Digital Library

[11]

E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC, 2007.

Digital Library

[12]

E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008.

Digital Library

[13]

E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. In ACM SIGMETRICS, 2009.

Digital Library

[14]

E. Cohen and H. Kaplan. Get the most out of your sample: Optimal unbiased estimators using partial information. In Proc. of the 2011 ACM Symp. on Principles of Database Systems (PODS 2011). ACM, 2011. full version: http://arxiv.org/abs/1203.4903.

Digital Library

[15]

E. Cohen and H. Kaplan. What you can do with coordinated samples. In The 17th. International Workshop on Randomization and Computation (RANDOM), 2013. full version: http://arxiv.org/abs/1206.5637.

[16]

E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. Proceedings of the VLDB Endowment, 2(1--2), 2009. full version: http://arxiv.org/abs/0906.4560.

Digital Library

[17]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proc. SIGMOD Conference, pages 240--251, 2002.

Digital Library

[18]

W. Dong, M. Charikar, and K. Li. Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces. In SIGIR, 2008.

Digital Library

[19]

N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007.

Digital Library

[20]

P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the 13th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 2001.

Digital Library

[21]

P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In International Conference on Very Large Databases (VLDB), pages 541--550, 2001.

Digital Library

[22]

M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: Selectivity estimators for set similarity selection queries. In Proceedings of the 34th VLDB Conference, 2008.

Digital Library

[23]

J. Hájek. Sampling from a finite population. Marcel Dekker, New York, 1981.

[24]

D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.

[25]

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symposium on Theory of Computing, pages 604--613. ACM, 1998.

Digital Library

[26]

D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In VLDB, 2004.

Digital Library

[27]

D.E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. Addison-Wesley, 1969.

Digital Library

[28]

P. Li, K. W. Church, and T. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, 2006.

[29]

J. B. Michel, Y. K. Shen, A. Presser Aiden, A. Veres, M. K. Gray, W. Brockman, The Google Books Team, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176--182, 2011. http://www.sciencemag.org/content/331/6014/176$.

[30]

E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.

[31]

E. Ohlsson. Coordination of pps samples over time. In The 2nd International Conference on Establishment Surveys, pages 255--264. American Statistical Association, 2000.

[32]

B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972.

[33]

B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997.

[34]

P. J. Saavedra. Fixed sample size pps approximations with a permanent random number. In Proc. of the Section on Survey Research Methods, Alexandria VA, pages 697--700. American Statistical Association, 1995.

[35]

M. Szegedy. The DLT priority sampling is essentially optimal. In Proc. 38th Annual ACM Symposium on Theory of Computing. ACM, 2006.

Digital Library

[36]

J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985.

Digital Library

Cited By

Cohen EPagh RWoodruff DLarochelle HRanzato MHadsell RBalcan MLin H(2020)WOR and p'sProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497495(21092-21104)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3497495
Rachkovskij D(2016)Real-Valued Embeddings and Sketches for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-016-9899-x52:6(967-988)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1007/s10559-016-9899-x
Cohen E(2016)Coordinated SamplingEncyclopedia of Algorithms10.1007/978-1-4939-2864-4_576(449-454)Online publication date: 22-Apr-2016
https://doi.org/10.1007/978-1-4939-2864-4_576
Show More Cited By

Index Terms

Distance queries from sampled data: accurate and efficient
1. Mathematics of computing
  1. Probability and statistics
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Data types and structures

Recommendations

Analysis of large data logs: an application of Poisson sampling on excite web queries

Search engines are the gateway for users to retrieve information from the Web. There is a crucial need for tools that allow effective analysis of search engine queries to provide a greater understanding of Web users' information seeking behavior. The ...
Consistent estimation of non-bandlimited spectral density from uniformly spaced samples

In the matter of selection of sample time points for the estimation of the power spectral density of a continuous time stationary stochastic process, irregular sampling schemes such as Poisson sampling are often preferred over regular (uniform) ...
A new resampling method for sampling designs without replacement: the doubled half bootstrap

A new and very fast method of bootstrap for sampling without replacement from a finite population is proposed. This method can be used to estimate the variance in sampling with unequal inclusion probabilities and does not require artificial populations ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2014

2028 pages

ISBN:9781450329569

DOI:10.1145/2623330

General Chairs:
Sofus Macskassy
Facebook
,
Claudia Perlich
Dstillery
,
Program Chairs:
Jure Leskovec
Stanford University
,
Wei Wang
UCLA
,
Rayid Ghani
University of Chicago

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '14

Sponsor:

KDD '14: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2014

New York, New York, USA

Acceptance Rates

KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
511
Total Downloads

Downloads (Last 12 months)42
Downloads (Last 6 weeks)4

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cohen EPagh RWoodruff DLarochelle HRanzato MHadsell RBalcan MLin H(2020)WOR and p'sProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497495(21092-21104)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3497495
Rachkovskij D(2016)Real-Valued Embeddings and Sketches for Fast Distance and Similarity EstimationCybernetics and Systems Analysis10.1007/s10559-016-9899-x52:6(967-988)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1007/s10559-016-9899-x
Cohen E(2016)Coordinated SamplingEncyclopedia of Algorithms10.1007/978-1-4939-2864-4_576(449-454)Online publication date: 22-Apr-2016
https://doi.org/10.1007/978-1-4939-2864-4_576
Cohen E(2014)Coordinated SamplingEncyclopedia of Algorithms10.1007/978-3-642-27848-8_576-1(1-7)Online publication date: 19-Nov-2014
https://doi.org/10.1007/978-3-642-27848-8_576-1

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents