Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2623330.2623680acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Distance queries from sampled data: accurate and efficient

Published: 24 August 2014 Publication History

Abstract

Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as measurements, snapshots of a system, content, traffic matrices, and activity logs are collected repeatedly. Random sampling, which can be efficiently performed over streamed or distributed data, is an important tool for scalable data analysis. The sample constitutes an extremely flexible summary, which naturally supports domain queries and scalable estimation of statistics, which can be specified after the sample is generated. The effectiveness of a sample as a summary, however, hinges on the estimators we have.
We derive novel estimators for estimating $L_p$ distance from sampled data. Our estimators apply with the most common weighted sampling schemes: Poisson Probability Proportional to Size (PPS) and its fixed sample size variants. They also apply when the samples of different data sets are independent or coordinated. Our estimators are admissible (Pareto optimal in terms of variance) and have compelling properties.
We study the performance of our Manhattan and Euclidean distance (p=1,2) estimators on diverse datasets, demonstrating scalability and accuracy even when a small fraction of the data is sampled. Our work, for the first time, facilitates effective distance estimation over sampled data.

Supplementary Material

MP4 File (p681-sidebyside.mp4)

References

[1]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. System Sci., 58:137--147, 1999.
[2]
B. Bahmani, A. Goel, and R. Shinde. efficient distributed locality sensitive hashing. In CIKM. ACM, 2012.
[3]
K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210. ACM, 2007.
[4]
K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian Journal of Statistics, 14(3):231--239, 1972.
[5]
A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. IEEE, 1997.
[6]
A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc.of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LNCS, pages 1--10. Springer, 2000.
[7]
S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, 1996.
[8]
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997.
[9]
E. Cohen. Estimation for monotone sampling: Competitiveness and customization. In PODC. ACM, 2014. full version http://arxiv.org/abs/1212.0243.
[10]
E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. Efficient stream sampling for variance-optimal estimation of subset sums. SIAM J. Comput., 40(5), 2011.
[11]
E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC, 2007.
[12]
E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008.
[13]
E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. In ACM SIGMETRICS, 2009.
[14]
E. Cohen and H. Kaplan. Get the most out of your sample: Optimal unbiased estimators using partial information. In Proc. of the 2011 ACM Symp. on Principles of Database Systems (PODS 2011). ACM, 2011. full version: http://arxiv.org/abs/1203.4903.
[15]
E. Cohen and H. Kaplan. What you can do with coordinated samples. In The 17th. International Workshop on Randomization and Computation (RANDOM), 2013. full version: http://arxiv.org/abs/1206.5637.
[16]
E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. Proceedings of the VLDB Endowment, 2(1--2), 2009. full version: http://arxiv.org/abs/0906.4560.
[17]
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proc. SIGMOD Conference, pages 240--251, 2002.
[18]
W. Dong, M. Charikar, and K. Li. Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces. In SIGIR, 2008.
[19]
N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. Assoc. Comput. Mach., 54(6), 2007.
[20]
P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the 13th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 2001.
[21]
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In International Conference on Very Large Databases (VLDB), pages 541--550, 2001.
[22]
M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: Selectivity estimators for set similarity selection queries. In Proceedings of the 34th VLDB Conference, 2008.
[23]
J. Hájek. Sampling from a finite population. Marcel Dekker, New York, 1981.
[24]
D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.
[25]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symposium on Theory of Computing, pages 604--613. ACM, 1998.
[26]
D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In VLDB, 2004.
[27]
D.E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. Addison-Wesley, 1969.
[28]
P. Li, K. W. Church, and T. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, 2006.
[29]
J. B. Michel, Y. K. Shen, A. Presser Aiden, A. Veres, M. K. Gray, W. Brockman, The Google Books Team, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M. A. Nowak, and E. L. Aiden. Quantitative analysis of culture using millions of digitized books. Science, 331(6014):176--182, 2011. http://www.sciencemag.org/content/331/6014/176$.
[30]
E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.
[31]
E. Ohlsson. Coordination of pps samples over time. In The 2nd International Conference on Establishment Surveys, pages 255--264. American Statistical Association, 2000.
[32]
B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972.
[33]
B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997.
[34]
P. J. Saavedra. Fixed sample size pps approximations with a permanent random number. In Proc. of the Section on Survey Research Methods, Alexandria VA, pages 697--700. American Statistical Association, 1995.
[35]
M. Szegedy. The DLT priority sampling is essentially optimal. In Proc. 38th Annual ACM Symposium on Theory of Computing. ACM, 2006.
[36]
J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2014
2028 pages
ISBN:9781450329569
DOI:10.1145/2623330
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bottom-k sampling
  2. distance queries
  3. euclidean distance
  4. manhattan distance
  5. poisson sampling

Qualifiers

  • Research-article

Conference

KDD '14
Sponsor:

Acceptance Rates

KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)4
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media