Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2611462.2611485acmconferencesArticle/Chapter ViewAbstractPublication PagespodcConference Proceedingsconference-collections
research-article

Estimation for monotone sampling: competitiveness and customization

Published: 15 July 2014 Publication History

Abstract

Random samples are lossy summaries which allow queries posed over the data to be approximated by applying an appropriate estimator to the sample. The effectiveness of sampling, however, hinges on estimator selection. The choice of estimators is subjected to global requirements, such as unbiasedness and range restrictions on the estimate value, and ideally, we seek estimators that are both efficient to derive and apply and admissible (not dominated, in terms of variance, by other estimators). Nevertheless, for a given data domain, sampling scheme, and query, there are many admissible estimators. We define monotone sampling, which is implicit in many applications of massive data set analysis, and study the choice of admissible nonnegative and unbiased estimators. Our main contribution is general derivations of admissible estimators with desirable properties. We present a construction of order-optimal estimators, which minimize variance according to {\em any} specified priorities over the data domain. Order-optimality allows us to customize the derivation to common patterns that we can learn or observe in the data. When we prioritize lower values (e.g., more similar data sets when estimating difference), we obtain the L* estimator, which is the unique monotone admissible estimator and dominates the classic Horvitz-Thompson estimator. We show that the L* estimator is 4-competitive, meaning that the expectation of the square, on any data, is at most $4$ times the minimum possible for that data. These properties make the L* estimator a natural default choice. We also present the U$^*$ estimator, which prioritizes large values (e.g., less similar data sets). Our estimator constructions are general, natural, and practical, allowing us to make the most from our summarized data.

References

[1]
K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210. ACM, 2007.
[2]
K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samples from a single population. Australian Journal of Statistics, 14(3):231--239, 1972.
[3]
A. Z. Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences, pages 21--29. IEEE, 1997.
[4]
A. Z. Broder. Identifying and filtering near-duplicate documents. In Proc.of the 11th Annual Symposium on Combinatorial Pattern Matching, volume 1848 of LNCS, pages 1--10. Springer, 2000.
[5]
J. W. Byers, J. Considine, M. Mitzenmacher, and S. Rost. Informed content delivery across adaptive overlay networks. IEEE/ACM Trans. Netw., 12(5):767--780, October 2004.
[6]
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. System Sci., 55:441--453, 1997.
[7]
E. Cohen. Distance queries from sampled data: Accurate and efficient. Technical Report cs.DS/1203.4903, arXiv, 2012.
[8]
E. Cohen. All-distances sketches, revisited: Hip estimators for massive graphs analysis. In PODS. ACM, 2014.
[9]
E. Cohen. Estimation for monotone sampling: Competitiveness and customization. Technical Report cs.ST/1212.0243, arXiv, 2014.
[10]
E. Cohen, D. Delling, F. Fuchs, A. Goldberg, M. Goldszmidt, and R. Werneck. Scalable similarity estimation in social networks: Closeness, node labels, and random edge lengths. In COSN, 2013.
[11]
E. Cohen and H. Kaplan. Spatially-decaying aggregation over a network: model and algorithms. JCSS., 73:265--288, 2007. Full version of a SIGMOD 2004 paper.
[12]
E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM PODC, 2007.
[13]
E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In Proceedings of the 34th VLDB Conference, 2008.
[14]
E. Cohen and H. Kaplan. Leveraging discarded samples for tighter estimation of multiple-set aggregates. In ACM SIGMETRICS, 2009.
[15]
E. Cohen and H. Kaplan. Get the most out of your sample: Optimal unbiased estimators using partial information. In ACM PODS, 2011. http://arxiv.org/abs/1203.4903.
[16]
E. Cohen and H. Kaplan. What you can do with coordinated samples. In RANDOM, 2013. http://arxiv.org/abs/1206.5637.
[17]
E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling for estimating aggregates over multiple weight assignments. VLDB 2009. full version: http://arxiv.org/abs/0906.4560.
[18]
E. Cohen, Y.-M. Wang, and G. Suri. When piecewise determinism is almost true. In Proc. Pacific Rim International Symposium on Fault-Tolerant Systems, pages 66--71, December 1995.
[19]
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, 2007.
[20]
N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums. J. ACM, 54(6), 2007.
[21]
P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In ACM SPAA 2001.
[22]
P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB 2001.
[23]
M. Hadjieleftheriou, X. Yu, N. Koudas, and D. Srivastava. Hashed samples: Selectivity estimators for set similarity selection queries. In VLDB 2008.
[24]
J. Hájek. Sampling from a finite population. Marcel Dekker, New York, 1981.
[25]
D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663--685, 1952.
[26]
D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 1st edition, 1968.
[27]
J. Lanke. On UMV-estimators in survey sampling. Metrika, 20(1):196--202, 1973.
[28]
P. Li, K. W. Church, and T. Hastie. One sketch for all: Theory and application of conditional random sampling. In NIPS, 2008.
[29]
D. Mosk-Aoyama and D. Shah. Computing separable functions via gossip. In ACM PODC, 2006.
[30]
P. Mukhopandhyay. Theory and Methods of Survey Sampling. PHI learning, New Delhi, 2 edition, 2008.
[31]
E. Ohlsson. Sequential poisson sampling. J. Official Statistics, 14(2):149--162, 1998.
[32]
E. Ohlsson. Coordination of pps samples over time. In The 2nd International Conference on Establishment Surveys, pages 255--264. American Statistical Association, 2000.
[33]
B. Rosén. Asymptotic theory for successive sampling with varying probabilities without replacement, I. The Annals of Mathematical Statistics, 43(2):373--397, 1972.
[34]
B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135--158, 1997.
[35]
P. J. Saavedra. Fixed sample size pps approximations with a permanent random number. In Proc. of the Section on Survey Research Methods, Alexandria VA, pages 697--700. American Statistical Association, 1995.
[36]
J.S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37--57, 1985.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PODC '14: Proceedings of the 2014 ACM symposium on Principles of distributed computing
July 2014
444 pages
ISBN:9781450329446
DOI:10.1145/2611462
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 July 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. L* estimator
  2. coordinated sampling
  3. estimators
  4. monotone sampling
  5. order optimal estimators
  6. variance competitiveness

Qualifiers

  • Research-article

Conference

PODC '14
Sponsor:

Acceptance Rates

PODC '14 Paper Acceptance Rate 39 of 141 submissions, 28%;
Overall Acceptance Rate 740 of 2,477 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media