poster

Sampling hidden objects using nearest-neighbor oracles

Authors:

Ashwin Machanavajjhala,

Vibhor RastogiAuthors Info & Claims

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1325 - 1333

https://doi.org/10.1145/2020408.2020606

Published: 21 August 2011 Publication History

Abstract

Given an unknown set of objects embedded in the Euclidean plane and a nearest-neighbor oracle, how to estimate the set size and other properties of the objects? In this paper we address this problem. We propose an efficient method that uses the Voronoi partitioning of the space by the objects and a nearest-neighbor oracle. Our method can be used in the hidden web/databases context where the goal is to estimate the number of certain objects of interest. Here, we assume that each object has a geographic location and the nearest-neighbor oracle can be realized by applications such as maps, local, or store-locator APIs. We illustrate the performance of our method on several real-world datasets.

References

[1]

J. Bar-Ilan. Size of the web, search engine coverage and overlap -- Methodological issues. Unpublished, 2006.

[2]

Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. In Proc. 26th VLDB, pages 535--544, 2000.

Digital Library

[3]

Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. J. ACM, 55(5), 2008.

Digital Library

[4]

K. Bharat and A. Z. Broder. A technique for measuring the relative size and overlap of public web search engines. Computer Networks, 30(1--7):379--388, 1998.

Digital Library

[5]

A. Z. Broder, M. Fontoura, V. Josifovski, R. Kumar, R. Motwani, S. U. Nabar, R. Panigrahy, A. Tomkins, and Y. Xu. Estimating corpus size via queries. In Proc. 15th CIKM, pages 594--603, 2006.

Digital Library

[6]

J. Callan and M. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001.

Digital Library

[7]

S. Clarke and P. Willett. Estimating the recall performance of web search engines. Aslib Proceedings, 49(7), 1997.

[8]

A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In Proc. SIGMOD, pages 629--640, 2007.

Digital Library

[9]

A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In Proc. SIGMOD, pages 855--866, 2010.

Digital Library

[10]

A. Dasgupta, N. Zhang, and G. Das. Leveraging COUNT information in sampling hidden databases. In Proc. 25th ICDE, pages 329--340, 2009.

Digital Library

[11]

M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. J. ACM, 38(1):1--17, 1991.

Digital Library

[12]

M. Fricke. Measuring recall. Journal of Information Science, 24(6), 1998.

Digital Library

[13]

A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In Proc. 14th WWW (Special interest tracks & posters), pages 902--903, 2005.

Digital Library

[14]

M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. Computer Networks, 33(1--6):295--308, 2000.

Digital Library

[15]

M. Jerrum. Counting, Sampling, and Integrating: Algorithms and Complexity. Birkhauser, 2003.

[16]

S. Lawrence and C. Giles. Accessibility of information on the web. Intelligence, 11(1):32--39, 2000.

Digital Library

[17]

S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280(5360):98--100, 1998.

[18]

T. Liu, F. Wang, and G. Agrawal. Stratified sampling for data mining on the deep web. In Proc. 10th ICDM, pages 324--333, 2010.

Digital Library

[19]

L. Lovász and M. Simonovits. Random walks in a convex body and an improved volume algorithm. Random Structures and Algorithms, 4:359--412, 1993.

[20]

L. Lovász and S. Vempala. Simulated annealing in convex bodies and an ~O(n4) volume algorithm. JCSS, 72(2):392--417, 2006.

Digital Library

[21]

J. Lu. Efficient estimation of the size of text deep web data source. In Proc. 17th CIKM, pages 1485--1486, 2008.

Digital Library

[22]

J. Lu. Ranking bias in deep web size estimation using capture recapture method. Data Knowl. Eng., 69(8):866--879, 2010.

Digital Library

[23]

P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. In AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.

[24]

S. Wu, F. Gibb, and F. Crestani. Experiments with document archive size detection. In Proc. 25th ECIR, pages 294--304, 2003.

Digital Library

Cited By

Buskirk TBlakely BEck AMcGrath RSingh RYu Y(2022)Sweet tweets! Evaluating a new approach for probability-based sampling of TwitterEPJ Data Science10.1140/epjds/s13688-022-00321-111:1Online publication date: 19-Feb-2022
https://doi.org/10.1140/epjds/s13688-022-00321-1
Mishra PKhurana A(2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
https://doi.org/10.1109/ICCPCCT.2018.8574286
Zhou ZZhao PSheng VXu JLi ZWu JCui Z(2018)Efficient sampling methods for characterizing POIs on maps based on road networksFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6146-612:3(582-592)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s11704-016-6146-6
Show More Cited By

Index Terms

Sampling hidden objects using nearest-neighbor oracles
1. Information systems
  1. Information retrieval

Recommendations

Fast k-Nearest Neighbor Searching in Static Objects

The k-nearest neighbor searching is a classical problem that has been seriously studied, due to its many important applications. The paper proposes an efficient algorithm to search the k-nearest neighbors for static objects. Since locations of static ...
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and Applications
Abstract
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...
K-Nearest Neighbor Finding Using MaxNearestDist

Similarity searching often reduces to finding the k nearest neighbors to a query object. Finding the k nearest neighbors is achieved by applying either a depth- first or a best-first algorithm to the search hierarchy containing the data. These ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2011

1446 pages

ISBN:9781450308137

DOI:10.1145/2020408

General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

KDD '11

Sponsor:

KDD '11: The 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2011

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
274
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Buskirk TBlakely BEck AMcGrath RSingh RYu Y(2022)Sweet tweets! Evaluating a new approach for probability-based sampling of TwitterEPJ Data Science10.1140/epjds/s13688-022-00321-111:1Online publication date: 19-Feb-2022
https://doi.org/10.1140/epjds/s13688-022-00321-1
Mishra PKhurana A(2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
https://doi.org/10.1109/ICCPCCT.2018.8574286
Zhou ZZhao PSheng VXu JLi ZWu JCui Z(2018)Efficient sampling methods for characterizing POIs on maps based on road networksFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6146-612:3(582-592)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.1007/s11704-016-6146-6
Bhakare UChatur P(2017)A review on extracting underlying content from deep web interfaces2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA)10.1109/ICIMIA.2017.7975609(234-237)Online publication date: Feb-2017
https://doi.org/10.1109/ICIMIA.2017.7975609
Bhakare UChatur P(2017)Content extraction from deep web interfaces2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)10.1109/ICECA.2017.8203702(349-353)Online publication date: Apr-2017
https://doi.org/10.1109/ICECA.2017.8203702
Rahman MLiu WSuhaim SThirumuruganathan SZhang NDas G(2017)Density Based Clustering over Location Based Services2017 IEEE 33rd International Conference on Data Engineering (ICDE)10.1109/ICDE.2017.103(461-472)Online publication date: Apr-2017
https://doi.org/10.1109/ICDE.2017.103
Zhao FZhou JNie CHuang HJin H(2016)SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web InterfacesIEEE Transactions on Services Computing10.1109/TSC.2015.24149319:4(608-620)Online publication date: 1-Jul-2016
https://doi.org/10.1109/TSC.2015.2414931
Wang PHe WLiu X(2016)Efficiently Estimating Statistics of Points of Interests on MapsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.248039728:2(425-438)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1109/TKDE.2015.2480397
Wang PHe WLiu X(2014)An efficient sampling method for characterizing points of interests on maps2014 IEEE 30th International Conference on Data Engineering10.1109/ICDE.2014.6816719(1012-1023)Online publication date: Mar-2014
https://doi.org/10.1109/ICDE.2014.6816719

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten