Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2020408.2020606acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

Sampling hidden objects using nearest-neighbor oracles

Published: 21 August 2011 Publication History

Abstract

Given an unknown set of objects embedded in the Euclidean plane and a nearest-neighbor oracle, how to estimate the set size and other properties of the objects? In this paper we address this problem. We propose an efficient method that uses the Voronoi partitioning of the space by the objects and a nearest-neighbor oracle. Our method can be used in the hidden web/databases context where the goal is to estimate the number of certain objects of interest. Here, we assume that each object has a geographic location and the nearest-neighbor oracle can be realized by applications such as maps, local, or store-locator APIs. We illustrate the performance of our method on several real-world datasets.

References

[1]
J. Bar-Ilan. Size of the web, search engine coverage and overlap -- Methodological issues. Unpublished, 2006.
[2]
Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. In Proc. 26th VLDB, pages 535--544, 2000.
[3]
Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. J. ACM, 55(5), 2008.
[4]
K. Bharat and A. Z. Broder. A technique for measuring the relative size and overlap of public web search engines. Computer Networks, 30(1--7):379--388, 1998.
[5]
A. Z. Broder, M. Fontoura, V. Josifovski, R. Kumar, R. Motwani, S. U. Nabar, R. Panigrahy, A. Tomkins, and Y. Xu. Estimating corpus size via queries. In Proc. 15th CIKM, pages 594--603, 2006.
[6]
J. Callan and M. Connell. Query-based sampling of text databases. ACM TOIS, 19(2):97--130, 2001.
[7]
S. Clarke and P. Willett. Estimating the recall performance of web search engines. Aslib Proceedings, 49(7), 1997.
[8]
A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In Proc. SIGMOD, pages 629--640, 2007.
[9]
A. Dasgupta, X. Jin, B. Jewell, N. Zhang, and G. Das. Unbiased estimation of size and other aggregates over hidden web databases. In Proc. SIGMOD, pages 855--866, 2010.
[10]
A. Dasgupta, N. Zhang, and G. Das. Leveraging COUNT information in sampling hidden databases. In Proc. 25th ICDE, pages 329--340, 2009.
[11]
M. E. Dyer, A. M. Frieze, and R. Kannan. A random polynomial time algorithm for approximating the volume of convex bodies. J. ACM, 38(1):1--17, 1991.
[12]
M. Fricke. Measuring recall. Journal of Information Science, 24(6), 1998.
[13]
A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In Proc. 14th WWW (Special interest tracks & posters), pages 902--903, 2005.
[14]
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. Computer Networks, 33(1--6):295--308, 2000.
[15]
M. Jerrum. Counting, Sampling, and Integrating: Algorithms and Complexity. Birkhauser, 2003.
[16]
S. Lawrence and C. Giles. Accessibility of information on the web. Intelligence, 11(1):32--39, 2000.
[17]
S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280(5360):98--100, 1998.
[18]
T. Liu, F. Wang, and G. Agrawal. Stratified sampling for data mining on the deep web. In Proc. 10th ICDM, pages 324--333, 2010.
[19]
L. Lovász and M. Simonovits. Random walks in a convex body and an improved volume algorithm. Random Structures and Algorithms, 4:359--412, 1993.
[20]
L. Lovász and S. Vempala. Simulated annealing in convex bodies and an ~O(n4) volume algorithm. JCSS, 72(2):392--417, 2006.
[21]
J. Lu. Efficient estimation of the size of text deep web data source. In Proc. 17th CIKM, pages 1485--1486, 2008.
[22]
J. Lu. Ranking bias in deep web size estimation using capture recapture method. Data Knowl. Eng., 69(8):866--879, 2010.
[23]
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. In AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.
[24]
S. Wu, F. Gibb, and F. Crestani. Experiments with document archive size detection. In Proc. 25th ECIR, pages 294--304, 2003.

Cited By

View all
  • (2022)Sweet tweets! Evaluating a new approach for probability-based sampling of TwitterEPJ Data Science10.1140/epjds/s13688-022-00321-111:1Online publication date: 19-Feb-2022
  • (2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
  • (2018)Efficient sampling methods for characterizing POIs on maps based on road networksFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6146-612:3(582-592)Online publication date: 1-Jun-2018
  • Show More Cited By

Index Terms

  1. Sampling hidden objects using nearest-neighbor oracles

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2011
    1446 pages
    ISBN:9781450308137
    DOI:10.1145/2020408
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. nearest-neighbors
    2. sampling
    3. size estimation
    4. voronoi cell

    Qualifiers

    • Poster

    Conference

    KDD '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Sweet tweets! Evaluating a new approach for probability-based sampling of TwitterEPJ Data Science10.1140/epjds/s13688-022-00321-111:1Online publication date: 19-Feb-2022
    • (2018)Accuracy Crawler: An Accurate Crawler for Deep Web Data Extraction2018 International Conference on Control, Power, Communication and Computing Technologies (ICCPCCT)10.1109/ICCPCCT.2018.8574286(25-29)Online publication date: Mar-2018
    • (2018)Efficient sampling methods for characterizing POIs on maps based on road networksFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6146-612:3(582-592)Online publication date: 1-Jun-2018
    • (2017)A review on extracting underlying content from deep web interfaces2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA)10.1109/ICIMIA.2017.7975609(234-237)Online publication date: Feb-2017
    • (2017)Content extraction from deep web interfaces2017 International conference of Electronics, Communication and Aerospace Technology (ICECA)10.1109/ICECA.2017.8203702(349-353)Online publication date: Apr-2017
    • (2017)Density Based Clustering over Location Based Services2017 IEEE 33rd International Conference on Data Engineering (ICDE)10.1109/ICDE.2017.103(461-472)Online publication date: Apr-2017
    • (2016)SmartCrawler: A Two-Stage Crawler for Efficiently Harvesting Deep-Web InterfacesIEEE Transactions on Services Computing10.1109/TSC.2015.24149319:4(608-620)Online publication date: 1-Jul-2016
    • (2016)Efficiently Estimating Statistics of Points of Interests on MapsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2015.248039728:2(425-438)Online publication date: 1-Feb-2016
    • (2014)An efficient sampling method for characterizing points of interests on maps2014 IEEE 30th International Conference on Data Engineering10.1109/ICDE.2014.6816719(1012-1023)Online publication date: Mar-2014

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media