Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1148170.1148230acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Distributed query sampling: a quality-conscious approach

Published: 06 August 2006 Publication History

Abstract

We present an adaptive distributed query-sampling framework that is quality-conscious for extracting high-quality text database samples. The framework divides the query-based sampling process into an initial seed sampling phase and a quality-aware iterative sampling phase. In the second phase the sampling process is dynamically scheduled based on estimated database size and quality parameters derived during the previous sampling process. The unique characteristic of our adaptive query-based sampling framework is its self-learning and self-configuring ability based on the overall quality of all text databases under consideration. We introduce three quality-conscious sampling schemes for estimating database quality, and our initial results show that the proposed framework supports higher-quality document sampling than existing approaches.

References

[1]
E. Agichtein, P. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.
[2]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999.
[3]
J. Callan and M. Connell. Query-based sampling of text databases. Information Systems, 19(2):97--130, 2001.
[4]
J. Callan, M. Connell, and A. Du. Automatic discovery of language models for text databases. In SIGMOD, 1999.
[5]
J. Callan et al. The effects of query-based sampling on automatic database selection algorithms. Technical Report CMU-LTI-00-162, CMU, 2000.
[6]
J. Callan, Z. Lu, and W. B. Croft. Searching distributed collections with inference networks. In SIGIR, 1995.
[7]
W. W. Cohen and Y. Singer. Learning to query the Web. In AAAI Workshop on Internet-Based Info. Systems. 1996.
[8]
N. Craswell, P. Bailey, and D. Hawking. Server selection on the World Wide Web. In Digital Libraries, 2000.
[9]
J. C. French et al. Comparing the performance of database selection algorithms. In SIGIR, 1999.
[10]
N. Fuhr. A decision-theoretic approach to database selection in networked IR. ACM TOIS, 17(3):229--229, 1999.
[11]
L. Gravano and H. García-Molina. Generalizing GlOSS to vector-space databases and broker hierarchies. In VLDB, 1995.
[12]
D. Hawking and P. Thomas. Server selection methods in hybrid portal search. In SIGIR, 2005.
[13]
P. Ipeirotis et al. Modeling and managing content changes in text databases. In ICDE, 2005.
[14]
P. Ipeirotis and L. Gravano. Improving text database selection using shrinkage. In SIGMOD, 2004.
[15]
P. Ipeirotis, L. Gravano, and M. Sahami. Probe, count, and classify: Categorizing hidden-web databases. In SIGMOD, 2001.
[16]
J. Lin. Divergence measures based on the shannon entropy. IEEE Trans. on Inf. Theory, 37(1):145--151, 1991.
[17]
K.-L. Liu, C. Yu, and W. Meng. Discovering the representative of a search engine. In CIKM, 2001.
[18]
J. Lu and J. Callan. Federated search of text-based digital libraries hierarchical peer-to-peer networks. In SIGIR Workshop on P2P Information Retrieval, 2004.
[19]
W. Meng, C. T. Yu, and K.-L. Liu. Detection of heterogeneities in a multiple text database environment. In CoopIS, 1999.
[20]
H. Nottelmann and N. Fuhr. Evaluating different methods of estimating retrieval quality for resource selection. In SIGIR, 2003.
[21]
A. Ntoulas, P. Zerfos, and J. Cho. Downloading textual hidden Web content through keyword queries. In JCDL, 2005.
[22]
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
[23]
A. L. Powell et al. The impact of database selection on distributed searching. In SIGIR, 2000.
[24]
S. Raghavan and H. Garcia-Molina. Crawling the hidden Web. In VLDB, 2001.
[25]
L. Si and J. Callan. Using sampled data and regression to merge search engine results. In SIGIR, 2002.
[26]
L. Si and J. Callan. Relevant document distribution estimation method for resource selection. In SIGIR, 2003.
[27]
W. Wang, W. Meng, and C. Yu. Concept hierarchy based text database categorization in a metasearch engine. In WISE '00.
[28]
J. Xu and J. Callan. Effective retrieval with distributed collections. In SIGIR, 1998.

Cited By

View all
  • (2018)Aggregated SearchFoundations and Trends in Information Retrieval10.1561/150000005210:5(365-502)Online publication date: 14-Dec-2018
  • (2017)Integration of deep web sourcesProceedings of the 7th International Conference on Web Intelligence, Mining and Semantics10.1145/3102254.3102291(1-4)Online publication date: 19-Jun-2017
  • (2014)Optimal top-K queries processing: Sampling and dynamic scheduling approach2014 International Conference on Computer Communication and Informatics10.1109/ICCCI.2014.6921736(1-3)Online publication date: Jan-2014
  • Show More Cited By

Index Terms

  1. Distributed query sampling: a quality-conscious approach

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
    August 2006
    768 pages
    ISBN:1595933697
    DOI:10.1145/1148170
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 August 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. adaptive
    2. distributed IR
    3. quality
    4. sampling

    Qualifiers

    • Article

    Conference

    SIGIR06
    Sponsor:
    SIGIR06: The 29th Annual International SIGIR Conference
    August 6 - 11, 2006
    Washington, Seattle, USA

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Aggregated SearchFoundations and Trends in Information Retrieval10.1561/150000005210:5(365-502)Online publication date: 14-Dec-2018
    • (2017)Integration of deep web sourcesProceedings of the 7th International Conference on Web Intelligence, Mining and Semantics10.1145/3102254.3102291(1-4)Online publication date: 19-Jun-2017
    • (2014)Optimal top-K queries processing: Sampling and dynamic scheduling approach2014 International Conference on Computer Communication and Informatics10.1109/ICCCI.2014.6921736(1-3)Online publication date: Jan-2014
    • (2013)Reducing the uncertainty in resource selectionProceedings of the 35th European conference on Advances in Information Retrieval10.1007/978-3-642-36973-5_43(507-519)Online publication date: 24-Mar-2013
    • (2010)Examining the information retrieval process from an inductive perspectiveProceedings of the 19th ACM international conference on Information and knowledge management10.1145/1871437.1871453(89-98)Online publication date: 26-Oct-2010
    • (2009)Robust result merging using sample-based score estimatesACM Transactions on Information Systems10.1145/1508850.150885227:3(1-29)Online publication date: 19-May-2009
    • (2009)Extracting Output Metadata from Scientific Deep Web Data SourcesProceedings of the 2009 Ninth IEEE International Conference on Data Mining10.1109/ICDM.2009.41(552-561)Online publication date: 6-Dec-2009
    • (2009)Weighted Rank Correlation in Information Retrieval EvaluationProceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology10.1007/978-3-642-04769-5_7(75-86)Online publication date: 1-Oct-2009
    • (2007)Federated text retrieval from uncooperative overlapped collectionsProceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/1277741.1277827(495-502)Online publication date: 23-Jul-2007
    • (2007)On rank correlation in information retrieval evaluationACM SIGIR Forum10.1145/1273221.127322341:1(18-33)Online publication date: 1-Jun-2007

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media