Nothing Special   »   [go: up one dir, main page]

skip to main content
article

Query-based sampling of text databases

Published: 01 April 2001 Publication History

Abstract

The proliferation of searchable text databases on corporate networks and the Internet causes a database selection problem for many people. Algorithms such as gGLOSS and CORI can automatically select which text databases to search for a given information need, but only if given a set of resource descriptions that accurately represent the contents of each database. The existing techniques for a acquiring resource descriptions have significant limitations when used in wide-area networks controlled by many parties. This paper presents query-based sampling, a new technicque for acquiring accurate resource descriptions. Query-based sampling does not require the cooperation of resource providers, nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are crated, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic dtabase selection.

References

[1]
ALLAN, J., BALLESTEROS, L., CALLAN,J.P.,CROFT,W.B.,AND LU, Z. 1995. Recent experiments with INQUERY. In Proceedings of the 4th Text Retrieval Conference (TREC-4, Washington, D.C., Nov.), D. K. Harman, Ed. National Institute of Standards and Technology, Gaithers-burg, MD, 49-63.
[2]
ALLAN, J., CALLAN, J., SANDERSON, M., XU, W., AND WEGMAN, S. 1999. INQUERY and TREC-7. In Proceedings of the 7th Conference on Text Retrieval (TREC-7, Gaithersburg, MD). 201-216.
[3]
BAUMGARTEN, C. 1997. A probabilitic model for distributed informaiton retrieval. In Proceedings of the 20th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '97, Philadelphia, PA, July 27-31), W. Hersh, F. Can, and E. Voorhees. ACM Press, New York, NY, 258-266.
[4]
CALLAN, J. 2000. Distributed information retrieval. In Advances in Information Retrieval,W. B. Croft, Ed. Kluwer Academic Publishers, Hingham, MA, 127-150.
[5]
CALLAN, J., CONNELL, M., AND DU, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD '99, Philadelphia, PA, June). ACM Press, New York, NY, 479-490.
[6]
CALLAN,J.P.,CROFT,W.B.,AND BROGLIO, J. 1995a. TREC and TIPSTER experiments with INQUERY. Inf. Process. Manage. 31, 3 (May-June), 327-343.
[7]
CALLAN,J.P.,LU, Z., AND CROFT, W. B. 1995b. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 21-28.
[8]
CLARKE, I., SANDBERG, O., WILEY, B., AND HONG, T. W. 2000. Freenet: A distributed anonymous information storage and retrieval system. In Proceedings of the ICSI Workshop on Design Issues in Anonymity and Unobservability (Berkeley, CA, July 25-26).
[9]
CRASWELL, N., BAILEY, P., AND HAWKING, D. 2000. Server selection on the World Wide Web. In Proceedings of the 5th ACM Conference on Digital Libraries. ACM, New York, NY, 37-46.
[10]
FRENCH, J., POWELL, A., CALLAN, J., VILES, C., EMMIT, T., PREY, K., AND MOU, Y. 1999. Comparing the performance of database selection algorithms. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99, Berkeley, CA). 238-245.
[11]
FRENCH,J.C.,POWELL,A.L.,VILES,C.L.,EMMITT, T., AND PREY, K. J. 1998. Evaluating database selection techniques: a testbed and experiment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Re-trieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 121-129.
[12]
FUHR, N. 1999. A decision-theoretic approach to database selection in networked IR. ACM Trans. Inf. Syst. 17, 3 (July), 229-249.
[13]
GRAVANO,L.AND GARCIA-MOLINA, H. 1995. Generalizing GLOSS to vector-space databases and broker hierarchies. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Sept.). 78-89.
[14]
GRAVANO, L., CHANG, C.-C. K., GARCIA-MOLINA, H., AND PAEPCKE, A. 1997. STARTS: Stanford proposal for Internet meta-searching. In Proceedings of the International ACM Conference on Management of Data (SIGMOD '97, Tucson, AZ, May). ACM, New York, NY.
[15]
GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994a. The effectiveness of GlOSS for the text database discovery problem. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY.
[16]
GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A. 1994b. Precision and recall of GlOSS estimators for database discovery. In Proceedings of the 3rd IEEE International Conference on Parallel and Distributed Information Systems (PDIS, Austin, TX, Sept.). IEEE Computer Society Press, Los Alamitos, CA. Also available as Stanford Univ. Computer Science Tech. Rep. STAN-CS-TN-94-10.
[17]
HARMAN,D.K.,ED. 1994. Proceedings of the 2nd Conference on Text Retrieval. (TREC-2). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Pub. 500-215.
[18]
HARMAN, D., ED. 1995. Proceedings of the 3rd Conference on Text Retrieval. (TREC-3, Gaithersburg, MD). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Pub. 500-225.
[19]
HAWKING,D.AND THISTLEWAITE, P. 1999. Methods for information server selection. ACM Trans. Inf. Syst. 17, 1 (Jan.), 40-76.
[20]
HEAPS, J. 1978. Information Retrieval-Computational and Theoretical Aspects. Academic Press, Inc., New York, NY.
[21]
KROVETZ, R. J. 1995. Word sense disambiguation for large text databases. Ph.D. Dissertation. Computer and Information Science Department, University of Massachusetts, Amherst, MA.
[22]
LARKEY, L., CONNELL, M., AND CALLAN, J. 2000. Collection selection and results merging with topically organized U.S. patents and TREC data. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM '00). ACM, New York, NY, 282-289.
[23]
LU, Z., CALLAN,J.P.,AND CROFT, W. B. 1996. Measures in collection ranking evaluation. Tech. Rep. TR96-39. Computer and Information Science Department, University of Massa-chusetts, Amherst, MA.
[24]
LUHN, H. P. 1958. The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159-165.
[25]
MARCUS, R. S. 1983. An experimental comparison of the effectiveness of computers and humans as search intermediaries. J. Am. Soc. Inf. Sci. 34, 381-404.
[26]
MENG, W., LIU,K.L.,YU,C.T.,WANG, X., CHANG, Y., AND RISHE, N. 1998. Determining text databases to search in the Internet. In Proceedings of the 24th International Conference on Very Large Data Bases, A. Gupta, O. Shmueli, and J. Widom, Eds. Morgan Kaufmann, San Mateo, CA, 14-25.
[27]
MENG, W., LIU,K.L.,YU,C.T.,WU, W., AND RISHE, N. 1999. Estimating the usefulness of search engines. In Proceedings of the 15th International IEEE Conference on Data Engineering (Sydney, Australia, Mar.). IEEE Press, Piscataway, NJ, 146-153.
[28]
MORONEY, M. J. 1951. Facts from Figures. Penguin Books, New York, NY.
[29]
NISO. 1995. Information Retrieval (Z39.50): Application service definition and protocol specification. Tech. Rep. ANSI/NISO Z39.50-1995. NISO Press, Bethesda, MD. Available via http://lcweb.loc.gov/z3950/agency/.
[30]
POWELL, A., FRENCH, J., CALLAN, J., CONNELL, M., AND VILES, C. 2000. The impact of database selection on distributed searching. In Proceedings of the 23rd Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR '00). ACM, New York, NY, 232-239.
[31]
PRESS,W.H.,TEUKOLSKY,S.A.,VETTERLING,W.T.,AND FLANNERY, B. P. 1992. Numerical Recipes in C: The Art of Scientific Computing. 2nd ed. Cambridge University Press, New York, NY.
[32]
TURTLE, H. R. 1991. Inference networks for document retrieval. Ph.D. Dissertation. Computer and Information Science Department, University of Massachusetts, Amherst, MA.
[33]
TURTLE,H.AND CROFT, W. B. 1991. Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst. 9, 3 (July), 187-222.
[34]
VILES,C.L.AND FRENCH, J. C. 1995. Dissemination of collection wide information in a distributed information retrieval system. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 12-20.
[35]
VOORHEES,E.M.AND TONG, R. M. 1997. Multiple search engines in database merging. In Proceedings of the 2nd ACM International Conference on Digital Libraries (DL '97, Philadel-phia, PA, July 23-26), R. B. Allen and E. Rasmussen, Chairs. ACM Press, New York, NY, 93-102.
[36]
VOORHEES,E.M.,GUPTA,N.K.,AND JOHNSON-LAIRD, B. 1995. Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95, Seattle, WA, July 9-13), E. A. Fox, P. Ingwersen, and R. Fidel, Eds. ACM Press, New York, NY, 172-179.
[37]
WEISS, R., VELEZ, B., SHELDON,M.A.,NANPREMPRE, C., SZILAGYI, P., DUDA, P., AND GIFFORD,D. 1996. HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering. In Proceedings of the Seventh ACM Conference on Hypertext '96 (Washington, D.C., Mar. 16-20), D. Stotts, Chair. ACM Press, New York, NY, 180-193.
[38]
XU,J.AND CALLAN, J. 1998. Effective retrieval with distributed collections. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '98, Melbourne, Australia, Aug. 24-28), W. B. Croft, A. Moffat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, Chairs. ACM Press, New York, NY, 112-120.
[39]
XU,J.AND CROFT, B. 1999. Cluster-based language models for distributed retrieval. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99, Berkeley, CA). 254-261.
[40]
YUWONO,B.AND LEE, D. L. 1996. Search and ranking algorithms for locating resources on the World Wide Web. In Proceedings of the 12th IEEE International Conference on Data Engineering (ICDE '97, New Orleans, LA, Feb.). IEEE Press, Piscataway, NJ, 164-171.
[41]
YUWONO,B.AND LEE, D. L. 1997. Server ranking for distributed text retrieval systems on the Internet. In Proceedings of the 5th International Conference on Database Systems for Advanced Applications (Melbourne, Australia, Apr.), R. Topor and K. Tanaka, Eds. World Scientific Publishing Co., Inc., River Edge, NJ, 41-49.
[42]
ZIPF, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.

Cited By

View all
  • (2024)Utilizing Ant Colony Optimization for Result Merging in Federated SearchEngineering, Technology & Applied Science Research10.48084/etasr.730214:4(14832-14839)Online publication date: 2-Aug-2024
  • (2024)Retrievability in an integrated retrieval system: an extended studyInternational Journal on Digital Libraries10.1007/s00799-023-00363-425:2(287-301)Online publication date: 1-Jun-2024
  • (2024)Exploring the Nexus Between Retrievability and Query Generation StrategiesAdvances in Information Retrieval10.1007/978-3-031-56066-8_16(177-193)Online publication date: 24-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 19, Issue 2
April 2001
119 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/382979
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2001
Published in TOIS Volume 19, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed information retrieval
  2. query-based sampling
  3. resource ranking
  4. resource selection
  5. server selection

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)3
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Utilizing Ant Colony Optimization for Result Merging in Federated SearchEngineering, Technology & Applied Science Research10.48084/etasr.730214:4(14832-14839)Online publication date: 2-Aug-2024
  • (2024)Retrievability in an integrated retrieval system: an extended studyInternational Journal on Digital Libraries10.1007/s00799-023-00363-425:2(287-301)Online publication date: 1-Jun-2024
  • (2024)Exploring the Nexus Between Retrievability and Query Generation StrategiesAdvances in Information Retrieval10.1007/978-3-031-56066-8_16(177-193)Online publication date: 24-Mar-2024
  • (2023)Snippet-based result merging in federated searchJournal of Information Science10.1177/01655515221144864(016555152211448)Online publication date: 12-Jan-2023
  • (2023)A Comparative Analysis of Retrievability and PageRank MeasuresProceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation10.1145/3632754.3632760(67-72)Online publication date: 15-Dec-2023
  • (2023)Machine learning methods for results merging in patent retrievalData Technologies and Applications10.1108/DTA-06-2021-015658:3(363-379)Online publication date: 1-Mar-2023
  • (2023)Federated search techniques: an overview of the trends and state of the artKnowledge and Information Systems10.1007/s10115-023-01922-665:12(5065-5095)Online publication date: 10-Jul-2023
  • (2022)Studying retrievability of publications and datasets in an integrated retrieval systemProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3530931(1-9)Online publication date: 20-Jun-2022
  • (2022)Exposing Query Identification for Search TransparencyProceedings of the ACM Web Conference 202210.1145/3485447.3512262(3662-3672)Online publication date: 25-Apr-2022
  • (2021)MODERATING EFFECT OF MANAGERIAL AND PROFSSIONAL ASSISTANCE ON INNOVATION STRATEGY AND ACADEMIC SPIN-OFF MANAGEMENTAirlangga Journal of Innovation Management10.20473/ajim.v2i1.266132:1(100)Online publication date: 2-Jul-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media