Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

A framework for effective retrieval

Published: 01 June 1989 Publication History

Abstract

The aim of an effective retrieval system is to yield high recall and precision (retrieval effectiveness). The nonbinary independence model, which takes into consideration the number of occurrences of terms in documents, is introduced. It is shown to be optimal under the assumption that terms are independent. It is verified by experiments to yield significant improvement over the binary independence model. The nonbinary model is extended to normalized vectors and is applicable to more general queries.
Various ways to alleviate the consequences of the term independence assumption are discussed. Estimation of parameters required for the nonbinary independence model is provided, taking into consideration that a term may have different meanings.

References

[1]
BOOKSTEIN, A. Fuzzy requests. J. Am. Soc. Inf. Sci. (1980), 240-247.
[2]
BOOKSTEIN, A. Information retreival: A sequential learing process. J. Am. Soc. Inf. Sci. (1983) 331-342.
[3]
BOOKSTEIN, A. Performance of self-taught documents: Exploiting co-relevance structure in a document collection. In ACM S!G!R Conference (1986), 244-248~
[4]
BOOKSTEIN, A., AND SWANSON, D.R. Probabilistic models for automatic indexing. J. Am. Soc. Inf. Sci. 25 (1974), 312-319.
[5]
CHENG, Y., AND FU, U. S. Conceptual clustering in knowledge organization. IEEE Trans. Pattern Anal. Mach. Intell. (1985), 592-598.
[6]
CHOW, D., AND YU, C.T. On the construction of feedback queries. J. ACM (1982), 127-151.
[7]
CROFT, W.B. Experiments with representation in a document system. Tech. Rep. 82-21, Univ. of Massachusetts, Amherst, 1982.
[8]
CROFT, W., AND HARPEB, D. Using probabilistic models of document retrieval without relevant information. J. Doc. (1979), 285-295.
[9]
DEERWESTER, S. Private communication.
[10]
FELLER, W. An introduction to Probability Theory and its Applications. Wiley, New York, 1968.
[11]
HARPER, D., AND VAN RIJSBERGEN, C.g. An evaluation of feedback in document retrieval using co-occurence data. J. Doc. (Sept. 1978), 189-216
[12]
HARTER, S.P. A probabilistic approach to automatic keyword indexing. J. Am. Soc. Inf. Sci. 26 (1975), part I, 197-205, part II, 280-289.
[13]
KOLL~ Mo An approach to co ncept-b~sed information retrieval. ACM SIGIR Conference (!978). ACM, New York, 1978.
[14]
KRAFT, D. Research into fuzzy extensions of information retrieval. ACM SIGIR Forum (1986).
[15]
LAM K., AND YU, C. A clustered search algorithm with arbitrary term dependencies. ACM Trans. Database Syst. (1982), 500-508.
[16]
LOSEE, R. The performance of probabilistic models of document retrieval systems. Ph.D. thesis, Univ. of Chicago, Chicago, Ill., 1986.
[17]
LOSEE, R., BOOKSTEIN, A., AND YU, C. Probabilistic models for document retrieval: A comparison of performance on experimental and synthetic databases. In ACM SIGIR Conference (1986). ACM, New York, 1986, 258-264.
[18]
MARON, M., AND KUHNS, J. On relevance, probabilistic indexing and information retrieval. J. ACM (1960), 216-244.
[19]
MCCUNE, B., TONG, R., DEAN, J., AND SHAPIRO, D. RUBRIC: A system for rule-based information retreival. IEEE Trans. Softw. Eng. (1985), 939-944.
[20]
RADECKI, T. Fuzzy set theoretical approach to document retrieval. Inf. Process. Manage. (1979), 247-259.
[21]
RAGHAVAN, V., SHI, H., AND YU, C. Evaluation of the 2-poisson models as a basis for using term frequency data in searching. In ACM SIGIR Conference (1983). ACM, New York, 1983, 88-100.
[22]
ROBERTSON, S.E. The probability ranking principle in information retrieval. J. Doc. (1977), 294-304.
[23]
ROBERTSON, S. E., MARON, M., AND COOPER, W. Probability of relevance: A unification of two competing models for document retrieval. Inf. Technol. (1982), 1-21.
[24]
ROBERTSON, S. E., AND SPARCH JONES, K. Relevance weighing of search terms. J. Am. Soc. Inf. Sci. (1976), 129-146.
[25]
ROBERTSON, S. E., VAN RIJSBERGEN, C. J., AND PORTER, M.F. Probabilistic models of indexing and searching. In Information Retrieval Researched, Oddy, Robertson, Van Rijsbergen, and Williams (Eds.), 1981, pp. 35-56.
[26]
ROCCHIO, J. Relevance feedback in information retrieval. In The Smart Retrieval System, G. Salton, Ed., Prentice-Hall, Englewood Cliffs, N.J., 1971.
[27]
SALTON, G. Dynamic Information and Library Processing. Prentice-Hall, Englewood Cliffs, N.J., 1975.
[28]
SALTON, G. Recent trends in automatic information retrieval. In ACM SIGIR Conference (1986). ACM, New York, 1986, 1-10.
[29]
SALTON, G., AND MCGiLL, M. introduction to Modern information Retrieval. McGraw Hill, New York, 1983.
[30]
SALTON, G., YANG, C., AND YU, C. A theory of term importance in automatic text analysis. J. Am. Soc. Inf. Sci. (1975), 33-44.
[31]
SPARCK JONES, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. (1972), 11-21~
[32]
TAGUE, J., AND NELSON, M. Simulation of bibliographic databases using hyperterms. In Research and Development in Information Retrieval, Salton and Schneider (Eds.), Springer- Verlag, New York, 1983,
[33]
TONG, R., ASKMAN, V., CUNNINGHAM, AND TOLLANDER. Rubric" An environment for full text information retrieval. In ACM SIGIR Conference (1985). ACM, New York, 1985, 243-251.
[34]
TSICHmTZ}S, D., Ed. Office Automation. Springer-Verlag, New York, 1985.
[35]
VAN RIJSBERGEN, C.J. A theoretical basis for the use of co-occurrence data in information retrieval, j. Doc. (1977), 106-119.
[36]
VAN RIJSBERGEN, C.J. Information Retrieval. 2nd ed., Butterworth, 1979.
[37]
TONG, M. Private communication.
[38]
WONG, M., ZiARKO, W., AND WONG, P. C. Generalized vector space model in information retrieval. ACM SIGIR Conference (1985). ACM, New York, 1985, 18-25.
[39]
Wu, H., AND SALTON, G. The estimation of term relevance weights using relevance feedback. J. Doc. (!981), 194-214.
[40]
Yu, C. W., AND LEE, T.C. Non-binary independence model. ACM SIGIR Conference (1986). ACM, New York, 1986, 265-268.
[41]
Yu, C. T., AND SALTON, G. Precision weighing--An effective automatic indexing method~ J. ACM (1976), 76-88.
[42]
Yu, C. T., LUK, W., AND CHEUNG, T. A statistical model for relevance feedback in information retrieval. J. ACM (1976), 273-286.
[43]
YU, C. T., LUK, W. S., ANO SIu, M.K. On models of information retrieval processes. Inf. Syst. (1979), 205-218.
[44]
Yu, C. T., WANG, Y. T., AND CHEN, C. H. Adaptive documents clustering. ACM SIGIR Conference (1985). ACM, New York, 1985, 131-137.
[45]
Yu, C. W., BUCKLEY, C., LAM, K., AND SALTON, G. A generalized term dependence model in information retrieval. Inf. Tech. (1983), 129-154.
[46]
Yu, C. T.W., SUEN, C. M., LAM, K., AND SIU, M. K. Adaptive record clustering. ACM Trans. Databse Syst. (1985), 180-204
[47]
ZADEH, L. Fuzzy sets. Inf. Control (1965), 338-353.

Cited By

View all

Recommendations

Reviews

Robert G Crawford

In their conclusion, the authors note four reasons for the mediocre retrieval performance of existing systems: “(i) Terms are not independent. (ii) Some relevant documents may not have any terms in common with a given query and therefore they cannot be retrieved. (iii) Term weights in documents are not assigned optimally. (iv) Previous parameter estimation techniques are incapable of estimating the parameter values correctly. . . . ” They propose a framework to address these four problems. The nonbinary independence model is introduced as a way of taking into consideration the frequency of occurrences of terms in documents. The authors also suggest methods to alleviate the term independence assumption, present an approach to estimating the parameters required by the model, and give a method to collect statistics, allowing for different usages of any particular term. The paper notes, “As in all other probabilistic models, parameters . . . need to be estimated from previously retrieved relevant and irrelevant documents” and “it is assumed that the document collection has been used for some time by a given user.” It is critical that these assumptions be understood in reading the work, and the authors have been careful in this regard. With these underlying requirements, the authors analyze results for small collections that have known queries and relevance assessments. The paper is clearly written. The mathematics is carefully done and is presented in a way that the reader can follow. Those interested in retrieval models, whether theoretically or for practical implementation, should read this solid piece of work.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 14, Issue 2
June 1989
144 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/63500
  • Editor:
  • Gio Wiederhold
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1989
Published in TODS Volume 14, Issue 2

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)38
  • Downloads (Last 6 weeks)5
Reflects downloads up to 01 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2002)ACIRDIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2002.100034514:3(599-614)Online publication date: 1-May-2002
  • (2000)A unified mathematical definition of classical information retrievalJournal of the American Society for Information Science10.1002/(SICI)1097-4571(2000)51:7<614::AID-ASI4>3.0.CO;2-S51:7(614-624)Online publication date: 2000
  • (1999)Integrated information system of the archive of MacedoniaIFAC Proceedings Volumes10.1016/S1474-6670(17)57020-232:2(5979-5984)Online publication date: Jul-1999
  • (1997)Uncertainty in Information Retrieval SystemsUncertainty Management in Information Systems10.1007/978-1-4615-6245-0_7(189-224)Online publication date: 1997
  • (1996)A lattice conceptual clustering system and its application to browsing retrievalMachine Learning10.1007/BF0005865424:2(95-122)Online publication date: Aug-1996
  • (1995)Type Classification of Semi-Structured DocumentsProceedings of the 21th International Conference on Very Large Data Bases10.5555/645921.673309(263-274)Online publication date: 11-Sep-1995
  • (1994)An association thesaurus for information retrievalIntelligent Multimedia Information Retrieval Systems and Management - Volume 110.5555/2856823.2856838(146-160)Online publication date: 11-Oct-1994
  • (1994)A probabilistic model for text categorizationProceedings of the fourth conference on Applied natural language processing10.3115/974358.974395(162-167)Online publication date: 13-Oct-1994
  • (1994)Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptionsACM Transactions on Information Systems10.1145/174608.17461212:1(92-115)Online publication date: 2-Jan-1994
  • (1994)INTERACTION INFORMATION RETRIEVALJournal of Documentation10.1108/eb02693050:3(197-212)Online publication date: Mar-1994
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media