Abstract
Due to the skewed nature of the frequency distribution of term occurrence (e.g., Zipf's law) it is unlikely that any single technique for indexing text can do well in all situations. In this paper we propose a hybrid approach to indexing text, and show how it can outperform the traditional inverted B-tree index both in storage overhead, in time to perform a retrieval, and, for dynamic databases, in time for an insertion, both for single term and for multiple term queries. We demonstrate the benefits of our technique on a database of stories from the Associated Press news wire, and we provide formulae and guidelines on how to make optimal choices of the design parameters in real applications.
This research was sponsored in pan by the University of Maryland Institute for Advanced Computer Studies, by the National Science Foundation under grants DCR-86-16833, IRI-8719458 and IRI-8938546 and by the Air Force Office of Scientific Research under Grant AFOSR-89-0303.
Preview
Unable to display preview. Download preview PDF.
References
Bourne, C.P., “Frequency and Impact of Spelling Errors in Bibliographic Databases,” Information Processing and Management 13(1), pp. 1–12 (1977).
Cardenas, A.F., “Analysis and Performance of Inverted Data Base Structures,” CACM 18(5), pp. 253–263 (May 1975).
Christodoulakis, S., M. Theodoridou, F. Ho, M. Papa, and A. Pathria, “Multimedia Document Presentation. Information Extraction and Document Formation in MINOS: A Model and a System,” ACM TOOIS 4(4) (Oct. 1986).
Comer, D., “The Ubiquitous B-Tree,” Computing Surveys 11(2), pp. 121–137 (June 1979).
Faloutsos, C., “Access Methods for Text,” ACM Computing Surveys 17(1), pp. 49–74 (March 1985).
Faloutsos, C. and R. Chan, “Fast Text Access Methods for Optical and Large Magnetic Disks: Designs and Performance Comparison,” Proc. 14th International Conf. on VLDB, Long Beach, California, pp. 280–293 (Aug. 1988).
Faloutsos, C., “Signature-Based Text Retrieval Methods: A Survey,” IEEE Data Engineering 13(1), pp. 25–32 (March 1990).
Haskin, R.L., “Special-Purpose Processors for Text Retrieval,” Database Engineering 4(1), pp. 16–29 (Sept. 1981).
Hollaar, L.A., K.F. Smith, W.H. Chow, P.A. Emrath, and R.L. Haskin, “Architecture and Operation of a Large, Full-Text Information-Retrieval System,” pp. 256–299 in Advanced Database Machine Architecture, ed. D.K. Hsiao, Prentice-Hall, Englewood Cliffs, New Jersey (1983).
King, D. R., “The Binary Vector as the Basis of an Inverted Index File,” J. Lib. Autom. 7(4), p. 307 (1974).
Lesk, M.E., “Some Applications of Inverted Indexes on the UNIX System,” UNIX Programmer's Manual, Bell Laboratories, Murray Hill, New Jersey (1978).
Lin, Z. and C. Faloutsos, “Frame Sliced Signature Files,” CS-TR-2146 and UMIACS-TR-88-88. Dept. of Computer Science, Univ. of Maryland (Dec. 1988).
Rothnic, J.B. and T. Lozano, “Attribute Based File Organization in a Paged Memory Environment,” CACM 17(2), pp. 63–69 (Feb. 1974).
Salton, G. and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983).
Schuegraf, E.J., “Compression of Large Inverted Files with Hyperbolic Term Distribution,” Information Processing and Management 12, pp. 377–384 (1976).
Zipf, G.K., Human Behavior and Principle of Least Effort: An Introduction to Human Ecology, Addison Wesley, Cambridge, Massachusetts (1949).
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1992 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Faloutsos, C., Jagadish, H.V. (1992). Hybrid index organizations for text databases. In: Pirotte, A., Delobel, C., Gottlob, G. (eds) Advances in Database Technology — EDBT '92. EDBT 1992. Lecture Notes in Computer Science, vol 580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0032439
Download citation
DOI: https://doi.org/10.1007/BFb0032439
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-55270-3
Online ISBN: 978-3-540-47003-8
eBook Packages: Springer Book Archive