Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Description and performance analysis of signature file methods for office filing

Published: 01 July 1987 Publication History

Abstract

Signature files have attracted a lot of interest as an access method for text and specifically for messages in the office environment. Messages are stored sequentially in the message file, whereas their hash-coded abstractions (signatures) are stored sequentially in the signature file. To answer a query, the signature file is examined first, and many nonqualifying messages are immediately rejected. In this paper we examine the problem of designing signature extraction methods and studying their performance. We describe two old methods, generalize another one, and propose a new method and its variation. We provide exact and approximate formulas for the dependency between the false drop probability and the signature size for all the methods, and we show that the proposed method (VBC) achieves approximately ten times smaller false drop probability than the old methods, whereas it is well suited for collections of documents with variable document sizes.

References

[1]
AHO, A. V., AND CORASICK, M. J. Fast pattern matching: An aid to bibliographic search. Commun. ACM, 18, 6 (June 1975), 333-340.
[2]
BIRD, R. M., Tu, J. C., AND WORTHY, R.M. Associative parallel processors for searching very large textual data bases. In Proceedings of the 3rd A CM Workshop on Computer Architecture for Non-numeric Processing, ACM, New York, May 1977, pp. 8-16.
[3]
BOYER, R. S., AND MOORE, J. S. A fast string searching algorithm. Commun. ACM 20, 10 (Oct. 1977), 762-772.
[4]
CHRISTODOULAKIS, S., AND FALOUTSOS, C. Design considerations for a message file server. IEEE Trans. Softw. Eng. SE-IO, 2 (Mar. 1984), 201-210.
[5]
CHRISTODOULAKIS, S., H O, F., AND THEODORIDOU, M. The multimedia object presentation manager of MINOS: A symmetric approach, In Proceedings o{ ACM SIGMOD '86 (Washington, D.C., May 28-30). ACM, New York, 1986, pp. 295-310.
[6]
FALOUTSOS, C. Integrated access methods for messages using signature files. In IFIP Conference on Methods and Tools for O{fice systems (Pisa, Italy, Oct.), 1986, pp. 135-157.
[7]
FALOUTSOS, C., AND CHRISTODOULAKIS, S. Signature files: An access method for documents and its analytical performance evaluation. ACM Trans. Off. Inf. Syst. 2, 4 (Oct. 1984), 267-288.
[8]
FUJITANI, L. Laser optical disk: The coming revolution in on-line storage. Commun. A CM 27, 6 (June 1984), 546-554.
[9]
GALLAGER, R. G., AND VAN VOORHIS, D.C. Optimal source codes for geometrically distributed integer alphabets. IEEE Trans. Inf. Theory IT-21 (Mar. 1975), 228-230.
[10]
GOLOMB, S.W. Run length encodings. IEEE Trans. Inf. Theory IT-12 (July 1966), 399-401.
[11]
GRAVINA, C.M. National Westminster Bank mass storage archiving. IBM Syst. J. 17, 4 (1978), 344-358.
[12]
GRAY, P., ED. Special issue on Meridian system information. Telesis 12, 2, 1985.
[13]
GUSTAFSON, R.A. Elements of the randomized combinatorial file structure. In ACM SIGIR, Proceedings of the Symposium on In{ormation Storage and Retrieval, (University of Maryland, Apr.). ACM, New York, 1971, pp. 163-174.
[14]
HILLIS, D. The Connection Machine. MIT Press, Cambridge, Mass., 1985.
[15]
HOLLAAR, L.A. Text retrieval computers. Computer 12, 3 (Mar. 1979), 40-50.
[16]
HOLLAAR, L. A., SMITH, K. F., CHOW, W. H., EMRATH, P. A., AND HASKIN, R.a. Architecture and operation of a large, full-text information-retrieval system, in Advanced Database Machine Architecture, D. K. Hsiao, Ed. Prentice-Hall, Englewood Cliffs, New Jersey, 1983, pp. 256-299.
[17]
IBM. STAIRS/VS: Reference Manual. IBM System Manual, IBM Corp., Armonk, N.Y.
[18]
KNUTH, D. E., MORRIS, J. H., AND PRATT, V. R. Fast pattern matching in strings. SIAM J. Comput. 6, 2 {June 1977), 323-350.
[19]
LARSON, P.A. A method for speeding up text retrieval. In Proceedings of Annual Meeting: Databases for Business and Office Applications (San Jose, Calif., May 23-26). ACM, New York, 1983, pp. 117-123.
[20]
MCILROY, M.D. Development of a spelling list. IEEE Trans. Commun. COM-30, 1 (Jan. 1982), 91-99.
[21]
MOCKAPETRIS, P. The domain name server. ISI/RS-84-133, University of Southern California Information Science Institute, Los Angeles, Calif., June 1984.
[22]
MOOERS, C. Application of random codes to the gathering of statistical information. Bulletin 31, Zator Co., Cambridge, Mass., 1949.
[23]
NAFFAH, N., AND KARMOUCH, A. Agora--An experiment in multimedia message systems. Computer 19, 5 (May 1986), 56-66.
[24]
PFALTZ, J. L., BERMAN, W. H., AND CAGLEY, E.M. Partial match retrieval using indexed descriptor files. Commun. ACM 23, 9 (Sept. 1980), 522-528.
[25]
POGGIO, A., ACEVES, G. L., CRAGHILL, E., MORAN, D., AGUILAR, L., WORTHINGTON, D., AND HIGHT, J. CCWS: A computer based multimedia information system. Computer (Oct. 1985), 92-103.
[26]
ROBERTS, C.S. Partial-match retrieval via the method of superimposed codes. In Proceedings of the IEEE 67, 12 (Dec. 1979), 1624-1642.
[27]
SACKS-DAVIS, R., AND RAMAMOHANARAO, K. A two level superimposed coding scheme for partial match retrieval. Inf. Syst. 8, 4, 1983, 273-280.
[28]
SALTON, G., AND MCGILL, M.J. Introduction to Modern Information Retrieval. McGraw-Hill, N.Y., 1983.
[29]
SOLOMON, M., LANDWEBER, L., AND NEUHENGEN, D. The CSNET name server. Comput. Networks 6, 3, (July 1982), 161-172.
[30]
STANDISH, W.A. An essay on software reuse. IEEE Trans. Softw. Eng. SE-IO, 5 (Sept. 1984), 494-497.
[31]
STANFmL, C., AND KAHLE, B. Parallel free-text search on the connection machine system. Commun. ACM 29, 12 (Dec. 1986), 1229-1239.
[32]
STIASSNY, S. Mathematical analysis of various superimposed coding methods. Am. Documentation 11, 2 (Feb. 1960), 155-169.
[33]
TSICHRITZIS, D., AND CHRISTODOULAKIS, S. Message files. A CM Trans. 0{{. Inf. Syst. 1, 1 (J an. 1983), 88-98.
[34]
TSlCHRITZlS, D., CHRISTODOULAKIS, S., ECONOMOPOULOS, P., FALOUTSOS, C., LEE, A., LEE, D., VANDENBROEK, J., AND Woo, C. A multimedia office filing system. In Proceedings of 9th International Conference on VLDB (Florence, Italy, Oct.-Nov.). VLDB Endowment. 1983.
[35]
VAN-RIJSBERGEN, C.J. Information retrieval. 2d ed. Butterworths, London, England, 1979.

Cited By

View all
  • (2017)Efficient indexing for semantic searchExpert Systems with Applications10.1016/j.eswa.2016.12.03373(92-114)Online publication date: May-2017
  • (2016)A Study on Improved Detection Signature System in Hacking Response of One-Line GamesThe Journal of Society for e-Business Studies10.7838/jsebs.2016.21.1.10521:1(105-118)Online publication date: 28-Feb-2016
  • (2015)Fast Forward Index Methods for Pseudo-Relevance Feedback RetrievalACM Transactions on Information Systems10.1145/274419933:4(1-33)Online publication date: 13-May-2015
  • Show More Cited By

Index Terms

  1. Description and performance analysis of signature file methods for office filing

                            Recommendations

                            Reviews

                            Dennis Roy Thompson

                            Christos and Stavros stuck in their thumbs and pulled out a signature file method plum. Access method classifications are full text scanning, inversion, signature files, clustering, and multi-attribute hashing. Signature file methods are concerned with creating a terse encoded index from an abstract of the text. Both the text and the “signature” are then just appended to their respective files. To search, the entire index must be scanned and comparisons and/or calculations made according to the search criteria. The object is to create a signature that is terse, never misses, and has few false matches (false drops). The methods that the authors look at are Word Signature (WS), Superimposed Coding (SC), Bit-block Compression (BC), Run Length encoding (RL), and Variable Bit-block Compression (VBC). They present a detailed design of the BC method and a proposal of the VBC method with supporting derivations. I would like to see them do research in combinations of access methods such as using daily (or some reasonable length of time) signature methods with batching into inversion. Also, the authors should not assume that the average user is not familiar with Boolean logic or the authors may become familiar with Boolean egg. I've seen too many users use it without knowing its name.

                            Access critical reviews of Computing literature here

                            Become a reviewer for Computing Reviews.

                            Comments

                            Please enable JavaScript to view thecomments powered by Disqus.

                            Information & Contributors

                            Information

                            Published In

                            cover image ACM Transactions on Information Systems
                            ACM Transactions on Information Systems  Volume 5, Issue 3
                            July 1987
                            77 pages
                            ISSN:1046-8188
                            EISSN:1558-2868
                            DOI:10.1145/27641
                            • Editor:
                            • Robert B. Allen
                            Issue’s Table of Contents

                            Publisher

                            Association for Computing Machinery

                            New York, NY, United States

                            Publication History

                            Published: 01 July 1987
                            Published in TOIS Volume 5, Issue 3

                            Permissions

                            Request permissions for this article.

                            Check for updates

                            Qualifiers

                            • Article

                            Contributors

                            Other Metrics

                            Bibliometrics & Citations

                            Bibliometrics

                            Article Metrics

                            • Downloads (Last 12 months)68
                            • Downloads (Last 6 weeks)12
                            Reflects downloads up to 24 Sep 2024

                            Other Metrics

                            Citations

                            Cited By

                            View all
                            • (2017)Efficient indexing for semantic searchExpert Systems with Applications10.1016/j.eswa.2016.12.03373(92-114)Online publication date: May-2017
                            • (2016)A Study on Improved Detection Signature System in Hacking Response of One-Line GamesThe Journal of Society for e-Business Studies10.7838/jsebs.2016.21.1.10521:1(105-118)Online publication date: 28-Feb-2016
                            • (2015)Fast Forward Index Methods for Pseudo-Relevance Feedback RetrievalACM Transactions on Information Systems10.1145/274419933:4(1-33)Online publication date: 13-May-2015
                            • (2012)A new index model based on inverted index2012 IEEE International Conference on Computer Science and Automation Engineering10.1109/ICSESS.2012.6269429(157-160)Online publication date: Jun-2012
                            • (2012)A Novel Context Based Indexing of Web DocumentsProceedings of the 2012 International Conference on Communication Systems and Network Technologies10.1109/CSNT.2012.103(448-452)Online publication date: 11-May-2012
                            • (2010)A constraint-based tool for data integrity management on the webProceedings of the 4th International Conference on Uniquitous Information Management and Communication10.1145/2108616.2108643(1-3)Online publication date: 14-Jan-2010
                            • (2009)An Index Structure for Fast Query Retrieval in Object Oriented Data Bases Using Signature Weight DeclusteringInformation Technology Journal10.3923/itj.2009.275.2838:3(275-283)Online publication date: 1-Mar-2009
                            • (2008)On the SD-tree construction for optimal signature operationsProceedings of the 1st Bangalore Annual Compute Conference10.1145/1341771.1341786(1-8)Online publication date: 18-Jan-2008
                            • (2007)A Keyword Searching Algorithm For Search Engines2007 Innovations in Information Technologies (IIT)10.1109/IIT.2007.4430467(203-207)Online publication date: Nov-2007
                            • (2005)An index design in topic-focused search engineProceedings. 2005 IEEE Networking, Sensing and Control, 2005.10.1109/ICNSC.2005.1461190(220-224)Online publication date: 2005
                            • Show More Cited By

                            View Options

                            View options

                            PDF

                            View or Download as a PDF file.

                            PDF

                            eReader

                            View online with eReader.

                            eReader

                            Get Access

                            Login options

                            Full Access

                            Media

                            Figures

                            Other

                            Tables

                            Share

                            Share

                            Share this Publication link

                            Share on social media