Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

Partitioned signature files: design issues and performance evaluation

Published: 01 April 1989 Publication History

Abstract

A signature file acts as a filtering mechanism to reduce the amount of text that needs to be searched for a query. Unfortunately, the signature file itself must be exhaustively searched, resulting in degraded performance for a large file size. We propose to use a deterministic algorithm to divide a signature file into partitions, each of which contains signatures with the same “key.” The signature keys in a partition can be extracted and represented as the partition's key. The search can then be confined to the subset of partitions whose keys match the query key. Our main concern here is to study methods for obtaining the keys and their performance in terms of their ability to reduce the search space.
Owing to the reduction of search space, partitioning a signature file has a direct benefit in a sequential search (single-processor) environment. In a parallel environment, search can be conducted in parallel effectively by allocating one or more partitions to a processor. Partitioning the signature tile with a deterministic method (as opposed to a random partitioning scheme) provides intraquery parallelism as well as interquery parallelism.
In this paper, we outline the criteria for evaluating partitioning schemes. Three algorithms are described and studied. An analytical study of the performance of the algorithms is provided and the results are verified with simulation.

References

[1]
AHUJA, S. R., AND ROBERTS, C.S. An associative/parallel processor for partial match retrieval using superimposed codes. In Proceedings of the 7th Annual Symposium on Computer Architecture (France, May). IEEE, New York, 1980, pp. 218-227.
[2]
BERRA, B., CHUNG, S. M., AND HACHEM, N.I. Computer architecture for a surrogate file to a very large data/knowledge base. IEEE Computer 20, 3 (Mar. 1987), 25-32.
[3]
CHRISTODOULAKIS, S., AND FALOUTSOS, C. Design considerations for a message file server. IEEE Trans. Soft. Eng. SE-IO, 2 (Mar. 1984), 201-210.
[4]
DEPPISCH, U. S-Tree: A dynamic balanced signature index for office retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (Pisa, Italy, Sept. 8-10). ACM, New York, 1986, pp. 77-87.
[5]
FALOUTSOS, C. Access methods for text. ACM Comput. Surv. 17, 1 (Mar. 1985), 49-74.
[6]
FALOUTSOS, C., AND CHRISTODOULAKIS, S. Description and performance analysis of signature file methods for office filing. ACM Trans. Off. Inf. Syst. 5, 3 (July 1987), 237-257.
[7]
FALOUTSOS, C., AND CHRISTODOULAKIS, S. Optimal signature extraction and information loss. ACM Trans. Database Syst. 12, 3 (Sept. 1987), 395-428.
[8]
GUSTAFSON, R.A. Elements of the randomized combinatorial file structure. In Proceedings of the Symposium on Information Storage and Retrieval (College Park, Md., Apr.). ACM, New York, 1971, pp. 163-174.
[9]
LEE, D.L. A word-parallel, bit-serial signature processor for superimposed coding. In Proceedings of the 2nd International Conference on Data Engineering (Los Angeles, Calif., Feb.). IEEE, New York, 1986, pp. 352-359.
[10]
LEE, D. L., AND LENG, C.-W. A fast access method based on partitioned signature files. Submitted for publication, 1988.
[11]
LEE, D. L., AND LOCHOVSKY, F. H. Text retrieval machines. In Office Automation, D. C. Tsichritzis, Ed. Springer-Verlag, New York, 1985, pp. 339-375.
[12]
OROSZ, G., AND TACKACS, L. Some probability problems concerning the marking of codes in the superimposition field. J. Doc. 12, 4 (Dec. 1956), 231-234.
[13]
PFALTZ, J. L., BERMAN, W. J., AND CAGLEY, E. M. Partial match retrieval using indexed descriptor files. Commun. ACM 23, 9 (Sept. 1980), 522-528.
[14]
RIVEST, R.L. Partial-match retrieval algorithms. SIAM J. Comput. 5, 1 (1976), 19-50.
[15]
ROBERTS, C.S. Partial-match retrieval via a method of superimposed codes. In Proceedings of the IEEE 67, 12 (Dec. 1979), 1624-1642.
[16]
SACKS-DAVIS, R., AND RAMAMOHANARAO, K. A two level superimposed coding scheme for partial match retrieval. Inf. Syst. 8, 4 (1983), 273-280.
[17]
SACKS-DAVIS, R., KENT, A., AND RAMAMOHANARAO, K. Multikey access methods based on superimposed coding techniques. ACM Trans. Database Syst. 12, 4 (Dec. 1987), 655-696.
[18]
STANFILL, C., AND KAHLE, B. Parallel free-text search on the connection machine system. Commun. ACM 29, 12 (Dec. 1986), 1229-1239.
[19]
TSICHRrTZIS, D. C., AND CHRISTODOULAKIS, S. Message files. ACM Trans. Off. Inf. Syst. i, 1 (Jan. 1983), 88-98.

Cited By

View all
  • (2024)On Mining Most Popular PackagesAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0904079:4(60-72)Online publication date: Aug-2024
  • (2018)On the Designing of Popular Packages2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)10.1109/Cybermatics_2018.2018.00180(937-944)Online publication date: Jul-2018
  • (2018)Incorporating String Search in a Hypertext System: User Interface and Signature File Design IssuesHypermedia10.1080/09558543.1990.120311822:3(183-200)Online publication date: 29-Oct-2018
  • Show More Cited By

Recommendations

Reviews

Esen A. Ozkarahan

The search of very large databases requires special hardware and software architectures to achieve acceptable performance. Text (document) databases are among the largest, which further complicates the search problem. In order to achieve efficient searching, many systems use fast access paths such as indices, partitioning, and signatures. This paper discusses efficient organization and search of signatures. As the authors point out, the compression factor of signature techniques (due to hash coding and superimposing) is 10–20 percent of the size of the original database. Such low compression would not mean much for a database size of 300 gigabytes, and the gigabyte range is common for document databases. The authors recognize this fact and propose the partitioning of signature files and their sequential or parallel search, which is a relevant goal. Partitions must be generated so that they cannot be searched for irrelevant queries, and the sizes of the partitions must be unskewed in order to balance loads in case of parallel processing. Randomly assigning signatures to partitions would defeat the first goal. Also, in the proposed partitioning schemes, a partition key with all binary ones should participate in every query search because such a key would include all possible query keys regardless of the relevance of the corresponding documents. The authors propose three partitioning schemes—fixed prefix, extended prefix, and floating key. All are based on extracting a subset of the signature bits and grouping signatures on the basis of matching keys. The goal is to further reduce the search space and the number of signatures searched. An even signature distribution is important, especially when partitions are searched in parallel. Fixed prefix partitioning is the simplest approach, and floating key partitioning gives the best results. Incoming queries should be processed for signature generation and query key extraction and query keys should be extracted in the same way that partition keys are generated. Any partition whose key includes the query key is further processed, otherwise, it is unqualified for further search. The paper gives detailed performance comparisons of the proposed partitioning methods (with respect to the aforementioned criteria) as well as proofs of the correctness of the proposed methodologies. The definition of n and the query and signature weights are carelessly omitted. The argument that all signatures must be searched in the bit-sliced (transposed) organization of signatures is imprecise. Such an organization resembles an inverted index with hashing. An inverted index is not exhaustively searched. The authors should have compared the execution performance of the transposed organization with that of the sequential search of the proposed partitioned approach. Several recent studies have established that the signature methodology does not yield high quality information retrieval in terms of recall and precision. Also, a document with a longer signature has a better chance of being retrieved. One cannot easily assign nonbinary term weights to documents and queries. On the other hand, the signature methodology addresses the needs of office information systems for fast searching of unformatted texts.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 7, Issue 2
April 1989
78 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/65935
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 1989
Published in TOIS Volume 7, Issue 2

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)131
  • Downloads (Last 6 weeks)36
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)On Mining Most Popular PackagesAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0904079:4(60-72)Online publication date: Aug-2024
  • (2018)On the Designing of Popular Packages2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)10.1109/Cybermatics_2018.2018.00180(937-944)Online publication date: Jul-2018
  • (2018)Incorporating String Search in a Hypertext System: User Interface and Signature File Design IssuesHypermedia10.1080/09558543.1990.120311822:3(183-200)Online publication date: 29-Oct-2018
  • (2014)A Robust Framework for Web Information Extraction and RetrievalInternational Journal of Machine Learning and Computing10.7763/IJMLC.2014.V4.4034:2(146-150)Online publication date: Apr-2014
  • (2013)Hypergraph index: an index for context-aware nearest neighbor query on social networksSocial Network Analysis and Mining10.1007/s13278-013-0095-y3:4(813-828)Online publication date: 3-Feb-2013
  • (2011)Context-aware nearest neighbor query on social networksProceedings of the Third international conference on Social informatics10.5555/2050728.2050748(98-112)Online publication date: 6-Oct-2011
  • (2011)Internet nuggetsACM SIGARCH Computer Architecture News10.1145/2024716.202472239:2(36-52)Online publication date: 31-Aug-2011
  • (2011)Survey and analysis of disk scheduling methodsACM SIGARCH Computer Architecture News10.1145/2024716.202471939:2(8-25)Online publication date: 31-Aug-2011
  • (2011)The gem5 simulatorACM SIGARCH Computer Architecture News10.1145/2024716.202471839:2(1-7)Online publication date: 31-Aug-2011
  • (2011)LeakChaserACM SIGPLAN Notices10.1145/1993316.199353046:6(270-282)Online publication date: 4-Jun-2011
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media