article

Free access

Partitioned signature files: design issues and performance evaluation

Editor: Robert B. Allen Authors:

Dik Lun Lee,

Chun-Wu LengAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 7, Issue 2

Pages 158 - 180

https://doi.org/10.1145/65935.65937

Published: 01 April 1989 Publication History

PDF eReader

Abstract

A signature file acts as a filtering mechanism to reduce the amount of text that needs to be searched for a query. Unfortunately, the signature file itself must be exhaustively searched, resulting in degraded performance for a large file size. We propose to use a deterministic algorithm to divide a signature file into partitions, each of which contains signatures with the same “key.” The signature keys in a partition can be extracted and represented as the partition's key. The search can then be confined to the subset of partitions whose keys match the query key. Our main concern here is to study methods for obtaining the keys and their performance in terms of their ability to reduce the search space.

Owing to the reduction of search space, partitioning a signature file has a direct benefit in a sequential search (single-processor) environment. In a parallel environment, search can be conducted in parallel effectively by allocating one or more partitions to a processor. Partitioning the signature tile with a deterministic method (as opposed to a random partitioning scheme) provides intraquery parallelism as well as interquery parallelism.

In this paper, we outline the criteria for evaluating partitioning schemes. Three algorithms are described and studied. An analytical study of the performance of the algorithms is provided and the results are verified with simulation.

References

[1]

AHUJA, S. R., AND ROBERTS, C.S. An associative/parallel processor for partial match retrieval using superimposed codes. In Proceedings of the 7th Annual Symposium on Computer Architecture (France, May). IEEE, New York, 1980, pp. 218-227.

Crossref

Google Scholar

[2]

BERRA, B., CHUNG, S. M., AND HACHEM, N.I. Computer architecture for a surrogate file to a very large data/knowledge base. IEEE Computer 20, 3 (Mar. 1987), 25-32.

Crossref

Google Scholar

[3]

CHRISTODOULAKIS, S., AND FALOUTSOS, C. Design considerations for a message file server. IEEE Trans. Soft. Eng. SE-IO, 2 (Mar. 1984), 201-210.

Google Scholar

[4]

DEPPISCH, U. S-Tree: A dynamic balanced signature index for office retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (Pisa, Italy, Sept. 8-10). ACM, New York, 1986, pp. 77-87.

Crossref

Google Scholar

[5]

FALOUTSOS, C. Access methods for text. ACM Comput. Surv. 17, 1 (Mar. 1985), 49-74.

Crossref

Google Scholar

[6]

FALOUTSOS, C., AND CHRISTODOULAKIS, S. Description and performance analysis of signature file methods for office filing. ACM Trans. Off. Inf. Syst. 5, 3 (July 1987), 237-257.

Crossref

Google Scholar

[7]

FALOUTSOS, C., AND CHRISTODOULAKIS, S. Optimal signature extraction and information loss. ACM Trans. Database Syst. 12, 3 (Sept. 1987), 395-428.

Crossref

Google Scholar

[8]

GUSTAFSON, R.A. Elements of the randomized combinatorial file structure. In Proceedings of the Symposium on Information Storage and Retrieval (College Park, Md., Apr.). ACM, New York, 1971, pp. 163-174.

Crossref

Google Scholar

[9]

LEE, D.L. A word-parallel, bit-serial signature processor for superimposed coding. In Proceedings of the 2nd International Conference on Data Engineering (Los Angeles, Calif., Feb.). IEEE, New York, 1986, pp. 352-359.

Crossref

Google Scholar

[10]

LEE, D. L., AND LENG, C.-W. A fast access method based on partitioned signature files. Submitted for publication, 1988.

Google Scholar

[11]

LEE, D. L., AND LOCHOVSKY, F. H. Text retrieval machines. In Office Automation, D. C. Tsichritzis, Ed. Springer-Verlag, New York, 1985, pp. 339-375.

Google Scholar

[12]

OROSZ, G., AND TACKACS, L. Some probability problems concerning the marking of codes in the superimposition field. J. Doc. 12, 4 (Dec. 1956), 231-234.

Google Scholar

[13]

PFALTZ, J. L., BERMAN, W. J., AND CAGLEY, E. M. Partial match retrieval using indexed descriptor files. Commun. ACM 23, 9 (Sept. 1980), 522-528.

Crossref

Google Scholar

[14]

RIVEST, R.L. Partial-match retrieval algorithms. SIAM J. Comput. 5, 1 (1976), 19-50.

Google Scholar

[15]

ROBERTS, C.S. Partial-match retrieval via a method of superimposed codes. In Proceedings of the IEEE 67, 12 (Dec. 1979), 1624-1642.

Google Scholar

[16]

SACKS-DAVIS, R., AND RAMAMOHANARAO, K. A two level superimposed coding scheme for partial match retrieval. Inf. Syst. 8, 4 (1983), 273-280.

Google Scholar

[17]

SACKS-DAVIS, R., KENT, A., AND RAMAMOHANARAO, K. Multikey access methods based on superimposed coding techniques. ACM Trans. Database Syst. 12, 4 (Dec. 1987), 655-696.

Crossref

Google Scholar

[18]

STANFILL, C., AND KAHLE, B. Parallel free-text search on the connection machine system. Commun. ACM 29, 12 (Dec. 1986), 1229-1239.

Crossref

Google Scholar

[19]

TSICHRrTZIS, D. C., AND CHRISTODOULAKIS, S. Message files. ACM Trans. Off. Inf. Syst. i, 1 (Jan. 1983), 88-98.

Crossref

Google Scholar

Cited By

View all

Chen YChen B(2024)On Mining Most Popular PackagesAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0904079:4(60-72)Online publication date: Aug-2024
https://doi.org/10.25046/aj090407
Chen YShi W(2018)On the Designing of Popular Packages2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)10.1109/Cybermatics_2018.2018.00180(937-944)Online publication date: Jul-2018
https://doi.org/10.1109/Cybermatics_2018.2018.00180
Faloutsos CLee RPlaisant CShneiderman B(2018)Incorporating String Search in a Hypertext System: User Interface and Signature File Design IssuesHypermedia10.1080/09558543.1990.120311822:3(183-200)Online publication date: 29-Oct-2018
https://doi.org/10.1080/09558543.1990.12031182
Show More Cited By

Index Terms

Partitioned signature files: design issues and performance evaluation
1. Information systems

Recommendations

Design and performance evaluation of signature files for text retrieval systems
Document ranking on weight-partitioned signature files
Parallel Search using Partitioned Inverted Files
SPIRE '00: Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)

Examines the searching of partitioned inverted files with particular emphasis on issues that arise from different types of partitioning methods. Two types of index partitions are investigated, namely term identifier (TermId) partitioning and document ...

Reviews

Reviewer: Esen A. Ozkarahan

The search of very large databases requires special hardware and software architectures to achieve acceptable performance. Text (document) databases are among the largest, which further complicates the search problem. In order to achieve efficient searching, many systems use fast access paths such as indices, partitioning, and signatures. This paper discusses efficient organization and search of signatures. As the authors point out, the compression factor of signature techniques (due to hash coding and superimposing) is 10–20 percent of the size of the original database. Such low compression would not mean much for a database size of 300 gigabytes, and the gigabyte range is common for document databases. The authors recognize this fact and propose the partitioning of signature files and their sequential or parallel search, which is a relevant goal. Partitions must be generated so that they cannot be searched for irrelevant queries, and the sizes of the partitions must be unskewed in order to balance loads in case of parallel processing. Randomly assigning signatures to partitions would defeat the first goal. Also, in the proposed partitioning schemes, a partition key with all binary ones should participate in every query search because such a key would include all possible query keys regardless of the relevance of the corresponding documents. The authors propose three partitioning schemes—fixed prefix, extended prefix, and floating key. All are based on extracting a subset of the signature bits and grouping signatures on the basis of matching keys. The goal is to further reduce the search space and the number of signatures searched. An even signature distribution is important, especially when partitions are searched in parallel. Fixed prefix partitioning is the simplest approach, and floating key partitioning gives the best results. Incoming queries should be processed for signature generation and query key extraction and query keys should be extracted in the same way that partition keys are generated. Any partition whose key includes the query key is further processed, otherwise, it is unqualified for further search. The paper gives detailed performance comparisons of the proposed partitioning methods (with respect to the aforementioned criteria) as well as proofs of the correctness of the proposed methodologies. The definition of n and the query and signature weights are carelessly omitted. The argument that all signatures must be searched in the bit-sliced (transposed) organization of signatures is imprecise. Such an organization resembles an inverted index with hashing. An inverted index is not exhaustively searched. The authors should have compared the execution performance of the transposed organization with that of the sequential search of the proposed partitioned approach. Several recent studies have established that the signature methodology does not yield high quality information retrieval in terms of recall and precision. Also, a document with a longer signature has a better chance of being retrieved. One cannot easily assign nonbinary term weights to documents and queries. On the other hand, the signature methodology addresses the needs of office information systems for fast searching of unformatted texts.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ACM Transactions on Information Systems Volume 7, Issue 2

April 1989

78 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/65935

Editor:
Robert B. Allen

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 1989

Published in TOIS Volume 7, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

62
Total Citations
View Citations
702
Total Downloads

Downloads (Last 12 months)131
Downloads (Last 6 weeks)36

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Chen YChen B(2024)On Mining Most Popular PackagesAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0904079:4(60-72)Online publication date: Aug-2024
https://doi.org/10.25046/aj090407
Chen YShi W(2018)On the Designing of Popular Packages2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData)10.1109/Cybermatics_2018.2018.00180(937-944)Online publication date: Jul-2018
https://doi.org/10.1109/Cybermatics_2018.2018.00180
Faloutsos CLee RPlaisant CShneiderman B(2018)Incorporating String Search in a Hypertext System: User Interface and Signature File Design IssuesHypermedia10.1080/09558543.1990.120311822:3(183-200)Online publication date: 29-Oct-2018
https://doi.org/10.1080/09558543.1990.12031182
Alfred RSoon GOn CAnthony P(2014)A Robust Framework for Web Information Extraction and RetrievalInternational Journal of Machine Learning and Computing10.7763/IJMLC.2014.V4.4034:2(146-150)Online publication date: Apr-2014
https://doi.org/10.7763/IJMLC.2014.V4.403
Wang YZheng B(2013)Hypergraph index: an index for context-aware nearest neighbor query on social networksSocial Network Analysis and Mining10.1007/s13278-013-0095-y3:4(813-828)Online publication date: 3-Feb-2013
https://doi.org/10.1007/s13278-013-0095-y
Wang YZheng B(2011)Context-aware nearest neighbor query on social networksProceedings of the Third international conference on Social informatics10.5555/2050728.2050748(98-112)Online publication date: 6-Oct-2011
https://dl.acm.org/doi/10.5555/2050728.2050748
Thorson M(2011)Internet nuggetsACM SIGARCH Computer Architecture News10.1145/2024716.202472239:2(36-52)Online publication date: 31-Aug-2011
https://dl.acm.org/doi/10.1145/2024716.2024722
Thomasian A(2011)Survey and analysis of disk scheduling methodsACM SIGARCH Computer Architecture News10.1145/2024716.202471939:2(8-25)Online publication date: 31-Aug-2011
https://dl.acm.org/doi/10.1145/2024716.2024719
Binkert NBeckmann BBlack GReinhardt SSaidi ABasu AHestness JHower DKrishna TSardashti SSen RSewell KShoaib MVaish NHill MWood D(2011)The gem5 simulatorACM SIGARCH Computer Architecture News10.1145/2024716.202471839:2(1-7)Online publication date: 31-Aug-2011
https://dl.acm.org/doi/10.1145/2024716.2024718
Xu GBond MQin FRountev A(2011)LeakChaserACM SIGPLAN Notices10.1145/1993316.199353046:6(270-282)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/1993316.1993530
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Design and performance evaluation of signature files for text retrieval systems

Document ranking on weight-partitioned signature files

Parallel Search using Partitioned Inverted Files

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations