Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2361354.2361404acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

Full-text search on multi-byte encoded documents

Published: 04 September 2012 Publication History

Abstract

The Burrows Wheeler transform (BWT) has become popular in text compression, full-text search, XML representation, and DNA sequence matching. It is very efficient to perform a full-text search on BWT encoded text using backward search. This paper aims to study different approaches for applying BWT on multi-byte encoded (e.g. UTF-16) text documents. While previous work has studied BWT on word-based models, and BWT can be applied directly on multi-byte encodings (by treating the document as single-byte coded), there has been no extensive study on how to utilize BWT on multi-byte encoded documents for efficient full-text search. Therefore, in this paper, we propose several ways to efficiently backward search multi-byte text documents. We demonstrate our findings using Chinese text documents. Our experiment results show that our extensions to the standard BWT method offer faster search performance and use less runtime memory.

References

[1]
A. Andersson, N. Larsson, and K. Swanson. Suffix trees on words. Algorithmica, 23(3):246--260, 1999.
[2]
M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, Palo Alto, CA, 1994.
[3]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching XML data via two zips. In WWW 2006, Edinburgh, Scotland, 2006. ACM.
[4]
P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS '00, pages 390--398, Washington, DC, USA, 2000. IEEE Computer Society.
[5]
P. Ferragina and G. Manzini. Indexing compressed text. J. ACM, 52(4):552--581, July 2005.
[6]
W. B. Frakes and R. A. Baeza-Yates, editors. Information Retrieval: Data Structures & Algorithms. Prentice-Hall, 1992.
[7]
S. Inenaga and M. Takeda. On-line linear-time construction of word suffix trees. In in Proc. 17th Ann. Symp. on Combinatorial Pattern Matching (CPM'06), pages 60--71. Springer-Verlag, 2006.
[8]
R. Y. K. Isal, A. Moffat, and A. C. Ngai. Enhanced word-based block-sorting text compression. In ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer Science. ACS, 2002.
[9]
J. Karkkainen and E. Ukkonen. Sparse suffix trees. In J.-Y. Cai and C. Wong, editors, Computing and Combinatorics, volume 1090 of Lecture Notes in Computer Science, pages 219--230. Springer Berlin / Heidelberg, 1996.
[10]
H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589--595, 2010.
[11]
V. Makinen and G. Navarro. Succinct suffix arrays based on run-length encoding. Nordic J. of Computing, 12(1):40--66, Mar. 2005.
[12]
U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM J. Computing, 22(5):935--948, 1993.
[13]
E. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23(2):262--272, 1976.
[14]
G. H. Ong and S. Y. Huang. A data compression scheme for Chinese text files using Huffman coding and a two-level dictionary. Information Sciences, 84(1-2):85--99, 1995.
[15]
K. Sadakane. A modified Burrows-Wheeler transformation for case-insensitive search with application to suffix array compression. In DCC: Data Compression Conference. IEEE Computer Society, 1999.
[16]
K. Sadakane. Unifying Text Search and Compression: Suffix Sorting, Block Sorting and Suffix Arrays. PhD thesis, The University of Tokyo, 2000.
[17]
I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA, 1999.
[18]
S. Yoshida, T. Morihara, H. Yahagi, and N. Satoh. Application of a word-based text compression method to Japanese and Chinese texts. In Data Compression Conference, DCC '99. IEEE, 1999.

Cited By

View all
  • (2023)GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593706(348-359)Online publication date: 21-Jun-2023
  • (2016)Frequent Multi-Byte Character Subtring Extraction using a Succinct Data StructureProceedings of the 2016 ACM Symposium on Document Engineering10.1145/2960811.2967161(103-106)Online publication date: 13-Sep-2016

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering
September 2012
256 pages
ISBN:9781450311168
DOI:10.1145/2361354
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Burrows Wheeler transform
  2. full-text search
  3. multi-byte encodings

Qualifiers

  • Research-article

Conference

DocEng '12
Sponsor:
DocEng '12: ACM Symposium on Document Engineering
September 4 - 7, 2012
Paris, France

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593706(348-359)Online publication date: 21-Jun-2023
  • (2016)Frequent Multi-Byte Character Subtring Extraction using a Succinct Data StructureProceedings of the 2016 ACM Symposium on Document Engineering10.1145/2960811.2967161(103-106)Online publication date: 13-Sep-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media