IEICE Trans - Application of a Word-Based Text Compression Method to Japanese and Chinese Texts

Application of a Word-Based Text Compression Method to Japanese and Chinese Texts

Shigeru YOSHIDA
Takashi MORIHARA
Hironori YAHAGI
Noriko ITANI

Publication
IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences Vol.E85-A No.12 pp.2933-2938
Publication Date: 2002/12/01
Online ISSN:
DOI:
Print ISSN: 0916-8508
Type of Manuscript: PAPER
Category: Information Theory
Keyword:
lossless, text compression, language, word-based,

Full Text: PDF(330.9KB)>>

Summary:
16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.

open access publishing via