Publication IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer SciencesVol.E85-ANo.12pp.2933-2938 Publication Date: 2002/12/01 Online ISSN: DOI: Print ISSN: 0916-8508 Type of Manuscript: PAPER Category: Information Theory Keyword: lossless, text compression, language, word-based,
Full Text: PDF(330.9KB)>>
Summary: 16-bit Asian language codes can not be compressed well by conventional 8-bit sampling text compression schemes. Previously, we reported the application of a word-based text compression method that uses 16-bit sampling for the compression of Japanese texts. This paper describes our further efforts in applying a word-based method with a static canonical Huffman encoder to both Japanese and Chinese texts. The method was proposed to support a multilingual environment, as we replaced the word-dictionary and the canonical Huffman code table for the respective language appropriately. A computer simulation showed that this method is effective for both languages. The obtained compression ratio was a little less than 0.5 without regarding the Markov context, and around 0.4 when accounting for the first order Markov context.