KR101023536B1

KR101023536B1 - Lossless data compression method

Info

Publication number: KR101023536B1
Application number: KR1020080069704A
Authority: KR
Inventors: 김형중
Original assignee: 고려대학교 산학협력단
Priority date: 2008-07-17
Filing date: 2008-07-17
Publication date: 2011-03-21
Also published as: KR20100009032A

Abstract

본 발명은 특정 심볼 빈도 수가 0으로 줄어들 때마다 허프만 테이블을 갱신함과 동시에 허프만 코딩 방법을 적용하여 데이터를 압축함으로써 압축효율을 높일 수 있는 데이터 무손실 압축 방법을 제시한다.The present invention proposes a data lossless compression method that can improve the compression efficiency by compressing data by applying the Huffman coding method while updating the Huffman table whenever a specific symbol frequency is reduced to zero.

Description

Data Lossless Compression Method {LOSSLESS DATA COMPRESSION METHOD}

본 발명은 데이터 무손실 압축 방법에 관한 것으로, 더욱 상세하게는 특정 심볼 빈도 수가 0으로 줄어들 때마다 허프만 테이블을 갱신함과 동시에 허프만 코딩 방법을 적용하여 데이터를 압축함으로써 압축효율을 높일 수 있는 데이터 무손실 압축 방법에 관한 것이다.The present invention relates to a data lossless compression method, and more particularly, to a data lossless compression that improves the compression efficiency by applying a Huffman coding method while updating a Huffman table whenever a specific symbol frequency is reduced to zero. It is about a method.

일반적으로, 데이터를 무손실 압축하는 방법으로는 엔트로피 코딩(Entropy coding) 방법이 가장 널리 사용되고 있으며, 데이터를 압축했다가 원래의 데이터와 완전히 동일하게 복원할 수 있는 압축 방법을 무손실 압축이라고 한다.In general, as a method of lossless compression of data, the entropy coding method is most widely used. A compression method that compresses data and restores the data exactly the same as the original data is called lossless compression.

상기 엔트로피 코딩방법은 출현빈도가 높은 심볼에 짧은 코드를 배정하고 빈도가 낮은 심볼에 긴 코드를 배정하므로, 가변길이코딩(Variable-length coding)이라고 불린다.The entropy coding method is called variable-length coding because a short code is assigned to a symbol with high occurrence frequency and a long code is assigned to a symbol with low frequency.

엔트로피 코딩에서는 심볼의 빈도만을 고려하는데, 대표적인 방법으로는 허프만(Huffman) 코딩이 사용된다.In entropy coding, only the frequency of symbols is taken into consideration. As a representative method, Huffman coding is used.

즉, 엔트로피 코딩은 미리 각 심볼의 출현 빈도를 알아야 최적의 압축이 가능하며, 허프만 코딩에서는 각각의 심볼에 하나씩 코드워드를 배정하므로 정수 길 이의 코드워드만 허용한다.In other words, entropy coding requires optimal knowledge of the appearance of each symbol beforehand, and Huffman coding allows only one codeword of integer length because one codeword is assigned to each symbol.

예를 들면, X₇ = {A A A B B C D}에 대하여, 종래의 허프만 테이블은 도 1을 기반으로 하여 그림 2와 같이 만들어진다.For example, for X ₇ = { AAABBCD }, a conventional Huffman table is created as shown in Figure 2 based on FIG.

도 1은 종래의 허프만 트리를 나타내는 도면이고, 도 2는 상기 도 1의 허프만 트리를 기반으로 하여 만들어진 허프만 테이블을 나타낸다.1 is a diagram illustrating a conventional Huffman tree, and FIG. 2 is a Huffman table created based on the Huffman tree of FIG. 1.

즉, 그림 2에 나타낸 허프만 테이블에 의해 X₇을 인코딩하면, 그 결과는 Y₇ = {000 10 10 110 111}이 되어 13비트가 필요하다.That is, if X ₇ is encoded by the Huffman table shown in Figure 2, the result is Y ₇ = {000 10 10 110 111}, which requires 13 bits.

여기서, A는 0으로, B는 10으로, C는 110으로 D는 111로 각각 변환된다.Here, A is converted to 0, B to 10, C to 110, and D to 111, respectively.

상기와 같이, 종래의 허프만 코딩은 일단 허프만 테이블이 만들어지면 코딩이 끝날 때까지 모든 심볼에 동일한 코드워드가 배정되고, 배정된 코드워드는 변하지 않는다.As described above, in the conventional Huffman coding, once the Huffman table is created, the same codeword is assigned to all symbols until the coding is completed, and the assigned codeword is not changed.

또한, 종래의 엔트로피 코딩은 압축의 한계에 거의 근접했고, 비록 압축효율을 더 높인다고 해도 무손실 압축의 한계인 엔트로피에 더 가까이 접근할 뿐 그 한계를 뛰어넘어 엔트로피 이하로 압축할 수 없는 문제점이 있다.In addition, the conventional entropy coding is almost close to the limit of compression, even if the compression efficiency is higher, there is a problem that can not be compressed to less than entropy beyond the limit to approach closer to the entropy of the limit of lossless compression.

즉, 엔트로피 이하로 압축할 수 있는 새로운 방법의 개발이 필요하다.In other words, there is a need to develop a new method that can compress below entropy.

본 발명은 상기와 같은 문제점을 해결하기 위한 것으로서, 그 목적은 특정 심볼 빈도 수가 0으로 줄어들 때마다 허프만 테이블을 갱신함과 동시에 허프만 코 딩 방법을 적용하여 데이터를 압축함으로써 압축효율을 높일 수 있는 데이터 무손실 압축 방법을 제공하는 것이다.SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object thereof is to update the Huffman table each time a specific symbol frequency is reduced to 0, and simultaneously compress data by applying the Huffman coding method to improve data compression efficiency. It is to provide a lossless compression method.

즉, 본 발명의 목적은, 코딩을 진행하면서 허프만 테이블을 갱신하여 심볼의 빈도 수를 변화시켜서, 변화된 빈도 수에 따라 허프만 테이블을 다시 만들어 압축효율을 높일 수 있는 데이터 무손실 압축 방법을 제공하는 것이다.That is, an object of the present invention is to provide a data lossless compression method that can improve the compression efficiency by updating the Huffman table while changing coding to change the frequency of symbols, and regenerating the Huffman table according to the changed frequency.

상기한 목적을 달성하기 위한 본 발명의 일 측면에서 압축을 위한 입력 심볼 열을 구성하는 심볼들의 종류별로의 빈도 수와, 코드 워드를 정의하는 허프만 테이블을 이용하여 데이터를 무손실 압축하는 방법은, 압축 대상 심볼 열을 구성하는 심볼들을 배열 순서에 따라 순차적으로 선택하고, 상기 선택한 심볼을 상기 허프만 테이블에서 상기 선택한 심볼의 종류에 대응하여 정의된 코드워드로 치환하는 과정과, 상기 압축 대상 심볼 열에서 특정 종류에 해당하는 심볼의 치환이 완료되면, 상기 허프만 테이블에서 상기 특정 종류에 해당하는 심볼을 제외한 나머지 종류의 심볼에 대한 코드워드의 비트 수를 1비트씩 감소시켜 상기 허프만 테이블을 새로 구성하는 과정과, 상기 압축 대상 심볼 열 내에 더 이상 선택할 심볼이 존재하지 않으면, 상기 치환하는 과정에 의해 생성한 코드워드 열을 무손실 압축된 심볼 열로 출력하는 과정을 포함한다.In one aspect of the present invention for achieving the above object, a method of lossless compression of data using a Huffman table defining a frequency and a frequency of each symbol constituting an input symbol string for compression, Sequentially selecting symbols constituting the target symbol string according to an arrangement order, replacing the selected symbol with a codeword defined corresponding to the type of the selected symbol in the Huffman table; When the substitution of the symbol corresponding to the type is completed, reconstructing the Huffman table by reducing the number of bits of codewords for the remaining symbols except for the symbol corresponding to the specific type by one bit in the Huffman table; If there are no more symbols to select in the compression target symbol column, the substitution Comprises the step of lossless compressed symbol column outputs the code words generated by the heat process.

바람직하게, 상기 압축 대상 심볼 열 내에서 연속하여 존재하는 심볼 종류별 개수를 카운트하고, 상기 카운트 값에 의해 각 심볼 종류별 빈도 수를 계산하며, 상기 계산된 각 심볼 종류별 빈도 수에 의해 초기 허프만 테이블을 구성하는 과정을 더 포함한다.Preferably, the number of symbol types consecutively present in the compression target symbol string is counted, the frequency count for each symbol type is calculated based on the count value, and an initial Huffman table is configured by the calculated frequency for each symbol type. The process further includes.

상기한 목적을 달성하기 위한 본 발명의 다른 측면에서 압축을 위한 입력 심볼 열을 구성하는 심볼들의 종류별로의 빈도 수와, 코드워드를 정의하는 허프만 테이블을 이용하여 데이터를 무손실 복원하는 방법은, 무손실 압축된 심볼 열에서 상기 허프만 테이블에 등록된 코드워드를 배열 순서에 따라 순차적으로 선택하고, 상기 선택한 코드워드를 상기 허프만 테이블에서 상기 선택한 코드워드에 대응하여 정의된 심볼로 치환하는 과정과, 상기 치환하는 과정에 의해 상기 허프만 테이블에서 특정 종류에 대해 정의하고 있는 빈도 수에 대응하는 특정 종류의 심볼에 대한 치환이 완료되면, 상기 허프만 테이블에서 상기 특정 종류의 심볼을 제외한 나머지 종류의 심볼에 대한 코드워드의 비트 수를 1비트씩 감소시켜 상기 허프만 테이블을 새로 구성하는 과정과, 상기 무손실 압축된 심볼 열 내에 더 이상 선택할 코드워드가 존재하지 않으면, 상기 치환하는 과정에 의해 생성한 심볼 열을 복원된 심볼 열로 출력하는 과정을 포함한다.In another aspect of the present invention for achieving the above object, a method of losslessly restoring data using a Huffman table defining a frequency and a type of symbols constituting an input symbol string for compression and a codeword is lossless. Sequentially selecting codewords registered in the Huffman table in a compressed symbol string according to an arrangement order, and replacing the selected codewords with symbols defined corresponding to the selected codewords in the Huffman table; When the substitution for the specific type of symbol corresponding to the frequency defined for the specific type in the Huffman table is completed, the codeword for the other types of symbols except for the specific type of symbol in the Huffman table Reconfigure the Huffman table by reducing the number of bits by 1 bit. And if the codeword to be selected no longer exists in the lossless compressed symbol string, outputting the symbol string generated by the replacing process as a restored symbol string.

바람직하게, 상기 허프만 테이블은 상기 복원된 심볼 열을 구성할 수 있는 심볼의 종류별 또는 이미 복원된 심볼을 제외한 나머지 심볼의 종류별로 빈도 수와 코드워드를 기록함을 특징으로 한다.Preferably, the Huffman table records frequency and codewords for each type of symbols constituting the restored symbol string or for each type of symbols except for symbols that have already been restored.

본 발명에 따른 데이터 무손실 압축 방법에 따르면, 코딩을 진행하면서 허프만 테이블을 갱신하여 심볼의 빈도 수를 변화시킨 후, 변화된 빈도 수에 따라 허프만 테이블을 다시 만들므로 압축효율을 높일 수 있는 효과가 있다.According to the data lossless compression method according to the present invention, since the Huffman table is changed by changing the frequency of symbols while coding, the Huffman table is re-created according to the changed frequency, thereby increasing the compression efficiency.

상기 목적 외에 본 발명의 다른 목적 및 이점들은 첨부한 도면을 참조한 실시 예에 대한 상세한 설명을 통하여 명백하게 드러나게 될 것이다.Other objects and advantages of the present invention in addition to the above object will be apparent from the detailed description of the embodiments with reference to the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명에 따른 데이터 무손실 압축 방법을 상세하게 설명하기로 한다.Hereinafter, a data lossless compression method according to the present invention will be described in detail with reference to the accompanying drawings.

도 3은 본 발명의 일 실시 예에 따른 데이터 무손실 압축 및 복원하는 방법에 적용되는 시스템을 나타내는 구성도이다.3 is a block diagram illustrating a system applied to a method for compressing and restoring data lossless according to an embodiment of the present invention.

도 3에 나타낸 바와 같이, 본 발명의 일 실시 예에 따른 데이터 무손실 압축 및 복원 방법에 적용되는 시스템은, 입력 심볼 열(X_n)을 인코딩하는 인코더(100)와, 상기 인코더(100)를 통해 인코딩된 심볼 열을 디코딩하는 디코더(200)를 포함한다.As shown in FIG. 3, a system applied to a data lossless compression and decompression method according to an embodiment of the present invention includes an encoder 100 encoding an input symbol string X _n and an encoder 100. A decoder 200 for decoding the encoded symbol string.

구체적으로, 상기 인코더(100)는 입력되는 심볼 열(X_n)에 의해 작성되는 허프만 테이블(110)과, 상기 허프만 테이블(110)을 기반으로 상기 입력되는 심볼 열을 인코딩하여 출력하는 엔트로피 부호화부(120)를 포함한다.Specifically, the encoder 100 encodes and outputs the Huffman table 110 created by the input symbol string X _n and the input symbol string based on the Huffman table 110. 120.

한편, 디코더(200)는 상기 엔트로피 부호화부(120)로부터 출력되는 압축 심볼 열(Y_n)을 복호하기 위해 요구되는 정보를 저장하는 허프만 테이블(220)과, 상기 허프만 테이블(220)을 기반으로 상기 압축 심볼 열을 디코딩하여 출력하는 엔트로피 역부호화부(230)를 포함한다.On the other hand, the decoder 200 is based on the Huffman table 220 for storing the information required to decode the compressed symbol string (Y _n ) output from the entropy encoder 120, and based on the Huffman table 220 An entropy decoding unit 230 for decoding and outputting the compressed symbol string.

이때, 상기 인코더(100) 및 디코더(200)는 심볼의 정확한 빈도 수를 바탕으로 허프만 테이블을 갱신하는 구조를 갖는다.In this case, the encoder 100 and the decoder 200 have a structure of updating the Huffman table based on the exact frequency of the symbols.

한편, 상기와 같이 구성된 시스템을 통해 데이터 무손실 압축 및 복호하는 방법을 이하에 설명한다.On the other hand, a method for data lossless compression and decoding through the system configured as described above will be described below.

여기서, 인코더(100)의 허프만 테이블(110)에 입력되는 입력 심볼 열(X_n )은 m개의 소스 심볼 s₀, s₁, ..., s_m-1의 조합으로 구성되어 있고, 상기 입력 심볼 열(X_n )은 {x₀, x₁, ..., x_n-1}로 표현된다.Here, the input symbol string X _n input to the Huffman table 110 of the encoder 100 is composed of a combination of _m source symbols s ₀ , s ₁ , ..., s _m-1 , and the input The symbol string X _n is represented by {x ₀ , x ₁ , ..., x _n-1 }.

즉, 상기 입력 심볼 열(X_n )은 m개의 소스 심볼 s₀, s₁, ..., s_m-1의 조합에 의해 구성되므로, 상기 입력 심볼 열(X_n )의 총 심볼 수 (n)는 상기 소스 심볼 수 m보다는 크거나 같다. 따라서 상기 입력 심볼 열(X_n )을 구성하는 각각의 입력 심볼은 상기 소스 심볼들 중 하나에 대응하며, 상기 소스 심볼은 상기 입력 심볼 열(X_n ) 내에서 복수개가 존재할 수 있다.That is, since the input symbol string X _n is constituted by a combination of _m source symbols s ₀ , s ₁ , ..., s _m-1 , the total number of symbols _{n of} the input symbol string X n ) Is greater than or equal to the source symbol number m. Accordingly, each input symbol constituting the input symbol string X _n corresponds to one of the source symbols, and a plurality of source symbols may exist in the input symbol string X _n .

또한, 소스 심볼 집합은 S_m이며, 상기 Sm은 {s₀, s₁, ..., s_m-1}로 나타낼 수 있다. 그리고 압축할 대상 심볼 열 (X_n ) 내에서 소스 심볼 s _k 의 빈도는 n _k 로 나타내며, 심볼 x _i 에 해당하는 심볼의 개수는 N(x_i)로 나타낸다. 따라서, 입력 심볼 x _i 가 S_m의 한 원소이므로, s _k = x _j 라면 N(x_i) = n _k 가 된다.In addition, the source symbol set is S _m , and Sm may be represented by {s ₀ , s ₁ , ..., s _m-1 }. And the source symbols within a target symbol sequence (X _n) to compress the frequency of s _k is denoted by _k n, the number of symbols corresponding to symbol x _i is denoted by N (x _i). Therefore, since the input symbol x _i is an element of S _m , if s _k = x _j, then N (x _i ) = n _k .

한편, 인코더(100)의 엔트로피 부호화부(120)를 통해 출력되는 압축 심볼 열은 Y _n 으로 표현되며, 상기 압축 심볼 열(Y _n ) = {y₀, y₁, ..., y_n-1}을 구성하는 압축 심볼 y _i 는 소스 심볼 s _k 를 코드워드 c _k 로 매핑함에 의해 생성할 수 있다. 이때 코드워드 집합은 C_m이고, 상기 C_m은 {c₀, c₁, ..., c_m-1}로 나타낼 수 있다.Meanwhile, the compressed symbol string output through the entropy encoder 120 of the encoder 100 is represented by Y _n , and the compressed symbol string Y _n = (y ₀ , y ₁ , ..., y _{n − The} compressed symbol y _i constituting ₁ } may be generated by mapping the source symbol s _k to the codeword c _k . In this case, the codeword set is C _m , and C _m may be represented by {c ₀ , c ₁ , ..., c _m-1 }.

여기서, 상기 인코더(100)로 입력되는 입력 심볼 열(X_n )을 인코딩하는 방법을 도 4를 참조하여 설명한다.Here, a method of encoding the input symbol string X _n input to the encoder 100 will be described with reference to FIG. 4.

도 4는 본 발명의 일 실시 예에 따라 입력 심볼 열을 인코딩하는 방법을 나타내는 순서도이다.4 is a flowchart illustrating a method of encoding an input symbol string according to an embodiment of the present invention.

먼저, 입력 심볼 열((X_n ) = {x₀, x₁, ..., x_n-1}) 내에서 소스 심볼 별로의 빈도 수(n₀, n₁, ..., n_m-1)를 계산한다(S310).First, the frequency per source symbol (n ₀ , n ₁ , ..., n _{m- in the} input symbol column (( x _n ) = (x ₀ , x ₁ , ..., x _n-1 })) ₁ ) is calculated (S310).

다음에, 상기 S310 단계를 통해 계산된 소스 심볼 별로의 빈도 수(n₀, n₁, ..., n_m-1)를 기반으로 초기 허프만 테이블을 구성한다(S312). 상기 초기 허프만 테이블은 도 2에서 보여지는 바와 같이 각 소스 심볼 별로 빈도 수와 대응하는 코드워드로 구성됨을 확인할 수 있다.Next, an initial Huffman table is configured based on the frequency number n ₀ , n ₁ , ..., n _{m-1 for} each source symbol calculated through the step S310 (S312). As shown in FIG. 2, the initial Huffman table may be configured as a codeword corresponding to a frequency number for each source symbol.

이때, 변수(i)를 초기화한다(i = 0)(S314). 상기 변수 i는 상기 입력 심볼 열 내에서 압축할 대상 심볼을 지정하는 인덱스로써, 입력 심볼 열을 구성하는 심볼들의 순서에 따라 코드워드로의 치환을 위한 심볼을 순차적으로 선택하기 위해 사용된다.At this time, the variable i is initialized ( i = 0) (S314). The variable i is an index for designating a target symbol to be compressed in the input symbol string, and is used to sequentially select a symbol for substitution by a codeword according to a sequence of symbols constituting the input symbol string.

그 다음, 입력 심볼 x _i 가 소스 심볼 집합 내의 s _k 에 해당(즉, x _i = s _k )하면, 허프만 테이블로부터 해당 소스 심볼의 빈도 수 n _k 를 확인(즉, N(x _i ) = n _k )한다(S316).Then, if the input symbol x _i corresponds to s _k in the source symbol set (that is, x _i = s _k ), then check the frequency number n _k of that source symbol from the Huffman table (that is, N (x _i ) = n _k ) (S316).

그리고 허프만 테이블에서 상기 소스 심볼 s _k 에 대응하여 등록된 코드워드 c _k 를 획득한 후, 상기 입력 심볼 x _i 를 상기 획득한 c _k 로 치환하여 압축 심볼 y _i 를 출력(즉, y _i = c _k )한다(S318).After acquiring the codeword c _k registered corresponding to the source symbol s _k in the Huffman table, the input symbol x _i is replaced with the obtained c _k to output the compressed symbol y _i (that is, y _i = c). _k ) (S318).

이후, 상기 입력 심볼 x _i 에 대한 압축이 이루어짐에 따라 상기 입력 심볼 열에서 앞서 확인한 s _k 의 빈도 수 n _k 가 감소하였으므로, 상기 입력 심볼 열에서 상기 s _k 의 개수를 의미하는 N(x _i )를 1 감소시킨다 (S320).
그리고 상기 S320 단계의 결과에 의해 N(x _i )가 0인지를 판단한다 (S322). 상기 N(x _i )가 0이라는 것은 입력 심볼 열 내에 존재하는 s _k 와 동일한 값을 가지는 모든 심볼에 대한 압축이 완료되었음을 의미한다.
따라서 상기 N(x _i )가 0이라고 판단되면, 기존의 허프만 테이블 내에서 해당 소스 심볼 s _k 를 제거하고, 나머지 소스 심볼들의 빈도 수를 고려하여 코드워드를 새로이 정의하는 허프만 테이블을 새로 구성한다(S324).Thereafter, as the compression of the input symbol x _i is performed, the frequency n _k of the previously identified s _k in the input symbol string decreases, so that N (x _i ) means the number of s _k in the input symbol string. Reduces 1 (S320).
Then, it is determined whether N (x _i ) is 0 based on the result of step S320 (S322). When N (x _i ) is 0, compression of all symbols having the same value as s _k existing in the input symbol string is completed.
Therefore, when N (x _i ) is determined to be 0, the Huffman table is newly formed by removing the corresponding source symbol s _k from the existing Huffman table, and newly defining a codeword in consideration of the frequency of the remaining source symbols. S324).

상기 N(x _i )가 0이 아니라고 판단되거나 새로운 허프만 테이블이 구성되면, 변수(i)를 가산(i=i+1)한다 (S326). 그 후, 변수(i)가 총 심볼 수(n)와 동일한지를 통해 입력 심볼 열에 대한 압축이 완료되었는지를 판단(S328)한다. 만약 압축이 완료되었다면 상기 S318에서 획득한 압축 심볼들에 의한 압축 심볼 열 (Y _n = {y₀, y₁, ..., y _n-1 })을 출력한다 (S330). 하지만 압축이 완료되지 않았다면, 상기 S316 단계로 리턴한다.If it is determined that N (x _i ) is not zero or a new Huffman table is configured, the variable i is added (i = i + 1) (S326). Thereafter, it is determined whether or not the compression for the input symbol string is completed based on whether the variable i is equal to the total number of symbols n (S328). If the compression is completed, a compressed symbol string (Y _n = {y ₀ , y ₁ , ..., y _n-1 }) based on the compressed symbols obtained in S318 is output (S330). However, if the compression is not completed, return to the step S316.

한편, 상기 디코더(200)를 통해 상기 인코더(100)로부터 출력되는 압축 심볼 열(Y _n )을 디코딩하는 방법을 도 5를 참조하여 설명한다.Meanwhile, a method of decoding the compressed symbol string Y _n output from the encoder 100 through the decoder 200 will be described with reference to FIG. 5.

도 5는 본 발명의 일 실시 예에 따른 출력 심볼 열을 디코딩하는 방법을 나타내는 순서도이다.5 is a flowchart illustrating a method of decoding an output symbol string according to an embodiment of the present invention.

먼저, 입력 심볼 열((X_n ) = {x₀, x₁, ..., x_n-1}) 내에서 소스 심볼 별로의 빈도 수(n₀, n₁, ..., n_m-1)를 계산한다(S410).First, the frequency per source symbol (n ₀ , n ₁ , ..., n _{m- in the} input symbol column (( x _n ) = (x ₀ , x ₁ , ..., x _n-1 })) ₁ ) is calculated (S410).

다음에, 상기 S410 단계를 통해 계산된 소스 심볼 별로의 빈도 수(n₀, n₁, ..., n_m-1)를 기반으로 초기 허프만 테이블을 구성한다(S412). 상기 초기 허프만 테이블은 도 2에서 보이고 있는 바와 같이 각 소스 심볼 별로 빈도 수와 대응하는 코드워드로 구성됨을 확인할 수 있다.Next, an initial Huffman table is configured based on the frequency number n ₀ , n ₁ , ..., n _{m-1 for} each source symbol calculated through the step S410 (S412). As shown in FIG. 2, the initial Huffman table may include a codeword corresponding to a frequency number for each source symbol.

그 다음, 압축 심볼 열(Y _n = {y₀, y₁, ..., y _n-1 })에 대하여 디코딩을 시작한다.Then, decoding starts on the compressed symbol string (Y _n = {y ₀ , y ₁ , ..., y _n-1 }).

이때, 변수(i)를 초기화한다(i = 0)(S414). 상기 변수 i는 상기 압축 심볼 열 내에서 복원할 대상 심볼을 지정하는 인덱스이다.At this time, the variable i is initialized ( i = 0) (S414). The variable i is an index specifying a target symbol to be restored in the compressed symbol string.

다음에, 압축 심볼 y _i 가 소스 심볼 집합 내의 s _k 에 해당(즉, y _i = s _k 또는 x _i = s _k )하면, 허프만 테이블로부터 해당 소스 심볼의 빈도 수 n _k 를 확인(즉, N(x _i ) = n _k )한다(S416).Next, if the compressed symbol y _i corresponds to s _k in the set of source symbols (that is, y _i = s _k or x _i = s _k ), then check the frequency number n _k of that source symbol from the Huffman table (that is, N (x _i ) = n _k ) (S416).

그리고 허프만 테이블에서 상기 압축 심볼 y _i 에 상응하는 해당 코드워드 c _k 에 대응하여 등록된 심볼 x _i 를 획득한 후, 상기 압축 심볼 y _i 를 상기 획득한 심볼 x _i 로 치환하여 출력(즉, x _i = s _k )한다(S418).And by substituting in the Huffman table as the compressed symbol y _i by the corresponding acquisition of the code symbol x _i registered in correspondence to the word c _k, which then, the symbol x _i by the acquiring the compressed symbol y _i (namely, x _i = s _k ) (S418).

이후, 상기 압축 심볼 y _i 에 대한 복원이 이루어짐에 따라 앞서 확인한 s _k 의 빈도 수 n _k 가 감소하였으므로, 출력할 심볼 열에서 상기 s _k 의 개수를 의미하는 N(x _i )를 1 감소시킨다 (S420).
그리고 상기 S420 단계의 결과에 의해 N(x _i )가 0인지를 판단한다 (S422). 상기 N(x _i )가 0이라는 것은 출력할 심볼 열 내에 존재하는 s _k 와 동일한 값을 가지는 모든 심볼에 대한 복원이 완료되었음을 의미한다.
따라서 상기 N(x _i )가 0이라고 판단되면, 기존의 허프만 테이블 내에서 해당 소스 심볼 s _k 를 제거하고, 나머지 소스 심볼들의 빈도 수를 고려하여 코드워드를 새로이 정의하는 허프만 테이블을 새로 구성한다(S424).Thereafter, as the compression symbol y _i is restored, the frequency n _k of the previously identified s _k decreases, so that N (x _i ) representing the number of s _k in the symbol string to be output is decreased by 1 ( S420).
Then, it is determined whether N (x _i ) is 0 based on the result of step S420 (S422). When N (x _i ) is 0, the restoration of all symbols having the same value as s _k existing in the symbol string to be output is completed.
Therefore, when N (x _i ) is determined to be 0, the Huffman table is newly formed by removing the corresponding source symbol s _k from the existing Huffman table, and newly defining a codeword in consideration of the frequency of the remaining source symbols. S424).

상기 N(x _i )가 0이 아니라고 판단되거나 새로운 허프만 테이블이 구성되면, 변수(i)를 가산(i=i+1)한다 (S326). 그 후 변수(i)가 총 심볼 수(n)와 동일한지를 통해 압축 심볼 열에 대한 복원이 완료되었는지를 판단(S428)한다. 만약 복원이 완료되었다면 상기 S318에서 획득한 압축 심볼들에 의한 압축 심볼 열 (X _n = {x₀, x₁, ..., x _n-1 })을 출력한다 (S430). 하지만 복원이 완료되지 않았다면, 상기 S416단계로 리턴한다.If it is determined that N (x _i ) is not zero or a new Huffman table is configured, the variable i is added (i = i + 1) (S326). Thereafter, it is determined whether the restoration of the compressed symbol string is completed through whether the variable i is equal to the total number of symbols n (S428). If the restoration is completed, a compressed symbol string (X _n = {x ₀ , x ₁ , ..., x _n-1 }) based on the compressed symbols obtained in S318 is output (S430). However, if the restoration is not completed, the flow returns to step S416.

한편, 이하에서는 상술한 인코딩 및 디코딩 방법의 일 예를 설명한다.Meanwhile, an example of the above-described encoding and decoding method will be described.

여기서, 일 예의 허프만 테이블은 도 2의 허프만 테이블을 동일하게 적용하여 사용하기로 하며, 상기 도 2의 허프만 테이블은 인코더(100)에 저장되어 있다.Here, the Huffman table of the example will be used by applying the Huffman table of FIG. 2, and the Huffman table of FIG. 2 is stored in the encoder 100.

먼저, 도 2의 허프만 테이블에 의해 세 개의 심볼 A를 3비트의 0으로 변환한다. 따라서 심볼 A를 세 번 변환하면, A의 빈도 수는 0이 된다.First, three symbols A are converted into three bits of zeros by the Huffman table of FIG. Therefore, if symbol A is converted three times, the frequency of A becomes zero.

따라서, 입력 심볼 열은 X₄ = {B B C D}와 같이 되며, 이때 상기 입력 심볼 열 X₄를 구성하는 심볼 별로의 빈도 수를 기준으로 새로이 만들어진 허프만 테이블은 아래의 표 1과 같다.Accordingly, the input symbol string becomes X ₄ = {BBCD}, and the Huffman table newly created based on the frequency count for each symbol constituting the input symbol string X ₄ is shown in Table 1 below.

심볼symbol 빈도frequency 코드워드Codeword S₀ S ₀ AA 00 c₀ c ₀ S₁ S ₁ BB 22 c₁ c ₁ 00 S₂ S ₂ CC 1One c₂ c ₂ 1010 S₃ S ₃ DD 1One c₃ c ₃ 1111

상기 표 1에서 보이고 있는 허프만 테이블에 의하면, 심볼 B에 대해서는 코드워드로 1비트인 0을, C에 대해서는 코드워드로 2비트인 10을, D에 대해서는 코드워드로 2비트인 11을 배정할 수 있게 된다. 즉, 코드워드를 구성하는 비트들이 도 2에 정의된 비트들에 비해 한 비트씩 감소한다.According to the Huffman table shown in Table 1, symbol B can be assigned 0, which is 1 bit as a codeword, 10 as 2 bits as a codeword for C, and 11 which is 2 bits as a codeword for D. Will be. That is, the bits constituting the codeword are reduced by one bit compared to the bits defined in FIG.

따라서, B에 대응하는 빈도 수가 2이므로, 두 개의 B 각각에 대해 1비트인 0을 배정한다.Therefore, since the frequency corresponding to B is 2, 0, which is 1 bit, is assigned to each of the two Bs.

상기 표 1을 기반으로 하여 만든 허프만 트리는 도 6과 같다.The Huffman tree created based on Table 1 is shown in FIG. 6.

상기 두 개의 B 각각을 1비트인 0으로 변환하면, 상기 B의 빈도 수가 0으로 줄어든다.When each of the two Bs is converted to 0, which is 1 bit, the frequency of the B is reduced to zero.

따라서, 빈도 수가 0인 심볼이 나타났으므로 허프만 테이블을 다시 만든다. 즉 입력 심볼 열은 X₂ = {C D}와 같이 되며, 이때 상기 입력 심볼 열 X₂를 구성하는 심볼 별로의 빈도 수를 기준으로 새로이 만들어진 허프만 테이블은 아래의 표 2와 같다.Thus, a symbol with a frequency of zero appears, so we rebuild the Huffman table. That is, the input symbol string becomes X ₂ = {CD}, and the newly generated Huffman table based on the frequency for each symbol constituting the input symbol string X ₂ is shown in Table 2 below.

심볼symbol 빈도frequency 코드워드Codeword S₀ S ₀ AA 00 c₀ c ₀ S₁ S ₁ BB 00 c₁ c ₁ S₂ S ₂ CC 1One c₂ c ₂ 00 S₃ S ₃ DD 1One c₃ c ₃ 1One

상기 표 2에서 보이고 있는 허프만 테이블에 의하면, 심볼 C에 대해서는 코드워드로 1비트인 0을, D에 대해서는 코드워드로 1비트인 1을 배정할 수 있게 된다. 즉, 코드워드를 구성하는 비트들이 표 1에 정의된 비트들에 비해 다시 한 비트씩 감소한다.According to the Huffman table shown in Table 2, the symbol C can be assigned 0, which is 1 bit in the codeword, and 1, which is 1 bit in the codeword. That is, the bits constituting the codeword decrease by one bit as compared with the bits defined in Table 1.

따라서, 심볼 C에 대응하는 빈도 수가 1이므로, 하나의 C에 대해 1비트인 0을 배정한다.Therefore, since the frequency corresponding to the symbol C is 1, 0, which is 1 bit, is assigned to one C.

상기 표 2를 기반으로 하여 만든 허프만 트리는 도 7과 같다.The Huffman tree created based on Table 2 is shown in FIG. 7.

상기한 동작에 의해 빈도 수가 0인 심볼이 나타냈으므로, 허프만 테이블을 다시 만든다. 즉 입력 심볼 열은 X₁ = {D}와 같이 되며, 이때 상기 입력 심볼 열 X₁를 구성하는 심볼의 빈도 수를 기준으로 새로이 만들어진 허프만 테이블은 아래의 표 3과 같다.Since the above operation showed a symbol having a frequency of 0, the Huffman table is recreated. That is, the input symbol string becomes X ₁ = {D}, and the Huffman table newly created based on the frequency of the symbols constituting the input symbol string X ₁ is shown in Table 3 below.

심볼symbol 빈도frequency 코드워드Codeword S₀ S ₀ AA 00 c₀ c ₀ S₁ S ₁ BB 00 c₁ c ₁ S₂ S ₂ CC 00 c₂ c ₂ S₃ S ₃ DD 1One c₃ c ₃ 00

따라서, 상술한 인코딩 방법으로 만든 코드워드는 Y₇ = {0 0 0 0 0 0 0}이 된다. 즉, 7개의 입력 심볼에 대한 압축을 통해 출력되는 압축 심볼 열은 7비트만으로 구성된다. 이때 상기 압축 심볼 열을 표현하는 7비트는 모두 0이므로 추가의 압축이 더 가능하다.Therefore, the codeword made by the above-described encoding method is Y ₇ = {0 0 0 0 0 0 0}. That is, the compressed symbol string output through compression on seven input symbols includes only 7 bits. In this case, since 7 bits representing the compressed symbol string are all 0, further compression is possible.

한편, 상술한 방법으로 인코딩된 심볼 열을 디코딩하는 방법은 다음과 같다.Meanwhile, a method of decoding a symbol string encoded by the above-described method is as follows.

여기서, 일 예의 허프만 테이블은 도 2의 허프만 테이블을 동일하게 적용하여 사용하기로 하며, 상기 도 2의 허프만 테이블은 디코더(200)에 저장되어 있다.Here, the Huffman table of the example will be used by applying the Huffman table of FIG. 2, and the Huffman table of FIG. 2 is stored in the decoder 200.

먼저, 디코더(200)는 압축 심볼 열 Y₇ = {0 0 0 0 0 0 0}에 대하여 디코딩을 시작한다. 도 2를 참조하면, 최초의 0은 심볼 A에 해당된다. 또한 허프만 테이블에 의하면 상기 심볼 A에 대응하는 빈도 수는 3임을 확인할 수 있다.First, the decoder 200 starts decoding on the compressed symbol string Y ₇ = {0 0 0 0 0 0 0}. Referring to FIG. 2, the first zero corresponds to symbol A. In addition, the Huffman table confirms that the frequency corresponding to the symbol A is three.

따라서, 상기 압축 심볼 열에서 최초의 0부터 연속하는 3개의 0을 대신하여 3개의 심볼 A를 연속으로 출력한다. 상기 3개의 심볼 A를 연속하여 출력한 후 상기 심볼 A의 빈도 수는 0이 된다.
빈도 수가 0인 심볼이 발생하였음에 따라, 나머지 심볼 별 빈도 수를 고려하여 허프만 테이블을 다시 만들면 상기 표 1과 같이 된다. 이때 상기 새로 만들어진 허프만 테이블에서 나머지 심볼 별로 부여된 코드워드는 이전 허프만 테이블에서의 코드워드에 비해 1비트씩이 감소하였음을 확인할 수 있다.
그 다음의 코드워드 0은 심볼 B에 해당함을 알 수 있다. 또한, 상기 새로 만들어진 허프만 테이블에 의하면 상기 심볼 B에 대응한 빈도 수가 2이므로, 연속하여 0의 값을 가지는 코드워드 2개를 두 개의 심볼 B로 출력한다. 따라서 심볼 B의 빈도 수는 0이 된다.
그리고 빈도 수가 0인 심볼이 발생하였음에 따라, 나머지 심볼 별 빈도 수를 고려하여 다시 허프만 테이블을 만들면 상기 표 2와 같이 된다.Therefore, in the compressed symbol string, three symbols A are successively output instead of three consecutive zeros. After outputting the three symbols A continuously, the frequency of the symbol A becomes zero.
As a symbol having a frequency of 0 is generated, the Huffman table is reconstructed in consideration of the remaining frequency of each symbol, as shown in Table 1 above. In this case, it can be seen that the codewords assigned to the remaining symbols in the newly created Huffman table are reduced by one bit compared to the codewords in the previous Huffman table.
It can be seen that the next codeword 0 corresponds to symbol B. In addition, according to the newly generated Huffman table, since the frequency corresponding to the symbol B is 2, two codewords having a value of 0 are successively output as two symbols B. Therefore, the frequency of symbol B becomes zero.
As a symbol having a frequency of 0 is generated, when the Huffman table is again created in consideration of the remaining frequency of each symbol, it becomes as shown in Table 2 above.

상술한 과정을 반복하면 정확히 원래의 심볼 열 X₇ = {A A A B B C D}이 복원된다.Repeating the above process restores exactly the original symbol string X ₇ = { AAABBCD }.

즉, 본 발명에 따른 데이터 무손실 압축 방법을 통해 제시한 방법은 이미 코딩한 심볼은 제외시키고 앞으로 코딩할 심볼을 이용하여 테이블을 만든다.That is, the method proposed through the data lossless compression method according to the present invention excludes the coded symbols and creates a table using the symbols to be coded in the future.

한편, 종래의 적응형 허프만 코딩의 비트율을 R_A(X)로 나타낼 때 R_H(X_n) ≤ R_A(X_n)의 관계가 성립한다. 즉, 적응형 허프만 코딩방법은 비트율에서 어느 정도의 감소를 감수하면서 알지 못하는 확률분포를 감안해 최고의 성능을 달성하기 위해 사용한다.On the other hand, when the bit rate of the conventional adaptive Huffman coding is represented by R _A (X), a relationship of R _H (X _n ) ≦ R _A (X _n ) is established. That is, the adaptive Huffman coding method is used to achieve the best performance in consideration of unknown probability distribution while taking some reduction in bit rate.

그러나, 본 발명에서 제시한 방법의 비트율을 R_K(X_n)으로 정의할 때 R_K(X_n) ≤ R_H(X_n)의 관계가 형성된다.However, the relationship between _{_{R K (X n) ≤ R}} H (X n) is formed to define the bit rate of the method proposed in the present invention as R _K (X _n).

참고로, 상기에서 예를 든 X₇ = {A A A B B C D}에 대해 코딩할 경우 엔트로피 H(X₇)는 1.8424 비트/심볼이다.For reference, when coding for the above example X ₇ = { AAABBCD }, entropy H (X ₇ ) is 1.8424 bits / symbol.

이에 비해, 종래의 허프만 코딩의 결과, 비트율 R_H(X₇)는 1.8575 비트/심볼이 된다. 이때의 출력은 Y = {0 0 0 10 10 110 111}가 된다. 즉, 7개의 입력 알파벳에 대한 출력은 13비트가 된다.In comparison, as a result of conventional Huffman coding, the bit rate R _H (X ₇ ) becomes 1.8575 bits / symbol. The output at this time is Y = {0 0 0 10 10 110 111}. In other words, the output for the seven input alphabets is 13 bits.

그러나, 본 발명에서 제시한 방법으로 코딩하면 출력은 Y₇ = {0 0 0 0 0 0 0}으로 7비트이면 충분하다.However, when coding with the method proposed in the present invention, the output is 7 bits with Y ₇ = {0 0 0 0 0 0 0}.

따라서, R_K(X₇)는 1비트/심볼에 불과하며 엔트로피 이하로도 압축이 가능함을 개시한다. 참고로, 3개의 심볼 A, 2개의 심볼 B, 1개의 심볼 C, 1개의 심볼 D로 구성된 입력에 대한 다양한 출력은 아래의 표 4와 같다.Therefore, R _K (X ₇ ) discloses that only 1 bit / symbol can be compressed even below entropy. For reference, various outputs of the input consisting of three symbols A, two symbols B, one symbol C, and one symbol D are shown in Table 4 below.

입력input 본 발명의 출력Output of the present invention 비트율Bit rate 종래의 코딩 출력Conventional coding output 비트율Bit rate AAABBCDAAABBCD 00000000000000 1.00001.0000

0001010110110

0001010110110

1.8571

1.8571

ABBAACDABBAACD 010100000010100000 1.28571.2857 BBAAACDBBAAACD 101000000101000000 1.28571.2857 CBAABADCBAABAD 1101000100111010001001 1.57141.5714 CDBBAAACDBBAAA 11011110001101111000 1.42861.4286

상기 표 4에 나타낸 바와 같이, 비트율이 입력 스트링의 종류에 따라 달라지지만, 어느 경우나 엔트로피보다도 낮다는 것을 알 수 있다.As shown in Table 4 above, although the bit rate varies depending on the type of the input string, it can be seen that it is lower than entropy in any case.

이상에서는 본 발명의 일 실시 예에 따라 본 발명을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 변경 및 변형한 것도 본 발명에 속함은 당연하다.Although the present invention has been described above according to an embodiment of the present invention, a person having ordinary skill in the art to which the present invention pertains has been changed and modified without departing from the technical spirit of the present invention. Of course.

도 1은 도 1은 종래의 허프만 트리를 나타내는 도면.1 is a diagram illustrating a conventional Huffman tree.

도 2는 상기 도 1의 허프만 트리를 기반으로 하여 만들어진 허프만 테이블을 나타낸 도면.FIG. 2 illustrates a Huffman table based on the Huffman tree of FIG. 1. FIG.

도 3은 본 발명의 일 실시 예에 따른 데이터 무손실 압축 방법에 적용되는 시스템을 나타내는 구성도.3 is a block diagram showing a system applied to the data lossless compression method according to an embodiment of the present invention.

도 4는 본 발명의 일 실시 예에 따른 입력 심볼열을 인코딩하는 방법을 나타내는 순서도.4 is a flowchart illustrating a method of encoding an input symbol string according to an embodiment of the present invention.

도 5는 본 발명의 일 실시 예에 따른 출력 심볼열을 디코딩하는 방법을 나타내는 순서도.5 is a flowchart illustrating a method of decoding an output symbol string according to an embodiment of the present invention.

도 6은 표 1을 기반으로 하여 만든 허프만 트리를 나타내는 도면.6 shows a Huffman tree created based on Table 1;

도 7은 표 2를 기반으로 하여 만든 허프만 트리를 나타내는 도면.7 shows a Huffman tree created based on Table 2. FIG.

*도면의 주요부분에 대한 부호의 설명** Description of the symbols for the main parts of the drawings *

100: 인코더 110: 허프만 테이블100: encoder 110: Huffman table

120: 엔트로피 부호화부 200: 디코더120: entropy encoding unit 200: decoder

220: 허프만 테이블 230: 엔트로피 역부호화부220: Huffman table 230: entropy decoding unit

Claims

A method of losslessly compressing data using a Huffman table defining a frequency and a codeword for each type of symbols constituting an input symbol string for compression,

Sequentially selecting symbols constituting the compression target symbol sequence according to an arrangement order, replacing the selected symbols with codewords defined corresponding to the type of the selected symbol in the Huffman table;

When the substitution of a symbol corresponding to a specific type in the compression target symbol string is completed, the Huffman is reduced by one bit in the Huffman table by decreasing the number of bits of codewords for the symbols of the other types except for the symbol corresponding to the specific type by one bit. Reorganizing the table,

And if a symbol to be selected no longer exists in the symbol string to be compressed, outputting a codeword string generated by the substituting process as a lossless compressed symbol string.

The method of claim 1,

Counting the number of symbol types consecutively present in the compression target symbol string, calculating the frequency for each symbol type based on the count value, and constructing an initial Huffman table based on the calculated frequency for each symbol type Data lossless compression method further comprising.

A method for losslessly restoring data by using a frequency count for each type of symbols constituting an input symbol string for compression and a Huffman table defining a codeword,

Sequentially selecting codewords registered in the Huffman table in the lossless compressed symbol string according to an arrangement order, and replacing the selected codewords with symbols defined corresponding to the selected codewords in the Huffman table;

When the substitution for the specific type of symbol corresponding to the frequency defined for the specific type in the Huffman table is completed by the substituting process, for the other types of symbols except for the specific type of symbol in the Huffman table Reconfiguring the Huffman table by reducing the number of bits of a codeword by one bit;

And if the codeword to be selected is no longer present in the lossless compressed symbol string, outputting the symbol string generated by the substituting process as a restored symbol string.

The method of claim 3, wherein

And the Huffman table records frequency numbers and codewords for each type of symbols constituting the restored symbol string or for each type of symbols except for symbols that have already been restored.