WO2014147671A1 - 圧縮装置、圧縮方法、伸張装置、伸張方法および情報処理システム - Google Patents
圧縮装置、圧縮方法、伸張装置、伸張方法および情報処理システム Download PDFInfo
- Publication number
- WO2014147671A1 WO2014147671A1 PCT/JP2013/001976 JP2013001976W WO2014147671A1 WO 2014147671 A1 WO2014147671 A1 WO 2014147671A1 JP 2013001976 W JP2013001976 W JP 2013001976W WO 2014147671 A1 WO2014147671 A1 WO 2014147671A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- compression
- delimiter
- code
- unit
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1744—Redundancy elimination performed by the file system using compression, e.g. sparse files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/146—Coding or compression of tree-structured data
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3068—Precoding preceding compression, e.g. Burrows-Wheeler transformation
- H03M7/3071—Prediction
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
- H03M7/3088—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing the use of a dictionary, e.g. LZ78
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/70—Type of the data to be coded, other than image and sound
- H03M7/705—Unicode
Definitions
- a space symbol included in a character string constituting a document indicates a word break that is a unit constituting the document.
- one compression code is assigned to a word including a plurality of characters, while a compression code is also assigned to a space symbol. Since a compression code is assigned to a space symbol as well as a word, the number of compression codes used for compression increases, which causes a reduction in compression rate.
- An object of one aspect of the present invention is to improve a compression rate in compression of data in which a delimiter between units constituting data is indicated by a symbol.
- the compression apparatus corresponds to a combination of a first element that is one of elements constituting data and a first delimiter that indicates a delimiter between the elements in the data.
- a storage unit that stores one compression code in association with the first element, and a compression code that is stored in association with the first element read from the data is acquired from the storage unit.
- An acquisition unit; and a writing unit that writes the acquired compressed code in a storage area for storing the compressed data of the data.
- the decompressing device combines a combination of a first element that is one of elements constituting data and a first delimiter that indicates a delimiter between the elements in the data into one A storage unit that stores the compressed code in association with the storage unit; an acquisition unit that acquires the combination corresponding to the compressed code read from the compressed data obtained by compressing the data from the storage unit; And a writing unit for writing both the first element and the first delimiter included in the combination into a storage area for decompressed data obtained by decompressing the compressed data.
- an information processing system includes a storage device and an information processing device, and the information processing device is a delimiter between the elements in the data and a first element that is one of elements constituting data.
- a storage unit that stores a conversion dictionary in which a combination of delimiters indicating one compression code is associated, a reception unit that receives compressed data obtained by compressing data from the storage device, and the compressed data
- a first acquisition unit that acquires the combination corresponding to the compression code read from the conversion dictionary, and both the first element and the delimiter included in the acquired combination.
- a first writing unit for writing in a first storage area of decompressed data obtained by decompressing data, and performing information processing on the decompressed data written in the first storage area A second acquisition unit that acquires, from the conversion dictionary, the compressed code stored in association with the first element read from the decompressed data subjected to the information processing; A first writing unit that writes the acquired compressed code in a second storage area that stores compressed data of the data, and the compressed data written in the second storage area are transmitted to the storage unit And a transmission unit.
- the computer-readable storage medium stores one compression code corresponding to a combination of a delimiter indicating a delimiter between elements in data and one data element delimited by the delimiter.
- Compression dictionary data having a structure including information for compression processing associated with each type of data element, wherein the one compression code is referred to based on the one data element in a compression process by a computer
- the compressed dictionary data is stored.
- a computer-readable storage medium includes a combination of a delimiter indicating a delimiter between elements in data and one data element delimited by the delimiter as one compression code corresponding to the combination.
- the decompression dictionary data having a structure including the decompression processing information associated with each data element type, and in the decompression processing by the computer, the decompression in which the combination is referred to based on the one compression code Store dictionary data.
- the compression efficiency of document data using space symbols can be improved.
- FIG. 1 shows an example of a compression dictionary.
- FIG. 2 shows an example of conversion to a compression code.
- FIG. 3 shows an example of conversion into decompressed data.
- FIG. 4 shows an example of the functional block configuration.
- FIG. 5 shows a processing procedure example of the compression function.
- FIG. 6 shows an example of a processing procedure for generating a compression dictionary.
- FIG. 7 shows a processing procedure example of statistical processing.
- FIG. 8 shows an example of the data structure of the statistical table T1.
- FIG. 9 shows a processing procedure example of the compressed data generation processing.
- FIG. 10 shows a processing procedure example of the compressed data generation processing.
- FIG. 11 shows an example of the data structure of the compression dictionary.
- FIG. 12 shows a processing procedure example of the decompression function.
- FIG. 13 shows an example of a processing procedure for decompression dictionary generation.
- FIG. 14 shows a processing procedure example of the decompressed data generation processing.
- FIG. 15 shows an example of the data structure of the expansion dictionary.
- FIG. 16 shows a hardware configuration example of the computer 1.
- FIG. 17 shows an example of a program configuration that runs on the computer 1.
- FIG. 18 shows a system configuration example using the computer 1.
- FIG. 19 shows a system configuration example using the computer 1.
- FIG. 20 shows an example of the data structure of the compression dictionary.
- FIG. 21 shows an example of the data structure of the expansion dictionary.
- FIG. 22 shows an example of a data structure in the csv format.
- FIG. 23 shows an example of the data structure of the compression dictionary.
- FIG. 24 shows an example of the data structure of the expansion dictionary.
- FIG. 1 shows an example of a compression dictionary.
- the compression dictionary D0 includes a symbol compression dictionary D01 and a symbol string compression dictionary D02.
- the symbol compression dictionary D01 indicates a correspondence relationship between a character code indicating a symbol such as a character or a number and a compression code.
- the symbol string compression dictionary D02 indicates a correspondence relationship between a character code string indicating a symbol string such as a word or a tag and a compression code.
- the compression dictionary D1 includes a symbol compression dictionary D11, a symbol string compression dictionary D12, and a control symbol compression dictionary D13.
- the symbol compression dictionary D11 indicates a correspondence relationship between a character code indicating a symbol such as a character or a number and a compression code.
- the symbol string compression dictionary D12 indicates a correspondence relationship between a character code string indicating a combination of a symbol string such as a word or a tag and a space symbol and a compression code.
- the control symbol compression dictionary D13 indicates a correspondence relationship between a compression code and a character code string indicating a combination of a control symbol and a delimiter to delete the immediately preceding space symbol.
- the compression code included in the compression dictionary D0 is indicated by “c ′ ()”.
- c ′ () a symbol or a symbol string corresponding to the compression code is shown in parentheses.
- the compression code included in the compression dictionary D1 is indicated by “c ()”.
- c () a symbol or a symbol string corresponding to the compression code is shown in parentheses.
- a compression code corresponding to “a” is indicated as “compression code c (a)”
- the compression code corresponding to “about ⁇ ” is indicated as “compression code c (about ⁇ )”.
- the space symbol is a symbol indicated by 0x20 in the ASCII code system, and is indicated as “ ⁇ ” in the description of the present embodiment.
- control symbol for deleting the space symbol expanded immediately before is indicated as “[- ⁇ ]”.
- the compression code c ([- ⁇ ];) corresponds to the combination of the control symbol [- ⁇ ] and the delimiter “;”.
- the symbol string compression dictionary D02 in the compression dictionary D0 associates a compression code with each symbol string and space symbol
- the symbol string compression dictionary D12 in the compression dictionary D1 is not a symbol string alone but a symbol string.
- One compression code is associated with a combination of a space symbol.
- the compression code in the symbol string compression dictionary D12 and the compression code in the control symbol compression dictionary D13 are combined and encoded. Is done.
- the compression code registered in the control symbol compression dictionary D13 is used in common for each word registered in the symbol string compression dictionary D12, and a symbol string compression dictionary D12 is provided for each delimiter symbol individually.
- the data size of the compression dictionary is less likely to increase.
- Example (1) in FIG. 2 shows an example in which the part “about” in the English example sentence E1 is compressed based on the compression dictionary D0.
- the compressed code c ′ (about) and the compressed code c ′ ( ⁇ ) corresponding to the word “about” and the symbol “ ⁇ ” included in “about” are used for the compressed data. That is, in the example (1), the compressed data corresponding to the portion “about ⁇ ” in the English example sentence E1 is “c ′ (about) c ′ ( ⁇ )”.
- Example (2) in FIG. 2 shows an example in which the “about ⁇ ” portion in the English example sentence E1 is compressed based on the compression dictionary D1. Since the word “about” is registered in the symbol string compression dictionary D12, the compressed data corresponding to the portion “about” in the English example sentence E1 is “c (about ⁇ )”.
- the compressed data of the “about ⁇ ” portion is composed of two compression codes, whereas in the example (2), the compressed code constituting the compressed data of the “about ⁇ ” portion is 1. One.
- Example (3) in FIG. 2 shows an example in which the “invention.”
- Portion in the English example sentence E1 is compressed based on the compression dictionary D0.
- the compressed code c ′ (invention) and the compressed code c ′ (.) Corresponding to each of the word “invention” and the symbol “.” Included in “invention.” are used for the compressed data. That is, in the example (3), the compressed data corresponding to the portion of “invention.”
- the English example sentence E1 is “c ′ (invention) c ′ (.)”.
- Example (4) in FIG. 2 shows an example when the “invention.”
- Portion in the English example sentence E1 is compressed based on the compression dictionary D1.
- the word “invention ⁇ ” is registered in the symbol string compression dictionary D12. If the last space symbol “ ⁇ ” of the symbol string registered in the symbol string compression dictionary D12 is a delimiter other than a space symbol such as “.”, “,”, “;” And “:”, a space A compression code (“invention ⁇ ”) including the symbol “ ⁇ ” is used for the compressed data. That is, even if the portion to be compressed is “invention.”, The compression code c (invention ⁇ ) is used. In this case, the compression code c ([ ⁇ ].) Registered in the control symbol compression dictionary D13 is further used.
- the compression code c ([- ⁇ ].) Is a compression code corresponding to a combination of the control symbol [- ⁇ ] and the symbol “.”.
- the control symbol [- ⁇ ] is a control symbol indicating that the immediately preceding space is cancelled.
- An empty code in the character code system used by the data to be compressed is assigned to the code indicating the control symbol.
- Example (4) the compressed data corresponding to the “invention.” Portion in the English example sentence E1 is “c (invention ⁇ ) c ([ ⁇ ].)”. In both the example (3) and the example (4), the number of compression codes constituting the compressed data corresponding to “invention.” Is two.
- the number of compressed codes included in the compressed data is smaller when the compression dictionary D1 is used than when the compression dictionary D0 is used.
- the number of compression codes included in the compressed data is the same regardless of whether the compression dictionary D0 or the compression dictionary D1 is used. Therefore, if the combination of the symbol string and the space symbol registered in the symbol string compression dictionary D12 exists in the data to be compressed, the compression code used for the compressed data is more compressed using the compression dictionary D1 than the compression dictionary D0.
- the number of compression codes in the compression dictionary D1 is larger than the number of compression codes registered in the control symbol compression dictionary D13.
- the number of compression codes in the control symbol compression dictionary D13 is overwhelmingly smaller than the types of characters. For example, even if compression codes related to exclamation, commas, periods, colons, semicolons, and questions are registered in the control symbol dictionary D13, the number of types of compression codes handled is only increased by six.
- the compression code length is almost unchanged for the entire compression dictionary. For example, even if the number of types of compression codes included in the compression dictionary is doubled, the compression code length is increased by 1 bit for each compression code.
- the registration of a space symbol that exists at a frequency similar to the number of words in the compression dictionary in combination with the word has a greater effect than the increase in the compression code length due to the addition of the control symbol compression dictionary D13. Therefore, the compression rate is expected to be improved by compression using the compression dictionary D1.
- FIG. 3 shows an example of conversion into decompressed data.
- FIG. 3 shows a process (1) of generating decompressed data corresponding to the compression code “c (invention) c ([ ⁇ ⁇ ].)”, which is an example of the compressed data generated using the compression dictionary D1 in FIG. (5) are illustrated.
- the storage area A3 is a storage area for storing compressed data.
- the storage area A5 is a buffer area in which an expansion code corresponding to the compression code stored in the storage area A3 is stored.
- the storage area A4 is a storage area for storing decompressed data generated based on the decompression code stored in the storage area A5.
- Step (1) in FIG. 3 shows a state in which compressed data is stored in the storage area A3.
- the decompression code corresponding to the compression code c (invention ⁇ ) is generated.
- Step (2) in FIG. 3 is a state in which the decompressed code generated in step (1) is written in the storage area A4.
- the decompression code corresponding to the compression code c ([ ⁇ ].) Is generated.
- Step (3) in FIG. 3 shows a state in which the decompression codes “[ ⁇ ]” and “.” Generated in step (2) are written. Control based on the control symbol [- ⁇ ] stored in the storage area A5 is performed on the storage area A4.
- Step (4) in FIG. 3 shows a state in which control based on the control symbol [- ⁇ ] is performed.
- the write position to the storage area A4 may be shifted without deleting the space symbol ⁇ in the storage area A4.
- the symbol following the control symbol in this case “.” Is overwritten at the position where the space symbol was written.
- the conversion process according to the control symbol [- ⁇ ] is performed.
- the control symbol [- ⁇ ] inconsistency by generating compressed data using a compression code corresponding to the symbol string registered in the compression dictionary D1 even for a symbol string whose trailing space symbol does not match Is resolved.
- the number of compression codes used for the compressed data can be reduced while suppressing the increase in the types of compression codes to a small amount, so that the compression rate is improved. Furthermore, when decompressing the compressed data compressed by the above-described compression processing, the number of compression codes to be processed is reduced, so that the decompression speed can be improved.
- the compression code corresponding to the space symbol is not generated in the compression process, and the space is generated each time a word is generated in the expansion process. It is conceivable to generate symbols automatically.
- compression code generation processing is performed in character units. Since the compressed data includes a compression code indicating a word and a compression code indicating a symbol such as a character, the determination whether the expansion code corresponding to the compression code is a word or a symbol is generated every time an expansion code is generated. To be done. Further, a mechanism for identifying whether the compression code is a compression code corresponding to a word or a compression code corresponding to a character is required.
- the decompression process of this embodiment it is only determined whether or not the decompression code is a control symbol. For example, it is realized by performing determination processing between a control symbol written in a register and a register in which an expansion code read from the expansion dictionary is written. Multiple determination processes for one decompression code and creation of an decompression program with a complicated algorithm are avoided. In addition, since it is possible to suppress processing across a plurality of decompression codes, the number of registers to be used can also be suppressed.
- FIG. 4 shows an example of the functional block configuration.
- the computer 1 includes a compression unit 11, an expansion unit 12, a generation unit 13, a generation unit 14, and a storage unit 15.
- the storage unit 15 stores, for example, a compression target file F1, a compression file F2, an expansion file F3, a compression dictionary D2, an expansion dictionary D3, and the like. Further, the storage unit 15 stores a word list L1 used for generating the compression dictionary D2 and the expansion dictionary D3, for example.
- the word list L1 is a list of word groups to which a compression code is assigned.
- the storage unit 15 includes storage areas such as storage areas A1, A2, A3, A4, and A5, and is used as a work area for processing of the compression unit 11, the expansion unit 12, the generation unit 13, and the generation unit 14.
- the compression unit 11 performs a compression process on the file F1 stored in the storage unit 15, and generates a compressed file F2.
- the decompression unit 12 executes decompression processing of the compressed file F2 stored in the storage unit 15, and generates an decompressed file F3.
- the generation unit 13 generates a compression dictionary D2 used in the compression process of the compression unit 11.
- the generation unit 14 generates an expansion dictionary D3 used in the expansion process of the expansion unit 12.
- the compression unit 11 includes a control unit 111, a search unit 112, a reading unit 113, and a writing unit 114.
- the control unit 111 controls the search unit 112, the reading unit 113, and the writing unit 114, and executes a compression process for the file F1.
- the control unit 111 loads the file F1 into the storage area A1.
- the reading unit 113 reads data from the file F1.
- the search unit 112 searches the compression dictionary D2 for the data read by the reading unit 113.
- the writing unit 114 writes a compression code corresponding to the search result of the search unit 112 in the storage area A2.
- the control unit 111 manages the reading position of the reading unit 113, the writing position of the writing unit 114, and the like.
- control unit 111 causes the reading unit 113 and the writing unit 114 to execute the symbol or symbol string included in the file F1. Causes sequential processing to be executed. Further, the control unit 111 generates a compressed file F2 based on the compressed data stored in the storage area A2, and stores the compressed file F2 in the storage unit 15.
- the decompression unit 12 includes a control unit 121, a search unit 122, a reading unit 123, and a writing unit 124.
- the control unit 121 controls the search unit 122, the reading unit 123, and the writing unit 124, and executes decompression processing of the compressed file F2.
- the control unit 121 loads the compressed file F2 into the storage area A3.
- the reading unit 123 reads the compressed code from the compressed file F2.
- the search unit 122 searches the expansion dictionary D3 for the compressed code read by the reading unit 123.
- the writing unit 124 writes the decompression code corresponding to the search result of the search unit 122 in the storage area A4.
- the control unit 121 adjusts the write position where writing by the writing unit 124 is performed by one space symbol (for example, 1 byte). To do.
- the control unit 121 manages the reading position of the reading unit 123, the writing position of the writing unit 124, and the like. For example, the control unit 121 causes the reading unit 123 and the writing unit 124 to sequentially perform the compression codes included in the compressed file F2. Execute the process.
- the control unit 121 generates an expanded file F3 based on the expanded data stored in the storage area A4, and stores the expanded file F3 in the storage unit 15.
- the generation unit 13 includes a control unit 131, a statistics unit 132, an allocation unit 133, and a sorting unit 134.
- the generation unit 13 generates a compression dictionary D2 in response to an instruction from the compression unit 11.
- the control unit 131 controls the statistics unit 132, the allocation unit 133, and the sorting unit 134, and generates a compression dictionary D2 used for compression of the file F1.
- the statistical unit 132 generates statistical information of each character and word by counting the number of appearances of characters and words included in the file F1.
- the sorting unit 134 sorts each character or word for which statistical information is generated based on the statistical information generated by the statistical unit 132.
- the generation unit 14 includes a control unit 141, an allocation unit 142, a duplication unit 143, and a sorting unit 144.
- the generation unit 14 generates an expansion dictionary D3 in response to an instruction from the expansion unit 12.
- the control unit 141 controls the assigning unit 142, the duplicating unit 143, and the sorting unit 144, and generates an expansion dictionary D3 used for decompressing the compressed file F2.
- the assigning unit 142 uses the statistical information stored in the storage unit 15 to generate a compressed code corresponding to each character or word.
- the sorting unit 144 sorts the character information assigned with the compression code according to the value of the compression code.
- the duplication unit 143 duplicates a character code indicating a character or a word corresponding to the compression code according to the code length of each of the sorted compression codes.
- the control unit 141 generates the expansion dictionary D3 by placing the character code copied by the copying unit 143 at the offset position corresponding to the compression code generated by the assigning unit 142.
- the control unit 141 further stores the expansion dictionary D3 in the storage unit 15.
- FIG. 5 shows a processing procedure example of the compression function.
- the control unit 111 executes preprocessing for compression processing (S101).
- the call of the compression function includes designation of the file F1 to be compressed.
- the control unit 111 secures the storage area A1 and the storage area A2, loads the word list L1 from the storage unit 15, and secures the storage area for the statistical table T1 and the compression dictionary D2.
- the control unit 111 loads the file F1 into the storage area A1 (S102).
- the control unit 111 divides the file F1 into blocks, and performs the following compression processing for each block obtained by the division. Subsequently, the control unit 111 instructs the generation unit 13 to generate the compression dictionary D2 (S103).
- FIG. 6 shows an example of a processing procedure for generating a compression dictionary.
- the control unit 131 causes the statistical unit 132 to execute statistical processing of the file F1 (S201).
- FIG. 7 shows an example of a statistical processing procedure.
- the statistical unit 132 starts statistical processing on the file F1 loaded in the storage area A1.
- the read position indicates the beginning of the file F1 loaded in the storage area A1.
- the statistics unit 132 acquires a character code from the reading position of the storage area A1 (S301). In the process of S301, the reading position is advanced by the character code acquired in S301.
- the statistical unit 132 determines whether or not the character code acquired in S301 is a delimiter (S302).
- the determination in S302 is made according to whether a character code to be processed as a delimiter is set in advance and the character code acquired in S301 corresponds to one of the preset character codes.
- the delimiter is, for example, a space symbol (0x20 in ASCII code system), exclamation (0x21 in ASCII code system), comma (0x2C in ASCII code system), period (0x2E in ASCII code system), colon (ASCII code system). 0x3A), semicolon (0x3B in ASCII code system) and question (0x3F in ASCII code system).
- the determination in S302 may be made according to whether or not the character code acquired in S301 is within a predetermined numerical range (for example, 0x20 to 0x3F).
- the statistical unit 132 stores the character code acquired in S301 in the buffer (S303). The process proceeds to S309 in which the process of S303 ends.
- the statistical unit 132 determines whether the character code acquired in S301 is a delimiter (S302: YES). As an example, the determination in S304 may be performed before the determination in S302. In this case, the statistical unit 132 performs the determination of S302 when the condition of S304 is not satisfied, proceeds to the procedure of S305 if the determination condition of S302 is satisfied, and proceeds to the procedure of S303 if the determination condition of S302 is not satisfied.
- S304 space symbol
- FIG. 8 shows an example of the statistics table T1.
- the statistics table T1 stores the number of appearances for each symbol such as letters and numbers, each word included in the word list L1, and each combination of a control symbol and each delimiter.
- the statistical unit 132 may register a character string delimited by satisfying the determination condition of S302 as a word in the statistical table T1.
- the statistical unit 132 when there is no corresponding character string in the statistical table T1, the statistical unit 132 newly registers the character string as a word in the statistical table T1.
- a compression code is assigned to words included in the word list L1 but not included in the file F1.
- a compression code is assigned to a word that is not included in the word list L1 but is included in the file F1.
- words that are included in the word list L1 and are not included in the file F1 are suppressed from occupying a storage area by being registered in the statistical table T1.
- the statistical unit 132 counts combinations of control symbols and delimiters (S306). In S306, the number of appearances for the combination of the control symbol for canceling the space symbol and the delimiter acquired in S301 is incremented. When the process of S306 ends, the statistics unit 132 proceeds to the procedure of S309.
- the statistical unit 132 determines whether or not the reading position is the end of the file F1 loaded in the storage area A1 (S309). If it is not the end in the determination of S309 (S309: NO), the statistical unit 132 proceeds to the procedure of S301. If the end of the determination in S309 (S309: YES), the statistical unit 132 ends the statistical processing.
- the control unit 131 returns to the procedure of FIG. 6 and causes the sorting unit 134 to execute the sorting process (S202).
- the sort unit 134 uses the statistical information (each character) generated by the statistical unit 132 using the character information (symbols such as characters, symbol strings such as words, combinations of control symbols and separators) registered in the statistical table T1. Sort based on the number of occurrences of information). For example, the statistic unit 132 rearranges the character information registered in the statistic table T1 in one of the order of increasing or decreasing number of appearances.
- the control unit 131 causes the allocating unit 133 to execute allocation of a compression code (S203).
- the assigning unit 133 assigns a compression code to the character information group rearranged in the order of frequency in S202 based on an algorithm that assigns a shorter compression code to the character information having a higher frequency, such as Huffman coding or arithmetic coding.
- the control unit 131 causes the sorting unit 134 to sort the set of the character information and the compression code assigned to the character information based on the character information (S204). For example, the sorting unit 134 sorts the character information in ascending order of character codes.
- the sorting unit 134 arranges the first character code value of the character information in ascending order, and arranges character information having the same first character code value in ascending order of the second character code value.
- the state rearranged by the process of S204 is the compression dictionary D2 shown in FIG.
- the control unit 131 When the processing of S204 is completed, the control unit 131 performs index generation processing (S205).
- the control unit 131 generates an index by associating the character information with information indicating the position (offset) in the character information group in which the character information is sorted in S204. For example, the offset “0x126” is associated with the character “i” in the compression dictionary D2 illustrated in FIG.
- the search for the compression code corresponding to the word starting with “i” is started from “0x126”.
- the generation unit 13 ends the generation process of the compression dictionary D2.
- FIG. 11 shows an example of the data structure of the compression dictionary.
- the compression dictionary D2 shown in FIG. 11 character information and a compression code are stored in association with each other.
- the storage position of the set of character information and compression code is indicated by an offset starting from the storage position of the compression dictionary D2.
- the information of the character information “invent ⁇ ” is stored at the offset 0x0140.
- the index generated in S205 uses this offset to narrow down the search range.
- “c ()” indicates a compressed code corresponding to character information in parentheses.
- the compression dictionary D2 is generated by the generation unit 13, but as another example, the compression dictionary D2 may be stored in the storage unit 15 in advance. In this case, the compression dictionary D2 is used in common for a plurality of files. For example, in the compression dictionary D2 stored in the storage unit 15 in advance, a compression code is assigned based on, for example, statistical information of character information in a previously compressed file or a plurality of files existing in the database.
- control unit 111 When the generation unit 13 finishes generating the compression dictionary D2, the control unit 111 returns to the procedure of FIG. 5 and executes the compressed data generation processing (S104).
- FIG. 9 shows a processing procedure example of the compressed data generation processing.
- the read position is set at the start point of the file F1 loaded in the storage area A1, the write position is set at a predetermined position in the storage area A2, and the buffer is cleared.
- the reading unit 113 acquires a character code from the reading position (S401).
- the control unit 111 updates the reading position after obtaining the character code in S401. Further, the control unit 111 stores the character code acquired by the reading unit 113 in S401 in a buffer (S402).
- the control unit 111 determines whether or not the character code acquired in S401 is a space symbol (S403).
- the process returns to S401, and the reading unit 113 acquires the character code from the reading position. That is, the procedures of S401 and S402 are repeated until the space symbol is read.
- the search unit 112 searches the compression dictionary D2 with the character code (or character code string) stored in the buffer (S404).
- the control unit 111 determines whether matching character information matching the character code (or character code string) stored in the buffer exists in the compression dictionary D2 (S405). The process when the matched character information does not exist in the compression dictionary D2 (S405: NO) will be described later with reference to FIG. If the matching character information exists (S405: YES), the writing unit 114 writes the compression code associated with the matching character information in the compression dictionary D2 at the writing position in the storage area A2 (S406).
- control unit 111 updates the writing position and deletes (clears) the character code (or character code string) stored in the buffer (S407). Further, the control unit 111 determines whether or not the reading position is the end of the file F1 loaded in the storage area A1 (S408).
- the process returns to S401, and the output unit 113 acquires the character code from the reading position.
- the control unit 111 ends the compressed data generation process.
- FIG. 10 shows a processing procedure example of the compressed data generation processing. If there is no matching character information in the compression dictionary D2 in the processing of S405 (S405: NO), the control unit 111 determines that the character code at the end of the character code (or character code string) delimited by the space symbol is It is determined whether it is a delimiter (S409). That is, it is determined whether or not the character code immediately before the space symbol indicates a delimiter symbol in the character code string stored in the buffer. Whether or not it is a delimiter is determined using the same determination condition as in S302 of FIG.
- the control unit 111 replaces the end delimiter with a space symbol in the character code string up to the end delimiter in the buffer.
- a character code string is generated (S410).
- the search unit 112 searches the compression dictionary D2 for the character code string generated in S410 (S411). Based on the search result of S411, the control unit 111 determines whether there is matching character information that matches the character code string generated in S410 in the compression dictionary D2 (S412).
- the writing unit 114 writes the compression code associated with the matching character information in the compression dictionary D2 at the writing position of the storage area A2 (S413).
- the control unit 111 updates the writing position in accordance with the writing by the writing unit 114. Further, the writing unit 114 stores, in the compression dictionary D2, the compression code (control code) associated with the combination of the control symbol and the delimiter symbol and the compression code associated with the space symbol at the write position in the storage area A2. Write (S414).
- the control unit 111 updates the writing position in accordance with the writing by the writing unit 114.
- the control unit 111 When there is no matching character information (S412: NO) or when the last character code is not a delimiter (S409: NO), the control unit 111 performs processing for each character code in the buffer (S415). To S418). For each character code, the control unit 111 causes the search unit 112 to search the compression dictionary D2 (S416), and causes the writing unit 114 to write the compression code obtained as a result of the search at the write position (S417). When the processing of S416 and S417 is performed for each character code stored in the buffer, the procedure returns to S407, and the control unit 111 clears the character code string stored in the buffer.
- the control unit 111 When the above compressed data generation process is completed, the procedure returns to S105 shown in FIG.
- the control unit 111 generates a compressed file F2 using the compressed data stored in the storage area A2, and stores it in the storage unit 15 (S105).
- the compressed file F2 includes, for example, a header, compressed data in the storage area A2, and trailer information.
- the header includes, for example, identification information for identifying the compression algorithm and information such as the data size of each of the header, the compressed data, and the trailer information.
- the trailer information includes, for example, an expansion dictionary D3 corresponding to the statistical table T1 or the compression dictionary D2. The expansion dictionary D3 will be described later with reference to FIG.
- the control unit 111 When the process of S105 is completed, the control unit 111 notifies the call destination of the compression function that the compression process has been completed (S106).
- the notification in S106 includes, for example, information indicating the storage destination of the compressed file F2.
- FIG. 12 shows an example of the processing procedure of the decompression function.
- the control unit 121 executes preprocessing for decompression processing (S501).
- the call of the expansion function includes designation of the compressed file F2 to be expanded.
- the control unit 111 secures the storage area A3 and the storage area A4, loads the statistical table T1 from the compressed file F2, and further secures a storage area for the expansion dictionary D3.
- control unit 121 loads the compressed file F2 into the storage area A3 (S502).
- control unit 111 causes the generation unit 14 to generate an expansion dictionary (S503).
- FIG. 13 shows an example of a processing procedure for generating a decompression dictionary.
- the control unit 141 acquires the statistics table T1 from the trailer information of the compressed file F2 loaded in the storage area A3 (S601).
- the allocation unit 142 allocates a compression code to each character information in the statistical table T1 (S602).
- a compression code is assigned by the same algorithm as in S203.
- the sorting unit 144 sorts the character information to which the compression code is assigned according to the value of the compression code (S603).
- the control unit 141 associates the code length of the assigned compression code with each character information to which the compression code is assigned (S604).
- the duplicating unit 143 duplicates the character information and the code length information to the number corresponding to the code length associated with the character information (S605). Further, the control unit 141 stores the copied information at a position in the storage area of the expansion dictionary D3 secured in the storage unit 15 and at an offset position based on the compression code (S606). As a result of S606, the expansion dictionary D3 is generated, and the procedure proceeds to S504 in FIG.
- FIG. 15 shows an example of the data structure of the expansion dictionary.
- the expansion dictionary D3 has a data structure in which information indicating an expansion code (character information) and a code length is stored at an offset position based on a corresponding compression code.
- the expansion dictionary D3 in FIG. 15 illustrates a case where the maximum code length of the compression code is 12 bits.
- the expansion dictionary D3 is used for processing of reading fixed-length data from compressed data subjected to variable-length encoding and extracting an expansion code corresponding to the read fixed-length data.
- the decompression speed can be increased compared to determining the boundary of the code bit by bit.
- extra data is read out from the compressed data, so the read position from the compressed data is adjusted based on the code length. Since the expansion dictionary D3 is an expansion dictionary used for such expansion processing, information having the same expansion code and code length is registered in duplicate.
- control symbol [- ⁇ ]. For example, control symbol [- ⁇ ].
- the symbol string around ⁇ is also the control symbol [- ⁇ ].
- the number is duplicated according to the code length of the compression code and stored at the offset position according to the compression code.
- c (about ⁇ ) is 10-bit data “0110101001”.
- FIG. 14 shows an example of a processing procedure for decompression data generation processing.
- the control unit 121 starts a process for generating expanded data corresponding to the compressed data included in the compressed file F2.
- the read position from the storage area A3 is set to the start point of the compressed data of the compressed file F2
- the write position to the storage area A4 is set to a predetermined position in the storage area A4, and the storage area A5 (buffer) is cleared. Yes.
- the writing unit 124 writes the decompression code obtained in S702 to the write position in the storage area A4 (S704).
- the control unit 121 adjusts the writing position (S705). In S705, the control unit 121 returns the writing position in the storage area A4 by one character code. Subsequent to S705, the writing unit writes the delimiter stored in the storage area A5 in the writing position of the storage area A4 (S706).
- the control unit 121 updates the reading position from the storage area A3 (S707).
- the read position from the storage area A3 is updated based on the code length obtained by referring to S702.
- the read position is advanced by the number of bits indicated by the code length information.
- the control unit 121 updates the write position in the storage area A4 (S708).
- the write position in the storage area A4 is updated based on the length of the decompressed code (expanded code length) read in S702.
- the processing order of S707 and S708 may be switched. *
- the expansion dictionary D3 may further include information indicating the expansion code length in addition to the expansion code and the code length. In that case, for example, the control unit 121 updates the writing position based on the decompression code length associated with each decompression code in the processing of S708.
- the control unit 121 When the decompressed data generation process is completed, the control unit 121 creates the decompressed file F3 based on the decompressed data stored in the storage area A4, and stores the created decompressed file F3 in the storage unit 15 (S505). Further, the control unit 121 notifies the callee of the decompression function that the decompression process has been completed (S506).
- the notification in S506 includes, for example, information indicating the storage destination of the decompressed file F3.
- FIG. 16 shows a hardware configuration example of the computer 1.
- the computer 1 includes, for example, a processor 301, a RAM (Random Access Memory) 302, a ROM (Read Only Memory) 303, a drive device 304, a storage medium 305, an input interface (I / F) 306, an input device 307, an output interface (I / F) 308, an output device 309, a communication interface (I / F) 310, a SAN (Storage Area Network) interface (I / F) 311, a bus 312 and the like. Each piece of hardware is connected via a bus 312.
- a processor 301 includes, for example, a processor 301, a RAM (Random Access Memory) 302, a ROM (Read Only Memory) 303, a drive device 304, a storage medium 305, an input interface (I / F) 306, an input device 307, an output interface (I / F) 308, an output device 309, a communication interface (I / F
- the RAM 302 is a readable / writable memory device.
- a semiconductor memory such as SRAM (Static RAM) or DRAM (Dynamic RAM), or a flash memory other than the RAM may be used.
- the ROM 303 may be a PROM (Programmable ROM).
- the drive device 304 is a device that performs at least one of reading and writing of information recorded in the storage medium 305.
- the storage medium 305 stores information written by the drive device 304.
- the storage medium 305 is a storage medium such as a hard disk, a flash memory such as an SSD (Solid State Drive), a CD (Compact Disc), a DVD (Digital Versatile Disc), or a Blu-ray disc.
- the computer 1 includes a drive device 304 and a storage medium 305 for each of a plurality of types of storage media.
- the input interface 306 is connected to the input device 307 and is a circuit that transmits an input signal received from the input device 307 to the processor 301.
- the output interface 308 is a circuit that is connected to the output device 309 and causes the output device 309 to execute an output in accordance with an instruction from the processor 301.
- the communication interface 310 is a circuit that controls communication via the network 3.
- the communication interface 310 is, for example, a network interface card (NIC).
- the SAN interface 311 is a circuit that controls communication with a storage device connected to the computer 1 via the storage area network 4.
- the SAN interface 311 is, for example, a host bus adapter (HBA).
- the processor 301 reads a program stored in the ROM 303 or the storage medium 305 to the RAM 302, and performs at least one process of the compression unit 11, the expansion unit 12, the generation unit 13, and the generation unit 14 according to the procedure of the read program. Do. At that time, the RAM 302 is used as a work area of the processor 301.
- the function of the storage unit 15 is that the ROM 303 and the storage medium 305 store program files (such as an application program 24, middleware 23, and OS 22 described later) and data files (such as a file F1, a compressed file F2, and an expanded file F3). This is realized by being used as a work area of the processor 301.
- the program read by the processor 301 will be described with reference to FIG.
- FIG. 17 shows an example of a program configuration that runs on the computer 1.
- the application program 24 or the middleware 23 is a program in which the processing procedure of the compression function or decompression function of this embodiment is defined.
- the application program 24 or the middleware 23 is a program in which the compression dictionary generation or decompression dictionary generation processing procedure of this embodiment is defined.
- the compression program in which the processing procedure of the compression function is defined and the decompression program in which the processing procedure of the decompression function is defined may be an integrated program or a separate program.
- the compression dictionary generation program in which the compression dictionary generation procedure is defined may be included in the compression program, or may be a separate program called by the compression program.
- the expansion dictionary generation program in which the procedure for generating the expansion dictionary is defined may be included in the expansion program, or may be a separate program read by the expansion program.
- OS Operating System 22
- At least one of the above-described compression function and expansion function, at least one of the compression program, the expansion program, the compression dictionary generation program, and the expansion dictionary generation program is stored in a storage medium.
- the storage medium is read by the drive device 304 and installed, the program stored in the storage medium becomes executable.
- Each processing procedure defined in the installed program is executed by controlling the hardware group 21 (301 to 312) based on the OS 23.
- the functions of the functional blocks included in the computer 1 shown in FIG. 4 are provided when the processor 301 executes a compression program or an expansion program.
- the processing procedures shown in FIGS. 5, 6, 7, 9, and 10 are executed by the processor 301 to provide the functions of the functional blocks included in the compression unit 11 and the generation unit 13. 12, 13, and 14 are executed by the processor 301, the functions of the functional blocks included in the decompression unit 12 and the generation unit 14 are provided.
- the function of the control unit 111 is that the processor 301 accesses the RAM 302 (allocating a storage area, loads a file, etc.) and manages the processing status (reading position, writing position, etc.) in the register. It is provided by making a match determination with the information held in the file.
- the function of the reading unit 113 is provided by the processor 301 accessing the RAM 302 according to the processing status in the register.
- the function of the search unit 112 is provided when the processor 301 accesses the RAM 302 and performs collation determination based on the access result.
- the function of the writing unit 114 is provided by the processor 301 accessing the RAM 302 according to the processing status in the register.
- the function of the control unit 121 is that the processor 301 accesses the RAM 302 (allocating a storage area, loads a file, etc.), manages the processing status (reading position, writing position, etc.) in the register, and stores it in the register. It is provided by making a match determination with the held information.
- the function of the reading unit 123 is provided by the processor 301 accessing the RAM 302 according to the processing status in the register.
- the function of the search unit 122 is provided when the processor 301 accesses the RAM 302 and performs collation determination based on the access result.
- the function of the writing unit 124 is provided by the processor 301 accessing the RAM 302 according to the processing status in the register.
- the functional blocks in the generation unit 13 are executed using the hardware group 21 as follows.
- the function of the control unit 131 is provided when the processor 301 performs area management of the RAM 302 and accesses the RAM 302, and the processor 301 calls a routine according to the routine processing result.
- the function of the statistical unit 132 is provided by an access process to the RAM 302 by the processor 301 and an arithmetic process according to the result of the access process.
- the function of the sorting unit 134 is provided by the processor 301 accessing the RAM 302 and the arithmetic processing according to the access result.
- the function of the allocating unit 133 is provided when the processor 301 performs arithmetic processing based on access to the RAM 302.
- the functional blocks in the generation unit 14 are executed using the hardware group 21 as follows.
- the function of the control unit 141 is provided when the processor 301 performs area management of the RAM 302 and accesses the RAM 302, and the processor 301 calls a routine according to the routine processing result.
- the function of the duplication unit 143 is provided by access processing to the RAM 302 by the processor 301.
- the function of the sorting unit 144 is provided by the processor 301 accessing the RAM 302 and the arithmetic processing according to the access result.
- the function of the allocation unit 142 is provided by the processor 301 performing arithmetic processing based on access to the RAM 302.
- the number of compression codes is suppressed. For this reason, the number of memory accesses required for writing the compressed code is suppressed. Further, since the compression rate is improved, the number of I / Os when storing the compressed file is also suppressed.
- FIG. 18 shows a system configuration example using the computer 1.
- the information processing system illustrated in FIG. 18 includes a base station 2, a network 3, a computer 1a, and a computer 1b.
- the computer 1a is connected to the network 3 connected to the computer 1b by at least one of wireless and wired.
- the compression unit 11, the decompression unit 12, the generation unit 13, and the generation unit 14 illustrated in FIG. 4 may be included only in the computer 1a, or may be included in both the computer 1a and the computer 1b.
- the computer 1a may include the compression unit 11 and the generation unit 13, and the computer 1b may include the decompression unit 12 and the generation unit 14.
- the computer 1b includes the compression unit 11 and the generation unit 13, and the computer 1a
- the decompression unit 12 and the generation unit 14 may be included.
- the compressed file F2 generated by the computer 1a is transmitted to the computer 1b by communication via the network 3, and the compressed file F2 is expanded by the computer 1b to generate the expanded file F3.
- the compressed file F2 may be transmitted to the base station 2 wirelessly and transmitted from the base station 2 to the computer 1b.
- the compression rate is improved, so that the amount of compressed data communicated is reduced. Thereby, the use of the hardware resources of the system illustrated in FIG. 18 for the communication processing is suppressed.
- FIG. 19 shows a system configuration example using the computer 1.
- the information processing system illustrated in FIG. 19 includes a computer 1, a network 3, a client device 6, a storage area network (SAN) 4, and a storage device 5.
- the computer 1 performs information processing in response to a request from the client device 6. Data subject to information processing is compressed and stored in the storage device 5, for example.
- the computer 1 acquires and decompresses data to be processed that is compressed and stored in the storage device 5.
- the computer 1 executes the information processing requested from the client device 6 on the decompressed data, further compresses the data after the information processing, and stores it in the storage device 5.
- the information processing includes, for example, update processing of data stored in the storage device 5, analysis / analysis processing of data stored in the storage device 5, and the like.
- a compression process or decompression dictionary may be executed based on In that case, the processing of S103 in FIG. 5 and the processing of S503 in FIG. 12 may simply load the compression dictionary and decompression dictionary that are held.
- the target of the compression process may be a monitoring message output from the system in addition to the file.
- the monitoring message sequentially stored in the buffer is compressed by the above-described compression processing and stored as a log file.
- compression may be performed in units of pages in the database, or compression may be performed in units of a plurality of pages.
- a common compression dictionary may be used for a plurality of monitoring messages, or a common compression dictionary may be used for a plurality of pages.
- space symbols are often used, but other delimiters are also used.
- the compression dictionary D2 When the compression dictionary D2 is used, a symbol string with a delimiter, a delimiter and a space symbol, a compression code corresponding to a combination of the symbol string and the space symbol, a compression code corresponding to a control symbol and a delimiter, Three compression codes, corresponding to the space symbol, are used.
- a delimiter other than the space symbol is used, a space symbol is often used after the delimiter.
- a compression dictionary D4 in which compression codes are assigned to combinations of control symbols, delimiters (other than space symbols), and space symbols may be used instead of the compression dictionary D2.
- FIG. 20 shows an example of the data structure of the compression dictionary.
- the compression dictionary D4 In the compression dictionary D4, combinations of control symbols and delimiters (other than space symbols) are registered.
- the compression dictionary D4 shown in FIG. 20 combinations of control symbols, delimiters (other than space symbols), and space symbols are registered.
- the control symbol counting process of S306 in FIG. 7 it is confirmed whether a space symbol follows a delimiter other than a space symbol. If a space symbol follows, the control symbol, delimiter (other than space symbol), and space symbol If the combination is registered, the compression dictionary D4 of FIG. 20 is generated.
- FIG. 21 shows an example of the data structure of the expansion dictionary. Also in FIG. 21, combinations of control symbols, delimiters (other than space symbols), and space symbols are registered.
- the expansion dictionary D5 is generated according to the statistical table. The expansion dictionary D5 is used instead of the expansion dictionary D3.
- csv code-separated values
- a comma included in the data string indicates a field delimiter that is an element constituting the csv format data.
- FIG. 22 shows an example of the data structure in the csv format.
- the csv format data E1 illustrated in FIG. 22 is data including six fields in each row, and the title of each field is a header row (application number, filing date, inventor, country name, route, name of invention). Shown in Each line delimiter is indicated by a line feed code.
- the data of the inventor field in the data E1 shows the inventor's last name and first name, and the last name and the first name are separated by a space symbol.
- FIG. 23 shows an example of the data structure of the compression dictionary.
- the compression dictionary D6 shown in FIG. 23 combinations of symbol strings and commas are registered. For example, character information in which a comma is added to a word such as “INDEX,” or “INFORMATION,” is registered. While the compression dictionary D6 for registering character information including a comma at the end of the character information is used, a delimiter other than a comma is present after the word in the data E1, and thus a control symbol [- ,] Are used. For example, a combination of a control symbol [ ⁇ ,] and a space symbol ⁇ is registered in the compression dictionary D6.
- a combination of a control symbol [-,] and a line feed code (the line feed code is shown as [Line feed]) is registered.
- a compression code corresponding to each registered character information is stored in association with the character information.
- FIG. 24 shows an example of the data structure of the expansion dictionary.
- each character information registered in the compression dictionary D6 is duplicated and stored in an offset position corresponding to the corresponding compression code in a number corresponding to the code length of the corresponding compression code. ing. Information indicating the code length of the compression code is also stored in the expansion dictionary D7 in association with each character information.
- each element of the data is separated by a comma, but a line feed code is used at the line boundary. Therefore, for example, the combination of the control symbol [ ⁇ ,] and the line feed code tends to increase the number of appearances and shorten the code length of the compression code.
- space symbols may be included when multiple words are included in the field.
- the data of the inventor field of the data E1 shown in FIG. 22 is “KATAOKA ⁇ MASAHIRO,”
- the compression code corresponding to this data is c (KATAOKA,) c ([ ⁇ ,] ⁇ ) c (MASAHIRO ,).
- a combination of a symbol string such as a word and a delimiter symbol such as a space symbol existing immediately before the symbol string may be registered in the compression dictionary.
- a control symbol for deleting the following space symbol is used.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
図1は、圧縮辞書の例を示す。圧縮辞書D0は、記号圧縮辞書D01および記号列圧縮辞書D02を含む。記号圧縮辞書D01は、文字や数字などの記号を示す文字コードと圧縮符号との対応関係を示す。記号列圧縮辞書D02は、単語やタグなどの記号列を示す文字コード列と圧縮符号との対応関係を示す。圧縮辞書D1は、記号圧縮辞書D11、記号列圧縮辞書D12および制御記号圧縮辞書D13を含む。記号圧縮辞書D11は、文字や数字などの記号を示す文字コードと、圧縮符号との対応関係を示す。記号列圧縮辞書D12は、単語やタグなどの記号列およびスペース記号の組み合わせを示す文字コード列と圧縮符号との対応関係を示す。制御記号圧縮辞書D13は、直前のスペース記号を削除する旨の制御記号および区切り記号の組み合わせを示す文字コード列と、圧縮符号との対応関係を示す。
図3は、伸張データへの変換例を示す。図3には、図2で圧縮辞書D1を用いて生成された圧縮データの例である圧縮符号「c(invention)c([-△].)」に対応する伸張データの生成過程(1)~(5)が例示されている。記憶領域A3は圧縮データが格納される記憶領域である。記憶領域A5は記憶領域A3に格納された圧縮符号に対応する伸張コードが格納されるバッファ領域である。また、記憶領域A4は、記憶領域A5に格納される伸張コードに基づき生成される伸張データが格納される記憶領域である。
図4は、機能ブロックの構成例を示す。コンピュータ1は、圧縮部11、伸張部12、生成部13、生成部14および記憶部15を含む。記憶部15は、例えば、圧縮対象のファイルF1、圧縮ファイルF2、伸張ファイルF3、圧縮辞書D2や伸張辞書D3などを記憶する。また、記憶部15は、例えば圧縮辞書D2や伸張辞書D3の生成に用いられる単語リストL1を記憶する。単語リストL1は、圧縮符号を割り当てる対象の単語群のリストである。また、記憶部15は、記憶領域A1、A2、A3、A4およびA5などの記憶領域を設け、圧縮部11、伸張部12、生成部13および生成部14の処理のワークエリアとして用いられる。圧縮部11は、記憶部15に記憶されたファイルF1の圧縮処理を実行し、圧縮ファイルF2を生成する。伸張部12は、記憶部15に記憶された圧縮ファイルF2の伸張処理を実行し、伸張ファイルF3を生成する。生成部13は、圧縮部11の圧縮処理で用いられる圧縮辞書D2を生成する。生成部14は、伸張部12の伸張処理で用いられる伸張辞書D3を生成する。
以下に、上述の圧縮処理や伸張処理を実施する構成について説明する。
以下、上述の実施形態における変形例の一部を説明する。
また、上述の実施形態において、単語などの記号列と、その直前に存在するスペース記号などの区切り記号の組み合わせを圧縮辞書に登録することとしてもよい。その場合には、例えば、後続するスペース記号を削除する旨の制御記号が用いられる。
1a コンピュータ
1b コンピュータ
2 基地局
3 ネットワーク
4 ストレージエリアネットワーク
5 ストレージ装置
6 クライアント装置
11 圧縮部
12 伸張部
13 生成部
14 生成部
15 記憶部
Claims (17)
- データを構成する要素のうちの1つである第1の要素と、前記データにおいて前記要素間の区切りを示す第1の区切り記号との組み合わせに対応する1つの圧縮符号を、前記第1の要素と対応付けて記憶する記憶部と、
前記データから読み出された前記第1の要素と対応付けて記憶された前記圧縮符号を、前記記憶部から取得する取得部と、
取得した前記圧縮符号を、前記データの圧縮データを格納する格納領域に書き込む書込部と、
を含むことを特徴とする圧縮装置。 - 前記記憶部は、さらに、前記第1の区切り記号を取り消す指示を示す制御記号と、前記データにおいて前記要素間の区切りを示す区切り記号であって前記第1の区切り記号と種類が異なる第2の区切り記号との組み合わせに対応する制御符号を、前記第2の区切り記号と対応付けて記憶し、
前記取得部は、前記データから前記第1の要素に後続する前記第2の区切り記号を読み出した場合に、前記圧縮符号に続いて、さらに前記制御符号を前記記憶部から取得し、
前記書込部は、さらに、取得した前記制御符号を、前記格納領域の前記圧縮符号に後続する位置に書き込む、
ことを特徴とする請求項1に記載の圧縮装置。 - 前記制御符号に対応する組み合わせに、前記制御記号および前記第2の区切り符号に加えて前記第1の区切り符号が含まれる
ことを特徴とする請求項2に記載の圧縮装置。 - 前記データは、テキストファイルであり、
前記要素は、単語単位のデータである、
ことを特徴とする請求項1~3のいずれか1項に記載の圧縮装置。 - 前記データは、テーブル構造に対応するテキストファイルであり、
前記要素は、前記テーブル構造における1つのフィールドを構成する、
ことを特徴とする請求項1~3のいずれか1項に記載の圧縮装置。 - コンピュータに、
データを構成する要素のうちの1つである第1の要素と、前記データにおいて前記要素間の区切りを示す第1の区切り記号との組み合わせに対応する1つの圧縮符号を、前記第1の要素と対応付けて記憶する記憶装置から、前記データから読み出された前記第1の要素と対応付けて記憶された前記圧縮符号を取得し、
取得した前記圧縮符号を、前記データの圧縮データを格納する格納領域に書き込む、
ことを特徴とする圧縮方法。 - コンピュータに、
データを構成する要素のうちの1つである第1の要素と、前記データにおいて前記要素間の区切りを示す第1の区切り記号との組み合わせに対応する1つの圧縮符号を、前記第1の要素と対応付けて記憶する記憶装置から、前記データから読み出された前記第1の要素と対応付けて記憶された前記圧縮符号を取得し、
取得した前記圧縮符号を、前記データの圧縮データを格納する格納領域に書き込む、
処理を特徴とする圧縮プログラム。 - データを構成する要素のうちの1つである第1の要素と、前記データにおいて前記要素間の区切りを示す第1の区切り記号との組み合わせを、1つの圧縮符号と対応付けて記憶する記憶部と、
前記データを圧縮して得られる圧縮データから読み出された前記圧縮符号に対応する前記組み合わせを、前記記憶部から取得する取得部と、
取得した前記組み合わせに含まれる前記第1の要素及び前記第1の区切り記号の双方を、前記圧縮データを伸張して得られる伸張データの格納領域に書き込む書込部と、
を含むことを特徴とする伸張装置。 - 前記記憶部は、さらに、前記区切り記号を取り消す指示を示す制御記号と、前記データにおいて前記要素間の区切りを示す区切り記号であって前記第1の区切り記号と種類が異なる第2の区切り記号との組み合わせを、前記圧縮符号と異なる他の圧縮符号と対応付けて記憶し、
前記取得部は、前記圧縮データから前記圧縮符号に後続する前記他の圧縮符号を読み出した場合に、前記第1の要素および前記第1の区切り記号に続いて、さらに前記制御記号および前記第2の区切り記号を前記記憶部から取得し、
前記書込部は、さらに、前記制御記号および前記第2の区切り記号を取得すると、取得した前記第2の区切り記号を、前記格納領域内の前記第1の区切り記号の格納位置に書き込む、
ことを特徴とする請求項8に記載の伸張装置。 - 前記他の圧縮符号に対応づけられる組み合わせに、前記制御記号および前記第2の区切り記号に加えて、前記第1の区切り符号が含まれる、
ことを特徴とする請求項9に記載の伸張装置。 - 前記データは、テキストファイルであり、
前記要素は、単語単位のデータである、
ことを特徴とする請求項8~10のいずれか1項に記載の伸張装置。 - 前記データは、テーブル構造に対応するテキストファイルであり、
前記要素は、前記テーブル構造における1つのフィールドを構成する、
ことを特徴とする請求項8~10のいずれか1項に記載の伸張装置。 - コンピュータに、
データを構成する要素のうちの1つである第1の要素と、前記データにおいて前記要素間の区切りを示す第1の区切り記号との組み合わせを、1つの圧縮符号と対応付けて記憶する記憶装置から、前記データを圧縮して得られる圧縮データから読み出された前記圧縮符号に対応する前記組み合わせを取得し、
取得した前記組み合わせに含まれる前記第1の要素及び前記第1の区切り記号の双方を、前記圧縮データを伸張して得られる伸張データの格納領域に書き込む、
ことを実行させることを特徴とする伸張方法。 - コンピュータに、
データを構成する要素のうちの1つである第1の要素と、前記データにおいて前記要素間の区切りを示す第1の区切り記号との組み合わせを、1つの圧縮符号と対応付けて記憶する記憶装置から、前記データを圧縮して得られる圧縮データから読み出された前記圧縮符号に対応する前記組み合わせを取得し、
取得した前記組み合わせに含まれる前記第1の要素及び前記第1の区切り記号の双方を、前記圧縮データを伸張して得られる伸張データの格納領域に書き込む、
処理を実行させることを特徴とする伸張プログラム。 - 記憶装置と、
データを構成する要素のうちの1つである第1の要素と前記データにおいて前記要素間の区切りを示す区切り記号の組み合わせと、1つの圧縮符号とが対応付けられた変換辞書を記憶する記憶部と、
前記記憶装置から、データを圧縮して得られる圧縮データを受ける受信部と、
前記圧縮データから読み出された前記圧縮符号に対応する前記組み合わせを、前記変換辞書から取得する第1の取得部と、
取得した前記組み合わせに含まれる前記第1の要素及び前記区切り記号の双方を、前記圧縮データを伸張して得られる伸張データの第1の格納領域に書き込む第1の書込部と、
前記第1の格納領域に書き込まれた前記伸長データに対して情報処理を行なう処理部と、
前記情報処理が行なわれた前記伸長データから読み出された前記第1の要素に対応付けて記憶された前記圧縮符号を、前記変換辞書から取得する第2の取得部と、
取得した前記圧縮符号を、前記データの圧縮データを格納する第2の格納領域に書き込む第1の書込部と、
前記第2の格納領域に書き込まれた圧縮データを、前記記憶部に送信する送信部と、を含む情報処理装置と、
を含むことを特徴とする情報処理システム。 - データにおいて要素間の区切りを示す区切り記号と、前記区切り記号によって区切られる1つのデータ要素との組み合わせに対応する1つの圧縮符号を前記データ要素に対応づけた圧縮処理用情報を、前記データ要素の種類ごとに含む構造を有する圧縮辞書データであって、 コンピュータによる圧縮処理において、前記1つのデータ要素に基づいて前記1つの圧縮符号が参照される前記圧縮辞書データ、
を記憶したコンピュータ読み取り可能な記憶媒体。 - データにおいて要素間の区切りを示す区切り記号と、前記区切り記号によって区切られる1つのデータ要素との組み合わせを、前記組み合わせに対応する1つの圧縮符号と対応づけた伸長処理用情報を、前記データ要素の種類ごとに含む構造を有する伸長辞書データであって、
コンピュータによる伸張処理において、前記1つの圧縮符号に基づいて前記組み合わせが参照される前記伸張辞書データ、
を記憶したコンピュータ読み取り可能な記憶媒体。
Priority Applications (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2013382910A AU2013382910B2 (en) | 2013-03-22 | 2013-03-22 | Compression device, compression method, decompression device, decompression method, and information processing system |
KR1020157025611A KR101750646B1 (ko) | 2013-03-22 | 2013-03-22 | 압축 장치, 압축 방법, 신장 장치, 신장 방법 및 정보 처리 시스템 |
JP2015506367A JP6527462B2 (ja) | 2013-03-22 | 2013-03-22 | 圧縮装置、圧縮方法、記録媒体および伸張装置 |
EP13879073.8A EP2978135A4 (en) | 2013-03-22 | 2013-03-22 | COMPRESSION DEVICE, COMPRESSION METHOD, DECOMPRESSION DEVICE, DECOMPRESSION PROCESS AND INFORMATION PROCESSING SYSTEM |
PCT/JP2013/001976 WO2014147671A1 (ja) | 2013-03-22 | 2013-03-22 | 圧縮装置、圧縮方法、伸張装置、伸張方法および情報処理システム |
CN201380074817.7A CN105191144B (zh) | 2013-03-22 | 2013-03-22 | 压缩装置、压缩方法、解压装置、解压方法以及信息处理系统 |
US14/857,553 US9509333B2 (en) | 2013-03-22 | 2015-09-17 | Compression device, compression method, decompression device, decompression method, information processing system, and recording medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2013/001976 WO2014147671A1 (ja) | 2013-03-22 | 2013-03-22 | 圧縮装置、圧縮方法、伸張装置、伸張方法および情報処理システム |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/857,553 Continuation US9509333B2 (en) | 2013-03-22 | 2015-09-17 | Compression device, compression method, decompression device, decompression method, information processing system, and recording medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014147671A1 true WO2014147671A1 (ja) | 2014-09-25 |
Family
ID=51579413
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2013/001976 WO2014147671A1 (ja) | 2013-03-22 | 2013-03-22 | 圧縮装置、圧縮方法、伸張装置、伸張方法および情報処理システム |
Country Status (7)
Country | Link |
---|---|
US (1) | US9509333B2 (ja) |
EP (1) | EP2978135A4 (ja) |
JP (1) | JP6527462B2 (ja) |
KR (1) | KR101750646B1 (ja) |
CN (1) | CN105191144B (ja) |
AU (1) | AU2013382910B2 (ja) |
WO (1) | WO2014147671A1 (ja) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9779312B2 (en) * | 2015-01-30 | 2017-10-03 | Honda Motor Co., Ltd. | Environment recognition system |
TWI625715B (zh) * | 2016-05-31 | 2018-06-01 | 瑞鼎科技股份有限公司 | 顯示驅動裝置及其運作方法 |
US10432217B2 (en) | 2016-06-28 | 2019-10-01 | International Business Machines Corporation | Page filtering via compression dictionary filtering |
CN109582653B (zh) * | 2018-11-14 | 2020-12-08 | 网易(杭州)网络有限公司 | 文件的压缩、解压缩方法及设备 |
US10476518B1 (en) * | 2018-12-06 | 2019-11-12 | Nyquist Semiconductor Limited | Hardware friendly data compression |
US12118041B2 (en) * | 2019-10-13 | 2024-10-15 | Thoughtspot, Inc. | Query execution on compressed in-memory data |
WO2021108787A1 (en) * | 2019-11-30 | 2021-06-03 | Bytedance Inc. | Dictionary based coding of video data |
CN113138968A (zh) * | 2020-01-20 | 2021-07-20 | 普天信息技术有限公司 | 日志压缩方法及日志解压缩方法 |
CN114492322A (zh) * | 2020-10-23 | 2022-05-13 | 晶晨半导体(上海)股份有限公司 | 文本压缩方法、模块、芯片、电子设备和存储介质 |
US20230004533A1 (en) * | 2021-07-01 | 2023-01-05 | Microsoft Technology Licensing, Llc | Hybrid intermediate stream format |
CN115587076B (zh) * | 2022-12-12 | 2023-05-16 | 北京象帝先计算技术有限公司 | 数据解压系统、图形处理系统、组件、设备及解压方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH04167821A (ja) * | 1990-10-31 | 1992-06-15 | Fujitsu Ltd | データ符号化及び復号化方法 |
JPH05324427A (ja) * | 1992-05-27 | 1993-12-07 | Hitachi Ltd | 文書情報圧縮装置 |
JPH09212395A (ja) * | 1996-01-30 | 1997-08-15 | Sharp Corp | テキスト圧縮用辞書作成装置およびテキスト圧縮装置 |
JPH11177438A (ja) * | 1997-12-12 | 1999-07-02 | Toyota Central Res & Dev Lab Inc | 情報変換装置 |
JP2010093414A (ja) | 2008-10-06 | 2010-04-22 | Fujitsu Ltd | 情報処理プログラム、情報処理装置、および情報処理方法 |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0160672A4 (en) * | 1983-10-19 | 1986-05-12 | Text Sciences Corp | METHOD AND DEVICE FOR COMPRESSING DATA. |
US5590317A (en) | 1992-05-27 | 1996-12-31 | Hitachi, Ltd. | Document information compression and retrieval system and document information registration and retrieval method |
US5325091A (en) * | 1992-08-13 | 1994-06-28 | Xerox Corporation | Text-compression technique using frequency-ordered array of word-number mappers |
US5710919A (en) * | 1995-09-29 | 1998-01-20 | Electronic Data Systems Corporation | Record compression |
US5999949A (en) * | 1997-03-14 | 1999-12-07 | Crandall; Gary E. | Text file compression system utilizing word terminators |
JP3337633B2 (ja) | 1997-12-03 | 2002-10-21 | 富士通株式会社 | データ圧縮方法及びデータ復元方法並びにデータ圧縮プログラム又はデータ復元プログラムを記録したコンピュータ読み取り可能な記録媒体 |
JP3511901B2 (ja) | 1998-07-01 | 2004-03-29 | 株式会社日立製作所 | 情報処理装置および情報処理システム |
JP4774145B2 (ja) * | 2000-11-24 | 2011-09-14 | 富士通株式会社 | 構造化文書圧縮装置および構造化文書復元装置並びに構造化文書処理システム |
JP2004518327A (ja) * | 2001-01-11 | 2004-06-17 | コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ | 逆進的にストリングを参照する為の識別子を用いたデータ圧縮方法 |
US20030123841A1 (en) * | 2001-12-27 | 2003-07-03 | Sylvie Jeannin | Commercial detection in audio-visual content based on scene change distances on separator boundaries |
US7624021B2 (en) * | 2004-07-02 | 2009-11-24 | Apple Inc. | Universal container for audio data |
US8554745B2 (en) * | 2009-04-27 | 2013-10-08 | Netapp, Inc. | Nearstore compression of data in a storage system |
CN101807926B (zh) * | 2010-01-21 | 2013-01-23 | 上海电力学院 | 低功耗soc测试数据压缩编码方法 |
CN101944362B (zh) * | 2010-09-14 | 2012-05-30 | 北京大学 | 一种基于整形小波变换的音频无损压缩编码、解码方法 |
CN102480335B (zh) * | 2010-11-30 | 2015-08-05 | 金蝶软件(中国)有限公司 | 一种业务数据的发送方法及系统 |
US8560508B2 (en) * | 2011-07-22 | 2013-10-15 | International Business Machines Corporation | Real-time compression of tabular data |
-
2013
- 2013-03-22 WO PCT/JP2013/001976 patent/WO2014147671A1/ja active Application Filing
- 2013-03-22 AU AU2013382910A patent/AU2013382910B2/en not_active Ceased
- 2013-03-22 JP JP2015506367A patent/JP6527462B2/ja active Active
- 2013-03-22 KR KR1020157025611A patent/KR101750646B1/ko active IP Right Grant
- 2013-03-22 CN CN201380074817.7A patent/CN105191144B/zh active Active
- 2013-03-22 EP EP13879073.8A patent/EP2978135A4/en not_active Ceased
-
2015
- 2015-09-17 US US14/857,553 patent/US9509333B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH04167821A (ja) * | 1990-10-31 | 1992-06-15 | Fujitsu Ltd | データ符号化及び復号化方法 |
JPH05324427A (ja) * | 1992-05-27 | 1993-12-07 | Hitachi Ltd | 文書情報圧縮装置 |
JPH09212395A (ja) * | 1996-01-30 | 1997-08-15 | Sharp Corp | テキスト圧縮用辞書作成装置およびテキスト圧縮装置 |
JPH11177438A (ja) * | 1997-12-12 | 1999-07-02 | Toyota Central Res & Dev Lab Inc | 情報変換装置 |
JP2010093414A (ja) | 2008-10-06 | 2010-04-22 | Fujitsu Ltd | 情報処理プログラム、情報処理装置、および情報処理方法 |
Non-Patent Citations (1)
Title |
---|
See also references of EP2978135A4 * |
Also Published As
Publication number | Publication date |
---|---|
KR20150119402A (ko) | 2015-10-23 |
AU2013382910B2 (en) | 2017-01-05 |
US9509333B2 (en) | 2016-11-29 |
JP6527462B2 (ja) | 2019-06-05 |
EP2978135A1 (en) | 2016-01-27 |
US20160006454A1 (en) | 2016-01-07 |
AU2013382910A1 (en) | 2015-10-08 |
JPWO2014147671A1 (ja) | 2017-02-16 |
EP2978135A4 (en) | 2016-06-01 |
CN105191144A (zh) | 2015-12-23 |
KR101750646B1 (ko) | 2017-06-23 |
CN105191144B (zh) | 2019-01-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2014147671A1 (ja) | 圧縮装置、圧縮方法、伸張装置、伸張方法および情報処理システム | |
AU2016200550B2 (en) | Encoding program, decompression program, compression method, decompression method, compression device and decompression device | |
CN107506153B (zh) | 一种数据压缩方法、数据解压方法及相关系统 | |
WO2014147672A1 (ja) | 圧縮装置、圧縮方法、辞書生成装置、辞書生成方法、伸長装置、伸長方法、伸長プログラムおよび情報処理システム | |
US10671306B2 (en) | Chunk-based data deduplication | |
US20170302292A1 (en) | Computer-readable recording medium, encoding device, and encoding method | |
US10521414B2 (en) | Computer-readable recording medium, encoding method, encoding device, retrieval method, and retrieval device | |
JP6032292B2 (ja) | 圧縮プログラム、圧縮装置、伸張プログラムおよび伸張装置 | |
US9479195B2 (en) | Non-transitory computer-readable recording medium, compression method, decompression method, compression device, and decompression device | |
US9424293B2 (en) | Row, table, and index compression | |
US9219497B2 (en) | Compression device, compression method, and recording medium | |
US8463759B2 (en) | Method and system for compressing data | |
US9496895B2 (en) | Compression method and decompression method | |
US12038888B2 (en) | Storage control apparatus and method | |
JP2010287167A (ja) | アーカイブストレージ装置、ストレージシステム、データ格納方法、およびデータ格納プログラム | |
US20160210304A1 (en) | Computer-readable recording medium, information processing apparatus, and conversion process method | |
CN109271463A (zh) | 一种恢复MySQL数据库的innodb压缩数据的方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201380074817.7 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13879073 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 20157025611 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2013879073 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2015506367 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2013382910 Country of ref document: AU Date of ref document: 20130322 Kind code of ref document: A |