JP4081236B2

JP4081236B2 - Database processing method

Info

Publication number: JP4081236B2
Application number: JP2000396198A
Authority: JP
Inventors: 宏二伊藤
Original assignee: Digital Works Inc
Current assignee: Digital Works Inc
Priority date: 2000-12-26
Filing date: 2000-12-26
Publication date: 2008-04-23
Anticipated expiration: 2020-12-26
Also published as: JP2002197099A

Description

【０００１】
【産業上の利用分野】
本発明は、データベースの高速処理、とくに検索エンジンや膨大なキー項目を備えるクロス集計データ管理システムに適用可能なデータ処理技術に関する。
【０００２】
【技術の背景】
汎用コンピュータを用いたデータベース作成とその運営は、近年とくにデータ件数の増大から、従来公知の基本的な処理方式では対応が難しくなっており、より高速な演算処理を可能とする技術が望まれている。
【０００３】
膨大なデータを取り扱う際の技術的な困難性は、例えば、▲１▼取り扱うデータの種類によって処理すべきアルゴリズムが異なること、▲２▼多次元処理をするか否かによって処理方式の基本原理を異にすること、▲３▼使用すべきアルゴリズム（原理）はコンピュータの性能に依存すること、等の点にある。
【０００４】
▲１▼に関して云えば、取り扱うデータが文字列を主とするか、数字を主とするかによって処理技術を異にする。例えば人事データのように、部署名、氏名、性別、趣味、家族構成、社内評価内容の詳細など、主として文字（ひらがな、漢字、カタカナ）をもってレコード・データを作成し必要に応じて加工する場合と、チェーン店の各店舗における品名ごとの売り上げ台数、売上金額といった数字データを扱うケースでは、コンピュータ内部における処理アルゴリズムを異にすることは当然である。これはデータベースの基本であり、例えば階層型、関係型、網型などのデータの基本レコードの格納方法からしても構築方法を異にする。
【０００５】
▲２▼の点は、膨大なデータファイルのレコードを、どのように取り扱うかの問題である。例えば、データ作成時に数値の加減乗除といった演算処理を行い、単純なレコード・ファイルとしてメモリに格納した後は、随時当該レコード・ファイルを呼び出して確認する等の処理をするだけで良いならば、その場合の課題はデータ作成時の処理の高速化に絞られる。一方、どのような加工をするかを前提とせず、とりあえず事実（ファクト；主として数字）データだけをレコードし、この事実ファイルに基づいて後日の多様な加工（射影／選択／結合等）を施すのであれば、データの処理のための原理は自ずから異なり、より高速なキー項目の検索と数値集計を実現し、任意の項目次元において自由な集計結果を高速で取得する点に絞られる。その場合には、基本となるレコード・ファイルの作成はより単純であることが望まれる。
【０００６】
▲３▼の点は、従来提案された特許技術および今後期待されるデータベース処理との相関に於いて重要である。データ種別がどのような種類であれ、データの加工処理速度は高速の並列型ＣＰＵを多数搭載したり巨大なメモリを搭載した大型コンピュータを使用すれば、ほぼ問題なく解決できる。しかしながら、いわゆるスーパーコンピュータはきわめて高額であり、一般の実用レベルではそれを常時使用する環境は期待できない。むしろ現在の汎用コンピュータを利用し、膨大なレコード・ファイル（オリジナルデータ）を応答性の高いレベルで加工し、任意の要求に応答する自在な多次元加工結果を得ることが望まれる。
【０００７】
【従来の技術】
本発明は、例えば１００万件を越える膨大なオリジナルのデータファイルを自在に加工して任意の次元におけるクロス集計結果を短時間で得ることを目的とする。この場合、処理すべきデータファイルは、文字を主とするキー項目（本明細書においてキー値；店舗名等である）と、数字を主とするキー項目（本明細書において項目；数字を主とするファクト）を扱うことになる。
【０００８】
つまり店舗名や商品名といった文字列から構成されるキー項目と、売上台数、金額といった数字から構成されるキー項目を、自由な形式で記録ファイルとして蓄積し、蓄積した膨大なファイルを自由な次元で加工して集計結果を即時に得ることが目的である。
【０００９】
このような技術傾向をもったデータベースの処理方式は、もちろん、従来においてもいくつかの提案がある。第一に、クロス集計処理に目的を絞った処理方式としては、重みコードを使用して処理の高速化を図る技術がある（特開平８−７７２６６号）。これは、あるテーブルのデータの意味を包含する他のテーブルに相対的に低いレベルの重みコードを付与しておき、検索条件が与えられたときに重みコードを利用して必要なファクトを抽出する。具体的には、集団の中で、誰が、いつ、なにを食べ、その材料は何であったか、その値段はいくらであったか、といったデータを集計するときに、人名、日付、メニュー、その金額、材料、その金額、といった具合に、データの意味を階層化し、それらを事実データとして格納しておき、必要に応じて適当な階層領域のデータを重みコードを用いて抽出するものである。
【００１０】
しかし、この技術はデータの意味に上下の関連づけを行うため、システム設計者の主観に起因する勝手な上下関係が発生したり、もともと関連づけの出来ない項目についての処理が出来なくなる等、汎用システムとして利用しにくい面が残る。
【００１１】
第二に、クロス集計処理を直接の目的とするものではないが、膨大な文字データの中からキーワードを高速検索する技術は、多数存在する。例えば、
▲１▼特開平１１−０３１０９６号
インデックスを作成することなくデータを検索する技術
▲２▼特開平１１−００３３５４号
項目の従属関係を作り、従属関係から検索を行う技術
▲３▼特開平１０−１４３５３８号
インデックス・キー値をデータの識別子に関連づける方式
▲４▼特開平１０−００３４１４号
利用頻度の高いキー値を予め定義し格納しておく方式
▲５▼特開平９−１６７１１１号
重複するインデックスのキー値をクラスタリング処理する技術
等である。しかしながら、これらの技術は、例えばインターネット上に存在する膨大な文字データから検索キーを用いて、効率の良い結果を得るには適しているが、階層化されたデータファイルのすべてを検索するといった深度検索の能力はないし、オペレータの主観操作によるデータ保存がなされている場合は、非常に使い勝手の悪いシステムとなりやすい。
【００１２】
これらのデータ処理技術をふまえて、オリジナルデータを自在に加工しうる従来の公知技術を検討すると、特開平５−２５７７７４号を挙げることが出来る。これは、検索キー値と圧縮されたIRN（インデックス・レコード番号）をメモリ上に保有し、検索条件（キー項目の指定）に基づいて、検索キー値からIRNを抽出し、圧縮されたIRNを復元するものである。
【００１３】
【発明が解決しようとする課題】
問題は、項目の検索やクロス集計処理を行う場合の処理時間である。
たしかに特開平５−２５７７７４号の技術によれば、数千件〜１万件程度の比較的少ない件数のオリジナルデータを扱う場合には、圧縮したIRNを用いて任意の項目を比較的高速で検索し、項目に対応する数字の加減乗除演算も比較的高速で処理できる。しかしながら、オリジナルデータが膨大な数になったとき、例えば各チェーン店からリアルタイムで刻々と報告される多数の商品の売り上げデータを長期に渡って継続管理するようなケースを考える場合、単にIRNを圧縮するだけでは、処理能力には限界が生ずる。IRNの圧縮は、主としてファクト（数字）の圧縮であり、新商品の名称など、膨大に広がる可能性のある文字データ（項目）の部分については圧縮処理を施さないからである。
【００１４】
もちろん文字データの圧縮は容易であるし、特開平５−２５７７７４号の発明においても文字データを圧縮することも可能である。しかし、その場合の問題は、圧縮した文字データを高速で検索し且つ復元する手段が存在しない点にある。従って、この特開平５−２５７７７４号の発明では、数千件単位の少ないレコードを管理できるにとどまり、例えば１００万件を越えるような膨大な数のオリジナルデータを自由な視点で高速処理するパワーをもたない。
【００１５】
そこで本発明の目的は、第一に、多数のキー値と項目からなる膨大なオリジナルデータを効率よく格納する点、第二に、格納したデータに基づいて高速でオリジナルデータを復元可能とする点にある。
【００１６】
【課題を解決するための手段】
前記目的を達成するため、本発明に係るデータベースの処理方法は、第一に、オリジナルデータに基づき、項目毎にキー値を圧縮して圧縮復元インデックスを作成し保存する段階と、圧縮復元インデックスに基づき、項目毎に圧縮キーを書き込んだ圧縮結果セットキャッシュを作成し保存する段階とを備える。
第二に、検索条件が与えられたときに、前記圧縮結果セットキャッシュに基づいて圧縮キーを検索（探索）する段階と、検索した圧縮キーに基づいて前記圧縮復元インデックスからオリジナルのキー値を読み出す段階とを備える。尚、圧縮結果セットキャッシュに基づいて圧縮キーを検索する段階において、圧縮キーの階層構造を作成すれば多次元のクロス集計をきわめて高速で処理することが出来る。
【００１７】
【作用】
本発明は、刻々追加／更新されるような膨大なオリジナルデータを、効率よく格納し、随時高速検索（または集計）可能とするため以下の処理を行う。
【００１８】
▲１▼ オリジナルデータに基づき、キー値を項目毎に圧縮して圧縮復元インデックスを作成し保存する段階。これは実際の多次元集計時に絶対に必要となるオリジナルのキー値を、検索可能な状態で加工圧縮し、高速検索と復元が可能な状態で保存しておくための処理である。この圧縮復元インデックスは、検索用の階層構造としては例えばＰ木階層構造、Ｂ木階層構造、Ｔ木階層構造、ハッシュ階層構造など、従来公知のいずれの階層手段でも構わない。次段処理および検索の高速のという点で云えばＰ木構造が好ましい。
【００１９】
▲２▼ 圧縮復元インデックスに基づき、項目毎に圧縮キーを書き込んだ圧縮結果セットキャッシュを作成し保存する段階。これは、▲１▼の圧縮復元インデックスで得られた圧縮キー値を項目毎に書き込んでおき、爾後における演算の自由性と検索の高速を実現するための処理段階である。多数の項目（店舗名、商品名、日付、金額等）を、▲１▼の処理段で得た圧縮キー（ノードＩＤ）を用いて保存しておくことにより、データ容量を著しく減らし、保存効率を最適化する。また、これらの処理によって、検索効率も飛躍的に向上する。
【００２０】
▲３▼ 検索条件が与えられたときには、前記▲１▼および▲２▼の処理に加えて、以下の処理を行う。すなわち、▲２▼の圧縮結果セットキャッシュに基づいて圧縮キーを検索する段階。圧縮キーの検索は高速に行うことが出来るため、単一のまたは複数の項目の検索は非常に容易であり、簡単な構成で検索エンジンを生み出すことが出来る。また、この処理段において、検索するべき圧縮キーを階層構造をもって処理したときには（請求項２）、後段処理と相俟って任意次元の集計を高速で行うことが可能となる。圧縮結果セットキャッシュのデータは圧縮キー（ノードＩＤ）なので、検索用階層手段を作成する所要時間は現在の汎用コンピュータにおいてもせいぜい１秒以下である。尚、▲１▼の圧縮復元インデックスや▲２▼の圧縮結果セットキャッシュからダイレクトにオリジナルデータを復元させることも原理的には可能であるが、圧縮データとオリジナルデータとの間の呼び出し回数が増加するなど、高速処理は困難となる。
【００２１】
▲４▼ 探索した圧縮キーに基づいて、圧縮復元インデックスからオリジナルのキー値を読み出す段階。▲１▼の圧縮復元インデックスに含まれるオリジナルのキー値は、▲３▼の圧縮キーの検索によって随時復元可能なデータとして処理できる。圧縮復元インデックスは項目毎に作成されているので、項目が指定されればオリジナルのキー値は直ちに復元さるからである。
【００２２】
以上のように本発明は、各項目毎にの圧縮キーを格納した圧縮結果セットキャッシュを必須の条件として、その前段にキー値を項目毎に圧縮した圧縮復元インデックスを備えてなる。検索時には、圧縮キーを検索した上で、圧縮復元インデックスからオリジナルのキー値を復元する。このような処理方法は、従来の多次元クロス集計等の処理方法に較べると処理段が増加するが、大量データを効率よく保存し、実行時における演算処理速度を確実に高速化できる。二段階の圧縮処理と、各段における生成ファイルの関係が汎用コンピュータの処理能力範囲内において実現でき、検索時に扱うデータ量が飛躍的に減少するからである。
【００２３】
【発明の実施の形態】
図１は、本発明に係るデータベースの高速処理方法の全体の原理を示すものである。
【００２４】
１は、多数の項目を含んだオリジナルデータ、１０は、オリジナルデータに基づいて作成する圧縮復元インデックス、２０は、圧縮復元インデックス１０に基づいて作成した圧縮結果セットキャッシュ、３０は、検索条件Ｃが与えられたときに、圧縮結果セットキャッシュ２０に基づいて作成される検索用階層構造である。
【００２５】
図２は、オリジナルデータ１と、圧縮復元インデックス１０との関係を例示するものである。この実施形態では、オリジナルデータ１は、店舗、日付、商品、数量、金額という５つの項目（カラム）をもっており、それぞれの項目は、それぞれユニークな値（キー値；カラム値）をもつ。また、この実施形態では例えば店舗を項目１、日付を項目２、商品を項目３・・として項目を区別してある。そして項目１の店舗Ａを例にとると、このキー値は文字「Ａ」である。さらに具体的に云えば店舗名が「霞が関商店」の場合、この項目のキー値は文字「霞が関商店」である。同様に、店舗「Ａ」の日付（項目２）のキー値は数字列「００，０４，０１」、商品（項目３）のキー値は文字「１」である。この例では商品１としてあるが実際には具体的な商標名を構成する文字／数字等がキー値となる。尚、オリジナルデータ１は、例えば各種の記録媒体やオンライン転送データの形式で受けとる。必要に応じてデータ形式を整えるための通常の前処理を施す。
【００２６】
オリジナルデータ１に基づいて作成する圧縮復元インデックス１０は、項目毎に作成する。つまり項目１であれば店舗、項目２であれば日付について、オリジナルのキー値を圧縮し保存する。ここで作成する圧縮復元インデックス１０は、少なくとも、オリジナルのキー値と圧縮キーを備え、復元時の高速性を確保するために適宜の階層構造をもっていることが望まれる。例えばＰ木構造、Ｂ木構造、Ｔ木構造、ハッシュ構造等である。これらの作成原理は公知であが、高速性の点ではＰ木構造が望ましい。作成した圧縮復元インデックスは、サーバ装置の適宜の格納場所（メモリ）にデータ保存しておく。書き換え可能としておくことが望ましい。
【００２７】
ところで、検索用の各種のツリー構造は従来公知であるが、任意データ長のオリジナルキーを検索用階層構造に配置するための手段は従来知られていない。しかし例えば、オリジナルのキー値をビット添字評価手法に基づいて各ノードに圧縮格納すると同時に、ビット添字の評価順序をビットストリームのＬＳＢからＭＳＢに向かって昇順で処理することにより無限ビット長の処理が可能になる。勿論、扱う各項目のビット長が十分短い場合には、このような処理を行う必要はない。
【００２８】
図３は、圧縮復元インデックス１０と、圧縮結果セットキャッシュ２０との関係を例示する図である。圧縮結果セットキャッシュ２０は、圧縮復元インデックス１０の作成段階で取得できる圧縮キー値を、項目毎に格納する手段である。図示した圧縮結果セットキャッシュ２０の項目１のファイルには、項目１の圧縮キー値が書き込まれている。例えば圧縮キー値「０」が「店舗Ａ」に該当し、圧縮キー値「１」が「店舗Ｂ」に該当する。この項目１のファイルデータは、項目１だけを格納しているものであるが、他のファイルには項目２、項目３・・・が同様の手法で書き込まれている。この圧縮結果セットキャッシュ２０は、サーバ装置等の適宜の格納フィールドに永久保存しておく。尚、書き換え可能とすることが望ましい。
【００２９】
図４は、ある検索条件が与えられた場合に、圧縮結果セットキャッシュ２０に基づいて作成する検索用階層構造３０を例示するものである。この検索用階層構造３０は、実行時演算方式を採用する本発明において必須の処理段階である。
【００３０】
検索用階層構造３０は、圧縮結果セットキャッシュ２０を構成する各項目の圧縮キーに基づいて、任意の次元で検索用階層構造を作成する。ここで用いる圧縮キーは、オリジナルのキー値を前段処理において格段に圧縮させたものであるから、階層構造（３０）の作成は通常瞬時（１秒以下）で行うことが出来る。
【００３１】
検索用階層構造３０も、圧縮復元インデックス１０と同様に、検索の信頼性と高速性の条件を満たす限り構造の原理を問わない。例えばＰ木構造、Ｂ木構造、Ｔ木構造、ハッシュ構造等が利用できる。但し、作成処理と検索効率を考慮すると現在のところはＰ木構造が優れる。
【００３２】
また検索用階層構造３０は、当該項目（例えば店舗）の圧縮キーに基づいて、指定条件に含まれる他項目（例えば日付と商品）の圧縮キーを同時に呼び出す機能を備えることが望ましい。勿論、検索用階層構造３０の本質的機能は、作成された階層構造（３０）から圧縮復元インデックス１０にアクセスし、オリジナルのキー値を復元させることであるから、他の項目は時分割処理等によって別途に行うことも可能であり、同時処理を絶対の条件とするものではない。
【００３３】
しかし、検索用階層構造３０の作成時に取り扱うデータ量は極めて少ないので、指定条件に含まれる他の次元項目を同時に呼び出して階層構造内に配置し（総合圧縮項目値）、圧縮復元インデックス１０から必要なオリジナルのキー値を得ることは極めて容易である。
【００３４】
従って、かかる多次元データベースの高速処理方法によれば、刻々増加する多数の項目とキー値を、圧縮結果セットキャッシュ２０の作成段階を経て集積し、随時、任意の次元で高速に集計できる。本発明の特徴は、オリジナルデータから圧縮復元インデックス１０を作成し、これに基づいて項目毎の圧縮結果セットキャッシュを作成する点、および、検索条件が与えられた時には圧縮結果セットキャッシュを用いて適宜の検索用階層を作成し、これを介してオリジナルのキー値を備える圧縮復元インデックス１０にアクセスして、高速でオリジナルのキー値を再現する点にある。
【００３５】
これらの処理の流れは、すべて本発明の要旨を構成するが、圧縮キー値を項目毎にファイルする圧縮結果セットキャッシュ２０の処理を除けば、他の処理段階における具体的手法は特に限定されない。例えば圧縮復元インデックス１０は、オリジナルのユニークなキー値を適宜の形式でデータ圧縮できれば良く、検索用階層構造３０も、圧縮結果セットキャッシュ２０に含まれる圧縮キー値の性質に応じて適宜の検索構造をとることが出来る。
【００３６】
尚、本発明において圧縮加工の処理を施すべきオリジナルデータ１のフォーム（データ形式）が統一されていない場合には、圧縮復元インデックス１０の作成に先立って、データ形式を本発明に係る処理システムにおいて取り扱い可能な形式に変換処理しておくことが望ましい。
【００３７】
【発明の効果】
以上説明したように本発明によれば、膨大な件数のレコードファイルを効率よく保存し、高速検索および任意次元でのクロス集計を高速で実現できる。
【図面の簡単な説明】
【図１】本発明に係る処理方法の全体の原理を示す図である。
【図２】本発明に係る圧縮復元インデックスの作成を例示する図である。
【図３】本発明に係る圧縮結果セットキャッシュの作成を例示する図である。
【図４】本発明に係る検索用階層構造の作成を例示する図である。
【符号の説明】
１オリジナルデータ
１０圧縮復元インデックス
２０圧縮結果セットキャッシュ
３０検索用階層構造[0001]
[Industrial application fields]
The present invention relates to high-speed database processing, and more particularly to a data processing technique applicable to a search engine and a cross tabulation data management system having a large number of key items.
[0002]
[Technical background]
The creation and management of databases using general-purpose computers has become difficult to handle with the conventionally known basic processing methods due to the increase in the number of data in recent years, and a technology that enables higher-speed arithmetic processing is desired. Yes.
[0003]
The technical difficulties in handling huge amounts of data are, for example, (1) the algorithm to be processed differs depending on the type of data to be handled, and (2) the basic principle of the processing method depending on whether or not to perform multidimensional processing. The difference is that (3) the algorithm (principle) to be used depends on the performance of the computer.
[0004]
Regarding (1), the processing technique differs depending on whether the data handled is mainly a character string or a number. For example, record data is created mainly with characters (Hiragana, Kanji, Katakana) such as department data, department name, name, gender, hobby, family structure, details of internal evaluation contents, etc. In the case of handling numerical data such as the number of units sold for each product name and the sales amount at each chain store, it is natural that the processing algorithm inside the computer is different. This is the basis of a database. For example, the construction method differs depending on the storage method of the basic record of data such as hierarchical type, relational type, and network type.
[0005]
The point {circle around (2)} is how to handle a huge data file record. For example, after performing arithmetic processing such as addition, subtraction, multiplication and division at the time of data creation and storing it in a memory as a simple record file, if you only need to perform processing such as calling and checking the record file at any time, The problem in this case is limited to speeding up the processing at the time of data creation. On the other hand, since it is not premised on what kind of processing is performed, only fact (facts; mainly numbers) data is recorded for the time being, and various processing (projection / selection / combination, etc.) is performed at a later date based on this fact file. If there is, the principle for data processing is naturally different, and it can be narrowed down to achieve faster key item retrieval and numerical aggregation, and to obtain free aggregation results at any item dimension at high speed. In that case, the creation of the basic record file is desired to be simpler.
[0006]
The point (3) is important in the correlation with the conventionally proposed patent technology and the database processing expected in the future. Regardless of the type of data, the data processing speed can be solved almost without any problem by using a large computer equipped with a large number of high-speed parallel CPUs or a huge memory. However, so-called supercomputers are extremely expensive, and an environment where they are always used cannot be expected at a general practical level. Rather, it is desired to use a current general-purpose computer to process a huge amount of record files (original data) at a highly responsive level and obtain a free multidimensional processing result that responds to any request.
[0007]
[Prior art]
An object of the present invention is to obtain a cross tabulation result in an arbitrary dimension in a short time by freely processing, for example, an enormous amount of original data files exceeding 1 million cases. In this case, the data file to be processed includes key items mainly including characters (in this specification, key values; store names, etc.) and key items mainly including numbers (in this specification, items; Will be handled.
[0008]
In other words, key items composed of character strings such as store names and product names, and key items composed of numbers such as the number of units sold and the amount of money are stored as recording files in a free format. The purpose is to obtain the total results immediately after processing.
[0009]
There are, of course, several proposals for database processing methods with such technical trends. First, as a processing method focused on the cross tabulation process, there is a technique for speeding up the process using a weight code (Japanese Patent Laid-Open No. 8-77266). This is because a weight code of a relatively low level is assigned to another table including the meaning of data of a certain table, and a necessary fact is extracted using the weight code when a search condition is given. . Specifically, when collecting data such as who, what, what was eaten, what the ingredients were, and how much the price was in the group, the name, date, menu, amount, material The meaning of the data is hierarchized and stored as fact data, and data in an appropriate hierarchical area is extracted using a weight code as necessary.
[0010]
However, since this technology associates the meaning of data up and down, as a general-purpose system, such as a self-reliable vertical relationship caused by the system designer's subjectivity, or the processing of items that cannot be associated originally becomes impossible It remains difficult to use.
[0011]
Secondly, although there is no direct purpose of cross tabulation processing, there are many techniques for high-speed keyword search from a large amount of character data. For example,
(1) Japanese Patent Laid-Open No. 11-031096 A technique for retrieving data without creating an index (2) A technique for creating a subordinate relationship of items of Japanese Patent Laid-Open No. 11-003354 and performing a search from the subordinate relationship (3) Japanese Patent Laid-Open No. 10- No. 143538 Index key value associating with data identifier (4) Japanese Patent Laid-Open No. 10-003414 A method for predefining and storing key values that are frequently used (5) Japanese Patent Laid-Open No. 9-167111 A technique for clustering key values. However, these techniques are suitable for obtaining an efficient result using a search key from a large amount of character data existing on the Internet, for example, but the depth of searching all hierarchical data files. There is no search capability, and if data is stored by the subjective operation of the operator, the system tends to be very inconvenient.
[0012]
Considering these data processing techniques, a conventional publicly known technique capable of freely processing original data can be cited as disclosed in JP-A-5-257774. This holds the search key value and the compressed IRN (index record number) in the memory, extracts the IRN from the search key value based on the search condition (key item specification), and uses the compressed IRN It is something to restore.
[0013]
[Problems to be solved by the invention]
The problem is the processing time when searching for items and performing cross tabulation processing.
Certainly, according to the technique disclosed in Japanese Patent Laid-Open No. 5-257774, when handling a relatively small number of original data of about several thousand to 10,000, any item can be searched at a relatively high speed using a compressed IRN. In addition, addition / subtraction / multiplication / division calculation of numbers corresponding to items can be processed at a relatively high speed. However, when there is a huge amount of original data, for example, when considering the case of continuously managing sales data for a large number of products reported in real time from each chain store over a long period of time, simply compress the IRN. Simply doing this will limit your processing power. This is because compression of IRN is mainly compression of facts (numbers), and compression processing is not performed on portions of character data (items) that may spread enormously, such as names of new products.
[0014]
Of course, the compression of the character data is easy, and the character data can also be compressed in the invention of JP-A-5-257774. However, the problem in that case is that there is no means for searching and restoring compressed character data at high speed. Therefore, in the invention of Japanese Patent Laid-Open No. 5-257774, only a few thousand records can be managed. For example, the power to process a huge number of original data exceeding 1 million at high speed from a free viewpoint. There is no waste.
[0015]
Therefore, the object of the present invention is to firstly store an enormous amount of original data consisting of a large number of key values and items efficiently, and secondly to be able to restore the original data at high speed based on the stored data. It is in.
[0016]
[Means for Solving the Problems]
In order to achieve the above object, a database processing method according to the present invention includes firstly a step of compressing a key value for each item based on original data, creating a compressed decompression index, and storing the compressed decompression index. And generating and storing a compression result set cache in which a compression key is written for each item.
Second, when a search condition is given, a step of searching (searching) a compression key based on the compression result set cache, and reading an original key value from the compression / decompression index based on the searched compression key. Stages. Note that, when a compression key is searched based on the compression result set cache, a multi-dimensional cross tabulation can be processed at a very high speed by creating a hierarchical structure of the compression keys.
[0017]
[Action]
The present invention performs the following processing in order to efficiently store a large amount of original data that is constantly added / updated and enable high-speed search (or aggregation) as needed.
[0018]
(1) A step of creating and storing a compression / decompression index by compressing a key value for each item based on original data. This is a process for processing and compressing original key values that are absolutely necessary at the time of actual multidimensional tabulation in a state where they can be searched and stored in a state where they can be quickly searched and restored. The compression / decompression index may be any conventionally known hierarchical means such as a P-tree hierarchical structure, a B-tree hierarchical structure, a T-tree hierarchical structure, or a hash hierarchical structure as a search hierarchical structure. A P-tree structure is preferable in terms of high-speed subsequent processing and retrieval.
[0019]
(2) A step of creating and storing a compression result set cache in which a compression key is written for each item based on a compression / decompression index. This is a processing stage for writing the compression key value obtained with the compression / decompression index of (1) for each item to realize the freedom of computation and the high speed of retrieval after the operation. By storing a large number of items (store name, product name, date, amount, etc.) using the compression key (node ID) obtained in the processing stage (1), the data capacity is significantly reduced and the storage efficiency is reduced. To optimize. In addition, the search efficiency is greatly improved by these processes.
[0020]
(3) When a search condition is given, the following processing is performed in addition to the processing of (1) and (2). That is, a step of retrieving a compression key based on the compression result set cache of (2). Since the search for the compression key can be performed at high speed, searching for a single item or a plurality of items is very easy, and a search engine can be created with a simple configuration. Further, in this processing stage, when the compressed key to be searched is processed with a hierarchical structure (claim 2), it becomes possible to perform summarization in an arbitrary dimension at a high speed in combination with the subsequent stage processing. Since the data of the compression result set cache is a compression key (node ID), the time required for creating the search hierarchy means is at most 1 second even in the current general-purpose computer. Although it is theoretically possible to restore the original data directly from the compression / decompression index (1) or the compression result set cache (2), the number of calls between the compressed data and the original data increases. For example, high-speed processing becomes difficult.
[0021]
(4) A step of reading the original key value from the compression / decompression index based on the searched compression key. The original key value included in the compression / decompression index (1) can be processed as data that can be restored at any time by searching the compression key (3). This is because the compression / decompression index is created for each item, and if the item is specified, the original key value is immediately restored.
[0022]
As described above, the present invention is provided with a compression / decompression index in which the key value is compressed for each item in the preceding stage, with the compression result set cache storing the compression key for each item as an essential condition. At the time of retrieval, the original key value is restored from the compression / decompression index after retrieving the compression key. Such a processing method increases the number of processing steps as compared with a conventional processing method such as multidimensional cross tabulation, but can efficiently store a large amount of data and reliably increase the processing speed at the time of execution. This is because the relationship between the two-stage compression processing and the generated file at each stage can be realized within the processing capability range of the general-purpose computer, and the amount of data handled at the time of search is dramatically reduced.
[0023]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 shows the overall principle of a high-speed database processing method according to the present invention.
[0024]
1 is original data including many items, 10 is a compression / decompression index created based on the original data, 20 is a compression result set cache created based on the compression / decompression index 10, and 30 is a search condition C. This is a search hierarchical structure created based on the compression result set cache 20 when given.
[0025]
FIG. 2 illustrates the relationship between the original data 1 and the compression / decompression index 10. In this embodiment, the original data 1 has five items (columns) of store, date, product, quantity, and amount, and each item has a unique value (key value; column value). Further, in this embodiment, for example, the item is distinguished by setting the store as item 1, the date as item 2, and the merchandise as item 3. Taking the store A of item 1 as an example, this key value is the letter “A”. More specifically, when the store name is “Kasumigaseki Shoten”, the key value of this item is the character “Kasumigaseki Shoten”. Similarly, the key value of the date of the store “A” (item 2) is the numeric string “00, 04, 01”, and the key value of the product (item 3) is the character “1”. In this example, the product 1 is used, but in actuality, letters / numbers or the like constituting a specific trade name are key values. The original data 1 is received in various recording media or online transfer data formats, for example. Perform normal preprocessing to arrange the data format as needed.
[0026]
The compression / decompression index 10 created based on the original data 1 is created for each item. That is, the original key value is compressed and stored for the store in the case of item 1 and the date in the case of item 2. The compression / decompression index 10 created here is preferably provided with at least an original key value and a compression key, and has an appropriate hierarchical structure in order to ensure high speed at the time of decompression. For example, there are a P tree structure, a B tree structure, a T tree structure, a hash structure, and the like. Although these creation principles are known, a P-tree structure is desirable in terms of high speed. The created compression / decompression index is stored in an appropriate storage location (memory) of the server device. It is desirable to make it rewritable.
[0027]
By the way, various tree structures for search are conventionally known, but means for arranging an original key having an arbitrary data length in a search hierarchical structure has not been known. However, for example, the original key value is compressed and stored in each node based on the bit index evaluation method, and at the same time, the process of infinite bit length is achieved by processing the bit index evaluation order in ascending order from the LSB to the MSB of the bit stream. It becomes possible. Of course, when the bit length of each item to be handled is sufficiently short, it is not necessary to perform such processing.
[0028]
FIG. 3 is a diagram illustrating the relationship between the compression / decompression index 10 and the compression result set cache 20. The compression result set cache 20 is means for storing, for each item, a compression key value that can be acquired at the stage of creating the compression / decompression index 10. The compression key value of item 1 is written in the file of item 1 of the compression result set cache 20 shown in the figure. For example, the compression key value “0” corresponds to “Store A”, and the compression key value “1” corresponds to “Store B”. The file data of item 1 stores only item 1, but items 2, item 3,... Are written in the same manner in other files. The compression result set cache 20 is permanently stored in an appropriate storage field such as a server device. Note that it is desirable to be able to rewrite.
[0029]
FIG. 4 illustrates a search hierarchical structure 30 created based on the compression result set cache 20 when a certain search condition is given. This search hierarchical structure 30 is an indispensable processing step in the present invention that employs a runtime calculation method.
[0030]
The search hierarchical structure 30 creates a search hierarchical structure in an arbitrary dimension based on the compression key of each item constituting the compression result set cache 20. Since the compression key used here is an original key value that has been remarkably compressed in the preceding process, the hierarchical structure (30) can be created in a normal instant (1 second or less).
[0031]
Similar to the compression / decompression index 10, the search hierarchical structure 30 may be of any structure principle as long as the search reliability and high-speed conditions are satisfied. For example, a P tree structure, a B tree structure, a T tree structure, a hash structure, or the like can be used. However, considering the creation process and search efficiency, the P-tree structure is excellent at present.
[0032]
The search hierarchical structure 30 preferably has a function of simultaneously calling the compression keys of other items (for example, date and product) included in the designated condition based on the compression key of the item (for example, store). Of course, the essential function of the search hierarchical structure 30 is to access the compression / decompression index 10 from the created hierarchical structure (30) and restore the original key value. However, the simultaneous processing is not an absolute condition.
[0033]
However, since the amount of data handled at the time of creating the search hierarchical structure 30 is extremely small, other dimension items included in the specified condition are simultaneously called and placed in the hierarchical structure (total compressed item value), and required from the compression / decompression index 10 It is very easy to obtain a unique original key value.
[0034]
Therefore, according to such a high-speed processing method for a multi-dimensional database, a large number of items and key values that increase every time are accumulated through the creation stage of the compression result set cache 20, and can be tallied at any speed at any time. A feature of the present invention is that a compression / decompression index 10 is created from original data, and a compression result set cache for each item is created based on the index, and a compression result set cache is appropriately used when a search condition is given. Is created, and the compression / decompression index 10 having the original key value is accessed through this search hierarchy, and the original key value is reproduced at high speed.
[0035]
The flow of these processes constitutes the gist of the present invention, but the specific technique in other processing stages is not particularly limited except for the process of the compression result set cache 20 that files the compression key value for each item. For example, the compression / decompression index 10 only needs to be able to compress the original unique key value in an appropriate format, and the search hierarchical structure 30 also has an appropriate search structure depending on the nature of the compression key value included in the compression result set cache 20. Can be taken.
[0036]
If the form (data format) of the original data 1 to be subjected to compression processing is not unified in the present invention, the data format is set in the processing system according to the present invention prior to the creation of the compression / decompression index 10. It is desirable to convert to a format that can be handled.
[0037]
【The invention's effect】
As described above, according to the present invention, a huge number of record files can be efficiently stored, and high-speed search and cross tabulation in an arbitrary dimension can be realized at high speed.
[Brief description of the drawings]
FIG. 1 is a diagram showing the overall principle of a processing method according to the present invention.
FIG. 2 is a diagram illustrating creation of a compression / decompression index according to the present invention.
FIG. 3 is a diagram illustrating creation of a compressed result set cache according to the present invention.
FIG. 4 is a diagram illustrating creation of a search hierarchical structure according to the present invention.
[Explanation of symbols]
1 Original data 10 Compression / decompression index 20 Compression result set cache 30 Search hierarchical structure

Claims

In a database processing method executed by a computer,
The computer acquires original data including a plurality of records having at least one item having a plurality of key values from a recording medium or as online transfer data, and the original data includes the plurality of items in the item. An ID is uniquely assigned to a key value, the key value and the ID are set as one node, a hierarchical structure composed of the nodes is created, and a node ID list having pointers to the nodes is created. Storing and storing a compressed / decompressed index comprising the node ID list and the hierarchical structure in a storage medium of the computer;
The computer replaces the key value in the item with the corresponding ID for each item of the original data based on the compression / decompression index, creates a compression result set cache, and stores the compression result set cache in a storage medium And storing and storing a compressed result set cache,
When the computer is given a search condition with an arbitrary ID, the computer searches for a matching ID in the compressed result set cache; and
The database processing method comprising: the computer searching the compression / decompression index based on the matched ID and reading the corresponding original key value.

When a search condition is given to the computer to call a plurality of key values at the same time, the computer uniquely assigns an ID to the plurality of keys, and all IDs corresponding to the plurality of key values to be called at the same time are all assigned. Searching for the records we have,
The database processing method according to claim 1, further comprising: searching the compressed decompression index based on the retrieved record and reading the corresponding original key value.