CN102521303B

CN102521303B - A kind of single-table multi-column sequence storage method for a column database

Info

Publication number: CN102521303B
Application number: CN201110392033.1A
Authority: CN
Inventors: 杨尚; 王鸿翔; 冯玉; 李祥凯; 冷建全
Original assignee: Beijing Kingbase Information Technologies Co Ltd
Current assignee: China Electronics Technology Group Jincang Beijing Technology Co ltd
Priority date: 2011-11-30
Filing date: 2011-11-30
Publication date: 2016-08-10
Anticipated expiration: 2031-11-30
Also published as: CN102521303A

Abstract

The invention discloses a kind of single-table multi-column sequence storage method for a column database.This column database includes multiple tables of data being made up of row and column, tables of data is divided into multiple row set, row set includes one or more row, and row union of sets collection constitutes tables of data, join index is set up in row set between any two, join index have recorded in two the row set establishing join index, belongs to the one-to-one relationship of the storage position of the row of the same tuple of tables of data.Utilize single table multiple row sequence provided by the present invention to store method, the search efficiency of column database can be improved, and reduce memory space.

Description

A kind of single-table multi-column sequence storage method for a column database

Technical field

The present invention relates to a kind of database storage method, particularly relate to a kind of for column database Single table multiple row sequence storage method, belongs to database storage techniques field.

Background technology

Relevant database be one in order to the software system storing and processing structural data, its Data are divided into two levels: one is logical data, and it is made up of tables of data, record etc.；Separately One is physical data, and how its representation database stores logical data.Realize database physical number According to method have two kinds: one be based on row storage, another kind be based on row storage.

For implementation method based on row storage, it is that the whole piece record of logical data is stored In data block, in order to improve inquiry velocity, the index of the types such as B+ tree to be set up for some row； For implementation method based on row storage, the record in logical data is the most directly mapped to thing by bar In reason data, but record by row separately, the value of the same column of all records is existed together, There is provided simultaneously and connect data, for the corresponding train value of difference record is reconfigured formation Record.

Along with deepening continuously of enterprise and e-government, the complexity of database application increases day by day By force.These demand drivings database technology and are developed to the direction of magnanimity and intelligence.Meanwhile, number The data processing technique in the urgent need to real-time high-efficiency is applied according to warehouse and on-line analysis etc..Traditional Database technology based on row storage has occurred in that technical bottleneck.How quickly to perform complexity While inquiry, moreover it is possible to reduce memory space and cost-effective be current database technology research Hot issue.

Column database is based on row memory technology, the relation that is mainly directed towards business decision analysis field Type data base.The feature of row memory technology is that efficiency data query is high, reads disk few, storage sky Between few, be the Ideal building data warehouse.It is right that the using value of column database comes from it The storage advantage that quickly response and data compression are brought of complex query so that it is determine in enterprise The applications such as plan analysis, data warehouse, business intelligence have good development prospect.According to The analysis report about data warehouse that Gartner company of the U.S. issued in January, 2010: row Data base, compared with traditional Relational DataBase, shows the performance of brilliance in terms of data analysis. Therefore, technical research and the product development of column database is extensively closed at academia and industrial quarters Note.

At present, the column database increased income has C-Store, rasdaman, MonetDB etc., business Sybase IQ, Vertica Analytic Database, ParAccel is had by column database Analytic Database, EXASOL EXA Solution etc..Over nearly 5 years, international one The upper outstanding opinion about column database field of data base's meeting such as VLDB, SIGMOD, ICDE of stream Literary composition occurs the most again and again.

In the Chinese invention patent application of Application No. 200810187227.6, disclose one Realize the method and device of relevant database based on row storage, including: set up data file, And the data block of composition data file is compiled serial number in order；Definition table segment；Record is inserted In table segment；Record for being inserted in table segment is unique record identification number in generating table segment, And by record by row separately；For each row in record, perform following operation: by train value Store in data block with record identification number as Value Data and sort by train value size；By record The serial number of the data block of identification number and storage Value Data stores new data as connecting data In block, and press the sequence of record identification size；Data block and storage to storage Value Data connect The data block of data sets up index, generates index data block.The method is to storage Value Data Data block and storage connect the data block of data and set up index rather than to belonging to same tuple Index is set up between different lines or row set.

Summary of the invention

The technical problem to be solved is to provide a kind of single table for column database many Row sequence storage method.Utilize this storage method can improve the search efficiency of column database, and subtract Little memory space.

For realizing above-mentioned goal of the invention, the present invention uses following technical scheme:

A kind of single-table multi-column sequence storage method for a column database, described column database includes many The individual tables of data being made up of row and column, described tables of data is divided into multiple row set, described row Set includes one or more row, and described row union of sets collection constitutes described tables of data, its It is characterised by:

Join index is set up in described row set between any two, and described join index record establishes even Connect in two row set of index, belong to the storage position of the row of the same tuple of described tables of data One-to-one relationship.

The most more preferably, to each row set, according to belong to described tables of data same tuple, The storage positional value of the row in two row set, sets up join index；

Described join index correspondence is stored described each row set in, and with described each row Described row correspondence in set.

The most more preferably, if two row set comprise repeats row, then according to the described row that repeat, right The row of said two row set is ranked up and stores；

If two row set do not repeat row, then by the row of said two row set, press respectively Sort according to querying condition and store.

The most more preferably, for not repeating the said two row set of row, if logical order phase Same then do not set up join index, if logical order differs, set up join index.

The most more preferably, for there being the said two row set repeating row, join index is not set up.

The most more preferably, estimate the cost needed for each implement plan, select even according to optimal cost Connect index.

The most more preferably, the plurality of row set is appeared at the whole row judging described tables of data And when concentrate, create Materialized View for each row set, complete the establishment of row set.

The most more preferably, during setting up join index, also include that following row set loads step Rapid:

Step 1: at described tables of data loading data；

Step 2: materialization all arranges set Materialized View, including the storage position of materialization each row set；

Step 3: set up join index；

Step 4: delete the data of described tables of data；

Step 5: delete unwanted storage positional value.

The present invention has broken row storage to be needed to keep belonging to the train value of same logic tuple in each column The identical restriction in position so that it is flexible that this list table multiple row sequence storage method adds in use Property.The present invention can divide the projection of optimal sequence to strengthen according to the inquiry of form class application Performance, also can process the inquiry of Ad-Hoc (point-to-point) class and not loss property according to join index Energy.

Accompanying drawing explanation

The present invention is described in further detail with detailed description of the invention below in conjunction with the accompanying drawings.

Fig. 1 shows that the database items purpose of application this list table multiple row sequence storage method is shown Example；

Fig. 2 is in the database project shown in Fig. 1, the schematic diagram of display storage location value；

Fig. 3 is in the database project shown in Fig. 1, shows the schematic diagram of join index；

Fig. 4 is in this list table multiple row sequence storage method, creates the operating procedure schematic diagram of row set；

Fig. 5 is in the embodiment shown in fig. 1, the schematic diagram of the join index set up；

Fig. 6 is in the embodiment shown in fig. 1, uses showing of join index reduction logic tuple It is intended to.

Detailed description of the invention

The present invention uses the logic data model of relational database: each relation (relation) is One bivariate table (table), by row (tuple, also referred to as tuple) and row, (attribute is also referred to as Field) constitute.On this basis, the present invention uses physical organization based on row set to realize patrolling Collect data model.Introduce the concrete meaning of " row set " first below.

Row set: each row set broadly falls into a relation, the above-listed set of logic is to close belonging to it One vertical subset of system；Physically comprise one or more row and the affiliated pass of this relation There is identical line number in system.If relation and another relation belonging to row set are many-to-one passes System, can also comprise the row in another relation in row set.

Can be with duplicate packages containing same string between row set, it is possible to mutually do not repeat row.In other words Saying, the row between row set can be overlapping, can comprise the same column of multiple tables of data.Belong to Set in homonymous all row set, that the union of row is exactly this pass series, these row Just constitute this relation.Row set uses row storage, it is possible to according in row set Row or several row are ranked up.Such organizational form can be saved the storage overhead of index and carry For the optimization space for inquiry.In storing process, it is possible to use multiple row storage compression side Formula, such as RLE (run length encoding) etc..Further, it is possible to use the mode of fragmented storage Improve compression efficiency.

The present invention provides a kind of row for OLAP (on-line analytical processing) scene to store physical set Knit mode, less memory space offer can be used more to optimize motility.To this end, In single table multiple row sequence storage method provided by the present invention, the tables of data conduct that first will store Base table, is divided into the set of multiple row.Use storage positional value (storage key) simultaneously Connection establishment join index (join index).This join index be used for recording different lines set it Between the one-to-one relationship of train value.Row set is permissible by the join index connecting other row set Obtain other train values that in this row set, train value is corresponding in base table, in order to rebuild in logic Bar tuple.Storage positional value can set according to actual needs, such as, can be that base table is Sino-Kazakhstan The cryptographic Hash etc. of uncommon row.

This row set is a vertical division of base table, comprises one or more row, and base table has Identical line number.It addition, row set can also comprise other numbers having many-one relationship with base table Row according to table.

Need to keep belonging to the train value of same logic tuple often owing to the present invention has broken row storage Position in row needs identical restriction so that this list table multiple row sequence storage method can increase to be made The motility used.The present invention can divide optimal sequence according to the inquiry of form class application Projection, to strengthen performance, the most in figure 6, first carries out the projection of name, further according to connection Index finds the place system of correspondence.The premise realizing this process is: record in join index It is that row set place ties up to arrange the storage position in set name.Therefore, in each row set Will store join index, the row in minute book row set arranges the position in gathering at another, During i.e. the row in the set of these row is gathered with another row which is corresponding, also can be according to connecting strand Draw and process the inquiry of Ad-Hoc (point-to-point) class and do not lose performance.

Single table multiple row sequence storage method of the present invention, sets up join index between any two in row set, Join index have recorded, and establishes in two row set of join index, belongs to the same of tables of data The one-to-one relationship of the storage position of the row of one tuple.To each row set, several according to belonging to According to table same tuple, the storage positional value of row in two row set, set up join index, Join index correspondence stored in each row set, and with the respective column in each row set Value correspondence.Such that make each row set be resequenced, the value of join index is therewith And sort, also would not affect the corresponding relation between the row belonging to same tuple of row set.

Below, as a example by the database project of application single table multiple row sequence storage method, specifically Illustrate enforcement step and the effect thereof of the present invention.This database example is to impart knowledge to students for institution of higher learning The data base of management work.As it is shown in figure 1, press student number sequence row set 1 (student number, name, Sex, the age) and constitute student pass by the row set 2 (student number, place is) of place system sequence System, contains the row in course in row set 3.As in figure 2 it is shown, each column arranged in set Each value has a storage positional value, and the row with same storage positional value constitute a logic Tuple.This storage positional value can be not use physical store mode.As it is shown on figure 3, due to The row set of component relationship can have same storage position according to different row sequences in base table Put the train value of value, different positions may be in different row set.Such as student number in Fig. 3 It is the record of 20070026, in row set 2, is positioned at the 1st row；It is positioned at the in row set 1 2 row.Use this position relationship of join index labelling, physical store position can not can determine that When putting, find out the train value identical with storing positional value, and then build by closing at different row collection Vertical join index, can obtain logic tuple according to certain order arranged.Such as, join index is (i.e. In Fig. 3 top-right " correspondence position ") marked, in row set 2 in the 1st row and column set The 2nd row corresponding.

By above explanation, can need not between two row set that logical order is identical Join index.If two row set comprise repeats row, then repeat row according to this, to two row The row of set is ranked up and stores；If two row set do not repeat row, then by two row The row of set, sorts according to querying condition respectively and stores.Do not repeat two row set of row, If logical order is identical, do not set up join index, if logical order differs, set up connection Index.There are two the row set repeating row, because its sequence is identical, then need not join index.

Fig. 4 shows in this list table multiple row sequence storage method, creates the operating procedure of row set. First, judge whether list is empty according to row set definition.If list is sky, return, as Really list does not then take row set for sky, determines whether that whether the row of base table part are at base table row collection Close in A, if result is for being, by row and fall in lines and gather in B, if result is for otherwise to feed back Error message.It follows that whether the row checked further in each row set not quoting other tables Belong to base table？Whether the row in each row set quoting other tables belong to reference list, reference list and Whether base table has main foreign key relationship？Comprise whether row belong to list X？Whether there is other row set Contain the row (whole row of base table appear in row set) of base table？In above inspection The result looked into is in the case of being, by the row comprised and fall in lines set B in.If there is The situation that the result checked is no, the most according to circumstances feedback error information.At row set B and base In the case of the row set of table is identical, creates a Materialized View for each row set, take the next one Row set repeats above-mentioned operating procedure.

It is shown below and creates some SQL statement using the base table of row set storage mode to be used Example:

Statement illustrates:

CREATE TABLE table_name specifies the base table that table name is table_name.

It is row storage that WITH (ORIENTATION=COLUMN) specifies the storage mode of base table.

PROJECTIONS clause is used for creating row set, it is intended that include row name set, in set The row comprised, for the row of sequence.

In the case of only using the row set of row of base table, use following query statement:

SELECT column_name...FROM table_name ORDER BY column_name；

Using in the case of the row set of the row of other tables, using following query statement:

SELECT column_name...FROM table_name JOIN other_table USING (main foreign key column) ORDER BY column_name.

Fig. 5 shows in the embodiment shown in fig. 1, the schematic diagram of the join index set up. Set up the SQL statement that join index used as follows:

CREATE JOIN INDEX index_name FROM projection_a TO projection_b；

This SQL statement creates row set projection_a to row set projection_b's Join index index_name.

During setting up join index, the data load process of row set is such that

1) other table loading datas referred at base table and row set；

2) the row set Materialized View that materialization is whole, including the storage position of materialization each row set；

3) join index is set up；

4) delete base table and arrange the data of other tables that set refers to；

5) unwanted storage position is deleted.

The process selecting join index can i.e. estimate each holding to use rule-based optimizing mode Cost needed for row plan, the resource spent by each implement plan is quantified by this cost, root The join index of optimum is selected according to this cost.Owing to using join index can cause random data Handling up (IO) in storehouse, therefore should use join index less.

Fig. 6 shows in the embodiment shown in fig. 1, uses join index reduction logic tuple Operating process.This operating process comprises the following steps that

Step 10: find the row set comprising all target column required for inquiry, if it has, This row set is then utilized to restore tuple；If it is not, enter next step；

In figure 6 particularly as follows: find and comprise inquiry all targets, i.e. " name ", " place system ", Row set.

Step 11: what searching row sequence was identical comprises multiple row of all target column required for inquiry Set；

Step 12: in the multiple row set obtained in a step 11, select to use join index Minimum row collection charge-coupled (the row set 1 in such as Fig. 6 and row set 2)；

Step 13: after completing the projection of target column, utilizes join index reduction tuple.

By postponing the use of join index as far as possible, until after completing the projection of target column, can To strengthen query performance.

Above single-table multi-column sequence storage method for a column database provided by the present invention is carried out Detailed description.To those skilled in the art, without departing substantially from true spirit On the premise of any obvious change that it is done, all by composition to patent right of the present invention Infringement, corresponding legal responsibility will be undertaken.

Claims

1. a single-table multi-column sequence storage method for a column database, described column database is for closing It is type data base, including multiple tables of data being made up of row and column, it is characterised in that include following Step:

The tables of data that will store, as base table, is divided into multiple row set, described row collection cooperation For a vertical division of described base table, including one or more row；

Using the connection establishment join index of storage positional value, described join index is for record not Between same column set, the row in the one-to-one relationship of train value and the set of these row is at another row collection Position in conjunction；

To each row set, according to belong to described tables of data same tuple, two row set In the storage positional value of row, set up join index；

Storing join index in each row set, described row set is by connecting other row set Join index obtain other train values that train value in this row set is corresponding in base table；Wherein, will Described join index correspondence store described each row set in, and with described each row set in Described row correspondence.

2. single-table multi-column sequence storage method for a column database as claimed in claim 1, its It is characterised by comprising the steps:

If two row set comprise repeats row, then repeat row according to described, said two is arranged The row of set is ranked up and stores；

3. single-table multi-column sequence storage method for a column database as claimed in claim 2, its It is characterised by:

For not repeating the said two row set of row, if logical order is identical, do not set up Join index, if logical order differs, sets up join index.

4. single-table multi-column sequence storage method for a column database as claimed in claim 2, its It is characterised by:

For there being the said two row set repeating row, do not set up join index.

5. single-table multi-column sequence storage method for a column database as claimed in claim 1, it is special Levy and be to include the following step creating row set:

Judge described tables of data whole row appear at the plurality of row union of sets concentrate time Wait, create Materialized View for each row set, complete the establishment of row set.

6. single-table multi-column sequence storage method for a column database as claimed in claim 1, it is special Levy and be during setting up join index also to include following row set load step:

Step 1: at described tables of data loading data；

Step 3: set up join index；

Step 4: delete the data of described tables of data；

Step 5: delete unwanted storage positional value.

7. single-table multi-column sequence storage method for a column database as claimed in claim 1, it is special Levy and be also to include the following step utilizing join index reduction logic tuple:

Step 1: find the row set comprising all target column required for inquiry, if it has, then profit Tuple is restored with this row set；If it is not, enter next step；

Step 2: the multiple row comprising all target column required for inquiry finding row sequence identical collect Close；

Step 3: in the multiple row set obtained in step 2, select to use join index minimum Row collection is charge-coupled；

Step 4: after completing the projection of target column, utilizes join index reduction tuple.

8. single-table multi-column sequence storage method for a column database as claimed in claim 1, it is special Levy and be:

Belong to the row of the same tuple of described tables of data, refer to the multiple row in row set.