CN107402988B

CN107402988B - Distributed NewSQL database system and semi-structured data query method

Info

Publication number: CN107402988B
Application number: CN201710580456.3A
Authority: CN
Inventors: 晋彤; 谭恒亮
Original assignee: Yunrun Da Data Service Co Ltd
Current assignee: Yunrun Da Data Service Co ltd
Priority date: 2016-09-21
Filing date: 2017-07-17
Publication date: 2020-01-03
Anticipated expiration: 2037-07-17
Also published as: CN107491485A; CN107480198B; CN107451221A; CN107329837A; CN107451219B; CN106446153A; CN107368575A; CN107402987A; CN107491485B; CN107368575B; CN107463635A; CN107463632A; CN107491345B; CN107402990A; CN107402995B; CN107480198A; CN107402991B; CN107463635B; CN107402991A; CN107451220B

Abstract

The invention discloses a distributed NewSQL database system, which comprises: the control unit is used for accessing a user request in a database interface mode and sending the user request to the planning unit; the user request comprises a query condition of JSON data needing to be queried; the planning unit is used for analyzing the user request, compiling and customizing a corresponding execution plan; the execution unit is used for starting the cooperative processing module to acquire the index data according to the execution plan; inquiring a data table according to the index data to obtain an inquiry result; and returning the query result to the control unit; an Hbase unit storing the data table and an index table; the Hbase unit comprises the cooperative processing module and is used for inquiring an index table according to the inquiry condition to obtain the corresponding index data. The invention also discloses a semi-structured data query method. The invention realizes the data query in the JSON format and solves the problem of poor effect and performance when processing the semi-structured data.

Description

Distributed NewSQL database system and semi-structured data query method

Technical Field

The invention relates to the technical field of big data, in particular to a distributed NewSQL database system and a semi-structured data query method.

Background

The Hbase unit is currently one of the most well-known distributed NoSQL databases in the Hadoop ecosystem. The Hbase unit main components comprise an HMmaster and an HRegionserver, a table type data model is provided for a user, a plurality of regions are divided according to a main key range, the HMmaster is responsible for managing and distributing the regions, and the HRegionserver is responsible for reading and writing region data. The data stored by the existing Hbase unit has no data type, and is byte arrays, so that problems in query can exist if semi-structured data such as JSON is stored. To store JSON format data in the Hbase unit, the entire JSON object is conventionally stored as a string. This approach has the following drawbacks:

when the records are to be filtered, all the records need to be read out and then filtered at the client, and the performance cannot be accepted in the case of large data volume.

When a record needs to be updated, the record needs to be read out, updated according to a specific field, and then rewritten into the Hbase unit for overwriting.

Disclosure of Invention

The embodiment of the invention aims to provide a distributed NewSQL database system and a data query method, which can realize data query in a JSON format and solve the problems of poor effect and poor performance when processing semi-structured data.

In order to achieve the above object, an embodiment of the present invention provides a distributed NewSQL database, including:

the control unit is used for accessing a user request in a database interface mode and sending the user request to the planning unit; the system is also used for returning the query result to the user; the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition;

the planning unit is used for analyzing the user request, compiling and customizing a corresponding execution plan;

the execution unit is used for starting a cooperative processing module to acquire index data corresponding to the query condition requested by the user according to an execution plan; inquiring a data table according to the acquired index data so as to acquire the corresponding inquiry result; and returning the query result to the control unit;

the Hbase unit is used for storing the data table and the index table, JSON type data are added to the bottom layer of the Hbase unit, and the JSON data are integrally stored in the bottom layer HFile;

the Hbase unit further comprises the cooperative processing module, and the cooperative processing module is used for inquiring an index table according to the inquiry condition to obtain the corresponding index data; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type.

Compared with the prior art, the distributed NewSQL database system disclosed by the invention firstly accesses a user request in a database interface mode through the control unit and sends the user request to the planning unit; then, analyzing the user request through a planning unit, compiling and generating a corresponding execution plan; then, starting a cooperative processing module to obtain index data corresponding to the query condition requested by the user in an index table of an Hbase unit through an execution unit according to an execution plan; inquiring a data table of an Hbase unit according to the acquired index data so as to acquire a corresponding inquiry result, namely JSON data; and returning the query result to the control unit, and finally returning the technical scheme of the user through the control unit, so that the data query in the JSON format can be realized, and the problems of poor performance and poor effect when processing the semi-structured data are solved.

Further, the distributed NewSQL database system further includes: and the distributed transaction manager is used for coordinating multiple parties in the execution plan to finish distributed transaction management when distributed transactions are involved in the execution plan.

Further, the Hbase unit further includes an Hbase unit API interface, and the execution unit is configured to query a data table through the Hbase unit API interface according to the acquired index data, so as to obtain the corresponding query result.

Further, the database interface is JDBC or ODBC.

The embodiment of the invention also discloses a semi-structured data query method, based on the distributed NewSQL database system, which comprises the following steps:

accessing a user request in a database interface mode through a control unit, and sending the user request to a planning unit; the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition;

analyzing the user request through a planning unit, and compiling and customizing a corresponding execution plan;

starting a query condition query index table of a cooperative processing module of an Hbase unit by an execution unit according to an execution plan to obtain corresponding index data; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type; the index table is stored in the Hbase unit;

inquiring a data table according to the acquired index data through the execution unit so as to acquire the corresponding inquiry result; and returning the query result to the control unit; wherein the data tables are stored in the Hbase unit; JSON type data are added to the bottom layer of the Hbase unit, and the JSON data are integrally stored in the bottom layer HFile;

and returning the query result to the user through the control unit.

Compared with the prior art, the semi-structured data query method disclosed by the invention comprises the steps of firstly accessing a user request in a database interface mode through a control unit, and sending the user request to a planning unit; then, analyzing the user request through a planning unit, compiling and generating a corresponding execution plan; then, starting a cooperative processing module to obtain index data corresponding to the query condition requested by the user in an index table of an Hbase unit through an execution unit according to an execution plan; inquiring a data table of an Hbase unit according to the acquired index data so as to acquire a corresponding inquiry result, namely JSON data; and returning the query result to the control unit, and finally returning the technical scheme of the user through the control unit, so that the data query in the JSON format can be realized, and the problems of poor performance and poor effect when processing the semi-structured data are solved.

Further, when the distributed transaction is involved in the execution plan, the distributed transaction manager coordinates multiple parties in the execution plan to complete distributed transaction management.

Further, when the execution unit queries the data table, the execution unit queries the data table through an Hbase unit API interface of the Hbase unit, so as to obtain a corresponding query result;

further, the database interface is JDBC or ODBC.

Drawings

Fig. 1 is a schematic structural diagram of a distributed NewSQL database provided in embodiment 1 of the present invention;

fig. 2 is a schematic flowchart of a semi-structured data query method provided in embodiment 2 of the present invention;

fig. 3 is a schematic flowchart of generating an execution plan in step S2 of the semi-structured data query method according to embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a distributed NewSQL database system according to embodiment 1 of the present invention; the present embodiment 1 includes the following structure:

the control unit 1 is used for accessing a user request in a database interface mode and sending the user request to the planning unit 2; the system is also used for returning the query result to the user; the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition;

a planning unit 2, configured to parse the user request, compile and customize a corresponding execution plan;

the execution unit 3 is configured to start the cooperative processing module 41 to obtain index data corresponding to the query condition requested by the user according to an execution plan; inquiring a data table according to the acquired index data so as to acquire the corresponding inquiry result; and returning the query result to the control unit 1;

the Hbase unit 4 is used for storing the data table and the index table, wherein JSON type data are added to the bottom layer of the Hbase unit 4, and the JSON data are integrally stored in bottom layer HFile;

the Hbase unit 4 further includes the cooperative processing module 41, where the cooperative processing module 41 is configured to query an index table according to the query condition to obtain the corresponding index data; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type.

In the embodiment, JSON type data is added to the bottom layer of the Hbase unit 4, in the bottom layer HFile, the JSON data is stored as a whole, and the index column of the JSON is also used as a nested type for indexing when a secondary index is constructed, so that any field query, index creation and deletion for the JSON can be supported.

Further, the distributed NewSQL database system further includes: and the distributed transaction manager is used for coordinating multiple parties in the execution plan to complete distributed transaction management when a transaction is involved in the execution plan. The distributed transaction manager realizes distributed transaction processing and transaction management by using Java transaction processing API (JTA); where JTA, Java transactioniapi, allows an application to perform a distributed transaction-accessing and updating data on two or more networked computer resources.

Specifically, the planning unit 2 is configured to, after receiving the user request from the control unit 1, parse the user request, compile SQL by a high-speed SQL engine, and then generate an execution plan. The execution unit 2 is also configured to generate an execution plan and return the execution plan to the control unit 1. And the control unit 1 is further configured to determine whether intervention of the distributed transaction manager is required according to the content of the execution plan after receiving the execution plan, and if so, start the distributed transaction manager.

Further, the Hbase unit 4 further includes an Hbase unit API interface, and the execution unit 3 is configured to query a data table through the Hbase unit API interface according to the obtained index data, so as to obtain the corresponding query result.

Further, the database interface is JDBC or ODBC.

Further, the control unit 1 is also connected to a monitor for taking charge of metadata management and for monitoring the load of the underlying hbase Region, avoiding that a specific Region is overloaded, and redistributing the Region by using the cooperative processing module 41.

In addition, the control unit 1 is also configured to coordinate data communication among a plurality of roles and manage the overall process.

The planning unit 2 is configured to generate a process of executing a plan, and specifically includes:

judging whether a pre-stored SQL statement corresponding to the SQL statement exists in the shared cache pool, if so, outputting an execution plan corresponding to the SQL statement, and if not, outputting an execution plan corresponding to the SQL statement

Syntax checking is carried out on the SQL statement, if the syntax error returns error information to a user, otherwise,

semantic check is carried out on the SQL statement, if the semantic is wrong, error information is returned to the user, otherwise,

carrying out view and expression conversion on the SQL statement to obtain a corresponding conversion result;

selecting an optimizer according to the conversion result to obtain a corresponding optimizer selection result;

selecting a corresponding data connection mode and a corresponding connection sequence according to the selection result of the optimizer;

selecting a searched path according to the connection mode and the connection sequence;

and generating an execution plan according to the search path, and outputting the execution plan.

When the method is specifically implemented, a user request is accessed through the control unit 1 in a database interface mode, and the user request is sent to the planning unit 2; then, the user request is analyzed through the planning unit 2, and a corresponding execution plan is compiled and generated; then, the control unit 1 judges whether the intervention of the distributed transaction manager is needed or not according to the content of the execution plan, if so, the distributed transaction manager is started, and the distributed transaction manager coordinates multiple parties in the execution plan to complete distributed transaction management; then, the execution unit 3 starts the cooperative processing module 41 to obtain index data corresponding to the query condition requested by the user in the index table of the Hbase unit 4 according to the execution plan; inquiring a data table of an Hbase unit 4 according to the acquired index data so as to acquire a corresponding inquiry result, namely JSON data; and returning the query result to the control unit 1, and finally returning the user through the control unit 1.

The distributed NewSQL database system of the embodiment can realize data query in a JSON format, and solves the problems of poor effect and performance when processing semi-structured data.

Referring to fig. 2, fig. 2 is a schematic flowchart of a semi-structured data query method provided in embodiment 2 of the present invention; based on the distributed NewSQL database system provided in embodiment 1 of the present invention, embodiment 2 includes the following steps:

s1, accessing a user request in a database interface mode through the control unit 1, and sending the user request to the planning unit 2; the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition;

s2, analyzing the user request through a planning unit 2, compiling and customizing a corresponding execution plan;

s3, starting the query condition query index table of the cooperative processing module 41 of the Hbase unit 4 through the execution unit 3 according to the execution plan, and obtaining the corresponding index data; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type; the index table is stored in the Hbase unit 4;

s4, inquiring a data table according to the acquired index data through the execution unit 3, thereby acquiring the corresponding inquiry result; and returns the query result to the control unit 1; wherein the data tables are all stored in the Hbase unit 4; JSON type data are added to the bottom layer of the Hbase unit 4, and the JSON data are integrally stored in the bottom layer HFile;

and S5, returning the query result to the user through the control unit 1.

Further, after the step S2 of this embodiment completes generating the execution plan, the method further includes returning the execution plan to the control unit 1, and after the control unit 1 receives the execution plan, the method is further configured to determine whether intervention of the distributed transaction manager is needed according to content of the execution plan, and if so, start the distributed transaction manager, and specifically, when the execution plan involves a transaction, coordinate multiple parties in the execution plan to complete distributed transaction management; if not, step S3 is executed directly.

Further, when the execution unit 3 queries the data table, the data table is queried through the Hbase unit 4API interface of the Hbase unit 4, so as to obtain a corresponding query result;

further, the database interface is JDBC or ODBC.

Referring to fig. 3, fig. 3 is a schematic flow chart of the step S2 for generating the execution plan through the planning unit 2, and specifically includes:

s201, judging whether a pre-stored SQL statement corresponding to the SQL statement exists in the shared cache pool, if so, outputting an execution plan corresponding to the SQL statement, and if not, outputting an execution plan corresponding to the SQL statement

S202, syntax check is carried out on the SQL statement, if the syntax error returns error information to the user, otherwise,

s203, semantic check is carried out on the SQL statement, if the semantic error returns error information to the user, otherwise,

s204, carrying out view and expression conversion on the SQL statement to obtain a corresponding conversion result;

s205, selecting an optimizer according to the conversion result to obtain a corresponding optimizer selection result;

s206, selecting a corresponding data connection mode and a corresponding connection sequence according to the selection result of the optimizer;

s207, selecting a searched path according to the connection mode and the connection sequence;

and S208, generating an execution plan according to the search path and outputting the execution plan.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A distributed NewSQL database system, comprising:

the control unit is used for accessing a user request in a database interface mode and sending the user request to the planning unit; the system is also used for returning the query result to the user; the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition; the database interface is JDBC or ODBC;

the Hbase unit further comprises the cooperative processing module, and the cooperative processing module is used for inquiring an index table according to the inquiry condition to obtain the corresponding index data; wherein, the index table stores the index data in the form of inverted index generated by the JSON data as a nested type;

the Hbase unit further comprises an Hbase unit API interface, and the execution unit is used for inquiring a data table through the Hbase unit API interface according to the acquired index data so as to acquire the corresponding inquiry result.

2. The distributed NewSQL database system according to claim 1, further comprising:

and the distributed transaction manager is used for coordinating multiple parties in the execution plan to finish distributed transaction management when distributed transactions are involved in the execution plan.

3. A semi-structured data query method based on the distributed NewSQL database system of any one of claims 1-2, which is characterized by comprising the following steps:

accessing a user request in a database interface mode through a control unit, and sending the user request to a planning unit; the user request comprises a query condition of JSON data to be queried, and the query result is the JSON data obtained according to the query condition; the database interface is JDBC or ODBC;

returning the query result to the user through the control unit;

and when the execution unit queries the data table, querying the data table through an Hbase unit API (application programming interface) of the Hbase unit so as to obtain a corresponding query result.

4. A semi-structured data query method as claimed in claim 3, wherein distributed transaction management is accomplished by coordinating multiple parties in the execution plan when distributed transactions are involved in the execution plan by a distributed transaction manager.