CN112395308A

CN112395308A - Data query method based on HDFS database

Info

Publication number: CN112395308A
Application number: CN202011226066.4A
Authority: CN
Inventors: 李发明
Original assignee: Shenzhen Zhongbo Kechuang Information Co ltd
Current assignee: Shenzhen Zhongbo Kechuang Information Co ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-02-23

Abstract

The invention provides a data query method based on an HDFS (Hadoop distributed File System) database, which is used for solving the problems of low data query efficiency and low speed in the prior art. According to the data query method based on the HDFS database, a user firstly accesses the HDFS database system through a user interface and sends a query request to a scheduler after obtaining the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.

Description

Data query method based on HDFS database

Technical Field

The invention belongs to the field of database management, and particularly relates to a data query method based on an HDFS (Hadoop distributed File System) database.

Background

A Hadoop Distributed File System (HDFS) is generally applied to a large-scale distributed relational database System, and the System contains massive business data and a large number of processing nodes for data storage, organization, and operational analysis. When a client queries data in a database, a database management system generally feeds back user queries in an interpretation mode, analyzes and optimizes the queries, creates a query plan, and executes the queries. When mass data is faced, the query quantity of the client is increased in a graded manner, if the original query mode is still adopted, too much execution overhead is introduced in the explanatory query execution process, the burden of a scheduling node is increased, the query speed and accuracy are reduced, and the user requirements cannot be met.

Disclosure of Invention

In view of the above defects or shortcomings in the prior art, the present invention aims to provide a data query method based on an HDFS database, which realizes the cooperation of a scheduler and a query sub-engine through a grammar set and a subset, and the scheduler performs task allocation on the query sub-engine through the grammar set, thereby completing the splitting of query tasks, improving the parallel quantity of the query tasks, and satisfying the query requirements under mass data.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

the embodiment of the invention provides a data query method based on an HDFS (Hadoop distributed File System) database, wherein after a scheduler receives a query request, a request task is split according to metadata or a query abstract in a metadata server, and split subtasks are issued to a sub-engine running at an HDFS sub-node.

In the above scheme, the data query method includes the following steps:

step S1, the user accesses the HDFS database system through the user interface;

step S2, the access user sends the query request to the dispatcher through SQL query language;

step S3, the dispatcher performs syntax translation and decomposition on the received SQL query language according to the metadata or the query abstract in the metadata server, divides one query task into a plurality of subtasks, issues the subtasks to corresponding sub-engines, and simultaneously saves the division strategy;

step S4, after the subtask is received by the sub-engine scheduled by the scheduler, the source data required by the query is obtained in the current HDFS database, then the query operation is executed in the source data according to the grammar subset, and the query result corresponding to the subtask is sent to the scheduler;

and step S5, the dispatcher integrates the query results fed back by each sub-engine according to the splitting strategy of the query task to generate a final query result, the final query result is sent to the client, and the query abstract is recorded in the metadata server.

In the above solution, the user interface in step S1 is a JDBC/ODBC application program interface or a Shell command line interface connected to the API.

In the scheme, the user accessing the HDFS database system comprises the steps of authenticating the user identity or the programmer identity of the user and distributing corresponding access authority according to an authentication result.

In the foregoing solution, in step S2, a query task is split into a plurality of subtasks, and after the SQL query request corresponds to the meta information in the metadata, the query request is split into subtasks based on the corresponding meta information, where the subtasks correspond to the corresponding subtasks, and the distribution consistency of the data resources and the computing resources is maintained.

In the scheme, the query abstract of the scheduler is recorded in the metadata server, and when the scheduler receives the same query task again, the splitting strategy matched with the query abstract is directly called to split the query task.

In the scheme, the scheduler tracks the health state of the sub-engine through a state monitoring process and monitors a query process; when a certain child node is disconnected due to hardware failure, network error and software failure, the state monitoring process informs all nodes to ensure that subsequent inquiry can avoid the unreachable node.

In the above scheme, the called sub-engines execute different query segments in parallel to complete the whole query process.

In the scheme, when the HDFS subnodes in the HDFS database are expanded, the subengines are expanded based on the protocol of the HDFS, so that the capacity of the database is improved, and the query capability of the system is correspondingly improved.

In the above scheme, the redundancy mechanism provided by the HDFS based on the HDFS database is used to cope with software, hardware and network failures of a single node.

The invention has the following beneficial effects:

according to the data query method based on the HDFS database, provided by the embodiment of the invention, a user firstly accesses the HDFS database system through a user interface, and sends a query request to a scheduler after acquiring the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is a schematic diagram illustrating a data query method based on an HDFS database according to an embodiment of the present invention;

fig. 2 shows a timing chart of a data query method based on an HDFS database according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

The embodiment of the invention provides a data query method based on an HDFS database. As shown in fig. 2, in the HDFS database, resource data is stored in each HDFS serving as a child node, each HDFS runs a query sub-engine, and the sub-engine stores a retrieval syntax subset corresponding to data in the current HDFS; all the sub-engines are connected with a scheduler, the scheduler is connected with a metadata server and is also provided with a user interface, a grammar set table is stored in the scheduler, after a query request from the user interface is received, the query request is translated into retrieval grammar according to metadata in the metadata server, the query request is decomposed into a plurality of retrieval grammars according to the retrieval grammar set and is issued to the corresponding sub-engines, the sub-engines start query tasks according to the received retrieval grammar, the retrieval is carried out in the current HDFS, the query results are fed back to the scheduler, the scheduler integrates the results fed back by all the sub-engines, and finally the results are fed back to a client through the user interface.

By retrieving the grammar set and the subsets, the query request is split into a plurality of subtasks, careful insertion and updating operations do not need to be carried out on data in a data table, specific query tasks are distributed to the subevents according to metadata, the query tasks are completed in the bottom-layer HDFS data in parallel, and finally feedback results are integrated, so that massive structured data processing and the bottom-layer HDFS basic framework are considered, the query accuracy and speed are improved, and the user experience is improved.

Fig. 2 is a data query method based on an HDFS database according to an embodiment of the present invention. As shown in fig. 2, the data query method includes the following steps:

in step S1, the user accesses the HDFS database system through the user interface.

In this step, the user interface is a JDBC/ODBC application program interface or a Shell command line interface connected to the API. Through the interface connection API, on one hand, a user can call data in the HDFS database through the API, and on the other hand, a programmer can call the data in the database directly through the interface to perform secondary development of the application.

Preferably, the step of accessing the HDFS database system by the user includes authenticating the identity of the user. The identity authentication comprises user identity and programmer identity, on the basis, different identities are appropriately graded, and corresponding rights are distributed to different grades.

In step S2, the access user sends a query request to the dispatcher via SQL query language.

In this step, the syntax used in the SQL query language is a subset of the syntax set in the scheduler.

Step S3, the scheduler performs syntax translation and decomposition on the received SQL query language according to the metadata or query abstract in the metadata server, splits one query task into a plurality of subtasks, issues the subtasks to the corresponding sub-engines, and saves the splitting policy.

In this step, the metadata server stores the metadata in the child engine and the corresponding HDFS database in the child node, and after the SQL query request corresponds to the metadata in the metadata, the query request is decomposed into subtasks based on the corresponding metadata, and the subtasks correspond to the corresponding child nodes, so that the distribution consistency of the data resources and the computing resources is realized on the premise of splitting the task, thereby realizing efficient data query. The subtasks distributed to the corresponding child nodes correspond to the information in the child engine of the current node and the information in the HDFS database, wherein the correspondence is carried out with the grammar subsets in the child engine through a retrieval grammar set in a scheduler, and simultaneously, after mapping between SQL query language and the meta information, the correspondence is carried out with the resource data or the service data in the HDFS database.

The metadata server stores metadata information about the system, such as which databases are available and how the specific table structures of those databases are; and when the resource information in the HDFS data is updated, correspondingly updating the table structure content in the metadata server. Meanwhile, the metadata server stores the query record abstract of the scheduler, and when the scheduler receives the same query task again, the scheduler directly calls a splitting strategy matched with the query abstract to split the query task.

Preferably, the scheduler also tracks the health status of the sub-engines and monitors the inquiry process; when a child node is disconnected due to hardware failure, network error, software failure or other reasons, the state monitoring process notifies all nodes to ensure that subsequent queries can avoid the unreachable node. Since the task of the condition monitoring component is to assist when a problem arises, it does not dominate a normal polling operation. If the process does not run normally or becomes unreachable, other nodes can still execute the query task normally; when the system state monitoring process is not reachable, if any node fails, the system just becomes less robust, but the execution of normal tasks is not affected. When the system state monitoring process is on-line again, it will re-establish contact with other nodes and resume its monitoring function.

Step S4, after receiving the subtasks, the sub-engine scheduled by the scheduler acquires the source data required for the query from the current HDFS database, then performs the query operation in the source data according to the syntax subset, and sends the query result corresponding to the subtasks to the scheduler.

The called sub-engines run on each HDFS sub-node, when one query request is split, a plurality of sub-engines are called, different query fragments are executed in parallel, and the whole query process is completed. Meanwhile, the child nodes and the running child engines in the HDFS database have corresponding expandability due to distributed storage of the HDFS; when the database is expanded, the sub-engines are expanded based on the protocol of the HDFS, so that the capacity of the database is improved, and the query capability of the system is correspondingly improved.

The data query method of the embodiment is based on the HDFS database, and software, hardware and network faults of a single node are dealt with by a redundancy mechanism provided by the HDFS. The table data in the HDFS database system is stored as a data file of the HDFS, and the file format and the compression strategy of the HDFS are used. When a data file is to be added to a new table, the mapping relationship between the data file in the HDFS and the table name in the system is uniformly managed by the system itself.

And step S5, the dispatcher integrates the query results fed back by each sub-engine according to the splitting strategy of the query task to generate a final query result, sends the final query result to the client and records the query abstract in the metadata.

According to the technical scheme, the data query method based on the HDFS database provided by the embodiment has the advantages that a user firstly accesses the HDFS database system through a user interface, and sends a query request to a scheduler after obtaining the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims

1. A data query method based on an HDFS database is characterized in that a scheduler splits a request task according to metadata or a query abstract in a metadata server after receiving a query request, and issues split subtasks to a sub-engine running at a HDFS sub-node.

2. The HDFS database-based data query method according to claim 1, comprising the steps of:

step S1, the user accesses the HDFS database system through the user interface;

3. The HDFS database-based data query method according to claim 2, wherein the user interface in step S1 is JDBC/ODBC application program interface or Shell command line interface connected to API.

4. The HDFS database-based data query method according to claim 3, wherein the accessing of the user to the HDFS database system comprises authenticating the user identity or the programmer identity of the user and allocating corresponding access rights according to the authentication result.

5. The HDFS database-based data query method according to claim 2, wherein in step S2, a query task is divided into a plurality of subtasks, and after the SQL query request corresponds to the meta information in the meta data, the query request is divided into subtasks based on the corresponding meta information, and the subtasks correspond to the corresponding subtasks and maintain the distribution consistency of the data resources and the computing resources.

6. The HDFS database-based data query method according to claim 2, wherein the metadata server records a query summary of the scheduler, and when the scheduler receives the same query task again, the scheduler directly calls a splitting policy matching the query summary to split the query task.

7. The HDFS database-based data query method according to claim 2, wherein the scheduler monitors the query process by tracking the health status of the sub-engines through a status monitoring process; when a certain child node is disconnected due to hardware failure, network error and software failure, the state monitoring process informs all nodes to ensure that subsequent inquiry can avoid the unreachable node.

8. The HDFS database-based data query method according to claim 2, wherein the invoked sub-engine executes different query fragments in parallel to complete the whole query process.

9. The data query method based on the HDFS database as claimed in claim 2, wherein when the HDFS subnode in the HDFS database is expanded, the subengine is expanded based on the protocol of the HDFS, so that the query capability of the system is correspondingly improved while the capacity of the database is improved.

10. The HDFS database-based data query method according to claim 2, wherein the HDFS database-based redundancy mechanism provided by the HDFS is used to cope with single-node software and hardware and network failures.