CN112395308A - Data query method based on HDFS database - Google Patents
Data query method based on HDFS database Download PDFInfo
- Publication number
- CN112395308A CN112395308A CN202011226066.4A CN202011226066A CN112395308A CN 112395308 A CN112395308 A CN 112395308A CN 202011226066 A CN202011226066 A CN 202011226066A CN 112395308 A CN112395308 A CN 112395308A
- Authority
- CN
- China
- Prior art keywords
- query
- hdfs
- database
- sub
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 239000012634 fragment Substances 0.000 claims abstract description 5
- 238000012544 monitoring process Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 8
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000013519 translation Methods 0.000 claims description 3
- 230000003862 health status Effects 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 5
- 230000006870 function Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24532—Query optimisation of parallel queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data query method based on an HDFS (Hadoop distributed File System) database, which is used for solving the problems of low data query efficiency and low speed in the prior art. According to the data query method based on the HDFS database, a user firstly accesses the HDFS database system through a user interface and sends a query request to a scheduler after obtaining the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.
Description
Technical Field
The invention belongs to the field of database management, and particularly relates to a data query method based on an HDFS (Hadoop distributed File System) database.
Background
A Hadoop Distributed File System (HDFS) is generally applied to a large-scale distributed relational database System, and the System contains massive business data and a large number of processing nodes for data storage, organization, and operational analysis. When a client queries data in a database, a database management system generally feeds back user queries in an interpretation mode, analyzes and optimizes the queries, creates a query plan, and executes the queries. When mass data is faced, the query quantity of the client is increased in a graded manner, if the original query mode is still adopted, too much execution overhead is introduced in the explanatory query execution process, the burden of a scheduling node is increased, the query speed and accuracy are reduced, and the user requirements cannot be met.
Disclosure of Invention
In view of the above defects or shortcomings in the prior art, the present invention aims to provide a data query method based on an HDFS database, which realizes the cooperation of a scheduler and a query sub-engine through a grammar set and a subset, and the scheduler performs task allocation on the query sub-engine through the grammar set, thereby completing the splitting of query tasks, improving the parallel quantity of the query tasks, and satisfying the query requirements under mass data.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
the embodiment of the invention provides a data query method based on an HDFS (Hadoop distributed File System) database, wherein after a scheduler receives a query request, a request task is split according to metadata or a query abstract in a metadata server, and split subtasks are issued to a sub-engine running at an HDFS sub-node.
In the above scheme, the data query method includes the following steps:
step S1, the user accesses the HDFS database system through the user interface;
step S2, the access user sends the query request to the dispatcher through SQL query language;
step S3, the dispatcher performs syntax translation and decomposition on the received SQL query language according to the metadata or the query abstract in the metadata server, divides one query task into a plurality of subtasks, issues the subtasks to corresponding sub-engines, and simultaneously saves the division strategy;
step S4, after the subtask is received by the sub-engine scheduled by the scheduler, the source data required by the query is obtained in the current HDFS database, then the query operation is executed in the source data according to the grammar subset, and the query result corresponding to the subtask is sent to the scheduler;
and step S5, the dispatcher integrates the query results fed back by each sub-engine according to the splitting strategy of the query task to generate a final query result, the final query result is sent to the client, and the query abstract is recorded in the metadata server.
In the above solution, the user interface in step S1 is a JDBC/ODBC application program interface or a Shell command line interface connected to the API.
In the scheme, the user accessing the HDFS database system comprises the steps of authenticating the user identity or the programmer identity of the user and distributing corresponding access authority according to an authentication result.
In the foregoing solution, in step S2, a query task is split into a plurality of subtasks, and after the SQL query request corresponds to the meta information in the metadata, the query request is split into subtasks based on the corresponding meta information, where the subtasks correspond to the corresponding subtasks, and the distribution consistency of the data resources and the computing resources is maintained.
In the scheme, the query abstract of the scheduler is recorded in the metadata server, and when the scheduler receives the same query task again, the splitting strategy matched with the query abstract is directly called to split the query task.
In the scheme, the scheduler tracks the health state of the sub-engine through a state monitoring process and monitors a query process; when a certain child node is disconnected due to hardware failure, network error and software failure, the state monitoring process informs all nodes to ensure that subsequent inquiry can avoid the unreachable node.
In the above scheme, the called sub-engines execute different query segments in parallel to complete the whole query process.
In the scheme, when the HDFS subnodes in the HDFS database are expanded, the subengines are expanded based on the protocol of the HDFS, so that the capacity of the database is improved, and the query capability of the system is correspondingly improved.
In the above scheme, the redundancy mechanism provided by the HDFS based on the HDFS database is used to cope with software, hardware and network failures of a single node.
The invention has the following beneficial effects:
according to the data query method based on the HDFS database, provided by the embodiment of the invention, a user firstly accesses the HDFS database system through a user interface, and sends a query request to a scheduler after acquiring the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 is a schematic diagram illustrating a data query method based on an HDFS database according to an embodiment of the present invention;
fig. 2 shows a timing chart of a data query method based on an HDFS database according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
The embodiment of the invention provides a data query method based on an HDFS database. As shown in fig. 2, in the HDFS database, resource data is stored in each HDFS serving as a child node, each HDFS runs a query sub-engine, and the sub-engine stores a retrieval syntax subset corresponding to data in the current HDFS; all the sub-engines are connected with a scheduler, the scheduler is connected with a metadata server and is also provided with a user interface, a grammar set table is stored in the scheduler, after a query request from the user interface is received, the query request is translated into retrieval grammar according to metadata in the metadata server, the query request is decomposed into a plurality of retrieval grammars according to the retrieval grammar set and is issued to the corresponding sub-engines, the sub-engines start query tasks according to the received retrieval grammar, the retrieval is carried out in the current HDFS, the query results are fed back to the scheduler, the scheduler integrates the results fed back by all the sub-engines, and finally the results are fed back to a client through the user interface.
By retrieving the grammar set and the subsets, the query request is split into a plurality of subtasks, careful insertion and updating operations do not need to be carried out on data in a data table, specific query tasks are distributed to the subevents according to metadata, the query tasks are completed in the bottom-layer HDFS data in parallel, and finally feedback results are integrated, so that massive structured data processing and the bottom-layer HDFS basic framework are considered, the query accuracy and speed are improved, and the user experience is improved.
Fig. 2 is a data query method based on an HDFS database according to an embodiment of the present invention. As shown in fig. 2, the data query method includes the following steps:
in step S1, the user accesses the HDFS database system through the user interface.
In this step, the user interface is a JDBC/ODBC application program interface or a Shell command line interface connected to the API. Through the interface connection API, on one hand, a user can call data in the HDFS database through the API, and on the other hand, a programmer can call the data in the database directly through the interface to perform secondary development of the application.
Preferably, the step of accessing the HDFS database system by the user includes authenticating the identity of the user. The identity authentication comprises user identity and programmer identity, on the basis, different identities are appropriately graded, and corresponding rights are distributed to different grades.
In step S2, the access user sends a query request to the dispatcher via SQL query language.
In this step, the syntax used in the SQL query language is a subset of the syntax set in the scheduler.
Step S3, the scheduler performs syntax translation and decomposition on the received SQL query language according to the metadata or query abstract in the metadata server, splits one query task into a plurality of subtasks, issues the subtasks to the corresponding sub-engines, and saves the splitting policy.
In this step, the metadata server stores the metadata in the child engine and the corresponding HDFS database in the child node, and after the SQL query request corresponds to the metadata in the metadata, the query request is decomposed into subtasks based on the corresponding metadata, and the subtasks correspond to the corresponding child nodes, so that the distribution consistency of the data resources and the computing resources is realized on the premise of splitting the task, thereby realizing efficient data query. The subtasks distributed to the corresponding child nodes correspond to the information in the child engine of the current node and the information in the HDFS database, wherein the correspondence is carried out with the grammar subsets in the child engine through a retrieval grammar set in a scheduler, and simultaneously, after mapping between SQL query language and the meta information, the correspondence is carried out with the resource data or the service data in the HDFS database.
The metadata server stores metadata information about the system, such as which databases are available and how the specific table structures of those databases are; and when the resource information in the HDFS data is updated, correspondingly updating the table structure content in the metadata server. Meanwhile, the metadata server stores the query record abstract of the scheduler, and when the scheduler receives the same query task again, the scheduler directly calls a splitting strategy matched with the query abstract to split the query task.
Preferably, the scheduler also tracks the health status of the sub-engines and monitors the inquiry process; when a child node is disconnected due to hardware failure, network error, software failure or other reasons, the state monitoring process notifies all nodes to ensure that subsequent queries can avoid the unreachable node. Since the task of the condition monitoring component is to assist when a problem arises, it does not dominate a normal polling operation. If the process does not run normally or becomes unreachable, other nodes can still execute the query task normally; when the system state monitoring process is not reachable, if any node fails, the system just becomes less robust, but the execution of normal tasks is not affected. When the system state monitoring process is on-line again, it will re-establish contact with other nodes and resume its monitoring function.
Step S4, after receiving the subtasks, the sub-engine scheduled by the scheduler acquires the source data required for the query from the current HDFS database, then performs the query operation in the source data according to the syntax subset, and sends the query result corresponding to the subtasks to the scheduler.
The called sub-engines run on each HDFS sub-node, when one query request is split, a plurality of sub-engines are called, different query fragments are executed in parallel, and the whole query process is completed. Meanwhile, the child nodes and the running child engines in the HDFS database have corresponding expandability due to distributed storage of the HDFS; when the database is expanded, the sub-engines are expanded based on the protocol of the HDFS, so that the capacity of the database is improved, and the query capability of the system is correspondingly improved.
The data query method of the embodiment is based on the HDFS database, and software, hardware and network faults of a single node are dealt with by a redundancy mechanism provided by the HDFS. The table data in the HDFS database system is stored as a data file of the HDFS, and the file format and the compression strategy of the HDFS are used. When a data file is to be added to a new table, the mapping relationship between the data file in the HDFS and the table name in the system is uniformly managed by the system itself.
And step S5, the dispatcher integrates the query results fed back by each sub-engine according to the splitting strategy of the query task to generate a final query result, sends the final query result to the client and records the query abstract in the metadata.
According to the technical scheme, the data query method based on the HDFS database provided by the embodiment has the advantages that a user firstly accesses the HDFS database system through a user interface, and sends a query request to a scheduler after obtaining the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.
The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.
Claims (10)
1. A data query method based on an HDFS database is characterized in that a scheduler splits a request task according to metadata or a query abstract in a metadata server after receiving a query request, and issues split subtasks to a sub-engine running at a HDFS sub-node.
2. The HDFS database-based data query method according to claim 1, comprising the steps of:
step S1, the user accesses the HDFS database system through the user interface;
step S2, the access user sends the query request to the dispatcher through SQL query language;
step S3, the dispatcher performs syntax translation and decomposition on the received SQL query language according to the metadata or the query abstract in the metadata server, divides one query task into a plurality of subtasks, issues the subtasks to corresponding sub-engines, and simultaneously saves the division strategy;
step S4, after the subtask is received by the sub-engine scheduled by the scheduler, the source data required by the query is obtained in the current HDFS database, then the query operation is executed in the source data according to the grammar subset, and the query result corresponding to the subtask is sent to the scheduler;
and step S5, the dispatcher integrates the query results fed back by each sub-engine according to the splitting strategy of the query task to generate a final query result, the final query result is sent to the client, and the query abstract is recorded in the metadata server.
3. The HDFS database-based data query method according to claim 2, wherein the user interface in step S1 is JDBC/ODBC application program interface or Shell command line interface connected to API.
4. The HDFS database-based data query method according to claim 3, wherein the accessing of the user to the HDFS database system comprises authenticating the user identity or the programmer identity of the user and allocating corresponding access rights according to the authentication result.
5. The HDFS database-based data query method according to claim 2, wherein in step S2, a query task is divided into a plurality of subtasks, and after the SQL query request corresponds to the meta information in the meta data, the query request is divided into subtasks based on the corresponding meta information, and the subtasks correspond to the corresponding subtasks and maintain the distribution consistency of the data resources and the computing resources.
6. The HDFS database-based data query method according to claim 2, wherein the metadata server records a query summary of the scheduler, and when the scheduler receives the same query task again, the scheduler directly calls a splitting policy matching the query summary to split the query task.
7. The HDFS database-based data query method according to claim 2, wherein the scheduler monitors the query process by tracking the health status of the sub-engines through a status monitoring process; when a certain child node is disconnected due to hardware failure, network error and software failure, the state monitoring process informs all nodes to ensure that subsequent inquiry can avoid the unreachable node.
8. The HDFS database-based data query method according to claim 2, wherein the invoked sub-engine executes different query fragments in parallel to complete the whole query process.
9. The data query method based on the HDFS database as claimed in claim 2, wherein when the HDFS subnode in the HDFS database is expanded, the subengine is expanded based on the protocol of the HDFS, so that the query capability of the system is correspondingly improved while the capacity of the database is improved.
10. The HDFS database-based data query method according to claim 2, wherein the HDFS database-based redundancy mechanism provided by the HDFS is used to cope with single-node software and hardware and network failures.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011226066.4A CN112395308A (en) | 2020-11-05 | 2020-11-05 | Data query method based on HDFS database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011226066.4A CN112395308A (en) | 2020-11-05 | 2020-11-05 | Data query method based on HDFS database |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112395308A true CN112395308A (en) | 2021-02-23 |
Family
ID=74598206
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011226066.4A Pending CN112395308A (en) | 2020-11-05 | 2020-11-05 | Data query method based on HDFS database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112395308A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113886458A (en) * | 2021-09-23 | 2022-01-04 | 浙江至元数据科技有限公司 | Distributed hiding query method and system based on task aggregation |
CN114020744A (en) * | 2021-11-03 | 2022-02-08 | 北京沃东天骏信息技术有限公司 | Data transmission method, device, electronic equipment and computer readable medium |
CN114610746A (en) * | 2022-03-15 | 2022-06-10 | 云粒智慧科技有限公司 | SQL merging execution system and method of multi-relational data engine |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779183A (en) * | 2012-07-02 | 2012-11-14 | 华为技术有限公司 | Data inquiry method, equipment and system |
CN103246749A (en) * | 2013-05-24 | 2013-08-14 | 北京立新盈企信息技术有限公司 | Matrix data base system for distributed computing and query method thereof |
CN103678520A (en) * | 2013-11-29 | 2014-03-26 | 中国科学院计算技术研究所 | Multi-dimensional interval query method and system based on cloud computing |
CN104903894A (en) * | 2013-01-07 | 2015-09-09 | 脸谱公司 | System and method for distributed database query engines |
CN107784103A (en) * | 2017-10-27 | 2018-03-09 | 北京人大金仓信息技术股份有限公司 | A kind of standard interface of access HDFS distributed memory systems |
-
2020
- 2020-11-05 CN CN202011226066.4A patent/CN112395308A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102779183A (en) * | 2012-07-02 | 2012-11-14 | 华为技术有限公司 | Data inquiry method, equipment and system |
CN104903894A (en) * | 2013-01-07 | 2015-09-09 | 脸谱公司 | System and method for distributed database query engines |
CN103246749A (en) * | 2013-05-24 | 2013-08-14 | 北京立新盈企信息技术有限公司 | Matrix data base system for distributed computing and query method thereof |
CN103678520A (en) * | 2013-11-29 | 2014-03-26 | 中国科学院计算技术研究所 | Multi-dimensional interval query method and system based on cloud computing |
CN107784103A (en) * | 2017-10-27 | 2018-03-09 | 北京人大金仓信息技术股份有限公司 | A kind of standard interface of access HDFS distributed memory systems |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113886458A (en) * | 2021-09-23 | 2022-01-04 | 浙江至元数据科技有限公司 | Distributed hiding query method and system based on task aggregation |
CN114020744A (en) * | 2021-11-03 | 2022-02-08 | 北京沃东天骏信息技术有限公司 | Data transmission method, device, electronic equipment and computer readable medium |
CN114610746A (en) * | 2022-03-15 | 2022-06-10 | 云粒智慧科技有限公司 | SQL merging execution system and method of multi-relational data engine |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112395308A (en) | Data query method based on HDFS database | |
US11860741B2 (en) | Continuous data protection | |
CN114756577B (en) | Processing method of multi-source heterogeneous data, computer equipment and storage medium | |
CN108920153B (en) | Docker container dynamic scheduling method based on load prediction | |
CN111752959B (en) | Real-time database cross-database SQL interaction method and system | |
CN111414381B (en) | Data processing method and device, electronic equipment and storage medium | |
CN101042767A (en) | Web services database cluster structure and method thereof | |
CN102880503A (en) | Data analysis system and data analysis method | |
CN111752945B (en) | Time sequence database data interaction method and system based on container and hierarchical model | |
CN113032356B (en) | Cabin distributed file storage system and implementation method | |
US20230052612A1 (en) | Multilayer processing engine in a data analytics system | |
CN115083538B (en) | Medicine data processing system, operation method and data processing method | |
US7752225B2 (en) | Replication and mapping mechanism for recreating memory durations | |
WO2022156542A1 (en) | Data access method and system, and storage medium | |
US20210089527A1 (en) | Incremental addition of data to partitions in database tables | |
CN114547199A (en) | Database increment synchronous response method and device and computer readable storage medium | |
US20240241981A1 (en) | Methods and systems for data synchronization, and computer-readable storage media | |
CN116775712A (en) | Method, device, electronic equipment, distributed system and storage medium for inquiring linked list | |
CN107818122A (en) | A kind of Agent components, search management method and search management system | |
CN114064778B (en) | Redis-based real-time user data acquisition, transmission and data monitoring method | |
CN114896054A (en) | Cross-heterogeneous computing engine big data task scheduling method, device and medium | |
Jamal et al. | Performance Comparison between S3, HDFS and RDS storage technologies for real-time big-data applications | |
Zhao et al. | Architecture Design of CTC Log Module Based on Web Service | |
CN117390040B (en) | Service request processing method, device and storage medium based on real-time wide table | |
US12045221B1 (en) | Compact representation of table columns via templatization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |