Nothing Special   »   [go: up one dir, main page]

CN112395308A - Data query method based on HDFS database - Google Patents

Data query method based on HDFS database Download PDF

Info

Publication number
CN112395308A
CN112395308A CN202011226066.4A CN202011226066A CN112395308A CN 112395308 A CN112395308 A CN 112395308A CN 202011226066 A CN202011226066 A CN 202011226066A CN 112395308 A CN112395308 A CN 112395308A
Authority
CN
China
Prior art keywords
query
hdfs
database
sub
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011226066.4A
Other languages
Chinese (zh)
Inventor
李发明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhongbo Kechuang Information Co ltd
Original Assignee
Shenzhen Zhongbo Kechuang Information Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhongbo Kechuang Information Co ltd filed Critical Shenzhen Zhongbo Kechuang Information Co ltd
Priority to CN202011226066.4A priority Critical patent/CN112395308A/en
Publication of CN112395308A publication Critical patent/CN112395308A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data query method based on an HDFS (Hadoop distributed File System) database, which is used for solving the problems of low data query efficiency and low speed in the prior art. According to the data query method based on the HDFS database, a user firstly accesses the HDFS database system through a user interface and sends a query request to a scheduler after obtaining the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.

Description

Data query method based on HDFS database
Technical Field
The invention belongs to the field of database management, and particularly relates to a data query method based on an HDFS (Hadoop distributed File System) database.
Background
A Hadoop Distributed File System (HDFS) is generally applied to a large-scale distributed relational database System, and the System contains massive business data and a large number of processing nodes for data storage, organization, and operational analysis. When a client queries data in a database, a database management system generally feeds back user queries in an interpretation mode, analyzes and optimizes the queries, creates a query plan, and executes the queries. When mass data is faced, the query quantity of the client is increased in a graded manner, if the original query mode is still adopted, too much execution overhead is introduced in the explanatory query execution process, the burden of a scheduling node is increased, the query speed and accuracy are reduced, and the user requirements cannot be met.
Disclosure of Invention
In view of the above defects or shortcomings in the prior art, the present invention aims to provide a data query method based on an HDFS database, which realizes the cooperation of a scheduler and a query sub-engine through a grammar set and a subset, and the scheduler performs task allocation on the query sub-engine through the grammar set, thereby completing the splitting of query tasks, improving the parallel quantity of the query tasks, and satisfying the query requirements under mass data.
In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:
the embodiment of the invention provides a data query method based on an HDFS (Hadoop distributed File System) database, wherein after a scheduler receives a query request, a request task is split according to metadata or a query abstract in a metadata server, and split subtasks are issued to a sub-engine running at an HDFS sub-node.
In the above scheme, the data query method includes the following steps:
step S1, the user accesses the HDFS database system through the user interface;
step S2, the access user sends the query request to the dispatcher through SQL query language;
step S3, the dispatcher performs syntax translation and decomposition on the received SQL query language according to the metadata or the query abstract in the metadata server, divides one query task into a plurality of subtasks, issues the subtasks to corresponding sub-engines, and simultaneously saves the division strategy;
step S4, after the subtask is received by the sub-engine scheduled by the scheduler, the source data required by the query is obtained in the current HDFS database, then the query operation is executed in the source data according to the grammar subset, and the query result corresponding to the subtask is sent to the scheduler;
and step S5, the dispatcher integrates the query results fed back by each sub-engine according to the splitting strategy of the query task to generate a final query result, the final query result is sent to the client, and the query abstract is recorded in the metadata server.
In the above solution, the user interface in step S1 is a JDBC/ODBC application program interface or a Shell command line interface connected to the API.
In the scheme, the user accessing the HDFS database system comprises the steps of authenticating the user identity or the programmer identity of the user and distributing corresponding access authority according to an authentication result.
In the foregoing solution, in step S2, a query task is split into a plurality of subtasks, and after the SQL query request corresponds to the meta information in the metadata, the query request is split into subtasks based on the corresponding meta information, where the subtasks correspond to the corresponding subtasks, and the distribution consistency of the data resources and the computing resources is maintained.
In the scheme, the query abstract of the scheduler is recorded in the metadata server, and when the scheduler receives the same query task again, the splitting strategy matched with the query abstract is directly called to split the query task.
In the scheme, the scheduler tracks the health state of the sub-engine through a state monitoring process and monitors a query process; when a certain child node is disconnected due to hardware failure, network error and software failure, the state monitoring process informs all nodes to ensure that subsequent inquiry can avoid the unreachable node.
In the above scheme, the called sub-engines execute different query segments in parallel to complete the whole query process.
In the scheme, when the HDFS subnodes in the HDFS database are expanded, the subengines are expanded based on the protocol of the HDFS, so that the capacity of the database is improved, and the query capability of the system is correspondingly improved.
In the above scheme, the redundancy mechanism provided by the HDFS based on the HDFS database is used to cope with software, hardware and network failures of a single node.
The invention has the following beneficial effects:
according to the data query method based on the HDFS database, provided by the embodiment of the invention, a user firstly accesses the HDFS database system through a user interface, and sends a query request to a scheduler after acquiring the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 is a schematic diagram illustrating a data query method based on an HDFS database according to an embodiment of the present invention;
fig. 2 shows a timing chart of a data query method based on an HDFS database according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
The embodiment of the invention provides a data query method based on an HDFS database. As shown in fig. 2, in the HDFS database, resource data is stored in each HDFS serving as a child node, each HDFS runs a query sub-engine, and the sub-engine stores a retrieval syntax subset corresponding to data in the current HDFS; all the sub-engines are connected with a scheduler, the scheduler is connected with a metadata server and is also provided with a user interface, a grammar set table is stored in the scheduler, after a query request from the user interface is received, the query request is translated into retrieval grammar according to metadata in the metadata server, the query request is decomposed into a plurality of retrieval grammars according to the retrieval grammar set and is issued to the corresponding sub-engines, the sub-engines start query tasks according to the received retrieval grammar, the retrieval is carried out in the current HDFS, the query results are fed back to the scheduler, the scheduler integrates the results fed back by all the sub-engines, and finally the results are fed back to a client through the user interface.
By retrieving the grammar set and the subsets, the query request is split into a plurality of subtasks, careful insertion and updating operations do not need to be carried out on data in a data table, specific query tasks are distributed to the subevents according to metadata, the query tasks are completed in the bottom-layer HDFS data in parallel, and finally feedback results are integrated, so that massive structured data processing and the bottom-layer HDFS basic framework are considered, the query accuracy and speed are improved, and the user experience is improved.
Fig. 2 is a data query method based on an HDFS database according to an embodiment of the present invention. As shown in fig. 2, the data query method includes the following steps:
in step S1, the user accesses the HDFS database system through the user interface.
In this step, the user interface is a JDBC/ODBC application program interface or a Shell command line interface connected to the API. Through the interface connection API, on one hand, a user can call data in the HDFS database through the API, and on the other hand, a programmer can call the data in the database directly through the interface to perform secondary development of the application.
Preferably, the step of accessing the HDFS database system by the user includes authenticating the identity of the user. The identity authentication comprises user identity and programmer identity, on the basis, different identities are appropriately graded, and corresponding rights are distributed to different grades.
In step S2, the access user sends a query request to the dispatcher via SQL query language.
In this step, the syntax used in the SQL query language is a subset of the syntax set in the scheduler.
Step S3, the scheduler performs syntax translation and decomposition on the received SQL query language according to the metadata or query abstract in the metadata server, splits one query task into a plurality of subtasks, issues the subtasks to the corresponding sub-engines, and saves the splitting policy.
In this step, the metadata server stores the metadata in the child engine and the corresponding HDFS database in the child node, and after the SQL query request corresponds to the metadata in the metadata, the query request is decomposed into subtasks based on the corresponding metadata, and the subtasks correspond to the corresponding child nodes, so that the distribution consistency of the data resources and the computing resources is realized on the premise of splitting the task, thereby realizing efficient data query. The subtasks distributed to the corresponding child nodes correspond to the information in the child engine of the current node and the information in the HDFS database, wherein the correspondence is carried out with the grammar subsets in the child engine through a retrieval grammar set in a scheduler, and simultaneously, after mapping between SQL query language and the meta information, the correspondence is carried out with the resource data or the service data in the HDFS database.
The metadata server stores metadata information about the system, such as which databases are available and how the specific table structures of those databases are; and when the resource information in the HDFS data is updated, correspondingly updating the table structure content in the metadata server. Meanwhile, the metadata server stores the query record abstract of the scheduler, and when the scheduler receives the same query task again, the scheduler directly calls a splitting strategy matched with the query abstract to split the query task.
Preferably, the scheduler also tracks the health status of the sub-engines and monitors the inquiry process; when a child node is disconnected due to hardware failure, network error, software failure or other reasons, the state monitoring process notifies all nodes to ensure that subsequent queries can avoid the unreachable node. Since the task of the condition monitoring component is to assist when a problem arises, it does not dominate a normal polling operation. If the process does not run normally or becomes unreachable, other nodes can still execute the query task normally; when the system state monitoring process is not reachable, if any node fails, the system just becomes less robust, but the execution of normal tasks is not affected. When the system state monitoring process is on-line again, it will re-establish contact with other nodes and resume its monitoring function.
Step S4, after receiving the subtasks, the sub-engine scheduled by the scheduler acquires the source data required for the query from the current HDFS database, then performs the query operation in the source data according to the syntax subset, and sends the query result corresponding to the subtasks to the scheduler.
The called sub-engines run on each HDFS sub-node, when one query request is split, a plurality of sub-engines are called, different query fragments are executed in parallel, and the whole query process is completed. Meanwhile, the child nodes and the running child engines in the HDFS database have corresponding expandability due to distributed storage of the HDFS; when the database is expanded, the sub-engines are expanded based on the protocol of the HDFS, so that the capacity of the database is improved, and the query capability of the system is correspondingly improved.
The data query method of the embodiment is based on the HDFS database, and software, hardware and network faults of a single node are dealt with by a redundancy mechanism provided by the HDFS. The table data in the HDFS database system is stored as a data file of the HDFS, and the file format and the compression strategy of the HDFS are used. When a data file is to be added to a new table, the mapping relationship between the data file in the HDFS and the table name in the system is uniformly managed by the system itself.
And step S5, the dispatcher integrates the query results fed back by each sub-engine according to the splitting strategy of the query task to generate a final query result, sends the final query result to the client and records the query abstract in the metadata.
According to the technical scheme, the data query method based on the HDFS database provided by the embodiment has the advantages that a user firstly accesses the HDFS database system through a user interface, and sends a query request to a scheduler after obtaining the authority; the dispatcher splits the received query request according to the metadata or the query abstract in the metadata server, sends the split subtasks to the sub-engines, and the called sub-engines execute the query fragments in parallel in the current HDFS to complete the query tasks. The invention combines the distributed infrastructure of the bottom layer HDFS with the structured data processing, completes the query task on the basis of splitting the request task, and feeds back the integrated query result to the client, thereby improving the query efficiency, improving the query accuracy and ensuring the expandability of the system.
The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims (10)

1. A data query method based on an HDFS database is characterized in that a scheduler splits a request task according to metadata or a query abstract in a metadata server after receiving a query request, and issues split subtasks to a sub-engine running at a HDFS sub-node.
2. The HDFS database-based data query method according to claim 1, comprising the steps of:
step S1, the user accesses the HDFS database system through the user interface;
step S2, the access user sends the query request to the dispatcher through SQL query language;
step S3, the dispatcher performs syntax translation and decomposition on the received SQL query language according to the metadata or the query abstract in the metadata server, divides one query task into a plurality of subtasks, issues the subtasks to corresponding sub-engines, and simultaneously saves the division strategy;
step S4, after the subtask is received by the sub-engine scheduled by the scheduler, the source data required by the query is obtained in the current HDFS database, then the query operation is executed in the source data according to the grammar subset, and the query result corresponding to the subtask is sent to the scheduler;
and step S5, the dispatcher integrates the query results fed back by each sub-engine according to the splitting strategy of the query task to generate a final query result, the final query result is sent to the client, and the query abstract is recorded in the metadata server.
3. The HDFS database-based data query method according to claim 2, wherein the user interface in step S1 is JDBC/ODBC application program interface or Shell command line interface connected to API.
4. The HDFS database-based data query method according to claim 3, wherein the accessing of the user to the HDFS database system comprises authenticating the user identity or the programmer identity of the user and allocating corresponding access rights according to the authentication result.
5. The HDFS database-based data query method according to claim 2, wherein in step S2, a query task is divided into a plurality of subtasks, and after the SQL query request corresponds to the meta information in the meta data, the query request is divided into subtasks based on the corresponding meta information, and the subtasks correspond to the corresponding subtasks and maintain the distribution consistency of the data resources and the computing resources.
6. The HDFS database-based data query method according to claim 2, wherein the metadata server records a query summary of the scheduler, and when the scheduler receives the same query task again, the scheduler directly calls a splitting policy matching the query summary to split the query task.
7. The HDFS database-based data query method according to claim 2, wherein the scheduler monitors the query process by tracking the health status of the sub-engines through a status monitoring process; when a certain child node is disconnected due to hardware failure, network error and software failure, the state monitoring process informs all nodes to ensure that subsequent inquiry can avoid the unreachable node.
8. The HDFS database-based data query method according to claim 2, wherein the invoked sub-engine executes different query fragments in parallel to complete the whole query process.
9. The data query method based on the HDFS database as claimed in claim 2, wherein when the HDFS subnode in the HDFS database is expanded, the subengine is expanded based on the protocol of the HDFS, so that the query capability of the system is correspondingly improved while the capacity of the database is improved.
10. The HDFS database-based data query method according to claim 2, wherein the HDFS database-based redundancy mechanism provided by the HDFS is used to cope with single-node software and hardware and network failures.
CN202011226066.4A 2020-11-05 2020-11-05 Data query method based on HDFS database Pending CN112395308A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011226066.4A CN112395308A (en) 2020-11-05 2020-11-05 Data query method based on HDFS database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011226066.4A CN112395308A (en) 2020-11-05 2020-11-05 Data query method based on HDFS database

Publications (1)

Publication Number Publication Date
CN112395308A true CN112395308A (en) 2021-02-23

Family

ID=74598206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011226066.4A Pending CN112395308A (en) 2020-11-05 2020-11-05 Data query method based on HDFS database

Country Status (1)

Country Link
CN (1) CN112395308A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886458A (en) * 2021-09-23 2022-01-04 浙江至元数据科技有限公司 Distributed hiding query method and system based on task aggregation
CN114020744A (en) * 2021-11-03 2022-02-08 北京沃东天骏信息技术有限公司 Data transmission method, device, electronic equipment and computer readable medium
CN114610746A (en) * 2022-03-15 2022-06-10 云粒智慧科技有限公司 SQL merging execution system and method of multi-relational data engine

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779183A (en) * 2012-07-02 2012-11-14 华为技术有限公司 Data inquiry method, equipment and system
CN103246749A (en) * 2013-05-24 2013-08-14 北京立新盈企信息技术有限公司 Matrix data base system for distributed computing and query method thereof
CN103678520A (en) * 2013-11-29 2014-03-26 中国科学院计算技术研究所 Multi-dimensional interval query method and system based on cloud computing
CN104903894A (en) * 2013-01-07 2015-09-09 脸谱公司 System and method for distributed database query engines
CN107784103A (en) * 2017-10-27 2018-03-09 北京人大金仓信息技术股份有限公司 A kind of standard interface of access HDFS distributed memory systems

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779183A (en) * 2012-07-02 2012-11-14 华为技术有限公司 Data inquiry method, equipment and system
CN104903894A (en) * 2013-01-07 2015-09-09 脸谱公司 System and method for distributed database query engines
CN103246749A (en) * 2013-05-24 2013-08-14 北京立新盈企信息技术有限公司 Matrix data base system for distributed computing and query method thereof
CN103678520A (en) * 2013-11-29 2014-03-26 中国科学院计算技术研究所 Multi-dimensional interval query method and system based on cloud computing
CN107784103A (en) * 2017-10-27 2018-03-09 北京人大金仓信息技术股份有限公司 A kind of standard interface of access HDFS distributed memory systems

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113886458A (en) * 2021-09-23 2022-01-04 浙江至元数据科技有限公司 Distributed hiding query method and system based on task aggregation
CN114020744A (en) * 2021-11-03 2022-02-08 北京沃东天骏信息技术有限公司 Data transmission method, device, electronic equipment and computer readable medium
CN114610746A (en) * 2022-03-15 2022-06-10 云粒智慧科技有限公司 SQL merging execution system and method of multi-relational data engine

Similar Documents

Publication Publication Date Title
CN112395308A (en) Data query method based on HDFS database
US11860741B2 (en) Continuous data protection
CN114756577B (en) Processing method of multi-source heterogeneous data, computer equipment and storage medium
CN108920153B (en) Docker container dynamic scheduling method based on load prediction
CN111752959B (en) Real-time database cross-database SQL interaction method and system
CN111414381B (en) Data processing method and device, electronic equipment and storage medium
CN101042767A (en) Web services database cluster structure and method thereof
CN102880503A (en) Data analysis system and data analysis method
CN111752945B (en) Time sequence database data interaction method and system based on container and hierarchical model
CN113032356B (en) Cabin distributed file storage system and implementation method
US20230052612A1 (en) Multilayer processing engine in a data analytics system
CN115083538B (en) Medicine data processing system, operation method and data processing method
US7752225B2 (en) Replication and mapping mechanism for recreating memory durations
WO2022156542A1 (en) Data access method and system, and storage medium
US20210089527A1 (en) Incremental addition of data to partitions in database tables
CN114547199A (en) Database increment synchronous response method and device and computer readable storage medium
US20240241981A1 (en) Methods and systems for data synchronization, and computer-readable storage media
CN116775712A (en) Method, device, electronic equipment, distributed system and storage medium for inquiring linked list
CN107818122A (en) A kind of Agent components, search management method and search management system
CN114064778B (en) Redis-based real-time user data acquisition, transmission and data monitoring method
CN114896054A (en) Cross-heterogeneous computing engine big data task scheduling method, device and medium
Jamal et al. Performance Comparison between S3, HDFS and RDS storage technologies for real-time big-data applications
Zhao et al. Architecture Design of CTC Log Module Based on Web Service
CN117390040B (en) Service request processing method, device and storage medium based on real-time wide table
US12045221B1 (en) Compact representation of table columns via templatization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination