CN114356964A

CN114356964A - Data blood margin construction method and device, storage medium and electronic equipment

Info

Publication number: CN114356964A
Application number: CN202210001562.2A
Authority: CN
Inventors: 刘俊杰; 余利华; 郭忆; 李卓豪; 汪源
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-04-15

Abstract

The invention relates to a data blood margin construction method, a data blood margin construction device, a storage medium and electronic equipment.A collection point of a data blood margin relation is set as a data processing engine of a database system according to the embodiment of the disclosure so as to obtain an analyzed access plan generated by the data processing engine; constructing a data consanguinity relationship based on the parsed access plan. According to the method, the collection point of the data blooding margin is arranged on the analyzed access plan obtained by the data processing engine, and the accuracy of the obtained data blooding margin relation is guaranteed by the correctness of the analyzed access plan; in addition, by setting the acquisition points of the data blooding borders in the data processing engine, the method replaces the mode of analyzing the received SQL commands and acquiring the data blooding border relations through a syntax analysis tool on a platform layer, the syntax analysis tool is omitted, various syntax rules of various database commands do not need to be concerned, the change influence of the upgrading on the syntax analysis tool is avoided, and the development cost and the realization difficulty are reduced; and may cover various ETL scenarios.

Description

Data blood margin construction method and device, storage medium and electronic equipment

Technical Field

Embodiments of the present disclosure relate to the field of data processing technologies, and in particular, to a data blood margin construction method and apparatus, a storage medium, and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims and the description herein is not admitted to be prior art by inclusion in this section.

With the advent of the big data age, processing and storage work of mass data is brought. Accordingly, data flow is becoming more complex, and accurate determination of dependencies among data generated during the data flow process is also becoming more important, which is related to problem troubleshooting, traceable tracking, and accurate construction of organizational and/or user relationships.

Disclosure of Invention

Although current database systems (including traditional databases and distributed databases) have great progress in data processing performance, the data dependency relationship is still insufficient. In a database system, a data processing process, i.e. an ETL process, refers to a process of extracting (extract), converting (transform), and loading (load) data from a source end to a destination end, and data dependencies generated in the process are referred to as "data blood relations". There are some disadvantages in the current way of extracting "data kindred relations". For example, current data lineage relationships are typically extracted through an Abstract Syntax Tree (AST) parsed by database commands. An Abstract Syntax Tree (AST) is a tree representation of the abstract syntax structure of source code written in a programming language. Each Node of the tree represents a construct that appears in source code (e.g., SQL commands). In practical application, because the syntax rule of the database command is complex, the difficulty of obtaining the abstract syntax tree by analyzing is high, and errors often exist, so that the accuracy of the data consanguinity relation extracted according to the abstract syntax tree cannot be guaranteed. In addition, as the version of the data processing engine is upgraded, the syntax rule of the corresponding SQL command may change, and accordingly, the parsing manner for parsing the abstract syntax tree according to the SQL command and the extraction manner of the data relationship also change, which brings further development and maintenance costs.

Therefore, there is a need in the art for a data context extraction scheme to efficiently extract accurate data context so as to solve the above problems.

In this context, embodiments of the present disclosure provide a data blood margin construction method, apparatus, storage medium, and electronic device.

According to a first aspect of the present disclosure, a data blood relationship construction method is provided, which is applied to a database system; the database system includes a data processing engine that generates an access plan corresponding to a database command, the access plan including a plurality of nodes, at least some of the plurality of nodes relating to data objects; the method comprises the following steps: obtaining a resolved access plan generated by the data processing engine; constructing a data consanguinity relationship based on the parsed access plan; wherein the data consanguinity relationship represents an associative relationship between data objects referred to by the plurality of nodes.

In some embodiments of the first aspect of the present disclosure, the resolved access plan comprises at least one of: a logical access plan generated after metadata matching; a physical access plan generated based on the logical access plan.

In some embodiments of the first aspect of the present disclosure, the data processing engine comprises a SPARK.

In some embodiments of the first aspect of the present disclosure, the data processing engine is configured with a listener; the obtaining the resolved access plan generated by the data processing engine comprises:

and after the analyzed access plan is successfully executed, calling the listener to acquire the analyzed access plan.

In some embodiments of the first aspect of the present disclosure, the constructing a data consanguinity relationship based on the parsed access plan comprises: traversing each of the nodes in the parsed access plan to determine output nodes and input nodes; and acquiring first identification information of the data object related to the output node and second identification information of the data object related to the input node, and associating the first identification information and the second identification information to construct the data consanguinity relationship.

In some embodiments of the first aspect of the present disclosure, said traversing each of said nodes in said parsed access plan to determine output nodes and input nodes comprises: traversing each node, and executing a preset judgment method on the current node in the traversing process, wherein the preset judgment method comprises the following steps: determining whether the description sentence of the current node contains preset information, wherein the preset information comprises: first characteristic information corresponding to the output node, second characteristic information corresponding to the input node or third characteristic information corresponding to the intermediate node; if the description statement contains first characteristic information, determining that the current node is an output node, and executing the preset judgment method on a lower node of the output node in the analyzed access plan; if the description statement contains third feature information, determining that the current node is an intermediate node, and executing the preset judgment method on a lower node of the intermediate node in the analyzed access plan; and if the description statement contains second characteristic information, determining the current node as an input node.

In some embodiments of the first aspect of the present disclosure, said traversing each of said nodes in said parsed access plan to determine output nodes and input nodes comprises: traversing each node, and executing a preset judgment method on the current node in the traversing process, wherein the preset judgment method comprises the following steps: determining whether the description sentence of the current node contains preset information, wherein the preset information comprises: first characteristic information corresponding to the output node or second characteristic information corresponding to the input node; if the description statement contains first characteristic information, determining that the current node is an output node, and executing the preset judgment method on a lower node of the output node in the analyzed access plan; if the description statement does not contain the first characteristic information and the second characteristic information, executing the preset judgment method on a lower node of the current node in the analyzed access plan; and if the description statement contains second characteristic information, determining the current node as an input node.

In some embodiments of the first aspect of the present disclosure, the parsed access plan is described by Scala language or Java language; the preset judgment method respectively determines whether the description sentence of the current node contains preset information through an instance matching command.

In some embodiments of the first aspect of the present disclosure, the first characteristic information comprises at least one of: creating a command; inserting a command; the third characteristic information includes at least one of the following acting on the data object: combining the commands; intersection commands; or, the third feature information includes: information other than the first characteristic information and the second characteristic information; the second feature information includes: and the preset statements describe the relationship types between the data objects.

In some embodiments of the first aspect of the present disclosure, the data processing engine comprises a Hive.

In some embodiments of the first aspect of the present disclosure, the data processing engine is configured with a hook program; the obtaining the resolved access plan generated by the data processing engine comprises: after the parsed access plan is generated, a hook program is invoked to extract the parsed access plan.

In some embodiments of the first aspect of the present disclosure, the constructing a data consanguinity relationship based on the parsed access plan comprises: calling an information extraction method pre-configured by the analyzed access plan to acquire and extract first identification information and second identification information; wherein the first identification information corresponds to a data object referred to by an output node and the second identification information corresponds to a data object referred to by an input node; and associating the first identification information and the second identification information to obtain the data blood relationship.

In some embodiments of the first aspect of the present disclosure, the data lineage construction method further includes: determining whether the database command is a preset command type; the preset command type corresponds to generation of data blood relationship; in response to the database command being a preset command type, attempting to construct a data consanguinity relationship based on the parsed access plan; in response to a failure to construct a data context, error information for the parsed access plan is fed back.

In some embodiments of the first aspect of the present disclosure, the preset command type relates to an operation of a data object referred to by the output node, and the preset command type includes at least one of: insert command, create and insert command.

In some embodiments of the first aspect of the present disclosure, the database command is an SQL command.

According to a second aspect of the present disclosure, a data blood margin construction device is provided, which is applied to a database system; the database system includes a data processing engine that generates an access plan corresponding to a database command, the access plan including a plurality of nodes, at least some of the plurality of nodes relating to data objects; the device comprises: a plan acquisition module for acquiring the parsed access plan generated by the data processing engine; a blood margin construction module for constructing a data blood margin relationship based on the parsed access plan; wherein the data consanguinity relationship represents an associative relationship between data objects referred to by the plurality of nodes.

According to a third aspect of the present disclosure, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements: the method of data margin construction according to any one of the first aspect.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform, via execution of the executable instructions: the method of data margin construction according to any one of the first aspect.

According to an embodiment of the present disclosure, a collection point of a data consanguinity relationship is set as a data processing engine in a database system to obtain an resolved access plan generated by the data processing engine; constructing a data consanguinity relationship based on the parsed access plan. According to the method, the collection points of the data blood margin are arranged in the data processing engine, and the accuracy of the obtained data blood margin relation is guaranteed by the correctness of the analyzed access plan of the data processing engine; in addition, since the database commands do not need to be analyzed into the abstract syntax tree and the data genetic relationship is extracted outside the data processing engine, even if the syntax rules of the database commands change, the change of the node types in the analyzed access plan after the change is only needed to be concerned when the acquisition tool of the data genetic relationship is upgraded, and the change of the syntax rules of all the database commands does not need to be considered, so that the development and maintenance cost is effectively reduced.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 is a schematic diagram illustrating the collection point placement of data consanguinity relationships in an embodiment of the present disclosure.

Fig. 2 shows a flow chart of a data blood margin construction method in an embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating a principle of extracting data blood relationship based on Spark engine in an embodiment of the present disclosure.

FIG. 4 is a flow diagram illustrating the construction of data consanguinity relationships based on a resolved access plan in an embodiment of the present disclosure.

Fig. 5A shows a schematic flowchart of the implementation of step S401 in an embodiment of the present disclosure.

Fig. 5B shows a schematic flowchart of the implementation of step S401 in another embodiment of the present disclosure.

Fig. 6 shows a schematic diagram of a principle of extracting data blood relationship based on Hive in an embodiment of the present disclosure.

Fig. 7 is a schematic view of an application scenario of detecting whether the data blood relationship is correctly extracted based on the data blood relationship construction method in an embodiment of the present disclosure.

Fig. 8 is a flowchart illustrating a method for detecting whether a data blood relationship is correctly extracted based on the data blood relationship construction method in an embodiment of the present disclosure.

FIG. 9 shows a block diagram of a data margin construction apparatus according to an embodiment of the present disclosure.

FIG. 10 shows a schematic diagram of a storage medium in an embodiment of the present disclosure; and

FIG. 11 illustrates a block architecture diagram of an electronic device in an embodiment of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present disclosure, and are not intended to limit the scope of the present disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the disclosure, a data blood margin construction method, a data blood margin construction device, a storage medium and an electronic device are provided.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.

Summary of The Invention

In the ETL process, by operating on the data table as the target, the data consanguinity relationship can be expressed as the relationship between the input data table and the output data table. The data relationship may be constructed by acquiring information of the data table in the ETL process, acquiring a dependency relationship between the information of the data table, and storing the dependency relationship.

In some examples, implementing the ETL process may be controlled by a database command, which may be Structured Query Language (SQL). The data may be stored in the form of a data table. In a data system, a platform layer (e.g., a big data platform, a data middle platform, etc.) and an engine layer (including data processing engines for data computation, such as Hive, SPARK, etc.) are typically included. The platform layer obtains an original SQL command statement submitted by a user and sends the original SQL command statement to the engine layer for processing. After the data processing engine in the engine layer analyzes the SQL command into an Abstract Syntax Tree (AST), a corresponding SQL command access plan is generated according to the abstract syntax tree, and a corresponding data table is accessed according to the access plan so as to complete the SQL command.

In some related technologies, when constructing a data lineage relationship based on a data table, a platform layer may parse, using a syntax parsing tool (e.g., ANTLR, etc.) other than a data processing engine, an SQL command submitted by a user to obtain an abstract syntax tree, and analyze the obtained abstract syntax tree to extract the data lineage relationship between the data tables.

For example, for the following SQL commands submitted by the user:

INSERT INTO output_table SELECT*FROM input_table；

the SQL statement is analyzed into an abstract syntax tree through the syntax analysis tool, an input data table 'input _ table' and an output data table 'output _ table' can be extracted from the abstract syntax tree, and the data blooding relationship of input _ table- > output _ table is obtained and stored.

However, there are many problems with data context extraction from abstract syntax trees.

On the one hand, the extracted data is not sufficiently accurate in blood-based relationship. In an actual scene, the quantity of SQL commands is huge, the grammar rule is complex, and an abstract grammar tree obtained by a data processing engine or a grammar parsing tool is high in possibility of containing errors. Therefore, the abstract syntax tree is used as a source for obtaining the data blood relationship, and the accuracy of the obtained data blood relationship is difficult to guarantee.

On the other hand, the technical realization difficulty is large. First, since there is a certain difference in syntax between different types of SQL commands, when parsing an abstract syntax tree using a syntax parsing tool (e.g., ANTLR) requires a targeted implementation of corresponding syntax rules, which results in a high development cost. Second, SQL commands can be extremely complex (e.g., multiple nested sub-queries, etc.), resulting in the parsing capabilities of syntax parsing tools, and the development capabilities of developers, being tested. Thirdly, with the version upgrade of the data processing engine, the corresponding SQL syntax rules may change, resulting in a substantial upgrade of the syntax parsing tool, which brings further development and maintenance costs.

On the other hand, the coverage scene is incomplete, resulting in omission of data blood relationship. The above example is only applicable to a scenario that the SQL command can be acquired, for example, a scenario that a user submits the SQL command to the platform layer, in a manner that the SQL command is parsed by the syntax parsing tool to acquire the data relationship. However, in other scenarios where the SQL statement cannot be obtained, for example, when the user bypasses the platform layer and directly submits the SQL command to the engine layer (for example, invokes the data processing engine by using a script), the platform layer cannot obtain the SQL statement, and further cannot parse the abstract syntax tree, thereby causing a missing data lineage. In addition, in non-SQL command scenarios (e.g., API calls to a data processing engine), the abstract syntax tree parsing of SQL commands is not performed at the platform level, which also results in missing data context.

In view of this, in order to solve the above technical problem, in the embodiment of the present disclosure, an information collection point of a data relationship is set in a data processing engine, and an parsed access plan is used as a source for constructing the data relationship, so as to construct an accurate data relationship.

To illustrate the above process, fig. 1 is a schematic diagram illustrating the arrangement of collection points of data blood relationship in the embodiment of the present disclosure.

As shown in fig. 1, the data processing engine 105 parses the SQL command 101 to obtain an abstract syntax tree 102, and a Logical access Plan 103(Logical Query Plan) can be generated according to the abstract syntax tree 102, where the Logical access Plan 103 is shown to be generated after metadata matching (Schema match) according to the abstract syntax tree 102, and the Schema may include database (database) information, table (table) information, column (column) information, and the like, that is, the Logical access Plan 103 generated after metadata matching completely includes the table information. Further, if the data processing engine 105 can generate the logical access Plan 103 by metadata matching, the obtained logical access Plan 103 is correct, and the Physical access Plan 104(Physical Query Plan) obtained from the logical access Plan 103 is also correct, which also means that the original SQL command 101 statement and the abstract syntax tree 102 corresponding to the logical access Plan 103 are also correct.

In the embodiment of the present disclosure, by setting the collection point 106 of the data relationship in the data processing engine 105, and by obtaining the resolved access plan (such as the logical access plan 103 or the physical access plan 104 matched with the metadata) which is obtained by the data processing engine 105 after matching the metadata and can ensure the correctness, the resolved access plan is used as the acquisition source of the data relationship, so that the accuracy of the data relationship can be ensured. Moreover, unlike the related art in which the platform layer parses the received SQL command using the syntax parsing tool and obtains the data context according to the parsing result, the scheme in the embodiment of the present disclosure obtains the data context at the data processing engine 105, so that the scenario restriction that the SQL command must be received at the platform layer is not limited, and syntax parsing tools other than the data processing engine are not used, so that the problems of difficulty in developing the syntax parsing tool, incomplete coverage, and the like can be avoided.

Exemplary method embodiments

Referring to fig. 2, a flow chart of a data blood margin construction method according to an embodiment of the disclosure is shown. The method is applied to a database system; the database system includes a data processing engine. The data processing engine generates access plans corresponding to database commands, such as the logical access plan and the physical access plan described above. The access plan includes a plurality of nodes, at least some of which relate to data objects, including, for example, data tables and the like.

As shown in fig. 2, the method includes:

step S201: obtaining a resolved access plan generated by the data processing engine;

step S202: constructing a data consanguinity relationship based on the parsed access plan.

Wherein the data consanguinity relationship represents an associative relationship between data objects referred to by the plurality of nodes. Illustratively, the association relationship between the data objects can be expressed as a dependency relationship between data tables, more specifically, a dependency relationship between an output data table and an input data table.

The following explains the technical effects achieved by the data blood margin construction method provided in the embodiments of the present disclosure:

on the one hand, the accuracy of the data blood relationship is guaranteed. The parsed access plan includes: the data processing engine generates a logical access plan after matching through metadata and generates a physical access plan based on the logical access plan; thus, the correctness of the parsed access plan is guaranteed. The analyzed access plan is used as a source for obtaining the data blood relationship, so that the accuracy of the constructed data blood relationship is guaranteed.

On the other hand, the technical difficulty is reduced. According to the data processing engine, the data consanguinity acquisition points are arranged in the data processing engine to replace a mode of analyzing the received SQL commands and acquiring the data consanguinity relationship through a grammar analysis tool on a platform layer, so that the grammar analysis tool is omitted, the influence of various database commands and various grammar rules on the grammar analysis tool caused by version upgrading is not needed to be considered, only the change of the analyzed access plan of the data processing engine after upgrading (for example, the description grammar of the nodes is changed) needs to be concerned, and therefore the development cost and the implementation difficulty can be greatly reduced.

In yet another aspect, the coverage scenario is comprehensive. Only the analyzed access plan generated by the data processing engine needs to be concerned, and the data blood relationship can be constructed no matter the ETL scene with or without a platform layer or the ETL scene with or without an SQL command by the data processing engine, so that the scene can be completely covered.

In some embodiments, the data blood margin construction method may be implemented by a plug-in program, the plug-in program is associated to the inside of the data processing engine, and after the data processing engine generates the resolved access plan, the above method flow is executed to construct the data blood margin relationship. By connecting the data blood margin construction method to a data processing engine in a plug-in mode, an access plan of SQL commands executed on the data processing engine and a related Application Program Interface (API) can be accessed by the plug-in, and therefore various ETL scenes can be supported.

In some embodiments, the data processing engine comprises: such as SPARK, Hive, etc. Hive is an open-source big data warehouse processing engine, and the ETL flow of big data can be realized based on writing of SQL commands, for example. Spark is a data processing engine similar to Hive, and the processing mode based on memory operation has greater performance advantage than Hive.

The following embodiments are provided by taking Spark and Hive as examples respectively, so as to illustrate specific implementations of the data blood margin construction method under different implementations of various data processing engines.

In some embodiments, the parsed access plan may be obtained by a Listener (Listener) or Hook (Hook) configured by the data processing engine, depending on the type of data processing engine.

A listener, Spark, sql, queryexexecutionlisteners is configured in Spark, and the monitoring function of the listener can be achieved in an Inheritance (Inheritance) manner and an implementation (implementation) manner, so that the listener can obtain the resolved access plan. Inheritance refers to the inheritance of a function of a parent class by a class. Implementation refers to implementing a class of interfaces.

The SQL parsing execution process of Spark is shown in fig. 3. The SPARK 305 parses an original SQL command 301 to form an abstract syntax tree 302, then performs metadata matching to form a complete parsed Logical access Plan 303(Analyzed Logical Plan), then may generate an Optimized Logical access Plan (Optimized Logical Plan) through optimization, and then combines the Optimized Logical access Plan with actual environment information to form a final physical access Plan 304, or may directly obtain the physical access Plan 304 from the parsed Logical access Plan 303, and enter a corresponding listening stage 306 after the physical access Plan 304 is executed. Starting from the resolved logical access plan 303, the optimized logical access plan (optional, not shown in the figure) and the physical access plan 304, which are subsequently obtained, are correct, and can be used as the resolved access plan for obtaining the data relationship.

In the listening phase 306, for example, the onSuccess (that is, "execution success" indicates that listening successfully executes the access plan) method of the listener queryExecutionListener of SPARK may be called, and QueryExecution is introduced, from which the resolved access plan corresponding to the current SQL command, such as any one of the analyzed logical access plan, the optimized logical access plan, and the physical access plan matched by the metadata, may be obtained, where the obtaining of the data genetic relationship is shown by taking the analyzed logical access plan as an example; the parsed access plan is then traversed to determine table names of the input data tables and table names of the output data tables to construct data lineage relationships between the input data tables and the output data tables.

Fig. 4 is a schematic flow chart illustrating construction of data relationship based on a resolved access plan according to an embodiment of the present disclosure. The process comprises the following steps:

step S401: traversing each of the nodes in the parsed access plan to determine output nodes and input nodes.

In some embodiments, the resolved access plan is also a tree structure, including a plurality of nodes. Taking the data object as a data table for example, the parsed access plan corresponds to a database command to perform an operation from an input data table to an output data table. The plurality of nodes includes: an output node containing first identification information (e.g., a table name) of the output data table, an input node containing second identification information of the output data table, and an intermediate node between the output node and the input node.

The input nodes, intermediate nodes and output nodes may have characteristic information different from each other, such as code strings and the like, and the output nodes and input nodes are determined by traversing each node and matching the current node traversed to the characteristic information of the input nodes, intermediate nodes and output nodes, respectively. In the tree structure of the parsed access plan, the state that the highest level is an output node, the middle level is an intermediate node, and the lowest level is an input node is generally presented, so the traversal may be to determine the output node first, and go through each node step by step downwards to determine the input node finally.

As shown in fig. 5A, a schematic flow chart of the implementation of step S401 in an embodiment of the disclosure is shown.

The flow in fig. 5A includes: traversing each node, and executing a preset judgment method on the current node in the traversing process, wherein the preset judgment method comprises the following steps:

step S501A: determining whether the description sentence of the current node contains preset information, wherein the preset information comprises: first characteristic information corresponding to the output node, second characteristic information corresponding to the input node or third characteristic information corresponding to the intermediate node;

if the descriptive statement includes the first feature information, step S502A is executed: determining the current node as an output node, and executing the preset judgment method on a lower node of the output node in the analyzed access plan; that is, the next-level node is a new current node, and the process returns to step S501A;

if the descriptive statement includes the third feature information, step S503 is executed to 503A: determining that the current node is an intermediate node, and executing the preset judgment method on a lower node of the intermediate node in the analyzed access plan; that is, the next-level node is a new current node, and the process returns to step S501A;

if the descriptive statement includes the second feature information, step S504A: and determining the current node as an input node.

It is possible that there may be only one output node. The lower nodes of the output node may be intermediate nodes. The subordinate nodes of the intermediate node may be intermediate nodes or input nodes. There may be a plurality of output nodes and lower nodes of each intermediate node, for example, two lower nodes corresponding to left and right branches downward of the binary tree, etc. There may be one or more input nodes.

In some embodiments, the first characteristic information comprises at least one of the following acting on the data object: create command (e.g., createas select, etc.); insert commands (e.g., insertintro, etc.). In some embodiments, the third characteristic information comprises at least one of the following acting on the data object: merge commands (e.g., Union); intersection commands (e.g., Join); alternatively, the intermediate node may be determined by an exclusion method, that is, the third feature information includes: and the other information except the first characteristic information and the second characteristic information indicates that the intermediate node is obtained after the input node and the output node are excluded. In some embodiments, the second characteristic information includes: the preset statement describing the relationship type between data objects, such as the HiveTableRelation, is usually used as HiveTableRelation (X), where X is the table name of the input data table.

In another example, if only the input node and the output node are concerned, the input node and the output node may be determined only based on the first characteristic information and the second characteristic information, and nodes other than the input node and the output node may be excluded, and the third characteristic information may be omitted instead of excluding the first characteristic information and the second characteristic information.

As shown in fig. 5B, a schematic flow chart of the implementation of step S401 in another embodiment of the disclosure is shown.

The flow in fig. 5B includes: traversing each node, and executing a preset judgment method on the current node in the traversing process, wherein the preset judgment method comprises the following steps:

step S501B: determining whether the description sentence of the current node contains preset information, wherein the preset information comprises: first characteristic information corresponding to the output node or second characteristic information corresponding to the input node;

if the descriptive statement includes the first feature information, step S502B is executed: determining the current node as an output node, and executing the preset judgment method on a lower node of the output node in the analyzed access plan; that is, the next-level node is a new current node, and the process returns to step S501B;

if the description sentence does not include the first feature information and the second feature information, execute step 503B: executing the preset judgment method on the lower node of the current node in the analyzed access plan; that is, the next-level node is a new current node, and the process returns to step S501B;

if the descriptive statement includes the second feature information, step S504B is executed: and determining the current node as an input node.

Compared with the embodiment of fig. 5A, the logic in the embodiment of fig. 5B may be applied to a scenario in which the data blood relationship does not need the intermediate node to participate, and the third feature information for determining the intermediate node is omitted, which is beneficial to reducing the complexity of code implementation and improving the speed and efficiency of constructing the data blood relationship.

Returning again to fig. 4, see step S402: and acquiring first identification information of the data object related to the output node and second identification information of the data object related to the input node, and associating the first identification information and the second identification information to construct the data consanguinity relationship.

In some embodiments, the table names of the input data table and the table names of the output data table are obtained, and the correlation between the input data table and the output data table is constructed according to the table names of the input data table and the output data table so as to obtain the data consanguinity relationship.

In a possible example, the data relationship between the input data table and the output data table can be stored in the visualization graph database to form a visualization data relationship graph, which is more convenient for developers to intuitively understand.

In some embodiments, the parsed access plan is described by the Scala language or the Java language. For complex SQL commands, the analyzed access plan generated by the data processing engine is a huge tree structure, and if the data table information is extracted by simply adopting lexical analysis, character string matching and other modes, the accuracy and robustness of the code are difficult to guarantee. Considering that the parsed access plan is a tree structure described by a Scala language or a Java language, each node in the tree can be traversed, and in the process of traversing the node, the node is matched with the characteristic information by combining with a Scala or Java instance matching command, so as to determine the output node and the input node. For example, the parsed access plan described in the following Scala language is taken as an example, and a complete traversal of the parsed access plan and a matched data lineage extraction algorithm are achieved by using the feature of instance matching (case match) of Scala and using a recursive algorithm to traverse nodes, so as to extract a table name of a complete output data table and a table name of an input data table. It is noted that in other embodiments, for a parsed access plan described by the Java language, the corresponding instance match command may be a switch case.

The following illustratively provides a brief logic of the data-blood-margin extraction algorithm of FIG. 5A:

the parselinange () is a data blood relationship function, the parenthesis is the current input node, and the function also includes conditions such as judging whether the function meets the conditions including the first characteristic information and the second characteristic information (or the third characteristic information) through Case match (for visual explanation, various Case match statements are not shown in detail in the above logic). And traversing each node from top to bottom of the analyzed access plan by recursively calling ParseLinage, and respectively matching case matches to determine output nodes and input nodes.

Specifically describing the algorithm process, in the algorithm, Case matching is performed from a father node in an analyzed access plan, and when the current node is determined to contain first characteristic information such as createeAsSelect or InsertInto, the current node is determined to be an output node 'outputNode'; then recursively calling ParseLineare to perform case matching on a subordinate node 'outputNode.child' of the 'outputNode', and if the subordinate node 'outputNode.child' contains third characteristic information such as 'Union' and 'Join', determining 'outputNode.child' as an intermediate node 'intermedia eNode'.

And then recursively calling ParseLinage, performing case matching on a child node 'IntermediatedeNode. leftchild' of a lower left branch of the middle node and a child node 'IntermediatedeNode. rightchild' of a lower right branch of the middle node, if the child nodes are matched to contain third characteristic information, determining as the middle node and calling ParseLinage for the next lower node of the middle node, or if the child nodes are matched to contain second characteristic information such as 'HiveTableRelations', determining as an input node 'LEAFnode', and ending the logic.

To more intuitively and specifically explain the above principle, the following description will be made by taking an SQL command as an example.

An exemplary SQL command is: insert input select from input;

the logical access plan obtained by the SQL command analysis is as follows, wherein each layer of nodes are exemplarily shown as a Scale class example, and each layer of nodes has a parent-child relationship from top to bottom:

for the above logical access plan, the following method is performed to extract the table names of the input data table and the output data table:

in some embodiments, for the scenario where the data processing engine is Hive, the Hook function (Hook) in Hive may also be utilized to extract the data consanguinity.

Referring to fig. 6, a schematic diagram of a relationship between blood sources based on Hive-extracted data according to an embodiment of the disclosure is shown.

In fig. 6, the principle of Hive 605 processing SQL command 601 is similar to Spark, and includes parsing abstract syntax tree 602 by Hive 605 according to SQL command 601, obtaining logical access plan 603 through metadata matching according to abstract syntax tree 602, and generating and executing physical access plan 604 according to logical access plan 603.

Hook functions (Hook) are provided in Hive 605, which is a mechanism to intercept events, messages, or function calls during processing. In some embodiments, the native interface of executewithhookcontexts may be inherited and implemented by calling a hook function of live. The method of run through the ExecuteWithHookContext can be introduced into HookContext class, namely, a class below the hook function live. For example, a hookcontext.getqueryplan method is used to obtain the parsed logic access plan from the context, and a queryplan.getinputs method is used to obtain the second identification information of the input data table, that is, the table name of the input data table; and acquiring first identification information of the output data table, namely the table name of the output data table, by a Queryplan. And then associating the table name of the input data table with the table name of the output data table to construct a data blood relationship. It is to be understood that in other embodiments, the physical access plan may be extracted for obtaining the data relationship in a similar manner, and is not limited to the parsed logical access plan of the above example.

In some embodiments, the present disclosure may further provide a method for detecting whether a data blood relationship is correctly extracted based on the data blood relationship construction method, and determine in advance a type of an SQL command that may generate the data blood relationship, so as to avoid a problem of missing the data blood relationship on the one hand, and on the other hand, extract no corresponding data blood relationship for an SQL command that does not belong to the data blood relationship, thereby saving resources. Therefore, the data blood relationship can be acquired in a full-coverage mode through the SQL command, and the data blood relationship is prevented from being lost.

Referring to fig. 7, before parsing the abstract syntax tree 702 of the original SQL command 701, a reference point 707 is set, and whether the SQL command 701 is a preset command type that may generate a data relationship is determined, so that it can be known in advance whether the data relationship can be extracted. With reference to the foregoing embodiment, the data processing engine 705 sets the collection point 706 of the data relationship to obtain the corresponding data relationship from the analyzed access plan (such as the analyzed logical access plan 703 and the analyzed physical access plan 704) generated by the data processing engine, and if it is determined that the data relationship should be generated according to the SQL command type, but the data relationship cannot be obtained actually, it may be determined that an error occurs and an error check may be performed, so as to avoid the missing of the data relationship.

In some embodiments, the preset command type is a command type generated corresponding to a data context. The preset command type may be related to an operation of a data object (e.g., an output data table) referred to by an output node, and the preset command type includes at least one of: insert command, create and insert command. For example, the INSERT command, such as INSERT INTO/OVERWRITE TABLE output SELECT.. FROM input; the CREATE command, such AS CREATE TABLE output AS select.. FROM input; the CREATE and INSERT commands, such AS, for example, WITH t AS (SELECT FROM input) CREATE/INSERT INTO TABLE output (AS) SELECT.

Fig. 8 is a schematic flow chart showing a method for detecting whether the data blood relationship is correctly extracted based on the data blood relationship construction method in an embodiment of the present disclosure. The process comprises the following steps:

step S801: determining whether the database command is a preset command type.

Step S802: in response to the database command being a preset command type, attempting to construct a data consanguinity relationship based on the parsed access plan;

step S803: in response to a failure to construct a data context, error information for the parsed access plan is fed back.

If the construction of the data relationship is successful, no action is required.

According to an embodiment of the present disclosure, a collection point of a data consanguinity relationship is set as a data processing engine in a database system to obtain an resolved access plan generated by the data processing engine; constructing a data consanguinity relationship based on the parsed access plan. According to the method, the collection point of the data blooding margin is arranged on the analyzed access plan obtained by the data processing engine, and the accuracy of the obtained data blooding margin relation is guaranteed by the correctness of the analyzed access plan; in addition, by setting the acquisition points of the data blooding borders in the data processing engine, the method replaces the mode of analyzing the received SQL commands and acquiring the data blooding border relations through a syntax analysis tool on a platform layer, the syntax analysis tool is omitted, various syntax rules of various database commands do not need to be concerned, the change influence of the upgrading on the syntax analysis tool is avoided, and the development cost and the realization difficulty are reduced; and may cover various ETL scenarios; on the other hand, the method is not limited by the limitation of a platform layer and is suitable for various ETL scenes; in addition, SQL command types which can generate data blood relationship can be used to avoid omission of data blood relationship.

Exemplary devices

Having described exemplary method embodiments of the present disclosure, a data margin construction apparatus of an exemplary embodiment of the present disclosure is described next with reference to fig. 9.

Referring to fig. 9, an exemplary embodiment of the present disclosure provides a data blood margin construction apparatus 900 applied to a database system; the database system includes a data processing engine that generates an access plan corresponding to a database command, the access plan including a plurality of nodes, at least some of the plurality of nodes relating to data objects; the device comprises:

a plan obtaining module 901, configured to obtain the parsed access plan generated by the data processing engine;

a blood margin construction module 902 for constructing a data blood margin relationship based on the parsed access plan; wherein the data consanguinity relationship represents an associative relationship between data objects referred to by the plurality of nodes.

In some embodiments, the parsed access plan includes at least one of: a logical access plan generated after metadata matching; a physical access plan generated based on the logical access plan.

In some embodiments, the data processing engine comprises a SPARK.

In some embodiments, the data processing engine is configured with a listener; the plan obtaining module 901 is configured to call the execution listener to obtain the resolved access plan after the resolved access plan is successfully executed.

In some embodiments, the blood margin construction module 902 comprises:

a node determination module to traverse each of the nodes in the parsed access plan to determine output nodes and input nodes;

and the information acquisition module is used for acquiring first identification information of the data object related to the output node and second identification information of the data object related to the input node, and associating the first identification information and the second identification information to construct the data consanguinity relationship.

In some embodiments, the node determining module is configured to traverse each of the nodes and perform a preset determination method on a current node in the traversal process, and includes:

a feature matching module, configured to determine whether the description statement of the current node contains preset information, where the preset information includes: first characteristic information corresponding to the output node, third characteristic information corresponding to the intermediate node or second characteristic information corresponding to the input node;

an execution module, configured to determine that the current node is an output node if the descriptive statement includes first feature information, and execute the preset determination method on a subordinate node of the output node in the resolved access plan; if the description statement does not contain the first characteristic information and the second characteristic information, executing the preset judgment method on a lower node of the current node in the analyzed access plan; and if the description statement contains second characteristic information, determining the current node as an input node.

a feature matching module, configured to determine whether the description statement of the current node contains preset information, where the preset information includes: first characteristic information corresponding to the output node or second characteristic information corresponding to the input node;

In some embodiments, the parsed access plan is described by the Scala language or the Java language; the preset judgment method respectively determines whether the description sentence of the current node contains preset information through an instance matching command.

In some embodiments, the first characteristic information comprises at least one of the following acting on the data object: creating a command; inserting a command;

the third characteristic information includes at least one of the following acting on the data object: combining the commands; intersection commands; or, the third feature information includes: information other than the first characteristic information and the second characteristic information;

the second feature information includes: and the preset statements describe the relationship types between the data objects.

In some embodiments, the data processing engine comprises a Hive.

In some embodiments, the data processing engine is configured with a hook program; the plan obtaining module 901 is configured to invoke a hook program to extract the resolved access plan after the resolved access plan is generated.

In some embodiments, the blood margin generation module comprises:

the information extraction module is used for calling an information extraction method which is configured in advance by the analyzed access plan so as to acquire and extract first identification information and second identification information; wherein the first identification information corresponds to a data object referred to by an input node and the second identification information corresponds to a data object referred to by an output node; and associating the first identification information and the second identification information to obtain the data blood relationship.

In some embodiments, the data margin construction apparatus 900 further comprises:

the command type determining module is used for determining whether the database command is a preset command type; the preset command type corresponds to generation of data blood relationship; in response to the database command being a preset command type, attempting to construct a data blood relationship through the plan acquisition module 901 and the blood relationship construction module 902;

and the judging module is used for feeding back error information of the analyzed access plan in response to failure of constructing the data blood relationship.

In some embodiments, the preset command type relates to an operation of a data object related to an output node, and the preset command type includes: at least one of insert command, create command, and create and insert command.

In some embodiments, the database command is an SQL command.

Since each functional module of the data blood margin construction apparatus 900 according to the embodiment of the present disclosure is the same as the corresponding step principle of the data blood margin construction method in the foregoing embodiment, it is not described herein again.

Exemplary storage Medium

Having described the data blood margin construction method and the data blood margin construction apparatus according to the exemplary embodiment of the present disclosure, a storage medium according to the exemplary embodiment of the present disclosure will be described with reference to fig. 10.

Referring to fig. 10, a storage medium 1000 according to an embodiment of the present disclosure is described, which may contain program code and may be run on a device, such as a computer or a mobile terminal, to implement the execution of each step and sub-step of the data blood margin construction method in the above embodiments of the present disclosure. In the context of this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program code may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary electronic device

Having described the storage medium of the exemplary embodiment of the present disclosure, next, an electronic device of the exemplary embodiment of the present disclosure is explained with reference to fig. 11.

The electronic device 1100 shown in fig. 11 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 11, electronic device 1100 is embodied in the form of a general purpose computing device. The components of the electronic device 1100 may include, but are not limited to: the at least one processing unit 1110, the at least one memory unit 1120, and a bus 1130 that couples various system components including the memory unit 1120 and the processing unit 1110.

Wherein the storage unit stores program code, which can be executed by the processing unit 1110, so that the processing unit 1110 performs the steps and sub-steps of the data blood margin construction method described in the above embodiments of the present disclosure. For example, the processing unit 1110 may perform the steps as shown in fig. 2, 3, 5, etc.

In some embodiments, the memory unit 1120 may include volatile memory units such as a random access memory unit (RAM)11201 and/or a cache memory unit 11202, and may further include a read only memory unit (ROM) 11203.

In some embodiments, storage unit 1120 may also include a program/utility 11204 having a set (at least one) of program modules 11205, such program modules 11205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

In some embodiments, bus 1130 may include a data bus, an address bus, and a control bus.

In some embodiments, the electronic device 1100 may also communicate with one or more external devices 1200 (e.g., keyboard, pointing device, bluetooth device, etc.), such communication may be through an input/output (I/O) interface 1150. Optionally, the electronic device 1100 further comprises a display unit 1140 connected to the input/output (I/O) interface 1150 for display. Also, the electronic device 1100 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) via the network adapter 1160. As shown, the network adapter 1160 communicates with the other modules of the electronic device 1100 over the bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description reference is made to the recognition model generation means and several modules or sub-modules of the recognition means, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A data blood relationship construction method is characterized by being applied to a database system; the database system includes a data processing engine that generates an access plan corresponding to a database command, the access plan including a plurality of nodes, at least some of the plurality of nodes relating to data objects; the method comprises the following steps:

obtaining a resolved access plan generated by the data processing engine;

constructing a data consanguinity relationship based on the parsed access plan; wherein the data consanguinity relationship represents an associative relationship between data objects referred to by the plurality of nodes.

2. The data lineage construction method according to claim 1, wherein the parsed access plan includes at least one of: a logical access plan generated after metadata matching; a physical access plan generated based on the logical access plan.

3. The data bloodline construction method of claim 1, characterized in that the data processing engine comprises SPARK.

4. The data consanguinity construction method of claim 1, wherein said constructing data consanguinity relationships based on said parsed access plan comprises:

traversing each of the nodes in the parsed access plan to determine output nodes and input nodes;

and acquiring first identification information of the data object related to the output node and second identification information of the data object related to the input node, and associating the first identification information and the second identification information to construct the data consanguinity relationship.

5. The data lineage construction method according to claim 4, wherein traversing each of the nodes in the parsed access plan to determine output nodes and input nodes, comprises:

traversing each node, and executing a preset judgment method on the current node in the traversing process, wherein the preset judgment method comprises the following steps:

determining whether the description sentence of the current node contains preset information, wherein the preset information comprises: first characteristic information corresponding to the output node, second characteristic information corresponding to the input node or third characteristic information corresponding to the intermediate node;

if the description statement contains first characteristic information, determining that the current node is an output node, and executing the preset judgment method on a lower node of the output node in the analyzed access plan;

if the description statement contains third feature information, determining that the current node is an intermediate node, and executing the preset judgment method on a lower node of the intermediate node in the analyzed access plan;

and if the description statement contains second characteristic information, determining the current node as an input node.

6. The data lineage construction method according to claim 4, wherein traversing each of the nodes in the parsed access plan to determine output nodes and input nodes, comprises:

determining whether the description sentence of the current node contains preset information, wherein the preset information comprises: first characteristic information corresponding to the output node or second characteristic information corresponding to the input node;

if the description statement does not contain the first characteristic information and the second characteristic information, executing the preset judgment method on a lower node of the current node in the analyzed access plan;

7. The data blood margin construction method of claim 1, further comprising:

determining whether the database command is a preset command type; the preset command type corresponds to generation of data blood relationship;

in response to the database command being a preset command type, attempting to construct a data consanguinity relationship based on the parsed access plan;

in response to a failure to construct a data context, error information for the parsed access plan is fed back.

8. A data blood margin construction device is characterized by being applied to a database system; the database system includes a data processing engine that generates an access plan corresponding to a database command, the access plan including a plurality of nodes, at least some of the plurality of nodes relating to data objects; the device comprises:

a plan acquisition module for acquiring the parsed access plan generated by the data processing engine;

a blood margin construction module for constructing a data blood margin relationship based on the parsed access plan; wherein the data consanguinity relationship represents an associative relationship between data objects referred to by the plurality of nodes.

9. A storage medium having a computer program stored thereon, the computer program when executed by a processor implementing:

the method for constructing data blood margin according to any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform, via execution of the executable instructions: