CN113535528B

CN113535528B - Log management system, method and medium for distributed graph iterative computation job

Info

Publication number: CN113535528B
Application number: CN202110728761.9A
Authority: CN
Inventors: 王志刚; 涂懿磊; 殷波; 王宁; 聂婕; 宋德海; �田�浩
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2023-08-08
Anticipated expiration: 2041-06-29
Also published as: CN113535528A

Abstract

The invention discloses a log management system, a method and a medium for distributed graph iterative computation operation, which are used for tracing after a fault occurs after the distributed graph iterative computation operation starts, and the fault is traced by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the incremental change condition of the logs of each node, and judging the update stopping sequence of the logs of each node by taking the time of the main control node as a reference, thereby giving out candidate fault source nodes; after fault tracing, optimizing log analysis in debugging of the program, and collecting key log information for debugging by migrating and executing search commands in a distributed manner; and during iterative computation of the distributed graph, checking iteration step information in real time through an incremental retrieval method. The invention ensures that a user can quickly track and analyze the running details of the program after determining the node where the fault source is located, thereby completing the program debugging.

Description

Log management system, method and medium for distributed graph iterative computation job

Technical Field

The invention belongs to the technical field of data processing, relates to a log management method, and in particular relates to a log management system, a log management method and a log management medium for distributed graph iterative computation operation.

Background

The distributed graph iterative computing system adopts a Master-Slave (Master-Slave) architecture, as shown in fig. 1, the operation is divided into a plurality of tasks and is completed by a plurality of machines in a cluster, wherein one machine is selected as a Master node Master, and the rest is working nodes Slave. Each working node periodically reports the processing progress of the data responsible for the working node to the main control node, and the main control node displays the processing progress of the whole operation to a user after summarizing the data. Such periodic reporting mechanisms, commonly referred to as "heartbeat" mechanisms, may be used to perform management and monitoring functions in a master-slave architecture.

Because of the good encapsulation of the distributed large-graph computing platform, a user cannot analyze the running process of the submitted operation program by himself under the single-step debugging, variable value monitoring and other tools like a single-machine programming environment. In addition, the instability of the communication connection of the distributed platform across the machine network and the uncertainty of the multi-thread concurrent calculation result exacerbate the debugging difficulty of the iterative calculation operation of the graph.

At present, the main debugging means of the distributed program is to print log information when the program is run, however, the distributed system log is distributed on each working node, and each iteration step in the graph calculation process may need to check the corresponding information and analyze the running correctness of the program in time, so that the debugging process needs to check the logs of a plurality of nodes frequently. The complexity of cross-node log retrieval and redundant information in related log files between different iteration steps reduce Debug efficiency. Secondly, because the graph algorithm generally accesses the vertexes along the outgoing edges, strong coupling exists among the subgraphs distributed on different working nodes, one working node is abnormal, the situation that abnormal error reporting occurs in the logs on a plurality of physical machines can occur, and at the moment, the multi-node error reporting can generate certain interference on the judgment of the first abnormal error reporting, namely, the abnormal machine cannot be rapidly positioned and the fault can not be traced.

The distributed system fault tracing is mainly divided into a rule-based fault tracing method and a modeling-based fault tracing method, and both of which need to extract relevant knowledge by analyzing long-time operation log information of a system so as to establish rules or models. The current mainstream distributed computing platform has few log management systems which are specially applicable to the graph computing system and are not specially aimed at the graph computing job. The existing method is only applicable to a common distributed system, and if log storage is carried out on a distributed graph computing system, the defects and problems exist: when the graph computing system generates errors, the current log records of the single job distributed in the cluster are only required to be checked, errors are found, the logs can be deleted after the errors are removed, and the log records do not need to be stored. Storing all logs requires a certain memory space, and the log collection process also causes a certain communication overhead. In addition, the graph algorithm operation is different from other distributed operations, the graph algorithm operation needs to be iterated and calculated for a plurality of times, each iteration can generate some information, when debugging is carried out, log information output in real time needs to be checked in the iteration process, then the stored log is not checked again, and the stored log is checked for a plurality of times, and if all log information of the operation is transmitted in each collection, log redundant transmission exists.

In summary, the existing distributed graph computing system only provides a simple log record function, and does not support fault tracing and log management functions in the process of program debugging. In the problem of distributed system fault tracing, the prior art needs to accumulate a large amount of fault data, and the abnormality is identified by modeling the data, so that the problems of serious dependence on priori knowledge exist in the methods. Therefore, the invention provides a log management method for the distributed graph iterative computing operation aiming at the problem of low program debugging efficiency in the distributed graph iterative computing system, has stronger single operation independence, does not have a large amount of data and does not need modeling.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a log management system, a method and a medium for distributed graph iterative computation operation, which are used for providing a solution from two angles of fault tracing and log retrieval, wherein modeling or rule agreement is not needed for fault tracing, and the position of an abnormal source is judged only according to the update state of the log; after fault tracing, log analysis is optimized in debugging of the program, logs are searched on each working node, required content is sent to a Master after searching, part of useless content is not sent, a large amount of sending time is saved, and after the node where the fault source is located is determined, a user can quickly track and analyze program operation details to complete program debugging.

In order to solve the technical problems, the invention adopts the following technical scheme:

firstly, the invention provides a log management method for distributed graph iterative computation operation, after the distributed graph iterative computation operation is started, a Master coordinates each node to load graph data to the local and starts computation, and log management is realized by the following method:

(1) Firstly, tracing faults by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the incremental change condition of the logs of each node through a heartbeat mechanism between the master node and the slave node, and judging the updating stopping sequence of the logs of each node by taking the time of the master node as a reference, thereby giving out candidate fault source nodes;

(2) After fault tracing, optimizing log analysis in debugging of the program, and collecting key log information for debugging by migrating and executing search commands in a distributed manner;

and during iterative computation of the distributed graph, checking iteration step information in real time through an incremental retrieval method.

Further, the specific operation steps of the log incremental change analysis tracing method based on the unified time measurement standard in the step (1) are as follows:

step1, firstly, collecting and reporting the updated state of the log record of each node by using a regular reporting mechanism of n milliseconds in a graph computing system on each working node;

step2, when every other heartbeat, each node compares the current local log with the log at the end of the previous heartbeat to obtain an incremental change value delta _log And will delta _log Reporting to a Master;

step3, judging whether abnormality occurs, and when abnormality occurs in the current graph calculation operation, checking the delta recorded by the Master first _log If the working node i reportsThen the working node i is considered to be faulty; if no +.>The 'heartbeat' interval n is adjusted to increase the fault tracing sensitivity, and the operation is run again until the log capturing the fault source is not updated any more.

Further, in the step (2), the migration and distributed execution search command refers to a method that when the distributed graph iterative computing system performs program debugging, each node is firstly searched locally, and then the searched key log information is transmitted to the Master, which specifically includes the following steps:

sending a search command to the Slave working nodes from a host Master, and after receiving the command, each Slave locally searching the log according to the command and running the search command in a distributed mode by each node;

and returning part of the key log information, and finally, presenting the result on a Master.

Further, in the step (2), the checking iteration step information through the incremental search method refers to: for one operation, when the information of the iteration step n is checked for the first time, directly outputting the needed information of the iteration step n, and then setting a shaping variable outoperation as n for recording that the information of the iteration step n is output; if the next information output to the iteration step m is needed, firstly checking whether m is larger than n, when m is larger than n, outputting the log information of the iteration steps n+1 to m only, setting the outoperation as m, and when m is smaller than n, prompting that the log information of the iteration step n is output, and checking the log information of the iteration step n.

The invention also provides a log management system for the distributed graph iterative computation operation, which is used for managing logs and comprises the following steps:

the log increment change analysis tracing module is used for tracing faults, continuously monitoring log increment change conditions of all nodes through a heartbeat mechanism between master nodes and slave nodes, judging the order of stopping updating the logs of all nodes by taking the time of the master control node as a reference, and further giving out candidate fault source nodes;

the distributed migration retrieval module is used for optimizing log analysis in debugging of a program after fault tracing, and collecting key log information for debugging through migration and distributed execution of retrieval commands;

and the increment retrieval module is used for checking iteration step information in real time through an increment retrieval command when the distributed graph is subjected to iterative computation.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a log management method for distributed graph oriented iterative computing jobs as described above.

Compared with the prior art, the invention has the advantages that:

(1) Aiming at the problem of difficult tracing of the node where the true fault is located when the multi-node reports the fault under the strong coupling correlation background, the invention provides a log increment change analysis tracing method based on a unified time measurement standard.

Modeling or convention rules are not needed, and only the update state of the log is needed to judge where the abnormality is based.

(2) Aiming at the problem of low efficiency of cross-node frequent log retrieval, the invention provides a data deleting redundancy method based on distributed incremental retrieval, which replaces log collection by migrating and executing a retrieval command in a distributed mode, retrieves the logs on each working node, sends required contents to a Master after retrieval, does not send part of useless contents, saves a large amount of sending time, reduces network transmission expenditure of redundant information, improves the retrieval efficiency by executing the retrieval command in a distributed mode, enables a user to quickly track and analyze program operation details after determining the node where the fault source is located, and completes program debugging.

(3) During distributed graph iterative computation, the iteration step information is checked in real time through an incremental retrieval method, so that repeated scanning of log information among different iteration steps is avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a prior art distributed graph iterative computing system architecture diagram;

FIG. 2 is a diagram illustrating log record updating and reporting according to the present invention;

FIG. 3 is a diagram comparing two types of log transmission strategies according to the present invention with the prior art;

FIG. 4 is a flowchart of a log management method for distributed graph iterative computation operation according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples.

The embodiment provides a log management method for a distributed graph iterative computation job, after the distributed graph iterative computation job is started, a Master coordinates each node to load graph data to the local and starts computation, and log management is realized by the following method:

(1) Firstly, tracing faults by using a log incremental change analysis tracing method based on a unified time measurement standard: by utilizing the characteristic that the local log of the node is not updated any more after the fault occurs, the incremental change condition of the log of each node is continuously monitored through a heartbeat mechanism between the master node and the slave node, and the order of stopping updating the log of each node is judged by taking the time of the master node as a reference, so that the candidate fault source node is provided.

(2) After fault tracing, the program optimizes log analysis in debugging, replaces log collection by migrating and executing search commands in a distributed mode, reduces network transmission overhead of redundant information, collects key log information for debugging, and improves search efficiency.

During distributed graph iterative computation, iteration step information is checked in real time through an incremental retrieval method, and repeated scanning of log information among different iteration steps is avoided.

The following two aspects are respectively introduced from the aspects of fault tracing and log retrieval:

1. log incremental change analysis tracing method based on unified time measurement standard

In the calculation process of the distributed graph calculation system, network communication is required by each working node, the output value of one sending working node is often the input value required by the next iteration of the other receiving working node, so that other working nodes fail in the working node i and cannot receive the required input value or perform calculation, and the whole system of the subsequent iteration fails in linkage, thereby causing the failure of calculation tasks. When a large-area working node logs record faults, the fault tracing cannot be performed rapidly. The invention utilizes the log increment change analysis tracing method based on the unified time measurement standard of the system to consider that delta is reported first _log An abnormal working node is typically the starting working node that caused the entire failure. Thus, the fault source can be rapidly positioned.

The key point is that a determination is made as to where the fault source of the current job is based on the update status of the log record. When the task of the distributed graph computing system is performed, each working node continuously adds the log according to the task, and when the task is finished, the log also stops updating, so that the content of the log is increased along with the task, and the log and the working node show a proportional relation. Therefore, the present invention judges abnormality according to the updated state of the log record, and the specific operation steps are as follows in combination with the description shown in fig. 2:

step1, collecting and reporting the updated state of the log record of each node by using a regular reporting mechanism of n milliseconds, namely 'heartbeat', in a graph computing system on each working node. The method aims to collect and report the log record updating states of all nodes at the same time, so that the method achieves indiscriminate treatment and can not cause the situation that the nodes report less and report missing.

Step2. When every other "heartbeat", each node compares the current local log with the log at the end of the previous "heartbeatAn incremental change value delta is obtained _log And will delta _log Report to the Master.

Step3, judging whether abnormality occurs, and when abnormality occurs in the current graph calculation operation, checking the delta recorded by the Master first _log If the working node i reportsAnd (2) other->Has great difference->Then the working node i is considered to be faulty; if no +.>The 'heartbeat' interval n is adjusted to increase the fault tracing sensitivity, and the operation is run again until the log capturing the fault source is not updated any more.

The size of the 'heartbeat' interval relates to balance between tracing sensitivity and detection time complexity, and needs to be selected according to specific requirements.

Above-mentioned Step3 relates to the sensitivity problem of tracing to source of trouble, and when n sets up to be too little, can frequent record log update situation and report to master control operational node, though can improve the accuracy of tracing to source of trouble, but can occupy limited computational resource and increase communication cost. If n is set too large, the accuracy will decrease. Providing a compromise method, in the course of daily calculation, regulating n value to reduce resource consumption, if it is not capturedThe n value can be reduced, the sensitivity is increased, and the operation calculation is performed again, wherein the main purpose is to perform fault tracing instead of operation calculation, so that the accuracy of fault tracing is sacrificed by sacrificing calculation resources.

2. Data redundancy deleting method based on distributed incremental retrieval

When the current distributed graph iterative computing system is used for program debugging, log information is distributed on each computing node, cross-node operation is needed, log checking is performed on different working nodes, the problem of operation repetition exists, the existing distributed log management system is not suitable for distributed graph computing operation, and the distributed graph computing system only provides a simple log recording function. In addition, in the distributed graph iterative computing system, when operation is computed, the iterative step output information needs to be checked for multiple times, after the iterative step n information is checked, the iterative step n information is still output when the iterative step n+i is checked, and the problem of redundancy of the retrieval iterative step information exists.

The invention solves the two problems by migrating and executing the search command and the increment search command in a distributed way, and is concretely as follows:

1. migration and distributed execution of search commands

The left side of fig. 3 shows the existing log transmission scheme incapable of efficiently performing log analysis, and the right side is the efficient method adopted by the present invention, namely, when the distributed graph iterative computing system performs program debugging, in order to avoid back and forth operation between each computing node, a method is adopted in which each node firstly performs local retrieval, and then the key log information after retrieval is transmitted to a Master: and a search command is sent to the Slave working nodes from the Master, each Slave retrieves the log locally according to the command after receiving the command, and each node executes the retrieval command in a distributed mode. And returning part of the key log information, and finally, presenting the result on a Master. Because the log records also include some useless information, only key logs are transmitted, and communication loss is greatly reduced.

2. Incremental retrieval commands

The iterative step information redundancy display is avoided by not displaying the output information. For one operation, when the information of the iteration step n is checked for the first time, redundant output does not exist, the needed information of the iteration step n is directly output, and then a shaping variable outoperation is set as n and used for recording the information of the iteration step n to be output; if the next information output to the iteration step m is needed, firstly checking whether m is larger than n, when m is larger than n, only outputting the log information of the iteration steps n+1 to m, setting the outoperation as m, when m is smaller than n, prompting that the log information of the iteration step n is output, and checking the log information of the iteration step n.

In connection with the flow chart shown in fig. 4, after the distributed graph computation job begins, the Master coordinates the nodes to load the graph data locally and begin computation. After each Slave passes a 'heartbeat' interval, each node compares the current local log quantity with the log quantity of the previous 'heartbeat' to obtain a change value delta _log And will delta _log Report to the Master. Master records these Δs _log If the operation is abnormal, recording delta to a Master _log Judging if the delta reported by a certain working node last time _log If the value is 0, the failure is determined to be the source of the failure. This part is a fault tracing technique.

If the iteration step information is to be checked during calculation, the incremental search technology is utilized to reduce the redundant output of the log information.

When the system is abnormal, checking the log to debug the program, and collecting key log information to debug by utilizing a migration command technology.

As another embodiment of the present invention, there is further provided a log management system for a distributed graph iterative computation job, for managing logs, including:

The functions and implementation manners of the modules are the same as those of the log management method for the distributed graph iterative computation operation, and are not repeated here.

As another embodiment of the present invention, there is further provided a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements a log management method for iterative computation of a distributed graph as described above, and will not be described herein.

In summary, the invention provides a solution from two angles of fault tracing and log retrieval, firstly, the fault tracing is carried out without modeling or rule agreement, and only the update state of the log is needed to judge where the abnormal root comes out; after fault tracing, log analysis is optimized in debugging of the program, logs are searched on each working node, required content is sent to a Master after searching, part of useless content is not sent, a large amount of sending time is saved, and after the node where the fault source is located is determined, a user can quickly track and analyze program operation details to complete program debugging.

The same or similar parts of the embodiments of the present invention will be referred to each other, and each embodiment focuses on the differences from the other embodiments. Moreover, the architecture of the system embodiments is merely illustrative, and the program modules illustrated by the separable components may or may not be physically separate, and in actual practice, some or all of the modules may be selected as desired to achieve the objectives of the embodiments.

The steps of the present invention may be implemented by general-purpose computer means, or alternatively, they may be implemented by program code executable by computing means, so that they may be stored in memory means and executed by computing means, or they may be fabricated into individual integrated circuit modules, respectively, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that various changes, modifications, additions and substitutions can be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims

1. The log management method for the distributed graph iterative computation job is characterized in that after the distributed graph iterative computation job is started, a Master coordinates each node to load graph data to the local and start computation, and log management is realized by the following method:

(1) Firstly, tracing faults by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the incremental change condition of the logs of each node through a heartbeat mechanism between the master node and the slave node, and judging the updating stopping sequence of the logs of each node by taking the time of the master node as a reference, thereby giving out candidate fault source nodes; the log incremental change analysis tracing method based on the unified time measurement standard in the step (1) comprises the following specific operation steps:

step3, judging whether abnormality occurs, and when abnormality occurs in the current graph calculation operation, checking the delta recorded by the Master first _log If the working node i reportsThen the working node i is considered to be faulty; if no +.>Then turn downThe heartbeat interval n increases the fault tracing sensitivity, and the operation is operated again until the log of the fault source is captured and is not updated;

(2) After fault tracing, optimizing log analysis in debugging of the program, and collecting key log information for debugging by migrating and executing search commands in a distributed manner; the migration and distributed execution search command refers to a method that when a distributed graph iterative computing system performs program debugging, each node is firstly used for local search, and then the searched key log information is transmitted to a Master, and the method specifically comprises the following steps:

returning part of key log information, and finally, presenting results on a Master;

2. The log management method for distributed graph iterative computation job according to claim 1, wherein in step (2), the checking iteration step information by the incremental search method means that: for one operation, when the information of the iteration step n is checked for the first time, directly outputting the needed information of the iteration step n, and then setting a shaping variable outoperation as n for recording that the information of the iteration step n is output; if the next information output to the iteration step m is needed, firstly checking whether m is larger than n, when m is larger than n, outputting the log information of the iteration steps n+1 to m only, setting the outoperation as m, and when m is smaller than n, prompting that the log information of the iteration step n is output, and checking the log information of the iteration step n.

3. A log management system for a distributed graph iterative computation job, configured to manage a log, and characterized by implementing the log management method for a distributed graph iterative computation job according to claim 1 or 2, including:

4. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a log management method for iterative computation of a distributed graph according to any of claims 1-2.