Nothing Special   »   [go: up one dir, main page]

CN113535528B - Log management system, method and medium for distributed graph iterative computation job - Google Patents

Log management system, method and medium for distributed graph iterative computation job Download PDF

Info

Publication number
CN113535528B
CN113535528B CN202110728761.9A CN202110728761A CN113535528B CN 113535528 B CN113535528 B CN 113535528B CN 202110728761 A CN202110728761 A CN 202110728761A CN 113535528 B CN113535528 B CN 113535528B
Authority
CN
China
Prior art keywords
log
node
distributed
information
master
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110728761.9A
Other languages
Chinese (zh)
Other versions
CN113535528A (en
Inventor
王志刚
涂懿磊
殷波
王宁
聂婕
宋德海
�田�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202110728761.9A priority Critical patent/CN113535528B/en
Publication of CN113535528A publication Critical patent/CN113535528A/en
Application granted granted Critical
Publication of CN113535528B publication Critical patent/CN113535528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a log management system, a method and a medium for distributed graph iterative computation operation, which are used for tracing after a fault occurs after the distributed graph iterative computation operation starts, and the fault is traced by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the incremental change condition of the logs of each node, and judging the update stopping sequence of the logs of each node by taking the time of the main control node as a reference, thereby giving out candidate fault source nodes; after fault tracing, optimizing log analysis in debugging of the program, and collecting key log information for debugging by migrating and executing search commands in a distributed manner; and during iterative computation of the distributed graph, checking iteration step information in real time through an incremental retrieval method. The invention ensures that a user can quickly track and analyze the running details of the program after determining the node where the fault source is located, thereby completing the program debugging.

Description

Log management system, method and medium for distributed graph iterative computation job
Technical Field
The invention belongs to the technical field of data processing, relates to a log management method, and in particular relates to a log management system, a log management method and a log management medium for distributed graph iterative computation operation.
Background
The distributed graph iterative computing system adopts a Master-Slave (Master-Slave) architecture, as shown in fig. 1, the operation is divided into a plurality of tasks and is completed by a plurality of machines in a cluster, wherein one machine is selected as a Master node Master, and the rest is working nodes Slave. Each working node periodically reports the processing progress of the data responsible for the working node to the main control node, and the main control node displays the processing progress of the whole operation to a user after summarizing the data. Such periodic reporting mechanisms, commonly referred to as "heartbeat" mechanisms, may be used to perform management and monitoring functions in a master-slave architecture.
Because of the good encapsulation of the distributed large-graph computing platform, a user cannot analyze the running process of the submitted operation program by himself under the single-step debugging, variable value monitoring and other tools like a single-machine programming environment. In addition, the instability of the communication connection of the distributed platform across the machine network and the uncertainty of the multi-thread concurrent calculation result exacerbate the debugging difficulty of the iterative calculation operation of the graph.
At present, the main debugging means of the distributed program is to print log information when the program is run, however, the distributed system log is distributed on each working node, and each iteration step in the graph calculation process may need to check the corresponding information and analyze the running correctness of the program in time, so that the debugging process needs to check the logs of a plurality of nodes frequently. The complexity of cross-node log retrieval and redundant information in related log files between different iteration steps reduce Debug efficiency. Secondly, because the graph algorithm generally accesses the vertexes along the outgoing edges, strong coupling exists among the subgraphs distributed on different working nodes, one working node is abnormal, the situation that abnormal error reporting occurs in the logs on a plurality of physical machines can occur, and at the moment, the multi-node error reporting can generate certain interference on the judgment of the first abnormal error reporting, namely, the abnormal machine cannot be rapidly positioned and the fault can not be traced.
The distributed system fault tracing is mainly divided into a rule-based fault tracing method and a modeling-based fault tracing method, and both of which need to extract relevant knowledge by analyzing long-time operation log information of a system so as to establish rules or models. The current mainstream distributed computing platform has few log management systems which are specially applicable to the graph computing system and are not specially aimed at the graph computing job. The existing method is only applicable to a common distributed system, and if log storage is carried out on a distributed graph computing system, the defects and problems exist: when the graph computing system generates errors, the current log records of the single job distributed in the cluster are only required to be checked, errors are found, the logs can be deleted after the errors are removed, and the log records do not need to be stored. Storing all logs requires a certain memory space, and the log collection process also causes a certain communication overhead. In addition, the graph algorithm operation is different from other distributed operations, the graph algorithm operation needs to be iterated and calculated for a plurality of times, each iteration can generate some information, when debugging is carried out, log information output in real time needs to be checked in the iteration process, then the stored log is not checked again, and the stored log is checked for a plurality of times, and if all log information of the operation is transmitted in each collection, log redundant transmission exists.
In summary, the existing distributed graph computing system only provides a simple log record function, and does not support fault tracing and log management functions in the process of program debugging. In the problem of distributed system fault tracing, the prior art needs to accumulate a large amount of fault data, and the abnormality is identified by modeling the data, so that the problems of serious dependence on priori knowledge exist in the methods. Therefore, the invention provides a log management method for the distributed graph iterative computing operation aiming at the problem of low program debugging efficiency in the distributed graph iterative computing system, has stronger single operation independence, does not have a large amount of data and does not need modeling.
Disclosure of Invention
Aiming at the defects existing in the prior art, the invention provides a log management system, a method and a medium for distributed graph iterative computation operation, which are used for providing a solution from two angles of fault tracing and log retrieval, wherein modeling or rule agreement is not needed for fault tracing, and the position of an abnormal source is judged only according to the update state of the log; after fault tracing, log analysis is optimized in debugging of the program, logs are searched on each working node, required content is sent to a Master after searching, part of useless content is not sent, a large amount of sending time is saved, and after the node where the fault source is located is determined, a user can quickly track and analyze program operation details to complete program debugging.
In order to solve the technical problems, the invention adopts the following technical scheme:
firstly, the invention provides a log management method for distributed graph iterative computation operation, after the distributed graph iterative computation operation is started, a Master coordinates each node to load graph data to the local and starts computation, and log management is realized by the following method:
(1) Firstly, tracing faults by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the incremental change condition of the logs of each node through a heartbeat mechanism between the master node and the slave node, and judging the updating stopping sequence of the logs of each node by taking the time of the master node as a reference, thereby giving out candidate fault source nodes;
(2) After fault tracing, optimizing log analysis in debugging of the program, and collecting key log information for debugging by migrating and executing search commands in a distributed manner;
and during iterative computation of the distributed graph, checking iteration step information in real time through an incremental retrieval method.
Further, the specific operation steps of the log incremental change analysis tracing method based on the unified time measurement standard in the step (1) are as follows:
step1, firstly, collecting and reporting the updated state of the log record of each node by using a regular reporting mechanism of n milliseconds in a graph computing system on each working node;
step2, when every other heartbeat, each node compares the current local log with the log at the end of the previous heartbeat to obtain an incremental change value delta log And will delta log Reporting to a Master;
step3, judging whether abnormality occurs, and when abnormality occurs in the current graph calculation operation, checking the delta recorded by the Master first log If the working node i reportsThen the working node i is considered to be faulty; if no +.>The 'heartbeat' interval n is adjusted to increase the fault tracing sensitivity, and the operation is run again until the log capturing the fault source is not updated any more.
Further, in the step (2), the migration and distributed execution search command refers to a method that when the distributed graph iterative computing system performs program debugging, each node is firstly searched locally, and then the searched key log information is transmitted to the Master, which specifically includes the following steps:
sending a search command to the Slave working nodes from a host Master, and after receiving the command, each Slave locally searching the log according to the command and running the search command in a distributed mode by each node;
and returning part of the key log information, and finally, presenting the result on a Master.
Further, in the step (2), the checking iteration step information through the incremental search method refers to: for one operation, when the information of the iteration step n is checked for the first time, directly outputting the needed information of the iteration step n, and then setting a shaping variable outoperation as n for recording that the information of the iteration step n is output; if the next information output to the iteration step m is needed, firstly checking whether m is larger than n, when m is larger than n, outputting the log information of the iteration steps n+1 to m only, setting the outoperation as m, and when m is smaller than n, prompting that the log information of the iteration step n is output, and checking the log information of the iteration step n.
The invention also provides a log management system for the distributed graph iterative computation operation, which is used for managing logs and comprises the following steps:
the log increment change analysis tracing module is used for tracing faults, continuously monitoring log increment change conditions of all nodes through a heartbeat mechanism between master nodes and slave nodes, judging the order of stopping updating the logs of all nodes by taking the time of the master control node as a reference, and further giving out candidate fault source nodes;
the distributed migration retrieval module is used for optimizing log analysis in debugging of a program after fault tracing, and collecting key log information for debugging through migration and distributed execution of retrieval commands;
and the increment retrieval module is used for checking iteration step information in real time through an increment retrieval command when the distributed graph is subjected to iterative computation.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a log management method for distributed graph oriented iterative computing jobs as described above.
Compared with the prior art, the invention has the advantages that:
(1) Aiming at the problem of difficult tracing of the node where the true fault is located when the multi-node reports the fault under the strong coupling correlation background, the invention provides a log increment change analysis tracing method based on a unified time measurement standard.
Modeling or convention rules are not needed, and only the update state of the log is needed to judge where the abnormality is based.
(2) Aiming at the problem of low efficiency of cross-node frequent log retrieval, the invention provides a data deleting redundancy method based on distributed incremental retrieval, which replaces log collection by migrating and executing a retrieval command in a distributed mode, retrieves the logs on each working node, sends required contents to a Master after retrieval, does not send part of useless contents, saves a large amount of sending time, reduces network transmission expenditure of redundant information, improves the retrieval efficiency by executing the retrieval command in a distributed mode, enables a user to quickly track and analyze program operation details after determining the node where the fault source is located, and completes program debugging.
(3) During distributed graph iterative computation, the iteration step information is checked in real time through an incremental retrieval method, so that repeated scanning of log information among different iteration steps is avoided.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a prior art distributed graph iterative computing system architecture diagram;
FIG. 2 is a diagram illustrating log record updating and reporting according to the present invention;
FIG. 3 is a diagram comparing two types of log transmission strategies according to the present invention with the prior art;
FIG. 4 is a flowchart of a log management method for distributed graph iterative computation operation according to the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and specific examples.
The embodiment provides a log management method for a distributed graph iterative computation job, after the distributed graph iterative computation job is started, a Master coordinates each node to load graph data to the local and starts computation, and log management is realized by the following method:
(1) Firstly, tracing faults by using a log incremental change analysis tracing method based on a unified time measurement standard: by utilizing the characteristic that the local log of the node is not updated any more after the fault occurs, the incremental change condition of the log of each node is continuously monitored through a heartbeat mechanism between the master node and the slave node, and the order of stopping updating the log of each node is judged by taking the time of the master node as a reference, so that the candidate fault source node is provided.
(2) After fault tracing, the program optimizes log analysis in debugging, replaces log collection by migrating and executing search commands in a distributed mode, reduces network transmission overhead of redundant information, collects key log information for debugging, and improves search efficiency.
During distributed graph iterative computation, iteration step information is checked in real time through an incremental retrieval method, and repeated scanning of log information among different iteration steps is avoided.
The following two aspects are respectively introduced from the aspects of fault tracing and log retrieval:
1. log incremental change analysis tracing method based on unified time measurement standard
In the calculation process of the distributed graph calculation system, network communication is required by each working node, the output value of one sending working node is often the input value required by the next iteration of the other receiving working node, so that other working nodes fail in the working node i and cannot receive the required input value or perform calculation, and the whole system of the subsequent iteration fails in linkage, thereby causing the failure of calculation tasks. When a large-area working node logs record faults, the fault tracing cannot be performed rapidly. The invention utilizes the log increment change analysis tracing method based on the unified time measurement standard of the system to consider that delta is reported first log An abnormal working node is typically the starting working node that caused the entire failure. Thus, the fault source can be rapidly positioned.
The key point is that a determination is made as to where the fault source of the current job is based on the update status of the log record. When the task of the distributed graph computing system is performed, each working node continuously adds the log according to the task, and when the task is finished, the log also stops updating, so that the content of the log is increased along with the task, and the log and the working node show a proportional relation. Therefore, the present invention judges abnormality according to the updated state of the log record, and the specific operation steps are as follows in combination with the description shown in fig. 2:
step1, collecting and reporting the updated state of the log record of each node by using a regular reporting mechanism of n milliseconds, namely 'heartbeat', in a graph computing system on each working node. The method aims to collect and report the log record updating states of all nodes at the same time, so that the method achieves indiscriminate treatment and can not cause the situation that the nodes report less and report missing.
Step2. When every other "heartbeat", each node compares the current local log with the log at the end of the previous "heartbeatAn incremental change value delta is obtained log And will delta log Report to the Master.
Step3, judging whether abnormality occurs, and when abnormality occurs in the current graph calculation operation, checking the delta recorded by the Master first log If the working node i reportsAnd (2) other->Has great difference->Then the working node i is considered to be faulty; if no +.>The 'heartbeat' interval n is adjusted to increase the fault tracing sensitivity, and the operation is run again until the log capturing the fault source is not updated any more.
The size of the 'heartbeat' interval relates to balance between tracing sensitivity and detection time complexity, and needs to be selected according to specific requirements.
Above-mentioned Step3 relates to the sensitivity problem of tracing to source of trouble, and when n sets up to be too little, can frequent record log update situation and report to master control operational node, though can improve the accuracy of tracing to source of trouble, but can occupy limited computational resource and increase communication cost. If n is set too large, the accuracy will decrease. Providing a compromise method, in the course of daily calculation, regulating n value to reduce resource consumption, if it is not capturedThe n value can be reduced, the sensitivity is increased, and the operation calculation is performed again, wherein the main purpose is to perform fault tracing instead of operation calculation, so that the accuracy of fault tracing is sacrificed by sacrificing calculation resources.
2. Data redundancy deleting method based on distributed incremental retrieval
When the current distributed graph iterative computing system is used for program debugging, log information is distributed on each computing node, cross-node operation is needed, log checking is performed on different working nodes, the problem of operation repetition exists, the existing distributed log management system is not suitable for distributed graph computing operation, and the distributed graph computing system only provides a simple log recording function. In addition, in the distributed graph iterative computing system, when operation is computed, the iterative step output information needs to be checked for multiple times, after the iterative step n information is checked, the iterative step n information is still output when the iterative step n+i is checked, and the problem of redundancy of the retrieval iterative step information exists.
The invention solves the two problems by migrating and executing the search command and the increment search command in a distributed way, and is concretely as follows:
1. migration and distributed execution of search commands
The left side of fig. 3 shows the existing log transmission scheme incapable of efficiently performing log analysis, and the right side is the efficient method adopted by the present invention, namely, when the distributed graph iterative computing system performs program debugging, in order to avoid back and forth operation between each computing node, a method is adopted in which each node firstly performs local retrieval, and then the key log information after retrieval is transmitted to a Master: and a search command is sent to the Slave working nodes from the Master, each Slave retrieves the log locally according to the command after receiving the command, and each node executes the retrieval command in a distributed mode. And returning part of the key log information, and finally, presenting the result on a Master. Because the log records also include some useless information, only key logs are transmitted, and communication loss is greatly reduced.
2. Incremental retrieval commands
The iterative step information redundancy display is avoided by not displaying the output information. For one operation, when the information of the iteration step n is checked for the first time, redundant output does not exist, the needed information of the iteration step n is directly output, and then a shaping variable outoperation is set as n and used for recording the information of the iteration step n to be output; if the next information output to the iteration step m is needed, firstly checking whether m is larger than n, when m is larger than n, only outputting the log information of the iteration steps n+1 to m, setting the outoperation as m, when m is smaller than n, prompting that the log information of the iteration step n is output, and checking the log information of the iteration step n.
In connection with the flow chart shown in fig. 4, after the distributed graph computation job begins, the Master coordinates the nodes to load the graph data locally and begin computation. After each Slave passes a 'heartbeat' interval, each node compares the current local log quantity with the log quantity of the previous 'heartbeat' to obtain a change value delta log And will delta log Report to the Master. Master records these Δs log If the operation is abnormal, recording delta to a Master log Judging if the delta reported by a certain working node last time log If the value is 0, the failure is determined to be the source of the failure. This part is a fault tracing technique.
If the iteration step information is to be checked during calculation, the incremental search technology is utilized to reduce the redundant output of the log information.
When the system is abnormal, checking the log to debug the program, and collecting key log information to debug by utilizing a migration command technology.
As another embodiment of the present invention, there is further provided a log management system for a distributed graph iterative computation job, for managing logs, including:
the log increment change analysis tracing module is used for tracing faults, continuously monitoring log increment change conditions of all nodes through a heartbeat mechanism between master nodes and slave nodes, judging the order of stopping updating the logs of all nodes by taking the time of the master control node as a reference, and further giving out candidate fault source nodes;
the distributed migration retrieval module is used for optimizing log analysis in debugging of a program after fault tracing, and collecting key log information for debugging through migration and distributed execution of retrieval commands;
and the increment retrieval module is used for checking iteration step information in real time through an increment retrieval command when the distributed graph is subjected to iterative computation.
The functions and implementation manners of the modules are the same as those of the log management method for the distributed graph iterative computation operation, and are not repeated here.
As another embodiment of the present invention, there is further provided a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements a log management method for iterative computation of a distributed graph as described above, and will not be described herein.
In summary, the invention provides a solution from two angles of fault tracing and log retrieval, firstly, the fault tracing is carried out without modeling or rule agreement, and only the update state of the log is needed to judge where the abnormal root comes out; after fault tracing, log analysis is optimized in debugging of the program, logs are searched on each working node, required content is sent to a Master after searching, part of useless content is not sent, a large amount of sending time is saved, and after the node where the fault source is located is determined, a user can quickly track and analyze program operation details to complete program debugging.
The same or similar parts of the embodiments of the present invention will be referred to each other, and each embodiment focuses on the differences from the other embodiments. Moreover, the architecture of the system embodiments is merely illustrative, and the program modules illustrated by the separable components may or may not be physically separate, and in actual practice, some or all of the modules may be selected as desired to achieve the objectives of the embodiments.
The steps of the present invention may be implemented by general-purpose computer means, or alternatively, they may be implemented by program code executable by computing means, so that they may be stored in memory means and executed by computing means, or they may be fabricated into individual integrated circuit modules, respectively, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that various changes, modifications, additions and substitutions can be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims (4)

1. The log management method for the distributed graph iterative computation job is characterized in that after the distributed graph iterative computation job is started, a Master coordinates each node to load graph data to the local and start computation, and log management is realized by the following method:
(1) Firstly, tracing faults by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the incremental change condition of the logs of each node through a heartbeat mechanism between the master node and the slave node, and judging the updating stopping sequence of the logs of each node by taking the time of the master node as a reference, thereby giving out candidate fault source nodes; the log incremental change analysis tracing method based on the unified time measurement standard in the step (1) comprises the following specific operation steps:
step1, firstly, collecting and reporting the updated state of the log record of each node by using a regular reporting mechanism of n milliseconds in a graph computing system on each working node;
step2, when every other heartbeat, each node compares the current local log with the log at the end of the previous heartbeat to obtain an incremental change value delta log And will delta log Reporting to a Master;
step3, judging whether abnormality occurs, and when abnormality occurs in the current graph calculation operation, checking the delta recorded by the Master first log If the working node i reportsThen the working node i is considered to be faulty; if no +.>Then turn downThe heartbeat interval n increases the fault tracing sensitivity, and the operation is operated again until the log of the fault source is captured and is not updated;
(2) After fault tracing, optimizing log analysis in debugging of the program, and collecting key log information for debugging by migrating and executing search commands in a distributed manner; the migration and distributed execution search command refers to a method that when a distributed graph iterative computing system performs program debugging, each node is firstly used for local search, and then the searched key log information is transmitted to a Master, and the method specifically comprises the following steps:
sending a search command to the Slave working nodes from a host Master, and after receiving the command, each Slave locally searching the log according to the command and running the search command in a distributed mode by each node;
returning part of key log information, and finally, presenting results on a Master;
and during iterative computation of the distributed graph, checking iteration step information in real time through an incremental retrieval method.
2. The log management method for distributed graph iterative computation job according to claim 1, wherein in step (2), the checking iteration step information by the incremental search method means that: for one operation, when the information of the iteration step n is checked for the first time, directly outputting the needed information of the iteration step n, and then setting a shaping variable outoperation as n for recording that the information of the iteration step n is output; if the next information output to the iteration step m is needed, firstly checking whether m is larger than n, when m is larger than n, outputting the log information of the iteration steps n+1 to m only, setting the outoperation as m, and when m is smaller than n, prompting that the log information of the iteration step n is output, and checking the log information of the iteration step n.
3. A log management system for a distributed graph iterative computation job, configured to manage a log, and characterized by implementing the log management method for a distributed graph iterative computation job according to claim 1 or 2, including:
the log increment change analysis tracing module is used for tracing faults, continuously monitoring log increment change conditions of all nodes through a heartbeat mechanism between master nodes and slave nodes, judging the order of stopping updating the logs of all nodes by taking the time of the master control node as a reference, and further giving out candidate fault source nodes;
the distributed migration retrieval module is used for optimizing log analysis in debugging of a program after fault tracing, and collecting key log information for debugging through migration and distributed execution of retrieval commands;
and the increment retrieval module is used for checking iteration step information in real time through an increment retrieval command when the distributed graph is subjected to iterative computation.
4. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements a log management method for iterative computation of a distributed graph according to any of claims 1-2.
CN202110728761.9A 2021-06-29 2021-06-29 Log management system, method and medium for distributed graph iterative computation job Active CN113535528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110728761.9A CN113535528B (en) 2021-06-29 2021-06-29 Log management system, method and medium for distributed graph iterative computation job

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110728761.9A CN113535528B (en) 2021-06-29 2021-06-29 Log management system, method and medium for distributed graph iterative computation job

Publications (2)

Publication Number Publication Date
CN113535528A CN113535528A (en) 2021-10-22
CN113535528B true CN113535528B (en) 2023-08-08

Family

ID=78126198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110728761.9A Active CN113535528B (en) 2021-06-29 2021-06-29 Log management system, method and medium for distributed graph iterative computation job

Country Status (1)

Country Link
CN (1) CN113535528B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7721152B1 (en) * 2004-12-21 2010-05-18 Symantec Operating Corporation Integration of cluster information with root cause analysis tool
CN105975604A (en) * 2016-05-12 2016-09-28 清华大学 Distribution iterative data processing program abnormity detection and diagnosis method
CN106227727A (en) * 2016-06-30 2016-12-14 乐视控股(北京)有限公司 Daily record update method, device and the system of a kind of distributed system
CN110134714A (en) * 2019-05-22 2019-08-16 东北大学 A kind of distributed computing framework caching index suitable for big data iterative calculation
CN110489302A (en) * 2019-08-22 2019-11-22 贵州电网有限责任公司 Fault judgment method based on plurality of devices log multiple analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110083123A1 (en) * 2009-10-05 2011-04-07 Microsoft Corporation Automatically localizing root error through log analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7721152B1 (en) * 2004-12-21 2010-05-18 Symantec Operating Corporation Integration of cluster information with root cause analysis tool
CN105975604A (en) * 2016-05-12 2016-09-28 清华大学 Distribution iterative data processing program abnormity detection and diagnosis method
CN106227727A (en) * 2016-06-30 2016-12-14 乐视控股(北京)有限公司 Daily record update method, device and the system of a kind of distributed system
CN110134714A (en) * 2019-05-22 2019-08-16 东北大学 A kind of distributed computing framework caching index suitable for big data iterative calculation
CN110489302A (en) * 2019-08-22 2019-11-22 贵州电网有限责任公司 Fault judgment method based on plurality of devices log multiple analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于日志数据的分布式软件系统故障诊断综述;贾统;李影;吴中海;;软件学报;第31卷(第7期);第1997-2018页 *

Also Published As

Publication number Publication date
CN113535528A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN110287052B (en) Root cause task determination method and device for abnormal task
US9092561B2 (en) Model checking for distributed application validation
CN102323945B (en) SQL (Structured Query Language)-based database management method and device
US20070067754A1 (en) Server application state
EP3591485B1 (en) Method and device for monitoring for equipment failure
CN108647137B (en) Operation performance prediction method, device, medium, equipment and system
CN113657715A (en) Root cause positioning method and system based on kernel density estimation calling chain
CN110489317B (en) Cloud system task operation fault diagnosis method and system based on workflow
CN110740054A (en) data center virtualization network fault diagnosis method based on reinforcement learning
CN113946499A (en) Micro-service link tracking and performance analysis method, system, equipment and application
US20090307526A1 (en) Multi-cpu failure detection/recovery system and method for the same
CN115114064B (en) Micro-service fault analysis method, system, equipment and storage medium
CN115455429A (en) Vulnerability analysis method and system based on big data
CN111367786A (en) Symbol execution method, electronic equipment and storage medium
US8341463B2 (en) System operations management apparatus, system operations management method
CN114153788A (en) Traceable control method of actuator, actuator and control system
CN113535528B (en) Log management system, method and medium for distributed graph iterative computation job
CN109150596B (en) SCADA system real-time data dump method and device
CN113626288B (en) Fault processing method, system, device, storage medium and electronic equipment
CN112181759A (en) Method for monitoring micro-service performance and diagnosing abnormity
CN115237641A (en) Fault detection method and device, electronic equipment and readable medium
Chen et al. Proverr: System level statistical fault diagnosis using dependency model
CN117992340B (en) Database cluster fuzzy test method, system, computer equipment and storage medium
CN118585382B (en) Remote backup restoration method and system for firmware
CN118567912B (en) Database backup method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant