Nothing Special   »   [go: up one dir, main page]

CN110493025B - Fault root cause diagnosis method and device based on multilayer digraphs - Google Patents

Fault root cause diagnosis method and device based on multilayer digraphs Download PDF

Info

Publication number
CN110493025B
CN110493025B CN201810461456.6A CN201810461456A CN110493025B CN 110493025 B CN110493025 B CN 110493025B CN 201810461456 A CN201810461456 A CN 201810461456A CN 110493025 B CN110493025 B CN 110493025B
Authority
CN
China
Prior art keywords
service
node
root cause
multilayer
directed graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810461456.6A
Other languages
Chinese (zh)
Other versions
CN110493025A (en
Inventor
乔柏林
叶晓龙
任赣
唐涛
蒋通通
胡林熙
蒋健
竺士杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Zhejiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Zhejiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201810461456.6A priority Critical patent/CN110493025B/en
Publication of CN110493025A publication Critical patent/CN110493025A/en
Application granted granted Critical
Publication of CN110493025B publication Critical patent/CN110493025B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)
  • Monitoring And Testing Of Exchanges (AREA)

Abstract

The embodiment of the invention discloses a fault root cause diagnosis method and device based on a multilayer directed graph, wherein the method determines the calling relationship of each service node according to original service data and attribute information, can comprehensively consider newly added service nodes or newly added calling relationship in practice, ensures that each service node can be added into the multilayer directed graph model when the multilayer directed graph model is established according to the calling relationship, and lays a foundation for accurately and quickly searching the root cause node generating abnormal service data based on the multilayer directed graph model. Because the service nodes in the multilayer directed graph model in the embodiment are generated according to actual service nodes, the condition that newly-appeared faults are queried according to law is avoided comprehensively by the nodes, and meanwhile, data analysis is not only based on a calling relation, but also based on the created multilayer directed graph model to comprehensively analyze the data.

Description

Fault root cause diagnosis method and device based on multilayer directed graph
Technical Field
The embodiment of the invention relates to the technical field of computer software, in particular to a fault root cause diagnosis method and device based on a multilayer directed graph.
Background
The popularity of cloud computing and container clouds has led to the deployment of a large number of IT application systems in a virtualized, containerized environment. With the continuous abundance of various service scenes and the well-blowout type increase of the service volume, great challenges are brought to the easy maintainability of the system and the application. Especially in the telecommunication industry, operators construct a great number of application systems to provide various characteristic services for vast consumers, and some system functions relate to sub-functions of a plurality of business systems and can work normally only by the cooperation of the plurality of systems. The evolution of the architecture further aggravates the complexity of the service system, and higher requirements are put forward on the operation and maintenance fault location and solution capability.
The current fault diagnosis method comprises three types, namely, fault diagnosis based on the form of a warning plan library and the like, and fault diagnosis and repair based on a decision tree model. Wherein, the fault diagnosis based on the plan library forms such as warning and the like: most operation and maintenance departments generally gather fault processing manuals according to fault phenomena and processing records, and part of equipment suppliers also provide similar simple fault positioning capability so as to realize initial positioning and solving of faults. In addition to historical failure experience, other dimensions such as QoE (quality of experience) are also included to perform failure diagnosis. Once the fault occurs, the diagnosis result is generated by collecting the alarm key information and finding out the corresponding diagnosis manual for retrieval. Therefore, the alarm-based mode can simply and quickly complete daily fault location and repair, and can not be used once the unknown fault and the like do not accord with the known alarm information. The fault diagnosis method based on the off-line index analysis tool comprises the following steps: the off-line index analysis tool comprises a service index and a system operation index, wherein the service index is mainly reflected by service warehousing data, the system operation index is mainly analyzed after external data such as logs are imported into a database, and the system performance, success rate, failure distribution and other information are analyzed by analyzing the system operation index to judge the system operation health degree. The database-based mode is convenient for extracting system key indexes, effectively monitors the running state of each link of a program, but relatively prolongs the time greatly, and can cause certain influence on the monitoring timeliness of the system. The fault diagnosis and repair method based on the decision tree model comprises the following steps: most system designs adopt a multilayer system topology architecture, a topological graph of tree-shaped relations is established based on a layered calling principle, and a decision tree facing to business and system faults is established based on the tree-shaped topology. Once the fault occurs, the diagnosis result is generated by collecting the key information of the fault and finding out the corresponding decision tree for retrieval. Therefore, the decision tree-based method can simply and quickly complete daily fault location and repair, and once a non-tree-structure calling relationship is met, the method cannot be used.
However, in a virtualized and containerized environment based on a big data platform, a DCOS platform, a module system, a micro service system, etc., the existing solutions are not sufficient to support the capability requirement of fast response and efficient analysis and resolution for diagnosing and repairing the cluster node failure or abnormality, and are mainly expressed in the following aspects: (1) the usage scene is narrow and cannot deal with unknown scenes. For example, in the first scheme, a fault diagnosis and repair method based on an alarm and plan library form mainly depends on experience accumulation of known fault information, and the method has great requirements on fault scenes. The same fault phenomenon may have different processing modes under different fault scenes, which is beyond the processing range of the simple plan library. Especially, in the face of unknown fault information, the existing means such as manuals and the like are completely invalid, and the problems are required to be manually checked, positioned and repaired step by step, so that the operation and maintenance efficiency is low. (2) The index timeliness is poor, and information cannot be fed back in time. The improvement of the existing scheme on the fault positioning capability is only limited to strengthening fault information collection, and the final positioning and repairing of the fault still depend on the judgment and execution of operation and maintenance personnel. Through massive monitoring index data, the source of fault information is greatly expanded, but the acquisition delay of indexes is high, so that the automatic processing and analysis capacity of the data is insufficient, and information points and roots of problems cannot be displayed in time. (3) Massive historical data is needed, and the method is not suitable for agile modes. The existing scheme mainly adopts a training decision tree to improve the analysis capability, but the training decision tree needs a large amount of historical data, and due to the service characteristics of the system, the new problems account for more, sufficient effective training data cannot be provided, so that the accuracy of a decision tree model is not high, the failure root cause analysis capability is not sufficient, and effective support cannot be provided.
In the process of implementing the embodiment of the invention, the inventor finds that the existing method for searching the fault root cause has poor environment adaptability and cannot perform root cause query on a newly-appeared fault, and the existing method for searching the fault root cause is only used for searching according to the calling relationship of the service node, so that the analysis on data is single and the data analysis capability is weak.
Disclosure of Invention
The invention aims to solve the technical problems that the existing method for searching the fault root cause has poor environment adaptability and cannot perform root cause query on a newly-appeared fault, and the existing method for searching the fault root cause only searches according to the calling relation of a service node, so that the data analysis is single and the data analysis capability is weak.
In view of the above technical problems, an embodiment of the present invention provides a method for diagnosing a fault root cause based on a multilayer directed graph, including:
acquiring original service data generated at each service node of a preset service, and determining a calling relationship of each service node according to the original service data and pre-stored attribute information of each service node;
establishing a multi-layer directed graph model of each service node according to the calling relation of each service node determined by the original service data and the attribute information and the layers to which each service node is divided in advance;
Obtaining abnormal business data in the original business data, determining at least one root cause node which causes the business to generate the abnormal business data according to the multilayer directed graph model, and determining a target root cause node which causes the preset business to be abnormal from the root cause nodes.
The embodiment of the invention provides a fault root cause diagnosis device based on a multilayer directed graph, which comprises the following steps:
the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring original service data generated at each service node of a preset service and determining the calling relationship of each service node according to the original service data and pre-stored attribute information of each service node;
the establishing module is used for establishing a multilayer directed graph model of each service node according to the calling relation of each service node determined by the original service data and the attribute information and the layers to which each service node is divided in advance;
and the root cause determining module is used for acquiring abnormal service data in the original service data, determining at least one root cause node which causes the service to generate the abnormal service data according to the multilayer directed graph model, and determining a target root cause node which causes the preset service to be abnormal from the root cause nodes.
The embodiment provides an electronic device, including:
at least one processor, at least one memory, a communication interface, and a bus; wherein,
the processor, the memory and the communication interface complete mutual communication through the bus;
the communication interface is used for information transmission between the electronic equipment and the communication equipment of the terminal equipment;
the memory stores program instructions executable by the processor, which when called by the processor are capable of performing the methods described above.
The present embodiments provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method described above.
The embodiment of the invention provides a fault root cause diagnosis method and device based on a multilayer directed graph, the method determines the call relation of each service node according to original service data and attribute information, can comprehensively consider newly added service nodes or newly added call relations in practice, ensures that each service node can be added into the multilayer directed graph model when the multilayer directed graph model is established according to the call relation, and lays a foundation for accurately and quickly searching root cause nodes generating abnormal service data based on the multilayer directed graph model. Because the service nodes in the multilayer directed graph model in the embodiment are generated according to actual service nodes, the condition that newly-appeared faults are queried according to law is avoided comprehensively by the nodes, and meanwhile, data analysis is not only based on a calling relation, but also based on the created multilayer directed graph model to comprehensively analyze the data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart illustrating a method for fault root cause diagnosis based on a multilayer directed graph according to an embodiment of the present invention;
FIG. 2 is an architectural diagram of fault root cause diagnosis for a multi-level directed graph provided by another embodiment of the invention;
FIG. 3 is a flow diagram illustrating a fault root cause query according to another embodiment of the present invention;
FIG. 4 is a block diagram of an apparatus for fault root cause diagnosis based on a multi-layer directed graph according to another embodiment of the present invention;
fig. 5 is a block diagram of an electronic device according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flowchart of a method for diagnosing a fault root cause based on a multilayer directed graph according to this embodiment, and referring to fig. 1, the method includes:
101: acquiring original service data generated at each service node of a preset service, and determining the calling relationship of each service node according to the original service data and attribute information of each service node stored in advance;
102: establishing a multi-layer directed graph model of each service node according to the calling relation of each service node determined by the original service data and the attribute information and the layers to which each service node is divided in advance;
103: obtaining abnormal business data in the original business data, determining at least one root cause node which causes the business to generate the abnormal business data according to the multilayer directed graph model, and determining a target root cause node which causes the preset business to be abnormal from the root cause nodes.
The method provided in this embodiment is generally executed by a device, such as a server, which performs fault diagnosis and repair on whether a service operates normally, and this embodiment does not specifically limit this. The method is used for carrying out root cause query on a certain service with a fault. The service nodes are nodes in the running process of the preset service, and the data collected at each service node is the original service data of the service. The attribute information is predefined attribute information of each service node, and the attribute information reflects the calling relationship of each service node. The service nodes may also be layered according to the attribute information, for example, a node located at an application layer, a node located at a transport layer, and the like. When the multi-layer directed graph model of each service node is created, the pre-divided affiliated layer of each service node is required to be referred to. And when searching for root nodes causing the abnormal preset service through the multilayer directed graph, searching layer by layer according to the calling relationship of each service node. The target root node is usually obtained by calculation, and a specific calculation method may be set, which is not specifically limited in this embodiment.
The embodiment provides a fault root cause diagnosis method based on a multilayer directed graph, which determines the calling relationship of each service node according to original service data and attribute information, can comprehensively consider newly added service nodes or newly added calling relationships in practice, ensures that each service node can be added into the multilayer directed graph model when the multilayer directed graph model is established according to the calling relationship, and lays a foundation for accurately and quickly searching the root cause node generating abnormal service data based on the multilayer directed graph model. Because the service nodes in the multilayer directed graph model in the embodiment are generated according to actual service nodes, the condition that newly-appeared faults are queried according to law is avoided comprehensively by the nodes, and meanwhile, data analysis is not only based on a calling relation, but also based on the created multilayer directed graph model to comprehensively analyze the data.
Further, on the basis of the above embodiment, the obtaining original service data generated at each service node of a preset service, and determining a call relationship of each service node according to the original service data and pre-stored attribute information of each service node includes:
Acquiring original service data generated at each service node of a preset service and attribute information of each service node stored in a CMDB database, and obtaining an original calling relationship between each service node according to the attribute information of each service node;
and analyzing the actual call relation of each service node according to the original service data, and adjusting the original call relation according to the actual call relation to obtain the call relation of each service node determined by the original service data and the attribute information.
The CMDB database is a database for storing and managing various configuration information of the equipment in the enterprise IT architecture. For each service node of the preset service, firstly, the calling relation of each service node is obtained according to the attribute information defined in the CMDB database. However, since a service node of a preset service may be added in practice, and the CMDB database may not have the added service node, the call relationship between the added node and each other service node needs to be supplemented according to the original service data after the original call relationship is determined, and finally, the call relationship between each service node determined by the original service data and the attribute information is obtained.
The embodiment provides a fault root cause diagnosis method based on a multilayer directed graph, which adjusts the original calling relationship obtained according to a CMDB database, ensures that the finally determined calling relationship comprises all calling relationships in the actual operation process of a service, and provides a guarantee for root cause query of a newly-appeared fault.
Further, on the basis of the foregoing embodiments, the establishing a multilayer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the layers to which each service node divided in advance belongs includes:
correcting the service nodes stored in the CMDB database according to the calling relationship of each service node determined by the original service data and the attribute information;
obtaining the service node of the ith layer in the CMDB database after the pre-division and the correction, and aiming at the service node v of the ith layer in the CMDB databasenObtaining the service node v according to the calling relation of each service node determined by the original service data and the attribute informationnReaches and can reach the service node vnThe target service node of (2);
and adding a target service node corresponding to each service node of the ith layer in the CMDB database into the node set of the ith layer, wherein the point in the node set of the ith layer is the node of the ith layer in the multilayer directed graph model.
For example, if a service node is added during actual operation of a service, the service node needs to be added to the CMDB database, and the CMDB database needs to be updated in time. Each service node is divided into layers in the CMDB database in advance according to the attribute of each service node, for example, the service nodes belonging to the application layer are divided into the same layer, and the service nodes belonging to the transport layer are divided into the same layer.
In the multilayer directed graph model, the ith layer node set can be represented by a formula Li={R1∩A1,R2∩A2,……,Rn∩AnDenotes. Wherein R isnDenotes all slaves vnSet of arriving nodes, AnIndicates that all can reach vnOf the node. The service nodes at the ith layer in the CMDB database have n, respectively v1,v2,……vn
The embodiment provides a fault root diagnosis method based on a multilayer directed graph, which obtains service nodes of each layer in a multilayer directed graph model according to the layer to which each service node in a CMDB database belongs.
Further, on the basis of the foregoing embodiments, the acquiring abnormal service data in the original service data, determining at least one root cause node causing the service to generate the abnormal service data according to the multilayer directed graph model, and determining a target root cause node causing the preset service abnormality from the root cause nodes includes:
Judging whether the original service data generated at each service node is abnormal according to a preset threshold interval, and acquiring all abnormal service data in the original service data;
mapping each abnormal service data to a service node which generates the abnormal service data in the multilayer directed graph model, and searching at least one root node which causes the service according to the calling relation of each service node in the multilayer directed graph model and the layer of each service node in the multilayer directed graph model;
constructing time series data<m,k,T,Em×k>In x ofi(t) is an independent variable, with Em×k-xi(t) is the amount of strain, the constitutive function f [ x ]i(t)]=Em×k-xi(t) for each factor node, the value x over all time seriesi(t)~xi(t-k) disturbing to obtain fluctuation values y [ delta, f [ x ] of each cause nodei(t)]]Taking the root cause node with the fluctuation value smaller than a preset fluctuation value as the target root cause node;
wherein m is the number of service nodes in the multilayer directed graph model, k is the number of time lags existing in each service node, T is the length of the time sequence, Em×kThe set of all service nodes in the multilayer directed graph model on all time lags is represented by delta which is a parameter related to the multilayer directed graph model, the total number of root nodes is j, x iAnd (t) is the corresponding service data of the ith service node when the time sequence length is t.
Whether the service data is abnormal or not may be determined according to a set threshold range, or after the service data is subjected to operation processing, whether the service data is abnormal or not may be determined according to a result of the operation processing, which is not specifically limited in this embodiment. In the process of searching the root cause, only the abnormal business data is mapped into the multilayer directed graph model.
When searching for root nodes, the root nodes need to be searched according to the calling relationship between the layer to which each service node belongs and each service node. For example, if a group of nodes having a call relationship have an exception point in each layer, the service node located at the bottom layer is usually the root node; if a group of service nodes with calling relation do not have abnormal service nodes in a certain layer, the service nodes above and below the layer are taken as two independent parts to carry out root cause search.
After the root cause nodes are found, the root cause nodes are sequenced according to the calculated fluctuation values corresponding to each root cause node, the smaller the fluctuation value is, the higher the possibility that the root cause node causes service abnormity is, and the more possible root cause nodes are used as target root cause nodes.
The embodiment provides a fault root cause diagnosis method based on a multilayer directed graph. And the target root cause node is determined from the root cause nodes, so that the range of the nodes which need to be considered when the service is repaired is reduced, and the efficiency of repairing the service is improved.
Further, on the basis of the foregoing embodiments, before the obtaining the original service data generated at each service node of the preset service, the method further includes:
and performing KEI index evaluation on each service, judging whether the service is in a healthy state, if not, taking the service as the preset service, and acquiring original service data generated at each service node of the preset service.
KEI (key performance indicators) are used to evaluate whether a business is in healthy loading, and the method provided by the present embodiment performs root diagnosis only for businesses that are in unhealthy status.
The embodiment provides a fault root cause diagnosis method based on a multilayer directed graph, which screens out the business in the unhealthy state through an KEI index, carries out root cause diagnosis on the business in the unhealthy state and avoids unnecessary diagnosis on the business in the healthy state.
Further, on the basis of the foregoing embodiments, after determining, from the root nodes, a target root node that causes the preset service exception, the method further includes:
and judging whether a fault processing plan for repairing the target root cause node is stored, if so, repairing the target root cause node according to the fault processing plan, and sending first prompt information for repairing the target root cause node, otherwise, sending node information of the target root cause node and second prompt information for not repairing the target root cause node.
After the target root cause node is determined, the target root cause node needs to be repaired, and normal operation of the system is guaranteed. The first prompt message and the second prompt message may be sent by email or by short message, which is not specifically limited in this embodiment.
The embodiment provides a fault root cause diagnosis method based on a multilayer directed graph, which can timely repair a fault under the condition that the fault can be repaired in time, timely sends prompt information under the condition that the fault cannot be repaired, timely informs workers of adopting a repair scheme to repair the fault, and ensures normal operation of a service.
As a more specific embodiment, fig. 2 is a schematic diagram of a fault root cause diagnosis of a multilayer directed graph provided in this embodiment, and refer to fig. 2, which mainly relates to a CMDB database, application topology relationship management, a directed graph model converter, a model library, an index management device, a fault root cause analysis device, a fault automation processing device, and the like. The directed graph converter generates a fault multilayer directed graph model (FSDG) by continuously analyzing the existing asset data, and the fault root diagnosis device evaluates and calculates a real-time KEI index by using the FSDG model and finally excavates a fault root.
As shown in fig. 2, (1) the application production system processes the operation of the user in real time, and when an exception occurs in the business process, the application production system inevitably has an exception point. The application production system is connected with the application topology management system: when a call relation is generated among application services, the topology management system acquires call relation data.
(2) The application topology relation management mainly comprises 6 devices, including calling data acquisition, data cleaning, rule conversion, calling relation analysis, calling behavior analysis and rule continuous learning. And the application topological relation management analyzes the calling relation among all nodes in the system by calling data acquisition, provides data support for the subsequent directed graph, and submits the data and CMDB data to a model converter to produce a multilayer directed graph model.
(3) The CMDB database stores the attributes of various CI items in the application system, and various relations among the CI items are defined. Through CMDB data, an FSDG hierarchical model in the multilayer directed graph can be defined, and the model is submitted to a model converter to produce the multilayer directed graph model.
(4) The model converter processes and converts the input data into corresponding codes according to the data attributes. And converting the complex calling relationship of the system into a multilayer directed graph model by applying the topological relationship data and the CMDB data. The model converter is connected with the model library, and submits the data to the FSDG model library after code conversion;
obtaining a node set V ═ V from the CMDB datai|viAsset nodes managed in the CMDB };
obtaining a branch set E ═ { E ] by applying topological relation datai,j| node viPointing to node vjDirected edge of };
in the multilayer directed graph model, all service nodes of the ith layer pass through a set Li={R1∩A1,R2∩A2,……,Rn∩AnRepresents it.
(5) The model library comprises known system topology models which are classified according to services and systems and can be divided into CRM (customer relationship management), channels, CBOSS (CBOSS) models and the like, and the topology levels and the calling relations of different systems are different. And the fault root cause analysis device is connected with: after the model base inputs the information into the fault source analysis device, the information and the index data of the index management device are supplied to the analysis module to analyze the fault source.
(6) The index management device manages index data of services, systems and the like in the system, and comprises index data of each node in the multilayer directed graph model, including key indexes such as health degree and the like. And the fault source analysis device is connected with: and pushing the indexes to an analysis device, and matching the indexes with the model in the index library to analyze the fault source.
(7) The fault source analysis device shortens the time consumption of fault source calculation to the second level through real-time data calculation based on a big data STORM flow calculation framework; and judging whether the system is abnormal or not according to the multilayer directed graph model and the node index data, and if so, calculating a root node according to a multilayer directed graph algorithm, namely analyzing the root cause of the system fault. And the fault automatic processing device is connected with: and when the fault source is analyzed, the fault source is sent to the processing device for fault processing.
Fig. 3 is a schematic flow chart of performing a fault root cause query according to this embodiment, and referring to fig. 3, the process includes:
evaluating the index data of the highest layer of the FSDG model by using an KEI model, and if the evaluation result is in a healthy state, not carrying out subsequent analysis by the system; and if the evaluation result is in an unhealthy state, triggering an FSDG fault root analysis process and calculating a fault source.
Processing the FSDG fault node set by adopting a naive causal mining algorithm, and constructing a fault causal mining object FCS (FCS), wherein the FCS is all time sequence data generated by each element in the system and is formally expressed as a quadruple<m,k,T,Em×k>M is the number of elements in the FSDG, k is the number of time lags existing in each element, T represents the length of the time series, Em×kRepresenting the set of all elements in the system at all time lags. The FSDG graph may have links consisting of multiple service nodes for which failure root diagnosis is required, and C1 … … Cn represents the FSDG graph corresponding to each link for the associated link split of different service nodes.
In the process of fluctuation speech calculation, target xi(t),variables=Em×k-xi(t) performing GEP-based function fitting with target as a dependent variable and variables as independent variables to obtain a function fxi(t); to fx in sequencei(t) perturbing each element from the set of variables. Since the time lag of the system is k, x is given to each elementjValue x over all time seriesi(t)~xj(t-k) are all disturbed; calculating fluctuation value delta fx of each element based on disturbancei(t)(xiDelta) and then carrying out causal judgment according to the fluctuation value, wherein the smaller fluctuation value is a fault source.
(8) The fault automatic processing device is used for automatically processing a fault source, if a corresponding fault processing plan exists, the device automatically executes according to the plan, repairs the system in time and informs a related person in charge of the system.
Aiming at the defects that the existing scheme is only limited to a known fault analysis root cause, newly discovered faults cannot be flexibly dealt with, and real-time computing capability cannot be provided, the fault root cause diagnosis method based on the multilayer directed graph is based on the Storm flow computing technology, and a method combining the fault directed graph algorithm FSDG and the naive causal mining algorithm NCM is adopted, so that the real-time, efficient and flexible fault root cause analysis capability is provided. On the other hand, aiming at the defect that the existing modeling method for the IT operation and maintenance system is difficult to provide mass data for training, the method provided by the embodiment provides a quick modeling method for generating the FSDG model based on CMDB data and an application topological relation management module, so that the convenience of model establishment is improved, and the problem that the model error is large due to insufficient training data is avoided.
The method for diagnosing the root cause of the fault based on the multilayer directed graph is not limited to positioning processing of the known fault, and root cause analysis can be automatically carried out on the new fault according to the model. The fault data analysis capability is enhanced, and the data backlog influence caused by information explosion and the like is avoided through data real-time calculation. The automatic fault handling capacity is improved, the automatic handling device is introduced, and closed-loop management from automatic discovery and positioning to final handling of the fault is realized.
Fig. 4 is a block diagram of a device for diagnosing fault root cause based on a multilayer directed graph according to the present embodiment, and referring to fig. 4, the device includes an obtaining module 401, an establishing module 402, and a root cause determining module 403, wherein,
an obtaining module 401, configured to obtain original service data generated at each service node of a preset service, and determine a call relationship of each service node according to the original service data and pre-stored attribute information of each service node;
an establishing module 402, configured to establish a multi-layer directed graph model of each service node according to a calling relationship of each service node determined by the original service data and the attribute information and a layer to which each service node divided in advance belongs;
a root cause determining module 403, configured to obtain abnormal service data in the original service data, determine, according to the multilayer directed graph model, at least one root cause node that causes the service to generate the abnormal service data, and determine, from the root cause nodes, a target root cause node that causes the preset service to be abnormal.
The device for diagnosing the fault root based on the multilayer directed graph provided in this embodiment is suitable for the method for diagnosing the fault root based on the multilayer directed graph provided in the above embodiment, and is not described herein again.
The embodiment provides a fault root cause diagnosis device based on a multilayer directed graph, and the method determines the calling relationship of each service node according to original service data and attribute information, can comprehensively consider newly added service nodes or newly added calling relationships in practice, ensures that each service node can be added into the multilayer directed graph model when the multilayer directed graph model is established according to the calling relationship, and lays a foundation for accurately and quickly searching the root cause node generating abnormal service data based on the multilayer directed graph model. Because the service nodes in the multilayer directed graph model in the embodiment are generated according to actual service nodes, the condition that newly-appeared faults are queried according to law is avoided comprehensively by the nodes, and meanwhile, data analysis is not only based on a calling relation, but also based on the created multilayer directed graph model to comprehensively analyze the data.
Fig. 5 is a block diagram showing the structure of the electronic apparatus provided in the present embodiment.
Referring to fig. 5, the electronic device includes: a processor (processor)501, a memory (memory)502, a communication Interface (Communications Interface)503, and a bus 504;
wherein,
The processor 501, the memory 502 and the communication interface 503 complete mutual communication through the bus 504;
the communication interface 503 is used for information transmission between the electronic device and communication devices of other electronic devices;
the processor 501 is configured to call program instructions in the memory 502 to perform the methods provided by the above-mentioned method embodiments, for example, including: acquiring original service data generated at each service node of a preset service, and determining a calling relationship of each service node according to the original service data and pre-stored attribute information of each service node; establishing a multi-layer directed graph model of each service node according to the calling relation of each service node determined by the original service data and the attribute information and the layers to which each service node is divided in advance; obtaining abnormal business data in the original business data, determining at least one root cause node which causes the business to generate the abnormal business data according to the multilayer directed graph model, and determining a target root cause node which causes the preset business to be abnormal from the root cause nodes.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring original service data generated at each service node of a preset service, and determining a calling relationship of each service node according to the original service data and pre-stored attribute information of each service node; establishing a multi-layer directed graph model of each service node according to the calling relation of each service node determined by the original service data and the attribute information and the layers to which each service node is divided in advance; obtaining abnormal business data in the original business data, determining at least one root cause node which causes the business to generate the abnormal business data according to the multilayer directed graph model, and determining a target root cause node which causes the preset business to be abnormal from the root cause nodes.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments, for example, comprising: acquiring original service data generated at each service node of a preset service, and determining a calling relationship of each service node according to the original service data and pre-stored attribute information of each service node; establishing a multilayer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the layers to which each service node is divided in advance; obtaining abnormal business data in the original business data, determining at least one root cause node which causes the business to generate the abnormal business data according to the multilayer directed graph model, and determining a target root cause node which causes the preset business to be abnormal from the root cause nodes.
Those of ordinary skill in the art will understand that: all or part of the steps of implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer-readable storage medium, and when executed, executes the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments of electronic devices and the like are only illustrative, where the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method for diagnosing fault root causes based on a multilayer directed graph is characterized by comprising the following steps:
acquiring original service data generated at each service node of a preset service and attribute information of each service node stored in a CMDB database, and obtaining an original calling relationship between each service node according to the attribute information of each service node;
analyzing the actual calling relationship of each service node according to the original service data, and adjusting the original calling relationship according to the actual calling relationship to obtain the calling relationship of each service node determined by the original service data and the attribute information;
Establishing a multilayer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the layers of the service nodes divided in advance;
and acquiring abnormal service data in the original service data, determining at least one root cause node which causes the preset service to generate the abnormal service data according to the multilayer directed graph model, and determining a target root cause node which causes the preset service to be abnormal from the root cause nodes.
2. The method according to claim 1, wherein the building of the multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the pre-partitioned layers to which each service node belongs includes:
correcting the service nodes stored in the CMDB database according to the calling relationship of each service node determined by the original service data and the attribute information;
obtaining the service node of the ith layer in the CMDB database after the correction divided in advance, and for the service node v of the ith layer in the CMDB databasenAcquiring the service node v according to the calling relationship of each service node determined by the original service data and the attribute information nReaches and can reach the service node vnThe target service node of (2);
and adding a target service node corresponding to each service node of the ith layer in the CMDB database into the node set of the ith layer, wherein the points in the node set of the ith layer are the nodes of the ith layer in the multilayer directed graph model.
3. The method according to claim 2, wherein the obtaining abnormal service data in the original service data, determining at least one root cause node causing the preset service to generate the abnormal service data according to the multilayer directed graph model, and determining a target root cause node causing the preset service to be abnormal from the root cause nodes comprises:
judging whether the original service data generated at each service node is abnormal according to a preset threshold interval, and acquiring all abnormal service data in the original service data;
mapping each abnormal service data to a service node which generates the abnormal service data in the multilayer directed graph model, and searching at least one root cause node which causes the preset service abnormality according to the calling relation of each service node in the multilayer directed graph model and the layer of the multilayer directed graph model to which each service node belongs;
Constructing time series data<m,k,T,Em×k>In x ofi(t) is an independent variable, with Em×k-xi(t) is the amount of strain, the constructor f [ x ]i(t)]=Em×k-xi(t) for each factor node, the value x over all time seriesi(t)~xi(t-k) disturbing to obtain fluctuation values y [ delta, f [ x ] of each cause nodei(t)]]Taking the root cause node with the fluctuation value smaller than a preset fluctuation value as the target root cause node;
wherein m is the number of service nodes in the multilayer directed graph model, k is the number of time lags existing in each service node, T is the length of the time sequence, Em×kThe set of all service nodes in the multilayer directed graph model on all time lags is represented by delta which is a parameter related to the multilayer directed graph model, the total number of root nodes is j, xiAnd (t) is the service data corresponding to the ith service node when the time sequence length is t.
4. The method according to claim 1, wherein before the obtaining the original service data generated at each service node of the preset service, further comprising:
and performing KEI index evaluation on each service, judging whether the service is in a healthy state, if not, taking the service as the preset service, and acquiring original service data generated at each service node of the preset service.
5. The method according to claim 1, wherein after determining a target root cause node causing the preset traffic anomaly from the root cause nodes, the method further comprises:
and judging whether a fault processing plan for repairing the target root cause node is stored, if so, repairing the target root cause node according to the fault processing plan, and sending first prompt information for repairing the target root cause node, otherwise, sending node information of the target root cause node and second prompt information for not repairing the target root cause node.
6. An apparatus for fault root cause diagnosis based on a multilayer directed graph, comprising:
the acquisition module is used for acquiring original service data generated at each service node of a preset service and attribute information of each service node stored in the CMDB database, and acquiring an original calling relationship between each service node according to the attribute information of each service node; analyzing the actual calling relationship of each service node according to the original service data, and adjusting the original calling relationship according to the actual calling relationship to obtain the calling relationship of each service node determined by the original service data and the attribute information;
The establishing module is used for establishing a multilayer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the layers to which each service node is divided in advance;
and the root cause determining module is used for acquiring abnormal service data in the original service data, determining at least one root cause node which causes the preset service to generate the abnormal service data according to the multilayer directed graph model, and determining a target root cause node which causes the preset service to be abnormal from the root cause nodes.
7. An electronic device, comprising:
at least one processor, at least one memory, a communication interface, and a bus; wherein,
the processor, the memory and the communication interface complete mutual communication through the bus;
the communication interface is used for information transmission between the electronic equipment and communication equipment of other electronic equipment;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-5.
8. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 5.
CN201810461456.6A 2018-05-15 2018-05-15 Fault root cause diagnosis method and device based on multilayer digraphs Active CN110493025B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810461456.6A CN110493025B (en) 2018-05-15 2018-05-15 Fault root cause diagnosis method and device based on multilayer digraphs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810461456.6A CN110493025B (en) 2018-05-15 2018-05-15 Fault root cause diagnosis method and device based on multilayer digraphs

Publications (2)

Publication Number Publication Date
CN110493025A CN110493025A (en) 2019-11-22
CN110493025B true CN110493025B (en) 2022-06-14

Family

ID=68545155

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810461456.6A Active CN110493025B (en) 2018-05-15 2018-05-15 Fault root cause diagnosis method and device based on multilayer digraphs

Country Status (1)

Country Link
CN (1) CN110493025B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887108A (en) * 2019-11-29 2021-06-01 中兴通讯股份有限公司 Fault positioning method, device, equipment and storage medium
CN111107158B (en) * 2019-12-26 2023-02-17 远景智能国际私人投资有限公司 Alarm method, device, equipment and medium for Internet of things equipment cluster
CN111639115A (en) * 2020-04-29 2020-09-08 国家电网有限公司客户服务中心 Five-dimensional model-based analysis method for operation and maintenance data abnormity of power grid information system
CN111913824B (en) * 2020-06-23 2024-03-05 中国建设银行股份有限公司 Method for determining data link fault cause and related equipment
CN113970913A (en) * 2020-07-24 2022-01-25 华为技术有限公司 Fault diagnosis method and device
CN111858123B (en) * 2020-07-29 2023-09-26 中国工商银行股份有限公司 Fault root cause analysis method and device based on directed graph network
CN112506763A (en) * 2020-11-30 2021-03-16 清华大学 Automatic positioning method and device for database system fault root
CN114629776B (en) * 2020-12-11 2023-05-30 中国联合网络通信集团有限公司 Fault analysis method and device based on graph model
CN112541098A (en) * 2020-12-17 2021-03-23 杉数科技(北京)有限公司 Directed graph drawing method and chemical material planning method
CN112580810A (en) * 2020-12-22 2021-03-30 济南中科成水质净化有限公司 Sewage treatment process analysis and diagnosis method based on directed acyclic graph
CN112711493A (en) * 2020-12-25 2021-04-27 上海精鲲计算机科技有限公司 Scenario root cause analysis application
CN113282884B (en) * 2021-04-28 2023-09-26 沈阳航空航天大学 Universal root cause analysis method
CN113793128A (en) * 2021-09-18 2021-12-14 北京京东振世信息技术有限公司 Method, device, equipment and computer readable medium for generating business fault reason information
CN117061332B (en) * 2023-10-11 2023-12-29 中国人民解放军国防科技大学 Fault diagnosis method and system based on probability directed graph deep learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106330501A (en) * 2015-06-26 2017-01-11 中兴通讯股份有限公司 Fault correlation method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8301755B2 (en) * 2007-12-14 2012-10-30 Bmc Software, Inc. Impact propagation in a directed acyclic graph

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106330501A (en) * 2015-06-26 2017-01-11 中兴通讯股份有限公司 Fault correlation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一种基于多层有向图的故障根因诊断的方法;赵靓;《中国优秀硕士学位论文期刊网》;20150915;第19-39 *
基于扰动的亚复杂动力系统因果关系挖掘;郑皎凌;《计算机学报》;20141231;第37卷(第12期);第2549-2560页 *

Also Published As

Publication number Publication date
CN110493025A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN110493025B (en) Fault root cause diagnosis method and device based on multilayer digraphs
US10902368B2 (en) Intelligent decision synchronization in real time for both discrete and continuous process industries
EP3798846B1 (en) Operation and maintenance system and method
CN111047082B (en) Early warning method and device of equipment, storage medium and electronic device
US11409962B2 (en) System and method for automated insight curation and alerting
US20200272923A1 (en) Identifying locations and causes of network faults
US10444746B2 (en) Method for managing subsystems of a process plant using a distributed control system
CN115097788A (en) Intelligent management and control platform based on digital twin factory
CN111915143B (en) Complex product assembly management and control system based on intelligent contract
KR102087959B1 (en) Artificial intelligence operations system of telecommunication network, and operating method thereof
CN114430365B (en) Fault root cause analysis method, device, electronic equipment and storage medium
CN117041029A (en) Network equipment fault processing method and device, electronic equipment and storage medium
CN109409780B (en) Change processing method, device, computer equipment and storage medium
US20230105304A1 (en) Proactive avoidance of performance issues in computing environments
CN116992346A (en) Enterprise production data processing system based on artificial intelligence big data analysis
CN113610225A (en) Quality evaluation model training method and device, electronic equipment and storage medium
CN118337654A (en) Method, equipment and medium for monitoring industrial Internet identification analysis service
CN112148347A (en) Method and device for full-process traceability management
CN114676021A (en) Job log monitoring method and device, computer equipment and storage medium
CN114757448A (en) Manufacturing inter-link optimal value chain construction method based on data space model
CN114862098A (en) Resource allocation method and device
CN114385398A (en) Request response state determination method, device, equipment and storage medium
Peng et al. Research on data quality detection technology based on ubiquitous state grid internet of things platform
Guo et al. Research on prognostics technology of spot-welding system in automotive manufacturing based on statistical process control
CN112990744B (en) Automatic operation and maintenance method and device for massive million-level cloud equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant