CN110493025B - A method and device for fault root cause diagnosis based on multi-layer directed graph - Google Patents
A method and device for fault root cause diagnosis based on multi-layer directed graph Download PDFInfo
- Publication number
- CN110493025B CN110493025B CN201810461456.6A CN201810461456A CN110493025B CN 110493025 B CN110493025 B CN 110493025B CN 201810461456 A CN201810461456 A CN 201810461456A CN 110493025 B CN110493025 B CN 110493025B
- Authority
- CN
- China
- Prior art keywords
- service
- node
- root cause
- directed graph
- original
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 77
- 238000003745 diagnosis Methods 0.000 title claims abstract description 30
- 230000002159 abnormal effect Effects 0.000 claims abstract description 46
- 238000012545 processing Methods 0.000 claims description 16
- 238000004891 communication Methods 0.000 claims description 14
- 238000003860 storage Methods 0.000 claims description 8
- 230000005856 abnormality Effects 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims 1
- 238000013507 mapping Methods 0.000 claims 1
- 238000004458 analytical method Methods 0.000 abstract description 26
- 230000008439 repair process Effects 0.000 description 15
- 238000007726 management method Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 9
- 238000003066 decision tree Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 238000012423 maintenance Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000001364 causal effect Effects 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000005065 mining Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/065—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/145—Network analysis or design involving simulating, designing, planning or modelling of a network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
- Monitoring And Testing Of Exchanges (AREA)
Abstract
本发明的实施例公开了一种基于多层有向图的故障根因诊断的方法及装置,该方法根据原始业务数据和属性信息共同确定各业务节点的调用关系,能够全面考虑到实际中新增加的业务节点或者新增加的调用关系,保证了在根据调用关系建立多层有向图模型时能够将每一业务节点均添加到多层有向图模型中,为基于多层有向图模型准确快速查找产生异常业务数据的根因节点奠定了基础。由于本实施例中的多层有向图模型中的业务节点根据实际业务节点生成,节点的全面性避免了法对新出现的故障进行根因查询的情况发生,同时,对数据的分析不仅仅是基于调用关系,而是基于创建的多层有向图模型对数据进行全面分析。
The embodiment of the present invention discloses a method and device for fault root cause diagnosis based on a multi-layer directed graph. The method jointly determines the calling relationship of each service node according to the original service data and attribute information, and can fully consider the actual new The added business node or the newly added calling relationship ensures that each business node can be added to the multi-layer directed graph model when the multi-layer directed graph model is established according to the calling relationship, which is based on the multi-layer directed graph model. It lays a foundation for accurately and quickly finding the root cause nodes that generate abnormal business data. Since the service nodes in the multi-layer directed graph model in this embodiment are generated according to actual service nodes, the comprehensiveness of the nodes avoids the need to perform root cause query for newly-occurring faults. At the same time, the analysis of data not only It is based on the calling relationship, but comprehensively analyzes the data based on the created multi-layer directed graph model.
Description
技术领域technical field
本发明实施例涉及计算机软件技术领域,尤其是涉及一种基于多层有向图的故障根因诊断的方法及装置。Embodiments of the present invention relate to the technical field of computer software, and in particular, to a method and device for root cause diagnosis of faults based on a multi-layer directed graph.
背景技术Background technique
云计算和容器云的普及,使得大量IT应用系统逐步被部署在虚拟化、容器化环境中。而随着各类业务场景的不断丰富和业务量的井喷式增长,给系统及应用的易维护性上带来巨大的挑战。尤其是在电信行业,运营商本身就构建了非常多的应用系统为广大消费者提供各种特色服务,而有些系统功能更涉及到多个业务系统的子功能,需要多系统协同才能正常工作。架构的演变更加剧此类业务系统的复杂性,对运维故障定位及解决能力提出了更高的要求。With the popularity of cloud computing and container cloud, a large number of IT application systems are gradually deployed in virtualized and containerized environments. With the continuous enrichment of various business scenarios and the explosive growth of business volume, it brings huge challenges to the ease of maintenance of systems and applications. Especially in the telecommunications industry, operators themselves have built a lot of application systems to provide various special services for consumers, and some system functions involve sub-functions of multiple business systems, requiring multi-system collaboration to work properly. The evolution of the architecture has exacerbated the complexity of such business systems, and has put forward higher requirements on the ability to locate and solve O&M faults.
目前的故障诊断方法包括三种类型,方案一是基于告警等预案库形式的故障诊断,方案二是基于告警等预案库形式的故障诊断,方案三是基于决策树模型的故障诊断及修复方法。其中,基于告警等预案库形式的故障诊断:多数运维部门通常根据故障现象及处理记录汇总成故障处理手册,部分设备供应商也会提供类似的简单故障定位能力,以此来实现故障的初步定位及解决。除了基于历史故障经验,还包括QoE(用户体验质量)等其他维度来进行故障诊断。一旦故障发生,通过收集告警关键信息,并找到相应诊断手册进行检索生成诊断结果。因此,基于告警的方式可以简单快速的完成日常故障定位及修复,而一旦面对未知故障等与已知告警信息不符时则无能为力。基于离线指标分析工具的故障诊断方法:离线指标分析工具包含业务指标及系统运行指标,前者主要通过业务入库数据反映业务量指标,后者主要通过日志等外部数据导入数据库后进行分析,通过对系统运行指标分析,对系统性能,成功率,失败分布等信息予以分析,以判断系统运行健康度。基于数据库的方式便于提取系统关键指标,有效监控程序各环节运行状态,但相对而言时间延长较大,会对系统监控时效性上造成一定影响。基于决策树模型的故障诊断及修复方法:多数系统设计采用多层系统拓扑架构,基于分层调用的原则,建立树形关系的拓扑图,并基于此树形拓扑建立了面向业务及系统故障的决策树。一旦故障发生,通过收集故障关键信息,并找到相应决策树进行检索生成诊断结果。因此,基于决策树的方式可以简单快速的完成日常故障定位及修复,而一旦面对非树形结构调用关系时则无能为力。The current fault diagnosis methods include three types. Scheme one is based on fault diagnosis in the form of alarms and other plan libraries, scheme two is based on fault diagnosis in the form of alarms and other plan libraries, and scheme three is fault diagnosis and repair methods based on decision tree models. Among them, fault diagnosis in the form of alarms and other plans: most operation and maintenance departments usually compile fault handling manuals according to fault symptoms and processing records, and some equipment suppliers will also provide similar simple fault location capabilities to achieve preliminary fault diagnosis. location and solution. In addition to historical fault experience, it also includes other dimensions such as QoE (Quality of User Experience) for fault diagnosis. Once a fault occurs, generate diagnostic results by collecting key alarm information and retrieving the corresponding diagnostic manual. Therefore, the alarm-based method can easily and quickly complete the daily fault location and repair, but it is powerless when faced with unknown faults that do not match the known alarm information. Fault diagnosis method based on offline indicator analysis tool: The offline indicator analysis tool includes business indicators and system operation indicators. The former mainly reflects business volume indicators through business inbound data, and the latter mainly imports external data such as logs into the database for analysis. System operation index analysis, analyze system performance, success rate, failure distribution and other information to judge the health of system operation. The database-based method is easy to extract key indicators of the system and effectively monitor the running status of each link of the program, but the time extension is relatively large, which will have a certain impact on the timeliness of system monitoring. Fault diagnosis and repair method based on decision tree model: Most system designs adopt a multi-layer system topology structure, based on the principle of hierarchical invocation, to establish a tree-shaped topology diagram, and based on this tree-shaped topology, a business and system fault-oriented network is established. decision tree. Once a fault occurs, the diagnosis result is generated by collecting the key information of the fault and finding the corresponding decision tree for retrieval. Therefore, the method based on the decision tree can complete the daily fault location and repair simply and quickly, but it is powerless once faced with the non-tree structure calling relationship.
然而,在基于大数据平台、DCOS平台、模块系统、微服务系统等虚拟化、容器化的环境中,针对集群节点故障或异常的诊断及修复,现有方案不足以支撑快速响应、高效分析解决的能力要求,其主要表现在以下几个方面:(1)使用场景狭隘,无法处理未知场景。如方案一中基于告警及预案库形式的故障诊断及修复方法,主要依赖于对已知故障信息的经验积累,而且这种方式对故障场景有极大的要求。同样的故障现象在不同的故障场景下可能会有不同的处理方式,也就超出了简单预案库的处理范围。尤其是在面对未知故障信息时,已有的手册等手段已经完全失效,需要人工进行逐步排查,定位故障,修复问题,导致运维效率低下。(2)指标时效性差,无法及时反馈信息。现有方案对故障定位能力的提升仅限于加强故障信息收集,而对故障的最终定位及修复还是依赖于运维人员的判断和执行。通过海量的监控指标数据,极大程度上扩大了故障信息来源,但也对指标的采集延迟较高,造成这些数据的自动处理和分析上能力不足,无法及时展现问题的信息点和根源。(3)需要海量历史数据,不适应敏捷模式。现有方案主要采用训练决策树来提升分析能力,但是训练决策树需要大量的历史数据,由于本司系统业务特点,新出问题占比较多,无法提供足量的有效训练数据,导致决策树模型准确度不高,对故障根因分析能力不足,无法提供有效支撑。However, in virtualized and containerized environments based on big data platforms, DCOS platforms, module systems, and microservice systems, the existing solutions are insufficient to support rapid response and efficient analysis and resolution for diagnosing and repairing cluster node failures or abnormalities. It is mainly manifested in the following aspects: (1) The usage scenarios are narrow and cannot handle unknown scenarios. For example, the fault diagnosis and repair method in the form of alarm and plan library in
在实现本发明实施例的过程中,发明人发现现有的查找故障根因的方法的环境适应能力差,无法对新出现的故障进行根因查询,且现有的查找故障根因的方法仅依据业务节点的调用关系查找,对数据的分析较为单一,数据分析能力较弱。In the process of implementing the embodiments of the present invention, the inventor found that the existing method for finding the root cause of a fault has poor environmental adaptability, and cannot perform root cause query for a newly-occurring fault, and the existing method for finding the root cause of a fault only According to the calling relationship of business nodes, the analysis of data is relatively simple, and the data analysis ability is weak.
发明内容SUMMARY OF THE INVENTION
本发明所要解决的技术问题是如何解决现有的查找故障根因的方法的环境适应能力差,无法对新出现的故障进行根因查询,且现有的查找故障根因的方法仅依据业务节点的调用关系查找,对数据的分析较为单一,数据分析能力较弱的问题。The technical problem to be solved by the present invention is how to solve the problem that the existing method for finding the root cause of a fault has poor environmental adaptability, and cannot perform root cause query for a new fault, and the existing method for finding the root cause of a fault is only based on service nodes. The problem that the calling relationship is searched, the data analysis is relatively simple, and the data analysis ability is weak.
针对以上技术问题,本发明的实施例提供了一种基于多层有向图的故障根因诊断的方法,包括:In view of the above technical problems, embodiments of the present invention provide a method for diagnosing fault root causes based on a multi-layer directed graph, including:
获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;acquiring original service data generated at each service node of the preset service, and determining the calling relationship of each service node according to the original service data and the pre-stored attribute information of each service node;
根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;Establish a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the layer to which each service node belongs to pre-divided;
获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。Obtain abnormal business data in the original business data, determine at least one root cause node that causes the business business to generate the abnormal business data according to the multi-layer directed graph model, and determine from the root cause node Set the target root cause node of business exception.
本发明的实施例提供了一种基于多层有向图的故障根因诊断的装置,包括:An embodiment of the present invention provides a device for fault root cause diagnosis based on a multi-layer directed graph, including:
获取模块,用于获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;an acquisition module, configured to acquire original service data generated at each service node of the preset service, and determine the calling relationship of each service node according to the original service data and the pre-stored attribute information of each service node;
建立模块,用于根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;a building module for establishing a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the layer to which each service node belongs to pre-divided;
根因确定模块,用于获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。The root cause determination module is used to obtain the abnormal service data in the original service data, and determine at least one root cause node that causes the service service to generate the abnormal service data according to the multi-layer directed graph model, and determine the root cause node from the root cause. A target root cause node that causes the preset service abnormality is determined in the nodes.
本实施例提供了一种电子设备,包括:This embodiment provides an electronic device, including:
至少一个处理器、至少一个存储器、通信接口和总线;其中,at least one processor, at least one memory, a communication interface, and a bus; wherein,
所述处理器、存储器、通信接口通过所述总线完成相互间的通信;The processor, the memory, and the communication interface communicate with each other through the bus;
所述通信接口用于该电子设备和终端设备的通信设备之间的信息传输;The communication interface is used for information transmission between the electronic device and the communication device of the terminal device;
所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行以上所述的方法。The memory stores program instructions executable by the processor, the processor invoking the program instructions capable of performing the method described above.
本实施例提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行以上所述的方法。This embodiment provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, The computer is caused to perform the method described above.
本发明的实施例提供了一种基于多层有向图的故障根因诊断的方法及装置,该方法根据原始业务数据和属性信息共同确定各业务节点的调用关系,能够全面考虑到实际中新增加的业务节点或者新增加的调用关系,保证了在根据调用关系建立多层有向图模型时能够将每一业务节点均添加到多层有向图模型中,为基于多层有向图模型准确快速查找产生异常业务数据的根因节点奠定了基础。由于本实施例中的多层有向图模型中的业务节点根据实际业务节点生成,节点的全面性避免了法对新出现的故障进行根因查询的情况发生,同时,对数据的分析不仅仅是基于调用关系,而是基于创建的多层有向图模型对数据进行全面分析。The embodiments of the present invention provide a method and device for fault root cause diagnosis based on a multi-layer directed graph. The method jointly determines the calling relationship of each service node according to the original service data and attribute information, and can fully consider the actual new The added business node or the newly added calling relationship ensures that each business node can be added to the multi-layer directed graph model when the multi-layer directed graph model is established according to the calling relationship, which is based on the multi-layer directed graph model. It lays the foundation to accurately and quickly find the root cause nodes that generate abnormal business data. Since the service nodes in the multi-layer directed graph model in this embodiment are generated according to actual service nodes, the comprehensiveness of the nodes avoids the need to perform root cause query for newly-occurring faults. At the same time, the analysis of data not only It is based on the calling relationship, but comprehensively analyzes the data based on the created multi-layer directed graph model.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative efforts.
图1是本发明一个实施例提供的基于多层有向图的故障根因诊断的方法的流程示意图;1 is a schematic flowchart of a method for root cause diagnosis of faults based on a multi-layer directed graph provided by an embodiment of the present invention;
图2是本发明另一个实施例提供的多层有向图的故障根因诊断的架构示意图;2 is a schematic diagram of the architecture of fault root cause diagnosis of a multi-layer directed graph provided by another embodiment of the present invention;
图3是本发明另一个实施例提供的进行故障根因查询的流程示意图;3 is a schematic flowchart of a fault root cause query provided by another embodiment of the present invention;
图4是本发明另一个实施例提供的基于多层有向图的故障根因诊断的装置的结构框图;4 is a structural block diagram of an apparatus for root cause diagnosis of faults based on a multi-layer directed graph provided by another embodiment of the present invention;
图5是本发明另一个实施例提供的电子设备的结构框图。FIG. 5 is a structural block diagram of an electronic device provided by another embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
图1是本实施例提供的基于多层有向图的故障根因诊断的方法的流程示意图,参见图1,该方法包括:FIG. 1 is a schematic flowchart of a method for diagnosing fault root causes based on a multi-layer directed graph provided by the present embodiment. Referring to FIG. 1 , the method includes:
101:获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;101: Acquire original service data generated at each service node of a preset service, and determine the calling relationship of each service node according to the original service data and pre-stored attribute information of each service node;
102:根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;102: Build a multi-layer directed graph model of each service node according to the calling relationship of each service node determined by the original service data and the attribute information and the layer to which each service node belongs to pre-divided;
103:获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。103: Acquire abnormal service data in the original service data, determine at least one root cause node that causes the service service to generate the abnormal service data according to the multi-layer directed graph model, and determine from the root cause node that the cause of the abnormal service data is generated. Describe the target root cause node of the preset service exception.
本实施例提供的方法通常由对业务是否正常运行进行故障诊断和修复的设备执行,例如,服务器,本实施例对此不做具体限制。该方法用于对某一出现故障的业务进行根因查询。业务节点为该预设业务运行过程中的节点,在各业务节点处采集的数据为该业务的原始业务数据。属性信息为预先定义的各业务节点的属性信息,属性信息反应了各业务节点的调用关系。根据属性信息也可以对各业务节点进行分层,例如,位于应用层的节点、传输层的节点等。在创建各业务节点的多层有向图模型时需参照预先划分好的各业务节点的所属层。通过多层有向图查找导致预设业务异常的根因节点时,根据各业务节点的调用关系逐层查找。目标根因节点通常通过计算得到,具体的计算方法可以进行设定,本实施例对此不做具体限定。The method provided in this embodiment is usually executed by a device that performs fault diagnosis and repair on whether a service is running normally, for example, a server, which is not specifically limited in this embodiment. This method is used to query the root cause of a faulty service. The service node is a node in the running process of the preset service, and the data collected at each service node is the original service data of the service. The attribute information is pre-defined attribute information of each service node, and the attribute information reflects the calling relationship of each service node. Each service node can also be layered according to the attribute information, for example, nodes located at the application layer, nodes at the transport layer, and the like. When creating a multi-layer directed graph model of each service node, it is necessary to refer to the pre-divided layer of each service node. When searching for the root cause node that causes the preset business exception through the multi-layer directed graph, it is searched layer by layer according to the calling relationship of each business node. The target root cause node is usually obtained by calculation, and a specific calculation method can be set, which is not specifically limited in this embodiment.
本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法根据原始业务数据和属性信息共同确定各业务节点的调用关系,能够全面考虑到实际中新增加的业务节点或者新增加的调用关系,保证了在根据调用关系建立多层有向图模型时能够将每一业务节点均添加到多层有向图模型中,为基于多层有向图模型准确快速查找产生异常业务数据的根因节点奠定了基础。由于本实施例中的多层有向图模型中的业务节点根据实际业务节点生成,节点的全面性避免了法对新出现的故障进行根因查询的情况发生,同时,对数据的分析不仅仅是基于调用关系,而是基于创建的多层有向图模型对数据进行全面分析。This embodiment provides a fault root cause diagnosis method based on a multi-layer directed graph. The method jointly determines the calling relationship of each service node according to the original service data and attribute information, and can fully consider the newly added service nodes in practice. Or the newly added invocation relationship ensures that each business node can be added to the multi-layer directed graph model when the multi-layer directed graph model is established according to the invocation relationship. The root cause node of abnormal business data lays the foundation. Since the service nodes in the multi-layer directed graph model in this embodiment are generated according to actual service nodes, the comprehensiveness of the nodes avoids the need to perform root cause query for newly-occurring faults. At the same time, the analysis of data not only It is based on the calling relationship, but comprehensively analyzes the data based on the created multi-layer directed graph model.
进一步地,在上述实施例的基础上,所述获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系,包括:Further, on the basis of the above-mentioned embodiment, the acquisition of the original service data generated at each service node of the preset service is performed, and the attribute information of each service node is determined according to the original service data and the pre-stored attribute information of each service node. Calling relationships, including:
获取在预设业务的各业务节点处生成的原始业务数据和CMDB数据库中存储的各业务节点的属性信息,根据各业务节点的属性信息得到各业务节点之间的原始调用关系;Obtain the original service data generated at each service node of the preset service and the attribute information of each service node stored in the CMDB database, and obtain the original calling relationship between each service node according to the attribute information of each service node;
根据所述原始业务数据分析各业务节点的实际调用关系,根据实际调用关系对所述原始调用关系进行调整,得到由所述原始业务数据和所述属性信息确定的各业务节点的调用关系。The actual calling relationship of each service node is analyzed according to the original service data, and the original calling relationship is adjusted according to the actual calling relationship to obtain the calling relationship of each service node determined by the original service data and the attribute information.
CMDB数据库为存储与管理企业IT架构中设备的各种配置信息的数据库。对预设业务的各业务节点,首先根据CMDB数据库中定义的属性信息,得到各业务节点的调用关系。然而,由于实际中可能新增了预设业务的业务节点,而CMDB数据库中可能没有该新增的业务节点,因此在确定原始调用关系后需要再根据原始业务数据将新增节点和其它各业务节点的调用关系进行补充,最终得到符合实际的由所述原始业务数据和所述属性信息确定的各业务节点的调用关系。The CMDB database is a database that stores and manages various configuration information of equipment in the enterprise IT architecture. For each service node of the preset service, first, according to the attribute information defined in the CMDB database, the calling relationship of each service node is obtained. However, since the service node of the preset service may be added in practice, and the newly added service node may not exist in the CMDB database, after the original calling relationship is determined, the new node and other services need to be added according to the original service data. The calling relationship of the nodes is supplemented, and finally the actual calling relationship of each service node determined by the original service data and the attribute information is obtained.
本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法对根据CMDB数据库得到的原始调用关系进行调整,保证最终确定的调用关系包括了业务实际运行过程中的所有调用关系,为对新出现的故障也能进行根因查询提供了保证。This embodiment provides a method for diagnosing fault root causes based on a multi-layer directed graph. The method adjusts the original invocation relationship obtained according to the CMDB database to ensure that the finalized invocation relationship includes all the actual business operations. The calling relationship provides a guarantee that the root cause query can also be performed for new faults.
进一步地,在上述各实施例的基础上,所述根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型,包括:Further, on the basis of the above-mentioned embodiments, the call relationship of each service node determined by the original service data and the attribute information and the multiplication of each service node are established with the pre-divided layer to which each service node belongs. Layered directed graph model, including:
根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系,对所述CMDB数据库中存储的业务节点进行修正;Amend the service nodes stored in the CMDB database according to the calling relationship of each service node determined by the original service data and the attribute information;
获取预先划分的修正后的CMDB数据库中第i层的业务节点,对CMDB数据库中第i层的业务节点vn,根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系获取由该业务节点vn到达且能到达该业务节点vn的目标业务节点;Obtain the pre-divided service nodes of the i-th layer in the revised CMDB database, and to the service nodes v n of the i-th layer in the CMDB database, according to the calling relationship of each service node determined by the original service data and the attribute information Obtain the target service node that is reached by the service node v n and can reach the service node v n ;
将CMDB数据库中第i层的每一业务节点对应的目标业务节点添加到第i层节点集合中,则所述第i层节点集合中的点为所述多层有向图模型中第i层的节点。Add the target business node corresponding to each business node of the i-th layer in the CMDB database to the i-th layer node set, then the point in the i-th layer node set is the i-th layer in the multi-layer directed graph model. node.
例如,业务实际运行时新增了业务节点,那么需要将该业务节点添加到CMDB数据库中,及时对CMDB数据库进行更新。各业务节点在CMDB数据库中预先根据各业务节点的属性划分了层,例如,将属于应用层的业务节点划分为同一层,将属于传输层的业务节点划分为同一层。For example, if a new service node is added when the service is actually running, the service node needs to be added to the CMDB database, and the CMDB database needs to be updated in time. Each service node is pre-divided into layers in the CMDB database according to the attributes of each service node. For example, the service nodes belonging to the application layer are divided into the same layer, and the service nodes belonging to the transport layer are divided into the same layer.
在多层有向图模型中,第i层节点集合可以通过公式Li={R1∩A1,R2∩A2,……,Rn∩An}表示。其中,Rn表示所有从vn到达的节点的集合,An表示所有能够到达vn的节点的集合。CMDB数据库中第i层的业务节点共有n各,分别为v1,v2,……vn。In the multi-layer directed graph model, the i -th layer node set can be represented by the formula Li = {R 1 ∩ A 1 , R 2 ∩ A 2 , ..., R n ∩ A n }. Among them, R n represents the set of all nodes reachable from v n , and A n represents the set of all nodes that can reach v n . There are a total of n business nodes in the ith layer in the CMDB database, namely v 1 , v 2 , ... v n .
本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法根据CMDB数据库中各业务节点所属层得到多层有向图模型中各层的业务节点。This embodiment provides a method for diagnosing fault root causes based on a multi-layer directed graph. The method obtains the service nodes of each layer in the multi-layer directed graph model according to the layer to which each service node belongs in the CMDB database.
进一步地,在上述各实施例的基础上,所述获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点,包括:Further, on the basis of the above-mentioned embodiments, the abnormal business data in the original business data is acquired, and at least one element that causes the business business to generate the abnormal business data is determined according to the multi-layer directed graph model. The root cause node, which determines the target root cause node that causes the preset service exception from the root cause node, including:
根据预先设定的阈值区间判断每一业务节点处生成的原始业务数据是否异常,获取原始业务数据中所有异常的异常业务数据;Determine whether the original business data generated at each business node is abnormal according to the preset threshold interval, and obtain all abnormal abnormal business data in the original business data;
将每一异常业务数据映射到所述多层有向图模型中生成该异常业务数据的业务节点上,根据所述多层有向图模型中各业务节点的调用关系和各业务节点在所述多层有向图模型中所属层查找导致所述业务业务的至少一个根因节点;Each abnormal business data is mapped to the business node that generates the abnormal business data in the multi-layer directed graph model, according to the calling relationship of each business node in the multi-layer directed graph model and the Searching for at least one root cause node of the business service caused by the layer to which it belongs in the multi-layer directed graph model;
构建时间序列数据<m,k,T,Em×k>,以xi(t)为自变量,以Em×k-xi(t)为应变量,构造函数f[xi(t)]=Em×k-xi(t),对每一根因节点所有时间序列上的值xi(t)~xi(t-k)进行扰动,得到每一根因节点的波动值y[δ,f[xi(t)]],将波动值小于预设波动值的根因节点作为所述目标根因节点;Construct time series data <m, k, T, E m×k >, take x i (t) as independent variable, take E m×k -xi (t) as dependent variable, construct function f[x i (t )]=E m×k -xi (t), perturb the values x i ( t)~x i (tk) on all time series of each root cause node, and obtain the fluctuation value y of each root cause node [δ,f[x i (t)]], the root cause node whose fluctuation value is less than the preset fluctuation value is used as the target root cause node;
其中,m是所述多层有向图模型中业务节点个数,k是每个业务节点存在的时滞个数,T为时间序列的长度,Em×k为所述多层有向图模型中所有业务节点在所有时滞上的集合,δ为与所述多层有向图模型有关的参数,根因节点的总个数为j,xi(t)为第i个业务节点在时间序列长度为t时对应的业务数据。Among them, m is the number of business nodes in the multi-layer directed graph model, k is the number of time lags existing in each business node, T is the length of the time series, and E m×k is the multi-layer directed graph The set of all business nodes in the model on all time delays, δ is a parameter related to the multi-layer directed graph model, the total number of root cause nodes is j, and x i (t) is the ith business node in The corresponding business data when the time series length is t.
判断业务数据是否为异常业务数据可以根据设定的阈值范围进行判断,也可以对业务数据进行运算处理后,根据运算处理后的结果判断业务数据是否异常,本实施例对此不做具体限制。在进行根因查找的过程中,只需要将异常业务数据映射到多层有向图模型中。Judging whether the service data is abnormal service data can be judged according to the set threshold range, or whether the service data is abnormal according to the result of the calculation processing after the service data is processed, which is not specifically limited in this embodiment. In the process of root cause search, only the abnormal business data needs to be mapped into the multi-layer directed graph model.
在查找根因节点时,需要根据各业务节点所属层和各业务节点之间的调用关系进行查找。例如,具有调用关系的一组节点在每一层均存在一个异常点,则通常位于最底层的业务节点为根因节点;若具有调用关系的一组业务节点在某一层不存在异常业务节点,则应将该层之上和该层之下的业务节点作为独立的两个部分进行根因查找。When searching for the root cause node, it is necessary to search according to the layer to which each service node belongs and the calling relationship between each service node. For example, if a group of nodes with a calling relationship has an abnormal point at each layer, the business node at the bottom is usually the root cause node; if a group of business nodes with a calling relationship has no abnormal business node in a certain layer , the root cause search should be performed on the business nodes above and below the layer as two independent parts.
查找到根因节点后,根据计算出的每一根因节点对应的波动值对根因节点进行排序,波动值越小,说明该根因节点导致业务异常的可能性更大,将可能性较大的几个根因节点作为目标根因节点。After finding the root cause node, sort the root cause nodes according to the calculated fluctuation value corresponding to each root cause node. The larger root cause nodes are used as target root cause nodes.
本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法通过多层有向图模型进行根因查询,多个维度分析数据,提高了根因查找的准确性。从根因节点中确定出目标根因节点,缩小了对业务进行修复时需要考虑的节点范围,提高了修复业务的效率。This embodiment provides a method for fault root cause diagnosis based on a multi-layer directed graph. The method performs root cause query through a multi-layer directed graph model and analyzes data in multiple dimensions, thereby improving the accuracy of root cause search. The target root cause node is determined from the root cause node, which narrows the range of nodes that need to be considered when repairing the business, and improves the efficiency of the repair business.
进一步地,在上述各实施例的基础上,所述获取在预设业务的各业务节点处生成的原始业务数据之前,还包括:Further, on the basis of the foregoing embodiments, before the acquiring the original service data generated at each service node of the preset service, the method further includes:
对每一业务进行KEI指标评估,判断该业务是否处于健康状态,若该业务未处于健康状态,则将该业务作为所述预设业务,获取在所述预设业务的各业务节点处生成的原始业务数据。Perform KEI index evaluation on each service to determine whether the service is in a healthy state, if the service is not in a healthy state, then use the service as the preset service, and obtain the information generated at each service node of the preset service. Raw business data.
KEI(关键绩效指标)用于对业务是否处于健康装填进行评估,本实施例提供的方法仅对处于不健康状态的业务进行根因诊断。KEI (Key Performance Indicator) is used to evaluate whether the business is in a healthy state, and the method provided in this embodiment only performs root cause diagnosis for the business in an unhealthy state.
本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法通过KEI指标筛选出处于不健康状态的业务,对处于不健康状态的业务进行根因诊断,避免了对健康状态的业务进行不必要的诊断。This embodiment provides a method for diagnosing fault root causes based on a multi-layer directed graph. The method filters out services in an unhealthy state through the KEI index, and performs root cause diagnosis on the services in an unhealthy state, thereby avoiding the need for a healthy state. business conducts unnecessary diagnostics.
进一步地,在上述各实施例的基础上,所述从根因节点中确定导致所述预设业务异常的目标根因节点之后,还包括:Further, based on the above embodiments, after the target root cause node that causes the preset service abnormality is determined from the root cause node, the method further includes:
判断是否存储有对所述目标根因节点进行修复的故障处理预案,若是,根据故障处理预案修复所述目标根因节点,并发送已经对目标根因节点进行修复的第一提示信息,否则,发送所述目标根因节点的节点信息和未对目标根因节点进行修复的第二提示信息。Determine whether a fault handling plan for repairing the target root cause node is stored, if so, repair the target root cause node according to the fault handling plan, and send the first prompt message that the target root cause node has been repaired, otherwise, Send the node information of the target root cause node and the second prompt information that the target root cause node is not repaired.
确定目标根因节点后,需要针对目标根因节点进行修复,保证系统的正常运行。第一提示信息和第二提示信息可以是通过邮件发送或者通过短信发送的信息,本实施例对此不做具体限定。After determining the target root cause node, it is necessary to repair the target root cause node to ensure the normal operation of the system. The first prompt information and the second prompt information may be information sent by email or by short message, which is not specifically limited in this embodiment.
本实施例提供了一种基于多层有向图的故障根因诊断的方法,该方法在能够及时修复故障的情况下及时对故障进行修复,在无法修复故障的情况下及时发出提示信息,及时告知工作人员采取修复方案进行故障修复,保证业务的正常运行。This embodiment provides a method for diagnosing the root cause of a fault based on a multi-layer directed graph. The method repairs the fault in a timely manner when the fault can be repaired in time, and sends a prompt message in a timely manner when the fault cannot be repaired. Inform the staff to adopt a repair plan to repair the fault to ensure the normal operation of the business.
作为更为具体的实施例,图2为本实施例提供的多层有向图的故障根因诊断的架构示意图,参见图2,其主要涉及CMDB数据库,应用拓扑关系管理,有向图模型转换器,模型库,指标管理装置,故障根源分析装置,故障自动化处理装置等。其中,有向图转换器通过对现有资产数据继续分析,生成故障多层有向图模型(FSDG),故障根源诊断装置利用FSDG模型对实时KEI指标进行评估计算,最终挖掘故障根因。As a more specific embodiment, FIG. 2 is a schematic diagram of the architecture of fault root cause diagnosis of a multi-layer directed graph provided in this embodiment, referring to FIG. 2 , which mainly involves a CMDB database, application topology relationship management, and directed graph model conversion device, model library, indicator management device, fault root cause analysis device, fault automatic processing device, etc. Among them, the directed graph converter continues to analyze the existing asset data to generate a fault multi-layer directed graph model (FSDG). The fault root cause diagnosis device uses the FSDG model to evaluate and calculate the real-time KEI index, and finally mine the fault root cause.
如图2所示的各部分中,(1)应用生产系统实时对用户的操作进行处理,当业务处理产生异常时,应用生产系统必然存在异常点。应用生产系统与应用拓扑管理系统连接:当各应用服务之间产生调用关系时,拓扑管理系统获取到调用关系数据。In the parts shown in Figure 2, (1) the application production system processes the user's operation in real time. When an exception occurs in the business processing, the application production system must have an abnormal point. The application production system is connected with the application topology management system: when a calling relationship is generated between various application services, the topology management system obtains the calling relationship data.
(2)应用拓扑关系管理主要有6个装置组成,包括调用数据采集,数据清洗,规则转换,调用关系分析,调用行为分析,规则持续学习。应用拓扑关系管理通过调用数据采集,分析系统中各节点之间的调用关系,为后续的有向图提供数据支持,并和CMDB数据共同提交至模型转换器生产多层有向图模型。(2) Application topology relationship management is mainly composed of six devices, including call data collection, data cleaning, rule conversion, call relationship analysis, call behavior analysis, and continuous rule learning. The application topology relationship management analyzes the calling relationship between each node in the system by calling data collection, provides data support for the subsequent directed graph, and submits it to the model converter together with the CMDB data to produce a multi-layer directed graph model.
(3)CMDB数据库中保存了应用系统中各CI项的属性,已经CI项之间的多种关系定义。通过CMDB数据,可以定义出多层有向图中的FSDG分层模型,并将模型提交至模型转换器生产多层有向图模型。(3) The attributes of each CI item in the application system are stored in the CMDB database, and various relationships between the CI items have been defined. Through the CMDB data, the FSDG hierarchical model in the multi-layer directed graph can be defined, and the model can be submitted to the model converter to produce the multi-layer directed graph model.
(4)模型转换器对输入数据进行处理与转换,根据数据属性转换为对应编码。通过应用拓扑关系数据及CMDB数据,将系统的复杂调用关系转换为多层有向图模型。模型转换器与模型库相连,将数据进行编码转换后提交至FSDG模型库;(4) The model converter processes and converts the input data, and converts it into corresponding codes according to the data attributes. By applying topological relational data and CMDB data, the complex calling relation of the system is transformed into a multi-layer directed graph model. The model converter is connected to the model library, and the data is encoded and converted and submitted to the FSDG model library;
由CMDB数据,得到节点集合V={vi|vi为CMDB中管理的资产节点};From the CMDB data, the node set V = {v i |v i is the asset node managed in the CMDB};
由应用拓扑关系数据,得到支路集合E={ei,j|节点vi指向节点vj的有向边};By applying the topological relation data, the branch set E={e i, j | the directed edge from node v i to node v j };
多层有向图模型中,第i层的所有业务节点通过集合Li={R1∩A1,R2∩A2,……,Rn∩An}表示。In the multi-layer directed graph model, all business nodes of the i -th layer are represented by a set Li ={R 1 ∩A 1 , R 2 ∩A 2 ,...,R n ∩A n }.
(5)模型库中包含已知系统拓扑模型,根据业务和系统进行分类,可分为CRM,渠道,CBOSS模型等等,不同系统的拓扑层次及调用关系都有差异。与故障根源分析装置相连:当模型库将信息输入故障根源分析装置后,与指标管理装置的指标数据一起供分析模块分析故障根源。(5) The model library contains known system topology models, which are classified according to business and system, and can be divided into CRM, channel, CBOSS model, etc. The topology level and calling relationship of different systems are different. Connected to the fault root cause analysis device: After the model library inputs the information into the fault root cause analysis device, together with the index data of the index management device, the analysis module can analyze the fault root cause.
(6)指标管理装置管理了系统中业务,系统等指标数据,包含多层有向图模型中各节点指标数据,包括健康度等关键指标。与故障根源分析装置连接:将指标推送至分析装置,并与指标库中模型配合分析故障根源。(6) The indicator management device manages the business, system and other indicator data in the system, including the indicator data of each node in the multi-layer directed graph model, including key indicators such as health. Connect with the fault root cause analysis device: push the indicators to the analysis device, and cooperate with the model in the indicator library to analyze the fault root cause.
(7)故障根源分析装置基于大数据STORM流计算架构,通过实时数据计算,将故障根源计算耗时缩短至秒级;根据多层有向图模型及节点指标数据,判断系统是否有异常,如果有异常,根据多层有向图算法,计算出根源节点,即分析出系统故障的根因。与故障自动化处理装置连接:当分析出故障根源时,将故障根源发送至处理装置进行故障处理。(7) The fault root cause analysis device is based on the big data STORM flow computing architecture, and through real-time data calculation, the calculation time of the fault root cause is shortened to the second level; according to the multi-layer directed graph model and node index data, it is judged whether there is an abnormality in the system. If there is an abnormality, according to the multi-layer directed graph algorithm, the root node is calculated, that is, the root cause of the system failure is analyzed. Connect with the automatic fault processing device: when the root cause of the fault is analyzed, the root cause of the fault is sent to the processing device for fault processing.
图3为本实施例提供的进行故障根因查询的流程示意图,参见图3,该过程包括:FIG. 3 is a schematic flowchart of a fault root cause query provided by the present embodiment. Referring to FIG. 3 , the process includes:
利用KEI模型对FSDG模型最高层的指标数据进行评估,如果评估结果处于健康状态,系统不进行后续分析;如果评估结果处于不健康状态,则触发FSDG故障根源分析流程,计算故障源。The KEI model is used to evaluate the index data at the highest level of the FSDG model. If the evaluation result is in a healthy state, the system does not perform subsequent analysis; if the evaluation result is in an unhealthy state, the FSDG fault root cause analysis process is triggered to calculate the fault source.
对FSDG故障节点集合采用朴素因果挖掘算法进行处理,构建故障因果挖掘对象FCS,FCS是系统中各个要素产生的所有时间序列数据,形式化表达成四元组<m,k,T,Em×k>,m是FSDG中要素个数,k是每个要素存在时滞个数,T表示时间序列的长度,Em×k表示系统中所有要素在所有时滞上的集合。FSDG图可能有多条业务节点组成的链路需要进行故障根因诊断,C1……Cn表示针对不同的业务节点组成的关联链路拆分的对应于每一链路的FSDG图。The FSDG fault node set is processed by the naive causal mining algorithm, and the fault causal mining object FCS is constructed. FCS is all the time series data generated by each element in the system, which is formally expressed as a quadruple <m, k, T, E m× k >, m is the number of elements in the FSDG, k is the number of time lags in each element, T is the length of the time series, and E m×k is the set of all elements in the system on all time lags. In the FSDG graph, there may be links composed of multiple service nodes, and the root cause of the fault needs to be diagnosed. C1...Cn represents the FSDG graph corresponding to each link split for the associated links composed of different service nodes.
在波动话计算的过程中,target=xi(t),variables=Em×k-xi(t),以target为因变量,以variables为自变量进行基于GEP的函数拟合,得到函数fxi(t);依次对fxi(t)自变量集合variable中的每个要素进行扰动。由于系统的时滞为k,故对每个要素xj所有时间序列上的值xi(t)~xj(t-k)都进行扰动;基于扰动计算出各要素波动值δfxi(t)(xi,δ)然后根据波动大小进行因果判断,波动值较小的为故障根源。In the process of wave calculation, target= xi (t), variables=E m×k -xi (t), take target as the dependent variable, and use variables as the independent variable to perform function fitting based on GEP, and obtain the function fx i (t); perturb each element in the fx i (t) independent variable set variable in turn. Since the time delay of the system is k, the values x i (t) ~ x j (tk) on all time series of each element x j are perturbed; based on the perturbation, the fluctuation value of each element δfx i (t) ( x i ,δ) and then make a causal judgment according to the magnitude of the fluctuation, and the one with the smaller fluctuation value is the root cause of the fault.
(8)故障自动化处理装置用于对故障根源进行自动处理,如有相应故障处理预案,装置自动按预案执行,及时对系统进行修复,并通知系统相关责任人。(8) The fault automatic processing device is used to automatically process the root cause of the fault. If there is a corresponding fault processing plan, the device will automatically execute the plan according to the plan, repair the system in time, and notify the relevant responsible person of the system.
针对现有方案只局限于已知故障分析根源,无法灵活应对新发现故障,且无法提供实时计算能力的缺点,本实施例提供的基于多层有向图的故障根因诊断的方法基于Storm流计算技术,采用故障有向图算法FSDG及朴素因果挖掘算法NCM相结合的方法,提供了实时高效灵活的故障根源分析能力。另一方面,针对当前IT运维系统建模方法难以提供海量数据进行训练的缺点,本实施例提供的方法提出了一种基于CMDB数据及应用拓扑关系管理模块生成FSDG模型的快速建模方法,提升的模型建立的便捷性,避免训练数据不足造成模型误差较大。Aiming at the shortcomings of the existing solution that is limited to known fault analysis root causes, cannot flexibly respond to newly discovered faults, and cannot provide real-time computing capabilities, the method for fault root cause diagnosis based on multi-layer directed graphs provided in this embodiment is based on Storm flow. The computing technology adopts the method of combining the fault directed graph algorithm FSDG and the naive cause and effect mining algorithm NCM, which provides real-time, efficient and flexible fault root cause analysis capabilities. On the other hand, in view of the shortcoming that the current IT operation and maintenance system modeling method is difficult to provide massive data for training, the method provided in this embodiment proposes a rapid modeling method for generating an FSDG model based on CMDB data and an application topology relationship management module, The improved convenience of model building avoids large model errors caused by insufficient training data.
本实施例提供的基于多层有向图的故障根因诊断的方法不局限于对已知故障的定位处理,对于新的故障能够依据模型自动进行根因分析。加强了故障数据分析能力,通过数据实时计算,避免了信息爆炸等带来的数据积压影响。提升了故障自动处理能力,引入自动化处理装置,实现了故障从自动发现、定位到最终处理的闭环管理。The method for diagnosing fault root causes based on a multi-layer directed graph provided in this embodiment is not limited to locating known faults, and can automatically perform root cause analysis for new faults according to the model. The ability to analyze fault data is strengthened, and the data backlog caused by information explosion is avoided through real-time data calculation. The automatic fault processing capability has been improved, and automatic processing devices have been introduced to realize the closed-loop management of faults from automatic discovery and location to final processing.
图4为本实施例提供的基于多层有向图的故障根因诊断的装置的结构框图,参见图4,该装置包括获取模块401、建立模块402和根因确定模块403,其中,FIG. 4 is a structural block diagram of an apparatus for fault root cause diagnosis based on a multi-layer directed graph provided in this embodiment. Referring to FIG. 4 , the apparatus includes an
获取模块401,用于获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;The obtaining
建立模块402,用于根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;The
根因确定模块403,用于获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。The root
本实施例提供的基于多层有向图的故障根因诊断的装置适用于上述实施例提供的基于多层有向图的故障根因诊断的方法,在此不再赘述。The apparatus for diagnosing fault root causes based on a multi-layer directed graph provided in this embodiment is applicable to the method for diagnosing fault root causes based on a multi-layer directed graph provided in the foregoing embodiment, and details are not described herein again.
本实施例提供了一种基于多层有向图的故障根因诊断的装置,该方法根据原始业务数据和属性信息共同确定各业务节点的调用关系,能够全面考虑到实际中新增加的业务节点或者新增加的调用关系,保证了在根据调用关系建立多层有向图模型时能够将每一业务节点均添加到多层有向图模型中,为基于多层有向图模型准确快速查找产生异常业务数据的根因节点奠定了基础。由于本实施例中的多层有向图模型中的业务节点根据实际业务节点生成,节点的全面性避免了法对新出现的故障进行根因查询的情况发生,同时,对数据的分析不仅仅是基于调用关系,而是基于创建的多层有向图模型对数据进行全面分析。This embodiment provides a fault root cause diagnosis device based on a multi-layer directed graph. The method jointly determines the calling relationship of each service node according to the original service data and attribute information, and can fully consider the newly added service nodes in practice. Or the newly added invocation relationship ensures that each business node can be added to the multi-layer directed graph model when the multi-layer directed graph model is established according to the invocation relationship. The root cause node of abnormal business data lays the foundation. Since the service nodes in the multi-layer directed graph model in this embodiment are generated according to actual service nodes, the comprehensiveness of the nodes avoids the need to perform root cause query for newly-occurring faults. At the same time, the analysis of data not only It is based on the calling relationship, but comprehensively analyzes the data based on the created multi-layer directed graph model.
图5是示出本实施例提供的电子设备的结构框图。FIG. 5 is a block diagram showing the structure of the electronic device provided in this embodiment.
参照图5,所述电子设备包括:处理器(processor)501、存储器(memory)502、通信接口(Communications Interface)503和总线504;5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, a communications interface (Communications Interface) 503 and a
其中,in,
所述处理器501、存储器502、通信接口503通过所述总线504完成相互间的通信;The
所述通信接口503用于该电子设备和其它电子设备的通信设备之间的信息传输;The
所述处理器501用于调用所述存储器502中的程序指令,以执行上述各方法实施例所提供的方法,例如包括:获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。The
本实施例提供一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行上述各方法实施例所提供的方法,例如包括:获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。This embodiment provides a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the methods provided by the foregoing method embodiments, for example, including : Obtain the original service data generated at each service node of the preset service, and determine the calling relationship of each service node according to the original service data and the pre-stored attribute information of each service node; The calling relationship of each service node determined by the attribute information and the multi-layer directed graph model of each service node is established with the pre-divided layer to which each service node belongs; the abnormal service data in the original service data is obtained, according to the multi-layer The directed graph model determines at least one root cause node that causes the service service to generate the abnormal service data, and determines a target root cause node that causes the preset service exception from the root cause nodes.
本实施例公开一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的方法,例如,包括:获取在预设业务的各业务节点处生成的原始业务数据,根据所述原始业务数据和预先存储的各业务节点的属性信息确定各业务节点的调用关系;根据由所述原始业务数据和所述属性信息确定的各业务节点的调用关系和与预先划分的各业务节点所属层建立各业务节点的多层有向图模型;获取所述原始业务数据中的异常业务数据,根据所述多层有向图模型确定导致所述业务业务生成所述异常业务数据的至少一个根因节点,从根因节点中确定导致所述预设业务异常的目标根因节点。This embodiment discloses a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer program The methods provided by the above method embodiments can be implemented, for example, including: acquiring original service data generated at each service node of a preset service, and determining each service node according to the original service data and pre-stored attribute information of each service node. The calling relationship of business nodes; according to the calling relationship of each business node determined by the original business data and the attribute information and the layer to which each business node belongs to pre-divided, a multi-layer directed graph model of each business node is established; the abnormal service data in the original service data, determine at least one root cause node that causes the service service to generate the abnormal service data according to the multi-layer directed graph model, and determine from the root cause nodes that cause the preset service The target root cause node for the exception.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments can be completed by program instructions related to hardware, the aforementioned program can be stored in a computer-readable storage medium, and when the program is executed, execute It includes the steps of the above method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other media that can store program codes.
以上所描述的电子设备等实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The above-described electronic equipment and other embodiments are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.
最后应说明的是:以上各实施例仅用以说明本发明的实施例的技术方案,而非对其限制;尽管参照前述各实施例对本发明的实施例进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明的实施例各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, but not to limit them; although the embodiments of the present invention have been described in detail with reference to the foregoing embodiments, ordinary The skilled person should understand that it is still possible to modify the technical solutions described in the foregoing embodiments, or to perform equivalent replacements on some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the present invention. The scope of the technical solutions of the embodiments of each embodiment.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810461456.6A CN110493025B (en) | 2018-05-15 | 2018-05-15 | A method and device for fault root cause diagnosis based on multi-layer directed graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810461456.6A CN110493025B (en) | 2018-05-15 | 2018-05-15 | A method and device for fault root cause diagnosis based on multi-layer directed graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110493025A CN110493025A (en) | 2019-11-22 |
CN110493025B true CN110493025B (en) | 2022-06-14 |
Family
ID=68545155
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810461456.6A Active CN110493025B (en) | 2018-05-15 | 2018-05-15 | A method and device for fault root cause diagnosis based on multi-layer directed graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110493025B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112887108A (en) * | 2019-11-29 | 2021-06-01 | 中兴通讯股份有限公司 | Fault positioning method, device, equipment and storage medium |
CN111107158B (en) * | 2019-12-26 | 2023-02-17 | 远景智能国际私人投资有限公司 | Alarm method, device, equipment and medium for Internet of things equipment cluster |
CN111639115A (en) * | 2020-04-29 | 2020-09-08 | 国家电网有限公司客户服务中心 | Five-dimensional model-based analysis method for operation and maintenance data abnormity of power grid information system |
CN111913824B (en) * | 2020-06-23 | 2024-03-05 | 中国建设银行股份有限公司 | Method for determining data link fault cause and related equipment |
CN113970913A (en) * | 2020-07-24 | 2022-01-25 | 华为技术有限公司 | Fault diagnosis method and device |
CN111858123B (en) * | 2020-07-29 | 2023-09-26 | 中国工商银行股份有限公司 | Fault root cause analysis method and device based on directed graph network |
CN112506763A (en) * | 2020-11-30 | 2021-03-16 | 清华大学 | Automatic positioning method and device for database system fault root |
CN114629776B (en) * | 2020-12-11 | 2023-05-30 | 中国联合网络通信集团有限公司 | Fault analysis method and device based on graph model |
CN112541098A (en) * | 2020-12-17 | 2021-03-23 | 杉数科技(北京)有限公司 | Directed graph drawing method and chemical material planning method |
CN112580810A (en) * | 2020-12-22 | 2021-03-30 | 济南中科成水质净化有限公司 | Sewage treatment process analysis and diagnosis method based on directed acyclic graph |
CN112711493A (en) * | 2020-12-25 | 2021-04-27 | 上海精鲲计算机科技有限公司 | Scenario root cause analysis application |
CN113282884B (en) * | 2021-04-28 | 2023-09-26 | 沈阳航空航天大学 | Universal root cause analysis method |
CN113793128A (en) * | 2021-09-18 | 2021-12-14 | 北京京东振世信息技术有限公司 | Method, apparatus, device and computer-readable medium for generating service failure cause information |
CN117061332B (en) * | 2023-10-11 | 2023-12-29 | 中国人民解放军国防科技大学 | Fault diagnosis method and system based on probability directed graph deep learning |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106330501A (en) * | 2015-06-26 | 2017-01-11 | 中兴通讯股份有限公司 | A fault correlation method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8301755B2 (en) * | 2007-12-14 | 2012-10-30 | Bmc Software, Inc. | Impact propagation in a directed acyclic graph |
-
2018
- 2018-05-15 CN CN201810461456.6A patent/CN110493025B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106330501A (en) * | 2015-06-26 | 2017-01-11 | 中兴通讯股份有限公司 | A fault correlation method and device |
Non-Patent Citations (2)
Title |
---|
一种基于多层有向图的故障根因诊断的方法;赵靓;《中国优秀硕士学位论文期刊网》;20150915;第19-39 * |
基于扰动的亚复杂动力系统因果关系挖掘;郑皎凌;《计算机学报》;20141231;第37卷(第12期);第2549-2560页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110493025A (en) | 2019-11-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110493025B (en) | A method and device for fault root cause diagnosis based on multi-layer directed graph | |
KR102483025B1 (en) | Operational maintenance systems and methods | |
CN112152830B (en) | An intelligent fault root cause analysis method and system | |
US10902368B2 (en) | Intelligent decision synchronization in real time for both discrete and continuous process industries | |
WO2016090929A1 (en) | Method, server and system for software system fault diagnosis | |
CN113935497A (en) | Intelligent operation and maintenance fault processing method, device and equipment and storage medium thereof | |
CN107770797A (en) | Correlation analysis method and system for wireless network alarm management | |
KR20180108446A (en) | System and method for management of ict infra | |
CN117422434A (en) | Wisdom fortune dimension dispatch platform | |
CN118313812A (en) | A method for collecting and processing power big data based on machine learning | |
CN117041029A (en) | Network equipment fault processing method and device, electronic equipment and storage medium | |
CN116861708B (en) | Method and device for constructing multidimensional model of production equipment | |
WO2024066683A1 (en) | Industrial internet operating system and product processing method | |
CN112241424A (en) | Air traffic control equipment application system and method based on knowledge graph | |
CN114647558A (en) | A method and device for log anomaly detection | |
CN118967147A (en) | An after-sales trigger management method and system based on multi-field analysis and fusion | |
CN112148347A (en) | Method and device for full-process traceability management | |
CN117591887A (en) | Prediction model training method and hazardous waste monitoring method | |
CN117194092A (en) | Root cause locating method, root cause locating device, computer equipment and storage medium | |
CN112147974B (en) | An alarm root cause diagnosis method based on chemical process knowledge automation | |
CN114676021A (en) | Job log monitoring method and device, computer equipment and storage medium | |
Peng et al. | Research on data quality detection technology based on ubiquitous state grid internet of things platform | |
CN118760572A (en) | A fault alarm method, system and terminal device for power grid dispatching platform | |
Xu et al. | Application of Edge Computing in the Quality Control of Cable Production Process | |
CN118572890A (en) | Automatic checking method and device for operation information of equipment start and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |