CN117221087A - Alarm root cause positioning method, device and medium - Google Patents
Alarm root cause positioning method, device and medium Download PDFInfo
- Publication number
- CN117221087A CN117221087A CN202311286937.5A CN202311286937A CN117221087A CN 117221087 A CN117221087 A CN 117221087A CN 202311286937 A CN202311286937 A CN 202311286937A CN 117221087 A CN117221087 A CN 117221087A
- Authority
- CN
- China
- Prior art keywords
- alarm
- root cause
- cause
- data
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000004458 analytical method Methods 0.000 claims abstract description 51
- 230000000694 effects Effects 0.000 claims abstract description 46
- 230000001364 causal effect Effects 0.000 claims description 111
- 238000004422 calculation algorithm Methods 0.000 claims description 41
- 238000010586 diagram Methods 0.000 claims description 17
- 238000007781 pre-processing Methods 0.000 claims description 17
- 238000010276 construction Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 6
- 230000004807 localization Effects 0.000 claims 1
- 238000012423 maintenance Methods 0.000 abstract description 11
- 238000005065 mining Methods 0.000 description 17
- 230000015654 memory Effects 0.000 description 14
- 230000008569 process Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 5
- 238000001228 spectrum Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013145 classification model Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 230000004927 fusion Effects 0.000 description 3
- 238000012097 association analysis method Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000001556 precipitation Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an alarm root cause positioning method, a device and a medium, which relate to the technical field of network operation and maintenance and are used for solving the problems of insufficient accuracy of root cause positioning and insufficient robustness of a scheme caused by insufficient application of the existing root cause positioning analysis to information, wherein the method comprises the following steps: acquiring first cause alarm information of alarm data to be analyzed based on alarm association rules; obtaining second cause alarm information of alarm data to be analyzed based on the alarm knowledge graph comprises the following steps: acquiring an alarm knowledge graph, wherein two nodes connected in the alarm knowledge graph are cause nodes and effect nodes, each connecting line is provided with a cause and effect edge weight, matching alarm data to be analyzed with the nodes of the alarm knowledge graph to acquire an alarm cause and effect graph, and acquiring second cause alarm information according to the alarm cause and effect graph and the cause and effect edge weights; and obtaining final root cause alarm information based on the first root cause alarm information and the second root cause alarm information. The invention increases the analysis of the alarm knowledge graph and improves the accuracy of root cause positioning.
Description
Technical Field
The present invention relates to the field of network operation and maintenance technologies, and in particular, to an alarm root cause positioning method, an alarm root cause positioning device, and a computer readable storage medium.
Background
With the development of network technology, a network system is more and more complex, and when a network fails, a large amount of alarm data is received. In some cases, the root cause positioning analysis in the alarm data is performed by adopting the technical means of association rule mining, however, the existing analysis method based on association rule mining has the problem of insufficient application of information, which may cause insufficient accuracy of root cause positioning and robustness of the scheme.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an alarm root cause positioning method, an alarm root cause positioning device and a computer readable storage medium aiming at the defects, so as to solve the problems of insufficient accuracy of root cause positioning and insufficient robustness of a scheme caused by insufficient information application of the existing root cause positioning analysis.
In a first aspect, the present invention provides a method for positioning an alarm root cause, including:
acquiring first cause alarm information of alarm data to be analyzed based on alarm association rules;
obtaining second cause alarm information of alarm data to be analyzed based on the alarm knowledge graph comprises the following steps:
obtaining an alarm knowledge graph, wherein two nodes connected in the alarm knowledge graph are cause nodes and effect nodes, each connecting line is provided with cause and effect edge weight,
Matching the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain an alarm causal graph,
obtaining second cause alarm information according to the alarm cause and effect diagram and the cause and effect edge weight;
and obtaining final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.
In a second aspect, the present invention provides an alert root cause positioning apparatus, including:
the first analysis module is used for obtaining first cause alarm information of alarm data to be analyzed based on alarm association rules;
the second analysis module is configured to obtain second root cause alarm information of alarm data to be analyzed based on an alarm knowledge graph, and includes:
an alarm knowledge graph unit for acquiring an alarm knowledge graph, wherein two nodes connected in the alarm knowledge graph are cause nodes and effect nodes, each connecting line is provided with cause and effect side weight,
a second matching unit connected with the alarm knowledge graph unit for matching the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain an alarm causal graph,
the second positioning unit is connected with the second matching unit and is used for obtaining second root cause alarm information according to the alarm cause and effect diagram and the cause and effect side weight;
the third analysis module is connected with the first analysis module and the second analysis module and is used for obtaining final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.
In a third aspect, the present invention provides a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the alert root positioning method as described above.
The invention provides an alarm root cause positioning method, an alarm root cause positioning device and a computer readable storage medium, wherein first root cause alarm information of alarm data to be analyzed, which is obtained based on alarm association rules, is combined with second root cause alarm information of alarm data to be analyzed, which is obtained based on alarm knowledge graphs, so that final root cause alarm information is obtained, and in the process of obtaining the second root cause alarm information of alarm data to be analyzed based on the alarm knowledge graphs, the second root cause alarm information is obtained according to the causal relationship and causal edge weight comprehensive judgment between cause nodes and fruit nodes in the alarm knowledge graphs, and the alarm knowledge graph analysis is increased on the basis of association rule mining, so that the application of the information is improved, and the accuracy of root cause positioning and the robustness of a scheme are improved.
Drawings
FIG. 1 is a flow chart of a method for locating an alarm root cause according to an embodiment of the present invention;
FIG. 2 is a flow chart of another method of alert root cause positioning according to an embodiment of the present invention;
FIG. 3 is a flowchart of an alarm root cause positioning method based on alarm association rules according to an embodiment of the present invention;
FIG. 4 is a diagram of the diagram change of an alarm cause location method based on an alarm knowledge graph according to an embodiment of the present invention, wherein (a) is a diagram of an alarm knowledge graph, (b) is a diagram of an alarm cause and effect graph, (c) is a diagram of an alarm cause and effect graph, and (d) is a diagram of another alarm cause and effect graph;
FIG. 5 is a flowchart of an alarm root cause positioning method based on an alarm knowledge graph according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a positioning device for an alarm root cause according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a second analysis module for alert root cause positioning according to an embodiment of the present invention.
Detailed Description
In order to make the technical scheme of the present invention better understood by those skilled in the art, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings.
It is to be understood that the specific embodiments and figures described herein are merely illustrative of the invention, and are not limiting of the invention.
It is to be understood that the various embodiments of the invention and the features of the embodiments may be combined with each other without conflict.
It is to be understood that only the portions relevant to the present invention are shown in the drawings for convenience of description, and the portions irrelevant to the present invention are not shown in the drawings.
It should be understood that each unit and module in the embodiments of the present invention may correspond to only one physical structure, may be formed by a plurality of physical structures, or may be integrated into one physical structure.
It will be appreciated that the functions and steps noted in the flowcharts and block diagrams of the subject invention can occur out of the order noted in the figures without conflict.
It is to be understood that the flowcharts and block diagrams of the present invention illustrate the architecture, functionality, and operation of possible implementations of systems, apparatuses, devices, methods according to various embodiments of the present invention. Where each block in the flowchart or block diagrams may represent a unit, module, segment, code, or the like, which comprises executable instructions for implementing the specified functions. Moreover, each block or combination of blocks in the block diagrams and flowchart illustrations can be implemented by hardware-based systems that perform the specified functions, or by combinations of hardware and computer instructions.
It should be understood that the units and modules related in the embodiments of the present invention may be implemented by software, or may be implemented by hardware, for example, the units and modules may be located in a processor.
To facilitate an understanding of the present invention, some techniques and concepts related to the present invention are first described.
With the development of integration of the cloud network of the operator, under the new form of deployment of communication cloud NFV (network function virtualization, network Functions Virtualization), all cloud network platforms of the operator are deployed in a cloud mode, service platform application and bottom cloud resources are completely decoupled, state sensing and monitoring capabilities of the bottom cloud resources are weak, inter-specialty cooperation mechanisms are not perfected, and therefore problems of difficult positioning of cross-layer faults of the cloud network platforms, long fault processing time and the like are caused, and cross-layer operation and maintenance cooperation is supported by an intelligent means to ensure safe, reliable and continuous operation of cloud network services.
Under the cloud construction background, due to the lack of alarm association and root cause positioning capability, operation and maintenance personnel face the following pain points in the actual operation and maintenance process: 1) The cross-layer information of the cloud network platform service layer and the resource layer is asynchronous, and the information collection is difficult; 2) After the cloud network platform fails, the warning quantity is large, the dispatching order is inaccurate, and the load of operation and maintenance personnel is large; 3) The cloud network platform relates to a system which is more and more complex, the fault is difficult to locate by manual work, and the fault locating time is long; 4) The historical failure experience of the cloud network platform is difficult to precipitate. Therefore, how to realize cross-layer alarm fault root cause positioning can be achieved, the root cause can be quickly found from mass cloud network alarms, the root cause positioning accuracy can be improved, the accurate dispatch of an operation and maintenance platform of an operator can be supported, and the construction of an intensive maintenance system for a cloud platform can be promoted, so that the method is the greatest challenge.
The association analysis method is hopeful to become a main stream means for solving the problems of alarm association and fault cause positioning in the industry, and the association analysis method firstly needs to carry out association rule mining, and the association rule mining can discover association relations among items from a data set, extract strong association rules meeting certain conditions from the association rules and guide services. The existing association rule analysis method mainly utilizes the traditional data mining algorithms such as Apriori or FP-growth to carry out alarm association mining, wherein:
the Apriori algorithm is the most powerful algorithm to mine frequent item sets of boolean association rules, using an iterative method called layer-by-layer search, the k-item set is used to explore the (k+1) -item set. The algorithm flow can be described as follows: firstly, a frequent 1-item set is found out and marked as L1; then L1 is utilized to generate a candidate item set C2, and the items in the C2 are judged to dig out L2, namely a frequent 2-item set; this continues until no more frequent k-term sets can be found. To increase the efficiency of the layer-by-layer generation of frequent item sets, an important property called Apriori property is used to compress the search space. The operation theorem is that, on one hand, all non-empty subsets of the frequent item sets must also be frequent, and on the other hand, all parent sets of the non-frequent item sets are non-frequent;
The FP-Growth algorithm essentially is a depth-first search algorithm by constructing a relatively compact data structure, such as a frequent pattern tree, and the algorithm is mainly divided into 2 processes of constructing the frequent pattern tree and mining the frequent pattern in the tree, and the frequent pattern set can be obtained only by scanning the transaction database for 2 times. The algorithm flow can be described as follows: (1) Counting the occurrence times of each item by the first scanning data set, counting in a head pointer table, and deleting the table items which do not meet the requirement of the support degree; (2) The second time of scanning the data set deletes the items which do not meet the support degree in each record, and then the items in each record are sorted in descending order according to the occurrence times; (3) Establishing an FPTree and perfecting a head pointer table by scanning the data set for the third time; iterating (1) through (3) to collect the frequent item set.
However, the conventional association rule analysis method generally has the problem of insufficient application of key information such as service topology, resource topology and the like. Even if the two algorithms are adopted to mine the association rule and then root cause positioning analysis is carried out based on the association rule, the problem that most of rules are low in confidence and poor in effectiveness easily exists because the mined rule depends on the occurrence frequency of the item set and only time and frequency information in historical alarm data is used, the alarm level in a specific operation and maintenance scene, the importance of equipment and other factors cannot be considered.
In addition to association rule analysis methods, it is also possible to consider graph deduction methods such as causal graphs and fault trees, for example, to implement the positioning of the cause of an alarm by using the causal graph method, that is, to first construct different types of causal graphs for different scenes and data characteristics, then search for the cause of an alarm from the causal graphs using different types of algorithms, compare typical conditional probability or co-occurrence characteristics, construct a causal graph model using a PC algorithm, and construct a causal graph model based on call chain characteristics. If the source alarm positioning technology realized by the causal graph is adopted, only one of the alarm association relationship, the service topology data and the equipment connection relationship is adopted in the process of constructing the causal graph, and a mature scheme for comprehensively considering various information is not formed, the application range of the scheme is limited, and the positioning effect of the source is greatly influenced by irregular network change.
Therefore, aiming at the defects and problems, the invention provides a cross-layer root alarm positioning technical scheme for a cloud network service platform cross-domain alarm analysis scene such as an operator video color ring, a 5G message and the like, which mainly comprises the following steps:
1) An improvement of the FP-Growth algorithm is proposed, and the improvement points comprise: according to importance factors such as alarm equipment association and alarm level, a weight coefficient is set for each frequent item set, and the confidence coefficient and the effectiveness of association rules are improved; in the process of mining frequent item sets, processed labels are added to the traversed child node paths, repeated traversal is avoided, calculation efficiency is improved, association rule mining of cross-layer alarms is completed based on an improved FP-Growth algorithm, and cross-layer alarm root cause analysis is performed;
2) The method for constructing the cloud network platform alarming knowledge graph and the software and hardware knowledge graph is provided, the historical alarming index, the service index, the resource index, the performance index and other data are fully utilized to construct the alarming knowledge graph and the software and hardware knowledge graph, the knowledge graph is utilized to carry out cross-layer alarming root cause analysis, and the scheme has the characteristics of optimality and iteration through the organic fusion application of multiple knowledge graphs, so that the operation and maintenance cases and experience precipitation can be realized;
3) And comprehensively studying and judging the association rule algorithm analysis result based on the improved FP-Growth algorithm and the algorithm analysis result based on the knowledge graph by fusion methods such as voting method and the like to determine the final root cause alarm, thereby effectively improving the accuracy of root cause positioning and the robustness of the scheme.
Example 1:
as shown in FIG. 1, the invention provides an alarm root cause positioning method, which comprises the following steps:
s1, acquiring first cause alarm information of alarm data to be analyzed based on alarm association rules;
s2, obtaining second cause alarm information of alarm data to be analyzed based on an alarm knowledge graph, wherein the second cause alarm information comprises:
obtaining an alarm knowledge graph, wherein two nodes connected in the alarm knowledge graph are cause nodes and effect nodes, each connecting line is provided with cause and effect edge weight,
Matching the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain an alarm causal graph,
obtaining second cause alarm information according to the alarm cause and effect diagram and the cause and effect edge weight;
s3, obtaining final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.
Specifically, in the present embodiment, steps S1 and S2 are first performed, wherein the order of execution of steps S1 and S2 is not sequential, and step S3 is performed after steps S1 and S2 are completed. The specific execution details of the steps S1 and S3 may be multiple, for example, the specific step provided later in this embodiment may be executed, where the specific execution details of the step S2 includes obtaining an alarm knowledge graph, connecting each node according to a causal relationship in the alarm knowledge graph, and setting a causal edge weight between each pair of causal nodes to indicate a probability of occurrence of a causal node event caused by occurrence of a node event, so that the alarm data to be analyzed may be matched with the nodes of the alarm knowledge graph to determine a causal relationship between the alarm data to be analyzed, and determining the second root cause alarm information by combining the obtained causal relationship and the corresponding causal edge weight. Through the flow, the first root cause alarm information of the alarm data to be analyzed, which is obtained based on the alarm association rule, is combined with the second root cause alarm information of the alarm data to be analyzed, which is obtained based on the alarm knowledge graph, so that the final root cause alarm information is obtained, and in the process of obtaining the second root cause alarm information of the alarm data to be analyzed based on the alarm knowledge graph, the second root cause alarm information is obtained according to the causal relationship and causal edge weight comprehensive judgment between the cause node and the effect node in the alarm knowledge graph, and the alarm knowledge graph analysis is added on the basis of association rule mining, so that the application of the information is improved, and the accuracy of root cause positioning and the robustness of a scheme are improved.
The specific application scenario of this embodiment is that cross-layer alarm analysis is performed for cloud network service platforms such as video color ring of operators and 5G messages, and the whole technical scheme may also be as shown in fig. 2, and mainly includes three parts: association rule mining and first root cause positioning, specifically, cross-layer alarm association rule mining and root cause positioning based on improved FP-Growth; the method comprises the steps of alarm knowledge graph construction and second root cause positioning, and specifically comprises the steps of software and hardware knowledge graph construction and root cause positioning by utilizing the software and hardware knowledge graph to assist the alarm knowledge graph; and the root cause positioning result is fused and studied, specifically, the root cause positioning result of a knowledge graph (including an alarm knowledge graph) and a rule base (storing alarm association rules) is fused, and the final root cause positioning result is studied and studied. The execution of the three parts comprises the steps that firstly, alarm data are needed, the alarm data comprise historical alarm data and alarm data to be analyzed, the alarm data to be analyzed can be real-time alarm data in a real-time alarm slice, the historical alarm data are used for association rule mining and alarm knowledge graph construction, then, first root cause positioning and second root cause positioning analysis are respectively carried out on the alarm data to be analyzed based on the alarm association rule obtained through mining and the constructed alarm knowledge graph, finally, a first root cause positioning analysis result (first root cause alarm information) and a second root cause positioning analysis result (second root cause alarm information) are fused to output final root cause alarm information, and the final root cause alarm information can comprise root alarms (final root cause alarms at the moment) and also can comprise root cause paths (final root cause alarm paths) corresponding to the root alarms (final root cause alarms).
In one possible implementation manner, before obtaining the first root cause alarm information of the alarm data to be analyzed based on the alarm association rule, the method further includes:
and constructing an alarm association rule based on an improved FP-Growth algorithm according to the historical alarm data, wherein the improved FP-Growth algorithm comprises the steps of adding a weight coefficient to improve the support calculation and/or the confidence calculation of the FP-Growth algorithm, and two nodes connected in the alarm association rule are a root cause alarm node and a derivative alarm node.
Specifically, in this embodiment, as shown in fig. 3, association rule mining and root cause positioning based on the modified FP-Growth algorithm can be divided into two parts: the first part is the construction of an association rule base, and the first part mainly utilizes an improved FP-Growth algorithm to carry out data mining on historical alarm data and resource topology data and dig out cross-layer alarm association rules. Meanwhile, the expert is supported to combine with the operation and maintenance experience to manually input the association rule; the second part is based on alarm root cause analysis of rules, mainly based on a constructed rule base (association rule base, including alarm association rules), and outputs root alarms and root cause list (namely, the first root cause alarm information, which can be in a specific form). The improvement points of the FP-Growth algorithm mainly comprise: and determining importance factors such as alarm equipment association, alarm level and the like according to the resource topology data, and adding a weight coefficient into support degree calculation and/or confidence degree calculation of an FP-Growth algorithm according to the alarm equipment association, the alarm level and the like so as to improve the confidence degree and the effectiveness of alarm association rules.
In one possible implementation manner, the alarm association rule is constructed based on the improved FP-Growth algorithm according to the historical alarm data, and specifically includes:
acquiring historical alarm data subjected to first preprocessing, wherein the first preprocessing comprises: extracting key fields from the historical alarm data, generating first dictionary values comprising alarm nodes indicating the historical alarm data and the key fields, representing the historical alarm data by using the first dictionary values, classifying the historical alarm data into different first alarm classes according to the key fields, and acquiring the occurrence frequency of each first alarm class in each historical time window by setting time windows and sliding step sizes;
according to the first weight coefficient omega 1 And the frequency calculates a first weighted support ωs for each first alarm class according to (1) 1 Will omega s 1 The first alarm class which is larger than the second preset value is ordered in each historical time window according to the frequency, and the first alarm class is used as a child node to establish a frequent tree FP-tree according to the ordered sequence of each historical time window, wherein:
wherein X refers to a certain first alarm class, N (X) is the frequency of the first alarm class X, and N is the frequency of all the first alarm classes;
acquiring a conditional mode base of each child node in the FP-tree according to the second weight coefficient omega 2 And the frequency calculates a second weighted support ωs for each conditional mode base path according to equation (2) 2 Obtaining omega s 2 A conditional pattern base path greater than a third preset value is used as a frequent item set, wherein:
wherein X is 1 Refers to a first alarm class, Y, of the end node of a conditional mode base path 1 Refers to the first alarm class, N (X) 1 ∪Y 1 ) The sum of the frequencies of the first alarm class of the certain condition mode base path is referred to;
according to the third weight coefficient omega 3 And the frequency calculates any two adjacent first alarm classes X in each frequent item set according to the frequency (3) 2 And Y 2 And a weighted confidence ωc between, responsive to ωc (X 2 →Y 2 )<ωc(Y 2 →X 2 ) Y is set to 2 Arranged as X 2 The root cause of (1) the alarm association rule is obtained until all the first alarm classes in each frequent item set are arranged, wherein:
wherein σ (X) 2 ∪Y 2 ) Refers to X 2 And Y 2 Frequency of simultaneous occurrence, σ (X 2 ) Refers to X 2 Frequency of occurrence.
Specifically, in this embodiment, the step of constructing the cross-layer alarm association rule base based on the improved FP-Growth, see fig. 3, includes:
data is obtained from the data warehouse, including obtaining historical alert data in an alert history data file and obtaining resource topology data.
Data preprocessing, such as including: alarm denoising: the method mainly comprises the steps of removing redundant alarms and invalid alarms, such as engineering alarms, cut-over identification alarms and key field missing alarms, and if the same alarm name and the same alarm position alarms exist, only one alarm is reserved; extracting key information: extracting useful fields in the alarm data, wherein the useful fields generally comprise an alarm title, a device type, an alarm code, a specialty, an alarm object, an alarm level field and the like; alarm dictionary integration: combining the alarm fields to form dictionary values, such as stringing the fields of 'network element ID+alarm title' together to form dictionary values, so as to ensure the uniqueness of alarm data; classifying according to topology: classifying the equipment alarms with the upstream and downstream relations according to the topology relation of the alarm objects, and judging the resource equipment relation (the resource equipment relation comprises the relation with a transmission link, a rack and the like) provided by a resource center according to the basis of the upstream and downstream relations; event window sliding: discrete alert data is converted into an alert data transaction set by setting a time window and a sliding step size.
Performing association rule mining based on an improved FP-Growth algorithm by an AI engine, including: building an alarm frequent tree from top to bottom: firstly traversing an alarm set, judging the occurrence frequency of certain types of alarms (first alarm types) according to field classifications such as equipment types, alarm titles, alarm codes and the like, sequencing from high to low, setting the minimum support, discarding alarms lower than the minimum support, not serving as frequent item sets, and increasing a weight coefficient omega (which can be set as the first weight coefficient omega at the moment) for the initial frequency of each transaction item (first alarm type) according to key influence factors such as alarm equipment association, alarm grades and the like 1 The higher the importance degree is, the larger the weight coefficient value is, the weight coefficient value can be set manually), the weight coefficient value is compared with the minimum support degree, the minimum support degree is the minimum weight support degree (second preset value), the support degree of each transaction item is multiplied by the weight coefficient omega, the weight coefficient omega is compared with the minimum weight support degree, the transaction item which is larger than the minimum weight support degree is reserved as a frequent item, the alarms with the frequency lower than the minimum weight support degree are removed according to the frequency sequence of the frequent item of traversal, and the frequent tree FP-tree is generated; excavating the frequent item sets of the child nodes from bottom to top: will beTraversing each lowest layer of sub node items of the frequent tree FP-tree upwards to obtain a condition mode base of each sub node item, traversing the condition mode base upwards by the sub node to search for an upper node, keeping the frequency of the upper node consistent with that of the sub node, traversing the condition FP-tree of the sub node item from bottom to top according to the condition mode base, directly generating a frequent item set if the condition FP-tree of the sub node item is a single path by taking the sub node item as a suffix, traversing upwards again if a branch exists, and eliminating a weight less than the weighted minimum support degree (a second preset value which can be the same as the first preset value, wherein the weighted weight can be set as a second weight coefficient omega) 2 ) Finally, a frequent item set of child node items is obtained; calculating the weighted confidence level to generate a strong association rule: according to the obtained frequent item sets, calculating the confidence coefficient of each alarm, eliminating the frequent items with the confidence coefficient smaller than the minimum confidence coefficient (third preset value), generating a strong association rule, and setting a third weight coefficient omega for each frequent item set according to key influence factors such as the association of alarm equipment, the alarm level and the like 3 (the higher the importance degree is, the higher the weight coefficient value of the frequent item set with stronger resource relevance is, and the larger the weight coefficient value can be set manually), the confidence of the frequent item set is multiplied by the weight coefficient omega 3 For weighted confidence, the root alert can be determined based on the weighted confidence of the frequent item set, e.g., one frequent item set includes only X and Y, and the confidence ωc (X→Y)<ωc (y→x), the root alert of frequent item set X, Y is Y, and when the number of items is greater, only two-by-two comparisons are needed, and so on.
Checking and warehousing association rules: and effectively checking the mined cross-layer alarm association rule, checking by an expert, uniformly forming a cross-professional (layer) alarm association rule base after verification and checking, generating a fault propagation chain based on the cross-professional alarm association rule base, and performing one-hot coding on the fault propagation chain so as to facilitate the subsequent first root cause positioning analysis.
In one possible implementation manner, obtaining the conditional mode base of each child node in the FP-tree specifically includes:
and each node positioned at the tail end in the FP-tree is obtained as a child node, all the upper nodes in the FP-tree are traversed upwards for a child node, a path with an upper-lower connection relationship between the upper nodes and the lower nodes in sequence is used as a condition mode base path, and each condition mode base path is marked so as to avoid repetition in the following traversal of the child node.
Specifically, in the embodiment, in the step of constructing the conditional pattern base of the child node item by traversing from bottom to top, the execution strategy of the child node upward traversal is improved, the processed labels are added to the traversed child node paths, repeated traversal is avoided, and the calculation efficiency is improved.
In one possible implementation manner, the method for obtaining the first root cause alarm information of the alarm data to be analyzed based on the alarm association rule specifically includes:
performing second preprocessing on the alarm data to be analyzed, wherein the second preprocessing comprises the following steps: extracting key fields from the alarm data to be analyzed, generating a second dictionary value comprising alarm nodes indicating the alarm data to be analyzed and the key fields, and representing the alarm data to be analyzed by using the second dictionary value;
Acquiring a first dictionary value of historical alarm data corresponding to each node in the alarm association rule, calculating cosine similarity between a second dictionary value and each acquired first dictionary value, and marking the second dictionary value at the corresponding node in the alarm association rule in response to the cosine similarity being larger than a fourth preset value so as to acquire an alarm association sub-rule hit by alarm data to be analyzed;
according to the sum of cosine similarity of hit nodes in each alarm association sub-rule, obtaining TopN alarm association sub-rules with the maximum sum of cosine similarity, and taking TopN final root cause alarm nodes of the TopN alarm association sub-rules as first root cause alarms.
Specifically, in this embodiment, based on the constructed association rule base, the rule-based alert root cause analysis steps refer to fig. 3, which includes:
acquiring real-time alarm data, namely monitoring an active alarm queue from a cloud resource pool and a service platform in real time, and acquiring M alarms in the active alarm queue as alarm data to be analyzed;
inputting real-time alarm monitoring data into a rule matcher, comparing the rule matcher with a rule base according to the input data, for example, encoding an active alarm queue one-hot, calculating the cosine similarity of the one-hot encoding and the one-hot encoding of a fault propagation chain, and reversing;
Obtaining a root cause list, if a rule is hit (cosine similarity is larger than a fourth preset value), analyzing the hit rule to return a root cause alarm (for example, hit rule Z-Y-X, return the final root cause alarm node Z as the root cause alarm, the root cause alarm are the same in meaning), and if a plurality of rules are hit, returning a rule including TopN root causes (the root cause alarm is abbreviated as the first root cause alarm at this time) and the similarity thereof.
In one possible implementation manner, the method for obtaining the second root cause alarm information according to the alarm causal graph and the causal edge weight specifically includes:
acquiring an alarm causal subgraph with a continuous causal edge weight larger than a first preset value in each alarm causal graph;
and taking the final cause node of each alarm cause and effect drawing as a second cause alarm and taking each alarm cause and effect drawing as a root cause alarm path.
Specifically, in this embodiment, as to the specific execution details of step S2, as can be understood with reference to fig. 4 (only a simplified example), firstly, an alarm knowledge graph as shown in fig. 4 (a) is obtained, where the nodes in fig. 4 (a) include A1, A2, B1, B2, C1, C2, and C3, the connection lines represent causal relationships, the arrows thereof represent causal directions, that is, the cause nodes of A1 and A2 are the cause nodes of B1 and B2, B1 and B2 are the effect nodes of A1 and A2, B1 and B2 are the cause nodes of C1, C2, and C3 are the effect nodes of B1 and B2, and a corresponding causal edge weight is provided on each connection line; if the alarm data to be analyzed is matched with the nodes of the alarm knowledge graph shown in fig. 4 (a), an alarm causal graph shown in fig. 4 (B) is obtained, namely the nodes A1, B2, C1, C2 and C3 of the alarm knowledge graph are hit by the alarm data to be analyzed; assuming a first preset value of 0.5 (by way of example only), two alarm causal subgraphs as shown in FIGS. 4 (c) and (d) may be obtained from the causal edge weights; according to the two alarm cause and effect subgraphs shown in fig. 4 (C) and (d), second root cause alarm information can be obtained, for example, a final cause node of each alarm cause and effect subgraph can be used as a second root cause alarm, namely two second root cause alarms of A1 and B2 are obtained, meanwhile, each alarm cause and effect subgraph is a root cause path, namely, the root cause alarm of A1 can trigger three derived alarms of B1, C1 and C2, and B2 can trigger a derived alarm of C3. It can be understood that the specific order of acquiring the alarm causal graph and the alarm causal subgraph is not limited by the above, or the alarm knowledge graph may be divided into multiple sub-graphs according to the causal edge weight, and then the matching node directly acquires the alarm causal subgraph.
In a possible implementation manner, the nodes in the alarm knowledge graph are arranged in layers according to alarm node levels, and each node corresponds to one alarm fault type of one alarm node level;
matching the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain an alarm causal graph, which comprises the following steps:
and dividing the alarm data to be analyzed into different third alarm classes according to the alarm node level and the alarm fault type, and matching the alarm data to be analyzed with the nodes of the alarm knowledge graph according to the third alarm classes to obtain an alarm causal graph.
Specifically, in this embodiment, the alarm knowledge graph is built in a hierarchical arrangement manner, the hierarchy of the alarm knowledge graph is set according to the hierarchy of alarm nodes, the alarm nodes refer to software and hardware that send out alarms, such as a host, a virtual machine, a database, etc., and each alarm node hierarchy has multiple alarm fault types, so that the alarm fault types are used as the basis for alarm data classification and alarm knowledge graph building, the classification method can be performed by using a trained alarm data classification model, the model is not used as the discussion of the application, the alarm data to be analyzed is classified into classification results (third alarm types) corresponding to the alarm knowledge graph nodes by using the model, and then the alarm causal graph can be obtained by matching according to the classification results.
In one possible implementation manner, before acquiring the alarm knowledge graph, the method further includes:
and constructing an alarm knowledge graph according to an alarm node level and an alarm fault type of the historical alarm data, wherein the alarm node level comprises three layers of a physical machine HOST, a virtual machine VM and a SOFTWARE SOFTWARE, and each node in the alarm knowledge graph corresponds to one HOST alarm fault type, VM alarm fault type or SOFTWARE alarm fault type.
Specifically, in this embodiment, as shown in fig. 5, the alarm knowledge graph framework and the second root cause positioning analysis may be divided into two parts, where the first part is a cloud network platform knowledge graph construction, and the first part makes full use of data such as cloud resource pool hardware resources, cloud platform software resources, call chains, historical alarms and the like to respectively construct a cloud network platform software and hardware knowledge graph and an alarm knowledge graph; and the second part is root cause analysis based on the knowledge graph, namely, alarm root cause positioning analysis is carried out based on the constructed cloud network platform knowledge graph, and root alarms and root cause paths are output. The alarm knowledge graph is constructed by adopting historical alarm data, the historical alarm data and the alarm data to be analyzed are classified by adopting the same method, and the alarm node level is divided into three layers, namely a physical machine HOST, a virtual machine VM and a SOFTWARE SOFTWARE at present.
In one possible implementation manner, an alarm knowledge graph is constructed according to an alarm node level and an alarm fault type of historical alarm data, and the method specifically comprises the following steps:
the historical alarm data are divided into different second alarm classes according to the alarm node level and the alarm fault types, wherein the second alarm classes comprise HOST alarm fault types, VM alarm fault types and SOFTWARE alarm fault types;
taking each second alarm class as a node of an alarm knowledge graph, taking HOST alarm fault type as a causative node of VM alarm fault type, taking VM alarm fault type as causative node of SOFTWARE alarm fault type, and arranging and connecting all the second alarm classes in a layered manner;
and calculating the ratio of the number of simultaneous occurrence of historical alarms of each pair of causal nodes to the total number of occurrence of historical alarms of the causal nodes in each pair of causal nodes, and taking the ratio as the causal edge weight of the connecting line between each pair of causal nodes.
Specifically, in this embodiment, as shown in fig. 5, the cloud network platform alarm knowledge graph construction process includes: acquiring historical alarm data of a cloud resource pool and a service platform; alarm data classification (alarm classification, training classification model): the method comprises the steps of performing word segmentation on historical alarm information (historical alarm data), calculating word vectors, inputting the word vectors into a training model, and classifying alarms of a physical machine layer, a virtual machine layer and a software layer; causal node selection (determining nodes according to classification results): the causal node does not specifically refer to alarms on a physical machine or virtual machine IP, but rather an abstract summary of all alarm types, currently comprising three layers, as can be understood in conjunction with the illustration of fig. 4, a represents alarms on the physical machine level, B represents alarms on the virtual machine level, and C represents alarms on the software level, such as: classifying downtime alarms on any physical machine into a node (A1) of a physical machine-downtime on a causal graph, and selecting a final causal node after a causal algorithm outputs causal edges and manual screening confirmation through alarm data classification, wherein the nodes can also be nodes such as A2 [ physical machine-network card flow alarms ], B1 [ VM-downtime ], C1 [ JBoss-log OutOfmemory ] and the like; constructing a causal discovery sample: according to the alarm classification, each alarm record has been classified into an alarm type (alarm type: physical machine-xx alarm, virtual machine-xx alarm, software-xx alarm), such as taking each virtual machine alarm record as a center, giving an alarm time slice (1 min, 2min, etc.), and searching a set of related alarm records (related alarms include alarms on physical machines affiliated with the virtual machine and alarms on other virtual machines affiliated with the physical machine) in each virtual machine alarm time slice as a causal discovery sample; acquisition of causal edges (causal algorithm finds causal edges): calculating and finding a causal edge by using a constructed causal finding sample and adopting a traditional PC causal algorithm (Peter-Clark algorism is a common algorithm for finding causal relations); weight calculation of causal edge (causal edge weight is calculated according to conditional probability): the weight of the causal edge is calculated by adopting conditional probability, namely: based on the causal discovery sample data and causal edges (connecting two nodes of a causal node and a causal node) provided by a causal discovery algorithm, the ratio of the number of alarms generated by the causal node under the condition that the causal node generates alarms to the number of alarms generated by the causal node is used as the weight of the causal edges; forming an alarm knowledge graph: all causal edges and weights thereof are finally generated through [ alarm classification ] - > [ construction causal discovery sample ] - > [ causal algorithm discovery causal edge ] - > [ causal edge weight calculation ], and unified alarm knowledge graph construction (construction of alarm knowledge graph) is completed.
In a possible implementation manner, the alarm data to be analyzed comprises a plurality of pieces of real-time alarm data of a cloud network platform in a current time window, wherein the pieces of real-time alarm data are acquired by setting the time window and a sliding step length, and the cloud network platform comprises a plurality of systems, and each system comprises HOST, VM and SOFTWARE;
matching the alarm data to be analyzed with the nodes of the alarm knowledge graph according to a third alarm class to obtain an alarm causal graph, which comprises the following steps:
inquiring a software and hardware knowledge graph of the cloud network platform according to alarm nodes of alarm data to be analyzed, dividing all third alarm classes according to the cloud network platform system to which the alarm nodes belong in the software and hardware knowledge graph, and respectively matching the third alarm classes divided into different systems of the cloud network platform with the nodes of the alarm knowledge graph so as to obtain an alarm causal graph of the alarm data to be analyzed in different systems of the cloud network platform.
Specifically, in this embodiment, the alarm root cause analysis of the cloud network platform uses the alarm knowledge spectrum in combination with the software and hardware knowledge spectrum to implement the organic fusion application of multiple knowledge spectrums, so that the root cause positioning result is optimized, and therefore, the software and hardware knowledge spectrum of the cloud network platform needs to be constructed, more specifically, as shown in fig. 5, the software and hardware knowledge spectrum displays the internal logic among each application, software, virtual machine and physical machine, the calling relationship among systems, the physical connection relationship of network devices and the like in the cloud network platform at a global view angle. Nodes in the graph include systems, DUs (deployment units), groups (host instance groups), software, virtual machines, physical machines, access switches, core switches, aggregation switches, routers, and the like. The relationships include a constructure, call, logical connection, cluster, bearer, host, connection, and the like. On one hand, the data source constructed by the software and hardware knowledge graph mainly comprises cloud resource pool data (mainly cloud resource pool hardware data), call chain data and physical equipment network connection relation data (namely equipment connection relation is required to be acquired), firstly, the software and hardware knowledge graph is initialized based on offline data, the offline of an old system and the online of a new system can occur along with the change and expansion of business, the software and hardware knowledge graph can be updated at regular time or periodically according to the change, and the cloud resource pool data format is assumed to be shown in the following table 1:
Table 1 cloud resource pool data format examples
From the above data, a relationship map of HOST (physical machine) - > VM (virtual machine) - > SOFTWARE) - > GROUP and GROUP (jbosss application server)) - > GROUP (nginix (a high-performance HTTP and reverse proxy web server)), i.e., a single-system relationship map, can be constructed. Calling chain/physical device network connection data knowledge construction: and acquiring call relations among DUs (Distributed units), call relations among systems, DU/IP (Internet Protocol, network interconnection protocol) mapping relations, logical connection relations among middleware (namely acquiring cloud platform software resource data and system and DU call relations) and the like by using call chain data to construct a service call relation map. Combining knowledge maps: and combining the single-system relation map and the calling relation map obtained by the method through a Python network x (a Python package for creating, operating and researching the structure, the dynamics and the functions of the complex network) software package to form a unified cloud network platform software and hardware knowledge map (constructing the software and hardware knowledge map). And storing the software and hardware knowledge graph and the alarm knowledge graph into a graph database (the graph data is imported into the graph database) so as to realize real-time query and call of the graph and support the subsequent root cause positioning analysis.
In one possible implementation manner, the obtaining the final root cause alarm information based on the first root cause alarm information and the second root cause alarm information specifically includes:
and in response to determining a plurality of second root cause alarms for a certain cloud network platform system, determining alarm data to be analyzed, which simultaneously corresponds to one of the first root cause alarms and the plurality of second root cause alarms, as a final root cause alarm of the certain cloud network platform system, and obtaining a root cause alarm path corresponding to the final root cause alarm.
Specifically, in this embodiment, the alarm root cause analysis based on the cloud network platform knowledge graph includes: cloud resource pool and service platform real-time alarm monitoring, and obtaining alarm in slice window: setting the granularity of the time slice, and acquiring alarm data (1 min, 5min and the like) in the time slice in real time; calling a classification model to carry out alarm classification: for the original alarm data, the original alarm data is classified from three aspects HOST, VM, SOFTWARE according to a trained classification model (the model which is the same as the construction of the alarm knowledge graph can be used), for example, the classification is divided into: the flow of the vm_network card is large, the usage rate of host_disk is too high, the software_webpage access fails, and the like; alarm convergence (the software and hardware knowledge graph nodes are obtained by inquiring a cloud network platform knowledge graph database, and the alarms are converged on related nodes): querying the software and hardware knowledge graph to converge the alarm by taking the system as a unit, wherein the convergence format is as follows: system 1: { software and hardware knowledge graph node 1: alarm type 1, alarm type 2 … ], software and hardware knowledge graph node 2: alarm type 1, alarm type 2 …; acquiring an alarm causal graph (acquiring an alarm causal relationship by querying a cloud network platform knowledge graph database): inquiring connection subgraphs among all nodes under each system in a graph database according to the system level based on the alarm convergence result to obtain a final connection relationship among all nodes under a certain system, namely an alarm causal graph; possible root cause paths are derived: and calculating suspected paths based on the generated alarm causal graph and the weights, and sequencing to give a root alarm and a root cause path. And finally, adopting fusion algorithms such as voting method and the like, and comprehensively researching and judging the result of root cause positioning algorithm analysis based on the improved FP-Growth algorithm and the result of root cause positioning algorithm analysis based on the cloud network platform knowledge graph to determine the final root cause. The final root cause can be to obtain a final root cause alarm and a corresponding final root cause alarm path for each system of the cloud network platform, specifically, real-time alarm data which is judged to be the first root cause alarm and the second root cause alarm is taken as the final root cause alarm, and the final root cause alarm path is obtained according to the root cause path of the second root cause alarm.
Example 2:
as shown in fig. 6 and 7, embodiment 2 of the present invention provides an alarm root cause positioning device, including:
the first analysis module 1 is used for obtaining first root cause alarm information of alarm data to be analyzed based on alarm association rules;
the second analysis module 2 is configured to obtain second root cause alarm information of alarm data to be analyzed based on an alarm knowledge graph, and includes:
an alarm knowledge graph unit 21, configured to obtain an alarm knowledge graph, where two nodes connected in the alarm knowledge graph are cause nodes and effect nodes, and each connecting line is provided with a cause-effect edge weight,
a second matching unit 22 connected with the alarm knowledge graph unit 21 for matching the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain an alarm causal graph,
the second positioning unit 23 is connected with the second matching unit 22 and is used for obtaining second root cause alarm information according to the alarm cause and effect graph and the cause and effect edge weight;
the third analysis module 3 is connected with the first analysis module 1 and the second analysis module 2, and obtains final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.
In one possible embodiment, the first analysis module 1 further comprises:
And the alarm association rule unit is used for constructing an alarm association rule based on an improved FP-Growth algorithm according to the historical alarm data, wherein the improved FP-Growth algorithm comprises the steps of adding a weight coefficient to improve the support calculation and/or the confidence calculation of the FP-Growth algorithm, and two nodes connected in the alarm association rule are root cause alarm nodes and derivative alarm nodes.
In one possible implementation manner, the alarm association rule unit specifically includes:
a preprocessing subunit, configured to obtain history alarm data subjected to a first preprocessing, where the first preprocessing includes: extracting key fields from the historical alarm data, generating first dictionary values comprising alarm nodes indicating the historical alarm data and the key fields, representing the historical alarm data by using the first dictionary values, classifying the historical alarm data into different first alarm classes according to the key fields, and acquiring the occurrence frequency of each first alarm class in each historical time window by setting time windows and sliding step sizes;
a frequent tree subunit connected with the first preprocessing subunit for generating a first weight coefficient omega 1 And the frequency calculates a first weighted support ωs for each first alarm class according to (1) 1 Will omega s 1 The first alarm class which is larger than the second preset value is ordered in each historical time window according to the frequency, and the first alarm class is used as a child node to establish a frequent tree FP-tree according to the ordered sequence of each historical time window, wherein:
wherein X refers to a certain first alarm class, N (X) is the frequency of the first alarm class X, and N is the frequency of all the first alarm classes;
the frequent item collection subunit is connected with the frequent tree subunit and is used for acquiring the condition mode base of each child node in the FP-tree according to the second weight coefficient omega 2 And the frequency calculates a second weighted support ωs for each conditional mode base path according to equation (2) 2 Obtaining omega s 2 A conditional pattern base path greater than a third preset value is used as a frequent item set, wherein:
wherein,X 1 refers to a first alarm class, Y, of the end node of a conditional mode base path 1 Refers to the first alarm class, N (X) 1 ∪Y 1 ) The sum of the frequencies of the first alarm class of the certain condition mode base path is referred to;
an association rule subunit, coupled to the frequent item collection subunit, for generating a third weight coefficient ω 3 And the frequency calculates any two adjacent first alarm classes X in each frequent item set according to the frequency (3) 2 And Y 2 And a weighted confidence ωc between, responsive to ωc (X 2 →Y 2 )<ωc(Y 2 →X 2 ) Y is set to 2 Arranged as X 2 The root cause of (1) the alarm association rule is obtained until all the first alarm classes in each frequent item set are arranged, wherein:
wherein σ (X) 2 ∪Y 2 ) Refers to X 2 And Y 2 Frequency of simultaneous occurrence, σ (X 2 ) Refers to X 2 Frequency of occurrence.
In one possible implementation, the frequent item set subunit includes a conditional pattern base subunit therein for:
and each node positioned at the tail end in the FP-tree is obtained as a child node, all the upper nodes in the FP-tree are traversed upwards for a child node, a path with an upper-lower connection relationship between the upper nodes and the lower nodes in sequence is used as a condition mode base path, and each condition mode base path is marked so as to avoid repetition in the following traversal of the child node.
In one possible embodiment, the first analysis module 1 specifically comprises:
the preprocessing unit is used for carrying out second preprocessing on the alarm data to be analyzed, and the second preprocessing comprises the following steps: extracting key fields from the alarm data to be analyzed, generating a second dictionary value comprising alarm nodes indicating the alarm data to be analyzed and the key fields, and representing the alarm data to be analyzed by using the second dictionary value;
The first matching unit is connected with the second preprocessing unit and is used for acquiring a first dictionary value of the historical alarm data corresponding to each node in the alarm association rule, calculating cosine similarity between the second dictionary value and each acquired first dictionary value, and marking the second dictionary value at the corresponding node in the alarm association rule in response to the cosine similarity being larger than a fourth preset value so as to acquire an alarm association sub-rule hit by the alarm data to be analyzed;
the first positioning unit is connected with the first matching unit and is used for acquiring TopN alarm association sub-rules with the maximum sum of cosine similarity according to the sum of cosine similarity of hit nodes in each alarm association sub-rule, and taking TopN final root cause alarm nodes of the TopN alarm association sub-rules as first root cause alarms.
In a possible embodiment, the second positioning unit 23 specifically comprises:
the weight judging subunit is used for acquiring an alarm causal subgraph with the continuous causal edge weight larger than a first preset value in each alarm causal graph;
and the root cause result subunit is connected with the weight judging subunit and is used for taking the final cause node of each alarm cause and effect subgraph as a second root cause alarm and taking each alarm cause and effect subgraph as a root cause alarm path.
In a possible implementation manner, the nodes in the alarm knowledge graph are arranged in layers according to alarm node levels, and each node corresponds to one alarm fault type of one alarm node level;
the second matching unit 22 specifically includes:
a classifying sub-unit for classifying the alarm data to be analyzed into different third alarm categories according to the alarm node level and the alarm fault type,
the second matching sub-unit is connected with the classifying sub-unit and is used for matching the alarm data to be analyzed with the nodes of the alarm knowledge graph according to the third alarm class so as to obtain an alarm causal graph.
In a possible implementation, the alarm knowledge-graph unit 21 is further configured to:
and constructing an alarm knowledge graph according to an alarm node level and an alarm fault type of the historical alarm data, wherein the alarm node level comprises three layers of a physical machine HOST, a virtual machine VM and a SOFTWARE SOFTWARE, and each node in the alarm knowledge graph corresponds to one HOST alarm fault type, VM alarm fault type or SOFTWARE alarm fault type.
In one possible implementation manner, the alarm knowledge-graph unit 21 specifically includes:
a classification subunit, configured to classify the historical alarm data into different second alarm classes according to the alarm node hierarchy and the alarm fault types, where the second alarm classes include an HOST alarm fault type, a VM alarm fault type, and a softfire alarm fault type;
The node arrangement connection subunit is connected with the classification subunit and is used for taking each second alarm class as one node of the alarm knowledge graph, taking the HOST alarm fault type as a causative node of the VM alarm fault type, taking the VM alarm fault type as a causative node of the SOFTWARE alarm fault type, and arranging and connecting all the second alarm classes in a layered manner;
the causal edge weight acquisition subunit is connected with the node arrangement connection subunit and is used for calculating the ratio of the number of times of simultaneous occurrence of historical alarms of each pair of causal nodes to the number of times of total occurrence of historical alarms of the causal nodes in each pair of causal nodes, and the ratio is used as the causal edge weight of the connecting line between each pair of causal nodes.
In a possible implementation manner, the alarm data to be analyzed comprises a plurality of pieces of real-time alarm data of a cloud network platform in a current time window, wherein the pieces of real-time alarm data are acquired by setting the time window and a sliding step length, and the cloud network platform comprises a plurality of systems, and each system comprises HOST, VM and SOFTWARE;
the second matching subunit is specifically configured to:
inquiring a software and hardware knowledge graph of the cloud network platform according to alarm nodes of alarm data to be analyzed, dividing all third alarm classes according to the cloud network platform system to which the alarm nodes belong in the software and hardware knowledge graph, and respectively matching the third alarm classes divided into different systems of the cloud network platform with the nodes of the alarm knowledge graph so as to obtain an alarm causal graph of the alarm data to be analyzed in different systems of the cloud network platform.
In a possible embodiment, the third analysis module 3 is specifically configured to:
and in response to determining a plurality of second root cause alarms for a certain cloud network platform system, determining alarm data to be analyzed, which simultaneously corresponds to one of the first root cause alarms and the plurality of second root cause alarms, as a final root cause alarm of the certain cloud network platform system, and obtaining a root cause alarm path corresponding to the final root cause alarm.
Example 3:
embodiment 3 of the present invention provides a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the alert root positioning method described in embodiment 1.
Computer-readable storage media includes volatile or nonvolatile, removable or non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media includes, but is not limited to, RAM (Random Access Memory ), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory, charged erasable programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact Disc Read-Only Memory), digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
In addition, the present invention may also provide a computer apparatus, including a memory and a processor, where the memory stores a computer program, and when the processor runs the computer program stored in the memory, the processor executes the alert root cause positioning method as described in embodiment 1.
The memory is connected with the processor, the memory can be flash memory or read-only memory or other memories, and the processor can be a central processing unit or a singlechip.
Embodiments 1 to 3 of the present invention provide an alarm root cause positioning method, an alarm root cause positioning device, and a computer readable storage medium, wherein a first root cause alarm information of alarm data to be analyzed, which is obtained based on an alarm association rule, is combined with a second root cause alarm information of alarm data to be analyzed, which is obtained based on an alarm knowledge graph, to obtain final root cause alarm information, and in the process of obtaining the second root cause alarm information of alarm data to be analyzed based on the alarm knowledge graph, the second root cause alarm information is obtained according to a causal relationship and a causal edge weight comprehensive judgment between a cause node and a fruit node in the alarm knowledge graph, and the application of the information is improved by increasing the alarm knowledge graph analysis based on the association rule mining, thereby improving the accuracy of root cause positioning and the robustness of a scheme.
It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.
Claims (13)
1. The alarm root cause positioning method is characterized by comprising the following steps:
acquiring first cause alarm information of alarm data to be analyzed based on alarm association rules;
obtaining second cause alarm information of alarm data to be analyzed based on the alarm knowledge graph comprises the following steps:
obtaining an alarm knowledge graph, wherein two nodes connected in the alarm knowledge graph are cause nodes and effect nodes, each connecting line is provided with cause and effect edge weight,
matching the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain an alarm causal graph,
obtaining second cause alarm information according to the alarm cause and effect diagram and the cause and effect edge weight;
and obtaining final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.
2. The method of claim 1, wherein prior to obtaining the first root cause alert information for the alert data to be analyzed based on the alert association rules, the method further comprises:
And constructing an alarm association rule based on an improved FP-Growth algorithm according to the historical alarm data, wherein the improved FP-Growth algorithm comprises the steps of adding a weight coefficient to improve the support calculation and/or the confidence calculation of the FP-Growth algorithm, and two nodes connected in the alarm association rule are a root cause alarm node and a derivative alarm node.
3. The method of claim 2, wherein the construction of the alert association rule based on the modified FP-Growth algorithm based on the historical alert data specifically comprises:
acquiring historical alarm data subjected to first preprocessing, wherein the first preprocessing comprises: extracting key fields from the historical alarm data, generating first dictionary values comprising alarm nodes indicating the historical alarm data and the key fields, representing the historical alarm data by using the first dictionary values, classifying the historical alarm data into different first alarm classes according to the key fields, and acquiring the occurrence frequency of each first alarm class in each historical time window by setting time windows and sliding step sizes;
according to the first weight coefficient omega 1 And the frequency calculates a first weighted support ωs for each first alarm class according to (1) 1 Will omega s 1 The first alarm class which is larger than the second preset value is ordered in each historical time window according to the frequency, and the first alarm class is used as a child node to establish a frequent tree FP-tree according to the ordered sequence of each historical time window, wherein:
Wherein X refers to a certain first alarm class, N (X) is the frequency of the first alarm class X, and N is the frequency of all the first alarm classes;
acquiring a conditional mode base of each child node in the FP-tree according to the second weight coefficient omega 2 And the frequency calculates a second weighted support ωs for each conditional mode base path according to equation (2) 2 Obtaining omega s 2 A conditional pattern base path greater than a third preset value is used as a frequent item set, wherein:
wherein X is 1 Refers to a first alarm class, Y, of the end node of a conditional mode base path 1 Refers to the first alarm class, N (X) 1 ∪Y 1 ) The sum of the frequencies of the first alarm class of the certain condition mode base path is referred to;
according to the third weight coefficient omega 3 And the frequency calculates any two adjacent first alarm classes X in each frequent item set according to the frequency (3) 2 And Y 2 And a weighted confidence ωc between, responsive to ωc (X 2 →Y 2 )<ωc(Y 2 →X 2 ) Y is set to 2 Arranged as X 2 The root cause of (1) the alarm association rule is obtained until all the first alarm classes in each frequent item set are arranged, wherein:
wherein σ (X) 2 ∪Y 2 ) Refers to X 2 And Y 2 Frequency of simultaneous occurrence, σ (X 2 ) Refers to X 2 Frequency of occurrence.
4. The method of claim 3, wherein obtaining the conditional schema base for each child node in the FP-tree comprises:
And each node positioned at the tail end in the FP-tree is obtained as a child node, all the upper nodes in the FP-tree are traversed upwards for a child node, a path with an upper-lower connection relationship between the upper nodes and the lower nodes in sequence is used as a condition mode base path, and each condition mode base path is marked so as to avoid repetition in the following traversal of the child node.
5. A method according to claim 3, wherein obtaining the first root cause alert information of the alert data to be analyzed based on the alert association rule comprises:
performing second preprocessing on the alarm data to be analyzed, wherein the second preprocessing comprises the following steps: extracting key fields from the alarm data to be analyzed, generating a second dictionary value comprising alarm nodes indicating the alarm data to be analyzed and the key fields, and representing the alarm data to be analyzed by using the second dictionary value;
acquiring a first dictionary value of historical alarm data corresponding to each node in the alarm association rule, calculating cosine similarity between a second dictionary value and each acquired first dictionary value, and marking the second dictionary value at the corresponding node in the alarm association rule in response to the cosine similarity being larger than a fourth preset value so as to acquire an alarm association sub-rule hit by alarm data to be analyzed;
According to the sum of cosine similarity of hit nodes in each alarm association sub-rule, obtaining TopN alarm association sub-rules with the maximum sum of cosine similarity, and taking TopN final root cause alarm nodes of the TopN alarm association sub-rules as first root cause alarms.
6. The method according to any one of claims 1-5, wherein obtaining the second cause alarm information based on the alarm cause and effect graph and the cause and effect edge weight comprises:
acquiring an alarm causal subgraph with a continuous causal edge weight larger than a first preset value in each alarm causal graph;
and taking the final cause node of each alarm cause and effect drawing as a second cause alarm and taking each alarm cause and effect drawing as a root cause alarm path.
7. The method of claim 6, wherein the nodes in the alarm knowledge graph are arranged in layers according to an alarm node hierarchy, each node corresponding to an alarm fault type of an alarm node hierarchy;
matching the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain an alarm causal graph, which comprises the following steps:
and dividing the alarm data to be analyzed into different third alarm classes according to the alarm node level and the alarm fault type, and matching the alarm data to be analyzed with the nodes of the alarm knowledge graph according to the third alarm classes to obtain an alarm causal graph.
8. The method of claim 7, wherein prior to obtaining the alert knowledge-graph, the method further comprises:
and constructing an alarm knowledge graph according to an alarm node level and an alarm fault type of the historical alarm data, wherein the alarm node level comprises three layers of a physical machine HOST, a virtual machine VM and a SOFTWARE SOFTWARE, and each node in the alarm knowledge graph corresponds to one HOST alarm fault type, VM alarm fault type or SOFTWARE alarm fault type.
9. The method according to claim 8, wherein constructing the alarm knowledge graph according to the alarm node level and the alarm fault type of the historical alarm data specifically comprises:
the historical alarm data are divided into different second alarm classes according to the alarm node level and the alarm fault types, wherein the second alarm classes comprise HOST alarm fault types, VM alarm fault types and SOFTWARE alarm fault types;
taking each second alarm class as a node of an alarm knowledge graph, taking HOST alarm fault type as a causative node of VM alarm fault type, taking VM alarm fault type as causative node of SOFTWARE alarm fault type, and arranging and connecting all the second alarm classes in a layered manner;
and calculating the ratio of the number of simultaneous occurrence of historical alarms of each pair of causal nodes to the total number of occurrence of historical alarms of the causal nodes in each pair of causal nodes, and taking the ratio as the causal edge weight of the connecting line between each pair of causal nodes.
10. The method of claim 8, wherein the alert data to be analyzed comprises a plurality of pieces of real-time alert data of a cloud network platform within a current time window obtained by setting a time window and a sliding step, the cloud network platform comprising a plurality of systems, each system comprising three layers of HOST, VM and softwire;
matching the alarm data to be analyzed with the nodes of the alarm knowledge graph according to a third alarm class to obtain an alarm causal graph, which comprises the following steps:
inquiring a software and hardware knowledge graph of the cloud network platform according to alarm nodes of alarm data to be analyzed, dividing all third alarm classes according to the cloud network platform system to which the alarm nodes belong in the software and hardware knowledge graph, and respectively matching the third alarm classes divided into different systems of the cloud network platform with the nodes of the alarm knowledge graph so as to obtain an alarm causal graph of the alarm data to be analyzed in different systems of the cloud network platform.
11. The method according to claim 10, wherein obtaining the final root cause alert information based on the first root cause alert information and the second root cause alert information, comprises:
and in response to determining a plurality of second root cause alarms for a certain cloud network platform system, determining alarm data to be analyzed, which simultaneously corresponds to one of the first root cause alarms and the plurality of second root cause alarms, as a final root cause alarm of the certain cloud network platform system, and obtaining a root cause alarm path corresponding to the final root cause alarm.
12. An alert root cause positioning device, comprising:
the first analysis module is used for obtaining first cause alarm information of alarm data to be analyzed based on alarm association rules;
the second analysis module is configured to obtain second root cause alarm information of alarm data to be analyzed based on an alarm knowledge graph, and includes:
an alarm knowledge graph unit for acquiring an alarm knowledge graph, wherein two nodes connected in the alarm knowledge graph are cause nodes and effect nodes, each connecting line is provided with cause and effect side weight,
a second matching unit connected with the alarm knowledge graph unit for matching the alarm data to be analyzed with the nodes of the alarm knowledge graph to obtain an alarm causal graph,
the second positioning unit is connected with the second matching unit and is used for obtaining second root cause alarm information according to the alarm cause and effect diagram and the cause and effect side weight;
the third analysis module is connected with the first analysis module and the second analysis module and is used for obtaining final root cause alarm information based on the first root cause alarm information and the second root cause alarm information.
13. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, which when being executed by a processor, implements the alert root cause localization method as claimed in any one of claims 1 to 11.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311286937.5A CN117221087A (en) | 2023-10-07 | 2023-10-07 | Alarm root cause positioning method, device and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311286937.5A CN117221087A (en) | 2023-10-07 | 2023-10-07 | Alarm root cause positioning method, device and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117221087A true CN117221087A (en) | 2023-12-12 |
Family
ID=89035277
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311286937.5A Pending CN117221087A (en) | 2023-10-07 | 2023-10-07 | Alarm root cause positioning method, device and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117221087A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117806916A (en) * | 2024-02-29 | 2024-04-02 | 中国人民解放军国防科技大学 | Multi-unit server lightweight alarm correlation mining and converging method and system |
CN117807589A (en) * | 2023-12-26 | 2024-04-02 | 电子科技大学 | Correlation analysis method based on intrusion detection of industrial control system |
CN118227847A (en) * | 2024-05-22 | 2024-06-21 | 北京启明星辰信息安全技术有限公司 | Filter tree generation method, alarm log processing method, storage medium and terminal |
-
2023
- 2023-10-07 CN CN202311286937.5A patent/CN117221087A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117807589A (en) * | 2023-12-26 | 2024-04-02 | 电子科技大学 | Correlation analysis method based on intrusion detection of industrial control system |
CN117806916A (en) * | 2024-02-29 | 2024-04-02 | 中国人民解放军国防科技大学 | Multi-unit server lightweight alarm correlation mining and converging method and system |
CN118227847A (en) * | 2024-05-22 | 2024-06-21 | 北京启明星辰信息安全技术有限公司 | Filter tree generation method, alarm log processing method, storage medium and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110888755B (en) | Method and device for searching abnormal root node of micro-service system | |
CN117221087A (en) | Alarm root cause positioning method, device and medium | |
CN108683530B (en) | Data analysis method and device for multi-dimensional data and storage medium | |
CN117473571B (en) | Data information security processing method and system | |
Olmezogullari et al. | Pattern2Vec: Representation of clickstream data sequences for learning user navigational behavior | |
CN113032238B (en) | Real-time root cause analysis method based on application knowledge graph | |
CN110837602A (en) | User recommendation method based on representation learning and multi-mode convolutional neural network | |
CN107111625A (en) | Realize the method and system of the efficient classification and exploration of data | |
CN115269357A (en) | Micro-service abnormity detection method based on call chain | |
CN116049454A (en) | Intelligent searching method and system based on multi-source heterogeneous data | |
CN112217674A (en) | Alarm root cause identification method based on causal network mining and graph attention network | |
CN115514627A (en) | Fault root cause positioning method and device, electronic equipment and readable storage medium | |
CN115544519A (en) | Method for carrying out security association analysis on threat information of metering automation system | |
CN110502677A (en) | A kind of device identification method, device and equipment, storage medium | |
CN116225760A (en) | Real-time root cause analysis method based on operation and maintenance knowledge graph | |
CN110011990A (en) | Intranet security threatens intelligent analysis method | |
CN116841779A (en) | Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium | |
CN113515434A (en) | Abnormity classification method, abnormity classification device, abnormity classification equipment and storage medium | |
CN115509848A (en) | Log analysis method and device, electronic equipment and storage medium | |
CN118069885B (en) | Dynamic video content coding and retrieving method and system | |
Gias et al. | Samplehst: Efficient on-the-fly selection of distributed traces | |
CN111899117A (en) | K-edge connected component mining system and mining method applied to social network | |
CN117544482A (en) | Operation and maintenance fault determining method, device, equipment and storage medium based on AI | |
CN117411780A (en) | Network log anomaly detection method based on multi-source data characteristics | |
CN115767601A (en) | 5GC network element automatic nanotube method and device based on multidimensional data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |