WO2024174135A1

WO2024174135A1 - Method for determining abnormal mode of log slice, apparatus, device and storage medium

Info

Publication number: WO2024174135A1
Application number: PCT/CN2023/077709
Authority: WO
Inventors: 刁海洋; 陈蓓华
Original assignee: 西门子股份公司; 西门子（中国）有限公司
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2024-08-29

Abstract

Disclosed in the embodiments of the present invention are a method for determining an abnormal mode of a log slice, an apparatus, a device and a storage medium. The method comprises: on the basis of character strings contained in logs in a log set, determining the types of the logs; determining an abnormal log slice from amongst a plurality of log slices, the plurality of log slices being obtained by dividing the log set on the basis of a time sliding window; on the basis of the type of a log contained in the abnormal log slice, determining a log slice abnormal mode; comparing the type of a log contained in a log slice to be tested with the log slice abnormal mode; and, on the basis of a comparison result, determining an abnormal mode of said log slice. Automatically determining the abnormal mode on the basis of log character strings overcomes the defects of low accuracy and large workload in manual log analysis. Performing clustering on the basis of the similarity between log character strings enables clustering to be completed without manual labeling, thus improving convenience. Sliding window-based log slice partitioning has a broader application range.

Description

Method, device, apparatus and storage medium for determining abnormal patterns of log sheets

Technical Field

The present invention relates to the field of log management technology, and in particular to a method, device, equipment and storage medium for determining an abnormal mode of a log piece.

Background Art

Network devices, systems, and service programs usually generate event records called logs when they are in operation. Logs can record descriptions of related operations such as date, time, user, and action. System development and operation and maintenance personnel can detect abnormal behavior and errors in the system based on logs.

As computing becomes increasingly complex and applications become more diverse, the types and volume of logs are increasing. Traditional manual log analysis methods are difficult to meet the requirements of daily analysis.

Summary of the invention

The embodiments of the present invention provide a method, an apparatus, a device and a storage medium for determining an abnormal mode of a log piece.

A method for determining an abnormal pattern of log slices, comprising:

Determine the type of the log based on the character string of the log contained in the log set;

Determining an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning the log set based on a sliding time window;

Determine a log slice abnormal mode based on the type of logs included in the abnormal log slice;

comparing the type of logs contained in the log sheet to be tested with the log sheet abnormality pattern;

The abnormal mode of the log sheet to be tested is determined based on the comparison result.

Therefore, the embodiments of the present invention automatically determine the abnormal pattern based on the log string in the log set, overcoming the shortcomings of low accuracy and heavy workload in manual log analysis. Moreover, compared with the session-based partitioning, the log slice partitioning based on the sliding window has a wider range of applications.

In one embodiment, the character string of the log contained in the log set is used to determine the The types of logs include:

extracting log content characterized as unstructured information from the log;

Divide the log content into word sequences using spaces as delimiters;

Grouping the logs based on the length of the word sequence;

For each group:

Based on the characters at the predetermined positions of each word in each word sequence contained in the group, determine

Log string for each word sequence;

Clustering is performed based on the similarity between any two log strings to obtain the

The type of logs included;

The types of all grouped logs are combined into a log type set.

It can be seen that the embodiment of the present invention also performs clustering based on the similarity between log character strings, and clustering can be completed without manual labeling, thereby improving convenience.

In one embodiment, the predetermined position is the first character or the last character of the word.

Therefore, the difficulty of clustering is reduced by extracting characters at predetermined positions.

In one embodiment, the any two log strings include a first log string and a second log string: the method also includes: determining the similarity between characters at the same character position between the first log string and the second log string; and determining the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.

It can be seen that by comparing the log strings one by one, the classification accuracy is guaranteed and the clustering difficulty is reduced.

In one embodiment, determining an abnormal log slice from a plurality of log slices comprises:

determining, based on principal component analysis, log sheets with an unbalanced log type ratio from the plurality of log sheets;

The log slice with an unbalanced log type ratio is determined as the abnormal log slice.

It can be seen that the abnormal log slices can be easily determined through principal component analysis.

In one embodiment, determining the log piece abnormal mode based on the type of logs included in the abnormal log piece includes:

Determine the number N of abnormal patterns in the log slice;

When the number m of the abnormal log slices is less than or equal to N, based on the type sequence of logs contained in each abnormal log slice, a corresponding log slice abnormal pattern is determined to obtain m log slice abnormal patterns;

When the number m of the abnormal log pieces is greater than N, clustering is performed based on the similarity between the type sequences of the logs contained in the abnormal log pieces to obtain N log piece abnormal patterns.

Therefore, the convenience of implementation is improved by considering the preset number to distinctively determine the log sheet abnormality pattern.

In one embodiment, clustering based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns includes:

The type sequence of the logs contained in the abnormal log piece is taken as a set element to form a log piece abnormal pattern set;

In the log slice anomaly pattern set, any two set elements with the greatest similarity are set merged to update the log slice anomaly pattern set until the number of set elements of the log slice anomaly pattern set is equal to N.

It can be seen that by merging the sets to update the log slice exception pattern set, the type sequence of each exception log slice is retained and the number of exception patterns is reduced.

In one embodiment, it also includes:

Determine the longest common subsequence length of a first set element of a first abnormal log piece and a second set element of a second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces;

Based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece, and the longest common subsequence length, the similarity between the first set element of the first abnormal log piece and the second set element of the second abnormal log piece is determined.

It can be seen that based on the longest common subsequence and hierarchical clustering, the abnormal patterns of log slices can be accurately mined.

In one embodiment, comparing the type of logs contained in the log piece to be tested with the log piece abnormal pattern includes: determining the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set;

Determining the abnormal pattern of the log piece to be tested based on the comparison result includes: determining the log piece abnormal pattern in the log piece abnormal pattern set that has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value as the abnormal pattern of the log piece to be tested.

Therefore, based on the similarity comparison, the abnormal pattern analysis of the log film to be tested is achieved.

In one embodiment, when the log slice anomaly pattern in the log slice anomaly pattern set is a set element that has been merged, the method further includes:

Determine a sub-similarity between a log type sequence included in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern;

The weighted sum of the sub-similarity values is determined as the similarity between the log type sequence in the log piece to be tested and the abnormal pattern of the log piece.

It can be seen that by weighted summing the similarities of each log type sequence participating in the set merging, each log type sequence participating in the set merging is comprehensively considered, thereby improving the accuracy of similarity calculation.

An apparatus for determining an abnormal pattern of a log sheet, comprising:

A first determination module is configured to determine the type of the log based on a character string of the log included in the log set;

A second determination module is configured to determine an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by dividing the log set based on a sliding time window;

A third determination module is configured to determine a log slice abnormal mode based on the type of logs included in the abnormal log slice;

a comparison module configured to compare the type of log included in the log sheet to be tested with the log sheet abnormality pattern;

The fourth determination module is configured to determine the abnormal mode of the log sheet to be tested based on the comparison result.

Therefore, the embodiments of the present invention automatically determine the abnormal pattern based on the log string, overcoming the shortcomings of low accuracy and heavy workload in manual log analysis. Moreover, compared with the session-based partitioning, the log slice partitioning based on the sliding window has a wider application range.

In one embodiment, the first determining module is configured to:

extracting log content characterized as unstructured information from the log;

Divide the log content into word sequences using spaces as delimiters;

Grouping the logs based on the length of the word sequence;

For each group:

Log string for each word sequence;

The type of logs included;

The types of all grouped logs are combined into a log type set.

In one embodiment, the any two log strings include a first log string and a second log string:

The first determination module is configured to: determine the similarity between the first log string and the second log string, between characters at the same character position; and determine the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.

In one embodiment, the third determination module is configured to: determine the abnormal mode of the log slice when the number m of abnormal log pieces is less than or equal to N, based on the type sequence of logs contained in each abnormal log piece, the corresponding log piece abnormal pattern is determined to obtain m log piece abnormal patterns; when the number m of abnormal log pieces is greater than N, clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log pieces to obtain N log piece abnormal patterns.

In one embodiment, the third determination module is configured to: use the type sequence of logs contained in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, any two set elements with the greatest similarity are merged to update the log piece abnormal pattern set until the number of set elements of the log piece abnormal pattern set is equal to N.

In one embodiment, the third determination module is configured to: determine the longest common subsequence length of the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determine the similarity between the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece.

In one embodiment, the comparison module is configured to: determine the similarity between the log type sequence in the log slice to be tested and each log slice abnormal pattern in the log slice abnormal pattern set;

The fourth determination module is configured to determine, in the log slice abnormal pattern set, a log slice abnormal pattern having the highest similarity with the log type sequence in the log slice to be tested and greater than a predetermined threshold value as the abnormal pattern of the log slice to be tested.

In one embodiment, the comparison module is configured to: When the log piece anomaly pattern in the set is a set element after set merging, determine the sub-similarity between the log type sequence contained in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern; and determine the weighted sum value of each sub-similarity as the similarity between the log type sequence in the log piece to be tested and the log piece anomaly pattern.

An electronic device, comprising:

processor;

A memory, configured to store executable instructions of the processor;

The processor is configured to read the executable instructions from the memory, and execute the executable instructions to implement the method for determining an abnormal mode of a log slice as described in any one of the above items.

A computer-readable storage medium stores computer instructions thereon, wherein when the computer instructions are executed by a processor, the method for determining an abnormal mode of a log sheet as described in any one of the above items is implemented.

A computer program product comprises a computer program, wherein when the computer program is executed by a processor, the method for determining an abnormal pattern of a log slice as described in any one of the above items is implemented.

BRIEF DESCRIPTION OF THE DRAWINGS

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that those skilled in the art can better understand the above and other features and advantages of the present invention. In the accompanying drawings:

FIG. 1 is a flow chart of a method for determining an abnormal pattern of a log sheet according to an embodiment of the present invention.

FIG. 2 is an exemplary flow chart of a log grouping process according to an embodiment of the present invention.

FIG. 3 is an exemplary flow chart of a log slice preprocessing process according to an embodiment of the present invention.

FIG. 4 is an exemplary flow chart of a log slice analysis process according to an embodiment of the present invention.

FIG. 5 is an exemplary flow chart of determining a log slice anomaly pattern according to an embodiment of the present invention.

FIG. 6 is a structural diagram of an apparatus for determining an abnormal pattern of a log sheet according to an embodiment of the present invention.

FIG. 7 is a structural diagram of an electronic device according to an embodiment of the present invention.

The reference numerals are as follows:

DETAILED DESCRIPTION

In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention is further described in detail with reference to the following embodiments.

For the sake of simplicity and intuitiveness, the following describes several representative embodiments to illustrate the solution of the present invention. A large number of details in the embodiments are only used to help understand the solution of the present invention. It is obvious that the technical solution of the present invention may not be limited to these details when implemented. In order to avoid unnecessarily obscuring the solution of the present invention, some implementation methods are not described in detail, but only a framework is given. In the following, "including" means "including but not limited to", and "according to..." means "at least according to..., but not limited to only according to...". Due to the language habits of Chinese, when the number of a component is not specifically specified below, it means that the component can be one or more, or can be understood as at least one.

The embodiment of the present invention proposes a method for determining abnormal patterns of log pieces based on clustering and principal component analysis, which overcomes the shortcomings of low accuracy and high workload in manual log analysis. Moreover, the embodiment of the present invention does not need to label the log data, realizes unsupervised log analysis, and reduces the workload of labeling. In addition, the log piece division based on the sliding window in the embodiment of the present invention has a wider range of applications compared with the log piece division based on the session. In addition, the embodiment of the present invention can mine deep-level log piece abnormal patterns through the longest common subsequence and hierarchical clustering, thereby improving the recognition efficiency of abnormal patterns.

FIG1 is a flow chart of a method for determining an abnormal mode of a log sheet according to an embodiment of the present invention. As shown in FIG1 , the method includes:

Step 101: Determine the type of the log based on the character string of the log contained in the log set.

A log set may contain a large number of logs. For example, a log set may be implemented as a log set obtained based on historical log data. For example, all logs within a predetermined historical time period (e.g., the past week, month, or year, etc.) may be combined into a log set. Log _mi can generally be represented as _mi = { _mih , _mic }, where _mih is the log header, generally containing structured information such as IP address and time. _mic is the specific content of the log, generally unstructured information, usually a statement output similar to printf("user%s connected", user), such as "user bob connected" or "user bob disconnected", etc.

In one embodiment, step 101 specifically includes: extracting log content characterized as unstructured information from the log; dividing the log content into word sequences using spaces as delimiters; grouping the logs based on the length of the word sequences; for each group: determining the log string of each word sequence based on the characters at the predetermined position of each word in each word sequence contained in the group; clustering based on the similarity between any two log strings to obtain the type of log contained in the group; and combining the types of logs in all groups into a log type set. Preferably, the predetermined position is the first letter of the word. character or the last character, or the character at any specified position.

For example: Use space as separator to divide _mic into word sequence _Ti , where Where _Li represents the length of the sequence, i.e., the number of words, and _tij represents the jth word of the i-th log. For example, when _mic is "user bob connected", _Ti = [user, bob, connected], and _Lj is equal to 3 (i.e., there are 3 words). All logs with the same length of word sequence (length L) are divided into their respective groups _GL , _GL = { _Ti | _Lj = L}, thereby roughly classifying all logs based on log length.

FIG2 is an exemplary flow chart of a log grouping process according to an embodiment of the present invention. As shown in FIG2 , the method includes:

Step 201: Get a log set M, where M = {m ₁ , m ₂ , ..., m _|M| }. Moreover, the i-th log _mi in M is represented by _mi = {m _ih , m _ic }, where _mih is the log header, which generally contains structured information such as IP address and time. m _ic is the specific content of the log, which is generally represented as unstructured information. The value range of i is [1, |M|].

Step 202: For each log in the log set M, that is, the i-th log _mi : divide the log content into word sequences _Ti using spaces as delimiters, where Where _Li represents the length of the sequence, that is, the number of words, and _tij represents the jth word in the i-th log.

Step 203: Divide the word sequences with the same word sequence length L into the same group _GL . That is, _GL = {T _i |L _i = L}.

In addition, the concept of log string is introduced in the embodiment of the present invention. The log string is a string composed of characters (such as the first character) at any specified position of each word in _Ti , and is recorded as Where t′ _ij represents the first letter of t _ij . For example, when the specified position is the first character, the log string of "user bob connected" is "ubc". For each group, clustering can be performed based on the similarity between the log strings of each word sequence contained in the group to obtain the type of logs contained in the group. Then, the types of logs contained in all groups are combined into a log type set.

For example, assume that the log set contains log 1, log 2, log 3, log 4, and log 5. The content of log 1 is "user bob connected"; the content of log 2 is "user tom The content of log 3 is "user bob disconnected"; the content of log 4 is "user tom disconnected"; the content of log 5 is "client disconnected the license server". The word sequence of log 1 is "user bob connected" and the sequence length is 3. The word sequence of log 2 is "user bob connected" and the sequence length is 3. The word sequence of log 3 is "user bob disconnected" and the sequence length is 3. The word sequence of log 4 is "user tom disconnected" and the sequence length is 3. The word sequence of log 5 is "client disconnected the license server". server", and the sequence length is 5. The sequence lengths of log 1, log 2, log 3 and log 4 are the same (all 3), so log 1, log 2, log 3 and log 4 are divided into a group corresponding to the sequence length of 3. Log 5 is divided into another group corresponding to the sequence length of 5. Moreover, the log string of log 1 is "ubc"; the log string of log 2 is "utc"; the log string of log 3 is "ubd"; the log string of log 4 is "utd"; and the log string of log 5 is "cdtls". Moreover, for each group, clustering is performed based on the similarity between any two log strings to obtain the type of logs contained in the group. For example, in a group with a sequence length of 3, 2 log types (respectively called log type a and log type b) are clustered. In a group with a sequence length of 5, since only the log string of log 5 is included, 1 log type (called log type c) is clustered. Then, the log type set determined based on the log set includes log type a, log type b and log type c.

The process of clustering based on the similarity between any two log strings in each group involves a process of determining the similarity between any two log strings.

For example, any two log strings include a first log string and a second log string. The process of determining the similarity between any two log strings includes: determining the similarity between the first log string and the second log string, and between characters at the same character position; determining the similarity between the first log string and the second log string based on the similarity between the characters at the same character position. The first log string and the second log string belong to the same group, so the number of characters in the first log string and the second log string is the same, and the number of characters is usually plural. The number of character pairs formed by two characters at the same character position between the first log string and the second log string is plural, and the similarity between the characters at the same character position is plural. The similarity can be determined based on the weighted sum of all similarities between characters at the same character position (the weight of the similarity between each character can be equal or unequal). Determine the similarity between the first log string and the second log string.

In a specific implementation of determining the similarity between log strings: any two log strings include a first log string T′ _i and a second log string T′ _j : the process of determining the similarity between any two log strings includes: determining the similarity sim(T′ _i , T′ _j ) between T′ _i and T′ _j ; wherein t′ _ik is the k-th character in T′ _i ; t′ _jk is the k-th character in T′ _j ; L is the number of characters in T′ _i and T′ _i ; when t′ _ik is equal to t′ _ik , F(t′ _ik , t′ _jk ) is 1; when t′ _ik is not equal to t′ _jk , F(t′ _jk , t′ _jk ) is 0.

Specifically, for each group G _L corresponding to the length of the respective string sequence, the similarity between any two log strings T ′ _i and T ′ _j is defined as in When sim(T′ _i , T′ _j ) ≥ λ, the two log strings are considered similar, where λ is the set similarity threshold, for example, λ = 0.4. A log string set can be created in each G _L to store all types of logs in the group (each type can be represented by a log string), denoted as S _L . Generate a log string T′ _i for each log T _i in each G _L in turn, and calculate the similarity with all log strings in the S _L corresponding to the G _L. If there is a log string s in S _L that is similar to T′ _i , then add the log T _i to the log set G _s of this type; if not, create a new log set G s. And add T′ _i to S _L. Then, all grouped S _L corresponding to their respective string sequence lengths are combined into a log type set S, denoted as S = {s ₁ , s ₂ , ..., s _|S| }. Among them, s ₁ , s ₂ , ..., s _|S| are respectively implemented as log strings to represent their respective log types.

The above exemplary description is a typical example of determining the similarity between log character strings. Those skilled in the art may appreciate that such description is merely exemplary and is not intended to limit the protection scope of the embodiments of the present invention.

Step 102: determining an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning a log set based on a sliding time window.

Here, the log slice preprocessing process can be performed in advance. In the log slice preprocessing process, the log set is divided based on the sliding time window to obtain multiple log slices. For example, the log set M is divided into A sliding window with a length of 9 hours and a step length of 1 hour is divided into m log slices W ₁ , W ₂ , ..., W _m . The logs contained in each log slice may be the same or different.

FIG3 is an exemplary flow chart of a log slice preprocessing process according to an embodiment of the present invention. As shown in FIG3 , the method includes:

Step 301: Divide the log set M into m log pieces W ₁ , W ₂ , ..., W _m using a sliding window of predetermined length and step size.

Step 302: For each log slice W _i , use ξ _ij to represent the number of occurrences of logs of type s _j (each type in S) in the log slice.

Step 303: Let w _i =(ξ _i1 ,ξ _i2 , ...,ξ _i|S| ) be the vector of the log slice. Let n = |S|, then all w _i form a matrix A with m rows and n columns.

In one embodiment, step 102 specifically includes: based on principal component analysis (PCA), determining a log piece with an unbalanced log type ratio from a plurality of log pieces; and determining the log piece with an unbalanced log type ratio as an abnormal log piece.

In statistics, principal component analysis is a technique for simplifying a data set. It is a linear transformation. This transformation transforms the data into a new coordinate system so that the first largest variance of any data projection is on the first coordinate (called the first principal component), the second largest variance is on the second coordinate (called the second principal component), and so on. Principal component analysis is often used to reduce the dimensionality of a data set while keeping the features that contribute the most to the variance of the data set.

Specifically, for each log slice W _i , ξ _ij is used to represent the number of logs of type s _j (each type in S) in the log slice. Let w _i = (ξ _i1 , ξ _i2 , ..., ξ _i|S| ) be the vector of the log slice. Let n = |S|, then all w _i form a matrix A with m rows and n columns. Under normal circumstances, the proportion of each type of log tends to be stable. If a large number of imbalanced proportions appear in logs of various types in certain time periods (certain log slices), it is very likely that abnormal situations will occur in this time slice, and these log slices need to be focused on and analyzed. Through principal component analysis, the time period of imbalanced proportions can be found. Find a low-dimensional space so that the sum of distances after projecting the data in the high-dimensional space to the low-dimensional space is minimized.

FIG4 is an exemplary flow chart of a log sheet analysis process according to an embodiment of the present invention. As shown in FIG4 , the method includes:

Step 401: Centralize matrix A to obtain matrix B.

Step 402: Obtain the eigenvalues λi (i=1, 2, ..., n) and eigenvectors V ₁ , V ₂ , ..., V _n of B ^T B.

Step 403: Select the variance ratio after dimensionality reduction as 90%, and bring in and calculate k so that

Step 404: The first k eigenvectors V ₁ , V ₂ , ..., V _k are combined into a matrix P = [V ₁ , V ₂ , ..., V _k ], and the orthogonal projection matrix of the matrix is calculated as V _P = P (P ^T P) ^-1 P ^T = P ^T .

Step 405: If the original vector is y, the Euclidean distance from the vector to its mapped subspace can be obtained by calculating the squared prediction error SPE = || _ya || ² , where _ya is the projection of y to the subspace, and _ya = (IV _P )y = (I-PP ^T )y.

Step 406: Calculate the distance from each point to the subspace according to the above algorithm and compare it with the detection threshold Q _α , that is, determine whether SPE＝|| _ya ||＞Q _α is established, if yes, execute step 407; otherwise, execute step 408.

Step 407: Mark the point y as abnormal. Where Q _α represents the threshold statistic of the SPE residual function at the (1-α) confidence level, which can be expressed as Get, where λ _j represents the eigenvalue of the jth principal component of the sample data covariance matrix projected on the subspace, C _α represents the 1-α percentile of the standard normal distribution, and exits this process.

Step 408: Determine that it is a normal log slice and exit this process.

So far, normal and abnormal rows and their corresponding log slices have been obtained.

The above exemplary description is a typical example of determining an abnormal log piece from a plurality of log pieces. Those skilled in the art may appreciate that such description is merely exemplary and is not intended to limit the protection scope of the embodiments of the present invention.

Step 103: Based on the type of logs included in the abnormal log piece, determine the abnormal mode of the log piece.

In one embodiment, step 103 specifically includes: determining the number N of log slice abnormal patterns; when the number m of abnormal log slices is less than or equal to N, determining the corresponding log slice abnormal pattern based on the type sequence of logs contained in each abnormal log slice to obtain m log slice abnormal patterns; when the number m of abnormal log slices is greater than N, clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns.

Here, the number N of log slice abnormal patterns is preset. When the number m of abnormal log slices is less than or equal to N, the type sequence of logs contained in each abnormal log slice can be determined as the corresponding log slice abnormal pattern. The log slice abnormal pattern includes the type sequence of logs contained in the abnormal log slice, wherein each type in the type sequence can be characterized as a log string corresponding to the type. When the number m of abnormal log slices is greater than N, clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log slice to obtain N log slice abnormal patterns.

In one embodiment, clustering is performed based on the similarity between the type sequences of logs included in the abnormal log piece to obtain N log piece abnormal patterns, including: taking the type sequences of logs included in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, merging any two set elements with the greatest similarity to update the log piece abnormal pattern set, until the number of set elements of the log piece abnormal pattern set is equal to N.

Before merging any two set elements with the greatest similarity, a process of determining the similarity between any two set elements is involved.

Assuming that the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces, the process of determining the similarity between any two set elements includes: determining the longest common subsequence length of the first set element of the first abnormal log piece and the second set element of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determining the similarity of the first set element of the first abnormal log piece and the second set element of the second abnormal log piece.

Assume that any two set elements include a first log string T′ _i and a second log string T′ _j . In a specific implementation of determining the similarity between any two set elements: determining the first abnormal log The similarity d(y _i ,y _j ) between the first set element y _i of the slice and the second set element y _j of the second abnormal log slice; t=LCS(y _i , y _j ); |y _i | is the number of logs in the first abnormal log piece; |y _j | is the number of logs in the second abnormal log piece; LCS(y _i , y _j ) is the longest common subsequence length of the first set element y _i and the second set element y _j ; the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces.

Specifically, let the set of type sequences of abnormal log pieces be Y = {y ₁ , y ₂ , ...}. Each element can be represented as a log type sequence contained in an abnormal log piece, such as y ₁ = [s ₆ , s ₆ , s ₃ , s ₂ , s ₁₀ , s ₆ , s ₃ , s ₂ , s ₃ ] and y ₂ = [s ₆ , s ₁ , s ₅ , s ₃ , s ₂ , s ₁₃ , s ₁₀ ] of the first abnormal log piece, which respectively indicate that the logs in the type sequence y ₁ of the first abnormal log piece are the 6th, 6th, 3rd, 2nd, 10th, 6th, 3rd, 2nd and 3rd logs, and the logs in the type sequence y ₂ of the second abnormal log piece are the 6th, 1st, 5th, 3rd, 2nd, 13th and 10th logs, respectively. Each log type in the log type sequence can be represented as the log string corresponding to the log type. The similarity between y _i and y _j is defined as the ratio of the sum of the number of log entries |y _i |+|y _j | and the length of the longest common subsequence LCS (y _i , y _j ) of the log type sequence, that is, Where t = LCS (y _i , y _j ). Here, the longest common subsequence is used to measure the similarity of two log pieces, considering that the ordered arrangement of log types can determine an abnormal log piece pattern. Based on the distance measurement of the longest common subsequence, hierarchical clustering is performed to obtain abnormal log pieces of different categories, namely, log piece abnormal patterns.

FIG5 is an exemplary flow chart of determining an abnormal mode of a log sheet according to an embodiment of the present invention. As shown in FIG5 , the method includes:

Step 501: Determine an abnormal log piece set Y={y ₁ , y ₂ , ...}, where y ₁ , y ₂ , ... are respective abnormal log pieces.

Step 502: P←{{y ₁ }, {y ₂ }, {y ₃ }...}; where “←” represents an assignment operation. P is a set of log slice exception patterns. {y ₁ }, {y ₂ }, {y ₃ }.. are the type sequences of the exception log slices y ₁ , y ₂ , y ₃ , ... respectively. For example, y ₁ = [s ₆ , s ₆ , s ₃ , s ₂ , s ₁₀ , s ₆ , s ₃ , s ₂ , s ₃ ], which means that the logs in the type sequence y ₁ are the 6th, 6th, 3rd, 2nd, 10th, 6th, 3rd, 2nd and 3rd logs respectively. Moreover, the type Each type in the sequence y ₁ (i.e., any one of s ₆ , s ₆ , s ₃ , s ₂ , s ₁₀ , s ₆ , s ₃ , s ₂ , s ₃ ) can be represented by a string of logs that conform to the type.

Step 503: Determine whether |P|＞N is true (where N is a preset value, such as N is 2), and |P| is the number of y. If true (corresponding to the "Y" branch), execute step 505; otherwise (corresponding to the "N" branch), execute step 504.

Step 504: Take each set element in P as a log slice exception mode and end this process.

Step 505: Select a, b from P such that Minimize and execute P←P-{P _i }-{P _j }+{P _i ∪P _j }.

Step 505 is repeatedly executed until |P| is less than or equal to N.

For example, assume that there are 10 abnormal log pieces, namely abnormal log piece 1 to daily log piece 10. The log type sequences of abnormal log piece 1 to daily log piece 10 are y ₁ to y ₁₀ respectively.

(1): When N is greater than or equal to 10, y ₁ to y ₁₀ are determined as 10 log slice anomaly patterns, each of which contains a corresponding log type sequence. Each log type in the log type sequence can be represented as a log string corresponding to the log type. For example, assuming y ₁ [s ₆ , s ₆ , s ₃ , s ₂ , s ₁₀ , s ₆ , s ₃ , s ₂ , s ₃ ], the log slice anomaly pattern corresponding to y ₁ is [s ₆ , s ₆ , s ₃ , s ₂ , s ₁₀ , s ₆ , s ₃ , s ₂ , s ₃ ], where each log type in the log type sequence can be represented as a log string corresponding to the log type.

(2): When N is less than 10 (for example, N is equal to 8), the type sequences of the logs contained in the 10 abnormal log slices are respectively taken as set elements and combined into a log slice abnormal pattern set. The log slice abnormal pattern set is {{y ₁ }, {y ₂ }, {y ₃ }...{y ₁₀ }}. In the log slice abnormal pattern set, any two set elements with the greatest similarity are merged to update the log slice abnormal pattern set until the number of set elements of the log slice abnormal pattern set is equal to N. For example, when the similarity between y ₁ and y ₂ is the greatest, y ₁ and y ₂ are merged. Then the abnormal log slice set is changed to: {{{y ₁ }, {y ₂ }}, {y ₃ }...{y ₁₀ }}. It can be seen that the number of elements in the set is reduced by one, and {{y ₁ }, {y ₂ }} becomes a set element in the set. Continue to merge any two set elements with the greatest similarity in the changed abnormal log piece set to continue updating the log piece abnormal pattern set until the final set element is 8. Assume that the final log piece abnormal pattern set is: {{{{y ₁ }, {y ₂ }}, {y ₃ }}}...{y ₁₀ }}. Among them, {{{y ₁ }, {y ₂ }}, {y ₃ }}} corresponds to an abnormal pattern. Style collection.

Step 104: Compare the type of the log included in the log piece to be tested with the log piece abnormal pattern.

Here, the log piece to be tested is a log piece whose abnormal pattern needs to be determined, such as a log piece obtained in real time.

Step 105: Determine the abnormal mode of the log sheet to be tested based on the comparison result.

In one embodiment, step 104 specifically includes: determining the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set; step 105 specifically includes: determining the log piece abnormal pattern in the log piece abnormal pattern set that has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value as the abnormal pattern of the log piece to be tested. When the similarity of the log piece abnormal pattern with the highest similarity to the log type sequence in the log piece to be tested is not greater than the predetermined threshold value, the log piece to be tested is determined to be normal.

Here, a log piece to be tested is obtained, and then the unstructured information of each log in the log piece to be tested is divided into word sequences with spaces as separators, and then the log string corresponding to the word sequence is determined based on each word sequence (the character position is equal to the character position in step 203), and then each log string is compared with each type in the log type set for similarity (the similarity determination method can refer to the above formula for calculating sim(T′ _i , T′ _j )) to determine the type of each log, thereby obtaining the log type sequence in the log piece to be tested.

When the log type set does not contain the type of log in the log piece to be tested, the log piece to be tested can be added to the log set in step 101, and the process shown in FIG1 is executed again to update the abnormal log piece model set. Then, the updated abnormal log piece model set is applied to execute steps 104 and 105.

In one embodiment, when a log slice anomaly pattern in a log slice anomaly pattern set is a set element that has undergone set merging, the method further includes: determining a sub-similarity between a log type sequence contained in the log slice to be tested and each log type sequence in the log slice anomaly pattern that participates in the set merging; and determining a weighted sum of the sub-similarity values as the similarity between the log type sequence in the log slice to be tested and the log slice anomaly pattern.

For example, suppose the log slice exception pattern set is: {{{{y ₁ }, {y ₂ }}, {y ₃ }}}...{y ₁₀ }}{{{y ₁ }, {y ₂ }}, {y ₃ }}}. {{{y ₁ }, {y ₂ }}, {y ₃ }}} are set elements after set merging. When calculating the similarity between the log type sequence in the log piece to be tested and {{{y ₁ }, {y ₂ }}, {y ₃ }}}, the similarity (called sub-similarity) between the log type sequence in the log piece to be tested and {y ₁ }, {y ₂ } and {y ₃ } can be calculated respectively to obtain three sub-similarity, and then the weighted sum value of the three sub-similarity (the weight can be set) is determined as the similarity between the log type sequence in the log piece to be tested and the log piece abnormal pattern.

FIG6 is a structural diagram of an apparatus for determining an abnormal mode of a log sheet according to an embodiment of the present invention. As shown in FIG6 , the apparatus 600 for determining an abnormal mode of a log sheet includes:

A first determination module 601 is configured to determine the type of a log based on a character string of a log included in a log set;

A second determination module 602 is configured to determine an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning the log set based on a sliding time window;

The third determination module 603 is configured to determine the log piece abnormality mode based on the type of logs included in the abnormal log piece;

A comparison module 604 is configured to compare the type of log included in the log piece to be tested with the log piece abnormality pattern;

The fourth determination module 605 is configured to determine the abnormal mode of the log slice to be tested based on the comparison result.

In one embodiment, the first determination module 601 is configured to: extract log content characterized as unstructured information from the log; divide the log content into word sequences with spaces as delimiters; group the logs based on the length of the word sequences; for each group: determine the log string of each word sequence based on the characters at a predetermined position of each word in each word sequence contained in the group; perform clustering based on the similarity between any two log strings to obtain the type of log contained in the group; and combine the types of the logs of all groups into a log type set.

In one embodiment, any two log strings include a first log string and a second log string: the first determination module 601 is configured to: determine the similarity between the first log string and the second log string, and between characters at the same character position; based on the similarity between the characters at the same character position, determine the similarity between the first log string and the second log string.

In one embodiment, the third determination module 603 is configured to: determine the number N of log slice abnormal patterns; when the number m of abnormal log slices is less than or equal to N, determine the corresponding log slice abnormal pattern based on the type sequence of logs contained in each abnormal log slice to obtain m log slice abnormal patterns; when the number m of abnormal log slices is greater than N, perform clustering based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns.

In one embodiment, the third determination module 603 is configured to: use the type sequence of logs contained in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, any two set elements with the greatest similarity are merged to update the log piece abnormal pattern set until the number of set elements of the log piece abnormal pattern set is equal to N.

In one embodiment, the third determination module 603 is configured to: determine the longest common subsequence length of the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determine the similarity between the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece.

In one embodiment, the comparison module 604 is configured to determine the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set; the fourth determination module 605 is configured to determine the log piece abnormal pattern in the log piece abnormal pattern set that has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value as the abnormal pattern of the log piece to be tested.

In one embodiment, the comparison module 604 is configured to: when a log slice anomaly pattern in a log slice anomaly pattern set is a set element that has undergone a set merge, determine the sub-similarity between the log type sequence contained in the log slice to be tested and each log type sequence participating in the set merge in the log slice anomaly pattern; and determine the weighted sum of the sub-similarity values as the similarity between the log type sequence in the log slice to be tested and the log slice anomaly pattern.

The embodiment of the present invention further provides an electronic device having a processor-memory architecture. FIG. 7 is a structural diagram of an electronic device according to an embodiment of the present invention. As shown in FIG. 7 , the electronic device 700 includes a processor The processor 701, the memory 702 and the computer program stored in the memory 702 and executable on the processor 701, when the computer program is executed by the processor 701, implements any of the above methods for determining the abnormal mode of the log slice. Among them, the memory 702 can be implemented as a variety of storage media such as an electrically erasable programmable read-only memory (EEPROM), a flash memory (Flash memory), and a programmable program read-only memory (PROM). The processor 701 can be implemented as including one or more central processing units or one or more field programmable gate arrays, wherein the field programmable gate array integrates one or more central processing unit cores. Specifically, the central processing unit or the central processing unit core can be implemented as a CPU, an MCU or a DSP, etc.

It should be noted that not all steps and modules in the above processes and structure diagrams are necessary, and some steps or modules can be ignored according to actual needs. The execution order of each step is not fixed and can be adjusted as needed. The division of each module is only for the convenience of describing the functional division adopted. In actual implementation, a module can be implemented by multiple modules, and the functions of multiple modules can also be implemented by the same module. These modules can be located in the same device or in different devices.

The hardware modules in each embodiment can be implemented mechanically or electronically. For example, a hardware module may include a specially designed permanent circuit or logic device (such as a dedicated processor, such as an FPGA or ASIC) to perform a specific operation. The hardware module may also include a programmable logic device or circuit (such as a general-purpose processor or other programmable processor) temporarily configured by software to perform a specific operation. As for whether to implement the hardware module mechanically, or using a dedicated permanent circuit, or using a temporarily configured circuit (such as configured by software), it can be decided based on cost and time considerations.

The above are only preferred embodiments of the present invention and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims

A method for determining an abnormal pattern of a log sheet, characterized by comprising:

Determine (101) the type of the log based on the character string of the log included in the log set;

Determining (102) an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning the log set based on a sliding time window;

Based on the type of logs included in the abnormal log piece, determining (103) a log piece abnormal mode;

Comparing the type of log contained in the log sheet to be tested with the log sheet abnormality pattern (104);

Based on the comparison result, the abnormal pattern of the log sheet to be tested is determined (105).
The method according to claim 1, characterized in that the determining (101) of the type of the log based on the character string of the log included in the log set comprises:

extracting log content characterized as unstructured information from the log;

Divide the log content into word sequences using spaces as delimiters;

Grouping the logs based on the length of the word sequence;

For each group:

Determine a log string for each word sequence based on the characters at a predetermined position of each word in each word sequence included in the group;

Clustering is performed based on the similarity between any two of the log strings to obtain the type of logs included in the group;

The types of all grouped logs are combined into a log type set.
The method according to any one of the above claims is characterized in that the predetermined position is the first character or the last character of the word.
The method according to any one of the above claims is characterized in that the arbitrary two log strings include a first log string and a second log string: the method further comprises:

Determine the similarity between the characters at the same character position between the first log character string and the second log character string;

Based on the similarity between the characters at the same character position, the similarity between the first log character string and the second log character string is determined.
The method according to any one of the preceding claims, characterized in that the step of determining (102) an abnormal log slice from a plurality of log slices comprises:

determining, based on principal component analysis, log sheets with an unbalanced log type ratio from the plurality of log sheets;

The log slice with an unbalanced log type ratio is determined as the abnormal log slice.
The method according to any one of the preceding claims, characterized in that the determining (103) of the log slice abnormality pattern based on the type of log contained in the abnormal log slice comprises:

Determine the number N of abnormal patterns in the log slice;

When the number m of the abnormal log slices is less than or equal to N, based on the type sequence of logs contained in each abnormal log slice, a corresponding log slice abnormal pattern is determined to obtain m log slice abnormal patterns;

When the number m of the abnormal log pieces is greater than N, clustering is performed based on the similarity between the type sequences of the logs contained in the abnormal log pieces to obtain N log piece abnormal patterns.
The method according to any one of the preceding claims is characterized in that clustering based on similarity between type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns comprises:

The type sequence of the logs contained in the abnormal log piece is taken as a set element to form a log piece abnormal pattern set;

In the log slice anomaly pattern set, any two set elements with the greatest similarity are set merged to update the log slice anomaly pattern set until the number of set elements of the log slice anomaly pattern set is equal to N.
The method according to any one of the above claims, further comprising:

Determine the longest common subsequence length of a first set element of a first abnormal log piece and a second set element of a second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces;

Based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece, and the longest common subsequence length, the similarity between the first set element of the first abnormal log piece and the second set element of the second abnormal log piece is determined.
The method according to any one of the preceding claims, characterized in that

The comparing the type of logs contained in the log piece to be tested with the log piece abnormal pattern (104) includes: determining the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set;

The determining (105) of the abnormal pattern of the log piece to be tested based on the comparison result comprises: determining the log piece abnormal pattern in the log piece abnormal pattern set, which has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value, as the abnormal pattern of the log piece to be tested.
The method according to any one of the preceding claims is characterized in that, when the log slice anomaly pattern in the log slice anomaly pattern set is a set element that has been merged, the method further comprises:

Determine a sub-similarity between a log type sequence included in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern;

The weighted sum of the sub-similarity values is determined as the similarity between the log type sequence in the log piece to be tested and the abnormal pattern of the log piece.
A device for determining an abnormal pattern of a log sheet, characterized by comprising:

A first determination module (601) is configured to determine the type of a log based on a character string of a log included in a log set;

A second determination module (602) is configured to determine an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by dividing the log set based on a sliding time window;

A third determination module (603) is configured to determine a log slice abnormality pattern based on the type of logs included in the abnormal log slice;

A comparison module (604) is configured to compare the type of log contained in the log slice to be tested with the log slice abnormality pattern;

The fourth determination module (605) is configured to determine the abnormal mode of the log sheet to be tested based on the comparison result.
The device according to claim 10, characterized in that the first determining module (601) is configured to:

extracting log content characterized as unstructured information from the log;

Divide the log content into word sequences using spaces as delimiters;

Grouping the logs based on the length of the word sequence;

For each group:

Determine a log string for each word sequence based on the characters at a predetermined position of each word in each word sequence included in the group;

Clustering is performed based on the similarity between any two of the log strings to obtain the type of logs included in the group;

The types of all grouped logs are combined into a log type set.
The device according to any one of the preceding claims, characterized in that the two log characters The string includes a first log string and a second log string;

The first determination module (601) is configured to: determine the similarity between the characters at the same character position between the first log string and the second log string; and determine the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.
The device according to any one of the preceding claims, characterized in that

The third determination module (603) is configured to: determine the number N of log slice abnormal patterns; when the number m of the abnormal log slices is less than or equal to N, determine the corresponding log slice abnormal pattern based on the type sequence of the logs contained in each abnormal log slice to obtain m log slice abnormal patterns; when the number m of the abnormal log slices is greater than N, perform clustering based on the similarity between the type sequences of the logs contained in the abnormal log slices to obtain N log slice abnormal patterns.
The device according to any one of the preceding claims, characterized in that

The third determination module (603) is configured to: use the type sequence of the logs contained in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, any two set elements with the greatest similarity are combined to update the log piece abnormal pattern set until the number of set elements of the log piece abnormal pattern set is equal to N.
The device according to any one of the preceding claims, characterized in that

The third determination module (603) is configured to: determine the longest common subsequence length of the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determine the similarity between the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece.
The device according to any one of the preceding claims, characterized in that

The comparison module (604) is configured to: determine the similarity between the log type sequence in the log slice to be tested and each log slice abnormal pattern in the log slice abnormal pattern set;

The fourth determination module (605) is configured to: determine the log slice abnormality pattern in the log slice abnormality pattern set, which has the highest similarity with the log type sequence in the log slice to be tested and is greater than a predetermined threshold value, as the abnormality pattern of the log slice to be tested.
The device according to any one of the preceding claims, characterized in that

The comparison module (604) is configured to: when a log slice abnormality pattern in the log slice abnormality pattern set When it is a set element after set merging, determine the sub-similarity between the log type sequence contained in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern; and determine the weighted sum value of each sub-similarity as the similarity between the log type sequence in the log piece to be tested and the log piece anomaly pattern.
An electronic device, comprising:

Processor (701);

A memory (702), configured to store executable instructions of the processor (701);

The processor (701) is configured to read the executable instructions from the memory (702), and execute the executable instructions to implement the method for determining an abnormal pattern of a log slice according to any one of claims 1 to 9.
A computer-readable storage medium stores computer instructions thereon, wherein when the computer instructions are executed by a processor, the method for determining an abnormal mode of a log sheet according to any one of claims 1 to 9 is implemented.
A computer program product, characterized in that it comprises a computer program, and when the computer program is executed by a processor, the method for determining an abnormal pattern of a log sheet according to any one of claims 1 to 9 is implemented.