WO2024174135A1 - Method for determining abnormal mode of log slice, apparatus, device and storage medium - Google Patents
Method for determining abnormal mode of log slice, apparatus, device and storage medium Download PDFInfo
- Publication number
- WO2024174135A1 WO2024174135A1 PCT/CN2023/077709 CN2023077709W WO2024174135A1 WO 2024174135 A1 WO2024174135 A1 WO 2024174135A1 CN 2023077709 W CN2023077709 W CN 2023077709W WO 2024174135 A1 WO2024174135 A1 WO 2024174135A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- log
- abnormal
- piece
- slice
- similarity
- Prior art date
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 285
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000000638 solvent extraction Methods 0.000 claims abstract description 9
- 230000005856 abnormality Effects 0.000 claims description 15
- 238000000513 principal component analysis Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 abstract description 11
- 238000002372 labelling Methods 0.000 abstract description 4
- 230000007547 defect Effects 0.000 abstract 1
- 230000008569 process Effects 0.000 description 19
- 239000011159 matrix material Substances 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 3
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 3
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 2
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 2
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 2
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000033772 system development Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
Definitions
- the present invention relates to the field of log management technology, and in particular to a method, device, equipment and storage medium for determining an abnormal mode of a log piece.
- Logs can record descriptions of related operations such as date, time, user, and action.
- System development and operation and maintenance personnel can detect abnormal behavior and errors in the system based on logs.
- the embodiments of the present invention provide a method, an apparatus, a device and a storage medium for determining an abnormal mode of a log piece.
- a method for determining an abnormal pattern of log slices comprising:
- the abnormal mode of the log sheet to be tested is determined based on the comparison result.
- the embodiments of the present invention automatically determine the abnormal pattern based on the log string in the log set, overcoming the shortcomings of low accuracy and heavy workload in manual log analysis. Moreover, compared with the session-based partitioning, the log slice partitioning based on the sliding window has a wider range of applications.
- the character string of the log contained in the log set is used to determine the types of logs include:
- Clustering is performed based on the similarity between any two log strings to obtain the
- the types of all grouped logs are combined into a log type set.
- the embodiment of the present invention also performs clustering based on the similarity between log character strings, and clustering can be completed without manual labeling, thereby improving convenience.
- the predetermined position is the first character or the last character of the word.
- the any two log strings include a first log string and a second log string: the method also includes: determining the similarity between characters at the same character position between the first log string and the second log string; and determining the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.
- determining an abnormal log slice from a plurality of log slices comprises:
- the log slice with an unbalanced log type ratio is determined as the abnormal log slice.
- abnormal log slices can be easily determined through principal component analysis.
- determining the log piece abnormal mode based on the type of logs included in the abnormal log piece includes:
- clustering is performed based on the similarity between the type sequences of the logs contained in the abnormal log pieces to obtain N log piece abnormal patterns.
- the convenience of implementation is improved by considering the preset number to distinctively determine the log sheet abnormality pattern.
- clustering based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns includes:
- the type sequence of the logs contained in the abnormal log piece is taken as a set element to form a log piece abnormal pattern set;
- any two set elements with the greatest similarity are set merged to update the log slice anomaly pattern set until the number of set elements of the log slice anomaly pattern set is equal to N.
- it also includes:
- the similarity between the first set element of the first abnormal log piece and the second set element of the second abnormal log piece is determined.
- comparing the type of logs contained in the log piece to be tested with the log piece abnormal pattern includes: determining the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set;
- Determining the abnormal pattern of the log piece to be tested based on the comparison result includes: determining the log piece abnormal pattern in the log piece abnormal pattern set that has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value as the abnormal pattern of the log piece to be tested.
- the method further includes:
- the weighted sum of the sub-similarity values is determined as the similarity between the log type sequence in the log piece to be tested and the abnormal pattern of the log piece.
- each log type sequence participating in the set merging is comprehensively considered, thereby improving the accuracy of similarity calculation.
- An apparatus for determining an abnormal pattern of a log sheet comprising:
- a first determination module is configured to determine the type of the log based on a character string of the log included in the log set
- a second determination module is configured to determine an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by dividing the log set based on a sliding time window;
- a third determination module is configured to determine a log slice abnormal mode based on the type of logs included in the abnormal log slice;
- a comparison module configured to compare the type of log included in the log sheet to be tested with the log sheet abnormality pattern
- the fourth determination module is configured to determine the abnormal mode of the log sheet to be tested based on the comparison result.
- the embodiments of the present invention automatically determine the abnormal pattern based on the log string, overcoming the shortcomings of low accuracy and heavy workload in manual log analysis. Moreover, compared with the session-based partitioning, the log slice partitioning based on the sliding window has a wider application range.
- the first determining module is configured to:
- Clustering is performed based on the similarity between any two log strings to obtain the
- the types of all grouped logs are combined into a log type set.
- the embodiment of the present invention also performs clustering based on the similarity between log character strings, and clustering can be completed without manual labeling, thereby improving convenience.
- the any two log strings include a first log string and a second log string:
- the first determination module is configured to: determine the similarity between the first log string and the second log string, between characters at the same character position; and determine the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.
- the third determination module is configured to: determine the abnormal mode of the log slice when the number m of abnormal log pieces is less than or equal to N, based on the type sequence of logs contained in each abnormal log piece, the corresponding log piece abnormal pattern is determined to obtain m log piece abnormal patterns; when the number m of abnormal log pieces is greater than N, clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log pieces to obtain N log piece abnormal patterns.
- the convenience of implementation is improved by considering the preset number to distinctively determine the log sheet abnormality pattern.
- the third determination module is configured to: use the type sequence of logs contained in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, any two set elements with the greatest similarity are merged to update the log piece abnormal pattern set until the number of set elements of the log piece abnormal pattern set is equal to N.
- the third determination module is configured to: determine the longest common subsequence length of the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determine the similarity between the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece.
- the comparison module is configured to: determine the similarity between the log type sequence in the log slice to be tested and each log slice abnormal pattern in the log slice abnormal pattern set;
- the fourth determination module is configured to determine, in the log slice abnormal pattern set, a log slice abnormal pattern having the highest similarity with the log type sequence in the log slice to be tested and greater than a predetermined threshold value as the abnormal pattern of the log slice to be tested.
- the comparison module is configured to: When the log piece anomaly pattern in the set is a set element after set merging, determine the sub-similarity between the log type sequence contained in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern; and determine the weighted sum value of each sub-similarity as the similarity between the log type sequence in the log piece to be tested and the log piece anomaly pattern.
- each log type sequence participating in the set merging is comprehensively considered, thereby improving the accuracy of similarity calculation.
- An electronic device comprising:
- a memory configured to store executable instructions of the processor
- the processor is configured to read the executable instructions from the memory, and execute the executable instructions to implement the method for determining an abnormal mode of a log slice as described in any one of the above items.
- a computer-readable storage medium stores computer instructions thereon, wherein when the computer instructions are executed by a processor, the method for determining an abnormal mode of a log sheet as described in any one of the above items is implemented.
- a computer program product comprises a computer program, wherein when the computer program is executed by a processor, the method for determining an abnormal pattern of a log slice as described in any one of the above items is implemented.
- FIG. 1 is a flow chart of a method for determining an abnormal pattern of a log sheet according to an embodiment of the present invention.
- FIG. 2 is an exemplary flow chart of a log grouping process according to an embodiment of the present invention.
- FIG. 3 is an exemplary flow chart of a log slice preprocessing process according to an embodiment of the present invention.
- FIG. 4 is an exemplary flow chart of a log slice analysis process according to an embodiment of the present invention.
- FIG. 5 is an exemplary flow chart of determining a log slice anomaly pattern according to an embodiment of the present invention.
- FIG. 6 is a structural diagram of an apparatus for determining an abnormal pattern of a log sheet according to an embodiment of the present invention.
- FIG. 7 is a structural diagram of an electronic device according to an embodiment of the present invention.
- the embodiment of the present invention proposes a method for determining abnormal patterns of log pieces based on clustering and principal component analysis, which overcomes the shortcomings of low accuracy and high workload in manual log analysis. Moreover, the embodiment of the present invention does not need to label the log data, realizes unsupervised log analysis, and reduces the workload of labeling. In addition, the log piece division based on the sliding window in the embodiment of the present invention has a wider range of applications compared with the log piece division based on the session. In addition, the embodiment of the present invention can mine deep-level log piece abnormal patterns through the longest common subsequence and hierarchical clustering, thereby improving the recognition efficiency of abnormal patterns.
- FIG1 is a flow chart of a method for determining an abnormal mode of a log sheet according to an embodiment of the present invention. As shown in FIG1 , the method includes:
- Step 101 Determine the type of the log based on the character string of the log contained in the log set.
- a log set may contain a large number of logs.
- a log set may be implemented as a log set obtained based on historical log data. For example, all logs within a predetermined historical time period (e.g., the past week, month, or year, etc.) may be combined into a log set.
- step 101 specifically includes: extracting log content characterized as unstructured information from the log; dividing the log content into word sequences using spaces as delimiters; grouping the logs based on the length of the word sequences; for each group: determining the log string of each word sequence based on the characters at the predetermined position of each word in each word sequence contained in the group; clustering based on the similarity between any two log strings to obtain the type of log contained in the group; and combining the types of logs in all groups into a log type set.
- the predetermined position is the first letter of the word. character or the last character, or the character at any specified position.
- FIG2 is an exemplary flow chart of a log grouping process according to an embodiment of the present invention. As shown in FIG2 , the method includes:
- mi the log header, which generally contains structured information such as IP address and time.
- m ic the specific content of the log, which is generally represented as unstructured information.
- the value range of i is [1,
- Step 202 For each log in the log set M, that is, the i-th log mi : divide the log content into word sequences Ti using spaces as delimiters, where Where Li represents the length of the sequence, that is, the number of words, and tij represents the jth word in the i-th log.
- L i L ⁇ .
- the log string is a string composed of characters (such as the first character) at any specified position of each word in Ti , and is recorded as Where t′ ij represents the first letter of t ij .
- t′ ij represents the first letter of t ij .
- the log string of "user bob connected" is "ubc”.
- clustering can be performed based on the similarity between the log strings of each word sequence contained in the group to obtain the type of logs contained in the group. Then, the types of logs contained in all groups are combined into a log type set.
- the log set contains log 1, log 2, log 3, log 4, and log 5.
- the content of log 1 is “user bob connected”; the content of log 2 is “user tom
- the content of log 3 is “user bob disconnected”; the content of log 4 is “user tom disconnected”; the content of log 5 is “client disconnected the license server”.
- the word sequence of log 1 is "user bob connected” and the sequence length is 3.
- the word sequence of log 2 is “user bob connected” and the sequence length is 3.
- the word sequence of log 3 is "user bob disconnected” and the sequence length is 3.
- the word sequence of log 4 is “user tom disconnected” and the sequence length is 3.
- the word sequence of log 5 is "client disconnected the license server”. server", and the sequence length is 5.
- the sequence lengths of log 1, log 2, log 3 and log 4 are the same (all 3), so log 1, log 2, log 3 and log 4 are divided into a group corresponding to the sequence length of 3.
- Log 5 is divided into another group corresponding to the sequence length of 5.
- the log string of log 1 is "ubc”; the log string of log 2 is “utc”; the log string of log 3 is “ubd”; the log string of log 4 is “utd”; and the log string of log 5 is "cdtls”.
- clustering is performed based on the similarity between any two log strings to obtain the type of logs contained in the group.
- log type a and log type b For example, in a group with a sequence length of 3, 2 log types (respectively called log type a and log type b) are clustered. In a group with a sequence length of 5, since only the log string of log 5 is included, 1 log type (called log type c) is clustered. Then, the log type set determined based on the log set includes log type a, log type b and log type c.
- the process of clustering based on the similarity between any two log strings in each group involves a process of determining the similarity between any two log strings.
- any two log strings include a first log string and a second log string.
- the process of determining the similarity between any two log strings includes: determining the similarity between the first log string and the second log string, and between characters at the same character position; determining the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.
- the first log string and the second log string belong to the same group, so the number of characters in the first log string and the second log string is the same, and the number of characters is usually plural.
- the number of character pairs formed by two characters at the same character position between the first log string and the second log string is plural, and the similarity between the characters at the same character position is plural.
- the similarity can be determined based on the weighted sum of all similarities between characters at the same character position (the weight of the similarity between each character can be equal or unequal). Determine the similarity between the first log string and the second log string.
- any two log strings include a first log string T′ i and a second log string T′ j : the process of determining the similarity between any two log strings includes: determining the similarity sim(T′ i , T′ j ) between T′ i and T′ j ; wherein t′ ik is the k-th character in T′ i ; t′ jk is the k-th character in T′ j ; L is the number of characters in T′ i and T′ i ; when t′ ik is equal to t′ ik , F(t′ ik , t′ jk ) is 1; when t′ ik is not equal to t′ jk , F(t′ jk , t′ jk ) is 0.
- a log string set can be created in each G L to store all types of logs in the group (each type can be represented by a log string), denoted as S L .
- S L the set similarity threshold
- Step 102 determining an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning a log set based on a sliding time window.
- the log slice preprocessing process can be performed in advance.
- the log set is divided based on the sliding time window to obtain multiple log slices.
- the log set M is divided into A sliding window with a length of 9 hours and a step length of 1 hour is divided into m log slices W 1 , W 2 , ..., W m .
- the logs contained in each log slice may be the same or different.
- FIG3 is an exemplary flow chart of a log slice preprocessing process according to an embodiment of the present invention. As shown in FIG3 , the method includes:
- Step 301 Divide the log set M into m log pieces W 1 , W 2 , ..., W m using a sliding window of predetermined length and step size.
- Step 302 For each log slice W i , use ⁇ ij to represent the number of occurrences of logs of type s j (each type in S) in the log slice.
- ) be the vector of the log slice. Let n
- step 102 specifically includes: based on principal component analysis (PCA), determining a log piece with an unbalanced log type ratio from a plurality of log pieces; and determining the log piece with an unbalanced log type ratio as an abnormal log piece.
- PCA principal component analysis
- principal component analysis is a technique for simplifying a data set. It is a linear transformation. This transformation transforms the data into a new coordinate system so that the first largest variance of any data projection is on the first coordinate (called the first principal component), the second largest variance is on the second coordinate (called the second principal component), and so on. Principal component analysis is often used to reduce the dimensionality of a data set while keeping the features that contribute the most to the variance of the data set.
- ⁇ ij is used to represent the number of logs of type s j (each type in S) in the log slice.
- w i ( ⁇ i1 , ⁇ i2 , ..., ⁇ i
- n
- the proportion of each type of log tends to be stable. If a large number of imbalanced proportions appear in logs of various types in certain time periods (certain log slices), it is very likely that abnormal situations will occur in this time slice, and these log slices need to be focused on and analyzed. Through principal component analysis, the time period of imbalanced proportions can be found. Find a low-dimensional space so that the sum of distances after projecting the data in the high-dimensional space to the low-dimensional space is minimized.
- FIG4 is an exemplary flow chart of a log sheet analysis process according to an embodiment of the present invention. As shown in FIG4 , the method includes:
- Step 401 Centralize matrix A to obtain matrix B.
- Step 403 Select the variance ratio after dimensionality reduction as 90%, and bring in and calculate k so that
- Step 407 Mark the point y as abnormal.
- Q ⁇ represents the threshold statistic of the SPE residual function at the (1- ⁇ ) confidence level, which can be expressed as Get, where ⁇ j represents the eigenvalue of the jth principal component of the sample data covariance matrix projected on the subspace, C ⁇ represents the 1- ⁇ percentile of the standard normal distribution, and exits this process.
- Step 408 Determine that it is a normal log slice and exit this process.
- Step 103 Based on the type of logs included in the abnormal log piece, determine the abnormal mode of the log piece.
- step 103 specifically includes: determining the number N of log slice abnormal patterns; when the number m of abnormal log slices is less than or equal to N, determining the corresponding log slice abnormal pattern based on the type sequence of logs contained in each abnormal log slice to obtain m log slice abnormal patterns; when the number m of abnormal log slices is greater than N, clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns.
- the number N of log slice abnormal patterns is preset.
- the type sequence of logs contained in each abnormal log slice can be determined as the corresponding log slice abnormal pattern.
- the log slice abnormal pattern includes the type sequence of logs contained in the abnormal log slice, wherein each type in the type sequence can be characterized as a log string corresponding to the type.
- clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log slice to obtain N log slice abnormal patterns.
- clustering is performed based on the similarity between the type sequences of logs included in the abnormal log piece to obtain N log piece abnormal patterns, including: taking the type sequences of logs included in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, merging any two set elements with the greatest similarity to update the log piece abnormal pattern set, until the number of set elements of the log piece abnormal pattern set is equal to N.
- the process of determining the similarity between any two set elements includes: determining the longest common subsequence length of the first set element of the first abnormal log piece and the second set element of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determining the similarity of the first set element of the first abnormal log piece and the second set element of the second abnormal log piece.
- any two set elements include a first log string T′ i and a second log string T′ j .
- LCS(y i , y j ) is the longest common subsequence length of the first set element y i and the second set element y j ; the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces.
- Each log type in the log type sequence can be represented as the log string corresponding to the log type.
- the similarity between y i and y j is defined as the ratio of the sum of the number of log entries
- and the length of the longest common subsequence LCS (y i , y j ) of the log type sequence, that is, Where t LCS (y i , y j ).
- the longest common subsequence is used to measure the similarity of two log pieces, considering that the ordered arrangement of log types can determine an abnormal log piece pattern. Based on the distance measurement of the longest common subsequence, hierarchical clustering is performed to obtain abnormal log pieces of different categories, namely, log piece abnormal patterns.
- FIG5 is an exemplary flow chart of determining an abnormal mode of a log sheet according to an embodiment of the present invention. As shown in FIG5 , the method includes:
- Step 502 P ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ ... ⁇ ; where “ ⁇ ” represents an assignment operation.
- P is a set of log slice exception patterns.
- ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ .. are the type sequences of the exception log slices y 1 , y 2 , y 3 , ... respectively.
- y 1 [s 6 , s 6 , s 3 , s 2 , s 10 , s 6 , s 3 , s 2 , s 3 ], which means that the logs in the type sequence y 1 are the 6th, 6th, 3rd, 2nd, 10th, 6th, 3rd, 2nd and 3rd logs respectively.
- the type Each type in the sequence y 1 i.e., any one of s 6 , s 6 , s 3 , s 2 , s 10 , s 6 , s 3 , s 2 , s 3
- Step 503 Determine whether
- Step 504 Take each set element in P as a log slice exception mode and end this process.
- Step 505 Select a, b from P such that Minimize and execute P ⁇ P- ⁇ P i ⁇ - ⁇ P j ⁇ + ⁇ P i ⁇ P j ⁇ .
- Step 505 is repeatedly executed until
- abnormal log pieces namely abnormal log piece 1 to daily log piece 10.
- the log type sequences of abnormal log piece 1 to daily log piece 10 are y 1 to y 10 respectively.
- y 1 to y 10 are determined as 10 log slice anomaly patterns, each of which contains a corresponding log type sequence.
- Each log type in the log type sequence can be represented as a log string corresponding to the log type.
- the log slice anomaly pattern corresponding to y 1 is [s 6 , s 6 , s 3 , s 2 , s 10 , s 6 , s 3 , s 2 , s 3 ]
- each log type in the log type sequence can be represented as a log string corresponding to the log type.
- N is less than 10 (for example, N is equal to 8)
- the type sequences of the logs contained in the 10 abnormal log slices are respectively taken as set elements and combined into a log slice abnormal pattern set.
- the log slice abnormal pattern set is ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ ... ⁇ y 10 ⁇ .
- any two set elements with the greatest similarity are merged to update the log slice abnormal pattern set until the number of set elements of the log slice abnormal pattern set is equal to N. For example, when the similarity between y 1 and y 2 is the greatest, y 1 and y 2 are merged.
- the abnormal log slice set is changed to: ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ ... ⁇ y 10 ⁇ . It can be seen that the number of elements in the set is reduced by one, and ⁇ y 1 ⁇ , ⁇ y 2 ⁇ becomes a set element in the set. Continue to merge any two set elements with the greatest similarity in the changed abnormal log piece set to continue updating the log piece abnormal pattern set until the final set element is 8. Assume that the final log piece abnormal pattern set is: ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ ... ⁇ y 10 ⁇ . Among them, ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ corresponds to an abnormal pattern. Style collection.
- Step 104 Compare the type of the log included in the log piece to be tested with the log piece abnormal pattern.
- the log piece to be tested is a log piece whose abnormal pattern needs to be determined, such as a log piece obtained in real time.
- Step 105 Determine the abnormal mode of the log sheet to be tested based on the comparison result.
- step 104 specifically includes: determining the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set;
- step 105 specifically includes: determining the log piece abnormal pattern in the log piece abnormal pattern set that has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value as the abnormal pattern of the log piece to be tested.
- the log piece to be tested is determined to be normal.
- a log piece to be tested is obtained, and then the unstructured information of each log in the log piece to be tested is divided into word sequences with spaces as separators, and then the log string corresponding to the word sequence is determined based on each word sequence (the character position is equal to the character position in step 203), and then each log string is compared with each type in the log type set for similarity (the similarity determination method can refer to the above formula for calculating sim(T′ i , T′ j )) to determine the type of each log, thereby obtaining the log type sequence in the log piece to be tested.
- the log piece to be tested can be added to the log set in step 101, and the process shown in FIG1 is executed again to update the abnormal log piece model set. Then, the updated abnormal log piece model set is applied to execute steps 104 and 105.
- the method when a log slice anomaly pattern in a log slice anomaly pattern set is a set element that has undergone set merging, the method further includes: determining a sub-similarity between a log type sequence contained in the log slice to be tested and each log type sequence in the log slice anomaly pattern that participates in the set merging; and determining a weighted sum of the sub-similarity values as the similarity between the log type sequence in the log slice to be tested and the log slice anomaly pattern.
- log slice exception pattern set is: ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ ... ⁇ y 10 ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ .
- ⁇ y 1 ⁇ , ⁇ y 2 ⁇ , ⁇ y 3 ⁇ are set elements after set merging.
- the similarity (called sub-similarity) between the log type sequence in the log piece to be tested and ⁇ y 1 ⁇ , ⁇ y 2 ⁇ and ⁇ y 3 ⁇ can be calculated respectively to obtain three sub-similarity, and then the weighted sum value of the three sub-similarity (the weight can be set) is determined as the similarity between the log type sequence in the log piece to be tested and the log piece abnormal pattern.
- FIG6 is a structural diagram of an apparatus for determining an abnormal mode of a log sheet according to an embodiment of the present invention.
- the apparatus 600 for determining an abnormal mode of a log sheet includes:
- a first determination module 601 is configured to determine the type of a log based on a character string of a log included in a log set;
- a second determination module 602 is configured to determine an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning the log set based on a sliding time window;
- the third determination module 603 is configured to determine the log piece abnormality mode based on the type of logs included in the abnormal log piece;
- a comparison module 604 is configured to compare the type of log included in the log piece to be tested with the log piece abnormality pattern
- the fourth determination module 605 is configured to determine the abnormal mode of the log slice to be tested based on the comparison result.
- the first determination module 601 is configured to: extract log content characterized as unstructured information from the log; divide the log content into word sequences with spaces as delimiters; group the logs based on the length of the word sequences; for each group: determine the log string of each word sequence based on the characters at a predetermined position of each word in each word sequence contained in the group; perform clustering based on the similarity between any two log strings to obtain the type of log contained in the group; and combine the types of the logs of all groups into a log type set.
- any two log strings include a first log string and a second log string: the first determination module 601 is configured to: determine the similarity between the first log string and the second log string, and between characters at the same character position; based on the similarity between the characters at the same character position, determine the similarity between the first log string and the second log string.
- the third determination module 603 is configured to: determine the number N of log slice abnormal patterns; when the number m of abnormal log slices is less than or equal to N, determine the corresponding log slice abnormal pattern based on the type sequence of logs contained in each abnormal log slice to obtain m log slice abnormal patterns; when the number m of abnormal log slices is greater than N, perform clustering based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns.
- the third determination module 603 is configured to: use the type sequence of logs contained in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, any two set elements with the greatest similarity are merged to update the log piece abnormal pattern set until the number of set elements of the log piece abnormal pattern set is equal to N.
- the third determination module 603 is configured to: determine the longest common subsequence length of the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determine the similarity between the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece.
- the comparison module 604 is configured to: when a log slice anomaly pattern in a log slice anomaly pattern set is a set element that has undergone a set merge, determine the sub-similarity between the log type sequence contained in the log slice to be tested and each log type sequence participating in the set merge in the log slice anomaly pattern; and determine the weighted sum of the sub-similarity values as the similarity between the log type sequence in the log slice to be tested and the log slice anomaly pattern.
- FIG. 7 is a structural diagram of an electronic device according to an embodiment of the present invention.
- the electronic device 700 includes a processor
- the memory 702 can be implemented as a variety of storage media such as an electrically erasable programmable read-only memory (EEPROM), a flash memory (Flash memory), and a programmable program read-only memory (PROM).
- EEPROM electrically erasable programmable read-only memory
- flash memory Flash memory
- PROM programmable program read-only memory
- the processor 701 can be implemented as including one or more central processing units or one or more field programmable gate arrays, wherein the field programmable gate array integrates one or more central processing unit cores.
- the central processing unit or the central processing unit core can be implemented as a CPU, an MCU or a DSP, etc.
- a hardware module may include a specially designed permanent circuit or logic device (such as a dedicated processor, such as an FPGA or ASIC) to perform a specific operation.
- the hardware module may also include a programmable logic device or circuit (such as a general-purpose processor or other programmable processor) temporarily configured by software to perform a specific operation.
- a programmable logic device or circuit such as a general-purpose processor or other programmable processor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Disclosed in the embodiments of the present invention are a method for determining an abnormal mode of a log slice, an apparatus, a device and a storage medium. The method comprises: on the basis of character strings contained in logs in a log set, determining the types of the logs; determining an abnormal log slice from amongst a plurality of log slices, the plurality of log slices being obtained by dividing the log set on the basis of a time sliding window; on the basis of the type of a log contained in the abnormal log slice, determining a log slice abnormal mode; comparing the type of a log contained in a log slice to be tested with the log slice abnormal mode; and, on the basis of a comparison result, determining an abnormal mode of said log slice. Automatically determining the abnormal mode on the basis of log character strings overcomes the defects of low accuracy and large workload in manual log analysis. Performing clustering on the basis of the similarity between log character strings enables clustering to be completed without manual labeling, thus improving convenience. Sliding window-based log slice partitioning has a broader application range.
Description
本发明涉及日志管理技术领域,特别是确定日志片的异常模式的方法、装置、设备及存储介质。The present invention relates to the field of log management technology, and in particular to a method, device, equipment and storage medium for determining an abnormal mode of a log piece.
网络设备、系统及服务程序等,在运作时通常会产生称为日志的事件记录。日志可以记载日期、时间、使用者及动作等相关操作的描述。系统开发与运维人员根据日志可以检测到系统的异常行为与错误。Network devices, systems, and service programs usually generate event records called logs when they are in operation. Logs can record descriptions of related operations such as date, time, user, and action. System development and operation and maintenance personnel can detect abnormal behavior and errors in the system based on logs.
随着计算日益复杂和应用程序的多样化,日志的类型和数量正在增加。传统的人工日志分析方式难以满足日常分析的要求。As computing becomes increasingly complex and applications become more diverse, the types and volume of logs are increasing. Traditional manual log analysis methods are difficult to meet the requirements of daily analysis.
发明内容Summary of the invention
本发明实施方式确定日志片的异常模式的方法、装置、设备及存储介质。The embodiments of the present invention provide a method, an apparatus, a device and a storage medium for determining an abnormal mode of a log piece.
一种确定日志片的异常模式的方法,包括:A method for determining an abnormal pattern of log slices, comprising:
基于包含于日志集合中的日志的字符串,确定所述日志的类型;Determine the type of the log based on the character string of the log contained in the log set;
从多个日志片中确定异常日志片,其中所述多个日志片是基于滑动时间窗口对所述日志集合划分得到的;Determining an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning the log set based on a sliding time window;
基于所述异常日志片包含的日志的类型,确定日志片异常模式;Determine a log slice abnormal mode based on the type of logs included in the abnormal log slice;
将包含于待测日志片中的日志的类型与所述日志片异常模式进行比较;comparing the type of logs contained in the log sheet to be tested with the log sheet abnormality pattern;
基于比较结果确定所述待测日志片的异常模式。The abnormal mode of the log sheet to be tested is determined based on the comparison result.
因此,本发明实施方式基于日志集合中的日志字符串自动确定出异常模式,克服了人工日志分析中的准确度低和工作量大的缺点。而且,基于滑动窗口的日志片划分,相比于基于会话划分,适用范围更广。Therefore, the embodiments of the present invention automatically determine the abnormal pattern based on the log string in the log set, overcoming the shortcomings of low accuracy and heavy workload in manual log analysis. Moreover, compared with the session-based partitioning, the log slice partitioning based on the sliding window has a wider range of applications.
在一个实施方式中,所述基于包含于日志集合中的日志的字符串,确定所
述日志的类型包括:In one embodiment, the character string of the log contained in the log set is used to determine the The types of logs include:
从所述日志中提取表征为非结构化信息的日志内容;extracting log content characterized as unstructured information from the log;
以空格为分隔符将所述日志内容划分为单词序列;Divide the log content into word sequences using spaces as delimiters;
基于所述单词序列的长度,对所述日志进行分组;Grouping the logs based on the length of the word sequence;
对于每个分组:For each group:
基于该分组包含的每个单词序列中的每个单词的预定位置处的字符,确定Based on the characters at the predetermined positions of each word in each word sequence contained in the group, determine
每个单词序列的日志字符串;Log string for each word sequence;
基于任意两个所述日志字符串之间的相似度进行聚类,以得到该分组所包Clustering is performed based on the similarity between any two log strings to obtain the
含的日志的类型;The type of logs included;
将全部分组的日志的类型组合为日志类型集合。The types of all grouped logs are combined into a log type set.
可见,本发明实施方式还基于日志字符串之间的相似度进行聚类,无需人工标注即可完成聚类,提高了便利性。It can be seen that the embodiment of the present invention also performs clustering based on the similarity between log character strings, and clustering can be completed without manual labeling, thereby improving convenience.
在一个实施方式中,所述预定位置为所述单词的首字符或末字符。In one embodiment, the predetermined position is the first character or the last character of the word.
因此,通过预定位置的字符提取,降低了聚类难度。Therefore, the difficulty of clustering is reduced by extracting characters at predetermined positions.
在一个实施方式中,所述任意两个日志字符串包括第一日志字符串和第二日志字符串:所述方法还包括:确定所述第一日志字符串与所述第二日志字符串之间的、相同字符位置的字符之间的相似度;基于所述相同字符位置的字符之间的相似度,确定所述第一日志字符串和所述第二日志字符串之间的相似度。In one embodiment, the any two log strings include a first log string and a second log string: the method also includes: determining the similarity between characters at the same character position between the first log string and the second log string; and determining the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.
可见,通过日志字符串的逐一比对,既保证了分类准确度,还降低了聚类难度。It can be seen that by comparing the log strings one by one, the classification accuracy is guaranteed and the clustering difficulty is reduced.
在一个实施方式中,所述从多个日志片中确定异常日志片包括:In one embodiment, determining an abnormal log slice from a plurality of log slices comprises:
基于主成分分析,从所述多个日志片中确定出日志类型比例失衡的日志片;determining, based on principal component analysis, log sheets with an unbalanced log type ratio from the plurality of log sheets;
将所述日志类型比例失衡的日志片确定为所述异常日志片。
The log slice with an unbalanced log type ratio is determined as the abnormal log slice.
可见,通过主成分分析,可以便利地确定异常日志片。It can be seen that the abnormal log slices can be easily determined through principal component analysis.
在一个实施方式中,所述基于所述异常日志片包含的日志的类型,确定日志片异常模式包括:In one embodiment, determining the log piece abnormal mode based on the type of logs included in the abnormal log piece includes:
确定日志片异常模式的数目N;Determine the number N of abnormal patterns in the log slice;
当所述异常日志片的数目m小于或等于N时,基于每个异常日志片包含的日志的类型序列,确定出对应的日志片异常模式,以得到m个日志片异常模式;When the number m of the abnormal log slices is less than or equal to N, based on the type sequence of logs contained in each abnormal log slice, a corresponding log slice abnormal pattern is determined to obtain m log slice abnormal patterns;
当所述异常日志片的数目m大于N时,基于所述异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式。When the number m of the abnormal log pieces is greater than N, clustering is performed based on the similarity between the type sequences of the logs contained in the abnormal log pieces to obtain N log piece abnormal patterns.
因此,考虑到预设数目以区别性地确定日志片异常模式,提高了实施便利性。Therefore, the convenience of implementation is improved by considering the preset number to distinctively determine the log sheet abnormality pattern.
在一个实施方式中,述基于异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式包括:In one embodiment, clustering based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns includes:
将所述异常日志片包含的日志的类型序列作为集合元素,组合为日志片异常模式集合;The type sequence of the logs contained in the abnormal log piece is taken as a set element to form a log piece abnormal pattern set;
在所述日志片异常模式集合中,对相似度最大的任意两个集合元素进行集合合并以更新所述日志片异常模式集合,直到所述日志片异常模式集合的集合元素个数等于N。In the log slice anomaly pattern set, any two set elements with the greatest similarity are set merged to update the log slice anomaly pattern set until the number of set elements of the log slice anomaly pattern set is equal to N.
可见,通过集合合并以更新日志片异常模式集合,既保留了每个异常日志片的类型序列,还减少了异常模式的数目。It can be seen that by merging the sets to update the log slice exception pattern set, the type sequence of each exception log slice is retained and the number of exception patterns is reduced.
在一个实施方式中,还包括:In one embodiment, it also includes:
确定第一异常日志片的第一集合元素和第二异常日志片的第二集合元素的最长公共子序列长度,其中所述第一异常日志片和所述第二异常日志片为任意两个不相同的异常日志片;Determine the longest common subsequence length of a first set element of a first abnormal log piece and a second set element of a second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces;
基于所述第一异常日志片中的日志个数、所述第二异常日志片中的日志个数和所述最长公共子序列长度,确定所述第一异常日志片的第一集合元素和所述第二异常日志片的第二集合元素的相似度。
Based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece, and the longest common subsequence length, the similarity between the first set element of the first abnormal log piece and the second set element of the second abnormal log piece is determined.
可见,基于最长公共子序列和层次聚类,可以准确地挖掘日志片异常模式。It can be seen that based on the longest common subsequence and hierarchical clustering, the abnormal patterns of log slices can be accurately mined.
在一个实施方式中,所述将包含于待测日志片中的日志的类型与所述日志片异常模式进行比较包括:确定所述待测日志片中的日志类型序列与所述日志片异常模式集合中的每个日志片异常模式的相似度;In one embodiment, comparing the type of logs contained in the log piece to be tested with the log piece abnormal pattern includes: determining the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set;
所述基于比较结果确定所述待测日志片的异常模式包括:将所述日志片异常模式集合中的、与所述待测日志片中的日志类型序列的相似度最高且大于预定门限值的日志片异常模式,确定为所述待测日志片的异常模式。Determining the abnormal pattern of the log piece to be tested based on the comparison result includes: determining the log piece abnormal pattern in the log piece abnormal pattern set that has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value as the abnormal pattern of the log piece to be tested.
因此,基于相似度比较,实现了对待测日志片的异常模式解析。Therefore, based on the similarity comparison, the abnormal pattern analysis of the log film to be tested is achieved.
在一个实施方式中,当所述日志片异常模式集合中的日志片异常模式为经过集合合并的集合元素时,所述方法还包括:In one embodiment, when the log slice anomaly pattern in the log slice anomaly pattern set is a set element that has been merged, the method further includes:
确定包含于待测日志片中的日志类型序列与所述日志片异常模式中参与集合合并的每个日志类型序列之间的子相似度;Determine a sub-similarity between a log type sequence included in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern;
将各个子相似度的加权求和值,确定为待测日志片中的日志类型序列与所述日志片异常模式的相似度。The weighted sum of the sub-similarity values is determined as the similarity between the log type sequence in the log piece to be tested and the abnormal pattern of the log piece.
可见,通过对每个参与集合合并的日志类型序列的相似度加权求和,综合考虑了参与集合合并的每个日志类型序列,提高了相似度计算的准确度。It can be seen that by weighted summing the similarities of each log type sequence participating in the set merging, each log type sequence participating in the set merging is comprehensively considered, thereby improving the accuracy of similarity calculation.
一种确定日志片的异常模式的装置,包括:An apparatus for determining an abnormal pattern of a log sheet, comprising:
第一确定模块,被配置为基于包含于日志集合中的日志的字符串,确定所述日志的类型;A first determination module is configured to determine the type of the log based on a character string of the log included in the log set;
第二确定模块,被配置为从多个日志片中确定异常日志片,其中所述多个日志片是基于滑动时间窗口对所述日志集合划分得到的;A second determination module is configured to determine an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by dividing the log set based on a sliding time window;
第三确定模块,被配置为基于所述异常日志片包含的日志的类型,确定日志片异常模式;A third determination module is configured to determine a log slice abnormal mode based on the type of logs included in the abnormal log slice;
比较模块,被配置为将包含于待测日志片中的日志的类型与所述日志片异常模式进行比较;a comparison module configured to compare the type of log included in the log sheet to be tested with the log sheet abnormality pattern;
第四确定模块,被配置为基于比较结果确定所述待测日志片的异常模式。
The fourth determination module is configured to determine the abnormal mode of the log sheet to be tested based on the comparison result.
因此,本发明实施方式基于日志字符串自动确定出异常模式,克服了人工日志分析中的准确度低和工作量大的缺点。而且,基于滑动窗口的日志片划分,相比于基于会话划分,适用范围更广。Therefore, the embodiments of the present invention automatically determine the abnormal pattern based on the log string, overcoming the shortcomings of low accuracy and heavy workload in manual log analysis. Moreover, compared with the session-based partitioning, the log slice partitioning based on the sliding window has a wider application range.
在一个实施方式中,所述第一确定模块,被配置为:In one embodiment, the first determining module is configured to:
从所述日志中提取表征为非结构化信息的日志内容;extracting log content characterized as unstructured information from the log;
以空格为分隔符将所述日志内容划分为单词序列;Divide the log content into word sequences using spaces as delimiters;
基于所述单词序列的长度,对所述日志进行分组;Grouping the logs based on the length of the word sequence;
对于每个分组:For each group:
基于该分组包含的每个单词序列中的每个单词的预定位置处的字符,确定Based on the characters at the predetermined positions of each word in each word sequence contained in the group, determine
每个单词序列的日志字符串;Log string for each word sequence;
基于任意两个所述日志字符串之间的相似度进行聚类,以得到该分组所包Clustering is performed based on the similarity between any two log strings to obtain the
含的日志的类型;The type of logs included;
将全部分组的日志的类型组合为日志类型集合。The types of all grouped logs are combined into a log type set.
可见,本发明实施方式还基于日志字符串之间的相似度进行聚类,无需人工标注即可完成聚类,提高了便利性。It can be seen that the embodiment of the present invention also performs clustering based on the similarity between log character strings, and clustering can be completed without manual labeling, thereby improving convenience.
在一个实施方式中,所述任意两个日志字符串包括第一日志字符串和第二日志字符串:In one embodiment, the any two log strings include a first log string and a second log string:
所述第一确定模块,被配置为:确定所述第一日志字符串与所述第二日志字符串之间的、相同字符位置的字符之间的相似度;基于所述相同字符位置的字符之间的相似度,确定所述第一日志字符串和所述第二日志字符串之间的相似度。The first determination module is configured to: determine the similarity between the first log string and the second log string, between characters at the same character position; and determine the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.
可见,通过日志字符串的逐一比对,既保证了分类准确度,还降低了聚类难度。It can be seen that by comparing the log strings one by one, the classification accuracy is guaranteed and the clustering difficulty is reduced.
在一个实施方式中,所述第三确定模块,被配置为:确定日志片异常模式
的数目N;当所述异常日志片的数目m小于或等于N时,基于每个异常日志片包含的日志的类型序列,确定出对应的日志片异常模式,以得到m个日志片异常模式;当所述异常日志片的数目m大于N时,基于所述异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式。In one embodiment, the third determination module is configured to: determine the abnormal mode of the log slice when the number m of abnormal log pieces is less than or equal to N, based on the type sequence of logs contained in each abnormal log piece, the corresponding log piece abnormal pattern is determined to obtain m log piece abnormal patterns; when the number m of abnormal log pieces is greater than N, clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log pieces to obtain N log piece abnormal patterns.
因此,考虑到预设数目以区别性地确定日志片异常模式,提高了实施便利性。Therefore, the convenience of implementation is improved by considering the preset number to distinctively determine the log sheet abnormality pattern.
在一个实施方式中,所述第三确定模块,被配置为:将所述异常日志片包含的日志的类型序列作为集合元素,组合为日志片异常模式集合;在所述日志片异常模式集合中,对相似度最大的任意两个集合元素进行集合合并,以更新所述日志片异常模式集合,直到所述日志片异常模式集合的集合元素个数等于N。In one embodiment, the third determination module is configured to: use the type sequence of logs contained in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, any two set elements with the greatest similarity are merged to update the log piece abnormal pattern set until the number of set elements of the log piece abnormal pattern set is equal to N.
可见,通过集合合并以更新日志片异常模式集合,既保留了每个异常日志片的类型序列,还减少了异常模式的数目。It can be seen that by merging the sets to update the log slice exception pattern set, the type sequence of each exception log slice is retained and the number of exception patterns is reduced.
在一个实施方式中,所述第三确定模块,被配置为:确定第一异常日志片的第一集合元素和第二异常日志片的第二集合元素的最长公共子序列长度,其中所述第一异常日志片和所述第二异常日志片为任意两个不相同的异常日志片;基于所述第一异常日志片中的日志个数、所述第二异常日志片中的日志个数和所述最长公共子序列长度,确定所述第一异常日志片的第一集合元素和所述第二异常日志片的第二集合元素的相似度。In one embodiment, the third determination module is configured to: determine the longest common subsequence length of the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determine the similarity between the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece.
可见,基于最长公共子序列和层次聚类,可以准确地挖掘日志片异常模式。It can be seen that based on the longest common subsequence and hierarchical clustering, the abnormal patterns of log slices can be accurately mined.
在一个实施方式中,所述比较模块,被配置为:确定所述待测日志片中的日志类型序列与所述日志片异常模式集合中的每个日志片异常模式的相似度;In one embodiment, the comparison module is configured to: determine the similarity between the log type sequence in the log slice to be tested and each log slice abnormal pattern in the log slice abnormal pattern set;
所述第四确定模块,被配置为:将所述日志片异常模式集合中的、与所述待测日志片中的日志类型序列的相似度最高且大于预定门限值的日志片异常模式,确定为所述待测日志片的异常模式。The fourth determination module is configured to determine, in the log slice abnormal pattern set, a log slice abnormal pattern having the highest similarity with the log type sequence in the log slice to be tested and greater than a predetermined threshold value as the abnormal pattern of the log slice to be tested.
因此,基于相似度比较,实现了对待测日志片的异常模式解析。Therefore, based on the similarity comparison, the abnormal pattern analysis of the log film to be tested is achieved.
在一个实施方式中,所述比较模块,被配置为:当所述日志片异常模式集
合中的日志片异常模式为经过集合合并的集合元素时,确定包含于待测日志片中的日志类型序列与所述日志片异常模式中参与集合合并的每个日志类型序列之间的子相似度;将各个子相似度的加权求和值,确定为待测日志片中的日志类型序列与所述日志片异常模式的相似度。In one embodiment, the comparison module is configured to: When the log piece anomaly pattern in the set is a set element after set merging, determine the sub-similarity between the log type sequence contained in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern; and determine the weighted sum value of each sub-similarity as the similarity between the log type sequence in the log piece to be tested and the log piece anomaly pattern.
可见,通过对每个参与集合合并的日志类型序列的相似度加权求和,综合考虑了参与集合合并的每个日志类型序列,提高了相似度计算的准确度。It can be seen that by weighted summing the similarities of each log type sequence participating in the set merging, each log type sequence participating in the set merging is comprehensively considered, thereby improving the accuracy of similarity calculation.
一种电子设备,包括:An electronic device, comprising:
处理器;processor;
存储器,用于存储所述处理器的可执行指令;A memory, configured to store executable instructions of the processor;
所述处理器,用于从所述存储器中读取所述可执行指令,并执行所述可执行指令以实施如上任一项所述的确定日志片的异常模式的方法。The processor is configured to read the executable instructions from the memory, and execute the executable instructions to implement the method for determining an abnormal mode of a log slice as described in any one of the above items.
一种计算机可读存储介质,其上存储有计算机指令,其特征在于,所述计算机指令被处理器执行时实施如上任一项所述的确定日志片的异常模式的方法。A computer-readable storage medium stores computer instructions thereon, wherein when the computer instructions are executed by a processor, the method for determining an abnormal mode of a log sheet as described in any one of the above items is implemented.
一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实施如上任一项所述的确定日志片的异常模式的方法。A computer program product comprises a computer program, wherein when the computer program is executed by a processor, the method for determining an abnormal pattern of a log slice as described in any one of the above items is implemented.
下面将通过参照附图详细描述本发明的优选实施例,使本领域的普通技术人员更清楚本发明的上述及其它特征和优点,附图中:The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that those skilled in the art can better understand the above and other features and advantages of the present invention. In the accompanying drawings:
图1是根据本发明实施方式的确定日志片的异常模式的方法的流程图。FIG. 1 is a flow chart of a method for determining an abnormal pattern of a log sheet according to an embodiment of the present invention.
图2是根据本发明实施方式的日志分组过程的示范性流程图。FIG. 2 is an exemplary flow chart of a log grouping process according to an embodiment of the present invention.
图3是根据本发明实施方式的日志片预处理过程的示范性流程图。FIG. 3 is an exemplary flow chart of a log slice preprocessing process according to an embodiment of the present invention.
图4为根据本发明实施方式的日志片分析过程的示范性流程图。FIG. 4 is an exemplary flow chart of a log slice analysis process according to an embodiment of the present invention.
图5是根据本发明实施方式确定日志片异常模式的示范性流程图。FIG. 5 is an exemplary flow chart of determining a log slice anomaly pattern according to an embodiment of the present invention.
图6是根据本发明实施方式的确定日志片的异常模式的装置的结构图。
FIG. 6 is a structural diagram of an apparatus for determining an abnormal pattern of a log sheet according to an embodiment of the present invention.
图7是根据本发明实施方式的电子设备的结构图。FIG. 7 is a structural diagram of an electronic device according to an embodiment of the present invention.
其中,附图标记如下:
The reference numerals are as follows:
The reference numerals are as follows:
为使本发明的目的、技术方案和优点更加清楚,以下举实施例对本发明进一步详细说明。In order to make the purpose, technical solutions and advantages of the present invention more clear, the present invention is further described in detail with reference to the following embodiments.
为了描述上的简洁和直观,下文通过描述若干代表性的实施方式来对本发明的方案进行阐述。实施方式中大量的细节仅用于帮助理解本发明的方案。但
是很明显,本发明的技术方案实现时可以不局限于这些细节。为了避免不必要地模糊了本发明的方案,一些实施方式没有进行细致地描述,而是仅给出了框架。下文中,“包括”是指“包括但不限于”,“根据……”是指“至少根据……,但不限于仅根据……”。由于汉语的语言习惯,下文中没有特别指出一个成分的数量时,意味着该成分可以是一个也可以是多个,或可理解为至少一个。For the sake of simplicity and intuitiveness, the following describes several representative embodiments to illustrate the solution of the present invention. A large number of details in the embodiments are only used to help understand the solution of the present invention. It is obvious that the technical solution of the present invention may not be limited to these details when implemented. In order to avoid unnecessarily obscuring the solution of the present invention, some implementation methods are not described in detail, but only a framework is given. In the following, "including" means "including but not limited to", and "according to..." means "at least according to..., but not limited to only according to...". Due to the language habits of Chinese, when the number of a component is not specifically specified below, it means that the component can be one or more, or can be understood as at least one.
本发明实施方式提出一种基于聚类和主成分分析确定日志片的异常模式的方法,克服了人工日志分析中的准确度低和工作量大的缺点。而且,本发明实施方式无需对日志数据进行标注,实现了无监督的日志分析,降低了标注工作量。另外,本发明实施方式基于滑动窗口的日志片划分,相比于基于会话的日志片划分,适用范围更广。还有,本发明实施方式通过最长公共子序列和层次聚类,可以挖掘出深层次的日志片异常模式,提高了异常模式的识别效率。The embodiment of the present invention proposes a method for determining abnormal patterns of log pieces based on clustering and principal component analysis, which overcomes the shortcomings of low accuracy and high workload in manual log analysis. Moreover, the embodiment of the present invention does not need to label the log data, realizes unsupervised log analysis, and reduces the workload of labeling. In addition, the log piece division based on the sliding window in the embodiment of the present invention has a wider range of applications compared with the log piece division based on the session. In addition, the embodiment of the present invention can mine deep-level log piece abnormal patterns through the longest common subsequence and hierarchical clustering, thereby improving the recognition efficiency of abnormal patterns.
图1是根据本发明实施方式的确定日志片的异常模式的方法的流程图。如图1所示,该方法包括:FIG1 is a flow chart of a method for determining an abnormal mode of a log sheet according to an embodiment of the present invention. As shown in FIG1 , the method includes:
步骤101:基于包含于日志集合中的日志的字符串,确定日志的类型。Step 101: Determine the type of the log based on the character string of the log contained in the log set.
日志集合中可以包含大量的日志。比如,日志集合可以实施为基于历史日志数据所获取的日志集合。比如,可以将预定的历史时间段(比如,过去一周、一月或一等,等等)内的所有日志组合为日志集合。日志mi通常可以表征为mi={mih,mic},其中mih为日志头,一般包含IP地址和时间等结构化信息。mic为日志的具体内容,一般为非结构化信息,通常是类似printf("user%s connected",user)的语句输出,形如“user bob connected”或“user bob disconnected”,等等。A log set may contain a large number of logs. For example, a log set may be implemented as a log set obtained based on historical log data. For example, all logs within a predetermined historical time period (e.g., the past week, month, or year, etc.) may be combined into a log set. Log mi can generally be represented as mi = { mih , mic }, where mih is the log header, generally containing structured information such as IP address and time. mic is the specific content of the log, generally unstructured information, usually a statement output similar to printf("user%s connected", user), such as "user bob connected" or "user bob disconnected", etc.
在一个实施方式中,步骤101具体包括:从日志中提取表征为非结构化信息的日志内容;以空格为分隔符将日志内容划分为单词序列;基于单词序列的长度,对日志进行分组;对于每个分组:基于该分组包含的每个单词序列中的每个单词的预定位置处的字符,确定每个单词序列的日志字符串;基于任意两个日志字符串之间的相似度进行聚类,以得到该分组所包含的日志的类型;将全部分组的日志的类型组合为日志类型集合。优选地,预定位置为单词的首字
符或末字符,或任意指定位置的字符。In one embodiment, step 101 specifically includes: extracting log content characterized as unstructured information from the log; dividing the log content into word sequences using spaces as delimiters; grouping the logs based on the length of the word sequences; for each group: determining the log string of each word sequence based on the characters at the predetermined position of each word in each word sequence contained in the group; clustering based on the similarity between any two log strings to obtain the type of log contained in the group; and combining the types of logs in all groups into a log type set. Preferably, the predetermined position is the first letter of the word. character or the last character, or the character at any specified position.
比如:以空格为分隔符将mic划分为单词序列Ti,其中其中Li表示序列的长度,即单词数,tij表示第i条日志的第j个单词。举例:当mic为“user bob connected”时,Ti=[user,bob,connected],Li等于3(即有3个单词)。将单词序列长度相等(长度为L)的所有日志分到各自的组GL中,GL={Ti|Li=L},由此基于日志长度对所有日志进行粗略分类。For example: Use space as separator to divide mic into word sequence Ti , where Where Li represents the length of the sequence, i.e., the number of words, and tij represents the jth word of the i-th log. For example, when mic is "user bob connected", Ti = [user, bob, connected], and Lj is equal to 3 (i.e., there are 3 words). All logs with the same length of word sequence (length L) are divided into their respective groups GL , GL = { Ti | Lj = L}, thereby roughly classifying all logs based on log length.
图2是根据本发明实施方式的日志分组过程的示范性流程图。如图2所示,该方法包括:FIG2 is an exemplary flow chart of a log grouping process according to an embodiment of the present invention. As shown in FIG2 , the method includes:
步骤201:获取日志集合M,其中M={m1,m2,...,m|M|}。而且,M中的第i个日志mi表征为mi={mih,mic},其中mih为日志头,一般包含IP地址和时间等结构化信息。mic为日志的具体内容,一般表征为非结构化信息。i的取值范围为[1,|M|]。Step 201: Get a log set M, where M = {m 1 , m 2 , ..., m |M| }. Moreover, the i-th log mi in M is represented by mi = {m ih , m ic }, where mih is the log header, which generally contains structured information such as IP address and time. m ic is the specific content of the log, which is generally represented as unstructured information. The value range of i is [1, |M|].
步骤202:针对日志集合M中的每个日志,也就是第i个日志mi:以空格为分隔符将日志内容划分为单词序列Ti,其中其中Li表示序列的长度,即单词数,tij表示第i条日志的第j个单词。Step 202: For each log in the log set M, that is, the i-th log mi : divide the log content into word sequences Ti using spaces as delimiters, where Where Li represents the length of the sequence, that is, the number of words, and tij represents the jth word in the i-th log.
步骤203:将具有相同的单词序列长度L的单词序列,划分到相同的组GL中。也就是,GL={Ti|Li=L}。Step 203: Divide the word sequences with the same word sequence length L into the same group GL . That is, GL = {T i |L i = L}.
另外,在本发明实施方式中引入日志字符串的概念。日志字符串是由Ti中每个单词的任意指定位置的字符(比如首字符)连成的字符串,记为其中t′ij表示tij的首字母。比如,当指定位置为首字符时,“user bob connected”的日志字符串为“ubc”。针对每个分组,可以基于该分组包含的每个单词序列的日志字符串之间的相似度进行聚类,以得到该分组所包含的日志的类型。然后,将所有分组包含的日志的类型组合为日志类型集合。In addition, the concept of log string is introduced in the embodiment of the present invention. The log string is a string composed of characters (such as the first character) at any specified position of each word in Ti , and is recorded as Where t′ ij represents the first letter of t ij . For example, when the specified position is the first character, the log string of "user bob connected" is "ubc". For each group, clustering can be performed based on the similarity between the log strings of each word sequence contained in the group to obtain the type of logs contained in the group. Then, the types of logs contained in all groups are combined into a log type set.
举例:假定日志集合中包含日志1、日志2、日志3、日志4和日志5。日志1的内容具体为“user bob connected”;日志2的内容具体为“user tom
connected”;日志3的内容具体为“user bob disconnected”;日志4的内容具体为“user tom disconnected”;日志5的内容具体为“client disconnected the license server”。日志1的单词序列为“user bob connected”,序列长度为3。日志2的单词序列为“user bob connected”,序列长度为3。日志3的单词序列为“user bob disconnected”,序列长度为3。日志4的单词序列为“user tom disconnected”,序列长度为3。日志5的单词序列为“client disconnected the license server”,序列长度为5。日志1、日志2、日志3和日志4的序列长度相同(都为3),因此日志1、日志2、日志3和日志4被分到对应于序列长度为3的一组。日志5被分到对应于序列长度为5的另一组。而且,日志1的日志字符串为“ubc”;日志2的日志字符串为“utc”;日志3的日志字符串为“ubd”;日志4的日志字符串为“utd”;日志5的日志字符串为“cdtls”。而且,针对每个组,基于任意两个日志字符串之间的相似度进行聚类,以得到该分组所包含的日志的类型。比如,在序列长度为3的一组中,聚类出2个日志类型(分别称为日志类型a和日志类型b)。在序列长度为5的一组中,由于只包含日志5的日志字符串,聚类出1个日志类型(称为日志类型c)。那么,基于该日志集合确定的日志类型集合包含日志类型a、日志类型b和日志类型c。For example, assume that the log set contains log 1, log 2, log 3, log 4, and log 5. The content of log 1 is "user bob connected"; the content of log 2 is "user tom The content of log 3 is "user bob disconnected"; the content of log 4 is "user tom disconnected"; the content of log 5 is "client disconnected the license server". The word sequence of log 1 is "user bob connected" and the sequence length is 3. The word sequence of log 2 is "user bob connected" and the sequence length is 3. The word sequence of log 3 is "user bob disconnected" and the sequence length is 3. The word sequence of log 4 is "user tom disconnected" and the sequence length is 3. The word sequence of log 5 is "client disconnected the license server". server", and the sequence length is 5. The sequence lengths of log 1, log 2, log 3 and log 4 are the same (all 3), so log 1, log 2, log 3 and log 4 are divided into a group corresponding to the sequence length of 3. Log 5 is divided into another group corresponding to the sequence length of 5. Moreover, the log string of log 1 is "ubc"; the log string of log 2 is "utc"; the log string of log 3 is "ubd"; the log string of log 4 is "utd"; and the log string of log 5 is "cdtls". Moreover, for each group, clustering is performed based on the similarity between any two log strings to obtain the type of logs contained in the group. For example, in a group with a sequence length of 3, 2 log types (respectively called log type a and log type b) are clustered. In a group with a sequence length of 5, since only the log string of log 5 is included, 1 log type (called log type c) is clustered. Then, the log type set determined based on the log set includes log type a, log type b and log type c.
在每个组中基于任意两个日志字符串之间的相似度进行聚类的过程中,涉及到确定任意两个日志字符串之间的相似度的过程。The process of clustering based on the similarity between any two log strings in each group involves a process of determining the similarity between any two log strings.
比如,任意两个日志字符串包括第一日志字符串和第二日志字符串。确定任意两个日志字符串之间的相似度的过程包括:确定第一日志字符串与第二日志字符串之间的、相同字符位置的字符之间的相似度;基于相同字符位置的字符之间的相似度,确定第一日志字符串和第二日志字符串之间的相似度。第一日志字符串与第二日志字符串都属于相同组,因此第一日志字符串与第二日志字符串的字符数相同,且字符数通常为复数。第一日志字符串和第二日志字符串之间的相同字符位置的两个字符所形成的字符对的数目为复数,相同字符位置的字符之间的相似度为复数。可以基于相同字符位置的字符之间的全部相似度的加权求和结果(每个字符之间的相似度的权重,可以相等或不相等),确
定第一日志字符串和第二日志字符串之间的相似度。For example, any two log strings include a first log string and a second log string. The process of determining the similarity between any two log strings includes: determining the similarity between the first log string and the second log string, and between characters at the same character position; determining the similarity between the first log string and the second log string based on the similarity between the characters at the same character position. The first log string and the second log string belong to the same group, so the number of characters in the first log string and the second log string is the same, and the number of characters is usually plural. The number of character pairs formed by two characters at the same character position between the first log string and the second log string is plural, and the similarity between the characters at the same character position is plural. The similarity can be determined based on the weighted sum of all similarities between characters at the same character position (the weight of the similarity between each character can be equal or unequal). Determine the similarity between the first log string and the second log string.
在确定日志字符串之间的相似度的一种具体实现中:任意两个日志字符串包括第一日志字符串T′i和第二日志字符串T′j:确定任意两个日志字符串之间的相似度的过程包括:确定T′i和T′j之间的相似度sim(T′i,T′j);其中t′ik为T′i中的第k个字符;t′jk为T′j中的第k个字符;L为T′i和T′i的字符个数;当t′ik等于t′ik时,F(t′ik,t′jk)为1;当t′ik不等于t′jk时,F(t′jk,t′jk)为0。In a specific implementation of determining the similarity between log strings: any two log strings include a first log string T′ i and a second log string T′ j : the process of determining the similarity between any two log strings includes: determining the similarity sim(T′ i , T′ j ) between T′ i and T′ j ; wherein t′ ik is the k-th character in T′ i ; t′ jk is the k-th character in T′ j ; L is the number of characters in T′ i and T′ i ; when t′ ik is equal to t′ ik , F(t′ ik , t′ jk ) is 1; when t′ ik is not equal to t′ jk , F(t′ jk , t′ jk ) is 0.
具体地,对于对应于各自的字符串序列长度的每一个组GL,定义任意两个日志字符串T′i和T′j的相似度为其中
当sim(T′i,T′j)≥λ时,认为这两个日志字符串为相似,其中λ为设定的相似度阈值,比如取λ=0.4。可以在每个GL中创建一个日志字符串集合,用来存储该组中的日志的所有类型(可以利用每个类型的一个日志的字符串表征该类型),表示为SL。依次为每个GL中的每一条日志Ti生成日志字符串T′i,与该GL对应的SL中的所有日志字符串作相似度计算。如果SL中存在一条日志字符串s与T′i相似,则将该日志Ti加入到该类型的日志集合Gs中;如果不存在,则新建日志集合并将T′i加入到SL中。然后,将对应于各自的字符串序列长度的全部分组的SL,组合为日志类型集合S,记为S={s1,s2,...,s|S|}。其中s1,s2,...,s|S|分别实施为日志字符串,以表征各自的日志类型。Specifically, for each group G L corresponding to the length of the respective string sequence, the similarity between any two log strings T ′ i and T ′ j is defined as in When sim(T′ i , T′ j ) ≥ λ, the two log strings are considered similar, where λ is the set similarity threshold, for example, λ = 0.4. A log string set can be created in each G L to store all types of logs in the group (each type can be represented by a log string), denoted as S L . Generate a log string T′ i for each log T i in each G L in turn, and calculate the similarity with all log strings in the S L corresponding to the G L. If there is a log string s in S L that is similar to T′ i , then add the log T i to the log set G s of this type; if not, create a new log set G s. And add T′ i to S L. Then, all grouped S L corresponding to their respective string sequence lengths are combined into a log type set S, denoted as S = {s 1 , s 2 , ..., s |S| }. Among them, s 1 , s 2 , ..., s |S| are respectively implemented as log strings to represent their respective log types.
以上示范性描述了确定日志字符串之间的相似度的典型实例,本领域技术人员可以意识到,这种描述仅是示范性的,并不用于限定本发明实施方式的保护范围。The above exemplary description is a typical example of determining the similarity between log character strings. Those skilled in the art may appreciate that such description is merely exemplary and is not intended to limit the protection scope of the embodiments of the present invention.
步骤102:从多个日志片中确定异常日志片,其中多个日志片是基于滑动时间窗口对日志集合划分得到的。Step 102: determining an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning a log set based on a sliding time window.
在这里,可以预先执行日志片预处理过程。在日志片预处理过程中,基于滑动时间窗口对日志集合进行划分,以得到多个日志片。比如,将日志集合M用
长度为9小时、步长为1小时的滑动窗口,划分为m个日志片W1,W2,...,Wm。每个日志片中所包含的日志可以相同,也可以不相同。Here, the log slice preprocessing process can be performed in advance. In the log slice preprocessing process, the log set is divided based on the sliding time window to obtain multiple log slices. For example, the log set M is divided into A sliding window with a length of 9 hours and a step length of 1 hour is divided into m log slices W 1 , W 2 , ..., W m . The logs contained in each log slice may be the same or different.
图3是根据本发明实施方式的日志片预处理过程的示范性流程图。如图3所示,该方法包括:FIG3 is an exemplary flow chart of a log slice preprocessing process according to an embodiment of the present invention. As shown in FIG3 , the method includes:
步骤301:将日志集合M用预定长度和步长的滑动窗口,划分为m个日志片W1,W2,...,Wm。Step 301: Divide the log set M into m log pieces W 1 , W 2 , ..., W m using a sliding window of predetermined length and step size.
步骤302:对于每一个日志片Wi,利用ξij表征在该日志片中,类型为sj(S中的每一个类型)的日志出现的次数。Step 302: For each log slice W i , use ξ ij to represent the number of occurrences of logs of type s j (each type in S) in the log slice.
步骤303:记wi=(ξi1,ξi2,...,ξi|S|)为该日志片的向量。记n=|S|,则所有wi构成了m行n列的矩阵A。Step 303: Let w i =(ξ i1 ,ξ i2 , ...,ξ i|S| ) be the vector of the log slice. Let n = |S|, then all w i form a matrix A with m rows and n columns.
在一个实施方式中,步骤102具体包括:基于主成分分析(Principal Components Analysis,PCA),从多个日志片中确定出日志类型比例失衡的日志片;将日志类型比例失衡的日志片确定为异常日志片。In one embodiment, step 102 specifically includes: based on principal component analysis (PCA), determining a log piece with an unbalanced log type ratio from a plurality of log pieces; and determining the log piece with an unbalanced log type ratio as an abnormal log piece.
在统计学中,主成分分析是一种简化数据集的技术,它是一种线性变换。这种变换把数据变换到一个新的坐标系统中,使得任何数据投影的第一大方差在第一个坐标(称为第一主成分)上,第二大方差在第二个坐标(称为第二主成分)上,依次类推。主成分分析经常用减少数据集的维数,同时保持数据集的对方差贡献最大的特征。In statistics, principal component analysis is a technique for simplifying a data set. It is a linear transformation. This transformation transforms the data into a new coordinate system so that the first largest variance of any data projection is on the first coordinate (called the first principal component), the second largest variance is on the second coordinate (called the second principal component), and so on. Principal component analysis is often used to reduce the dimensionality of a data set while keeping the features that contribute the most to the variance of the data set.
具体地,于每一个日志片Wi,用ξij表示在该日志片中,类型为sj(S中的每一个类型)的日志出现的次数。记wi=(ξi1,ξi2,...,ξi|S|)为该日志片的向量。记n=|S|,则所有wi构成了m行n列的矩阵A。在正常情况下,各个类型日志出现的比例趋于稳定。如果各个类型日志在某些时间段(某些日志片)内出现大量的比例失衡的情况,则很有可能在这个时间片内有异常的情况发生,则这些日志片需要重点关注和分析。通过主成分分析,可以找到比例失衡的时间段。找出一个低维空间,使得将高维空间的数据投影到低维空间后距离之和最小。
Specifically, for each log slice W i , ξ ij is used to represent the number of logs of type s j (each type in S) in the log slice. Let w i = (ξ i1 , ξ i2 , ..., ξ i|S| ) be the vector of the log slice. Let n = |S|, then all w i form a matrix A with m rows and n columns. Under normal circumstances, the proportion of each type of log tends to be stable. If a large number of imbalanced proportions appear in logs of various types in certain time periods (certain log slices), it is very likely that abnormal situations will occur in this time slice, and these log slices need to be focused on and analyzed. Through principal component analysis, the time period of imbalanced proportions can be found. Find a low-dimensional space so that the sum of distances after projecting the data in the high-dimensional space to the low-dimensional space is minimized.
图4为根据本发明实施方式的日志片分析过程的示范性流程图。如图4所示,该方法包括:FIG4 is an exemplary flow chart of a log sheet analysis process according to an embodiment of the present invention. As shown in FIG4 , the method includes:
步骤401:将矩阵A中心化后得到矩阵B。Step 401: Centralize matrix A to obtain matrix B.
步骤402:解得BTB的特征值λi(i=1,2,...,n)和特征向量V1,V2,...,Vn。Step 402: Obtain the eigenvalues λi (i=1, 2, ..., n) and eigenvectors V 1 , V 2 , ..., V n of B T B.
步骤403:,选择降维后的方差比重为90%,带入并计算k使得
Step 403: Select the variance ratio after dimensionality reduction as 90%, and bring in and calculate k so that
步骤404:将前k个特征向量V1,V2,...,Vk组成矩阵P=[V1,V2,...,Vk],计算该矩阵的正交投影矩阵为VP=P(PTP)-1PT=PPT。Step 404: The first k eigenvectors V 1 , V 2 , ..., V k are combined into a matrix P = [V 1 , V 2 , ..., V k ], and the orthogonal projection matrix of the matrix is calculated as V P = P (P T P) -1 P T = P T .
步骤405:如果原来的向量为y,则该向量到其映射的子空间的欧几里德距离可以通过计算平方预测误差SPE=||ya||2得到,其中ya是y到子空间的投影,且ya=(I-VP)y=(I-PPT)y。Step 405: If the original vector is y, the Euclidean distance from the vector to its mapped subspace can be obtained by calculating the squared prediction error SPE = || ya || 2 , where ya is the projection of y to the subspace, and ya = (IV P )y = (I-PP T )y.
步骤406:根据上述算法计算每一个点到子空间的距离,将其与检测阈值Qα进行比较。也就是判断SPE=||ya||>Qα是否成立,如果是,执行步骤407;否则执行步骤408。Step 406: Calculate the distance from each point to the subspace according to the above algorithm and compare it with the detection threshold Q α , that is, determine whether SPE=|| ya ||>Q α is established, if yes, execute step 407; otherwise, execute step 408.
步骤407:标记y这个点是异常的。其中Qα表示在(1-α)置信水平下SPE残差函数的阈值统计量,可用公式得到,其中λj代表样本数据协方差矩阵第j个主成分投影在子空间的特征值,Cα表示标准正态分布的1-α百分位数,并退出本流程。Step 407: Mark the point y as abnormal. Where Q α represents the threshold statistic of the SPE residual function at the (1-α) confidence level, which can be expressed as Get, where λ j represents the eigenvalue of the jth principal component of the sample data covariance matrix projected on the subspace, C α represents the 1-α percentile of the standard normal distribution, and exits this process.
步骤408:确定为正常的日志片,并退出本流程。Step 408: Determine that it is a normal log slice and exit this process.
至此,已经能够得到正常和异常的行及其对应的日志片。So far, normal and abnormal rows and their corresponding log slices have been obtained.
以上示范性描述了从多个日志片中确定异常日志片的典型实例,本领域技术人员可以意识到,这种描述仅是示范性的,并不用于限定本发明实施方式的保护范围。The above exemplary description is a typical example of determining an abnormal log piece from a plurality of log pieces. Those skilled in the art may appreciate that such description is merely exemplary and is not intended to limit the protection scope of the embodiments of the present invention.
步骤103:基于异常日志片包含的日志的类型,确定日志片异常模式。
Step 103: Based on the type of logs included in the abnormal log piece, determine the abnormal mode of the log piece.
在一个实施方式中,步骤103具体包括:确定日志片异常模式的数目N;当异常日志片的数目m小于或等于N时,基于每个异常日志片包含的日志的类型序列,确定出对应的日志片异常模式,以得到m个日志片异常模式;当异常日志片的数目m大于N时,基于异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式。In one embodiment, step 103 specifically includes: determining the number N of log slice abnormal patterns; when the number m of abnormal log slices is less than or equal to N, determining the corresponding log slice abnormal pattern based on the type sequence of logs contained in each abnormal log slice to obtain m log slice abnormal patterns; when the number m of abnormal log slices is greater than N, clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns.
在这里,预先设定日志片异常模式的数目N。当异常日志片的数目m小于或等于N时,可以将每个异常日志片包含的日志的类型序列,确定为各自对应的日志片异常模式。日志片异常模式包含异常日志片包含的日志的类型序列,其中类型序列中的每个类型可以表征为对应于该类型的日志字符串。当异常日志片的数目m大于N时,基于异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式。Here, the number N of log slice abnormal patterns is preset. When the number m of abnormal log slices is less than or equal to N, the type sequence of logs contained in each abnormal log slice can be determined as the corresponding log slice abnormal pattern. The log slice abnormal pattern includes the type sequence of logs contained in the abnormal log slice, wherein each type in the type sequence can be characterized as a log string corresponding to the type. When the number m of abnormal log slices is greater than N, clustering is performed based on the similarity between the type sequences of logs contained in the abnormal log slice to obtain N log slice abnormal patterns.
在一个实施方式中基于异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式包括:将异常日志片包含的日志的类型序列作为集合元素,组合为日志片异常模式集合;在日志片异常模式集合中,对相似度最大的任意两个集合元素进行集合合并以更新日志片异常模式集合,直到日志片异常模式集合的集合元素个数等于N。In one embodiment, clustering is performed based on the similarity between the type sequences of logs included in the abnormal log piece to obtain N log piece abnormal patterns, including: taking the type sequences of logs included in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, merging any two set elements with the greatest similarity to update the log piece abnormal pattern set, until the number of set elements of the log piece abnormal pattern set is equal to N.
在对相似度最大的任意两个集合元素进行集合合并之前,涉及到确定任意两个集合元素之间的相似度的过程。Before merging any two set elements with the greatest similarity, a process of determining the similarity between any two set elements is involved.
假定第一异常日志片和第二异常日志片为任意两个不相同的异常日志片,确定任意两个集合元素之间的相似度的过程包括:确定第一异常日志片的第一集合元素和第二异常日志片的第二集合元素的最长公共子序列长度,其中第一异常日志片和第二异常日志片为任意两个不相同的异常日志片;基于第一异常日志片中的日志个数、第二异常日志片中的日志个数和最长公共子序列长度,确定第一异常日志片的第一集合元素和第二异常日志片的第二集合元素的相似度。Assuming that the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces, the process of determining the similarity between any two set elements includes: determining the longest common subsequence length of the first set element of the first abnormal log piece and the second set element of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determining the similarity of the first set element of the first abnormal log piece and the second set element of the second abnormal log piece.
假定任意两个集合元素包括第一日志字符串T′i和第二日志字符串T′j。在确定任意两个集合元素之间的相似度的一种具体实现中:确定第一异常日志
片的第一集合元素yi和第二异常日志片的第二集合元素yj的相似度d(yi,yj);其中t=LCS(yi,yj);|yi|为第一异常日志片中的日志个数;|yj|为第二异常日志片中的日志个数;LCS(yi,yj)为第一集合元素yi和第二集合元素yj的最长公共子序列长度;第一异常日志片和第二异常日志片为任意两个不相同的异常日志片。Assume that any two set elements include a first log string T′ i and a second log string T′ j . In a specific implementation of determining the similarity between any two set elements: determining the first abnormal log The similarity d(y i ,y j ) between the first set element y i of the slice and the second set element y j of the second abnormal log slice; t=LCS(y i , y j ); |y i | is the number of logs in the first abnormal log piece; |y j | is the number of logs in the second abnormal log piece; LCS(y i , y j ) is the longest common subsequence length of the first set element y i and the second set element y j ; the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces.
具体地,设异常日志片的类型序列的集合Y={y1,y2,...}。每一个元素可以以表征为一个异常日志片所包含的日志类型序列,如第一异常日志片的y1=[s6,s6,s3,s2,s10,s6,s3,s2,s3],y2=[s6,s1,s5,s3,s2,s13,s10],分别表示第一个异常日志片的类型序列y1中的日志依次分别是第6、第6、第3、第2、第10、第6、第3、第2和第3种日志,第二个异常日志片的类型序列y2中的日志依次分别是第6、第1、第5、第3、第2、第13和第10种日志。其中,日志类型序列中的每个日志类型都可以表征为该种日志类型所对应的日志字符串。定义yi和yj的相似度为:其包含的日志条数之和|yi|+|yj|与包含的日志类型序列的最长公共子序列长度LCS(yi,yj)的比值,即其中t=LCS(yi,yj)。在这里,利用最长公共子序列度量两个日志片的相似度,是考虑到日志类型的有序排列可以决定一个异常日志片模式。基于最长公共子序列的距离度量,执行层次聚类,得到不同类别的异常日志片,即日志片异常模式。Specifically, let the set of type sequences of abnormal log pieces be Y = {y 1 , y 2 , ...}. Each element can be represented as a log type sequence contained in an abnormal log piece, such as y 1 = [s 6 , s 6 , s 3 , s 2 , s 10 , s 6 , s 3 , s 2 , s 3 ] and y 2 = [s 6 , s 1 , s 5 , s 3 , s 2 , s 13 , s 10 ] of the first abnormal log piece, which respectively indicate that the logs in the type sequence y 1 of the first abnormal log piece are the 6th, 6th, 3rd, 2nd, 10th, 6th, 3rd, 2nd and 3rd logs, and the logs in the type sequence y 2 of the second abnormal log piece are the 6th, 1st, 5th, 3rd, 2nd, 13th and 10th logs, respectively. Each log type in the log type sequence can be represented as the log string corresponding to the log type. The similarity between y i and y j is defined as the ratio of the sum of the number of log entries |y i |+|y j | and the length of the longest common subsequence LCS (y i , y j ) of the log type sequence, that is, Where t = LCS (y i , y j ). Here, the longest common subsequence is used to measure the similarity of two log pieces, considering that the ordered arrangement of log types can determine an abnormal log piece pattern. Based on the distance measurement of the longest common subsequence, hierarchical clustering is performed to obtain abnormal log pieces of different categories, namely, log piece abnormal patterns.
图5是根据本发明实施方式确定日志片异常模式的示范性流程图。如图5所示,该方法包括:FIG5 is an exemplary flow chart of determining an abnormal mode of a log sheet according to an embodiment of the present invention. As shown in FIG5 , the method includes:
步骤501:确定异常日志片集合Y={y1,y2,...},其中y1,y2,...分别为各自的异常日志片。Step 501: Determine an abnormal log piece set Y={y 1 , y 2 , ...}, where y 1 , y 2 , ... are respective abnormal log pieces.
步骤502:P←{{y1},{y2},{y3}...};其中“←”表征为赋值操作。P为日志片异常模式集合。{y1},{y2},{y3}..分别为异常日志片y1,y2,y3...的各自的类型序列。比如,y1=[s6,s6,s3,s2,s10,s6,s3,s2,s3],则意味着类型序列y1中的日志依次分别是第6、第6、第3、第2、第10、第6、第3、第2和第3种日志。而且,类型
序列y1中的每个类型(即s6,s6,s3,s2,s10,s6,s3,s2,s3中的任意一个),都可以通过符合该类型的日志的字符串进行表征。Step 502: P←{{y 1 }, {y 2 }, {y 3 }...}; where “←” represents an assignment operation. P is a set of log slice exception patterns. {y 1 }, {y 2 }, {y 3 }.. are the type sequences of the exception log slices y 1 , y 2 , y 3 , ... respectively. For example, y 1 = [s 6 , s 6 , s 3 , s 2 , s 10 , s 6 , s 3 , s 2 , s 3 ], which means that the logs in the type sequence y 1 are the 6th, 6th, 3rd, 2nd, 10th, 6th, 3rd, 2nd and 3rd logs respectively. Moreover, the type Each type in the sequence y 1 (i.e., any one of s 6 , s 6 , s 3 , s 2 , s 10 , s 6 , s 3 , s 2 , s 3 ) can be represented by a string of logs that conform to the type.
步骤503:判断|P|>N是否成立(其中N为预设值,比如N为2),|P|是y的个数。如果成立(对应于“Y”分支),执行步骤505;否则(对应于“N”分支),执行步骤504。Step 503: Determine whether |P|>N is true (where N is a preset value, such as N is 2), and |P| is the number of y. If true (corresponding to the "Y" branch), execute step 505; otherwise (corresponding to the "N" branch), execute step 504.
步骤504:将P中的每个集合元素,作为日志片异常模式,结束本流程。Step 504: Take each set element in P as a log slice exception mode and end this process.
步骤505:从P中选择a,b使得最小,并执行P←P-{Pi}-{Pj}+{Pi∪Pj}。Step 505: Select a, b from P such that Minimize and execute P←P-{P i }-{P j }+{P i ∪P j }.
重复执行步骤505,直到|P|小于或等于N。Step 505 is repeatedly executed until |P| is less than or equal to N.
举例,假定异常日志片有10个,分别为异常日志片1~日常日志片10。异常日志片1~日常日志片10的日志类型序列分别为y1~y10。For example, assume that there are 10 abnormal log pieces, namely abnormal log piece 1 to daily log piece 10. The log type sequences of abnormal log piece 1 to daily log piece 10 are y 1 to y 10 respectively.
(1):当N大于或等于10时,将y1~y10确定为10个日志片异常模式,每个日志片异常模式中包含对应的日志类型序列。日志类型序列中的每个日志类型都可以表征为该种日志类型所对应的日志字符串。比如,假定y1[s6,s6,s3,s2,s10,s6,s3,s2,s3],则对应于y1的日志片异常模式为[s6,s6,s3,s2,s10,s6,s3,s2,s3],其中日志类型序列中的每个日志类型都可以表征为该种日志类型所对应的日志字符串。(1): When N is greater than or equal to 10, y 1 to y 10 are determined as 10 log slice anomaly patterns, each of which contains a corresponding log type sequence. Each log type in the log type sequence can be represented as a log string corresponding to the log type. For example, assuming y 1 [s 6 , s 6 , s 3 , s 2 , s 10 , s 6 , s 3 , s 2 , s 3 ], the log slice anomaly pattern corresponding to y 1 is [s 6 , s 6 , s 3 , s 2 , s 10 , s 6 , s 3 , s 2 , s 3 ], where each log type in the log type sequence can be represented as a log string corresponding to the log type.
(2):当N小于10(比如N等于8)时,将10个异常日志片包含的日志的类型序列分别作为集合元素,组合为日志片异常模式集合。日志片异常模式集合即为{{y1},{y2},{y3}...{y10}}。在日志片异常模式集合中,对相似度最大的任意两个集合元素进行集合合并以更新日志片异常模式集合,直到日志片异常模式集合的集合元素个数等于N。举例,当y1和y2的相似度最大时,则将y1和y2进行集合合并。则异常日志片集合变更为:{{{y1},{y2}},{y3}...{y10}},可见该集合中的元素数目减少一个,{{y1},{y2}}成为该集合中的一个集合元素。继续在变更后的异常日志片集合中,对相似度最大的任意两个集合元素进行集合合并继续更新日志片异常模式集合,直到最终集合元素为8。假定最终的日志片异常模式集合为:{{{{y1},{y2}},{y3}}}...{y10}}。其中{{{y1},{y2}},{y3}}}对应于一个异常模
式集合。(2): When N is less than 10 (for example, N is equal to 8), the type sequences of the logs contained in the 10 abnormal log slices are respectively taken as set elements and combined into a log slice abnormal pattern set. The log slice abnormal pattern set is {{y 1 }, {y 2 }, {y 3 }...{y 10 }}. In the log slice abnormal pattern set, any two set elements with the greatest similarity are merged to update the log slice abnormal pattern set until the number of set elements of the log slice abnormal pattern set is equal to N. For example, when the similarity between y 1 and y 2 is the greatest, y 1 and y 2 are merged. Then the abnormal log slice set is changed to: {{{y 1 }, {y 2 }}, {y 3 }...{y 10 }}. It can be seen that the number of elements in the set is reduced by one, and {{y 1 }, {y 2 }} becomes a set element in the set. Continue to merge any two set elements with the greatest similarity in the changed abnormal log piece set to continue updating the log piece abnormal pattern set until the final set element is 8. Assume that the final log piece abnormal pattern set is: {{{{y 1 }, {y 2 }}, {y 3 }}}...{y 10 }}. Among them, {{{y 1 }, {y 2 }}, {y 3 }}} corresponds to an abnormal pattern. Style collection.
步骤104:将包含于待测日志片中的日志的类型与日志片异常模式进行比较。Step 104: Compare the type of the log included in the log piece to be tested with the log piece abnormal pattern.
在这里,待测日志片为需要被确定异常模式的日志片,比如为实时获取的日志片。Here, the log piece to be tested is a log piece whose abnormal pattern needs to be determined, such as a log piece obtained in real time.
步骤105:基于比较结果确定待测日志片的异常模式。Step 105: Determine the abnormal mode of the log sheet to be tested based on the comparison result.
在一个实施方式中,步骤104具体包括:确定待测日志片中的日志类型序列与日志片异常模式集合中的每个日志片异常模式的相似度;步骤105具体包括:将日志片异常模式集合中的、与待测日志片中的日志类型序列的相似度最高且大于预定门限值的日志片异常模式,确定为待测日志片的异常模式。当与待测日志片中的日志类型序列的相似度最高的日志片异常模式的相似度不大于预定门限值时,确定待测日志片为正常。In one embodiment, step 104 specifically includes: determining the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set; step 105 specifically includes: determining the log piece abnormal pattern in the log piece abnormal pattern set that has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value as the abnormal pattern of the log piece to be tested. When the similarity of the log piece abnormal pattern with the highest similarity to the log type sequence in the log piece to be tested is not greater than the predetermined threshold value, the log piece to be tested is determined to be normal.
在这里,获取待测日志片,再以空格为分隔符将待测日志片中的各个日志的非结构化信息分别划分为单词序列,再基于每个单词序列确定出对应于单词序列的日志字符串(字符位置等同于步骤203中的字符位置),然后将每个日志字符串与日志类型集合中的每个类型进行相似度比对(相似度确定方式可以参照上述计算sim(T′i,T′j)的公式),以确定出每个日志的种类,从而得到待测日志片中的日志类型序列。Here, a log piece to be tested is obtained, and then the unstructured information of each log in the log piece to be tested is divided into word sequences with spaces as separators, and then the log string corresponding to the word sequence is determined based on each word sequence (the character position is equal to the character position in step 203), and then each log string is compared with each type in the log type set for similarity (the similarity determination method can refer to the above formula for calculating sim(T′ i , T′ j )) to determine the type of each log, thereby obtaining the log type sequence in the log piece to be tested.
当日志类型集合中不存在对应于待测日志片中的日志的类型时,可以将待测日志片加入到步骤101中的日志集合,并再次执行图1所示流程以更新异常日志片模型集合。然后,再应用更新后的异常日志片模型集合,执行步骤104和步骤105。When the log type set does not contain the type of log in the log piece to be tested, the log piece to be tested can be added to the log set in step 101, and the process shown in FIG1 is executed again to update the abnormal log piece model set. Then, the updated abnormal log piece model set is applied to execute steps 104 and 105.
在一个实施方式中,当日志片异常模式集合中的日志片异常模式为经过集合合并的集合元素时,方法还包括:确定包含于待测日志片中的日志类型序列与日志片异常模式中参与集合合并的每个日志类型序列之间的子相似度;将各个子相似度的加权求和值,确定为待测日志片中的日志类型序列与日志片异常模式的相似度。In one embodiment, when a log slice anomaly pattern in a log slice anomaly pattern set is a set element that has undergone set merging, the method further includes: determining a sub-similarity between a log type sequence contained in the log slice to be tested and each log type sequence in the log slice anomaly pattern that participates in the set merging; and determining a weighted sum of the sub-similarity values as the similarity between the log type sequence in the log slice to be tested and the log slice anomaly pattern.
比如:假定日志片异常模式集合为:
{{{{y1},{y2}},{y3}}}...{y10}}{{{y1},{y2}},{y3}}}。{{{y1},{y2}},{y3}}}为经过集合合并的集合元素。在计算待测日志片中的日志类型序与{{{y1},{y2}},{y3}}}的相似度时,可以分别计算待测日志片中的日志类型序列与{y1}、{y2}和{y3}的相似度(称为子相似度),得到三个子相似度,再将这三个子相似度的加权求和值(权重可以设置)确定为待测日志片中的日志类型序列与日志片异常模式的相似度。For example, suppose the log slice exception pattern set is: {{{{y 1 }, {y 2 }}, {y 3 }}}...{y 10 }}{{{y 1 }, {y 2 }}, {y 3 }}}. {{{y 1 }, {y 2 }}, {y 3 }}} are set elements after set merging. When calculating the similarity between the log type sequence in the log piece to be tested and {{{y 1 }, {y 2 }}, {y 3 }}}, the similarity (called sub-similarity) between the log type sequence in the log piece to be tested and {y 1 }, {y 2 } and {y 3 } can be calculated respectively to obtain three sub-similarity, and then the weighted sum value of the three sub-similarity (the weight can be set) is determined as the similarity between the log type sequence in the log piece to be tested and the log piece abnormal pattern.
图6是根据本发明实施方式的确定日志片的异常模式的装置的结构图。如图6所示,确定日志片的异常模式的装置600包括:FIG6 is a structural diagram of an apparatus for determining an abnormal mode of a log sheet according to an embodiment of the present invention. As shown in FIG6 , the apparatus 600 for determining an abnormal mode of a log sheet includes:
第一确定模块601,被配置为基于包含于日志集合中的日志的字符串,确定日志的类型;A first determination module 601 is configured to determine the type of a log based on a character string of a log included in a log set;
第二确定模块602,被配置为从多个日志片中确定异常日志片,其中多个日志片是基于滑动时间窗口对日志集合划分得到的;A second determination module 602 is configured to determine an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning the log set based on a sliding time window;
第三确定模块603,被配置为基于异常日志片包含的日志的类型,确定日志片异常模式;The third determination module 603 is configured to determine the log piece abnormality mode based on the type of logs included in the abnormal log piece;
比较模块604,被配置为将包含于待测日志片中的日志的类型与日志片异常模式进行比较;A comparison module 604 is configured to compare the type of log included in the log piece to be tested with the log piece abnormality pattern;
第四确定模块605,被配置为基于比较结果确定待测日志片的异常模式。The fourth determination module 605 is configured to determine the abnormal mode of the log slice to be tested based on the comparison result.
在一个实施方式中,第一确定模块601,被配置为:从日志中提取表征为非结构化信息的日志内容;以空格为分隔符将日志内容划分为单词序列;基于单词序列的长度,对日志进行分组;对于每个分组:基于该分组包含的每个单词序列中的每个单词的预定位置处的字符,确定每个单词序列的日志字符串;基于任意两个日志字符串之间的相似度进行聚类,以得到该分组所包含的日志的类型;将全部分组的日志的类型组合为日志类型集合。In one embodiment, the first determination module 601 is configured to: extract log content characterized as unstructured information from the log; divide the log content into word sequences with spaces as delimiters; group the logs based on the length of the word sequences; for each group: determine the log string of each word sequence based on the characters at a predetermined position of each word in each word sequence contained in the group; perform clustering based on the similarity between any two log strings to obtain the type of log contained in the group; and combine the types of the logs of all groups into a log type set.
在一个实施方式中,任意两个日志字符串包括第一日志字符串和第二日志字符串:第一确定模块601,被配置为:确定第一日志字符串与第二日志字符串之间的、相同字符位置的字符之间的相似度;基于相同字符位置的字符之间的相似度,确定第一日志字符串和第二日志字符串之间的相似度。
In one embodiment, any two log strings include a first log string and a second log string: the first determination module 601 is configured to: determine the similarity between the first log string and the second log string, and between characters at the same character position; based on the similarity between the characters at the same character position, determine the similarity between the first log string and the second log string.
在一个实施方式中,第三确定模块603,被配置为:确定日志片异常模式的数目N;当异常日志片的数目m小于或等于N时,基于每个异常日志片包含的日志的类型序列,确定出对应的日志片异常模式,以得到m个日志片异常模式;当异常日志片的数目m大于N时,基于异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式。In one embodiment, the third determination module 603 is configured to: determine the number N of log slice abnormal patterns; when the number m of abnormal log slices is less than or equal to N, determine the corresponding log slice abnormal pattern based on the type sequence of logs contained in each abnormal log slice to obtain m log slice abnormal patterns; when the number m of abnormal log slices is greater than N, perform clustering based on the similarity between the type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns.
在一个实施方式中,第三确定模块603,被配置为:将异常日志片包含的日志的类型序列作为集合元素,组合为日志片异常模式集合;在日志片异常模式集合中,对相似度最大的任意两个集合元素进行集合合并,以更新日志片异常模式集合,直到日志片异常模式集合的集合元素个数等于N。In one embodiment, the third determination module 603 is configured to: use the type sequence of logs contained in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, any two set elements with the greatest similarity are merged to update the log piece abnormal pattern set until the number of set elements of the log piece abnormal pattern set is equal to N.
在一个实施方式中,第三确定模块603,被配置为:确定第一异常日志片的第一集合元素和第二异常日志片的第二集合元素的最长公共子序列长度,其中第一异常日志片和第二异常日志片为任意两个不相同的异常日志片;基于第一异常日志片中的日志个数、第二异常日志片中的日志个数和最长公共子序列长度,确定第一异常日志片的第一集合元素和第二异常日志片的第二集合元素的相似度。In one embodiment, the third determination module 603 is configured to: determine the longest common subsequence length of the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determine the similarity between the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece.
在一个实施方式中,比较模块604,被配置为:确定待测日志片中的日志类型序列与日志片异常模式集合中的每个日志片异常模式的相似度;第四确定模块605,被配置为:将日志片异常模式集合中的、与待测日志片中的日志类型序列的相似度最高且大于预定门限值的日志片异常模式,确定为待测日志片的异常模式。In one embodiment, the comparison module 604 is configured to determine the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set; the fourth determination module 605 is configured to determine the log piece abnormal pattern in the log piece abnormal pattern set that has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value as the abnormal pattern of the log piece to be tested.
在一个实施方式中,比较模块604,被配置为:当日志片异常模式集合中的日志片异常模式为经过集合合并的集合元素时,确定包含于待测日志片中的日志类型序列与日志片异常模式中参与集合合并的每个日志类型序列之间的子相似度;将各个子相似度的加权求和值,确定为待测日志片中的日志类型序列与日志片异常模式的相似度。In one embodiment, the comparison module 604 is configured to: when a log slice anomaly pattern in a log slice anomaly pattern set is a set element that has undergone a set merge, determine the sub-similarity between the log type sequence contained in the log slice to be tested and each log type sequence participating in the set merge in the log slice anomaly pattern; and determine the weighted sum of the sub-similarity values as the similarity between the log type sequence in the log slice to be tested and the log slice anomaly pattern.
本发明实施方式还提出了一种具有处理器-存储器架构的电子设备。图7是根据本发明实施方式的电子设备的结构图。如图7所示,电子设备700包括处
理器701、存储器702及存储在存储器702上并可在处理器701上运行的计算机程序,计算机程序被处理器701执行时实现如上任一种的确定日志片的异常模式的方法。其中,存储器702具体可以实施为电可擦可编程只读存储器(EEPROM)、快闪存储器(Flash memory)、可编程程序只读存储器(PROM)等多种存储介质。处理器701可以实施为包括一或多个中央处理器或一或多个现场可编程门阵列,其中现场可编程门阵列集成一或多个中央处理器核。具体地,中央处理器或中央处理器核能够实施为CPU、MCU或DSP,等等。The embodiment of the present invention further provides an electronic device having a processor-memory architecture. FIG. 7 is a structural diagram of an electronic device according to an embodiment of the present invention. As shown in FIG. 7 , the electronic device 700 includes a processor The processor 701, the memory 702 and the computer program stored in the memory 702 and executable on the processor 701, when the computer program is executed by the processor 701, implements any of the above methods for determining the abnormal mode of the log slice. Among them, the memory 702 can be implemented as a variety of storage media such as an electrically erasable programmable read-only memory (EEPROM), a flash memory (Flash memory), and a programmable program read-only memory (PROM). The processor 701 can be implemented as including one or more central processing units or one or more field programmable gate arrays, wherein the field programmable gate array integrates one or more central processing unit cores. Specifically, the central processing unit or the central processing unit core can be implemented as a CPU, an MCU or a DSP, etc.
需要说明的是,上述各流程和各结构图中不是所有的步骤和模块都是必须的,可以根据实际的需要忽略某些步骤或模块。各步骤的执行顺序不是固定的,可以根据需要进行调整。各模块的划分仅仅是为了便于描述采用的功能上的划分,实际实现时,一个模块可以分由多个模块实现,多个模块的功能也可以由同一个模块实现,这些模块可以位于同一个设备中,也可以位于不同的设备中。It should be noted that not all steps and modules in the above processes and structure diagrams are necessary, and some steps or modules can be ignored according to actual needs. The execution order of each step is not fixed and can be adjusted as needed. The division of each module is only for the convenience of describing the functional division adopted. In actual implementation, a module can be implemented by multiple modules, and the functions of multiple modules can also be implemented by the same module. These modules can be located in the same device or in different devices.
各实施方式中的硬件模块可以以机械方式或电子方式实现。例如,一个硬件模块可以包括专门设计的永久性电路或逻辑器件(如专用处理器,如FPGA或ASIC)用于完成特定的操作。硬件模块也可以包括由软件临时配置的可编程逻辑器件或电路(如包括通用处理器或其它可编程处理器)用于执行特定操作。至于具体采用机械方式,或是采用专用的永久性电路,或是采用临时配置的电路(如由软件进行配置)来实现硬件模块,可以根据成本和时间上的考虑来决定。The hardware modules in each embodiment can be implemented mechanically or electronically. For example, a hardware module may include a specially designed permanent circuit or logic device (such as a dedicated processor, such as an FPGA or ASIC) to perform a specific operation. The hardware module may also include a programmable logic device or circuit (such as a general-purpose processor or other programmable processor) temporarily configured by software to perform a specific operation. As for whether to implement the hardware module mechanically, or using a dedicated permanent circuit, or using a temporarily configured circuit (such as configured by software), it can be decided based on cost and time considerations.
以上,仅为本发明的较佳实施方式而已,并非用于限定本发明的保护范围。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
The above are only preferred embodiments of the present invention and are not intended to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.
Claims (21)
- 一种确定日志片的异常模式的方法,其特征在于,包括:A method for determining an abnormal pattern of a log sheet, characterized by comprising:基于包含于日志集合中的日志的字符串,确定(101)所述日志的类型;Determine (101) the type of the log based on the character string of the log included in the log set;从多个日志片中确定(102)异常日志片,其中所述多个日志片是基于滑动时间窗口对所述日志集合划分得到的;Determining (102) an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by partitioning the log set based on a sliding time window;基于所述异常日志片包含的日志的类型,确定(103)日志片异常模式;Based on the type of logs included in the abnormal log piece, determining (103) a log piece abnormal mode;将包含于待测日志片中的日志的类型与所述日志片异常模式进行比较(104);Comparing the type of log contained in the log sheet to be tested with the log sheet abnormality pattern (104);基于比较结果确定(105)所述待测日志片的异常模式。Based on the comparison result, the abnormal pattern of the log sheet to be tested is determined (105).
- 根据权利要求1所述的方法,其特征在于,所述基于包含于日志集合中的日志的字符串,确定(101)所述日志的类型包括:The method according to claim 1, characterized in that the determining (101) of the type of the log based on the character string of the log included in the log set comprises:从所述日志中提取表征为非结构化信息的日志内容;extracting log content characterized as unstructured information from the log;以空格为分隔符将所述日志内容划分为单词序列;Divide the log content into word sequences using spaces as delimiters;基于所述单词序列的长度,对所述日志进行分组;Grouping the logs based on the length of the word sequence;对于每个分组:For each group:基于该分组包含的每个单词序列中的每个单词的预定位置处的字符,确定每个单词序列的日志字符串;Determine a log string for each word sequence based on the characters at a predetermined position of each word in each word sequence included in the group;基于任意两个所述日志字符串之间的相似度进行聚类,以得到该分组所包含的日志的类型;Clustering is performed based on the similarity between any two of the log strings to obtain the type of logs included in the group;将全部分组的日志的类型组合为日志类型集合。The types of all grouped logs are combined into a log type set.
- 根据上述权利要求中任意一项所述的方法,其特征在于,所述预定位置为所述单词的首字符或末字符。The method according to any one of the above claims is characterized in that the predetermined position is the first character or the last character of the word.
- 根据上述权利要求中任意一项所述的方法,其特征在于,所述任意两个日志字符串包括第一日志字符串和第二日志字符串:所述方法还包括:The method according to any one of the above claims is characterized in that the arbitrary two log strings include a first log string and a second log string: the method further comprises:确定所述第一日志字符串与所述第二日志字符串之间的、相同字符位置的字符之间的相似度;Determine the similarity between the characters at the same character position between the first log character string and the second log character string;基于所述相同字符位置的字符之间的相似度,确定所述第一日志字符串和所述第二日志字符串之间的相似度。Based on the similarity between the characters at the same character position, the similarity between the first log character string and the second log character string is determined.
- 根据上述权利要求中任意一项所述的方法,其特征在于,所述从多个日志片中确定(102)异常日志片包括: The method according to any one of the preceding claims, characterized in that the step of determining (102) an abnormal log slice from a plurality of log slices comprises:基于主成分分析,从所述多个日志片中确定出日志类型比例失衡的日志片;determining, based on principal component analysis, log sheets with an unbalanced log type ratio from the plurality of log sheets;将所述日志类型比例失衡的日志片确定为所述异常日志片。The log slice with an unbalanced log type ratio is determined as the abnormal log slice.
- 根据上述权利要求中任意一项所述的方法,其特征在于,所述基于所述异常日志片包含的日志的类型,确定(103)日志片异常模式包括:The method according to any one of the preceding claims, characterized in that the determining (103) of the log slice abnormality pattern based on the type of log contained in the abnormal log slice comprises:确定日志片异常模式的数目N;Determine the number N of abnormal patterns in the log slice;当所述异常日志片的数目m小于或等于N时,基于每个异常日志片包含的日志的类型序列,确定出对应的日志片异常模式,以得到m个日志片异常模式;When the number m of the abnormal log slices is less than or equal to N, based on the type sequence of logs contained in each abnormal log slice, a corresponding log slice abnormal pattern is determined to obtain m log slice abnormal patterns;当所述异常日志片的数目m大于N时,基于所述异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式。When the number m of the abnormal log pieces is greater than N, clustering is performed based on the similarity between the type sequences of the logs contained in the abnormal log pieces to obtain N log piece abnormal patterns.
- 根据上述权利要求中任意一项所述的方法,其特征在于,所述基于异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式包括:The method according to any one of the preceding claims is characterized in that clustering based on similarity between type sequences of logs contained in the abnormal log slices to obtain N log slice abnormal patterns comprises:将所述异常日志片包含的日志的类型序列作为集合元素,组合为日志片异常模式集合;The type sequence of the logs contained in the abnormal log piece is taken as a set element to form a log piece abnormal pattern set;在所述日志片异常模式集合中,对相似度最大的任意两个集合元素进行集合合并以更新所述日志片异常模式集合,直到所述日志片异常模式集合的集合元素个数等于N。In the log slice anomaly pattern set, any two set elements with the greatest similarity are set merged to update the log slice anomaly pattern set until the number of set elements of the log slice anomaly pattern set is equal to N.
- 根据上述权利要求中任意一项所述的方法,其特征在于,还包括:The method according to any one of the above claims, further comprising:确定第一异常日志片的第一集合元素和第二异常日志片的第二集合元素的最长公共子序列长度,其中所述第一异常日志片和所述第二异常日志片为任意两个不相同的异常日志片;Determine the longest common subsequence length of a first set element of a first abnormal log piece and a second set element of a second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces;基于所述第一异常日志片中的日志个数、所述第二异常日志片中的日志个数和所述最长公共子序列长度,确定所述第一异常日志片的第一集合元素和所述第二异常日志片的第二集合元素的相似度。Based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece, and the longest common subsequence length, the similarity between the first set element of the first abnormal log piece and the second set element of the second abnormal log piece is determined.
- 根据上述权利要求中任意一项所述的方法,其特征在于,The method according to any one of the preceding claims, characterized in that所述将包含于待测日志片中的日志的类型与所述日志片异常模式进行比较(104)包括:确定所述待测日志片中的日志类型序列与所述日志片异常模式集合中的每个日志片异常模式的相似度;The comparing the type of logs contained in the log piece to be tested with the log piece abnormal pattern (104) includes: determining the similarity between the log type sequence in the log piece to be tested and each log piece abnormal pattern in the log piece abnormal pattern set;所述基于比较结果确定(105)所述待测日志片的异常模式包括:将所述日志片异常模式集合中的、与所述待测日志片中的日志类型序列的相似度最高且大于预定门限值的日志片异常模式,确定为所述待测日志片的异常模式。 The determining (105) of the abnormal pattern of the log piece to be tested based on the comparison result comprises: determining the log piece abnormal pattern in the log piece abnormal pattern set, which has the highest similarity with the log type sequence in the log piece to be tested and is greater than a predetermined threshold value, as the abnormal pattern of the log piece to be tested.
- 根据上述权利要求中任意一项所述的方法,其特征在于,当所述日志片异常模式集合中的日志片异常模式为经过集合合并的集合元素时,所述方法还包括:The method according to any one of the preceding claims is characterized in that, when the log slice anomaly pattern in the log slice anomaly pattern set is a set element that has been merged, the method further comprises:确定包含于待测日志片中的日志类型序列与所述日志片异常模式中参与集合合并的每个日志类型序列之间的子相似度;Determine a sub-similarity between a log type sequence included in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern;将各个子相似度的加权求和值,确定为待测日志片中的日志类型序列与所述日志片异常模式的相似度。The weighted sum of the sub-similarity values is determined as the similarity between the log type sequence in the log piece to be tested and the abnormal pattern of the log piece.
- 一种确定日志片的异常模式的装置,其特征在于,包括:A device for determining an abnormal pattern of a log sheet, characterized by comprising:第一确定模块(601),被配置为基于包含于日志集合中的日志的字符串,确定所述日志的类型;A first determination module (601) is configured to determine the type of a log based on a character string of a log included in a log set;第二确定模块(602),被配置为从多个日志片中确定异常日志片,其中所述多个日志片是基于滑动时间窗口对所述日志集合划分得到的;A second determination module (602) is configured to determine an abnormal log slice from a plurality of log slices, wherein the plurality of log slices are obtained by dividing the log set based on a sliding time window;第三确定模块(603),被配置为基于所述异常日志片包含的日志的类型,确定日志片异常模式;A third determination module (603) is configured to determine a log slice abnormality pattern based on the type of logs included in the abnormal log slice;比较模块(604),被配置为将包含于待测日志片中的日志的类型与所述日志片异常模式进行比较;A comparison module (604) is configured to compare the type of log contained in the log slice to be tested with the log slice abnormality pattern;第四确定模块(605),被配置为基于比较结果确定所述待测日志片的异常模式。The fourth determination module (605) is configured to determine the abnormal mode of the log sheet to be tested based on the comparison result.
- 根据权利要求10所述的装置,其特征在于,所述第一确定模块(601),被配置为:The device according to claim 10, characterized in that the first determining module (601) is configured to:从所述日志中提取表征为非结构化信息的日志内容;extracting log content characterized as unstructured information from the log;以空格为分隔符将所述日志内容划分为单词序列;Divide the log content into word sequences using spaces as delimiters;基于所述单词序列的长度,对所述日志进行分组;Grouping the logs based on the length of the word sequence;对于每个分组:For each group:基于该分组包含的每个单词序列中的每个单词的预定位置处的字符,确定每个单词序列的日志字符串;Determine a log string for each word sequence based on the characters at a predetermined position of each word in each word sequence included in the group;基于任意两个所述日志字符串之间的相似度进行聚类,以得到该分组所包含的日志的类型;Clustering is performed based on the similarity between any two of the log strings to obtain the type of logs included in the group;将全部分组的日志的类型组合为日志类型集合。The types of all grouped logs are combined into a log type set.
- 根据上述权利要求中任意一项所述的装置,其特征在于,所述任意两个日志字 符串包括第一日志字符串和第二日志字符串;The device according to any one of the preceding claims, characterized in that the two log characters The string includes a first log string and a second log string;所述第一确定模块(601),被配置为:确定所述第一日志字符串与所述第二日志字符串之间的、相同字符位置的字符之间的相似度;基于所述相同字符位置的字符之间的相似度,确定所述第一日志字符串和所述第二日志字符串之间的相似度。The first determination module (601) is configured to: determine the similarity between the characters at the same character position between the first log string and the second log string; and determine the similarity between the first log string and the second log string based on the similarity between the characters at the same character position.
- 根据上述权利要求中任意一项所述的装置,其特征在于,The device according to any one of the preceding claims, characterized in that所述第三确定模块(603),被配置为:确定日志片异常模式的数目N;当所述异常日志片的数目m小于或等于N时,基于每个异常日志片包含的日志的类型序列,确定出对应的日志片异常模式,以得到m个日志片异常模式;当所述异常日志片的数目m大于N时,基于所述异常日志片包含的日志的类型序列之间的相似度进行聚类,以得到N个日志片异常模式。The third determination module (603) is configured to: determine the number N of log slice abnormal patterns; when the number m of the abnormal log slices is less than or equal to N, determine the corresponding log slice abnormal pattern based on the type sequence of the logs contained in each abnormal log slice to obtain m log slice abnormal patterns; when the number m of the abnormal log slices is greater than N, perform clustering based on the similarity between the type sequences of the logs contained in the abnormal log slices to obtain N log slice abnormal patterns.
- 根据上述权利要求中任意一项所述的装置,其特征在于,The device according to any one of the preceding claims, characterized in that所述第三确定模块(603),被配置为:将所述异常日志片包含的日志的类型序列作为集合元素,组合为日志片异常模式集合;在所述日志片异常模式集合中,对相似度最大的任意两个集合元素进行集合合并,以更新所述日志片异常模式集合,直到所述日志片异常模式集合的集合元素个数等于N。The third determination module (603) is configured to: use the type sequence of the logs contained in the abnormal log piece as set elements to combine into a log piece abnormal pattern set; in the log piece abnormal pattern set, any two set elements with the greatest similarity are combined to update the log piece abnormal pattern set until the number of set elements of the log piece abnormal pattern set is equal to N.
- 根据上述权利要求中任意一项所述的装置,其特征在于,The device according to any one of the preceding claims, characterized in that所述第三确定模块(603),被配置为:确定第一异常日志片的第一集合元素和第二异常日志片的第二集合元素的最长公共子序列长度,其中所述第一异常日志片和所述第二异常日志片为任意两个不相同的异常日志片;基于所述第一异常日志片中的日志个数、所述第二异常日志片中的日志个数和所述最长公共子序列长度,确定所述第一异常日志片的第一集合元素和所述第二异常日志片的第二集合元素的相似度。The third determination module (603) is configured to: determine the longest common subsequence length of the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece, wherein the first abnormal log piece and the second abnormal log piece are any two different abnormal log pieces; based on the number of logs in the first abnormal log piece, the number of logs in the second abnormal log piece and the longest common subsequence length, determine the similarity between the first set elements of the first abnormal log piece and the second set elements of the second abnormal log piece.
- 根据上述权利要求中任意一项所述的装置,其特征在于,The device according to any one of the preceding claims, characterized in that所述比较模块(604),被配置为:确定所述待测日志片中的日志类型序列与所述日志片异常模式集合中的每个日志片异常模式的相似度;The comparison module (604) is configured to: determine the similarity between the log type sequence in the log slice to be tested and each log slice abnormal pattern in the log slice abnormal pattern set;所述第四确定模块(605),被配置为:将所述日志片异常模式集合中的、与所述待测日志片中的日志类型序列的相似度最高且大于预定门限值的日志片异常模式,确定为所述待测日志片的异常模式。The fourth determination module (605) is configured to: determine the log slice abnormality pattern in the log slice abnormality pattern set, which has the highest similarity with the log type sequence in the log slice to be tested and is greater than a predetermined threshold value, as the abnormality pattern of the log slice to be tested.
- 根据上述权利要求中任意一项所述的装置,其特征在于,The device according to any one of the preceding claims, characterized in that所述比较模块(604),被配置为:当所述日志片异常模式集合中的日志片异常模式 为经过集合合并的集合元素时,确定包含于待测日志片中的日志类型序列与所述日志片异常模式中参与集合合并的每个日志类型序列之间的子相似度;将各个子相似度的加权求和值,确定为待测日志片中的日志类型序列与所述日志片异常模式的相似度。The comparison module (604) is configured to: when a log slice abnormality pattern in the log slice abnormality pattern set When it is a set element after set merging, determine the sub-similarity between the log type sequence contained in the log piece to be tested and each log type sequence participating in the set merging in the log piece anomaly pattern; and determine the weighted sum value of each sub-similarity as the similarity between the log type sequence in the log piece to be tested and the log piece anomaly pattern.
- 一种电子设备,其特征在于,包括:An electronic device, comprising:处理器(701);Processor (701);存储器(702),用于存储所述处理器(701)的可执行指令;A memory (702), configured to store executable instructions of the processor (701);所述处理器(701),用于从所述存储器(702)中读取所述可执行指令,并执行所述可执行指令以实施权利要求1-9中任一项所述的确定日志片的异常模式的方法。The processor (701) is configured to read the executable instructions from the memory (702), and execute the executable instructions to implement the method for determining an abnormal pattern of a log slice according to any one of claims 1 to 9.
- 一种计算机可读存储介质,其上存储有计算机指令,其特征在于,所述计算机指令被处理器执行时实施权利要求1-9中任一项所述的确定日志片的异常模式的方法。A computer-readable storage medium stores computer instructions thereon, wherein when the computer instructions are executed by a processor, the method for determining an abnormal mode of a log sheet according to any one of claims 1 to 9 is implemented.
- 一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序被处理器执行时实施权利要求1-9中任一项所述的确定日志片的异常模式的方法。 A computer program product, characterized in that it comprises a computer program, and when the computer program is executed by a processor, the method for determining an abnormal pattern of a log sheet according to any one of claims 1 to 9 is implemented.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2023/077709 WO2024174135A1 (en) | 2023-02-22 | 2023-02-22 | Method for determining abnormal mode of log slice, apparatus, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2023/077709 WO2024174135A1 (en) | 2023-02-22 | 2023-02-22 | Method for determining abnormal mode of log slice, apparatus, device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024174135A1 true WO2024174135A1 (en) | 2024-08-29 |
Family
ID=92500137
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/077709 WO2024174135A1 (en) | 2023-02-22 | 2023-02-22 | Method for determining abnormal mode of log slice, apparatus, device and storage medium |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024174135A1 (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110296244A1 (en) * | 2010-05-25 | 2011-12-01 | Microsoft Corporation | Log message anomaly detection |
CN110750412A (en) * | 2019-09-02 | 2020-02-04 | 北京云集智造科技有限公司 | Log abnormity detection method |
CN110888849A (en) * | 2019-11-06 | 2020-03-17 | 国网上海市电力公司 | Online log analysis method and system and electronic terminal equipment thereof |
CN114742051A (en) * | 2022-04-25 | 2022-07-12 | 京东科技信息技术有限公司 | Log processing method, device, computer system and readable storage medium |
US20220405592A1 (en) * | 2022-03-10 | 2022-12-22 | University Of Electronic Science And Technology Of China | Multi-feature log anomaly detection method and system based on log full semantics |
CN115509848A (en) * | 2022-08-17 | 2022-12-23 | 中国电信股份有限公司 | Log analysis method and device, electronic equipment and storage medium |
-
2023
- 2023-02-22 WO PCT/CN2023/077709 patent/WO2024174135A1/en unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110296244A1 (en) * | 2010-05-25 | 2011-12-01 | Microsoft Corporation | Log message anomaly detection |
CN110750412A (en) * | 2019-09-02 | 2020-02-04 | 北京云集智造科技有限公司 | Log abnormity detection method |
CN110888849A (en) * | 2019-11-06 | 2020-03-17 | 国网上海市电力公司 | Online log analysis method and system and electronic terminal equipment thereof |
US20220405592A1 (en) * | 2022-03-10 | 2022-12-22 | University Of Electronic Science And Technology Of China | Multi-feature log anomaly detection method and system based on log full semantics |
CN114742051A (en) * | 2022-04-25 | 2022-07-12 | 京东科技信息技术有限公司 | Log processing method, device, computer system and readable storage medium |
CN115509848A (en) * | 2022-08-17 | 2022-12-23 | 中国电信股份有限公司 | Log analysis method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11132248B2 (en) | Automated information technology system failure recommendation and mitigation | |
Bodik et al. | Fingerprinting the datacenter: automated classification of performance crises | |
CN108427720B (en) | System log classification method | |
US9213565B2 (en) | Methods and systems for mining datacenter telemetry data | |
CN103370722B (en) | The system and method that actual volatility is predicted by small echo and nonlinear kinetics | |
US20210200768A1 (en) | Responding to similarity queries using vector dimensionality reduction | |
US20240264890A1 (en) | Method and system for analyzing cloud platform logs, device and medium | |
CN110929525B (en) | Network loan risk behavior analysis and detection method, device, equipment and storage medium | |
CN110633371A (en) | Log classification method and system | |
CN111612038A (en) | Abnormal user detection method and device, storage medium and electronic equipment | |
US20210319046A1 (en) | Method and system for hierarchical time-series clustering with auto encoded compact sequence (aecs) | |
WO2023024670A1 (en) | Device clustering method and apparatus, and computer device and storage medium | |
CN112596964A (en) | Disk failure prediction method and device | |
CN103530312A (en) | User identification method and system using multifaceted footprints | |
CN114325405A (en) | Battery pack consistency analysis method, modeling method, device, equipment and medium | |
CN111383732B (en) | Medicine auditing method, device, computer system and readable storage medium based on mutual exclusion identification | |
CN116795977A (en) | Data processing method, apparatus, device and computer readable storage medium | |
Bodík et al. | HiLighter: Automatically Building Robust Signatures of Performance Behavior for Small-and Large-Scale Systems. | |
CN112306820A (en) | Log operation and maintenance root cause analysis method and device, electronic equipment and storage medium | |
CN114495137B (en) | Bill abnormity detection model generation method and bill abnormity detection method | |
CN105590026A (en) | PCA (Principal Component Analysis) based satellite telemetering regression method | |
WO2024174135A1 (en) | Method for determining abnormal mode of log slice, apparatus, device and storage medium | |
CN111680082B (en) | Government financial data acquisition system and method based on data integration | |
CN110609901B (en) | User network behavior prediction method based on vectorization characteristics | |
US11797578B2 (en) | Technologies for unsupervised data classification with topological methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23923316 Country of ref document: EP Kind code of ref document: A1 |