Network access behavior characteristic group dynamic mining method and system under degradation condition
Technical Field
The invention belongs to the field of data mining, and particularly relates to a dynamic mining method and system for network access behavior feature groups under a degradation condition.
Background
At present, the relation maps are widely applied in the scientific fields of social relation networks, gene biology, cognitive radio and the like. In many large data fields, there is a need to search for populations or targets with maximized common characteristics. The groups or targets and their features are usually abstractly expressed in the form of various graphs, wherein the groups or targets with maximized common features are expressed in the form of some special graphs, including: maximum clique, maximum bipartite clique, quasi-bipartite clique, maximum edge bipartite clique, maximum balance bipartite clique, and frequent item set, etc.
The invention mainly aims at online network access relations, and a group with maximized common access relations is searched. The population with the maximized co-visit relationship is essentially the largest bipartite. It has been proven that the maximum two-cluster search problem is equivalent to the maximum frequent closed item set search problem, so in recent years, the maximum two-cluster search technology has been rapidly developed in the fields of various databases and relational maps, and the main algorithms include: DCI-CLOSED algorithm, D-Miner algorithm, LCM-MBC algorithm and the like. Among them, the DCI-CLOSED algorithm focuses on enumerating the largest bipartite cluster from a large bipartite graph. Bimax and D-Miner algorithms generate all bi-directional clusters that represent gene expression data. The DataPeeler algorithm efficiently mines the closed frequent item sets corresponding to the largest two clusters one by one from the three-dimensional data set. The LCM-MBC algorithm searches for the largest bipartite cluster from a symmetric undirected large graph. The cube miner-MBC algorithm enumerates the 3D maximum bipartite blob from the 3D symmetric matrix using symmetry enumeration of the graph. The EMBS algorithm searches the maximum two clusters with limited characteristics by using a dynamic threshold, and can output all the maximum two clusters under the condition of no limitation, and the efficiency is slightly higher than that of the LCM-MBC algorithm. The above algorithm searches for the largest blob if the input data remains static. However, in many application scenarios, when the external environment changes, the input data also changes, including the case of adding or deleting edges or vertices of the graph.
Aiming at the scene that the input data can be dynamically changed, the maximum binary cluster is searched in the dynamically changed data mainly by adopting a method based on a sliding window at present, and the main algorithms comprise a Max-FISM algorithm, a VSW algorithm, an MWFIM algorithm and the like. Wherein the Max-FISM algorithm mines a frequent set of items in a sliding window of a continuous data stream. The VSW algorithm may continuously mine frequent patterns over a sliding window of variable size. The MWFIM algorithm prunes weighted infrequent patterns from the transactional database and uses a prefix tree with a decreasing order. The TKC-DS algorithm is used to efficiently mine the set of top-K closed terms in the data stream. Although these methods are capable of searching for the largest blobs in dynamically changing data, such sliding window based methods are inherently limited by the size of the window, and the results tend to be coarse rather than precise.
In the dynamic change process of input data, two conditions of data degradation and enhancement are included: data degradation refers to the situation where a point or edge in the input data disappears; data enhancement refers to the situation where points or edges in the input data are increased. Two different types of dynamic changes and search techniques thereof are completely different, and no accurate and efficient solution is provided at present. The invention provides a dynamic mining method for accurately and efficiently searching a network access behavior characteristic group aiming at the condition of data degradation.
Disclosure of Invention
The invention solves the problems: aiming at statistical data of an individual access webpage, under the continuous change condition that an individual (namely a point) and an access relation (namely an edge) disappear, an intelligent model which can quickly and efficiently search the maximum clustering in the changed data can be established, all groups with maximized common access characteristics are determined, and a user can accurately and quickly lock, track or monitor a target group.
The technical scheme adopted by the invention is as follows:
a dynamic mining method for network access behavior feature group under degradation condition includes providing an input interface for user, inputting effective frequency statistic data of each type of web page accessed by individual by user, converting frequency statistic data into 0,1 matrix, executing one-time scanning search algorithm based on said matrix to obtain all maximum binary groups in said matrix and storing all maximum binary groups in internal memory, providing an interface for inputting matrix point or edge deletion data for user, normalizing deletion data inputted by user into edge deletion data, executing maximum binary group iterative search process for each deletion data and outputting all maximum binary groups obtained by last iteration.
According to the dynamic mining method for the network access behavior feature group under the degradation condition, a user inputs effective frequency statistical data of each type of webpage accessed by an individual through the input interface, the individual is an internet user, the effective frequency statistical data refers to the time of day as a unit between the time of the individual accessing the webpage of the type and the current time, the total frequency of the individual accessing the webpage of the type is divided by the time of the individual accessing the webpage of the type, the effective frequency statistical data of the individual accessing the webpage of the type is finally normalized to be 0 or 1, wherein 0 represents that the frequency is insufficient, and 1 represents that the frequency is sufficient.
In the method for dynamically mining the network access behavior feature group under the degradation condition, converting the frequency statistical data into a 0,1 matrix means that the frequency statistical data input by the user is processed and expressed as a matrix, wherein one row of the matrix represents an individual, one column of the matrix represents a type of web page, and elements of the matrix represent the access frequency of the individual to the corresponding type of web page.
In the method for dynamically mining the network access behavior feature population under the degradation condition, the step of executing the scanning search algorithm once on the basis of the matrix to obtain all the maximum two clusters in the matrix and storing the maximum two clusters in the memory means that the EMBS algorithm is executed on the converted matrix to search and obtain all the maximum two clusters, wherein one maximum two cluster represents the most users with the same access webpage types.
In the method for dynamically mining the network access behavior feature group under the degradation condition, an interface for inputting matrix points or deleting data at the edges is provided for the user, wherein the interface for deleting data at the points refers to which individuals represented in the user input matrix are deleted, and the interface for deleting data at the edges refers to which individuals represented in the user input matrix have the frequency of accessing the webpage changed from 1 to 0.
In the above method for dynamically mining network access behavior feature groups under degraded conditions, the normalization of the pruned data input by the user into pruned data of edges refers to a pruning situation in which the pruned individual input by the user is converted into a plurality of edges, for example, if the user prunes an individual, the method is equivalent to completely pruning all access frequency data corresponding to the individual, and finally converting all pruning points or edges input by the user into a plurality of edges pruning situation.
The method for dynamically mining the network access behavior feature population under the degradation condition includes executing a maximum clustering iterative search process on each piece of deleted data, and outputting all maximum clustering obtained through the last iteration, where executing an iterative search process on the basis of the maximum clustering obtained through the first search, that is, deciding on each deleted edge for the maximum clustering obtained through each search, and if the maximum clustering contains the deleted edge, performing decomposition and judgment, where the decomposition is to divide the maximum clustering into a plurality of clustering according to the deleted edge, and judging whether the clustering is still the maximum clustering, and if the decomposed result is the maximum clustering, storing the maximum clustering obtained through decomposition. Each time a deleted edge is processed, a new set of maximum binary clusters is obtained, and when the next deleted edge is processed, the processing procedure is repeated based on the newly obtained maximum binary clusters.
Compared with the prior art, the invention has the beneficial effects that:
according to the method, all maximum groups can be searched in the changed data quickly and efficiently under the continuous change condition that the individual (point) and the access relation (edge) disappear, all groups with maximized common access characteristics are determined, and a user can lock, track or monitor the target group accurately and quickly. Compared with the prior art that the specific group cannot be quickly and accurately searched, the invention provides the iterative search method, and for dynamically changed data, only the changed data needs to be searched, and the whole data does not need to be searched, so that the specific group can be quickly and accurately searched.
Drawings
FIG. 1 is a flow chart of the method implementation of the present invention.
Detailed Description
Embodiments of the present invention are further provided below in conjunction with the appended drawings and this summary.
As shown in fig. 1, the method of the present invention develops a prototype system, which includes a user data input interface, a data-matrix conversion module, an EMBS search module, an input interface for point or edge pruning data, a normalized edge processing module, and an iterative search module: inputting effective frequency statistical data of each type of webpage accessed by an individual through a data input interface by a user; the data-matrix conversion module converts effective frequency statistical data input by a user into a 0,1 matrix; the EMBS searching module performs one-time scanning on the matrix according to an EMBS searching method to search and store all the maximum two clusters in the matrix; inputting a point or an edge to be deleted in a matrix through an input interface of the point or edge deletion data by a user; the normalized edge processing module is used for converting all deletion points or edge conditions input by a user into deletion conditions of a plurality of edges and recording the deletion conditions; the iterative search module processes each deleted side in sequence and carries out the next search processing on the basis of the processing result of the previous side.
As shown in FIG. 1, the method of the present invention has specific operation processes.
(1) The user inputs effective frequency statistical data of each type of webpage accessed by an individual through the input interface, the individual is an internet user, the effective frequency statistical data refers to the total times of the individual accessing a certain type of webpage divided by the time of the individual accessing the certain type of webpage to the current time in days, and the user determines that the effective frequency is 0 or 1.
(2) The system converts the effective frequency statistic data input by the user into a 0,1 matrix through the data-matrix conversion module, namely, the effective frequency statistic data input by the user is processed and expressed into a matrix M, wherein one row of the matrix represents an individual, one column of the matrix represents a type of webpage, and elements of the matrix represent the access frequency of the individual to the webpage of the corresponding type. An example of the matrix is shown in table 1, which contains effective frequency data of five types of web pages, i.e., 0,1, 2, 3, and 4, accessed by five individuals (internet users), a, b, c, d, and e.
TABLE 1
|
0
|
1
|
2
|
3
|
4
|
a
|
0
|
1
|
0
|
1
|
1
|
b
|
1
|
0
|
1
|
1
|
1
|
c
|
0
|
1
|
0
|
1
|
1
|
d
|
1
|
1
|
1
|
0
|
0
|
e
|
1
|
1
|
1
|
0
|
0 |
(3) The system searches all the maximum two clusters in the matrix M by performing one-time scanning on the matrix M according to a publicly published EMBS searching method through the EMBS searching module and stores the maximum two clusters in the matrix B. For example, by searching the matrix represented by table 1 according to the EMBS, the largest bipartite (i.e., the most users with the most same visited web page types) can be obtained as { (a, c) - (1,3,4), (a, b, c) - (3,4), (a, c, d, e) -1 }.
(4) The system deletes the data input interface through the point or edge, which means that the user inputs the point or edge to be deleted in the M. For example, for Table 1, the user may delete point a or delete edge a-1.
(5) The system converts all deletion points or edge conditions input by the user into deletion conditions of a plurality of edges through the normalization edge processing module, and records the deletion conditions into E. As shown in table 1, when the user deletes point a, the system will automatically convert the deletion to delete all edges corresponding to point a, i.e. delete three edges a-1, a-3, a-4 at the same time.
(6) The system executes search through the iterative search module, specifically, the iterative search is performed according to the following process.
(6.1) set B' to null.
(6.2) taking out one side E from E.
(6.3) taking out a maximum micelle G from B.
(6.4) if G does not contain e, then put G to B'; if G contains e, then decompose G into left sub-graph G1And G2Left drawing G1G-a, right diagram G2If G is G-b1Is the largest two clusters, G is1Put into B', if G2Is the largest two clusters, G is2Put into B'. If G is the last maximum two blobs, put the maximum two blobs in B 'into B, namely B ← B', and then return to(6.2), otherwise, directly returning to (6.3).
(6) And outputting a set B'.
The effectiveness comparison is carried out by using an EMBS algorithm repeated search method and the iterative search method, the comparison result of the search efficiency under the conditions of matrixes with different sizes and different matrix densities is shown in a table 2, and the results show that the method has high efficiency on the premise of keeping accuracy, and the search time is far shorter than that of the repeated search method.
TABLE 2
Matrix size
|
EMBS method (ms)
|
Iterative method (ms)
|
Maximum number of two clusters
|
Density of matrix
|
10*10
|
10
|
9
|
36
|
0.48
|
12*12
|
40
|
10
|
63
|
0.44
|
16*16
|
85
|
12
|
190
|
0.45
|
20*20
|
204
|
15
|
355
|
0.46
|
24*24
|
280
|
24
|
1465
|
0.49
|
32*32
|
3450
|
180
|
7595
|
0.5
|
40*40
|
38317
|
220
|
17041
|
0.47
|
48*48
|
181397
|
246
|
41872
|
0.46 |
Aiming at the statistical data of the individual access webpage, the method can accurately and efficiently search the maximum binary group in the changed data under the continuous change condition that the individual (namely point) and the access relation (namely edge) disappear, and determine all groups with the maximized common access characteristic.
The invention has not been described in detail and is part of the common general knowledge of a person skilled in the art.
The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.