CN113157957A - Attribute graph document clustering method based on graph convolution neural network - Google Patents
Attribute graph document clustering method based on graph convolution neural network Download PDFInfo
- Publication number
- CN113157957A CN113157957A CN202110244762.6A CN202110244762A CN113157957A CN 113157957 A CN113157957 A CN 113157957A CN 202110244762 A CN202110244762 A CN 202110244762A CN 113157957 A CN113157957 A CN 113157957A
- Authority
- CN
- China
- Prior art keywords
- graph
- clustering
- neural network
- attribute
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 15
- 238000000926 separation method Methods 0.000 claims abstract description 9
- 238000003064 k means clustering Methods 0.000 claims abstract description 4
- 238000009826 distribution Methods 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 20
- 239000011159 matrix material Substances 0.000 claims description 15
- 238000005457 optimization Methods 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 2
- 238000010276 construction Methods 0.000 claims description 2
- 210000002569 neuron Anatomy 0.000 claims description 2
- 238000007418 data mining Methods 0.000 abstract description 4
- 239000010410 layer Substances 0.000 description 31
- 241000689227 Cora <basidiomycete fungus> Species 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 8
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 239000011229 interlayer Substances 0.000 description 1
- 230000004853 protein function Effects 0.000 description 1
- 238000005295 random walk Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an attribute graph document clustering method based on a graph convolution neural network, and belongs to the field of graph data mining. Specifically, a cross-layer linked graph convolution neural network is used for document attribute graph feature learning; estimating the optimal cluster number from the node characteristics by utilizing a deep clustering estimation model; alternately executing the two steps to finish training; obtaining the characteristics of all document attribute graph nodes to be clustered and the estimated number of clustering clusters by using the trained model; and obtaining a document attribute graph clustering result by using a k-means clustering method by taking the characteristics and the estimated number of the clustering clusters as input. When the cross-layer linked graph convolution neural network is trained, the self-separation regularization item based on the pairwise similarity of the nodes is adopted, so that the similarity of the characteristics of the nodes in the same cluster can be promoted, and the characteristics of the nodes in different clusters are far away, thereby effectively improving the graph clustering performance. The cluster estimation module realizes the data-driven cluster number estimation, so that the whole system is more suitable for a real data environment without labels.
Description
Technical Field
The invention belongs to the field of graph data mining, and particularly relates to an attribute image-text contribution clustering method based on a graph convolution neural network.
Background
The attribute graph clustering is a basic task in the field of graph data mining, and aims to divide nodes in a graph into mutually-disjoint clusters according to node attributes and graph structure information. Compared with the traditional graph clustering method only using graph structure information, the attribute graph clustering is more suitable for scenes with nodes having rich content information. The attribute graph clustering has wide practical application in the fields of community discovery, protein function module detection, financial network fraud detection and the like.
A number of depth model-based graph clustering efforts have been proposed. Compared with a shallow graph clustering method, the deep graph clustering method is better in capturing nonlinear and complex node relations in the graph and is beneficial to improving the clustering performance. At present, most of the existing deep map clustering methods adopt a two-step framework to complete a clustering task: the feature learning step uses a depth model to learn low-dimensional node features; the clustering step performs a conventional clustering method to complete the graph clustering tasks, such as k-means and spectral clustering. Whether the real features of the attribute Graph can be learned in the feature learning step is important for the Graph clustering task, various Graph Autoencoders (GAE) are usually used for capturing Graph structure information in the early depth model method, but GAEs only use the structure features of the Graph to complete the training of a neural network, and the node attribute information in the attribute Graph is ignored, so that the performance of the method in the attribute Graph clustering task is limited.
In recent years, attribute Graph clustering methods generally implement feature learning of Graph nodes by using Graph Neural Networks (GNNs). The GNNs aggregate the attribute information of adjacent nodes by weighting and update the node characteristics iteratively, and the forward propagation mode of the GNNs fuses the structural characteristics and the node attributes of the attribute graph, so that the data utilization rate is improved, and the GNNs can be naturally applied to attribute graph clustering tasks and improve the clustering performance. Furthermore, the goal of graph clustering is to detect local substructures with dense intra-cluster connections and sparse inter-cluster connections, while the node features learned by GNNs preserve the local similarity of the graphs, which is advantageous for the graph clustering task. However, the current methods have two limitations: firstly, the task guidance of clustering is lacked in the feature learning process, so that cluster-friendly node features are difficult to learn, the node distribution in the feature space is easy to cause overlapping problem, and further clustering is not facilitated. Secondly, the method needs to manually set the number of clusters in advance, and in real application, the network data is large in scale and high in complexity, and the number of clustered clusters is usually difficult to manually estimate. In addition, the actual cluster number is highly correlated with the specific task, and the optimal cluster number should be determined by the node characteristics themselves. Therefore, the method for clustering the non-parameter attribute graph based on the graph convolution neural network is important to graph data mining.
Document clustering aims at dividing documents with similar contents into different groups. The existing literature clustering method adopts a clustering method based on hierarchy, division, density and the like, and the main idea is to cluster the literatures with similar characteristics into a same cluster. However, the current method only considers the similarity between the document contents in the clustering process, and ignores the existing reference relationship between documents. The documents cited from each other generally have higher similarity, and the citation relationship of the documents can also provide valuable information for clustering.
Disclosure of Invention
The invention provides an attribute graph document clustering method based on a graph convolution neural network aiming at the problems in the prior art, which is used for solving the problem that the citation relation of documents is not utilized in the document clustering process, can deal with the unbalanced cluster structure in real graph data, learn node characteristics friendly to clustering tasks from the unbalanced cluster structure, estimate the number of clustering clusters of the graph data according to the node characteristics and realize the parameter-free attribute graph clustering.
A undirected property graph may be represented as G ═ (V, E, X)1,v2,…,vnIs the set of nodes and E is the set of edges. The adjacency matrix of the graph may be denoted as A if node viAnd vjThere is a connection between, then A ij1, otherwiseThe node attribute matrix of diagram G is represented, where n represents the number of nodes and m represents the dimension of the node attribute. The purpose of attribute graph clustering is to divide nodes in an attribute graph G into k mutually disjoint clusters, and in the invention, the number of k is estimated by a clustering modelThe blocks are estimated from the node characteristics.
In order to achieve the purpose, the invention adopts the technical scheme that: the method provides a cross-layer linked graph convolution neural network which can cope with unbalanced cluster structures in real graph data and learn valuable node characteristics from the unbalanced cluster structures, and has strong robustness; then, providing a regularization item for encouraging node feature self-separation, and realizing the joint optimization of feature learning and graph clustering; finally, a cluster estimation module is provided for realizing the data-driven cluster number estimation.
The specific technical scheme is as follows: firstly, training a cross-layer linked graph convolution neural network by using attribute graph data, encouraging self-separation of node characteristics in a training process, and outputting characteristics of graph nodes; secondly, training a clustering estimation module by using the characteristics of the graph nodes, and outputting the optimal clustering number; alternately executing the two steps until the maximum iteration number is reached; finally, clustering the characteristics of the graph nodes by using a k-means algorithm;
step (1) attribute graph feature learning:
step (1.1) attribute graph data encoding: the method utilizes a cross-layer linked graph convolution neural network to carry out coding operation on attribute graph data. Encoding the graph data significantly reduces the computational complexity of clustering and avoids overfitting phenomena that may result from graph data sparsity. The propagation rule of the graph neural network from l-1 to l-1 is as follows:
wherein N (v)i| A) in a citation network represented by adjacency matrix A, including document viAnd document viThere are documents in citation relations, i.e., neighbor documents, i 1. W(l)Is the parameter matrix of the l-th layer. deg (v) represents the degree of node v. When l is 1, in the formula (1)Namely, the first layer graph convolution neural network aggregates the original characteristics of the neighbor documents. Relu (. cndot.) is a nonlinear activation function.
The cross-layer linked graph convolution neural network splices the output vectors of each layer of graph convolution: to be provided withNode v in the representationiThe output of the convolution of the graph of the l < th > layer, node v in the graphiCoding result d of cross-layer linked graph convolution neural networkiConvolution of the neural network for each layer of graph versus node v in the graphiThe output stitching vector of (a), expressed as follows:
coding result diFor the spliced vector output by each layer of graph convolution neural network, the coding result is subjected to linear mapping operation, and the node characteristic z learned by the graph convolution neural network is outputi. In the encoding step, the method uses a 6-layer cross-chain convolutional neural network.
And (1.2) decoding the node characteristic data:
the method uses a multilayer perceptron to realize the decoding of the attribute matrix:
wherein,representing node characteristics ziThe output of the decoding of (a) is,derepresenting the dimension of the code vector, MLPsRepresenting a multi-layer perceptron of s layers, the method uses a multi-layer perceptron of 2 layers. WDAre parameters of the decoder. In order to make the node represent the characteristic ziThe real original image information is reserved, the accuracy of clustering is ensured, and the output of a decoder is ensuredThe original graph attribute information x should be kept as much as possibleiTherefore, the method uses mean square error loss (MSE) as an optimization target for the graph convolution neural network:
mean square error loss can measure xiAndthe method optimizes the objective function in the training so that the node characteristic ziCan be decoded into original graph attribute information xiThereby securing ziIncluding the characteristics of the graph attribute information.
Step (1.3) clustering friendly graph feature learning:
the method provides a regularization term to encourage the separation of node natural cluster structures in the feature space, the regularization term effectively relieves the overlapping problem in graph clustering, and the clustering performance is further improved. Specifically, the pairwise similarity of graph node features is first modeled using student t-distribution Q:
wherein z isiRepresenting a node viCharacteristic vector of (q)ijCan be seen as node viAnd vjProbability of having similar features in the feature space. As the student t distribution has a heavy tail characteristic, nodes with lower similarity are farther away in a feature space, and the natural cluster structures in the graph are macroscopically separated and the problem of crowding in the cluster is relieved. To increase the tendency for inter-cluster separation, further separating nodes in the feature space, the model encourages the distribution Q to approach another degree of freedomThe higher degree of student t distribution P realizes cluster-friendly feature learning:
wherein the parameter θ controls the degree of freedom of the distribution of the student t, set as de-1,deIs the dimension of the feature z. The higher degree of freedom makes the target distribution P more concentrated than Q, so P assigns higher probability for similar nodes in the feature space, making it more compact; lower probabilities are assigned to dissimilar nodes to make them more dispersed, thereby achieving self-separating regularization. The self-separating regularization term is defined as follows:
where KL is the Kullback-Leibler divergence, and is used to measure the asymmetric distance between distributions P and Q, PijAnd q isijRepresenting pairwise similarities of node features. By optimizing the KL divergence between P and Q, the encoder is able to increase the inter-cluster distance and decrease the intra-cluster distance. The process is beneficial to model learning of clustering friendly features, so that the graph clustering performance is improved. In summary, the feature learning optimization objective of the graph convolution neural network is as follows:
whereinRepresenting input node attribute xiThe output of the codec process, α ═ 0.01, is a hyperparameter. By optimizing the objective function, the node distribution optimization aiming at clustering can be realized while the graph convolution neural network learns the node characteristics of the graph.
Step (2): estimating the number of clustering clusters:
most of the traditional clustering methods need a user to specify the number of clustering clusters, and in order to expand the model to a parameter-free condition, the method provides a deep clustering estimation model. The model aims to estimate the optimal cluster number from the node features z.
The model uses softmax autoencoder for cluster estimation. And (3) performing encoding and decoding operations on the node characteristics z in the step (1) by using a softmax self-encoder, and using a softmax nonlinear function in an output layer of the encoder, wherein the nonlinear function converts the characteristics of the nodes into soft clustering probability distribution. The number of hidden layer neurons represents the upper limit of the number of cluster clusters. The output of the softmax encoder isWherein d iscIs the number of hidden units. Clustering assignment y due to activation using softmaxiThe sum of (a) and (b) is equal to 1,the method is that dcThe total number of the nodes is set, so that the clustering estimation is completely independent of parameters, and the clustering estimation without parameters is realized.
The softmax autoencoder estimates the number of clusters by calculating the number of cluster labels, but is itself prone to generating a uniformly distributed cluster allocation, which is disadvantageous for cluster estimation. Therefore, the present model learns the centralized soft cluster distribution by introducing additional Gini-index regularization, which can be expressed as:
ignoring constant 1, the above equation can be expressed as:
optimizing the kini coefficient loss can facilitate low entropy distribution of cluster assignments. Combining the reconstruction loss with the regularization of the kini coefficients, the overall loss function of the softmax autoencoder can be expressed as:
wherein z isiAndrepresents the input and output of the softmax self-encoder, and β ═ 0.1 is a hyperparameter.
And (5) training the softmax self-encoder by using the target function to obtain a clustering distribution result Y. y isiRepresenting the probability of dividing a node into the ith cluster. Finally, the model is used as an estimation of the cluster number k by calculating the number of different labels of all nodes. The calculation of k can be described as:
where the Card function calculates the number of different elements present in the collection.
And (3): the alternative training graph convolution and clustering estimation module: the above feature learning and cluster estimation are optimized simultaneously by means of alternate training, and in each iteration, we first fix the parameters of the softmax self-encoder, and then optimize the feature learning using equation (8). The parameters of the feature learning model are then fixed and the parameters of the softmax autoencoder are optimized using equation (11). Then, the cluster number k is calculated from the cluster distribution Y by equation (12).
And (4): and (3) outputting a clustering result: setting the maximum optimization times of the model to be 200, and outputting the node characteristics z and the cluster estimation number k after the maximum optimization times reach the timeseFinally, in z and keAnd as input, running a k-means clustering algorithm to obtain a graph clustering result.
Advantageous effects
The invention effectively improves the clustering accuracy.
Drawings
FIG. 1: a flow chart of an attribute graph clustering method based on a graph convolution neural network is disclosed.
FIG. 2: an attribute graph clustering model structure chart based on a graph convolution neural network.
FIG. 3: and (5) visually displaying the clustering result of the Cora data set.
Detailed Description
The validity of the method is verified by taking three literature databases of Cora, Citeser and Pubmed as examples. The document attribute map data is first constructed with the three databases described above. The document attribute map may be expressed as G ═ a, X, where a is the adjacency matrix, if document viAnd vjThere is a reference relationship between them, then Aij1, otherwise Aij0. X is the document attribute matrix, the ith row vector X in XiContains a reference viDescription of the contents. The construction method of X comprises the following steps: (1) eliminating the false words, i.e. adverb, preposition, conjunctions, auxiliary words, etc., in the document. (2) Words with elimination frequencies less than 10. (3) Constructing word vector characteristics of each document by using residual words if the jth word is in the document viIn (b) is present, then xij1, otherwise xijAnd (0) constructing the finished document attribute graph parameters as follows:
TABLE 1
Data set | Number of nodes | Number of edges | Number of real clusters | Feature dimension |
Cora | 2708 | 5429 | 7 | 1433 |
Citeseer | 3327 | 4732 | 6 | 3703 |
Pubmed | 19717 | 44338 | 3 | 500 |
Wiki | 2405 | 17981 | 17 | 4973 |
The following describes the specific implementation steps of the present invention on the four attribute map data sets:
step (1) attribute graph feature learning:
step (1.1) attribute graph data encoding: as shown in the encoder portion of fig. 2, the present invention uses a 6-layer graph convolution neural network to perform an encoding operation on four data sets in table 1, the output of each layer of graph convolution neural network is used as the input of the next layer of graph convolution neural network, and the forward propagation process from layer l-1 to layer l can be expressed as:
taking Cora attribute diagram as an example, the document v in the attribute diagram isi,N(vi| A) representation includes document viAnd withDocument viThere are references in citation, i.e. neighbour references, i ═ 1,2, …,2708, i.e. there are a total of 2708 references. W(l)Is the parameter matrix of the l-th layer. deg (v)i) Representing a node viDegree of (c). When l is 1, in the formula (1) I.e. the first layer graph convolution neural network aggregates viOf the neighbor documents. Relu (·) is a nonlinear activation function, Relu (x) max (0, x).
As shown in fig. 2, in addition to forward propagation, the output of each layer of the graph convolutional neural network is passed to an inter-layer aggregation module that performs a stitching operation on the node features from the different layers:
and performing linear mapping operation on the spliced vector to further reduce the dimension to obtain a node characteristic vector z., obtaining a 16-dimensional node characteristic vector for the Cora data set through linear mapping, and obtaining a 32-dimensional node characteristic vector for a larger data set Pubmed through linear mapping.
And (1.2) decoding the node characteristic data: as shown in the decoder portion of FIG. 2, the present invention uses a two-layered multi-layered perceptron for the encoding characteristics z of the four data sets described aboveiAnd (3) decoding:
wherein,representing node characteristics ziThe output of the decoding of (a) is,derepresenting the dimensions of the coded vector, d for the Cora, Citeser, Wiki and Pubmed data setse16, 32, respectively MLP2Multilayer perceptron representing two layers, WDAre parameters of the decoder. The decoder tries to derive the characteristic ziReconstructing section attribute information, so the method sets the hidden layer structure of the multilayer perceptron as de500-1000-m, where m represents the dimension of the node attribute, and the specific arrangement of m in the above four data sets is given in the fifth column of table 1.
Step (1.3) clustering friendly graph feature learning: as shown in the middle of FIG. 2, the student t distribution Q is first used to map node features z of the four datasets as described aboveiPairwise similarity modeling of (c):
the distribution Q is represented in the form of a matrix with the ith row and jth column elements represented by QijAnd (4) forming. For the different sets of data it is possible to, where n represents the number of nodes of the dataset and the number of nodes n for the Cora, Citeseer, Wiki and Pubmed datasets is given by the second column of table 1. Then, the student t distribution P with higher degree of freedom is calculated in the same way:
wherein the parameter θ controls the degree of freedom of the distribution of the student t, set as de-1,deIs the dimension of the feature z. The parameters θ for the Cora, citeser, Wiki and Pubmed datasets are set to: 1432. 3702, 499, 4972. the higher degree of freedom makes the target distribution P more concentrated than Q, so P assigns higher similar nodes in the feature spaceThe probability of (2) making it more compact; lower probabilities are assigned to dissimilar nodes to make them more dispersed, thereby achieving self-separating regularization. The method optimizes the graph convolution neural network using the regularization term as follows:
where KL is the Kullback-Leibler divergence, and is used to measure the asymmetric distance between distributions P and Q, PijAnd q isijRepresenting pairwise similarities of node features. In summary, the feature learning optimization objective of the graph convolution neural network is as follows:
whereinRepresenting input node attribute xiThe output of the codec process, α ═ 0.01, is a hyperparameter.
Step (2): estimating the number of clustering clusters: as shown in fig. 2 cluster estimation module: the invention uses a softmax autoencoder for cluster estimation. And (3) performing encoding and decoding operation on the node characteristics z in the step (1) by using a softmax self-encoder, and using a softmax nonlinear function in an output layer of the encoder, wherein the nonlinear function converts the characteristics of the nodes into soft clustering probability distribution. The output of the softmax encoder isWherein d iscIs the number of hidden units. For different data sets: cora, Citeser, Wiki and Pubmed, the invention will be described in detailcSet to the total number of nodes n.
The Gini-index regularization of the cluster estimation module may represent this as:
ignoring constant 1, the above equation can be expressed as:
with equation (10) as the regularization term and MSE loss as the main optimization objective, calculating the overall loss function of the softmax self-encoder can be:
wherein z isiAndrepresenting the input and output of softmax from the encoder, β is set to 0.1 uniformly for the four data sets. And (5) training the softmax self-encoder by using the target function to obtain a clustering distribution result Y. y isiRepresenting the probability of dividing a node into the ith cluster. Finally, the model is used as an estimation of the cluster number k by calculating the number of different labels of all nodes. The k value is calculated as follows:
where the Card function calculates the number of different elements present in the collection.
And (3): the alternative training graph convolution and clustering estimation module: the feature learning and the cluster estimation are optimized simultaneously through an alternate training mode, and in each iteration, the parameters of the softmax self-encoder are firstly fixed, and then the feature learning is optimized by using an equation (8). The parameters of the feature learning model are then fixed and the parameters of the softmax autoencoder are optimized using equation (11). Then, the cluster number k is calculated from the cluster distribution Y by equation (12). The neural network in each module is optimized by adopting an Adam optimizer, and the fixed learning rate is set to be 10-3Dropout rate is set to 0.2.
And (4): and (3) outputting a clustering result: setting the maximum optimization times of the model to be 200, and outputting the node characteristics z and the cluster estimation number k after the maximum optimization times reach the timeseFinally, in z and keAnd as input, running a k-means clustering algorithm to obtain a graph clustering result.
To illustrate the beneficial effects of the method of the present invention, in the implementation we performed comparison tests on a number of different algorithms: the k-means algorithm is a traditional clustering method based on distance, and the main idea is to assign samples to the nearest clustering centers. And (3) enabling the Deepwalk to carry out truncated random walk on the graph to generate a group of node sequences, learning node representation by adopting a Skip-gram model, and finally clustering by using a k-means algorithm. Graphnencoder learns the node representation by training stacked sparse autoencoders and applies spectral clustering to obtain a clustering result. The MGAE trains graph convolutional neural networks using an unsupervised mapping loss and performs spectral clustering on node features.
The experimental results of the above comparison algorithm on four data sets of Cora, Citeseer, Wiki and Pubmed are shown in table 2:
TABLE 2
As shown in Table 2, the method of the present invention achieves the best results on the real attribute map data set, with better performance on the ACC, F1, NMI and ARI clustering indexes. The method is reasonable and reliable.
The self-separation regularization item provided by the invention can improve the distribution of node features in a high-dimensional space, is verified by feature vector visualization, a T-SNE algorithm is operated on a feature vector z of a Cora data set, and the visualization result is shown in FIG. 3. As can be seen in fig. 3, comparing the first rows (a) - (e) and the second rows (f) - (j), we can observe a strong contrast between using and not using the self-separating regularization term. The cluster structures in the second row are denser and the gaps between different clusters are more pronounced, indicating that self-separation regularization has a large effect on increasing the inter-cluster distance and decreasing the intra-cluster distance.
The invention is based on a graph convolution neural network, firstly carries out encoding and decoding operation on an attribute graph, provides a cross-layer link map convolution neural network, unsupervised learns the characteristics of graph nodes, and optimizes the characteristic distribution of the nodes by a self-separation regularization item in the process so as to enable the nodes to have a more obvious clustering structure. And then, a cluster estimation module is provided, and the optimal cluster number is estimated based on the node characteristics. The experiment result and the visual experiment on the real data set show that the method is reasonable and reliable and can provide reliable help for the attribute map clustering task.
Claims (9)
1. An attribute graph document clustering method based on a graph convolution neural network is characterized in that:
step (1), performing document attribute graph feature learning by using a cross-layer linked graph convolution neural network, wherein the learning comprises two stages of encoding and decoding to obtain features z of all graph nodes, and the features z are used for completing separation of graph node natural cluster structures in a feature space;
estimating the optimal cluster number from the node characteristics z by utilizing a deep clustering estimation model;
step (3), alternately executing the two steps until the maximum iteration number is reached to finish training;
obtaining the characteristics of all document attribute graph nodes to be clustered and the estimated number of clustering clusters by using the trained cross-layer linked graph convolution neural network and a deep clustering estimation model; and obtaining a document attribute graph clustering result by using a k-means clustering method by taking the characteristics and the estimated number of the clustering clusters as input.
2. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein: the step (1) further comprises the steps of,
step (1.1) attribute graph data encoding: encoding the attribute map data, and setting the input of the document attribute map as G ═ A, X, wherein A is an adjacent matrix, if the document is in the same way as the adjacent matrix, then the document attribute map data is encodedviAnd vjThere is a reference relationship between them, then Aij1, otherwise AijX is a document attribute matrix, each row vector representing a description of the contents of a document, where the ith row vector X in X is 0iRepresentative of the document viThe description of the content, the propagation rule of the graph convolution neural network from the l-1 st layer to the l-1 st layer is as follows:
wherein N (v)i| A) in a citation network represented by adjacency matrix A, including document viAnd document viA literature with a citation relationship, namely a neighbor literature, i 1., n, namely n literatures; w(l)Is the parameter matrix of the l-th layer. deg (v) represents the degree of node v; when l is 1, in the formula (1)Namely, the first layer graph convolution neural network aggregates the original characteristics of the neighbor documents, and Relu (·) is a nonlinear activation function;
the cross-layer linked graph convolution neural network splices the output vectors of each layer of graph convolution: to be provided withNode v in the representationiThe output of the convolution of the graph of the l < th > layer, node v in the graphiCoding result d of cross-layer linked graph convolution neural networkiConvolution of the neural network for each layer of graph versus node v in the graphiThe output stitching vector of (a), expressed as follows:
the coding result is subjected to linear mapping operation, and the node v in the graph learned by the graph convolutional neural network is outputiCharacteristic z of the nodei;
And (1.2) decoding the node characteristic data:
decoding of the attribute matrix is achieved using a multi-layer perceptron:
3. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 2, wherein:
the construction method of X comprises the following steps: (1) eliminating the fictitious words in all the document documents; (2) eliminating all words with frequencies less than 10 in the literature documents; (3) constructing word vector characteristics of each document by using residual words if the jth word is in the document viIn (b) is present, then xij1, otherwise xij=0。
4. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein: the feature learning optimization objective of the cross-layer linked graph convolution neural network is as follows:
wherein,
representing input node attribute xiThe output of the encoding and decoding process, alpha is a hyperparameter,
qijrepresenting a node viAnd vjThe probability of having similar features in the feature space is as follows:
pijthe method is used for approaching another student t distribution with higher degree of freedom to realize cluster-friendly feature learning, and specifically comprises the following steps:
wherein the parameter θ controls the degree of freedom of the distribution of the student t, set as de-1,
And optimizing the objective function to realize the distribution optimization of the nodes taking clustering as the target while the graph convolution neural network learns the node characteristics of the graph.
5. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein: the step (2) further comprises the following steps:
the deep clustering estimation model adopts a softmax self-encoder and is used for clustering estimation; performing encoding and decoding operations on the node characteristics z in the step (1) by a softmax self-encoder, wherein a softmax nonlinear function is used in an output layer of the encoder, the nonlinear function converts the characteristics of the nodes into probability distribution of soft clusters, the number of hidden layer neurons represents the upper limit of the number of cluster clusters, and the output of the softmax encoder is a cluster distribution resultWherein d iscIs the number of hidden units, cluster assignment y due to activation using softmaxiOfAnd is equal to 1, and is,dcand setting the total number of the nodes for realizing complete non-parameter clustering estimation.
6. The method according to claim 5, wherein the method comprises the following steps: the overall loss function of the softmax autoencoder can be expressed as:
7. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 6, wherein the softmax self-encoder calculates the number of different tags of all nodes as the estimation of the cluster number k, and the calculation of k is described as follows:
wherein the Card function calculates the number of different elements present in the set, YijExpression document viProbability of belonging to the jth cluster.
8. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein the step (3) is as follows: simultaneously optimizing a cross-layer linked graph convolution neural network and a deep clustering estimation model in an alternating training mode, firstly fixing parameters of a softmax self-encoder in each iteration, and then optimizing feature learning by using an equation (8); then fixing the parameters of the characteristic learning model, and optimizing the parameters of the softmax self-encoder by using an equation (11); then, the cluster number k is calculated from the cluster distribution Y by equation (12).
9. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein: the attribute map data encoding step uses a 6-layer cross-layer linked map convolutional neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110244762.6A CN113157957A (en) | 2021-03-05 | 2021-03-05 | Attribute graph document clustering method based on graph convolution neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110244762.6A CN113157957A (en) | 2021-03-05 | 2021-03-05 | Attribute graph document clustering method based on graph convolution neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113157957A true CN113157957A (en) | 2021-07-23 |
Family
ID=76884221
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110244762.6A Pending CN113157957A (en) | 2021-03-05 | 2021-03-05 | Attribute graph document clustering method based on graph convolution neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113157957A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113505849A (en) * | 2021-07-27 | 2021-10-15 | 电子科技大学 | Multilayer network clustering method based on comparison learning |
CN113688386A (en) * | 2021-07-26 | 2021-11-23 | 中国人民解放军陆军工程大学 | Graph structure-based intelligent detection method and system for malicious PDF (Portable document Format) document |
CN113869404A (en) * | 2021-09-27 | 2021-12-31 | 北京工业大学 | Self-adaptive graph volume accumulation method for thesis network data |
CN114462524A (en) * | 2022-01-19 | 2022-05-10 | 北京工业大学 | Clustering method for data center batch processing operation |
CN115545098A (en) * | 2022-09-23 | 2022-12-30 | 青海师范大学 | Node classification method of three-channel graph neural network based on attention mechanism |
CN115905617A (en) * | 2023-03-02 | 2023-04-04 | 南京邮电大学 | Video scoring prediction method based on deep neural network and double regularization |
WO2024055677A1 (en) * | 2022-09-15 | 2024-03-21 | 华为技术有限公司 | Deep clustering method, apparatus and system |
CN117971354A (en) * | 2024-03-29 | 2024-05-03 | 苏州元脑智能科技有限公司 | Heterogeneous acceleration method, device, equipment and storage medium based on end-to-end learning |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190228312A1 (en) * | 2018-01-25 | 2019-07-25 | SparkCognition, Inc. | Unsupervised model building for clustering and anomaly detection |
CN110399493A (en) * | 2019-07-29 | 2019-11-01 | 中南大学 | A kind of author's disambiguation method based on incremental learning |
CN110889015A (en) * | 2019-10-31 | 2020-03-17 | 天津工业大学 | Independent decoupling convolutional neural network characterization algorithm for graph data |
CN111259979A (en) * | 2020-02-10 | 2020-06-09 | 大连理工大学 | Deep semi-supervised image clustering method based on label self-adaptive strategy |
CN111985623A (en) * | 2020-08-28 | 2020-11-24 | 复旦大学 | Attribute graph group discovery method based on maximized mutual information and graph neural network |
CN112380435A (en) * | 2020-11-16 | 2021-02-19 | 北京大学 | Literature recommendation method and recommendation system based on heterogeneous graph neural network |
-
2021
- 2021-03-05 CN CN202110244762.6A patent/CN113157957A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190228312A1 (en) * | 2018-01-25 | 2019-07-25 | SparkCognition, Inc. | Unsupervised model building for clustering and anomaly detection |
CN110399493A (en) * | 2019-07-29 | 2019-11-01 | 中南大学 | A kind of author's disambiguation method based on incremental learning |
CN110889015A (en) * | 2019-10-31 | 2020-03-17 | 天津工业大学 | Independent decoupling convolutional neural network characterization algorithm for graph data |
CN111259979A (en) * | 2020-02-10 | 2020-06-09 | 大连理工大学 | Deep semi-supervised image clustering method based on label self-adaptive strategy |
CN111985623A (en) * | 2020-08-28 | 2020-11-24 | 复旦大学 | Attribute graph group discovery method based on maximized mutual information and graph neural network |
CN112380435A (en) * | 2020-11-16 | 2021-02-19 | 北京大学 | Literature recommendation method and recommendation system based on heterogeneous graph neural network |
Non-Patent Citations (1)
Title |
---|
李亚芳 等: "基于社区优化的深度网络嵌入方法", 《计算机应用》, 25 January 2021 (2021-01-25), pages 1957 - 1963 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113688386A (en) * | 2021-07-26 | 2021-11-23 | 中国人民解放军陆军工程大学 | Graph structure-based intelligent detection method and system for malicious PDF (Portable document Format) document |
CN113505849A (en) * | 2021-07-27 | 2021-10-15 | 电子科技大学 | Multilayer network clustering method based on comparison learning |
CN113505849B (en) * | 2021-07-27 | 2023-09-19 | 电子科技大学 | Multi-layer network clustering method based on contrast learning |
CN113869404A (en) * | 2021-09-27 | 2021-12-31 | 北京工业大学 | Self-adaptive graph volume accumulation method for thesis network data |
CN113869404B (en) * | 2021-09-27 | 2024-05-28 | 北京工业大学 | Self-adaptive graph roll accumulation method for paper network data |
CN114462524A (en) * | 2022-01-19 | 2022-05-10 | 北京工业大学 | Clustering method for data center batch processing operation |
WO2024055677A1 (en) * | 2022-09-15 | 2024-03-21 | 华为技术有限公司 | Deep clustering method, apparatus and system |
CN115545098A (en) * | 2022-09-23 | 2022-12-30 | 青海师范大学 | Node classification method of three-channel graph neural network based on attention mechanism |
CN115545098B (en) * | 2022-09-23 | 2023-09-08 | 青海师范大学 | Node classification method of three-channel graph neural network based on attention mechanism |
CN115905617A (en) * | 2023-03-02 | 2023-04-04 | 南京邮电大学 | Video scoring prediction method based on deep neural network and double regularization |
CN117971354A (en) * | 2024-03-29 | 2024-05-03 | 苏州元脑智能科技有限公司 | Heterogeneous acceleration method, device, equipment and storage medium based on end-to-end learning |
CN117971354B (en) * | 2024-03-29 | 2024-06-14 | 苏州元脑智能科技有限公司 | Heterogeneous acceleration method, device, equipment and storage medium based on end-to-end learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113157957A (en) | Attribute graph document clustering method based on graph convolution neural network | |
CN110929029A (en) | Text classification method and system based on graph convolution neural network | |
CN109389151B (en) | Knowledge graph processing method and device based on semi-supervised embedded representation model | |
CN111785329B (en) | Single-cell RNA sequencing clustering method based on countermeasure automatic encoder | |
CN112417219A (en) | Hyper-graph convolution-based hyper-edge link prediction method | |
CN112966114B (en) | Literature classification method and device based on symmetrical graph convolutional neural network | |
CN108108854A (en) | City road network link prediction method, system and storage medium | |
CN112765352A (en) | Graph convolution neural network text classification method based on self-attention mechanism | |
CN113693563B (en) | Brain function network classification method based on hypergraph attention network | |
CN111861756B (en) | Group partner detection method based on financial transaction network and realization device thereof | |
CN113065649A (en) | Complex network topology graph representation learning method, prediction method and server | |
CN111476261A (en) | Community-enhanced graph convolution neural network method | |
CN114118369B (en) | Image classification convolutional neural network design method based on group intelligent optimization | |
CN113220897A (en) | Knowledge graph embedding model based on entity-relation association graph | |
CN116469561A (en) | Breast cancer survival prediction method based on deep learning | |
CN117349494A (en) | Graph classification method, system, medium and equipment for space graph convolution neural network | |
CN115640842A (en) | Network representation learning method based on graph attention self-encoder | |
CN114329233A (en) | Cross-region cross-scoring collaborative filtering recommendation method and system | |
CN108805280B (en) | Image retrieval method and device | |
CN117036760A (en) | Multi-view clustering model implementation method based on graph comparison learning | |
CN116959588A (en) | Biochemical passage crosstalk identification method | |
CN114037014A (en) | Reference network clustering method based on graph self-encoder | |
CN117408336A (en) | Entity alignment method for structure and attribute attention mechanism | |
CN114880538A (en) | Attribute graph community detection method based on self-supervision | |
CN115691817A (en) | LncRNA-disease association prediction method based on fusion neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |