CN113887698B - Integral knowledge distillation method and system based on graph neural network - Google Patents
Integral knowledge distillation method and system based on graph neural network Download PDFInfo
- Publication number
- CN113887698B CN113887698B CN202110982472.1A CN202110982472A CN113887698B CN 113887698 B CN113887698 B CN 113887698B CN 202110982472 A CN202110982472 A CN 202110982472A CN 113887698 B CN113887698 B CN 113887698B
- Authority
- CN
- China
- Prior art keywords
- graph
- knowledge
- model
- attribute
- teacher
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000013140 knowledge distillation Methods 0.000 title claims abstract description 38
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 21
- 239000000126 substance Substances 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 19
- 238000012512 characterization method Methods 0.000 claims description 14
- 238000004821 distillation Methods 0.000 claims description 11
- 238000010276 construction Methods 0.000 claims description 10
- 238000005516 engineering process Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000011524 similarity measure Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 230000005055 memory storage Effects 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000004931 aggregating effect Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 235000009499 Vanilla fragrans Nutrition 0.000 description 1
- 235000012036 Vanilla tahitensis Nutrition 0.000 description 1
- 244000263375 Vanilla tahitensis Species 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000012733 comparative method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention aims to provide an overall knowledge distillation method based on a graph neural network, which comprises the following steps: giving the characteristic representation and classification prediction results learned by teacher and student networks, taking each sample as a node, taking the network-learned characteristic as the attribute of the node, taking the K neighbor (KNN) relationship of the classification prediction result as an edge, and constructing an attribute graph for each network; extracting integrity knowledge by using node attributes and topology information of neighborhood samples in a topological structure self-adaptive graph convolution neural network aggregate attribute graph, wherein the node attributes and the topology information are expressed as unified graph-based embedded vectors; the cross information of the graph embedded representation of the maximum chemical network and teacher network is estimated using infoNCE and the training efficiency is accelerated using feature memory storage techniques. The method comprises the following steps: the knowledge on the individuals and the knowledge on the relations in the teacher network can be integrated at the same time, so that the student network learns the integral knowledge, and the performance of the student network is improved.
Description
Technical Field
The invention relates to the field of deep learning and computer vision, in particular to a global knowledge distillation method and system based on a graph neural network.
Background
Deep Neural Networks (DNNs) have enjoyed great success in a variety of applications. However, their success depends to a large extent on a large amount of computing and storage resources, which are often not available in embedded devices and mobile devices. In order to reduce costs while maintaining satisfactory results, some of the techniques of model compression are beginning to be a hot topic of research. The knowledge distillation technology is one of the methods, and the technology can transfer knowledge from a statistically large trained teacher network to a small-sized student network to be learned, so that the effect of the student network is improved, and meanwhile, the characteristics of small volume and quick operation of the student network are maintained.
Knowledge extracted from the teacher network plays a central role in knowledge distillation. In the existing knowledge distillation method, two types of knowledge distillation are classified according to the type of extracted knowledge, namely, knowledge distillation on an individual basis and knowledge distillation on a relationship basis. Knowledge distillation on-individual means that individual knowledge is extracted from each data instance independently using a teacher network and provides more advantageous supervision than discrete tags, including probabilistic representation (logits), feature representation and feature mapping, etc. Knowledge distillation on the relationship is extracted from the relationships of the paired samples, and the relationships are kept between student networks and teacher networks of different network architectures through training of the student networks.
Although both of the above-described methods of knowledge distillation have been successful, the existing methods independently employ both types of techniques, ignoring their inherent relevance. Especially when the teacher's network is limited in its ability to independently extract one type of knowledge is insufficient for learning by the student's network. Intuitively, knowledge on an individual and knowledge on a relationship can be seen as two views of the natural correlation of the same teacher network. Two similar examples often have similar individual features in a similar pattern of relationships, and exploring this knowledge is critical to training more discriminative student network learning. At the same time, the knowledge on the individuals and the related knowledge on the relations are integrated, and the inherent uniformity of the knowledge and the related knowledge is reserved, so that the knowledge distillation is of great importance.
Disclosure of Invention
The present invention has been made to overcome the above-mentioned drawbacks of the prior art, and provides a global knowledge distillation method (the Holistic Knowledge Distillation (HKD) method and system based on a graph neural network that can integrate knowledge on an individual and knowledge on a relationship at the same time.
The invention aims at realizing the following technical scheme:
an overall knowledge distillation method based on a graph neural network, comprising:
And 1, respectively constructing attribute graphs of a teacher model and a student model. Inputting the image into a teacher model and a student model to obtain a feature representation f t,fs (wherein D t and d s are dimensions of feature representations output by the teacher model and the student model, respectively) and the result p t,ps of the classification prediction, and then construct an attribute graph G t={At,Ft}、Gs={As,Fs for the teacher model and the student model, respectively, where each node represents an instance, and the node attributes represent learned feature representations, whereA t,As is an adjacency matrix of the attribute map constructed based on p t,ps and is based on the formula/>Construction, wherein/>Is a graph construction function based on K-nearest neighbors (Knearestneighbors, KNN). . In the whole training process, gt is fixed, and the attribute and structure of Gs are dynamically changed.
The attribute map defined above has the following characteristics: first, the KNN graph will filter out the least relevant sample pairs compared to the inter-instance complete connected graph constructed by the prior relational knowledge decomposition method. This is particularly important because only a few samples in a randomly sampled batch are truly correlated and provide sufficient information for learning of node representations, the concepts of which should be well known to researchers in the field and will not be described in detail here. Second, since edges are constructed based on predicted probabilities, the graph can model inter-class and intra-class information. Finally, the graph neural network can be utilized to extract knowledge on individuals and knowledge on relationships from the attribute graph in a very efficient manner.
And 2, aggregating node attributes and Topology information of the neighborhood samples in the attribute map by using a topological structure self-adaptive graph convolution neural network (Topology ADAPTIVE GRAPH Convolution Network, TAGCN) to extract overall knowledge, thereby obtaining an embedded vector based on the map.
As shown in the following formulas (1) and (2), learning the attribute information and the topology information of the nodes by using TAGCN to learn simultaneously to extract the knowledge of the integrity, and obtaining the teacher model and the representation of the student model based on the graphAnd/>
Where Θ s 1 and Θ t 1 are learnable parameters and g t,gs is the dimension represented by the above graph. D t,Ds is a diagonal matrix of attribute maps, i.e., shown in equation (3)
And 3, estimating the mutual information of the graph embedded representation of the maximum chemical model and the teacher model by using InfoNCE, and accelerating the training efficiency by using a characteristic memorization storage technology.
In order for the student model to learn knowledge of the teacher model's integrity as much as possible, it is desirable to maximize the similarity of Ht and Hs, many existing vector-wise similarity metric methods (e.g., cosine similarity, euclidean distance) are not suitable for distillation of the integrity knowledge, they are often limited by differences in characterization capability due to structural differences between the teacher model and the student model, and direct alignment of Ht and Hs may result in too refined knowledge learned. To overcome the above limitation, mutual information is used to measure the degree of similarity of student models distilling information from teacher models, i.e., maximizing the mutual information of Ht and Hs, as shown in the following equation (4).
I (-) represents the mutual information measurement of two random variables, and is inspired by the research work of some recent mutual information estimation methods, infoNCE is adopted to estimate the mutual information, and the relation between infoNCE and the mutual information is as shown in the following formula (5):
Where f (·) is a vector-wise similarity measure function and h t i,hs i is the graph-based characterization of sample i learned in teacher and student models.
The student model needs to learn knowledge (such as label information in data) besides learning the teacher model, and cross entropy is a loss function in a common classification task, and the final loss function is shown in the following formula (6).
Where β is the weight of the linear combination.
Because infoNCE requires the use of all the samples in the dataset as negative samples, the calculation of distillation losses to calculate the integrity is prohibitively expensive for larger-scale datasets. To avoid computing the characterization of the sample during the training process, feature memorisation storage techniques have been widely used. In the method of the embodiment, G t and Gs is respectively constructed in a randomly selected mini-Batch, and the overall knowledge of graph-based characterization Ht and Hs reactions is presented in different attribute graphs, which cannot be stored by using a feature memory storage technology, so that only the feature memory storage technology is respectively used for storing F t,Fs. The final approximated overall distillation loss is defined as shown in equation (7):
in the method for embedding the mutual information represented by the maximum chemical model and the teacher model in the step3, infoNCE is used for maximizing the mutual information, and the relation between infoNCE and the mutual information is shown as a formula (8):
Where f (·) is a vector-wise similarity measure function and h t i,hs i is the graph-based characterization of sample i learned in teacher and student models.
The system for implementing the overall knowledge distillation method based on the graph neural network comprises a teacher model and a student model attribute graph construction model, an overall knowledge extraction module and a graph embedded representation mutual information maximization module of the student model and the teacher model, which are connected in sequence.
The working principle of the invention is as follows: according to the knowledge distillation method, on the basis of the individual knowledge and the relational knowledge of the teacher model, the integral knowledge from the teacher model is further extracted by using the graph neural network, and the student model learns the integral knowledge from the teacher model, so that the model performance is improved more obviously than that of other knowledge distillation methods.
The invention has the advantages that: the student model can learn not only knowledge on individuals from teacher models and knowledge on relationships, but also more complex overall knowledge, so that the performance improvement of the student model is more obvious than that of other knowledge distillation methods.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a classification of a knowledge distillation method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an overall knowledge distillation method based on a graph neural network according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of an overall knowledge distillation method based on a graph neural network according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The embodiment of the invention provides a global knowledge distillation method based on a graph neural network, which can simultaneously distill knowledge on an individual and knowledge on a relation.
In the knowledge distillation framework provided by the embodiment of the invention, given the characteristic representation learned by a teacher model and a student model and the classification prediction result, firstly, an attribute graph is constructed for each model, wherein each node represents an example, the node attribute represents the learned characteristic representation, and the edges between the examples are constructed by using the K Nearest Neighbor (KNN) relationship of the classification prediction result.
The node properties of neighborhood samples in the property graph are aggregated using a Topology-adaptive graph convolutional neural network (Topology ADAPTIVE GRAPH Convolution Network (TAGCN)) to extract the overall knowledge, represented as a unified graph-based embedding vector. The use InfoNCE of estimates maximizes the mutual information of the graph-embedded representation of the teacher model and the student model and accelerates training efficiency using feature memory storage techniques. Through the complex knowledge in the teacher model attribute diagram, the performance of the student model can be improved more obviously than the current knowledge distillation method.
Specifically, as shown in fig. 3, a schematic diagram of an overall knowledge distillation method based on a graph neural network according to an embodiment of the present invention mainly includes the following steps:
And 1, respectively constructing attribute graphs of a teacher model and a student model. Inputting the image into a teacher model and a student model to obtain a feature representation f t,fs (wherein D t and d s are dimensions of feature representations output by the teacher model and the student model, respectively) and the result p t,ps of the classification prediction, and then construct an attribute graph G t={At,Ft}、Gs={As,Fs for the teacher model and the student model, respectively, where each node represents an instance, and the node attributes represent learned feature representations, where A t,As is an adjacency matrix of the attribute map constructed based on p t,ps and is based on the formula/>Construction, wherein/>Is a graph construction function based on KNN. During the whole training process, G t is fixed, and the attribute and structure of G s are dynamically changed.
The attribute map defined above has the following characteristics: first, the KNN graph will filter out the least relevant sample pairs compared to the inter-instance complete connected graph constructed by the prior relational knowledge decomposition method. This is particularly important because only a few samples in a randomly sampled batch are truly correlated and provide sufficient information for learning of node representations, the concepts of which should be well known to researchers in the field and will not be described in detail here. Second, since edges are constructed based on predicted probabilities, the graph can model inter-class and intra-class information. Finally, the graph neural network can be utilized to extract knowledge on individuals and knowledge on relationships from the attribute graph in a very efficient manner.
And 2, aggregating node attributes and Topology information of the neighborhood samples in the attribute map by using a topological structure self-adaptive graph convolution neural network (Topology ADAPTIVE GRAPH Convolution Network, TAGCN) to extract overall knowledge, thereby obtaining an embedded vector based on the map.
As shown in the following formulas (1) and (2), learning the attribute information and the topology information of the nodes by using TAGCN to learn simultaneously to extract the knowledge of the integrity, and obtaining the teacher model and the representation of the student model based on the graphAnd/>
Where Θ s l and Θ t l are learnable parameters and g t,gs is the dimension represented by the above graph. D t,Ds is a diagonal matrix of attribute maps, i.e., shown in equation (3)
And 3, estimating the mutual information of the graph embedded representation of the maximum chemical model and the teacher model by using InfoNCE, and accelerating the training efficiency by using a characteristic memorization storage technology.
In order for the student model to learn knowledge of the teacher model's integrity as much as possible, it is desirable to maximize the similarity of Ht and Hs, many existing vector-wise similarity metric methods (e.g., cosine similarity, euclidean distance) are not suitable for distillation of the integrity knowledge, they are often limited by differences in characterization capability due to structural differences between the teacher model and the student model, and direct alignment of Ht and Hs may result in too refined knowledge learned. To overcome the above limitation, mutual information is used to measure the degree of similarity of student models distilling information from teacher models, i.e., maximizing the mutual information of Ht and Hs, as shown in the following equation (4).
I (-) represents the mutual information measurement of two random variables, and is inspired by the research work of some recent mutual information estimation methods, infoNCE is adopted to estimate the mutual information, and the relation between infoNCE and the mutual information is as shown in the following formula (5):
Where f (·) is a vector-wise similarity measure function and h t i,hs i is the graph-based characterization of sample i learned in teacher and student models.
The student model needs to learn knowledge (such as label information in data) besides learning the teacher model, and cross entropy is a loss function in a common classification task, and the final loss function is shown in the following formula (6).
Where β is the weight of the linear combination.
Because infoNCE requires the use of all the samples in the dataset as negative samples, the calculation of distillation losses to calculate the integrity is prohibitively expensive for larger-scale datasets. To avoid computing the characterization of the sample during the training process, feature memorisation storage techniques have been widely used. In the method of the embodiment, G t and Gs is built in a randomly selected mini-Batch, and the overall knowledge of the graph-based representations H t and H s are presented in different attribute graphs, which cannot be stored using the feature-memory storage technique, and therefore, only the feature-memory storage technique is used to store F t,Fs, respectively. The final approximated overall distillation loss is defined as shown in equation (7):
in the method for embedding the mutual information represented by the maximum chemical model and the teacher model in the step3, infoNCE is used for maximizing the mutual information, and the relation between infoNCE and the mutual information is shown as a formula (8):
Where f (·) is a vector-wise similarity measure function and h t i,hs i is the graph-based characterization of sample i learned in teacher and student models.
According to the scheme provided by the embodiment of the invention, the knowledge distillation method can enable the student model to learn the integral knowledge of the teacher model more effectively, and compared with other knowledge distillation methods, the performance of the student model is improved more obviously.
The system for implementing the overall knowledge distillation method based on the graph neural network comprises a teacher model and a student model attribute graph construction model, an overall knowledge extraction module and a graph embedded representation mutual information maximization module of the student model and the teacher model, which are connected in sequence. The teacher model and the student model are respectively corresponding to the contents of the steps 1,2 and 3, and the overall knowledge extraction module and the maximization module of the mutual information of the graph embedded representation of the student model and the teacher model.
In order to illustrate the effects of the above-described schemes of embodiments of the present invention, the experiments are described in connection with experiments that are developed on classical datasets in several fields of image classification.
1. Experimental data set
The experiment involved two reference data sets, the correlation of which is described in the following table:
Data set | Category number | Training set size | Test set size | Image size |
Tiny-ImageNet | 200 | 100000 | 10000 | 224*224*3 |
Cifar-100 | 20 | 50000 | 10000 | 32*32*3 |
2. Model structure
Four structures, resNet, VGG, shuffleNet, mobileNet respectively, are used in the teacher model and the student model, and are all network structures familiar to researchers in the field, and are not described in detail herein.
3. Baseline method
In order to demonstrate the superiority of this method, the present examples compare recent knowledge distillation methods, which can be generalized into two categories, differing by the schematic diagram of the knowledge distillation classification provided by the present examples.
In particular, the method comprises the steps of,
The first category is a method of knowledge distillation on an individual, comprising VANILLA KD to learn logits, AT to learn attention Map, PKT to learn the characteristic representation, CRD and SSKD.
The second category is relational knowledge distillation methods that learn knowledge of paired relational types, including RKD, CCKD.
The above comparison methods are all reproduced by using the code of the author which is open source, and in order to keep consistency of training samples among various methods, data enhancement in SSKD codes is removed.
4. Conclusion of the experiment
Training the embodiments of the invention and various comparison methods using different teacher models and student models, at CIFAR100,100
The performance on the dataset is as follows (the evaluation index is the accuracy).
The performance on TINYIMAGENET datasets is as follows (the evaluation index is accuracy).
From the comparative experiments on both data sets, it was found that the performance of the present method (HKD) was significantly better than the comparative method with different teacher model student model combinations.
From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.
The invention is applicable to image classification. Deep Neural Networks (DNNs) have enjoyed great success in a variety of applications. However, their success depends to a large extent on the large amount of computing and storage resources that are often not available in embedded devices and mobile devices, i.e. under limited computing resource conditions, some large models with high requirements on computing resources are not deployable. In order to keep the available small models satisfactory under the condition of limited computational resources, some knowledge distillation methods (learning methods for allowing a student model to learn the learned knowledge of a teacher model and the knowledge itself at the same time) are widely studied,
An image classification method to which the method of the present invention is applied is described as follows:
1. first, a data set required for the image classification model is prepared, and the preparation of the data set is a process well known to those skilled in the art, and will not be described herein.
2. And a teacher model with larger model volume and excellent performance is trained by using the cross entropy as a loss function.
After the teacher model is trained, parameters are not updated in the subsequent training process. The student model training method, the core content of the method, is described below.
3 Training the student model;
and 3.1, respectively constructing attribute graphs of the teacher model and the student model. Inputting the image into a teacher model and a student model to obtain a feature representation f t,fs (wherein D t and d s are dimensions of feature representations output by the teacher model and the student model, respectively) and the result p t,ps of the classification prediction, and then construct an attribute graph G t={At,Ft}、Gs={As,Fs for the teacher model and the student model, respectively, where each node represents an instance, and the node attributes represent learned feature representations, where A t,As is an adjacency matrix of the attribute map constructed based on p t,ps and is based on the formula/>Construction, wherein/>Is a graph construction function based on KNN. During the whole training process, G t is fixed, and the attribute and structure of G s are dynamically changed.
3.2, Aggregating node attributes and Topology information of neighborhood samples in the attribute map using Topology-adaptive graph convolutional neural network (Topology ADAPTIVE GRAPH Convolution Network, TAGCN) to extract overall knowledge, thereby obtaining a map-based embedded vector.
As shown in the following formulas (1) and (2), learning the attribute information and the topology information of the nodes by using TAGCN to learn simultaneously to extract the knowledge of the integrity, and obtaining the teacher model and the representation of the student model based on the graphAnd/>
Where Θ s l and Θ t l are learnable parameters and g t,gs is the dimension represented by the above graph. D t,Ds is a diagonal matrix of attribute maps, i.e., shown in equation (3)
And 3.3, estimating the mutual information of the graph embedded representation of the maximum chemical model and the teacher model by using InfoNCE, and accelerating the training efficiency by using a characteristic memorization storage technology.
In order for the student model to learn knowledge of the teacher model's model integrity as much as possible, it is desirable to maximize the degree of similarity between H t and H s, many existing vector-wise similarity metrics (e.g., cosine similarity, euclidean distance) are not suitable for distillation of the integrity knowledge, they are often limited by differences in characterization ability due to structural differences between the teacher model and the student model, and direct alignment of Ht and Hs may result in too refined knowledge learned. To overcome the above limitation, mutual information is used to measure the degree of similarity of student models distilling information from teacher models, i.e., maximizing the mutual information of Ht and Hs, as shown in the following equation (4).
I (-) represents the mutual information measurement of two random variables, and is inspired by the research work of some recent mutual information estimation methods, infoNCE is adopted to estimate the mutual information, and the relation between infoNCE and the mutual information is as shown in the following formula (5):
Where f (·) is a vector-wise similarity measure function and h t i,hs i is the graph-based characterization of sample i learned in teacher and student models.
The student model needs to learn knowledge (such as label information in data) besides learning the teacher model, and cross entropy is a loss function in a common classification task, and the final loss function is shown in the following formula (6).
Where β is the weight of the linear combination.
Because infoNCE requires the use of all the samples in the dataset as negative samples, the calculation of distillation losses to calculate the integrity is prohibitively expensive for larger-scale datasets. To avoid computing the characterization of the sample during the training process, feature memorisation storage techniques have been widely used. In the method of the embodiment, G t and Gs is built in a randomly selected mini-Batch, and the overall knowledge of the graph-based representations H t and H s are presented in different attribute graphs, which cannot be stored using the feature-memory storage technique, and therefore, only the feature-memory storage technique is used to store F t,Fs, respectively. The final approximated overall distillation loss is defined as shown in equation (7):
4. and deploying the model, namely deploying the trained student model under the required scene. The method can be realized under limited calculation conditions, and the student model with better performance is used for classifying the images.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (2)
1. An overall knowledge distillation method based on a graph neural network applied to image classification comprises the following steps:
Step 1, respectively constructing attribute graphs of a teacher model and a student model; inputting the image into teacher model and student model to obtain feature representation f t,fs and classification prediction result p t,ps, wherein D t and d s are dimensions of the feature representations output by the teacher model and the student model, respectively; then respectively constructing an attribute graph G t={At,Ft}、Gs={As,Fs for the teacher model and the student model, wherein each node represents an instance, the node attribute represents the learned feature representation, and the node attribute represents the learned feature representation, whereinA t,As is an adjacency matrix of the attribute map constructed based on p t,ps and is based on the formula/>Construction, wherein/>Is a graph construction function based on K nearest neighbors (Knearest neighbors, KNN);
in the whole training process, gt is fixed, and the attribute and structure of Gs are dynamically changed;
the attribute map defined above has the following characteristics: firstly, compared with a full connected graph among examples constructed by the prior relational knowledge decomposition method, a KNN graph filters out least relevant sample pairs; second, since edges are constructed based on predicted probabilities, the graph can model inter-class and intra-class information; finally, knowledge on individuals and knowledge on relationships can be extracted from the attribute map in a combined manner very efficiently by using the graph neural network;
Step 2, node attributes and topology information of neighborhood samples in the graph roll-up neural network aggregate attribute graph are used for extracting overall knowledge, and therefore an embedded vector based on the graph is obtained;
as shown in the following formulas (1) and (2), the attribute information and Topology information of the nodes are simultaneously learned by using Topology-adaptive graph convolutional neural networks (Topology ADAPTIVE GRAPH Convolution Network, TAGCN) to extract the knowledge of the integrity, and a teacher model and a graph-based representation of a student model are obtained And/>
Where Θ s 1 and Θ t 1 are learnable parameters, g t,gs is the dimension represented by the above graph; d t,Ds is a diagonal matrix of attribute maps, i.e., shown in equation (3)
Step 3, embedding the graph of the maximum chemical model and the teacher model into the represented mutual information, and accelerating the training efficiency by using a characteristic memorization storage technology;
The mutual information is used to measure the similarity of the student model distillation information from the teacher model, i.e. to maximize the mutual information of Ht and Hs, as shown in the following equation (4):
l (·) represents the mutual information metric of two random variables, infoNCE is used to estimate the mutual information, and the relationship between infoNCE and the mutual information is shown in formula (5):
Wherein f (·) is a vector-wise similarity measure function, h t i,hs i is a graph-based representation of sample i learned in teacher and student models;
The student model needs to learn knowledge itself in addition to learning the teacher model's knowledge, and the final loss function is as follows formula (6):
Where β is the weight of the linear combination;
G tand Gs is respectively constructed in randomly selected mini-Batch, the overall knowledge of the graph-based reactions of the characterizations H t and H s is presented in different attribute graphs, and the characteristic memorization storage technology is respectively used for storing F t,Fs; the final approximated overall distillation loss is defined as shown in equation (7):
2. the method for global knowledge distillation based on graph neural network for image classification according to claim 1, wherein in step 3, the mutual information of the graph embedded representation of the maximum chemogenetic model and the teacher model is maximized by using infoNCE, and the relationship between infoNCE and the mutual information is as shown in formula (8):
Where f (·) is a vector-wise similarity measure function and h t i,hs i is the graph-based characterization of sample i learned in teacher and student models.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110982472.1A CN113887698B (en) | 2021-08-25 | 2021-08-25 | Integral knowledge distillation method and system based on graph neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110982472.1A CN113887698B (en) | 2021-08-25 | 2021-08-25 | Integral knowledge distillation method and system based on graph neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113887698A CN113887698A (en) | 2022-01-04 |
CN113887698B true CN113887698B (en) | 2024-06-14 |
Family
ID=79011512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110982472.1A Active CN113887698B (en) | 2021-08-25 | 2021-08-25 | Integral knowledge distillation method and system based on graph neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113887698B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115101119B (en) * | 2022-06-27 | 2024-05-17 | 山东大学 | Isochrom function prediction system based on network embedding |
CN117058437B (en) * | 2023-06-16 | 2024-03-08 | 江苏大学 | Flower classification method, system, equipment and medium based on knowledge distillation |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112116030A (en) * | 2020-10-13 | 2020-12-22 | 浙江大学 | Image classification method based on vector standardization and knowledge distillation |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA3076424A1 (en) * | 2019-03-22 | 2020-09-22 | Royal Bank Of Canada | System and method for knowledge distillation between neural networks |
CN110472730A (en) * | 2019-08-07 | 2019-11-19 | 交叉信息核心技术研究院(西安)有限公司 | A kind of distillation training method and the scalable dynamic prediction method certainly of convolutional neural networks |
CN112861936B (en) * | 2021-01-26 | 2023-06-02 | 北京邮电大学 | Graph node classification method and device based on graph neural network knowledge distillation |
CN113095480A (en) * | 2021-03-24 | 2021-07-09 | 重庆邮电大学 | Interpretable graph neural network representation method based on knowledge distillation |
-
2021
- 2021-08-25 CN CN202110982472.1A patent/CN113887698B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112116030A (en) * | 2020-10-13 | 2020-12-22 | 浙江大学 | Image classification method based on vector standardization and knowledge distillation |
Non-Patent Citations (1)
Title |
---|
基于深度特征蒸馏的人脸识别;葛仕明;赵胜伟;刘文瑜;李晨钰;;北京交通大学学报;20171215(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113887698A (en) | 2022-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109271522B (en) | Comment emotion classification method and system based on deep hybrid model transfer learning | |
CN110942091B (en) | Semi-supervised few-sample image classification method for searching reliable abnormal data center | |
CN110110100A (en) | Across the media Hash search methods of discrete supervision decomposed based on Harmonious Matrix | |
CN109063112B (en) | Rapid image retrieval method, model and model construction method based on multitask learning deep semantic hash | |
CN111950594A (en) | Unsupervised graph representation learning method and unsupervised graph representation learning device on large-scale attribute graph based on sub-graph sampling | |
CN111310074B (en) | Method and device for optimizing labels of interest points, electronic equipment and computer readable medium | |
CN109753589A (en) | A kind of figure method for visualizing based on figure convolutional network | |
CN113255892B (en) | Decoupled network structure searching method, device and readable storage medium | |
CN113887698B (en) | Integral knowledge distillation method and system based on graph neural network | |
CN114299362B (en) | A small sample image classification method based on k-means clustering | |
CN117992805B (en) | Zero sample cross-modal retrieval method and system based on tensor product graph fusion diffusion | |
CN115293919A (en) | Graph neural network prediction method and system oriented to social network distribution generalization | |
CN115761275A (en) | Unsupervised community discovery method and system based on graph neural network | |
CN116304367B (en) | Algorithm and device for obtaining communities based on graph self-encoder self-supervision training | |
CN111241326A (en) | Image visual relation referring and positioning method based on attention pyramid network | |
CN117349494A (en) | Graph classification method, system, medium and equipment for space graph convolution neural network | |
CN112115971B (en) | Method and system for carrying out student portrait based on heterogeneous academic network | |
CN109815335A (en) | A kind of paper domain classification method suitable for document network | |
CN113434815A (en) | Community detection method based on similar and dissimilar constraint semi-supervised nonnegative matrix factorization | |
CN113515519A (en) | Method, device and equipment for training graph structure estimation model and storage medium | |
Saha et al. | Novel randomized feature selection algorithms | |
CN111914108A (en) | Discrete supervision cross-modal Hash retrieval method based on semantic preservation | |
CN115408531A (en) | Knowledge graph reasoning method with induction capability | |
Zhang et al. | Color clustering using self-organizing maps | |
Dennis et al. | Autoencoder-enhanced sum-product networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |