CN113887698B

CN113887698B - Integral knowledge distillation method and system based on graph neural network

Info

Publication number: CN113887698B
Application number: CN202110982472.1A
Authority: CN
Inventors: 周晟; 仉鹏; 卜佳俊
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2024-06-14
Anticipated expiration: 2041-08-25
Also published as: CN113887698A

Abstract

The invention aims to provide an overall knowledge distillation method based on a graph neural network, which comprises the following steps: giving the characteristic representation and classification prediction results learned by teacher and student networks, taking each sample as a node, taking the network-learned characteristic as the attribute of the node, taking the K neighbor (KNN) relationship of the classification prediction result as an edge, and constructing an attribute graph for each network; extracting integrity knowledge by using node attributes and topology information of neighborhood samples in a topological structure self-adaptive graph convolution neural network aggregate attribute graph, wherein the node attributes and the topology information are expressed as unified graph-based embedded vectors; the cross information of the graph embedded representation of the maximum chemical network and teacher network is estimated using infoNCE and the training efficiency is accelerated using feature memory storage techniques. The method comprises the following steps: the knowledge on the individuals and the knowledge on the relations in the teacher network can be integrated at the same time, so that the student network learns the integral knowledge, and the performance of the student network is improved.

Description

Integral knowledge distillation method and system based on graph neural network

Technical Field

The invention relates to the field of deep learning and computer vision, in particular to a global knowledge distillation method and system based on a graph neural network.

Background

Deep Neural Networks (DNNs) have enjoyed great success in a variety of applications. However, their success depends to a large extent on a large amount of computing and storage resources, which are often not available in embedded devices and mobile devices. In order to reduce costs while maintaining satisfactory results, some of the techniques of model compression are beginning to be a hot topic of research. The knowledge distillation technology is one of the methods, and the technology can transfer knowledge from a statistically large trained teacher network to a small-sized student network to be learned, so that the effect of the student network is improved, and meanwhile, the characteristics of small volume and quick operation of the student network are maintained.

Knowledge extracted from the teacher network plays a central role in knowledge distillation. In the existing knowledge distillation method, two types of knowledge distillation are classified according to the type of extracted knowledge, namely, knowledge distillation on an individual basis and knowledge distillation on a relationship basis. Knowledge distillation on-individual means that individual knowledge is extracted from each data instance independently using a teacher network and provides more advantageous supervision than discrete tags, including probabilistic representation (logits), feature representation and feature mapping, etc. Knowledge distillation on the relationship is extracted from the relationships of the paired samples, and the relationships are kept between student networks and teacher networks of different network architectures through training of the student networks.

Although both of the above-described methods of knowledge distillation have been successful, the existing methods independently employ both types of techniques, ignoring their inherent relevance. Especially when the teacher's network is limited in its ability to independently extract one type of knowledge is insufficient for learning by the student's network. Intuitively, knowledge on an individual and knowledge on a relationship can be seen as two views of the natural correlation of the same teacher network. Two similar examples often have similar individual features in a similar pattern of relationships, and exploring this knowledge is critical to training more discriminative student network learning. At the same time, the knowledge on the individuals and the related knowledge on the relations are integrated, and the inherent uniformity of the knowledge and the related knowledge is reserved, so that the knowledge distillation is of great importance.

Disclosure of Invention

The present invention has been made to overcome the above-mentioned drawbacks of the prior art, and provides a global knowledge distillation method (the Holistic Knowledge Distillation (HKD) method and system based on a graph neural network that can integrate knowledge on an individual and knowledge on a relationship at the same time.

The invention aims at realizing the following technical scheme:

an overall knowledge distillation method based on a graph neural network, comprising:

And 1, respectively constructing attribute graphs of a teacher model and a student model. Inputting the image into a teacher model and a student model to obtain a feature representation f ^t,f^s (wherein D ^t and d ^s are dimensions of feature representations output by the teacher model and the student model, respectively) and the result p ^t,p^s of the classification prediction, and then construct an attribute graph G ^t＝{A^t,F^t}、G^s＝{A^s,F^s for the teacher model and the student model, respectively, where each node represents an instance, and the node attributes represent learned feature representations, whereA ^t,A^s is an adjacency matrix of the attribute map constructed based on p ^t,p^s and is based on the formula/>Construction, wherein/>Is a graph construction function based on K-nearest neighbors (Knearestneighbors, KNN). . In the whole training process, gt is fixed, and the attribute and structure of Gs are dynamically changed.

The attribute map defined above has the following characteristics: first, the KNN graph will filter out the least relevant sample pairs compared to the inter-instance complete connected graph constructed by the prior relational knowledge decomposition method. This is particularly important because only a few samples in a randomly sampled batch are truly correlated and provide sufficient information for learning of node representations, the concepts of which should be well known to researchers in the field and will not be described in detail here. Second, since edges are constructed based on predicted probabilities, the graph can model inter-class and intra-class information. Finally, the graph neural network can be utilized to extract knowledge on individuals and knowledge on relationships from the attribute graph in a very efficient manner.

And 2, aggregating node attributes and Topology information of the neighborhood samples in the attribute map by using a topological structure self-adaptive graph convolution neural network (Topology ADAPTIVE GRAPH Convolution Network, TAGCN) to extract overall knowledge, thereby obtaining an embedded vector based on the map.

As shown in the following formulas (1) and (2), learning the attribute information and the topology information of the nodes by using TAGCN to learn simultaneously to extract the knowledge of the integrity, and obtaining the teacher model and the representation of the student model based on the graphAnd/>

Where Θ ^s ₁ and Θ ^t ₁ are learnable parameters and g ^t,g^s is the dimension represented by the above graph. D _t,D_s is a diagonal matrix of attribute maps, i.e., shown in equation (3)

And 3, estimating the mutual information of the graph embedded representation of the maximum chemical model and the teacher model by using InfoNCE, and accelerating the training efficiency by using a characteristic memorization storage technology.

In order for the student model to learn knowledge of the teacher model's integrity as much as possible, it is desirable to maximize the similarity of Ht and Hs, many existing vector-wise similarity metric methods (e.g., cosine similarity, euclidean distance) are not suitable for distillation of the integrity knowledge, they are often limited by differences in characterization capability due to structural differences between the teacher model and the student model, and direct alignment of Ht and Hs may result in too refined knowledge learned. To overcome the above limitation, mutual information is used to measure the degree of similarity of student models distilling information from teacher models, i.e., maximizing the mutual information of Ht and Hs, as shown in the following equation (4).

I (-) represents the mutual information measurement of two random variables, and is inspired by the research work of some recent mutual information estimation methods, infoNCE is adopted to estimate the mutual information, and the relation between infoNCE and the mutual information is as shown in the following formula (5):

Where f (·) is a vector-wise similarity measure function and h ^t _i,h^s _i is the graph-based characterization of sample i learned in teacher and student models.

The student model needs to learn knowledge (such as label information in data) besides learning the teacher model, and cross entropy is a loss function in a common classification task, and the final loss function is shown in the following formula (6).

Where β is the weight of the linear combination.

Because infoNCE requires the use of all the samples in the dataset as negative samples, the calculation of distillation losses to calculate the integrity is prohibitively expensive for larger-scale datasets. To avoid computing the characterization of the sample during the training process, feature memorisation storage techniques have been widely used. In the method of the embodiment, G ^t and G^s is respectively constructed in a randomly selected mini-Batch, and the overall knowledge of graph-based characterization Ht and Hs reactions is presented in different attribute graphs, which cannot be stored by using a feature memory storage technology, so that only the feature memory storage technology is respectively used for storing F ^t,F^s. The final approximated overall distillation loss is defined as shown in equation (7):

in the method for embedding the mutual information represented by the maximum chemical model and the teacher model in the step3, infoNCE is used for maximizing the mutual information, and the relation between infoNCE and the mutual information is shown as a formula (8):

The system for implementing the overall knowledge distillation method based on the graph neural network comprises a teacher model and a student model attribute graph construction model, an overall knowledge extraction module and a graph embedded representation mutual information maximization module of the student model and the teacher model, which are connected in sequence.

The working principle of the invention is as follows: according to the knowledge distillation method, on the basis of the individual knowledge and the relational knowledge of the teacher model, the integral knowledge from the teacher model is further extracted by using the graph neural network, and the student model learns the integral knowledge from the teacher model, so that the model performance is improved more obviously than that of other knowledge distillation methods.

The invention has the advantages that: the student model can learn not only knowledge on individuals from teacher models and knowledge on relationships, but also more complex overall knowledge, so that the performance improvement of the student model is more obvious than that of other knowledge distillation methods.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a classification of a knowledge distillation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an overall knowledge distillation method based on a graph neural network according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of an overall knowledge distillation method based on a graph neural network according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a global knowledge distillation method based on a graph neural network, which can simultaneously distill knowledge on an individual and knowledge on a relation.

In the knowledge distillation framework provided by the embodiment of the invention, given the characteristic representation learned by a teacher model and a student model and the classification prediction result, firstly, an attribute graph is constructed for each model, wherein each node represents an example, the node attribute represents the learned characteristic representation, and the edges between the examples are constructed by using the K Nearest Neighbor (KNN) relationship of the classification prediction result.

The node properties of neighborhood samples in the property graph are aggregated using a Topology-adaptive graph convolutional neural network (Topology ADAPTIVE GRAPH Convolution Network (TAGCN)) to extract the overall knowledge, represented as a unified graph-based embedding vector. The use InfoNCE of estimates maximizes the mutual information of the graph-embedded representation of the teacher model and the student model and accelerates training efficiency using feature memory storage techniques. Through the complex knowledge in the teacher model attribute diagram, the performance of the student model can be improved more obviously than the current knowledge distillation method.

Specifically, as shown in fig. 3, a schematic diagram of an overall knowledge distillation method based on a graph neural network according to an embodiment of the present invention mainly includes the following steps:

And 1, respectively constructing attribute graphs of a teacher model and a student model. Inputting the image into a teacher model and a student model to obtain a feature representation f ^t,f^s (wherein D ^t and d ^s are dimensions of feature representations output by the teacher model and the student model, respectively) and the result p ^t,p^s of the classification prediction, and then construct an attribute graph G ^t＝{A^t,F^t}、G^s＝{A^s,F^s for the teacher model and the student model, respectively, where each node represents an instance, and the node attributes represent learned feature representations, where A ^t,A^s is an adjacency matrix of the attribute map constructed based on p ^t,p^s and is based on the formula/>Construction, wherein/>Is a graph construction function based on KNN. During the whole training process, G ^t is fixed, and the attribute and structure of G ^s are dynamically changed.

Where Θ ^s _l and Θ ^t _l are learnable parameters and g ^t,g^s is the dimension represented by the above graph. D _t,D_s is a diagonal matrix of attribute maps, i.e., shown in equation (3)

Where β is the weight of the linear combination.

Because infoNCE requires the use of all the samples in the dataset as negative samples, the calculation of distillation losses to calculate the integrity is prohibitively expensive for larger-scale datasets. To avoid computing the characterization of the sample during the training process, feature memorisation storage techniques have been widely used. In the method of the embodiment, G ^t and G^s is built in a randomly selected mini-Batch, and the overall knowledge of the graph-based representations H ^t and H ^s are presented in different attribute graphs, which cannot be stored using the feature-memory storage technique, and therefore, only the feature-memory storage technique is used to store F ^t,F^s, respectively. The final approximated overall distillation loss is defined as shown in equation (7):

According to the scheme provided by the embodiment of the invention, the knowledge distillation method can enable the student model to learn the integral knowledge of the teacher model more effectively, and compared with other knowledge distillation methods, the performance of the student model is improved more obviously.

The system for implementing the overall knowledge distillation method based on the graph neural network comprises a teacher model and a student model attribute graph construction model, an overall knowledge extraction module and a graph embedded representation mutual information maximization module of the student model and the teacher model, which are connected in sequence. The teacher model and the student model are respectively corresponding to the contents of the steps 1,2 and 3, and the overall knowledge extraction module and the maximization module of the mutual information of the graph embedded representation of the student model and the teacher model.

In order to illustrate the effects of the above-described schemes of embodiments of the present invention, the experiments are described in connection with experiments that are developed on classical datasets in several fields of image classification.

1. Experimental data set

The experiment involved two reference data sets, the correlation of which is described in the following table:

Data set	Category number	Training set size	Test set size	Image size
					Tiny-ImageNet	200	100000	10000	2242243
Cifar-100	20	50000	10000	32323

2. Model structure

Four structures, resNet, VGG, shuffleNet, mobileNet respectively, are used in the teacher model and the student model, and are all network structures familiar to researchers in the field, and are not described in detail herein.

3. Baseline method

In order to demonstrate the superiority of this method, the present examples compare recent knowledge distillation methods, which can be generalized into two categories, differing by the schematic diagram of the knowledge distillation classification provided by the present examples.

In particular, the method comprises the steps of,

The first category is a method of knowledge distillation on an individual, comprising VANILLA KD to learn logits, AT to learn attention Map, PKT to learn the characteristic representation, CRD and SSKD.

The second category is relational knowledge distillation methods that learn knowledge of paired relational types, including RKD, CCKD.

The above comparison methods are all reproduced by using the code of the author which is open source, and in order to keep consistency of training samples among various methods, data enhancement in SSKD codes is removed.

4. Conclusion of the experiment

Training the embodiments of the invention and various comparison methods using different teacher models and student models, at CIFAR100,100

The performance on the dataset is as follows (the evaluation index is the accuracy).

The performance on TINYIMAGENET datasets is as follows (the evaluation index is accuracy).

From the comparative experiments on both data sets, it was found that the performance of the present method (HKD) was significantly better than the comparative method with different teacher model student model combinations.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

The invention is applicable to image classification. Deep Neural Networks (DNNs) have enjoyed great success in a variety of applications. However, their success depends to a large extent on the large amount of computing and storage resources that are often not available in embedded devices and mobile devices, i.e. under limited computing resource conditions, some large models with high requirements on computing resources are not deployable. In order to keep the available small models satisfactory under the condition of limited computational resources, some knowledge distillation methods (learning methods for allowing a student model to learn the learned knowledge of a teacher model and the knowledge itself at the same time) are widely studied,

An image classification method to which the method of the present invention is applied is described as follows:

1. first, a data set required for the image classification model is prepared, and the preparation of the data set is a process well known to those skilled in the art, and will not be described herein.

2. And a teacher model with larger model volume and excellent performance is trained by using the cross entropy as a loss function.

After the teacher model is trained, parameters are not updated in the subsequent training process. The student model training method, the core content of the method, is described below.

3 Training the student model;

and 3.1, respectively constructing attribute graphs of the teacher model and the student model. Inputting the image into a teacher model and a student model to obtain a feature representation f ^t,f^s (wherein D ^t and d ^s are dimensions of feature representations output by the teacher model and the student model, respectively) and the result p ^t,p^s of the classification prediction, and then construct an attribute graph G ^t＝{A^t,F^t}、G^s＝{A^s,F^s for the teacher model and the student model, respectively, where each node represents an instance, and the node attributes represent learned feature representations, where A ^t,A^s is an adjacency matrix of the attribute map constructed based on p ^t,p^s and is based on the formula/>Construction, wherein/>Is a graph construction function based on KNN. During the whole training process, G ^t is fixed, and the attribute and structure of G ^s are dynamically changed.

3.2, Aggregating node attributes and Topology information of neighborhood samples in the attribute map using Topology-adaptive graph convolutional neural network (Topology ADAPTIVE GRAPH Convolution Network, TAGCN) to extract overall knowledge, thereby obtaining a map-based embedded vector.

And 3.3, estimating the mutual information of the graph embedded representation of the maximum chemical model and the teacher model by using InfoNCE, and accelerating the training efficiency by using a characteristic memorization storage technology.

In order for the student model to learn knowledge of the teacher model's model integrity as much as possible, it is desirable to maximize the degree of similarity between H ^t and H ^s, many existing vector-wise similarity metrics (e.g., cosine similarity, euclidean distance) are not suitable for distillation of the integrity knowledge, they are often limited by differences in characterization ability due to structural differences between the teacher model and the student model, and direct alignment of Ht and Hs may result in too refined knowledge learned. To overcome the above limitation, mutual information is used to measure the degree of similarity of student models distilling information from teacher models, i.e., maximizing the mutual information of Ht and Hs, as shown in the following equation (4).

Where β is the weight of the linear combination.

4. and deploying the model, namely deploying the trained student model under the required scene. The method can be realized under limited calculation conditions, and the student model with better performance is used for classifying the images.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. An overall knowledge distillation method based on a graph neural network applied to image classification comprises the following steps:

Step 1, respectively constructing attribute graphs of a teacher model and a student model; inputting the image into teacher model and student model to obtain feature representation f ^t,f^s and classification prediction result p ^t,p^s, wherein D ^t and d ^s are dimensions of the feature representations output by the teacher model and the student model, respectively; then respectively constructing an attribute graph G ^t＝{A^t,F^t}、G^s＝{A^s,F^s for the teacher model and the student model, wherein each node represents an instance, the node attribute represents the learned feature representation, and the node attribute represents the learned feature representation, whereinA ^t,A^s is an adjacency matrix of the attribute map constructed based on p ^t,p^s and is based on the formula/>Construction, wherein/>Is a graph construction function based on K nearest neighbors (Knearest neighbors, KNN);

in the whole training process, gt is fixed, and the attribute and structure of Gs are dynamically changed;

the attribute map defined above has the following characteristics: firstly, compared with a full connected graph among examples constructed by the prior relational knowledge decomposition method, a KNN graph filters out least relevant sample pairs; second, since edges are constructed based on predicted probabilities, the graph can model inter-class and intra-class information; finally, knowledge on individuals and knowledge on relationships can be extracted from the attribute map in a combined manner very efficiently by using the graph neural network;

Step 2, node attributes and topology information of neighborhood samples in the graph roll-up neural network aggregate attribute graph are used for extracting overall knowledge, and therefore an embedded vector based on the graph is obtained;

as shown in the following formulas (1) and (2), the attribute information and Topology information of the nodes are simultaneously learned by using Topology-adaptive graph convolutional neural networks (Topology ADAPTIVE GRAPH Convolution Network, TAGCN) to extract the knowledge of the integrity, and a teacher model and a graph-based representation of a student model are obtained And/>

Where Θ ^s ₁ and Θ ^t ₁ are learnable parameters, g ^t,g^s is the dimension represented by the above graph; d _t,D_s is a diagonal matrix of attribute maps, i.e., shown in equation (3)

Step 3, embedding the graph of the maximum chemical model and the teacher model into the represented mutual information, and accelerating the training efficiency by using a characteristic memorization storage technology;

The mutual information is used to measure the similarity of the student model distillation information from the teacher model, i.e. to maximize the mutual information of Ht and Hs, as shown in the following equation (4):

l (·) represents the mutual information metric of two random variables, infoNCE is used to estimate the mutual information, and the relationship between infoNCE and the mutual information is shown in formula (5):

Wherein f (·) is a vector-wise similarity measure function, h ^t _i,h^s _i is a graph-based representation of sample i learned in teacher and student models;

The student model needs to learn knowledge itself in addition to learning the teacher model's knowledge, and the final loss function is as follows formula (6):

Where β is the weight of the linear combination;

G ^tand G^s is respectively constructed in randomly selected mini-Batch, the overall knowledge of the graph-based reactions of the characterizations H ^t and H ^s is presented in different attribute graphs, and the characteristic memorization storage technology is respectively used for storing F ^t,F^s; the final approximated overall distillation loss is defined as shown in equation (7):

2. the method for global knowledge distillation based on graph neural network for image classification according to claim 1, wherein in step 3, the mutual information of the graph embedded representation of the maximum chemogenetic model and the teacher model is maximized by using infoNCE, and the relationship between infoNCE and the mutual information is as shown in formula (8):