CN113723596A - Training method and training device for fixed-point model - Google Patents
Training method and training device for fixed-point model Download PDFInfo
- Publication number
- CN113723596A CN113723596A CN202111033355.7A CN202111033355A CN113723596A CN 113723596 A CN113723596 A CN 113723596A CN 202111033355 A CN202111033355 A CN 202111033355A CN 113723596 A CN113723596 A CN 113723596A
- Authority
- CN
- China
- Prior art keywords
- network model
- model
- determining
- student
- fixed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the disclosure discloses a training method and a training device of a fixed-point model, wherein the training method comprises the following steps: determining output characteristics of a target network layer in a teacher network model based on training data and the teacher network model; determining features of a target network layer in the student network model before quantization operation based on the training data and the student network model; determining a loss error based on the output characteristic and the characteristic before the quantization operation; updating parameters of the student network model by back propagation based on the loss error. The embodiment of the disclosure can realize knowledge distillation of the fixed-point model by the floating-point model, and improve the prediction accuracy of the fixed-point model.
Description
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a training method and a training device for a fixed-point model.
Background
At present, a neural network deployed at a mobile terminal generally has small computational power and is mostly quantized, and the performance is poorer than that of a floating point model.
In the related technology, the performance of the floating point model is improved by adopting a knowledge distillation technology, and the working principle of the floating point model is that the output of a large model is taken as knowledge to be transferred, so that the performance of the light weight model can be effectively improved. However, the current knowledge distillation technology mainly aims at floating point models, and the fixed point distillation work generally has the limitations of high training cost and limited performance gain.
Disclosure of Invention
The present disclosure is proposed to solve the above technical problems. The embodiment of the disclosure provides a training method and a training device for a fixed-point model.
According to a first aspect of the embodiments of the present disclosure, there is provided a training method of a fixed-point model, including:
determining output characteristics of a target network layer in a teacher network model based on training data and the teacher network model, wherein the teacher network model is a floating point model;
determining the characteristics of a target network layer in the student network model before quantization operation based on the training data and the student network model, wherein the student network model is a fixed point model, and the target network layer in the teacher network model is matched with the target network layer in the student network model;
determining a loss error based on the output characteristic and the characteristic before the quantization operation;
updating parameters of the student network model by back propagation based on the loss error.
According to a second aspect of the embodiments of the present disclosure, there is provided a training apparatus for a fixed-point model, including:
the teacher network model comprises a first characteristic determining module, a second characteristic determining module and a third characteristic determining module, wherein the first characteristic determining module is used for determining the output characteristics of a target network layer in the teacher network model based on training data and the teacher network model, and the teacher network model is a floating point model;
the second characteristic determination module is used for determining the characteristics of a target network layer in the student network model before quantization operation based on the training data and the student network model, the student network model is a fixed point model, and the target network layer in the teacher network model is matched with the target network layer in the student network model;
a loss error determination module for determining a loss error based on the output characteristic and the characteristic before the quantization operation;
a first parameter updating module for updating parameters of the student network model by back propagation based on the first objective loss function.
According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the method for training a fixed-point model according to the first aspect.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing the processor-executable instructions;
the processor is configured to read the executable instructions from the memory and execute the instructions to implement the fixed-point model training method according to the first aspect.
Based on the training method and the training device for the fixed point model provided by the embodiment of the disclosure, the floating point model trained in advance is used as a teacher network model, the untrained fixed point model is used as a student network model, and the student network model can be trained in a knowledge distillation mode through the floating point characteristics output to the target network layer in the teacher network model and the floating point characteristics which are not subjected to quantization operation in the target network layer of the student network model associated with the target network layer in the teacher network model. The embodiment of the disclosure can train the fixed-point model through the floating-point model, and improve the prediction accuracy of the fixed-point model.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.
FIG. 1 is a schematic flow chart diagram of a training method of a fixed-point model according to an embodiment of the present disclosure;
FIG. 2 is a schematic illustration of distillation of a network layer of a student network model by a network layer of a teacher network model in one example of the present disclosure;
FIG. 3 is a schematic diagram of the feature distribution of a floating point model and a fixed point model in one example of the present disclosure;
FIG. 4 is a block diagram of an apparatus for training a fixed-point model according to an embodiment of the present disclosure;
FIG. 5 is a block diagram of the structure of the loss error determination module 430 in one embodiment of the present disclosure;
FIG. 6 is a block diagram of a training apparatus for fixed-point models according to another embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.
Detailed Description
Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Summary of the application
In the course of implementing the present disclosure, the inventor finds that, in the existing knowledge distillation technology for network models, the floating point network model is usually used to perform knowledge distillation on the floating point model, that is, an untrained floating point network model is subjected to knowledge distillation through a trained floating point network model. Because the network layer of the fixed-point network model outputs the fixed-point data characteristics and the network layer of the floating-point model outputs the floating-point data characteristics, knowledge distillation of the fixed-point model based on the floating-point model is difficult to realize.
Exemplary method
Fig. 1 is a schematic flow chart diagram of a training method of a fixed-point model according to an embodiment of the present disclosure. As shown in fig. 1, the training method of the fixed-point model according to the embodiment of the present disclosure includes:
s1: based on the training data and the teacher network model, output characteristics of a target network layer in the teacher network model are determined. Wherein, the teacher network model is a floating point model. The floating point model represents that the data characteristics output by each network layer of the network model are floating point numerical values.
In this embodiment, the teacher network model is trained in advance, and may be divided into a training set and a test set by sample data, the teacher network model is trained by the training set, and then whether the teacher network model meets the convergence condition is detected and verified by the test set, and after the convergence condition is reached, the teacher network model may be obtained.
In the present embodiment, the teacher network model may adopt a Convolutional Neural Network (CNN) model. In the teacher network model, the output characteristic value of each convolution layer is a floating-point number. And connecting the last network layer of the teacher network model with a softmax function, and outputting a prediction result through the softmax function. The predicted content of the teacher network model can be finally determined after training through corresponding sample data according to user requirements.
The target network layer in the teacher network model performs convolution calculation on training data or an output feature map of a previous network layer of the target network layer to obtain the output features of the target network layer, wherein the training data can be image data acquired by a camera positioned in a specific physical scene. The specific physical scene may be a scene in a cabin of the vehicle, and the training data may be image data in the cabin acquired by a camera in the cabin; alternatively, the specific physical scene may be a traffic scene outside the vehicle, and the training data may be image data inside the vehicle collected by a camera provided outside the vehicle.
It should be noted that the training data in the embodiment of the present disclosure may be determined according to the user requirement.
When the training data is image data, the feature map output by each convolution layer of the target network layer is a feature in A, B, C dimensions. Wherein A, B and C are both integers greater than 0.
In this embodiment, when the target network layer is a network layer specified in the teacher network model, for example, the teacher network model is a network model with 50 layers, the target network layer is a network layer specified in one of the 50 layers, for example, the target network layer is a 49 th layer, a 48 th layer, or a 45 th layer, and so on.
S2: and determining the characteristics of a target network layer in the student network model before the quantitative operation based on the training data and the student network model.
In this embodiment, the student network model is a fixed point model, and the number of network layers of the student network model is the same as the number of network layers of the teacher network model, for example, the number of network layers of the teacher network model is 50. Accordingly, the student network model is also 50 layers. The fixed-point model is that the output characteristics of each network layer of the model are fixed-point numerical values, and the characteristics of each network layer of the fixed-point model before quantization are floating-point numerical values. The quantization operation is an operation that converts floating point features to fixed point features. In the embodiment of the disclosure, the teacher network model has a large parameter amount and high precision relative to the student network model. For example, the teacher network model is a 32-byte floating-point numerical storage and operation network model, and the student network model is an 8-byte integer numerical storage and operation network model.
In this embodiment, the target network layer in the teacher network model matches the target network layer in the student network model. For example, if the target network layer of the teacher network model is layer 49, the target network layer of the student network model is also layer 49, and the feature dimensions of the corresponding network layers of the teacher network model and the student network model are the same.
S3: based on the output characteristic and the characteristic before the quantization operation, a loss error is determined. The loss error is determined based on the floating point characteristics output by the teacher network model target network layer and the floating point characteristics before quantization of the student network model target network layer.
S4: parameters of the student network model are updated by back propagation based on the loss error.
In accordance with the exemplary scenario in step S1, when the training data is image data in the cabin, the trained student network model is executed by an artificial intelligence processor or a general purpose processor in the cabin by way of instructions, so as to implement scene recognition of the scenario in the cabin, where the scene recognition may be to monitor whether the driver is driving safely or the personal safety status of the passengers, and so on. In addition, the structure of the student network model is simple compared with that of the teacher network model, so that the computing efficiency of the artificial intelligence processor or the general processor can be improved, and meanwhile, the computing power consumption of the artificial intelligence processor or the general processor is reduced.
In this embodiment, a floating point model trained in advance is used as a teacher network model, an untrained fixed point model is used as a student network model, and the student network model can be trained in a knowledge distillation manner by outputting floating point characteristics to a target network layer in the teacher network model and floating point characteristics which are not subjected to quantization operation in the target network layer of the student network model and are associated with the target network layer in the teacher network model. The embodiment of the disclosure can train the fixed-point model through the floating-point model, and improve the prediction accuracy of the fixed-point model.
FIG. 2 is a schematic illustration of distillation of a network layer of a student network model by a network layer of a teacher network model in one example of the disclosure. As shown in fig. 2, in the present example, the target network layer of the teacher network model selects the nth layer of the teacher network model, where N is a natural number greater than 1, and N is less than or equal to the total number of network layers of the teacher network model. Accordingly, the target network layer of the student network model selects the nth layer of the student network model, for example, when the teacher network model is 50 layers, N is 49. After the loss error is determined through the floating point characteristics output by the nth layer of the teacher network model and the floating point characteristics of the nth layer of the student network model before quantization, the parameters of the student network model are updated in a back propagation mode according to the sequence of the nth-1 layer, the nth-2 layer, …, the 2 nd layer and the 1 st layer of the student network model based on the loss error.
In one embodiment of the present disclosure, step S3 includes:
s3-1: and determining a target error function based on the mean square error function and the original error function of the student network model.
In one example of the present disclosure, the mean square error function is as follows:
wherein MSE represents the mean square error,representing an element, y, in a feature matrix of a target network layer in a student network model prior to a quantization operationiANDing in feature matrix representing target network layer output in teacher network modelAnd n represents the number of elements in the feature matrix output by the target network layer in the teacher network model.
In this embodiment, the mean square error function and the original error function of the student network model are used to calculate the target error function.
S3-2: a loss error is determined based on the output characteristic, the characteristic prior to the quantization operation, and a target error function.
Specifically, the loss error can be calculated by inputting a feature matrix output by the teacher network model target network layer and a feature matrix target error function of the student network model target network layer before quantization operation.
In the embodiment, the target error function is obtained through the mean square error function and the original error function of the student network model, the reasonable loss error between the characteristics of the target network layer of the teacher network model and the characteristics of the target network layer of the student network model before quantization operation can be accurately calculated based on the target error function, then back propagation is carried out based on the loss error, knowledge distillation of the teacher network model to the student network model can be realized, and the prediction accuracy of the student network model is improved.
In an embodiment of the present disclosure, step S3-1 specifically includes: and weighting the mean square error function and the original error function of the student network model to obtain a target error function. For example, the following weighting formula is adopted:
Loss Total=A*MSE+(1-A)*LossOriginal source
Wherein L isoss TotalRepresenting Loss error, A represents the weighting coefficient of the mean square error, LossOriginal sourceRepresenting the original error function of the student network model.
In this embodiment, can effectively adjust the loss error through the weighting coefficient, and then can effectively adjust the parameter variation condition when carrying out reverse propagation to student's network model, promote teacher's network model and carry out knowledge distillation's effect to student's network model.
FIG. 3 is a schematic diagram of feature distributions of a floating point model and a fixed point model in one example of the present disclosure. In fig. 3(a) to 3(c), the abscissa in each graph represents the activation value of the feature, and the ordinate represents the number of neurons corresponding to the activation value. Wherein the activation value is a value obtained by the activation function after convolution. In the present example, a Linear rectification function (ReLU) is employed as the activation function.
Comparing fig. 3(a) and fig. 3(c), the neuron distribution of the activation values quantized by the fixed point model is sparse, and the neuron distribution characteristics of the activation values of the floating point model are greatly different, if the characteristics quantized by the fixed point model and the output characteristics of the floating point model are subjected to knowledge distillation, the effect is poor. Fig. 3(b) is a schematic diagram of the number distribution of neurons before fixed-point model quantization, and the number distribution of neurons of the floating-point model in fig. 3(c) is closer, so that it is more reasonable to perform knowledge distillation on the features of the fixed-point model before quantization and the output features of the floating-point model in the embodiment of the present disclosure.
In one embodiment of the present disclosure, the training method of the fixed-point model further includes:
s5: based on the training data and the teacher network model, a probability distribution result of the teacher network model is determined.
In this embodiment, the last layer of the teacher network model is connected with a softmax function, and after passing through the softmax function, a probability distribution result of the teacher network model can be obtained.
S6: and determining a probability distribution result of the student network model based on the training data and the student network model.
In this embodiment, the last layer of the network layer of the student network model is connected with a softmax function, and after passing through the softmax function, a probability distribution result of the student network model can be obtained.
S7: and determining the cross entropy between the probability distribution result of the teacher network model and the probability distribution result of the student network model.
In one example of the present disclosure, the cross entropy function is as follows:
wherein H (p, q) represents cross entropy, p (x)i) Representing the probability of a classification item in the probability distribution results of the teacher network model, p (x)i) And representing the probability of the classification item corresponding to the probability distribution result of the instructor network model in the probability distribution result of the student network model.
Based on the cross entropy function, the cross entropy between the probability distribution result of the teacher network model and the probability distribution result of the student network model can be calculated.
S8: parameters of the student network model are updated by back propagation based on the cross entropy.
In this embodiment, on the basis of performing knowledge distillation on the features of the target network layer in the student network model before quantization operation through the output features of the target network layer in the teacher network model, the prediction accuracy of the student network model can be further improved by performing knowledge distillation again on the probability distribution result of the teacher network model and the probability distribution result of the student network model.
Any of the methods for training a fixed-point model provided by embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the training methods for the fixed-point models provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute any of the training methods for the fixed-point models mentioned in the embodiments of the present disclosure by calling corresponding instructions stored in a memory. And will not be described in detail below.
Exemplary devices
FIG. 4 is a block diagram illustrating an exemplary training apparatus for a fixed-point model according to an embodiment of the present disclosure. As shown in fig. 4, the training apparatus for a fixed-point model according to an embodiment of the present disclosure includes: a first characteristic determination module 410, a second characteristic determination module 420, a loss error determination module 430, and a first parameter update module 440.
Wherein the first feature determination module 410 is configured to determine output features of a target network layer in a teacher network model based on training data and the teacher network model. Wherein the teacher network model is a floating point model. The training data is at least one of picture data, text data, and audio data. The second feature determination module 420 is configured to determine features of a target network layer in the student network model before a quantization operation based on the training data and the student network model. Wherein, the student network model is a fixed point model. The target network layer in the teacher network model is matched with the target network layer in the student network model. The loss error determination module 430 is configured to determine a loss error based on the output characteristic and the characteristic before the quantization operation. The first parameter updating module 440 is configured to update the parameters of the student network model through back propagation based on the first objective loss function.
Fig. 5 is a block diagram of the structure of the loss error determination module 430 in one embodiment of the present disclosure. As shown in fig. 5, in one embodiment of the present disclosure, the loss error determination module 430 includes:
and the target error function determining unit 4301 is configured to determine a target error function based on the mean square error function and the original error function of the student network model.
A loss error determination unit 4302 configured to determine the loss error based on the output feature, the feature before the quantization operation, and the target error function.
In an embodiment of the present disclosure, the target error function determining unit 4301 is configured to perform weighting processing on the mean square error function and an original error function of the student network model to obtain the target error function.
FIG. 6 is a block diagram of a training apparatus for fixed-point models according to another embodiment of the present disclosure. As shown in fig. 6, in another embodiment of the present disclosure, the training apparatus for the fixed-point model further includes:
a first probability distribution result determination module 450 for determining a probability distribution result of the teacher network model based on the training data and the teacher network model;
a second probability distribution result determining module 460, configured to determine a probability distribution result of the student network model based on the training data and the student network model;
a relative entropy determination module 470, configured to determine a cross entropy between the probability distribution result of the teacher network model and the probability distribution result of the student network model;
and a second parameter updating module 480, configured to update parameters of the student network model through back propagation based on the cross entropy.
It should be noted that, a specific implementation of the training apparatus for a fixed-point model in the embodiment of the present disclosure is similar to a specific implementation of the training method for a fixed-point model in the embodiment of the present disclosure, and for specific reference, a part of the training method for a fixed-point model is specifically referred to, and details are not described here in order to reduce redundancy.
Exemplary electronic device
Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 7. FIG. 7 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.
As shown in fig. 7, the electronic device includes one or more processors 710 and memory 720.
The processor 710 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
Memory 720 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by processor 11 to implement the fixed point model training methods of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.
In one example, the electronic device may further include: an input device 730 and an output device 740, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device 730 may also include, for example, a keyboard, a mouse, and the like. The output devices 740 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.
Of course, for simplicity, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 7, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device may include any other suitable components, depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in a method of training a fixed-point model according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method for training a fixed-point model according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.
Claims (10)
1. A method of training a fixed-point model, comprising:
determining output characteristics of a target network layer in a teacher network model based on training data and the teacher network model, wherein the teacher network model is a floating point model;
determining the characteristics of a target network layer in the student network model before quantization operation based on the training data and the student network model, wherein the student network model is a fixed point model, and the target network layer in the teacher network model is matched with the target network layer in the student network model;
determining a loss error based on the output characteristic and the characteristic before the quantization operation;
updating parameters of the student network model by back propagation based on the loss error.
2. The training method of the fixed-point model according to claim 1, wherein the determining a loss error based on the output feature and the feature before the quantization operation comprises:
determining a target error function based on a mean square error function and an original error function of the student network model;
determining the loss error based on the output characteristic, the characteristic prior to the quantization operation, and the target error function.
3. The training method of the fixed-point model according to claim 2, wherein the obtaining a target error function based on a mean square error function and an original error function of the student network model comprises:
and weighting the mean square error function and the original error function of the student network model to obtain the target error function.
4. The training method of the fixed-point model according to any one of claims 1 to 3, further comprising:
determining a probability distribution result of the teacher network model based on the training data and the teacher network model;
determining a probability distribution result of the student network model based on the training data and the student network model;
determining cross entropy between the probability distribution result of the teacher network model and the probability distribution result of the student network model;
updating parameters of the student network model by back propagation based on the cross entropy.
5. A training apparatus for a fixed-point model, comprising:
the teacher network model comprises a first characteristic determining module, a second characteristic determining module and a third characteristic determining module, wherein the first characteristic determining module is used for determining the output characteristics of a target network layer in the teacher network model based on training data and the teacher network model, and the teacher network model is a floating point model;
the second characteristic determination module is used for determining the characteristics of a target network layer in the student network model before quantization operation based on the training data and the student network model, the student network model is a fixed point model, and the target network layer in the teacher network model is matched with the target network layer in the student network model;
a loss error determination module for determining a loss error based on the output characteristic and the characteristic before the quantization operation;
a first parameter updating module for updating parameters of the student network model by back propagation based on the first objective loss function.
6. The training apparatus for the fixed-point model according to claim 5, wherein the loss error determination module comprises:
a target error function determination unit for determining a target error function based on a mean square error function and an original error function of the student network model;
a loss error determination unit for determining the loss error based on the output characteristic, the characteristic before quantization operation, and the target error function.
7. The training device of the fixed-point model according to claim 6, wherein the objective error function determining unit is configured to perform weighting processing on the mean square error function and the original error function of the student network model to obtain the objective error function.
8. The training device for a fixed-point model according to any one of claims 5 to 7, further comprising:
a first probability distribution result determination module for determining a probability distribution result of the teacher network model based on the training data and the teacher network model;
a second probability distribution result determination module for determining a probability distribution result of the student network model based on the training data and the student network model;
the relative entropy determining module is used for determining the cross entropy between the probability distribution result of the teacher network model and the probability distribution result of the student network model;
and the second parameter updating module is used for updating the parameters of the student network model through back propagation based on the cross entropy.
9. A computer-readable storage medium, storing a computer program for executing the method for training a fixed-point model according to any one of claims 1 to 4.
10. An electronic device, the electronic device comprising:
a processor;
a memory for storing the processor-executable instructions;
the processor is used for reading the executable instructions from the memory and executing the instructions to realize the training method of the fixed point model of any one of the claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111033355.7A CN113723596A (en) | 2021-09-03 | 2021-09-03 | Training method and training device for fixed-point model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111033355.7A CN113723596A (en) | 2021-09-03 | 2021-09-03 | Training method and training device for fixed-point model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113723596A true CN113723596A (en) | 2021-11-30 |
Family
ID=78681514
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111033355.7A Pending CN113723596A (en) | 2021-09-03 | 2021-09-03 | Training method and training device for fixed-point model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113723596A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180365564A1 (en) * | 2017-06-15 | 2018-12-20 | TuSimple | Method and device for training neural network |
CN111611377A (en) * | 2020-04-22 | 2020-09-01 | 淮阴工学院 | Knowledge distillation-based multi-layer neural network language model training method and device |
CN111985523A (en) * | 2020-06-28 | 2020-11-24 | 合肥工业大学 | Knowledge distillation training-based 2-exponential power deep neural network quantification method |
-
2021
- 2021-09-03 CN CN202111033355.7A patent/CN113723596A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180365564A1 (en) * | 2017-06-15 | 2018-12-20 | TuSimple | Method and device for training neural network |
CN111611377A (en) * | 2020-04-22 | 2020-09-01 | 淮阴工学院 | Knowledge distillation-based multi-layer neural network language model training method and device |
CN111985523A (en) * | 2020-06-28 | 2020-11-24 | 合肥工业大学 | Knowledge distillation training-based 2-exponential power deep neural network quantification method |
Non-Patent Citations (1)
Title |
---|
DAN ALISTARH ET AL.: "MODEL COMPRESSION VIA DISTILLATION AND QUANTIZATION", 《ARXIV》, pages 1 - 21 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11030997B2 (en) | Slim embedding layers for recurrent neural language models | |
CN108959482B (en) | Single-round dialogue data classification method and device based on deep learning and electronic equipment | |
US11928601B2 (en) | Neural network compression | |
US20220230048A1 (en) | Neural Architecture Scaling For Hardware Accelerators | |
CN111414987A (en) | Training method and training device for neural network and electronic equipment | |
CN110647893A (en) | Target object identification method, device, storage medium and equipment | |
CN113610232B (en) | Network model quantization method and device, computer equipment and storage medium | |
CN114780727A (en) | Text classification method and device based on reinforcement learning, computer equipment and medium | |
CN113987187B (en) | Public opinion text classification method, system, terminal and medium based on multi-label embedding | |
CN115526320A (en) | Neural network model inference acceleration method, apparatus, electronic device and medium | |
CN113705809A (en) | Data prediction model training method, industrial index prediction method and device | |
CN111461862B (en) | Method and device for determining target characteristics for service data | |
CN111242162A (en) | Training method and device of image classification model, medium and electronic equipment | |
CN113449840A (en) | Neural network training method and device and image classification method and device | |
CN115482141A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN113902114A (en) | Quantization method, device and system of neural network model, electronic device and storage medium | |
CN115803806A (en) | Systems and methods for training dual-mode machine-learned speech recognition models | |
CN114463553A (en) | Image processing method and apparatus, electronic device, and storage medium | |
WO2022154829A1 (en) | Neural architecture scaling for hardware accelerators | |
CN111062203B (en) | Voice-based data labeling method, device, medium and electronic equipment | |
CN111587441B (en) | Generating output examples using regression neural networks conditioned on bit values | |
CN117610608A (en) | Knowledge distillation method, equipment and medium based on multi-stage feature fusion | |
CN113723596A (en) | Training method and training device for fixed-point model | |
CN116701411A (en) | Multi-field data archiving method, device, medium and equipment | |
CN116432705A (en) | Text generation model construction method, text generation device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |