JP2020071883A

JP2020071883A - Model training method, data recognition method and data recognition device

Info

Publication number: JP2020071883A
Application number: JP2019195406A
Authority: JP
Inventors: ワン・モンジアオ; Mengjiao Wang; リィウ・ルゥジエ; Rujie Liu
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-10-29
Filing date: 2019-10-28
Publication date: 2020-05-07
Also published as: CN111105008A; US20200134506A1; EP3648014A1

Abstract

To provide a model training method.SOLUTION: The model training method is a method for training a student model corresponding to a teacher model. The teacher model is trained using a first input data as the input data, and a first output data as the output target. The method has a step of training a student model using a second input data as the input data and the first output data as the output target. The second input data is a piece of data obtained by changing the first input data.SELECTED DRAWING: Figure 2

Description

本開示は、モデル訓練方法、データ認識方法及びデータ認識装置に関し、具体的には、知識の蒸留（ｋｎｏｗｌｅｄｇｅｄｉｓｔｉｌｌａｔｉｏｎ）を用いて有効なデータ認識モデルを学習することに関する。 TECHNICAL FIELD The present disclosure relates to a model training method, a data recognition method, and a data recognition apparatus, and more particularly to learning an effective data recognition model by using knowledge distillation.

最近、データ認識の精度は、深層学習ネットワークを用いることにより大幅に改善された。一方、速度は多くの応用シナリオで考慮する重要な要素であり、演算速度を確保すると共に、応用シナリオに必要な正確性を確保する必要がある。よって、例えば対象検出などのデータ認識の進歩はより深い深層学習の構造に依存しているが、このようなより深い構造は実行時の計算オーバヘッドの増加に繋がる。このため、知識の蒸留という概念が提案されている。 Recently, the accuracy of data recognition has been greatly improved by using deep learning networks. On the other hand, the speed is an important factor to be considered in many application scenarios, and it is necessary to ensure the calculation speed and the accuracy required for the application scenarios. Thus, advances in data recognition, such as object detection, rely on deeper deep learning structures, which in turn lead to increased computational overhead at run time. For this reason, the concept of knowledge distillation has been proposed.

複雑な深層学習ネットワーク構造モデルは、幾つかの独立したモデルからなる集合であってもよいし、幾つかの制約条件に従って訓練された大きなネットワークモデルであってもよい。複雑なネットワークモデルの訓練が完了すると、他の訓練方法を用いて複雑なモデルからアプリケーション側に配置される小型のモデルを抽出し、即ち知識の蒸留を行ってもよい。知識の蒸留は、大きなモデルの監督により高速のニューラルネットワークモデルを訓練するための実用的な方法である。最も一般的な手順として、大きなニューラルネットワーク層から出力を抽出し、小さなニューラルネットワークに同一の結果を強制的に出力させる。このように、小さなニューラルネットワークは大きなモデルの表現力を学習することができる。ここで、小さなニューラルネットワークは「生徒」モデルとも称され、大きなニューラルネットワークは「教師」モデルとも称される。 The complex deep learning network structure model may be a set of several independent models or a large network model trained according to some constraints. Once the training of the complex network model is complete, other training methods may be used to extract a small model located on the application side from the complex model, ie distilling knowledge. Knowledge distillation is a practical way to train fast neural network models with large model supervision. The most general procedure is to extract the output from a large neural network layer and force a small neural network to output the same result. Thus, small neural networks can learn the expressive power of large models. Here, the small neural network is also called a "student" model, and the large neural network is also called a "teacher" model.

従来の知識の蒸留の方法では、「生徒」モデルの入力と「教師」モデルの入力とは通常同じである。但し、元の訓練データセットを変更し、例えば元の訓練データセットにおける訓練データを一定量だけ変更すると、従来の方法では、「教師」モデルを再訓練して知識の蒸留の方法を用いて「生徒」モデルを訓練する必要がある。このような方法は、大きく、且つ訓練しにくい「教師」モデルを再訓練する必要があるため、演算負荷が大きくなってしまう。 In the traditional method of knowledge distillation, the "student" model input and the "teacher" model input are usually the same. However, if the original training data set is changed, for example, the training data in the original training data set is changed by a certain amount, the conventional method is to retrain the "teacher" model and use the method of knowledge distillation to The "student" model needs to be trained. Such a method requires a retraining of a "teacher" model that is large and difficult to train, which increases the computational load.

従って、本発明は、新たな生徒モデルの訓練を提供する。 Thus, the present invention provides training for new student models.

なお、上述した技術背景の説明は、本発明の技術案を明確、完全に理解させるための説明であり、当業者を理解させるために記述されているものである。これらの技術案は、単なる本発明の背景技術部分として説明されたものであり、当業者により周知されたものではない。 In addition, the above-mentioned explanation of the technical background is an explanation for clarifying and completely understanding the technical solution of the present invention, and is provided for the understanding of those skilled in the art. These technical solutions have been described as merely background art part of the present invention, and are not known to those skilled in the art.

以下は、本開示の態様を基本的に理解させるために、本開示の簡単な概要を説明する。なお、この簡単な概要は、本開示を網羅的な概要ではなく、本開示のポイント又は重要な部分を意図的に特定するものではなく、本開示の範囲を意図的に限定するものではなく、後述するより詳細的な説明の前文として、単なる概念を簡単な形で説明することを目的とする。 The following provides a brief overview of the disclosure in order to provide a basic understanding of aspects of the disclosure. It should be noted that this brief overview is not an exhaustive overview of the present disclosure, does not intentionally specify a point or an important part of the present disclosure, and does not intentionally limit the scope of the present disclosure, Its purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

本開示の目的を実現するために、本開示の１つの態様では、教師モデルに対応する生徒モデルを訓練する方法であって、前記教師モデルは、第１入力データを入力データとし、且つ第１出力データを出力ターゲットとして訓練されたものであり、前記方法は、第２入力データを入力データとし、且つ前記第１出力データを出力ターゲットとして前記生徒モデルを訓練するステップ、を含み、前記第２入力データは、前記第１入力データを変更して得られたデータである、方法を提供する。 In order to achieve the object of the present disclosure, in one aspect of the present disclosure, a method of training a student model corresponding to a teacher model, wherein the teacher model uses first input data as input data, and Trained with output data as an output target, the method comprising: training the student model with second input data as input data and the first output data as an output target, the second A method is provided wherein the input data is data obtained by modifying the first input data.

本開示のもう１つの態様では、教師モデルに対応する生徒モデルを訓練する方法により訓練された生徒モデルを用いてデータ認識を行うステップ、を含む、データ認識方法を提供する。 According to another aspect of the present disclosure, there is provided a data recognition method, comprising: performing data recognition using a student model trained by a method of training a student model corresponding to a teacher model.

本開示のもう１つの態様では、データ認識方法を実行する少なくとも１つのプロセッサ、を含む、データ認識装置を提供する。 In another aspect of the present disclosure, there is provided a data recognition device including at least one processor that executes a data recognition method.

本開示によれば、教師モデルを再訓練する必要がなく、訓練された生徒モデルのロバスト性を高める新たなモデル訓練方法を提供する。本開示によれば、教師モデルの訓練の入力は依然として元のデータであるが、生徒モデルの訓練の入力は元のデータを変更して得られたデータである。これによって、生徒モデルの出力は依然として教師モデルと同じであり、即ち、データの違いに関係なく、教師モデルを再訓練せずに生徒モデルを訓練することができる。 The present disclosure provides a new model training method that enhances the robustness of trained student models without having to retrain the teacher model. According to the present disclosure, the training input of the teacher model is still the original data, while the training input of the student model is the data obtained by modifying the original data. This allows the output of the student model to still be the same as the teacher model, ie the student model can be trained without retraining the teacher model regardless of the data differences.

本開示の上記及び他の目的、特徴及び利点をより容易に理解させるために、以下は図面を参照しながら本開示の実施形態を説明する。
従来の生徒モデルの訓練方法を示す模式図である。本開示の実施形態に係る生徒モデルの訓練方法を示す模式図である。本開示の実施形態に係る学習モデルの訓練方法のフローチャートである。本開示の実施形態に係るデータ認識方法を示すフローチャートである。本開示の実施形態に係るデータ認識装置を示す模式図である。本開示の実施形態に係る生徒モデルの訓練方法又はデータ認識方法を実現可能な装置の汎用機器の構成を示す図である。 In order to more easily understand the above and other objects, features and advantages of the present disclosure, the following describes embodiments of the present disclosure with reference to the drawings.
It is a schematic diagram which shows the training method of the conventional student model. It is a schematic diagram which shows the training method of the student model which concerns on embodiment of this indication. 7 is a flowchart of a learning model training method according to an embodiment of the present disclosure. 9 is a flowchart illustrating a data recognition method according to an embodiment of the present disclosure. It is a schematic diagram which shows the data recognition apparatus which concerns on embodiment of this indication. It is a figure which shows the structure of the general purpose apparatus of the apparatus which can implement the training method or data recognition method of the student model which concerns on embodiment of this indication.

以下は図面を参照しながら本開示の例示的な実施形態を説明する。説明の便宜上、明細書には実際の実施形態の全ての特徴が示されていない。なお、当業者が実施形態を実現する際に、実施形態を実現するために特定の決定を行ってもよく、これらの決定は実施形態に応じて変更されてもよい。 The following describes exemplary embodiments of the present disclosure with reference to the drawings. For convenience of explanation, not all features of an actual embodiment are shown in the specification. It should be noted that when a person skilled in the art realizes an embodiment, specific decisions may be made to realize the embodiment, and these decisions may be changed according to the embodiment.

なお、本開示を明確にするために、図面には本開示に密に関連する構成要件のみが示され、本開示と関係のない細部が省略されている。 It should be noted that in order to clarify the present disclosure, only constituent elements closely related to the present disclosure are shown in the drawings, and details not related to the present disclosure are omitted.

以下は図面を参照しながら本開示の例示的な実施例を説明する。なお、明確化のために、図面及び説明では当業者に知られており、例示的な実施例と関係のない部分及びプロセスの表示及び説明が省略されている。 The following describes exemplary embodiments of the present disclosure with reference to the drawings. It should be noted that, for the sake of clarity, representations and descriptions of parts and processes that are known to those skilled in the art in the drawings and description and not related to the exemplary embodiments are omitted.

なお、例示的な実施例の各態様は、システム、方法又はコンピュータプログラムプロダクトとして実施されてもよい。このため、例示的な実施例の各態様は、具体的に以下の形式で実現されてもよく、即ち、完全なハードウェアの実施例、完全なソフトウェアの実施例（ファームウェア、常駐ソフトウェア、マイクロコードなどを含む）、又はソフトウェアとハードウェアとの組み合わせの実施例であってもよく、本明細書では一般的に「回路」、「モジュール」又は「システム」と称される場合がある。さらに、例示的な実施例の各態様は、１つ又は複数のコンピュータ読み取り可能な媒体で表されるコンピュータプログラムプロダクトの形を採用してもよく、該コンピュータ読み取り可能な媒体にはコンピュータ読み取り可能なプログラムコードが記録されている。コンピュータプログラムは、例えば、コンピュータのネットワークを介して配分されてもよいし、１つ又は複数のリモートサーバに配置されてもよいし、装置のメモリに埋め込まされてもよい。 It should be noted that each aspect of the illustrative embodiments may be implemented as a system, method or computer program product. Thus, aspects of the exemplary embodiment may be implemented in the following specific forms: complete hardware embodiment, complete software embodiment (firmware, resident software, microcode). Etc.) or a combination of software and hardware, and may be generally referred to herein as a “circuit”, a “module”, or a “system”. Furthermore, each aspect of the illustrative embodiments may take the form of a computer program product represented by one or more computer-readable media, which are computer-readable. The program code is recorded. The computer program may be distributed, for example, via a network of computers, may be located on one or more remote servers or may be embedded in the memory of the device.

１つ又は複数のコンピュータ読み取り可能な媒体の任意の組み合わせを用いてもよい。コンピュータ読み取り可能な媒体は、コンピュータ読み取り可能な信号媒体又はコンピュータ読み取り可能な記憶媒体であってもよい。コンピュータ読み取り可能な記憶媒体は、例えば、電気、磁気、光学、電磁気、赤外線若しくは半導体のシステム、装置若しくは機器、又はこれらの任意の適切な組み合わせであってもよいが、これらに限定されない。コンピュータ読み取り可能な記憶媒体のより具体的な例（非網羅的なリスト）は、１つ又は複数のワイヤの電気的接続、ポータブルコンピュータディスク、ハードディスク、ランダムアクセスメモリ（ＲＡＭ）、読み取り専用メモリ（ＲＯＭ）、消去型のプログラミング可能な読み取り専用メモリ（ＥＰＲＯＭ又はフラッシュメモリ）、光ファイバ、ポータブルコンパクトディスク読み取り専用メモリ（ＣＤ−ＲＯＭ）、光学的記憶装置、磁気的記憶装置、又はこれらの適切な組み合わせを含む。本明細書では、コンピュータ読み取り可能な記憶媒体は、命令実行システム、装置若しくは機器により使用され、或いはこれらに関連して使用するプログラムを含み、或いは記憶する任意の有形の媒体であってもよい。 Any combination of one or more computer-readable media may be used. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer readable storage medium may be, for example, without limitation, an electrical, magnetic, optical, electromagnetic, infrared or semiconductor system, device or equipment, or any suitable combination thereof. More specific examples (non-exhaustive list) of computer readable storage media include one or more wire electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM). ), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination thereof. Including. As used herein, a computer-readable storage medium may be any tangible medium that contains or stores a program used by, or associated with, an instruction execution system, apparatus or device.

コンピュータ読み取り可能な信号媒体は、例えば、ベースバンド内、又はキャリアの一部として伝播される、コンピュータ読み取り可能なプログラムコードを有するデータ信号を含んでもよい。このような伝播信号は、任意の適切な形を採用してもよく、例えば電磁、光学又はこれらの任意の適切な組み合わせを含んでもよいが、これらに限定されない。 Computer readable signal media may, for example, include a data signal having a computer readable program code propagated in baseband or as part of a carrier. Such a propagated signal may take any suitable form, including but not limited to electromagnetic, optical, or any suitable combination thereof.

コンピュータ読み取り可能な信号媒体は、コンピュータ読み取り可能な記憶媒体以外の、命令実行システム、装置又は機器により使用され、或いはこれらに関連して使用されるプログラムを伝送、伝播又は送信できる任意のコンピュータで読み取り可能な媒体であってもよい。 The computer-readable signal medium is a computer-readable storage medium other than a computer-readable storage medium, and can be read by any computer capable of transmitting, propagating, or transmitting a program used by or related to an instruction execution system, device, or equipment. It may be a possible medium.

コンピュータ読み取り可能な媒体におけるプログラムコードは、任意の適切な媒体を用いて伝送されてもよく、例えば無線、有線、光ケーブル、無線周波数など、又はこれらの任意の適切な組み合わせを含んでもよいが、これらに限定されない。 The program code on a computer-readable medium may be transmitted using any suitable medium, and may include, for example, wireless, wireline, optical cable, radio frequency, etc., or any suitable combination thereof. Not limited to.

本明細書に開示される例示的な実施例の各態様の操作を実行するためのコンピュータプログラムコードは、１つ又は複数のプログラミング言語の任意の組み合わせで記述されてもよく、該プログラミング言語は、Ｊａｖａ(登録商標)、Ｓｍａｌｌｔａｌｋ、Ｃ＋＋などのオブジェクト指向プログラミング言語を含み、「Ｃ」プログラミング言語又は同様なプログラミング言語などの従来の手続き型プログラミング言語を含む。 Computer program code for carrying out operations of each aspect of the exemplary embodiments disclosed herein may be written in any combination of one or more programming languages, the programming languages comprising: Includes object oriented programming languages such as Java, Smalltalk, C ++, etc., including conventional procedural programming languages such as the "C" programming language or similar programming languages.

以下は、例示的な実施例に係る方法、装置（システム）及びコンピュータプログラムプロダクトのフローチャート及び／又はブロック図を参照しながら、本明細書で開示される例示的な実施例の各態様を説明する。なお、フローチャート及び／又はブロック図の各ブロック、並びにフローチャート及び／又はブロック図の各ブロックの組み合わせは、コンピュータプログラム命令により実現されてもよい。これらのコンピュータプログラム命令は、汎用コンピュータ、専用コンピュータ又は他のプログラミング可能なデータ処理装置のプロセッサに提供されて装置を構成し、コンピュータ又は他のプログラミング可能なデータ処理装置によりこれらの命令を実行することで、フローチャート及び／又はブロック図の各ブロックに規定された機能／操作を実現するための装置を構成する。 The following describes aspects of the exemplary embodiments disclosed herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems) and computer program products according to the exemplary embodiments. .. Note that each block in the flowchart and / or block diagram and a combination of each block in the flowchart and / or block diagram may be realized by a computer program instruction. These computer program instructions are provided to a processor of a general purpose computer, a special purpose computer or other programmable data processing device to configure the device, and the computer or other programmable data processing device to execute these instructions. Then, a device for realizing the function / operation defined in each block of the flowchart and / or block diagram is configured.

これらのコンピュータプログラム命令は、コンピュータ又は他のプログラミング可能なデータ処理装置に特定の方法で動作するコンピュータ読み取り可能な媒体に記憶され、コンピュータ読み取り可能な媒体に記憶された命令によりフローチャート及び／又はブロック図の各ブロックに規定された機能／操作を実現する命令を含むプロダクトを構成してもよい。 These computer program instructions are stored on a computer readable medium that operates in a manner specific to a computer or other programmable data processing device, and flowcharts and / or block diagrams are provided by the instructions stored on the computer readable medium. A product including instructions for realizing the functions / operations defined in each block may be configured.

コンピュータプログラム命令は、コンピュータ又は他のプログラミング可能なデータ処理装置にロードされ、コンピュータ又は他のプログラミング可能なデータ処理装置で一連の動作ステップが実行され、コンピュータ又は他のプログラミング装置で実行される命令によりフローチャート及び／又はブロック図の各ブロックに規定された機能／操作を実現するプロセスを提供してもよい。 Computer program instructions are loaded into a computer or other programmable data processing device, a series of operating steps are performed in the computer or other programmable data processing device, and the instructions are executed in the computer or other programming device. Processes may be provided to implement the functions / operations defined in each block of the flowcharts and / or block diagrams.

図１は従来の生徒モデルの訓練方法を示す模式図である。 FIG. 1 is a schematic diagram showing a conventional student model training method.

該従来の生徒モデルの訓練方法では、教師モデルの出力と生徒モデルの出力との差分を用いて知識の蒸留を構成し、小さく且つ高速な生徒モデルを訓練する。このような方法により、生徒モデルに教師モデルの表現力を学習させることができる。 In the conventional student model training method, knowledge distillation is configured by using the difference between the output of the teacher model and the output of the student model, and a small and fast student model is trained. With such a method, the student model can learn the expressive power of the teacher model.

通常、従来の生徒モデルの訓練プロセスでは、各サンプルは同じように扱われ、即ち各サンプルにより生じる損失の重みは同一である。しかし、このような方法は以下の欠点を有する。教師モデルは、異なるサンプルについて異なる信頼度を有するため、損失に対して異なる重みで重み付けする。従って、この問題を解決するために、本開示の実施形態に係る方法が提案される。 Usually, in the conventional student model training process, each sample is treated the same, ie the weight of the loss caused by each sample is the same. However, such a method has the following drawbacks. Since the teacher model has different confidences for different samples, it weights losses with different weights. Therefore, in order to solve this problem, the method according to the embodiment of the present disclosure is proposed.

図２は本開示の実施形態に係る生徒モデルの訓練方法を示す模式図である。 FIG. 2 is a schematic diagram illustrating a student model training method according to an embodiment of the present disclosure.

本開示の実施形態に係る生徒モデルの訓練方法では、同様に、教師モデルの出力と生徒モデルの出力との差分を用いて知識の蒸留を構成し、小さく且つ高速な生徒モデルを訓練し、生徒モデルに教師モデルの表現力を学習させる。しかし、図１に示す従来の生徒モデルの訓練方法と異なって、生徒モデルの入力に変化量Δを追加する。一方、出力ターゲットとして依然として教師モデルの出力ターゲットと同様なターゲットを用い、生徒モデルを訓練する。この方法により訓練された生徒モデルは、変更された入力データに適用することができるため、より多くの応用シナリオに適用することができる。 In the student model training method according to the embodiment of the present disclosure, similarly, the knowledge distillation is configured using the difference between the output of the teacher model and the output of the student model, and the small and fast student model is trained. Make the model learn the expressive power of the teacher model. However, unlike the conventional method of training a student model shown in FIG. 1, the change amount Δ is added to the input of the student model. On the other hand, the student model is trained using the same target as the output target of the teacher model as the output target. The student model trained by this method can be applied to modified input data, and thus can be applied to more application scenarios.

本開示の実施形態に係る学習モデルの訓練方法はニューラルネットワークを用いて生徒モデルを訓練し、ニューラルネットワークは生体のニューロンの機能を簡略化して構成された人工のニューロンを用い、人工のニューロンは接続の重みを有するエッジにより互いに接続されてもよい。接続の重み（ニューラルネットワークのパラメータ）は、エッジの所定値であり、接続の強度とも称される。ニューラルネットワークは、人工のニューロンを通じて人間の脳の認知機能又は学習プロセスを実行できる。人工のニューロンはノードとも称される。 A learning model training method according to an embodiment of the present disclosure trains a student model using a neural network, the neural network uses artificial neurons configured by simplifying the functions of biological neurons, and the artificial neurons are connected. May be connected to each other by edges having weights of. The connection weight (a parameter of the neural network) is a predetermined value of the edge and is also called the connection strength. Neural networks can perform cognitive functions or learning processes in the human brain through artificial neurons. Artificial neurons are also called nodes.

ニューラルネットワークは複数の層を含んでもよい。例えば、ニューラルネットワークは、入力層、隠れ層及び出力層を含んでもよい。入力層は訓練を実行するための入力を受信して隠れ層に送信し、出力層は隠れ層のノードから受信された信号に基づいてニューラルネットワークの出力を生成してもよい。隠れ層は、入力層と出力層との間に配置されてもよい。隠れ層は、入力層から受信された訓練データを予測しやすい値に変更してもよい。入力層及び隠れ層に含まれるノードは接続の重みを有するエッジにより互いに接続されてもよく、隠れ層及び出力層に含まれるノードも接続の重みを有するエッジにより互いに接続されてもよい。入力層、隠れ層及び出力層は、それぞれ複数のノードを含んでもよい。 The neural network may include multiple layers. For example, the neural network may include an input layer, a hidden layer and an output layer. The input layer may receive input to perform training and send it to the hidden layer, and the output layer may generate the output of the neural network based on the signals received from the nodes of the hidden layer. The hidden layer may be arranged between the input layer and the output layer. The hidden layer may change the training data received from the input layer to values that are more predictable. The nodes included in the input layer and the hidden layer may be connected to each other by the edge having the connection weight, and the nodes included in the hidden layer and the output layer may be connected to each other by the edge having the connection weight. The input layer, the hidden layer, and the output layer may each include a plurality of nodes.

ニューラルネットワークには、複数の隠れ層を含んでもよい。複数の隠れ層を含むニューラルネットワークは、ディープニューラルネットワークと称されてもよい。ディープニューラルネットワークの訓練は深層学習と称されてもよい。隠れ層に含まれるノードは、隠れノードと称されてもよい。ディープニューラルネットワークで提供される隠れ層の数は特定の数に限定されない。 The neural network may include multiple hidden layers. Neural networks that include multiple hidden layers may be referred to as deep neural networks. Training of deep neural networks may be referred to as deep learning. The nodes included in the hidden layer may be referred to as hidden nodes. The number of hidden layers provided by the deep neural network is not limited to a particular number.

教師あり学習によりニューラルネットワークを訓練してもよい。教師あり学習とは、入力データ及びそれに対応する出力データをニューラルネットワークに提供し、エッジの接続の重みを更新して入力データに対応する出力データを出力する方法を意味する。例えば、モデル訓練装置は、ｄｅｌｔａ規則及び誤差逆伝播学習により、人工のニューロン間のエッジの接続の重みを更新してもよい。 The neural network may be trained by supervised learning. Supervised learning refers to a method of providing input data and output data corresponding thereto to a neural network, updating weights of edge connections, and outputting output data corresponding to the input data. For example, the model training device may update the weight of the edge connection between the artificial neurons by the delta rule and the error backpropagation learning.

ディープネットワークはディープのニューラルネットワークである。ディープニューラルネットワークの構造は従来の多層パーセプトロンと同様であり、教師あり学習を行う場合のアルゴリズムも同様である。唯一の差異としては、このネットワークは教師あり学習を行う前に教師なし学習を行い、教師なし学習により学習された重みを教師あり学習の初期値として用いる必要がある。この変更は、実際には合理的な仮定に対応するものである。教師なし学習によりネットワークに対して事前訓練を行って得られたデータの表現をＰ（ｘ）で表し、その後に教師あり学習によりネットワークを訓練し（例えばＢＰアルゴリズム）、Ｐ（Ｙ｜Ｘ）を取得し、ここでＹは出力である（例えばカテゴリラベル）。この仮説では、Ｐ（Ｘ）の学習がＰ（Ｙ｜Ｘ）の学習に役に立つと考えられる。この学習アプローチは、条件の確率分布Ｐ（Ｙ｜Ｘ）だけでなく、ＸとＹの組み合わせ確率分布も学習するため、単純な教師あり学習に比べてオーバフィッティングのリスクを低減させる。 Deep networks are Deep neural networks. The structure of the deep neural network is similar to that of the conventional multi-layer perceptron, and so is the algorithm for supervised learning. The only difference is that this network must perform unsupervised learning before supervised learning and use the weights learned by unsupervised learning as the initial value for supervised learning. This change actually corresponds to a reasonable assumption. The representation of the data obtained by pre-training the network by unsupervised learning is represented by P (x), then the network is trained by supervised learning (for example, BP algorithm), and P (Y | X) Taken, where Y is the output (eg category label). Under this hypothesis, learning P (X) is considered useful for learning P (Y | X). Since this learning approach learns not only the probability distribution P (Y | X) of the condition but also the combined probability distribution of X and Y, it reduces the risk of overfitting as compared with simple supervised learning.

本開示の実施形態に係る学習モデルの訓練方法は、ディープニューラルネットワーク、特に畳み込みニューラルネットワークを使用する。近年、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：ＣＮＮ）が提案され、ＣＮＮは、人工のニューロンが一部のカバレッジ内の周囲のユニットに応答し、大きな画像処理に対して優れたパフォーマンスを発揮できるフィードフォワード型のニューラルネットワークである。ＣＮＮは、畳み込み層とプーリング層を含む。ＣＮＮは主に、変位、スケーリング、及び他の形式の歪み不変性の２次元画像を認識するために用いられる。ＣＮＮの特徴検出層が訓練データにより学習を行うため、ＣＮＮを利用すると、明示的な特徴抽出を回避し、訓練データから学習を暗黙的に行う。さらに、同一の特徴マッピング面上のニューロンの重みが同一であるため、ネットワークは並行して学習することができ、これは、ニューロンが互いに接続されたネットワークに対する畳み込みネットワークの大きな利点でもある。畳み込みニューラルネットワークは、局所の重みを共有するという特殊な構造により、音声認識及び画像処理において独自の利点を有し、その配置が実際の生体ニューラルネットワークに近く、重みの共有によりネットワークの複雑さを低減させ、特に多次元の入力ベクトルの画像をネットワークに直接入力できるという特徴により、特徴抽出及び分類プロセスにおけるデータ再構築の複雑さを回避した。このため、本開示の実施形態に係る学習モデルの訓練方法は、好ましくは、畳み込みニューラルネットワークを用いて、教師モデルの出力と生徒モデルの出力との差分を反復的に小さくして生徒モデルを訓練する。畳み込みニューラルネットワークは当業者にとって周知であるため、本開示はその原理の詳細な説明を省略する。 The learning model training method according to the embodiment of the present disclosure uses a deep neural network, particularly a convolutional neural network. In recent years, a convolutional neural network (CNN) has been proposed, which is a feedforward in which an artificial neuron responds to surrounding units within some coverage and can perform well for large image processing. Type neural network. The CNN includes a convolutional layer and a pooling layer. CNN is primarily used for recognizing displacement, scaling, and other forms of strain invariant two-dimensional images. Since the feature detection layer of CNN learns from training data, use of CNN avoids explicit feature extraction and implicitly learns from the training data. Moreover, since the weights of neurons on the same feature mapping surface are the same, the networks can learn in parallel, which is also a great advantage of convolutional networks over networks in which neurons are connected to each other. The convolutional neural network has a unique advantage in speech recognition and image processing due to the special structure of sharing local weights. Its arrangement is close to that of an actual biological neural network, and sharing weights reduces the complexity of the network. The complexity of data reconstruction in the feature extraction and classification process was avoided, especially by the feature that images of multi-dimensional input vectors could be input directly into the network. Therefore, the learning model training method according to the embodiment of the present disclosure preferably trains the student model by using a convolutional neural network to repeatedly reduce the difference between the output of the teacher model and the output of the student model. To do. Convolutional neural networks are well known to those of skill in the art, and the present disclosure omits a detailed description of its principles.

図３は本開示の実施形態に係る学習モデルの訓練方法のフローチャートである。 FIG. 3 is a flowchart of a learning model training method according to an embodiment of the present disclosure.

図３に示すように、ステップ３０１において、訓練済みの教師モデルを予め取得し、或いは教師モデルを一時的に訓練する。ここで、該教師モデルは、第１入力データの変更されていないサンプルを入力データとし、且つ第１出力データを出力ターゲットとして訓練されたものである。ステップ３０２において、第２入力データの変更されたサンプルを入力データとし、且つ教師モデルと同一の第１出力データを出力ターゲットとして生徒モデルを訓練する。ここで、第２入力データは、第１入力データを変更して得られたデータであり、該変更は、第１入力データのタイプに対応する信号処理方法である。ステップ３０１及びステップ３０２における訓練は、畳み込みニューラルネットワークにより行われる。 As shown in FIG. 3, in step 301, a trained teacher model is acquired in advance, or the teacher model is temporarily trained. Here, the teacher model is trained with the unchanged sample of the first input data as the input data and the first output data as the output target. In step 302, the student model is trained using the modified sample of the second input data as the input data and the same first output data as the teacher model as the output target. Here, the second input data is data obtained by changing the first input data, and the change is a signal processing method corresponding to the type of the first input data. The training in steps 301 and 302 is performed by a convolutional neural network.

従来の生徒モデルの訓練ステップにおいて、教師モデルと同一の第１入力データのサンプルを入力データとし、且つ教師モデルと同一の第１出力データを出力ターゲットとして生徒モデルを訓練する。このプロセスは、以下の式（１）で表されてもよい。

In the conventional student model training step, the student model is trained using the same first input data sample as the teacher model as the input data and the same first output data as the teacher model as the output target. This process may be represented by equation (1) below.

上記の式（１）において、Ｓは生徒モデルを表し、Ｔは教師モデルを表し、ｘｉは訓練サンプルを表す。即ち、従来の生徒モデルの訓練方法では、生徒モデルの入力と教師モデルの入力サンプルは同一である。よって、入力サンプルが変わると、知識の蒸留により新たな生徒モデルを取得するために、教師モデルを再訓練する必要がある。 In the above formula (1), S represents a student model, T represents a teacher model, and xi represents a training sample. That is, in the conventional student model training method, the input of the student model and the input sample of the teacher model are the same. Thus, when the input sample changes, the teacher model needs to be retrained to obtain a new student model by distilling knowledge.

教師モデルと生徒モデルとの出力の差分は、損失関数として表されてもよい。通常の損失関数は、１）Ｌｏｇｉｔ損失、２）特徴Ｌ２損失、及び３）生徒モデルのｓｏｆｔｍａｘ損失を含む。以下は、この３つの損失関数を詳細に説明する。 The output difference between the teacher model and the student model may be represented as a loss function. Typical loss functions include 1) Logit loss, 2) feature L2 loss, and 3) student model softmax loss. The following describes these three loss functions in detail.

１）Ｌｏｇｉｔ損失
Ｌｏｇｉｔ損失は、教師モデルと生徒モデルにより生成された確率分布の差分を表す。ここで、ＫＬダイバージェンスを用いて損失関数を算出し、ここで、ＫＬダイバージェンスは相対エントロピーであり、２つの確率分布及び差分を表す一般的な方法であり、Ｌｏｇｉｔ損失関数は以下の式で表される。

1) Logit Loss Logit loss represents the difference between the probability distributions generated by the teacher model and the student model. Here, a loss function is calculated using KL divergence, where KL divergence is a relative entropy, which is a general method of expressing two probability distributions, and a Logit loss function is represented by the following equation. It

式（２）において、Ｌ_ＬはＬｏｇｉｔ損失を表し、ｘ^ｔ（ｉ）は教師モデルによりサンプルをｉ番目のカテゴリに分類する確率を表し、ｘ^ｓ（ｉ）は生徒モデルによりサンプルをｉ番目のカテゴリに分類する確率を表し、ｍはカテゴリの総数を表す。 In Equation (2), L _L represents Logit loss, x ^t (i) represents the probability of classifying the sample into the i th category by the teacher model, and x ^s (i) represents the i th sample by the student model. It represents the probability of classifying into categories, and m represents the total number of categories.

２）特徴Ｌ２損失
特徴Ｌ２損失は以下の式で表される。

2) Characteristic L2 loss The characteristic L2 loss is represented by the following formula.

式（３）において、Ｌ_Ｆは特徴Ｌ２損失を表し、ｍはカテゴリの総数（サンプルｘ_ｉの総数）を表し、
（外１）

はサンプルｘ_ｉの生徒モデルにより出力された出力特徴を表し、
（外２）

はサンプルｘ_ｉの教師モデルにより出力された出力特徴を表す。 In Expression (3), L _F represents the feature L2 loss, m represents the total number of categories (total number of samples x _i ),
(Outside 1)

Represents the output features output by the student model of sample x _i ,
(Outside 2)

Represents the output feature output by the teacher model of sample x _i .

３）生徒モデルのｓｏｆｔｍａｘ損失

3) Softmax loss of student model

式（４）において、Ｌ_Ｓはｓｏｆｔｍａｘ損失を表し、ｍはカテゴリの総数（サンプルｘ_ｉの総数）を表し、ｙ^ｉはｘ_ｉのラベルを表し、
（外３）

はサンプルｘ_ｉの生徒モデルにより出力された出力特徴を表し、他のパラメータについて、例えばＷ及びｂは何れもｓｏｆｔｍａｘにおける通常のパラメータであり、Ｗは係数の行列であり、ｂはオフセットであり、これらのパラメータは何れも訓練により決定されたものである。 In Equation (4), L _S represents softmax loss, m represents the total number of categories (total number of samples x _i ), y ⁱ represents the label of x _i ,
(Outside 3)

Represents the output features output by the student model of sample x _i , for other parameters, eg W and b are both normal parameters in softmax, W is a matrix of coefficients, b is an offset, All of these parameters were determined by training.

上記の３つの損失関数に基づいて、総損失は以下の式で表されてもよい。

Based on the above three loss functions, the total loss may be expressed by the following equation.

ここで、λ_Ｌ、λ_Ｆ、λ_Ｓは何れも訓練により取得されたものである。 Here, λ _L , λ _F , and λ _S are all acquired by training.

以下は、上記従来の生徒モデルの訓練ステップとは異なる訓練ステップ３０２を説明する。 The following describes a training step 302 that differs from the training steps of the conventional student model described above.

上記従来の生徒モデルの訓練ステップとは異なり、本開示の実施形態に係る図３に示すステップ３０２において、生徒モデルの入力に変化量Δを追加し、このプロセスは以下の式（６）で表されてもよい。

Unlike the conventional training step of the student model described above, in step 302 shown in FIG. 3 according to the embodiment of the present disclosure, the variation Δ is added to the input of the student model, and this process is represented by the following equation (6). May be done.

上記の式（６）において、Ｓは生徒モデルを表し、Ｔは教師モデルを表し、ｘ^ｉは訓練サンプルを表し、Δはｘ^ｉが変更される変化量を表す。該変化量は、入力データ、即ちサンプルのタイプに対応する信号処理方法である。例えば、訓練サンプルが画像である場合、Δは例えば画像に対してダウンサンプリング処理を行って生成された変化量であってもよい。入力データのタイプは、画像データ、音声データ又はテキストデータを含むが、これらに限定されない。以上のことから、本開示の実施形態に係る生徒モデルの訓練方法では、生徒モデルの入力サンプルと教師モデルの入力サンプルとは異なる。 In the above formula (6), S represents a student model, T represents a teacher model, x ⁱ represents a training sample, and Δ represents a change amount by which x ⁱ is changed. The amount of change is a signal processing method corresponding to the input data, that is, the type of sample. For example, when the training sample is an image, Δ may be the amount of change generated by performing downsampling processing on the image. Input data types include, but are not limited to, image data, audio data, or text data. From the above, in the student model training method according to the embodiment of the present disclosure, the student model input sample and the teacher model input sample are different.

訓練データに変化量Δを追加すると、生徒モデルの訓練サンプルドメインと教師モデルの訓練サンプルドメインとは異なることになる。本開示の実施形態に係る生徒モデルの訓練方法では、従来の方法におけるＬｏｇｉｔ損失及び特徴Ｌ２損失により訓練された生徒モデルを直接使用すると、データ又は対象を正確に認識することができない。元の入力サンプルと変更されたデータサンプルとのデータ関連性に基づくと、ドメイン類似度計量−マルチカーネル最大平均値差分（ＭＫ−ＭＭＤ）を損失関数として用いることが考えられる。ドメイン間距離計量をマルチカーネル最大平均値差分ＭＫ−ＭＭＤに変更することで、複数の適応層のドメイン間距離を同時に測定することができ、また、ＭＫ−ＭＭＤのパラメータ学習はディープニューラルネットワークの訓練時間を増加させることがない。ＭＫ−ＭＭＤ損失関数に基づく生徒モデルの学習方法により訓練されたモデルは、様々なタイプのタスクにおいて良好な分類効果を達成することができる。使用されるＭＫ−ＭＭＤ関数は以下の式（７）で表される。

When the variation amount Δ is added to the training data, the training sample domain of the student model and the training sample domain of the teacher model are different. In the student model training method according to the embodiment of the present disclosure, when the student model trained by the Logit loss and the feature L2 loss in the conventional method is directly used, the data or the object cannot be accurately recognized. Based on the data association between the original input sample and the modified data sample, it is conceivable to use the domain similarity metric-multikernel maximum mean difference (MK-MMD) as the loss function. By changing the inter-domain distance metric to the multi-kernel maximum average value difference MK-MMD, the inter-domain distances of a plurality of adaptive layers can be measured at the same time, and the parameter learning of MK-MMD is a training of a deep neural network. It does not increase the time. The model trained by the learning method of the student model based on MK-MMD loss function can achieve good classification effect in various types of tasks. The MK-MMD function used is represented by the following equation (7).

上記の式（７）において、ＮとＭはそれぞれサンプルセットｘとｙに対応する１つのカテゴリにおけるサンプルの数を表す。本開示の実施形態に係る生徒モデルの訓練方法では、好ましくは、生徒モデルに対応する１つのカテゴリのサンプルの数は、教師モデルの１つのカテゴリのサンプルの数と同一である。即ち、以下の各式において、好ましくは、ＮとＭは同一の値を有する。 In the above equation (7), N and M represent the number of samples in one category corresponding to the sample sets x and y, respectively. In the student model training method according to the embodiment of the present disclosure, preferably, the number of samples of one category corresponding to the student model is the same as the number of samples of one category of the teacher model. That is, in each of the following formulas, N and M preferably have the same value.

上記のＭＫ−ＭＭＤ関数（以下の式におけるＭＭＤに対応する）を用いてＬｏｇｉｔ損失を最適化し、即ちＬｏｇｉｔ損失を以下のように変更する。

The Logit loss is optimized by using the above MK-MMD function (corresponding to MMD in the following equation), that is, Logit loss is changed as follows.

上記の式（８）において、Ｌ_Ｌは変更されたＬｏｇｉｔ損失を表し、ｘ^ｔ（ｉ）は教師モデルによりサンプルをｉ番目のカテゴリに分類する確率を表し、ｘ^ｓ（ｉ）は生徒モデルによりサンプルをｉ番目のカテゴリに分類する確率を表し、ｍはカテゴリの総数を表す。 In the above equation (8), L _L represents the modified Logit loss, x ^t (i) represents the probability of classifying the sample into the i-th category by the teacher model, and x ^s (i) represents the student model. It represents the probability of classifying the sample into the i-th category, and m represents the total number of categories.

次に、上記のＭＫ−ＭＭＤ関数（以下の式におけるＭＭＤに対応する）を用いて特徴損失を最適化し、即ち特徴損失を以下のように変更する。

Next, the feature loss is optimized by using the above MK-MMD function (corresponding to MMD in the following equation), that is, the feature loss is changed as follows.

式（９）において、Ｌ_Ｆは変更された特徴損失を表し、ｍはカテゴリの総数（サンプルｘ_ｉの総数）を表し、
（外４）

はサンプルｘ_ｉの生徒モデルにより出力された出力特徴を表し、
（外５）

はサンプルｘ_ｉの教師モデルにより出力された出力特徴を表す。 In equation (9), L _F represents the modified feature loss, m represents the total number of categories (total number of samples x _i ),
(Outside 4)

Represents the output features output by the student model of sample x _i ,
(Outside 5)

Represents the output feature output by the teacher model of sample x _i .

生徒モデルのｓｏｆｔｍａｘ損失は、図１に示す生徒モデルのｓｏｆｔｍａｘ損失と同じであり、以下のように表される。

The softmax loss of the student model is the same as the softmax loss of the student model shown in FIG. 1, and is expressed as follows.

上記の式（１０）において、Ｌ_Ｓはｓｏｆｔｍａｘ損失を表し、ｍはカテゴリの総数（サンプルｘ_ｉの総数）を表し、ｙ^ｉはｘ_ｉのラベルを表し、
（外６）

はサンプルｘ_ｉの生徒モデルにより出力された出力特徴を表し、他のパラメータについて、例えばＷ及びｂは何れもｓｏｆｔｍａｘにおける通常のパラメータであり、Ｗは係数の行列であり、ｂはオフセットであり、これらのパラメータは何れも訓練により決定されたものである。 In Equation (10) above, L _S represents the softmax loss, m represents the total number of categories (total number of samples x _i ), y ⁱ represents the label of x _i ,
(Outside 6)

ここで、λ_Ｌ、λ_Ｆ、λ_Ｓは何れも訓練により取得されたものである。該合計の損失を反復的に小さくして生徒モデルを訓練する。 Here, λ _L , λ _F , and λ _S are all acquired by training. The student model is trained by iteratively reducing the total loss.

図４は本開示の実施形態に係るデータ認識方法を示すフローチャートである。 FIG. 4 is a flowchart showing a data recognition method according to the embodiment of the present disclosure.

図４に示すように、ステップ４０１において、訓練済みの教師モデルを予め取得し、或いは教師モデルを一時的に訓練する。ここで、該教師モデルは、第１入力データの変更されていないサンプルを入力データとし、且つ第１出力データを出力ターゲットとして訓練されたものである。ステップ４０２において、第２入力データの変更されたサンプルを入力データとし、且つ教師モデルと同一の第１出力データを出力ターゲットとして生徒モデルを訓練する。ここで、第２入力データは、第１入力データを変更して得られたデータであり、該変更は、第１入力データのタイプに対応する信号処理方法である。ステップ４０１及びステップ４０２における訓練は、畳み込みニューラルネットワークにより行われる。ステップ４０３において、ステップ４０２において得られた生徒モデルを用いてデータ認識を行う。 As shown in FIG. 4, in step 401, a trained teacher model is acquired in advance, or the teacher model is temporarily trained. Here, the teacher model is trained with the unchanged sample of the first input data as the input data and the first output data as the output target. In step 402, the student model is trained using the modified sample of the second input data as the input data and the same first output data as the teacher model as the output target. Here, the second input data is data obtained by changing the first input data, and the change is a signal processing method corresponding to the type of the first input data. The training in steps 401 and 402 is performed by a convolutional neural network. In step 403, data recognition is performed using the student model obtained in step 402.

本開示の実施形態に係る図４に示すステップ４０２において、生徒モデルの入力に変化量Δを追加し、このプロセスは以下の式（１２）で表されてもよい。

In step 402 shown in FIG. 4 according to an embodiment of the present disclosure, the variation Δ is added to the input of the student model, and this process may be expressed by the following equation (12).

上記の式（１２）において、Ｓは生徒モデルを表し、Ｔは教師モデルを表し、ｘ^ｉは訓練サンプルを表し、Δはｘ^ｉが変更される変化量を表す。該変化量は、入力データ、即ちサンプルのタイプに対応する信号処理方法である。例えば、訓練サンプルが画像である場合、Δは例えば画像に対してダウンサンプリング処理を行って生成された変化量であってもよい。入力データのタイプは、画像データ、音声データ又はテキストデータを含むが、これらに限定されない。 In the above formula (12), S represents a student model, T represents a teacher model, x ⁱ represents a training sample, and Δ represents a change amount by which x ⁱ is changed. The amount of change is a signal processing method corresponding to the input data, that is, the type of sample. For example, when the training sample is an image, Δ may be the amount of change generated by performing downsampling processing on the image. Input data types include, but are not limited to, image data, audio data, or text data.

訓練データに変化量Δを追加すると、生徒モデルの訓練サンプルドメインと教師モデルの訓練サンプルドメインとは異なることになる。本開示の実施形態に係る生徒モデルの訓練方法では、図１に示す従来の方法におけるＬｏｇｉｔ損失及び特徴Ｌ２損失により訓練された生徒モデルを直接使用すると、データ又は対象を正確に認識することができないため、本開示の方法では元のＬｏｇｉｔ損失及び特徴Ｌ２損失を直接使用することができない。元の入力サンプルと変更されたデータサンプルとのデータ関連性に基づくと、ドメイン類似度計量−マルチカーネル最大平均値差分（ＭＫ−ＭＭＤ）を損失関数として用いることが考えられる。 When the variation amount Δ is added to the training data, the training sample domain of the student model and the training sample domain of the teacher model are different. In the student model training method according to the embodiment of the present disclosure, when the student model trained by the Logit loss and the feature L2 loss in the conventional method shown in FIG. 1 is directly used, the data or the object cannot be accurately recognized. Therefore, the method of the present disclosure cannot directly use the original Logit loss and the characteristic L2 loss. Based on the data association between the original input sample and the modified data sample, it is conceivable to use the domain similarity metric-multikernel maximum mean difference (MK-MMD) as the loss function.

ドメイン間距離計量をマルチカーネル最大平均値差分ＭＫ−ＭＭＤに変更することで、複数の適応層のドメイン間距離を同時に測定することができ、また、ＭＫ−ＭＭＤのパラメータ学習はディープニューラルネットワークの訓練時間を増加させることがない。ＭＫ−ＭＭＤ損失関数に基づく生徒モデルの学習方法により訓練されたモデルは、様々なタイプのタスクにおいて良好な分類効果を達成することができる。使用されるＭＫ−ＭＭＤ関数は以下の式（１３）で表される。

By changing the inter-domain distance metric to the multi-kernel maximum average value difference MK-MMD, the inter-domain distances of a plurality of adaptive layers can be measured at the same time, and the parameter learning of MK-MMD is a training of a deep neural network. It does not increase the time. The model trained by the learning method of the student model based on MK-MMD loss function can achieve good classification effect in various types of tasks. The MK-MMD function used is represented by the following equation (13).

上記の式（１３）において、ＮとＭはそれぞれサンプルセットｘとｙに対応する１つのカテゴリにおけるサンプルの数を表す。本開示の実施形態に係る生徒モデルの訓練方法では、好ましくは、生徒モデルに対応する１つのカテゴリのサンプルの数は、教師モデルの１つのカテゴリのサンプルの数と同一である。即ち、以下の各式において、好ましくは、ＮとＭは同一の値を有する。 In the above equation (13), N and M represent the number of samples in one category corresponding to the sample sets x and y, respectively. In the student model training method according to the embodiment of the present disclosure, preferably, the number of samples of one category corresponding to the student model is the same as the number of samples of one category of the teacher model. That is, in each of the following formulas, N and M preferably have the same value.

上記の式（１４）において、Ｌ_Ｌは変更されたＬｏｇｉｔ損失を表し、ｘ^ｔ（ｉ）は教師モデルによりサンプルをｉ番目のカテゴリに分類する確率を表し、ｘ^ｓ（ｉ）は生徒モデルによりサンプルをｉ番目のカテゴリに分類する確率を表し、ｍはカテゴリの総数を表す。 In equation (14) above, L _L represents the modified Logit loss, x ^t (i) represents the probability of classifying the sample into the ith category by the teacher model, and x ^s (i) represents the student model. It represents the probability of classifying the sample into the i-th category, and m represents the total number of categories.

式（１５）において、Ｌ_Ｆは変更された特徴損失を表し、ｍはカテゴリの総数（サンプルｘ_ｉの総数）を表し、
（外７）

はサンプルｘ_ｉの生徒モデルにより出力された出力特徴を表し、
（外８）

はサンプルｘ_ｉの教師モデルにより出力された出力特徴を表す。 In equation (15), L _F represents the modified feature loss, m represents the total number of categories (total number of samples x _i ),
(Outside 7)

Represents the output features output by the student model of sample x _i ,
(Outside 8)

Represents the output feature output by the teacher model of sample x _i .

上記の式（１６）において、Ｌ_Ｓはｓｏｆｔｍａｘ損失を表し、ｍはカテゴリの総数（サンプルｘ_ｉの総数）を表し、ｙ^ｉはｘ_ｉのラベルを表し、
（外９）

はサンプルｘ_ｉの生徒モデルにより出力された出力特徴を表し、他のパラメータについて、例えばＷ及びｂは何れもｓｏｆｔｍａｘにおける通常のパラメータであり、Ｗは係数の行列であり、ｂはオフセットであり、これらのパラメータは何れも訓練により決定されたものである。 In the above equation (16), L _S represents softmax loss, m represents the total number of categories (total number of samples x _i ), y ⁱ represents the label of x _i ,
(Outside 9)

図５は本開示の実施形態に係るデータ認識装置を示す模式図である。 FIG. 5 is a schematic diagram showing a data recognition device according to an embodiment of the present disclosure.

図５に示すデータ認識装置５００は、データ認識方法を実行する少なくとも１つのプロセッサ５０１を含む。データ認識装置５００は、記憶ユニット５０３及び／又は通信ユニット５０２をさらに含んでもよく、記憶ユニット５０３は認識すべきデータ及び／又は認識により得られたデータを記憶し、通信ユニット５０２は認識すべきデータを受信し、且つ／或いは認識により得られたデータを送信する。 The data recognition device 500 shown in FIG. 5 includes at least one processor 501 that executes a data recognition method. The data recognition device 500 may further include a storage unit 503 and / or a communication unit 502, the storage unit 503 stores data to be recognized and / or data obtained by the recognition, and the communication unit 502 is to recognize the data. Is received and / or data obtained by recognition is transmitted.

本開示の各実施形態では、教師モデル及び生徒モデルの入力データは、画像データ、音声データ又はテキストデータの何れかを含んでもよい。 In each embodiment of the present disclosure, the input data of the teacher model and the student model may include any of image data, audio data, or text data.

図６は本開示の実施形態に係る生徒モデルの訓練方法又はデータ認識方法を実現可能な装置の汎用機器７００の構成を示す図である。汎用機器７００は、例えばコンピュータシステムであってもよい。なお、汎用機器７００は単なる一例であり、本開示の方法及び装置の使用範囲又は機能を制限するものではない。また、汎用機器７００は、上記のモデル訓練方法及びモデル訓練装置における構成要件又はその組み合わせに依存するものではない。 FIG. 6 is a diagram illustrating a configuration of a general-purpose device 700 that is an apparatus that can implement a student model training method or a data recognition method according to an embodiment of the present disclosure. The general purpose device 700 may be, for example, a computer system. It should be noted that the general-purpose device 700 is merely an example and does not limit the usage range or function of the method and apparatus of the present disclosure. Further, the general-purpose device 700 does not depend on the constituent requirements or the combination thereof in the model training method and the model training device described above.

図６において、中央処理部（ＣＰＵ）７０１は、読み出し専用メモリ（ＲＯＭ）７０２に記憶されているプログラム、又は記憶部７０８からランダムアクセスメモリ（ＲＡＭ）７０３にロードされたプログラムにより各種の処理を実行する。ＲＡＭ７０３には、必要に応じて、ＣＰＵ７０１が各種の処理を実行するに必要なデータが記憶されている。ＣＰＵ７０１、ＲＯＭ７０２、及びＲＡＭ７０３は、バス７０４を介して互いに接続されている。入力／出力インターフェース７０５もバス７０４に接続されている。 In FIG. 6, a central processing unit (CPU) 701 executes various processes by a program stored in a read-only memory (ROM) 702 or a program loaded from a storage unit 708 into a random access memory (RAM) 703. To do. The RAM 703 stores data necessary for the CPU 701 to execute various types of processing as necessary. The CPU 701, ROM 702, and RAM 703 are connected to each other via a bus 704. The input / output interface 705 is also connected to the bus 704.

入力部７０６（キーボード、マウスなどを含む）、出力部７０７（ディスプレイ、例えばブラウン管（ＣＲＴ）、液晶ディスプレイ（ＬＣＤ）など、及びスピーカなどを含む）、記憶部７０８（例えばハードディスクなどを含む）、通信部７０９（ネットワークのインタフェースカード、例えばＬＡＮカード、モデムなどを含む）は、入力／出力インターフェース７０５に接続されている。通信部７０９は、ネットワーク、例えばインターネットを介して通信処理を実行する。必要に応じて、ドライブ部７１０は、入力／出力インターフェース７０５に接続されてもよい。取り外し可能な媒体７１１は、例えば磁気ディスク、光ディスク、光磁気ディスク、半導体メモリなどであり、必要に応じてドライブ部７１０にセットアップされて、その中から読みだされたコンピュータプログラムは必要に応じて記憶部７０８にインストールされている。 Input unit 706 (including keyboard, mouse, etc.), output unit 707 (display, including cathode ray tube (CRT), liquid crystal display (LCD), etc., and speaker), storage unit 708 (including hard disk, etc.), communication The unit 709 (including a network interface card such as a LAN card and a modem) is connected to the input / output interface 705. The communication unit 709 executes communication processing via a network such as the Internet. If desired, the drive unit 710 may be connected to the input / output interface 705. The removable medium 711 is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, and is set up in the drive unit 710 as necessary, and the computer program read from the memory is stored as necessary. It is installed in the section 708.

ソフトウェアにより上記処理を実施する場合、ネットワーク、例えばインターネット、又は記憶媒体、例えば取り外し可能な媒体７１１を介してソフトウェアを構成するプログラムをインストールする。 When the above-mentioned processing is performed by software, a program constituting the software is installed via a network such as the Internet or a storage medium such as a removable medium 711.

なお、これらの記憶媒体は、図６に示されている、プログラムを記憶し、機器と分離してユーザへプログラムを提供する取り外し可能な媒体７１１に限定されない。取り外し可能な媒体７１１は、例えば磁気ディスク（フロッピーディスクを含む）、光ディスク（光ディスク−読み出し専用メモリ（ＣＤ−ＲＯＭ）、及びデジタル多目的ディスク（ＤＶＤ）を含む）、光磁気ディスク（ミニディスク（ＭＤ）（登録商標））及び半導体メモリを含む。或いは、記憶媒体は、ＲＯＭ７０２、記憶部７０８に含まれるハードディスクなどであってもよく、プログラムを記憶し、それらを含む機器と共にユーザへ提供される。 It should be noted that these storage media are not limited to the removable media 711 shown in FIG. 6 that stores the program and separates the device from the device and provides the program to the user. The removable medium 711 is, for example, a magnetic disk (including a floppy disk), an optical disk (including an optical disk-read only memory (CD-ROM), and a digital multipurpose disk (DVD)), a magneto-optical disk (mini disk (MD)). (Registered trademark) and semiconductor memory. Alternatively, the storage medium may be a ROM 702, a hard disk included in the storage unit 708, or the like, which stores the program and is provided to the user together with a device including the program.

また、本開示は、コンピュータ読み取り可能なプログラム命令が記憶されたコンピュータプログラムプロダクトをさらに提供する。該プログラム命令がコンピュータにより読み取り、実行される際に、上記本開示の方法を実行することができる。それに応じて、このようなプログラム命令を記録した上述した各種の記憶媒体も本開示の範囲内のものである。 The present disclosure further provides a computer program product having computer readable program instructions stored thereon. The method of the present disclosure can be executed when the program instructions are read and executed by a computer. Accordingly, the above-described various storage media having such program instructions recorded therein are also within the scope of the present disclosure.

以上はブロック図、フローチャート及び／又は実施形態を詳細に説明することで、本開示の実施形態の装置及び／又は方法の具体的な実施形態を説明している。これらのブロック図、フローチャート及び／又は実施形態に１つ又は複数の機能及び／又は動作が含まれている場合、これらのブロック図、フローチャート及び／又は実施形態における各機能及び／又は動作は、ハードウェア、ソフトウェア、ファームウェア、又はこれらの任意の組み合わせにより個別及び／又はまとめて実施されてもよい。１つの実施形態では、本明細書に記載された主題の幾つかの部分は、特定用途向け集積回路（ＡＳＩＣ）、フィールド・プログラマブル・ゲートアレイ（ＦＰＧＡ）、デジタル信号プロセッサ（ＤＳＰ）又は他の統合形態により実現されてもよい。なお、本明細書に記載された実施形態の全て又は一部の態様は、集積回路における１つ又は複数のコンピュータにより実行される１つ又は複数のコンピュータプログラムの形（例えば１つ又は複数のコンピュータシステムにより実行される１つ又は複数のコンピュータプログラムの形）、１つ又は複数のプロセッサにより実行される１つ又は複数のプログラムの形（１つ又は複数のマイクロプロセッサにより実行される１つ又は複数のプログラムの形）、ファームウェアの形、又は実質的なこれらの任意の組み合わせの形で均等的に実施されもよい。また、本明細書に開示された内容に応じて、本開示を設計するための回路及び／又は本開示のソフトウェア及び／又はファームウェアを編集するためのコードは全て当業者の能力の範囲内のものである。 The foregoing detailed description of block diagrams, flow charts, and / or embodiments describes specific embodiments of apparatus and / or methods of embodiments of the present disclosure. Where these block diagrams, flowcharts and / or embodiments include one or more features and / or operations, each function and / or action in these block diagrams, flowcharts and / or embodiments It may be implemented individually and / or collectively by ware, software, firmware, or any combination thereof. In one embodiment, some parts of the subject matter described herein are application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), digital signal processors (DSPs), or other integrated devices. It may be realized by a form. It should be noted that all or some aspects of the embodiments described in the present specification are in the form of one or more computer programs executed by one or more computers in an integrated circuit (for example, one or more computers. One or more computer programs executed by the system) One or more programs executed by one or more processors (One or more executed by one or more microprocessors Program form), firmware form, or substantially any combination thereof. Also, all circuitry for designing the present disclosure and / or code for editing software and / or firmware of the present disclosure is within the ability of a person of ordinary skill in the art, depending on the teachings disclosed herein. Is.

なお、用語「含む」、「有する」は本明細書に説明された特徴、要素、ステップ又は部材の存在を意味するが、他の１つ又は複数の特徴、要素、ステップ又は部材の存在又は追加を排除するものではない。序数に関する用語は、これらの用語により言及された特徴、要素、ステップ又は部材の実施の順序又は重要性のレベルを意味することではなく、単なるこれらの特徴、要素、ステップ又は部材を区別するためものである。 It should be noted that the terms “comprising” and “having” mean the presence of the features, elements, steps or members described herein, but the presence or addition of one or more other features, elements, steps or members. Does not exclude The term relating to the ordinal numbers does not mean the order of execution or the level of importance of the features, elements, steps or members referred to by these terms, but merely to distinguish these features, elements, steps or members. Is.

また、上述の各実施例を含む実施形態に関し、更に以下の付記を開示するが、これらの付記に限定されない。
（付記１）
教師モデルに対応する生徒モデルを訓練する方法であって、
前記教師モデルは、第１入力データを入力データとし、且つ第１出力データを出力ターゲットとして訓練されたものであり、
前記方法は、第２入力データを入力データとし、且つ前記第１出力データを出力ターゲットとして前記生徒モデルを訓練するステップ、を含み、
前記第２入力データは、前記第１入力データを変更して得られたデータである、方法。
（付記２）
前記生徒モデルを訓練するステップは、
前記教師モデルの出力と前記生徒モデルの出力との差分を反復的に小さくして前記生徒モデルを訓練するステップ、を含む、付記１に記載の方法。
（付記３）
前記第１入力データと前記第２入力データとのデータ関連性に基づいて、前記差分を算出するための差分関数を決定する、付記２に記載の方法。
（付記４）
前記差分関数はＭＫ−ＭＭＤである、付記３に記載の方法。
（付記５）
前記生徒モデルを訓練する際に前記差分関数を用いてＬｏｇｉｔ損失関数及び特徴損失関数を算出する、付記３又は４に記載の方法。
（付記６）
前記生徒モデルを訓練する際にＳｏｆｔｍａｘ損失関数を算出する、付記３又は４に記載の方法。
（付記７）
前記教師モデルと前記生徒モデルとは、同一のＳｏｆｔｍａｘ損失関数を有する、付記６に記載の方法。
（付記８）
前記第１入力データは、画像データ、音声データ又はテキストデータの何れかを含む、付記１乃至４の何れかに記載の方法。
（付記９）
前記変更は、前記第１入力データのタイプに対応する信号処理方法である、付記５に記載の方法。
（付記１０）
前記第１入力データのサンプルの数は、前記第２入力データのサンプルの数と同一である、付記１乃至４の何れかに記載の方法。
（付記１１）
訓練された複数の損失関数のそれぞれのための複数の重みにより、前記差分を算出するための差分関数を決定する、付記１乃至４の何れかに記載の方法。
（付記１２）
畳み込みニューラルネットワークを用いて前記生徒モデルを訓練する、付記１乃至４の何れかに記載の方法。
（付記１３）
付記１乃至８の何れかに記載の方法により訓練された生徒モデルを用いてデータ認識を行うステップ、を含む、データ認識方法。
（付記１４）
付記１３に記載のデータ認識方法を実行する少なくとも１つのプロセッサ、を含む、データ認識装置。
（付記１５）
プログラム命令が記憶されているコンピュータ読み取り可能な記憶媒体であって、前記プログラム命令がコンピュータにより実行される際に付記１〜１３に記載の方法を実行する、記憶媒体。 Further, regarding the embodiments including the above-described examples, the following additional notes are further disclosed, but the present invention is not limited to these additional notes.
(Appendix 1)
A method of training a student model corresponding to a teacher model, the method comprising:
The teacher model is trained with the first input data as input data and the first output data as output target,
The method includes training the student model with second input data as input data and the first output data as an output target,
The method, wherein the second input data is data obtained by modifying the first input data.
(Appendix 2)
The step of training the student model comprises:
Training the student model by iteratively reducing the difference between the output of the teacher model and the output of the student model.
(Appendix 3)
The method according to appendix 2, wherein a difference function for calculating the difference is determined based on a data relevance between the first input data and the second input data.
(Appendix 4)
The method according to Appendix 3, wherein the difference function is MK-MMD.
(Appendix 5)
5. The method according to appendix 3 or 4, wherein the Logit loss function and the feature loss function are calculated using the difference function when training the student model.
(Appendix 6)
The method of claim 3 or 4, wherein a Softmax loss function is calculated when training the student model.
(Appendix 7)
The method of claim 6, wherein the teacher model and the student model have the same Softmax loss function.
(Appendix 8)
5. The method according to any one of appendices 1 to 4, wherein the first input data includes any of image data, voice data, and text data.
(Appendix 9)
6. The method according to appendix 5, wherein the modification is a signal processing method corresponding to the type of the first input data.
(Appendix 10)
5. The method according to any one of appendices 1 to 4, wherein the number of samples of the first input data is the same as the number of samples of the second input data.
(Appendix 11)
The method according to any one of appendices 1 to 4, wherein a plurality of weights for each of the trained plurality of loss functions determines a difference function for calculating the difference.
(Appendix 12)
The method according to any one of appendices 1 to 4, wherein a convolutional neural network is used to train the student model.
(Appendix 13)
A data recognition method, comprising the step of performing data recognition using a student model trained by the method according to any one of appendices 1 to 8.
(Appendix 14)
A data recognition device, comprising: at least one processor that executes the data recognition method according to appendix 13.
(Appendix 15)
A computer-readable storage medium in which program instructions are stored, the storage medium performing the method according to notes 1 to 13 when the program instructions are executed by a computer.

以上は本開示の具体的な実施形態を説明しているが、当業者は添付の特許請求の範囲の要旨及び範囲内で本開示に対して各種の変更、改善又は均等的なものを行うことができる。これらの変更、改善又は均等的なものは本開示の保護範囲に属する。 Although the specific embodiments of the present disclosure have been described above, those skilled in the art can make various changes, improvements, or equivalents to the present disclosure within the spirit and scope of the appended claims. You can These modifications, improvements or equivalents belong to the protection scope of the present disclosure.

Claims

A method of training a student model corresponding to a teacher model, the method comprising:
The teacher model is trained with the first input data as input data and the first output data as output target,
The method includes training the student model with second input data as input data and the first output data as an output target,
The method, wherein the second input data is data obtained by modifying the first input data.

The step of training the student model comprises:
The method of claim 1, further comprising: iteratively reducing the difference between the output of the teacher model and the output of the student model to train the student model.

The method according to claim 2, wherein a difference function for calculating the difference is determined based on a data relationship between the first input data and the second input data.

The method of claim 3, wherein the difference function is MK-MMD.

The method according to claim 3 or 4, wherein the Logit loss function and the feature loss function are calculated using the difference function when training the student model.

The method according to claim 3 or 4, wherein a Softmax loss function is calculated in training the student model.

The method according to claim 1, wherein the first input data includes any one of image data, voice data, and text data.

The method according to claim 5, wherein the modification is a signal processing method corresponding to the type of the first input data.

A data recognition method, comprising the step of performing data recognition using a student model trained by the method according to any one of claims 1 to 8.

A data recognition device comprising at least one processor for performing the data recognition method according to claim 9.