JP2021103519A

JP2021103519A - Method and system for normalizing smoothing feature of time space for behavior recognition

Info

Publication number: JP2021103519A
Application number: JP2020213564A
Authority: JP
Inventors: ジンヒョンキム; Jinhyung Kim; グァンジンオ; Kwangjin Oh; ユジンキム; You Jin Kim; ドンユンウィ; Dongyoon Wee; スンミンペ; Soonmin Bae
Original assignee: A Holdings Corp; Naver Corp
Current assignee: A Holdings Corp; Naver Corp
Priority date: 2019-12-24
Filing date: 2020-12-23
Publication date: 2021-07-15
Anticipated expiration: 2040-12-23
Also published as: KR102235784B1; JP7159276B2

Abstract

To disclose a method and a system for normalizing smoothing features of a time space for behavior recognition.SOLUTION: A feature normalizing method includes a step of calculating a low-frequency component from an input feature, a step of calculating a high-frequency component by using a residual between the input feature and the low-frequency component, and a step of adding a noise to the low-frequency component.SELECTED DRAWING: Figure 3

Description

以下の説明は、行動認識（ａｃｔｉｏｎｒｅｃｏｇｎｉｔｉｏｎ）のためのフィーチャ正規化技術に関する。 The following description relates to a feature normalization technique for action recognition.

知能型ビデオ監視システムのような保安関連分野、人間との相互交流実行能力を備える知能型ロボット、知能型家電製品などのような多くの分野において、人間の行動認識技術が適用されている。 Human behavior recognition technology is applied in many fields such as security-related fields such as intelligent video surveillance systems, intelligent robots capable of executing mutual exchange with humans, and intelligent home appliances.

例えば、特許文献１（登録日２０１５年１０月２０日）には、キネクトを利用して行動認識に必要なデータを抽出した後、このデータを階層化して特徴を学習することにより、映像から行動を認識する技術が開示されている。 For example, in Patent Document 1 (registration date: October 20, 2015), after extracting data necessary for behavior recognition using Kinect, the data is layered and features are learned to act from the video. The technology for recognizing is disclosed.

３Ｄ畳み込みニューラルネットワーク（３ＤＣｏｎｖｏｌｕｔｉｏｎＮｅｕｒａｌＮｅｔｗｏｒｋ）は、行動認識分野で広く使用されている。３Ｄ畳み込みニューラルネットワークは、時空間ストリームを処理するために追加の次元がある２Ｄ畳み込みニューラルネットワーク（２ＤＣｏｎｖＮｅｔ）で拡張されたものであり、大規模イメージ認識データセットに対して訓練された２Ｄカーネルを膨らませて画像ドメインで学習された知識を活用する。 3D convolutional neural networks (3D Convolutional Neural Networks) are widely used in the field of behavior recognition. A 3D convolutional neural network is an extension of a 2D convolutional neural network (2D Conv Net) that has additional dimensions to handle spatiotemporal streams, and is a 2D kernel trained for large image recognition datasets. Inflate and utilize the knowledge learned in the image domain.

韓国登録特許第１０−１５６３２９７号公報Korean Registered Patent No. 10-1563297

３Ｄ畳み込みニューラルネットワーク（３ＤＣｏｎｖＮｅｔ）の過剰適合（ｏｖｅｒｆｉｔｔｉｎｇ）の問題を解決するために、簡単かつ効率的な正規化方法を提供することができる。 A simple and efficient normalization method can be provided to solve the problem of overfitting of a 3D convolutional neural network (3D Conv Net).

フィーチャの低周波成分（ｌｏｗ−ｆｒｅｑｕｅｎｃｙｃｏｍｐｏｎｅｎｔ）の大きさ（ｍａｇｎｉｔｕｄｅ）をランダムに変化させて内部表現（ｉｎｔｅｒｎａｌｒｅｐｒｅｓｅｎｔａｔｉｏｎ）を正規化するランダム平均スケーリング（ＲＭＳ：ｒａｎｄｏｍｍｅａｎｓｃａｌｉｎｇ）を適用することができる。 Random mean scaling (RMS) that normalizes the internal representation by randomly changing the magnitude of the low-frequency component of the feature can be applied.

コンピュータシステムが実行するフィーチャ正規化（ｆｅａｔｕｒｅｒｅｇｕｌａｒｉｚａｔｉｏｎ）方法であって、前記コンピュータシステムは、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記フィーチャ正規化方法は、前記少なくとも１つのプロセッサにより、入力フィーチャから低周波成分を求める段階、前記少なくとも１つのプロセッサにより、前記入力フィーチャと前記低周波成分との残差（ｒｅｓｉｄｕａｌ）を利用して高周波成分を求める段階、および前記少なくとも１つのプロセッサにより、前記低周波成分にノイズを追加する段階を含むフィーチャ正規化方法を提供する。 A feature normalization method performed by a computer system, wherein the computer system includes at least one processor configured to execute a computer-readable instruction contained in memory, said feature normalization. The method is a step of obtaining a low frequency component from an input feature by the at least one processor, and a high frequency component is obtained by using the residual (residual) between the input feature and the low frequency component by the at least one processor. Provided is a feature normalization method comprising a step and a step of adding noise to the low frequency component by the at least one processor.

一側面によると、前記低周波成分を求める段階は、ローパスフィルタを利用して前記入力フィーチャから前記低周波成分を分離してよい。 According to one aspect, in the step of obtaining the low frequency component, the low frequency component may be separated from the input feature by using a low-pass filter.

他の側面によると、前記低周波成分を求める段階は、平均プーリング（ａｖｅｒａｇｅｐｏｏｌｉｎｇ）またはガウシアンフィルタ（Ｇａｕｓｓｉａｎｆｉｌｔｅｒ）を利用して前記入力フィーチャから前記低周波成分を分離してよい。 According to another aspect, the step of determining the low frequency component may utilize an average pooling or a Gaussian filter to separate the low frequency component from the input feature.

また他の側面によると、前記ノイズを追加する段階は、前記入力フィーチャの局所的平均にランダムスケーリング（ｒａｎｄｏｍｓｃａｌｉｎｇ）を適用して前記ノイズを追加する段階を含んでよい。 According to another aspect, the step of adding the noise may include a step of applying random scaling to the local average of the input features to add the noise.

また他の側面によると、前記ノイズを追加する段階は、与えられた確率分布でサンプリングされたスカラーを乗じる演算により、前記低周波成分の大きさをランダムに変調する段階を含んでよい According to another aspect, the step of adding the noise may include a step of randomly modulating the magnitude of the low frequency component by an operation of multiplying a scalar sampled with a given probability distribution.

また他の側面によると、前記低周波成分に前記ノイズを追加するランダム平均スケーリングは、ネットワークモデルの残差分岐（ｒｅｓｉｄｕａｌｂｒａｎｃｈ）内に適用されてよい。 According to another aspect, the random average scaling that adds the noise to the low frequency component may be applied within the residual branch of the network model.

また他の側面によると、前記ランダム平均スケーリングは、前記ネットワークモデルの畳み込み層（ｃｏｎｖｏｌｕｔｉｏｎｌａｙｅｒ）、バッチ正規化層（ｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎｌａｙｅｒ）、非線形活性化層（ｎｏｎｌｉｎｅａｒａｃｔｉｖａｔｉｏｎｌａｙｅｒ）のうちの少なくとも１つの層の前に適用されてよい。 According to another aspect, the random average scaling is at least one layer of the convolution layer, batch normalization layer, and non-linear activation layer of the network model. May be applied before.

また他の側面によると、前記ネットワークモデルがベーシックブロック（ｂａｓｉｃｂｌｏｃｋ）構造のネットワークの場合、前記ランダム平均スケーリングは、前記ネットワークモデルの一部のステージに含まれたすべてのバッチ正規化層の前にそれぞれ適用されてよい。 According to another aspect, when the network model is a network with a basic block structure, the random average scaling is performed before all the batch normalization layers included in some stages of the network model, respectively. May be applied.

さらに他の側面によると、前記ネットワークモデルがボトルネックブロック（ｂｏｔｔｌｅｎｅｃｋｂｌｏｃｋ）構造のネットワークの場合、前記ランダム平均スケーリングは、前記ネットワークモデルの一部のステージに含まれたバッチ正規化層のうちの最後のバッチ正規化層の前に適用されてよい。 According to yet another aspect, if the network model is a network with a bottleneck block structure, the random average scaling is the last of the batch normalization layers included in some stages of the network model. May be applied before the batch normalization layer of.

前記フィーチャ正規化方法を前記コンピュータシステムに実行させるために非一時なコンピュータ読み取り可能な記録媒体に記録される、コンピュータプログラムを提供する。 Provided is a computer program recorded on a non-transitory computer-readable recording medium for causing the computer system to perform the feature normalization method.

前記フィーチャ正規化方法をコンピュータに実行させるためのプログラムが記録されている、非一時なコンピュータ読み取り可能な記録媒体を提供する。 Provided is a non-transitory computer-readable recording medium in which a program for causing a computer to execute the feature normalization method is recorded.

コンピュータシステムであって、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記少なくとも１つのプロセッサは、入力フィーチャから低周波成分を求め、前記入力フィーチャと前記低周波成分との残差を利用して高周波成分を求め、前記低周波成分にノイズを追加することを特徴とする、コンピュータシステムを提供する。 A computer system comprising at least one processor configured to execute a computer-readable instruction contained in memory, said at least one processor seeking a low frequency component from an input feature and with said input feature. Provided is a computer system characterized in that a high frequency component is obtained by utilizing a residual with the low frequency component and noise is added to the low frequency component.

本発明の実施形態によると、３Ｄ畳み込みニューラルネットワークの過剰適合の問題を解決するために、簡単かつ効率的な正規化方法を提供することができる。 According to an embodiment of the present invention, a simple and efficient normalization method can be provided to solve the problem of overfitting of a 3D convolutional neural network.

本発明の実施形態によると、フィーチャの低周波成分の大きさをランダムに変化させて内部表現を正規化するランダム平均スケーリング（ＲＭＳ）により、３Ｄ残差ネットワーク（ｒｅｓｉｄｕａｌｎｅｔｗｏｒｋ）の過剰適合の問題を効果的に解決することができる。 According to an embodiment of the present invention, the problem of overfitting of a 3D residual network (residual neural network) is solved by random mean scaling (RMS) that randomly changes the magnitude of the low frequency component of the feature to normalize the internal representation. It can be solved effectively.

本発明の実施形態によると、低周波成分に対する摂動（ｐｅｒｔｕｒｂａｔｉｏｎ）を適用することにより、フィーチャ全体や高周波成分に適用するよりも、３Ｄ畳み込みニューラルネットワークの正確度と正規化効果を向上させることができる。 According to an embodiment of the present invention, applying perturbation to a low frequency component can improve the accuracy and normalization effect of a 3D convolutional neural network as compared to applying it to an entire feature or high frequency component. ..

本発明の一実施形態における、コンピュータシステムの内部構成の一例を説明するためのブロック図である。It is a block diagram for demonstrating an example of the internal structure of the computer system in one Embodiment of this invention. 本発明の一実施形態における、スケーリング因子に対する３Ｄ畳み込みニューラルネットワークの正確度を示したグラフである。It is a graph which showed the accuracy of the 3D convolutional neural network with respect to the scaling factor in one Embodiment of this invention. 本発明の一実施形態における、フィーチャ正規化のためのランダム平均スケーリングモジュールの例を示した図である。It is a figure which showed the example of the random average scaling module for feature normalization in one Embodiment of this invention. 本発明の一実施形態における、フィーチャ正規化のためのランダム平均スケーリングモジュールの例を示した図である。It is a figure which showed the example of the random average scaling module for feature normalization in one Embodiment of this invention. 本発明の一実施形態における、ランダム平均スケーリングモジュールが追加されたベーシックブロック構造のネットワークの例を示した図である。It is a figure which showed the example of the network of the basic block structure to which the random average scaling module was added in one Embodiment of this invention. 本発明の一実施形態における、ランダム平均スケーリングモジュールが追加されたボトルネックブロック構造のネットワークの例を示した図である。It is a figure which showed the example of the network of the bottleneck block structure to which the random average scaling module was added in one Embodiment of this invention. 本発明の一実施形態における、ランダム平均スケーリングモジュールの位置別の正規化を示した実験結果を示した図である。It is a figure which showed the experimental result which showed the normalization by the position of the random average scaling module in one Embodiment of this invention. 本発明の一実施形態における、コンピュータシステムのプロセッサが含むことのできる構成要素の例を示した図である。It is a figure which showed the example of the component which can include the processor of the computer system in one Embodiment of this invention. 本発明の一実施形態における、コンピュータシステムが実行することのできるフィーチャ正規化方法の例を示したフローチャートである。It is a flowchart which showed the example of the feature normalization method which a computer system can perform in one Embodiment of this invention.

以下、本発明の実施形態について、添付の図面を参照しながら詳しく説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

本発明の実施形態は、行動認識のためのフィーチャ正規化技術に関する。 Embodiments of the present invention relate to feature normalization techniques for behavior recognition.

本明細書で具体的に開示される事項を含む実施形態は、フィーチャの低周波成分をランダムに変化させて内部表現を正規化するランダム平均スケーリング（ＲＭＳ：ＲａｎｄｏｍＭｅａｎＳｃａｌｉｎｇ）によって簡単かつ効率的な正規化方法を提供することができ、これによって行動認識の正確度と正規化性能を向上させることができる。 The embodiments, including those specifically disclosed herein, are simple and efficient by random mean scaling (RMS), which randomly changes the low frequency components of the features to normalize the internal representation. A normalization method can be provided, which can improve the accuracy and normalization performance of behavior recognition.

図１は、本発明の一実施形態における、コンピュータシステムの例を示したブロック図である。例えば、本発明の実施形態に係るフィーチャ正規化システムは、図１に示したコンピュータシステム１００によって実現されてよい。 FIG. 1 is a block diagram showing an example of a computer system according to an embodiment of the present invention. For example, the feature normalization system according to the embodiment of the present invention may be realized by the computer system 100 shown in FIG.

図１に示すように、コンピュータシステム１００は、本発明の実施形態に係るフィーチャ正規化方法を実行するための構成要素として、メモリ１１０、プロセッサ１２０、通信インタフェース１３０、および入力／出力インタフェース１４０を含んでよい。 As shown in FIG. 1, the computer system 100 includes a memory 110, a processor 120, a communication interface 130, and an input / output interface 140 as components for executing the feature normalization method according to the embodiment of the present invention. Is fine.

メモリ１１０は、コンピュータ読み取り可能な記録媒体であって、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）、ＲＯＭ（ｒｅａｄｏｎｌｙｍｅｍｏｒｙ）、およびディスクドライブのような永続的大容量記録装置を含んでよい。ここで、ＲＯＭやディスクドライブのような永続的大容量記録装置は、メモリ１１０とは区分される別の永続的記録装置としてコンピュータシステム１００に含まれてもよい。また、メモリ１１０には、オペレーティングシステムと、少なくとも１つのプログラムコードが記録されてよい。このようなソフトウェア構成要素は、メモリ１１０とは別のコンピュータ読み取り可能な記録媒体からメモリ１１０にロードされてよい。このような別のコンピュータ読み取り可能な記録媒体は、フロッピー（登録商標）ドライブ、ディスク、テープ、ＤＶＤ／ＣＤ−ＲＯＭドライブ、メモリカードなどのコンピュータ読み取り可能な記録媒体を含んでよい。他の実施形態において、ソフトウェア構成要素は、コンピュータ読み取り可能な記録媒体ではない通信インタフェース１３０を通じてメモリ１１０にロードされてもよい。例えば、ソフトウェア構成要素は、ネットワーク１６０を介して受信されるファイルによってインストールされるコンピュータプログラムに基づいてコンピュータシステム１００のメモリ１１０にロードされてよい。 The memory 110 is a computer-readable recording medium and may include a permanent large-capacity recording device such as a RAM (random access memory), a ROM (read only memory), and a disk drive. Here, a permanent large-capacity recording device such as a ROM or a disk drive may be included in the computer system 100 as another permanent recording device that is separated from the memory 110. Further, the memory 110 may record an operating system and at least one program code. Such software components may be loaded into memory 110 from a computer-readable recording medium separate from memory 110. Such other computer readable recording media may include computer readable recording media such as floppy® drives, discs, tapes, DVD / CD-ROM drives, memory cards and the like. In other embodiments, software components may be loaded into memory 110 through a communication interface 130 that is not a computer-readable recording medium. For example, software components may be loaded into memory 110 of computer system 100 based on a computer program installed by a file received over network 160.

プロセッサ１２０は、基本的な算術、ロジック、および入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成されてよい。命令は、メモリ１１０または通信インタフェース１３０によって、プロセッサ１２０に提供されてよい。例えば、プロセッサ１２０は、メモリ１１０のような記録装置に記録されたプログラムコードにしたがって受信される命令を実行するように構成されてよい。 Processor 120 may be configured to process instructions in a computer program by performing basic arithmetic, logic, and input / output operations. Instructions may be provided to processor 120 by memory 110 or communication interface 130. For example, the processor 120 may be configured to execute instructions received according to program code recorded in a recording device such as memory 110.

通信インタフェース１３０は、ネットワーク１６０を介してコンピュータシステム１００が他の装置と互いに通信するための機能を提供してよい。一例として、コンピュータシステム１００のプロセッサ１２０がメモリ１１０のような記録装置に記録されたプログラムコードにしたがって生成した要求や命令、データ、ファイルなどが、通信インタフェース１３０の制御にしたがってネットワーク１６０を介して他の装置に伝達されてよい。これとは逆に、他の装置からの信号や命令、データ、ファイルなどが、ネットワーク１６０を経てコンピュータシステム１００の通信インタフェース１３０を通じてコンピュータシステム１００に受信されてよい。通信インタフェース１３０を通じて受信された信号や命令、データなどは、プロセッサ１２０やメモリ１１０に伝達されてよく、ファイルなどは、コンピュータシステム１００がさらに含むことのできる記録媒体（上述した永続的記録装置）に記録されてよい。 The communication interface 130 may provide a function for the computer system 100 to communicate with other devices via the network 160. As an example, requests, instructions, data, files, etc. generated by the processor 120 of the computer system 100 according to a program code recorded in a recording device such as a memory 110 can be transmitted via the network 160 under the control of the communication interface 130. May be transmitted to the device of. On the contrary, signals, commands, data, files and the like from other devices may be received by the computer system 100 via the communication interface 130 of the computer system 100 via the network 160. Signals, instructions, data and the like received through the communication interface 130 may be transmitted to the processor 120 and the memory 110, and the files and the like may be further included in a recording medium (the above-mentioned permanent recording device) that the computer system 100 can include. May be recorded.

通信方式が限定されることはなく、ネットワーク１６０が含むことのできる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を利用する通信方式だけではなく、機器間の近距離無線通信が含まれてもよい。例えば、ネットワーク１６０は、ＰＡＮ（ｐｅｒｓｏｎａｌａｒｅａｎｅｔｗｏｒｋ）、ＬＡＮ（ｌｏｃａｌａｒｅａｎｅｔｗｏｒｋ）、ＣＡＮ（ｃａｍｐｕｓａｒｅａｎｅｔｗｏｒｋ）、ＭＡＮ（ｍｅｔｒｏｐｏｌｉｔａｎａｒｅａｎｅｔｗｏｒｋ）、ＷＡＮ（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ）、ＢＢＮ（ｂｒｏａｄｂａｎｄｎｅｔｗｏｒｋ）、インターネットなどのネットワークのうちの１つ以上の任意のネットワークを含んでよい。さらに、ネットワーク１６０は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター−バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジのうちの任意の１つ以上を含んでもよいが、これらに限定されることはない。 The communication method is not limited, and not only the communication method using the communication network (for example, mobile communication network, wired Internet, wireless Internet, broadcasting network) that can be included in the network 160, but also the short distance between devices. Wireless communication may be included. For example, the network 160 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), a WAN (wide network), etc. It may include any one or more of the networks. Further, network 160 may include, but is limited to, any one or more of network topologies, including bus networks, star networks, ring networks, mesh networks, star-bus networks, tree or hierarchical networks, and the like. Will not be done.

入力／出力インタフェース１４０は、入力／出力装置１５０とのインタフェースのための手段であってよい。例えば、入力装置は、マイク、キーボード、カメラ、またはマウスなどの装置を、出力装置は、ディスプレイやスピーカのような装置を含んでよい。他の例として、入力／出力インタフェース１４０は、タッチスクリーンのように入力と出力のための機能が１つに統合された装置とのインタフェースのための手段であってもよい。入力／出力装置１５０は、コンピュータシステム１００と１つの装置で構成されてもよい。 The input / output interface 140 may be a means for interfacing with the input / output device 150. For example, an input device may include a device such as a microphone, keyboard, camera, or mouse, and an output device may include a device such as a display or speaker. As another example, the input / output interface 140 may be a means for an interface with a device such as a touch screen in which functions for input and output are integrated into one. The input / output device 150 may be composed of a computer system 100 and one device.

また、他の実施形態において、コンピュータシステム１００は、図１の構成要素よりも少ない又は多くの構成要素を含んでもよい。しかし、大部分の従来技術的構成要素を明確に図に示す必要はない。例えば、コンピュータシステム１００は、上述した入力／出力装置１５０のうちの少なくとも一部を含むように実現されてもよいし、トランシーバ、カメラ、各種センサ、データベースなどのような他の構成要素をさらに含んでもよい。 Also, in other embodiments, the computer system 100 may include fewer or more components than the components of FIG. However, it is not necessary to clearly illustrate most of the prior art components. For example, the computer system 100 may be implemented to include at least a portion of the input / output devices 150 described above, and may further include other components such as transceivers, cameras, various sensors, databases, and the like. It may be.

ビデオ行動認識分野において、深層ニューラルネットワーク（ｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋ，ＤＮＮ）は、３Ｄ畳み込みフィルタ（ｃｏｎｖｏｌｕｔｉｏｎｆｉｌｔｅｒ）を必要とする場合が多く、多数のパラメータによって過剰適合が発生する場合が多い。 In the field of video behavior recognition, deep neural networks (DNNs) often require a 3D convolution filter, and overfitting often occurs due to a large number of parameters.

ＤＮＮが直面している問題の１つである過剰適合は、３Ｄ畳み込みニューラルネットワーク（３ＤＣｏｎｖＮｅｔ）が時空間表現（ｓｐａｔｉｏ−ｔｅｍｐｏｒａｌｒｅｐｒｅｓｅｎｔａｔｉｏｎ）をエンコードするのに選好されるアプローチであるビデオ行動認識分野においては、特に致命的である。 One of the problems facing DNN, overfitting, is the field of video behavior recognition, where 3D convolutional neural networks (3D Conv Net) are the preferred approach for encoding spatio-temporal representation. Is especially fatal.

３Ｄ畳み込みニューラルネットワークがビデオストリームを処理する能力を備えているにもかかわらず、多数のパラメータによってたびたび過度な問題に直面する。 Despite the ability of 3D convolutional neural networks to process video streams, many parameters often face excessive problems.

入力空間（ｉｎｐｕｔｓｐａｃｅ）とフィーチャ空間（ｆｅａｔｕｒｅｓｐａｃｅ）における正規化は、過剰適合の問題を緩和するための広く知られたアプローチであるが、過去の研究ではフィーチャに及ぼす影響がどこから始まるのかを動ずる方向（摂動方向，ｄｉｒｅｃｔｉｏｎｔｏｐｅｒｔｕｒｂ）を見逃している。 Normalization in the input space and the feature space is a well-known approach to alleviate the problem of overfitting, but previous studies have motivated where the effect on the feature begins. The direction of slippage (direction to perturb) is overlooked.

このような方向が正規化には極めて重要な要素であると仮定するため、どのような情報が行動認識の課題に重要となるかを分析する必要がある。 Since we assume that such a direction is a very important factor for normalization, it is necessary to analyze what kind of information is important for behavior recognition tasks.

図２は、フィーチャの低周波と高周波成分の大きさを変調する多様なスケーリング因子に対して３Ｄ畳み込みニューラルネットワークの正確度の変化を示した図である。図２によると、行動認識性能は、低周波よりも高周波成分により敏感であることが分かる。 FIG. 2 is a diagram showing changes in the accuracy of a 3D convolutional neural network with respect to various scaling factors that modulate the magnitude of the low and high frequency components of a feature. According to FIG. 2, it can be seen that the behavior recognition performance is more sensitive to the high frequency component than to the low frequency.

フィーチャに対する選択的摂動（ｐｅｒｔｕｒｂａｔｉｏｎ）がネットワークを正規化する効果的な方法になるという点に基づき、本発明の実施形態では、ランダム平均スケーリング（ＲＭＳ）を正規化方法として適用する。 Random mean scaling (RMS) is applied as a normalization method in embodiments of the present invention, based on the fact that selective perturbation on features is an effective way to normalize the network.

本発明の実施形態に係るランダム平均スケーリング（ＲＭＳ）方法は、ランダムスカラーを時空間平滑化フィーチャに乗じて選択的に摂動を追加するものである。低周波情報を分離するために、イメージ処理で最も簡単なローパスフィルタ（ｌｏｗ−ｐａｓｓｆｉｌｔｅｒ）である３Ｄ平均フィルタ（殆どのディープラーニングで３Ｄ平均プーリング演算）を使用してよい。他の正規化方法と同じように、ランダム平均スケーリング（ＲＭＳ）方法も、訓練（ｔｒａｉｎｉｎｇ）中に限って必要であり、推論（ｉｎｆｅｒｅｎｃｅ）中には追加作業は必要ない。 The Random Mean Scaling (RMS) method according to an embodiment of the present invention multiplies a random scalar by a spatiotemporal smoothing feature to selectively add perturbations. To separate low frequency information, a 3D averaging filter (3D averaging pooling operation in most deep learning), which is the simplest low-pass filter in image processing, may be used. Like other normalization methods, the Random Mean Scaling (RMS) method is required only during training and requires no additional work during inference.

先ず、行動認識と正規化について説明する。 First, behavior recognition and normalization will be described.

行動認識
３Ｄ畳み込みニューラルネットワークは、時空間ストリームを処理するために追加の次元がある２Ｄ畳み込みニューラルネットワーク（２ＤＣｏｎｖＮｅｔ）で拡張されたものであり、行動認識分野において広く利用されている。３Ｄ畳み込みニューラルネットワークは、大規模イメージ認識データセットに対して訓練された２Ｄカーネルを膨らませて画像ドメインで学習された知識を活用してよい。 The behavior recognition 3D convolutional neural network is extended by a 2D convolutional neural network (2D ConvNet) having an additional dimension for processing a spatiotemporal stream, and is widely used in the field of behavior recognition. A 3D convolutional neural network may inflate a 2D kernel trained on a large image recognition dataset to leverage the knowledge learned in the image domain.

しかし、３Ｄ畳み込みニューラルネットワークは、多数のパラメータを短所としている。このような問題を克服するために、３Ｄカーネルを２Ｄカーネルと１Ｄカーネルからなる階段式に分解してよい。一例として、３ＤフィルタをＨ−Ｗ、Ｔ−Ｈ、Ｔ−Ｗによって同時に適用することのできる２Ｄフィルタに分解し、次の段階に３Ｄ畳み込みフィルタだけを使用してよい。 However, 3D convolutional neural networks have many parameters as their disadvantages. In order to overcome such a problem, the 3D kernel may be decomposed into a stepped structure consisting of a 2D kernel and a 1D kernel. As an example, the 3D filter may be decomposed into 2D filters that can be applied simultaneously by HW, TH, TW, and only the 3D convolution filter may be used in the next step.

一方、一部の研究では多段階モデルを提案しており、周波数に応じて情報を個別に使用する。例えば、静的空間フィーチャの遅い分岐（ｓｌｏｗｂｒａｎｃｈ）と動的モーションフィーチャのための早い分岐（ｆａｓｔｂｒａｎｃｈ）で構成された２つのストリームモデルを利用してよい。また、単一ストリームモデルで多重周波数信号を処理するために、オクターブ畳み込み（Ｏｃｔａｖｅｃｏｎｖｏｌｕｔｉｏｎ）を利用してよい。 On the other hand, some studies have proposed a multi-step model, which uses information individually according to frequency. For example, two stream models may be utilized that consist of a slow branch for static spatial features and a fast branch for dynamic motion features. Octave convolution may also be used to process multiple frequency signals in a single stream model.

グローバルへの依存性（ｇｌｏｂａｌｄｅｐｅｎｄｅｎｃｙ）を捉えることは、行動認識モデルの改善のための他のアプローチであると言える。例えば、３Ｄ畳み込みニューラルネットワークの演算を減らすための方法として、非局所的モジュール（ｎｏｎ−ｌｏｃａｌｍｏｄｕｌｅ）を追加してよい。 Capturing global dependency is another approach to improving behavioral cognitive models. For example, a non-local module may be added as a method for reducing the number of operations of a 3D convolutional neural network.

正規化
正規化は、モデルの過剰適合を解消するのに効果的ではあるが、ビデオドメインでは画像ドメインに比べて研究が活発でなかった。画像ドメインでは主に、データ拡張（ｄａｔａａｕｇｍｅｎｔａｔｉｏｎ）、重み減衰（ｗｅｉｇｈｔｄｅｃａｙ）、ドロップアウト（ｄｒｏｐｏｕｔ）、ラベル平滑化（ｌａｂｅｌｓｍｏｏｔｈｉｎｇ）、およびバッチ正規化（ｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎ）などの正規化技法が使用されている。 Normalization Normalization is effective in eliminating model overfitting, but it has been less active in the video domain than in the image domain. Image domains primarily use normalization techniques such as data augmentation, weight decay, dropout, label smoothing, and batch normalization. ing.

最近の研究では、ランダムオクルージョン（ｒａｎｄｏｍｏｃｃｌｕｓｉｏｎ）、画像補間（ｉｎｔｅｒｐｏｌａｔｉｎｇｔｗｏｉｍａｇｅｓ）または画像パッチ移植（ｔｒａｎｓｐｌａｎｔｉｎｇａｎｉｍａｇｅｐａｔｃｈｏｎｔｏａｎｏｔｈｅｒｉｍａｇｅ）などの方法により、入力データ空間に対するデータ拡張を可能にする。 Recent studies have made it possible to extend data to the input data space by methods such as random interpolation, interpolating two images or transplating an image patch onto another image.

また、内部表現は、最近の研究では正規化とは別の対象となっている。Ｓｈａｋｅ−Ｓｈａｋｅ正規化技術は、２−分岐ＲｅｓＮｅｔには適用することができない前方および後方演算にランダムにスケーリングされた分岐を追加することで、多重分岐ＲｅｓＮｅｔを正規化する。また、確率論的深さ（ｓｔｏｃｈａｓｔｉｃｄｅｐｔｈ）（言い換えれば、ＲａｎｄｏｍＤｒｏｐ）技術は、メイン分岐（ｍａｉｎｂｒａｎｃｈ）と連結する残差分岐とをランダムに切り換える。このような２つの技術を融合したＳｈａｋｅｄｒｏｐは、Ｓｈａｋｅ−Ｓｈａｋｅとしてのランダムドロップ転換メカニズムを採択し、２−分岐ＲｅｓＮｅｔとの互換も可能である。 Also, internal representation is a separate subject from normalization in recent studies. Shake-Shake normalization techniques normalize multi-branch ResNets by adding randomly scaled branches to forward and backward operations that are not applicable to 2-branch ResNets. Also, the stochastic depth (in other words, RandomDrop) technique randomly switches between a main branch and a residual branch connected to it. Shakedrop, which combines these two technologies, adopts a random drop conversion mechanism as Shake-Shake, and is compatible with 2-branched ResNet.

以下では、本発明の実施形態における、時空間平滑化フィーチャに対するランダムスケーリング方法について説明する。 Hereinafter, the random scaling method for the spatiotemporal smoothing feature in the embodiment of the present invention will be described.

ランダム平均スケーリング（ＲＭＳ）
ランダムスケーリング（ｒａｎｄｏｍｓｃａｌｉｎｇ）は、畳み込みニューラルネットワークのいかなる階層にも適用可能な、簡単な正規化方法である。 Random mean scaling (RMS)
Random scaling is a simple normalization method that can be applied to any hierarchy of convolutional neural networks.

ランダムスケーリングは、与えられた確率分布（例えば、Ｇａｕｓｓｉａｎ）でサンプリングしたスカラーαを乗じてフィーチャの大きさをランダムに変調する方式である。 Random scaling is a method of randomly modulating the size of a feature by multiplying it by a scalar α sampled with a given probability distribution (eg, Gaussian).

本実施形態では、ランダムスケーリングをフィーチャに直接適用するのではなく、フィーチャの局所的平均（ｌｏｃａｌｍｅａｎ）に適用する（ランダム平均スケーリング）。特に、ランダム平均スケーリング方法は、過剰適合を減らすために入力を周波数特徴（高周波成分と低周波成分）に分離し、低周波成分にノイズをランダムに追加する。 In this embodiment, random scaling is not applied directly to the features, but to the local mean of the features (random average scaling). In particular, the random average scaling method separates the inputs into frequency features (high frequency and low frequency components) to reduce overfitting and randomly adds noise to the low frequency components.

一般的に、画像において、高周波成分はエッジ（ｅｄｇｅ）情報を含んでいるが、これは分類（ｃｌａｓｓｉｆｉｃａｔｉｏｎ）に重要な情報であると言える。したがって、分類に重要となる高周波成分は変化させずに、低周波成分だけをランダムにスケーリングすることにより、正規化効果を向上させることができる。 Generally, in an image, the high frequency component contains edge information, which can be said to be important information for classification. Therefore, the normalization effect can be improved by randomly scaling only the low-frequency components without changing the high-frequency components that are important for classification.

入力から低周波成分を分離するためにローパスフィルタを使用してよく、一例として、平均プーリングまたはボックスフィルタ（ｂｏｘｆｉｌｔｅｒ）やガウシアンフィルタなどが使用されてよい。 A low-pass filter may be used to separate the low frequency components from the input, for example, an average pooling or box filter, a Gaussian filter, or the like may be used.

局所的平均は、数式（１）のように計算されてよい。

The local average may be calculated as in formula (1).

ここで、ｘは入力フィーチャを意味し、Ｗ_ｉは現在のインデックスｉ周囲の３Ｄ局所的ウィンドウ（ｌｏｃａｌｗｉｎｄｏｗ）を意味する。 Here, x denotes the input features, _{W i} refers to 3D local window of the current index i surrounding the (local window).

入力フィーチャｘは、数式（２）のように、平均ｘ⁻と残差ｒに分離される。

^{The input feature x is separated into an average x −} and a residual r as in equation (2).

ランダム平均スケーリングによる変調出力ｙは、数式（３）のように定義されてよい。

The modulation output y by random average scaling may be defined as in equation (3).

摂動は、訓練中に限って適用されてよい。αの確率分布の平均が１であれば、推論中はｙ＝ｘとなる。 Perturbations may only be applied during training. If the average of the probability distributions of α is 1, then y = x during inference.

上述したランダム平均スケーリング方法は、畳み込み、バッチ正規化（ＢＮ：ｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎ）、非線形活性化（ｎｏｎｌｉｎｅａｒａｃｔｉｖａｔｉｏｎ）などのような階層のいかなるレベルにも適用可能である。 The random average scaling method described above is applicable to any level of hierarchy such as convolution, batch normalization (BN), nonlinear activation (nonlinear activation), and the like.

ネットワーク性能を向上させるためのランダム平均スケーリングの位置を決定してよい。また、局所的平均にランダムスケーリングを適用すれば、残差または入力全体にランダムスケーリングを適用するよりも性能を向上させることができる。局所的平均は入力の低周波成分として解説される反面、残差は残りの高周波成分を示す。 Random average scaling positions may be determined to improve network performance. Also, applying random scaling to the local mean can improve performance over applying random scaling to the residuals or the entire input. The local average is described as the low frequency component of the input, while the residual indicates the remaining high frequency component.

図２を参照しながら説明したように、高周波変調が性能を著しく低下させるため、ランダム平均スケーリングは、低周波に比べて高周波成分をより活用するようにモデルを生成するものと推定される。 As explained with reference to FIG. 2, since high frequency modulation significantly reduces performance, random average scaling is presumed to generate a model to make better use of high frequency components compared to low frequencies.

図３は、本発明の一実施形態における、ランダム平均スケーリングモジュール３００を示した図である。ランダム平均スケーリングは、図３に示すように、いくつかの基本的演算を実行するネットワークモジュールによって実現されてよい。 FIG. 3 is a diagram showing a random average scaling module 300 according to an embodiment of the present invention. Random average scaling may be implemented by a network module that performs some basic operations, as shown in FIG.

図３において、ｘは入力、ｘ⁻は入力平均（ｍｅａｎ）、ｒは残差、ｙは出力を示し、

は要素ごとの和（ｅｌｅｍｅｎｔ−ｗｉｓｅｓｕｍ）の演算を示し、

は要素ごとの積（ｅｌｅｍｅｎｔ−ｗｉｓｅｍｕｌｔｉｐｌｉｃａｔｉｏｎ）の演算を示す。 In FIG. 3, x is the input, x ⁻ is the input mean, r is the residual, and y is the output.

Indicates the sum (element-wise sum) operation for each element.

Indicates an operation of the product (elent-wise multiplication) for each element.

図３では、説明の便宜のために、入力平均ｘ⁻と残差ｒとが分離された構造のランダム平均スケーリングモジュール３００を示した。数式（１）の局所的平均は、殆どのディープラーニングフレームワークで提供する３Ｄ平均プーリングによって求めることができ、数式（３）は図３のように示される。 In FIG. 3, for convenience of explanation, a random average scaling module 300 having a structure in which the ^{input average x − and the residual r are separated is shown.} The local average of formula (1) can be determined by the 3D average pooling provided by most deep learning frameworks, and formula (3) is shown as shown in FIG.

実際の実現のためには、数式（２）を利用しながら数式（３）をより単純な形態に修正してよい（数式（４））。

For actual realization, the mathematical formula (3) may be modified to a simpler form while using the mathematical formula (2) (mathematical formula (4)).

ここで、α’＝α−１である。 Here, α'= α-1.

図４には、ランダム平均スケーリングモジュールを実現するための単純化された形態を示しており、数式（４）に該当するランダム平均スケーリングモジュール３００を示すブロック図を示している。 FIG. 4 shows a simplified form for realizing the random average scaling module, and shows a block diagram showing the random average scaling module 300 corresponding to the mathematical formula (4).

ランダム平均スケーリングモジュール３００は、上述したように、平均プーリングとスカラーとの乗算のように簡単な演算だけを必要とするため、パラメータがなく、訓練中には少量の追加演算だけを必要とし、さらに推論中に追加の演算を必要としない。 As mentioned above, the random average scaling module 300 requires only simple operations such as multiplication of average pooling and scalars, so it has no parameters and requires only a small amount of additional operations during training. No additional operations are required during inference.

ランダム平均スケーリングによって効果が向上するネットワークの一例として、３ＤＲｅｓＮｅｔ系列がある。例えば、ＳｌｏｗＯｎｌｙ、ＣＳＮ（ｃｈａｎｎｅｌ−ｓｅｐａｒａｔｅｄｃｏｎｖｏｌｕｔｉｏｎａｌｎｅｔｗｏｒｋ）などがこれに該当するが、ＳｌｏｗＯｎｌｙには、一般的な２ＤＲｅｓＮｅｔを３Ｄに確張した形態で３Ｄ畳み込みをｒｅｓ４とｒｅｓ５段階だけで使用するため時間軸による次元縮小がないという特徴があり、ＣＳＮは、ｌｉｇｈｔ−ｗｅｉｇｈｔ（パラメータが少ない）３ＤＲｅｓＮｅｔであると言える。 An example of a network whose effect is improved by random average scaling is the 3D ResNet series. For example, SlowOnly, CSN (channel-separated parameter network), etc. correspond to this, but in SlowOnly, 3D convolution is used only in res4 and res5 stages in a form in which general 2D ResNet is firmly set in 3D. It can be said that the CSN is a light-weight (fewer parameters) 3D ResNet because there is no dimension reduction due to the axis.

ランダム平均スケーリングモジュール３００は、残差分岐内のどの位置にも適用可能であり、例えば、畳み込み層やＲｅＬｕ層の前に位置してよい。 The random average scaling module 300 can be applied to any position in the residual branch and may be located, for example, in front of the convolution layer or the ReLu layer.

一例として、図５は、ベーシックブロック構造のネットワークの例を示している。ベーシックブロック構造の場合、ｒｅｓ４とｒｅｓ５段階に含まれたすべてのＢＮ（バッチ正規化）層の前に、ランダム平均スケーリングモジュール３００がそれぞれ追加されてよい。 As an example, FIG. 5 shows an example of a network having a basic block structure. In the case of the basic block structure, a random average scaling module 300 may be added in front of all the BN (batch normalization) layers included in the res4 and res5 stages, respectively.

他の例として、図６には、ボトルネックブロック構造のネットワークの例を示している。ボトルネックブロックの場合、ｒｅｓ４とｒｅｓ５段階に含まれたＢＮ（バッチ正規化）層のうちの最後のＢＮ（バッチ正規化）層の前にランダム平均スケーリングモジュール３００が追加されてよい。 As another example, FIG. 6 shows an example of a network having a bottleneck block structure. In the case of the bottleneck block, the random average scaling module 300 may be added before the last BN (batch normalization) layer of the BN (batch normalization) layers included in the res4 and res5 stages.

ランダム平均スケーリングモジュール３００は、各畳み込み層の前、各ＢＮ（バッチ正規化）層の前、各ＲｅＬＵ層の前のように、いかなるレベルにも適用可能である。最後のＲｅＬＵ層の前のランダム平均スケーリングモジュール３００は、メイン分岐と残差分岐の合算の前に位置してよい。 The random average scaling module 300 can be applied to any level, such as before each convolution layer, before each BN (batch regularization) layer, before each ReLU layer, and so on. The random average scaling module 300 before the last ReLU layer may be located before the sum of the main branch and the residual branch.

ランダム平均スケーリングモジュール３００の各位置の正規化効果の実験結果は、図７に示すとおりである。 The experimental results of the normalization effect of each position of the random average scaling module 300 are as shown in FIG.

図７で、ＳｌｏｗＯｎｌｙ−３４に対してすべての可能な位置にランダム平均スケーリングモジュール３００を追加した結果を詳察すると、ランダム平均スケーリングモジュール３００が追加された位置による正規化効果の差は大きくないが、単一のランダム平均スケーリングモジュール３００を使用する場合のうちで１番目のＢＮ前のランダム平均スケーリングモジュール３００が最も高い正確度を示すことが分かる。 In FIG. 7, when the result of adding the random average scaling module 300 to all possible positions with respect to SlowOnly-34 is examined in detail, the difference in the normalization effect depending on the position where the random average scaling module 300 is added is not large. It can be seen that, among the cases where a single random average scaling module 300 is used, the random average scaling module 300 before the first BN shows the highest accuracy.

さらに、ボトルネックブロック構造のＳｌｏｗＯｎｌｙ−５０の場合、すべての場合においてランダム平均スケーリングモジュール３００がネットワークの性能を向上させ、特に、最後のＢＮ前のランダム平均スケーリングモジュール３００が最も高い正確度を示した。１番目のＢＮ前のランダム平均スケーリングモジュール３００も大きな性能の差がないため効率的な選択にはなるが、演算効率のためには、ボトルネックブロック構造では、複数のランダム平均スケーリングモジュール３００を使用しないように選択してよい。 Furthermore, in the case of the bottleneck block structure SlowOnly-50, the random average scaling module 300 improved the performance of the network in all cases, and in particular, the random average scaling module 300 before the last BN showed the highest accuracy. .. The random average scaling module 300 before the first BN is also an efficient choice because there is no big difference in performance, but for computational efficiency, the bottleneck block structure uses multiple random average scaling modules 300. You may choose not to.

したがって、ランダム平均スケーリングモジュール３００は、ベーシックブロック構造はもちろん、ボトルネックブロック構造でも、過去のモデルよりも高い性能を示すことが分かった。 Therefore, it was found that the random average scaling module 300 exhibits higher performance than the past models not only in the basic block structure but also in the bottleneck block structure.

図８は、本発明の一実施形態における、コンピュータシステムのプロセッサが含むことのできる構成要素の例を示した図であり、図９は、本発明の一実施形態における、コンピュータシステムが実行することのできるフィーチャ正規化方法の例を示したフローチャートである。 FIG. 8 is a diagram showing an example of components that can be included in the processor of the computer system according to the embodiment of the present invention, and FIG. 9 is a diagram showing the execution by the computer system according to the embodiment of the present invention. It is a flowchart which showed the example of the feature normalization method which can be done.

図８に示すように、プロセッサ１２０は、周波数分離部８０１および正規化部８０２を含んでよい。このようなプロセッサ１２０の構成要素は、少なくとも１つのプログラムコードによって提供される制御命令にしたがってプロセッサ１２０によって実行される、互いに異なる機能（ｄｉｆｆｅｒｅｎｔｆｕｎｃｔｉｏｎｓ）の表現であってよい。例えば、プロセッサ１２０が入力を周波数特性に分離するようにコンピュータシステム１００を制御するために動作する機能的表現として、周波数分離部８０１が使用されてよい。 As shown in FIG. 8, the processor 120 may include a frequency separation unit 801 and a normalization unit 802. Such components of the processor 120 may be representations of different functions that are executed by the processor 120 according to control instructions provided by at least one program code. For example, the frequency separator 801 may be used as a functional representation in which the processor 120 operates to control the computer system 100 to separate the inputs into frequency characteristics.

プロセッサ１２０およびプロセッサ１２０の構成要素は、図９のフィーチャ正規化方法が含む段階９１０〜９２０を実行してよい。例えば、プロセッサ１２０およびプロセッサ１２０の構成要素は、メモリ１１０が含むオペレーティングシステムのコードと、上述した少なくとも１つのプログラムコードとによる命令（ｉｎｓｔｒｕｃｔｉｏｎ）を実行するように実現されてよい。ここで、少なくとも１つのプログラムコードは、フィーチャ正規化方法を処理するために実現されたプログラムのコードに対応してよい。 Processor 120 and the components of processor 120 may perform steps 910-920 included in the feature normalization method of FIG. For example, the processor 120 and the components of the processor 120 may be implemented to execute instructions by the operating system code included in the memory 110 and at least one program code described above. Here, at least one program code may correspond to the code of the program implemented to handle the feature normalization method.

フィーチャ正規化方法は、図に示された順に発生しないこともあり、方法の段階のうちの一部が省略されることもあるし、あるいは追加の過程がさらに含まれることもある。 The feature normalization method may not occur in the order shown in the figure, some of the steps of the method may be omitted, or additional steps may be included.

プロセッサ１２０は、フィーチャ正規化方法のためのプログラムファイルに記録されたプログラムコードをメモリ１１０にロードしてよい。例えば、フィーチャ正規化方法のためのプログラムファイルは、メモリ１１０とは区別される永続的記録装置に記録されていてよく、プロセッサ１２０は、バスを介して永続的記録装置に記録されたプログラムファイルからプログラムコードがメモリ１１０にロードされるようにコンピュータシステム１００を制御してよい。このとき、プロセッサ１２０およびプロセッサ１２０が含む周波数分離部８０１および正規化部８０２それぞれは、メモリ１１０にロードされたプログラムコードのうちの対応する部分の命令を実行して以下の段階９１０〜９２０を実行するためのプロセッサ１２０の互いに異なる機能的表現であってよい。段階９１０〜９２０の実行のために、プロセッサ１２０およびプロセッサ１２０の構成要素は、制御命令による演算を直接処理するか、またはコンピュータシステム１００を制御してよい。 The processor 120 may load the program code recorded in the program file for the feature normalization method into the memory 110. For example, the program file for the feature normalization method may be recorded in a persistent recording device that is distinct from memory 110, and the processor 120 may be recorded from the program file recorded in the persistent recording device via the bus. The computer system 100 may be controlled so that the program code is loaded into the memory 110. At this time, the processor 120 and the frequency separation unit 801 and the normalization unit 802 included in the processor 120 each execute the instruction of the corresponding part of the program code loaded in the memory 110 to execute the following steps 910 to 920. It may be a different functional representation of the processors 120 for the purpose. For the execution of steps 910-920, the processor 120 and the components of the processor 120 may directly process the operations by the control instructions or control the computer system 100.

プロセッサ１２０は、図３と図４を参照しながら説明したランダム平均スケーリングモジュール３００を含んでよい。 Processor 120 may include the random average scaling module 300 described with reference to FIGS. 3 and 4.

本発明に係るフィーチャ正規化方法は、次の２つの段階を含んでよい。 The feature normalization method according to the present invention may include the following two steps.

段階９１０で、周波数分離部８０１は、入力フィーチャを周波数特性に分離してよい。周波数分離部８０１は、ローパスフィルタを利用して入力フィーチャから低周波成分を求め、入力フィーチャと低周波成分との残差を利用して高周波成分を求めてよい。一例として、周波数分離部８０１は、フィーチャマップに対して平均プーリングを利用して高周波成分（残差）と低周波成分（局所的平均）とに分離してよい。 At step 910, the frequency separator 801 may separate the input features into frequency characteristics. The frequency separation unit 801 may obtain a low frequency component from the input feature by using a low-pass filter, and obtain a high frequency component by using the residual between the input feature and the low frequency component. As an example, the frequency separation unit 801 may separate the feature map into a high frequency component (residual) and a low frequency component (local average) by using average pooling.

段階９２０で、正規化部８０２は、入力フィーチャから分離された低周波成分にノイズをランダムに追加して入力フィーチャを正規化してよい。正規化部８０２は、３Ｄ残差ネットワークの過剰適合の問題を解消するためにランダム平均スケーリング（ＲＭＳ）を適用することにより、入力フィーチャから分離された高周波成分と低周波成分のうちで、高周波成分は維持し、低周波成分はランダムにスケーリングしてよい。例えば、正規化部８０２は、一様分布（ｕｎｉｆｏｒｍｄｉｓｔｒｉｂｕｔｉｏｎ）、正規分布（ｎｏｒｍａｌｄｉｓｔｒｉｂｕｔｉｏｎ）などを利用してフィーチャの低周波部分をランダムに変化させてよい。 At step 920, the normalization unit 802 may normalize the input features by randomly adding noise to the low frequency components separated from the input features. The normalization unit 802 applies random average scaling (RMS) to solve the problem of overfitting of the 3D residual network, so that the high frequency component and the low frequency component separated from the input features are high frequency components. May be maintained and low frequency components may be scaled randomly. For example, the normalization unit 802 may randomly change the low-frequency portion of the feature by utilizing a uniform distribution, a normal distribution, or the like.

したがって、プロセッサ１２０は、フィーチャマップで低周波成分を分離し、該当の低周波成分をランダムに変化させて内部表現を正規化することでにより、過剰適合の問題を簡単かつ効率的に解決することができる。最終的に、平滑化されたフィーチャに対する正規化は、選択的に低周波成分と高周波成分を明確に取り扱うことによって性能向上を招来することができる。 Therefore, the processor 120 can easily and efficiently solve the problem of overfitting by separating the low frequency components in the feature map and randomly changing the corresponding low frequency components to normalize the internal representation. Can be done. Finally, normalization for smoothed features can lead to performance gains by selectively explicitly handling low and high frequency components.

このように、本発明の実施形態によると、フィーチャの低周波成分の大きさをランダムに変化させて内部表現を正規化するランダム平均スケーリング（ＲＭＳ）により、３Ｄ残差ネットワークの過剰適合の問題を効果的に解決することができる。また、本発明の実施形態によると、低周波成分に対する摂動を適用することにより、フィーチャ全体や高周波成分に適用するよりも、３Ｄ畳み込みニューラルネットワークの正確度と正規化効果を向上させることができる。 Thus, according to an embodiment of the present invention, the problem of overfitting of a 3D residual network is solved by random mean scaling (RMS), which randomly changes the magnitude of the low frequency components of a feature to normalize the internal representation. It can be solved effectively. Further, according to the embodiment of the present invention, by applying the perturbation to the low frequency component, the accuracy and normalization effect of the 3D convolutional neural network can be improved as compared with the application to the entire feature or the high frequency component.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを記録、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The devices described above may be implemented by hardware components, software components, and / or combinations of hardware components and software components. For example, the devices and components described in the embodiments include a processor, a controller, an ALU (arithmetic logic unit), a digital signal processor, a microcomputer, an FPGA (field programgate array), a PLU (programmable log unit), a microprocessor, and the like. Alternatively, it may be implemented using one or more general purpose computers or special purpose computers, such as various devices capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the OS. The processing device may also respond to the execution of the software, access the data, and record, manipulate, process, and generate the data. For convenience of understanding, one processor may be described as being used, but one of ordinary skill in the art may appreciate that the processor may include multiple processing elements and / or multiple types of processing elements. You can understand. For example, the processing device may include multiple processors or one processor and one controller. Other processing configurations, such as parallel processors, are also possible.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、コンピュータ記録媒体または装置に具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で記録されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に記録されてよい。 The software may include computer programs, code, instructions, or a combination of one or more of these, configuring the processing equipment to operate at will, or instructing the processing equipment independently or collectively. You may do it. The software and / or data is embodied in any type of machine, component, physical device, computer recording medium or device to be interpreted based on the processing device or to provide instructions or data to the processing device. Good. The software is distributed on a networked computer system and may be recorded or executed in a distributed state. The software and data may be recorded on one or more computer-readable recording media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。ここで、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ−ＲＯＭおよびＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 The method according to the embodiment may be implemented in the form of program instructions that can be executed by various computer means and recorded on a computer-readable medium. Here, the medium may be a continuous recording of a computer-executable program or a temporary recording for execution or download. Further, the medium may be various recording means or storage means in the form of a combination of a single piece of hardware or a plurality of pieces of hardware, and is not limited to a medium directly connected to a certain computer system, but is distributed on a network. It may exist. Examples of media include hard disks, floppy (registered trademark) disks, magnetic media such as magnetic tapes, optical media such as CD-ROMs and DVDs, optical magnetic media such as floptic discs, and the like. And ROM, RAM, flash memory, etc., and may be configured to record program instructions. Other examples of media include recording media or storage media managed by application stores that distribute applications, sites that supply or distribute various other software, servers, and the like.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As described above, the embodiments have been described based on the limited embodiments and drawings, but those skilled in the art will be able to make various modifications and modifications from the above description. For example, the techniques described may be performed in a different order than the methods described, and / or components such as the systems, structures, devices, circuits described may be in a form different from the methods described. Appropriate results can be achieved even if they are combined or combined, or confronted or replaced by other components or equivalents.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Therefore, even if the embodiments are different, they belong to the attached claims as long as they are equal to the claims.

１２０：プロセッサ
８０１：周波数分離部
８０２：正規化部 120: Processor 801: Frequency separation unit 802: Normalization unit

Claims

A feature normalization method performed by a computer system
The computer system includes at least one processor configured to execute computer-readable instructions contained in memory.
The feature normalization method is
The step of obtaining a low frequency component from an input feature by the at least one processor.
A step of obtaining a high frequency component by utilizing the residual of the input feature and the low frequency component by the at least one processor.
A feature normalization method comprising adding noise to the low frequency components by the at least one processor.

The step of obtaining the low frequency component is
The feature normalization method according to claim 1, wherein the low-frequency component is separated from the input feature by using a low-pass filter.

The step of obtaining the low frequency component is
The feature normalization method according to claim 1, wherein the low frequency component is separated from the input feature by using an average pooling or a Gaussian filter.

The stage of adding the noise is
The feature normalization method according to claim 1, wherein the noise is added by applying random scaling to the local average of the input features.

The stage of adding the noise is
The feature normalization method according to claim 1, further comprising a step of randomly modulating the magnitude of the low frequency component by an operation of multiplying a scalar sampled with a given probability distribution.

The feature normalization method according to claim 1, wherein the random average scaling that adds the noise to the low frequency component is applied within the residual branch of the network model.

The feature normalization according to claim 6, wherein the random average scaling is applied before at least one layer of the convolution layer, the batch normalization layer, and the nonlinear activation layer of the network model. Method.

If the network model is a network with a basic block structure, the random average scaling is applied before all the batch normalization layers included in some stages of the network model, respectively. Item 6. The feature normalization method according to Item 6.

If the network model is a network with a bottleneck block structure, the random average scaling is applied before the last batch normalization layer of the batch normalization layers included in some stages of the network model. The feature normalization method according to claim 6, wherein the feature normalization method is characterized in that.

A computer program that causes the computer system to execute the feature normalization method according to any one of claims 1 to 9.

A non-transitory computer-readable recording medium in which a program for causing a computer to execute the feature normalization method according to any one of claims 1 to 9 is recorded.

It ’s a computer system,
Contains at least one processor configured to execute computer-readable instructions contained in memory.
The at least one processor
Find the low frequency component from the input features
The high frequency component is obtained by using the residual between the input feature and the low frequency component.
A computer system characterized by adding noise to the low frequency components.

The at least one processor
The computer system according to claim 12, wherein a low-pass filter is used to separate the low-frequency component from the input feature.

The at least one processor
12. The computer system of claim 12, characterized in that the low frequency components are separated from the input features using an average pooling or Gaussian filter.

The at least one processor
12. The computer system of claim 12, wherein random scaling is applied to the local average of the input features to add the noise.

The at least one processor
The computer system according to claim 12, wherein the magnitude of the low frequency component is randomly modulated by an operation of multiplying a scalar sampled with a given probability distribution.

The at least one processor
Includes a random average scaling module that adds the noise to the low frequency components.
The computer system according to claim 12, wherein the random average scaling module is located within a residual branch of the network model.

The computer system according to claim 17, wherein the random average scaling module is located in front of at least one layer of a convolution layer, a batch normalization layer, and a nonlinear activation layer.

When the network model is a network with a basic block structure, the random average scaling module is respectively located in front of all batch normalization layers included in some stages of the network model. Item 17. The computer system according to item 17.

When the network model is a network with a bottleneck block structure, the random average scaling module is located before the last batch normalization layer among the batch normalization layers included in some stages of the network model. The computer system according to claim 17, wherein the computer system is characterized in that.