CN111178115A

CN111178115A - Training method and system of object recognition network

Info

Publication number: CN111178115A
Application number: CN201811340221.8A
Authority: CN
Inventors: 袁培江; 史震云; 李建民; 任鹏远
Original assignee: Beijing Sensing Tech Co ltd
Current assignee: Beijing Sensing Tech Co ltd
Priority date: 2018-11-12
Filing date: 2018-11-12
Publication date: 2020-05-19
Anticipated expiration: 2038-11-12
Also published as: CN111178115B

Abstract

The method comprises the steps of respectively inputting a plurality of sample images in a training set into a first recognition network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image; inputting each sample image into a second identification network respectively for processing, acquiring first network loss of the second identification network, and acquiring second network loss and third network loss of the second identification network according to the teacher characteristics; and training the second recognition network according to the first network loss, the second network loss and the third network loss to obtain a trained second recognition network. The second recognition network obtained by training through the method can realize accurate recognition of the target object, and the training method disclosed by the disclosure has the advantage of good expansibility, and can be used for increasing a plurality of first recognition networks to train the second recognition network.

Description

Training method and system of object recognition network

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a method and a system for training an object recognition network.

Background

Pedestrian Re-identification (ReID) can be realized by searching one or more images of pedestrians in a search library for images of people corresponding to the images in other cameras at other viewing angles.

The early ReID technology adopts artificially designed image features, so that the precision is poor, and the precision is greatly improved after the deep learning technology is used later. Currently, the mainstream ReID technology is a ReID technology based on deep learning, and the ReID technology is realized through deep learning, so that the recognition accuracy of the ReID technology can be improved.

Therefore, a new training method is urgently needed to improve the recognition accuracy and the working efficiency of the deep learning-based ReID technology.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a training method of an object recognition network, the method including:

respectively inputting a plurality of sample images in a training set into a first identification network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image;

inputting each sample image into a second identification network respectively for processing, acquiring first network loss of the second identification network, and acquiring second network loss and third network loss of the second identification network according to the teacher characteristics;

training the second recognition network according to the first network loss, the second network loss and the third network loss to obtain a trained second recognition network,

the second recognition network is used for recognizing the identity of a target object in the image to be processed, and the first recognition network is a teacher network used for training the second recognition network.

In one possible embodiment, the first identification network comprises a first view resolution subnetwork, a first image augmentation subnetwork, a first convolution subnetwork, a first pooling subnetwork, and a first embedding subnetwork,

the method for processing the plurality of sample images in the training set is characterized in that the method comprises the following steps of respectively inputting the plurality of sample images in the training set into a first recognition network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image, wherein the method comprises the following steps:

inputting a target sample image into a first view decomposition sub-network for view decomposition processing to obtain a plurality of first views of the target sample image, wherein the target sample image is any one of the plurality of sample images;

inputting the plurality of first views into a first image amplification sub-network for amplification processing to obtain a plurality of second views after amplification;

the plurality of first views and the plurality of second views are sequentially convolved, pooled and embedded through a first convolution sub-network, a first pooling sub-network and a first embedding sub-network to obtain first feature vectors of the plurality of first views and second feature vectors of the plurality of second views of the target sample image;

and determining teacher characteristics of the target sample image according to the first characteristic vector and the second characteristic vector.

In one possible implementation, the second recognition network comprises a feature extraction network, a feature map mapping network, and a feature vector mapping network, the feature extraction network comprises a second image augmenting subnetwork, a second convolution subnetwork, a second pooling subnetwork, a second embedding subnetwork, and a classification subnetwork,

wherein, inputting each sample image into a second identification network respectively for processing, acquiring a first network loss of the second identification network, and acquiring a second network loss and a third network loss of the second identification network according to the teacher characteristic, comprising:

inputting a target sample image into a second image amplification sub-network for amplification processing to obtain an amplified third view, wherein the target sample image is any one of the plurality of sample images;

inputting the third view into a second convolution sub-network for convolution processing to obtain a feature map of the target sample image;

inputting the feature map of the target sample image into a feature map mapping network for processing to obtain first predicted values of a plurality of third views of the target sample image;

determining a second network loss for the second identified network based on the teacher feature and the first predicted value for the plurality of sample images.

In a possible implementation manner, inputting each sample image into a second identification network respectively for processing, obtaining a first network loss of the second identification network, and obtaining a second network loss and a third network loss of the second identification network according to the teacher characteristic, further includes:

performing pooling and embedding processing on the feature map of the target sample image through a second pooling sub-network and a second embedding sub-network in sequence to obtain a third feature vector of the target sample image;

inputting the third feature vector into the feature vector mapping network for processing to obtain second predicted values of a plurality of first views of the target sample image;

determining a third network loss of the second identified network based on the teacher feature and the second predicted value of the plurality of sample images.

In one possible implementation, the feature map mapping network includes a first view extraction sub-network, a third pooling sub-network, and a third embedding sub-network, the first view extraction sub-network being configured to map a feature map of the target sample image into feature maps of a plurality of first views of the target sample image;

the feature vector mapping network comprises a second view extraction sub-network for mapping a third feature vector of the target sample image to feature vectors of a plurality of first views of the target sample image and a mapping sub-network.

In a possible implementation, inputting each sample image into a second identification network for processing, and obtaining a first network loss of the second identification network, further includes:

inputting the third feature vector of the target sample image into the classification sub-network for processing to obtain the classification information of the target sample image;

determining a first network loss of the second recognition network based on the classification information and the labeling information of the plurality of sample images,

wherein the first network loss comprises a cross-entropy loss function.

In a possible implementation manner, training the second recognition network according to the first network loss, the second network loss, and the third network loss to obtain a trained second recognition network includes:

determining a weighted sum of the first network loss, the second network loss, and the third network loss as an overall network loss for the second identified network;

and carrying out reverse training on the second recognition network according to the total network loss to obtain the trained second recognition network.

In one possible implementation, the method for performing augmentation processing on the plurality of first-view input image augmentation subnetworks to obtain a plurality of augmented second views includes:

and respectively carrying out at least one of random turning processing, random shielding processing, random matting processing, random color processing and random rotation processing on the plurality of first views to obtain a plurality of second views.

In one possible embodiment, the plurality of first views includes a whole-body view and a partial view.

According to another aspect of the present disclosure, a training system of an object recognition network is provided, the system comprising:

the first processing module is used for respectively inputting a plurality of sample images in the training set into a first identification network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image;

the second processing module is connected with the first processing module and used for inputting each sample image into a second identification network respectively for processing to obtain first network loss of the second identification network and obtain second network loss and third network loss of the second identification network according to the teacher characteristics;

a training module connected to the second processing module for training the second recognition network according to the first network loss, the second network loss and the third network loss to obtain a trained second recognition network,

wherein the first processing module comprises:

the first decomposition sub-module is used for inputting a target sample image into a first view decomposition sub-network for view decomposition processing to obtain a plurality of first views of the target sample image, wherein the target sample image is any one of the plurality of sample images;

the first amplification sub-module is connected to the first decomposition sub-module and used for performing amplification processing on the plurality of first views through the first image amplification sub-network to obtain a plurality of second views after amplification;

the first processing sub-module is connected to the amplification sub-module and is used for carrying out convolution, pooling and embedding processing on the plurality of first views and the plurality of second views sequentially through a first convolution sub-network, a first pooling sub-network and a first embedding sub-network to obtain first feature vectors of the plurality of first views and second feature vectors of the plurality of second views of the target sample image;

and the first determining submodule is connected to the processing submodule and is used for determining the teacher characteristic of the target sample image according to the first characteristic vector and the second characteristic vector.

wherein the second processing module comprises:

the second amplification sub-module is used for inputting a target sample image into a second image amplification sub-network for amplification processing to obtain an amplified third view, wherein the target sample image is any one of the plurality of sample images;

the convolution submodule is connected to the second amplification submodule and used for inputting the third view into a second convolution sub-network for convolution processing to obtain a feature map of the target sample image;

the characteristic map mapping submodule is connected with the convolution submodule and used for inputting the characteristic map of the target sample image into a characteristic map mapping network for processing to obtain first predicted values of a plurality of third views of the target sample image;

and the second determining submodule is connected with the feature map mapping submodule and is used for determining the second network loss of the second identification network according to the teacher features and the first predicted values of the plurality of sample images.

In a possible implementation, the second processing module further includes:

the second processing submodule is connected with the convolution submodule and is used for performing pooling and embedding processing on the feature map of the target sample image through a second pooling sub-network and a second embedding sub-network in sequence to obtain a third feature vector of the target sample image;

the feature vector mapping sub-module is connected to the second processing sub-module and is used for inputting the third feature vector into the feature vector mapping network for processing to obtain second predicted values of the multiple first views of the target sample image;

and the fourth determining submodule is connected with the feature vector mapping submodule and is used for determining the third network loss of the second identification network according to the teacher features and the second predicted values of the plurality of sample images.

In a possible implementation, the feature map mapping network includes a first view extraction sub-network, a third pooling sub-network, and a third embedding sub-network, and the feature map mapping sub-module is further configured to map the feature map of the target sample image into feature maps of a plurality of first views of the target sample image through the first view extraction sub-network;

the feature vector mapping sub-module is further configured to map a third feature vector of the target sample image to a feature vector of a plurality of first views of the target sample image through the second view extraction sub-network.

In a possible implementation, the second processing module further includes:

the classification sub-module is connected to the second processing sub-module and is used for inputting the third feature vector of the target sample image into the classification sub-network for processing to obtain the classification information of the target sample image;

a fourth determining sub-module, connected to the classifying sub-module, for determining the first network loss of the second recognition network according to the classification information and the labeling information of the plurality of sample images,

wherein the first network loss comprises a cross-entropy loss function.

In one possible embodiment, the training module includes:

an operation submodule for determining a weighted sum of the first network loss, the second network loss and the third network loss as an overall network loss of the second identified network;

and the training submodule is connected with the operation submodule and used for carrying out reverse training on the second recognition network according to the total network loss to obtain the trained second recognition network.

According to another aspect of the present disclosure, there is provided a training apparatus for an object recognition network, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the training method of the object recognition network.

According to another aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, on which computer program instructions are stored, wherein the computer program instructions, when executed by a processor, implement the above-mentioned training method of an object recognition network.

The second recognition network obtained by training through the method can realize accurate recognition of the target object, and the training method disclosed by the disclosure has the advantage of good expansibility, and can be used for increasing a plurality of first recognition networks to train the second recognition network.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow chart of a training method of an object recognition network according to an embodiment of the present disclosure.

FIG. 2 shows a schematic diagram of a single view training network according to an embodiment of the present disclosure.

Fig. 3 shows a flowchart of step S110 of the training method of the object recognition network according to an aspect of the present disclosure.

Fig. 4 a-4 g show schematic diagrams of multiple views according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of a second recognition network according to an embodiment of the present disclosure.

Fig. 6 shows a flowchart of step S120 of a training method of an object recognition network according to an embodiment of the present disclosure.

FIG. 7 illustrates a block diagram of a training system for an object recognition network in accordance with an embodiment of the present disclosure.

FIG. 8 illustrates a block diagram of a training system for an object recognition network in accordance with an embodiment of the present disclosure.

FIG. 9 shows a block diagram of a training apparatus for an object recognition network according to an embodiment of the present disclosure.

FIG. 10 shows a block diagram of a training apparatus for an object recognition network according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Referring to fig. 1, fig. 1 is a flowchart illustrating a training method of an object recognition network according to an embodiment of the present disclosure.

As shown in fig. 1, the method includes:

step S110, respectively inputting a plurality of sample images in a training set into a first identification network for processing, and obtaining teacher characteristics of a plurality of first views of each sample image;

step S120, inputting each sample image into a second identification network respectively for processing, acquiring first network loss of the second identification network, and acquiring second network loss and third network loss of the second identification network according to the teacher characteristics;

step S130, training the second recognition network according to the first network loss, the second network loss and the third network loss to obtain a trained second recognition network,

According to the object recognition network training method provided by the disclosure, a plurality of sample images in a training set can be respectively input into a first recognition network for processing, teacher characteristics of a plurality of first views of each sample image are obtained, each sample image is respectively input into a second recognition network for processing, first network loss of the second recognition network is obtained, second network loss and third network loss of the second recognition network are obtained according to the teacher characteristics, and the second recognition network is trained according to the first network loss, the second network loss and the third network loss to obtain the trained second recognition network. The second recognition network obtained by training through the method can realize accurate recognition of the target object, and the training method disclosed by the disclosure has the advantage of good expansibility, and can be used for increasing a plurality of first recognition networks to train the second recognition network.

For step S110:

in one possible embodiment, a training set may be preset, and the training set may include a plurality of sample images for training of the object recognition network.

In one possible embodiment, the first identification network may be a teacher network (or teacher model) in Knowledge Transfer using Knowledge distillation (or Knowledge Transfer).

In a system adopting knowledge distillation (hereinafter referred to as a knowledge distillation system), a student network (or referred to as a student model) can be further included, and a teacher network converts hard knowledge into soft knowledge through a knowledge distillation method and transmits the soft knowledge to the student network so as to train the student network, thereby improving the precision and the execution efficiency of the student network.

In one possible implementation, the first recognition network may be a deep learning network trained in advance, for example, a single-view training network.

Referring to fig. 2, fig. 2 is a schematic diagram of a single-view training network according to an embodiment of the present disclosure.

As shown in fig. 2, the first recognition network (single view training network) may include a first view decomposition subnetwork 401, a first image augmentation subnetwork 402, a first convolution subnetwork 403, a first pooling subnetwork 404, a first embedding subnetwork 405, a first classification subnetwork 406, and so on.

In one possible embodiment, the first recognition network may be a deep learning network built according to model data of a single-view training network, for example, after the network model is built, the network model of the first recognition network may be initialized by using the model data of the trained single-view training network, wherein the model data may include, for example, weight parameters of each sub-network.

In one possible implementation, the first recognition network may include all of the subnetworks of the single-view training network, or may include some of the subnetworks of the single-view training network, for example, the first recognition network may include other subnetworks of the single-view training network except the first classification subnetwork 406.

Referring to fig. 3, fig. 3 is a flowchart illustrating a step S110 of a training method of an object recognition network according to an embodiment of the present disclosure.

As shown in fig. 3, the step S110 of inputting the plurality of sample images in the training set into the first recognition network respectively for processing, and acquiring the teacher feature of the plurality of first views of each sample image may include:

step S111, inputting a target sample image into a first view decomposition sub-network for view decomposition processing, and obtaining a plurality of first views of the target sample image, wherein the target sample image is any one of the plurality of sample images;

step S112, inputting the plurality of first views into a first image augmentation subnetwork for augmentation processing to obtain a plurality of augmented second views;

step S113, the plurality of first views and the plurality of second views are sequentially convolved, pooled and embedded through a first convolution sub-network, a first pooling sub-network and a first embedding sub-network to obtain first feature vectors of the plurality of first views and second feature vectors of the plurality of second views of the target sample image;

step S114, determining teacher characteristics of the target sample image according to the first characteristic vector and the second characteristic vector.

For step S111:

in one possible implementation, the first view decomposition subnetwork 401 may decompose an input target sample image into multiple views, and may further perform scaling and other processing on the target sample image.

Referring to fig. 4 a-4 g, fig. 4 a-4 g illustrate schematic views of a plurality of views according to an embodiment of the present disclosure.

Fig. 4a may be a full-body view of a target sample image, and fig. 4 b-4 g are a plurality of partial views of the target sample image.

The various partial views of fig. 4 b-4 g may be obtained as follows:

the whole body view shown in fig. 4a is divided into 4 parts from top to bottom, and the first part and the second part, the second part and the third part, and the third part and the fourth part can be respectively used as the partial views shown in fig. 4 b-4 d.

The whole body view shown in fig. 4a is divided into 7 parts from top to bottom, the first part, the second part, and the third part may be regarded as the partial view shown in fig. 4e, the third part, the fourth part, and the fifth part may be regarded as the partial view shown in fig. 4f, and the fifth part, the sixth part, and the seventh part may be regarded as the partial view shown in fig. 4 g.

It should be understood that other methods may be used to obtain different numbers of partial views, and the present disclosure is not limited to the manner in which the views are divided and the number of views.

For step S112:

in a possible implementation, the inputting of the plurality of first views into the first image augmenting subnetwork for augmentation processing may include at least one of random flipping processing, random occlusion processing, random matting processing, random color processing, and random rotation processing of the input first views.

For step S113:

in one possible embodiment, the first convolution sub-network 403 may employ various deep convolution neural networks including, but not limited to, ResNet, densnet, squeezet, etc., and the first convolution sub-network 403 may process the input first view or second view to obtain the corresponding feature map.

In one possible implementation, the first pooling sub-network 404 may employ a variety of global pooling including, but not limited to, global mean pooling (GAP) and Global Maximum Pooling (GMP) to process the feature map output by the first convolution sub-network 403 to output a globally pooled feature vector.

In one possible implementation, the first embedded subnetwork 405 may include one or more fully-connected, BN (Batch normalization) layers to perform feature dimension reduction on features output by the first pooled subnetwork 404 to reduce the dimensions of those features. For example, when the dimension of the feature output by the first pooling sub-network 404 is 2048, the first embedding sub-network 405 may output a feature vector having dimensions 256, 512, 1024, 2048.

For step S114:

in one possible implementation, the teacher feature may be obtained by the following formula:

where j represents the number of the target sample image in the training set, m represents the first view number of the target sample image,

the teacher feature, θ, representing the target sample image j in the first view m^m(view^m(I_j) A first feature vector, θ, representing the target sample image j in a first view m^m(flip(view^m(I_j) ) of the target sample image j in the first view m) is determined, view) is determined^m(I_j) Representing a first view of the target sample image in a first view m.

In this embodiment, flip () may represent view to the first view^mAnd carrying out random overturning processing to obtain a second view.

Through step S114, the teacher feature obtained by the first recognition network can be enhanced, and the accuracy of the feature output by the first recognition network can be improved.

Teacher features of the target sample image under different first views can be obtained through the first identification networks, for example, multiple first identification networks can be used for performing operation simultaneously, so that multiple teacher features of the target sample image are obtained according to the multiple first identification networks, for example, when the first views include 7 types, 7 first views of the target sample image can be performed operation simultaneously through the 7 first identification networks, so that the teacher features of the 7 first views are obtained.

For step S120:

referring to fig. 5, fig. 5 is a schematic diagram illustrating a second recognition network according to an embodiment of the disclosure.

As shown in fig. 5, the second recognition network may include a feature extraction network 60, a feature map mapping network (FeatureMaps mapping Branch)61, and a feature vector mapping network (registration mapping Branch) 62.

In one possible implementation, the feature extraction network 60 includes a second image augmenting sub-network 602, a second convolution sub-network 603, a second pooling sub-network 604, a second embedding sub-network 605, and a classification sub-network 606.

In the present embodiment, the feature extraction network 60 may be established according to the first recognition network, for example, the weight parameter of the first recognition network may be used as the initial weight parameter of the feature extraction network 60, so as to establish the feature extraction network 60.

In one possible embodiment, the feature extraction network 60 may be a student network in a knowledge distillation system.

In one possible implementation, the feature map mapping network 61 may include a first view extraction sub-network 611, a third pooling sub-network 612, and a third embedding sub-network 613.

In one possible implementation, the feature vector mapping network 62 may include a second view extraction subnetwork 621 and a mapping subnetwork 622.

Referring to fig. 6, fig. 6 is a flowchart illustrating a step S120 of a training method of an object recognition network according to an embodiment of the present disclosure.

As shown in fig. 6, the step S120 of inputting each sample image into a second identification network for processing, acquiring a first network loss of the second identification network, and acquiring a second network loss and a third network loss of the second identification network according to the teacher characteristic may include:

step S231, inputting a target sample image into a second image amplification sub-network for amplification processing, and obtaining an amplified third view, wherein the target sample image is any one of the plurality of sample images;

step S232, inputting the third view into a second convolution sub-network for convolution processing to obtain a feature map of the target sample image;

step S233, inputting the feature map of the target sample image into a feature map mapping network for processing, and obtaining first predicted values of a plurality of third views of the target sample image;

step S234 determines a second network loss of the second identified network based on the teacher characteristic and the first predicted value of the plurality of sample images.

In step S120, the second identification network may process the whole-body view of each sample image, so as to obtain a first network loss, a second network loss and a third network loss of the second identification network. However, the present disclosure is not limited to the processing object of the second recognition network, the second recognition network may process other views of each sample image to obtain the above network loss, the second recognition network may also include a second view decomposition sub-network 601 (as shown in fig. 6), the second view decomposition sub-network 601 may be the first view decomposition sub-network 401 in the first recognition network or a variation thereof, and a whole-body view or other local view of each sample image may be obtained through the second view decomposition sub-network 601.

For step S231:

in one possible implementation, inputting the target sample image into the second image amplification sub-network for amplification processing may include: and performing at least one of random turning processing, random shielding processing, random matting processing, random color processing and random rotation processing on the input first view.

For step S232:

when the second convolution sub-network 603 receives the third view output by the second image augmenting sub-network 602, the second convolution sub-network 603 may perform convolution processing on the third view, thereby extracting a feature map included in the third view.

For step S233:

in one possible implementation, the first view extraction sub-network 611 may be configured to map the feature map of the target sample image to the feature maps of the plurality of first views of the target sample image.

In this embodiment, the first view extraction sub-network 611 may also implement a dimension reduction process for the feature map of the first view.

In this embodiment, the first view extraction subnetwork 611 may include at least one convolution layer, a BN layer, and a ReLU (Rectified Linear Units) layer.

In one possible implementation, the third pooling sub-network 612 may be globally pooled with the feature map of the first view.

In one possible embodiment, the third embedding sub-network 613 may include fully connected layers (FC) and BN layers, and obtains the first prediction value by receiving the feature vector output from the third pooling sub-network 612 and performing an embedding process on the feature vector.

For step S234:

in one possible implementation, the first predicted value and the teacher feature may be fitted using a regression loss function to obtain the second network loss.

In this embodiment, the second network loss may be obtained by using the following equation:

wherein,

representing a second net loss for the first view k, N representing a number of target sample images in the training set,

representing the teacher's features of the target sample image i in the first view k,

representing a first prediction value of the target sample image i at the first view k.

Through the implementation of the method, the second network loss can be obtained according to the teacher characteristic output by the first identification network and the first predicted value.

Step S241, performing pooling and embedding processing on the feature map of the target sample image through a second pooling sub-network and a second embedding sub-network in sequence to obtain a third feature vector of the target sample image;

step S242, inputting the third feature vector into the feature vector mapping network for processing, and obtaining second predicted values of the multiple first views of the target sample image;

step S243, determining a third network loss of the second identification network according to the teacher characteristic and the second predicted value of the plurality of sample images.

For step S241:

after the second convolution subnetwork 603 obtains the feature map of the third view, the second pooling subnetwork 604 of the feature extraction network 60 obtains the feature map and performs pooling processing on the feature map, and the features obtained after the pooling processing are input into the second embedding subnetwork 605 to perform embedding processing, and then the third feature vector of the target sample image is output.

In this embodiment, the second pooling sub-network 604 may perform global pooling of the feature map.

For step S242:

in one possible implementation, the second view extraction sub-network 621 is configured to map the third feature vector of the target sample image to the feature vectors of the plurality of first views of the target sample image.

In one possible implementation, the mapping subnetwork 622 in the feature vector mapping network 621 may perform mapping processing on the feature vectors of the multiple first views to obtain the second predicted values of the multiple first views.

In this embodiment, mapping subnetwork 622 may include at least one FC layer and a BN layer.

In this embodiment, when the first view is a whole-body view, the second view extraction sub-network 621 may directly input the third feature vector into the mapping sub-network 622, and the mapping sub-network 622 maps the third feature vector, so as to obtain a second predicted value of the whole-body view.

In this embodiment, in the second view extraction sub-network 621 that extracts the feature vector of the whole-body view, the second view extraction sub-network 621 may directly input the third feature vector into the mapping sub-network 622, or, in the second view extraction sub-network 621 that extracts the feature vector of the whole-body view, the second view extraction sub-network 621 may not be included, in which case the mapping sub-network 622 is connected to the second embedding sub-network 605, and the mapping sub-network 622 directly obtains the third feature vector from the second embedding sub-network 605; in the second view extraction sub-network 621 that extracts the feature vector of the partial view, the second view extraction sub-network 621 may extract one partial view of the target sample image from the third feature vector as the feature vector of the first view to input into the mapping sub-network 622.

For step S243, in one possible implementation, the fitting process may be performed on the second predicted value and the teacher characteristic by using a regression loss function, so as to obtain a third network loss.

In the present embodiment, the third network loss may be obtained by the following equation:

wherein,

representing a third net loss of the first view k, N representing a number of sample images in the image set,

representing a second prediction value of the target sample image i under the first view k.

Through the implementation of the method, the third network loss can be obtained according to the second predicted value and the teacher characteristic output by the first identification network.

Step S251, inputting the third feature vector of the target sample image into a classification sub-network for processing, and obtaining the classification information of the target sample image;

step S252, determining a first network loss of the second identification network according to the classification information and the labeling information of the plurality of sample images, where the first network loss includes a cross entropy loss function.

For step S251:

in one possible implementation, the classification subnetwork 606 may be implemented by the FC layer.

In this embodiment, the third feature vector may be classified by the FC layer, so that classification information of the target sample image is obtained.

In this embodiment, the classification information may be a probability that the target sample image belongs to the annotation information of the target sample image, and the annotation information of the target sample image may be identity information of a person corresponding to the target sample image that is annotated in advance, and may include, for example, an ID of the person or an annotation number of the target sample image.

For step S252:

in one possible implementation, the first network loss of the feature extraction network may be obtained by the following formula:

wherein L is_clsRepresenting a first network loss, N representing the number of sample images in a training set, C representing the total number of images in a data set to which the training set belongs, Z_i，cRepresenting classification information of a target sample image i, c is prediction annotation information output by the target sample image i through a feature extraction network, Z_i，yiAnd the preset two-dimensional vector value represents the target sample image, and yi represents the labeling information (for example, information such as a labeling ID or a serial number of the sample image i) of the sample image i in the training set.

For step S130:

as shown in fig. 3, the step S130 of training the second recognition network according to the first network loss, the second network loss, and the third network loss to obtain a trained second recognition network may include:

step S311, determining a weighted sum of the first network loss, the second network loss, and the third network loss as an overall network loss of the second identified network;

and step S312, carrying out reverse training on the second recognition network according to the total network loss to obtain a trained second recognition network.

For step S311:

in one possible implementation, the total network loss may be obtained by the following equation:

wherein L is_totalRepresents the total network loss, L_clsIs representative of a loss of the first network,

representing the second net loss for the kth first view,

the third net loss of the kth first view is shown, K is the number of first views, and α and β are the first constant and the second constant, respectively.

In this embodiment, the first constant and the second constant may take values of 4 and 2, respectively, and in other embodiments, other values may be selected for the first constant and the second constant, which is not limited herein.

For step S312:

the weight parameters of the second recognition network can be updated through a random gradient descent algorithm such as SGD (generalized mean) and Adam, so that reverse training of the second recognition network is realized.

The trained second recognition network can be used for re-recognizing pedestrians, and can be applied to the fields of security protection, human-computer interaction, unmanned selling (unmanned stores) and the like to recognize objects such as characters in images or videos.

By adopting the training method disclosed by the invention, the efficient second recognition network can be obtained, and the method has the characteristics of high precision and high efficiency when object recognition is carried out.

Referring to fig. 7, fig. 7 illustrates a training system of an object recognition network according to an embodiment of the present disclosure.

As shown in fig. 7, the system may include:

the first processing module 10 is configured to input a plurality of sample images in a training set into a first recognition network respectively for processing, and obtain teacher characteristics of a plurality of first views of each sample image;

the second processing module 20 is connected to the first processing module 10, and is configured to input each sample image into a second identification network respectively for processing, obtain a first network loss of the second identification network, and obtain a second network loss and a third network loss of the second identification network according to the teacher characteristic;

a training module 30, connected to the second processing module 20, for training the second recognition network according to the first network loss, the second network loss and the third network loss to obtain a trained second recognition network,

It should be noted that the training system of the object recognition network is a system item corresponding to the training method of the object recognition network, and for specific introduction, reference is made to the description of the method item before, and details are not repeated here.

The second recognition network obtained by matching the modules of the training system of the object recognition network can realize the accurate recognition of the target object, and the training method disclosed by the invention has the advantage of good expansibility, and can be used for increasing a plurality of first recognition networks to train the second recognition network.

Referring to fig. 8, fig. 8 illustrates a training system of an object recognition network according to an embodiment of the present disclosure.

In a possible implementation, as shown in fig. 8, the first processing module 10 may include:

a first decomposition sub-module 101, configured to input a target sample image into a first view decomposition sub-network for view decomposition processing, so as to obtain a plurality of first views of the target sample image, where the target sample image is any one of the plurality of sample images;

the first amplification sub-module 102 is connected to the first decomposition sub-module 101, and configured to perform amplification processing on the multiple first views through the first image amplification sub-network to obtain multiple second views after amplification;

a first processing sub-module 102, connected to the augmenting sub-module 101, configured to perform convolution, pooling and embedding on the multiple first views and the multiple second views sequentially through a first convolution sub-network, a first pooling sub-network and a first embedding sub-network, so as to obtain first feature vectors of the multiple first views and second feature vectors of the multiple second views of the target sample image;

and the first determining sub-module 104 is connected to the processing sub-module and is used for determining the teacher feature of the target sample image according to the first feature vector and the second feature vector.

In a possible implementation, the second processing module 20 includes:

a second amplification sub-module 201, configured to input a target sample image into a second image amplification sub-network for amplification processing, so as to obtain a third view after amplification, where the target sample image is any one of the multiple sample images;

the convolution submodule 202 is connected to the second amplification submodule 201, and is configured to input the third view into a second convolution sub-network for convolution processing, so as to obtain a feature map of the target sample image;

a feature map mapping sub-module 203, connected to the convolution sub-module 202, configured to input the feature map of the target sample image into a feature map mapping network for processing, so as to obtain first predicted values of a plurality of third views of the target sample image;

and a second determining sub-module 204, connected to the feature map mapping sub-module 203, for determining a second network loss of the second identification network according to the teacher feature of the plurality of sample images and the first predicted value.

The second processing sub-module 211 is connected to the convolution sub-module 202, and configured to perform pooling and embedding processing on the feature map of the target sample image sequentially through a second pooling sub-network and a second embedding sub-network, so as to obtain a third feature vector of the target sample image;

a feature vector mapping sub-module 212, connected to the second processing sub-module 211, configured to input the third feature vector into the feature vector mapping network for processing, so as to obtain second predicted values of the multiple first views of the target sample image;

a fourth determining sub-module 213, connected to the feature vector mapping sub-module 212, is configured to determine a third network loss of the second identification network according to the teacher feature and the second predicted value of the plurality of sample images.

The classification sub-module 221, connected to the second processing sub-module 211, is configured to input the third feature vector of the target sample image into a classification sub-network for processing, so as to obtain classification information of the target sample image;

a fourth determining sub-module 222, connected to the classifying sub-module 221, configured to determine a first network loss of the second identification network according to the classification information and the labeling information of the plurality of sample images, where the first network loss includes a cross entropy loss function.

in a possible implementation, the feature vector mapping network comprises a second view extraction sub-network and a mapping sub-network, and the feature vector mapping sub-module is further configured to map a third feature vector of the target sample image to a feature vector of the plurality of first views of the target sample image through the second view extraction sub-network.

In a possible embodiment, the training module 30 includes:

an operation submodule 301, configured to determine a weighted sum of the first network loss, the second network loss, and the third network loss as an overall network loss of the second identified network;

and the training submodule 302 is connected to the operation submodule 301, and is configured to perform reverse training on the second recognition network according to the total network loss to obtain a trained second recognition network.

Referring to fig. 9, fig. 9 is a block diagram illustrating a training apparatus for an object recognition network according to an embodiment of the present disclosure. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 9, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, images, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the device 800 to perform the above-described methods.

Referring to fig. 10, fig. 10 is a block diagram illustrating a training apparatus for an object recognition network according to an embodiment of the present disclosure. For example, the apparatus 1900 may be provided as a server.

Referring to FIG. 10, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, MacOS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the apparatus 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of training an object recognition network, the method comprising:

2. The method of claim 1, wherein the first recognition network comprises a first view resolution sub-network, a first image augmentation sub-network, a first convolution sub-network, a first pooling sub-network, and a first embedding sub-network,

3. The method of claim 1, wherein the second identification network comprises a feature extraction network, a feature map mapping network, and a feature vector mapping network, the feature extraction network comprising a second image augmenting sub-network, a second convolution sub-network, a second pooling sub-network, a second embedding sub-network, and a classification sub-network,

4. The method of claim 3, wherein inputting each sample image into a second recognition network for processing to obtain a first network loss of the second recognition network and to obtain a second network loss and a third network loss of the second recognition network based on the teacher characteristic, further comprises:

5. The method of claim 4,

the feature map mapping network comprises a first view extraction subnetwork, a third pooling subnetwork, and a third embedding subnetwork, the first view extraction subnetwork for mapping a feature map of the target sample image to feature maps of a plurality of first views of the target sample image;

6. The method of claim 4, wherein each sample image is respectively input into a second identification network for processing, and a first network loss of the second identification network is obtained, further comprising:

wherein the first network loss comprises a cross-entropy loss function.

7. The method of claim 1, wherein training the second recognition network according to the first network loss, the second network loss, and the third network loss to obtain a trained second recognition network comprises:

8. The method of claim 2, wherein performing augmentation processing on the plurality of first-view input image augmentation subnetworks to obtain augmented plurality of second views comprises:

9. The method of claim 2, wherein the plurality of first views comprise a whole-body view and a local view.

10. A system for training an object recognition network, the system comprising:

11. The system of claim 10, wherein the first recognition network comprises a first view resolution subnetwork, a first image augmentation subnetwork, a first convolution subnetwork, a first pooling subnetwork, and a first embedding subnetwork,

wherein the first processing module comprises:

12. The system of claim 10, wherein the second identification network comprises a feature extraction network, a feature map mapping network, and a feature vector mapping network, the feature extraction network comprising a second image augmenting subnetwork, a second convolution subnetwork, a second pooling subnetwork, a second embedding subnetwork, and a classification subnetwork,

wherein the second processing module comprises:

13. The system of claim 12, wherein the second processing module further comprises:

14. The system of claim 13,

the feature map mapping sub-module is further configured to map a feature map of the target sample image through the first view extraction sub-network into feature maps of a plurality of first views of the target sample image;

15. The system of claim 13, wherein the second processing module further comprises:

wherein the first network loss comprises a cross-entropy loss function.

16. The system of claim 10, wherein the training module comprises: