CN112329826B

CN112329826B - Training method of image recognition model, image recognition method and device

Info

Publication number: CN112329826B
Application number: CN202011150035.5A
Authority: CN
Inventors: 吴元明; 袁利娟; 孙茂; 李冰; 万军; 刘梓谕
Original assignee: Air Force Medical University of PLA
Current assignee: Air Force Medical University of PLA
Priority date: 2020-10-24
Filing date: 2020-10-24
Publication date: 2024-10-18
Anticipated expiration: 2040-10-24
Also published as: CN112329826A

Abstract

The application discloses a training method of an image recognition model, an image recognition method, an image recognition device, equipment and a storage medium, and belongs to the field of image processing. In the embodiment of the application, the server can train the image recognition model by adopting three images, namely, a reference sample image, a positive sample image and a negative sample image, wherein the three images comprise sample objects, the reference sample image and the positive sample image correspond to the same sample object, and the reference sample image and the negative sample image correspond to different sample objects. During the training process, the server aims at identifying that the similarity between the reference sample image and the positive sample image is as high as possible through the model, and identifying that the similarity between the reference sample image and the negative sample image is as low as possible, so that in the subsequent use process, whether the shot image and the stored image contain the same human face can be accurately determined.

Description

Training method of image recognition model, image recognition method and device

Technical Field

The present application relates to the field of image processing, and in particular, to a training method for an image recognition model, an image recognition method, an image recognition device, a server, and a storage medium.

Background

With the development of computer technology, artificial intelligence technology is vigorously developed, and an image recognition technology is used as a branch of the artificial intelligence technology, so that the application range of the image recognition technology is wider and wider, for example, the image recognition technology can be applied to face recognition scenes, an image recognition model is adopted to recognize images containing faces, and identity information corresponding to the faces can be obtained.

The recognition accuracy of the image recognition model in the related art is limited to the number of sample images, which tends to be limited, resulting in poor training effect of the image recognition model.

Disclosure of Invention

The embodiment of the application provides a training method, an image recognition device, a server and a storage medium for an image recognition model, which can improve the training effect of the image recognition model. The technical scheme is as follows:

In one aspect, a training method of an image recognition model is provided, the method comprising:

In one iteration process, a reference sample image, a positive sample image and a negative sample image are acquired, wherein the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects;

Inputting the reference sample image, the positive sample image and the negative sample image into an image recognition model, and determining a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image through the image recognition model;

and responding to the difference value between the first similarity and the second similarity to meet a target condition, and taking the image recognition model as a trained image recognition model.

In one possible implementation, the determining, by the image recognition model, a first similarity between the reference sample image and the positive sample image, and a second similarity between the reference sample image and the negative sample image includes:

Extracting reference characteristic information of the reference sample image, positive sample characteristic information of the positive sample image and negative sample characteristic information of the negative sample image through the image recognition model;

Obtaining the first similarity according to the reference characteristic information and the positive sample characteristic information;

and obtaining the second similarity according to the reference characteristic information and the negative sample characteristic information.

In one possible embodiment, before the acquiring the reference sample image, the positive sample image, and the negative sample image, the method further includes:

Extracting features of a plurality of first sample images in the first sample image set to obtain a plurality of first sample image features;

Determining the similarity between the plurality of first sample images in the first sample image set according to the plurality of first sample image features;

determining two first sample images with the lowest similarity as a first image and a second image;

Extracting features of a plurality of second sample images in the second sample image set to obtain a plurality of second sample image features;

determining a third image and a fourth image with the lowest similarity with the first image and the second image from the second sample image set according to the second sample image features and the first sample image features of the first image and the second image;

in response to a first similarity between the first image and the third image being less than a second similarity between the second image and the fourth image, determining the first image as the reference image, determining the second image as the positive sample image, and determining the third image as the negative sample image.

In a possible implementation manner, after the determining, by the image recognition model, the first similarity between the reference sample image and the positive sample image, and the second similarity between the reference sample image and the negative sample image, the method further includes:

And adjusting model parameters of the image recognition model in response to the difference between the first similarity and the second similarity not meeting the target condition.

In a possible implementation manner, the adjusting the model parameters of the image recognition model includes:

Determining a loss function according to the difference between the first similarity and the second similarity;

determining a generated gradient of the image recognition model according to the loss function;

and adjusting model parameters of the image recognition model according to a gradient descent method.

In one possible embodiment, before using the image recognition model as the trained image recognition model, the method further includes:

Acquiring a first positive sample image, a second positive sample image, a first negative sample image and a second negative sample image, wherein the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects;

inputting the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image into an image recognition model, determining a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image by the image recognition model;

And adjusting model parameters of the image recognition model according to the difference information between the third similarity and the fourth similarity.

In one aspect, there is provided an image recognition method, the method comprising:

Acquiring a first image to be identified and a second image to be identified, wherein the first image to be identified comprises a first object to be identified, and the second image to be identified comprises a second object to be identified;

Inputting the first image to be identified and the second image to be identified into an image identification model, and extracting first image features of the first image to be identified and second image features of the second image to be identified through the image identification model;

The image recognition model is trained based on a plurality of reference sample images, positive sample images and negative sample images, wherein the reference sample images and the positive sample images correspond to a first sample object, the negative sample images correspond to a second sample object, and the first sample object and the second sample object are different sample objects;

And outputting the similarity between the first object to be identified and the second object to be identified according to the first image feature and the first image feature.

In one aspect, there is provided a training apparatus for an image recognition model, the apparatus comprising:

A sample image acquisition module, configured to acquire a reference sample image, a positive sample image, and a negative sample image in an iterative process, where the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects;

a first input module for inputting the reference sample image, the positive sample image, and the negative sample image into an image recognition model, determining a first similarity between the reference sample image and the positive sample image, and a second similarity between the reference sample image and the negative sample image by the image recognition model;

And the training module is used for responding to the difference value between the first similarity and the second similarity to meet a target condition, and taking the image recognition model as a trained image recognition model.

In a possible implementation manner, the first input module is configured to extract, through the image recognition model, reference feature information of the reference sample image, positive sample feature information of the positive sample image, and negative sample feature information of the negative sample image; obtaining the first similarity according to the reference characteristic information and the positive sample characteristic information; and obtaining the second similarity according to the reference characteristic information and the negative sample characteristic information.

In one possible embodiment, the apparatus further comprises:

the feature extraction module is used for extracting features of a plurality of first sample images in the first sample image set to obtain a plurality of first sample image features;

A similarity determining module, configured to determine, according to the plurality of first sample image features, a similarity between a plurality of first sample images in the first sample image set;

An image determining module, configured to determine two first sample images with the lowest similarity as a first image and a second image;

The similarity determining module is further configured to perform feature extraction on a plurality of second sample images in the second sample image set to obtain a plurality of second sample image features;

the image determining module is further configured to determine, from the second sample image set, a third image and a fourth image with lowest similarity to the first image and the second image according to the plurality of second sample image features and the first sample image features of the first image and the second image, respectively;

The image determination module is further configured to determine the first image as the reference image, the second image as the positive sample image, and the third image as the negative sample image in response to a first similarity between the first image and the third image being less than a second similarity between the second image and the fourth image.

In one possible embodiment, the apparatus further comprises:

And the parameter adjustment module is used for adjusting model parameters of the image recognition model in response to the fact that the difference value between the first similarity and the second similarity does not meet the target condition.

In a possible implementation manner, the parameter adjustment module is configured to determine a loss function according to a difference between the first similarity and the second similarity; determining a generated gradient of the image recognition model according to the loss function; and adjusting model parameters of the image recognition model according to a gradient descent method.

In a possible implementation manner, the sample image acquisition module is further configured to acquire a first positive sample image, a second positive sample image, a first negative sample image, and a second negative sample image, where the first positive sample image and the second positive sample image correspond to a same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects;

The first input module is further configured to input the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image into an image recognition model, determine a third similarity between the first positive sample image and the second positive sample image, and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model;

the training module is further configured to adjust model parameters of the image recognition model according to difference information between the third similarity and the fourth similarity.

In one aspect, there is provided an image recognition apparatus, the apparatus comprising:

The image acquisition module is used for acquiring a first image to be identified and a second image to be identified, wherein the first image to be identified comprises a first object to be identified, and the second image to be identified comprises a second object to be identified;

The second input module is used for inputting the first image to be identified and the second image to be identified into an image identification model, and extracting first image features of the first image to be identified and second image features of the second image to be identified through the image identification model;

The image recognition model is trained based on a plurality of reference sample images, a plurality of positive sample images and a plurality of counter-order negative sample images, wherein the reference sample images and the positive sample images correspond to a first sample object, the negative sample images correspond to a second sample object, and the first sample object and the second sample object are different sample objects;

and the output module is used for outputting the similarity between the first object to be identified and the second object to be identified according to the first image feature and the first image feature.

In one aspect, a computer device is provided that includes one or more processors and one or more memories having at least one instruction stored therein that is loaded and executed by the one or more processors to implement a training method of the image recognition model or operations performed by the image recognition method.

In one aspect, a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement a training method of the image recognition model or operations performed by the image recognition method is provided.

In the embodiment of the application, the server can train the image recognition model by adopting three images, namely, a reference sample image, a positive sample image and a negative sample image, wherein the three images comprise sample objects, the reference sample image and the positive sample image correspond to the same sample object, and the reference sample image and the negative sample image correspond to different sample objects. During the training process, the server aims at identifying that the similarity between the reference sample image and the positive sample image is as high as possible through the model, and identifying that the similarity between the reference sample image and the negative sample image is as low as possible, so that in the subsequent use process, whether the shot image and the stored image contain the same human face can be accurately determined.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an image recognition model according to an embodiment of the present application;

FIG. 2 is a flowchart of a training method for an image recognition model according to an embodiment of the present application;

FIG. 3 is a flowchart of a training method for an image recognition model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training method of an image recognition model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a method for acquiring a cross-batch refractory triplet image according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a training method of an image recognition model according to an embodiment of the present application;

fig. 7 is a schematic diagram of a network updating method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a training method of an image recognition model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a training device structure of an image recognition model according to an embodiment of the present application;

Fig. 10 is a schematic view of an image recognition apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present application;

Fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution.

The term "at least one" in the present application means one or more, and "plurality" means two or more, for example, a plurality of reference face images means two or more reference face images.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements the learning behavior of a human to acquire new knowledge or skills, reorganizing existing knowledge sub-models to continuously improve its own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The monotonic function comprises a monotonic increasing function and a monotonic decreasing function, wherein a dependent variable in the monotonic increasing function increases along with the increase of the independent variable, for example, a monotonic increasing function F _i (), and the independent variable a < b exists, and then F _i(a)＜F_i (b); for example, if there is a single subtraction function F _d (), and if there is an argument a < b, then F _i(a)＞F_i (b).

Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.

As a basic capability provider of cloud computing, a cloud computing resource pool (abbreviated as a cloud platform, generally referred to as IaaS (Infrastructure AS A SERVICE) platform) is established, in which multiple types of virtual resources are deployed for external clients to select for use.

According to the logic function division, a PaaS (Platform AS A SERVICE, platform service) layer can be deployed on an IaaS (Infrastructure AS A SERVICE, infrastructure service) layer, and a SaaS (Software AS A SERVICE, service) layer can be deployed above the PaaS layer, or the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, web container, etc. SaaS is a wide variety of business software such as web portals, sms mass senders, etc. Generally, saaS and PaaS are upper layers relative to IaaS.

In the embodiment of the present application, the technical solution provided in the embodiment of the present application may be implemented by a server or a terminal as an execution body, or the technical method provided in the present application may be implemented by interaction between the terminal and the server, which is not limited in the embodiment of the present application. The following will describe an example in which the execution body is a server:

In the embodiment of the application, the image recognition model can be used for recognizing whether the objects contained in the two images are the same object. For example, whether two vehicles are the same vehicle is identified in two images containing vehicles, or whether two fruits are the same type of fruit is identified in an image containing two fruits. Of course, the image recognition model can also compare whether the faces in the two images are the same face, for example, comparing whether the faces in the identity card stored in the database and the images photographed in real time are the same face.

In order to more clearly describe the technical scheme provided by the application, the training method of the image recognition model provided by the application is introduced firstly:

It should be noted that, in training the image recognition model provided by the present application, parameters of the model may be initialized in a pre-training manner, for example, a common face recognition data set (such as MS-Celeb-1M) is used to train the model, so that the network obtains good initialization. Therefore, the image recognition model can be well trained, and a good image recognition effect is achieved.

The pre-training steps are as follows:

Step 1, the server preprocesses all the pre-training images, including steps of detecting whether faces are contained, aligning, cutting and the like. The server divides the preprocessed image into a plurality of batches, and one batch of image is adopted in each training process.

And step2, randomly picking n images by the server according to batches, and inputting the images into a network for training, wherein n is a positive integer.

And step 3, forward transmission is carried out on the server, network output is obtained, and L2 normalization is carried out on the characteristics.

Step 4, the server calculates an AM-Softmax loss function based on a formula (1):

Wherein n represents the number of pre-training images; y _i represents the true attribute label of the ith training image; j is a real attribute tag other than y _i; s is a scaling factor set to 30; cos θ _j represents the angle cosine value of the feature vector and the weight vector; m is a cosine edge term.

Step 5, the server judges whether the training loss is converged, if so, the training is terminated, and a pre-training model is obtained; otherwise, continuing the next operation.

And 6, calculating the network parameter gradient by the server, and updating the network parameter by adopting a random gradient descent method (Stochastic GRADIENT DESCENT, SGD) algorithm.

And 7, returning the server to the step 2.

Of course, the server can also acquire other open-source image recognition models from the network to serve as a pre-training model for training the image recognition models provided by the application, so that the operation amount of the server can be reduced, and the model training efficiency can be improved.

In order to more clearly describe the training method of the image recognition model provided by the present application, first, description is made on the structure of the image recognition model provided by the embodiment of the present application, referring to fig. 1, the image recognition model may include: an input layer 101, a feature extraction layer 102, and an output layer 103.

Wherein the input layer 101 is used for inputting images into the model. The feature extraction layer 102 is used to extract feature information of an image, and the feature information may include, but is not limited to, geometric feature information and size feature information of an object in the image. The output layer 103 is used for processing the image characteristic information to obtain difference information for model training. Three or four input layers 101 may be present in the image recognition model, and if three input layers 101 are present in the image recognition model, the three input layers 101 are used to input a reference sample image, a positive sample image, and a negative sample image, that is, a triplet image, respectively. If there are four input layers 101 in the image recognition model, the four input layers 101 are used for inputting a first positive sample image, a second positive sample image, a first negative sample image, and a second negative sample image, that is, a quadruple image, respectively. Accordingly, each input layer 101 may be followed by a feature extraction layer 102 for extracting reference feature information of the reference sample image, positive sample feature information of the positive sample image, and negative sample feature information of the negative sample, respectively, or first positive sample feature information of the first positive sample image, second positive sample feature information of the second positive sample image, first negative sample feature information of the first negative sample image, and second negative sample feature information of the second negative sample image, respectively, and parameters between the feature extraction layers may be shared. The number of output layers 103 may be two, with one output layer 1031 for outputting a first similarity between the positive sample image and the reference sample image, or for outputting a third similarity between the first positive sample image and the second positive sample image, and one output layer 1032 for outputting a second similarity between the reference sample image and the negative sample image, or for outputting a fourth similarity between the first negative sample image and the second negative sample image.

Of course, the structure of the image recognition model is shown for exemplary description, and in other possible embodiments, there may be other structures of the image recognition model, which are not limited in the embodiment of the present application.

Fig. 2 is a flowchart of a training method of an image recognition model according to an embodiment of the present application, referring to fig. 2, the method includes:

201. In one iteration process, the server acquires a reference sample image, a positive sample image and a negative sample image, wherein the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects.

202. The server inputs the reference sample image, the positive sample image, and the negative sample image into an image recognition model, and determines a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image through the image recognition model.

203. In response to the difference between the first similarity and the second similarity meeting the target condition, the server takes the image recognition model as a trained image recognition model.

In one possible embodiment, determining, by the image recognition model, a first similarity between the reference sample image and the positive sample image, and a second similarity between the reference sample image and the negative sample image comprises:

And extracting the reference characteristic information of the reference sample image, the positive sample characteristic information of the positive sample image and the negative sample characteristic information of the negative sample image through the image identification model.

And obtaining the first similarity according to the reference characteristic information and the positive sample characteristic information.

And obtaining second similarity according to the reference characteristic information and the negative sample characteristic information.

In one possible embodiment, before acquiring the reference sample image, the positive sample image and the negative sample image, the method further comprises:

And extracting the characteristics of the plurality of first sample images in the first sample image set to obtain the characteristics of the plurality of first sample images.

A similarity between the plurality of first sample images in the first set of sample images is determined based on the plurality of first sample image features.

The two first sample images with the lowest similarity are determined as a first image and a second image.

And extracting the characteristics of a plurality of second sample images in the second sample image set to obtain a plurality of second sample image characteristics.

And determining a third image and a fourth image with the lowest similarity with the first image and the second image from the second sample image set according to the plurality of second sample image features and the first sample image features of the first image and the second image respectively.

In response to the first similarity between the first image and the third image being less than the second similarity between the second image and the fourth image, the first image is determined to be the reference image, the second image is determined to be the positive sample image, and the third image is determined to be the negative sample image.

In one possible embodiment, after determining the first similarity between the reference sample image and the positive sample image and the second similarity between the reference sample image and the negative sample image by the image recognition model, the method further comprises:

In one possible implementation, adjusting the model parameters of the image recognition model includes:

A loss function is determined based on a difference between the first similarity and the second similarity.

The gradient of the image recognition model generation is determined from the loss function.

a first positive sample image, a second positive sample image, a first negative sample image, and a second negative sample image are acquired, the first positive sample image and the second positive sample image corresponding to the same sample object, the first negative sample image and the second negative sample image corresponding to different sample objects.

The first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image are input into an image recognition model, and a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image are determined through the image recognition model.

Since the training of the image recognition model may comprise a plurality of iterative processes, the following steps 301-304 are illustrated by way of example as one iterative process. Fig. 3 is a flowchart of a training method of an image recognition model according to an embodiment of the present application, referring to fig. 3, the method includes:

301. In one iteration process, the server acquires a reference sample image, a positive sample image and a negative sample image, wherein the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects.

Wherein, the sample object can be determined according to the purpose of the image recognition model, for example, the image recognition model is used for recognizing the human face in the image, and then the sample object can be the human face; if the image recognition model is used for recognizing the cell nucleus in the image, the sample object can be the cell nucleus; if the image recognition model is used to recognize a vehicle in an image, the sample object may be a vehicle. The embodiment of the present application is not limited thereto. The following description will take a sample object as a face as an example:

In one possible implementation, the server may obtain a first set of images and a second set of images from the network, wherein the images in the first set of images correspond to different faces than the images in the second set of images. The server can divide the images in the first image set and the second image set into a plurality of batches, and one batch of images is adopted for training in each training process. The server may obtain a reference sample image and a positive sample image from a first set of images and a negative sample image from a second set of images. The server can add labels to the reference sample image, the positive sample image and the negative sample image to indicate the corresponding faces.

After the server acquires the reference sample image, the positive sample image and the negative sample image, the reference sample image, the positive sample image and the negative sample image can be cut to obtain sample images with the same size. The technician can screen the sample images after cutting, and reject the sample images which do not contain human faces. The image recognition model is trained based on sample images with the same size, so that all numerical values in model parameters of the image recognition model can be obtained through a large amount of training, and the accuracy of the image recognition model in recognizing images can be improved.

In addition, in order to further improve the recognition capability of the image recognition model, a technician may manually screen sample images in two image sets, and determine that the sample images in each of the two image sets correspond to the same face image. In this implementation, the server may obtain a more accurate recognition effect using such a sample image trained model.

Of course, the server may also acquire a plurality of images from the network, perform image recognition on the plurality of images, and discard an image if there is no face in the image. The server may detect key points of the image including the face to obtain a position of the face in the image, and cut the image according to the position of the face, for example, cut the image to a predetermined size of 120×120. The server may classify the images according to their faces, generating at least two image sets, the images in each image set corresponding to the same face. The server may determine a first image set and a second image set from the at least two image sets. The server may obtain a reference sample image and a positive sample image from a first set of images and a negative sample image from a second set of images.

The method by which the server acquires the reference sample image, the positive sample image, and the negative sample image from the first image set and the second image set will be described below:

The reference sample image, the positive sample image, and the negative sample image may be referred to as triplets, wherein the reference sample image is also referred to as an anchor sample, the anchor sample and the positive sample are from the same set of images, i.e., corresponding to the same face, and the anchor sample and the negative sample are from different identities, i.e., corresponding to different faces. The positive sample and the anchor sample are from the same image set, namely corresponding to the same face, and the negative sample is from different identities, namely corresponding to different faces, and the positive sample and the anchor sample are paired to form a positive sample pair. The negative sample, the positive sample and the anchor sample do not belong to the same identity, and are paired with the anchor sample to form a negative sample pair.

In order to more comprehensively train an image recognition model and obtain the image recognition model with better image recognition effect, the technical scheme provided by the application adopts a Batch difficult-to-sample mining algorithm (Batch HARDMINING, BHM) when acquiring a reference sample image, a positive sample image and a negative sample image, and the similarity between the reference sample image and the positive sample image is relatively low and the similarity between the reference sample image and the negative sample image is relatively high in the sample image acquired through a BHM algorithm, so that the difficulty of training the image recognition model is improved, and the image recognition effect of the image recognition model is finally improved.

For example, the server performs feature extraction on a plurality of first sample images in the first sample image set to obtain a plurality of first sample image features. The server determines a degree of similarity between the plurality of first sample images in the first set of sample images based on the plurality of first sample image features. The server determines two first sample images having the lowest similarity as the first image and the second image. And the server performs feature extraction on a plurality of second sample images in the second sample image set to obtain a plurality of second sample image features. The server determines a third image and a fourth image with the lowest similarity with the first image and the second image from the second sample image set according to the plurality of second sample image features and the first sample image features of the first image and the second image respectively. In response to the first similarity between the first image and the third image being less than the second similarity between the second image and the fourth image, the server determines the first image as a reference image, the second image as a positive sample image, and the third image as a negative sample image.

In addition to employing the Batch refractory mining algorithm (Batch HARD MINING, BHM), the present application also provides a cross-Batch refractory mining algorithm to obtain reference sample images, positive sample images, and negative sample images. The method comprises the following steps:

since training is performed in batches during the training process, it is considered that although the network parameters will change as training progresses, the iteration characteristics of adjacent batches will not change too much. Thus, when selecting a sample image, a past batch of reference images can be used as an important reference. In other words, the selection of the sample image is not limited to the current lot, but extends to M lots in the past. Compared with Batch difficult cases mining (Batch HARDMINING, BHM), the algorithm sample has larger selection space, and correspondingly, can better acquire difficult cases for training. Referring to fig. 5, the principle steps are as follows: and acquiring a sample image set and training the difficult-case triples to obtain cross-batch triples.

1. The server determines the distance between any two samples within the current batch, forming a distance matrix D ^current where D _ij ^current represents the distance between the ith sample and the jth sample. The server selects the positive sample image pair that is the most difficult to part, i.e., the positive sample image pair P ₁ ^tri,P₂ ^tri that is the farthest from, based on the distance matrix and the labels of the lot samples (the two samples that are the same label can be considered as the positive sample image pair, the positive sample image pair is one reference sample image and one positive sample image).

2. The distance matrix D ^cross between the selected difficult positive sample image pairs P ₁ ^tri and P ₁ ^tri and the previous batch of samples Q _X is calculated, and the corresponding most difficult negative sample N ₁ ^tri,N₂ ^tri is selected according to the sample tag Q _X queue (two samples with different tags can be regarded as a negative sample image pair). At this time, the distances between P ₁ ^tri and N ₁ ^tri、P₂ ^tri and N ₂ ^tri are compared, and only the negative sample that is the most difficult, i.e., farthest, remains and is marked as N ^tri, while the positive sample image corresponding to N ^tri is used as the reference sample image.

302. The server inputs the reference sample image, the positive sample image, and the negative sample image into an image recognition model, and determines a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image through the image recognition model.

In one possible implementation, the server extracts reference feature information of the reference sample image, positive sample feature information of the positive sample image, and negative sample feature information of the negative sample image through the image recognition model. And the server obtains the first similarity according to the reference characteristic information and the positive sample characteristic information. And the server obtains a second similarity according to the reference characteristic information and the negative sample characteristic information.

For example, the server may input the reference sample image, the positive sample image and the amplified negative sample image into the image recognition model through the input layer 101 of the image recognition model, and convolve the reference sample image, the positive sample image and the negative sample image through the feature extraction layer 102 of the image recognition model to obtain reference feature information of the reference sample image, positive sample feature information of the positive sample image and negative sample feature information of the negative sample image. The server may input the reference feature information of the reference sample image, the positive sample feature information of the positive sample image, and the negative sample feature information of the negative sample image into the output layer 1031 of the image recognition model, and obtain the first size difference information and the second size difference information through the output layer 1031.

For example, the server may perform convolution processing on the reference sample image, the positive sample image, and the negative sample image through the feature extraction layer 102 of the image recognition model, to obtain a 128-dimensional reference feature vector, a 128-dimensional positive sample feature vector, and a 128-dimensional negative sample feature vector, respectively. The server may determine a first similarity from the reference feature vector and the positive sample feature vector and a second similarity from the reference feature vector and the negative sample feature vector, i.e. a cosine similarity between the vectors.

It should be noted that, after the server performs step 302, it may determine whether the difference between the first similarity and the second similarity meets the target condition. In response to the difference not meeting the target condition, the server may perform step 303; in response to the difference meeting the target condition, the server may perform step 304.

303. In response to the difference between the similarity and the second similarity not meeting the target condition, the server re-acquires the reference sample image, the positive sample image, and the negative sample image based on adjusting the model parameters of the image recognition model according to the difference, and performs steps 301 and 302.

In one possible implementation, the server determines the loss function based on a difference between the first similarity and the second similarity. The server determines the gradient of the image recognition model generation based on the loss function. And the server adjusts model parameters of the image recognition model according to a gradient descent method.

For example, the server may construct a first loss function from the first similarity and the second similarity, and adjust model parameters of the image recognition model through the first loss function. Since the reference sample image and the positive sample image correspond to the same face, that is, the features of the faces in the positive sample image and the reference sample image are close, the negative sample image and the reference sample image or the positive sample image of the input image recognition model correspond to different faces. The server trains the image recognition model so that the first degree of similarity is as large as possible and the second degree of similarity is as small as possible, i.e. the difference between the first degree of similarity and the second degree of similarity is as large as possible. In the implementation manner, the server can adjust the model parameters of the image recognition model through the first similarity and the second similarity, so that the recognition capability of the image recognition model on the size of a sample object in the image is improved.

For example, the server may construct a first loss function from the first similarity and the second similarity by equation (2).

Wherein L _tri is a triplet loss function, that is, a first loss function, x _a is a reference sample image, x _p is a positive sample image, x _n is a negative sample image, y _a、y_p and y _n are reference sample images, positive sample images and negative sample images correspond to faces, d (r ₁,r₂) represents a metric function to measure a distance between vectors r ₁,r₂, d (r ₁,r₂) is set as a distance between r ₁ and r ₂, and m ₁ represents an edge hyper-parameter representing a difference between classes in a control class, wherein [ z ] ₊ =max (z, 0), that is, a larger value of z and 0 is selected, if z is larger than 0, z is smaller than 0, and 0 is selected. Under such an arrangement, the triplet loss function would expand the inter-class gap while reducing the intra-class gap.

It should be noted that, the server may adjust the model parameters of the image recognition model according to the first loss function by using a gradient descent method, where the gradient descent method may be a random gradient descent method (Stochastic GRADIENT DESCENT, SGD), a Batch gradient descent (Batch GRADIENT DESCENT), a small Batch gradient descent (Mini-Batch GRADIENT DESCENT), and the embodiment of the present application is not limited thereto. In addition, the server can also adjust model parameters of the image recognition model by adopting a gradient descent method and a polynomial learning rate attenuation strategy at the same time. In the implementation mode, the server can dynamically adjust the learning rate according to the training process, and the training effect of the image recognition model is improved.

304. In response to the difference between the first similarity and the second similarity meeting the target condition, the server takes the image recognition model as a trained image recognition model.

It should be noted that, in the foregoing steps 201 to 204 are described taking the server as an execution subject, in other possible embodiments, the terminal may be used as an execution subject to perform the foregoing method, or the terminal may interact with the server to perform the foregoing method, that is, the user inputs an image through the terminal, the terminal sends the image to the server, the server performs a training process of the image recognition model, the server sends the trained image recognition model to the terminal, and the user may use the image recognition model to perform image recognition on the terminal.

The above steps 301-304 are described by taking triples, that is, a reference sample image, a positive sample image, and a negative sample image as an example, and in addition to the above steps 301-304, a method for training an image recognition model based on the quadruples is also provided, see fig. 4, where the method includes:

401. The server acquires a first positive sample image, a second positive sample image, a first negative sample image and a second negative sample image, wherein the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects.

In one possible implementation, the server may obtain a first set of sample images, a second set of images, and a third set of images from the network, wherein the images in the first set of sample images, the images in the second set of images, and the images in the third set of images correspond to different faces. The server can divide the images in the first sample image set, the second image set and the third image set into a plurality of batches, and training is performed by adopting images of one batch in each training process. The server may obtain a first positive sample image and a second positive sample image from the first set of sample images, a first negative sample image from the second set of images, and a second negative sample image from the third set of images. The server can add labels to the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image to indicate the corresponding faces.

After the server acquires the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image, the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image can be cut to obtain sample images with the same size. The technician can screen the sample images after cutting, and reject the sample images which do not contain human faces. The image recognition model is trained based on sample images with the same size, so that all numerical values in model parameters of the image recognition model can be obtained through a large amount of training, and the accuracy of the image recognition model in recognizing images can be improved.

Of course, the server may also acquire a plurality of images from the network, perform image recognition on the plurality of images, and discard an image if there is no face in the image. The server may detect key points of the image including the face to obtain a position of the face in the image, and cut the image according to the position of the face, for example, cut the image to a predetermined size of 120×120. The server may classify the images according to their faces, generating at least two image sets, the images in each image set corresponding to the same face. The server may determine a first set of sample images, a second set of images, and a third set of images from the at least three sets of images. The server may obtain a first positive sample image and a second positive sample image from the first set of sample images, a first negative sample image from the second set of images, and a second negative sample image from the third set of images.

The method for the server to acquire the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image from the first sample image set, the second image set, and the third image set will be described as follows:

the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image may be referred to as a quadruple, wherein the first positive sample and the second positive sample are from the same image set, i.e. correspond to the same face, and the first negative sample and the second negative sample are from different image sets, i.e. correspond to different faces. Any one negative sample and any one positive sample are from different sets of images, corresponding to different faces.

In order to perform more comprehensive training on an image recognition model and obtain the image recognition model with better image recognition effect, the technical scheme provided by the application adopts a Batch difficult-to-sample mining algorithm (Batch HARD MINING, BHM) when acquiring a reference sample image, a positive sample image and a negative sample image, and the similarity between a first positive sample image and a second positive sample image which are selected in the sample images acquired through a BHM algorithm is relatively low, and the similarity between the first negative sample image and the second negative sample image is relatively high, so that the training difficulty of the image recognition model is improved, and the image recognition effect of the image recognition model is finally improved.

For example, the server performs feature extraction on a plurality of first sample images in the first sample image set to obtain a plurality of first sample image features. The server determines a degree of similarity between the plurality of first sample images in the first set of sample images based on the plurality of first sample image features. The server determines two first sample images having the lowest similarity as the first image and the second image. And the server performs feature extraction on a plurality of second sample images in the second sample image set to obtain a plurality of second sample image features. The server determines a third image with the lowest similarity with the first image and the second image from the second sample image set according to the plurality of second sample image features and the first sample image features of the first image and the second image. The server determines a third image with the lowest similarity with the first image and the second image from the second sample image set according to the plurality of second sample image features and the first sample image features of the first image and the second image. The server determines a fourth image with the lowest similarity with the first image and the second image and the highest similarity with the third image from the third sample image set according to the plurality of third sample image features, the first sample image features of the first image and the second sample image features of the third image. The server determines the first image as a first positive sample image, the second image as a second positive sample image, the third image as a first negative sample image, and the fourth image as a second negative sample image.

402. The server inputs the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image into an image recognition model, and determines a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model.

In one possible embodiment, the server extracts the first positive sample image, the second positive sample image, the first negative sample image, and the first positive sample feature information, the second positive sample feature information, the first negative sample feature information, and the second negative sample feature information of the second negative sample image, respectively, through the image recognition model. And the server obtains a third similarity according to the first positive sample characteristic information and the second positive sample characteristic information. And the server obtains a fourth similarity according to the first negative sample characteristic information and the second negative sample characteristic information.

For example, the server may input the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image into the image recognition model through the input layer 101 of the image recognition model, and convolve the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image through the feature extraction layer 102 of the image recognition model to obtain the first positive sample feature information, the second positive sample feature information, the first negative sample feature information and the second negative sample feature information. The server may input the first positive sample feature information, the second positive sample feature information, the first negative sample feature information, and the second negative sample feature information into the output layer 1031 of the image recognition model, and obtain a third similarity and a fourth similarity through the output layer 1031.

For example, the server may perform convolution processing on the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image through the feature extraction layer 102 of the image recognition model to obtain a first positive sample feature vector of 128 dimensions, a second positive sample feature vector of 128 dimensions, a first negative sample feature vector of 128 dimensions, and a second negative sample feature vector of 128 dimensions, respectively. The server may determine a third similarity from the first positive sample feature vector and the second positive sample feature vector, and determine a fourth similarity from the first negative sample feature vector and the second negative sample feature vector, i.e. determine cosine similarity between the vectors.

It should be noted that, after the server finishes executing step 402, it may determine whether the difference between the third similarity and the fourth similarity meets the target condition. Responsive to the difference not meeting the target condition, the server may perform step 403; in response to the difference meeting the target condition, the server may perform step 404.

403. And the server adjusts model parameters of the image recognition model according to the difference information between the third similarity and the fourth similarity.

In one possible implementation, the server determines the loss function based on a difference between the third similarity and the fourth similarity. The server determines the gradient of the image recognition model generation based on the loss function. And the server adjusts model parameters of the image recognition model according to a gradient descent method.

For example, the server may construct a second loss function from the third similarity and the fourth similarity, and adjust model parameters of the image recognition model by the second loss function. Since the first positive sample image and the second positive sample image correspond to the same face, that is, the features of the faces in the second positive sample image and the first positive sample image are close, and the first negative sample image and the second negative sample image of the input image recognition model correspond to different faces, that is, the features of the faces in the second negative sample image and the first negative sample image are not close. The server trains the image recognition model so that the third similarity is as large as possible and the fourth similarity is as small as possible, i.e. any one of the third similarities is greater than any one of the fourth similarities, in other words, the difference between the third and fourth similarities is as large as possible. In this implementation manner, the server may adjust the model parameters of the image recognition model through the third similarity and the fourth similarity, so as to improve the recognition capability of the image recognition model for the sample object size in the image.

For example, the server may construct the second loss function through the third similarity and the fourth similarity by equation (3).

Where L _qui is a four-tuple loss function, i.e., a second loss function, x _i is a first positive sample image, x _j is a second positive sample image, x _l is a first negative sample image, xk is a second negative sample image, y _i、y_j、y_l and y _l are first positive sample image, second positive sample image, first negative sample image and second negative sample image corresponding to a face, d (r ₁,r₂) represents a metric function to measure the distance between vectors r ₁,r₂, d (r ₁,r₂) is set to the distance between r ₁ and r ₂, m ₂ represents an edge hyper-parameter representing the difference between classes within the control class, where [ z ] ₊ =max (z, 0), i.e., a larger value of z and 0 is selected, z is selected if z is greater than 0, and 0 is selected if z is less than 0. Under such an arrangement, the four-tuple loss function would expand the inter-class gap while reducing the intra-class gap.

404. In response to the difference between the third similarity and the fourth similarity meeting the target condition, the server takes the image recognition model as a trained image recognition model.

Through the steps 801-805 described above, the server may train the image recognition model using two images, a positive sample image and a negative sample image of smaller size than the positive sample image, both of which include sample objects, the negative sample image being derived from the positive sample image. The server may amplify the negative sample image to be the same as the positive sample image, which may result in the amplified negative sample image having a number of sample objects smaller than or equal to the number of sample objects in the positive sample image, i.e. a labeled sample image is generated, i.e. the number of sample objects in the negative sample image is naturally smaller than the number of sample objects in the positive sample image. The method comprises the steps of inputting a positive sample image and an amplified negative sample image into an image recognition model, determining the quantity information of sample objects in the positive sample image and the quantity information of the sample objects in the amplified negative sample image through the image recognition model, and training to enable the difference between the two quantity information to be large enough, wherein when the difference is large enough, the model can recognize the quantity of the sample objects, namely the quantity information of the sample objects, so that priori knowledge is provided for subsequent model training, and the training effect of the image recognition model is improved.

In the embodiment of the application, the server can train the image recognition model by adopting four images, wherein the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image comprise sample objects, the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects. In the training process, the server aims at identifying that the similarity between the first positive sample image and the second positive sample image is as high as possible through the model, and identifying that the similarity between the first negative sample image and the second negative sample image is as low as possible, so that in the subsequent use process, whether the shot image and the stored image contain the same human face or not can be accurately determined.

Based on the above steps 301-304 and 401-404, the present application further provides a joint training method based on triplets and quadruples, in which the reference sample image in the triplets corresponds to the first positive sample image in the quadruples, the positive sample image in the triplets corresponds to the second positive sample image in the quadruples, and the negative sample image in the triplets corresponds to the first negative sample image in the quadruples, see fig. 5, and the method is as follows.

601. The server acquires a first positive sample image, a second positive sample image, a first negative sample image and a second negative sample image, wherein the first positive sample image and the second positive sample image correspond to the same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects.

The image acquisition process is described in reference to 401 above and will not be described again.

602. The server inputs the first positive sample image, the second positive sample image, the first negative sample image and the second negative sample image into an image recognition model, determines a third similarity between the first positive sample image and the second positive sample image and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model, and determines a loss function according to a difference value between the third similarity and the fourth similarity.

In one possible one, the server initializes the feature extraction network F (·; θ), where θ is a network parameter, and the server initializes the pseudo-large batch sequence x _plb to null, i.e., x _plb = [ ]. The server randomly selects a sample image formation set i= [ I ₁,…,I_n ], and extracts the corresponding batch feature x=f (I; θ). The server removes the sample feature X extracted from the current iteration from the graph node (pytorch has a special term called detach, and can remove the information such as the gradient of X so that it does not occupy the video memory), and inputs it into the sequence X _plb. The server selects a refractory triplet and a quadruple from the current iteration batch (according to a batch refractory mining mode), and respectively calculates lossesAndIn addition, the server selects a difficult-to-sample triplet across iterations from the current batch sample image X and the previous batch sample image X _plb, and calculates a loss function according to the following formula (4)

Where x _a is the anchor sample image, i.e., the first positive sample image, and x _p is the positive sample, i.e., the second positive sample image, both samples will only come from the current iteration, because if the previous iteration sample is taken as a positive sample pair, neither calculated gradient will be back-propagated to them, and thus training is meaningless (the previous iteration samples were de-tarted). x _n is the first negative image, and x _n can be selected from previous iterations (i.e., from x _plb), so that the selection space is larger, and the difficult negative sample pair is easy to select. In addition, y _i is a label of a corresponding sample, and N is the number of samples.

The server calculates the joint loss of the current batch according to the following formula (5), and the schematic diagram of the joint loss is shown in fig. 7.

603. The server determines the gradient of the image recognition model generation based on the loss function. And the server adjusts model parameters of the image recognition model according to a gradient descent method.

In one possible implementation, the server performs a gradient back-pass and collects the gradient magnitude generated by the current iteration according to equation (6) below.

The server judges whether the iteration times reach the fixed times PseudoN, if so, the network parameter gradient is calculated, and the SGD algorithm is adopted to update the network parameters, see formula (7).

604. In response to the difference between the third similarity and the fourth similarity meeting the target condition, the server takes the image recognition model as a trained image recognition model.

In one possible implementation, the server determines whether the training loss converges, and if so, terminates the training to obtain the image recognition model.

In addition to the above-mentioned image recognition model training method, the embodiment of the application also provides an image recognition method, which is implemented by an image recognition model trained by the image recognition model training method, and comprises the following steps:

801. the server acquires a first image to be identified and a second image to be identified, wherein the first image to be identified comprises a first object to be identified, and the second image to be identified comprises a second object to be identified.

The first image to be identified can be an image acquired in real time, the second image to be identified is an image stored on the server, for example, the second image to be identified is an identity card photograph of a person to be identified, and the first image to be identified is an image of the person to be identified photographed by a camera on site.

802. The server inputs the first image to be identified and the second image to be identified into an image identification model, and extracts the first image feature of the first image to be identified and the second image feature of the second image to be identified through the image identification model.

The image recognition model is trained based on a plurality of reference sample images, positive sample images and negative sample images, wherein the reference sample images and the positive sample images correspond to a first sample object, the negative sample images correspond to a second sample object, and the first sample object and the second sample object are different sample objects.

803. And the server outputs the similarity between the first object to be identified and the second object to be identified according to the first image feature and the first image feature.

804. In response to the similarity between the first object to be identified and the second object to be identified meeting the similarity condition, the server determines that the first generation of identified objects and the second object to be identified are the same object.

Fig. 9 is a schematic structural diagram of an image recognition model training device according to an embodiment of the present application, and referring to fig. 9, the device includes a sample image acquisition module 901, a first input module 902, and a training module 903.

The sample image obtaining module 901 is configured to obtain, in an iterative process, a reference sample image, a positive sample image, and a negative sample image, where the reference sample image and the positive sample image correspond to a first sample object, the negative sample image corresponds to a second sample object, and the first sample object and the second sample object are different sample objects.

A first input module 902 for inputting the reference sample image, the positive sample image and the negative sample image into an image recognition model, determining a first similarity between the reference sample image and the positive sample image and a second similarity between the reference sample image and the negative sample image by the image recognition model.

The training module 903 is configured to use the image recognition model as a trained image recognition model in response to the difference between the first similarity and the second similarity meeting a target condition.

In one possible implementation, the first input module is configured to extract, by using the image recognition model, reference feature information of the reference sample image, positive sample feature information of the positive sample image, and negative sample feature information of the negative sample image. And obtaining the first similarity according to the reference characteristic information and the positive sample characteristic information. And obtaining second similarity according to the reference characteristic information and the negative sample characteristic information.

In one possible embodiment, the apparatus further comprises:

And the feature extraction module is used for extracting features of a plurality of first sample images in the first sample image set to obtain a plurality of first sample image features.

And the similarity determining module is used for determining the similarity between the plurality of first sample images in the first sample image set according to the characteristics of the plurality of first sample images.

And the image determining module is used for determining the two first sample images with the lowest similarity as a first image and a second image.

The similarity determining module is further configured to perform feature extraction on a plurality of second sample images in the second sample image set, so as to obtain a plurality of second sample image features.

The image determining module is further configured to determine a third image and a fourth image with the lowest similarity to the first image and the second image from the second sample image set according to the plurality of second sample image features and the first sample image features of the first image and the second image, respectively.

The image determination module is further configured to determine the first image as a reference image, determine the second image as a positive sample image, and determine the third image as a negative sample image in response to the first similarity between the first image and the third image being less than the second similarity between the second image and the fourth image.

In one possible embodiment, the apparatus further comprises:

and the parameter adjustment module is used for adjusting the model parameters of the image recognition model in response to the fact that the difference between the first similarity and the second similarity does not meet the target condition.

In one possible implementation, the parameter adjustment module is configured to determine the loss function according to a difference between the first similarity and the second similarity. The gradient of the image recognition model generation is determined from the loss function. And adjusting model parameters of the image recognition model according to a gradient descent method.

In a possible embodiment, the sample image obtaining module is further configured to obtain a first positive sample image, a second positive sample image, a first negative sample image, and a second negative sample image, where the first positive sample image and the second positive sample image correspond to a same sample object, and the first negative sample image and the second negative sample image correspond to different sample objects.

The first input module is further configured to input the first positive sample image, the second positive sample image, the first negative sample image, and the second negative sample image into an image recognition model, and determine a third similarity between the first positive sample image and the second positive sample image, and a fourth similarity between the first negative sample image and the second negative sample image through the image recognition model.

The training module is further used for adjusting model parameters of the image recognition model according to difference information between the third similarity and the fourth similarity.

Fig. 10 is a schematic structural diagram of an image recognition device according to an embodiment of the present application, referring to fig. 10, the device includes: an image acquisition module 1001, a second input module 1002, and an output module 1003.

An image obtaining module 1001, configured to obtain a first image to be identified and a second image to be identified, where the first image to be identified includes a first object to be identified, and the second image to be identified includes a second object to be identified;

A second input module 1002, configured to input the first image to be identified and the second image to be identified into an image identification model, and extract, through the image identification model, a first image feature of the first image to be identified and a second image feature of the second image to be identified;

And an output module 1003, configured to output a similarity between the first object to be identified and the second object to be identified according to the first image feature and the first image feature.

The embodiment of the application provides a computer device, which is used for executing the methods provided by the above embodiments, and the computer device can be implemented as a terminal or a server, and the structure of the terminal is described below:

Fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal 1100 may be: tablet, notebook or desktop. Terminal 1100 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.

Generally, the terminal 1100 includes: one or more processors 1101, and one or more memories 1102.

The processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1101 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable LogicArray ). The processor 1101 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1101 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1101 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement the methods provided by the method embodiments of the present application.

In some embodiments, the terminal 1100 may further optionally include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102, and peripheral interface 1103 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1103 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, a display screen 1105, a camera assembly 1106, audio circuitry 1107, a positioning assembly 1108, and a power supply 1109.

A peripheral interface 1103 may be used to connect I/O (Input/Output) related at least one peripheral device to the processor 1101 and memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or both of the processor 1101, memory 1102, and peripheral interface 1103 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1104 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1104 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth.

The display screen 1105 is used to display a UI (useinterface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1105 is a touch display, the display 1105 also has the ability to collect touch signals at or above the surface of the display 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this time, the display screen 1105 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards.

The camera assembly 1106 is used to capture images or video. Optionally, the camera assembly 1106 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal.

The audio circuit 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing, or inputting the electric signals to the radio frequency circuit 1104 for voice communication.

The location component 1108 is used to locate the current geographic location of the terminal 1100 for navigation or LBS (Location Based Service, location-based services).

A power supply 1109 is used to supply power to various components in the terminal 1100. The power source 1109 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery.

In some embodiments, terminal 1100 also includes one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyroscope sensor 1112, pressure sensor 1113, fingerprint sensor 1114, optical sensor 1115, and proximity sensor 1116.

The acceleration sensor 1111 may detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the terminal 1100.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal 1100, and the gyro sensor 1112 may collect a 3D motion of the user on the terminal 1100 in cooperation with the acceleration sensor 1111.

The pressure sensor 1113 may be disposed at a side frame of the terminal 1100 and/or at a lower layer of the display screen 1105. When the pressure sensor 1113 is disposed at a side frame of the terminal 1100, a grip signal of the terminal 1100 by a user may be detected, and the processor 1101 performs a right-left hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 realizes control of the operability control on the UI interface according to the pressure operation of the user on the display screen 1105.

The fingerprint sensor 1114 is used to collect a fingerprint of the user, and the processor 1101 identifies the identity of the user based on the collected fingerprint of the fingerprint sensor 1114, or the fingerprint sensor 1114 identifies the identity of the user based on the collected fingerprint.

The optical sensor 1115 is used to collect the ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the intensity of ambient light collected by the optical sensor 1115. The proximity sensor 1116 is used to collect a distance between the user and the front surface of the terminal 1100.

Those skilled in the art will appreciate that the structure shown in fig. 11 is not limiting and that terminal 1100 may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

The computer device may also be implemented as a server, and the following describes the structure of the server:

Fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1200 may include one or more processors (Central Processing Units, CPU) 1201 and one or more memories 1202, where the one or more memories 1202 store at least one instruction, and the at least one instruction is loaded and executed by the one or more processors 1201 to implement the methods provided in the foregoing method embodiments. Of course, the server 1200 may also have a wired or wireless network interface, a keyboard, an input/output interface, etc. for performing input/output, and the server 1200 may also include other components for implementing device functions, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor to perform the methods provided by the various method embodiments described above, is also provided. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements within the spirit and principles of the present application.

Claims

1. The training method of the image recognition model is characterized by comprising the following steps of:

2. The method of claim 1, wherein the determining, by the image recognition model, a first similarity between the reference sample image and the positive sample image, and a second similarity between the reference sample image and the negative sample image comprises:

3. The method of claim 1, wherein prior to the acquiring the reference sample image, the positive sample image, and the negative sample image, the method further comprises:

4. The method of claim 1, wherein after the determining, by the image recognition model, a first similarity between the reference sample image and the positive sample image, and a second similarity between the reference sample image and the negative sample image, the method further comprises:

5. The method of claim 4, wherein said adjusting model parameters of said image recognition model comprises:

6. The method of claim 1, wherein prior to using the image recognition model as a trained image recognition model, the method further comprises:

7. An image recognition method, characterized in that the image recognition method comprises:

8. An image recognition model training apparatus, characterized in that the image recognition model training apparatus comprises:

9. An image recognition apparatus, characterized in that the image recognition apparatus comprises:

10. A computer device comprising one or more processors and one or more memories, the one or more memories having stored therein at least one instruction that is loaded and executed by the one or more processors to implement the operations performed by the training method of the image recognition model of any of claims 1-6; or operations performed by the image recognition method of claim 7.