CN116958674A

CN116958674A - Image recognition method and device, electronic equipment and storage medium

Info

Publication number: CN116958674A
Application number: CN202310906958.6A
Authority: CN
Inventors: 陈明翔
Original assignee: Ant Blockchain Technology Shanghai Co Ltd
Current assignee: Ant Blockchain Technology Shanghai Co Ltd
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-10-27

Abstract

The specification discloses a method, a device, an electronic device and a storage medium for image recognition. The server is provided with a pre-trained multi-mode pre-training model and a classification model, wherein the number of model parameters contained in the multi-mode pre-training model is larger than that of model parameters contained in the classification model. First, an image to be recognized is acquired. And secondly, inputting the image to be identified into a classification model, and determining whether the image to be identified is a violation image or not based on a classification result output by the classification model. And finally, if the image to be identified is determined to be the violation image, the image to be identified is further input into a multi-mode pre-training model to carry out violation identification. The method can reduce the operation resources required by image recognition and improve the efficiency of image recognition.

Description

Image recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for image recognition, an electronic device, and a storage medium.

Background

With the rapid development of the internet, the image becomes one of the main forms of information release of internet platform users due to the advantages of intuitiveness, large information carrying capacity and the like. To attract traffic, lawbreakers may generate or propagate a large number of offending images, e.g., pornography, violence, etc. These offending images can cause adverse social effects or negatively impact the proper operation of the internet platform. Based on the method, the Internet platform needs to identify the images uploaded by the user, so that the user is prevented from spreading illegal images.

At present, a common method is to train a multi-modal pre-training model on a large-scale data set, so that images uploaded by a user are identified through the multi-modal pre-training model. However, the multi-mode pre-training model requires more operation resources, and the image recognition efficiency is low.

Therefore, how to reduce the computational resources required for image recognition and improve the image recognition efficiency is a urgent problem to be solved.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a storage medium, and an electronic device for image recognition, so as to reduce operation resources required for image recognition and improve efficiency of image recognition.

The technical scheme adopted in the specification is as follows:

the present disclosure provides a method for image recognition, where the method is applied to a server, where a pre-trained multi-mode pre-training model and a classification model are deployed on the server, where the number of model parameters included in the multi-mode pre-training model is greater than the number of model parameters included in the classification model, and the method includes:

acquiring an image to be identified;

inputting the image to be identified into the classification model, and determining whether the image to be identified is a violation image or not based on a classification result output by the classification model;

If the image to be identified is determined to be the violation image, the image to be identified is further input into the multi-mode pre-training model to conduct violation identification.

Optionally, the input image size corresponding to the classification model is smaller than the input image size of the multi-mode pre-training model;

inputting the image to be identified into the classification model, and determining whether the image to be identified is a violation image based on a classification result output by the classification model, comprising:

and performing scaling processing on the image to be identified according to the size of the input image corresponding to the classification model, inputting the scaled image to be identified into the classification model, and determining whether the image to be identified is a illegal image or not based on a classification result output by the classification model.

Optionally, a plurality of classification models with differences in the sizes of the corresponding input images are deployed on the server;

according to the size of the input image corresponding to the classification model, scaling the image to be identified, inputting the scaled image to be identified into the classification model, and determining whether the image to be identified is a illegal image based on the classification result output by the classification model, including:

Sorting the plurality of classification models in order of the input image size from small to large;

according to the input image size corresponding to a first classification model in the plurality of classified models after sorting, scaling the image to be identified, inputting the scaled image to be identified into the first classification model, and determining whether the image to be identified is an illegal image or not based on a classification result output by the first classification model;

if so, continuing to perform scaling processing on the image to be identified according to the size of the input image corresponding to the second classification model in the plurality of classified models after sequencing, inputting the scaled image to be identified into the second classification model, determining whether the image to be identified is an illegal image or not based on the classification result output by the second classification model, and the like until performing scaling processing on the image to be identified according to the size of the input image corresponding to the last classification model in the plurality of classified models after sequencing, and inputting the scaled image to be identified into the last classification model.

Optionally, if the image to be identified is determined to be a violation image, the image to be identified is further input into the multimodal pre-training model to perform violation identification, including:

Determining whether the image to be identified is a violation image or not based on a classification result output by the last classification model; if yes, the image to be identified is further input into the multi-mode pre-training model to conduct violation identification.

Optionally, the method further comprises:

if any classification model in the plurality of classification models is determined to output the classification result that the image to be identified is the non-offending image, the zoomed image to be identified is not input into other classification models positioned behind the sorting position of the classification model.

Optionally, the multi-modal pre-training model includes: a visual question-answering model;

inputting the image to be identified into the multi-mode pre-training model for violation identification, wherein the method comprises the following steps:

acquiring a plurality of text questions for violation identification;

inputting the image to be identified and the plurality of text questions into a visual question-answering model, and determining whether the image to be identified is a violation image based on reply texts aiming at the plurality of text questions.

Optionally, the classification model is optimized and adjusted and trained by taking the minimum omission ratio as an optimization target, wherein the omission ratio refers to the ratio of the number of the illegal images classified as the non-illegal images to the number of the illegal images.

Optionally, before inputting the image to be identified into the classification model, the method further comprises:

judging whether the user blacklist contains the user information of the image to be identified or not, if so, determining that the image to be identified is an illegal image;

and judging whether the user white list contains the user information of the image to be identified, and if so, determining that the image to be identified is an unordered image.

performing edge detection on the image to be identified, and determining edge information of an image target contained in the image to be identified;

and determining whether the image to be identified is a violation image according to the quantity of the edge information of the image target contained in the image to be identified.

Optionally, the visual question-answering model includes: BLIP model.

Optionally, the image to be identified includes: the user draws an image using an AI-based drawing program.

The present disclosure provides an image recognition device, which is applied to a server, where a pre-trained multi-mode pre-training model and a classification model are deployed on the server, where the number of model parameters included in the multi-mode pre-training model is greater than the number of model parameters included in the classification model, and includes:

The acquisition module is used for acquiring the image to be identified;

the execution module is used for inputting the image to be identified into the classification model and determining whether the image to be identified is a violation image or not based on a classification result output by the classification model;

and the input module is used for further inputting the image to be recognized into the multi-mode pre-training model for violation recognition if the image to be recognized is determined to be the violation image.

The specification provides an electronic device, which comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are connected with each other through the bus;

the memory stores machine readable instructions and the processor performs the method of image recognition by invoking the machine readable instructions.

The present specification provides a machine-readable storage medium storing machine-readable instructions that, when invoked and executed by a processor, implement the method of image recognition described above.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the image recognition method provided by the specification, the number of model parameters contained in the multi-mode pre-training model is larger than the number of model parameters contained in the classifying model, the method can input the images to be recognized into the classifying model, and determine whether the images to be recognized are illegal images or not based on the classifying result output by the classifying model, so that the images to be recognized with smaller recognition difficulty are recognized, and the images to be recognized with larger recognition difficulty are screened out. If the image to be identified is determined to be the illegal image, the image to be identified with high identification difficulty is further input into a multi-mode pre-training model to carry out illegal identification, so that the operation resources required by image identification are reduced, and the efficiency of image identification is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a flow chart illustrating a method of image recognition in accordance with an exemplary embodiment;

FIG. 2 is a flow chart illustrating the identification of an image to be identified by a plurality of classification models in accordance with an exemplary embodiment;

FIG. 3 is a flow chart illustrating a method of determining whether an image to be identified is a offending image in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram of an electronic device in which an image recognition apparatus according to an exemplary embodiment is located;

fig. 5 is a block diagram illustrating an apparatus for image recognition in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

In order to make the technical solution in the embodiments of the present specification better understood by those skilled in the art, the related art related to the embodiments of the present specification will be briefly described below.

The multi-mode pre-training model can be a deep learning model, pre-training is performed on a large-scale data set through a self-supervision learning method, representations of multiple modes (such as text, images, voice and the like) are effectively learned, and specific application is realized through fine tuning or training of specific tasks. The multi-mode pre-training model achieves better performance on a plurality of classical multi-mode tasks, such as visual question answering, picture title generation, picture-text retrieval and the like.

In practical applications, the images uploaded by the user are typically identified by a multimodal pre-training model. However, the multi-mode pre-training model requires more operation resources, and the image recognition efficiency is low.

Based on the above, the present specification proposes a technical scheme of inputting a model to be identified into a classification model with a smaller number of model parameters, identifying an image to be identified with smaller identification difficulty, and inputting the image to be identified with larger identification difficulty into a multi-mode pre-training model for identification, thereby reducing operation resources required by image identification and improving efficiency of image identification.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

FIG. 1 is a flow chart illustrating a method of image recognition according to an exemplary embodiment, comprising the steps of:

s100: and acquiring an image to be identified.

In the embodiment of the present disclosure, the method for image recognition is applied to a server. The physical carrier of the server may be a server, a server cluster, etc. For convenience of description, a method of image recognition provided in the present specification will be described below with only a server as an execution subject.

The server is provided with a pre-trained multi-mode pre-training model and a classification model, wherein the number of model parameters contained in the multi-mode pre-training model is larger than that of model parameters contained in the classification model.

In the embodiment of the present disclosure, the server may acquire the image to be identified. The image to be identified mentioned here includes: the user draws an image using an AI-based drawing program.

S102: and inputting the image to be identified into the classification model, and determining whether the image to be identified is a violation image or not based on a classification result output by the classification model.

S104: if the image to be identified is determined to be the violation image, the image to be identified is further input into the multi-mode pre-training model to conduct violation identification.

In the embodiment of the present disclosure, the server may train the classification model before inputting the image to be identified into the classification model. There are various methods for the server to train the classification model. For example, the server may prepare a dataset according to the service requirement, and divide the image in the dataset into a training set and a verification set according to a set proportion. Secondly, the server side can perform scaling processing on the image to be identified according to the size of the input image corresponding to the classification model, input the scaled image to be identified into the classification model, and train the classification model by taking the deviation between the minimized classification result output by the classification model and the actual result as an optimization target. The present specification does not limit the method of training the classification model, and does not limit the model results of the classification model. The service requirement referred to herein may refer to a requirement to identify pornography, identify public character images, etc.

In practical application, the multi-mode pre-training model contains a large number of model parameters, the accuracy of image recognition is high, but the multi-mode pre-training model needs more operation resources, and the image recognition efficiency is low. The classification model has less model parameters and low image recognition accuracy, but the classification model has less calculation resources and high image recognition efficiency.

Therefore, the server can identify the easily-identified non-offending image in the image to be identified through the classification model, so that the operation resource required by image identification is reduced, the image identification efficiency is improved, and the offending image which is not easy to identify in the image to be identified is input into the multi-mode pre-training model for offending identification, so that the image to be identified is accurately identified.

In the embodiment of the present disclosure, the server may input the image to be identified into the classification model, and determine whether the image to be identified is a violation image based on the classification result output by the classification model.

If the image to be identified is determined to be the violation image, the image to be identified is further input into a multi-mode pre-training model to conduct violation identification.

In order to effectively reduce the computational complexity and improve the image recognition efficiency, the classification model is trained by a training set consisting of images with smaller image sizes. Correspondingly, the input image size corresponding to the classification model is smaller than that of the multi-mode pre-training model. Based on the above, the server needs to perform scaling processing on the image to be identified before inputting the model to be identified into the classification model.

In this embodiment of the present disclosure, the server may perform scaling processing on the image to be identified according to the size of the input image corresponding to the classification model, input the scaled image to be identified into the classification model, and determine whether the image to be identified is a violation image based on the classification result output by the classification model.

Further, the classification model can be optimally adjusted and trained by taking the minimized omission ratio as an optimization target. The omission ratio referred to herein refers to the ratio of the number of offending images classified as non-offending images to the number of offending images. For example, if the number of the illegal images is determined to be 100, the classification model only identifies 90 illegal images, and 10 illegal images are missed, then the missing detection rate is 10%.

Therefore, under the condition that the omission factor of the classification model is ensured, if the image to be identified is determined to be the non-illegal image, even if the classification model incorrectly identifies the image to be identified as the illegal image, the multi-mode pre-training model is used for carrying out the last illegal identification, so that the image to be identified is correctly identified as the non-illegal image.

In practical application, the number of the non-offending images identified by the single classification model is small, so that the multi-mode pre-training model still needs to perform offending identification on a large number of images to be identified, and the efficiency of image identification is improved less. Based on this, multiple classification models can be deployed on the server to increase the number of non-offending images identified.

However, a plurality of classification models are required to simultaneously recognize the image to be recognized, and thus a lot of operation resources are required to be consumed. Therefore, the server can sort according to the size of the input image of the classification model, sequentially input the images to be identified into the classification model according to the sorting order, and if the images to be identified are determined to be illegal images, input the images to be identified into the next classification model according to the sorting order for continuous identification. If the image to be identified is determined to be the non-illegal image, the subsequent classification model does not identify the image to be identified any more, and the image to be identified is output as the non-illegal image.

In the embodiment of the present specification, a plurality of classification models with differences in the sizes of corresponding input images are deployed on a server side. The server may sort the plurality of classification models in order of the input image size from small to large. The image size referred to herein may refer to the number of pixels of the image in the horizontal and vertical directions, typically expressed in terms of width and height. For example, 16 x 16 images, 64 x 64 images, etc.

Then, the server side can perform scaling processing on the image to be identified according to the size of the input image corresponding to the first classification model in the plurality of classified models after sequencing, input the scaled image to be identified into the first classification model, and determine whether the image to be identified is a violation image based on a classification result output by the first classification model. The server may perform scaling processing on the image to be identified according to a preset scaling ratio, or may perform scaling processing according to an input image size corresponding to the classification model. And the server side can also represent the number of classification models by scaling the number of categories.

If so, continuing to perform scaling processing on the image to be identified according to the size of the input image corresponding to the second classification model in the plurality of classified models after sequencing, inputting the scaled image to be identified into the second classification model, determining whether the image to be identified is an illegal image or not based on the classification result output by the second classification model, and the like until the image to be identified is scaled according to the size of the input image corresponding to the last classification model in the plurality of classified models after sequencing, and inputting the scaled image to be identified into the last classification model.

Then, the server may determine whether the image to be identified is a offending image based on the classification result output by the last classification model. If yes, the image to be identified is further input into a multi-mode pre-training model to conduct violation identification.

If any classification model in the plurality of classification models is determined to output the classification result that the image to be identified is the non-offending image, the zoomed image to be identified is not input into other classification models positioned behind the sorting position of the classification model. As particularly shown in fig. 2.

FIG. 2 is a flow chart illustrating the identification of an image to be identified by a plurality of classification models in accordance with an exemplary embodiment.

In fig. 2, the scaled image to be identified corresponding to the first classification model is input into the first classification model, and it is determined whether the image to be identified is a offending image. If not, outputting the image to be identified as a non-illegal image. If yes, inputting the scaled image to be identified corresponding to the second classification model into the second classification model, and judging whether the image to be identified is an illegal image or not. And by analogy, inputting the zoomed image to be identified corresponding to the Nth classification model into the Nth classification model, and judging whether the image to be identified is a violation image.

Therefore, under the condition that the omission ratio of the classification model is ensured, even if one classification model in the plurality of classification models erroneously identifies an uncorrupted image as an uncorrupted image, the subsequent classification model can identify the uncorrupted image again, and the final illegal identification is also carried out through the multi-mode pre-training model, so that the operation resources required by image identification are reduced and the image identification efficiency is improved on the basis of ensuring the identification accuracy of the images to be identified.

In an embodiment of the present description, the multi-modal pre-training model comprises: visual question-answering model. The server may obtain a plurality of text questions for violation identification. The text question mentioned here may be text input in advance by the user.

Wherein different service requirements correspond to different text questions. For example, if the service requirement is determined to be detecting pornography, the text question may include: whether a certain part of a male appears in the image, whether a certain part of a female appears in the image, and the like. For another example, if the service requirement is determined to be detecting a public character image, the text question may include: whether a flag representing a country or region appears in the image, whether a public character appears in the image, etc. The specification does not limit the detailed description of the text problem.

Then, the server may input the image to be identified and the plurality of text questions into a visual question-answering model, and determine whether the image to be identified is a violation image based on the reply text for the plurality of text questions. The reply text referred to herein may be a positive answer or a negative answer, e.g., yes or no, present or not, paired or not peer. The specification does not limit the detailed description of the reply text.

Specifically, if at least one positive answer exists in the reply text of the plurality of text questions, the image to be identified is determined to be a violation image. And if the answer of the reply texts of the plurality of text questions is negative, determining that the image to be recognized is an uncorrupted image.

Further, the server may construct a plurality of text questions as a text question list, input the image to be identified and the text question list into the visual question-answering model, sequentially determine the reply text of each text question according to the ordering order of the text questions in the text question list, and if the reply text of the text question is determined to be a negative answer, continuously determine the reply text of the next text question. If the reply text of any text question in the text question list is determined to be a positive answer, the image to be identified is determined to be a illegal image, and the reply text of other subsequent text questions is not determined.

In practical application, a part of users often send illegal images or other illegal behaviors exist, and the server side can record the part of illegal users into a user blacklist and directly determine the images to be identified of the part of users as the illegal images. And part of users are trusted, the server side can record the trusted users into a user white list, and the images to be identified of the users are directly determined to be non-illegal images. Thereby improving the efficiency of image recognition.

In the embodiment of the present disclosure, the server may store the user blacklist and the user whitelist in advance. The server side can judge whether the user blacklist contains the user information of the image to be identified or not, and if so, the image to be identified is determined to be the illegal image.

Of course, the server may also determine whether the user white list includes user information of the image to be identified, and if so, determine that the image to be identified is an uncorrupted image.

In practical applications, the image information in the image uploaded by a part of users is too small, for example, a pure white image or a pure black image, and identifying the part of the image wastes operation resources. Based on the method, the server side can firstly perform edge detection on the image to be identified, and determine edge information in the image to be identified, so as to determine whether the image to be identified is an illegal image.

In the embodiment of the present disclosure, the server may perform edge detection on the image to be identified, and determine edge information of an image target included in the image to be identified.

Then, the server side can determine whether the image to be identified is a violation image according to the quantity of the edge information of the image target contained in the image to be identified.

Specifically, based on different service requirements, the logic of the server to determine whether the image to be identified is a violation image is different. For example, when the number of the edge information of the image object included in the image to be identified is small, the user is considered to upload a meaningless image, and the server side can directly determine that the image to be identified is a illegal image. Under the service requirement, if the number of the edge information of the image targets contained in the image to be identified is determined to be smaller than the set number, the image to be identified is determined to be a illegal image.

For another example, when the number of the edge information of the image target included in the image to be identified is small, no illegal image appears in the image to be identified, and the server side can directly determine that the image to be identified is an unordered image. Under the service requirement, if the number of the edge information of the image targets contained in the image to be identified is determined to be smaller than the set number, the image to be identified is determined to be an unobstructive image.

There are various algorithms for edge detection, such as Canny algorithm, roberts operator, etc. The present description is not limited to algorithms for edge detection.

It should be noted that the visual question-answering model includes: BLIP model.

In the embodiment of the present disclosure, the server may determine whether the image to be identified is an offending image through multiple steps. As particularly shown in fig. 3.

FIG. 3 is a flow chart illustrating a method of determining whether an image to be identified is a offending image in accordance with an exemplary embodiment.

In fig. 3, the server may determine whether the user blacklist and the user whitelist include user information of the image to be identified, and if not, perform edge detection on the image to be identified, and determine whether the image to be identified is a violation image.

If yes, inputting the scaled image to be identified corresponding to the first classification model into the first classification model, and judging whether the image to be identified is a violation image. If not, outputting the image to be identified as a non-illegal image. If yes, inputting the scaled image to be identified corresponding to the second classification model into the second classification model, and judging whether the image to be identified is an illegal image or not. And by analogy, inputting the zoomed image to be identified corresponding to the Nth classification model into the Nth classification model, and judging whether the image to be identified is a violation image.

If yes, the image to be identified is further input into a multi-mode pre-training model to conduct violation identification, and whether the image to be identified is a violation image or not is judged.

According to the method, the image to be identified can be input into the classification model, whether the image to be identified is the illegal image or not is determined based on the classification result output by the classification model, so that the image to be identified with smaller identification difficulty is identified, and the image to be identified with larger identification difficulty is screened out. If the image to be identified is determined to be the illegal image, the image to be identified with high identification difficulty is further input into a multi-mode pre-training model to carry out illegal identification, so that the operation resources required by image identification are reduced, and the efficiency of image identification is improved.

Corresponding to the embodiment of the method for image recognition described above, the present specification also provides an embodiment of an apparatus for image recognition.

Referring to fig. 4, fig. 4 is a block diagram of an electronic device in which an image recognition apparatus is shown in an exemplary embodiment. At the hardware level, the device includes a processor 402, an internal bus 404, a network interface 406, a memory 408, and a non-volatile storage 410, although other hardware requirements are possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 402 reading a corresponding computer program from the non-volatile memory 410 into the memory 408 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.

Referring to fig. 5, fig. 5 is a block diagram illustrating an apparatus for image recognition according to an exemplary embodiment. The image recognition device can be applied to the electronic equipment shown in fig. 4 to realize the technical scheme of the specification. Wherein the image recognition device may include:

an acquisition module 500, configured to acquire an image to be identified;

the input module 502 is configured to input the image to be identified into the classification model, and determine whether the image to be identified is a violation image based on a classification result output by the classification model;

and the recognition module 504 is configured to, if the image to be recognized is determined to be a violation image, further input the image to be recognized into the multimodal pre-training model to perform violation recognition.

Optionally, the size of the input image corresponding to the classification model is smaller than the size of the input image of the multimodal pre-training model, and the input module 502 is specifically configured to perform scaling processing on the image to be identified according to the size of the input image corresponding to the classification model, input the scaled image to be identified into the classification model, and determine whether the image to be identified is an offending image based on a classification result output by the classification model.

Optionally, a plurality of classification models with differences in corresponding input image sizes are deployed on the server, and the input module 502 is specifically configured to sort the plurality of classification models in order from smaller to larger of the input image sizes, perform scaling processing on the image to be identified according to the input image size corresponding to a first classification model in the sorted plurality of classification models, input the scaled image to be identified into the first classification model, determine whether the image to be identified is an offending image based on a classification result output by the first classification model, if yes, perform scaling processing on the image to be identified according to the input image size corresponding to a second classification model in the sorted plurality of classification models, input the scaled image to be identified into the second classification model, determine whether the image to be identified is an offending image based on a classification result output by the second classification model, and so on until the image to be identified is the last classification model in the sorted plurality of classification models, and perform scaling processing on the last image to be identified.

Optionally, the identifying module 504 is specifically configured to determine whether the image to be identified is a violation image based on a classification result output by the last classification model; if yes, the image to be identified is further input into the multi-mode pre-training model to conduct violation identification.

Optionally, the identifying module 504 is further specifically configured to, if it is determined that any one of the multiple classification models outputs a classification result that the image to be identified is a non-offending image, not input the scaled image to be identified into other classification models located after the sorting location where the classification model is located.

Optionally, the multi-modal pre-training model includes: the recognition module 504 is specifically configured to obtain a plurality of text questions for violation recognition, input the image to be recognized and the plurality of text questions into the visual question-answering model, and determine whether the image to be recognized is a violation image based on reply texts for the plurality of text questions.

Optionally, the classification model is optimized and adjusted and trained by taking the minimized omission ratio as an optimization target, wherein the omission ratio refers to the ratio of the number of the illegal images classified as the non-illegal images to the number of the illegal images.

Optionally, the obtaining module 500 is specifically further configured to determine whether the user blacklist includes user information of the image to be identified, if yes, determine that the image to be identified is an offending image, determine whether the user blacklist includes user information of the image to be identified, and if yes, determine that the image to be identified is an uncorrupted image.

Optionally, the obtaining module 500 is specifically further configured to perform edge detection on the image to be identified, determine edge information of an image target included in the image to be identified, and determine whether the image to be identified is a violation image according to the number of edge information of the image target included in the image to be identified.

Optionally, the visual question-answering model includes: BLIP model.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are illustrative only, in that the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.

In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

The user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of related data is required to comply with the relevant laws and regulations and standards of the relevant country and region, and is provided with corresponding operation entries for the user to select authorization or rejection.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims

1. An image recognition method, the method is applied to a server, a pre-trained multi-mode pre-training model and a classification model are deployed on the server, the multi-mode pre-training model comprises a larger number of model parameters than the classification model, and the method comprises the following steps:

acquiring an image to be identified;

2. The method of claim 1, wherein the classification model corresponds to an input image size that is smaller than an input image size of the multimodal pre-training model;

3. The method of claim 2, wherein a plurality of classification models with corresponding differences in input image size are deployed on the server;

4. The method of claim 3, further inputting the image to be identified into the multimodal pre-training model for violation identification if the image to be identified is determined to be a violation image, comprising:

5. A method as claimed in claim 3, the method further comprising:

6. The method of claim 1, the multimodal pre-training model comprising: a visual question-answering model;

acquiring a plurality of text questions for violation identification;

7. The method of claim 1, wherein the classification model is optimally tuned with a minimum miss rate, which is a ratio of the number of offending images classified as non-offending images to the number of offending images, as an optimization objective.

8. The method of claim 1, prior to inputting the image to be identified into the classification model, the method further comprising:

Judging whether the user blacklist contains the user information of the image to be identified, if so, determining that the image to be identified is an illegal image;

9. The method of claim 1, prior to inputting the image to be identified into the classification model, the method further comprising:

10. The method of claim 6, the visual question-answering model comprising: BLIP model.

11. The method of claim 1, the image to be identified comprising: the user draws an image using an AI-based drawing program.

12. An image recognition device, the device is applied to a server, a multi-mode pre-training model and a classification model which are trained in advance are deployed on the server, the number of model parameters contained in the multi-mode pre-training model is greater than the number of model parameters contained in the classification model, and the device comprises:

The acquisition module is used for acquiring the image to be identified;

the input module is used for inputting the image to be identified into the classification model and determining whether the image to be identified is a violation image or not based on a classification result output by the classification model;

and the identification module is used for further inputting the image to be identified into the multi-mode pre-training model for violation identification if the image to be identified is determined to be the violation image.

13. An electronic device comprises a communication interface, a processor, a memory and a bus, wherein the communication interface, the processor and the memory are connected with each other through the bus;

the memory stores machine readable instructions, the processor executing the method of any of claims 1 to 11 by invoking the machine readable instructions.

14. A machine-readable storage medium storing machine-readable instructions which, when invoked and executed by a processor, implement the method of any one of claims 1 to 11.