CN111898544A

CN111898544A - Character and image matching method, device and equipment and computer storage medium

Info

Publication number: CN111898544A
Application number: CN202010757678.XA
Authority: CN
Inventors: 汪翔; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-11-06
Anticipated expiration: 2040-07-31
Also published as: CN111898544B

Abstract

The application discloses a character and image matching method, a device and equipment and a computer storage medium, relates to the technical field of computers, and is used for improving the accuracy of homologous matching of character and images. The method comprises the following steps: respectively carrying out character area detection on two character images to be matched to obtain a character area characteristic matrix of each character image; respectively superposing a pixel matrix of the character image and a character area characteristic matrix aiming at each character image to obtain a superposition matrix corresponding to each character image; acquiring visual similarity between the two character images according to the superposition matrixes respectively corresponding to the two character images; whether the two character images are homologous images is determined according to the visual similarity, and therefore the visual similarity between the two character images is judged by combining the positions of the character areas, the attention of the character areas can be improved, and the reliability of the visual similarity and the accuracy of homologous matching are improved.

Description

Character and image matching method, device and equipment and computer storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of image processing, and provides a character and image matching method, device and equipment and a computer storage medium.

Background

At present, a large amount of data in the internet era is generated by users according to requirements, for example, a text image published in a website may be subjected to various operations such as smearing or modification by different users, so that a large amount of similar text images are generated and spread in a network. Sometimes, the user may need to obtain other homologous graphs according to a graph, such as retrieving a matched source graph from a painted or modified text image, or retrieving a website published by the text image. It can be seen that the correct character image can be retrieved only by accurately performing the homologous matching of the character image, but in the actual retrieval process, the character image is often generated after operations such as smearing or modifying, and the operations on the character image increase interference information in the character image, so that the interference is generated on the homologous matching of the character image, and the difficulty of the homologous matching is increased.

Therefore, how to improve the accuracy of the homologous matching of the text image is a problem to be considered at present.

Disclosure of Invention

The embodiment of the application provides a character and image matching method, device and equipment and a computer storage medium, which are used for improving the accuracy of homologous matching of character and images.

In one aspect, a text image matching method is provided, and the method includes:

respectively carrying out character area detection on two character images to be matched to obtain a character area characteristic matrix of each character image, wherein the character area characteristic matrix represents an area where characters are located;

respectively superposing a pixel matrix of the character image and a character area characteristic matrix aiming at each character image to obtain a superposition matrix corresponding to each character image;

acquiring visual similarity between the two character images according to the superposition matrixes respectively corresponding to the two character images;

and determining whether the two character images are homologous images according to the visual similarity.

In one aspect, a text image matching apparatus is provided, the apparatus comprising:

the character detection unit is used for respectively detecting character areas of the two character images to be matched to obtain a character area characteristic matrix of each character image, and the character area characteristic matrix represents the character area where the characters are located;

the superposition unit is used for superposing the pixel matrix of the character image and the character area characteristic matrix aiming at each character image to obtain a superposition matrix corresponding to each character image;

the visual similarity obtaining unit is used for obtaining the visual similarity between the two character images according to the superposition matrixes respectively corresponding to the two character images;

and the homologous image determining unit is used for determining whether the two character images are homologous images according to the visual similarity.

Optionally, the text similarity obtaining unit is configured to: acquiring the editing distance between the character information corresponding to the two character images respectively; the editing distance is the number of editing steps required for converting one character message into another character message; determining a text similarity between the two text images based on the edit distance.

Optionally, the text detection unit is configured to: respectively extracting the characteristics of the two character images to obtain the image characteristics corresponding to each character image; detecting characters included in each character image according to the image characteristics corresponding to each character image to obtain a plurality of text detection boxes included in each character image; the text detection box represents a text area where the text is located; and respectively obtaining a character area characteristic matrix corresponding to each character image according to the plurality of text detection boxes included in each character image.

Optionally, the visual similarity obtaining unit is configured to:

determining the visual similarity between the two character images through a trained visual similarity determination model according to the superposition matrixes respectively corresponding to the two character images;

the visual similarity determination model is obtained by training a plurality of image sample pairs, each image sample pair comprises two character and image samples, each image sample pair is marked with the visual similarity between the two character and image samples, and the character and image samples included in the image sample pairs are images obtained by adding interference information on corresponding source character and image.

Optionally, the visual similarity determination model includes two feature extraction submodels with the same model structure and a similarity determination submodel; the apparatus further comprises a model training unit for:

training the visual similarity determination model for multiple times through the multiple image sample pairs to obtain the visual similarity determination model; wherein, each training process comprises the following steps:

respectively extracting the features of the two character image samples included in each image sample pair through the two feature extraction submodels to obtain the feature vectors of the two character image samples included in each image sample pair;

determining the visual similarity between the two text image samples included in each image sample pair through the similarity determination sub-model based on the feature vectors of the two text image samples included in each image sample pair;

determining whether a loss value of the visual similarity determination model is smaller than a preset loss threshold value, wherein the loss value is used for representing the sum of the visual similarity of each image sample pair determined by the similarity determination model and the difference between the visual similarity of each image sample pair marked;

if the loss value of the visual similarity determination model is not smaller than a preset loss threshold value, adjusting model parameters of the visual similarity determination model according to the loss value, and performing the next training process on the visual similarity determination model after adjustment; or,

and if the loss value of the visual similarity determination model is smaller than a preset loss threshold value, ending the training process.

Optionally, the visual similarity obtaining unit is configured to: determining Euclidean distances between the feature vectors respectively corresponding to the two character images; and determining the visual similarity between the two character images according to the Euclidean distance.

In one aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the above methods when executing the computer program.

In one aspect, a computer storage medium is provided having computer program instructions stored thereon that, when executed by a processor, implement the steps of any of the above-described methods.

In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of any of the methods described above.

In the embodiment of the application, when the character images are subjected to homologous matching, character area detection is performed on the two character images respectively to obtain character area characteristics in the character images, and after a pixel matrix of the character images and a character area characteristic matrix are superposed, the visual similarity between the two character images is judged according to the superposed matrix, and whether the two character images are homologous images is judged. Therefore, when the visual similarity is measured, the image characteristics owned by the text image and the position of the text area in the text image are combined to comprehensively judge, and as most of the contents in the text image are text contents, the visual similarity between the two images is judged by combining the position of the text area, the reliability of the obtained visual similarity can be improved, and the accuracy of homologous matching is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIGS. 1 a-1 c are schematic diagrams illustrating comparison of a text image to generate a plurality of homologous text images according to an embodiment of the present application;

fig. 2 is a schematic view of a scenario provided in an embodiment of the present application;

fig. 3 is a schematic view of another scenario provided in the embodiment of the present application;

fig. 4 is a schematic flowchart of a text-image matching method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of performing text region detection according to an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a pixel matrix and a character region feature matrix according to an embodiment of the present disclosure;

fig. 7 is a network structure diagram of a visual similarity determination model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a training process of a visual similarity determination model according to an embodiment of the present application;

fig. 9 is another schematic flow chart of a text-image matching method according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a text-image matching apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

For the convenience of understanding the technical solutions provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained first:

character and image: the image is mainly the text content in the image.

Homologous image: in the embodiments of the present application, a plurality of text images having the same source image are referred to. For a character image A, after image processing means such as smearing, adding a filter or blurring processing are carried out, a character image B, a character image C and a character image D are obtained, and then the character image B, the character image C and the character image D can be called as homologous images. As shown in fig. 1a to 1c, which are schematic diagrams showing the comparison of a plurality of homologous character images generated from a character image, fig. 1a is a source character image, fig. 1b is a smeared character image obtained by smearing on the basis of the source character image, fig. 1c is a watermarked character image obtained by adding a watermark on the basis of the source character image, and the smeared character image and the watermarked character image are homologous images.

Visual similarity: the similarity between two images can be measured from the visual sense of human, and particularly, when the visual similarity between two images is determined, the similarity is measured according to visual features, such as gray scale, brightness, chroma and the like, extracted from the images.

Character similarity: the similarity of the text contents of the two images is measured from the perspective of the text contents included in the two text images.

Convolutional Neural Networks (CNN): the method is a Feed-Forward neural network (Feed-Forward neural Networks) containing convolution calculation and having a deep structure, the convolution neural network can learn grid-like topologic features such as pixels with small calculation amount, the effect is stable, and no additional feature engineering (feature engineering) requirement is required on data. In general, convolutional Neural networks can be Residual Neural networks (resnets) and google nets, among other Network structures.

Feature matrix (feature map): or called a feature map, is obtained by extracting through the above feature extraction network, such as a convolutional neural network, and is also a pixel matrix in essence, each element in the pixel matrix can be regarded as a pixel point on the feature map, and the value of the position of the pixel point is the feature value of an area or a pixel point in the original image.

However, in the actual retrieval process, the character image is often generated after operations such as smearing or modification, and the operations on the character image increase interference information in the character image, so that interference is generated on the homologous matching of the character image, and the difficulty of the homologous matching is increased.

Considering that many text images have no particularly obvious visual features, after image processing, such as smearing or modifying, etc., the visual difference between homologous images may be large, and matching is performed directly according to the visual features of the text images, which is likely to result in mismatching. Taking smearing as an example, visual similarity differences between different smeared character images of the same source are large, smearing is generally located in the middle area of the images, in order to identify the images with large differences as the same type, a model learned through a machine learning method can easily place a concerned terminal in the edge area of the images, but the edge areas of most of the images of the same source are almost the same, and mismatching can be easily generated by performing homologous matching only according to the visual characteristics of the character images, so that the change of the distribution of the character areas in the character images cannot be large no matter what image processing is performed, and therefore homologous matching can be performed by combining the distribution of the character areas in the character images.

In view of this, an embodiment of the present application provides a text image matching method, in which when performing homologous matching of text images, text region detection is performed on the two text images respectively to obtain text region characteristics in the text images, and after superimposing a pixel matrix of the text images and a text region characteristic matrix, a visual similarity between the two text images is determined according to the superimposed matrix, so as to determine whether the two text images are homologous images. Therefore, when the visual similarity is measured, the image characteristics owned by the text image and the position of the text area in the text image are combined to comprehensively judge, and as most of the contents in the text image are text contents, the visual similarity between the two images is judged by combining the position of the text area, the reliability of the obtained visual similarity can be improved, and the accuracy of homologous matching is improved.

In addition, considering that most contents in the character images are character contents, the character contents can be identified, so that whether the two character images are homologous images or not is comprehensively judged by combining character similarity in the two character images, more accurate homologous matching is realized, and the mismatching rate is reduced.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In a specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be applied to a scene of image retrieval, and is a scene schematic diagram provided by the embodiment of the application as shown in fig. 2. A database 201 and a text-image matching device 202 may be included in the scene.

The database 201 is used as an image database for storing text images, and when a text image needs to be retrieved, the text image can be read from the database 201 to match with a homologous image.

The text image matching device 202 may be, for example, a computer device with certain processing capability, such as a Personal Computer (PC), a notebook computer, or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

The textual image matching device 202 may include one or more processors 2021, memory 2022, and I/O interface 2023 to interact with other devices, among others. In addition, the server 202 may further configure a database 2024, and the database 2024 may be used to store model data and the like involved in the scheme provided in the embodiment of the present application. The memory 2022 of the server 202 may store program instructions of the text image matching method provided in the embodiment of the present application, and when executed by the processor 2021, the program instructions can be used to implement the steps of the text image matching method provided in the embodiment of the present application to determine whether two text images are homologous images. The text image matching device 201 may further include a display panel 2025 for enabling visual interaction.

In practical applications, a text image may be uploaded through the text image matching device 201, and the text image matching device 201 may read the text image from the database 201 to match the uploaded text image with the text image read from the database 201, so as to find a homologous image of the uploaded text image from the database 201.

The database 201 and the text-image matching device 202 may be in direct or indirect communication connection via one or more networks 203. The network 203 may be a wired network or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may also be other possible networks, which is not limited in this embodiment of the present application.

Fig. 3 is a schematic view of another scenario provided in the embodiment of the present application. A terminal 301, a server 302 and a database 303 may be included in the scenario.

The terminal 301 may be a terminal for image retrieval, a client may be installed in the terminal 301, and the client may be, for example, an image retrieval client or a social platform client, and the server 302 may be a backend server corresponding to the client installed in the terminal 301. The database 302 is an image database and is used for storing text images, and when a text image needs to be retrieved, the text image can be read from the database 301 to match with a homologous image.

For example, the client installed in the terminal 301 may be an image retrieval client, and a text image to be retrieved is input through the image retrieval client and confirmed to be retrieved, and then a retrieval request may be submitted to the server 302, and the server 302 retrieves a text image belonging to a homologous image with the text image from the database 303 based on the text image provided by the terminal 301.

The terminal 301 may be, for example, a Personal Computer (PC), a notebook computer, a tablet computer (PAD), or the like. The server 302 may be a computer device with certain processing capability, for example, a Personal Computer (PC), a notebook computer, or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.

The terminals 301, servers 302, and databases 303 may be communicatively coupled directly or indirectly through one or more networks 303. The network 303 may be a wired network or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, and of course, may also be other possible networks, which is not limited in this embodiment of the present application.

Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenarios shown in fig. 2 and 3, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenarios shown in fig. 2 and 3 will be described in the following method embodiments, and will not be described in detail herein. Hereinafter, the technology related to the embodiments of the present application will be briefly described.

In an alternative implementation, the text image matching process may be implemented by using a physical device in combination with an Artificial Intelligence (AI) technology, and in another alternative implementation, the text image matching process may also be implemented by using a Cloud technology (Cloud technology) in combination with an AI technology.

The cloud technology is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize the calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Background services which will become important supporting technology network systems in cloud computing technology require a large amount of computing and storage resources, such as video websites, picture websites and more portal websites. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing. Specifically, according to the embodiment of the application, besides the program flow can be executed through the entity computing resources and the data storage can be realized through the entity storage resources, the character image matching can also be performed through the computing resources provided by the cloud, and the data related in the character image matching process can be stored through the storage resources provided by the cloud.

AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly includes several directions, such as computer vision technology, speech Processing technology, Natural Language Processing (NLP) technology, machine learning/deep learning, and the like. The technical scheme provided by the embodiment of the application mainly relates to the technologies of machine learning/deep learning and the like of artificial intelligence.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. Specifically, in the embodiment of the present application, matching of text images may be performed by using a model obtained by machine learning.

Referring to fig. 4, a flowchart of a text image matching method according to an embodiment of the present application is schematically shown, where the text image matching method can be executed by the text image matching device 202 in fig. 2 or the server 302 in fig. 3, and the flow of the method is described as follows.

Step 401: and acquiring a character image pair to be matched.

In the embodiment of the application, when a homologous image of a character image needs to be retrieved and acquired, the character image and one of the character images in the image library can be used as a text image pair to perform homologous image matching, so as to determine the character image in the image library which is the homologous image with the character image. Since the matching process of any text image pair is the same, the matching process of a text image pair is described as an example.

Step 402: and respectively carrying out character area detection on the two character images to be matched to obtain a character area characteristic matrix of each character image, wherein the character area characteristic matrix represents the character area where the character is located. In the embodiment of the application, considering that most contents in the character images are character contents, no matter what image processing is performed, the change of the distribution of the character areas in the character images is not large, so that homologous matching can be performed by combining the distribution of the character areas in the character images, and therefore, for two character images included in one character image pair to be matched, character area detection can be performed on the two character images, and the character area feature matrix of each character image is obtained. The character area feature matrix can represent the character area where the characters are located.

For two text images included in one text image pair, the text region detection process is substantially the same, so that the text region detection process is described here by taking one text image as an example, and the text region detection process of the other text image is not described again.

For a character image, a character region where characters in the character image are located can be detected through a character detection algorithm, and character region detection is achieved. The Text Detection algorithm may be, for example, east (efficient and accurate scene Text) algorithm, Pixel Link algorithm, or SOTD (Self-organized Text Detection) algorithm, and of course, other possible algorithms may also be used, which is not limited in this embodiment of the present application.

Specifically, as shown in fig. 5, a schematic flow chart of text region detection on a text image by taking EAST algorithm as an example is shown. The character image may be first subjected to feature extraction to obtain a feature map corresponding to the character image, the feature map includes the extracted image features of the character image, and the process of feature extraction may be implemented by CNN, for example.

And detecting characters included in the character image according to the acquired feature map corresponding to the character image to obtain a plurality of text detection boxes included in the character image, wherein the text detection boxes represent character areas where the characters are located. The process of detecting the characters in the character image based on the feature map can also be understood as a process of classifying the character image, namely dividing pixels into character pixels and non-character pixels, further determining which are character areas and which are non-character areas in the character image, and further outputting a text detection box where the character areas are located.

Furthermore, a character area feature matrix corresponding to the character image can be obtained according to a plurality of text detection boxes included in the character image. Based on the text detection box, it can be known which areas are text areas and which are non-text areas, and different feature values are used to represent pixels of the non-text areas and pixels of the text areas, so that a text area feature matrix corresponding to the text image can be obtained, for example, the feature value of a text area pixel is 1, the feature value of a non-text area pixel is 0, or the feature value of a text area pixel is 0, and the feature value of a non-text area pixel is 1. As shown in fig. 5, to distinguish different feature values, different colors are used for distinction, and a white portion represents a feature value of a character region and a black portion represents a feature value of the remaining portion of the text image except the character region.

Through the above process, the character area feature matrix of two character images included in one character image pair can be obtained.

Step 403: and respectively superposing the pixel matrix of the character image and the character area characteristic matrix aiming at each character image to obtain a superposition matrix corresponding to each character image.

In the embodiment of the application, in order to increase the reliability of the visual similarity of character image matching, when the visual similarity between two character images is measured, character region characteristics are combined for comprehensive judgment. Therefore, when the visual similarity between two text images is judged, the pixel matrix of the text images and the character area characteristic matrix can be superposed respectively to obtain a superposition matrix corresponding to each text image.

Fig. 6 is a schematic diagram illustrating a pixel matrix of a text image and a text region feature matrix being superimposed. For a computer, the text image is substantially stored in the form of a pixel matrix, and the pixel matrix may be a pixel matrix including three color channels of red (R), green (G), and blue (B), which is specifically shown in fig. 6, or may also be a pixel matrix including a brightness-chrominance (YUV) channel.

In a specific application, the character region feature matrix is a single-channel binary matrix, for example, the value of the character region feature matrix is 0 or 1, and the pixel matrix of the character image may be preprocessed in advance in order to be consistent with the value range of the character region feature matrix. Taking an RGB three-channel matrix as an example, the eigenvalue value range of the RGB three-channel matrix is generally 0 to 255, and the preprocessing may be to normalize the value range of 0 to 255 to be within a range of 0 to 1.

After preprocessing, the pixel matrix and the character region feature matrix can be superposed. The superposition of the pixel matrix and the character area feature matrix is performed by splicing on the channel layer, as shown in fig. 6, after the RGB three-channel matrix and the character area feature matrix of the single channel are superposed, a four-channel superposition matrix can be obtained.

Step 404: and acquiring the visual similarity between the two character images according to the superposition matrixes respectively corresponding to the two character images.

In the embodiment of the application, after the superposition matrix is obtained, the visual similarity between the two text images can be obtained according to the superposition matrix corresponding to the two text images respectively.

The superposition matrix corresponding to the two character images can be used as the input of the trained visual similarity determination model to determine the visual similarity between the two character images. The model training process of the visual similarity determination model will be specifically described later, and therefore, the description thereof is not repeated herein.

The visual similarity determination model may be a twin network model, and the model structure may adopt a residual error network (ResNet) structure, a vgg (visual Geometry group) network structure, a SqueezeNet network structure, or a ShuffleNet network structure, for example, although other possible network structures may also be adopted, which is not limited in this embodiment of the present application.

As shown in fig. 7, for the network structure diagram of the visual similarity determination model provided in the embodiment of the present application, the visual similarity determination model may include two identical feature extraction submodels, that is, a feature extraction submodel 1 and a feature extraction submodel 2 shown in fig. 7, and a similarity determination submodel. The structure of the feature extraction submodel 1 is completely the same as that of the feature extraction submodel 2, and the weight parameters are shared.

As shown in fig. 7, each of the feature extraction submodel 1 and the feature extraction submodel 2 includes a plurality of Convolutional layers (volumetric layers) and Fully connected layers (full connected layers), where the number of Convolutional layers may be multiple, and the number of Convolutional layers may be set according to actual conditions or experimental data, which is not limited in this embodiment of the present application.

Specifically, for a superposition matrix corresponding to one text image, feature extraction can be performed on the superposition matrix through a feature extraction submodel to obtain a feature vector corresponding to the text image. And the feature extraction is carried out on the superposition matrix layer by layer through each convolution layer included by the feature extraction submodel, then the feature graph output by the last convolution layer is input into the full-connection layer, and the feature vector corresponding to the text and the image is obtained after the processing of the full-connection layer. The size of the feature vector may be adjusted according to actual conditions or experimental data, and may be, for example, a 1000 × 1 vector.

After the feature vectors corresponding to the two text images are respectively obtained through the above processes, the visual similarity between the two text images can be determined through the similarity determination submodel. Specifically, the similarity determining sub-model may calculate Euclidean distances (Euclidean distances) between feature vectors corresponding to the two text images, respectively, and then determine the visual similarity between the two text images according to the Euclidean distances. The euclidean distance can be directly used as data for measuring the visual similarity, for example, the euclidean distance is 0.4, and then 0.4 can be directly used as the visual similarity data; alternatively, the value of the visual similarity may be obtained by performing a certain conversion according to the value of the euclidean distance, for example, the euclidean distance is 0.4, and then the value of the visual similarity may be converted according to a conversion method, and generally speaking, the smaller the euclidean distance, the higher the visual similarity.

Of course, in addition to calculating the euclidean distance between feature vectors to obtain the visual similarity between text images, the visual similarity may also be obtained by other similarity measurement methods, such as a cosine similarity calculation method.

In the embodiment of the application, the difference degree between the pixel points can be calculated according to the pixel points by using the superposition matrixes respectively corresponding to the two character images, so that the visual similarity is measured according to the sum of the difference degrees.

Step 405: and determining whether the two character images are homologous images according to the visual similarity.

In the embodiment of the application, after the visual similarity is determined, whether the two character images are homologous images can be determined according to the visual similarity. For example, a visual similarity threshold is preset, and when the visual similarity between two text images is greater than the visual similarity threshold, the two text images are determined to be homologous images. Or, when the euclidean distance is used as data for measuring the visual similarity, a distance threshold value can be set, and when the euclidean distance between two text images is smaller than the distance threshold value, the two text images are determined to be homologous images.

In the embodiment of the present application, the visual similarity determination model is obtained by performing multiple training on a plurality of image sample pairs, and a training process of the visual similarity determination model is described below. As shown in fig. 8, a schematic diagram of a training process of the visual similarity determination model is shown.

Step 801: a plurality of image sample pairs for model training are acquired.

In this embodiment of the application, each image sample pair in the plurality of image sample pairs includes two text image samples, and a label is manually added to each image sample pair, where the label may indicate whether the two text image samples are homologous images, for example, a label value of 1 indicates that the two text image samples are homologous images, and a label value of 0 indicates that the two text image samples are non-homologous images; or, the label value 0 indicates that the two text image samples are homologous images, and the label value 1 indicates that the two text image samples are non-homologous images, which is not limited in this embodiment of the application. The label of each image sample pair may also be understood as a visual similarity between two text image samples included in each image sample pair, for example, if a label value 1 indicates that the two text image samples are homologous images, the similarity between the two text image samples is 100%, and if a label value 0 indicates that the two text image samples are non-homologous images, the similarity between the two text image samples is 0.

In the embodiment of the application, because most of the text images subjected to text image matching are text images subjected to image processing in actual application, in order to improve the stability of the model, the text image samples included in the image samples used for model training may be images obtained by adding interference information to corresponding source text images, where the interference information is added by processing the source text images through a certain image processing means, such as increased smearing in the text images.

During training, the superposition matrix of each character image sample can be obtained in advance, so that the superposition matrix can be directly used as model input, and the situation that the superposition matrix is repeatedly obtained during each training is avoided. The process of acquiring the superposition matrix is already described in the embodiment section shown in fig. 4, and thus is not described herein again.

Step 802: and respectively carrying out feature extraction on the two character image samples included in each image sample pair through the two feature extraction submodels to obtain the feature vectors of the two character image samples included in each image sample pair.

Step 803: and determining the visual similarity between the two text image samples included in each image sample pair through the similarity determination submodel based on the feature vectors of the two text image samples included in each image sample pair.

For each image sample pair, the process through step 801 and step 802 may be similar to the process of step 403 in the corresponding embodiment of fig. 4, and therefore, redundant description is not repeated for the process of obtaining the visual similarity between two text images included in each image sample pair.

Step 804: and determining whether the loss value of the visual similarity determination model is smaller than a preset loss threshold value.

The model training process is a process of continuously optimizing the model, and in each training process, in order to measure whether the model needs to be trained continuously, after the visual similarity between two text image samples included in each image sample pair is obtained, the loss value (loss) of the model determined by the visual similarity during the training is obtained. The loss value can represent the visual similarity of each image sample pair determined by the similarity determination model, and the sum of the difference between the visual similarity of each image sample pair marked, namely the loss value can measure whether the current model is accurate enough.

In the embodiment of the present application, the Loss value may be, for example, a contrast Loss value (contrast Loss), and the contrast Loss value of each image sample pair may be calculated by the following formula:

the Loss value represents a contrast Loss value, y represents a label of an image sample to a label or a visual similarity of the image sample to the label, d represents a Euclidean distance between feature vectors of two input text image samples, and m is a control parameter used for controlling a range of the Euclidean distance of a negative sample pair, for example, when the value of d is 0-1, m can be set to be 1.

The contrast loss value of each image sample pair can be obtained through the formula, so that the loss value of the similarity determination model of the training is obtained, namely the sum of the contrast loss values of all the image sample pairs.

When the method is used specifically, besides the comparison loss, other loss calculation modes can be adopted for the loss value, and the method is not limited in the embodiment of the application.

Step 805: if the result of the step 804 is negative, adjusting the model parameters of the visual similarity determination model according to the loss value.

Step 806: if the result of step 804 is yes, the training process is ended.

In the embodiment of the application, if it is determined that the loss value of the visual similarity determination model is not less than the preset loss threshold, it indicates that the current visual similarity determination model does not meet the accuracy requirement, and the model needs to be trained continuously.

Specifically, the model parameters of the visual similarity determination model may be adjusted according to the loss values, and then the next training is continued through the adjusted visual similarity determination model, that is, the adjusted visual similarity determination model is used to continue the processes of steps 802 to 806. The model adjustment may be performed by an optimization algorithm such as a gradient descent method (GradientDescent) or a Newton's method.

If the loss value of the visual similarity determination model is determined to be not less than the preset loss threshold value, the current visual similarity determination model meets the accuracy requirement, and the visual similarity determination model can be applied to the actual visual similarity determination process, so that the training process can be finished.

In the embodiment of the present application, considering that a situation that a mismatch may still occur only by performing matching according to the visual similarity, for example, when the text layouts of two text images are very close, the two text images visually look very close and are easily regarded as homologous images, so that in the embodiment of the present application, it may be further determined that whether the two text images are homologous images by combining the text similarity between the text contents in the text images. Fig. 9 is a schematic flow chart illustrating matching of text images in combination with text similarity.

Step 901: and acquiring a character image pair to be matched.

Step 902: and respectively carrying out character area detection on the two character images to be matched to obtain a character area characteristic matrix of each character image, wherein the character area characteristic matrix represents the character area where the character is located.

Step 903: and respectively superposing the pixel matrix of the character image and the character area characteristic matrix aiming at each character image to obtain a superposition matrix corresponding to each character image.

Step 904: and acquiring the visual similarity between the two character images according to the superposition matrixes respectively corresponding to the two character images.

The steps 901 to 904 correspond to the processes of the steps 401 to 404 in the embodiment shown in fig. 4 one by one, so that reference may be made to the description of the corresponding parts in the embodiment shown in fig. 4, and redundant description is not repeated here.

Step 905: and respectively carrying out character recognition on the two character images to obtain character information corresponding to each character image.

Specifically, for each character image, a character recognition algorithm may be used to perform character recognition on the character image, so as to obtain character information in the character image. The Text Recognition algorithm may be, for example, an Optical Character Recognition (OCR) algorithm to implement Text Recognition, and of course, other possible algorithms may also be used to implement Text Recognition, for example, an rare (robust Scene Text Recognition with automatic Recognition) algorithm, which is not limited in this embodiment of the present application.

Step 906: and acquiring the character similarity between the two character images according to the character information respectively corresponding to the two character images.

In the embodiment of the application, the character similarity between the character images can be measured by the Edit Distance (Edit Distance) between the character information, and after the character information in the two character images is acquired through character recognition, the Edit Distance between the acquired character information can be calculated, so that the character similarity between the two character images is determined based on the Edit Distance. The editing distance is the number of editing steps required for converting one text message into another text message. The normalized edit distance, i.e., the edit distance divided by the text length, may be used as an index for measuring text similarity, for example, if the normalized edit distance is 0.6, then 0.6 may be directly used as the text similarity index. Generally, the smaller the edit distance, the higher the character similarity. Of course, the value of the character similarity may be obtained by converting the value of the edit distance in a certain calculation manner.

In practical application, besides the editing distance, the text similarity between the text information included in the text image can be obtained in other manners, for example, the text information in the text image is split and then represented by a word vector, the word vector of the text image is synthesized, and the similarity between the two word vectors is calculated as the text similarity.

Step 907: and determining whether the two character images are homologous images according to the visual similarity and the character similarity.

Because the visual similarity is independently adopted for matching, the pure text images with similar typesetting are easily subjected to mismatching, interference information such as smearing and the like can cause great difference of character recognition results of homologous images, and if the character similarity is independently adopted, a lot of missing matching can be caused, so that the visual similarity and the character similarity are comprehensively utilized to determine whether the two character images are homologous images, so that better matching is realized, and mismatching is greatly reduced.

Specifically, if the visual similarity and the text similarity of the two text images respectively satisfy the visual similarity condition and the text similarity condition, it may be determined that the two text images are homologous images, for example, the visual similarity is greater than a set visual similarity threshold, and the text similarity is also within a certain threshold range, or the text similarity is greater than a set text similarity threshold, and the visual similarity is also within a certain threshold, it may be determined that the two text images belong to homologous images, otherwise, they are non-homologous images.

In addition, a larger value can be determined from the visual similarity and the character similarity, and when the visual similarity is larger than the character similarity, the visual similarity is larger than a first similarity threshold value, and the character similarity is larger than a second similarity threshold value, the two character images are determined to be homologous images; the first similarity threshold is greater than the second similarity threshold. Or, when the similarity is measured by distance, that is, the visual similarity is measured by euclidean distance, and the text similarity is measured by edit distance, the euclidean distance is less than or equal to the first distance threshold th₁And the edit distance is less than or equal to a second distance threshold th₂And then determining that the two character images are homologous images. In practical application, th₁Can be set to 0.4, th₂Can be set to 0.8, of course, th₁And th₂Other possible values may also be set, and the embodiment of the present application is not limited thereto.

When the visual similarity is smaller than the character similarity, the visual similarity is larger than a third similarity threshold, and the character similarity is larger than a fourth similarity threshold, determining that the two character images are homologous images; the third similarity threshold is less than the fourth similarity threshold. Or, when the similarity is measured by using the distance, the Euclidean distance is less than or equal to the fourth distance threshold th₄And the edit distance is less than or equal to a third distance threshold th₃And then determining that the two character images are homologous images. In practical application, th₃Can be set to 0.1, th₄Can be set to 0.8, of course, th₃And th₄Other possible values may also be set, and the embodiment of the present application is not limited thereto.

To sum up, when the visual similarity is determined and measured, the text region in the image is paid more attention to during network training by combining the text region position in the text image, so that overfitting caused by only paying attention to the edge region is avoided, the probability of mismatching is reduced, and the stability of the visual similarity model is enhanced. In addition, on the basis of the visual similarity, the character similarity is introduced to comprehensively carry out homologous matching of character images, so that more accurate homologous matching is realized and the mismatching rate is reduced.

Referring to fig. 10, based on the same inventive concept, an embodiment of the present application further provides a text-image matching apparatus 100, including:

a text detection unit 1001, configured to perform text region detection on two text images to be matched respectively to obtain a text region feature matrix of each text image, where the text region feature matrix represents a text region where a text is located;

a superposition unit 1002, configured to superpose a pixel matrix of each text image and a text region feature matrix of each text image, respectively, to obtain a superposition matrix corresponding to each text image;

a visual similarity obtaining unit 1003, configured to obtain a visual similarity between two text images according to the corresponding superposition matrices of the two text images respectively;

and a homologous image determining unit 1004 for determining whether the two character images are homologous images according to the visual similarity.

Optionally, the apparatus further includes a text similarity obtaining unit 1005;

a character similarity obtaining unit configured to: respectively carrying out character recognition on the two character images to obtain character information corresponding to each character image; acquiring character similarity between the two character images according to the character information respectively corresponding to the two character images;

a homologous image determining unit 1004 for: and determining whether the two character images are homologous images or not according to the visual similarity and the character similarity.

Optionally, the text similarity obtaining unit 1005 is configured to:

acquiring the editing distance between the character information corresponding to the two character images respectively; the editing distance is the number of editing steps required for converting one character message into another character message; a text similarity between the two text images is determined based on the edit distance.

Optionally, the homologous image determining unit 1004 is configured to:

if the visual similarity is greater than the character similarity, the visual similarity is greater than a first similarity threshold, and the character similarity is greater than a second similarity threshold, determining that the two character images are homologous images; the first similarity threshold is greater than the second similarity threshold; or,

if the visual similarity is smaller than the character similarity and larger than a third similarity threshold value and the character similarity is larger than a fourth similarity threshold value, determining that the two character images are homologous images; the third similarity threshold is less than the fourth similarity threshold.

Optionally, the text detection unit 1001 is configured to: respectively extracting the characteristics of the two character images to obtain the image characteristics corresponding to each character image; detecting characters included in each character image according to the image characteristics corresponding to each character image to obtain a plurality of text detection boxes included in each character image; the text detection box represents a text area where the text is located; and respectively obtaining a character area characteristic matrix corresponding to each character image according to the plurality of text detection boxes included in each character image.

Optionally, the visual similarity obtaining unit 1003 is configured to:

Optionally, the visual similarity determination model includes two feature extraction submodels with the same model structure and a similarity determination submodel; the apparatus further comprises a model training unit 1006 for:

training the visual similarity determination model for multiple times through the multiple image sample pairs to obtain a visual similarity determination model; wherein, each training process comprises the following steps:

determining the visual similarity between the two text image samples included in each image sample pair through a similarity determination sub-model based on the feature vectors of the two text image samples included in each image sample pair;

determining whether the loss value of the visual similarity determination model is smaller than a preset loss threshold value or not, wherein the loss value is used for representing the sum of the visual similarity of each image sample pair determined by the similarity determination model and the difference between the visual similarity of each image sample pair marked by the similarity determination model and the visual similarity of each image sample pair marked by the mark;

if the loss value of the visual similarity determination model is determined to be not less than the preset loss threshold value, adjusting model parameters of the visual similarity determination model according to the loss value, and determining the model to perform the next training process according to the adjusted visual similarity; or,

and if the loss value of the visual similarity determination model is smaller than the preset loss threshold value, ending the training process.

Optionally, the visual similarity obtaining unit 1003 is configured to:

respectively extracting the features of the superposition matrixes respectively corresponding to the two character images through two feature extraction submodels included in the visual similarity determination model to obtain a feature vector corresponding to each character image;

and determining the visual similarity between the two character images through the similarity determination submodel according to the feature vectors respectively corresponding to the two character images.

Optionally, the visual similarity obtaining unit 1003 is configured to: determining Euclidean distances between the feature vectors respectively corresponding to the two character images; and determining the visual similarity between the two character images according to the Euclidean distance.

The apparatus may be configured to execute the methods shown in the embodiments shown in fig. 4 to fig. 9, and therefore, for functions and the like that can be realized by each functional module of the apparatus, reference may be made to the description of the embodiments shown in fig. 4 to fig. 9, which is not repeated here. The text similarity obtaining unit 1005 and the model training unit 1006 are optional functional modules, and are therefore shown by dashed lines in fig. 10.

Referring to fig. 11, based on the same technical concept, an embodiment of the present application further provides a computer device 110, which may include a memory 1101 and a processor 1102.

The memory 1101 is used for storing computer programs executed by the processor 1102. The memory 1101 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the computer device, and the like. The processor 1102 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The specific connection medium between the memory 1101 and the processor 1102 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 1101 and the processor 1102 are connected by a bus 1103 in fig. 11, the bus 1103 is indicated by a thick line in fig. 11, and the connection manner between other components is merely illustrative and not limited thereto. The bus 1103 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.

The memory 1101 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1101 may also be a non-volatile memory (non-volatile) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or the memory 1101 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 1101 may be a combination of the above memories.

A processor 1102 for executing the method performed by the apparatus of the embodiments shown in fig. 4-9 when invoking the computer program stored in the memory 1101.

In some possible embodiments, various aspects of the methods provided by the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the methods performed by the devices in the embodiments shown in fig. 4-9.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A character image matching method is characterized by comprising the following steps:

respectively carrying out character area detection on two character images to be matched to obtain a character area characteristic matrix of each character image, wherein the character area characteristic matrix represents a character area where characters are located;

2. The method of claim 1, wherein the method further comprises:

respectively carrying out character recognition on the two character images to obtain character information corresponding to each character image;

acquiring character similarity between the two character images according to the character information respectively corresponding to the two character images;

determining whether the two character images are homologous images according to the visual similarity corresponding to the two character images respectively comprises the following steps:

and determining whether the two character images are homologous images or not according to the visual similarity and the character similarity.

3. The method of claim 2, wherein the obtaining the text similarity between the two text images according to the text information respectively corresponding to the two text images comprises:

acquiring the editing distance between the character information corresponding to the two character images respectively; the editing distance is the number of editing steps required for converting one character message into another character message;

determining a text similarity between the two text images based on the edit distance.

4. The method of claim 2, wherein determining whether the two text images are homologous images based on the visual similarity and the text similarity comprises:

if the visual similarity is smaller than the character similarity, the visual similarity is larger than a third similarity threshold, and the character similarity is larger than a fourth similarity threshold, determining that the two character images are homologous images; the third similarity threshold is less than the fourth similarity threshold.

5. The method of claim 1, wherein the performing text region detection on the two text images to be matched respectively to obtain a text region feature matrix of each text image comprises:

respectively extracting the characteristics of the two character images to obtain the image characteristics corresponding to each character image;

detecting characters included in each character image according to the image characteristics corresponding to each character image to obtain a plurality of text detection boxes included in each character image; the text detection box represents a text area where the text is located;

and respectively obtaining a character area characteristic matrix corresponding to each character image according to the plurality of text detection boxes included in each character image.

6. The method of claim 1, wherein obtaining the visual similarity between the two text images according to the superposition matrices corresponding to the two text images respectively comprises:

7. The method of claim 6, wherein the visual similarity determination model comprises two feature extraction submodels and a similarity determination submodel with the same model structure;

8. The method of claim 7, wherein determining the visual similarity between the two text images according to the superimposed matrices corresponding to the two text images, respectively, by using the trained visual similarity determination model comprises:

9. The method of claim 8, wherein determining the visual similarity between the two text images through the similarity determination submodel according to the feature vectors corresponding to the two text images respectively comprises:

determining Euclidean distances between the feature vectors respectively corresponding to the two character images;

and determining the visual similarity between the two character images according to the Euclidean distance.

10. A character-image matching apparatus, comprising:

11. The apparatus of claim 10, further comprising a text similarity obtaining unit;

the homologous image determining unit is configured to: and determining whether the two character images are homologous images or not according to the visual similarity and the character similarity.

12. The apparatus of claim 11, wherein the homologous image determining unit is to:

13. The apparatus of claim 10, wherein the visual similarity obtaining unit is to:

14. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor, when executing the computer program, realizes the steps of the method of any one of claims 1 to 9.

15. A computer storage medium having computer program instructions stored thereon, wherein,

the computer program instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 9.