CN112016543B

CN112016543B - Text recognition network, neural network training method and related equipment

Info

Publication number: CN112016543B
Application number: CN202010723541.2A
Authority: CN
Inventors: 刘志广; 王靓伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2024-09-20
Anticipated expiration: 2040-07-24
Also published as: WO2022017245A1; CN112016543A

Abstract

The application relates to a text recognition technology in the field of artificial intelligence, and discloses a text recognition network, a neural network training method and related equipment, wherein the text recognition network is a neural network for recognizing characters in an image; the text feature acquisition module is used for acquiring preset characters corresponding to the first characters in the image to be identified, and carrying out text prediction according to the preset characters so as to generate semantic features of the first predicted characters; the recognition module is used for executing recognition operation according to the first features and the semantic features of the first predicted characters so as to generate recognition results corresponding to the images to be recognized, and executing the recognition operation according to the features with more dimensions; and the accuracy of the predicted characters is not affected by the image quality problem, so that the accuracy of the text recognition result is improved.

Description

Text recognition network, neural network training method and related equipment

Technical Field

The application relates to the field of artificial intelligence, in particular to a text recognition network, a neural network training method and related equipment.

Background

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Currently, deep learning (DEEP LEARNING) based neural networks are one common application for artificial intelligence to recognize characters in images.

However, in actual situations, when the quality of the image to be recognized is low, for example, when the image to be recognized is blurred or some characters in the image to be recognized are blocked, the neural network may output an incorrect recognition result, thereby reducing the accuracy of the text recognition result. Therefore, a scheme for improving the accuracy of the text recognition result is needed.

Disclosure of Invention

The embodiment of the application provides a text recognition network, a neural network training method and related equipment, which are used for generating a recognition result according to semantic features of predicted characters and image features of an image to be recognized and executing recognition operation according to features with more dimensions; and the accuracy of the predicted character is not affected due to problems of image blurring or partial character shielding in the image to be recognized, and the accuracy of the text recognition result is improved.

In order to solve the technical problems, the embodiment of the application provides the following technical scheme:

in a first aspect, an embodiment of the present application provides a text recognition network that may be used in the field of text recognition in the field of artificial intelligence. The text recognition network is a neural network for recognizing characters in an image and comprises an image feature extraction module, a text feature acquisition module and a recognition module. The image feature extraction module is used for acquiring the image to be identified and extracting features of the image to be identified so as to generate first features corresponding to the first characters in the image to be identified. The first character is a character to be recognized in the image to be recognized, and the image feature extraction module in the text recognition network can be specifically represented as a convolutional neural network, a direction gradient histogram or a local binary pattern. The text feature acquisition module is used for acquiring preset characters corresponding to the first characters in the image to be identified, and carrying out text prediction according to the preset characters so as to generate semantic features of the first predicted characters. The preset character may be a start flag character, which may be represented in the computer program as a < BOS > character for instructing the text feature acquisition module to begin text prediction. The recognition module is used for combining the first characteristics with the semantic characteristics of the first predicted characters and executing recognition operation according to the combined characteristics so as to generate a recognition result corresponding to the first characters in the image to be recognized. The recognition module may be a classification network, and the classification network may be a classifier, where the classifier may select a multi-layer perceptron, or may be composed of a linear transformation matrix and a classification function.

In the implementation mode, not only the image characteristics of the image to be recognized are obtained, but also the semantic characteristics of the predicted characters are generated according to the second characters corresponding to the recognized characters in the first characters, and recognition operation is executed according to the characteristics with more dimensions, so that the accuracy of a text recognition result is improved; and when the factors such as blurring of the image to be recognized or shielding of part of the characters in the image to be recognized occur, the accuracy of the features of the blurred characters or the shielded characters included in the first features can be greatly reduced, the semantic features of the predicted characters are generated based on the semantic information of the recognized characters, and the accuracy of the predicted characters cannot be influenced due to the problems of the image blurring or the shielding of the part of the characters in the image to be recognized, and the recognition result is generated according to the semantic features of the predicted characters and the image features, so that the accuracy of the text recognition result is improved.

In one possible implementation manner of the first aspect, the text feature obtaining module is specifically configured to obtain, in a case where the recognition operation is performed on the image to be recognized for the first time, a preset character corresponding to a first character in the image to be recognized, and perform text prediction according to the preset character, so as to generate a semantic feature of the first predicted character. If the execution device performs image segmentation on the entire image to be identified, the first character in the image to be identified is first identified when the identification operation is performed on the segmented image to be identified (i.e., a text region of the image to be identified). If the execution device does not perform image segmentation on the whole image to be recognized, performing recognition operation on the first character in the image to be recognized for the first time refers to performing recognition operation on the whole image to be recognized for the first time. The text feature obtaining module is specifically configured to determine, as a second character, at least one recognition result corresponding to at least one recognized character of the first characters and a preset character, and perform text prediction according to the second character, so as to generate a semantic feature of a second predicted character corresponding to the second character, where the recognition operation has been performed on at least one character of the first characters.

In the implementation manner, under the condition that the first character in the image to be recognized is first recognized, the execution device generates the semantic features of the first predicted character according to the preset characters, and under the condition that the recognition operation is performed on at least one character in the first characters, the execution device determines at least one recognition result corresponding to the recognized character in the first characters and the preset characters as at least one second character corresponding to the recognized character in the first characters, so that the completeness of the scheme is ensured, the whole recognition process does not need manual intervention, and the user viscosity of the scheme is improved.

In a possible implementation manner of the first aspect, the recognition module is further configured to perform a recognition operation according to the first feature and the semantic feature of the second predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

In this implementation manner, since the text recognition network in this solution may only be able to obtain the recognition result of a part of the characters in the first characters when the recognition operation has been performed on at least one of the characters, the execution device performs text prediction according to at least one recognition result corresponding to at least one recognized character, so as to generate semantic features of the second predicted character, and performs the recognition operation according to the semantic features of the first feature and the second predicted character, thereby further improving the integrity of this solution.

In one possible implementation manner of the first aspect, the text feature acquisition module includes: the first generation sub-module is used for carrying out vectorization processing on each preset character in at least one preset character to generate character codes of each preset character, and generating position codes of each preset character according to the position of the first character of each preset character in the image to be recognized. The combination sub-module is used for combining the character codes of the preset characters and the position codes of the preset characters to obtain initial characteristics of the preset characters, and executing self-attention coding operation and self-attention decoding operation according to the initial characteristics of the preset characters to generate semantic characteristics of the first predicted characters. The mode of combination between the character codes of the preset characters and the position codes of the preset characters is any one of the following modes: splicing, adding, fusing and multiplying.

In the implementation manner, text prediction is performed by executing the self-attention encoding operation and the self-attention decoding operation on the initial characteristics of the preset characters, so that the semantic characteristics of the first predicted character are generated, the calculation speed is high, and the complexity is low.

In one possible implementation manner of the first aspect, the identification module includes: and the computing sub-module is used for computing the similarity between the first feature and the semantic feature of the first predicted character. The similarity may be obtained by calculating cosine similarity, euclidean distance, mahalanobis distance, or the like between the first feature and the semantic feature of the first predicted character, or the similarity may be obtained by performing a dot product operation on the first feature and the semantic feature of the first predicted character. The similarity may include one similarity value or two transposed similarity values. And the second generation submodule is used for generating a second feature and a third feature according to the first feature, the semantic features of the first predicted character and the similarity, wherein the second feature is the semantic features of the first predicted character combined on the basis of the first feature, and the third feature is the first feature combined on the basis of the semantic features of the first predicted character. The second generating sub-module is further configured to combine the second feature and the third feature, and perform a recognition operation according to the combined feature, so as to generate a recognition result.

In the implementation manner, the similarity between the first feature and the semantic feature of the first predicted character is calculated, and then according to the similarity between the first feature and the semantic feature of the first predicted character, a second feature and a third feature are generated, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first feature, and the third feature is the first feature combined on the basis of the semantic feature of the first predicted character, namely, the image feature of the character to be recognized is enhanced according to the semantic feature of the predicted character, and the image feature of the character to be recognized is blended into the semantic feature of the predicted character, so that the full fusion of the image feature and the predicted character feature is facilitated, and the accuracy of a text recognition result is improved.

In a possible implementation manner of the first aspect, the text recognition network further includes a feature update module, configured to: combining the features of the preset characters with the first features to generate updated first features; the features of the preset characters may be initial features of the preset characters or updated features of the preset characters. The first feature includes image features of a plurality of first characters, at least one of the plurality of first characters is a character on which a recognition operation has been performed, and in a case where a preset character includes a recognition result corresponding to a plurality of recognized characters, the feature of the preset character includes a feature that is a recognition result corresponding to the recognized character. The updated first feature is enhanced relative to the first feature with respect to the features of the recognized character. The recognition module is specifically configured to perform a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

In the implementation mode, semantic features of the recognized characters are blended into the image features, so that the features of the recognized characters in the image features are more obvious, and the recognition module can more intensively recognize the not-recognized characters, so that the difficulty of a single recognition process of the recognition module is reduced, and the accuracy of text recognition is improved.

In a possible implementation manner of the first aspect, the feature updating module is specifically configured to: and executing the self-attention encoding operation according to the initial characteristics of the preset characters to obtain updated characteristics of the preset characters, and executing the self-attention encoding operation according to the first characteristics and the updated characteristics of the preset characters to generate updated first characteristics. In the implementation mode, the self-attention coding mode is adopted, the characteristics of the preset characters are combined with the first characteristics, the full combination of the characteristics of the preset characters and the first characteristics is facilitated, the complexity is low, and the implementation is easy.

In a possible implementation manner of the first aspect, in a case that granularity of performing the recognition operation by the text recognition network is characters, at least one character is included in one first character, and one recognition result output by performing the recognition operation by the text recognition network includes one character. In the case that the granularity of the text recognition network performing the recognition operation is that the words are included in one first character, one recognition result output by the text recognition network performing the recognition operation once is that the words include one or more characters.

In the implementation manner, the granularity of the text recognition network for executing the recognition operation can be characters or words, so that the application scene of the scheme is expanded, and the implementation flexibility of the scheme is improved.

In a second aspect, an embodiment of the present application provides a training method for a text recognition network, which may be used in the field of text recognition in the field of artificial intelligence. The text recognition network is a neural network for recognizing characters in an image and comprises an image feature extraction module, a text feature acquisition module and a recognition module. The method comprises the following steps: the training equipment inputs an image to be identified into an image feature extraction module, and performs feature extraction on the image to be identified to generate first features corresponding to first characters in the image to be identified, wherein the first characters are characters to be identified in the image to be identified; and inputting a preset character corresponding to the first character in the image to be recognized into a text feature acquisition module, and carrying out text prediction according to the preset character to generate semantic features of the first predicted character. The training device executes recognition operation through the recognition module according to the first features and the semantic features of the first predicted characters so as to generate recognition results corresponding to the first characters in the image to be recognized. The training device trains the text recognition network according to a correct result corresponding to the first character in the image to be recognized, a recognition result and a loss function, the loss function indicates similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized, the training target of the loss function is to zoom in the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized, and the loss function can be specifically a cross entropy loss function, a focus loss function or a center loss function.

The steps in each possible implementation manner of the first aspect may also be performed by the second aspect of the embodiment of the present application, and for the specific implementation steps of the second aspect of the embodiment of the present application and the various possible implementation manners of the second aspect, and the beneficial effects caused by each possible implementation manner, reference may be made to descriptions in each possible implementation manner of the first aspect, which are not described herein in detail.

In a third aspect, an embodiment of the present application provides a text recognition method, which may be used in the field of text recognition in the field of artificial intelligence. The method comprises the following steps: the method comprises the steps that an execution device inputs an image to be identified into an image feature extraction module, and feature extraction is carried out on the image to be identified to generate first features corresponding to first characters in the image to be identified, wherein the first characters are characters to be identified in the image to be identified; and inputting a preset character corresponding to the first character in the image to be recognized into a text feature acquisition module, and carrying out text prediction according to the second character to generate semantic features of the first predicted character. The execution device executes recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a recognition result corresponding to the first character in the image to be recognized. The image feature extraction module, the text feature acquisition module and the recognition module belong to the same text recognition network.

The third aspect of the embodiments of the present application may further perform steps in each possible implementation manner of the first aspect, and for specific implementation steps of the third aspect of the embodiments of the present application and each possible implementation manner of the third aspect, and beneficial effects brought by each possible implementation manner, reference may be made to descriptions in each possible implementation manner of the first aspect, which are not described herein in detail.

In a fourth aspect, an embodiment of the present application provides a training device for a text recognition network, where the text recognition network is a neural network for recognizing characters in an image, the text recognition network includes an image feature extraction module, a text feature acquisition module, and a recognition module, and the training device for the text recognition network includes: the input unit is used for inputting the image to be identified into the image feature extraction module, and extracting the features of the image to be identified to generate first features corresponding to first characters in the image to be identified, wherein the first characters are characters to be identified in the image to be identified; the input unit is also used for inputting preset characters corresponding to the first characters in the image to be recognized into the text feature acquisition module, and carrying out text prediction according to the preset characters so as to generate semantic features of the first predicted characters; the recognition unit is used for executing recognition operation through the recognition module according to the first features and the semantic features of the first predicted characters so as to generate a recognition result corresponding to the first characters in the image to be recognized; the training unit is used for training the text recognition network according to the correct result corresponding to the first character in the image to be recognized, the recognition result and the loss function, wherein the loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized.

The fourth aspect of the embodiments of the present application may further perform the steps in each possible implementation manner of the second aspect, and for specific implementation steps of the fourth aspect of the embodiments of the present application and each possible implementation manner of the fourth aspect, and beneficial effects caused by each possible implementation manner, reference may be made to descriptions in each possible implementation manner of the second aspect, which are not repeated herein.

In a fifth aspect, an embodiment of the present application provides an execution device, which may include a processor, where the processor is coupled to a memory, and where the memory stores program instructions, and where the program instructions stored in the memory, when executed by the processor, implement the steps performed by the text recognition network according to the first aspect.

In a sixth aspect, an embodiment of the present application provides a training device, which may include a processor, and a memory coupled to the processor, where the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, implement the training method of the text recognition network according to the second aspect.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored, which when executed on a computer causes the computer to perform the steps performed by the text recognition network described in the first aspect, or causes the computer to perform the training method of the text recognition network described in the second aspect.

In an eighth aspect, an embodiment of the present application provides a circuit system, where the circuit system includes a processing circuit configured to perform the steps performed by the text recognition network described in the first aspect, or perform the training method of the text recognition network described in the second aspect.

In a ninth aspect, embodiments of the present application provide a computer program, which when run on a computer, causes the computer to perform the steps performed by the text recognition network described in the first aspect or to perform the training method of the text recognition network described in the second aspect.

In a tenth aspect, embodiments of the present application provide a chip system, which includes a processor for implementing the functions involved in the above aspects, for example, transmitting or processing data and/or information involved in the above method. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the server or the communication device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence main body framework according to an embodiment of the present application;

FIG. 2 is a system architecture diagram of a text recognition system according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a workflow of a text recognition network according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of generating a fourth feature in the workflow of the text recognition network according to the embodiment of the present application;

FIG. 5 is a schematic flow chart of generating fifth and sixth features in a workflow of a text recognition network according to an embodiment of the present application;

fig. 6 is a schematic diagram of a network architecture of a text recognition network according to an embodiment of the present application;

fig. 7 is a schematic flow chart of a training method of a text recognition network according to an embodiment of the present application;

Fig. 8 is a schematic diagram of an advantageous effect of the text recognition network according to the embodiment of the present application;

Fig. 9 is a schematic structural diagram of a text recognition network according to an embodiment of the present application;

Fig. 10 is a schematic diagram of another structure of a text recognition network according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a training device for a text recognition network according to an embodiment of the present application;

Fig. 12 is a schematic diagram of another structure of a training device of a text recognition network according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 14 is a schematic structural view of a training device according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, a schematic structural diagram of an artificial intelligence main body framework is shown in fig. 1, and the artificial intelligence main body framework is described below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure of

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; computing power is provided by smart chips including, but not limited to, hardware accelerator chips such as central processing units (central processing unit, CPU), embedded neural Network Processors (NPU), graphics processors (graphics processing unit, GPU), application-specific integrated circuits (ASIC), and field programmable gate arrays (field programmable GATE ARRAY, FPGA); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capability

After the data has been processed as described above, further general capabilities may be formed based on the results of the data processing, such as algorithms or a general system, e.g., translation, analysis of text, processing of computer vision, speech recognition, etc.

(5) Intelligent product and industry application

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, etc.

The embodiment of the application can be applied to various fields of artificial intelligence, in particular to various scenes in which characters in images need to be identified, wherein the images are acquired through equipment such as a camera, a printer, a scanner and the like. As an example, in an application scenario, in fields such as finance, accounting, tax fields, etc., an enterprise needs to scan a document such as a receipt or an invoice to obtain an image document, and identify characters in the image document to extract text information, so that functions such as digital filing, quick indexing of the document, or analysis of the document can be realized. In another application scenario, a user needs to input information on a certificate such as an identity card, a driver license, a driving license, or a passport, and the user can acquire an image of the certificate by using a camera and recognize characters in the image to extract key information. It should be understood that the examples herein are only for convenience of understanding the application scenario of the embodiments of the present application, and are not exhaustive of the application scenario of the embodiments of the present application. In the foregoing various scenarios, there may be a possibility that the image quality is low, so that the text recognition network provided by the embodiment of the present application needs to recognize the image, so as to improve the accuracy of the recognition result.

In order to facilitate understanding of the present solution, in the embodiment of the present application, a text recognition system provided in the embodiment of the present application is first described with reference to fig. 2, and referring to fig. 2, fig. 2 is a system architecture diagram of the text recognition system provided in the embodiment of the present application. In fig. 2, the text recognition system 200 includes an execution device 210, a training device 220, a database 230, and a data storage system 240, with the execution device 210 including a computing module 211.

During the training phase, a training data set is stored in the database 230, where the training data set may include a plurality of images to be identified and a correct result corresponding to a first character in each image to be identified. The training device 220 generates a target model/rule 201 for processing the sequence data and iteratively trains the target model/rule 201 using the training data set in the database to obtain a mature target model/rule 201.

In the inference phase, the execution device 210 may call data, code, etc. in the data storage system 240, or may store data, instructions, etc. in the data storage system 240. The data storage system 240 may be configured in the execution device 210, or the data storage system 240 may be an external memory with respect to the execution device 210. The computing module 211 may perform the recognition operation on the image to be recognized input by the execution device 210 through the mature target model/rule 201, so as to obtain a recognition result of the first character in the image to be recognized.

In some embodiments of the present application, such as in FIG. 2, a "user" may interact directly with the execution device 210, i.e., the execution device 210 is integrated with the client device in the same device. However, fig. 2 is only a schematic architecture diagram of two image processing systems according to an embodiment of the present application, and the positional relationship between the devices, apparatuses, modules, etc. shown in the drawings does not constitute any limitation. In other embodiments of the present application, the execution device 210 and the client device may be separate devices, where the execution device 210 is configured with an input/output interface, and performs data interaction with the client device, and the "user" may input the acquired image to the input/output interface through the client device, and the execution device 210 returns the processing result to the client device through the input/output interface.

Based on the above description, the embodiment of the application provides a text recognition network, which comprises an image feature extraction module, a text feature acquisition module and a recognition module, wherein the image feature extraction module is used for extracting the image feature of a first character in an image to be recognized, the text feature acquisition module is used for carrying out text prediction on the semantic information of a preset character corresponding to the first character in the image to be recognized, so as to obtain the semantic feature of a predicted character, and the recognition module further executes recognition operation according to the image feature of the first character in the image to be recognized and the semantic feature of the predicted character to generate a recognition result. As can be seen from the description in fig. 2, the embodiment of the present application includes an inference phase and a training phase, and the flows of the inference phase and the training phase are different, and the inference phase and the training phase are described below, respectively.

1. Inference phase

In an embodiment of the present application, the reasoning phase describes how the execution device 210 performs character recognition on the image to be recognized using a sophisticated text recognition network. Referring to fig. 3, fig. 3 is a schematic flow chart of a workflow of a text recognition network according to an embodiment of the present application, and the method may include:

301. The execution device inputs the image to be identified to an image feature extraction module, and performs feature extraction on the image to be identified to generate a first feature corresponding to a first character in the image to be identified.

In the embodiment of the application, after the image to be recognized is acquired, the execution device inputs the image to be recognized into the image feature extraction module of the text recognition network, so as to perform feature extraction on the image to be recognized through the image feature extraction module, so as to generate a first feature corresponding to a first character in the image to be recognized, wherein the first character is a character to be recognized in the image to be recognized.

The image feature extraction module in the text recognition network may be specifically represented by a convolutional neural network, a direction gradient histogram (histogram of oriented gradient, HOG), a local binary pattern (local binary pattern, LBP), or other neural network for feature extraction of an image.

One image to be recognized may include one or more rows of first characters therein, or one image to be recognized may include one or more columns of first characters therein. If the granularity of the recognition operation executed by the text recognition network is character, that is, the recognition result of one character in the image to be recognized can be obtained after the execution device executes the recognition operation once through the text recognition network, one first character comprises one or more characters. As an example, for example, one first character included in the image to be recognized is "cat", the text recognition network performs a recognition operation once, or the like, to obtain a recognition result of one character "c". As another example, for example, one first character included in the image to be recognized is "today's weather true bar", the text recognition network performs a feature generated by a recognition operation once as one character "present", and outputs a recognition result corresponding to the character "present".

If the granularity of the text recognition network for executing the recognition operation is words, that is, if the execution device executes the recognition operation once through the text recognition network, a recognition result of one word in the image to be recognized can be obtained, one or more words are included in one first character. As an example, for example, a first character included in an image to be recognized is "how are you", and the text recognition network performs a recognition operation once to obtain a recognition result of a word "how". As another example, for example, a first character included in the image to be recognized is "today's weather true bar", and the text recognition network obtains a recognition result of a word "today" every time a recognition operation is performed, etc., it should be understood that the above examples are only for convenience in understanding the present solution, and are not limited to the present solution.

Specifically, in one implementation, after the image to be identified is acquired, the executing device performs image segmentation on the image to be identified to generate at least one segmented image to be identified (i.e., segments the image to be identified into at least one text region). If one image to be recognized comprises one or more rows of first characters, each segmented image to be recognized (namely, each text region) comprises one row of first characters; if one or more columns of first characters are included in one image to be recognized, each segmented image to be recognized includes one column of first characters.

More specifically, in one case, the text recognition network is further configured with an image segmentation module, and the execution device performs image segmentation on the image to be recognized through the image segmentation module of the text recognition network, so as to obtain at least one segmented image to be recognized. In another case, the execution device may be configured with a first neural network for image segmentation in addition to the text recognition network, and then the execution device performs image segmentation on the image to be recognized through the first neural network to obtain at least one segmented image to be recognized. Further, the image segmentation module in the text recognition network or the first neural network for performing image segmentation may specifically be represented by a shape robust text detection network (shape Robust Text Detection with Progressive Scale Expansion Network, PSENet) based on a progressive scale expansion network, an rCTPN, an ASTER, or other neural networks for image segmentation, etc., which are not limited herein.

Correspondingly, step 301 may include: the execution device inputs the segmented image to be identified to the image feature extraction module, and performs feature extraction on the segmented image to be identified to generate a first feature corresponding to a first character in the segmented image to be identified, wherein the segmented image is a text region in the image to be identified. A first feature refers to a feature of a segmented image to be recognized, which includes image features of a row of first characters (i.e., a text region in the image to be recognized), or image features of a column of first characters.

In another implementation manner, in the case that one image to be recognized includes a plurality of rows of first characters or includes a plurality of columns of first characters, after the image to be recognized is acquired, the executing device inputs the whole image to be recognized into an image feature extraction module of the text recognition network, and performs feature extraction on the whole image to be recognized to generate first features corresponding to the first characters in the image to be recognized. Wherein, a first feature refers to the feature of the whole image to be recognized, if a row of first characters or a column of first characters are included in the image to be recognized, the first feature is the image feature of the row of first characters or the column of first characters in the image to be recognized; if the image to be recognized comprises a plurality of rows or columns of first characters, the first feature is the image feature of the plurality of rows or columns of first characters in the image to be recognized.

302. The execution device inputs preset characters corresponding to the first characters in the image to be recognized to the text feature acquisition module, and performs text prediction according to the preset characters to generate semantic features of the first predicted characters.

In the embodiment of the application, under the condition that the execution equipment carries out the recognition operation on the first character in the image to be recognized for the first time, the execution equipment acquires the preset character corresponding to the first character in the image to be recognized, and inputs the preset character corresponding to the first character in the image to be recognized into the text feature acquisition module of the text recognition network, so that text prediction is carried out through the text feature acquisition module according to the preset character, and semantic features of the first predicted character are generated.

If the execution device performs image segmentation on the entire image to be identified, the first character in the image to be identified is first identified when the identification operation is performed on the segmented image to be identified (i.e., a text region of the image to be identified). If the execution device does not perform image segmentation on the whole image to be recognized, performing recognition operation on the first character in the image to be recognized for the first time refers to performing recognition operation on the whole image to be recognized for the first time.

The preset character may be a start flag character, which may be represented in the computer program as a < BOS > character for instructing the text feature acquisition module to begin text prediction. The preset character is expressed in a predefined form, and can be specifically expressed as a vector comprising N elements, wherein each element in the N elements is a determined numerical value. Further, N is an integer greater than or equal to 1. As an example, the preset character may be specifically a vector including 321 s, or the preset character may be specifically a vector including 64 2 s, or the like, for example, which is not exhaustive herein.

The text feature acquisition module of the text recognition network may include an encoding module for extracting text features of the input characters and a decoding module for generating text features of the predicted characters from the text features of the input characters. Further, the encoding module may be an encoder in a recurrent neural network (recurrent neural networks, RNNs), and the decoding module is a decoder in the recurrent neural network; as an example, the encoding module and decoding module may be, for example, encoding modules and decoding modules in a long and short term memory network (long short term mermory network, LSTM). The coding module can also be a self-attention (self-attention) coding module, and the decoding module is a self-attention decoding module; as an example, the encoding and decoding modules may be self-attention encoding and decoding modules of a neural network based on a bi-directional encoded representation (bidirectional encoder representations from transformers, BERT) of a converter, etc., which may also be represented as other encoding and decoding modules in a neural network for text prediction, etc., which are not exhaustive herein.

Specifically, in one implementation, the encoding module and the decoding module in the text feature acquisition module are a self-attention encoding module and a self-attention decoding module, respectively. Step 302 may include: the execution device converts the preset characters from character form to tensor form through the text feature acquisition module to generate character codes of the preset characters, and generates position codes of the preset characters according to the positions of the preset characters in the first characters in the image to be identified; and combining the character codes of the preset characters and the position codes of the preset characters to obtain initial characteristics of the preset characters. And the execution device further executes the self-attention encoding operation and the self-attention decoding operation according to the initial characteristics of the preset characters through the text characteristic acquisition module so as to generate semantic characteristics of the first predicted characters.

In the embodiment of the application, the text prediction is performed by executing the self-attention encoding operation and the self-attention decoding operation on the initial characteristics of the preset characters so as to generate the semantic characteristics of the first predicted character, so that the calculation speed is high and the complexity is low.

More specifically, the generation process for character encoding is directed. The execution device may perform vectorization (embedding) on the preset character through the text feature acquisition module to generate a character code of the preset character. The training device may also acquire a one-hot (one-hot) code of the preset character, and determine the one-hot code of the preset character as a character code of the preset character, etc., which does not limit a process of generating the character code of the preset character. The character code of the preset character may be a vector including M elements, where the value of M is related to what neural network is adopted by the text feature acquisition module of the text recognition network, and is not limited herein.

For the generation process of the position code. The position of the preset character in the first character in the image to be recognized is the first position, and the position code of the preset character indicates that the position of the preset character is the first position. Alternatively, the position code of the preset character may be a vector including M elements. As an example, for example, the value of M is 512, the position code of the preset character may be a vector including 1, 1 and 511 0, where 1 in the position code of the preset character is located at the first position, indicating that the position of the preset character in the first character in the image to be identified is the first position, optionally, the execution device may further perform secondary conversion on the foregoing 512 elements through a cosine function, and it should be understood that the examples of the value of M and the expression form of the position code are only convenient to understand the present scheme, and are not limited to the present scheme. The manner in which character encoding and position encoding are combined includes, but is not limited to, concatenation (conact), addition (add), fusion (fusion), multiplication, and the like.

For a process of generating semantic features of a first predicted character. After obtaining the initial feature of the preset character, the execution device needs to perform text prediction through a text feature acquisition module of the text recognition network, namely, performs self-attention encoding operation on the initial feature of the preset character to generate an updated feature of the preset character, and performs self-attention decoding operation on the updated feature of the preset character to generate the semantic feature of the first predicted character.

In another implementation, the encoding and decoding modules in the text feature acquisition module are selected from a recurrent neural network. Step 302 may include: the execution device converts the preset characters from character form to tensor form through the text feature acquisition module to generate character codes of the preset characters, and determines the character codes of the preset characters as initial features of the preset characters. And the execution device further executes the encoding operation and the decoding operation according to the initial characteristics of the preset characters through the text characteristic acquisition module so as to generate semantic characteristics of the first predicted characters.

It should be noted that, when the encoding module and the decoding module in the text feature obtaining module are selected from other types of neural networks for text prediction, the step 302 may be modified correspondingly, which is not exhaustive herein.

In addition, the execution sequence of step 301 and step 302 is not limited in the embodiment of the present application, step 301 may be executed first, then step 302 may be executed, step 301 may be executed first, then step 301 may be executed, or step 301 and step 302 may be executed simultaneously.

303. The execution device combines the features of the preset characters with the first features through the feature updating module to generate fourth features.

In some embodiments of the present application, after the image feature extraction module of the text recognition network generates the first feature corresponding to the first character in the image to be recognized, the execution device may further combine the feature of the preset character with the first feature to generate a fourth feature, where the fourth feature is the updated first feature. The characteristics of the preset character can be updated characteristics of the preset character or initial characteristics of the preset character.

Specifically, in one implementation manner, the executing device executes, through a feature updating module of the text recognition network, a self-attention encoding operation according to an initial feature of a preset character, obtains an updated feature of the preset character, and executes the self-attention encoding operation according to the first feature and the updated feature of the preset character to generate a fourth feature.

For a more intuitive understanding of the self-attention encoding process with respect to the generation process of updated features of preset characters, the following discloses a formula for performing self-attention encoding operation on preset characters:

Q′_char＝Norm(softmax(Q_charK_char)V_char+Q_char)； (1)

Wherein Q _char is obtained by multiplying the initial feature of the preset character by the first conversion matrix, K _char is obtained by multiplying the initial feature of the preset character by the second conversion matrix, Q _charK_char represents Q _char and K _char point multiplication, softmax (Q _charK_char)V_char represents softmax (Q _charK_char) and V _char point multiplication, V _char is obtained by multiplying the initial feature of the preset character by the third conversion matrix, Q' _char represents the updated feature of the preset character, and the first conversion matrix, the second conversion matrix and the third conversion matrix may be the same or different, which should be understood that the example in formula (1) is only for facilitating understanding the scheme and is not limited to the scheme.

For the process of generating the fourth feature, in order to more intuitively understand the process of performing the self-attention encoding according to the updated feature of the first feature and the preset character, the formula of performing the self-attention encoding operation on the preset character is disclosed as follows:

Q′_img＝Norm(softmax(Q′_charK_img)V_img+Q′_char)； (2)

Where Q '_img represents a fourth feature, Q' _char represents an updated feature of the preset character, K _img is obtained by multiplying the first feature by a fourth transformation matrix, V _img is obtained by multiplying the first feature by a fifth transformation matrix, and the fourth transformation matrix and the fifth transformation matrix may be the same or different, and it should be understood that the example in formula (2) is only for easier understanding of the present scheme and is not limited to the present scheme.

For a more intuitive understanding of the present solution, please refer to fig. 4, fig. 4 is a schematic flow chart of generating a fourth feature in the workflow of the text recognition network according to the embodiment of the present application, and fig. 4 illustrates that the text recognition network performs a recognition operation once to obtain a recognition result of one character, that is, one character is included in one second character. As shown in fig. 4, the execution device inputs the image to be recognized to an image feature extraction module of the text recognition network to obtain an image feature of a first character in the image to be recognized (i.e., a first feature of the first character in the image to be recognized), and in fig. 4, taking an example that the image feature extraction module includes a plurality of convolution layers and a plurality of pooling layers, max poll refers to maximum pooling. As shown in fig. 4, the execution device generates character encoding and position encoding of the preset character to obtain initial characteristics of the preset character, and generates updated characteristics Q' _char of the preset character by the above formula (1). After obtaining the image feature of the first character in the image to be recognized (i.e., the first feature of the first character in the image to be recognized) and the updated feature of the preset character, the executing device executes the self-attention encoding operation by the above formula (2) to generate the fourth feature, and it should be noted that, in a practical situation, the executing device may further be provided with more neural networks in the feature updating module in the text recognition network, for example, the feature updating module in the text recognition network may further be provided with a feedforward neural network, a regularization module, etc., and fig. 4 is only an example for facilitating understanding the present scheme, and is not limited to the present scheme.

In another implementation, the execution device performs, through a feature update module of the text recognition network, a self-attention encoding operation according to the first feature and an initial feature of the preset character to generate a fourth feature.

In another implementation manner, the execution device executes the encoding operation according to the initial feature of the preset character through the feature updating module of the text recognition network to obtain an updated feature of the preset character, and executes the encoding operation according to the first feature and the updated feature of the preset character to generate the fourth feature. Further, the feature update module of the text recognition network performs the encoding operation by an encoder, which is an encoder in the recurrent neural network.

In another implementation, the execution device performs, through a feature update module of the text recognition network, an encoding operation according to the first feature and an initial feature of the preset character to generate a fourth feature.

304. The execution device executes a recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a first recognition result.

In the embodiment of the application, the execution device combines the first feature and the semantic feature of the first predicted character through the recognition module, and executes recognition operation according to the combined feature to generate a first recognition result. If the granularity at which the text recognition network performs the recognition operation is character, a first recognition result is a character recognized by the text recognition network. If the granularity at which the text recognition network performs the recognition operation is word, a first recognition result is a word recognized by the text recognition network.

Specifically, step 303 is an optional step, and if step 303 is performed, step 304 includes: the execution device combines the fourth feature (i.e., the updated first feature) and the fifth feature through the recognition module, and executes a recognition operation according to the combined feature to generate a first recognition result.

A process of combining semantic features for the fourth feature and the first predicted character. In one implementation, the execution device directly combines the fourth feature (i.e., the updated first feature) and the fifth feature by way of splicing, matrix multiplication, combination, and the like, through the identification module.

In another implementation, the executing device executes, through the identifying module, a combining operation of the fourth feature and the semantic feature of the first predicted character according to a similarity between the fourth feature and the semantic feature of the first predicted character. The execution device calculates a first similarity between the fourth feature and the semantic feature of the first predicted character through the identification module; generating a fifth feature according to the fourth feature, the semantic feature of the first predicted character and the first similarity; and generating a sixth feature according to the fourth feature, the semantic features of the first predicted character and the similarity. The fifth feature and the sixth feature are then combined by the identification module.

The first similarity may be obtained by calculating a cosine similarity, a euclidean distance, a mahalanobis distance, or the like between the fourth feature and the semantic feature of the first predicted character, or the first similarity may be obtained by performing a dot product operation on the fourth feature and the semantic feature of the first predicted character. Further, the first similarity may include one similarity value, or may be two transposed two similarity values. The fifth feature is a semantic feature of the first predicted character combined on the basis of the fourth feature, and the sixth feature is a fourth feature combined on the basis of the semantic feature of the first predicted character. The manner in which the fifth feature and the sixth feature are combined includes, but is not limited to, stitching, adding, multiplying, or otherwise combining, etc., and is not intended to be exhaustive.

More specifically, for a more intuitive understanding of the process of generating the fifth feature and the sixth feature, please refer to fig. 5, fig. 5 is a schematic flow chart of generating the fifth feature and the sixth feature in the workflow of the text recognition network according to the embodiment of the present application, and fig. 5 illustrates a first similarity generated by dot multiplication. Where K _vis represents the fourth feature (i.e., the updated first feature), Q _lin represents the semantic features of the first predicted character,AndThe text recognition network training stage comprises a first weight and a second weight, P _lin is obtained by multiplying Q _lin by a first weight point, P _vis is obtained by multiplying K _vis by a second weight point, and the first weight and the second weight are determined in the text recognition network training stage. S _vis represents the similarity of the fourth feature to the first predicted character, S _vis is expressed by the formulaCalculated, S _lin represents the similarity of the first predicted character and the fourth feature, S _lin is calculated by the formulaThe calculated d represents the number of dimensions of the feature, i.e. d represents the number of elements comprised in the fourth feature or the fifth feature. The fifth feature is a semantic feature combining the first predicted character on the basis of the fourth feature, and is represented in fig. 5 asSplicingThe product can be obtained by the method,By the formulas S _lin、K_lin andAnd obtaining the dot product. The sixth feature is a combination of the fourth feature based on the semantic features of the first predicted character, and is represented in FIG. 5 asSplicingThe product can be obtained by the method,By the formulas S _vis、K_vis andAnd obtaining the dot product. It should be understood that fig. 5 is only an example for facilitating understanding of the present solution, and is not intended to limit the present solution.

A process for performing an identification operation based on the combined features is directed. The executing device combines the fifth feature and the sixth feature through the identifying module, and then inputs the combined feature to the classifying network in the identifying module so as to execute the identifying operation through the classifying network in the identifying module, and a first identifying result output by the whole identifying module is obtained.

The classification network may specifically be represented by a classifier, where the classifier may select a multi-layer perceptron (multi-layer perceptron, MLP), or the classifier may be composed of a linear transformation matrix and a softmax classification function, and the specific form of the classification network is not limited herein.

If step 303 is not performed, step 304 includes: the execution device combines the first feature obtained in step 301 and the semantic feature of the first predicted character through the recognition module, and performs a recognition operation according to the combined feature to generate a first recognition result. The specific implementation process may be described in the case of executing step 303, which is not described in detail herein.

305. The execution device inputs a second character corresponding to the recognized character in the first character to the text feature acquisition module, and performs text prediction according to the second character to generate semantic features of the first predicted character.

In the embodiment of the present application, the implementation manner of step 305 is similar to the implementation manner of step 302, and the executing device obtains at least one second character corresponding to all the recognized characters in the first characters when the recognition operation has been performed on at least one of the first characters. Specifically, the execution device determines at least one recognition result corresponding to all the recognized characters in the first characters as at least one second character corresponding to the recognized characters in the first characters. In the embodiment of the application, under the condition that the recognition operation is executed on at least one character in the first characters, the execution equipment determines at least one recognition result corresponding to the recognized characters in the first characters and the preset characters as at least one second character corresponding to the recognized characters in the first characters, under the condition that the recognition operation is executed on the first characters in the image to be recognized for the first time, the execution equipment generates the semantic features of the first predicted characters according to the preset characters, thereby ensuring the completeness of the scheme, avoiding manual intervention in the whole recognition process and improving the user viscosity of the scheme.

More specifically, if the execution device proceeds to step 305 through step 304, step 305 includes: the execution device determines the first recognition result and the preset character as one second character corresponding to the recognized character of the first characters. If the execution device proceeds to step 305 through step 307, step 305 includes: the execution device determines the preset character, the first recognition result and the at least one second recognition result as a plurality of second characters corresponding to the recognized character of the first characters.

Wherein, in the case that the granularity of the text recognition network for performing the recognition operation is character, the first character is a word including at least one character, one character is included in one recognition result, and one character is included in each second character. In the case that the granularity at which the text recognition network performs the recognition operation is a word, at least one word is included in the first character, one recognition result is a word including one or more characters, and each second character is a word including one or more characters. In the embodiment of the application, the granularity of the text recognition network for executing the recognition operation can be characters or words, so that the application scene of the scheme is expanded, and the implementation flexibility of the scheme is improved.

The execution device inputs all second characters corresponding to all recognized characters in the first characters into a text feature acquisition module of the text recognition network, so that text prediction is performed through an encoding module and a decoding module in the text feature acquisition module according to all the second characters, and semantic features of the first predicted characters are generated.

Specifically, in one implementation, the encoding module and the decoding module in the text feature acquisition module are a self-attention encoding module and a self-attention decoding module, respectively. Step 302 may include: the execution device converts any one of the at least one second character from a character form to a tensor form through the text feature acquisition module to generate a character code of the one second character, and generates a position code of the one second character according to the position of the first character of any one of the at least one second character in the image to be recognized. The execution device combines the character codes of any one of the at least one second character and the position codes of the at least one second character through a text feature acquisition module of the text recognition network to obtain initial features of the at least one second character. The execution device executes the operation on each second character in the at least one second character through a text feature acquisition module of the text recognition network, so as to generate the initial feature of each second character in the at least one second character. And the execution device further executes the self-attention encoding operation and the self-attention decoding operation according to the initial characteristics of the second character through the text characteristic acquisition module so as to generate semantic characteristics of the first predicted character.

In another implementation, the encoding and decoding modules in the text feature acquisition module are selected from a recurrent neural network. Step 302 may include: the execution device converts each second character of the at least one second character from character form to tensor form through the text feature acquisition module to generate a character code of each second character, and determines the character code of each second character as an initial feature of each second character. And the execution device further executes the encoding operation and the decoding operation according to the initial characteristics of all the second characters in the at least one second character through the text characteristic acquisition module so as to generate semantic characteristics of the first predicted character.

The specific implementation manner of both the above two implementation manners may be referred to the description in step 302, and will not be repeated here.

306. The execution device combines the feature of the second character with the first feature through the feature update module to generate a seventh feature.

In this embodiment of the present application, the specific implementation manner of step 306 is similar to the specific implementation manner of step 303, and after the first feature corresponding to the first character in the image to be recognized is generated by the image feature extraction module of the text recognition network, the execution device may further combine the feature of the second character with the first feature to generate a seventh feature, where the seventh feature is the updated first feature. The features of the second character may be updated features of the preset character or may be initial features of the preset character. The first feature includes image features of a plurality of first characters, at least one of the plurality of first characters being a character on which a recognition operation has been performed, and in the case where the second character includes a recognition result corresponding to the plurality of recognized characters, the feature of the second character includes a feature that is a recognition result corresponding to the recognized character. The seventh feature is enhanced relative to the first feature in the feature of the recognized character.

In the embodiment of the application, the semantic features of the recognized characters are blended into the image features, so that the features of the recognized characters in the image features are more obvious, and the recognition module can more intensively recognize the not-recognized characters, thereby reducing the difficulty of the single recognition process of the recognition module and being beneficial to improving the accuracy of text recognition.

Specifically, in one implementation, the executing device executes, through a feature update module of the text recognition network, a self-attention encoding operation according to an initial feature of the second character, to obtain an updated feature of the second character, and executes the self-attention encoding operation according to the first feature and the updated feature of the second character to generate a seventh feature (i.e., the updated first feature). In the embodiment of the application, the characteristic of the second character is combined with the first characteristic by adopting a self-attention coding mode, so that the full combination of the characteristic of the second character and the first characteristic is facilitated, the complexity is low, and the realization is easy.

In another implementation, the execution device performs, via a feature update module of the text recognition network, a self-attention encoding operation based on the first feature and the initial feature of the second character to generate a seventh feature.

In another implementation, the executing device executes, through a feature update module of the text recognition network, an encoding operation according to the initial feature of the second character, obtains an updated feature of the second character, and executes the encoding operation according to the updated feature of the first character and the updated feature of the second character to generate a seventh feature. Further, the feature update module of the text recognition network performs the encoding operation by an encoder, which is an encoder in the recurrent neural network.

In another implementation, the execution device performs, via a feature update module of the text recognition network, an encoding operation based on the first feature and the initial feature of the second character to generate a seventh feature.

Various embodiments of step 306 may refer to the description of step 303, and will not be described herein.

307. The execution device executes the recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a second recognition result.

In the embodiment of the present application, the specific implementation manner of step 307 is similar to the specific implementation manner of step 304, and the execution device combines the first feature and the semantic feature of the first predicted character through the recognition module, and executes the recognition operation according to the combined feature, so as to generate a second recognition result.

Specifically, step 306 is an optional step, and if step 306 is performed, step 307 includes: the execution device combines the seventh feature (i.e., the updated first feature) with the semantic feature of the first predicted character via the recognition module, and performs a recognition operation according to the combined feature to generate a second recognition result.

A process of combining semantic features for the seventh feature and the first predicted character. In one implementation, the execution device directly combines the seventh feature (i.e., the updated first feature) and the first feature by means of a recognition module, such as stitching, matrix multiplication, combining, and the like.

In another implementation, the executing device executes, through the identifying module, a combining operation of the seventh feature and the semantic feature of the first predicted character according to a similarity between the seventh feature and the semantic feature of the first predicted character. The execution device calculates the similarity between the seventh feature (namely the updated first feature) and the semantic feature of the first predicted character through the identification module; and generating a second feature and a third feature according to the seventh feature, the semantic feature and the similarity of the first predicted character. Wherein the second feature is a semantic feature of the first predicted character combined on the basis of the seventh feature, and the third feature is a seventh feature combined on the basis of the semantic feature of the first predicted character; and executing the identification operation according to the second characteristic and the third characteristic to generate a second identification result.

In the embodiment of the application, the similarity between the first feature and the semantic feature of the first predicted character is calculated, and then the second feature and the third feature are generated according to the similarity between the first feature and the semantic feature of the first predicted character, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first feature, and the third feature is the feature of the first feature combined on the basis of the semantic feature of the first predicted character, namely, the image feature of the character to be recognized is enhanced according to the semantic feature of the predicted character, and the image feature of the character to be recognized is merged into the semantic feature of the predicted character, so that the full fusion of the image feature and the predicted character feature is facilitated, and the accuracy of a text recognition result is improved.

A process for performing an identification operation based on the combined features is directed. The execution device combines the second feature and the third feature through the recognition module, and then inputs the combined feature into a classification network in the recognition module so as to execute recognition operation through the classification network in the recognition module, thereby obtaining a first recognition result output by the whole recognition module.

If step 306 is not performed, step 307 includes: the execution device combines the first feature obtained in step 301 and the semantic feature of the first predicted character through the recognition module, and performs a recognition operation according to the combined feature to generate a second recognition result. The specific implementation manner of step 307 may refer to the description in step 304, which is not described herein.

It should be noted that, the number of times of execution of steps 301 to 304 and steps 305 to 307 is not limited in the embodiment of the present application, and steps 305 to 307 may be repeatedly executed after steps 301 to 304 are executed once to obtain a plurality of second recognition results.

Specifically, if the granularity of performing the recognition operation once by the text recognition network is character, the execution device can obtain the recognition result of one character in the first character every time the execution device performs steps 305 to 307, and the execution device repeatedly performs steps 305 to 307 for a plurality of times to obtain the recognition results of all characters in the first character. If the granularity of the text recognition network performing the recognition operation once is that of the words, each time the execution device performs steps 305 to 307, the recognition result of one word in one first character can be obtained, and the execution device repeatedly performs steps 305 to 307 for a plurality of times, so as to obtain the recognition result of all the words in one first character. And then the output result of the whole first character can be output.

Further, if one first character includes only one character to be recognized, or one first character includes only one word to be recognized, the executing device may directly output the recognition result of the entire first character after executing steps 301 to 304.

For a more intuitive understanding of the present solution, refer to fig. 6, and fig. 6 is a schematic diagram of a network architecture of a text recognition network according to an embodiment of the present application. The text recognition structure comprises an image feature extraction module, A1, A2 and a recognition module, wherein A1 represents a text feature acquisition module, and A2 represents a feature update module. As shown in fig. 6, the executing device inputs the image to be recognized into the image feature extraction module to obtain an image feature (i.e., a first feature) of a first character in the image to be recognized, inputs a character corresponding to the first character in the image to be recognized into A1 (i.e., the text feature acquisition module), and may be a preset character or a preset character and a second character, so as to generate an initial feature of the character through the text feature acquisition module, and execute a self-attention encoding operation and a self-attention decoding operation on the initial feature of the character to obtain a semantic feature of the predicted character. After the first feature is obtained, the executing device performs self-attention encoding on the initial feature of the character to obtain an updated feature of the character, and further performs self-attention encoding operation according to the first feature and the updated feature of the character to generate an updated first feature. The execution device inputs the updated first features and the semantic features of the predicted characters into the recognition module so as to execute recognition operation through the recognition module and input recognition results. The specific implementation manner of each step in fig. 6 may refer to the foregoing description, which is not repeated herein, and it should be understood that in a practical situation, more or fewer neural network layers may be disposed in the text recognition network, and fig. 6 is merely an example for facilitating understanding of the present solution, and is not limited to this solution.

2. Training phase

In an embodiment of the present application, the training phase describes the process of how the training device 220 trains the text recognition network. Referring to fig. 7, fig. 7 is a flowchart of a training method of a text recognition network according to an embodiment of the present application, where the method may include:

701. the training device acquires an image to be identified from the training data set.

In the embodiment of the application, the training device is pre-configured with a training data set, the training data set comprises a plurality of images to be recognized, and the training device randomly acquires one image to be recognized from the training data set according to the correct result corresponding to the first character in each image to be recognized.

702. The training device inputs the image to be recognized to an image feature extraction module, and performs feature extraction on the image to be recognized to generate first features corresponding to first characters in the image to be recognized.

703. The training equipment inputs preset characters corresponding to the first characters in the image to be recognized to the text feature acquisition module, and performs text prediction according to the preset characters so as to generate semantic features of the first predicted characters.

704. The training device combines the features of the preset characters with the first features through the feature updating module to generate fourth features.

705. The training device performs a recognition operation by the recognition module according to the first feature and the semantic feature of the first predicted character to generate a first recognition result.

706. The training device inputs a second character corresponding to the recognized character in the first character to the text feature acquisition module, and performs text prediction according to the second character to generate semantic features of the first predicted character.

707. The training device combines the features of the second character with the first features through the feature update module to generate seventh features.

708. The training device performs recognition operation through the recognition module according to the first feature and the semantic feature of the first predicted character to generate a second recognition result.

In the embodiment of the present application, the specific implementation manner of the training device to execute steps 702 to 708 is similar to the specific implementation manner of steps 301 to 307 in the corresponding embodiment of fig. 3, and the description of steps 301 to 307 in the corresponding embodiment of fig. 3 may be referred to, and will not be repeated here.

709. The training device trains the text recognition network according to the correct result, the recognition result and the loss function corresponding to the first character in the image to be recognized.

In the embodiment of the application, after the training device obtains the recognition result of one first character in the image to be recognized, the function value of the loss function is calculated according to the correct result corresponding to the first character in the image to be recognized and the recognition result of one first character in the image to be recognized, and gradient derivation is carried out on the function value of the loss function so as to reversely update the weight parameter of the text recognition network, thereby completing one training of the text recognition network. The training device repeatedly performs the foregoing steps to achieve iterative training of the text recognition network.

Specifically, if one first character includes only one character to be recognized, or one first character includes only one word to be recognized, the executing device may directly output the recognition result of the entire first character after executing steps 701 to 705, and the training device calculates the function value of the loss function according to the correct result corresponding to the first character in the image to be recognized and the first recognition result output in step 705.

If one first character includes a plurality of characters to be recognized or one first character includes a plurality of words to be recognized, the executing device may directly output a recognition result of the entire first character after executing steps 701 to 705 once and executing steps 706 to 709 at least once, and the training device calculates a function value of the loss function according to a correct result corresponding to the first character in the image to be recognized, the first recognition result output in step 705, and at least one second recognition result obtained in step 708.

The loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result of one first character in the image to be recognized, and the training aim is to pull the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result of one first character in the image to be recognized. The loss function may be embodied as a cross entropy loss function, a focal point (focal) loss function, a center loss function, or other type of loss function, and the like, without limitation.

The preset condition may be that the loss function satisfies a convergence condition, or may be that the iteration number reaches a preset number.

In the embodiment of the application, a training method of a text recognition network is provided, and the completeness of the scheme is improved; due to factors such as blurring of the image to be recognized or occlusion of a part of the characters in the image to be recognized, the accuracy of the features of the blurred or occluded characters included in the first feature may be greatly reduced. In the training stage, semantic features of the predicted characters are generated based on semantic information of the recognized characters, recognition results are generated according to the semantic features of the predicted characters and the image features, and the accuracy of the predicted characters is not affected due to problems of image blurring or partial character shielding in the image to be recognized and other images, so that the accuracy of text recognition results output by a text recognition network after training is improved.

In order to more intuitively understand the beneficial effects of the embodiments of the present application, the beneficial effects of the embodiments of the present application are shown by experimental data in table 1 below.

	svt	SVTP	CT80
				OCR	88.2％	77.67％	84.98％
Embodiments of the application	92.4％	84.2％	89.9％

TABLE 1

Referring to table 1, svt, SVTP and CT80 are three disclosed data sets, respectively, and the first line of data in table 1 indicates the accuracy of recognition results obtained by text recognition of images in data set svt, data set SVTP and data set CT80, respectively, using optical character recognition (optical character recognition, OCR) technology. The second row of data in table 1 indicates the accuracy of the recognition result obtained by using the text recognition network provided by the embodiment of the present application to perform text recognition on the images in the dataset svt, the dataset SVTP and the dataset CT80, respectively. Obviously, the accuracy of the recognition result obtained by the text recognition network provided by the embodiment of the application is higher.

In addition, referring to fig. 8, fig. 8 is a schematic diagram illustrating an advantageous effect of the text recognition network according to the embodiment of the present application. For the first line of data in fig. 8, when the characters in the image to be recognized are recognized only according to the image features of the image to be recognized, the recognition result is shcct, and the text recognition network provided by the embodiment of the application is adopted to recognize the characters in the image to be recognized, so that the recognition result is sheet. It can be analogically understood for the second line of data and the third line of data in fig. 8, and it is obvious that the accuracy of the recognition result obtained by using the text recognition network provided by the embodiment of the application is higher.

In order to better implement the above-described scheme of the embodiment of the present application on the basis of the embodiments corresponding to fig. 1 to 8, the following provides a related device for implementing the above-described scheme. Referring specifically to fig. 9, fig. 9 is a schematic structural diagram of a text recognition network according to an embodiment of the present application. The text recognition network 900 includes a text feature acquisition module 902 and a recognition module 903, which may include an image feature extraction module 901. The image feature extraction module 901 is configured to obtain an image to be identified, and perform feature extraction on the image to be identified to generate a first feature corresponding to a first character in the image to be identified, where the first character is a character to be identified in the image to be identified; the text feature obtaining module 902 is configured to obtain a preset character corresponding to a first character in the image to be identified, and perform text prediction according to the preset character, so as to generate semantic features of the first predicted character; the recognition module 903 is configured to perform a recognition operation according to the first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

In one possible design, the text feature obtaining module 902 is specifically configured to obtain, in a case where an identification operation is performed on an image to be identified for the first time, a preset character corresponding to a first character in the image to be identified, and perform text prediction according to the preset character, so as to generate a semantic feature of a second predicted character; the text feature obtaining module 902 is further configured to determine, as the second character, a recognition result corresponding to the recognized character in the first character, and generate a semantic feature of a second predicted character corresponding to the second character, in a case where the recognition operation has been performed on at least one character in the first characters.

In one possible design, the recognition module 903 is further configured to perform a recognition operation according to the first feature and the semantic feature of the second predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

In one possible design, referring to fig. 10, fig. 10 is a schematic structural diagram of a text recognition network according to an embodiment of the present application. The text feature acquisition module 902 includes: the first generating sub-module 9021 is configured to perform vectorization processing on the preset character to generate a character code of the preset character, and generate a position code of the preset character according to a position of the preset character in the first character in the image to be identified; the combination submodule 9022 is configured to combine the character encoding of the preset character and the position encoding of the preset character to obtain an initial feature of the preset character, and execute a self-attention encoding operation and a self-attention decoding operation according to the initial feature of the preset character to generate a semantic feature of the first predicted character.

In one possible design, referring to fig. 10, the identification module 903 includes: a computation submodule 9031 for computing a similarity between the first feature and the semantic feature of the first predicted character; a second generating submodule 9032, configured to generate a second feature and a third feature according to the first feature, the semantic feature and the similarity of the first predicted character, where the second feature is a semantic feature that combines the first predicted character based on the first feature, and the third feature is a semantic feature that combines the first feature based on the semantic feature of the first predicted character; the second generating submodule 9032 is further configured to perform a recognition operation according to the second feature and the third feature, so as to generate a recognition result.

In one possible design, referring to fig. 10, the text recognition network further includes a feature update module 904, the feature update module 904 configured to: combining the features of the preset characters with the first features to generate updated first features; the recognition module 903 is specifically configured to perform a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

In one possible design, feature update module 904 is specifically configured to: and executing the self-attention encoding operation according to the initial characteristics of the preset characters to obtain updated characteristics of the preset characters, and executing the self-attention encoding operation according to the first characteristics and the updated characteristics of the preset characters to generate updated first characteristics.

In one possible design, in the case that the granularity of performing the recognition operation by the text recognition network is character, at least one character is included in one first character, and one character is included in one recognition result output by the text recognition network by performing the recognition operation once; in the case that the granularity of the text recognition network performing the recognition operation is that the words are included in one first character, one recognition result output by the text recognition network performing the recognition operation once is that the words include one or more characters.

It should be noted that, content such as information interaction and execution process between each module/unit in the text recognition network 900, and each method embodiment corresponding to fig. 3 to 6 in the present application are based on the same concept, and specific content may be referred to the description in the foregoing method embodiment of the present application, which is not repeated herein.

The embodiment of the application also provides a training device of the text recognition network, and particularly referring to fig. 11, fig. 11 is a schematic structural diagram of the training device of the text recognition network. The text recognition network is a neural network for recognizing characters in an image and comprises an image feature extraction module, a text feature acquisition module and a recognition module. The training apparatus 1100 of the text recognition network includes: an input unit 1101, a recognition unit 1102, and a training unit 1103. The input unit 1101 is configured to input an image to be identified to the image feature extraction module, and perform feature extraction on the image to be identified to generate a first feature corresponding to a first character in the image to be identified, where the first character is a character to be identified in the image to be identified; the input unit 1101 is further configured to input a preset character corresponding to a first character in the image to be recognized to the text feature acquisition module, and perform text prediction according to the preset character, so as to generate semantic features of the first predicted character; a recognition unit 1102, configured to perform a recognition operation according to the first feature and the semantic feature of the first predicted character through a recognition module, so as to generate a recognition result corresponding to the first character in the image to be recognized; the training unit 1103 is configured to train the text recognition network according to a correct result corresponding to the first character in the image to be recognized, a recognition result, and a loss function, where the loss function indicates a similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized.

In one possible design, referring to fig. 12, fig. 12 is a schematic structural diagram of a training device for a text recognition network according to an embodiment of the present application. An input unit 1101, specifically configured to input, to the text feature acquisition module, a preset character corresponding to a first character in the image to be recognized when the recognition operation is performed on the image to be recognized for the first time; the training apparatus 1100 of the text recognition network further includes a generating unit 1104 for determining, by the text feature acquisition module, a recognition result corresponding to a recognized character of the first characters as a second character in a case where a recognition operation has been performed on at least one character of the first characters, and generating semantic features of a second predicted character corresponding to the second character.

In one possible design, the recognition unit 1102 is further configured to perform a recognition operation by the recognition module according to the first feature and the semantic feature of the second predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

In one possible design, the input unit 1101 is specifically configured to perform vectorization processing on a preset character through the text feature acquisition module, so as to generate a character code of the preset character, and generate a position code of the preset character according to a position of the preset character in a first character in the image to be identified; and combining the character codes of the preset characters and the position codes of the preset characters through a text characteristic acquisition module to obtain initial characteristics of the preset characters, and executing self-attention coding operation and self-attention decoding operation according to the initial characteristics of the preset characters to generate semantic characteristics of the first predicted characters.

In one possible design, the identification unit 1102 is specifically configured to: calculating the similarity between the first feature and the semantic feature of the first predicted character by the identification module; generating a second feature and a third feature through the identification module according to the first feature, the semantic feature and the similarity of the first predicted character, wherein the second feature is the semantic feature of the first predicted character combined on the basis of the first feature, and the third feature is the first feature combined on the basis of the semantic feature of the first predicted character; and executing the identification operation according to the second characteristic and the third characteristic by the identification module so as to generate an identification result.

In one possible design, referring to FIG. 12, the text recognition network further includes a feature update module. The training device 1100 of the text recognition network further includes a combining unit 1105, configured to combine, by the feature updating module, the feature of the preset character with the first feature to generate an updated first feature; the recognition unit 1102 is specifically configured to perform, by using a recognition module, a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

In one possible design, the combining unit 1105 is specifically configured to perform, through the feature updating module, a self-attention encoding operation according to an initial feature of a preset character, to obtain an updated feature of the preset character; and executing self-attention coding operation according to the first feature and the updated feature of the preset character by a feature updating module so as to generate the updated first feature.

It should be noted that, content such as information interaction and execution process between each module/unit in the training device 1100 of the text recognition network, each method embodiment corresponding to fig. 7 in the present application is based on the same concept, and specific content may be referred to the description in the foregoing method embodiment of the present application, which is not repeated herein.

Referring to fig. 13, fig. 13 is a schematic structural diagram of an execution device according to an embodiment of the present application, where the execution device 1300 may be deployed with a text recognition network 900 described in the corresponding embodiment of fig. 9 or fig. 10, so as to implement functions of the execution device in the corresponding embodiment of fig. 3 to fig. 6. Specifically, the execution apparatus 1300 includes: receiver 1301, transmitter 1302, processor 1303 and memory 1304 (where the number of processors 1303 in executing device 1300 may be one or more, as exemplified by one processor in fig. 13), where processor 1303 may include an application processor 13031 and a communication processor 13032. In some embodiments of the application, the receiver 1301, transmitter 1302, processor 1303, and memory 1304 may be connected by a bus or other means.

Memory 1304 may include read only memory and random access memory and provides instructions and data to processor 1303. A portion of the memory 1304 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1304 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 1303 controls operations of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The method disclosed in the above embodiment of the present application may be applied to the processor 1303 or implemented by the processor 1303. The processor 1303 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the method described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 1303. The processor 1303 may be a general purpose processor, a Digital Signal Processor (DSP), a microprocessor, or a microcontroller, and may further include an Application SPECIFIC INTEGRATED Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 1303 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1304, and the processor 1303 reads information in the memory 1304, and performs the steps of the method in combination with hardware.

The receiver 1301 may be used to receive input numeric or character information and to generate signal inputs related to performing relevant settings and function control of the device. The transmitter 1302 may be configured to output numeric or character information via a first interface; the transmitter 1302 may also be configured to send instructions to the disk group through the first interface to modify data in the disk group; the transmitter 1302 may also include a display device such as a display screen.

In the embodiment of the present application, the application processor 13031 is configured to perform the functions of the performing device in the corresponding embodiment of fig. 3 to 6. It should be noted that, for the specific implementation manner of the application processor 13031 to execute the functions of the execution device in the corresponding embodiments of fig. 3 to 6 and the beneficial effects, reference may be made to the descriptions in the respective method embodiments corresponding to fig. 3 to 6, and the descriptions are not repeated here.

Referring to fig. 14, fig. 14 is a schematic structural diagram of a training device provided in the embodiment of the present application, and training apparatus 1100 of the text recognition network described in the corresponding embodiment of fig. 11 or 12 may be disposed on training device 1400, so as to implement the function of the training device corresponding to fig. 7. In particular, the training apparatus 1400 is implemented by one or more servers, and the training apparatus 1400 may vary widely in configuration or performance, and may include one or more central processing units (central processing units, CPUs) 1422 (e.g., one or more processors) and memory 1432, one or more storage mediums 1430 (e.g., one or more mass storage devices) that store applications 1442 or data 1444. Wherein the memory 1432 and storage medium 1430 can be transitory or persistent storage. The program stored on the storage medium 1430 may include one or more modules (not shown) each of which may include a series of instruction operations for the training device. Still further, central processor 1422 may be provided in communication with a storage medium 1430 to perform a series of instruction operations in storage medium 1430 on training device 1400.

The training apparatus 1400 may also comprise one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input/output interfaces 1458, and/or one or more operating systems 1441, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

In an embodiment of the present application, the central processor 1422 is configured to implement the functions of the training apparatus in the corresponding embodiment of fig. 7. It should be noted that, for the specific implementation manner and the beneficial effects of executing the functions of the training device in the corresponding embodiment of fig. 7 by the central processor 1422, reference may be made to the descriptions of the respective method embodiments corresponding to fig. 7, and no further description is given here.

Embodiments of the present application also provide a computer-readable storage medium having a program stored therein, which when executed on a computer, causes the computer to perform the steps performed by the performing apparatus in the corresponding embodiment of fig. 3 to 6 described above, or the steps performed by the training apparatus in the corresponding embodiment of fig. 7 described above.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the steps performed by the apparatus as described above in the corresponding embodiments of fig. 3 to 6, or the steps performed by the training apparatus as described above in the corresponding embodiments of fig. 7.

In an embodiment of the present application, a circuit system is further provided, where the circuit system includes a processing circuit configured to perform a step performed by the performing device in the corresponding embodiment of fig. 3 to 6, or perform a step performed by the training device in the corresponding embodiment of fig. 7.

The execution device or training device provided by the embodiment of the application may be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit, so that the chip performs the steps performed by the performing device in the corresponding embodiment of fig. 3 to 6, or performs the steps performed by the training device in the corresponding embodiment of fig. 7. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, or the like, and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), or the like.

Specifically, referring to fig. 15, fig. 15 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 150, and the NPU 150 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The core part of the NPU is an operation circuit 1503, and the controller 1504 controls the operation circuit 1503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 1503 includes a plurality of processing units (PEs) inside. In some implementations, the operation circuit 1503 is a two-dimensional systolic array. The operation circuit 1503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1503 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The operation circuit 1503 takes the data corresponding to the matrix B from the weight memory 1502 and buffers the data on each PE in the operation circuit. The arithmetic circuit 1503 takes matrix a data from the input memory 1501 and performs matrix operation with matrix B, and the partial result or the final result of the matrix obtained is stored in an accumulator (accumulator) 1508.

Unified memory 1506 is used to store input data and output data. The weight data is carried directly to the weight memory 1502 through the memory cell access controller (Direct Memory Access Controller, DMAC) 1505. The input data is also carried into the unified memory 1506 through the DMAC.

BIU is Bus Interface Unit, bus interface unit 1510, for interaction of the AXI bus with the DMAC and instruction fetch memory (Instruction Fetch Buffer, IFB) 1509.

The bus interface unit 1510 (Bus Interface Unit, abbreviated as BIU) is configured to fetch the instruction from the external memory by the instruction fetch memory 1509, and further configured to fetch the raw data of the input matrix a or the weight matrix B from the external memory by the memory unit access controller 1505.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 1506 or to transfer weight data to the weight memory 1502 or to transfer input data to the input memory 1501.

The vector calculation unit 1507 includes a plurality of operation processing units, and further processes such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like are performed on the output of the operation circuit 1503 if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.

In some implementations, the vector computation unit 1507 can store the vector of processed outputs to the unified memory 1506. For example, the vector calculation unit 1507 may apply a linear function and/or a nonlinear function to the output of the operation circuit 1503, for example, linearly interpolate the feature plane extracted by the convolution layer, and further, for example, accumulate a vector of values to generate an activation value. In some implementations, the vector calculation unit 1507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 1503, for example for use in subsequent layers in a neural network.

An instruction fetch memory (instruction fetch buffer) 1509 connected to the controller 1504 for storing instructions used by the controller 1504;

the unified memory 1506, the input memory 1501, the weight memory 1502 and the finger memory 1509 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The operations of the layers in the recurrent neural network may be performed by the operation circuit 1503 or the vector calculation unit 1507.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the program of the method of the first aspect.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by dedicated hardware including application specific integrated circuits, dedicated CLUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

Claims

1. A text recognition network, which is characterized in that the text recognition network is a neural network for recognizing characters in an image, and comprises an image feature extraction module, a text feature acquisition module and a recognition module;

The image feature extraction module is used for acquiring an image to be identified and extracting features of the image to be identified to generate a first feature corresponding to a first character in the image to be identified, wherein the first character is a character to be identified in the image to be identified;

The text feature acquisition module is used for acquiring preset characters corresponding to the first characters in the image to be identified, and carrying out text prediction according to the preset characters so as to generate semantic features of the first predicted characters, wherein the preset characters comprise initial mark characters which are used for indicating the text feature acquisition module to start text prediction;

The recognition module is used for executing recognition operation according to the first characteristics and the semantic characteristics of the first predicted characters so as to generate a recognition result corresponding to the first characters in the image to be recognized.

2. The network of claim 1, wherein the network comprises a plurality of network nodes,

The text feature acquisition module is specifically configured to acquire a preset character corresponding to a first character in the image to be recognized under the condition that the recognition operation is performed on the image to be recognized for the first time, and perform text prediction according to the preset character so as to generate semantic features of the first predicted character;

The text feature obtaining module is further configured to determine, when a recognition operation has been performed on at least one of the first characters, a recognition result corresponding to the recognized character of the first characters as a second character, and perform text prediction according to the second character, so as to generate a semantic feature of a second predicted character corresponding to the second character.

3. The network of claim 2, wherein the network is configured to,

The recognition module is further configured to perform a recognition operation according to the first feature and the semantic feature of the second predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

4. A network according to any one of claims 1 to 3, wherein the text feature acquisition module comprises:

The first generation sub-module is used for carrying out vectorization processing on the preset characters to generate character codes of the preset characters, and generating position codes of the preset characters according to the positions of the preset characters in the first characters in the image to be recognized;

And the combination sub-module is used for combining the character codes of the preset characters and the position codes of the preset characters to obtain initial characteristics of the preset characters, and executing self-attention coding operation and self-attention decoding operation according to the initial characteristics of the preset characters to generate semantic characteristics of the first predicted characters.

5. A network according to any one of claims 1 to 3, wherein the identification module comprises:

A computing sub-module for computing a similarity between the first feature and a semantic feature of the first predicted character;

A second generation sub-module, configured to generate a second feature and a third feature according to the first feature, the semantic feature of the first predicted character, and the similarity, where the second feature is a semantic feature that combines the first predicted character based on the first feature, and the third feature is a semantic feature that combines the first feature based on the first predicted character;

the second generating sub-module is further configured to perform a recognition operation according to the second feature and the third feature, so as to generate a recognition result.

6. A network according to any one of claims 1 to 3, wherein the text recognition network further comprises a feature update module for:

Combining the features of the preset characters with the first features to generate updated first features;

The recognition module is specifically configured to perform a recognition operation according to the updated first feature and the semantic feature of the first predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

7. The network of claim 6, wherein the network is configured to,

The feature updating module is specifically configured to perform a self-attention encoding operation according to an initial feature of the preset character, obtain an updated feature of the preset character, and perform the self-attention encoding operation according to the first feature and the updated feature of the preset character, so as to generate the updated first feature.

8. A network according to any one of claim 1 to 3, wherein,

When the granularity of the text recognition network for executing the recognition operation is character, at least one character is included in one first character, and one recognition result output by the text recognition network for executing the recognition operation once includes one character;

and under the condition that the granularity of the text recognition network for executing the recognition operation is words, at least one word is included in one first character, and one recognition result output by the text recognition network for executing the recognition operation once is words including one or more characters.

9. A training method of a text recognition network, wherein the text recognition network is a neural network for recognizing characters in an image, the text recognition network includes an image feature extraction module, a text feature acquisition module, and a recognition module, the method includes:

Inputting an image to be identified into the image feature extraction module, and carrying out feature extraction on the image to be identified to generate a first feature corresponding to a first character in the image to be identified, wherein the first character is a character to be identified in the image to be identified;

Inputting a preset character corresponding to a first character in the image to be recognized to the text feature acquisition module, and carrying out text prediction according to the preset character to generate semantic features of a first predicted character, wherein the preset character comprises a start mark character which is used for indicating the text feature acquisition module to start text prediction;

executing recognition operation through the recognition module according to the first features and the semantic features of the first predicted characters to generate a recognition result corresponding to the first characters in the image to be recognized;

Training the text recognition network according to a correct result corresponding to the first character in the image to be recognized, a recognition result and a loss function, wherein the loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized.

10. The method of claim 9, wherein the step of determining the position of the substrate comprises,

11. The method of claim 10, wherein the step of determining the position of the first electrode is performed,

12. A method of text recognition, the method comprising:

inputting an image to be identified into an image feature extraction module, and performing feature extraction on the image to be identified to generate a first feature corresponding to a first character in the image to be identified, wherein the first character is a character to be identified in the image to be identified;

Inputting a preset character corresponding to a first character in the image to be recognized into a text feature acquisition module, and carrying out text prediction according to the preset character to generate semantic features of a first predicted character, wherein the preset character comprises a start mark character which is used for indicating the text feature acquisition module to start text prediction;

Executing recognition operation through a recognition module according to the first feature and the semantic feature of the first predicted character to generate a recognition result corresponding to the first character in the image to be recognized;

The image feature extraction module, the text feature acquisition module and the recognition module belong to the same text recognition network.

13. The method of claim 12, wherein the inputting a preset character corresponding to a first character in the image to be recognized to a text feature acquisition module comprises:

Under the condition that the identification operation is executed for the first time on the image to be identified, inputting a preset character corresponding to a first character in the image to be identified into a text feature acquisition module;

the method further comprises the steps of:

In the case that the recognition operation has been performed on at least one of the first characters, determining, by the text feature acquisition module, a recognition result corresponding to the recognized character of the first characters as a second character, and performing text prediction according to the second character to generate semantic features of a second predicted character corresponding to the second character.

14. The method of claim 13, wherein the method further comprises:

and executing recognition operation through the recognition module according to the first characteristic and the semantic characteristic of the second predicted character so as to generate a recognition result corresponding to the first character in the image to be recognized.

15. The method according to any one of claims 12 to 14, wherein inputting a preset character corresponding to a first character in the image to be recognized to a text feature acquisition module, and performing text prediction according to the preset character to generate a semantic feature of the first predicted character, includes:

Vectorizing the preset characters through the text feature acquisition module to generate character codes of the preset characters, and generating position codes of the preset characters according to the positions of the first characters of the preset characters in the image to be recognized;

And combining the character codes of the preset characters and the position codes of the preset characters through the text characteristic acquisition module to obtain initial characteristics of the preset characters, and executing self-attention coding operation and self-attention decoding operation according to the initial characteristics of the preset characters to generate semantic characteristics of the first predicted characters.

16. The method according to any one of claims 12 to 14, wherein the performing, by the recognition module, a recognition operation according to the first feature and the semantic feature of the first predicted character to generate a recognition result corresponding to the first character in the image to be recognized, includes:

calculating the similarity between the first feature and the semantic feature of the first predicted character by the identification module;

generating, by the recognition module, a second feature and a third feature according to the first feature, the semantic feature of the first predicted character, and the similarity, wherein the second feature is a semantic feature of the first predicted character combined on the basis of the first feature, and the third feature is a semantic feature of the first predicted character combined on the basis of the first feature;

And executing the identification operation according to the second characteristic and the third characteristic by the identification module so as to generate an identification result.

17. The method of any of claims 12 to 14, wherein the text recognition network further comprises a feature update module, the method further comprising:

Combining, by the feature updating module, the features of the preset character with the first features to generate updated first features;

the identifying module performs an identifying operation according to the first feature and the semantic feature of the first predicted character to generate an identifying result corresponding to the first character in the image to be identified, including:

and executing a recognition operation according to the updated first feature and the semantic feature of the first predicted character by the recognition module so as to generate a recognition result corresponding to the first character in the image to be recognized.

18. The method of claim 17, wherein the combining, by the feature updating module, the features of the preset character with the first features to generate updated first features comprises:

Executing self-attention coding operation according to the initial characteristics of the preset characters through the characteristic updating module to obtain updated characteristics of the preset characters;

And executing self-attention coding operation according to the first feature and the updated feature of the preset character by the feature updating module so as to generate the updated first feature.

19. The method according to any one of claims 12 to 14, wherein,

20. A training device for a text recognition network, wherein the text recognition network is a neural network for recognizing characters in an image, the text recognition network includes an image feature extraction module, a text feature acquisition module, and a recognition module, the device includes:

The input unit is used for inputting an image to be identified into the image feature extraction module, and carrying out feature extraction on the image to be identified to generate a first feature corresponding to a first character in the image to be identified, wherein the first character is a character to be identified in the image to be identified;

The input unit is further configured to input a preset character corresponding to a first character in the image to be identified to the text feature acquisition module, and perform text prediction according to the preset character to generate semantic features of a first predicted character, where the preset character includes a start flag character, and the start flag character is used to instruct the text feature acquisition module to start text prediction;

The recognition unit is used for executing recognition operation through the recognition module according to the first characteristics and the semantic characteristics of the first predicted characters so as to generate a recognition result corresponding to the first characters in the image to be recognized;

The training unit is used for training the text recognition network according to the correct result corresponding to the first character in the image to be recognized, the recognition result and the loss function, wherein the loss function indicates the similarity between the correct result corresponding to the first character in the image to be recognized and the recognition result corresponding to the first character in the image to be recognized.

21. The apparatus of claim 20, wherein the device comprises a plurality of sensors,

The input unit is specifically configured to input, to the text feature acquisition module, a preset character corresponding to a first character in the image to be recognized under the condition that the recognition operation is performed on the image to be recognized for the first time;

The input unit is further configured to, when a recognition operation has been performed on at least one of the first characters, determine, by the text feature acquisition module, a recognition result corresponding to the recognized character of the first characters as a second character, and perform text prediction according to the second character, so as to generate semantic features of a second predicted character corresponding to the second character.

22. The apparatus of claim 21, wherein the device comprises a plurality of sensors,

The recognition unit is further configured to perform a recognition operation through the recognition module according to the first feature and the semantic feature of the second predicted character, so as to generate a recognition result corresponding to the first character in the image to be recognized.

23. An execution device comprising a processor coupled to a memory, the memory storing program instructions that when executed by the processor implement the steps performed by the text recognition network of any of claims 1 to 8.

24. A training device comprising a processor coupled to a memory, the memory storing program instructions that when executed by the processor implement the method of any of claims 9 to 11.

25. A computer readable storage medium comprising a program which, when run on a computer, causes the computer to perform the steps performed by the text recognition network of any one of claims 1 to 8 or causes the computer to perform the method of any one of claims 9 to 11.

26. Circuitry, characterized in that it comprises processing circuitry configured to perform the steps performed by the text recognition network according to any of claims 1 to 8 or to perform the method according to any of claims 9 to 11.