CN113065339A - Automatic error correction method, device and equipment for Chinese text and storage medium - Google Patents
Automatic error correction method, device and equipment for Chinese text and storage medium Download PDFInfo
- Publication number
- CN113065339A CN113065339A CN202110390732.6A CN202110390732A CN113065339A CN 113065339 A CN113065339 A CN 113065339A CN 202110390732 A CN202110390732 A CN 202110390732A CN 113065339 A CN113065339 A CN 113065339A
- Authority
- CN
- China
- Prior art keywords
- character
- chinese text
- pinyin
- text
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses an automatic error correction method for Chinese texts, which comprises the following steps: executing preset character conversion operation and pinyin conversion operation on the target Chinese text to obtain a character sequence and a pinyin sequence containing each character in the target Chinese text; inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis to obtain a text characteristic vector corresponding to the target Chinese text; inputting the text feature vector into a preset bert network for analysis to obtain a candidate character list corresponding to each character in the target Chinese text; inputting a candidate character list corresponding to each character in the target Chinese text into a preset language analysis model for analysis to obtain an error correction Chinese text corresponding to the target Chinese text; the target chinese text is replaced with the corrected chinese text. Therefore, the method and the device can be used for correcting the text by integrating the pinyin factors of the Chinese text, and the accuracy of automatic error correction is improved. The invention also relates to the technical field of block chains.
Description
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to an automatic error correction method and apparatus for a chinese text, a computer device, and a storage medium.
Background
Automatic correction of chinese text is a technique for checking whether there is a grammatical or semantic error in chinese text and automatically correcting the existing error, and is widely used in the fields of keyboard input methods, document editing, search engines, speech recognition, and the like. In the current automatic Chinese text error correction method, when a Chinese text is embedded, the commonly used embedding forms are three forms of character embedding, paragraph embedding and position embedding, so that the current automatic Chinese text error correction method only integrates character factors, paragraph factors, position factors and the like of the Chinese text to correct the text when automatic error correction is performed, and cannot integrate other factors to correct the text, for example, when errors in the Chinese text occur due to pinyin factors such as homophones and fuzzy tones (for example, a correct Chinese text is 'Schwarielluoshi', and an incorrect Chinese text is 'Shiwaraloshi'), the current automatic Chinese text error correction method usually cannot accurately automatically correct the errors. Therefore, the accuracy of the automatic error correction of the traditional Chinese text automatic error correction method still has a space for improving.
Disclosure of Invention
The invention aims to solve the technical problem that the automatic error correction method of the traditional Chinese text cannot accurately correct errors caused by pinyin factors, so that the automatic error correction accuracy of the traditional Chinese text automatic error correction method is not high.
In order to solve the above technical problem, a first aspect of the present invention discloses an automatic error correction method for a chinese text, including:
executing preset character conversion operation on a target Chinese text to obtain a character sequence containing each character in the target Chinese text;
executing preset pinyin conversion operation on the target Chinese text to obtain a pinyin sequence containing pinyin characters corresponding to each character in the target Chinese text;
inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis to obtain a text characteristic vector corresponding to the target Chinese text;
inputting the text feature vector into a preset bert network for analysis to obtain a candidate character list corresponding to each character in the target Chinese text;
inputting a candidate character list corresponding to each character in the target Chinese text into a preset language analysis model for analysis to obtain an error correction Chinese text corresponding to the target Chinese text;
and replacing the target Chinese text with the error correction Chinese text to finish automatic error correction of the target Chinese text.
The second aspect of the present invention discloses an automatic error correction device for chinese text, the device comprising:
the conversion module is used for executing preset character conversion operation on the target Chinese text to obtain a character sequence containing each character in the target Chinese text;
the conversion module is further used for executing a preset pinyin conversion operation on the target Chinese text to obtain a pinyin sequence containing pinyin characters corresponding to each character in the target Chinese text;
the analysis module is used for inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis to obtain a text characteristic vector corresponding to the target Chinese text;
the analysis module is further used for inputting the text feature vector to a preset bert network for analysis to obtain a candidate character list corresponding to each character in the target Chinese text;
the analysis module is further configured to input a candidate character list corresponding to each character in the target chinese text to a preset language analysis model for analysis, so as to obtain an error correction chinese text corresponding to the target chinese text;
and the replacing module is used for replacing the target Chinese text with the error correction Chinese text so as to finish automatic error correction of the target Chinese text.
A third aspect of the present invention discloses a computer apparatus, comprising:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program codes stored in the memory to execute part or all of the steps in the automatic Chinese text correction method disclosed by the first aspect of the invention.
In a fourth aspect of the present invention, a computer storage medium is disclosed, wherein the computer storage medium stores computer instructions, and when the computer instructions are called, the computer instructions are used for executing part or all of the steps in the method for automatically correcting the Chinese text disclosed in the first aspect of the present invention.
In the embodiment of the invention, the character conversion operation and the pinyin conversion operation are executed on a target Chinese text to obtain a character sequence and a pinyin sequence corresponding to the target Chinese text, the character sequence and the pinyin sequence are input to a character pinyin embedding layer to be analyzed to obtain a text characteristic vector corresponding to the target Chinese text, the text characteristic vector is input to a bert network to be analyzed to obtain a candidate character list corresponding to each character in the target Chinese text, the candidate character list is input to a language analysis model to be analyzed to obtain an error correction Chinese text corresponding to the target Chinese text, and finally the target Chinese text is replaced by the error correction Chinese text to complete automatic error correction, so that pinyin embedding can be caused in the traditional Chinese text automatic error correction method, and the Chinese text automatic error correction method can be used for correcting the text by combining pinyin factors of the Chinese text, the accuracy of automatic error correction is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method for automatically correcting errors of Chinese texts according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an automatic error correction device for Chinese texts according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, article, or article that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or article.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The invention discloses an automatic error correction method, a device, a computer device and a storage medium for Chinese text, which execute character conversion operation and pinyin conversion operation on a target Chinese text to obtain a character sequence and a pinyin sequence corresponding to the target Chinese text, input the character sequence and the pinyin sequence into a character pinyin embedding layer for analysis to obtain a text characteristic vector corresponding to the target Chinese text, input the text characteristic vector into a bert network for analysis to obtain a candidate character list corresponding to each character in the target Chinese text, input the candidate character list into a language analysis model for analysis to obtain an error correction Chinese text corresponding to the target Chinese text, and finally replace the target Chinese text with the error correction Chinese text to finish automatic error correction, thereby being capable of causing pinyin embedding in the current automatic error correction method for the Chinese text, leading the automatic error correction method for the Chinese text to be capable of correcting the text by synthesizing pinyin factors of the Chinese text, the accuracy of automatic error correction is improved. The following are detailed below.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating an automatic error correction method for a chinese text according to an embodiment of the present invention. As shown in fig. 1, the automatic error correction method for chinese text may include the following operations:
101. executing preset character conversion operation on a target Chinese text to obtain a character sequence containing each character in the target Chinese text;
in the step 101, in the character conversion operation performed on the target chinese text, the target chinese text may be segmented by taking a character as a unit, so as to obtain each character included in the target chinese text, and each obtained character may be formed into a character sequence according to a position of the character in the target chinese text. If the target chinese text is "i want a poetry luoshi necklace", the character sequence obtained after performing the character conversion operation is [ i want a poetry luoshi necklace ].
102. Executing preset pinyin conversion operation on the target Chinese text to obtain a pinyin sequence containing pinyin characters corresponding to each character in the target Chinese text;
in the step 102, in the pinyin conversion operation performed on the target chinese text, the target chinese text may be segmented by taking characters as units, and then the pinyin character corresponding to each character is found from the pre-stored character pinyin mapping relationship (e.g., a pre-stored pinyin database in which mapping relationships between all chinese characters and the pinyin corresponding to each chinese character are stored), and the obtained pinyin characters form a pinyin sequence according to the positions of the characters corresponding to the pinyin characters in the target chinese text. If the target Chinese text is "necklace that I want pohua luoshi", the pinyin sequence obtained after the pinyin conversion operation is performed is [ wo xiang yao shi hua luo shi qi de xiang lian ].
103. Inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis to obtain a text characteristic vector corresponding to the target Chinese text;
in the step 103, the character pinyin embedding layer is a new embedding layer obtained by adding pinyin embedding to the embedding layer of the bert model. Since the embedding layer of the bert model usually includes character embedding, paragraph embedding and position embedding, the character pinyin embedding layer of pinyin embedding is added on the basis of the original embedding layer to perform pinyin embedding and character embedding, so that the input character sequence and pinyin sequence can be analyzed to obtain corresponding text feature vectors (the specific analysis process is described later).
104. Inputting the text feature vector into a preset bert network for analysis to obtain a candidate character list corresponding to each character in the target Chinese text;
in step 104, the bert network is an existing and mature artificial neural network, and usually includes a multi-layer transformer structure, which discards the conventional RNN structure and CNN structure, and can predict a candidate character list corresponding to each character according to the text feature vector through an attention mechanism. For example, the list of candidate characters corresponding to the "poem" character in the target chinese text is "poem, luxury, shi" and the list of candidate characters corresponding to the "odd" character is "odd, even, qi".
105. Inputting a candidate character list corresponding to each character in the target Chinese text into a preset language analysis model for analysis to obtain an error correction Chinese text corresponding to the target Chinese text;
in the step 105, in the language analysis model, all possible paths may be calculated according to the candidate character list corresponding to each character, then the probability of each path is calculated, and finally the path with the highest probability of occurrence is taken out as the corrected chinese text (a specific analysis process, which will be described in detail later). For example, for a target chinese text of "i want a pohualuoshi necklace", that "i want a pohualuoshi necklace" is a path in which a "i want a pohualuoshi necklace", that "i want a pohualuoshi necklace" is also a path in which a "i want a pohualuoshi necklace", that "i want a pohualuoshi necklace", is also a path in which a "i want a pohualuoshi necklace", and if the probability of occurrence of the path of "i want a pohualuoshi necklace" is the highest, the "i want a pohualuoshi necklace" may be taken as an error correction chinese text corresponding to the target chinese text.
106. And replacing the target Chinese text with the error correction Chinese text to finish automatic error correction of the target Chinese text.
After the corrected chinese text is determined, the target chinese text may be directly replaced with the corrected chinese text in step 106, thereby completing the automatic correction. For example, a user enters a target Chinese text in an input box, and after automatic correction, the text of the input box is automatically modified into a corrected Chinese text.
It can be seen that, in implementing the embodiment of the present invention, a character conversion operation and a pinyin conversion operation are performed on a target chinese text to obtain a character sequence and a pinyin sequence corresponding to the target chinese text, the character sequence and the pinyin sequence are input to a character pinyin embedding layer for analysis to obtain a text feature vector corresponding to the target chinese text, the text feature vector is input to a bert network for analysis to obtain a candidate character list corresponding to each character in the target chinese text, the candidate character list is input to a language analysis model for analysis to obtain an error correction chinese text corresponding to the target chinese text, and finally the target chinese text is replaced with the error correction chinese text to complete automatic error correction, so that pinyin embedding can be caused in the current automatic error correction method for the chinese text, and the automatic error correction method for the chinese text can synthesize pinyin factors of the chinese text for text correction, the accuracy of automatic error correction is improved.
In an optional embodiment, the character pinyin embedding layer comprises a preset character vector table and a pinyin vector table;
and inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis to obtain a text feature vector corresponding to the target Chinese text, wherein the method comprises the following steps:
searching a character vector value corresponding to each character in the character sequence from the character vector table;
generating a character vector corresponding to the character sequence according to a character vector value corresponding to each character in the character sequence;
searching a pinyin vector value corresponding to each pinyin character in the pinyin sequence from the pinyin vector table;
generating a pinyin vector corresponding to the pinyin sequence according to the pinyin vector value corresponding to each pinyin character in the pinyin sequence;
and calculating the weighted sum of the character vector and the pinyin vector to serve as a text characteristic vector corresponding to the target Chinese text.
In this alternative embodiment, the expression form of the character vector table and the pinyin vector table in the character pinyin embedding layer may be a matrix, and the matrix may be a weight matrix obtained through bert pre-training. Searching the character vector value corresponding to each character from the character vector table and searching the pinyin vector value corresponding to each pinyin character from the pinyin vector table is actually an indexing operation. Specifically, a character index id corresponding to each character may be recorded in the character vector table, the character index id (i.e., a character vector value) of each character in the character sequence may be obtained through an indexing operation, and then the character index ids are combined into a character vector according to the position of the corresponding character in the character sequence. Similarly, the pinyin vector table may record a pinyin index id (i.e., a pinyin vector value) corresponding to each pinyin character, and the pinyin index id (i.e., a pinyin vector value) of each pinyin character in the pinyin sequence may be obtained through an indexing operation, and then the pinyin index ids are combined into a pinyin vector according to the position of the corresponding pinyin character in the pinyin sequence. For example, a character vector into which a character sequence [ i want a necklace of pohua luoshi ] may be converted may be [ 0000100013000220012400216001270031800129002340032100412 ], and a pinyin vector into which a pinyin sequence [ wo xiang yao shi hua luo shi qi de xiang lian ] may be [ 0010100213001220022400236001470034800119003340042100512 ]. After the character vector and the pinyin vector are obtained, the character vector and the pinyin vector can be subjected to weighted summation to obtain a text feature vector of the target Chinese text. The weights of the character vector and the pinyin vector can be preset empirical values or values obtained in model training.
Therefore, by implementing the optional embodiment, the pinyin sequence and the character sequence are converted into the pinyin vector and the character vector by searching the character vector table and the pinyin vector table in the character pinyin embedding layer, and finally the weighted sum of the pinyin vector and the character vector is calculated to be used as the text characteristic vector of the target Chinese text, so that the pinyin vector and the character vector can be converted into the text characteristic vector capable of being analyzed subsequently through the character pinyin embedding layer, the pinyin embedding is introduced into the automatic error correction method of the Chinese text, and the accuracy of automatic error correction is improved.
In an optional embodiment, the inputting the candidate character list corresponding to each character in the target chinese text into a preset language analysis model for analysis to obtain the error correction chinese text corresponding to the target chinese text includes:
determining a plurality of candidate error correction Chinese texts corresponding to the target Chinese text according to a candidate character list corresponding to each character in the target Chinese text;
calculating the occurrence probability value of each candidate error correction Chinese text based on a preset calculation mode;
and determining the candidate error correction Chinese text corresponding to the maximum occurrence probability value as the error correction Chinese text corresponding to the target Chinese text.
In this alternative embodiment, each character in the target chinese text has its corresponding candidate character list, e.g., the target chinese text is "i want pohua luoshi necklace", wherein the candidate character list corresponding to "me" is [ me ], "want" is [ want to be ], "want" is [ buy ], "poem" is [ poem luxury ], "hua" is [ hua ], "lo" is [ luoluo ], "world" is [ world ], "odd" is [ qiziqi ], "is [ yes ]," is [ y ], "is" y ], "is" y ", and" is [ chain ]. For each character in the target Chinese text, a character is arbitrarily selected from the candidate character list in each character and is used as the character at the position of the character in the target Chinese text, so that a path which possibly occurs can be determined (namely, a candidate error correction Chinese text is obtained). For example, the characters (i.e. a path that may occur) selected from the candidate character list of each character are in turn: "i", "want", "shi", "hua", "luo", "world", "odd", "item", "chain", the determined candidate corrected chinese text is "i want luo shi qi necklace", and another possible path is in turn: "i", "can", "buy", "poem", "hua", "luo", "shi", "qi", "item" and "chain", the determined candidate corrected chinese text is "i can buy a necklace of poem, hualuoshi". After a plurality of possible paths (i.e., a plurality of candidate corrected chinese texts) are determined, the occurrence probability value of each candidate corrected chinese text may be calculated, and then the candidate corrected chinese text with the highest occurrence probability value is taken as the corrected chinese text. The specific calculation process of the occurrence probability value of the candidate corrected chinese text will be described later. For example, if the calculated occurrence probability value of the candidate corrected chinese text "i want a schwarrious necklace" is 0.011 and the calculated occurrence probability value of the candidate corrected chinese text "i can buy a powarrious necklace" is 0.005, then "i want a schwarrious necklace" may be determined as the corrected chinese text.
Therefore, by implementing the optional embodiment, a plurality of candidate error correction Chinese texts which are possible to appear are determined according to the candidate character list corresponding to each character, then the occurrence probability value of each candidate error correction Chinese text is calculated, and finally the candidate error correction Chinese text with the maximum occurrence probability value is determined as the error correction Chinese text corresponding to the target Chinese text, so that the analysis aiming at each candidate character list can be completed, the accurate error correction Chinese text can be obtained, and the accuracy of automatic error correction is improved.
In an optional embodiment, the calculating the occurrence probability value of each candidate corrected chinese text based on a preset calculation manner includes:
calculating an occurrence probability value of each of the candidate corrected chinese texts by the following formula:
Ppath=P(c1c2…cn)=P(c1)P(c2|c1)P(c3|c2)…P(cn|cn-1)
wherein, PpathIs the probability value of occurrence of the candidate corrected Chinese text, c1To cnIs each character in the candidate corrected Chinese text, P (c)1) Is defined by c1The ratio of the number of candidate corrected Chinese texts with characters at the beginning to the total number of candidate corrected Chinese texts, P (c)n|cn-1) Means that the character c appears continuously in the textn-1And character cnIs compared to the total number of candidate corrected chinese texts.
In this alternative embodiment, taking as an example that the candidate corrected chinese text is "i want a schwarrior necklace", the occurrence probability value of the candidate corrected chinese text is equal to the product of P (i), P (want | me) … P (chain | term), where P (i) means the ratio of the number of candidate corrected chinese texts beginning with the "i" character to the total number of candidate corrected chinese texts, and P (want | me) means the ratio of the number of candidate corrected chinese texts in which the characters "want" and "i" continuously occur in the text to the total number of candidate corrected chinese texts.
Therefore, by implementing the optional embodiment, the occurrence probability value of the candidate error-corrected Chinese text can be calculated according to the occurrence probability value of the characters in the candidate error-corrected Chinese text, so that the error-corrected Chinese text determined according to the occurrence probability value is more accurate, and the accuracy of automatic error correction is improved.
In an optional embodiment, the determining, according to the candidate character list corresponding to each character in the target chinese text, a plurality of candidate corrected chinese texts corresponding to the target chinese text includes:
and carrying out Cartesian product calculation on the candidate character list corresponding to each character in the target Chinese text to obtain a plurality of candidate error correction Chinese texts corresponding to the target Chinese text.
In this optional embodiment, a cartesian product calculation is performed on the candidate character list corresponding to each character, that is, all possible paths of all candidate character lists are exhausted to obtain all possible candidate corrected chinese texts. Taking the above example into account, the candidate character list corresponding to "me" is [ me ], there is a possibility that "want" corresponds to the candidate character list is [ want ], there are two possibilities that "want" corresponds to the candidate character list is [ want to buy ], there are two possibilities that "shi" corresponds to the candidate character list is [ shihishi ], there are three possibilities that "hua" corresponds to the candidate character list is [ hua ], there is a possibility that "lo" corresponds to the candidate character list is [ los ], there are three possibilities that "shi" corresponds to the candidate character list is [ shi ], there is a possibility that "shi" corresponds to the candidate character list is [ qi ], there are three possibilities that "yes" corresponds to the candidate character list is [ yes ], there is a possibility that "term" corresponds to the candidate character list is [ item fragrance ], there are two possibilities, the candidate character list corresponding to the "chain" is [ lotus ], and there are two possibilities. Therefore, after the cartesian product calculation, a total of 1 × 2 × 3 × 1 × 2 — 432 different paths can be obtained, that is, a total of 432 candidate corrected chinese texts are obtained. Specifically, the cartesian product calculation of the candidate character list corresponding to each character can be realized through two layers of for loops, the outer layer loop realizes traversing each character in the target chinese text, and the inner layer loop realizes traversing the candidate character list of each character.
Therefore, by implementing the optional embodiment, the error-corrected Chinese text is screened from all the candidate error-corrected Chinese texts, all possible candidate error-corrected Chinese texts are calculated by performing Cartesian product calculation on the candidate character list corresponding to each character, the coverage range of the candidate error-corrected Chinese texts can be expanded as much as possible, the finally determined error-corrected Chinese text is more accurate, and the accuracy of automatic error correction is improved.
In an alternative embodiment, the bert network consists of 12 layers of transformer structures in tandem.
In this alternative embodiment, the transform structure is an encoder-decoder structure, and is formed by stacking several encoders and decoders. The encoder is composed of Multi-header authorization and a full link, and is used for converting input corpus into feature vectors. The decoder, whose input is the output of the encoder and the predicted result, consists of Masked Multi-Head attachment, Multi-Head attachment and a full connection, for outputting the conditional probability of the final result. After the text feature vectors are analyzed through a 12-layer transformer structure, a candidate character list corresponding to each character in the target Chinese text can be obtained. In the embodiment of the invention, a better analysis result can be obtained by analyzing the text feature vector by using a bert network formed by serially connecting 12 layers of transformer structures.
Therefore, by implementing the optional embodiment, the text feature vectors are analyzed by using a bert network formed by serially connecting 12 layers of transform structures, so that the obtained candidate character list is more accurate, the finally determined corrected Chinese text is more accurate, and the accuracy of automatic error correction is improved.
In an optional embodiment, before the performing a preset character conversion operation on the target chinese text to obtain a character sequence including each character in the target chinese text, the method further includes:
judging whether the target Chinese text contains target characters, wherein the target characters comprise numeric characters and/or alphabetic characters;
and deleting the target character from the target Chinese text when the target Chinese text is judged to contain the target character, and triggering and executing the preset character conversion operation on the target Chinese text to obtain a character sequence containing each character in the target Chinese text.
In this alternative embodiment, the target character may be preset, for example, 9 numeric characters "1" to "9" and 26 alphabetic characters "a" to "b" are set as the target character. Because the automatic error correction of the Chinese text can only be processed aiming at Chinese characters generally, if a target Chinese text contains non-Chinese characters, the automatic error correction can possibly not be normally executed, whether the target Chinese text contains the non-Chinese characters (namely the target characters) or not can be detected firstly, when the target Chinese text contains the target characters, the target characters can be deleted from the target Chinese text and then the automatic error correction is triggered to be executed, and when the target Chinese text does not contain the target characters, the automatic error correction can be directly triggered to be executed, so that the smooth automatic error correction of the Chinese text can be ensured.
Therefore, by implementing the optional embodiment, the target characters in the target Chinese text are deleted before the automatic error correction of the target Chinese text is executed, so that the smooth proceeding of the automatic error correction of the Chinese text is ensured.
Optionally, it is also possible: and uploading the automatic error correction information of the Chinese text of the automatic error correction method of the Chinese text to a block chain.
Specifically, the automatic error correction information of the chinese text is obtained by running the automatic error correction method of the chinese text, and is used to record the automatic error correction condition of the chinese text, for example, the target chinese text targeted by the current error correction, the character sequence and pinyin sequence obtained by conversion, the text feature vector obtained by analysis, the candidate character list, and the finally obtained error-corrected chinese text, and the like. Uploading the automatic error correction information of the Chinese text to the block chain can ensure the safety and the fair transparency to the user. The user can download the automatic error correction information of the Chinese text from the blockchain so as to check whether the automatic error correction information of the Chinese text of the automatic error correction method of the Chinese text is tampered. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of an automatic error correction device for chinese texts according to an embodiment of the present invention. As shown in fig. 2, the apparatus for automatically correcting a chinese text may include:
the conversion module 201 is configured to perform a preset character conversion operation on a target chinese text to obtain a character sequence including each character in the target chinese text;
the conversion module 201 is further configured to perform a preset pinyin conversion operation on the target chinese text to obtain a pinyin sequence including pinyin characters corresponding to each character in the target chinese text;
the analysis module 202 is configured to input the character sequence and the pinyin sequence to a preset character pinyin embedding layer for analysis, so as to obtain a text feature vector corresponding to the target chinese text;
the analysis module 202 is further configured to input the text feature vector to a preset bert network for analysis, so as to obtain a candidate character list corresponding to each character in the target chinese text;
the analysis module 202 is further configured to input a candidate character list corresponding to each character in the target chinese text to a preset language analysis model for analysis, so as to obtain an error correction chinese text corresponding to the target chinese text;
a replacing module 203, configured to replace the target chinese text with the error-corrected chinese text, so as to complete automatic error correction of the target chinese text.
In an optional embodiment, the character pinyin embedding layer comprises a preset character vector table and a pinyin vector table;
and the specific way of inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis by the analysis module 202 to obtain the text feature vector corresponding to the target chinese text is as follows:
searching a character vector value corresponding to each character in the character sequence from the character vector table;
generating a character vector corresponding to the character sequence according to a character vector value corresponding to each character in the character sequence;
searching a pinyin vector value corresponding to each pinyin character in the pinyin sequence from the pinyin vector table;
generating a pinyin vector corresponding to the pinyin sequence according to the pinyin vector value corresponding to each pinyin character in the pinyin sequence;
and calculating the weighted sum of the character vector and the pinyin vector to serve as a text characteristic vector corresponding to the target Chinese text.
For the specific description of the automatic error correction device for the chinese text, reference may be made to the specific description of the automatic error correction method for the chinese text, and in order to avoid repetition, the detailed description is omitted here.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the computer apparatus may include:
a memory 301 storing executable program code;
a processor 302 connected to the memory 301;
the processor 302 calls the executable program code stored in the memory 301 to execute the steps of the method for automatically correcting the Chinese text according to the embodiment of the present invention.
Example four
The embodiment of the invention discloses a computer storage medium 401, wherein a computer instruction is stored in the computer storage medium 401, and when the computer instruction is called, the computer instruction is used for executing the steps in the automatic error correction method for the Chinese text disclosed by the embodiment of the invention.
The above-described embodiments of the apparatus are merely illustrative, and the modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above detailed description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, where the storage medium includes a Read-Only Memory (ROM), a Random Access Memory (RAM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Compact Disc-Read-Only Memory (CD-ROM), or other disk memories, CD-ROMs, or other magnetic disks, A tape memory, or any other medium readable by a computer that can be used to carry or store data.
Finally, it should be noted that: the method, apparatus, computer device and storage medium for automatically correcting the error of the chinese text disclosed in the embodiments of the present invention are only preferred embodiments of the present invention, and are only used for illustrating the technical solutions of the present invention, not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. An automatic error correction method for Chinese text, the method comprising:
executing preset character conversion operation on a target Chinese text to obtain a character sequence containing each character in the target Chinese text;
executing preset pinyin conversion operation on the target Chinese text to obtain a pinyin sequence containing pinyin characters corresponding to each character in the target Chinese text;
inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis to obtain a text characteristic vector corresponding to the target Chinese text;
inputting the text feature vector into a preset bert network for analysis to obtain a candidate character list corresponding to each character in the target Chinese text;
inputting a candidate character list corresponding to each character in the target Chinese text into a preset language analysis model for analysis to obtain an error correction Chinese text corresponding to the target Chinese text;
and replacing the target Chinese text with the error correction Chinese text to finish automatic error correction of the target Chinese text.
2. The automatic error correction method for Chinese text according to claim 1, wherein the character pinyin embedding layer includes a preset character vector table and a pinyin vector table;
and inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis to obtain a text feature vector corresponding to the target Chinese text, wherein the method comprises the following steps:
searching a character vector value corresponding to each character in the character sequence from the character vector table;
generating a character vector corresponding to the character sequence according to a character vector value corresponding to each character in the character sequence;
searching a pinyin vector value corresponding to each pinyin character in the pinyin sequence from the pinyin vector table;
generating a pinyin vector corresponding to the pinyin sequence according to the pinyin vector value corresponding to each pinyin character in the pinyin sequence;
and calculating the weighted sum of the character vector and the pinyin vector to serve as a text characteristic vector corresponding to the target Chinese text.
3. The method according to claim 1 or 2, wherein the step of inputting the candidate character list corresponding to each character in the target chinese text into a preset language analysis model for analysis to obtain the corrected chinese text corresponding to the target chinese text comprises:
determining a plurality of candidate error correction Chinese texts corresponding to the target Chinese text according to a candidate character list corresponding to each character in the target Chinese text;
calculating the occurrence probability value of each candidate error correction Chinese text based on a preset calculation mode;
and determining the candidate error correction Chinese text corresponding to the maximum occurrence probability value as the error correction Chinese text corresponding to the target Chinese text.
4. The method for automatic error correction of chinese text as recited in claim 3, wherein the calculating the occurrence probability value of each candidate error corrected chinese text based on a predetermined calculation manner comprises:
calculating an occurrence probability value of each of the candidate corrected chinese texts by the following formula:
Ppath=P(c1c2…cn)=P(c1)P(c2|c1)P(c3|c2)…P(cn|cn-1)
wherein, PpathIs the probability value of occurrence of the candidate corrected Chinese text, c1To cnIs each character in the candidate corrected Chinese text, P (c)1) Is defined by c1The ratio of the number of candidate corrected Chinese texts with characters at the beginning to the total number of candidate corrected Chinese texts, P (c)n|cn-1) Means that the character c appears continuously in the textn-1And character cnIs compared to the total number of candidate corrected chinese texts.
5. The method of claim 3, wherein the determining a plurality of candidate corrected chinese texts corresponding to the target chinese text according to the candidate character list corresponding to each character in the target chinese text comprises:
and carrying out Cartesian product calculation on the candidate character list corresponding to each character in the target Chinese text to obtain a plurality of candidate error correction Chinese texts corresponding to the target Chinese text.
6. The method for automatic correction of Chinese text according to any of claims 1-5, wherein the bert network is composed of 12 layers of transform structures in series.
7. The method for automatic correction of chinese text according to any of claims 1-5, wherein before the performing the predetermined character conversion operation on the target chinese text to obtain the character sequence including each character in the target chinese text, the method further comprises:
judging whether the target Chinese text contains target characters, wherein the target characters comprise numeric characters and/or alphabetic characters;
and deleting the target character from the target Chinese text when the target Chinese text is judged to contain the target character, and triggering and executing the preset character conversion operation on the target Chinese text to obtain a character sequence containing each character in the target Chinese text.
8. An apparatus for automatic correction of chinese text, the apparatus comprising:
the conversion module is used for executing preset character conversion operation on the target Chinese text to obtain a character sequence containing each character in the target Chinese text;
the conversion module is further used for executing a preset pinyin conversion operation on the target Chinese text to obtain a pinyin sequence containing pinyin characters corresponding to each character in the target Chinese text;
the analysis module is used for inputting the character sequence and the pinyin sequence into a preset character pinyin embedding layer for analysis to obtain a text characteristic vector corresponding to the target Chinese text;
the analysis module is further used for inputting the text feature vector to a preset bert network for analysis to obtain a candidate character list corresponding to each character in the target Chinese text;
the analysis module is further configured to input a candidate character list corresponding to each character in the target chinese text to a preset language analysis model for analysis, so as to obtain an error correction chinese text corresponding to the target chinese text;
and the replacing module is used for replacing the target Chinese text with the error correction Chinese text so as to finish automatic error correction of the target Chinese text.
9. A computer device, characterized in that the computer device comprises:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program code stored in the memory to execute the method for automatic correction of chinese text according to any one of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the method for automatic correction of chinese text according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110390732.6A CN113065339B (en) | 2021-04-12 | 2021-04-12 | Automatic error correction method, device and equipment for Chinese text and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110390732.6A CN113065339B (en) | 2021-04-12 | 2021-04-12 | Automatic error correction method, device and equipment for Chinese text and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113065339A true CN113065339A (en) | 2021-07-02 |
CN113065339B CN113065339B (en) | 2023-06-30 |
Family
ID=76566398
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110390732.6A Active CN113065339B (en) | 2021-04-12 | 2021-04-12 | Automatic error correction method, device and equipment for Chinese text and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113065339B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109165384A (en) * | 2018-08-23 | 2019-01-08 | 成都四方伟业软件股份有限公司 | A kind of name entity recognition method and device |
US20190103097A1 (en) * | 2017-09-29 | 2019-04-04 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for correcting input speech based on artificial intelligence, and storage medium |
CN111444705A (en) * | 2020-03-10 | 2020-07-24 | 中国平安人寿保险股份有限公司 | Error correction method, device, equipment and readable storage medium |
CN112016310A (en) * | 2020-09-03 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, system, device and readable storage medium |
CN112016305A (en) * | 2020-09-09 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN112199945A (en) * | 2020-08-19 | 2021-01-08 | 宿迁硅基智能科技有限公司 | Text error correction method and device |
CN112231480A (en) * | 2020-10-23 | 2021-01-15 | 中电科大数据研究院有限公司 | Character and voice mixed error correction model based on bert |
CN112287670A (en) * | 2020-11-18 | 2021-01-29 | 北京明略软件系统有限公司 | Text error correction method, system, computer device and readable storage medium |
CN112395861A (en) * | 2020-11-18 | 2021-02-23 | 平安普惠企业管理有限公司 | Method and device for correcting Chinese text and computer equipment |
CN112507695A (en) * | 2020-12-01 | 2021-03-16 | 平安科技(深圳)有限公司 | Text error correction model establishing method, device, medium and electronic equipment |
-
2021
- 2021-04-12 CN CN202110390732.6A patent/CN113065339B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190103097A1 (en) * | 2017-09-29 | 2019-04-04 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for correcting input speech based on artificial intelligence, and storage medium |
CN109165384A (en) * | 2018-08-23 | 2019-01-08 | 成都四方伟业软件股份有限公司 | A kind of name entity recognition method and device |
CN111444705A (en) * | 2020-03-10 | 2020-07-24 | 中国平安人寿保险股份有限公司 | Error correction method, device, equipment and readable storage medium |
CN112199945A (en) * | 2020-08-19 | 2021-01-08 | 宿迁硅基智能科技有限公司 | Text error correction method and device |
CN112016310A (en) * | 2020-09-03 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, system, device and readable storage medium |
CN112016305A (en) * | 2020-09-09 | 2020-12-01 | 平安科技(深圳)有限公司 | Text error correction method, device, equipment and storage medium |
CN112231480A (en) * | 2020-10-23 | 2021-01-15 | 中电科大数据研究院有限公司 | Character and voice mixed error correction model based on bert |
CN112287670A (en) * | 2020-11-18 | 2021-01-29 | 北京明略软件系统有限公司 | Text error correction method, system, computer device and readable storage medium |
CN112395861A (en) * | 2020-11-18 | 2021-02-23 | 平安普惠企业管理有限公司 | Method and device for correcting Chinese text and computer equipment |
CN112507695A (en) * | 2020-12-01 | 2021-03-16 | 平安科技(深圳)有限公司 | Text error correction model establishing method, device, medium and electronic equipment |
Non-Patent Citations (1)
Title |
---|
张庆恒: "智能机器外呼系统的设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑 (月刊)》, no. 08, pages 138 - 197 * |
Also Published As
Publication number | Publication date |
---|---|
CN113065339B (en) | 2023-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111428044B (en) | Method, device, equipment and storage medium for acquiring supervision and identification results in multiple modes | |
CN107977356B (en) | Method and device for correcting recognized text | |
CN109445834A (en) | The quick comparative approach of program code similitude based on abstract syntax tree | |
CN109460434B (en) | Data extraction model establishing method and device | |
WO1999024968A1 (en) | Method, device and system for part-of-speech disambiguation | |
CN114048464A (en) | Ether house intelligent contract security vulnerability detection method and system based on deep learning | |
CN111666402A (en) | Text abstract generation method and device, computer equipment and readable storage medium | |
Ugare et al. | Improving llm code generation with grammar augmentation | |
WO2022146546A1 (en) | Multi-lingual code generation with zero-shot inference | |
CN112395861A (en) | Method and device for correcting Chinese text and computer equipment | |
CN114064117A (en) | Code clone detection method and system based on byte code and neural network | |
CN113887200B (en) | Text variable length error correction method, device, electronic equipment and storage medium | |
CN117972033A (en) | Large model illusion detection method, device, computer equipment and storage medium | |
CN112513901A (en) | Method for finding unique coordination system code from given text using artificial neural network and system for implementing the method | |
CN113065339A (en) | Automatic error correction method, device and equipment for Chinese text and storage medium | |
CN114327609A (en) | Code completion method, model and tool | |
CN113946742A (en) | Search redirection method and device, equipment, medium and product thereof | |
CN113177405A (en) | Method, device and equipment for correcting data errors based on BERT and storage medium | |
CN116384370B (en) | Big data security analysis method and system for online service session interaction | |
CN111626059B (en) | Information processing method and device | |
CN115588429A (en) | Error correction method and device for voice recognition | |
CN113609279A (en) | Material model extraction method and device and computer equipment | |
CN115099359A (en) | Address recognition method, device, equipment and storage medium based on artificial intelligence | |
CN115270792A (en) | Medical entity identification method and device | |
CN113486246A (en) | Information searching method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |