CN113449510A

CN113449510A - Text recognition method, device, equipment and storage medium

Info

Publication number: CN113449510A
Application number: CN202110720540.7A
Authority: CN
Inventors: 蒋佳惟; 马龙
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-28
Anticipated expiration: 2041-06-28
Also published as: CN113449510B

Abstract

The invention relates to the field of artificial intelligence, and provides a text recognition method, a text recognition device, text recognition equipment and a storage medium, which are used for improving the accuracy of sensitive text recognition. The text recognition method comprises the following steps: acquiring a text to be processed, and performing maximum entropy-based word sequence conversion on the text to be processed to obtain a target word sequence; performing similar word replacement based on the similar word candidate set on the target word sequence to obtain a similar word sequence; performing pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence; sentence vector conversion is respectively carried out on the text to be processed, the similar word sequence and the pinyin sequence to obtain a target text sentence vector, a target pinyin sentence vector and a target similar sentence vector; and classifying the sensitive texts of the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model to obtain a target text. In addition, the invention also relates to a block chain technology, and the text to be processed can be stored in the block chain.

Description

Text recognition method, device, equipment and storage medium

Technical Field

The invention relates to the field of intelligent decision making of artificial intelligence, in particular to a text recognition method, a text recognition device, text recognition equipment and a storage medium.

Background

With the continuous development and innovation of internet technology, network public sentiment has penetrated all aspects of social life. If the user inputs the relevant non-compliant words in the dialogue system, a large amount of manpower is still consumed for effective data reflow subsequently.

In order to increase the supervision of the network environment, the industry identifies non-compliant keywords (i.e. sensitive words) by constructing a related keyword dictionary, and if related keywords appear in the speech input by the user in the dialog system, the speech is judged to be non-compliant, but the expansibility of the method is poor, and some texts (i.e. sensitive words) with keywords replaced by pinyin or harmonious sound/shape near words cannot be identified, so that the accuracy of sensitive text identification is low.

Disclosure of Invention

The invention provides a text recognition method, a text recognition device, text recognition equipment and a storage medium, which are used for improving the accuracy of sensitive text recognition.

The invention provides a text recognition method in a first aspect, which comprises the following steps:

acquiring a text to be processed, and performing maximum entropy-based word sequence conversion on the text to be processed to obtain a target word sequence;

performing similar word replacement based on a similar word candidate set on the target word sequence to obtain a similar word sequence;

performing pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence;

sentence vector conversion is respectively carried out on the text to be processed, the similar word sequence and the pinyin sequence to obtain a target text sentence vector, a target similar sentence vector and a target pinyin sentence vector;

and classifying the sensitive texts of the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model to obtain a target text.

Optionally, in a first implementation manner of the first aspect of the present invention, the obtaining a text to be processed and performing maximum entropy-based word sequence conversion on the text to be processed to obtain a target word sequence includes:

acquiring a text to be processed, and performing text preprocessing and word segmentation processing on the text to be processed to obtain initial word segmentation;

performing ambiguous word extraction and word segmentation recombination on the initial word segmentation to obtain recombined word segmentation;

carrying out probability screening on the recombined segmented words through a preset maximum entropy model to obtain effective segmented words;

and performing word segmentation replacement on the initial word segmentation through the effective word segmentation to obtain a target word sequence.

Optionally, in a second implementation manner of the first aspect of the present invention, the performing similar word replacement based on a similar word candidate set on the target word sequence to obtain a similar word sequence includes:

carrying out context-based error word detection and word position recognition on the target word sequence to obtain an error word set and target position data, wherein the target position data is position data of each error word in the error word set;

performing word extension and sequence truncation on the error word set through the target position data to obtain a word set to be corrected;

and acquiring a similar candidate set corresponding to the target word sequence, and performing comparative analysis and word replacement on the word set of the words to be corrected and the similar candidate set to obtain a similar word sequence.

Optionally, in a third implementation manner of the first aspect of the present invention, the performing pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence includes:

performing sequence truncation on the target word sequence according to a preset length through a coding layer in a preset deep learning network model to obtain a cut word sequence, wherein the deep learning network model comprises the coding layer and a decoding layer;

splicing the cut word sequence based on the starting characters to obtain a processed sequence;

and decoding the processed sequence through the decoding layer to obtain a pinyin sequence.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the decoding, by the decoding layer, the processed sequence to obtain a pinyin sequence includes:

performing parameter learning on the processed sequence through a training decoding layer in the decoding layers to obtain a parameter to be predicted, wherein the decoding layers comprise the training decoding layer and a prediction decoding layer;

performing pinyin prediction on the processed sequence through the prediction decoding layer based on the parameter to be predicted to obtain a predicted sequence;

and splicing and circularly decoding the prediction sequence based on the cut word sequence to obtain a pinyin sequence.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the performing sentence vector conversion on the text to be processed, the similar word sequence, and the pinyin sequence respectively to obtain a target text sentence vector, a target similar sentence vector, and a target pinyin sentence vector includes:

respectively performing word vector conversion and weighted average on the text to be processed, the similar word sequence and the pinyin sequence through a machine translation coding layer in the deep learning network model to obtain an initial text sentence vector, an initial similar sentence vector and an initial pinyin sentence vector;

respectively calculating the vector weights of the initial text sentence vector, the initial similar sentence vector and the initial pinyin sentence vector to obtain a text weight, a similar weight and a pinyin weight;

and carrying out weighted summation on the initial text sentence vector through the text weight to obtain a target text sentence vector, carrying out weighted summation on the initial similar sentence vector through the similar weight to obtain a target similar sentence vector, and carrying out weighted summation on the initial pinyin sentence vector through the pinyin weight to obtain a target pinyin sentence vector.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the classifying, by a preset two-classification neural network model, the target text sentence vector, the target similar sentence vector, and the target pinyin sentence vector into a sensitive text to obtain a target text includes:

sensitive word probability calculation is respectively carried out on the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model, and text sensitive probability, similar sensitive probability and pinyin sensitive probability are obtained;

and judging the sensitive text of the text to be processed according to the text sensitive probability, the similar sensitive probability and the pinyin sensitive probability to obtain a target text.

A second aspect of the present invention provides a text recognition apparatus, including:

the first conversion module is used for acquiring a text to be processed, and performing maximum entropy-based word sequence conversion on the text to be processed to obtain a target word sequence;

the second conversion module is used for carrying out similar word replacement based on a similar word candidate set on the target word sequence to obtain a similar word sequence;

the third conversion module is used for performing pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence;

a fourth conversion module, configured to perform sentence vector conversion on the text to be processed, the similar word sequence, and the pinyin sequence, respectively, to obtain a target text sentence vector, a target similar sentence vector, and a target pinyin sentence vector;

and the classification module is used for classifying the sensitive texts of the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model to obtain a target text.

Optionally, in a first implementation manner of the second aspect of the present invention, the first conversion module includes:

the word segmentation unit is used for acquiring a text to be processed, and performing text preprocessing and word segmentation on the text to be processed to obtain initial words;

the recombination unit is used for carrying out ambiguous word extraction and word segmentation recombination on the initial word segmentation to obtain recombined word segmentation;

the screening unit is used for carrying out probability screening on the recombined participles through a preset maximum entropy model to obtain effective participles;

and the replacing unit is used for performing word segmentation replacement on the initial word segmentation through the effective word segmentation to obtain a target word sequence.

Optionally, in a second implementation manner of the second aspect of the present invention, the second conversion module is specifically configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the third converting module includes:

the truncation unit is used for performing sequence truncation on the target word sequence according to a preset length through a coding layer in a preset deep learning network model to obtain a cut word sequence, and the deep learning network model comprises a coding layer and a decoding layer;

the splicing unit is used for carrying out splicing processing based on starting characters on the cut word sequences to obtain processed sequences;

and the decoding unit is used for decoding the processed sequence through the decoding layer to obtain a pinyin sequence.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the decoding unit is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the fourth converting module is specifically configured to:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the classification module is specifically configured to:

A third aspect of the present invention provides a text recognition apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the text recognition device to perform the text recognition method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the text recognition method described above.

According to the technical scheme provided by the invention, a text to be processed is obtained, and word sequence conversion based on maximum entropy is carried out on the text to be processed to obtain a target word sequence; performing similar word replacement based on a similar word candidate set on the target word sequence to obtain a similar word sequence; performing pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence; sentence vector conversion is respectively carried out on the text to be processed, the similar word sequence and the pinyin sequence to obtain a target text sentence vector, a target similar sentence vector and a target pinyin sentence vector; and classifying the sensitive texts of the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model to obtain a target text. In the embodiment of the invention, the sensitive text classification is carried out on the text to be processed by combining the target word sequence, the similar word sequence and the pinyin sequence, the expansibility is higher, the sensitive words can be identified, and the pinyin or harmonic tone/shape near characters replace the text of the key words, so that the accuracy of the sensitive text identification is improved.

Drawings

FIG. 1 is a diagram of an embodiment of a text recognition method according to an embodiment of the present invention;

FIG. 2 is a diagram of another embodiment of a text recognition method according to an embodiment of the present invention;

FIG. 3 is a diagram of an embodiment of a text recognition apparatus according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a text recognition apparatus according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a text recognition device in the embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a text recognition method, a text recognition device, text recognition equipment and a storage medium, and improves the accuracy of sensitive text recognition.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For convenience of understanding, a specific flow of an embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a text recognition method in an embodiment of the present invention includes:

101. and acquiring a text to be processed, and performing maximum entropy-based word sequence conversion on the text to be processed to obtain a target word sequence.

It is to be understood that the executing subject of the present invention may be a text recognition device, and may also be a terminal or a server, which is not limited herein. The embodiment of the present invention is described by taking a server as an execution subject.

The text to be processed can be question and answer information Query. The server obtains the text to be processed by receiving the text information input by the user and sent by the input interface, and the server also can obtain the text to be processed by obtaining the text information input by the user in the block chain. The method comprises the steps that a server carries out text preprocessing on a text to be processed to obtain a preprocessed text, wherein the text preprocessing comprises case conversion, full half-angle conversion, length truncation and complex and simple body conversion; calling a preset maximum entropy model to perform probability distribution prediction on the preprocessed text to obtain a word entropy value, wherein the word entropy value is used for indicating the probability of word segmentation; and calling a preset Language Technology Platform (LTP) word segmentation tool, and performing word segmentation processing on the preprocessed text based on the word entropy value to obtain a target word sequence, wherein the LTP word segmentation tool is a Hadamard open source language processing system and is used for performing word segmentation, part-of-speech tagging, named entity recognition, dependency syntax analysis and semantic role tagging on the initial sequence to obtain the target word sequence.

102. And performing similar word replacement based on the similar word candidate set on the target word sequence to obtain a similar word sequence.

The server extracts a similar word candidate set from a preset database; calculating the font similarity and semantic similarity of each word in each participle and similar word candidate set in the target word sequence; carrying out weighted summation on the font similarity and the semantic similarity to obtain a target similarity; judging whether the target similarity is greater than a preset threshold value, if so, determining the words of the corresponding similar word candidate set as standby words, determining the participles of the corresponding target word sequence as words to be replaced, otherwise, not processing, and continuing to judge the next participle until all the participles in the target word sequence are matched with the similar word candidate set, and obtaining the standby words and the words to be replaced; and replacing the standby words with the words to be replaced, thereby obtaining a similar word sequence.

103. And performing pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence.

The preset deep learning network model may be a deep learning network model Seq2Seq, and the deep learning network model includes two Recurrent Neural Networks (RNN), one of which is used as an encoding layer and the other is used as a decoding layer. The server encodes the target word sequence through a coding layer in a preset deep learning network model to obtain an initial coding vector; retrieving pinyin data in a preset database according to the initial coding vector to obtain target pinyin data, wherein the target pinyin data comprises pinyin data of polyphones; carrying out context-based coding conversion on the initial coding vector based on the target pinyin data to obtain pinyin coding vectors; and decoding the pinyin coding vector through a decoding layer in a preset deep learning network model to obtain a pinyin sequence.

104. And respectively carrying out sentence vector conversion on the text to be processed, the similar word sequence and the pinyin sequence to obtain a target text sentence vector, a target similar sentence vector and a target pinyin sentence vector.

The server respectively carries out word vector conversion on the text to be processed, the similar word sequence and the pinyin sequence to obtain a text word vector, a similar word vector and a pinyin word vector; and calling a preset sentence vector conversion algorithm, and respectively performing sentence vector conversion on the text word vector, the similar word vector and the pinyin word vector to obtain a target text sentence vector, a target similar sentence vector and a target pinyin sentence vector, wherein the sentence vector conversion algorithm can be any one of an accumulative method, an averaging method, a term-inverse text frequency index (TF-IDF) weighted averaging method and a Smooth Inverse Frequency (SIF) embedding method.

The cumulative addition is performed, for example: and (3) converting the sentence vectors of the text word vectors into example descriptions, carrying out non-stop word recognition, non-stop word filtering and non-stop word statistics on the text word vectors by the server to obtain filtered text word vectors and the number of non-stop words, and sequentially overlapping the filtered text word vectors to obtain the target text sentence vectors.

The averaging method is performed, for example: the sentence vector of the pinyin word vector is converted into an example for explanation, the server identifies the non-stop words and filters the non-stop words to obtain a filtered pinyin word vector, the filtered pinyin word vectors are sequentially superposed to obtain an initial pinyin word vector, and the initial pinyin word vector is divided by the number of the non-stop words to obtain a target pinyin word vector.

the tf-idf weighted average method is performed, for example: the method comprises the steps of converting sentence vectors of similar word vectors into example descriptions, carrying out non-stop word recognition and non-stop word filtering on the similar word vectors by a server to obtain filtered similar word vectors, sequentially overlapping the filtered similar word vectors to obtain initial similar sentence vectors, calculating word frequency-inverse text frequency indexes of the filtered similar word vectors in the initial similar sentence vectors, determining the word frequency-inverse text frequency indexes as weight values of the filtered similar word vectors in the initial similar sentence vectors, and carrying out weighted average calculation on the filtered similar word vectors in the initial similar sentence vectors to obtain target similar sentence vectors.

sif the implementation of the embedding method includes: the method comprises the steps that a sentence vector of a text word vector is converted into an example, a server traverses sentences in a preset corpus through the text word vector, a word vector weighted average calculation based on the sentences in the preset corpus is carried out on the text word vector through a preset sif embedding method calculation formula to obtain an initial text sentence vector, principal component analysis and principal component calculation are carried out on the initial text sentence vector to obtain principal component information, and public component deletion is carried out on the initial text sentence vector through the principal component information to obtain a target text sentence vector.

105. And classifying the sensitive texts of the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model to obtain a target text.

The server calls a preset classification neural network model, sensitive text probability calculation is respectively carried out on a target text sentence vector, a target similar sentence vector and a target pinyin sentence vector to obtain a text sensitive value, a similar sensitive value and a pinyin sensitive value, and the text sensitive value, the similar sensitive value and the pinyin sensitive value are weighted and summed to obtain a target probability value; judging whether the target probability value is greater than a preset sensitive threshold value, if so, determining the corresponding text to be processed as a sensitive text, and if not, determining the corresponding text to be processed as a non-sensitive text; and determining the sensitive text and the non-sensitive text as target texts.

In the embodiment of the invention, the sensitive text classification is carried out on the text to be processed by combining the target word sequence, the similar word sequence and the pinyin sequence, the expansibility is higher, the sensitive words can be identified, and the pinyin or harmonic tone/shape near characters replace the text of the key words, so that the accuracy of the sensitive text identification is improved.

Referring to fig. 2, another embodiment of the text recognition method according to the embodiment of the present invention includes:

201. and acquiring a text to be processed, and performing text preprocessing and word segmentation on the text to be processed to obtain initial word segmentation.

And after the server obtains the text to be processed, performing case conversion, full half-angle conversion, length truncation and complex and simple body conversion on the text to be processed to obtain the preprocessed text to be processed. The server calls a preset word segmentation tool, and performs word segmentation labeling, named entity identification, dependency syntactic analysis, semantic role labeling and word segmentation processing on the preprocessed text to be processed to obtain initial word segmentation, wherein the preset word segmentation tool can be a Language Technology Platform (LTP) word segmentation tool.

202. And carrying out ambiguous word extraction and word segmentation recombination on the initial word segmentation to obtain recombined word segmentation.

The server extracts context-based word segmentation ambiguous points of the initial segmented words to obtain target ambiguous words; performing word segmentation and recombination on the target ambiguous word to obtain a recombined word segmentation, wherein the word segmentation and recombination can be realized in at least one of the following three ways: (1) combining the word segmentation ambiguous point with at least one character which is adjacent to the word segmentation ambiguous point from the rear; (2) combining at least one word immediately adjacent to the word segmentation ambiguity point from the front with the word segmentation ambiguity point; (3) combining at least one point immediately preceding the segmentation ambiguity point, the segmentation ambiguity point and at least one word immediately following the segmentation ambiguity point.

203. And carrying out probability screening on the recombined word segmentation through a preset maximum entropy model to obtain effective word segmentation.

The server carries out maximum entropy score calculation on the recombined segmented words through a preset maximum entropy model to obtain segmented word probability, the segmented word probability is the probability of occurrence of the recombined segmented words, the recombined segmented words are arranged according to the sequence of the segmented word probability from large to small to obtain effective segmented words, the effective segmented words are the segmented words which are arranged at the first position after the recombined segmented words are arranged according to the probability, and the effective segmented words are correct segmented words.

For example, the language technology platform LTP word segmentation tool is used to segment the processed sequence "working in such environment too afraid", obtain the initial word segmentation as "working in/such/environment/working/yes/too/afraid/out", extract "working" and "working" as word segmentation ambiguity points, obtain the recombined word segmentation "working" and "working" by the three new word segmentation construction methods, perform the maximum entropy score calculation and descending order sorting (i.e. arranging the recombined words according to the order of the word segmentation probability from large to small) on the two recombined words "working" and "working" by the maximum entropy model, obtain the word segmentation with "working" arranged at the first position after the recombined word segmentation is arranged according to the probability, i.e. the word segmentation probability of "working" is highest, the probability of working "is higher, and more likely to be the correct segmentation result, so that the "work" is taken as the effective segmentation.

204. And performing word segmentation replacement on the initial word segmentation through effective word segmentation to obtain a target word sequence.

And the server replaces the participles corresponding to the effective participles in the initial participles with the effective participles to obtain a target word sequence. For example, replacing the initial participle "work/at/this/environment/work/yes/too/afraid/work" in "by the valid participle" work "results in the correct participle result (i.e., the target word sequence) being" at/this/environment/work/yes/too/afraid/out ".

205. And performing similar word replacement based on the similar word candidate set on the target word sequence to obtain a similar word sequence.

Specifically, the server performs context-based error word detection and word position recognition on the target word sequence to obtain an error word set and target position data, wherein the target position data is position data of each error word in the error word set; carrying out word extension and sequence truncation on the wrong word set through the target position data to obtain a word set to be corrected; and acquiring a similar candidate set corresponding to the target word sequence, and performing comparative analysis and word replacement on the word set to be corrected and the similar candidate set to obtain a similar word sequence.

The context-based detection of the wrong words refers to that the server identifies the context correlation of a target word sequence through a preset context relationship to obtain a wrong word set, and finds words, namely the wrong word set, which do not correspond to the preset context relationship from the target word sequence through identifying the context correlation of the target word sequence. The server performs word position recognition on the wrong word set to obtain target position data, for example, the wrong words of the target word sequence "past a 1/west a 2/very A3/hot a 4" are detected based on the context to obtain "west" (i.e., the wrong word set) which may be wrong words, the position of the "west" (i.e., the wrong word set) is a2, wherein a1-a4 represent position information corresponding to each word, and thus the target position data west a2 is obtained.

The server performs upper word extension or lower word extension and sequence truncation on the error word set through the target position data to obtain a word set to be corrected, for example, the upper word extension or lower word extension is performed on the error word "west" in the error word set to obtain an "east-west day" and a "west-sky", and the sequence truncation is performed on the "east-west day" and the "west-sky" to obtain the word set to be corrected "east-west/day" and the "west/sky".

The server carries out similar candidate word acquisition processing on the target word sequence to obtain a similar candidate set corresponding to the target word sequence, wherein the similar candidate set comprises semantic similar words and font similar words. For example, similar candidate word acquisition is performed on a target word sequence of "past/western day/very/hot", the "past" candidate words include "weekday", "previous" and "previous day", the "western day" similar words include "two days" and "unitary day", the "very" similar words include "heel", "hate" and "very" and the like, and the "hot" similar words include "dry", "stuffy" and "hot" and the like, i.e., the similar candidate words are similar candidate candidates of the target word sequence.

The server performs similar comparison (comparative analysis) and word replacement on the word set to be corrected and the similar candidate set to obtain a similar word sequence, for example, the word set to be corrected is "past/west day/abhor/hot", and the similar word sequence "past/weather/very/hot" and "past/two days/very/hot" can be obtained by performing context-based similar comparison and word replacement on the word set to be corrected through the similar candidate set.

206. And performing pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence.

Specifically, the server performs sequence truncation on a target word sequence according to a preset length through a coding layer in a preset deep learning network model to obtain a cut word sequence, wherein the deep learning network model comprises the coding layer and a decoding layer; splicing the cut word sequence based on the starting characters to obtain a processed sequence; and decoding the processed sequence through a decoding layer to obtain a pinyin sequence.

Specifically, the server performs parameter learning on the processed sequence through a training decoding layer in the decoding layers to obtain a parameter to be predicted, wherein the decoding layers comprise the training decoding layer and a prediction decoding layer; performing pinyin prediction on the processed sequence through a prediction decoding layer based on the parameter to be predicted to obtain a predicted sequence; and splicing and circularly decoding the predicted sequence based on the cut word sequence to obtain a pinyin sequence.

The server sequentially encodes each word of the target word sequence through a coding layer in a preset deep learning network model to obtain a coded word sequence; and performing sequence truncation on the coded word sequence according to a preset length to obtain a truncated word sequence. And the server splices the cut word sequences to obtain processed sequences, namely splices the cut word sequences and the start characters (go) to obtain the processed sequences. The decoding layer comprises a training decoding layer and a prediction decoding layer, wherein the training decoding layer and the prediction decoding layer share parameters, namely, the prediction decoding layer is used for predicting through the parameters obtained by the training decoding layer.

The server performs parameter learning on the processed sequence through a training decoding layer to obtain a parameter to be predicted, wherein the parameter to be predicted is a pinyin prediction parameter; performing pinyin prediction on the processed sequence based on the parameter to be predicted through a prediction decoding layer to obtain a predicted sequence; performing maximum processing based on maximum likelihood estimation on the prediction sequence to obtain a target sequence, specifically, performing joint probability calculation on the prediction sequence by the server to obtain joint probability of the prediction sequence, and performing maximum processing based on maximum likelihood estimation on the joint probability of the prediction sequence to obtain the target sequence; splicing and circularly decoding a target sequence based on a cut word sequence to obtain a pinyin sequence, specifically, splicing the target sequence output by a decoding layer at the last moment and the cut word sequence by a server to obtain a spliced sequence, using the spliced sequence as the input of the decoding layer at the next moment, and repeatedly executing the steps to perform sequence truncation on the target word sequence according to a preset length through a coding layer in a preset deep learning network model to obtain a cut word sequence, wherein the deep learning network model comprises the coding layer and the decoding layer; splicing the cut word sequence based on the starting characters to obtain a processed sequence; and decoding the processed sequence through the decoding layer to obtain a pinyin sequence, and stopping decoding until the result output by the decoding layer is an end symbol to obtain the pinyin sequence, wherein the cycle number is the maximum length of the preset pinyin sequence.

207. And respectively carrying out sentence vector conversion on the text to be processed, the similar word sequence and the pinyin sequence to obtain a target text sentence vector, a target similar sentence vector and a target pinyin sentence vector.

Specifically, the server performs word vector conversion and weighted average on a text to be processed, a similar word sequence and a pinyin sequence respectively through a machine translation coding layer in the deep learning network model to obtain an initial text sentence vector, an initial similar sentence vector and an initial pinyin sentence vector; respectively calculating vector weights of the initial text sentence vector, the initial similar sentence vector and the initial pinyin sentence vector to obtain a text weight, a similar weight and a pinyin weight; and carrying out weighted summation on the initial text sentence vector through the text weight to obtain a target text sentence vector, carrying out weighted summation on the initial similar sentence vector through the similar weight to obtain a target similar sentence vector, and carrying out weighted summation on the initial pinyin sentence vector through the pinyin weight to obtain a target pinyin sentence vector.

The preset deep learning network model can be a deep learning network model bert, and the deep learning network model bert comprises a 12-layer machine translation coding layer; respectively carrying out word vector conversion processing on a text to be processed, a similar word sequence and a pinyin sequence through a 12-layer machine translation coding layer to obtain a text word vector, a similar word vector and a pinyin word vector; and performing weighted average calculation on the text word vector, the similar word vector and the pinyin word vector, namely calculating the weighted arithmetic mean of all word vectors generated by the 12-layer machine translation coding layer to obtain an initial text sentence vector, an initial similar sentence vector and an initial pinyin sentence vector.

Wherein, the server can pass the preset formula

Respectively carrying out weighted average on the text word vector, the similar word vector and the pinyin word vector to obtain an initial text sentence vector, an initial similar sentence vector and an initial pinyin sentence vector, wherein,

representing a text word vector, a similar word vector or a pinyin word vector, k representing a k-th layer machine translation coding layer, n representing an nth character in a text, a similar word sequence or a pinyin sequence to be processed, wherein k is 1,2_kRepresenting an initial text sentence vector, an initial similar sentence vector or an initial pinyin sentence vector.

The server respectively carries out sentence similarity calculation in a preset corpus on the initial text sentence vector, the initial similar sentence vector and the initial pinyin sentence vector to obtain a similarity score of the initial text sentence vector, a similarity score of the initial similar sentence vector and a similarity score of the initial pinyin sentence vector, namely, cosine similarity calculation is carried out on the preset sentence vector (sentences in the preset corpus) and the initial text sentence vector, the initial similar sentence vector and the initial pinyin sentence vector; respectively normalizing the similarity score of the initial text sentence vector, the similarity score of the initial similar sentence vector and the similarity score of the initial pinyin sentence vector to obtain a text weight, a similar weight and a pinyin weight; and carrying out weighted summation on the initial text sentence vector through the text weight to obtain a target text sentence vector, carrying out weighted summation on the initial similar sentence vector through the similar weight to obtain a target similar sentence vector, and carrying out weighted summation on the initial pinyin sentence vector through the pinyin weight to obtain a target pinyin sentence vector.

208. And classifying the sensitive texts of the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model to obtain a target text.

Specifically, the server respectively carries out sensitive word probability calculation on a target text sentence vector, a target similar sentence vector and a target pinyin sentence vector through a preset two-classification neural network model to obtain a text sensitive probability, a similar sensitive probability and a pinyin sensitive probability; and judging the sensitive text of the text to be processed according to the text sensitive probability, the similar sensitive probability and the pinyin sensitive probability to obtain the target text.

The server calculates the probability of sensitive words in the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model based on a preset sensitive word bank to obtain a text sensitive probability, a similar sensitive probability and a pinyin sensitive probability; subtracting the text sensitivity probability, the similar sensitivity probability and the pinyin sensitivity probability from preset sensitive word probability thresholds respectively to obtain a text sensitivity difference value, a similar sensitivity difference value and a pinyin sensitivity difference value; respectively counting the text sensitivity difference, the similar sensitivity difference and the pinyin sensitivity difference to obtain statistical data, wherein the statistical data is used for indicating the number of the text sensitivity difference, the similar sensitivity difference and the pinyin sensitivity difference, of which the probability difference is a negative value; and judging whether the statistical data is greater than or equal to 1, if so, determining the corresponding text to be processed as a sensitive text, otherwise, determining the corresponding text to be processed as a non-sensitive text, and determining the sensitive text and the non-sensitive text as target texts.

In the above description of the text recognition method in the embodiment of the present invention, referring to fig. 3, a text recognition apparatus in the embodiment of the present invention is described below, and an embodiment of the text recognition apparatus in the embodiment of the present invention includes:

the first conversion module 301 is configured to obtain a text to be processed, and perform maximum entropy-based word sequence conversion on the text to be processed to obtain a target word sequence;

the second conversion module 302 is configured to perform similar word replacement based on the similar word candidate set on the target word sequence to obtain a similar word sequence;

the third conversion module 303 is configured to perform pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence;

a fourth conversion module 304, configured to perform sentence vector conversion on the text to be processed, the similar word sequence, and the pinyin sequence, respectively, to obtain a target text sentence vector, a target similar sentence vector, and a target pinyin sentence vector;

the classification module 305 is configured to classify the target text sentence vector, the target similar sentence vector, and the target pinyin sentence vector by using a preset two-classification neural network model to obtain a target text.

The function implementation of each module in the text recognition device corresponds to each step in the text recognition method embodiment, and the function and implementation process thereof are not described in detail herein.

Referring to fig. 4, another embodiment of the text recognition apparatus according to the embodiment of the present invention includes:

the first conversion module 301 specifically includes:

the word segmentation unit 3011 is configured to obtain a text to be processed, perform text preprocessing and word segmentation on the text to be processed, and obtain an initial word segmentation;

a recombination unit 3012, configured to perform ambiguous word extraction and word segmentation recombination on the initial word segmentation to obtain a recombined word segmentation;

the screening unit 3013 is configured to perform probability screening on the recombined segmented words through a preset maximum entropy model to obtain effective segmented words;

a replacing unit 3014, configured to perform word segmentation replacement on the initial word segmentation by using the effective word segmentation to obtain a target word sequence;

Optionally, the second conversion module 302 may be further specifically configured to:

carrying out context-based error word detection and word position identification on the target word sequence to obtain an error word set and target position data, wherein the target position data is position data of each error word in the error word set;

carrying out word extension and sequence truncation on the wrong word set through the target position data to obtain a word set to be corrected;

and acquiring a similar candidate set corresponding to the target word sequence, and performing comparative analysis and word replacement on the word set to be corrected and the similar candidate set to obtain a similar word sequence.

Optionally, the third converting module 303 includes:

a truncation unit 3031, configured to perform sequence truncation on the target word sequence according to a preset length through a coding layer in a preset deep learning network model, to obtain a cut word sequence, where the deep learning network model includes the coding layer and a decoding layer;

a concatenation unit 3032, configured to perform concatenation processing based on the starting character on the sequence of cut words, so as to obtain a processed sequence;

a decoding unit 3033, configured to decode the processed sequence through the decoding layer to obtain a pinyin sequence.

Optionally, the decoding unit 3033 may be further specifically configured to:

performing parameter learning on the processed sequence through a training decoding layer in the decoding layer to obtain a parameter to be predicted, wherein the decoding layer comprises the training decoding layer and a prediction decoding layer;

performing pinyin prediction on the processed sequence through a prediction decoding layer based on the parameter to be predicted to obtain a predicted sequence;

and splicing and circularly decoding the predicted sequence based on the cut word sequence to obtain a pinyin sequence.

Optionally, the fourth conversion module 304 may be further specifically configured to:

respectively carrying out word vector conversion and weighted average on a text to be processed, a similar word sequence and a pinyin sequence through a machine translation coding layer in a deep learning network model to obtain an initial text sentence vector, an initial similar sentence vector and an initial pinyin sentence vector;

respectively calculating vector weights of the initial text sentence vector, the initial similar sentence vector and the initial pinyin sentence vector to obtain a text weight, a similar weight and a pinyin weight;

Optionally, the classification module 305 may be further specifically configured to:

sensitive word probability calculation is respectively carried out on the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector through a preset two-classification neural network model to obtain a text sensitive probability, a similar sensitive probability and a pinyin sensitive probability;

and judging the sensitive text of the text to be processed according to the text sensitive probability, the similar sensitive probability and the pinyin sensitive probability to obtain the target text.

The function implementation of each module and each unit in the text recognition device corresponds to each step in the text recognition method embodiment, and the function and implementation process are not described in detail herein.

Fig. 3 and 4 describe the text recognition apparatus in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the text recognition device in the embodiment of the present invention is described in detail from the perspective of hardware processing.

Fig. 5 is a schematic structural diagram of a text recognition apparatus 500 according to an embodiment of the present invention, where the text recognition apparatus 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 533 or data 532. Memory 520 and storage media 530 may be, among other things, transient or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a sequence of instructions operating on the text recognition device 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the text recognition device 500.

Text recognition device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input-output interfaces 560, and/or one or more operating systems 531, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and so forth. Those skilled in the art will appreciate that the configuration of the text recognition device shown in FIG. 5 does not constitute a limitation of the text recognition device, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The present application also provides a text recognition apparatus, including: a memory having instructions stored therein and at least one processor, the memory and the at least one processor interconnected by a line; the at least one processor invokes the instructions in the memory to cause the text recognition device to perform the steps in the text recognition method described above. The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored thereon instructions, which, when executed on a computer, cause the computer to perform the steps of the text recognition method.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A text recognition method, characterized in that the text recognition method comprises:

2. The text recognition method of claim 1, wherein the obtaining of the text to be processed and the maximum entropy-based word sequence conversion of the text to be processed to obtain the target word sequence comprises:

3. The text recognition method of claim 1, wherein the performing similar word replacement based on a similar word candidate set on the target word sequence to obtain a similar word sequence comprises:

4. The method of claim 1, wherein the performing pinyin conversion on the target word sequence through a preset deep learning network model to obtain a pinyin sequence comprises:

5. The text recognition method of claim 4, wherein the decoding the processed sequence through the decoding layer to obtain a pinyin sequence comprises:

6. The text recognition method of claim 1, wherein the performing sentence vector conversion on the text to be processed, the similar word sequence, and the pinyin sequence to obtain a target text sentence vector, a target similar sentence vector, and a target pinyin sentence vector comprises:

7. The text recognition method of any one of claims 1-6, wherein the classifying the target text sentence vector, the target similar sentence vector and the target pinyin sentence vector by a preset two-classification neural network model to obtain a target text comprises:

8. A text recognition apparatus, characterized in that the text recognition apparatus comprises:

9. A text recognition apparatus characterized by comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the text recognition device to perform the text recognition method of any of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the text recognition method of any of claims 1-7.