WO2022166808A1

WO2022166808A1 - Text restoration method and apparatus, and electronic device

Info

Publication number: WO2022166808A1
Application number: PCT/CN2022/074583
Authority: WO
Inventors: 佟禹
Original assignee: 维沃移动通信有限公司
Priority date: 2021-02-04
Filing date: 2022-01-28
Publication date: 2022-08-11
Also published as: CN112949261A

Abstract

The present application relates to the technical field of language recognition and discloses a text restoration method and apparatus, and an electronic device. The text restoration method comprises: obtaining first candidate words and second candidate words according to a first character group; determining a first confusion degree and a second confusion degree, the first confusion degree being the confusion degree corresponding to a first sentence obtained by replacing the first character group and a second character group in a target sentence with the first candidate words, and the second confusion degree being the confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence with the second candidate words; if the first confusion degree is smaller than the second confusion degree, obtaining restored target text according to the first candidate words; or if the second confusion degree is smaller than the first confusion degree, obtaining restored target text according to the second candidate words. The present method is applied in text restoration scenarios.

Description

Text restoration method, device and electronic device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to Chinese Patent Application No. 202110158872.0 filed in China on Feb. 04, 2021, the entire contents of which are incorporated herein by reference.

technical field

The present application belongs to the technical field of language recognition, and in particular relates to a text restoration method, device and electronic device.

Background technique

In the process of editing the text, if the Western character group (such as English character group) at the end of a certain line of the text cannot be displayed in the whole line, the Western character group can be disconnected from the position of automatic line wrapping, And add a separator at the position of the line break, such as mark 1, mark 2, mark 3, mark 4, mark 5, mark 6 in Figure 1.

Currently, if the above text is copied into another file, the character groups can be automatically restored based on these delimiters. Specifically, the separator at the end of the text line can be directly removed, so that the character groups before and after the separator form a character group, which is displayed in the copied text. For example, the text shown in Figure 2 is the text shown in Figure 1 The resulting text after copying.

However, in the above process, since some character groups are compound words, that is, the character group itself includes delimiters, so by directly removing the delimiter, the character group in the restored text may be wrong, for example In Figure 2, the group of characters marked at mark 3, mark 5, and mark 6. Therefore, how to accurately restore the text has become an urgent problem to be solved.

SUMMARY OF THE INVENTION

The purpose of the embodiments of the present application is to provide a text restoration method, device and electronic device, which can solve the problem of inaccurate text restoration by existing electronic devices.

In order to solve the above technical problems, this application is implemented as follows:

In a first aspect, an embodiment of the present application provides a method for text restoration, the method comprising: obtaining a first candidate word and a second candidate word according to a first character group, where the first character group is in the target text to be restored The character group at the end of the Nth line and ending with a separator, the first candidate word is a word obtained by combining the first character group and the second character group, and the second candidate word is a combination of the third character group and the second character group. word, the second character group is the first character group of the N+1th line in the target text to be restored, and the third character group is the character group obtained after removing the separator from the first character group; determine the first perplexity degree and the second perplexity degree, the first perplexity degree is the perplexity degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word, and the second perplexity degree is the second candidate word replacement target. The confusion degree corresponding to the second sentence obtained by the first character group and the second character group in the sentence; when the first confusion degree is less than the second confusion degree, obtain the restored target text according to the first candidate word; or When the second perplexity degree is smaller than the first perplexity degree, the restored target text is obtained according to the second candidate word.

In a second aspect, an embodiment of the present application provides a text restoration device, where the text restoration device includes an acquisition module, a determination module, and a restoration module. The acquisition module is used to acquire the first candidate word and the second candidate word according to the first character group, where the first character group is the character group at the end of the Nth line in the target text to be restored and ending with a separator, The first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the target text to be restored. The first character group in the N+1 line, the third character group is the character group obtained by removing the separator from the first character group; the determination module is used to determine the first degree of confusion and the second degree of confusion, the first degree of confusion The confusion degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence for the first candidate word, and the second confusion degree is the second candidate word replacing the first character group and the second character group in the target sentence. The perplexity degree corresponding to the second sentence obtained by the character group; the restoration module is used to obtain the restored target text according to the first candidate word when the first perplexity degree is less than the second perplexity degree; or in the second perplexity degree In the case of less than the first perplexity degree, the restored target text is obtained according to the second candidate word.

In a third aspect, an embodiment of the present application provides an electronic device, the electronic device includes a processor, a memory, and a program or instruction stored in the memory and executable on the processor. When the program or instruction is executed by the processor, Implement the steps of the text restoration method as in the first aspect above.

In a fourth aspect, an embodiment of the present application provides a readable storage medium, on which a program or an instruction is stored, and when the program or instruction is executed by a processor, the text restoration method as described in the first aspect above is implemented. step.

In a fifth aspect, an embodiment of the present application provides a chip, the chip includes a processor and a communication interface, the communication interface and the processor are coupled, and the processor is used to run a program or an instruction to implement the text restoration method in the first aspect above. step.

In this embodiment of the present application, the first candidate word and the second candidate word may be obtained according to the first character group, where the first character group is the line at the end of the Nth line in the target text to be restored and ends with a delimiter Character group, the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the target to be restored The first character group of the N+1th line in the text, and the third character group is the character group obtained by removing the separator from the first character group; determine the first perplexity degree and the second perplexity degree, and the first perplexity degree is the first perplexity degree. The confusion degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with a candidate word, and the second confusion degree is the second candidate word replacing the first character group and the second character group in the target sentence. The obtained confusion degree corresponding to the second sentence; when the first confusion degree is less than the second confusion degree, obtain the restored target text according to the first candidate word; or when the second confusion degree is smaller than the first confusion degree Next, according to the second candidate word, the restored target text is obtained. Through this solution, the smaller the confusion degree corresponding to the sentence, the smoother the sentence, that is, the smaller the confusion degree corresponding to the sentence, the more accurate the sentence. Therefore, by comparing the confusion degree corresponding to the first sentence and the basis of The confusion degree corresponding to the second sentence obtained by the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, it can be determined that the correct combination of the first character group and the second character group in the target text is words, so that the text can be accurately restored.

Description of drawings

1 is a schematic diagram of a text to be restored provided by an embodiment of the present application;

2 is a schematic diagram of a restored text provided by an embodiment of the present application;

3 is a schematic flowchart of a text restoration method provided by an embodiment of the present application;

4 is a schematic structural diagram of a text restoration device provided by an embodiment of the present application;

5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

FIG. 6 is a schematic hardware diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

The terms "first", "second" and the like in the description and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in sequences other than those illustrated or described herein, and distinguish between "first", "second", etc. The objects are usually of one type, and the number of objects is not limited. For example, the first object may be one or more than one. In addition, "and/or" in the description and claims indicates at least one of the connected objects, and the character "/" generally indicates that the associated objects are in an "or" relationship.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner.

The text restoration method provided by the embodiments of the present application will be described in detail below through specific embodiments and application scenarios with reference to the accompanying drawings.

As shown in FIG. 3 , an embodiment of the present application provides a text restoration method, and the method includes the following steps 201 - 204 , or steps 201 - 203 and 205 .

It should be noted that the execution body of the text restoration method provided by the embodiment of the present application may be a text restoration apparatus, or a control module in the text restoration apparatus for executing the text restoration method, or an electronic device. The text restoration method provided by the embodiments of the present application will be exemplarily described below by taking a text restoration apparatus as an example.

Optionally, in the embodiment of the present application, when the execution body of the text restoration method provided by the embodiment of the present application is an electronic device, the electronic device may include the text restoration apparatus provided in the embodiment of the present application, or externally connect the text restoration apparatus. Specifically, it can be determined according to actual use requirements, and is not limited in the embodiments of the present application.

Step 201, the electronic device obtains the first candidate word and the second candidate word according to the first character group.

Wherein, the above-mentioned first character group may be a character group at the end of the Nth line in the target text to be restored and ending with a separator, and the first candidate word is a word obtained by combining the first character group and the second character group , the second candidate word is the word obtained by combining the third character group and the second character group, the second character group is the first character group of the N+1th line in the target text to be restored, and the third character group is the A character group obtained after removing the separator, N is a positive integer.

In this embodiment of the present application, after the electronic device acquires the target text to be restored, the electronic device may acquire the first candidate word and the second candidate word according to the first character group, so that the first candidate word and the second candidate word can be The correct word in the two candidate words restores the target text.

Optionally, the text restoration method provided in this embodiment of the present application may be applied to the following two possible scenarios:

Scenario 1: The electronic device copies the target text from one location to another, for example, copying the target text from one document to another document.

Scenario 2: The target text is the text in the target image, and the electronic device recognizes the text in the target image through optical character recognition (OCR) technology.

Optionally, in the above-mentioned second scenario, the text in the target image may be typeset horizontally or vertically. When the text in the target image is a vertical typesetting, the above-mentioned first character group may be the character group at the end of the M-th column in the target text to be restored and ending with a separator, and the second character group is the character group to be restored The first character group of column M+1 in the target text, where M is a positive integer.

Of course, in actual implementation, the text restoration method provided by the embodiments of the present application may also be applied to any other possible scenarios, which may be determined according to actual usage requirements, which are not limited in the embodiments of the present application.

Optionally, the character group involved in the embodiment of the present application may be a Western character group, such as an English character group, a French character group, a German character group, a Russian character group, or a Portuguese character group, etc. It is confirmed that the embodiments of the present application are not limited. Wherein, the embodiment of the present application is exemplified by taking an English character group as an example.

In this embodiment of the present application, after the electronic device obtains the target text to be restored, the electronic device can detect line by line whether the end of each line of text in the target text ends with a separator or a specific separator (such as "-"), If yes, the electronic device may regard the character group including the separator as the above-mentioned first character group. If not, then the electronic device can proceed to detect the next line of text.

Optionally, in this embodiment of the present application, the manner in which the electronic device obtains the first candidate word and the second candidate word may be:

Step 1. The electronic device forms a unit of the character group before the separator at the end of the current line (that is, the above-mentioned third character group), the separator (such as "-"), and the first character group of the next line of the current line. (hereinafter referred to as the processing candidate set).

Exemplarily, taking the text shown in FIG. 1 as an example, the electronic device can obtain the candidate set to be processed {representa,-,tion} from the first row, and the candidate set to be processed {repre,-,sentation} from the fourth row. , the candidate set to be processed {pre,-,train} is obtained from the sixth line, the candidate set to be processed {re,-,sult} is obtained from the ninth line, and the candidate set to be processed {fine,-,tuned is obtained from the tenth line }, get the candidate set {task,-,specific} to be processed from the fourteenth line.

Step 2. For all candidates in each candidate set to be processed, combine the character groups before and after the separator to obtain candidate words, such as {representation}, {representation}, {pretrain}, {result}, {finetuned} and {taskspecific} }, that is, the first candidate word can be obtained.

Step 3.] For all candidates in each candidate set to be processed, generate compound words with reserved separators, such as {representa-tion}, {repre-sentation}, {pre-train}, {re-sult}, { fine-tuned} and {task-specific}, that is, the above-mentioned second candidate word can be obtained.

It can be understood that, in the embodiment of the present application, the second candidate word is a compound word, so the electronic device can use the word after the combination of the characters before and after the separator and the compound word formed by the separator, respectively. detection, so as to ensure the accuracy of the restored target text.

Step 202, the electronic device determines a first degree of confusion and a second degree of confusion.

The first degree of confusion may be the degree of confusion corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word, and the second degree of confusion is the degree of confusion in the target sentence that is replaced by the second candidate word. The perplexity corresponding to the second sentence obtained from the first character group and the second character group.

In this embodiment of the present application, after the electronic device acquires the first candidate word and the second candidate word, the electronic device can determine the first degree of confusion and the second degree of confusion, so that the first candidate word and the second candidate word can be obtained from the first candidate word and the second candidate word. to determine the correct words to restore the above target text.

Optionally, in this embodiment of the present application, for the above step 202, the electronic device may perform the following steps 202a and 202b respectively on the above-mentioned first candidate word and the second candidate word, so as to determine the above-mentioned first degree of confusion and the first degree of confusion. Second confusion.

It can be understood that the following steps 202a and 202b are exemplified by one candidate word (for example, the target candidate word in the embodiment of the present application) among the above-mentioned first candidate word and second candidate word.

Step 202a, the electronic device determines the target parameter based on the probability that each character in the target candidate word appears in the target text.

The target candidate word may be the first candidate word or the second candidate word.

Step 202b, the electronic device determines the confusion degree corresponding to the target candidate word according to the target parameter.

Wherein, the above-mentioned target parameters may include: the legitimacy value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence. The target phrase may include target candidate words, a fourth character group and a fifth character group, the fourth character group may be the character group located before the first character group in the target text, and the fifth character group is the second character group in the target text. character group after.

In this embodiment of the present application, the electronic device may determine the legitimacy value of the target candidate word and the fluency of the target phrase based on the probability that each character in the target candidate word (the first candidate word or the second candidate word) appears in the target text The degree value and the fluency value of the target sentence can be obtained, so that the above target parameter can be obtained, and then the electronic device can determine the degree of confusion corresponding to the target candidate word (for example, the above-mentioned first degree of confusion or the second degree of confusion) according to the target parameter.

In this embodiment of the present application, the electronic device can input the target candidate word (the first candidate word or the second candidate word) and the target text into the language model, and then the language model can calculate the validity value of the target candidate word and the fluency value of the target phrase and the calculation of the fluency value of the target sentence, so that the above target parameters can be obtained.

In the embodiment of the present application, the validity value of the above target candidate word may be the probability of the target candidate word appearing in the target text in the target text (denoted as Score_1).

Optionally, in this embodiment of the present application, the validity value of the target candidate word may be the product of the probabilities that each character in the target candidate word appears in the target text.

Among them, the probability that the Kth character in the target candidate word appears in the target text refers to the probability that the Kth character appears when the sixth character group appears in the target text, and the sixth character group is determined by the target candidate word. It consists of the first character to the (K-1)th character of , where K is an integer greater than 1.

It should be noted that “in the case of a certain character group or character (denoted as A), another character (denoted as B)” involved in the embodiments of the present application refers to: in the text, B is located in A After that, and there is no separator between A and B.

Specifically, the validity value of the target candidate word can be expressed as:

P(W)=p(C ₁ )×p(C ₂ |C ₁ )×…×p(C _K |C ₁ ,C ₂ ,…C _K-1 );

Among them, P(W) represents the validity value of the target candidate word, p(C ₁ ) represents the probability that the first character in the target candidate word appears in the target text, p(C _K |C ₁ ,C ₂ ,… C _K-1 ) represents the product of the probabilities of the K-th character appearing when the sixth character group appears in the target text, and the sixth character group consists of the 1st character to the (K-1)th character in the target candidate word composed of characters.

Exemplarily, judged by the language model, the language model is shown in the following formula (1), W represents a candidate word, C ₁ represents the first character in the candidate word, C _k represents the last character in the candidate word, Determine whether W is a valid word by calculating the probability that the candidate word W is composed of characters from C ₁ to C _k . The probability formula for calculating the word is shown in the following formula (2), where p(C ₁ ) represents the probability that the character C ₁ appears in the target text, and the calculation formula is shown in the following formula (3). Exemplarily, if C ₁ represents character r, the total number of characters in the target text is 100, and character r appears 10 times, then the probability of r appearing is 10/100=0.1, that is, p(C ₁ )=0.1.

In formula (4), p(C ₂ |C ₁ ) indicates that the occurrence of C ₂ is related to C ₁ , that is, the probability that C ₂ appears under the condition that C ₁ appears. Exemplarily, if C ₁ represents the character "w" and C ₂ represents the character "e", then the probability that the character "e" appears under the condition that the character "w" appears is: P(e|w)=P( we)/P(w).

W=C ₁ ,C ₂ ,C ₃ ,...CK (1)

P(W)=P(C ₁ , C ₂ , C ₃ ,…C _K )=p(C ₁ )×p(C ₂ |C ₁ )×…×p(C _K |C ₁ ,C ₂ ,… C _K-1 ) (2)

p(C _k ) = the number of times the character k appears / the total number of characters in the document (3)

P(C ₂ |C ₁ )=P(C ₁ C ₂ )/P(C ₁ ) (4)

In the embodiment of the present application, the fluency value of the target phrase may be the probability that the phrase composed of the target candidate word, the fourth character group and the fifth character group appears in the target text (referred to as Score_2).

In the embodiment of the present application, the fluency value of the target phrase can be calculated according to the following formula (5).

where S represents a sentence or phrase consisting of words W ₁ ...W _N. Generally, the lower the perplexity, the more fluent the sentence or phrase.

Exemplarily, as shown in mark 1 in Figure 1, assuming that the above target candidate word is "representation", the word before "representa-" is "language", and the word after "tion" is "model", then It can be obtained by formula (5)

In the embodiment of the present application, the fluency value of the target sentence may be the probability of the target sentence appearing in the target text (referred to as Score_3).

In the embodiment of the present application, the fluency value of the above target sentence may be calculated according to the above formula (5).

Exemplarily, as shown in mark 1 in Figure 1, assuming that the above target candidate word is "representation", and the sentence where "representa-" is located is "We introduce a new language representation model called BERT", then by formula (5) You can get:

Optionally, in this embodiment of the present application, the foregoing step 202b may be specifically implemented by the following step 202b1.

Step 202b1, the electronic device obtains the target according to the product of the legitimacy value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient. The perplexity corresponding to the candidate word.

Wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.

In the embodiment of the present application, after the electronic device determines the legality value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence, the electronic device can calculate the legality value of the target candidate word and the first coefficient. (denoted as α), the sum of the product of the fluency value of the target phrase and the second coefficient (denoted as β), and the sum of the product of the fluency value of the target sentence and the third coefficient (denoted as γ), so that the target can be obtained. The perplexity corresponding to the candidate word (denoted as Score).

That is, Score=α×Score_1+β×Score_2+γ×Score_3.

Optionally, in this embodiment of the present application, the values of the first coefficient, the second coefficient and the third coefficient may be any possible positive numbers, and the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.

Step 203, the electronic device determines whether the first degree of confusion is less than the second degree of confusion.

In this embodiment of the present application, after the electronic device determines the first degree of confusion and the second degree of confusion, the electronic device may compare the magnitudes of the first degree of confusion and the second degree of confusion. Thereby, it is determined which candidate word among the above-mentioned first candidate word and second candidate word is correct.

In this embodiment of the present application, if the first degree of confusion is less than the second degree of confusion, the electronic device can obtain the restored target text according to the first candidate word, that is, when the first degree of confusion is less than the second degree of confusion, the electronic device can obtain the restored target text according to the first candidate word. The device may perform step 204 described below. If the second degree of confusion is less than the first degree of confusion, the electronic device can obtain the restored target text according to the second candidate word, that is, when the second degree of confusion is less than the first degree of confusion, the electronic device can perform the following Step 205.

It can be understood that, in this embodiment of the present application, the following steps 204 and 205 are executed alternatively.

Step 204, the electronic device obtains the restored target text according to the first candidate word.

In the embodiment of the present application, when the first degree of confusion is less than the second degree of confusion, the electronic device can restore the target text according to the first candidate word, so that the restored target text can be obtained.

In a possible implementation manner, the electronic device can directly use the first candidate word to replace the first character group and the second character group in the target text, so that the restored target text can be obtained.

In another possible implementation manner, the electronic device can use the above-mentioned first sentence (the first sentence includes the first candidate word) to replace the target sentence in the target text, so that the restored target text can be obtained.

Step 205, the electronic device obtains the restored target text according to the second candidate word.

In this embodiment of the present application, when the second degree of confusion is less than the first degree of confusion, the electronic device can restore the target text according to the second candidate word, so that the restored target text can be obtained.

In a possible implementation manner, the electronic device can directly use the second candidate word to replace the first character group and the second character group in the target text, so that the restored target text can be obtained.

In another possible implementation manner, the electronic device can use the above-mentioned second sentence (the second sentence includes the second candidate word) to replace the target sentence in the target text, so that the restored target text can be obtained.

In the text restoration method provided by the embodiment of the present application, since the lower the degree of confusion corresponding to the sentence, the smoother the sentence is, that is, the smaller the degree of confusion corresponding to the sentence is, the more accurate the sentence is. Therefore, by comparing the first sentence obtained according to the first candidate word The corresponding perplexity degree and the perplexity degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the second word group in the target text can be determined. The correct word composed of character groups, so that the text can be accurately restored.

Optionally, in the embodiment of the present application, after the electronic device obtains the restored target text, the text restoration method provided by the embodiment of the present application may further include the following step 206 .

Step 206, the electronic device acquires the keywords of the restored target text based on the keyword recognition model.

The content type of the above keyword may be the same as the content type preset in the keyword identification model.

In the embodiment of the present application, after the electronic device obtains the restored target text, the electronic device may input the restored target text into the keyword recognition model, so that the key words in the restored target text can be obtained based on the keyword recognition model. In this way, accurate keywords can be obtained, and then the accuracy of keyword recognition can be improved.

Optionally, in this embodiment of the present application, after the keyword identification model identifies the keywords in the restored target text, the keyword identification model may output a keyword list to the electronic device. Wherein, the keyword list may include all keywords in the restored target text.

Exemplarily, assuming that the content type of the keyword preset in the keyword recognition model is "place name", after the electronic device inputs the restored target text into the keyword recognition model, the keyword recognition model can Extract and output all words related to "place name" from the text, so as to obtain the above keywords.

In the embodiment of the present application, after the electronic device inputs the restored target text into the keyword recognition model, the keyword recognition model can perform keyword recognition on the restored target text, so as to obtain the keywords in the restored target text , and output the list of these keywords to the electronic device, so that the keywords in the target text can be accurately obtained.

The text restoration apparatus provided by the embodiment of the present application will be described below by taking the text restoration method performed by the text restoration apparatus in the embodiment of the present application as an example.

As shown in FIG. 4 , an embodiment of the present application provides a text restoration apparatus 300 . The text restoration apparatus 300 includes an acquisition module 301 , a determination module 302 and a restoration module 303 . The obtaining module 301 is used to obtain the first candidate word and the second candidate word according to the first character group, where the first character group is the character group at the end of the Nth line in the target text to be restored and ending with a delimiter , the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the target text to be restored. The first character group in the N+1th line of the The perplexity degree is the perplexity degree corresponding to the first sentence obtained by the first candidate word replacing the first character group and the second character group in the target sentence, and the second perplexity degree is the second candidate word replacing the first character group and the second character group in the target sentence. The perplexity degree corresponding to the second sentence obtained by the second character group; the restoration module 303 is used to obtain the restored target text according to the first candidate word when the first perplexity degree is less than the second perplexity degree; When the second perplexity degree is less than the first perplexity degree, the restored target text is obtained according to the second candidate word.

Optionally, the determination module is specifically configured to perform the following steps respectively on the first candidate word and the second candidate word: based on the probability that each character in the target candidate word appears in the target text, determine the target parameter, and the target candidate word is The first candidate word or the second candidate word; according to the target parameter, determine the degree of confusion corresponding to the target candidate word; wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase and the fluency value of the target sentence ; The target phrase includes a target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is located in the target text after the second character group character group.

Optionally, the determination module is specifically used for the product of the validity value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient. The sum is obtained to obtain the perplexity corresponding to the target candidate word; wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.

Optionally, the legitimacy value of the target candidate word is the probability that the target candidate word appears in the target text in the target text; the fluency value of the target phrase is the phrase consisting of the target candidate word, the fourth character group and the fifth character group. The probability of appearing in the target text; the fluency value of the target sentence is the probability that the target sentence appears in the target text.

Optionally, the legitimacy value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text; wherein, the probability that the Kth character in the target candidate word appears in the target text Refers to: the probability that the Kth character appears when the sixth character group appears in the target text. The sixth character group is composed of the first character to the (K-1)th character in the target candidate word, and K is Integer greater than 1.

Optionally, the determining module is further configured to obtain the restored keyword of the target text based on the keyword recognition model, where the content type of the keyword is the same as the preset content type in the keyword recognition model.

An embodiment of the present application provides a text restoration device. Since the lower the degree of confusion corresponding to the sentence, the smoother the sentence is, that is, the smaller the degree of confusion corresponding to the sentence, the more accurate the sentence. Therefore, by comparing the first candidate word obtained by comparing the first candidate word The confusion degree corresponding to the sentence and the confusion degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the first character group in the target text can be determined. The correct word composed of two-character groups, so that the text can be accurately restored.

The text restoration apparatus in this embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in an electronic device. The apparatus may be a mobile electronic device or a non-mobile electronic device. Exemplarily, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palmtop computer, an in-vehicle electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook, or a personal digital assistant (personal digital assistant). assistant, PDA), etc., the non-mobile electronic device may be a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., which are not specifically limited in the embodiments of the present application.

The text restoration apparatus in this embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.

The text restoration device provided in the embodiment of the present application can implement each process implemented by the foregoing method embodiment, which is not repeated here to avoid repetition.

Optionally, as shown in FIG. 5 , an embodiment of the present application further provides an electronic device 500, including a processor 501, a memory 502, a program or instruction stored in the memory 502 and executable on the processor 501, the program Or, when the instruction is executed by the processor 501, each process of the foregoing text restoration method embodiment can be implemented, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.

FIG. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 100 includes but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110, etc. part.

Those skilled in the art can understand that the electronic device 100 may also include a power source (such as a battery) for supplying power to various components, and the power source may be logically connected to the processor 110 through a power management system, so as to manage charging, discharging, and power management through the power management system. consumption management and other functions. The structure of the electronic device shown in FIG. 6 does not constitute a limitation on the electronic device, and the electronic device may include more or less components than those shown in the figure, or combine some components, or arrange different components, which will not be repeated here. .

The processor 110 may be configured to obtain the first candidate word and the second candidate word according to the first character group, where the first character group is at the end of the Nth line in the target text to be restored and ends with a delimiter character group, the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the word to be restored The first character group of the N+1th line in the target text, and the third character group is the character group obtained by removing the separator from the first character group; and determine the first perplexity degree and the second perplexity degree, the first perplexity degree The confusion degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence for the first candidate word, and the second confusion degree is the second candidate word replacing the first character group and the second character group in the target sentence. The confusion degree corresponding to the second sentence obtained by the character group; and when the first confusion degree is less than the second confusion degree, obtain the restored target text according to the first candidate word; or when the second confusion degree is smaller than the first confusion degree In the case of degree, the restored target text is obtained according to the second candidate word.

Optionally, the processor 110 is specifically configured to perform the following steps on the first candidate word and the second candidate word respectively: based on the probability that each character in the target candidate word appears in the target text, determine the target parameter, the target candidate word is the first candidate word or the second candidate word; according to the target parameter, determine the degree of confusion corresponding to the target candidate word; wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase and the fluency of the target sentence value; the target phrase includes the target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is the second character group in the target text. character group after.

Optionally, the processor 110 is specifically configured according to the product of the legitimacy value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the fluency value of the target sentence and the third coefficient. The sum of the products is used to obtain the perplexity corresponding to the target candidate word; wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.

Optionally, the processor 110 is further configured to acquire, based on the keyword recognition model, the keyword of the restored target text, where the content type of the keyword is the same as the preset content type in the keyword recognition model.

The embodiment of the present application provides an electronic device. Since the lower the degree of confusion corresponding to the sentence, the smoother the sentence is, that is, the smaller the degree of confusion corresponding to the sentence is, the more accurate the sentence is. Therefore, by comparing the first sentence obtained according to the first candidate word The corresponding perplexity degree and the perplexity degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the second word group in the target text can be determined. The correct word composed of the character group, so that the text can be accurately restored.

It should be noted that, in this embodiment of the present application, the acquisition module, the determination module, the restoration module, and the input module in the above-mentioned text restoration apparatus may all be implemented by the above-mentioned processor 110 .

It should be understood that, in this embodiment of the present application, the radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. The electronic device provides the user with wireless broadband Internet access through the network module 102, such as helping the user to send and receive emails, browse web pages, and access streaming media. The audio output unit 103 may include a speaker, a buzzer, a receiver, and the like. The input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, and the graphics processor 1041 captures images of still pictures or videos obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode data is processed. The display unit 106 may include a display panel 1061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes a touch panel 1071 and other input devices 1072 . The touch panel 1071 is also called a touch screen. The touch panel 1071 may include two parts, a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described herein again. Memory 109 may be used to store software programs as well as various data including, but not limited to, application programs and operating systems. The processor 110 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, and an application program, and the like, and the modem processor mainly processes wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 110 .

Embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, each process of the foregoing text restoration method embodiment can be achieved, and can achieve the same The technical effect, in order to avoid repetition, will not be repeated here.

Wherein, the above-mentioned processor is the processor in the electronic device in the above-mentioned embodiment. The readable storage medium may include a computer-readable storage medium, such as a computer read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.

The embodiment of the present application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction to implement each process of the above text restoration method embodiment, and can achieve the same In order to avoid repetition, the technical effect will not be repeated here.

It should be understood that the chip mentioned in the embodiments of the present application may also be referred to as a system-on-chip, a system-on-chip, a system-on-a-chip, or a system-on-a-chip, or the like.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in the reverse order depending on the functions involved. To perform functions, for example, the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to some examples may be combined in other examples.

From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make an electronic device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the various embodiments of the present application.

The embodiments of the present application have been described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific embodiments, which are merely illustrative rather than restrictive. Under the inspiration of this application, without departing from the scope of protection of the purpose of this application and the claims, many forms can be made, which all fall within the protection of this application.

Claims

A text restoration method, comprising:

Obtain a first candidate word and a second candidate word according to a first character group, where the first character group is a character group at the end of the Nth line in the target text to be restored and ending with a delimiter, and the first character group is A candidate word is a word obtained by combining the first character group and the second character group, the second candidate word is a word obtained by combining the third character group and the second character group, and the second character group is The first character group of the N+1th row in the target text to be restored, and the third character group is a character group obtained after the first character group is removed from the separator;

Determine a first degree of confusion and a second degree of confusion, where the first degree of confusion is the confusion corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word degree, the second perplexity degree is the perplexity degree corresponding to the second sentence obtained by replacing the first character group and the second character group in the target sentence with the second candidate word;

In the case that the first perplexity degree is less than the second perplexity degree, the restored target text is obtained according to the first candidate word; or when the second perplexity degree is less than the first perplexity degree In the case of , the restored target text is obtained according to the second candidate word.
The method of claim 1, wherein the determining the first perplexity degree and the second perplexity degree comprises:

The following steps are respectively performed on the first candidate word and the second candidate word:

Determine target parameters based on the probability that each character in the target candidate word appears in the target text, and the target candidate word is the first candidate word or the second candidate word;

According to the target parameter, determine the degree of confusion corresponding to the target candidate word;

Wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence; the target phrase includes the target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is the character group located after the second character group in the target text .
The method according to claim 2, wherein, determining the perplexity degree corresponding to the target candidate word according to the target parameter, comprising:

According to the sum of the product of the legitimacy value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient, obtain The perplexity corresponding to the target candidate word;

Wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
The method according to claim 2 or 3, wherein the validity value of the target candidate word is the probability that the target candidate word appears in the target text in the target text;

The fluency value of the target phrase is the probability that the phrase composed of the target candidate word, the fourth character group and the fifth character group appears in the target text;

The fluency value of the target sentence is the probability that the target sentence appears in the target text.
The method according to claim 4, wherein the validity value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text;

Wherein, the probability that the Kth character in the target candidate word appears in the target text refers to the probability that the Kth character appears when the sixth character group appears in the target text, and the Kth character appears in the target text. The six-character group consists of the 1st character to the (K-1)th character in the target candidate word, where K is an integer greater than 1.
A text restoration device, comprising an acquisition module, a determination module and a restoration module;

The obtaining module is used to obtain the first candidate word and the second candidate word according to the first character group, where the first character group is the character at the end of the Nth line in the target text to be restored and ending with a delimiter group, the first candidate word is a word obtained by combining the first character group and the second character group, the second candidate word is a word obtained by combining the third character group and the second character group, and the The second character group is the first character group of the N+1th row in the target text to be restored, and the third character group is the character group obtained by removing the separator from the first character group;

A determination module is used to determine a first degree of confusion and a second degree of confusion, where the first degree of confusion is the first word obtained by replacing the first character group and the second character group in the target sentence with the first candidate word. The confusion degree corresponding to a sentence, the second confusion degree is the confusion degree corresponding to the second sentence obtained by replacing the first character group and the second character group in the target sentence with the second candidate word;

A restoration module, configured to obtain the restored target text according to the first candidate word when the first degree of confusion is less than the second degree of confusion; or when the second degree of confusion is less than the second degree of confusion In the case of the first perplexity degree, the restored target text is obtained according to the second candidate word.
The apparatus according to claim 6, wherein the determining module is specifically configured to perform the following steps on the first candidate word and the second candidate word respectively:

Determine target parameters based on the probability that each character in the target candidate word appears in the target text, and the target candidate word is the first candidate word or the second candidate word;

According to the target parameter, determine the degree of confusion corresponding to the target candidate word;

Wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence; the target phrase includes the target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is the character group located after the second character group in the target text .
The apparatus according to claim 7, wherein the determining module is specifically configured to be based on the product of the legitimacy value of the target candidate word and the first coefficient, and the product of the fluency value of the target phrase and the second coefficient , the sum of the product of the fluency value of the target sentence and the third coefficient to obtain the perplexity corresponding to the target candidate word;

Wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
The device according to claim 7 or 8, wherein the validity value of the target candidate word is a probability that the target candidate word appears in the target text in the target text;

The fluency value of the target phrase is the probability that the phrase composed of the target candidate word, the fourth character group and the fifth character group appears in the target text;

The fluency value of the target sentence is the probability that the target sentence appears in the target text.
The apparatus according to claim 9, wherein the validity value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text;

Wherein, the probability that the Kth character in the target candidate word appears in the target text refers to the probability that the Kth character appears when the sixth character group appears in the target text, and the Kth character appears in the target text. The six-character group consists of the 1st character to the (K-1)th character in the target candidate word, where K is an integer greater than 1.
An electronic device, comprising a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or the instruction being executed by the processor to achieve as claimed in claim 1 The steps of the text restoration method described in any one of -5.
A readable storage medium, on which a program or an instruction is stored, and when the program or the instruction is executed by a processor, the steps of the text restoration method according to any one of claims 1-5 are implemented .
A computer program product executed by at least one processor to implement the steps of the text restoration method according to any one of claims 1-5.
A chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used for running a program or an instruction to implement the text restoration according to any one of claims 1-5 steps of the method.
An electronic device configured to perform the steps of the text restoration method according to any one of claims 1-5.