Nothing Special   »   [go: up one dir, main page]

WO2022166808A1 - Text restoration method and apparatus, and electronic device - Google Patents

Text restoration method and apparatus, and electronic device Download PDF

Info

Publication number
WO2022166808A1
WO2022166808A1 PCT/CN2022/074583 CN2022074583W WO2022166808A1 WO 2022166808 A1 WO2022166808 A1 WO 2022166808A1 CN 2022074583 W CN2022074583 W CN 2022074583W WO 2022166808 A1 WO2022166808 A1 WO 2022166808A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
candidate word
character group
text
character
Prior art date
Application number
PCT/CN2022/074583
Other languages
French (fr)
Chinese (zh)
Inventor
佟禹
Original Assignee
维沃移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 维沃移动通信有限公司 filed Critical 维沃移动通信有限公司
Publication of WO2022166808A1 publication Critical patent/WO2022166808A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • the present application belongs to the technical field of language recognition, and in particular relates to a text restoration method, device and electronic device.
  • the Western character group (such as English character group) at the end of a certain line of the text cannot be displayed in the whole line
  • the Western character group can be disconnected from the position of automatic line wrapping, And add a separator at the position of the line break, such as mark 1, mark 2, mark 3, mark 4, mark 5, mark 6 in Figure 1.
  • the character groups can be automatically restored based on these delimiters.
  • the separator at the end of the text line can be directly removed, so that the character groups before and after the separator form a character group, which is displayed in the copied text.
  • the text shown in Figure 2 is the text shown in Figure 1 The resulting text after copying.
  • the purpose of the embodiments of the present application is to provide a text restoration method, device and electronic device, which can solve the problem of inaccurate text restoration by existing electronic devices.
  • an embodiment of the present application provides a method for text restoration, the method comprising: obtaining a first candidate word and a second candidate word according to a first character group, where the first character group is in the target text to be restored The character group at the end of the Nth line and ending with a separator, the first candidate word is a word obtained by combining the first character group and the second character group, and the second candidate word is a combination of the third character group and the second character group.
  • the second character group is the first character group of the N+1th line in the target text to be restored, and the third character group is the character group obtained after removing the separator from the first character group; determine the first perplexity degree and the second perplexity degree, the first perplexity degree is the perplexity degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word, and the second perplexity degree is the second candidate word replacement target.
  • an embodiment of the present application provides a text restoration device, where the text restoration device includes an acquisition module, a determination module, and a restoration module.
  • the acquisition module is used to acquire the first candidate word and the second candidate word according to the first character group, where the first character group is the character group at the end of the Nth line in the target text to be restored and ending with a separator,
  • the first candidate word is the word obtained by combining the first character group and the second character group
  • the second candidate word is the word obtained by combining the third character group and the second character group
  • the second character group is the target text to be restored.
  • the first character group in the N+1 line, the third character group is the character group obtained by removing the separator from the first character group; the determination module is used to determine the first degree of confusion and the second degree of confusion, the first degree of confusion The confusion degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence for the first candidate word, and the second confusion degree is the second candidate word replacing the first character group and the second character group in the target sentence.
  • the perplexity degree corresponding to the second sentence obtained by the character group; the restoration module is used to obtain the restored target text according to the first candidate word when the first perplexity degree is less than the second perplexity degree; or in the second perplexity degree In the case of less than the first perplexity degree, the restored target text is obtained according to the second candidate word.
  • an embodiment of the present application provides an electronic device, the electronic device includes a processor, a memory, and a program or instruction stored in the memory and executable on the processor.
  • the program or instruction is executed by the processor, Implement the steps of the text restoration method as in the first aspect above.
  • an embodiment of the present application provides a readable storage medium, on which a program or an instruction is stored, and when the program or instruction is executed by a processor, the text restoration method as described in the first aspect above is implemented. step.
  • an embodiment of the present application provides a chip, the chip includes a processor and a communication interface, the communication interface and the processor are coupled, and the processor is used to run a program or an instruction to implement the text restoration method in the first aspect above. step.
  • the first candidate word and the second candidate word may be obtained according to the first character group, where the first character group is the line at the end of the Nth line in the target text to be restored and ends with a delimiter Character group, the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the target to be restored
  • the first character group of the N+1th line in the text, and the third character group is the character group obtained by removing the separator from the first character group; determine the first perplexity degree and the second perplexity degree, and the first perplexity degree is the first perplexity degree.
  • the restored target text is obtained.
  • the confusion degree corresponding to the first sentence and the basis of The confusion degree corresponding to the second sentence obtained by the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, it can be determined that the correct combination of the first character group and the second character group in the target text is words, so that the text can be accurately restored.
  • FIG. 1 is a schematic diagram of a text to be restored provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a restored text provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a text restoration method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a text restoration device provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 6 is a schematic hardware diagram of an electronic device provided by an embodiment of the present application.
  • first, second and the like in the description and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in sequences other than those illustrated or described herein, and distinguish between “first”, “second”, etc.
  • the objects are usually of one type, and the number of objects is not limited.
  • the first object may be one or more than one.
  • “and/or” in the description and claims indicates at least one of the connected objects, and the character “/" generally indicates that the associated objects are in an "or” relationship.
  • words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as “exemplary” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present the related concepts in a specific manner.
  • an embodiment of the present application provides a text restoration method, and the method includes the following steps 201 - 204 , or steps 201 - 203 and 205 .
  • the execution body of the text restoration method provided by the embodiment of the present application may be a text restoration apparatus, or a control module in the text restoration apparatus for executing the text restoration method, or an electronic device.
  • the text restoration method provided by the embodiments of the present application will be exemplarily described below by taking a text restoration apparatus as an example.
  • the electronic device when the execution body of the text restoration method provided by the embodiment of the present application is an electronic device, the electronic device may include the text restoration apparatus provided in the embodiment of the present application, or externally connect the text restoration apparatus. Specifically, it can be determined according to actual use requirements, and is not limited in the embodiments of the present application.
  • Step 201 the electronic device obtains the first candidate word and the second candidate word according to the first character group.
  • the above-mentioned first character group may be a character group at the end of the Nth line in the target text to be restored and ending with a separator
  • the first candidate word is a word obtained by combining the first character group and the second character group
  • the second candidate word is the word obtained by combining the third character group and the second character group
  • the second character group is the first character group of the N+1th line in the target text to be restored
  • the third character group is the A character group obtained after removing the separator
  • N is a positive integer.
  • the electronic device may acquire the first candidate word and the second candidate word according to the first character group, so that the first candidate word and the second candidate word can be The correct word in the two candidate words restores the target text.
  • the text restoration method provided in this embodiment of the present application may be applied to the following two possible scenarios:
  • Scenario 1 The electronic device copies the target text from one location to another, for example, copying the target text from one document to another document.
  • Scenario 2 The target text is the text in the target image, and the electronic device recognizes the text in the target image through optical character recognition (OCR) technology.
  • OCR optical character recognition
  • the text in the target image may be typeset horizontally or vertically.
  • the above-mentioned first character group may be the character group at the end of the M-th column in the target text to be restored and ending with a separator
  • the second character group is the character group to be restored
  • the text restoration method provided by the embodiments of the present application may also be applied to any other possible scenarios, which may be determined according to actual usage requirements, which are not limited in the embodiments of the present application.
  • the character group involved in the embodiment of the present application may be a Western character group, such as an English character group, a French character group, a German character group, a Russian character group, or a Portuguese character group, etc. It is confirmed that the embodiments of the present application are not limited. Wherein, the embodiment of the present application is exemplified by taking an English character group as an example.
  • the electronic device can detect line by line whether the end of each line of text in the target text ends with a separator or a specific separator (such as "-"), If yes, the electronic device may regard the character group including the separator as the above-mentioned first character group. If not, then the electronic device can proceed to detect the next line of text.
  • a separator or a specific separator such as "-"
  • the manner in which the electronic device obtains the first candidate word and the second candidate word may be:
  • Step 1 The electronic device forms a unit of the character group before the separator at the end of the current line (that is, the above-mentioned third character group), the separator (such as "-"), and the first character group of the next line of the current line. (hereinafter referred to as the processing candidate set).
  • the electronic device can obtain the candidate set to be processed ⁇ representa,-,tion ⁇ from the first row, and the candidate set to be processed ⁇ repre,-,sentation ⁇ from the fourth row.
  • the candidate set to be processed ⁇ pre,-,train ⁇ is obtained from the sixth line
  • the candidate set to be processed ⁇ re,-,sult ⁇ is obtained from the ninth line
  • the candidate set to be processed ⁇ fine,-,tuned is obtained from the tenth line ⁇ , get the candidate set ⁇ task,-,specific ⁇ to be processed from the fourteenth line.
  • Step 2 For all candidates in each candidate set to be processed, combine the character groups before and after the separator to obtain candidate words, such as ⁇ representation ⁇ , ⁇ representation ⁇ , ⁇ pretrain ⁇ , ⁇ result ⁇ , ⁇ finetuned ⁇ and ⁇ taskspecific ⁇ ⁇ , that is, the first candidate word can be obtained.
  • candidate words such as ⁇ representation ⁇ , ⁇ representation ⁇ , ⁇ pretrain ⁇ , ⁇ result ⁇ , ⁇ finetuned ⁇ and ⁇ taskspecific ⁇ ⁇ , that is, the first candidate word can be obtained.
  • Step 3 For all candidates in each candidate set to be processed, generate compound words with reserved separators, such as ⁇ representa-tion ⁇ , ⁇ repre-sentation ⁇ , ⁇ pre-train ⁇ , ⁇ re-sult ⁇ , ⁇ fine-tuned ⁇ and ⁇ task-specific ⁇ , that is, the above-mentioned second candidate word can be obtained.
  • the second candidate word is a compound word
  • the electronic device can use the word after the combination of the characters before and after the separator and the compound word formed by the separator, respectively. detection, so as to ensure the accuracy of the restored target text.
  • Step 202 the electronic device determines a first degree of confusion and a second degree of confusion.
  • the first degree of confusion may be the degree of confusion corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word
  • the second degree of confusion is the degree of confusion in the target sentence that is replaced by the second candidate word.
  • the perplexity corresponding to the second sentence obtained from the first character group and the second character group.
  • the electronic device after the electronic device acquires the first candidate word and the second candidate word, the electronic device can determine the first degree of confusion and the second degree of confusion, so that the first candidate word and the second candidate word can be obtained from the first candidate word and the second candidate word. to determine the correct words to restore the above target text.
  • the electronic device may perform the following steps 202a and 202b respectively on the above-mentioned first candidate word and the second candidate word, so as to determine the above-mentioned first degree of confusion and the first degree of confusion. Second confusion.
  • steps 202a and 202b are exemplified by one candidate word (for example, the target candidate word in the embodiment of the present application) among the above-mentioned first candidate word and second candidate word.
  • Step 202a the electronic device determines the target parameter based on the probability that each character in the target candidate word appears in the target text.
  • the target candidate word may be the first candidate word or the second candidate word.
  • Step 202b the electronic device determines the confusion degree corresponding to the target candidate word according to the target parameter.
  • the above-mentioned target parameters may include: the legitimacy value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence.
  • the target phrase may include target candidate words, a fourth character group and a fifth character group, the fourth character group may be the character group located before the first character group in the target text, and the fifth character group is the second character group in the target text. character group after.
  • the electronic device may determine the legitimacy value of the target candidate word and the fluency of the target phrase based on the probability that each character in the target candidate word (the first candidate word or the second candidate word) appears in the target text
  • the degree value and the fluency value of the target sentence can be obtained, so that the above target parameter can be obtained, and then the electronic device can determine the degree of confusion corresponding to the target candidate word (for example, the above-mentioned first degree of confusion or the second degree of confusion) according to the target parameter.
  • the electronic device can input the target candidate word (the first candidate word or the second candidate word) and the target text into the language model, and then the language model can calculate the validity value of the target candidate word and the fluency value of the target phrase and the calculation of the fluency value of the target sentence, so that the above target parameters can be obtained.
  • the validity value of the above target candidate word may be the probability of the target candidate word appearing in the target text in the target text (denoted as Score_1).
  • the validity value of the target candidate word may be the product of the probabilities that each character in the target candidate word appears in the target text.
  • the probability that the Kth character in the target candidate word appears in the target text refers to the probability that the Kth character appears when the sixth character group appears in the target text, and the sixth character group is determined by the target candidate word. It consists of the first character to the (K-1)th character of , where K is an integer greater than 1.
  • the validity value of the target candidate word can be expressed as:
  • P(W) represents the validity value of the target candidate word
  • p(C 1 ) represents the probability that the first character in the target candidate word appears in the target text
  • C 1 ,C 2 ,... C K-1 ) represents the product of the probabilities of the K-th character appearing when the sixth character group appears in the target text
  • the sixth character group consists of the 1st character to the (K-1)th character in the target candidate word composed of characters.
  • the language model is shown in the following formula (1), W represents a candidate word, C 1 represents the first character in the candidate word, C k represents the last character in the candidate word, Determine whether W is a valid word by calculating the probability that the candidate word W is composed of characters from C 1 to C k .
  • the probability formula for calculating the word is shown in the following formula (2), where p(C 1 ) represents the probability that the character C 1 appears in the target text, and the calculation formula is shown in the following formula (3).
  • C 1 represents character r
  • C 1 ) indicates that the occurrence of C 2 is related to C 1 , that is, the probability that C 2 appears under the condition that C 1 appears.
  • C 1 represents the character "w”
  • C 2 represents the character “e”
  • the probability that the character "e” appears under the condition that the character "w” appears is: P(e
  • w) P( we)/P(w).
  • the fluency value of the target phrase may be the probability that the phrase composed of the target candidate word, the fourth character group and the fifth character group appears in the target text (referred to as Score_2).
  • the fluency value of the target phrase can be calculated according to the following formula (5).
  • S represents a sentence or phrase consisting of words W 1 ...W N.
  • W 1 ...W N the perplexity
  • the fluency value of the target sentence may be the probability of the target sentence appearing in the target text (referred to as Score_3).
  • the fluency value of the above target sentence may be calculated according to the above formula (5).
  • step 202b may be specifically implemented by the following step 202b1.
  • Step 202b1 the electronic device obtains the target according to the product of the legitimacy value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient.
  • the perplexity corresponding to the candidate word is the perplexity corresponding to the candidate word.
  • the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
  • the electronic device can calculate the legality value of the target candidate word and the first coefficient. (denoted as ⁇ ), the sum of the product of the fluency value of the target phrase and the second coefficient (denoted as ⁇ ), and the sum of the product of the fluency value of the target sentence and the third coefficient (denoted as ⁇ ), so that the target can be obtained.
  • the perplexity corresponding to the candidate word (denoted as Score).
  • Score ⁇ Score_1+ ⁇ Score_2+ ⁇ Score_3.
  • the values of the first coefficient, the second coefficient and the third coefficient may be any possible positive numbers, and the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
  • Step 203 the electronic device determines whether the first degree of confusion is less than the second degree of confusion.
  • the electronic device may compare the magnitudes of the first degree of confusion and the second degree of confusion. Thereby, it is determined which candidate word among the above-mentioned first candidate word and second candidate word is correct.
  • the electronic device can obtain the restored target text according to the first candidate word, that is, when the first degree of confusion is less than the second degree of confusion, the electronic device can obtain the restored target text according to the first candidate word.
  • the device may perform step 204 described below. If the second degree of confusion is less than the first degree of confusion, the electronic device can obtain the restored target text according to the second candidate word, that is, when the second degree of confusion is less than the first degree of confusion, the electronic device can perform the following Step 205.
  • Step 204 the electronic device obtains the restored target text according to the first candidate word.
  • the electronic device when the first degree of confusion is less than the second degree of confusion, the electronic device can restore the target text according to the first candidate word, so that the restored target text can be obtained.
  • the electronic device can directly use the first candidate word to replace the first character group and the second character group in the target text, so that the restored target text can be obtained.
  • the electronic device can use the above-mentioned first sentence (the first sentence includes the first candidate word) to replace the target sentence in the target text, so that the restored target text can be obtained.
  • Step 205 the electronic device obtains the restored target text according to the second candidate word.
  • the electronic device when the second degree of confusion is less than the first degree of confusion, the electronic device can restore the target text according to the second candidate word, so that the restored target text can be obtained.
  • the electronic device can directly use the second candidate word to replace the first character group and the second character group in the target text, so that the restored target text can be obtained.
  • the electronic device can use the above-mentioned second sentence (the second sentence includes the second candidate word) to replace the target sentence in the target text, so that the restored target text can be obtained.
  • the smoother the sentence is that is, the smaller the degree of confusion corresponding to the sentence is, the more accurate the sentence is. Therefore, by comparing the first sentence obtained according to the first candidate word
  • the corresponding perplexity degree and the perplexity degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the second word group in the target text can be determined.
  • the correct word composed of character groups, so that the text can be accurately restored.
  • the text restoration method provided by the embodiment of the present application may further include the following step 206 .
  • Step 206 the electronic device acquires the keywords of the restored target text based on the keyword recognition model.
  • the content type of the above keyword may be the same as the content type preset in the keyword identification model.
  • the electronic device may input the restored target text into the keyword recognition model, so that the key words in the restored target text can be obtained based on the keyword recognition model. In this way, accurate keywords can be obtained, and then the accuracy of keyword recognition can be improved.
  • the keyword identification model may output a keyword list to the electronic device.
  • the keyword list may include all keywords in the restored target text.
  • the keyword recognition model can Extract and output all words related to "place name” from the text, so as to obtain the above keywords.
  • the keyword recognition model can perform keyword recognition on the restored target text, so as to obtain the keywords in the restored target text , and output the list of these keywords to the electronic device, so that the keywords in the target text can be accurately obtained.
  • the text restoration apparatus provided by the embodiment of the present application will be described below by taking the text restoration method performed by the text restoration apparatus in the embodiment of the present application as an example.
  • an embodiment of the present application provides a text restoration apparatus 300 .
  • the text restoration apparatus 300 includes an acquisition module 301 , a determination module 302 and a restoration module 303 .
  • the obtaining module 301 is used to obtain the first candidate word and the second candidate word according to the first character group, where the first character group is the character group at the end of the Nth line in the target text to be restored and ending with a delimiter , the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the target text to be restored.
  • the first character group in the N+1th line of the The perplexity degree is the perplexity degree corresponding to the first sentence obtained by the first candidate word replacing the first character group and the second character group in the target sentence
  • the second perplexity degree is the second candidate word replacing the first character group and the second character group in the target sentence.
  • the perplexity degree corresponding to the second sentence obtained by the second character group; the restoration module 303 is used to obtain the restored target text according to the first candidate word when the first perplexity degree is less than the second perplexity degree; When the second perplexity degree is less than the first perplexity degree, the restored target text is obtained according to the second candidate word.
  • the determination module is specifically configured to perform the following steps respectively on the first candidate word and the second candidate word: based on the probability that each character in the target candidate word appears in the target text, determine the target parameter, and the target candidate word is The first candidate word or the second candidate word; according to the target parameter, determine the degree of confusion corresponding to the target candidate word; wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase and the fluency value of the target sentence ;
  • the target phrase includes a target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is located in the target text after the second character group character group.
  • the determination module is specifically used for the product of the validity value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient.
  • the sum is obtained to obtain the perplexity corresponding to the target candidate word; wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
  • the legitimacy value of the target candidate word is the probability that the target candidate word appears in the target text in the target text
  • the fluency value of the target phrase is the phrase consisting of the target candidate word, the fourth character group and the fifth character group.
  • the probability of appearing in the target text; the fluency value of the target sentence is the probability that the target sentence appears in the target text.
  • the legitimacy value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text; wherein, the probability that the Kth character in the target candidate word appears in the target text Refers to: the probability that the Kth character appears when the sixth character group appears in the target text.
  • the sixth character group is composed of the first character to the (K-1)th character in the target candidate word, and K is Integer greater than 1.
  • the determining module is further configured to obtain the restored keyword of the target text based on the keyword recognition model, where the content type of the keyword is the same as the preset content type in the keyword recognition model.
  • An embodiment of the present application provides a text restoration device. Since the lower the degree of confusion corresponding to the sentence, the smoother the sentence is, that is, the smaller the degree of confusion corresponding to the sentence, the more accurate the sentence. Therefore, by comparing the first candidate word obtained by comparing the first candidate word The confusion degree corresponding to the sentence and the confusion degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the first character group in the target text can be determined. The correct word composed of two-character groups, so that the text can be accurately restored.
  • the text restoration apparatus in this embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in an electronic device.
  • the apparatus may be a mobile electronic device or a non-mobile electronic device.
  • the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palmtop computer, an in-vehicle electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook, or a personal digital assistant (personal digital assistant).
  • UMPC ultra-mobile personal computer
  • netbook or a personal digital assistant (personal digital assistant). assistant, PDA), etc.
  • the non-mobile electronic device may be a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., which are not specifically limited in the embodiments of the present application.
  • the text restoration apparatus in this embodiment of the present application may be an apparatus having an operating system.
  • the operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.
  • the text restoration device provided in the embodiment of the present application can implement each process implemented by the foregoing method embodiment, which is not repeated here to avoid repetition.
  • an embodiment of the present application further provides an electronic device 500, including a processor 501, a memory 502, a program or instruction stored in the memory 502 and executable on the processor 501, the program Or, when the instruction is executed by the processor 501, each process of the foregoing text restoration method embodiment can be implemented, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.
  • the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.
  • FIG. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
  • the electronic device 100 includes but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110, etc. part.
  • the electronic device 100 may also include a power source (such as a battery) for supplying power to various components, and the power source may be logically connected to the processor 110 through a power management system, so as to manage charging, discharging, and power management through the power management system. consumption management and other functions.
  • a power source such as a battery
  • the structure of the electronic device shown in FIG. 6 does not constitute a limitation on the electronic device, and the electronic device may include more or less components than those shown in the figure, or combine some components, or arrange different components, which will not be repeated here. .
  • the processor 110 may be configured to obtain the first candidate word and the second candidate word according to the first character group, where the first character group is at the end of the Nth line in the target text to be restored and ends with a delimiter character group, the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the word to be restored
  • the first character group of the N+1th line in the target text, and the third character group is the character group obtained by removing the separator from the first character group; and determine the first perplexity degree and the second perplexity degree, the first perplexity degree
  • the confusion degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence for the first candidate word, and the second confusion degree is the second candidate word replacing the first character group and the second character group in the target sentence.
  • the confusion degree corresponding to the second sentence obtained by the character group and when the first confusion degree is less than the second confusion degree, obtain the restored target text according to the first candidate word; or when the second confusion degree is smaller than the first confusion degree In the case of degree, the restored target text is obtained according to the second candidate word.
  • the processor 110 is specifically configured to perform the following steps on the first candidate word and the second candidate word respectively: based on the probability that each character in the target candidate word appears in the target text, determine the target parameter, the target candidate word is the first candidate word or the second candidate word; according to the target parameter, determine the degree of confusion corresponding to the target candidate word; wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase and the fluency of the target sentence value; the target phrase includes the target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is the second character group in the target text. character group after.
  • the processor 110 is specifically configured according to the product of the legitimacy value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the fluency value of the target sentence and the third coefficient.
  • the sum of the products is used to obtain the perplexity corresponding to the target candidate word; wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
  • the legitimacy value of the target candidate word is the probability that the target candidate word appears in the target text in the target text
  • the fluency value of the target phrase is the phrase consisting of the target candidate word, the fourth character group and the fifth character group.
  • the probability of appearing in the target text; the fluency value of the target sentence is the probability that the target sentence appears in the target text.
  • the legitimacy value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text; wherein, the probability that the Kth character in the target candidate word appears in the target text Refers to: the probability that the Kth character appears when the sixth character group appears in the target text.
  • the sixth character group is composed of the first character to the (K-1)th character in the target candidate word, and K is Integer greater than 1.
  • the processor 110 is further configured to acquire, based on the keyword recognition model, the keyword of the restored target text, where the content type of the keyword is the same as the preset content type in the keyword recognition model.
  • the embodiment of the present application provides an electronic device. Since the lower the degree of confusion corresponding to the sentence, the smoother the sentence is, that is, the smaller the degree of confusion corresponding to the sentence is, the more accurate the sentence is. Therefore, by comparing the first sentence obtained according to the first candidate word The corresponding perplexity degree and the perplexity degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the second word group in the target text can be determined. The correct word composed of the character group, so that the text can be accurately restored.
  • the acquisition module, the determination module, the restoration module, and the input module in the above-mentioned text restoration apparatus may all be implemented by the above-mentioned processor 110 .
  • the radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like.
  • the electronic device provides the user with wireless broadband Internet access through the network module 102, such as helping the user to send and receive emails, browse web pages, and access streaming media.
  • the audio output unit 103 may include a speaker, a buzzer, a receiver, and the like.
  • the input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, and the graphics processor 1041 captures images of still pictures or videos obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode data is processed.
  • the display unit 106 may include a display panel 1061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like.
  • the user input unit 107 includes a touch panel 1071 and other input devices 1072 .
  • the touch panel 1071 is also called a touch screen.
  • the touch panel 1071 may include two parts, a touch detection device and a touch controller.
  • Other input devices 1072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described herein again.
  • Memory 109 may be used to store software programs as well as various data including, but not limited to, application programs and operating systems.
  • the processor 110 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, and an application program, and the like, and the modem processor mainly processes wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 110 .
  • Embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, each process of the foregoing text restoration method embodiment can be achieved, and can achieve the same The technical effect, in order to avoid repetition, will not be repeated here.
  • the above-mentioned processor is the processor in the electronic device in the above-mentioned embodiment.
  • the readable storage medium may include a computer-readable storage medium, such as a computer read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
  • the embodiment of the present application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction to implement each process of the above text restoration method embodiment, and can achieve the same In order to avoid repetition, the technical effect will not be repeated here.
  • the chip mentioned in the embodiments of the present application may also be referred to as a system-on-chip, a system-on-chip, a system-on-a-chip, or a system-on-a-chip, or the like.
  • the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation.
  • the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make an electronic device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the various embodiments of the present application.
  • a storage medium such as ROM/RAM, magnetic disk, CD-ROM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the technical field of language recognition and discloses a text restoration method and apparatus, and an electronic device. The text restoration method comprises: obtaining first candidate words and second candidate words according to a first character group; determining a first confusion degree and a second confusion degree, the first confusion degree being the confusion degree corresponding to a first sentence obtained by replacing the first character group and a second character group in a target sentence with the first candidate words, and the second confusion degree being the confusion degree corresponding to a second sentence obtained by replacing the first character group and the second character group in the target sentence with the second candidate words; if the first confusion degree is smaller than the second confusion degree, obtaining restored target text according to the first candidate words; or if the second confusion degree is smaller than the first confusion degree, obtaining restored target text according to the second candidate words. The present method is applied in text restoration scenarios.

Description

文本还原方法、装置及电子设备Text restoration method, device and electronic device
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请主张在2021年02月04日在中国提交的中国专利申请号为202110158872.0的优先权,其全部内容通过引用包含于此。This application claims priority to Chinese Patent Application No. 202110158872.0 filed in China on Feb. 04, 2021, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请属于语言识别技术领域,具体涉及一种文本还原方法、装置及电子设备。The present application belongs to the technical field of language recognition, and in particular relates to a text restoration method, device and electronic device.
背景技术Background technique
在编辑文本的过程中,若位于该文本的某一行行末的西文字符组(例如英文字符组)不能全部显示在该行时,则可以将该西文字符组从自动换行的位置断开,并在断行的位置添加一个分隔符,例如图1中的标记处1,标记处2,标记处3,标记处4,标记处5,标记处6。In the process of editing the text, if the Western character group (such as English character group) at the end of a certain line of the text cannot be displayed in the whole line, the Western character group can be disconnected from the position of automatic line wrapping, And add a separator at the position of the line break, such as mark 1, mark 2, mark 3, mark 4, mark 5, mark 6 in Figure 1.
目前,如果将上述文本复制到另外一个文件中,那么可以根据这些分隔符自动还原字符组。具体的,可以直接去掉位于该文本行末的分隔符,使得分隔符前后的字符组组成一个字符组,并显示在复制得到的文本中,例如图2所示的文本即为图1所示的文本复制之后得到的文本。Currently, if the above text is copied into another file, the character groups can be automatically restored based on these delimiters. Specifically, the separator at the end of the text line can be directly removed, so that the character groups before and after the separator form a character group, which is displayed in the copied text. For example, the text shown in Figure 2 is the text shown in Figure 1 The resulting text after copying.
然而,在上述过程中,由于有的字符组是复合词,即该字符组本身是包括分隔符的,因此通过直接去掉分隔符的方式,可能会导致还原后的文本中的字符组有误,例如图2中的标记处3,标记处5,标记处6标记的字符组。因此,如何准确还原文本成为一个亟待解决的问题。However, in the above process, since some character groups are compound words, that is, the character group itself includes delimiters, so by directly removing the delimiter, the character group in the restored text may be wrong, for example In Figure 2, the group of characters marked at mark 3, mark 5, and mark 6. Therefore, how to accurately restore the text has become an urgent problem to be solved.
发明内容SUMMARY OF THE INVENTION
本申请实施例的目的是提供一种文本还原方法、装置及电子设备,能够解决现有的电子设备还原文本不准确的问题。The purpose of the embodiments of the present application is to provide a text restoration method, device and electronic device, which can solve the problem of inaccurate text restoration by existing electronic devices.
为了解决上述技术问题,本申请是这样实现的:In order to solve the above technical problems, this application is implemented as follows:
第一方面,本申请实施例提供了一种文本还原方法,该方法包括:根据第一字符组,获取第一候选词和第二候选词,第一字符组为处于待还原的目标文本中的第N行的行末、且以分隔符结尾的字符组,第一候选词为第一字符组与第二字符组组合得到的词,第二候选词为第三字符组与第二字符组组合得到的词,第二字符组为待还原的目标文本中的第N+1行的第一个字符组,第三字符组为第一字符组除去分隔符后得到的字符组;确定第一困惑度和第二困惑度,第一困惑度为第一候选词替换目标语句中的第一字符组和第二字符组得到的第一语句对应的困惑度,第二困惑度为第二候选词替换目标语句中的第一字符组和第二字符组得到的第二语句对应的困惑度;在第一困惑度小于第二困惑度的情况下,根据第一候选词,得到还原后的目标文本;或在第二困惑度小于第一困惑度的情况下,根据第二候选词,得到还原后的目标文本。In a first aspect, an embodiment of the present application provides a method for text restoration, the method comprising: obtaining a first candidate word and a second candidate word according to a first character group, where the first character group is in the target text to be restored The character group at the end of the Nth line and ending with a separator, the first candidate word is a word obtained by combining the first character group and the second character group, and the second candidate word is a combination of the third character group and the second character group. word, the second character group is the first character group of the N+1th line in the target text to be restored, and the third character group is the character group obtained after removing the separator from the first character group; determine the first perplexity degree and the second perplexity degree, the first perplexity degree is the perplexity degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word, and the second perplexity degree is the second candidate word replacement target. The confusion degree corresponding to the second sentence obtained by the first character group and the second character group in the sentence; when the first confusion degree is less than the second confusion degree, obtain the restored target text according to the first candidate word; or When the second perplexity degree is smaller than the first perplexity degree, the restored target text is obtained according to the second candidate word.
第二方面,本申请实施例提供了一种文本还原装置,该文本还原装置包括获取模块,确定模块和还原模块。获取模块,用于根据第一字符组,获取第一候选词和第二候选词,第一字符组为处于待还原的目标文本中的第N行的行末、且以分隔符结尾的字符组,第一候选词为第一字符组与第二字符组组合得到的词,第二候选词为第三字符组与第二字符组组合得到的词,第二字符组为待还原的目标文本中的第N+1行的第一个字符组,第三字符组为第一字符组除去分隔符后得到的字符组;确定模块,用于确定第一困惑度和第二困惑度,第一困惑度为第一候选词替换目标语句中的第一字符组和第二字符组得到的第一语句对应的困惑度,第二困惑度为第二候选词替换目标语句中的第一字符组和第二字符组得到的第二语句对应的困惑度;还原模块,用于在第一困惑度小于第二困惑度的情况下,根据第一候选词,得到还原后的目标文本;或在第二困惑度小于第一困惑度的情况下,根据第二候选词,得到还原后的目标文本。In a second aspect, an embodiment of the present application provides a text restoration device, where the text restoration device includes an acquisition module, a determination module, and a restoration module. The acquisition module is used to acquire the first candidate word and the second candidate word according to the first character group, where the first character group is the character group at the end of the Nth line in the target text to be restored and ending with a separator, The first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the target text to be restored. The first character group in the N+1 line, the third character group is the character group obtained by removing the separator from the first character group; the determination module is used to determine the first degree of confusion and the second degree of confusion, the first degree of confusion The confusion degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence for the first candidate word, and the second confusion degree is the second candidate word replacing the first character group and the second character group in the target sentence. The perplexity degree corresponding to the second sentence obtained by the character group; the restoration module is used to obtain the restored target text according to the first candidate word when the first perplexity degree is less than the second perplexity degree; or in the second perplexity degree In the case of less than the first perplexity degree, the restored target text is obtained according to the second candidate word.
第三方面,本申请实施例提供了一种电子设备,该电子设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序或指令,该程序或指令被处理器执行时,实现如上述第一方面中的文本还原方法的步骤。In a third aspect, an embodiment of the present application provides an electronic device, the electronic device includes a processor, a memory, and a program or instruction stored in the memory and executable on the processor. When the program or instruction is executed by the processor, Implement the steps of the text restoration method as in the first aspect above.
第四方面,本申请实施例提供了一种可读存储介质,该可读存储介质上存 储程序或指令,该程序或指令被处理器执行时,实现如上述第一方面中的文本还原方法的步骤。In a fourth aspect, an embodiment of the present application provides a readable storage medium, on which a program or an instruction is stored, and when the program or instruction is executed by a processor, the text restoration method as described in the first aspect above is implemented. step.
第五方面,本申请实施例提供了一种芯片,该芯片包括处理器和通信接口,通信接口和处理器耦合,处理器用于运行程序或指令,实现如上述第一方面中的文本还原方法的步骤。In a fifth aspect, an embodiment of the present application provides a chip, the chip includes a processor and a communication interface, the communication interface and the processor are coupled, and the processor is used to run a program or an instruction to implement the text restoration method in the first aspect above. step.
在本申请实施例中,可以根据第一字符组,获取第一候选词和第二候选词,第一字符组为处于待还原的目标文本中的第N行的行末、且以分隔符结尾的字符组,第一候选词为第一字符组与第二字符组组合得到的词,第二候选词为第三字符组与第二字符组组合得到的词,第二字符组为待还原的目标文本中的第N+1行的第一个字符组,第三字符组为第一字符组除去分隔符后得到的字符组;确定第一困惑度和第二困惑度,第一困惑度为第一候选词替换目标语句中的第一字符组和第二字符组得到的第一语句对应的困惑度,第二困惑度为第二候选词替换目标语句中的第一字符组和第二字符组得到的第二语句对应的困惑度;在第一困惑度小于第二困惑度的情况下,根据第一候选词,得到还原后的目标文本;或在第二困惑度小于第一困惑度的情况下,根据第二候选词,得到还原后的目标文本。通过该方案,由于语句对应的困惑度越小,表示语句越流畅,即语句对应的困惑度越小,语句越准确,因此通过比较根据第一候选词得到的第一语句对应的困惑度和根据第二候选词得到的第二语句对应的困惑度,可以确定第一候选词和第二候选词哪个是正确地,即可以确定目标文本中的第一字符组和第二字符组组成的正确的词,从而可以准确地还原文本。In this embodiment of the present application, the first candidate word and the second candidate word may be obtained according to the first character group, where the first character group is the line at the end of the Nth line in the target text to be restored and ends with a delimiter Character group, the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the target to be restored The first character group of the N+1th line in the text, and the third character group is the character group obtained by removing the separator from the first character group; determine the first perplexity degree and the second perplexity degree, and the first perplexity degree is the first perplexity degree. The confusion degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with a candidate word, and the second confusion degree is the second candidate word replacing the first character group and the second character group in the target sentence. The obtained confusion degree corresponding to the second sentence; when the first confusion degree is less than the second confusion degree, obtain the restored target text according to the first candidate word; or when the second confusion degree is smaller than the first confusion degree Next, according to the second candidate word, the restored target text is obtained. Through this solution, the smaller the confusion degree corresponding to the sentence, the smoother the sentence, that is, the smaller the confusion degree corresponding to the sentence, the more accurate the sentence. Therefore, by comparing the confusion degree corresponding to the first sentence and the basis of The confusion degree corresponding to the second sentence obtained by the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, it can be determined that the correct combination of the first character group and the second character group in the target text is words, so that the text can be accurately restored.
附图说明Description of drawings
图1为本申请实施例提供的一种待还原的文本示意图;1 is a schematic diagram of a text to be restored provided by an embodiment of the present application;
图2为本申请实施例提供的一种还原后的文本示意图;2 is a schematic diagram of a restored text provided by an embodiment of the present application;
图3为本申请实施例提供的文本还原方法的流程示意图;3 is a schematic flowchart of a text restoration method provided by an embodiment of the present application;
图4为本申请实施例提供的文本还原装置的结构示意图;4 is a schematic structural diagram of a text restoration device provided by an embodiment of the present application;
图5为本申请实施例提供的电子设备的结构示意图;5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;
图6为本申请实施例提供的电子设备的硬件示意图。FIG. 6 is a schematic hardware diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。The terms "first", "second" and the like in the description and claims of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It is to be understood that the data so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in sequences other than those illustrated or described herein, and distinguish between "first", "second", etc. The objects are usually of one type, and the number of objects is not limited. For example, the first object may be one or more than one. In addition, "and/or" in the description and claims indicates at least one of the connected objects, and the character "/" generally indicates that the associated objects are in an "or" relationship.
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of the present application, words such as "exemplary" or "for example" are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as "exemplary" or "such as" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present the related concepts in a specific manner.
下面结合附图,通过具体的实施例及其应用场景,对本申请实施例提供的文本还原方法进行详细地说明。The text restoration method provided by the embodiments of the present application will be described in detail below through specific embodiments and application scenarios with reference to the accompanying drawings.
如图3所示,本申请实施例提供一种文本还原方法,该方法包括下述的步骤201-步骤204,或步骤201-步骤203和步骤205。As shown in FIG. 3 , an embodiment of the present application provides a text restoration method, and the method includes the following steps 201 - 204 , or steps 201 - 203 and 205 .
需要说明的是,本申请实施例提供的文本还原方法的执行主体可以为文本还原装置,或者该文本还原装置中的用于执行文本还原方法的控制模块,还可以为电子设备。下面将以文本还原装置为例,对本申请实施例提供的文本还原方法进行示例性的说明。It should be noted that the execution body of the text restoration method provided by the embodiment of the present application may be a text restoration apparatus, or a control module in the text restoration apparatus for executing the text restoration method, or an electronic device. The text restoration method provided by the embodiments of the present application will be exemplarily described below by taking a text restoration apparatus as an example.
可选地,本申请实施例中,当本申请实施例提供的文本还原方法的执行主体为电子设备时,该电子设备可以包括本申请实施例提供的文本还原装置,或外接该文本还原装置。具体可以根据实际使用需求确定,本申请实施例不作限 定。Optionally, in the embodiment of the present application, when the execution body of the text restoration method provided by the embodiment of the present application is an electronic device, the electronic device may include the text restoration apparatus provided in the embodiment of the present application, or externally connect the text restoration apparatus. Specifically, it can be determined according to actual use requirements, and is not limited in the embodiments of the present application.
步骤201,电子设备根据第一字符组,获取第一候选词和第二候选词。Step 201, the electronic device obtains the first candidate word and the second candidate word according to the first character group.
其中,上述第一字符组可以为处于待还原的目标文本中的第N行的行末、且以分隔符结尾的字符组,第一候选词为第一字符组与第二字符组组合得到的词,第二候选词为第三字符组与第二字符组组合得到的词,第二字符组为待还原的目标文本中的第N+1行的第一个字符组,第三字符组为第一字符组除去分隔符后得到的字符组,N为正整数。Wherein, the above-mentioned first character group may be a character group at the end of the Nth line in the target text to be restored and ending with a separator, and the first candidate word is a word obtained by combining the first character group and the second character group , the second candidate word is the word obtained by combining the third character group and the second character group, the second character group is the first character group of the N+1th line in the target text to be restored, and the third character group is the A character group obtained after removing the separator, N is a positive integer.
本申请实施例中,在电子设备获取上述待还原的目标文本之后,电子设备可以根据上述第一字符组,获取上述第一候选词和第二候选词,从而可以根据该第一候选词和第二候选词中正确地词,还原目标文本。In this embodiment of the present application, after the electronic device acquires the target text to be restored, the electronic device may acquire the first candidate word and the second candidate word according to the first character group, so that the first candidate word and the second candidate word can be The correct word in the two candidate words restores the target text.
可选地,本申请实施例提供的文本还原方法可以应用于下述两种可能的场景中:Optionally, the text restoration method provided in this embodiment of the present application may be applied to the following two possible scenarios:
场景一:电子设备将目标文本从一个位置复制至另一个位置,例如将目标文本从一个文档复制至另一个文档。Scenario 1: The electronic device copies the target text from one location to another, for example, copying the target text from one document to another document.
场景二:目标文本为目标图像中的文本,电子设备通过光学字符(optical character recognition,OCR)技术识别该目标图像中的文本。Scenario 2: The target text is the text in the target image, and the electronic device recognizes the text in the target image through optical character recognition (OCR) technology.
可选地,在上述场景二中,目标图像中的文本可以是横向排版,也可以是竖向排版。当目标图像中的文本是竖向排版时,上述第一字符组可以为待还原的目标文本中的第M列的列末、且以分隔符结尾的字符组,第二字符组为待还原的目标文本中的第M+1列的第一个字符组,M为正整数。Optionally, in the above-mentioned second scenario, the text in the target image may be typeset horizontally or vertically. When the text in the target image is a vertical typesetting, the above-mentioned first character group may be the character group at the end of the M-th column in the target text to be restored and ending with a separator, and the second character group is the character group to be restored The first character group of column M+1 in the target text, where M is a positive integer.
当然,实际实现时,本申请实施例提供的文本还原方法还可以应用于其他任意可能的场景中,具体可以根据实际使用需求确定,本申请实施例不作限定。Of course, in actual implementation, the text restoration method provided by the embodiments of the present application may also be applied to any other possible scenarios, which may be determined according to actual usage requirements, which are not limited in the embodiments of the present application.
可选地,本申请实施例涉及的字符组可以为西文字符组,例如英文字符组、法文字符组、德文字符组、俄文字符组或葡萄牙文字符组等,具体可以根据实际使用需求确定,本申请实施例不作限定。其中,本申请实施例是以英文字符组为例进行示例性说明的。Optionally, the character group involved in the embodiment of the present application may be a Western character group, such as an English character group, a French character group, a German character group, a Russian character group, or a Portuguese character group, etc. It is confirmed that the embodiments of the present application are not limited. Wherein, the embodiment of the present application is exemplified by taking an English character group as an example.
本申请实施例中,在电子设备获得上述待还原的目标文本之后,电子设备 可以逐行检测该目标文本中的每行文本的行末是否以分隔符或特定分割符(如“-”)结束,如果是,那么电子设备可以将包括该分隔符的字符组作为上述第一字符组。如果不是,那么电子设备可以继续检测下一行的文本。In this embodiment of the present application, after the electronic device obtains the target text to be restored, the electronic device can detect line by line whether the end of each line of text in the target text ends with a separator or a specific separator (such as "-"), If yes, the electronic device may regard the character group including the separator as the above-mentioned first character group. If not, then the electronic device can proceed to detect the next line of text.
可选地,本申请实施例中,电子设备获取上述第一候选词和第二候选词的方式可以为:Optionally, in this embodiment of the present application, the manner in which the electronic device obtains the first candidate word and the second candidate word may be:
步骤1、电子设备将当前行的行末的分隔符之前的字符组(即上述第三字符组),分隔符(如“-”),以及当前行的下一行的第一个字符组组成一个单元(以下称为处理候选集)。 Step 1. The electronic device forms a unit of the character group before the separator at the end of the current line (that is, the above-mentioned third character group), the separator (such as "-"), and the first character group of the next line of the current line. (hereinafter referred to as the processing candidate set).
示例性地,以图1所示的文本为例,电子设备可以从第一行得到待处理候选集{representa,-,tion},从第四行得到待处理候选集{repre,-,sentation},从第六行得到待处理候选集{pre,-,train},从第九行得到待处理候选集{re,-,sult},从第十行得到待处理候选集{fine,-,tuned},从第十四行得到待处理候选集{task,-,specific}。Exemplarily, taking the text shown in FIG. 1 as an example, the electronic device can obtain the candidate set to be processed {representa,-,tion} from the first row, and the candidate set to be processed {repre,-,sentation} from the fourth row. , the candidate set to be processed {pre,-,train} is obtained from the sixth line, the candidate set to be processed {re,-,sult} is obtained from the ninth line, and the candidate set to be processed {fine,-,tuned is obtained from the tenth line }, get the candidate set {task,-,specific} to be processed from the fourteenth line.
步骤2、对每个待处理候选集中的所有候选项,合并分隔符前后的字符组,得到候选词,例如{representation},{representation},{pretrain},{result},{finetuned}和{taskspecific},即可以得到第一候选词。 Step 2. For all candidates in each candidate set to be processed, combine the character groups before and after the separator to obtain candidate words, such as {representation}, {representation}, {pretrain}, {result}, {finetuned} and {taskspecific} }, that is, the first candidate word can be obtained.
步骤3、]对每个待处理候选集中的所有候选项,生成保留分割符的复合词候,例如{representa-tion},{repre-sentation},{pre-train},{re-sult},{fine-tuned}和{task-specific},即可以得到上述第二候选词。 Step 3.] For all candidates in each candidate set to be processed, generate compound words with reserved separators, such as {representa-tion}, {repre-sentation}, {pre-train}, {re-sult}, { fine-tuned} and {task-specific}, that is, the above-mentioned second candidate word can be obtained.
可以理解,本申请实施例中,第二候选词为复合词,如此电子设备可以采用分隔符前后字符组合并后的词和由该分隔符形成的复合词,分别进行词的合法性检测以及句子流畅度检测,从而可以保证还原后的目标文本的准确度。It can be understood that, in the embodiment of the present application, the second candidate word is a compound word, so the electronic device can use the word after the combination of the characters before and after the separator and the compound word formed by the separator, respectively. detection, so as to ensure the accuracy of the restored target text.
步骤202,电子设备确定第一困惑度和第二困惑度。 Step 202, the electronic device determines a first degree of confusion and a second degree of confusion.
其中,上述第一困惑度可以为第一候选词替换目标语句中的第一字符组和第二字符组得到的第一语句对应的困惑度,第二困惑度为第二候选词替换目标语句中的第一字符组和第二字符组得到的第二语句对应的困惑度。The first degree of confusion may be the degree of confusion corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word, and the second degree of confusion is the degree of confusion in the target sentence that is replaced by the second candidate word. The perplexity corresponding to the second sentence obtained from the first character group and the second character group.
本申请实施例中,在电子设备获取上述第一候选词和第二候选词之后,电 子设备可以确定上述第一困惑度和第二困惑度,从而可以从上述第一候选词和第二候选词中确定正确的词,以还原上述目标文本。In this embodiment of the present application, after the electronic device acquires the first candidate word and the second candidate word, the electronic device can determine the first degree of confusion and the second degree of confusion, so that the first candidate word and the second candidate word can be obtained from the first candidate word and the second candidate word. to determine the correct words to restore the above target text.
可选地,本申请实施例中,对于上述步骤202,电子设备可以对上述第一候选词和第二候选词分别执行下述的步骤202a和步骤202b,从而可以确定上述第一困惑度和第二困惑度。Optionally, in this embodiment of the present application, for the above step 202, the electronic device may perform the following steps 202a and 202b respectively on the above-mentioned first candidate word and the second candidate word, so as to determine the above-mentioned first degree of confusion and the first degree of confusion. Second confusion.
可以理解,下述的步骤202a和步骤202b是以上述第一候选词和第二候选词中的一个候选词(例如本申请实施例中的目标候选词)进行示例性说明的。It can be understood that the following steps 202a and 202b are exemplified by one candidate word (for example, the target candidate word in the embodiment of the present application) among the above-mentioned first candidate word and second candidate word.
步骤202a,电子设备基于目标候选词中的每个字符在目标文本中出现的概率,确定目标参数。Step 202a, the electronic device determines the target parameter based on the probability that each character in the target candidate word appears in the target text.
其中,上述目标候选词可以为上述第一候选词或上述第二候选词。The target candidate word may be the first candidate word or the second candidate word.
步骤202b,电子设备根据目标参数,确定目标候选词对应的困惑度。Step 202b, the electronic device determines the confusion degree corresponding to the target candidate word according to the target parameter.
其中,上述目标参数可以包括:目标候选词的合法性值、目标词组的流畅度值和目标语句的流畅度值。目标词组可以包括目标候选词、第四字符组和第五字符组,第四字符组可以为目标文本中位于第一字符组之前的字符组,第五字符组为目标文本中位于第二字符组之后的字符组。Wherein, the above-mentioned target parameters may include: the legitimacy value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence. The target phrase may include target candidate words, a fourth character group and a fifth character group, the fourth character group may be the character group located before the first character group in the target text, and the fifth character group is the second character group in the target text. character group after.
本申请实施例中,电子设备可以基于目标候选词(第一候选词或第二候选词)中的每个字符在目标文本中出现的概率,确定目标候选词的合法性值、目标词组的流畅度值和目标语句的流畅度值,从而可以得到上述目标参数,然后电子设备可以根据该目标参数,确定目标候选词对应的困惑度(例如上述第一困惑度或第二困惑度)。In this embodiment of the present application, the electronic device may determine the legitimacy value of the target candidate word and the fluency of the target phrase based on the probability that each character in the target candidate word (the first candidate word or the second candidate word) appears in the target text The degree value and the fluency value of the target sentence can be obtained, so that the above target parameter can be obtained, and then the electronic device can determine the degree of confusion corresponding to the target candidate word (for example, the above-mentioned first degree of confusion or the second degree of confusion) according to the target parameter.
本申请实施例中,电子设备可以将目标候选词(第一候选词或第二候选词)和目标文本输入语言模型,然后语言模型可以进行目标候选词的合法性值、目标词组的流畅度值和目标语句的流畅度值的计算,从而可以得到上述目标参数。In this embodiment of the present application, the electronic device can input the target candidate word (the first candidate word or the second candidate word) and the target text into the language model, and then the language model can calculate the validity value of the target candidate word and the fluency value of the target phrase and the calculation of the fluency value of the target sentence, so that the above target parameters can be obtained.
本申请实施例中,上述目标候选词的合法性值可以为目标文本中,目标候选词在目标文本中出现的概率(记为Score_1)。In the embodiment of the present application, the validity value of the above target candidate word may be the probability of the target candidate word appearing in the target text in the target text (denoted as Score_1).
可选地,本申请实施例中,目标候选词的合法性值可以为目标候选词中的 每个字符在目标文本中出现的概率之间的乘积。Optionally, in this embodiment of the present application, the validity value of the target candidate word may be the product of the probabilities that each character in the target candidate word appears in the target text.
其中,目标候选词中的第K个字符在目标文本中出现的概率是指:在目标文本中出现第六字符组的情况下出现第K个字符的概率,第六字符组由目标候选词中的第1个字符至第(K-1)个字符组成,K为大于1的整数。Among them, the probability that the Kth character in the target candidate word appears in the target text refers to the probability that the Kth character appears when the sixth character group appears in the target text, and the sixth character group is determined by the target candidate word. It consists of the first character to the (K-1)th character of , where K is an integer greater than 1.
需要说明的是,本申请实施例中涉及的“在出现某一字符组或字符(记为A)的情况下,出现另一字符(记为B)”是指:在文本中,B位于A之后,且A与B之间没有分隔符。It should be noted that “in the case of a certain character group or character (denoted as A), another character (denoted as B)” involved in the embodiments of the present application refers to: in the text, B is located in A After that, and there is no separator between A and B.
具体地,目标候选词的合法性值可以表示为:Specifically, the validity value of the target candidate word can be expressed as:
P(W)=p(C 1)×p(C 2|C 1)×…×p(C K|C 1,C 2,…C K-1); P(W)=p(C 1 )×p(C 2 |C 1 )×…×p(C K |C 1 ,C 2 ,…C K-1 );
其中,P(W)表示目标候选词的合法性值,p(C 1)表示目标候选词中的第一个字符在目标文本中出现的概率,p(C K|C 1,C 2,…C K-1)表示在目标文本中出现第六字符组的情况下出现第K个字符的概率的乘积,第六字符组由目标候选词中的第1个字符至第(K-1)个字符组成。 Among them, P(W) represents the validity value of the target candidate word, p(C 1 ) represents the probability that the first character in the target candidate word appears in the target text, p(C K |C 1 ,C 2 ,… C K-1 ) represents the product of the probabilities of the K-th character appearing when the sixth character group appears in the target text, and the sixth character group consists of the 1st character to the (K-1)th character in the target candidate word composed of characters.
示例性地,通过语言模型来判断,语言模型见下述公式(1)中,W表示一个候选词,C 1表示候选词中的第一个字符,C k表示候选词中的最后一个字符,通过计算由C 1至C k的字符组成候选词W的概率,判断W是否为一个合法的单词。其中,计算单词的概率公式见下述公式(2),其中p(C 1)表示字符C 1在目标文本中出现的概率,计算公式见下述公式(3)。示例性地,如果C 1表示字符r,目标文本中的字符总数为100,字符r出现了10次,那么r出现的概率为10/100=0.1,即p(C 1)=0.1。 Exemplarily, judged by the language model, the language model is shown in the following formula (1), W represents a candidate word, C 1 represents the first character in the candidate word, C k represents the last character in the candidate word, Determine whether W is a valid word by calculating the probability that the candidate word W is composed of characters from C 1 to C k . The probability formula for calculating the word is shown in the following formula (2), where p(C 1 ) represents the probability that the character C 1 appears in the target text, and the calculation formula is shown in the following formula (3). Exemplarily, if C 1 represents character r, the total number of characters in the target text is 100, and character r appears 10 times, then the probability of r appearing is 10/100=0.1, that is, p(C 1 )=0.1.
公式(4)中,p(C 2|C 1)表示C 2的出现与C 1是相关的,即在出现C 1的条件下出现C 2的概率。示例性地,C 1如果表示字符“w”,C 2表示字符“e”,那么在出现字符“w”的条件下出现字符“e”的概率出现为:P(e|w)=P(we)/P(w)。 In formula (4), p(C 2 |C 1 ) indicates that the occurrence of C 2 is related to C 1 , that is, the probability that C 2 appears under the condition that C 1 appears. Exemplarily, if C 1 represents the character "w" and C 2 represents the character "e", then the probability that the character "e" appears under the condition that the character "w" appears is: P(e|w)=P( we)/P(w).
W=C 1,C 2,C 3,…CK    (1) W=C 1 ,C 2 ,C 3 ,...CK (1)
P(W)=P(C 1,C 2,C 3,…C K)=p(C 1)×p(C 2|C 1)×…×p(C K|C 1,C 2,…C K-1)    (2) P(W)=P(C 1 , C 2 , C 3 ,…C K )=p(C 1 )×p(C 2 |C 1 )×…×p(C K |C 1 ,C 2 ,… C K-1 ) (2)
p(C k)=字符k出现的次数/文档总字符数    (3) p(C k ) = the number of times the character k appears / the total number of characters in the document (3)
P(C 2|C 1)=P(C 1C 2)/P(C 1)    (4) P(C 2 |C 1 )=P(C 1 C 2 )/P(C 1 ) (4)
本申请实施例中,上述目标词组的流畅度值可以为目标候选词、第四字符组和第五字符组组成的词组在目标文本中出现的概率(记为Score_2)。In the embodiment of the present application, the fluency value of the target phrase may be the probability that the phrase composed of the target candidate word, the fourth character group and the fifth character group appears in the target text (referred to as Score_2).
本申请实施例中,上述目标词组的流畅度值可以根据下述的公式(5)计算得到。In the embodiment of the present application, the fluency value of the target phrase can be calculated according to the following formula (5).
Figure PCTCN2022074583-appb-000001
Figure PCTCN2022074583-appb-000001
其中,S代表一个句子或词组,由单词W 1…W N组成。一般困惑度越小,句子或词组越流畅。 where S represents a sentence or phrase consisting of words W 1 ...W N. Generally, the lower the perplexity, the more fluent the sentence or phrase.
示例性地,如图1中的标记1所示,假设上述目标候选词为“representation”,位于“representa-”之前的单词为“language”,位于“tion”之后的单词为“model”,那么通过公式(5)可以得到
Figure PCTCN2022074583-appb-000002
Exemplarily, as shown in mark 1 in Figure 1, assuming that the above target candidate word is "representation", the word before "representa-" is "language", and the word after "tion" is "model", then It can be obtained by formula (5)
Figure PCTCN2022074583-appb-000002
本申请实施例中,上述目标语句的流畅度值可以为目标语句在目标文本中出现的概率(记为Score_3)。In the embodiment of the present application, the fluency value of the target sentence may be the probability of the target sentence appearing in the target text (referred to as Score_3).
本申请实施例中,上述目标语句的流畅度值可以根据上述公式(5)计算得到。In the embodiment of the present application, the fluency value of the above target sentence may be calculated according to the above formula (5).
示例性地,如图1中的标记1所示,假设上述目标候选词为“representation”,“representa-”所在的语句为“We introduce a new language representation model called BERT”,那么通过公式(5)可以得到:Exemplarily, as shown in mark 1 in Figure 1, assuming that the above target candidate word is "representation", and the sentence where "representa-" is located is "We introduce a new language representation model called BERT", then by formula (5) You can get:
Figure PCTCN2022074583-appb-000003
Figure PCTCN2022074583-appb-000003
可选地,本申请实施例中,上述步骤202b具体可以通过下述的步骤202b1实现。Optionally, in this embodiment of the present application, the foregoing step 202b may be specifically implemented by the following step 202b1.
步骤202b1,电子设备根据目标候选词的合法性值与第一系数的乘积、目标词组的流畅度值与第二系数的乘积、目标语句的流畅度值与第三系数的乘积之和,得到目标候选词对应的困惑度。Step 202b1, the electronic device obtains the target according to the product of the legitimacy value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient. The perplexity corresponding to the candidate word.
其中,上述第一系数、第二系数和第三系数之和等于1。Wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
本申请实施例中,在电子设备确定上述目标候选词的合法性值、目标词组的流畅度值和目标语句的流畅度值之后,电子设备可以法计算目标候选词的合法性值与第一系数(记为α)的乘积、目标词组的流畅度值与第二系数(记为 β)的乘积、目标语句的流畅度值与第三系数(记为γ)的乘积之和,从而可以得到目标候选词对应的困惑度(记为Score)。In the embodiment of the present application, after the electronic device determines the legality value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence, the electronic device can calculate the legality value of the target candidate word and the first coefficient. (denoted as α), the sum of the product of the fluency value of the target phrase and the second coefficient (denoted as β), and the sum of the product of the fluency value of the target sentence and the third coefficient (denoted as γ), so that the target can be obtained. The perplexity corresponding to the candidate word (denoted as Score).
即,Score=α×Score_1+β×Score_2+γ×Score_3。That is, Score=α×Score_1+β×Score_2+γ×Score_3.
可选地,本申请实施例中,上述第一系数、第二系数和第三系数的取值可以为任意可能的正数,且第一系数、第二系数和第三系数之和等于1。Optionally, in this embodiment of the present application, the values of the first coefficient, the second coefficient and the third coefficient may be any possible positive numbers, and the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
步骤203,电子设备确定第一困惑度是否小于第二困惑度。 Step 203, the electronic device determines whether the first degree of confusion is less than the second degree of confusion.
本申请实施例中,在电子设备确定第一困惑度和第二困惑度之后,电子设备可以比较第一困惑度和第二困惑度的大小。从而确定上述第一候选词和第二候选词中的哪个候选词是正确的。In this embodiment of the present application, after the electronic device determines the first degree of confusion and the second degree of confusion, the electronic device may compare the magnitudes of the first degree of confusion and the second degree of confusion. Thereby, it is determined which candidate word among the above-mentioned first candidate word and second candidate word is correct.
本申请实施例中,若第一困惑度小于第二困惑度,则电子设备可以根据第一候选词,得到还原后的目标文本,即在第一困惑度小于第二困惑度的情况下,电子设备可以执行下述的步骤204。若第二困惑度小于第一困惑度,则电子设备可以根据第二候选词,得到还原后的目标文本,即在第二困惑度小于第一困惑度的情况下,电子设备可以执行下述的步骤205。In this embodiment of the present application, if the first degree of confusion is less than the second degree of confusion, the electronic device can obtain the restored target text according to the first candidate word, that is, when the first degree of confusion is less than the second degree of confusion, the electronic device can obtain the restored target text according to the first candidate word. The device may perform step 204 described below. If the second degree of confusion is less than the first degree of confusion, the electronic device can obtain the restored target text according to the second candidate word, that is, when the second degree of confusion is less than the first degree of confusion, the electronic device can perform the following Step 205.
可以理解,本申请实施例中,下述步骤204和步骤205是择一执行的。It can be understood that, in this embodiment of the present application, the following steps 204 and 205 are executed alternatively.
步骤204,电子设备根据第一候选词,得到还原后的目标文本。Step 204, the electronic device obtains the restored target text according to the first candidate word.
本申请实施例中,在第一困惑度小于第二困惑度的情况下,电子设备可以根据第一候选词,还原目标文本,从而可以得到还原后的目标文本。In the embodiment of the present application, when the first degree of confusion is less than the second degree of confusion, the electronic device can restore the target text according to the first candidate word, so that the restored target text can be obtained.
一种可能的实现方式,电子设备可以直接采用第一候选词,替换目标文本中的第一字符组和第二字符组,从而可以得到还原后的目标文本。In a possible implementation manner, the electronic device can directly use the first candidate word to replace the first character group and the second character group in the target text, so that the restored target text can be obtained.
另一种可能的实现方式,电子设备可以采用上述第一语句(第一语句包括第一候选词),替换目标文本中的目标语句,从而可以得到还原后的目标文本。In another possible implementation manner, the electronic device can use the above-mentioned first sentence (the first sentence includes the first candidate word) to replace the target sentence in the target text, so that the restored target text can be obtained.
步骤205,电子设备根据第二候选词,得到还原后的目标文本。 Step 205, the electronic device obtains the restored target text according to the second candidate word.
本申请实施例中,在第二困惑度小于第一困惑度的情况下,电子设备可以根据第二候选词,还原目标文本,从而可以得到还原后的目标文本。In this embodiment of the present application, when the second degree of confusion is less than the first degree of confusion, the electronic device can restore the target text according to the second candidate word, so that the restored target text can be obtained.
一种可能的实现方式,电子设备可以直接采用第二候选词,替换目标文本中的第一字符组和第二字符组,从而可以得到还原后的目标文本。In a possible implementation manner, the electronic device can directly use the second candidate word to replace the first character group and the second character group in the target text, so that the restored target text can be obtained.
另一种可能的实现方式,电子设备可以采用上述第二语句(第二语句包括第二候选词),替换目标文本中的目标语句,从而可以得到还原后的目标文本。In another possible implementation manner, the electronic device can use the above-mentioned second sentence (the second sentence includes the second candidate word) to replace the target sentence in the target text, so that the restored target text can be obtained.
本申请实施例提供的文本还原方法,由于语句对应的困惑度越小,表示语句越流畅,即语句对应的困惑度越小,语句越准确,因此通过比较根据第一候选词得到的第一语句对应的困惑度和根据第二候选词得到的第二语句对应的困惑度,可以确定第一候选词和第二候选词哪个是正确地,即可以确定目标文本中的第一字符组和第二字符组组成的正确的词,从而可以准确地还原文本。In the text restoration method provided by the embodiment of the present application, since the lower the degree of confusion corresponding to the sentence, the smoother the sentence is, that is, the smaller the degree of confusion corresponding to the sentence is, the more accurate the sentence is. Therefore, by comparing the first sentence obtained according to the first candidate word The corresponding perplexity degree and the perplexity degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the second word group in the target text can be determined. The correct word composed of character groups, so that the text can be accurately restored.
可选地,本申请实施例中,在电子设备得到还原后的目标文本之后,本申请实施例提供的文本还原方法还可以包括下述的步骤206。Optionally, in the embodiment of the present application, after the electronic device obtains the restored target text, the text restoration method provided by the embodiment of the present application may further include the following step 206 .
步骤206,电子设备基于关键词识别模型,获取还原后的目标文本的关键词。Step 206, the electronic device acquires the keywords of the restored target text based on the keyword recognition model.
其中,上述关键词的内容类型可以与关键词识别模型中预设的内容类型相同。The content type of the above keyword may be the same as the content type preset in the keyword identification model.
本申请实施例中,在电子设备得到还原后的目标文本之后,电子设备可以将还原后的目标文本输入关键词识别模型,从而可以基于该关键词识别模型,获取还原后的目标文本中的关键词,如此可以得到准确的关键词,进而可以提升关键词识别的准确率。In the embodiment of the present application, after the electronic device obtains the restored target text, the electronic device may input the restored target text into the keyword recognition model, so that the key words in the restored target text can be obtained based on the keyword recognition model. In this way, accurate keywords can be obtained, and then the accuracy of keyword recognition can be improved.
可选地,本申请实施例中,在关键词识别模型识别还原后的目标文本中的关键词之后,关键词识别模型可以向电子设备输出关键词列表。其中,该关键词列表中可以包括还原后的目标文本中的所有关键词。Optionally, in this embodiment of the present application, after the keyword identification model identifies the keywords in the restored target text, the keyword identification model may output a keyword list to the electronic device. Wherein, the keyword list may include all keywords in the restored target text.
示例性地,假设关键词识别模型中预设的关键词的内容类型为“地名”,那么在电子设备将还原后的目标文本输入关键词识别模型之后,关键词识别模型可以从还原后的目标文本中提取并输出所有与“地名”相关的词,从而得到上述关键词。Exemplarily, assuming that the content type of the keyword preset in the keyword recognition model is "place name", after the electronic device inputs the restored target text into the keyword recognition model, the keyword recognition model can Extract and output all words related to "place name" from the text, so as to obtain the above keywords.
本申请实施例中,在电子设备将还原后的目标文本输入关键词识别模型之后,关键词识别模型可以对还原后的目标文本进行关键词识别,从而可以得到还原后的目标文本中的关键词,并向电子设备输出这些关键词的列表,从而可 以准确得到目标文本中的关键词。In the embodiment of the present application, after the electronic device inputs the restored target text into the keyword recognition model, the keyword recognition model can perform keyword recognition on the restored target text, so as to obtain the keywords in the restored target text , and output the list of these keywords to the electronic device, so that the keywords in the target text can be accurately obtained.
下面将以本申请实施例中以文本还原装置执行文本还原方法为例,说明本申请实施例提供的文本还原装置。The text restoration apparatus provided by the embodiment of the present application will be described below by taking the text restoration method performed by the text restoration apparatus in the embodiment of the present application as an example.
如图4所示,本申请实施例提供一种文本还原装置300,文本还原装置300包括获取模块301,确定模块302和还原模块303。获取模块301,用于根据第一字符组,获取第一候选词和第二候选词,第一字符组为处于待还原的目标文本中的第N行的行末、且以分隔符结尾的字符组,第一候选词为第一字符组与第二字符组组合得到的词,第二候选词为第三字符组与第二字符组组合得到的词,第二字符组为待还原的目标文本中的第N+1行的第一个字符组,第三字符组为第一字符组除去分隔符后得到的字符组;确定模块302,用于确定第一困惑度和第二困惑度,第一困惑度为第一候选词替换目标语句中的第一字符组和第二字符组得到的第一语句对应的困惑度,第二困惑度为第二候选词替换目标语句中的第一字符组和第二字符组得到的第二语句对应的困惑度;还原模块303,用于在第一困惑度小于第二困惑度的情况下,根据第一候选词,得到还原后的目标文本;或在第二困惑度小于第一困惑度的情况下,根据第二候选词,得到还原后的目标文本。As shown in FIG. 4 , an embodiment of the present application provides a text restoration apparatus 300 . The text restoration apparatus 300 includes an acquisition module 301 , a determination module 302 and a restoration module 303 . The obtaining module 301 is used to obtain the first candidate word and the second candidate word according to the first character group, where the first character group is the character group at the end of the Nth line in the target text to be restored and ending with a delimiter , the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the target text to be restored. The first character group in the N+1th line of the The perplexity degree is the perplexity degree corresponding to the first sentence obtained by the first candidate word replacing the first character group and the second character group in the target sentence, and the second perplexity degree is the second candidate word replacing the first character group and the second character group in the target sentence. The perplexity degree corresponding to the second sentence obtained by the second character group; the restoration module 303 is used to obtain the restored target text according to the first candidate word when the first perplexity degree is less than the second perplexity degree; When the second perplexity degree is less than the first perplexity degree, the restored target text is obtained according to the second candidate word.
可选地,确定模块,具体用于对第一候选词和第二候选词分别执行以下步骤:基于目标候选词中的每个字符在目标文本中出现的概率,确定目标参数,目标候选词为第一候选词或第二候选词;根据目标参数,确定目标候选词对应的困惑度;其中,目标参数包括:目标候选词的合法性值、目标词组的流畅度值和目标语句的流畅度值;目标词组包括目标候选词、第四字符组和第五字符组,第四字符组为目标文本中位于第一字符组之前的字符组,第五字符组为目标文本中位于第二字符组之后的字符组。Optionally, the determination module is specifically configured to perform the following steps respectively on the first candidate word and the second candidate word: based on the probability that each character in the target candidate word appears in the target text, determine the target parameter, and the target candidate word is The first candidate word or the second candidate word; according to the target parameter, determine the degree of confusion corresponding to the target candidate word; wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase and the fluency value of the target sentence ; The target phrase includes a target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is located in the target text after the second character group character group.
可选地,确定模块,具体用于根据目标候选词的合法性值与第一系数的乘积、目标词组的流畅度值与第二系数的乘积、目标语句的流畅度值与第三系数的乘积之和,得到目标候选词对应的困惑度;其中,第一系数、第二系数和第三系数之和等于1。Optionally, the determination module is specifically used for the product of the validity value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient. The sum is obtained to obtain the perplexity corresponding to the target candidate word; wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
可选地,目标候选词的合法性值为目标文本中,目标候选词在目标文本中出现的概率;目标词组的流畅度值为目标候选词、第四字符组和第五字符组组成的词组在目标文本中出现的概率;目标语句的流畅度值为目标语句在目标文本中出现的概率。Optionally, the legitimacy value of the target candidate word is the probability that the target candidate word appears in the target text in the target text; the fluency value of the target phrase is the phrase consisting of the target candidate word, the fourth character group and the fifth character group. The probability of appearing in the target text; the fluency value of the target sentence is the probability that the target sentence appears in the target text.
可选地,目标候选词的合法性值为目标候选词中的每个字符在目标文本中出现的概率之间的乘积;其中,目标候选词中的第K个字符在目标文本中出现的概率是指:在目标文本中出现第六字符组的情况下出现第K个字符的概率,第六字符组由目标候选词中的第1个字符至第(K-1)个字符组成,K为大于1的整数。Optionally, the legitimacy value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text; wherein, the probability that the Kth character in the target candidate word appears in the target text Refers to: the probability that the Kth character appears when the sixth character group appears in the target text. The sixth character group is composed of the first character to the (K-1)th character in the target candidate word, and K is Integer greater than 1.
可选地,确定模块,还用于基于关键词识别模型,获取还原后的目标文本的关键词,该关键词的内容类型与关键词识别模型中预设的内容类型相同。Optionally, the determining module is further configured to obtain the restored keyword of the target text based on the keyword recognition model, where the content type of the keyword is the same as the preset content type in the keyword recognition model.
本申请实施例提供一种文本还原装置,由于语句对应的困惑度越小,表示语句越流畅,即语句对应的困惑度越小,语句越准确,因此通过比较根据第一候选词得到的第一语句对应的困惑度和根据第二候选词得到的第二语句对应的困惑度,可以确定第一候选词和第二候选词哪个是正确地,即可以确定目标文本中的第一字符组和第二字符组组成的正确的词,从而可以准确地还原文本。An embodiment of the present application provides a text restoration device. Since the lower the degree of confusion corresponding to the sentence, the smoother the sentence is, that is, the smaller the degree of confusion corresponding to the sentence, the more accurate the sentence. Therefore, by comparing the first candidate word obtained by comparing the first candidate word The confusion degree corresponding to the sentence and the confusion degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the first character group in the target text can be determined. The correct word composed of two-character groups, so that the text can be accurately restored.
本申请实施例中的文本还原装置可以是装置,也可以是电子设备中的部件、集成电路、或芯片。该装置可以是移动电子设备,也可以为非移动电子设备。示例性的,移动电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,非移动电子设备可以为个人计算机(personal computer,PC)、电视机(television,TV)、柜员机或者自助机等,本申请实施例不作具体限定。The text restoration apparatus in this embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in an electronic device. The apparatus may be a mobile electronic device or a non-mobile electronic device. Exemplarily, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palmtop computer, an in-vehicle electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook, or a personal digital assistant (personal digital assistant). assistant, PDA), etc., the non-mobile electronic device may be a personal computer (personal computer, PC), a television (television, TV), a teller machine or a self-service machine, etc., which are not specifically limited in the embodiments of the present application.
本申请实施例中的文本还原装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为ios操作系统,还可以为其他可能的操作系统,本申请实施例不作具体限定。The text restoration apparatus in this embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.
本申请实施例提供的文本还原装置能够实现上述方法实施例实现的各个过程,为避免重复,这里不再赘述。The text restoration device provided in the embodiment of the present application can implement each process implemented by the foregoing method embodiment, which is not repeated here to avoid repetition.
可选地,如图5所示,本申请实施例还提供一种电子设备500,包括处理器501,存储器502,存储在存储器502上并可在处理器501上运行的程序或指令,该程序或指令被处理器501执行时实现上述文本还原方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。Optionally, as shown in FIG. 5 , an embodiment of the present application further provides an electronic device 500, including a processor 501, a memory 502, a program or instruction stored in the memory 502 and executable on the processor 501, the program Or, when the instruction is executed by the processor 501, each process of the foregoing text restoration method embodiment can be implemented, and the same technical effect can be achieved. In order to avoid repetition, details are not repeated here.
需要说明的是,本申请实施例中的电子设备包括上述的移动电子设备和非移动电子设备。It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.
图6为实现本申请实施例的一种电子设备的硬件结构示意图。FIG. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
该电子设备100包括但不限于:射频单元101、网络模块102、音频输出单元103、输入单元104、传感器105、显示单元106、用户输入单元107、接口单元108、存储器109、以及处理器110等部件。The electronic device 100 includes but is not limited to: a radio frequency unit 101, a network module 102, an audio output unit 103, an input unit 104, a sensor 105, a display unit 106, a user input unit 107, an interface unit 108, a memory 109, and a processor 110, etc. part.
本领域技术人员可以理解,电子设备100还可以包括给各个部件供电的电源(比如电池),电源可以通过电源管理系统与处理器110逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图6中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置,在此不再赘述。Those skilled in the art can understand that the electronic device 100 may also include a power source (such as a battery) for supplying power to various components, and the power source may be logically connected to the processor 110 through a power management system, so as to manage charging, discharging, and power management through the power management system. consumption management and other functions. The structure of the electronic device shown in FIG. 6 does not constitute a limitation on the electronic device, and the electronic device may include more or less components than those shown in the figure, or combine some components, or arrange different components, which will not be repeated here. .
其中,处理器110,可以用于根据第一字符组,获取第一候选词和第二候选词,第一字符组为处于待还原的目标文本中的第N行的行末、且以分隔符结尾的字符组,第一候选词为第一字符组与第二字符组组合得到的词,第二候选词为第三字符组与第二字符组组合得到的词,第二字符组为待还原的目标文本中的第N+1行的第一个字符组,第三字符组为第一字符组除去分隔符后得到的字符组;并确定第一困惑度和第二困惑度,第一困惑度为第一候选词替换目标语句中的第一字符组和第二字符组得到的第一语句对应的困惑度,第二困惑度为第二候选词替换目标语句中的第一字符组和第二字符组得到的第二语句对应的困惑度;以及在第一困惑度小于第二困惑度的情况下,根据第一候选词,得到还原后的目标文本;或在第二困惑度小于第一困惑度的情况下,根据第二 候选词,得到还原后的目标文本。The processor 110 may be configured to obtain the first candidate word and the second candidate word according to the first character group, where the first character group is at the end of the Nth line in the target text to be restored and ends with a delimiter character group, the first candidate word is the word obtained by combining the first character group and the second character group, the second candidate word is the word obtained by combining the third character group and the second character group, and the second character group is the word to be restored The first character group of the N+1th line in the target text, and the third character group is the character group obtained by removing the separator from the first character group; and determine the first perplexity degree and the second perplexity degree, the first perplexity degree The confusion degree corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence for the first candidate word, and the second confusion degree is the second candidate word replacing the first character group and the second character group in the target sentence. The confusion degree corresponding to the second sentence obtained by the character group; and when the first confusion degree is less than the second confusion degree, obtain the restored target text according to the first candidate word; or when the second confusion degree is smaller than the first confusion degree In the case of degree, the restored target text is obtained according to the second candidate word.
可选地,处理器110,具体用于对第一候选词和第二候选词分别执行以下步骤:基于目标候选词中的每个字符在目标文本中出现的概率,确定目标参数,目标候选词为第一候选词或第二候选词;根据目标参数,确定目标候选词对应的困惑度;其中,目标参数包括:目标候选词的合法性值、目标词组的流畅度值和目标语句的流畅度值;目标词组包括目标候选词、第四字符组和第五字符组,第四字符组为目标文本中位于第一字符组之前的字符组,第五字符组为目标文本中位于第二字符组之后的字符组。Optionally, the processor 110 is specifically configured to perform the following steps on the first candidate word and the second candidate word respectively: based on the probability that each character in the target candidate word appears in the target text, determine the target parameter, the target candidate word is the first candidate word or the second candidate word; according to the target parameter, determine the degree of confusion corresponding to the target candidate word; wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase and the fluency of the target sentence value; the target phrase includes the target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is the second character group in the target text. character group after.
可选地,处理器110,具体用于根据目标候选词的合法性值与第一系数的乘积、目标词组的流畅度值与第二系数的乘积、目标语句的流畅度值与第三系数的乘积之和,得到目标候选词对应的困惑度;其中,第一系数、第二系数和第三系数之和等于1。Optionally, the processor 110 is specifically configured according to the product of the legitimacy value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the fluency value of the target sentence and the third coefficient. The sum of the products is used to obtain the perplexity corresponding to the target candidate word; wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
可选地,目标候选词的合法性值为目标文本中,目标候选词在目标文本中出现的概率;目标词组的流畅度值为目标候选词、第四字符组和第五字符组组成的词组在目标文本中出现的概率;目标语句的流畅度值为目标语句在目标文本中出现的概率。Optionally, the legitimacy value of the target candidate word is the probability that the target candidate word appears in the target text in the target text; the fluency value of the target phrase is the phrase consisting of the target candidate word, the fourth character group and the fifth character group. The probability of appearing in the target text; the fluency value of the target sentence is the probability that the target sentence appears in the target text.
可选地,目标候选词的合法性值为目标候选词中的每个字符在目标文本中出现的概率之间的乘积;其中,目标候选词中的第K个字符在目标文本中出现的概率是指:在目标文本中出现第六字符组的情况下出现第K个字符的概率,第六字符组由目标候选词中的第1个字符至第(K-1)个字符组成,K为大于1的整数。Optionally, the legitimacy value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text; wherein, the probability that the Kth character in the target candidate word appears in the target text Refers to: the probability that the Kth character appears when the sixth character group appears in the target text. The sixth character group is composed of the first character to the (K-1)th character in the target candidate word, and K is Integer greater than 1.
可选地,处理器110,还用于基于关键词识别模型,获取还原后的目标文本的关键词,该关键词的内容类型与关键词识别模型中预设的内容类型相同。Optionally, the processor 110 is further configured to acquire, based on the keyword recognition model, the keyword of the restored target text, where the content type of the keyword is the same as the preset content type in the keyword recognition model.
本申请实施例提供一种电子设备,由于语句对应的困惑度越小,表示语句越流畅,即语句对应的困惑度越小,语句越准确,因此通过比较根据第一候选词得到的第一语句对应的困惑度和根据第二候选词得到的第二语句对应的困惑度,可以确定第一候选词和第二候选词哪个是正确地,即可以确定目标文本 中的第一字符组和第二字符组组成的正确的词,从而可以准确地还原文本。The embodiment of the present application provides an electronic device. Since the lower the degree of confusion corresponding to the sentence, the smoother the sentence is, that is, the smaller the degree of confusion corresponding to the sentence is, the more accurate the sentence is. Therefore, by comparing the first sentence obtained according to the first candidate word The corresponding perplexity degree and the perplexity degree corresponding to the second sentence obtained from the second candidate word can determine which of the first candidate word and the second candidate word is correct, that is, the first character group and the second word group in the target text can be determined. The correct word composed of the character group, so that the text can be accurately restored.
需要说明的是,本申请实施例中,上述文本还原装置中的获取模块、确定模块、还原模块和输入模块均可以通过上述处理器110实现。It should be noted that, in this embodiment of the present application, the acquisition module, the determination module, the restoration module, and the input module in the above-mentioned text restoration apparatus may all be implemented by the above-mentioned processor 110 .
应理解的是,本申请实施例中,射频单元101包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。电子设备通过网络模块102为用户提供无线的宽带互联网访问,如帮助用户收发电子邮件、浏览网页和访问流式媒体等。音频输出单元103可以包括扬声器、蜂鸣器以及受话器等。输入单元104可以包括图形处理器(Graphics Processing Unit,GPU)1041和麦克风1042,图形处理器1041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。显示单元106可包括显示面板1061,可以采用液晶显示器、有机发光二极管等形式来配置显示面板1061。用户输入单元107包括触控面板1071以及其他输入设备1072。触控面板1071,也称为触摸屏。触控面板1071可包括触摸检测装置和触摸控制器两个部分。其他输入设备1072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。存储器109可用于存储软件程序以及各种数据,包括但不限于应用程序和操作系统。处理器110可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器110中。It should be understood that, in this embodiment of the present application, the radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. The electronic device provides the user with wireless broadband Internet access through the network module 102, such as helping the user to send and receive emails, browse web pages, and access streaming media. The audio output unit 103 may include a speaker, a buzzer, a receiver, and the like. The input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, and the graphics processor 1041 captures images of still pictures or videos obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode data is processed. The display unit 106 may include a display panel 1061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes a touch panel 1071 and other input devices 1072 . The touch panel 1071 is also called a touch screen. The touch panel 1071 may include two parts, a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described herein again. Memory 109 may be used to store software programs as well as various data including, but not limited to, application programs and operating systems. The processor 110 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, and an application program, and the like, and the modem processor mainly processes wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 110 .
本申请实施例还提供一种可读存储介质,该可读存储介质上存储有程序或指令,该程序或指令被处理器执行时实现上述文本还原方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。Embodiments of the present application further provide a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, each process of the foregoing text restoration method embodiment can be achieved, and can achieve the same The technical effect, in order to avoid repetition, will not be repeated here.
其中,上述处理器为上述实施例中的电子设备中的处理器。可读存储介质可以包括计算机可读存储介质,如计算机只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等。Wherein, the above-mentioned processor is the processor in the electronic device in the above-mentioned embodiment. The readable storage medium may include a computer-readable storage medium, such as a computer read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
本申请实施例另提供了一种芯片,该芯片包括处理器和通信接口,通信接口和处理器耦合,处理器用于运行程序或指令,实现上述文本还原方法实施例 的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。The embodiment of the present application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction to implement each process of the above text restoration method embodiment, and can achieve the same In order to avoid repetition, the technical effect will not be repeated here.
应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。It should be understood that the chip mentioned in the embodiments of the present application may also be referred to as a system-on-chip, a system-on-chip, a system-on-a-chip, or a system-on-a-chip, or the like.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in the reverse order depending on the functions involved. To perform functions, for example, the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to some examples may be combined in other examples.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台电子设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make an electronic device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the various embodiments of the present application.
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。The embodiments of the present application have been described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific embodiments, which are merely illustrative rather than restrictive. Under the inspiration of this application, without departing from the scope of protection of the purpose of this application and the claims, many forms can be made, which all fall within the protection of this application.

Claims (15)

  1. 一种文本还原方法,包括:A text restoration method, comprising:
    根据第一字符组,获取第一候选词和第二候选词,所述第一字符组为处于待还原的目标文本中的第N行的行末、且以分隔符结尾的字符组,所述第一候选词为所述第一字符组与第二字符组组合得到的词,所述第二候选词为第三字符组与所述第二字符组组合得到的词,所述第二字符组为待还原的所述目标文本中的第N+1行的第一个字符组,所述第三字符组为所述第一字符组除去所述分隔符后得到的字符组;Obtain a first candidate word and a second candidate word according to a first character group, where the first character group is a character group at the end of the Nth line in the target text to be restored and ending with a delimiter, and the first character group is A candidate word is a word obtained by combining the first character group and the second character group, the second candidate word is a word obtained by combining the third character group and the second character group, and the second character group is The first character group of the N+1th row in the target text to be restored, and the third character group is a character group obtained after the first character group is removed from the separator;
    确定第一困惑度和第二困惑度,所述第一困惑度为所述第一候选词替换目标语句中的所述第一字符组和所述第二字符组得到的第一语句对应的困惑度,所述第二困惑度为所述第二候选词替换目标语句中的所述第一字符组和所述第二字符组得到的第二语句对应的困惑度;Determine a first degree of confusion and a second degree of confusion, where the first degree of confusion is the confusion corresponding to the first sentence obtained by replacing the first character group and the second character group in the target sentence with the first candidate word degree, the second perplexity degree is the perplexity degree corresponding to the second sentence obtained by replacing the first character group and the second character group in the target sentence with the second candidate word;
    在所述第一困惑度小于所述第二困惑度的情况下,根据所述第一候选词,得到还原后的所述目标文本;或在所述第二困惑度小于所述第一困惑度的情况下,根据所述第二候选词,得到还原后的所述目标文本。In the case that the first perplexity degree is less than the second perplexity degree, the restored target text is obtained according to the first candidate word; or when the second perplexity degree is less than the first perplexity degree In the case of , the restored target text is obtained according to the second candidate word.
  2. 根据权利要求1所述的方法,其中,所述确定第一困惑度和第二困惑度,包括:The method of claim 1, wherein the determining the first perplexity degree and the second perplexity degree comprises:
    对所述第一候选词和所述第二候选词分别执行以下步骤:The following steps are respectively performed on the first candidate word and the second candidate word:
    基于目标候选词中的每个字符在所述目标文本中出现的概率,确定目标参数,所述目标候选词为所述第一候选词或所述第二候选词;Determine target parameters based on the probability that each character in the target candidate word appears in the target text, and the target candidate word is the first candidate word or the second candidate word;
    根据所述目标参数,确定所述目标候选词对应的困惑度;According to the target parameter, determine the degree of confusion corresponding to the target candidate word;
    其中,所述目标参数包括:目标候选词的合法性值、目标词组的流畅度值和所述目标语句的流畅度值;所述目标词组包括所述目标候选词、第四字符组和第五字符组,所述第四字符组为所述目标文本中位于所述第一字符组之前的字符组,所述第五字符组为所述目标文本中位于所述第二字符组之后的字符组。Wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence; the target phrase includes the target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is the character group located after the second character group in the target text .
  3. 根据权利要求2所述的方法,其中,所述根据所述目标参数,确定所述 目标候选词对应的困惑度,包括:The method according to claim 2, wherein, determining the perplexity degree corresponding to the target candidate word according to the target parameter, comprising:
    根据所述目标候选词的合法性值与第一系数的乘积、所述目标词组的流畅度值与第二系数的乘积、所述目标语句的流畅度值与第三系数的乘积之和,得到所述目标候选词对应的困惑度;According to the sum of the product of the legitimacy value of the target candidate word and the first coefficient, the product of the fluency value of the target phrase and the second coefficient, and the product of the fluency value of the target sentence and the third coefficient, obtain The perplexity corresponding to the target candidate word;
    其中,所述第一系数、第二系数和第三系数之和等于1。Wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
  4. 根据权利要求2或3所述的方法,其中,所述目标候选词的合法性值为所述目标文本中,所述目标候选词在所述目标文本中出现的概率;The method according to claim 2 or 3, wherein the validity value of the target candidate word is the probability that the target candidate word appears in the target text in the target text;
    所述目标词组的流畅度值为所述目标候选词、所述第四字符组和所述第五字符组组成的词组在所述目标文本中出现的概率;The fluency value of the target phrase is the probability that the phrase composed of the target candidate word, the fourth character group and the fifth character group appears in the target text;
    所述目标语句的流畅度值为所述目标语句在所述目标文本中出现的概率。The fluency value of the target sentence is the probability that the target sentence appears in the target text.
  5. 根据权利要求4所述的方法,其中,所述目标候选词的合法性值为目标候选词中的每个字符在所述目标文本中出现的概率之间的乘积;The method according to claim 4, wherein the validity value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text;
    其中,所述目标候选词中的第K个字符在所述目标文本中出现的概率是指:在所述目标文本中出现第六字符组的情况下出现第K个字符的概率,所述第六字符组由所述目标候选词中的第1个字符至第(K-1)个字符组成,K为大于1的整数。Wherein, the probability that the Kth character in the target candidate word appears in the target text refers to the probability that the Kth character appears when the sixth character group appears in the target text, and the Kth character appears in the target text. The six-character group consists of the 1st character to the (K-1)th character in the target candidate word, where K is an integer greater than 1.
  6. 一种文本还原装置,包括获取模块,确定模块和还原模块;A text restoration device, comprising an acquisition module, a determination module and a restoration module;
    获取模块,用于根据第一字符组,获取第一候选词和第二候选词,所述第一字符组为处于待还原的目标文本中的第N行的行末、且以分隔符结尾的字符组,所述第一候选词为所述第一字符组与第二字符组组合得到的词,所述第二候选词为第三字符组与所述第二字符组组合得到的词,所述第二字符组为待还原的所述目标文本中的第N+1行的第一个字符组,所述第三字符组为所述第一字符组除去所述分隔符后得到的字符组;The obtaining module is used to obtain the first candidate word and the second candidate word according to the first character group, where the first character group is the character at the end of the Nth line in the target text to be restored and ending with a delimiter group, the first candidate word is a word obtained by combining the first character group and the second character group, the second candidate word is a word obtained by combining the third character group and the second character group, and the The second character group is the first character group of the N+1th row in the target text to be restored, and the third character group is the character group obtained by removing the separator from the first character group;
    确定模块,用于确定第一困惑度和第二困惑度,所述第一困惑度为所述第一候选词替换目标语句中的所述第一字符组和所述第二字符组得到的第一语句对应的困惑度,所述第二困惑度为所述第二候选词替换目标语句中的所述第一字符组和所述第二字符组得到的第二语句对应的困惑度;A determination module is used to determine a first degree of confusion and a second degree of confusion, where the first degree of confusion is the first word obtained by replacing the first character group and the second character group in the target sentence with the first candidate word. The confusion degree corresponding to a sentence, the second confusion degree is the confusion degree corresponding to the second sentence obtained by replacing the first character group and the second character group in the target sentence with the second candidate word;
    还原模块,用于在所述第一困惑度小于所述第二困惑度的情况下,根据所述第一候选词,得到还原后的所述目标文本;或在所述第二困惑度小于所述第一困惑度的情况下,根据所述第二候选词,得到还原后的所述目标文本。A restoration module, configured to obtain the restored target text according to the first candidate word when the first degree of confusion is less than the second degree of confusion; or when the second degree of confusion is less than the second degree of confusion In the case of the first perplexity degree, the restored target text is obtained according to the second candidate word.
  7. 根据权利要求6所述的装置,其中,所述确定模块,具体用于对所述第一候选词和所述第二候选词分别执行以下步骤:The apparatus according to claim 6, wherein the determining module is specifically configured to perform the following steps on the first candidate word and the second candidate word respectively:
    基于目标候选词中的每个字符在所述目标文本中出现的概率,确定目标参数,所述目标候选词为所述第一候选词或所述第二候选词;Determine target parameters based on the probability that each character in the target candidate word appears in the target text, and the target candidate word is the first candidate word or the second candidate word;
    根据所述目标参数,确定所述目标候选词对应的困惑度;According to the target parameter, determine the degree of confusion corresponding to the target candidate word;
    其中,所述目标参数包括:目标候选词的合法性值、目标词组的流畅度值和所述目标语句的流畅度值;所述目标词组包括所述目标候选词、第四字符组和第五字符组,所述第四字符组为所述目标文本中位于所述第一字符组之前的字符组,所述第五字符组为所述目标文本中位于所述第二字符组之后的字符组。Wherein, the target parameter includes: the legitimacy value of the target candidate word, the fluency value of the target phrase, and the fluency value of the target sentence; the target phrase includes the target candidate word, the fourth character group and the fifth character group, the fourth character group is the character group located before the first character group in the target text, and the fifth character group is the character group located after the second character group in the target text .
  8. 根据权利要求7所述的装置,其中,所述确定模块,具体用于根据所述目标候选词的合法性值与第一系数的乘积、所述目标词组的流畅度值与第二系数的乘积、所述目标语句的流畅度值与第三系数的乘积之和,得到所述目标候选词对应的困惑度;The apparatus according to claim 7, wherein the determining module is specifically configured to be based on the product of the legitimacy value of the target candidate word and the first coefficient, and the product of the fluency value of the target phrase and the second coefficient , the sum of the product of the fluency value of the target sentence and the third coefficient to obtain the perplexity corresponding to the target candidate word;
    其中,所述第一系数、第二系数和第三系数之和等于1。Wherein, the sum of the first coefficient, the second coefficient and the third coefficient is equal to 1.
  9. 根据权利要求7或8所述的装置,其中,所述目标候选词的合法性值为所述目标文本中,所述目标候选词在所述目标文本中出现的概率;The device according to claim 7 or 8, wherein the validity value of the target candidate word is a probability that the target candidate word appears in the target text in the target text;
    所述目标词组的流畅度值为所述目标候选词、所述第四字符组和所述第五字符组组成的词组在所述目标文本中出现的概率;The fluency value of the target phrase is the probability that the phrase composed of the target candidate word, the fourth character group and the fifth character group appears in the target text;
    所述目标语句的流畅度值为所述目标语句在所述目标文本中出现的概率。The fluency value of the target sentence is the probability that the target sentence appears in the target text.
  10. 根据权利要求9所述的装置,其中,所述目标候选词的合法性值为目标候选词中的每个字符在所述目标文本中出现的概率之间的乘积;The apparatus according to claim 9, wherein the validity value of the target candidate word is the product of the probabilities that each character in the target candidate word appears in the target text;
    其中,所述目标候选词中的第K个字符在所述目标文本中出现的概率是指:在所述目标文本中出现第六字符组的情况下出现第K个字符的概率,所述 第六字符组由所述目标候选词中的第1个字符至第(K-1)个字符组成,K为大于1的整数。Wherein, the probability that the Kth character in the target candidate word appears in the target text refers to the probability that the Kth character appears when the sixth character group appears in the target text, and the Kth character appears in the target text. The six-character group consists of the 1st character to the (K-1)th character in the target candidate word, where K is an integer greater than 1.
  11. 一种电子设备,包括处理器,存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,所述程序或所述指令被所述处理器执行时实现如权利要求1-5中任一项所述的文本还原方法的步骤。An electronic device, comprising a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or the instruction being executed by the processor to achieve as claimed in claim 1 The steps of the text restoration method described in any one of -5.
  12. 一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或所述指令被处理器执行时实现如权利要求1-5中任一项所述的文本还原方法的步骤。A readable storage medium, on which a program or an instruction is stored, and when the program or the instruction is executed by a processor, the steps of the text restoration method according to any one of claims 1-5 are implemented .
  13. 一种计算机程序产品,所述计算机程序产品被至少一个处理器执行以实现如权利要求1-5中任一项所述的文本还原方法的步骤。A computer program product executed by at least one processor to implement the steps of the text restoration method according to any one of claims 1-5.
  14. 一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如权利要求1-5任一项所述的文本还原方法的步骤。A chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used for running a program or an instruction to implement the text restoration according to any one of claims 1-5 steps of the method.
  15. 一种电子设备,所述电子设备被配置成用于执行如权利要求1-5任一项所述的文本还原方法的步骤。An electronic device configured to perform the steps of the text restoration method according to any one of claims 1-5.
PCT/CN2022/074583 2021-02-04 2022-01-28 Text restoration method and apparatus, and electronic device WO2022166808A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110158872.0 2021-02-04
CN202110158872.0A CN112949261A (en) 2021-02-04 2021-02-04 Text restoration method and device and electronic equipment

Publications (1)

Publication Number Publication Date
WO2022166808A1 true WO2022166808A1 (en) 2022-08-11

Family

ID=76244023

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074583 WO2022166808A1 (en) 2021-02-04 2022-01-28 Text restoration method and apparatus, and electronic device

Country Status (2)

Country Link
CN (1) CN112949261A (en)
WO (1) WO2022166808A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115690806A (en) * 2022-10-11 2023-02-03 杭州瑞成信息技术股份有限公司 Unstructured document format identification method based on image data processing

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112949261A (en) * 2021-02-04 2021-06-11 维沃移动通信有限公司 Text restoration method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108899016A (en) * 2018-08-02 2018-11-27 科大讯飞股份有限公司 A kind of regular method, apparatus of speech text, equipment and readable storage medium storing program for executing
US10402490B1 (en) * 2015-08-14 2019-09-03 Shutterstock, Inc. Edit distance based spellcheck
CN111401004A (en) * 2020-03-28 2020-07-10 苏州机数芯微科技有限公司 Article sentence-breaking method based on machine learning
CN112949261A (en) * 2021-02-04 2021-06-11 维沃移动通信有限公司 Text restoration method and device and electronic equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626065A (en) * 2019-02-26 2020-09-04 株式会社理光 Training method and device of neural machine translation model and storage medium
CN110852087B (en) * 2019-09-23 2022-02-22 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN112269475A (en) * 2020-10-23 2021-01-26 维沃移动通信有限公司 Character display method and device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10402490B1 (en) * 2015-08-14 2019-09-03 Shutterstock, Inc. Edit distance based spellcheck
CN108899016A (en) * 2018-08-02 2018-11-27 科大讯飞股份有限公司 A kind of regular method, apparatus of speech text, equipment and readable storage medium storing program for executing
CN111401004A (en) * 2020-03-28 2020-07-10 苏州机数芯微科技有限公司 Article sentence-breaking method based on machine learning
CN112949261A (en) * 2021-02-04 2021-06-11 维沃移动通信有限公司 Text restoration method and device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115690806A (en) * 2022-10-11 2023-02-03 杭州瑞成信息技术股份有限公司 Unstructured document format identification method based on image data processing

Also Published As

Publication number Publication date
CN112949261A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN109522538B (en) Automatic listing method, device, equipment and storage medium for table contents
WO2022166808A1 (en) Text restoration method and apparatus, and electronic device
US9411801B2 (en) General dictionary for all languages
US20170109435A1 (en) Apparatus and method for searching for information
US8843493B1 (en) Document fingerprint
JPH11203311A (en) Device for extracting related word and method therefor and computer readable recording medium for recording related word extraction program
CN111061383B (en) Text detection method and electronic equipment
CN113518026B (en) Message processing method and device and electronic equipment
WO2022083750A1 (en) Text display method and apparatus and electronic device
CN110889265A (en) Information processing apparatus, information processing method, and computer program
WO2022135474A1 (en) Information recommendation method and apparatus, and electronic device
US20230306765A1 (en) Recognition method and apparatus, and electronic device
US11501504B2 (en) Method and apparatus for augmented reality
CN111538830B (en) French searching method, device, computer equipment and storage medium
WO2022105754A1 (en) Character input method and apparatus, and electronic device
WO2024179519A1 (en) Semantic recognition method and apparatus
RU2608470C2 (en) User data update method and device
CN113359999A (en) Candidate word updating method and device and electronic equipment
KR102327790B1 (en) Information processing methods, devices and storage media
CN112148135A (en) Input method processing method and device and electronic equipment
WO2022100622A1 (en) Candidate word display method and apparatus, and electronic device
WO2022233275A1 (en) Input correction method and apparatus
WO2022161307A1 (en) Text translation method and apparatus, and device and medium
WO2022156817A1 (en) Content extraction method and apparatus
CN112036135B (en) Text processing method and related device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22749087

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22749087

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 240124)

122 Ep: pct application non-entry in european phase

Ref document number: 22749087

Country of ref document: EP

Kind code of ref document: A1