WO2020000764A1 - Hindi-oriented multi-language mixed input method and device - Google Patents
Hindi-oriented multi-language mixed input method and device Download PDFInfo
- Publication number
- WO2020000764A1 WO2020000764A1 PCT/CN2018/109507 CN2018109507W WO2020000764A1 WO 2020000764 A1 WO2020000764 A1 WO 2020000764A1 CN 2018109507 W CN2018109507 W CN 2018109507W WO 2020000764 A1 WO2020000764 A1 WO 2020000764A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- hindi
- vocabulary
- input
- language model
- latin
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/02—Input arrangements using manually operated switches, e.g. using keyboards or dials
- G06F3/023—Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
- G06F3/0233—Character input methods
- G06F3/0237—Character input methods using prediction or retrieval techniques
Definitions
- the invention relates to the technical field of input methods, and in particular, to a multilingual mixed input method and device for Hindi.
- the purpose of multilingual mixed input is achieved by switching input modes. For example, when the user uses the English keyboard to input Latin characters, if the user wants to input a certain Hindi character at this time, the user needs to switch to the Vietnamese input method for input, and then switch back to the English keyboard to continue inputting Latin characters.
- the present invention provides a multilingual mixed input method and device oriented to Hindi, which is used to solve the purpose of multilingual mixed input by switching input modes in the prior art.
- the efficiency of multilingual mixed input is low. And extremely time-consuming technical issues.
- An embodiment of one aspect of the present invention provides a multilingual mixed input method for Hindi, including:
- obtaining the first candidate character string list of Latin character forms corresponding to the Latin character sequence according to the first language model includes:
- the Latin character sequence is a Vietnamese vocabulary in the form of a complete Latin character spelling
- adding the Hindi vocabulary corresponding to the Latin character sequence to the first candidate character string list
- An extended option is obtained, the extended option includes: a Hindi word or a vocabulary segment containing a Latin character spelling form of the Latin character sequence, and the extended option is added to a first candidate character string list.
- obtaining the first candidate character string list of the Latin character form corresponding to the Latin character sequence according to the first language model further includes:
- the method further includes:
- predicting a subsequent vocabulary of the input vocabulary according to a language model corresponding to the input vocabulary, and generating a second candidate word list according to the prediction result including:
- the subsequent input vocabulary is predicted according to a second language model, which is a pre-established language model that spells Hindi in the form of Hindi characters.
- the first candidate character string list of the Latin character form corresponding to the Latin character sequence is obtained, and the first language model is a Latin language A language model of the character form spelling Hindi, where,
- the pre-establishment of the first language model includes:
- the constructing a language model using the collated corpus includes:
- the collated corpus uses the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, where the parameters of the language model include: words in the language model, and in the N-gram lexical arrangement, the Nth word is about the former Conditional probability for N-1 words, where N is a positive integer; and
- the multi-lingual mixed input method for Hindi obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model.
- Candidate string list where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Vietnamese vocabulary between the spelling form of Latin characters and the Hindi character spelling
- a Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- An embodiment of another aspect of the present invention provides a multilingual mixed input device for Hindi, including:
- Input character acquisition module which is used to acquire the Latin character sequence of the current input vocabulary typed by the input method interface
- a first candidate character string generating module configured to obtain a first candidate character string list in the form of a Latin character corresponding to the Latin character sequence according to a first language model, where the first language model is to spell Hindi in the form of a Latin character Language model
- a vocabulary mapping module is configured to obtain a target Hindi vocabulary list according to a mapping relationship between a Latin character spelling form of the Hindi vocabulary and a Hindi character spelling form, which is established in advance.
- the target Hindi vocabulary list includes: The Hindi character spelling form corresponding to the Hindi vocabulary in the Latin character spelling form in the first candidate string list;
- a first candidate word list generating module configured to generate a Hindi character spelling form corresponding to a Hindi word corresponding to a Hindi word spelling in the first candidate character string list and the Latin character spelling form in the first candidate character string list;
- a first candidate list of words including spellings of Latin characters and spellings of Hindi characters;
- a first candidate word list display module configured to display the first candidate word list on an input method interface
- the first candidate word input module is configured to obtain a selection operation of a word in the first candidate word list, and input the selected word as an input word.
- the first candidate string generating module is specifically configured to:
- the Latin character sequence is a Vietnamese vocabulary in the form of a complete Latin character spelling
- adding the Hindi vocabulary corresponding to the Latin character sequence to the first candidate character string list
- An extended option is obtained, the extended option includes: a Hindi word or a vocabulary segment containing a Latin character spelling form of the Latin character sequence, and the extended option is added to a first candidate character string list.
- the first candidate string generating module is further configured to:
- the device further includes:
- a second candidate word list generating module configured to predict a subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result;
- a second candidate word list display module configured to display the second candidate word list on an input method interface
- a second candidate word input module is configured to obtain a selection operation of a vocabulary of the second candidate word list, and input the selected vocabulary as a next input vocabulary.
- the second candidate word list generating module is specifically configured to:
- the subsequent input vocabulary is predicted according to a second language model, which is a pre-established language model that spells Hindi in the form of Hindi characters.
- the device further includes:
- a first language model creation module is used to establish a first language model.
- the first language model creation module includes:
- a corpus acquisition unit configured to acquire corpus data spelling Hindi in the form of Latin characters, and preprocess the corpus data to remove the erroneous corpus and low-frequency corpus to obtain valid corpus;
- a corpus deduplication unit for removing redundant parts in the valid corpus data to obtain a collated corpus
- a language model building unit is used to build a language model using the collated corpus.
- the language model construction unit is specifically configured to:
- the collated corpus uses the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, where the parameters of the language model include: words in the language model, and in the N-gram lexical arrangement, the Nth word is about the former Conditional probability for N-1 words, where N is a positive integer; and
- the multilingual mixed input device for Vietnamese obtains the Latin character sequence of the current input vocabulary typed by the input method interface, and then obtains the first Latin character form corresponding to the Latin character input sequence according to the first language model.
- a list of candidate character strings where the first language model is a pre-established language model that spells Hindi in the form of Latin characters, and then according to the pre-established Vietnamese word spelling form and the Hindi character spelling form, Mapping relationship between the first candidate string list to obtain the Vietnamese character spelling form corresponding to the Vietnamese vocabulary of the Latin character spelling form in the first candidate string list, and according to the first candidate string list and the first candidate string list
- a Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate on the input method interface Word list, and obtain a selection operation of words in the first candidate word list, thereby The selected input as input vocabulary words.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- Another embodiment of the present invention provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the multilingual Hindi-oriented multilingual mixture proposed by the above embodiment of the present invention Input method.
- an embodiment of the fourth aspect of the present invention provides a computer program product, and when instructions in the computer program product are executed by a processor, a multi-language oriented Hindi language according to the foregoing embodiment of the present invention is implemented. Mixed language input method.
- an embodiment of the fifth aspect of the present invention provides a computing device including a memory, a processor, and a computer program stored on the memory and executable on the processor.
- the processor executes the program, A multi-language mixed input method for Hindi language according to the above embodiment of the present invention is implemented.
- the computer program product and the computing device have similar methods and devices for Hindi-oriented multilingual mixed input according to the first and second aspects of the present invention The beneficial effects are not repeated here.
- FIG. 1 is a schematic flowchart of a Hindi-oriented multilingual mixed input method according to a first embodiment of the present invention
- FIG. 2 is a schematic flowchart of lexical association input in a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention
- FIG. 3 is a schematic flowchart of establishing a language model according to an embodiment of the present invention.
- FIG. 4 is a structural block diagram of a multi-lingual mixed input device for Hindi according to an embodiment of the present invention.
- FIG. 5 is a structural block diagram of a Hindi-oriented multilingual mixed input device according to an embodiment of the present invention.
- the first way is to switch the input mode to achieve the purpose of multilingual mixed input. For example, when the user uses the English keyboard to input Latin characters, if the user wants to input a certain Hindi character at this time, the user needs to switch to the Vietnamese input method for input, and then switch back to the English keyboard to continue inputting Latin characters.
- the second method is to enter the temporary input mode through a preset operation, and the user can type characters in the second language in the temporary input mode. For example, in Chinese and English input methods, the user can switch the input method by clicking the Shift key.
- the third method, part of the input method supports two encoding methods in the language model, that is, according to user input, the most suitable encoding rule is automatically selected and the characters are displayed.
- the efficiency of mixed-language input is low.
- the second mode after entering the temporary input mode, special processing of characters is required, which increases the development cycle.
- the third mode when two When the encoding differences between languages are small, the accuracy of the output of the speech model is low.
- the present invention mainly aims at the technical problems of low efficiency of multilingual mixed input and low accuracy of output results in the prior art, and proposes a multilingual mixed input method oriented to Hindi.
- the multi-lingual mixed input method for Hindi obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model.
- Candidate string list where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Vietnamese vocabulary between the spelling form of Latin characters and the Hindi character spelling
- a Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- the language model in the form of N-Gram is based on the following assumptions: the occurrence of the nth vocabulary is related to the first n-1 vocabulary, but not related to any other vocabulary. Among them, the probability of occurrence of each vocabulary can be obtained through statistical calculation of corpus data.
- the probability of the Nth vocabulary is determined by the probabilities of these vocabularies from w 1 , w 2 , w 3 , ..., w N-1 that have appeared before.
- the previous vocabulary is used to predict the next vocabulary that will appear, and then based on a large number of text observations, it can be obtained that the predicted vocabulary is more and more likely to be behind these existing vocabulary. Therefore, the constructed language model can be a (n-1) -order Markov model or an N-ary language model.
- the value of N can be 2, 3, 4, etc.
- FIG. 1 is a schematic flowchart of a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention.
- the Hindi-oriented multilingual mixed input method provided by the embodiment of the present invention may be implemented by the Hindi-oriented multilingual mixed input device provided by the embodiment of the present invention, and the device may be configured in any computing device so that the The computing device implements a multilingual mixed input function for Hindi.
- the computing device may be a hardware device such as a personal computer (PC), a cloud device, or a mobile device.
- the mobile device may be a mobile phone, a tablet computer, a personal digital assistant, or a wearable device. And / or display hardware.
- the multilingual mixed input method for Hindi includes the following steps:
- Step 101 Obtain a Latin character sequence of a current input vocabulary typed by an input method interface.
- the computing device may be provided with an input method interface, and a user may enter a Latin character sequence through the input method interface.
- a user may enter a Latin character sequence through the input method interface.
- the computing device is a mobile phone
- the user can manually type the Latin character sequence through the touch screen
- the computing device is a PC
- the user can manually type the Latin character sequence through the keyboard.
- a computing device may be provided with a listener to monitor a user-typed input operation.
- the current input typed by the user on the input method interface may be obtained according to the user's input operation.
- Vocabulary sequence of Latin characters For example, when the user wants to enter “mobile phone”, he can type “mobile” in the input method interface.
- Step 102 Obtain a first candidate character string list of Latin character forms corresponding to the Latin character sequence according to the first language model.
- the first language model is a pre-established language model that spells Hindi in the form of Latin characters.
- the first language model is a pre-established language model that spells Hindi in the form of Latin characters.
- corpus data that spells Hindi in the form of Latin characters can be obtained, and then a language model is constructed based on the corpus data to obtain a first language model.
- the Latin character sequence when a Latin character sequence is acquired, the Latin character sequence may be input to a first language model to obtain a first candidate character string list of the Latin character form corresponding to the Latin character sequence.
- the Latin character sequence when the Latin character sequence is a Vietnamese vocabulary in the form of a complete Latin character spelling, the Vietnamese vocabulary corresponding to the Latin character sequence may be directly added to the first candidate character string list.
- the Latin character sequence corresponds to a Vietnamese vocabulary in the form of incomplete Latin character spelling, in order to improve the input efficiency of the user, or to correct and complete the Latin character sequence input by the user, in the present invention, an extension can be obtained Options.
- the extended option includes: a Hindi word or a vocabulary segment of a Latin character spelling form containing a Latin character sequence, and then the extended option is added to the first candidate character string list.
- the input method may also provide an error correction function. That is, obtaining the first candidate character string list in the form of a Latin character corresponding to the Latin character sequence according to the first language model may further include: when the first language model does not contain a character string containing the Latin character sequence When the Hindi vocabulary in the Latin character spelling form is obtained, the Hindi vocabulary in the Latin character spelling form having the highest similarity to the Latin character sequence is obtained, and added to the first candidate character string list as an extended option.
- the extension options can be: Mai, Nai, Main, Maine.
- Step 103 Obtain a target Vietnamese vocabulary list according to the mapping relationship between the Latin character spelling form of the Hindi vocabulary and the Hindi character spelling form, which may include a first candidate.
- a mapping relationship between the spelling form of the Latin characters of the Hindi vocabulary and the spelling form of the Hindi characters may be established in advance.
- the Latin character spelling form of the Hindi vocabulary includes two forms, one is : Vietnamese character spelling Latin pronunciation spelling directly translated from pronunciation, for example, Hindi characters The corresponding Latin character is "dena", dena has no practical meaning in other scenes, only if you want to get Vietnamese characters Only makes sense when you enter dena; another is: some English words that do not appear in Hindi, for example, there is no English word "mobile” in Hindi.
- mapping relationship between the spellings of the Latin characters and the spellings of the Hindi vocabulary By establishing a mapping between the spellings of the Latin characters and the spellings of the Hindi vocabulary, such as establishing "mobile" and The mapping relationship between them can ensure that the mapping relationship between the Latin character spelling form of Hindi vocabulary and the Hindi character spelling form is a one-to-one relationship.
- the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate character string list can be obtained by querying the above mapping relationship, and the operation is simple and easy to implement. And through the mapping relationship established in advance, the corresponding spelling form of the Hindi character can be determined, which can further improve the accuracy of the output result.
- Step 104 Generate a first candidate word list of words including Latin character spelling form and Hindi character spelling form according to the first candidate character string list and the target Hindi vocabulary list.
- the first candidate character string list and the first candidate character string may be obtained.
- the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the list generates a first candidate word list of the vocabulary including the Latin character spelling form and the Hindi character spelling form.
- the first candidate word list may simultaneously include all the Hindi words in the spelling form of Latin characters in the first candidate character string list and the words in the Vietnamese character spelling form corresponding to the Hindi word.
- the Hindi word corresponding to the first number of Latin characters in the first candidate character string list and the Hindi word corresponding to the second number of Hindi words can be selected.
- the first and second numbers can be the same or different.
- the first number can be two and the second number can be three.
- Step 105 Display the first candidate word list on the input method interface.
- the first candidate word list may be displayed on the input method interface.
- the first candidate word list displayed on the input method interface may be: Nai, Main, Maine.
- Step 106 Acquire a selection operation of a word in the first candidate word list, and input the selected word as an input word.
- the selection operation is triggered by a user, and the selection operation may be, for example, a user's click operation, or the user triggers an operation corresponding to a number or a space key on the keyboard, which is not limited.
- the user may select a word from the first candidate word list for input according to actual needs.
- a computing device may be provided with a listener to monitor the selection operation triggered by the user. When the selection operation triggered by the user is monitored, the selected word may be determined according to the selection operation, and then the selected word is used as an input word. Enter it.
- the user can select "Main" as an input word for input.
- the present invention takes a mixed input of Hindi and Latin as an example, but the present invention is not limited thereto, and those skilled in the art can implement mixed input of any two languages based on the present invention. Strong scalability.
- the multi-lingual mixed input method for Hindi obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model.
- Candidate string list where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Vietnamese vocabulary between the spelling form of Latin characters and the Hindi character spelling
- a Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- the subsequent vocabulary of the input vocabulary can be predicted, so that the user can input the next vocabulary according to the prediction result. Therefore, there is no need for the user to manually type the next vocabulary, which further improves the user's multilingual mixed input efficiency.
- FIG. 2 is a schematic flowchart of lexical association input in a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention.
- the Hindi-oriented multilingual mixed input method may further include the following steps:
- Step 201 Predict the subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result.
- the subsequent input vocabulary can be predicted according to the first language model, and when the spelling form of the input vocabulary is Vietnamese characters, the subsequent input vocabulary is predicted according to the second language model, where
- the second language model is a pre-established language model that spells Hindi in the form of Hindi characters. For example, Hindi corpus data spelled with Hindi characters can be obtained, and then a language model is constructed based on the corpus data to obtain a second language model.
- the input vocabulary is "Main”
- the spelling form of the input vocabulary is Latin characters
- the subsequent input vocabulary is predicted according to the first language model.
- the prediction result can be: bhi, ne, to, nahi, khud, hi.
- the spelling form of the input vocabulary is Korean characters
- the subsequent input vocabulary is predicted according to the second language model.
- the prediction result can be
- the second candidate word list may include all words in the candidate result. Further, due to the limited display interface of the computing device, the second candidate word list may include the third number of words in the prediction result. Among them, the third number is preset.
- Step 202 Display the second candidate word list on the input method interface.
- the second candidate word list may be displayed on the input method interface.
- Step 203 Acquire a vocabulary selection operation of the second candidate word list, and input the selected vocabulary as the next input vocabulary.
- the user may select a word from the second candidate word list for input according to actual needs.
- a computing device may be provided with a listener to monitor the selection operation triggered by the user. When the selection operation triggered by the user is monitored, the selected word may be determined according to the selection operation, and then the selected word is used as the next one. Enter the word for input.
- the multi-language mixed input method for Hindi can be used in the process of inputting vocabulary by the user. To perform error correction, completion and prediction of the input vocabulary.
- the first candidate word list obtained can be:
- the user can select the vocabulary "Main”, and then predict the subsequent input vocabulary according to the first language model.
- the second candidate word list obtained can be:
- the user can select the vocabulary "bhi", and then predict the subsequent input vocabulary based on the first language model, and the obtained second candidate word list can be:
- the vocabulary that the user wants to output is Hindi, which is spelled in the form of Hindi characters corresponding to "nahi”. At this time, the user can enter "nahi".
- the obtained first A candidate list can be:
- the first candidate word list obtained after the first language model and the query mapping relationship can be:
- the user can select the vocabulary "meri”. After that, the vocabulary that the user wants to output is the Hindi spelled in the form of Hindi characters corresponding to "kahani". At this time, the user can enter "kahani" and pass the first language After the model and the query mapping relationship, the first candidate word list obtained can be:
- the user when a user wants to enter a Hindi word spelled in the form of Hindi characters, the user does not know the spelling rules of the word, but only knows the spelling of some Latin characters corresponding to the word form.
- the vocabulary the user wants to enter is The Latin character spelling form corresponding to this vocabulary is "Abhishek", if the user only remembers the first half of the Latin character spelling form "Abhis”.
- the user can enter the vocabulary "Abhis”, after completing the first language model to correct it, and querying the mapping relationship, the first candidate word list obtained can be:
- FIG. 3 is a schematic flowchart of establishing a language model according to an embodiment of the present invention.
- the process of establishing the first language model may include the following steps:
- Step 301 Obtaining corpus data that spells Hindi in the form of Latin characters, and preprocesses the corpus data to remove the erroneous corpus and low-frequency corpus to obtain a valid corpus.
- corpus data spelling Hindi in the form of Latin characters in India can be collected, and then the corpus data is pre-processed to remove the erroneous corpus and low-frequency corpus to obtain an effective corpus.
- the corpus can be The data is subjected to preprocessing operations such as interference removal of non-text information, spell check correction, data cleaning, data formatting, and selection of high-frequency words, so as to ensure the performance of the first language model after learning.
- Step 302 Remove redundant parts in the valid corpus data to obtain a collated corpus.
- the redundant part in the effective corpus data can be removed to obtain a collated corpus, thereby reducing the redundancy of the corpus data and the storage space occupied by it, and improving the learning efficiency of the first language model.
- Step 303 Construct a language model by using the corpus.
- the collated corpus when the collated corpus is obtained, the collated corpus may be used to construct a language model.
- a language model in order to avoid data overflow and improve the performance of the language model, logarithms can be used, and addition operations can be used instead of multiplication operations.
- the language model can A language model in the form of N-Gram is an N-gram language model.
- step 303 may specifically include: constructing a language model in the form of N-Gram using the compiled corpus, and calculating the parameters of the language model, wherein the parameters of the language model include: vocabulary in the language model and N-ary vocabulary arrangement In N, the conditional probability of the Nth word with respect to the first N-1 words, where N is a positive integer.
- step 303 may further include: smoothing the conditional probability data, so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear in the collated corpus is not zero.
- data smoothing technology can be used to smooth the conditional probability data to reduce the conditional probability corresponding to the N-ary vocabulary arrangement that has appeared in the collated corpus, so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear Not zero.
- the present invention also proposes a multilingual mixed input device oriented to Hindi.
- the implementation of the device may include one or more computing devices.
- the computing device includes a processor and a memory, and the memory stores an application program including computer program instructions executable on the processor.
- the application program can be divided into a plurality of program modules for corresponding functions of each component of the system.
- the division of program modules is logical rather than physical.
- Each program module can run on one or more computing devices, and one computing device can also run one or more program modules.
- the device of the present invention is described in detail according to the functional logic division of the program module.
- FIG. 4 is a schematic structural diagram of a Hindi-oriented multilingual mixed input device according to an embodiment of the present invention.
- the multilingual mixed input device 100 for Hindi may be implemented by using a computing device including a processor and a memory.
- the memory stores program modules that can be executed by the processor. When each program module is executed, the computing device is controlled to implement corresponding operations. Functions.
- the multilingual mixed input device 100 for Hindi includes: an input character acquisition module 101, a first candidate character string generation module 102, a vocabulary mapping module 103, a first candidate word list generation module 104, a first A candidate word list display module 105 and a first candidate word input module 106. among them,
- the input character acquisition module 101 is configured to acquire a Latin character sequence of a current input vocabulary typed by an input method interface.
- a first candidate character string generating module 102 is configured to obtain a first candidate character string list in the form of a Latin character corresponding to a Latin character sequence according to a first language model.
- the first language model is a language model that spells Hindi in the form of Latin characters. .
- a vocabulary mapping module 103 is configured to obtain a target Hindi vocabulary list according to a mapping relationship between the spelling form of the Latin characters of the Hindi vocabulary and the spelling form of the Hindi characters, and the target Hindi vocabulary list includes : The Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate string list.
- the first candidate word list generating module 104 is configured to generate a first candidate word list including a Latin character spelling form and a Vietnamese character spelling form according to the first candidate character string list and the target Hindi word list.
- the first candidate word list display module 105 is configured to display the first candidate word list on an input method interface.
- the first candidate word input module 106 is configured to obtain a selection operation of a word in the first candidate word list, and input the selected word as an input word.
- the Vietnamese-oriented multilingual mixed input device 100 may further include:
- the first candidate string generating module 102 is specifically configured to: when the Latin character sequence is a Vietnamese vocabulary in the form of a complete Latin character spelling, add the Vietnamese vocabulary corresponding to the Latin character sequence to the first candidate string list; and The extended option is obtained.
- the extended option includes a Hindi word or a vocabulary segment of a Latin character spelling form containing a Latin character sequence, and the extended option is added to the first candidate character string list.
- the first candidate character string generating module 102 may be further configured to: obtain a similarity to the Latin character sequence when there is no Hindi vocabulary in the first language model containing the Latin character spelling form of the Latin character sequence The Hindi vocabulary with the highest degree of spelling of Latin characters is added as an extended option to the first candidate string list.
- a second candidate word list generating module 107 is configured to predict a subsequent vocabulary of the input vocabulary according to a language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result.
- the second candidate word list display module 108 is configured to display the second candidate word list on the input method interface.
- the second candidate word input module 109 is configured to obtain a selection operation of a vocabulary in the second candidate word list, and input the selected vocabulary as a next input vocabulary.
- the second candidate word list generating module 107 is specifically configured to determine whether the spelling form of the input vocabulary is Latin characters or Hindi characters; when the spelling form of the input vocabulary is Latin characters, according to the first The language model predicts subsequent input vocabulary; when the spelling form of the input vocabulary is Vietnamese characters, the subsequent input vocabulary is predicted according to the second language model, which is a pre-established language that spells Hindi in the form of Hindi characters model.
- the first language model creation module 110 is configured to establish a first language model.
- the first language model creation module 110 includes:
- a corpus acquisition unit 111 is configured to acquire corpus data spelling Hindi in the form of Latin characters, and preprocess the corpus data to remove the erroneous corpus and low-frequency corpus therein to obtain a valid corpus.
- the corpus de-redundant unit 112 is used to remove redundant parts in the effective corpus data to obtain a collated corpus.
- the language model constructing unit 113 is configured to construct a language model using the corpus after arrangement.
- the language model constructing unit 113 is specifically configured to: use the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, wherein the parameters of the language model include: the language model Vocabulary, as well as the conditional probability of the Nth vocabulary with respect to the first N-1 vocabulary, N is a positive integer; and the conditional probability data is smoothed so that the The conditional probability corresponding to the N-gram lexical arrangement is not zero.
- the multilingual mixed input device for Vietnamese obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model.
- Candidate string list where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Vietnamese vocabulary between the spelling form of Latin characters and the Hindi character spelling
- a Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- the present invention also provides a non-transitory computer-readable storage medium.
- the non-transitory computer-readable storage medium stores executable instructions thereon.
- the executable instructions When the executable instructions are run on a processor, the multilingual oriented to the Hindi language as proposed in the foregoing embodiment of the present invention is implemented.
- Mixed input method The storage medium may be provided on the device as part of the device; or when the device can be remotely controlled by the server, the storage medium may be provided on a remote server that controls the device.
- the computer instructions for implementing the method of the present invention may be carried in any combination of one or more computer-readable media.
- the so-called non-transitory computer-readable medium may include any computer-readable medium, except for the signal itself which is temporarily propagated.
- the computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.
- a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.
- the present invention also provides a computer program product.
- Computer program code for performing the operations of the present invention may be written in one or more programming languages, or combinations thereof, including programming languages such as Java, Smalltalk, C ++, and also conventional Procedural programming language—such as "C" or similar programming language.
- the program code can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer, partly on a remote computer, or entirely on a remote computer or server.
- the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider) Internet connection).
- LAN local area network
- WAN wide area network
- Internet service provider Internet service provider
- the present invention also provides a computing device.
- a computing device includes a memory, a processor, and a computer program stored on the memory and executable on the processor.
- the processor executes the program, the print-oriented computer according to the foregoing embodiments of the present invention is implemented. Multilingual mixed input method in the local language.
- the computing device may be implemented by a central control unit of a computer device as part of the function of the central control unit of the computer device. It can also be implemented by a separate computing device, which is communicatively connected with the central control unit of the computer device.
- the implementation of the computing device may include, but is not limited to, a single chip microcomputer, a programmable logic controller (PLC), a complex programmable logic device (CPLD), a programmable gate array (PGA), a field programmable gate array (FPGA), and a dedicated nerve Network chip, etc.
- the non-transitory computer-readable storage medium, computer program product, and computing device according to the embodiments of the present invention may be implemented with reference to the content specifically described in the foregoing embodiments of the present invention, and have many advantages to the Hindi-oriented multifaceted solutions proposed by the foregoing embodiments of the present invention Similar beneficial effects of the mixed language input method are not repeated here.
- first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Therefore, the features defined as “first” and “second” may explicitly or implicitly include at least one of the features. In the description of the present invention, the meaning of “plurality” is two or more, such as two, three, etc., unless it is specifically and specifically defined otherwise.
- any process or method description in a flowchart or otherwise described herein may be understood to mean an instruction that includes one or more executable instructions for implementing a particular logical function or process step.
- Modules, fragments or sections of code, and the scope of the preferred embodiments of the present invention includes additional implementations, which may not be in the order shown or discussed, including in a substantially simultaneous manner or in the reverse order according to the functions involved To perform functions, which should be understood by those skilled in the art to which the embodiments of the present invention pertain.
- a sequenced list of executable instructions that can be considered to implement a logical function can be embodied in any computer-readable medium,
- the instruction execution system, device, or device such as a computer-based system, a system including a processor, or other system that can fetch and execute instructions from the instruction execution system, device, or device), or combine these instruction execution systems, devices, or devices Or equipment.
- a "computer-readable medium” may be any device that can contain, store, communicate, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device.
- each part of the present invention may be implemented by hardware, software, firmware, or a combination thereof.
- multiple steps or methods may be implemented by software or firmware stored in a memory and executed by a suitable instruction execution system.
- a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it may be implemented using any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gate circuits, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
A Hindi-oriented multi-language mixed input method and device, wherein the method comprises: acquiring a Latin character sequence of currently inputted vocabulary entered by means of an input method interface; according to a first language model, acquiring a first candidate character string list in the form of Latin characters that corresponds to the Latin character sequence; acquiring a Hindi character spelling form corresponding to Hindi vocabulary that is in the Latin character spelling form in the first candidate character string list according to the mapping between the Latin character spelling form of Hindi vocabulary and the Hindi character spelling form; generating a first candidate word list comprising the vocabulary in the Latin character spelling form and the Hindi character spelling form; displaying the first candidate word list on the input method interface; acquiring a selection operation for the vocabulary in the first candidate word list, and inputting the selected vocabulary as inputted vocabulary. The method may increase the efficiency of multi-language mixed input, thereby improving user input experience.
Description
相关申请的交叉引用Cross-reference to related applications
本申请要求北京金山安全软件有限公司于2018年6月29日提交的、发明名称为“一种面向印地语的多语言混合输入方法及装置”的、中国专利申请号“201810713058.9”的优先权。This application claims the priority of China Patent Application No. “201810713058.9” filed by Beijing Jinshan Security Software Co., Ltd. on June 29, 2018, with the invention name “A Multi-Language Mixed Input Method and Device for Hindi” .
本发明涉及输入法技术领域,尤其涉及一种面向印地语的多语言混合输入方法及装置。The invention relates to the technical field of input methods, and in particular, to a multilingual mixed input method and device for Hindi.
随着国际交流的日益频繁,两种语言甚至多种语言的混合输入变得越来越普遍。目前印度地区的两种官方语言:英语和印地语,分别采用拉丁字母和梵文天诚体书写,因此,印度用户具有对拉丁语和印地语的混合使用需求。With the increasing frequency of international exchanges, mixed input of two or even multiple languages has become more common. At present, the two official languages in India: English and Hindi, which are written in Latin and Sanskrit, respectively. Therefore, Indian users have a mixed demand for Latin and Hindi.
现有技术中,通过切换输入模式,来达到多语言混合输入的目的。例如,当用户使用英文键盘输入拉丁字符时,如果此时用户想输入某个印地语字符时,用户需切换成印地语输入法进行输入后,再切回英文键盘继续输入拉丁字符。In the prior art, the purpose of multilingual mixed input is achieved by switching input modes. For example, when the user uses the English keyboard to input Latin characters, if the user wants to input a certain Hindi character at this time, the user needs to switch to the Hindi input method for input, and then switch back to the English keyboard to continue inputting Latin characters.
这种方式下,用户需要来回切换输入模式,多语言的混合输入的效率较低,且极为耗时。In this way, the user needs to switch the input mode back and forth, and the multi-language mixed input is less efficient and time consuming.
发明内容Summary of the invention
本发明提供一种面向印地语的多语言混合输入方法及装置,用于解决现有技术中通过切换输入模式,来达到多语言混合输入的目的,存在多语言的混合输入的效率较低,且极为耗时的技术问题。The present invention provides a multilingual mixed input method and device oriented to Hindi, which is used to solve the purpose of multilingual mixed input by switching input modes in the prior art. The efficiency of multilingual mixed input is low. And extremely time-consuming technical issues.
本发明一方面实施例提出了一种面向印地语的多语言混合输入方法,包括:An embodiment of one aspect of the present invention provides a multilingual mixed input method for Hindi, including:
获取输入法界面键入的当前输入词汇的拉丁字符序列;Get the Latin character sequence of the current input vocabulary typed by the input method interface;
根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型;Obtaining a first candidate character string list of Latin character forms corresponding to the Latin character sequence according to a first language model, where the first language model is a pre-established language model that spells Hindi in the form of Latin characters;
根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:所述第一候选字符串列表 中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;Obtain a target Hindi vocabulary list according to a mapping relationship between the spelling form of the Latin characters of the Hindi vocabulary and the spelling form of the Hindi characters, and the target Hindi vocabulary list includes: the first candidate character The Hindi character spelling corresponding to the Hindi vocabulary of the Latin character spelling in the string list;
根据所述第一候选字符串列表和所述目标印地语词汇列表,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;Generating, according to the first candidate character string list and the target Hindi vocabulary list, a first candidate word list of words including Latin character spelling form and Hindi character spelling form;
在输入法界面展示所述第一候选词列表;Displaying the first candidate word list on an input method interface;
获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。Acquiring a selection operation of a word in the first candidate word list, and inputting the selected word as an input word.
作为本发明第一种可能的实现方式,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,包括:As a first possible implementation manner of the present invention, obtaining the first candidate character string list of Latin character forms corresponding to the Latin character sequence according to the first language model includes:
当所述拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将所述拉丁字符序列对应的印地语词汇加入所述第一候选字符串列表;以及When the Latin character sequence is a Hindi vocabulary in the form of a complete Latin character spelling, adding the Hindi vocabulary corresponding to the Latin character sequence to the first candidate character string list; and
获取扩展选项,所述扩展选项包括:含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将所述扩展选项加入第一候选字符串列表。An extended option is obtained, the extended option includes: a Hindi word or a vocabulary segment containing a Latin character spelling form of the Latin character sequence, and the extended option is added to a first candidate character string list.
作为本发明第二种可能的实现方式,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,还包括:As a second possible implementation manner of the present invention, obtaining the first candidate character string list of the Latin character form corresponding to the Latin character sequence according to the first language model further includes:
当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。When there is no Hindi word in the first language model containing the Latin character spelling form of the Latin character sequence, obtaining a Hindi word in the Latin character spelling form having the highest similarity to the Latin character sequence, and Add it as an extended option to the first candidate string list.
作为本发明第三种可能的实现方式,获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入之后,还包括:As a third possible implementation manner of the present invention, after obtaining a selection operation of a vocabulary in the first candidate word list and inputting the selected vocabulary as an input vocabulary, the method further includes:
根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表;Predicting a subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generating a second candidate word list according to the prediction result;
在输入法界面展示所述第二候选词列表;Displaying the second candidate word list on an input method interface;
获取对所述第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。Acquiring a selection operation of a vocabulary of the second candidate word list, and inputting the selected vocabulary as a next input vocabulary.
作为本发明第四种可能的实现方式,所述根据输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表,包括:As a fourth possible implementation manner of the present invention, predicting a subsequent vocabulary of the input vocabulary according to a language model corresponding to the input vocabulary, and generating a second candidate word list according to the prediction result, including:
判断所述输入词汇的拼写形式是拉丁字符还是印地语字符;Determining whether the spelling form of the input vocabulary is a Latin character or a Hindi character;
当所述输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;When the spelling form of the input vocabulary is Latin characters, predicting subsequent input vocabulary according to the first language model;
当所述输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,所述第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。When the spelling form of the input vocabulary is Hindi characters, the subsequent input vocabulary is predicted according to a second language model, which is a pre-established language model that spells Hindi in the form of Hindi characters.
作为本发明第五种可能的实现方式,所述根据第一语言模型,获取所述拉丁字符序列 对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,其中,As a fifth possible implementation manner of the present invention, according to the first language model, the first candidate character string list of the Latin character form corresponding to the Latin character sequence is obtained, and the first language model is a Latin language A language model of the character form spelling Hindi, where,
所述第一语言模型的预先建立,包括:The pre-establishment of the first language model includes:
获取以拉丁字符形式拼写印地语的语料数据,并对所述语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料;Acquiring corpus data spelling Hindi in the form of Latin characters, and preprocessing the corpus data to remove erroneous corpus and low-frequency corpus therein to obtain an effective corpus;
去除所述有效语料数据中的冗余部分,得到整理后的语料;Removing redundant parts in the effective corpus data to obtain a collated corpus;
使用整理后的语料构建语言模型。Use the corpus to organize the language model.
作为本发明第六种可能的实现方式,所述使用整理后的语料构建语言模型,包括:As a sixth possible implementation manner of the present invention, the constructing a language model using the collated corpus includes:
使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数;以及Use the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, where the parameters of the language model include: words in the language model, and in the N-gram lexical arrangement, the Nth word is about the former Conditional probability for N-1 words, where N is a positive integer; and
对所述条件概率的数据进行平滑处理,以使所述整理后的语料中未出现的N元词汇排列对应的条件概率不为零。Smooth the conditional probability data so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear in the collated corpus is not zero.
本发明实施例的面向印地语的多语言混合输入方法,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。The multi-lingual mixed input method for Hindi according to the embodiment of the present invention obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model. Candidate string list, where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Hindi vocabulary between the spelling form of Latin characters and the Hindi character spelling To obtain the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate character string list, and according to the first candidate character string list and the Latin character in the first candidate character string list A Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form, generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary. As a result, there is no need to frequently switch input modes to meet the user's simultaneous input requirements for mixed input of Hindi and Latin, improve multi-language mixed input efficiency, and improve user input experience. In addition, according to the mapping relationship, determining the spelling form of the Hindi characters can improve the accuracy of the output result.
本发明又一方面实施例提出了一种面向印地语的多语言混合输入装置,包括:An embodiment of another aspect of the present invention provides a multilingual mixed input device for Hindi, including:
输入字符获取模块,用于获取输入法界面键入的当前输入词汇的拉丁字符序列;Input character acquisition module, which is used to acquire the Latin character sequence of the current input vocabulary typed by the input method interface;
第一候选字符串生成模块,用于根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为以拉丁字符形式拼写印地语的语言模型;A first candidate character string generating module, configured to obtain a first candidate character string list in the form of a Latin character corresponding to the Latin character sequence according to a first language model, where the first language model is to spell Hindi in the form of a Latin character Language model
词汇映射模块,用于根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼 写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:所述第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;A vocabulary mapping module is configured to obtain a target Hindi vocabulary list according to a mapping relationship between a Latin character spelling form of the Hindi vocabulary and a Hindi character spelling form, which is established in advance. The target Hindi vocabulary list includes: The Hindi character spelling form corresponding to the Hindi vocabulary in the Latin character spelling form in the first candidate string list;
第一候选词列表生成模块,用于根据所述第一候选字符串列表和所述第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;A first candidate word list generating module, configured to generate a Hindi character spelling form corresponding to a Hindi word corresponding to a Hindi word spelling in the first candidate character string list and the Latin character spelling form in the first candidate character string list; A first candidate list of words including spellings of Latin characters and spellings of Hindi characters;
第一候选词列表展示模块,用于在输入法界面展示所述第一候选词列表;A first candidate word list display module, configured to display the first candidate word list on an input method interface;
第一候选词输入模块,用于获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。The first candidate word input module is configured to obtain a selection operation of a word in the first candidate word list, and input the selected word as an input word.
作为本发明第一种可能的实现方式,所述第一候选字符串生成模块,具体用于:As a first possible implementation manner of the present invention, the first candidate string generating module is specifically configured to:
当所述拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将所述拉丁字符序列对应的印地语词汇加入所述第一候选字符串列表;以及When the Latin character sequence is a Hindi vocabulary in the form of a complete Latin character spelling, adding the Hindi vocabulary corresponding to the Latin character sequence to the first candidate character string list; and
获取扩展选项,所述扩展选项包括:含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将所述扩展选项加入第一候选字符串列表。An extended option is obtained, the extended option includes: a Hindi word or a vocabulary segment containing a Latin character spelling form of the Latin character sequence, and the extended option is added to a first candidate character string list.
作为本发明第二种可能的实现方式,所述第一候选字符串生成模块,还用于:As a second possible implementation manner of the present invention, the first candidate string generating module is further configured to:
当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。When there is no Hindi word in the first language model containing the Latin character spelling form of the Latin character sequence, obtaining a Hindi word in the Latin character spelling form having the highest similarity to the Latin character sequence, and Add it as an extended option to the first candidate string list.
作为本发明第三种可能的实现方式,所述装置还包括:As a third possible implementation manner of the present invention, the device further includes:
第二候选词列表生成模块,用于根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表;A second candidate word list generating module, configured to predict a subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result;
第二候选词列表显示模块,用于在输入法界面展示所述第二候选词列表;A second candidate word list display module, configured to display the second candidate word list on an input method interface;
第二候选词输入模块,用于获取对所述第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。A second candidate word input module is configured to obtain a selection operation of a vocabulary of the second candidate word list, and input the selected vocabulary as a next input vocabulary.
作为本发明第四种可能的实现方式,所述第二候选词列表生成模块,具体用于:As a fourth possible implementation manner of the present invention, the second candidate word list generating module is specifically configured to:
判断所述输入词汇的拼写形式是拉丁字符还是印地语字符;Determining whether the spelling form of the input vocabulary is a Latin character or a Hindi character;
当所述输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;When the spelling form of the input vocabulary is Latin characters, predicting subsequent input vocabulary according to the first language model;
当所述输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,所述第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。When the spelling form of the input vocabulary is Hindi characters, the subsequent input vocabulary is predicted according to a second language model, which is a pre-established language model that spells Hindi in the form of Hindi characters.
作为本发明第五种可能的实现方式,所述装置还包括:As a fifth possible implementation manner of the present invention, the device further includes:
第一语言模型创建模块,用于建立第一语言模型,所述第一语言模型创建模块包括:A first language model creation module is used to establish a first language model. The first language model creation module includes:
语料获取单元,用于获取以拉丁字符形式拼写印地语的语料数据,并对所述语料数据 进行预处理以去除其中的错误语料和低频语料,得到有效语料;A corpus acquisition unit, configured to acquire corpus data spelling Hindi in the form of Latin characters, and preprocess the corpus data to remove the erroneous corpus and low-frequency corpus to obtain valid corpus;
语料去冗余单元,用于去除所述有效语料数据中的冗余部分,得到整理后的语料;A corpus deduplication unit for removing redundant parts in the valid corpus data to obtain a collated corpus;
语言模型构建单元,用于使用整理后的语料构建语言模型。A language model building unit is used to build a language model using the collated corpus.
作为本发明第六种可能的实现方式,所述语言模型构建单元,具体用于:As a sixth possible implementation manner of the present invention, the language model construction unit is specifically configured to:
使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数;以及Use the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, where the parameters of the language model include: words in the language model, and in the N-gram lexical arrangement, the Nth word is about the former Conditional probability for N-1 words, where N is a positive integer; and
对所述条件概率的数据进行平滑处理,以使所述整理后的语料中未出现的N元词汇排列对应的条件概率不为零。Smooth the conditional probability data so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear in the collated corpus is not zero.
本发明实施例的面向印地语的多语言混合输入装置,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符输入序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。The multilingual mixed input device for Hindi according to the embodiment of the present invention obtains the Latin character sequence of the current input vocabulary typed by the input method interface, and then obtains the first Latin character form corresponding to the Latin character input sequence according to the first language model. A list of candidate character strings, where the first language model is a pre-established language model that spells Hindi in the form of Latin characters, and then according to the pre-established Hindi word spelling form and the Hindi character spelling form, Mapping relationship between the first candidate string list to obtain the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate string list, and according to the first candidate string list and the first candidate string list A Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form, generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate on the input method interface Word list, and obtain a selection operation of words in the first candidate word list, thereby The selected input as input vocabulary words. As a result, there is no need to frequently switch input modes to meet the user's simultaneous input requirements for mixed input of Hindi and Latin, improve multi-language mixed input efficiency, and improve user input experience. In addition, according to the mapping relationship, determining the spelling form of the Hindi characters can improve the accuracy of the output result.
本发明又一方面实施例提供了一种非临时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本发明上述实施例提出的面向印地语的多语言混合输入方法。Another embodiment of the present invention provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the multilingual Hindi-oriented multilingual mixture proposed by the above embodiment of the present invention Input method.
为了实现上述目的,本发明第四方面实施例提供了一种计算机程序产品,当所述计算机程序产品中的指令由处理器执行时,实现根据本发明上述实施例提出的面向印地语的多语言混合输入方法。In order to achieve the above object, an embodiment of the fourth aspect of the present invention provides a computer program product, and when instructions in the computer program product are executed by a processor, a multi-language oriented Hindi language according to the foregoing embodiment of the present invention is implemented. Mixed language input method.
为了实现上述目的,本发明第五方面实施例提供了一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现根据本发明上述实施例提出的面向印地语的多语言混合输入方法。In order to achieve the foregoing objective, an embodiment of the fifth aspect of the present invention provides a computing device including a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the program, A multi-language mixed input method for Hindi language according to the above embodiment of the present invention is implemented.
根据本发明第三到五方面的非临时性计算机可读存储介质,计算机程序产品和计算设备具有与根据本发明第一和第二方面的面向印地语的多语言混合输入方法和装置类似的有益效果,在此不再赘述。According to the non-transitory computer-readable storage medium of the third to fifth aspects of the present invention, the computer program product and the computing device have similar methods and devices for Hindi-oriented multilingual mixed input according to the first and second aspects of the present invention The beneficial effects are not repeated here.
本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and / or additional aspects and advantages of the present invention will become apparent and easily understood from the following description of the embodiments with reference to the accompanying drawings, in which:
图1为本发明实施例一所提供的面向印地语的多语言混合输入方法的流程示意图;FIG. 1 is a schematic flowchart of a Hindi-oriented multilingual mixed input method according to a first embodiment of the present invention; FIG.
图2为根据本发明实施例的面向印地语的多语言混合输入方法中词汇联想输入的流程示意图;2 is a schematic flowchart of lexical association input in a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention;
图3为根据本发明实施例的建立语言模型的流程示意图;3 is a schematic flowchart of establishing a language model according to an embodiment of the present invention;
图4为根据本发明实施例的面向印地语的多语言混合输入装置的结构框图;4 is a structural block diagram of a multi-lingual mixed input device for Hindi according to an embodiment of the present invention;
图5为根据本发明实施例的面向印地语的多语言混合输入装置的结构框图。FIG. 5 is a structural block diagram of a Hindi-oriented multilingual mixed input device according to an embodiment of the present invention.
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。Hereinafter, embodiments of the present invention will be described in detail. Examples of the embodiments are shown in the drawings, wherein the same or similar reference numerals indicate the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are exemplary and are intended to explain the present invention, but should not be construed as limiting the present invention.
目前,可以通过以下三种方式,实现用户的多语言混合输入需求。At present, the following three ways can be used to achieve the user's multilingual mixed input requirements.
第一种方式,通过切换输入模式,来达到多语言混合输入的目的。例如,当用户使用英文键盘输入拉丁字符时,如果此时用户想输入某个印地语字符时,用户需切换成印地语输入法进行输入后,再切回英文键盘继续输入拉丁字符。The first way is to switch the input mode to achieve the purpose of multilingual mixed input. For example, when the user uses the English keyboard to input Latin characters, if the user wants to input a certain Hindi character at this time, the user needs to switch to the Hindi input method for input, and then switch back to the English keyboard to continue inputting Latin characters.
第二种方式,通过预设操作,进入临时输入模式,用户可以在临时输入模式中进行第二语言的字符键入。例如,在中英文输入法中,用户可以通过点击Shift键来进行输入法的切换。The second method is to enter the temporary input mode through a preset operation, and the user can type characters in the second language in the temporary input mode. For example, in Chinese and English input methods, the user can switch the input method by clicking the Shift key.
第三种方式,部分输入法在语言模型中同时支持两种编码方式,即根据用户输入,自动选择最合适的编码规则并进行字符显示。The third method, part of the input method supports two encoding methods in the language model, that is, according to user input, the most suitable encoding rule is automatically selected and the characters are displayed.
第一种方式下,多语言的混合输入的效率较低;第二种方式的下,在进入临时输入模式后,需要进行字符的特殊处理,增加了开发周期;第三种方式下,当两种语言的编码方式差异较小时,语音模型的输出结果的准确性较低。In the first mode, the efficiency of mixed-language input is low. In the second mode, after entering the temporary input mode, special processing of characters is required, which increases the development cycle. In the third mode, when two When the encoding differences between languages are small, the accuracy of the output of the speech model is low.
本发明主要针对现有技术中多语言的混合输入的效率较低,输出结果准确性较低的技术问题,提出一种面向印地语的多语言混合输入方法。The present invention mainly aims at the technical problems of low efficiency of multilingual mixed input and low accuracy of output results in the prior art, and proposes a multilingual mixed input method oriented to Hindi.
本发明实施例的面向印地语的多语言混合输入方法,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。The multi-lingual mixed input method for Hindi according to the embodiment of the present invention obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model. Candidate string list, where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Hindi vocabulary between the spelling form of Latin characters and the Hindi character spelling To obtain the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate character string list, and according to the first candidate character string list and the Latin character in the first candidate character string list A Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form, generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary. As a result, there is no need to frequently switch input modes to meet the user's simultaneous input requirements for mixed input of Hindi and Latin, improve multi-language mixed input efficiency, and improve user input experience. In addition, according to the mapping relationship, determining the spelling form of the Hindi characters can improve the accuracy of the output result.
下面参考附图对本发明实施例的面向印地语的多语言混合输入方法及装置进行详细的说明。在具体描述本发明实施例之前,为了便于理解,首先对常用技术词进行介绍:The multi-lingual mixed input method and device for Hindi according to an embodiment of the present invention will be described in detail below with reference to the drawings. Before describing the embodiments of the present invention in detail, in order to facilitate understanding, firstly introduce common technical words:
N-Gram形式的语言模型,基于下述假设:第n个词汇出现与前n-1个词汇相关,而与其他任何词汇不相关,整个句子出现的概率等于各个词汇出现的概率乘积。其中,各个词汇出现的概率,可以通过对语料数据进行统计计算得到。The language model in the form of N-Gram is based on the following assumptions: the occurrence of the nth vocabulary is related to the first n-1 vocabulary, but not related to any other vocabulary. Among them, the probability of occurrence of each vocabulary can be obtained through statistical calculation of corpus data.
假设句子T是由词汇序列w
1,w
2,w
3,...,w
N组成,则N-Gram形式的语言模型可以用下述公式表示:
Assuming that the sentence T is composed of vocabulary sequences w 1 , w 2 , w 3 , ..., w N , the language model in the form of N-Gram can be expressed by the following formula:
P(w
N|w
1.........w
N-1);
P (w N | w 1 ......... w N-1 );
上述公式表示:出现第N个词汇的概率是由前边已经出现的从w
1,w
2,w
3,...,w
N-1的这些词汇的概率来决定的,在该过程中,通过先前词汇来去预测下一个将要出现的词汇,然后根据大量的文本观测,可以得到预测的词汇越来越趋向于在这些已出现词汇的后面的可能性。因此,构建的语言模型可以为(n-1)阶马尔科夫模型,或者为N元语言模型。就输入法的应用而言,由于与机器翻译等应用不同,通常不需要对长句子的理解和词序预测,一般情况下,N的取值为可为2、3、4等。
The above formula indicates that the probability of the Nth vocabulary is determined by the probabilities of these vocabularies from w 1 , w 2 , w 3 , ..., w N-1 that have appeared before. In the process, The previous vocabulary is used to predict the next vocabulary that will appear, and then based on a large number of text observations, it can be obtained that the predicted vocabulary is more and more likely to be behind these existing vocabulary. Therefore, the constructed language model can be a (n-1) -order Markov model or an N-ary language model. As far as the application of input method is concerned, it is different from machine translation and other applications. Generally, it is not necessary to understand long sentences and predict word order. Generally, the value of N can be 2, 3, 4, etc.
图1为根据本发明实施例的面向印地语的多语言混合输入方法的流程示意图。FIG. 1 is a schematic flowchart of a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention.
本发明实施例提供的面向印地语的多语言混合输入方法,可以由本发明实施例提供的 面向印地语的多语言混合输入装置实现,该装置可以被配置在任何计算设备中,以使该计算设备实现面向印地语的多语言混合输入功能。The Hindi-oriented multilingual mixed input method provided by the embodiment of the present invention may be implemented by the Hindi-oriented multilingual mixed input device provided by the embodiment of the present invention, and the device may be configured in any computing device so that the The computing device implements a multilingual mixed input function for Hindi.
其中,计算设备例如可以为个人电脑(Personal Computer,PC),云端设备或者移动设备等硬件设备,移动设备例如可以为手机、平板电脑、个人数字助理、穿戴式设备等具有各种操作系统、触摸屏和/或显示屏的硬件设备。The computing device may be a hardware device such as a personal computer (PC), a cloud device, or a mobile device. The mobile device may be a mobile phone, a tablet computer, a personal digital assistant, or a wearable device. And / or display hardware.
如图1所示,该面向印地语的多语言混合输入方法包括以下步骤:As shown in FIG. 1, the multilingual mixed input method for Hindi includes the following steps:
步骤101,获取输入法界面键入的当前输入词汇的拉丁字符序列。Step 101: Obtain a Latin character sequence of a current input vocabulary typed by an input method interface.
本发明实施例中,计算设备可以提供有输入法界面,用户可以通过该输入法界面键入拉丁字符序列。例如,当计算设备为手机时,用户可以通过触摸屏,手动键入拉丁字符序列,或者当计算设备为PC时,用户可以通过键盘,手动键入拉丁字符序列。In the embodiment of the present invention, the computing device may be provided with an input method interface, and a user may enter a Latin character sequence through the input method interface. For example, when the computing device is a mobile phone, the user can manually type the Latin character sequence through the touch screen, or when the computing device is a PC, the user can manually type the Latin character sequence through the keyboard.
可选地,计算设备中可以设置有监听器,以对用户触发的键入操作进行监听,当监听到用户触发的键入操作时,可以根据用户的键入操作,获取用户在输入法界面键入的当前输入词汇的拉丁字符序列。例如,用户想要输入“手机”时,可以在输入法界面键入“mobile”。Optionally, a computing device may be provided with a listener to monitor a user-typed input operation. When the user-typed input operation is monitored, the current input typed by the user on the input method interface may be obtained according to the user's input operation. Vocabulary sequence of Latin characters. For example, when the user wants to enter "mobile phone", he can type "mobile" in the input method interface.
步骤102,根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型。Step 102: Obtain a first candidate character string list of Latin character forms corresponding to the Latin character sequence according to the first language model. The first language model is a pre-established language model that spells Hindi in the form of Latin characters.
本发明实施例中,第一语言模型为预先建立的、以拉丁字符形式拼写印地语的语言模型。例如,可以获取以拉丁字符形式拼写印地语的语料数据,而后根据语料数据,构建语言模型,得到第一语言模型。In the embodiment of the present invention, the first language model is a pre-established language model that spells Hindi in the form of Latin characters. For example, corpus data that spells Hindi in the form of Latin characters can be obtained, and then a language model is constructed based on the corpus data to obtain a first language model.
本发明实施例中,在获取到拉丁字符序列时,可以将拉丁字符序列输入至第一语言模型,得到拉丁字符序列对应的拉丁字符形式的第一候选字符串列表。In the embodiment of the present invention, when a Latin character sequence is acquired, the Latin character sequence may be input to a first language model to obtain a first candidate character string list of the Latin character form corresponding to the Latin character sequence.
具体地,当拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,可以直接将该拉丁字符序列对应的印地语词汇加入第一候选字符串列表。而当拉丁字符序列对应非完整的拉丁字符拼写形式的印地语词汇时,为了提升用户的输入效率,或者,对用户输入的拉丁字符序列进行纠错和补全,本发明中,可以获取扩展选项。其中,扩展选项包括:含有拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,而后将扩展选项加入第一候选字符串列表。Specifically, when the Latin character sequence is a Hindi vocabulary in the form of a complete Latin character spelling, the Hindi vocabulary corresponding to the Latin character sequence may be directly added to the first candidate character string list. When the Latin character sequence corresponds to a Hindi vocabulary in the form of incomplete Latin character spelling, in order to improve the input efficiency of the user, or to correct and complete the Latin character sequence input by the user, in the present invention, an extension can be obtained Options. The extended option includes: a Hindi word or a vocabulary segment of a Latin character spelling form containing a Latin character sequence, and then the extended option is added to the first candidate character string list.
有时,用户可能会存在拼写错误,因此在一些实施例中,输入法还可提供纠错功能。即,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,还可包括:当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。Sometimes, the user may have a spelling error, so in some embodiments, the input method may also provide an error correction function. That is, obtaining the first candidate character string list in the form of a Latin character corresponding to the Latin character sequence according to the first language model may further include: when the first language model does not contain a character string containing the Latin character sequence When the Hindi vocabulary in the Latin character spelling form is obtained, the Hindi vocabulary in the Latin character spelling form having the highest similarity to the Latin character sequence is obtained, and added to the first candidate character string list as an extended option.
举例说明,当用户想要输入的语句是“Main bhi
meri
”,该语句对应的拉丁字符拼写形式的印地语词汇为“Main bhi nahi meri kahani hai”。假设用户第一个键入的印地语词汇为“Mai”,则第一语言模型输出的结果,即扩展选项可以为:Mai,Nai,Main,Maine。
For example, when the user wants to enter the sentence "Main bhi meri ", The Hindi vocabulary of the Latin character spelling corresponding to this sentence is" Main bhi nahi meri kahani hai ". Assuming that the first Hindi vocabulary typed by the user is" Mai ", the output of the first language model, That is, the extension options can be: Mai, Nai, Main, Maine.
步骤103,根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表可以包括第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式。Step 103: Obtain a target Hindi vocabulary list according to the mapping relationship between the Latin character spelling form of the Hindi vocabulary and the Hindi character spelling form, which may include a first candidate. The Hindi character spelling corresponding to the Hindi vocabulary of the Latin character spelling in the string list.
本发明实施例中,可以预先建立印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,其中,印地语词汇的拉丁字符拼写形式包括两种形式,一种为:印地语字符拼写形式经过发音直接翻译过来的拉丁字符拼写形式,比如,印地语字符
其对应的拉丁字符为“dena”,dena在其他场景中没有实际意义,只有在想得到印地语字符
时,输入dena才有意义;另一种为:某些英语单词,这些单词未出现在印地语中,例如印地语中未有英文单词“mobile”。
In the embodiment of the present invention, a mapping relationship between the spelling form of the Latin characters of the Hindi vocabulary and the spelling form of the Hindi characters may be established in advance. The Latin character spelling form of the Hindi vocabulary includes two forms, one is : Hindi character spelling Latin pronunciation spelling directly translated from pronunciation, for example, Hindi characters The corresponding Latin character is "dena", dena has no practical meaning in other scenes, only if you want to get Hindi characters Only makes sense when you enter dena; another is: some English words that do not appear in Hindi, for example, there is no English word "mobile" in Hindi.
通过建立印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,例如建立“mobile”和
之间的映射关系,从而可以保证印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系为一对一关系,在确定拉丁字符形式的第一候选字符串列表后,可以通过查询上述映射关系,获取与该第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,操作简单,且易于实现。并且通过预先建立的映射关系,确定对应的印地语字符拼写形式,可以进一步提升输出结果的准确性。
By establishing a mapping between the spellings of the Latin characters and the spellings of the Hindi vocabulary, such as establishing "mobile" and The mapping relationship between them can ensure that the mapping relationship between the Latin character spelling form of Hindi vocabulary and the Hindi character spelling form is a one-to-one relationship. After the first candidate character string list of the Latin character form is determined, The Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate character string list can be obtained by querying the above mapping relationship, and the operation is simple and easy to implement. And through the mapping relationship established in advance, the corresponding spelling form of the Hindi character can be determined, which can further improve the accuracy of the output result.
步骤104,根据第一候选字符串列表和目标印地语词汇列表,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表。Step 104: Generate a first candidate word list of words including Latin character spelling form and Hindi character spelling form according to the first candidate character string list and the target Hindi vocabulary list.
本发明实施例中,在得到第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式后,可以根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表。In the embodiment of the present invention, after obtaining the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate character string list, the first candidate character string list and the first candidate character string may be obtained. The Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the list generates a first candidate word list of the vocabulary including the Latin character spelling form and the Hindi character spelling form.
可选地,第一候选词列表中可以同时包括所有的第一候选字符串列表中拉丁字符拼写形式的印地语词汇,以及该印地语词汇所对应的印地语字符拼写形式的词汇。Optionally, the first candidate word list may simultaneously include all the Hindi words in the spelling form of Latin characters in the first candidate character string list and the words in the Hindi character spelling form corresponding to the Hindi word.
进一步地,由于计算设备的显示界面有限,可以选取第一候选字符串列表中第一个数的拉丁字符拼写形式的印地语词汇和第二个数的印地语词汇所对应的印地语字符拼写形式的词汇,而后根据选取的词汇生成第一候选词列表。其中,第一个数和第二个数可以相同 或者不同。例如,第一个数可以为2个,第二个数可以为3个。Further, since the display interface of the computing device is limited, the Hindi word corresponding to the first number of Latin characters in the first candidate character string list and the Hindi word corresponding to the second number of Hindi words can be selected. A vocabulary in the form of character spelling, and then a first candidate word list is generated based on the selected vocabulary. The first and second numbers can be the same or different. For example, the first number can be two and the second number can be three.
步骤105,在输入法界面展示第一候选词列表。Step 105: Display the first candidate word list on the input method interface.
本发明实施例中,为了满足用户同时输入印地语和拉丁语的混合输入需求,在得到第一候选词列表后,可以在输入法界面展示第一候选词列表。In the embodiment of the present invention, in order to meet a user's simultaneous input requirement for mixed input of Hindi and Latin, after obtaining the first candidate word list, the first candidate word list may be displayed on the input method interface.
仍以上述例子示例,当用户键入的拉丁字符序列为“Mai”时,在输入法界面展示的第一候选词列表可以为:Mai,
Nai,
Main,Maine。
Using the above example as an example, when the Latin character sequence typed by the user is "Mai", the first candidate word list displayed on the input method interface may be: Nai, Main, Maine.
步骤106,获取对第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。Step 106: Acquire a selection operation of a word in the first candidate word list, and input the selected word as an input word.
本发明实施例中,选择操作为用户触发的,该选择操作例如可以为用户的点击操作,或者用户触发键盘上的数字或空格键所对应的操作,对此不作限制。In the embodiment of the present invention, the selection operation is triggered by a user, and the selection operation may be, for example, a user's click operation, or the user triggers an operation corresponding to a number or a space key on the keyboard, which is not limited.
具体地,当在输入法界面展示第一候选词列表后,用户可以根据实际需求,从第一候选词列表中选择一个词汇进行输入。计算设备中可以设置有监听器,以对用户触发的选择操作进行监听,当监听到用户触发的选择操作时,可以根据选择操作,确定被选中的词汇,而后将该被选中的词汇作为输入词进行输入。Specifically, after the first candidate word list is displayed on the input method interface, the user may select a word from the first candidate word list for input according to actual needs. A computing device may be provided with a listener to monitor the selection operation triggered by the user. When the selection operation triggered by the user is monitored, the selected word may be determined according to the selection operation, and then the selected word is used as an input word. Enter it.
仍以上述例子示例,用户可以选择“Main”作为输入词进行输入。Taking the above example as an example, the user can select "Main" as an input word for input.
需要说明的是,本发明以印地语和拉丁语的混合输入为例,但是本发明并不限于此,本领域的技术人员可以在本发明的基础上,实现任意两种语言的混合输入,扩展性较强。It should be noted that the present invention takes a mixed input of Hindi and Latin as an example, but the present invention is not limited thereto, and those skilled in the art can implement mixed input of any two languages based on the present invention. Strong scalability.
本发明实施例的面向印地语的多语言混合输入方法,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。The multi-lingual mixed input method for Hindi according to the embodiment of the present invention obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model. Candidate string list, where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Hindi vocabulary between the spelling form of Latin characters and the Hindi character spelling To obtain the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate character string list, and according to the first candidate character string list and the Latin character in the first candidate character string list A Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form, generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary. As a result, there is no need to frequently switch input modes to meet the user's simultaneous input requirements for mixed input of Hindi and Latin, improve multi-language mixed input efficiency, and improve user input experience. In addition, according to the mapping relationship, determining the spelling form of the Hindi characters can improve the accuracy of the output result.
作为一种可能的实现方式,为了提升用户的输入效率,在将被选中的词汇作为输入词 汇进行输入之后,还可以预测输入词汇的后续词汇,从而用户可以根据预测结果,进行下一个词汇的输入,由此,无需用户手动键入下一个词汇,进一步提升用户多语言的混合输入效率。下面结合图2,对上述过程进程详细说明。As a possible implementation, in order to improve the input efficiency of the user, after the selected vocabulary is input as the input vocabulary, the subsequent vocabulary of the input vocabulary can be predicted, so that the user can input the next vocabulary according to the prediction result. Therefore, there is no need for the user to manually type the next vocabulary, which further improves the user's multilingual mixed input efficiency. The above process will be described in detail below with reference to FIG. 2.
图2为根据本发明实施例的面向印地语的多语言混合输入方法中词汇联想输入的流程示意图。FIG. 2 is a schematic flowchart of lexical association input in a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention.
如图2所示,在图1所示实施例的基础上,在步骤106之后,该面向印地语的多语言混合输入方法还可以包括以下步骤:As shown in FIG. 2, on the basis of the embodiment shown in FIG. 1, after step 106, the Hindi-oriented multilingual mixed input method may further include the following steps:
步骤201,根据输入词汇对应的语言模型,预测输入词汇的后续词汇,并根据预测结果生成第二候选词列表。Step 201: Predict the subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result.
具体地,当输入词汇的拼写形式是拉丁字符时,可以根据第一语言模型预测后续输入词汇,而当输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,其中,第二语言模型为预先建立的、以印地语字符形式拼写印地语的语言模型。例如,可以获取以印地语字符拼写印地语的语料数据,而后根据语料数据,构建语言模型,得到第二语言模型。Specifically, when the spelling form of the input vocabulary is Latin characters, the subsequent input vocabulary can be predicted according to the first language model, and when the spelling form of the input vocabulary is Hindi characters, the subsequent input vocabulary is predicted according to the second language model, where The second language model is a pre-established language model that spells Hindi in the form of Hindi characters. For example, Hindi corpus data spelled with Hindi characters can be obtained, and then a language model is constructed based on the corpus data to obtain a second language model.
举例而言,当输入词汇为“Main”,可知,该输入词汇的拼写形式是拉丁字符,则根据第一语言模型预测后续输入词汇,预测结果可以为:bhi,ne,to,nahi,khud,hi。For example, when the input vocabulary is "Main", it can be known that the spelling form of the input vocabulary is Latin characters, and the subsequent input vocabulary is predicted according to the first language model. The prediction result can be: bhi, ne, to, nahi, khud, hi.
当输入词汇为
可知,该输入词汇的拼写形式是印地语字符,则根据第二语言模型预测后续输入词汇,预测结果可以为:
When the input vocabulary is It can be known that the spelling form of the input vocabulary is Hindi characters, and the subsequent input vocabulary is predicted according to the second language model. The prediction result can be
本发明实施例中,第二候选词列表可以包括候选结果中所有的词汇。进一步地,由于计算设备的显示界面有限,第二候选词列表中可以包括预测结果中第三个数的词汇。其中,第三个数为预先设置的。In the embodiment of the present invention, the second candidate word list may include all words in the candidate result. Further, due to the limited display interface of the computing device, the second candidate word list may include the third number of words in the prediction result. Among them, the third number is preset.
步骤202,在输入法界面展示第二候选词列表。Step 202: Display the second candidate word list on the input method interface.
本发明实施例中,在生成第二候选词列表后,可以在输入法界面展示第二候选词列表。In the embodiment of the present invention, after the second candidate word list is generated, the second candidate word list may be displayed on the input method interface.
步骤203,获取对第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。Step 203: Acquire a vocabulary selection operation of the second candidate word list, and input the selected vocabulary as the next input vocabulary.
本发明实施例中,当在输入法界面展示第二候选词列表后,用户可以根据实际需求,从第二候选词列表中选择一个词汇进行输入。计算设备中可以设置有监听器,以对用户触发的选择操作进行监听,当监听到用户触发的选择操作时,可以根据选择操作,确定被选中的词汇,而后将该被选中的词汇作为下一个输入词进行输入。In the embodiment of the present invention, after the second candidate word list is displayed on the input method interface, the user may select a word from the second candidate word list for input according to actual needs. A computing device may be provided with a listener to monitor the selection operation triggered by the user. When the selection operation triggered by the user is monitored, the selected word may be determined according to the selection operation, and then the selected word is used as the next one. Enter the word for input.
作为一种应用场景,当用户希望高效地输入包括拉丁语和印地语的混合输入句子时,采用本发明实施例的面向印地语的多语言混合输入方法,可以在用户输入词汇的过程中, 进行输入词汇的纠错、补全和预测。As an application scenario, when a user wishes to efficiently input a mixed input sentence including Latin and Hindi, the multi-language mixed input method for Hindi according to the embodiment of the present invention can be used in the process of inputting vocabulary by the user. To perform error correction, completion and prediction of the input vocabulary.
假设用户想要输入的语句是“Main bhi
meri
”,该语句对应的拉丁字符拼写形式的印地语词汇为“Main bhi nahi meri kahani hai”。
Suppose the statement the user wants to enter is "Main bhi meri ", The Hindi vocabulary of the Latin character spelling corresponding to this sentence is" Main bhi nahi meri kahani hai ".
1)当用户输入词汇“Mai”时,经过第一语言模型对其进行补全纠错,以及查询映射关系后,得到的第一候选词列表可以为:1) When the user inputs the vocabulary "Mai", after completing the first language model to complete the error correction and query the mapping relationship, the first candidate word list obtained can be:
2)用户可以选择词汇“Main”,之后根据第一语言模型,预测后续输入词汇,得到的第二候选词列表可以为:2) The user can select the vocabulary "Main", and then predict the subsequent input vocabulary according to the first language model. The second candidate word list obtained can be:
bhi,ne,to,nahi,khud,hibhi, ne, to, nahi, khud, hi
3)用户可以选择词汇“bhi”,之后根据第一语言模型,预测后续输入词汇,得到的第二候选词列表可以为:3) The user can select the vocabulary "bhi", and then predict the subsequent input vocabulary based on the first language model, and the obtained second candidate word list can be:
nahi,bhi,to,ho,hai,nanahi, bhi, to, ho, hai, na
4)用户想要输出的词汇为“nahi”对应的以印地语字符形式拼写的印地语,此时,用户可以输入“nahi”,经过第一语言模型以及查询映射关系后,得到的第一候选词列表可以为:4) The vocabulary that the user wants to output is Hindi, which is spelled in the form of Hindi characters corresponding to "nahi". At this time, the user can enter "nahi". After the first language model and the query mapping relationship, the obtained first A candidate list can be:
5)用户可以选择词汇
之后用户输入词汇“meri”,则经过第一语言模型以及查询映射关系后,得到的第一候选词列表可以为:
5) Users can choose vocabulary After the user inputs the word "meri", the first candidate word list obtained after the first language model and the query mapping relationship can be:
6)用户可以选择词汇“meri”,之后,用户想要输出的词汇为“kahani”对应的以印地语字符形式拼写的印地语,此时,用户可以输入“kahani”,经过第一语言模型以及查询映射关系后,得到的第一候选词列表可以为:6) The user can select the vocabulary "meri". After that, the vocabulary that the user wants to output is the Hindi spelled in the form of Hindi characters corresponding to "kahani". At this time, the user can enter "kahani" and pass the first language After the model and the query mapping relationship, the first candidate word list obtained can be:
7)用户可以选择词汇
之后根据第二语言模型,预测后续输入词汇,得到的第二候选词列表可以为:
7) Users can choose vocabulary Then, according to the second language model, subsequent input words are predicted, and the obtained second candidate word list can be:
8)用户可以选择词汇
至此输出结束。由此,可以有效提升用户的输入效率。
8) Users can choose vocabulary This ends the output. Therefore, the input efficiency of the user can be effectively improved.
作为另一种应用场景,当用户想要输入某个以印地语字符形式拼写的印地语词汇时,但是该用户并不知道该词汇的拼写规则,只知道该词汇对应的部分拉丁字符拼写形式。例如,用户希望输入的词汇为
该词汇对应的拉丁字符拼写形式为“Abhishek”,如果用户只记得拉丁字符拼写形式的前半部分“Abhis”。
As another application scenario, when a user wants to enter a Hindi word spelled in the form of Hindi characters, the user does not know the spelling rules of the word, but only knows the spelling of some Latin characters corresponding to the word form. For example, the vocabulary the user wants to enter is The Latin character spelling form corresponding to this vocabulary is "Abhishek", if the user only remembers the first half of the Latin character spelling form "Abhis".
1)用户可以输入词汇“Abhis”,经过第一语言模型对其进行补全纠错,以及查询映射关系后,得到的第一候选词列表可以为:1) The user can enter the vocabulary "Abhis", after completing the first language model to correct it, and querying the mapping relationship, the first candidate word list obtained can be:
2)用户可以选择词汇
至此输出结束。由此,可以有效提升用户的输入效率,以及保证了字符串的连续输入。
2) Users can choose vocabulary This ends the output. Therefore, the input efficiency of the user can be effectively improved, and continuous input of the character string can be ensured.
作为一种可能的实现方式,参见图3,图3为根据本发明实施例的建立语言模型的流程示意图。第一语言模型的建立过程,具体可以包括以下步骤:As a possible implementation manner, refer to FIG. 3, which is a schematic flowchart of establishing a language model according to an embodiment of the present invention. The process of establishing the first language model may include the following steps:
步骤301,获取以拉丁字符形式拼写印地语的语料数据,并对语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料。Step 301: Obtaining corpus data that spells Hindi in the form of Latin characters, and preprocesses the corpus data to remove the erroneous corpus and low-frequency corpus to obtain a valid corpus.
本发明实施例中,可以采集印度地区的以拉丁字符形式拼写印地语的语料数据,而后,对语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料,例如,可以对语料数据进行非文本信息的干扰去除、拼写检查更正、数据清洗、数据格式整理、挑选高频词语等预处理操作,从而保证学习后的第一语言模型的性能。In the embodiment of the present invention, corpus data spelling Hindi in the form of Latin characters in India can be collected, and then the corpus data is pre-processed to remove the erroneous corpus and low-frequency corpus to obtain an effective corpus. For example, the corpus can be The data is subjected to preprocessing operations such as interference removal of non-text information, spell check correction, data cleaning, data formatting, and selection of high-frequency words, so as to ensure the performance of the first language model after learning.
步骤302,去除有效语料数据中的冗余部分,得到整理后的语料。Step 302: Remove redundant parts in the valid corpus data to obtain a collated corpus.
应当理解的是,获取到的有效语料数据中往往存在大量的冗余信息,如果直接利用该有效语料数据,构建语言模型,将严重影响第一语言模型的学习效率。因此,本发明中,可以去除有效语料数据中的冗余部分,得到整理后的语料,从而可以降低语料数据的冗余和其占用的存储空间,以及提升第一语言模型的学习效率。It should be understood that there is often a lot of redundant information in the obtained effective corpus data. If the effective corpus data is directly used to build a language model, it will seriously affect the learning efficiency of the first language model. Therefore, in the present invention, the redundant part in the effective corpus data can be removed to obtain a collated corpus, thereby reducing the redundancy of the corpus data and the storage space occupied by it, and improving the learning efficiency of the first language model.
步骤303,使用整理后的语料构建语言模型。Step 303: Construct a language model by using the corpus.
本发明实施例中,在得到整理后的语料时,可以使用整理后的语料,构建语言模型。在构建语言模型时,为了避免数据溢出,且提高语言模型性能,可以采用取对数后,使用加法运算代替乘法运算。In the embodiment of the present invention, when the collated corpus is obtained, the collated corpus may be used to construct a language model. When constructing a language model, in order to avoid data overflow and improve the performance of the language model, logarithms can be used, and addition operations can be used instead of multiplication operations.
作为一种可能的实现方式,由于需要根据语言模型和输入词汇,预测后续输入词汇,而后续输入词汇的出现,仅与之前出现的词汇相关,而与其他任何词汇不相关,因此,语言模型可以为N-Gram形式的语言模型,即为N元语言模型。则步骤303具体可以包括:使用整理后的语料,构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数。As a possible implementation, since the subsequent input vocabulary needs to be predicted according to the language model and the input vocabulary, and the appearance of subsequent input vocabulary is only related to the previously appeared vocabulary and not related to any other vocabulary, therefore, the language model can A language model in the form of N-Gram is an N-gram language model. Then step 303 may specifically include: constructing a language model in the form of N-Gram using the compiled corpus, and calculating the parameters of the language model, wherein the parameters of the language model include: vocabulary in the language model and N-ary vocabulary arrangement In N, the conditional probability of the Nth word with respect to the first N-1 words, where N is a positive integer.
假设语言模型中的词汇为:w
1,w
2,w
3,...,w
N,则第N个词汇关于前N-1个词汇的条件概率为:
Assuming the words in the language model are: w 1 , w 2 , w 3 , ..., w N , the conditional probability of the Nth word with respect to the first N-1 words is:
P(w
N|w
1.........w
N-1);
P (w N | w 1 ......... w N-1 );
需要说明的是,假设语言模型中的词汇为1000个,当语言模型为二元语言模型时,使用二元语言模型将会形成1000*1000的矩阵,使用三元语言模型将会形成1000*1000*1000的矩阵,形成的矩阵中含有大量的零值,即稀疏矩阵,此时,需要对形成的矩阵中的稀疏数据进行平滑处理。即步骤303还可以包括:对条件概率的数据进行平滑处理,以使整理后的语料中未出现的N元词汇排列对应的条件概率不为零。It should be noted that, assuming that the vocabulary in the language model is 1000, when the language model is a binary language model, using a binary language model will form a matrix of 1000 * 1000, and using a ternary language model will form 1000 * 1000. The matrix of * 1000 contains a large number of zero values, that is, a sparse matrix. At this time, the sparse data in the formed matrix needs to be smoothed. That is, step 303 may further include: smoothing the conditional probability data, so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear in the collated corpus is not zero.
可选地,可以采用数据平滑处理技术,对条件概率的数据进行平滑处理,降低整理后的语料中已出现的N元词汇排列对应的条件概率,使得未出现的N元词汇排列对应的条件概率不为零。Optionally, data smoothing technology can be used to smooth the conditional probability data to reduce the conditional probability corresponding to the N-ary vocabulary arrangement that has appeared in the collated corpus, so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear Not zero.
为了实现上述实施例,本发明还提出一种面向印地语的多语言混合输入装置。In order to implement the above embodiment, the present invention also proposes a multilingual mixed input device oriented to Hindi.
装置的实现可包括一个或多个计算设备,计算设备包括处理器和存储器,存储器上存储有包括可在处理器上运行的计算机程序指令的应用程序。应用程序可以划分为多个程序模块,用于系统各个组成部分的相应功能。其中,程序的模块的划分是逻辑上的而非物理上的,每个程序模块可以运行在一个或多个计算设备上,一个计算设备上也可以运行一个或一个以上的程序模块。以下对本发明的装置按照程序模块的功能逻辑划分进行详细说明。The implementation of the device may include one or more computing devices. The computing device includes a processor and a memory, and the memory stores an application program including computer program instructions executable on the processor. The application program can be divided into a plurality of program modules for corresponding functions of each component of the system. The division of program modules is logical rather than physical. Each program module can run on one or more computing devices, and one computing device can also run one or more program modules. In the following, the device of the present invention is described in detail according to the functional logic division of the program module.
图4为根据本发明实施例的面向印地语的多语言混合输入装置的结构示意图。FIG. 4 is a schematic structural diagram of a Hindi-oriented multilingual mixed input device according to an embodiment of the present invention.
其中,面向印地语的多语言混合输入装置100可以采用包括处理器和存储器的计算设备实现,存储器中存储有可被处理器执行的程序模块,各个程序模块被执行时,控制计算设备实现相应的功能。The multilingual mixed input device 100 for Hindi may be implemented by using a computing device including a processor and a memory. The memory stores program modules that can be executed by the processor. When each program module is executed, the computing device is controlled to implement corresponding operations. Functions.
如图4所示,该面向印地语的多语言混合输入装置100包括:输入字符获取模块101、第一候选字符串生成模块102、词汇映射模块103、第一候选词列表生成模块104、第一候选词列表展示模块105,以及第一候选词输入模块106。其中,As shown in FIG. 4, the multilingual mixed input device 100 for Hindi includes: an input character acquisition module 101, a first candidate character string generation module 102, a vocabulary mapping module 103, a first candidate word list generation module 104, a first A candidate word list display module 105 and a first candidate word input module 106. among them,
输入字符获取模块101,用于获取输入法界面键入的当前输入词汇的拉丁字符序列。The input character acquisition module 101 is configured to acquire a Latin character sequence of a current input vocabulary typed by an input method interface.
第一候选字符串生成模块102,用于根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,第一语言模型为以拉丁字符形式拼写印地语的语言模型。A first candidate character string generating module 102 is configured to obtain a first candidate character string list in the form of a Latin character corresponding to a Latin character sequence according to a first language model. The first language model is a language model that spells Hindi in the form of Latin characters. .
词汇映射模块103,用于根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式。A vocabulary mapping module 103 is configured to obtain a target Hindi vocabulary list according to a mapping relationship between the spelling form of the Latin characters of the Hindi vocabulary and the spelling form of the Hindi characters, and the target Hindi vocabulary list includes : The Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate string list.
第一候选词列表生成模块104,用于根据第一候选字符串列表和目标印地语词汇列表, 生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表。The first candidate word list generating module 104 is configured to generate a first candidate word list including a Latin character spelling form and a Hindi character spelling form according to the first candidate character string list and the target Hindi word list.
第一候选词列表展示模块105,用于在输入法界面展示第一候选词列表。The first candidate word list display module 105 is configured to display the first candidate word list on an input method interface.
第一候选词输入模块106,用于获取对第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。The first candidate word input module 106 is configured to obtain a selection operation of a word in the first candidate word list, and input the selected word as an input word.
进一步地,在本发明实施例的一种可能的实现方式中,参见图5,在图4所示实施例的基础上,该面向印地语的多语言混合输入装置100还可以包括:Further, in a possible implementation manner of the embodiment of the present invention, referring to FIG. 5, based on the embodiment shown in FIG. 4, the Hindi-oriented multilingual mixed input device 100 may further include:
第一候选字符串生成模块102,具体用于:当拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将拉丁字符序列对应的印地语词汇加入第一候选字符串列表;以及获取扩展选项,扩展选项包括:含有拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将扩展选项加入第一候选字符串列表。The first candidate string generating module 102 is specifically configured to: when the Latin character sequence is a Hindi vocabulary in the form of a complete Latin character spelling, add the Hindi vocabulary corresponding to the Latin character sequence to the first candidate string list; and The extended option is obtained. The extended option includes a Hindi word or a vocabulary segment of a Latin character spelling form containing a Latin character sequence, and the extended option is added to the first candidate character string list.
第一候选字符串生成模块102,还可以用于:当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。The first candidate character string generating module 102 may be further configured to: obtain a similarity to the Latin character sequence when there is no Hindi vocabulary in the first language model containing the Latin character spelling form of the Latin character sequence The Hindi vocabulary with the highest degree of spelling of Latin characters is added as an extended option to the first candidate string list.
第二候选词列表生成模块107,用于根据输入词汇对应的语言模型,预测输入词汇的后续词汇,并根据预测结果生成第二候选词列表。A second candidate word list generating module 107 is configured to predict a subsequent vocabulary of the input vocabulary according to a language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result.
第二候选词列表显示模块108,用于在输入法界面展示第二候选词列表。The second candidate word list display module 108 is configured to display the second candidate word list on the input method interface.
第二候选词输入模块109,用于获取对第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。The second candidate word input module 109 is configured to obtain a selection operation of a vocabulary in the second candidate word list, and input the selected vocabulary as a next input vocabulary.
作为一种可能的实现方式,第二候选词列表生成模块107,具体用于:判断输入词汇的拼写形式是拉丁字符还是印地语字符;当输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;当输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。As a possible implementation manner, the second candidate word list generating module 107 is specifically configured to determine whether the spelling form of the input vocabulary is Latin characters or Hindi characters; when the spelling form of the input vocabulary is Latin characters, according to the first The language model predicts subsequent input vocabulary; when the spelling form of the input vocabulary is Hindi characters, the subsequent input vocabulary is predicted according to the second language model, which is a pre-established language that spells Hindi in the form of Hindi characters model.
第一语言模型创建模块110,用于建立第一语言模型。The first language model creation module 110 is configured to establish a first language model.
作为一种可能的实现方式,第一语言模型创建模块110,包括:As a possible implementation manner, the first language model creation module 110 includes:
语料获取单元111,用于获取以拉丁字符形式拼写印地语的语料数据,并对语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料。A corpus acquisition unit 111 is configured to acquire corpus data spelling Hindi in the form of Latin characters, and preprocess the corpus data to remove the erroneous corpus and low-frequency corpus therein to obtain a valid corpus.
语料去冗余单元112,用于去除有效语料数据中的冗余部分,得到整理后的语料。The corpus de-redundant unit 112 is used to remove redundant parts in the effective corpus data to obtain a collated corpus.
语言模型构建单元113,用于使用整理后的语料构建语言模型。The language model constructing unit 113 is configured to construct a language model using the corpus after arrangement.
作为一种可能的实现方式,语言模型构建单元113,具体用于:使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正 整数;以及对条件概率的数据进行平滑处理,以使整理后的语料中未出现的N元词汇排列对应的条件概率不为零。As a possible implementation manner, the language model constructing unit 113 is specifically configured to: use the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, wherein the parameters of the language model include: the language model Vocabulary, as well as the conditional probability of the Nth vocabulary with respect to the first N-1 vocabulary, N is a positive integer; and the conditional probability data is smoothed so that the The conditional probability corresponding to the N-gram lexical arrangement is not zero.
本发明面向印地语的多语言混合输入装置100中的各个模块的功能和作用的实现过程具体详情可参见上述方法中对应步骤的实现过程。对于装置实施例而言,由于其基本对应于方法实施例,前述对本发明的方法实施例的解释说明也适用于本发明的装置实施例。为避免冗余,在装置实施例中将不会对所有细节进行重复,相关未尽之处可参见上述结合图1到图3对本发明面向印地语的多语言混合输入方法实施例的相关描述。For details of the implementation process of the functions and functions of the various modules in the Hindi-oriented multilingual mixed input device 100 of the present invention, refer to the implementation process of the corresponding steps in the above method. As for the device embodiment, since it basically corresponds to the method embodiment, the foregoing explanation of the method embodiment of the present invention is also applicable to the device embodiment of the present invention. In order to avoid redundancy, all details will not be repeated in the device embodiment. For related unresolved details, please refer to the above-mentioned related description of the embodiment of the multi-lingual mixed input method for Hindi of the present invention with reference to FIG. 1 to FIG. 3. .
本发明实施例的面向印地语的多语言混合输入装置,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。The multilingual mixed input device for Hindi according to the embodiment of the present invention obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model. Candidate string list, where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Hindi vocabulary between the spelling form of Latin characters and the Hindi character spelling To obtain the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate character string list, and according to the first candidate character string list and the Latin character in the first candidate character string list A Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form, generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary. As a result, there is no need to frequently switch input modes to meet the user's simultaneous input requirements for mixed input of Hindi and Latin, improve multi-language mixed input efficiency, and improve user input experience. In addition, according to the mapping relationship, determining the spelling form of the Hindi characters can improve the accuracy of the output result.
为了实时上述实施例,本发明还提出一种非临时性计算机可读存储介质。In order to implement the above embodiments in real time, the present invention also provides a non-transitory computer-readable storage medium.
本发明实施例的非临时性计算机可读存储介质,其上存储有可执行指令,所述可执行指令在处理器上运行时,实现如本发明前述实施例提出的面向印地语的多语言混合输入方法。该存储介质可以作为设备的一部分设置在其上;或者当该设备可以被服务器远程控制时,该存储介质可以设置在对该设备进行控制的远程服务器上。The non-transitory computer-readable storage medium according to the embodiment of the present invention stores executable instructions thereon. When the executable instructions are run on a processor, the multilingual oriented to the Hindi language as proposed in the foregoing embodiment of the present invention is implemented. Mixed input method. The storage medium may be provided on the device as part of the device; or when the device can be remotely controlled by the server, the storage medium may be provided on a remote server that controls the device.
用于实现本发明方法的计算机指令的可以采用一个或多个计算机可读的介质的任意组合来承载。所谓非临时性计算机可读介质可以包括任何计算机可读介质,除了临时性地传播中的信号本身。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上 述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。The computer instructions for implementing the method of the present invention may be carried in any combination of one or more computer-readable media. The so-called non-transitory computer-readable medium may include any computer-readable medium, except for the signal itself which is temporarily propagated. The computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing. In this document, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.
为了实现上述实施例,本发明还提出一种计算机程序产品。In order to implement the above embodiments, the present invention also provides a computer program product.
本发明实施例的计算机程序产品,当所述计算机程序产品中的指令由处理器执行时,实现根据本发明前述实施例提出的面向印地语的多语言混合输入方法。In the computer program product according to the embodiment of the present invention, when the instructions in the computer program product are executed by a processor, the multi-language mixed input method for Hindi according to the foregoing embodiment of the present invention is implemented.
可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如”C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present invention may be written in one or more programming languages, or combinations thereof, including programming languages such as Java, Smalltalk, C ++, and also conventional Procedural programming language—such as "C" or similar programming language. The program code can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer, partly on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider) Internet connection).
为了实现上述实施例,本发明还提出一种计算设备。In order to implement the above embodiments, the present invention also provides a computing device.
本发明实施例的计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现根据本发明前述实施例提出的面向印地语的多语言混合输入方法。A computing device according to an embodiment of the present invention includes a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the program, the print-oriented computer according to the foregoing embodiments of the present invention is implemented. Multilingual mixed input method in the local language.
该计算设备可以由计算机设备的中央控制单元实现,作为计算机设备的中央控制单元的功能中的一部分。也可以由单独的计算设备实现,与计算机设备的中央控制单元通信连接。该计算设备的实现可包括但不限于,单片机,可编程逻辑控制器(PLC),复杂可编程逻辑器件(CPLD),可编程门阵列(PGA),现场可编程门阵列(FPGA),专用神经网络芯片,等等。The computing device may be implemented by a central control unit of a computer device as part of the function of the central control unit of the computer device. It can also be implemented by a separate computing device, which is communicatively connected with the central control unit of the computer device. The implementation of the computing device may include, but is not limited to, a single chip microcomputer, a programmable logic controller (PLC), a complex programmable logic device (CPLD), a programmable gate array (PGA), a field programmable gate array (FPGA), and a dedicated nerve Network chip, etc.
上述存储介质和计算设备,其相关部分的具体实施方式可以从相应的本发明的面向印地语的多语言混合输入方法或装置的实施例中获得,并具有与相应的本发明的面向印地语的多语言混合输入方法或装置相似的有益效果,在此不再赘述。Specific implementations of the above-mentioned storage medium and computing device and related parts thereof can be obtained from the corresponding embodiments of the Hindi-oriented multilingual mixed input method or device of the present invention, and have the corresponding Hindi-oriented The multi-language mixed input method or device has similar beneficial effects, and is not repeated here.
本发明实施例的非临时性计算机可读存储介质,计算机程序产品和计算设备,可以参照本发明前述实施例具体描述的内容实现,并具有与本发明前述实施例提出的面向印地语的多语言混合输入方法类似的有益效果,在此不再赘述。The non-transitory computer-readable storage medium, computer program product, and computing device according to the embodiments of the present invention may be implemented with reference to the content specifically described in the foregoing embodiments of the present invention, and have many advantages to the Hindi-oriented multifaceted solutions proposed by the foregoing embodiments of the present invention Similar beneficial effects of the mixed language input method are not repeated here.
需要说明的是,在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意 性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。It should be noted that, in the description of this specification, the description with reference to the terms “one embodiment”, “some embodiments”, “examples”, “specific examples”, or “some examples” means that the embodiments or The specific features, structures, materials, or characteristics described in the examples are included in at least one embodiment or example of the present invention. In this specification, the schematic expressions of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. In addition, without any contradiction, those skilled in the art may combine and combine different embodiments or examples and features of the different embodiments or examples described in this specification.
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是两个或两个以上,例如两个,三个等,除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Therefore, the features defined as "first" and "second" may explicitly or implicitly include at least one of the features. In the description of the present invention, the meaning of "plurality" is two or more, such as two, three, etc., unless it is specifically and specifically defined otherwise.
本技术领域的普通技术人员可以理解实现上述实施例的方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art may understand that all or part of the steps carried by the method for implementing the foregoing embodiments may be completed by a program instructing related hardware. The program may be stored in a computer-readable storage medium. The program When executed, one or a combination of the steps of the method embodiments is included.
在本说明书的描述中,流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。In the description of this specification, any process or method description in a flowchart or otherwise described herein may be understood to mean an instruction that includes one or more executable instructions for implementing a particular logical function or process step. Modules, fragments or sections of code, and the scope of the preferred embodiments of the present invention includes additional implementations, which may not be in the order shown or discussed, including in a substantially simultaneous manner or in the reverse order according to the functions involved To perform functions, which should be understood by those skilled in the art to which the embodiments of the present invention pertain.
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。The logic and / or steps represented in the flowchart or otherwise described herein, for example, a sequenced list of executable instructions that can be considered to implement a logical function, can be embodied in any computer-readable medium, For the instruction execution system, device, or device (such as a computer-based system, a system including a processor, or other system that can fetch and execute instructions from the instruction execution system, device, or device), or combine these instruction execution systems, devices, or devices Or equipment. For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device.
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一个实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that each part of the present invention may be implemented by hardware, software, firmware, or a combination thereof. In the above embodiments, multiple steps or methods may be implemented by software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it may be implemented using any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gate circuits, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的, 不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limitations on the present invention. Those skilled in the art can interpret the above within the scope of the present invention. Embodiments are subject to change, modification, substitution, and modification.
Claims (17)
- 一种面向印地语的多语言混合输入方法,其特征在于,包括:A multi-language mixed input method for Hindi, which includes:获取输入法界面键入的当前输入词汇的拉丁字符序列;Get the Latin character sequence of the current input vocabulary typed by the input method interface;根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型;Obtaining a first candidate character string list of Latin character forms corresponding to the Latin character sequence according to a first language model, where the first language model is a pre-established language model that spells Hindi in the form of Latin characters;根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;According to a pre-established mapping relationship between the spelling form of the Latin characters of the Hindi word and the spelling form of the Hindi character, a target Hindi word list is obtained, and the target Hindi word list includes: a first candidate character string list The spelling form of the Hindi character corresponding to the Hindi vocabulary of the Latin character spelling;根据所述第一候选字符串列表和目标印地语词汇列表,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;Generating, according to the first candidate character string list and the target Hindi vocabulary list, a first candidate word list of words including Latin character spelling form and Hindi character spelling form;在输入法界面展示所述第一候选词列表;Displaying the first candidate word list on an input method interface;获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。Acquiring a selection operation of a word in the first candidate word list, and inputting the selected word as an input word.
- 根据权利要求1所述的面向印地语的多语言混合输入方法,其特征在于,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,包括:The multi-language mixed input method for Hindi according to claim 1, wherein, according to the first language model, obtaining a first candidate character string list of Latin character forms corresponding to the Latin character sequence, comprising: :当所述拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将所述拉丁字符序列对应的印地语词汇加入所述第一候选字符串列表;以及When the Latin character sequence is a Hindi vocabulary in the form of a complete Latin character spelling, adding the Hindi vocabulary corresponding to the Latin character sequence to the first candidate character string list; and获取扩展选项,所述扩展选项包括:含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将所述扩展选项加入第一候选字符串列表。An extended option is obtained, the extended option includes: a Hindi word or a vocabulary segment containing a Latin character spelling form of the Latin character sequence, and the extended option is added to a first candidate character string list.
- 根据权利要求2所述的面向印地语的多语言混合输入方法,其特征在于,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,还包括:The multi-lingual mixed input method for Hindi according to claim 2, wherein the first candidate character string list in the form of a Latin character corresponding to the Latin character sequence is obtained according to the first language model, and include:当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。When there is no Hindi word in the first language model containing the Latin character spelling form of the Latin character sequence, obtaining a Hindi word in the Latin character spelling form having the highest similarity to the Latin character sequence, and Add it as an extended option to the first candidate string list.
- 根据权利要求1所述的面向印地语的多语言混合输入方法,其特征在于,所述获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入之后,还包括:The multi-lingual mixed input method for Hindi according to claim 1, wherein the obtaining operation for selecting a word in the first candidate word list uses the selected word as an input word After that, it also includes:根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表;Predicting a subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generating a second candidate word list according to the prediction result;在输入法界面展示所述第二候选词列表;Displaying the second candidate word list on an input method interface;获取对所述第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。Acquiring a selection operation of a vocabulary of the second candidate word list, and inputting the selected vocabulary as a next input vocabulary.
- 根据权利要求4所述的面向印地语的多语言混合输入方法,其特征在于,所述根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表,包括:The multilingual mixed input method for Hindi according to claim 4, wherein the prediction of a subsequent vocabulary of the input vocabulary is performed according to a language model corresponding to the input vocabulary, and a second vocabulary is generated according to the prediction result. Candidate list, including:判断所述输入词汇的拼写形式是拉丁字符还是印地语字符;Determining whether the spelling form of the input vocabulary is a Latin character or a Hindi character;当所述输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;When the spelling form of the input vocabulary is Latin characters, predicting subsequent input vocabulary according to the first language model;当所述输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,所述第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。When the spelling form of the input vocabulary is Hindi characters, the subsequent input vocabulary is predicted according to a second language model, which is a pre-established language model that spells Hindi in the form of Hindi characters.
- 根据权利要求1所述的面向印地语的多语言混合输入方法,其特征在于,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,其中,The multi-lingual mixed input method for Hindi according to claim 1, wherein, according to the first language model, a first candidate character string list of Latin character forms corresponding to the Latin character sequence is obtained, and The first language model is a pre-established language model that spells Hindi in the form of Latin characters, wherein,所述第一语言模型的预先建立,包括:The pre-establishment of the first language model includes:获取以拉丁字符形式拼写印地语的语料数据,并对所述语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料;Acquiring corpus data spelling Hindi in the form of Latin characters, and preprocessing the corpus data to remove erroneous corpus and low-frequency corpus therein to obtain an effective corpus;去除所述有效语料数据中的冗余部分,得到整理后的语料;Removing redundant parts in the effective corpus data to obtain a collated corpus;使用整理后的语料构建语言模型。Use the corpus to organize the language model.
- 根据权利要求6所述的面向印地语的多语言混合输入方法,其特征在于,所述使用整理后的语料构建语言模型,包括:The multi-lingual mixed input method for Hindi according to claim 6, wherein the constructing a language model using the collated corpus comprises:使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数;以及Use the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, where the parameters of the language model include: words in the language model, and in the N-gram lexical arrangement, the Nth word is about the former Conditional probability for N-1 words, where N is a positive integer; and对所述条件概率的数据进行平滑处理,以使所述整理后的语料中未出现的N元词汇排列对应的条件概率不为零。Smooth the conditional probability data so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear in the collated corpus is not zero.
- 一种面向印地语的多语言混合输入装置,其特征在于,包括:A multilingual mixed input device oriented to Hindi, which includes:输入字符获取模块,用于获取输入法界面键入的当前输入词汇的拉丁字符序列;Input character acquisition module, which is used to acquire the Latin character sequence of the current input vocabulary typed by the input method interface;第一候选字符串生成模块,用于根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为以拉丁字符形式拼写印地语的语言模型;A first candidate character string generating module, configured to obtain a first candidate character string list in the form of a Latin character corresponding to the Latin character sequence according to a first language model, where the first language model is to spell Hindi in the form of a Latin character Language model词汇映射模块,用于根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼 写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:所述第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;A vocabulary mapping module is configured to obtain a target Hindi vocabulary list according to a mapping relationship between a Latin character spelling form of the Hindi vocabulary and a Hindi character spelling form, which is established in advance. The target Hindi vocabulary list includes: The Hindi character spelling form corresponding to the Hindi vocabulary in the Latin character spelling form in the first candidate string list;第一候选词列表生成模块,用于根据所述第一候选字符串列表和所述目标印地语词汇列表,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;A first candidate word list generating module, configured to generate, according to the first candidate character string list and the target Hindi vocabulary list, a first candidate word list including a Latin character spelling form and a Hindi character spelling form ;第一候选词列表展示模块,用于在输入法界面展示所述第一候选词列表;A first candidate word list display module, configured to display the first candidate word list on an input method interface;第一候选词输入模块,用于获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。The first candidate word input module is configured to obtain a selection operation of a word in the first candidate word list, and input the selected word as an input word.
- 根据权利要求8所述的面向印地语的多语言混合输入装置,其特征在于,所述第一候选字符串生成模块,具体用于:The multilingual mixed input device for Hindi according to claim 8, wherein the first candidate character string generating module is specifically configured to:当所述拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将所述拉丁字符序列对应的印地语词汇加入所述第一候选字符串列表;以及When the Latin character sequence is a Hindi vocabulary in the form of a complete Latin character spelling, adding the Hindi vocabulary corresponding to the Latin character sequence to the first candidate character string list; and获取扩展选项,所述扩展选项包括:含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将所述扩展选项加入第一候选字符串列表。An extended option is obtained, the extended option includes: a Hindi word or a vocabulary segment containing a Latin character spelling form of the Latin character sequence, and the extended option is added to a first candidate character string list.
- 根据权利要求9所述的面向印地语的多语言混合输入装置,其特征在于,所述第一候选字符串生成模块,还用于:The multilingual mixed input device for Hindi according to claim 9, wherein the first candidate character string generating module is further configured to:当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。When there is no Hindi word in the first language model containing the Latin character spelling form of the Latin character sequence, obtaining a Hindi word in the Latin character spelling form having the highest similarity to the Latin character sequence, and Add it as an extended option to the first candidate string list.
- 根据权利要求8所述的面向印地语的多语言混合输入装置,其特征在于,还包括:The multilingual mixed input device for Hindi according to claim 8, further comprising:第二候选词列表生成模块,用于根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表;A second candidate word list generating module, configured to predict a subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result;第二候选词列表显示模块,用于在输入法界面展示所述第二候选词列表;A second candidate word list display module, configured to display the second candidate word list on an input method interface;第二候选词输入模块,用于获取对所述第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。A second candidate word input module is configured to obtain a selection operation of a vocabulary of the second candidate word list, and input the selected vocabulary as a next input vocabulary.
- 根据权利要求11所述的面向印地语的多语言混合输入装置,其特征在于,所述第二候选词列表生成模块,具体用于:The multilingual mixed input device for Hindi according to claim 11, wherein the second candidate word list generating module is specifically configured to:判断所述输入词汇的拼写形式是拉丁字符还是印地语字符;Determining whether the spelling form of the input vocabulary is a Latin character or a Hindi character;当所述输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;When the spelling form of the input vocabulary is Latin characters, predicting subsequent input vocabulary according to the first language model;当所述输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,所述第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。When the spelling form of the input vocabulary is Hindi characters, the subsequent input vocabulary is predicted according to a second language model, which is a pre-established language model that spells Hindi in the form of Hindi characters.
- 根据权利要求8所述的面向印地语的多语言混合输入装置,其特征在于,还包括, 第一语言模型创建模块,用于建立第一语言模型,所述第一语言模型创建模块包括:The multilingual mixed input device for Hindi according to claim 8, further comprising: a first language model creation module for establishing a first language model, wherein the first language model creation module comprises:语料获取单元,用于获取以拉丁字符形式拼写印地语的语料数据,并对所述语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料;A corpus acquisition unit, configured to acquire corpus data spelling Hindi in the form of Latin characters, and preprocess the corpus data to remove the erroneous corpus and low-frequency corpus therein to obtain an effective corpus;语料去冗余单元,用于去除所述有效语料数据中的冗余部分,得到整理后的语料;A corpus deduplication unit for removing redundant parts in the valid corpus data to obtain a collated corpus;语言模型构建单元,用于使用整理后的语料构建语言模型。A language model building unit is used to build a language model using the collated corpus.
- 根据权利要求13所述的面向印地语的多语言混合输入装置,其特征在于,所述语言模型构建单元,具体用于:The multilingual mixed input device for Hindi according to claim 13, wherein the language model construction unit is specifically configured to:使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数;以及Use the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, where the parameters of the language model include: words in the language model, and in the N-gram lexical arrangement, the Nth word is about the former Conditional probability for N-1 words, where N is a positive integer; and对所述条件概率的数据进行平滑处理,以使所述整理后的语料中未出现的N元词汇排列对应的条件概率不为零。Smooth the conditional probability data so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear in the collated corpus is not zero.
- 一种非临时性计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时,实现根据权利要求1-7中任一项所述的面向印地语的多语言混合输入方法。A non-transitory computer-readable storage medium having stored thereon a computer program, characterized in that, when the program is executed by a processor, the multi-faceted Hindi-oriented multilingual program according to any one of claims 1-7 is implemented. Mixed language input method.
- 一种计算机程序产品,其特征在于,当所述计算机程序产品中的指令由处理器执行时,实现根据权利要求1-7中任一项所述的面向印地语的多语言混合输入方法。A computer program product, characterized in that when instructions in the computer program product are executed by a processor, the multi-lingual mixed input method for Hindi according to any one of claims 1-7 is implemented.
- 一种计算设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现根据权利要求1-7中任一项所述的面向印地语的多语言混合输入方法。A computing device, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the program, the processor implements any one of claims 1-7. Multi-language mixed input method for Hindi as described in item 6.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810713058.9 | 2018-06-29 | ||
CN201810713058.9A CN108897438A (en) | 2018-06-29 | 2018-06-29 | Multi-language mixed input method and device for hindi |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020000764A1 true WO2020000764A1 (en) | 2020-01-02 |
Family
ID=64348154
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/109507 WO2020000764A1 (en) | 2018-06-29 | 2018-10-09 | Hindi-oriented multi-language mixed input method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108897438A (en) |
WO (1) | WO2020000764A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109739367A (en) * | 2018-12-28 | 2019-05-10 | 北京金山安全软件有限公司 | Candidate word list generation method and device |
CN112506359B (en) * | 2020-12-21 | 2023-07-21 | 北京百度网讯科技有限公司 | Method and device for providing candidate long sentences in input method and electronic equipment |
CN112764551A (en) * | 2020-12-31 | 2021-05-07 | 维沃移动通信有限公司 | Vocabulary display method and device and electronic equipment |
CN112987943B (en) * | 2021-03-10 | 2023-03-14 | 江西航智信息技术有限公司 | Cloud architecture system for remotely controlling student mobile terminal input method |
CN112987940B (en) * | 2021-04-27 | 2021-08-27 | 广州智品网络科技有限公司 | Input method and device based on sample probability quantization and electronic equipment |
WO2022241640A1 (en) * | 2021-05-18 | 2022-11-24 | Citrix Systems, Inc. | A split keyboard with different languages as input |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1983129A (en) * | 2005-12-12 | 2007-06-20 | 北京优耐数码科技有限公司 | Technology for inputting Hindi in digital keyboard intelligently |
CN101882025A (en) * | 2010-06-29 | 2010-11-10 | 汉王科技股份有限公司 | Hand input method and system |
CN102929571A (en) * | 2012-10-15 | 2013-02-13 | 深圳市视得安罗格朗电子股份有限公司 | Multi-language configuration display system and device |
CN106156014A (en) * | 2016-07-29 | 2016-11-23 | 宇龙计算机通信科技(深圳)有限公司 | A kind of information processing method and device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7636083B2 (en) * | 2004-02-20 | 2009-12-22 | Tegic Communications, Inc. | Method and apparatus for text input in various languages |
CN101493732A (en) * | 2009-02-27 | 2009-07-29 | 广东国笔科技股份有限公司 | Language input system for Indo-European |
CN102193643B (en) * | 2010-03-15 | 2014-07-02 | 北京搜狗科技发展有限公司 | Word input method and input method system having translation function |
US20140278349A1 (en) * | 2013-03-14 | 2014-09-18 | Microsoft Corporation | Language Model Dictionaries for Text Predictions |
US9785252B2 (en) * | 2015-07-28 | 2017-10-10 | Fitnii Inc. | Method for inputting multi-language texts |
-
2018
- 2018-06-29 CN CN201810713058.9A patent/CN108897438A/en active Pending
- 2018-10-09 WO PCT/CN2018/109507 patent/WO2020000764A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1983129A (en) * | 2005-12-12 | 2007-06-20 | 北京优耐数码科技有限公司 | Technology for inputting Hindi in digital keyboard intelligently |
CN101882025A (en) * | 2010-06-29 | 2010-11-10 | 汉王科技股份有限公司 | Hand input method and system |
CN102929571A (en) * | 2012-10-15 | 2013-02-13 | 深圳市视得安罗格朗电子股份有限公司 | Multi-language configuration display system and device |
CN106156014A (en) * | 2016-07-29 | 2016-11-23 | 宇龙计算机通信科技(深圳)有限公司 | A kind of information processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108897438A (en) | 2018-11-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020000764A1 (en) | Hindi-oriented multi-language mixed input method and device | |
JP7301922B2 (en) | Semantic retrieval method, device, electronic device, storage medium and computer program | |
US20210312139A1 (en) | Method and apparatus of generating semantic feature, method and apparatus of training model, electronic device, and storage medium | |
WO2020062770A1 (en) | Method and apparatus for constructing domain dictionary, and device and storage medium | |
WO2018205389A1 (en) | Voice recognition method and system, electronic apparatus and medium | |
US10789431B2 (en) | Method and system of translating a source sentence in a first language into a target sentence in a second language | |
TWI629601B (en) | System for providing translation and classification of translation results, computer-readable storage medium, file distribution system and method thereof | |
JP5513898B2 (en) | Shared language model | |
JP2016218995A (en) | Machine translation method, machine translation system and program | |
US20210209472A1 (en) | Method and apparatus for determining causality, electronic device and storage medium | |
US11907671B2 (en) | Role labeling method, electronic device and storage medium | |
JP7413630B2 (en) | Summary generation model training method, apparatus, device and storage medium | |
KR102456535B1 (en) | Medical fact verification method and apparatus, electronic device, and storage medium and program | |
US20210200813A1 (en) | Human-machine interaction method, electronic device, and storage medium | |
JP7093825B2 (en) | Man-machine dialogue methods, devices, and equipment | |
US11321370B2 (en) | Method for generating question answering robot and computer device | |
JP2021192283A (en) | Information query method, device and electronic apparatus | |
CN110874536A (en) | Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method | |
JP7241122B2 (en) | Smart response method and device, electronic device, storage medium and computer program | |
JP2023007369A (en) | Translation method, classification model training method, apparatus, device and storage medium | |
CN109710952B (en) | Translation history retrieval method, device, equipment and medium based on artificial intelligence | |
KR102523034B1 (en) | Method and apparatus for detecting adverseness in medicine, electronic device, storage medium and program | |
WO2022227166A1 (en) | Word replacement method and apparatus, electronic device, and storage medium | |
JP2014010634A (en) | Paginal translation expression extraction device, paginal translation expression extraction method and computer program for extracting paginal translation expression | |
JP2022017173A (en) | Method and device for outputting information, electronic device, computer-readable storage medium, and computer program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18924071 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 12.05.2021 DATED 12/05/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18924071 Country of ref document: EP Kind code of ref document: A1 |