Summary of the invention
The objective of the invention is to overcome the defective that existing statictic machine translation system can not simultaneous adaptation different field cypher text, a kind of method according to the text selecting translation model that will translate is provided, thereby can both obtains better translation result for the translation input of different field.
To achieve these goals, the invention provides candidate's translation model generation method in a kind of statistical machine translation, may further comprise the steps:
Step 101), collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus;
Step 102), according to step 101) the sub-corpus that obtains, training candidate translation model;
Step 103), for step 101) the sub-corpus that obtains sets up index, obtains the corpus index file.
In the technique scheme, in described step 101) in, when dividing bilingual Parallel Corpus, according to affiliated field, theme and the word of data in the bilingual Parallel Corpus, the bilingual Parallel Corpus that adopts classification or clustering method will have similar field, theme and word is divided in same the sub-corpus.
Described classification or clustering method comprise k mean cluster method or k nearest neighbour classification method or maximum entropy classification.
In the technique scheme, in described step 102) in, ready-portioned each sub-corpus is carried out the training of translation model, obtain corresponding sub-translation model, simultaneously, utilize all bilingual Parallel Corpus to train, obtain a general translation model.
In the technique scheme, in described step 103) in, set up index for the right source language sentence of each translation sentence in the bilingual Parallel Corpus, described index comprises the information of the sub-corpus in source language sentence place that the translation sentence is right.
Adopt Lemur information retrieval instrument to set up index.
The present invention also provides the method for utilizing candidate's translation model to translate in a kind of statistical machine translation, may further comprise the steps:
Step 201), import text to be translated, from the corpus index file retrieval to text to be translated in the similar sentence of sentence;
Step 202), according to step 201) result for retrieval, obtain and the pairing candidate's translation model of the sub-corpus in similar sentence place, from all candidate's translation models, select final translation model;
Step 203), according to step 202) determined final translation model translates the text to be translated of input, to the end translation result.
In the technique scheme, in described step 201) in, adopt the similarity between all index files in described text to be translated of similarity retrieval Model Calculation and the language material index file, be that all result of calculation is pressed from big to small ordering successively according to the similarity size then, select at least one the highest sentence of similarity, selected sentence comprises the information of this sub-corpus in sentence place.
Adopt vector space model and TF-IDF similarity calculating method to realize the retrieval of similar sentence.
In the technique scheme, in described step 202) in, set selection strategy, from all candidate's translation models, select the combination of candidate's translation model or several candidate's translation models as described final translation model according to selection strategy.
Described selection strategy comprises according to the number that comprises similar sentence in the same sub-corpus determines candidate's translation model, or determines candidate's translation model in conjunction with the numerical value of similarity.
The present invention provides a kind of translation on line Model Selection method of statistical machine translation again, comprise the training and translate two stages, it is characterized in that the described training stage may further comprise the steps:
Step 101), collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus;
Step 102), according to step 101) the sub-corpus that obtains, training candidate translation model;
Step 103), for step 101) the sub-corpus that obtains sets up index, obtains the corpus index file;
The described translating phase may further comprise the steps:
Step 201), import text to be translated, from step 103) the similar sentence of sentence the corpus index file that obtains in retrieval and the text to be translated;
Step 202), according to step 201) result for retrieval, obtain and the pairing candidate's translation model of the sub-corpus in similar sentence place, from all candidate's translation models, select final translation model;
Step 203), according to step 202) determined final translation model translates the text to be translated of input, to the end translation result.
The present invention provides the system of the translation on line Model Selection in a kind of statistical machine translation again, comprise training module and translation module, described training module comprises that corpus collector unit, candidate's translation model training unit and index set up the unit, and described translation module comprises retrieval unit, candidate's translation model selected cell and translation unit; Wherein:
Described corpus collector unit is used to collect bilingual Parallel Corpus, and according to the type of collected bilingual Parallel Corpus, makes up sub-corpus;
Described candidate's translation model training unit is used to described sub-corpus training candidate translation model;
The unit set up in described index is that index set up in described sub-corpus, obtains the corpus index file;
Described retrieval unit be used for according to the input text to be translated, from described corpus index file the retrieval to text to be translated in the similar sentence of sentence;
Described candidate's translation model selected cell is used for according to result for retrieval, obtains and the corresponding candidate's translation model of the sub-corpus in similar sentence place, selects final translation model from all candidate's translation models;
Described translation unit is treated translated document according to selected final translation model and is translated.
The invention has the advantages that:
1. this translation on line Model Selection method provided by the invention, make that statictic machine translation system can be according to the text to be translated of input, the translation model that on-line selection is fit to is translated, solved the problem that statictic machine translation system can not adapt to the different field input text well, can improve the translation quality of statictic machine translation system effectively, for the practicability of statictic machine translation system provides feasible scheme.
2. translation on line Model Selection method provided by the invention, with modeling, training and the decode procedure of concrete statistical machine translation method be independently, go for various statistical machine translation methods, as based on the statistical machine translation method of vocabulary, based on the statistical machine translation method of phrase, based on statistical machine translation method of sentence structure etc.Therefore to have adaptability good in this invention, implements advantages such as simple.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:
The translation on line Model Selection method of statistical machine translation of the present invention comprises model training and translation on line two large divisions, is elaborated respectively below.
As shown in Figure 1, model training process of the present invention specifically may further comprise the steps:
Step 101, collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus.In this step, collected bilingual Parallel Corpus generally is the bilingualism corpora of sentence alignment, comprises the contrast translation of sentence in this corpus.When bilingual Parallel Corpus being divided in the dissimilar sub-corpus, should make the data in same the sub-corpus have similar field, theme and word etc. as far as possible, the gap of field, theme and the word etc. of the data between the different sub-corpus is big as far as possible.When bilingual Parallel Corpus being divided in the dissimilar sub-corpus, can adopt the method for classification or cluster, existing classification or clustering method all can be applicable to the present invention, as methods such as k mean cluster commonly used, k nearest neighbour classification, maximum entropy classification.In addition, when collecting bilingualism corpora, often can know the source and the field of corpus, at this moment can directly corpus be divided into the different sub-corpus of several fields with the field according to the source of corpus.
By aforesaid operations, collected bilingual Parallel Corpus is divided into several sub-corpus.The sub-corpus number of being divided is unsuitable too much, guarantee that each sub-corpus comprises the language material of certain scale (i.e. translation sentence to), to avoid the too small and influence that translation quality is caused of sub-corpus scale.In addition, in sub-corpus partition process, in original corpus one translation sentence to also may with the time-division in different sub-corpus, that is to say that to allow to comprise identical translation sentence in the sub-corpus of having divided right.
Step 102, the sub-corpus that obtains according to step 101, training candidate translation model.When training candidate translation model, ready-portioned each sub-corpus is carried out the training of translation model, obtain corresponding sub-translation model.Simultaneously, utilize all bilingual Parallel Corpus to train, obtain a general translation model.
In this step, the training of translation model is ripe prior art, can adopt translation model training method commonly used, for example, can adopt the EM coaching method disclosed in the list of references 1 in the present embodiment; In the maximum likelihood coaching method disclosed in the list of references 2; With in discriminative training method disclosed in the list of references 3 etc.
The translation model that obtains by this step is exactly the candidate's translation model that will use in the follow-up translating phase.
Step 103, set up index, obtain the corpus index file for sub-corpus.Index set up in the right source language sentence of each translation sentence in the antithetical phrase corpus, comprises the information of the sub-corpus in source language sentence place that the translation sentence is right in the index of being set up.The purpose of setting up index is to make in follow-up translation process easily and quickly retrieval to N the sentence the most similar to given text, can know that these sentences derive from which or which individual sub-corpus simultaneously.Set up the work of index for sub-corpus and adopt ripe prior art to get final product, can use Lemur information retrieval instrument to set up index in the present embodiment.In setting up the process of index, the right source language sentence of each translation sentence is regarded as a document, in the routing information of document, write down the sub-corpus information under the document simultaneously.
By above-mentioned operation, finished training process to translation model, below the process of translation on line is carried out specific description.
As shown in Figure 2, the translation on line method in the translation on line Model Selection method of statistical machine translation of the present invention may further comprise the steps:
Step 201, import text to be translated, the retrieval training sentence similar from the corpus index file to the sentence in the text to be translated.
When treating translated document and retrieving similar sentence, can utilize the similarity retrieval method to retrieve a most similar N sentence from the index of training corpus, each sentence comprises the sub-corpus information of its correspondence simultaneously, and promptly which sub-corpus this sentence belongs to.
Wherein, above-mentioned similarity retrieval method has multiple implementation, as Dice Y-factor method Y, editing distance method, cosine function method etc.The retrieval that can adopt vector space model commonly used in the information retrieval and TF-IDF similarity calculating method to realize similar sentence in the present embodiment specifies as follows:
In the vector space model retrieving, inquiry and the document in the system that the user is imported all use vector representation, suppose total n word, then every piece of document (or inquiry) D
iAll can be considered a n-dimensional vector (w
I1, w
I2..., w
In), w wherein
IjThe expression document D
iIn the weights of j dimension, can be undertaken by following TF-IDF method the calculating of these weights:
w
ij=tf
ij×log(idf
j)
Wherein, tf
IjBe meant that word j is in document D
iThe middle frequency that occurs, tf
IjValue big more, word j is for document D in expression
iImportant more; And idf
jBe called inverse document frequency, be the inverse of the number of documents that includes word j, the general total number of documents of using is divided by the number of files that contains word j during calculating.Idf
jMore little, the number of documents that comprises word j is many more, and the effect of expression word j aspect the measurement document similarity is low more.
When the user imported text to be translated, searching system was at first calculated the similarity between text to be translated and all the index file vectors, was all result of calculation orderings successively from big to small according to the similarity size then.When calculating similarity, often adopt included angle cosine or inner product between the vector to represent the similarity size.
In step 103, mention and to adopt Lemur information retrieval instrument to set up index, in this step, can utilize Lemur information retrieval instrument to realize retrieving equally based on the similar sentence of vector space model and TF-IDF similarity.By retrieval, can obtain the top n training sentence the most similar to text to be translated, can obtain the sub-corpus information of affiliated training of each sentence simultaneously.
Step 202, according in the step 201 retrieval the model of selected text translation as a result.Behind the similar sentence that step 201 obtains being retrieved, also obtained the information of the affiliated sub-corpus of similar sentence.According to the associated description information in the step 102, a sub-corpus is to there being candidate's translation model, and may be subordinated to different sub-corpus at the resulting a plurality of similar sentences of step 201, therefore also can corresponding different candidate's translation models, to select the combination of one of them candidate's model or several candidate's models as last translation model according to certain selection strategy exactly in this step.Described selection strategy can be determined according to actual needs, as both can also determining selection strategy in conjunction with the numerical value of similarity according to the number of the similar sentence of sub-corpus.Suppose a sentence to be translated, it has 5 similar sentences, wherein 3 similar sentences belong to sub-corpus 1,1 similar sentence belongs to sub-corpus 2,1 similar sentence belongs to sub-corpus 3, then according to the selection strategy of the similar sentence number of sub-corpus, with candidate's translation model of sub-corpus 1 correspondence as final translation model.Suppose again a sentence to be translated, it has 5 similar sentences, their similarity is respectively 0.9,0.7,0.5,0.3,0.1, wherein, the 1st belongs to sub-corpus 1 with the 2nd similar sentence, the 3rd, 4,5 similar sentences belong to sub-corpus 2, then according to the selection strategy of similarity numerical value, because the similarity total value of sub-corpus 1 is 1.6 (0.9+0.7), and the similarity total value of sub-corpus 2 is 0.9 (0.5+0.3+0.1), therefore, although the similar sentence that sub-corpus 2 comprises is more, but still chooser language class libraries 1 pairing candidate's translation model is as final translation model.
Adopt a simple Model Selection strategy that the specific implementation process of this step is described below:
if?Proportion(max_model)>0.5
δ
0=0;δ
i=max_model=1;δ
i≠max_model=0;
else
δ
0=1;δ
i=0;
Wherein, δ
0The weight of representing general translation model, δ
iThe weight of representing i sub-translation model, i=(1...M).Max_model is that model that occupies maximum ratio.In the similar sentence that function Proportion (Max_model) expression retrieves, belong to the shared ratio of sentence of the pairing sub-corpus of Max_model.
At the weight δ that determines model
0And δ
iAfter, final translation model is the log-linear interpolation of these candidate's models:
Wherein, c represents Chinese sentence to be translated, and e represents candidate's translation result,
The translation result of expression probability maximum.p
0Be the translation probability that utilizes general translation model to obtain, p
iIt is the translation probability that utilizes i translation model to obtain.
According to this formula and above the Model Selection strategy, when the shared ratio of the model M ax_model of maximum ratio greater than 0.5 the time, use Max_model as last translation model, otherwise, use universal model as last translation model.Certainly, also can define more complicated model selection strategy, be the weight that decides each submodel according to the shared ratio of each sub-corpus in the similar sentence that retrieves as following strategy:
If?Proportion(max_model)>0.5
δ
0=0;
δ
i=proportion(model
i);
else
δ
0=0.5;
δ
i=0.5×proportion(model
i);
Step 203, the text to be translated of input is translated according to the determined translation model of step 202, to the end translation result.
This step is similar with the translation implementation procedure in the existing statictic machine translation system, therefore, no longer elaborates in the present invention.
Be specifying above to the translation on line Model Selection method implementation procedure in the statistical machine translation of the present invention, compared with prior art, the present invention is that the bilingual Parallel Corpus of collecting is divided according to classification, and set up corresponding translation model for each sub-corpus, for all bilingual Parallel Corpus have been set up universal model, and set up corresponding index file for the source language sentence.Behind input text to be translated, at first search for similar sentence, according to similar sentence selected text translation model, avoided prior art to adopt the single translation degree of accuracy that translation model caused not high, to the defective a little less than the different field cypher text adaptive faculty.
Online Model Selection method in the statistical machine translation that proposes according to the present invention, the invention allows for the translation on line Model Selection system that adapts with it, this system comprises training module and translation module, training module wherein comprises that corpus collector unit, candidate's translation model training unit and index set up the unit, and translation module comprises retrieval unit, candidate's translation model selected cell and translation unit.
The corpus collector unit is used to collect bilingual Parallel Corpus, and according to the type of collected bilingual Parallel Corpus, makes up sub-corpus.
Candidate's translation model training unit is used to sub-corpus training candidate translation model.
The unit set up in index is that index set up in sub-corpus, obtains the corpus index file.
Retrieval unit be used for according to the input text to be translated, from the corpus index file retrieval to text to be translated in the similar sentence of sentence.
Candidate's translation model selected cell is used for according to result for retrieval, obtains and the corresponding candidate's translation model of the sub-corpus in similar sentence place, selects final translation model from all candidate's translation models.
Translation unit is treated translated document according to selected final translation model and is translated.
It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.