CN101079028A

CN101079028A - On-line translation model selection method of statistic machine translation

Info

Publication number: CN101079028A
Application number: CN 200710099724
Authority: CN
Inventors: 吕雅娟; 刘群; 黄瑾
Original assignee: Institute of Computing Technology of CAS
Current assignee: Huawei Technologies Co Ltd
Priority date: 2007-05-29
Filing date: 2007-05-29
Publication date: 2007-11-28
Anticipated expiration: 2027-05-29
Also published as: CN100527125C

Abstract

The invention discloses an on-line translating mold selective method to statistic machine translation, which is characterized by the following: comprising two stage of training and translating; collecting double language parallel language material library; dividing the double language parallel language material library to diverse sub-language material library; choosing translation mold for training of sub-language material library; building index for sub-language material library; getting language library index file; inputting into pre-translating text; searching similar sentence with the pre-translating text from the file of the language material library index; getting the candidate translating mold; corresponding to the sub-language material library of similar sentence; choosing the final translating mold from all candidate translating mold; translating the inputting text according to the final translating mold; getting the final translating result. This invention can improve translating quality of machine translating system.

Description

Translation on line Model Selection method in a kind of statistical machine translation

Technical field

The present invention relates to the statistical machine translation technical field, particularly the translation on line Model Selection method of statictic machine translation system.

Background technology

Along with the arrival of information age and the fast development of internet, the interchange between various countries is increasingly extensive, and people are also more and more urgent for the demand of mechanical translation.In recent years, mechanical translation research has obtained very big development, is that the mechanical translation new technology of representative has obtained breakthrough to a certain degree with the statistical machine translation technology especially, becomes the main flow of present mechanical translation research.

Machine translation method can be divided into rule-based machine translation method (being regular machine translation method) and based on the statistics machine translation method (statistical machine translation method).In traditional rule-based machine translation method, translation knowledge mainly is presented as dictionary and rule, relies on the human expert to write and dictionary and rule are main.The subject matter that this method exists has: the human expert writes linguistry need expend lot of manpower and material resources and time; The knowledge that the human expert writes is difficult to cover the variety of issue that faces in the true translation environment comprehensively; The linguistry that the human expert writes does not have good solution when facing conflict; The linguistry that the human expert writes is inconvenient to be transplanted to different languages and field.And in statistical machine translation, all translation knowledge all derive from real bilingual Parallel Corpus (parallel corpus), pass through statistical modeling, automatically learn the translation knowledge in the bilingual Parallel Corpus, therefore having overcome the human expert writes the subject matter that knowledge faces, and is transplanted to easily on the new field and languages.Because having strict statistical model is foundation, and more rational solution is arranged in the conflict that overcomes knowledge, can arrive translation result preferably generally.This is the main cause that can surpass rule-based machine translation method at present based on the translation quality of the machine translation method of adding up.

The foundation of statictic machine translation system generally includes two main processes: training and decoding.So-called training is exactly the parameter that estimates statistical translation model according to certain algorithm from the corpus resource automatically; So-called decoding is exactly the process of input text being translated according to the model parameter that training process obtains, and therefore decoding also directly is called translation usually.At list of references 1 " Peter F.Brown; Stephen A.Della Pietra; Vincent J.Della Pietra; andPobert L.Mercer.1993; The Mathematics of Statistical Machine Translation:ParameterEstimation, Computational Linguistics[J], vol.19; no.2, pages263-311 "; List of references 2 " Philipp Koehn; Franz Joseph Och; and Daniel Marcu.2003.Statistical phrase-basedtranslation.In Proceedings of Human Language Technology Conference/North Americanchapter of the Association for Computational Linguistics annual meeting 2003, pages127-133 "; The explanation of training and decode procedure is all arranged in the list of references 3 " Franz J.Och and Hermann Ney.2002.Discriminative trainingand maximum entropy models for statistical machine translation.In Proceedings of the40th Annual Meeting of Association for Computational Linguistics 2002, pages295-302. " in pair prior art.

An important resource in the training process of statistical machine translation is exactly bilingual Parallel Corpus, promptly comprises the set of the text of bilingual contrast translation.Because the translation knowledge in the statictic machine translation system all derives from bilingual Parallel Corpus, so the scale of bilingual Parallel Corpus and the translation quality that quality directly has influence on translation system.In general, be used to train the bilingual Parallel Corpus scale of translation model big more, the model parameter that training obtains is stable more, approaches truth more, and translation quality is high more.Therefore Many researchers has proposed the method for automatic collection bilingualism corpora, as obtaining bilingual Parallel Corpus automatically or obtain bilingual Parallel Corpus etc. from Web from comparable text.But, the bilingual Parallel Corpus of collecting often has very strong territoriality at present, comes from some fields that Hong Kong parliament session record, Hong Kong law, Xinhua News Agency's news etc. fall far short respectively as the bigger bilingual Parallel Corpus of using always in Chinese-English statistical machine translation training at present of several scales.The corpus merging that simply these fields is fallen far short is trained and can not obviously be improved translation quality.Utilize the corpus in a certain field to train the translation model that obtains to obtain good translation result in this field, and translation quality will descend much when this model is applied to the translation of other field, and promptly statictic machine translation system is for the unusual sensitivity in the field of corpus and cypher text.In actual applications, mostly the field of the text to be translated of user's input can't be predicted by system under the situation, if translate the text of different field with a unified model, will certainly influence the translation quality of system.Therefore, how to improve the field adaptive faculty of statictic machine translation system to different cypher texts, improve statictic machine translation system translation quality, to advance the practicality of statictic machine translation system be the problem that people press for solution.

Summary of the invention

The objective of the invention is to overcome the defective that existing statictic machine translation system can not simultaneous adaptation different field cypher text, a kind of method according to the text selecting translation model that will translate is provided, thereby can both obtains better translation result for the translation input of different field.

To achieve these goals, the invention provides candidate's translation model generation method in a kind of statistical machine translation, may further comprise the steps:

Step 101), collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus;

Step 102), according to step 101) the sub-corpus that obtains, training candidate translation model;

Step 103), for step 101) the sub-corpus that obtains sets up index, obtains the corpus index file.

In the technique scheme, in described step 101) in, when dividing bilingual Parallel Corpus, according to affiliated field, theme and the word of data in the bilingual Parallel Corpus, the bilingual Parallel Corpus that adopts classification or clustering method will have similar field, theme and word is divided in same the sub-corpus.

Described classification or clustering method comprise k mean cluster method or k nearest neighbour classification method or maximum entropy classification.

In the technique scheme, in described step 102) in, ready-portioned each sub-corpus is carried out the training of translation model, obtain corresponding sub-translation model, simultaneously, utilize all bilingual Parallel Corpus to train, obtain a general translation model.

In the technique scheme, in described step 103) in, set up index for the right source language sentence of each translation sentence in the bilingual Parallel Corpus, described index comprises the information of the sub-corpus in source language sentence place that the translation sentence is right.

Adopt Lemur information retrieval instrument to set up index.

The present invention also provides the method for utilizing candidate's translation model to translate in a kind of statistical machine translation, may further comprise the steps:

Step 201), import text to be translated, from the corpus index file retrieval to text to be translated in the similar sentence of sentence;

Step 202), according to step 201) result for retrieval, obtain and the pairing candidate's translation model of the sub-corpus in similar sentence place, from all candidate's translation models, select final translation model;

Step 203), according to step 202) determined final translation model translates the text to be translated of input, to the end translation result.

In the technique scheme, in described step 201) in, adopt the similarity between all index files in described text to be translated of similarity retrieval Model Calculation and the language material index file, be that all result of calculation is pressed from big to small ordering successively according to the similarity size then, select at least one the highest sentence of similarity, selected sentence comprises the information of this sub-corpus in sentence place.

Adopt vector space model and TF-IDF similarity calculating method to realize the retrieval of similar sentence.

In the technique scheme, in described step 202) in, set selection strategy, from all candidate's translation models, select the combination of candidate's translation model or several candidate's translation models as described final translation model according to selection strategy.

Described selection strategy comprises according to the number that comprises similar sentence in the same sub-corpus determines candidate's translation model, or determines candidate's translation model in conjunction with the numerical value of similarity.

The present invention provides a kind of translation on line Model Selection method of statistical machine translation again, comprise the training and translate two stages, it is characterized in that the described training stage may further comprise the steps:

Step 103), for step 101) the sub-corpus that obtains sets up index, obtains the corpus index file;

The described translating phase may further comprise the steps:

Step 201), import text to be translated, from step 103) the similar sentence of sentence the corpus index file that obtains in retrieval and the text to be translated;

The present invention provides the system of the translation on line Model Selection in a kind of statistical machine translation again, comprise training module and translation module, described training module comprises that corpus collector unit, candidate's translation model training unit and index set up the unit, and described translation module comprises retrieval unit, candidate's translation model selected cell and translation unit; Wherein:

Described corpus collector unit is used to collect bilingual Parallel Corpus, and according to the type of collected bilingual Parallel Corpus, makes up sub-corpus;

Described candidate's translation model training unit is used to described sub-corpus training candidate translation model;

The unit set up in described index is that index set up in described sub-corpus, obtains the corpus index file;

Described retrieval unit be used for according to the input text to be translated, from described corpus index file the retrieval to text to be translated in the similar sentence of sentence;

Described candidate's translation model selected cell is used for according to result for retrieval, obtains and the corresponding candidate's translation model of the sub-corpus in similar sentence place, selects final translation model from all candidate's translation models;

Described translation unit is treated translated document according to selected final translation model and is translated.

The invention has the advantages that:

1. this translation on line Model Selection method provided by the invention, make that statictic machine translation system can be according to the text to be translated of input, the translation model that on-line selection is fit to is translated, solved the problem that statictic machine translation system can not adapt to the different field input text well, can improve the translation quality of statictic machine translation system effectively, for the practicability of statictic machine translation system provides feasible scheme.

2. translation on line Model Selection method provided by the invention, with modeling, training and the decode procedure of concrete statistical machine translation method be independently, go for various statistical machine translation methods, as based on the statistical machine translation method of vocabulary, based on the statistical machine translation method of phrase, based on statistical machine translation method of sentence structure etc.Therefore to have adaptability good in this invention, implements advantages such as simple.

Description of drawings

Fig. 1 is the synoptic diagram of model training part in the translation on line Model Selection method of statistical machine translation of the present invention;

Fig. 2 is the synoptic diagram of translation on line part in the translation on line Model Selection method of statistical machine translation.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:

The translation on line Model Selection method of statistical machine translation of the present invention comprises model training and translation on line two large divisions, is elaborated respectively below.

As shown in Figure 1, model training process of the present invention specifically may further comprise the steps:

Step 101, collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus.In this step, collected bilingual Parallel Corpus generally is the bilingualism corpora of sentence alignment, comprises the contrast translation of sentence in this corpus.When bilingual Parallel Corpus being divided in the dissimilar sub-corpus, should make the data in same the sub-corpus have similar field, theme and word etc. as far as possible, the gap of field, theme and the word etc. of the data between the different sub-corpus is big as far as possible.When bilingual Parallel Corpus being divided in the dissimilar sub-corpus, can adopt the method for classification or cluster, existing classification or clustering method all can be applicable to the present invention, as methods such as k mean cluster commonly used, k nearest neighbour classification, maximum entropy classification.In addition, when collecting bilingualism corpora, often can know the source and the field of corpus, at this moment can directly corpus be divided into the different sub-corpus of several fields with the field according to the source of corpus.

By aforesaid operations, collected bilingual Parallel Corpus is divided into several sub-corpus.The sub-corpus number of being divided is unsuitable too much, guarantee that each sub-corpus comprises the language material of certain scale (i.e. translation sentence to), to avoid the too small and influence that translation quality is caused of sub-corpus scale.In addition, in sub-corpus partition process, in original corpus one translation sentence to also may with the time-division in different sub-corpus, that is to say that to allow to comprise identical translation sentence in the sub-corpus of having divided right.

Step 102, the sub-corpus that obtains according to step 101, training candidate translation model.When training candidate translation model, ready-portioned each sub-corpus is carried out the training of translation model, obtain corresponding sub-translation model.Simultaneously, utilize all bilingual Parallel Corpus to train, obtain a general translation model.

In this step, the training of translation model is ripe prior art, can adopt translation model training method commonly used, for example, can adopt the EM coaching method disclosed in the list of references 1 in the present embodiment; In the maximum likelihood coaching method disclosed in the list of references 2; With in discriminative training method disclosed in the list of references 3 etc.

The translation model that obtains by this step is exactly the candidate's translation model that will use in the follow-up translating phase.

Step 103, set up index, obtain the corpus index file for sub-corpus.Index set up in the right source language sentence of each translation sentence in the antithetical phrase corpus, comprises the information of the sub-corpus in source language sentence place that the translation sentence is right in the index of being set up.The purpose of setting up index is to make in follow-up translation process easily and quickly retrieval to N the sentence the most similar to given text, can know that these sentences derive from which or which individual sub-corpus simultaneously.Set up the work of index for sub-corpus and adopt ripe prior art to get final product, can use Lemur information retrieval instrument to set up index in the present embodiment.In setting up the process of index, the right source language sentence of each translation sentence is regarded as a document, in the routing information of document, write down the sub-corpus information under the document simultaneously.

By above-mentioned operation, finished training process to translation model, below the process of translation on line is carried out specific description.

As shown in Figure 2, the translation on line method in the translation on line Model Selection method of statistical machine translation of the present invention may further comprise the steps:

Step 201, import text to be translated, the retrieval training sentence similar from the corpus index file to the sentence in the text to be translated.

When treating translated document and retrieving similar sentence, can utilize the similarity retrieval method to retrieve a most similar N sentence from the index of training corpus, each sentence comprises the sub-corpus information of its correspondence simultaneously, and promptly which sub-corpus this sentence belongs to.

Wherein, above-mentioned similarity retrieval method has multiple implementation, as Dice Y-factor method Y, editing distance method, cosine function method etc.The retrieval that can adopt vector space model commonly used in the information retrieval and TF-IDF similarity calculating method to realize similar sentence in the present embodiment specifies as follows:

In the vector space model retrieving, inquiry and the document in the system that the user is imported all use vector representation, suppose total n word, then every piece of document (or inquiry) D _iAll can be considered a n-dimensional vector (w _I1, w _I2..., w _In), w wherein _IjThe expression document D _iIn the weights of j dimension, can be undertaken by following TF-IDF method the calculating of these weights:

w _ij＝tf _ij×log(idf _j)

Wherein, tf _IjBe meant that word j is in document D _iThe middle frequency that occurs, tf _IjValue big more, word j is for document D in expression _iImportant more; And idf _jBe called inverse document frequency, be the inverse of the number of documents that includes word j, the general total number of documents of using is divided by the number of files that contains word j during calculating.Idf _jMore little, the number of documents that comprises word j is many more, and the effect of expression word j aspect the measurement document similarity is low more.

When the user imported text to be translated, searching system was at first calculated the similarity between text to be translated and all the index file vectors, was all result of calculation orderings successively from big to small according to the similarity size then.When calculating similarity, often adopt included angle cosine or inner product between the vector to represent the similarity size.

In step 103, mention and to adopt Lemur information retrieval instrument to set up index, in this step, can utilize Lemur information retrieval instrument to realize retrieving equally based on the similar sentence of vector space model and TF-IDF similarity.By retrieval, can obtain the top n training sentence the most similar to text to be translated, can obtain the sub-corpus information of affiliated training of each sentence simultaneously.

Step 202, according in the step 201 retrieval the model of selected text translation as a result.Behind the similar sentence that step 201 obtains being retrieved, also obtained the information of the affiliated sub-corpus of similar sentence.According to the associated description information in the step 102, a sub-corpus is to there being candidate's translation model, and may be subordinated to different sub-corpus at the resulting a plurality of similar sentences of step 201, therefore also can corresponding different candidate's translation models, to select the combination of one of them candidate's model or several candidate's models as last translation model according to certain selection strategy exactly in this step.Described selection strategy can be determined according to actual needs, as both can also determining selection strategy in conjunction with the numerical value of similarity according to the number of the similar sentence of sub-corpus.Suppose a sentence to be translated, it has 5 similar sentences, wherein 3 similar sentences belong to sub-corpus 1,1 similar sentence belongs to sub-corpus 2,1 similar sentence belongs to sub-corpus 3, then according to the selection strategy of the similar sentence number of sub-corpus, with candidate's translation model of sub-corpus 1 correspondence as final translation model.Suppose again a sentence to be translated, it has 5 similar sentences, their similarity is respectively 0.9,0.7,0.5,0.3,0.1, wherein, the 1st belongs to sub-corpus 1 with the 2nd similar sentence, the 3rd, 4,5 similar sentences belong to sub-corpus 2, then according to the selection strategy of similarity numerical value, because the similarity total value of sub-corpus 1 is 1.6 (0.9+0.7), and the similarity total value of sub-corpus 2 is 0.9 (0.5+0.3+0.1), therefore, although the similar sentence that sub-corpus 2 comprises is more, but still chooser language class libraries 1 pairing candidate's translation model is as final translation model.

Adopt a simple Model Selection strategy that the specific implementation process of this step is described below:

if?Proportion(max_model)＞0.5

δ ₀＝0；δ _{i＝max_model}＝1；δ _{i≠max_model}＝0；

else

δ ₀＝1；δ _i＝0；

Wherein, δ ₀The weight of representing general translation model, δ _iThe weight of representing i sub-translation model, i=(1...M).Max_model is that model that occupies maximum ratio.In the similar sentence that function Proportion (Max_model) expression retrieves, belong to the shared ratio of sentence of the pairing sub-corpus of Max_model.

At the weight δ that determines model ₀And δ _iAfter, final translation model is the log-linear interpolation of these candidate's models:

\hat{e} = \underset{e}{\arg \max} (δ_{0} \log (p_{0} (e | c)) + Σ_{i = 1}^{M} δ_{i} \log (p_{i} (e | c)))

Wherein, c represents Chinese sentence to be translated, and e represents candidate's translation result, The translation result of expression probability maximum.p ₀Be the translation probability that utilizes general translation model to obtain, p _iIt is the translation probability that utilizes i translation model to obtain.

According to this formula and above the Model Selection strategy, when the shared ratio of the model M ax_model of maximum ratio greater than 0.5 the time, use Max_model as last translation model, otherwise, use universal model as last translation model.Certainly, also can define more complicated model selection strategy, be the weight that decides each submodel according to the shared ratio of each sub-corpus in the similar sentence that retrieves as following strategy:

If?Proportion(max_model)＞0.5

δ ₀＝0；

δ _i＝proportion(model _i)；

else

δ ₀＝0.5；

δ _i＝0.5×proportion(model _i)；

Step 203, the text to be translated of input is translated according to the determined translation model of step 202, to the end translation result.

This step is similar with the translation implementation procedure in the existing statictic machine translation system, therefore, no longer elaborates in the present invention.

Be specifying above to the translation on line Model Selection method implementation procedure in the statistical machine translation of the present invention, compared with prior art, the present invention is that the bilingual Parallel Corpus of collecting is divided according to classification, and set up corresponding translation model for each sub-corpus, for all bilingual Parallel Corpus have been set up universal model, and set up corresponding index file for the source language sentence.Behind input text to be translated, at first search for similar sentence, according to similar sentence selected text translation model, avoided prior art to adopt the single translation degree of accuracy that translation model caused not high, to the defective a little less than the different field cypher text adaptive faculty.

Online Model Selection method in the statistical machine translation that proposes according to the present invention, the invention allows for the translation on line Model Selection system that adapts with it, this system comprises training module and translation module, training module wherein comprises that corpus collector unit, candidate's translation model training unit and index set up the unit, and translation module comprises retrieval unit, candidate's translation model selected cell and translation unit.

The corpus collector unit is used to collect bilingual Parallel Corpus, and according to the type of collected bilingual Parallel Corpus, makes up sub-corpus.

Candidate's translation model training unit is used to sub-corpus training candidate translation model.

The unit set up in index is that index set up in sub-corpus, obtains the corpus index file.

Retrieval unit be used for according to the input text to be translated, from the corpus index file retrieval to text to be translated in the similar sentence of sentence.

Candidate's translation model selected cell is used for according to result for retrieval, obtains and the corresponding candidate's translation model of the sub-corpus in similar sentence place, selects final translation model from all candidate's translation models.

Translation unit is treated translated document according to selected final translation model and is translated.

It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims

1, candidate's translation model generation method in a kind of statistical machine translation may further comprise the steps:

Step 102), according to described sub-corpus, training candidate translation model;

Step 103), set up index, obtain the corpus index file for described sub-corpus.

2, candidate's translation model generation method in the statistical machine translation according to claim 1, it is characterized in that, in described step 101) in, bilingual Parallel Corpus is divided in the different sub-corpus, be meant: when dividing bilingual Parallel Corpus, according to affiliated field, theme and the word of data in the bilingual Parallel Corpus, the bilingual Parallel Corpus that adopts classification or clustering method will have similar field, theme and word is divided in same the sub-corpus.

3, candidate's translation model generation method in the statistical machine translation according to claim 2 is characterized in that, described classification or clustering method are k mean cluster method or k nearest neighbour classification method or maximum entropy classification.

4, candidate's translation model generation method in the statistical machine translation according to claim 1 is characterized in that described step 102) in, also comprise the following steps:

Each sub-corpus is carried out the training of translation model, obtain corresponding sub-translation model;

All sub-corpus are carried out the training of translation model, obtain a general translation model.

5, candidate's translation model generation method in the statistical machine translation according to claim 1 is characterized in that, in described step 103) in, index set up in described sub-corpus, is meant:

Index set up in the right source language sentence of each translation sentence in the sub-corpus, and described index comprises the information of the sub-corpus in source language sentence place that the translation sentence is right.

6, candidate's translation model generation method in the statistical machine translation according to claim 5 is characterized in that, adopts Lemur information retrieval instrument to set up index.

7, the method for utilizing candidate's translation model to translate in a kind of statistical machine translation may further comprise the steps:

Step 201), import text to be translated, from the corpus index file retrieval to text to be translated in the similar sentence of sentence, obtain result for retrieval;

Step 202), according to described result for retrieval, obtain and the pairing candidate's translation model of the sub-corpus in similar sentence place, from all candidate's translation models, select final translation model;

Step 203), the text to be translated of input is translated, according to described final translation model to the end translation result.

8, the method for utilizing candidate's translation model to translate in the statistical machine translation according to claim 7, it is characterized in that, in described step 201) in, described from the corpus index file retrieval to text to be translated in the similar sentence of sentence, be meant:

Adopt the similarity retrieval method to calculate the similarity between all index files in described text to be translated and the language material index file, be that all result of calculation is pressed from big to small ordering successively according to the similarity size then, select at least one the highest sentence of similarity, selected sentence comprises the information of this sub-corpus in sentence place.

9, the method for utilizing candidate's translation model to translate in the statistical machine translation according to claim 8 is characterized in that, described similarity retrieval method is vector space model and TF-IDF similarity calculating method.

10, the method for utilizing candidate's translation model to translate in the statistical machine translation according to claim 7 is characterized in that, in described step 202) in, describedly from all candidate's translation models, select final translation model, be meant:

Set selection strategy, from all candidate's translation models, select the combination of candidate's translation model or several candidate's translation models as described final translation model according to selection strategy.

11, the method for utilizing candidate's translation model to translate in the statistical machine translation according to claim 10, it is characterized in that, described selection strategy is for to determine candidate's translation model according to the number that comprises similar sentence in the same sub-corpus, or determines candidate's translation model in conjunction with the numerical value of similarity.

12, the translation on line Model Selection method in a kind of statistical machine translation, comprise the training and translate two stages, it is characterized in that the described training stage may further comprise the steps:

Step 103), set up index, obtain the corpus index file for described sub-corpus;

The described translating phase may further comprise the steps:

Step 201), import text to be translated, from described corpus index file the retrieval to text to be translated in the similar sentence of sentence, obtain result for retrieval;

13, the translation on line Model Selection system in a kind of statistical machine translation, comprise training module and translation module, it is characterized in that, described training module comprises that corpus collector unit, candidate's translation model training unit and index set up the unit, and described translation module comprises retrieval unit, candidate's translation model selected cell and translation unit; Wherein: