Nothing Special   »   [go: up one dir, main page]

CN101079028A - On-line translation model selection method of statistic machine translation - Google Patents

On-line translation model selection method of statistic machine translation Download PDF

Info

Publication number
CN101079028A
CN101079028A CN 200710099724 CN200710099724A CN101079028A CN 101079028 A CN101079028 A CN 101079028A CN 200710099724 CN200710099724 CN 200710099724 CN 200710099724 A CN200710099724 A CN 200710099724A CN 101079028 A CN101079028 A CN 101079028A
Authority
CN
China
Prior art keywords
translation
corpus
candidate
sub
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200710099724
Other languages
Chinese (zh)
Other versions
CN100527125C (en
Inventor
吕雅娟
刘群
黄瑾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB2007100997246A priority Critical patent/CN100527125C/en
Publication of CN101079028A publication Critical patent/CN101079028A/en
Application granted granted Critical
Publication of CN100527125C publication Critical patent/CN100527125C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an on-line translating mold selective method to statistic machine translation, which is characterized by the following: comprising two stage of training and translating; collecting double language parallel language material library; dividing the double language parallel language material library to diverse sub-language material library; choosing translation mold for training of sub-language material library; building index for sub-language material library; getting language library index file; inputting into pre-translating text; searching similar sentence with the pre-translating text from the file of the language material library index; getting the candidate translating mold; corresponding to the sub-language material library of similar sentence; choosing the final translating mold from all candidate translating mold; translating the inputting text according to the final translating mold; getting the final translating result. This invention can improve translating quality of machine translating system.

Description

Translation on line Model Selection method in a kind of statistical machine translation
Technical field
The present invention relates to the statistical machine translation technical field, particularly the translation on line Model Selection method of statictic machine translation system.
Background technology
Along with the arrival of information age and the fast development of internet, the interchange between various countries is increasingly extensive, and people are also more and more urgent for the demand of mechanical translation.In recent years, mechanical translation research has obtained very big development, is that the mechanical translation new technology of representative has obtained breakthrough to a certain degree with the statistical machine translation technology especially, becomes the main flow of present mechanical translation research.
Machine translation method can be divided into rule-based machine translation method (being regular machine translation method) and based on the statistics machine translation method (statistical machine translation method).In traditional rule-based machine translation method, translation knowledge mainly is presented as dictionary and rule, relies on the human expert to write and dictionary and rule are main.The subject matter that this method exists has: the human expert writes linguistry need expend lot of manpower and material resources and time; The knowledge that the human expert writes is difficult to cover the variety of issue that faces in the true translation environment comprehensively; The linguistry that the human expert writes does not have good solution when facing conflict; The linguistry that the human expert writes is inconvenient to be transplanted to different languages and field.And in statistical machine translation, all translation knowledge all derive from real bilingual Parallel Corpus (parallel corpus), pass through statistical modeling, automatically learn the translation knowledge in the bilingual Parallel Corpus, therefore having overcome the human expert writes the subject matter that knowledge faces, and is transplanted to easily on the new field and languages.Because having strict statistical model is foundation, and more rational solution is arranged in the conflict that overcomes knowledge, can arrive translation result preferably generally.This is the main cause that can surpass rule-based machine translation method at present based on the translation quality of the machine translation method of adding up.
The foundation of statictic machine translation system generally includes two main processes: training and decoding.So-called training is exactly the parameter that estimates statistical translation model according to certain algorithm from the corpus resource automatically; So-called decoding is exactly the process of input text being translated according to the model parameter that training process obtains, and therefore decoding also directly is called translation usually.At list of references 1 " Peter F.Brown; Stephen A.Della Pietra; Vincent J.Della Pietra; andPobert L.Mercer.1993; The Mathematics of Statistical Machine Translation:ParameterEstimation, Computational Linguistics[J], vol.19; no.2, pages263-311 "; List of references 2 " Philipp Koehn; Franz Joseph Och; and Daniel Marcu.2003.Statistical phrase-basedtranslation.In Proceedings of Human Language Technology Conference/North Americanchapter of the Association for Computational Linguistics annual meeting 2003, pages127-133 "; The explanation of training and decode procedure is all arranged in the list of references 3 " Franz J.Och and Hermann Ney.2002.Discriminative trainingand maximum entropy models for statistical machine translation.In Proceedings of the40th Annual Meeting of Association for Computational Linguistics 2002, pages295-302. " in pair prior art.
An important resource in the training process of statistical machine translation is exactly bilingual Parallel Corpus, promptly comprises the set of the text of bilingual contrast translation.Because the translation knowledge in the statictic machine translation system all derives from bilingual Parallel Corpus, so the scale of bilingual Parallel Corpus and the translation quality that quality directly has influence on translation system.In general, be used to train the bilingual Parallel Corpus scale of translation model big more, the model parameter that training obtains is stable more, approaches truth more, and translation quality is high more.Therefore Many researchers has proposed the method for automatic collection bilingualism corpora, as obtaining bilingual Parallel Corpus automatically or obtain bilingual Parallel Corpus etc. from Web from comparable text.But, the bilingual Parallel Corpus of collecting often has very strong territoriality at present, comes from some fields that Hong Kong parliament session record, Hong Kong law, Xinhua News Agency's news etc. fall far short respectively as the bigger bilingual Parallel Corpus of using always in Chinese-English statistical machine translation training at present of several scales.The corpus merging that simply these fields is fallen far short is trained and can not obviously be improved translation quality.Utilize the corpus in a certain field to train the translation model that obtains to obtain good translation result in this field, and translation quality will descend much when this model is applied to the translation of other field, and promptly statictic machine translation system is for the unusual sensitivity in the field of corpus and cypher text.In actual applications, mostly the field of the text to be translated of user's input can't be predicted by system under the situation, if translate the text of different field with a unified model, will certainly influence the translation quality of system.Therefore, how to improve the field adaptive faculty of statictic machine translation system to different cypher texts, improve statictic machine translation system translation quality, to advance the practicality of statictic machine translation system be the problem that people press for solution.
Summary of the invention
The objective of the invention is to overcome the defective that existing statictic machine translation system can not simultaneous adaptation different field cypher text, a kind of method according to the text selecting translation model that will translate is provided, thereby can both obtains better translation result for the translation input of different field.
To achieve these goals, the invention provides candidate's translation model generation method in a kind of statistical machine translation, may further comprise the steps:
Step 101), collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus;
Step 102), according to step 101) the sub-corpus that obtains, training candidate translation model;
Step 103), for step 101) the sub-corpus that obtains sets up index, obtains the corpus index file.
In the technique scheme, in described step 101) in, when dividing bilingual Parallel Corpus, according to affiliated field, theme and the word of data in the bilingual Parallel Corpus, the bilingual Parallel Corpus that adopts classification or clustering method will have similar field, theme and word is divided in same the sub-corpus.
Described classification or clustering method comprise k mean cluster method or k nearest neighbour classification method or maximum entropy classification.
In the technique scheme, in described step 102) in, ready-portioned each sub-corpus is carried out the training of translation model, obtain corresponding sub-translation model, simultaneously, utilize all bilingual Parallel Corpus to train, obtain a general translation model.
In the technique scheme, in described step 103) in, set up index for the right source language sentence of each translation sentence in the bilingual Parallel Corpus, described index comprises the information of the sub-corpus in source language sentence place that the translation sentence is right.
Adopt Lemur information retrieval instrument to set up index.
The present invention also provides the method for utilizing candidate's translation model to translate in a kind of statistical machine translation, may further comprise the steps:
Step 201), import text to be translated, from the corpus index file retrieval to text to be translated in the similar sentence of sentence;
Step 202), according to step 201) result for retrieval, obtain and the pairing candidate's translation model of the sub-corpus in similar sentence place, from all candidate's translation models, select final translation model;
Step 203), according to step 202) determined final translation model translates the text to be translated of input, to the end translation result.
In the technique scheme, in described step 201) in, adopt the similarity between all index files in described text to be translated of similarity retrieval Model Calculation and the language material index file, be that all result of calculation is pressed from big to small ordering successively according to the similarity size then, select at least one the highest sentence of similarity, selected sentence comprises the information of this sub-corpus in sentence place.
Adopt vector space model and TF-IDF similarity calculating method to realize the retrieval of similar sentence.
In the technique scheme, in described step 202) in, set selection strategy, from all candidate's translation models, select the combination of candidate's translation model or several candidate's translation models as described final translation model according to selection strategy.
Described selection strategy comprises according to the number that comprises similar sentence in the same sub-corpus determines candidate's translation model, or determines candidate's translation model in conjunction with the numerical value of similarity.
The present invention provides a kind of translation on line Model Selection method of statistical machine translation again, comprise the training and translate two stages, it is characterized in that the described training stage may further comprise the steps:
Step 101), collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus;
Step 102), according to step 101) the sub-corpus that obtains, training candidate translation model;
Step 103), for step 101) the sub-corpus that obtains sets up index, obtains the corpus index file;
The described translating phase may further comprise the steps:
Step 201), import text to be translated, from step 103) the similar sentence of sentence the corpus index file that obtains in retrieval and the text to be translated;
Step 202), according to step 201) result for retrieval, obtain and the pairing candidate's translation model of the sub-corpus in similar sentence place, from all candidate's translation models, select final translation model;
Step 203), according to step 202) determined final translation model translates the text to be translated of input, to the end translation result.
The present invention provides the system of the translation on line Model Selection in a kind of statistical machine translation again, comprise training module and translation module, described training module comprises that corpus collector unit, candidate's translation model training unit and index set up the unit, and described translation module comprises retrieval unit, candidate's translation model selected cell and translation unit; Wherein:
Described corpus collector unit is used to collect bilingual Parallel Corpus, and according to the type of collected bilingual Parallel Corpus, makes up sub-corpus;
Described candidate's translation model training unit is used to described sub-corpus training candidate translation model;
The unit set up in described index is that index set up in described sub-corpus, obtains the corpus index file;
Described retrieval unit be used for according to the input text to be translated, from described corpus index file the retrieval to text to be translated in the similar sentence of sentence;
Described candidate's translation model selected cell is used for according to result for retrieval, obtains and the corresponding candidate's translation model of the sub-corpus in similar sentence place, selects final translation model from all candidate's translation models;
Described translation unit is treated translated document according to selected final translation model and is translated.
The invention has the advantages that:
1. this translation on line Model Selection method provided by the invention, make that statictic machine translation system can be according to the text to be translated of input, the translation model that on-line selection is fit to is translated, solved the problem that statictic machine translation system can not adapt to the different field input text well, can improve the translation quality of statictic machine translation system effectively, for the practicability of statictic machine translation system provides feasible scheme.
2. translation on line Model Selection method provided by the invention, with modeling, training and the decode procedure of concrete statistical machine translation method be independently, go for various statistical machine translation methods, as based on the statistical machine translation method of vocabulary, based on the statistical machine translation method of phrase, based on statistical machine translation method of sentence structure etc.Therefore to have adaptability good in this invention, implements advantages such as simple.
Description of drawings
Fig. 1 is the synoptic diagram of model training part in the translation on line Model Selection method of statistical machine translation of the present invention;
Fig. 2 is the synoptic diagram of translation on line part in the translation on line Model Selection method of statistical machine translation.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:
The translation on line Model Selection method of statistical machine translation of the present invention comprises model training and translation on line two large divisions, is elaborated respectively below.
As shown in Figure 1, model training process of the present invention specifically may further comprise the steps:
Step 101, collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus.In this step, collected bilingual Parallel Corpus generally is the bilingualism corpora of sentence alignment, comprises the contrast translation of sentence in this corpus.When bilingual Parallel Corpus being divided in the dissimilar sub-corpus, should make the data in same the sub-corpus have similar field, theme and word etc. as far as possible, the gap of field, theme and the word etc. of the data between the different sub-corpus is big as far as possible.When bilingual Parallel Corpus being divided in the dissimilar sub-corpus, can adopt the method for classification or cluster, existing classification or clustering method all can be applicable to the present invention, as methods such as k mean cluster commonly used, k nearest neighbour classification, maximum entropy classification.In addition, when collecting bilingualism corpora, often can know the source and the field of corpus, at this moment can directly corpus be divided into the different sub-corpus of several fields with the field according to the source of corpus.
By aforesaid operations, collected bilingual Parallel Corpus is divided into several sub-corpus.The sub-corpus number of being divided is unsuitable too much, guarantee that each sub-corpus comprises the language material of certain scale (i.e. translation sentence to), to avoid the too small and influence that translation quality is caused of sub-corpus scale.In addition, in sub-corpus partition process, in original corpus one translation sentence to also may with the time-division in different sub-corpus, that is to say that to allow to comprise identical translation sentence in the sub-corpus of having divided right.
Step 102, the sub-corpus that obtains according to step 101, training candidate translation model.When training candidate translation model, ready-portioned each sub-corpus is carried out the training of translation model, obtain corresponding sub-translation model.Simultaneously, utilize all bilingual Parallel Corpus to train, obtain a general translation model.
In this step, the training of translation model is ripe prior art, can adopt translation model training method commonly used, for example, can adopt the EM coaching method disclosed in the list of references 1 in the present embodiment; In the maximum likelihood coaching method disclosed in the list of references 2; With in discriminative training method disclosed in the list of references 3 etc.
The translation model that obtains by this step is exactly the candidate's translation model that will use in the follow-up translating phase.
Step 103, set up index, obtain the corpus index file for sub-corpus.Index set up in the right source language sentence of each translation sentence in the antithetical phrase corpus, comprises the information of the sub-corpus in source language sentence place that the translation sentence is right in the index of being set up.The purpose of setting up index is to make in follow-up translation process easily and quickly retrieval to N the sentence the most similar to given text, can know that these sentences derive from which or which individual sub-corpus simultaneously.Set up the work of index for sub-corpus and adopt ripe prior art to get final product, can use Lemur information retrieval instrument to set up index in the present embodiment.In setting up the process of index, the right source language sentence of each translation sentence is regarded as a document, in the routing information of document, write down the sub-corpus information under the document simultaneously.
By above-mentioned operation, finished training process to translation model, below the process of translation on line is carried out specific description.
As shown in Figure 2, the translation on line method in the translation on line Model Selection method of statistical machine translation of the present invention may further comprise the steps:
Step 201, import text to be translated, the retrieval training sentence similar from the corpus index file to the sentence in the text to be translated.
When treating translated document and retrieving similar sentence, can utilize the similarity retrieval method to retrieve a most similar N sentence from the index of training corpus, each sentence comprises the sub-corpus information of its correspondence simultaneously, and promptly which sub-corpus this sentence belongs to.
Wherein, above-mentioned similarity retrieval method has multiple implementation, as Dice Y-factor method Y, editing distance method, cosine function method etc.The retrieval that can adopt vector space model commonly used in the information retrieval and TF-IDF similarity calculating method to realize similar sentence in the present embodiment specifies as follows:
In the vector space model retrieving, inquiry and the document in the system that the user is imported all use vector representation, suppose total n word, then every piece of document (or inquiry) D iAll can be considered a n-dimensional vector (w I1, w I2..., w In), w wherein IjThe expression document D iIn the weights of j dimension, can be undertaken by following TF-IDF method the calculating of these weights:
w ij=tf ij×log(idf j)
Wherein, tf IjBe meant that word j is in document D iThe middle frequency that occurs, tf IjValue big more, word j is for document D in expression iImportant more; And idf jBe called inverse document frequency, be the inverse of the number of documents that includes word j, the general total number of documents of using is divided by the number of files that contains word j during calculating.Idf jMore little, the number of documents that comprises word j is many more, and the effect of expression word j aspect the measurement document similarity is low more.
When the user imported text to be translated, searching system was at first calculated the similarity between text to be translated and all the index file vectors, was all result of calculation orderings successively from big to small according to the similarity size then.When calculating similarity, often adopt included angle cosine or inner product between the vector to represent the similarity size.
In step 103, mention and to adopt Lemur information retrieval instrument to set up index, in this step, can utilize Lemur information retrieval instrument to realize retrieving equally based on the similar sentence of vector space model and TF-IDF similarity.By retrieval, can obtain the top n training sentence the most similar to text to be translated, can obtain the sub-corpus information of affiliated training of each sentence simultaneously.
Step 202, according in the step 201 retrieval the model of selected text translation as a result.Behind the similar sentence that step 201 obtains being retrieved, also obtained the information of the affiliated sub-corpus of similar sentence.According to the associated description information in the step 102, a sub-corpus is to there being candidate's translation model, and may be subordinated to different sub-corpus at the resulting a plurality of similar sentences of step 201, therefore also can corresponding different candidate's translation models, to select the combination of one of them candidate's model or several candidate's models as last translation model according to certain selection strategy exactly in this step.Described selection strategy can be determined according to actual needs, as both can also determining selection strategy in conjunction with the numerical value of similarity according to the number of the similar sentence of sub-corpus.Suppose a sentence to be translated, it has 5 similar sentences, wherein 3 similar sentences belong to sub-corpus 1,1 similar sentence belongs to sub-corpus 2,1 similar sentence belongs to sub-corpus 3, then according to the selection strategy of the similar sentence number of sub-corpus, with candidate's translation model of sub-corpus 1 correspondence as final translation model.Suppose again a sentence to be translated, it has 5 similar sentences, their similarity is respectively 0.9,0.7,0.5,0.3,0.1, wherein, the 1st belongs to sub-corpus 1 with the 2nd similar sentence, the 3rd, 4,5 similar sentences belong to sub-corpus 2, then according to the selection strategy of similarity numerical value, because the similarity total value of sub-corpus 1 is 1.6 (0.9+0.7), and the similarity total value of sub-corpus 2 is 0.9 (0.5+0.3+0.1), therefore, although the similar sentence that sub-corpus 2 comprises is more, but still chooser language class libraries 1 pairing candidate's translation model is as final translation model.
Adopt a simple Model Selection strategy that the specific implementation process of this step is described below:
if?Proportion(max_model)>0.5
δ 0=0;δ i=max_model=1;δ i≠max_model=0;
else
δ 0=1;δ i=0;
Wherein, δ 0The weight of representing general translation model, δ iThe weight of representing i sub-translation model, i=(1...M).Max_model is that model that occupies maximum ratio.In the similar sentence that function Proportion (Max_model) expression retrieves, belong to the shared ratio of sentence of the pairing sub-corpus of Max_model.
At the weight δ that determines model 0And δ iAfter, final translation model is the log-linear interpolation of these candidate's models:
e ^ = arg max e ( δ 0 log ( p 0 ( e | c ) ) + Σ i = 1 M δ i log ( p i ( e | c ) ) )
Wherein, c represents Chinese sentence to be translated, and e represents candidate's translation result, The translation result of expression probability maximum.p 0Be the translation probability that utilizes general translation model to obtain, p iIt is the translation probability that utilizes i translation model to obtain.
According to this formula and above the Model Selection strategy, when the shared ratio of the model M ax_model of maximum ratio greater than 0.5 the time, use Max_model as last translation model, otherwise, use universal model as last translation model.Certainly, also can define more complicated model selection strategy, be the weight that decides each submodel according to the shared ratio of each sub-corpus in the similar sentence that retrieves as following strategy:
If?Proportion(max_model)>0.5
δ 0=0;
δ i=proportion(model i);
else
δ 0=0.5;
δ i=0.5×proportion(model i);
Step 203, the text to be translated of input is translated according to the determined translation model of step 202, to the end translation result.
This step is similar with the translation implementation procedure in the existing statictic machine translation system, therefore, no longer elaborates in the present invention.
Be specifying above to the translation on line Model Selection method implementation procedure in the statistical machine translation of the present invention, compared with prior art, the present invention is that the bilingual Parallel Corpus of collecting is divided according to classification, and set up corresponding translation model for each sub-corpus, for all bilingual Parallel Corpus have been set up universal model, and set up corresponding index file for the source language sentence.Behind input text to be translated, at first search for similar sentence, according to similar sentence selected text translation model, avoided prior art to adopt the single translation degree of accuracy that translation model caused not high, to the defective a little less than the different field cypher text adaptive faculty.
Online Model Selection method in the statistical machine translation that proposes according to the present invention, the invention allows for the translation on line Model Selection system that adapts with it, this system comprises training module and translation module, training module wherein comprises that corpus collector unit, candidate's translation model training unit and index set up the unit, and translation module comprises retrieval unit, candidate's translation model selected cell and translation unit.
The corpus collector unit is used to collect bilingual Parallel Corpus, and according to the type of collected bilingual Parallel Corpus, makes up sub-corpus.
Candidate's translation model training unit is used to sub-corpus training candidate translation model.
The unit set up in index is that index set up in sub-corpus, obtains the corpus index file.
Retrieval unit be used for according to the input text to be translated, from the corpus index file retrieval to text to be translated in the similar sentence of sentence.
Candidate's translation model selected cell is used for according to result for retrieval, obtains and the corresponding candidate's translation model of the sub-corpus in similar sentence place, selects final translation model from all candidate's translation models.
Translation unit is treated translated document according to selected final translation model and is translated.
It should be noted last that above embodiment is only unrestricted in order to technical scheme of the present invention to be described.Although the present invention is had been described in detail with reference to embodiment, those of ordinary skill in the art is to be understood that, technical scheme of the present invention is made amendment or is equal to replacement, do not break away from the spirit and scope of technical solution of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims (13)

1, candidate's translation model generation method in a kind of statistical machine translation may further comprise the steps:
Step 101), collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus;
Step 102), according to described sub-corpus, training candidate translation model;
Step 103), set up index, obtain the corpus index file for described sub-corpus.
2, candidate's translation model generation method in the statistical machine translation according to claim 1, it is characterized in that, in described step 101) in, bilingual Parallel Corpus is divided in the different sub-corpus, be meant: when dividing bilingual Parallel Corpus, according to affiliated field, theme and the word of data in the bilingual Parallel Corpus, the bilingual Parallel Corpus that adopts classification or clustering method will have similar field, theme and word is divided in same the sub-corpus.
3, candidate's translation model generation method in the statistical machine translation according to claim 2 is characterized in that, described classification or clustering method are k mean cluster method or k nearest neighbour classification method or maximum entropy classification.
4, candidate's translation model generation method in the statistical machine translation according to claim 1 is characterized in that described step 102) in, also comprise the following steps:
Each sub-corpus is carried out the training of translation model, obtain corresponding sub-translation model;
All sub-corpus are carried out the training of translation model, obtain a general translation model.
5, candidate's translation model generation method in the statistical machine translation according to claim 1 is characterized in that, in described step 103) in, index set up in described sub-corpus, is meant:
Index set up in the right source language sentence of each translation sentence in the sub-corpus, and described index comprises the information of the sub-corpus in source language sentence place that the translation sentence is right.
6, candidate's translation model generation method in the statistical machine translation according to claim 5 is characterized in that, adopts Lemur information retrieval instrument to set up index.
7, the method for utilizing candidate's translation model to translate in a kind of statistical machine translation may further comprise the steps:
Step 201), import text to be translated, from the corpus index file retrieval to text to be translated in the similar sentence of sentence, obtain result for retrieval;
Step 202), according to described result for retrieval, obtain and the pairing candidate's translation model of the sub-corpus in similar sentence place, from all candidate's translation models, select final translation model;
Step 203), the text to be translated of input is translated, according to described final translation model to the end translation result.
8, the method for utilizing candidate's translation model to translate in the statistical machine translation according to claim 7, it is characterized in that, in described step 201) in, described from the corpus index file retrieval to text to be translated in the similar sentence of sentence, be meant:
Adopt the similarity retrieval method to calculate the similarity between all index files in described text to be translated and the language material index file, be that all result of calculation is pressed from big to small ordering successively according to the similarity size then, select at least one the highest sentence of similarity, selected sentence comprises the information of this sub-corpus in sentence place.
9, the method for utilizing candidate's translation model to translate in the statistical machine translation according to claim 8 is characterized in that, described similarity retrieval method is vector space model and TF-IDF similarity calculating method.
10, the method for utilizing candidate's translation model to translate in the statistical machine translation according to claim 7 is characterized in that, in described step 202) in, describedly from all candidate's translation models, select final translation model, be meant:
Set selection strategy, from all candidate's translation models, select the combination of candidate's translation model or several candidate's translation models as described final translation model according to selection strategy.
11, the method for utilizing candidate's translation model to translate in the statistical machine translation according to claim 10, it is characterized in that, described selection strategy is for to determine candidate's translation model according to the number that comprises similar sentence in the same sub-corpus, or determines candidate's translation model in conjunction with the numerical value of similarity.
12, the translation on line Model Selection method in a kind of statistical machine translation, comprise the training and translate two stages, it is characterized in that the described training stage may further comprise the steps:
Step 101), collect bilingual Parallel Corpus, according to type, bilingual Parallel Corpus is divided in the different sub-corpus, thereby makes up dissimilar sub-corpus;
Step 102), according to described sub-corpus, training candidate translation model;
Step 103), set up index, obtain the corpus index file for described sub-corpus;
The described translating phase may further comprise the steps:
Step 201), import text to be translated, from described corpus index file the retrieval to text to be translated in the similar sentence of sentence, obtain result for retrieval;
Step 202), according to described result for retrieval, obtain and the pairing candidate's translation model of the sub-corpus in similar sentence place, from all candidate's translation models, select final translation model;
Step 203), the text to be translated of input is translated, according to described final translation model to the end translation result.
13, the translation on line Model Selection system in a kind of statistical machine translation, comprise training module and translation module, it is characterized in that, described training module comprises that corpus collector unit, candidate's translation model training unit and index set up the unit, and described translation module comprises retrieval unit, candidate's translation model selected cell and translation unit; Wherein:
Described corpus collector unit is used to collect bilingual Parallel Corpus, and according to the type of collected bilingual Parallel Corpus, makes up sub-corpus;
Described candidate's translation model training unit is used to described sub-corpus training candidate translation model;
The unit set up in described index is that index set up in described sub-corpus, obtains the corpus index file;
Described retrieval unit be used for according to the input text to be translated, from described corpus index file the retrieval to text to be translated in the similar sentence of sentence;
Described candidate's translation model selected cell is used for according to result for retrieval, obtains and the corresponding candidate's translation model of the sub-corpus in similar sentence place, selects final translation model from all candidate's translation models;
Described translation unit is treated translated document according to selected final translation model and is translated.
CNB2007100997246A 2007-05-29 2007-05-29 On-line translation model selection method of statistic machine translation Expired - Fee Related CN100527125C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100997246A CN100527125C (en) 2007-05-29 2007-05-29 On-line translation model selection method of statistic machine translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100997246A CN100527125C (en) 2007-05-29 2007-05-29 On-line translation model selection method of statistic machine translation

Publications (2)

Publication Number Publication Date
CN101079028A true CN101079028A (en) 2007-11-28
CN100527125C CN100527125C (en) 2009-08-12

Family

ID=38906508

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100997246A Expired - Fee Related CN100527125C (en) 2007-05-29 2007-05-29 On-line translation model selection method of statistic machine translation

Country Status (1)

Country Link
CN (1) CN100527125C (en)

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193912A (en) * 2010-03-12 2011-09-21 富士通株式会社 Phrase division model establishing method, statistical machine translation method and decoder
CN102227723A (en) * 2008-11-27 2011-10-26 国际商业机器公司 Device and method for supporting detection of mistranslation
CN102270196A (en) * 2010-06-04 2011-12-07 中国科学院软件研究所 Machine translation method
CN101714136B (en) * 2008-10-06 2012-04-11 株式会社东芝 Method and device for adapting machine translation system based on corpus to new field
CN102591857A (en) * 2011-01-10 2012-07-18 富士通株式会社 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system
CN102591858A (en) * 2011-11-11 2012-07-18 东莞康明电子有限公司 A method and device for machine translation
CN102662935A (en) * 2012-04-08 2012-09-12 北京语智云帆科技有限公司 Interactive machine translation method and machine translation system
CN102789451A (en) * 2011-05-16 2012-11-21 北京百度网讯科技有限公司 Individualized machine translation system, method and translation model training method
CN102955819A (en) * 2011-08-31 2013-03-06 镇江诺尼基智能技术有限公司 Method for acquiring shortened form in Chinese from Web page
CN102999483A (en) * 2011-09-16 2013-03-27 北京百度网讯科技有限公司 Method and device for correcting text
CN103631773A (en) * 2013-12-16 2014-03-12 哈尔滨工业大学 Statistical machine translation method based on field similarity measurement method
CN103729350A (en) * 2013-12-30 2014-04-16 武汉传神信息技术有限公司 Multi-dimension preprocessing method for files to be translated
CN104166644A (en) * 2014-07-09 2014-11-26 苏州市职业大学 Term translation mining method based on cloud computing
CN104391838A (en) * 2014-08-18 2015-03-04 武汉传神信息技术有限公司 Method for improving translation accuracy of legal documents
CN104750676A (en) * 2013-12-31 2015-07-01 橙译中科信息技术(北京)有限公司 Machine translation processing method and device
CN104933038A (en) * 2014-03-20 2015-09-23 株式会社东芝 Machine translation method and machine translation device
CN105095192A (en) * 2014-05-05 2015-11-25 武汉传神信息技术有限公司 Double-mode translation equipment
CN105808529A (en) * 2016-03-10 2016-07-27 武汉传神信息技术有限公司 Method and device of corpora division field
CN101989287B (en) * 2009-07-31 2016-12-14 富士通株式会社 Generate the regular method and apparatus for machine translation based on statistics
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Based on the machine translation method of statistics, device and electronic equipment
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of method generating candidate's translation, device and electronic equipment
CN106503153A (en) * 2016-10-21 2017-03-15 江苏理工学院 Computer text classification system, system and text classification method thereof
CN106598959A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Method and system for determining intertranslation relationship of bilingual sentence pairs
CN106844358A (en) * 2017-01-19 2017-06-13 中译语通科技(北京)有限公司 The natural language statistical machine translation method of mass data model in system level chip
CN107545036A (en) * 2017-07-28 2018-01-05 深圳前海微众银行股份有限公司 Customer service robot Knowledge Database method, customer service robot and readable storage medium storing program for executing
CN107644085A (en) * 2017-09-22 2018-01-30 百度在线网络技术(北京)有限公司 The generation method and device of competitive sports news
CN108228576A (en) * 2017-12-29 2018-06-29 科大讯飞股份有限公司 Text interpretation method and device
CN108628841A (en) * 2017-03-22 2018-10-09 湖南本来文化发展有限公司 The APP of Guangdong language accent and English is translated based on BIRCH clustering algorithms
CN108628848A (en) * 2017-03-22 2018-10-09 湖南本来文化发展有限公司 The method that Sichuan accent and English are translated with BIRCH clustering algorithms
CN108628847A (en) * 2017-03-22 2018-10-09 湖南本来文化发展有限公司 A kind of simultaneous interpretation case for translating mandarin and English using BIRCH clustering algorithms
CN108664477A (en) * 2016-06-28 2018-10-16 大连民族大学 The interpretation method of the multi-lingual machine translation subsystem of Transaction Information
CN108920473A (en) * 2018-07-04 2018-11-30 中译语通科技股份有限公司 A kind of data enhancing machine translation method based on similar word and synonym replacement
CN109543194A (en) * 2018-11-21 2019-03-29 传神语联网网络科技股份有限公司 Interpretation method and system are merged based on ICAT and TRADOS
CN109829550A (en) * 2019-02-01 2019-05-31 北京金山数字娱乐科技有限公司 Model evaluation method and apparatus, model evaluation system and its training method and device
CN109977207A (en) * 2019-03-21 2019-07-05 网易(杭州)网络有限公司 Talk with generation method, dialogue generating means, electronic equipment and storage medium
CN110705320A (en) * 2019-10-08 2020-01-17 中国船舶工业综合技术经济研究院 State-defense military-industry-field machine translation method and system for subdivision field
CN111177412A (en) * 2019-12-30 2020-05-19 成都信息工程大学 Public logo bilingual parallel corpus system
CN111368563A (en) * 2020-03-03 2020-07-03 新疆大学 Clustering algorithm fused dimension-Chinese machine translation system
CN113204977A (en) * 2021-04-29 2021-08-03 北京有竹居网络技术有限公司 Information translation method, device, equipment and storage medium
CN114218965A (en) * 2021-12-31 2022-03-22 语联网(武汉)信息技术有限公司 Automatic selection method for machine translation engine in similar field

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003263433A (en) * 2002-03-07 2003-09-19 Advanced Telecommunication Research Institute International Method of generating translation model in statistical machine translator
JP2005521952A (en) * 2002-03-27 2005-07-21 ユニバーシティ・オブ・サザン・カリフォルニア Inter-phrase coupling probability model for statistical machine translation
US20070043553A1 (en) * 2005-08-16 2007-02-22 Microsoft Corporation Machine translation models incorporating filtered training data
CN100474301C (en) * 2005-09-08 2009-04-01 富士通株式会社 System and method for obtaining words or phrases unit translation information based on data excavation

Cited By (64)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101714136B (en) * 2008-10-06 2012-04-11 株式会社东芝 Method and device for adapting machine translation system based on corpus to new field
CN102227723B (en) * 2008-11-27 2013-10-09 国际商业机器公司 Device and method for supporting detection of mistranslation
CN102227723A (en) * 2008-11-27 2011-10-26 国际商业机器公司 Device and method for supporting detection of mistranslation
US8676791B2 (en) 2008-11-27 2014-03-18 International Business Machines Corporation Apparatus and methods for providing assistance in detecting mistranslation
CN101989287B (en) * 2009-07-31 2016-12-14 富士通株式会社 Generate the regular method and apparatus for machine translation based on statistics
CN102193912A (en) * 2010-03-12 2011-09-21 富士通株式会社 Phrase division model establishing method, statistical machine translation method and decoder
CN102193912B (en) * 2010-03-12 2013-11-06 富士通株式会社 Phrase division model establishing method, statistical machine translation method and decoder
CN102270196A (en) * 2010-06-04 2011-12-07 中国科学院软件研究所 Machine translation method
CN102591857A (en) * 2011-01-10 2012-07-18 富士通株式会社 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system
CN102591857B (en) * 2011-01-10 2015-06-24 富士通株式会社 Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system
CN102789451A (en) * 2011-05-16 2012-11-21 北京百度网讯科技有限公司 Individualized machine translation system, method and translation model training method
CN102955819A (en) * 2011-08-31 2013-03-06 镇江诺尼基智能技术有限公司 Method for acquiring shortened form in Chinese from Web page
CN102999483A (en) * 2011-09-16 2013-03-27 北京百度网讯科技有限公司 Method and device for correcting text
CN102999483B (en) * 2011-09-16 2016-04-27 北京百度网讯科技有限公司 The method and apparatus that a kind of text is corrected
CN102591858A (en) * 2011-11-11 2012-07-18 东莞康明电子有限公司 A method and device for machine translation
CN102591858B (en) * 2011-11-11 2016-06-22 张生麟 A kind of method and apparatus of machine translation
CN102662935A (en) * 2012-04-08 2012-09-12 北京语智云帆科技有限公司 Interactive machine translation method and machine translation system
CN103631773A (en) * 2013-12-16 2014-03-12 哈尔滨工业大学 Statistical machine translation method based on field similarity measurement method
CN103729350A (en) * 2013-12-30 2014-04-16 武汉传神信息技术有限公司 Multi-dimension preprocessing method for files to be translated
CN103729350B (en) * 2013-12-30 2017-01-04 语联网(武汉)信息技术有限公司 The preprocess method of various dimensions waiting for translating shelves
CN104750676A (en) * 2013-12-31 2015-07-01 橙译中科信息技术(北京)有限公司 Machine translation processing method and device
CN104750676B (en) * 2013-12-31 2017-10-24 橙译中科信息技术(北京)有限公司 Machine translation processing method and processing device
CN104933038A (en) * 2014-03-20 2015-09-23 株式会社东芝 Machine translation method and machine translation device
CN105095192A (en) * 2014-05-05 2015-11-25 武汉传神信息技术有限公司 Double-mode translation equipment
CN104166644A (en) * 2014-07-09 2014-11-26 苏州市职业大学 Term translation mining method based on cloud computing
CN104391838B (en) * 2014-08-18 2017-08-29 武汉传神信息技术有限公司 A kind of method for improving legal document translation accuracy
CN104391838A (en) * 2014-08-18 2015-03-04 武汉传神信息技术有限公司 Method for improving translation accuracy of legal documents
US10810379B2 (en) 2015-08-25 2020-10-20 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
CN106484682B (en) * 2015-08-25 2019-06-25 阿里巴巴集团控股有限公司 Machine translation method, device and electronic equipment based on statistics
CN106484681B (en) * 2015-08-25 2019-07-09 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment generating candidate translation
US10268685B2 (en) 2015-08-25 2019-04-23 Alibaba Group Holding Limited Statistics-based machine translation method, apparatus and electronic device
CN106484681A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 A kind of method generating candidate's translation, device and electronic equipment
US10255275B2 (en) 2015-08-25 2019-04-09 Alibaba Group Holding Limited Method and system for generation of candidate translations
CN106484682A (en) * 2015-08-25 2017-03-08 阿里巴巴集团控股有限公司 Based on the machine translation method of statistics, device and electronic equipment
US10860808B2 (en) 2015-08-25 2020-12-08 Alibaba Group Holding Limited Method and system for generation of candidate translations
CN105808529A (en) * 2016-03-10 2016-07-27 武汉传神信息技术有限公司 Method and device of corpora division field
CN105808529B (en) * 2016-03-10 2018-06-08 语联网(武汉)信息技术有限公司 The method and apparatus that a kind of language material divides field
CN108664477A (en) * 2016-06-28 2018-10-16 大连民族大学 The interpretation method of the multi-lingual machine translation subsystem of Transaction Information
CN108664477B (en) * 2016-06-28 2022-04-01 大连民族大学 Translation method of transaction information multi-language machine translation subsystem
CN106503153A (en) * 2016-10-21 2017-03-15 江苏理工学院 Computer text classification system, system and text classification method thereof
CN106503153B (en) * 2016-10-21 2019-05-10 江苏理工学院 Computer text classification system
CN106598959A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Method and system for determining intertranslation relationship of bilingual sentence pairs
CN106844358A (en) * 2017-01-19 2017-06-13 中译语通科技(北京)有限公司 The natural language statistical machine translation method of mass data model in system level chip
CN108628848A (en) * 2017-03-22 2018-10-09 湖南本来文化发展有限公司 The method that Sichuan accent and English are translated with BIRCH clustering algorithms
CN108628847A (en) * 2017-03-22 2018-10-09 湖南本来文化发展有限公司 A kind of simultaneous interpretation case for translating mandarin and English using BIRCH clustering algorithms
CN108628841A (en) * 2017-03-22 2018-10-09 湖南本来文化发展有限公司 The APP of Guangdong language accent and English is translated based on BIRCH clustering algorithms
CN107545036A (en) * 2017-07-28 2018-01-05 深圳前海微众银行股份有限公司 Customer service robot Knowledge Database method, customer service robot and readable storage medium storing program for executing
CN107644085B (en) * 2017-09-22 2020-12-11 百度在线网络技术(北京)有限公司 Method and device for generating sports event news
CN107644085A (en) * 2017-09-22 2018-01-30 百度在线网络技术(北京)有限公司 The generation method and device of competitive sports news
CN108228576B (en) * 2017-12-29 2021-07-02 科大讯飞股份有限公司 Text translation method and device
CN108228576A (en) * 2017-12-29 2018-06-29 科大讯飞股份有限公司 Text interpretation method and device
CN108920473B (en) * 2018-07-04 2022-08-09 中译语通科技股份有限公司 Data enhancement machine translation method based on same-class word and synonym replacement
CN108920473A (en) * 2018-07-04 2018-11-30 中译语通科技股份有限公司 A kind of data enhancing machine translation method based on similar word and synonym replacement
CN109543194A (en) * 2018-11-21 2019-03-29 传神语联网网络科技股份有限公司 Interpretation method and system are merged based on ICAT and TRADOS
CN109829550B (en) * 2019-02-01 2022-03-04 北京金山数字娱乐科技有限公司 Model evaluation method and device, model evaluation system and training method and device thereof
CN109829550A (en) * 2019-02-01 2019-05-31 北京金山数字娱乐科技有限公司 Model evaluation method and apparatus, model evaluation system and its training method and device
CN109977207A (en) * 2019-03-21 2019-07-05 网易(杭州)网络有限公司 Talk with generation method, dialogue generating means, electronic equipment and storage medium
CN110705320A (en) * 2019-10-08 2020-01-17 中国船舶工业综合技术经济研究院 State-defense military-industry-field machine translation method and system for subdivision field
CN111177412A (en) * 2019-12-30 2020-05-19 成都信息工程大学 Public logo bilingual parallel corpus system
CN111177412B (en) * 2019-12-30 2023-03-31 成都信息工程大学 Public logo bilingual parallel corpus system
CN111368563A (en) * 2020-03-03 2020-07-03 新疆大学 Clustering algorithm fused dimension-Chinese machine translation system
CN113204977A (en) * 2021-04-29 2021-08-03 北京有竹居网络技术有限公司 Information translation method, device, equipment and storage medium
CN113204977B (en) * 2021-04-29 2023-09-26 北京有竹居网络技术有限公司 Information translation method, device, equipment and storage medium
CN114218965A (en) * 2021-12-31 2022-03-22 语联网(武汉)信息技术有限公司 Automatic selection method for machine translation engine in similar field

Also Published As

Publication number Publication date
CN100527125C (en) 2009-08-12

Similar Documents

Publication Publication Date Title
CN101079028A (en) On-line translation model selection method of statistic machine translation
CN1159661C (en) System for Chinese tokenization and named entity recognition
Botha et al. Compositional morphology for word representations and language modelling
CN111061862B (en) Method for generating abstract based on attention mechanism
CN104199965B (en) Semantic information retrieval method
CN1253820C (en) Device and method for intercrossing language information retrieval
CN108334495A (en) Short text similarity calculating method and system
CN1426561A (en) Computer-aided reading system and method with cross-languige reading wizard
CN1475907A (en) Machine translation system based on examples
CN101042692A (en) translation obtaining method and apparatus based on semantic forecast
CN104391842A (en) Translation model establishing method and system
CN1855090A (en) Apparatus and method for translating japanese into chinese, and computer program product therefor
CN1465018A (en) Machine translation mothod
CN1924858A (en) Method and device for fetching new words and input method system
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN1940915A (en) Corpus expansion system and method
CN101055588A (en) Method for catching limit word information, optimizing output and input method system
CN1629837A (en) Method and apparatus for processing, browsing and classified searching of electronic document and system thereof
CN106407184B (en) Coding/decoding method, statistical machine translation method and device for statistical machine translation
Prabhakar et al. Machine transliteration and transliterated text retrieval: a survey
KR20210062934A (en) Text document cluster and topic generation apparatus and method thereof
KR20110112192A (en) A syntactic analysis and hierarchical phrase model based machine translation system and method
CN1158621C (en) Information processing device and information processing method, and recording medium
Lopez et al. Improved HMM alignment models for languages with scarce resources
CN1567297A (en) Method for extracting multi-word translation equivalent cells from bilingual corpus automatically

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: HUAWEI TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES

Effective date: 20130528

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100080 HAIDIAN, BEIJING TO: 518129 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20130528

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Shenzhen

Patentee after: HUAWEI TECHNOLOGIES Co.,Ltd.

Address before: 100080 Haidian District, Zhongguancun Academy of Sciences, South Road, No. 6, No.

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090812