CN111680529A

CN111680529A - Machine translation algorithm and device based on layer aggregation

Info

Publication number: CN111680529A
Application number: CN202010527099.6A
Authority: CN
Inventors: 汪金玲
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-09-18

Abstract

The invention relates to the technical field of text translation, and discloses a machine translation algorithm and a device based on layer aggregation, wherein the algorithm comprises the following steps: acquiring a Chinese sentence to be translated, and preprocessing the Chinese sentence; the ATransformer coder extracts multilayer semantic feature information of the preprocessed sentences based on a multilayer information extraction algorithm; decoding the multilayer semantic feature information by an ATransformer decoder, and outputting a translation target language sequence; and judging the translation target language sequence by the judging model D, if the translation target language sequence is judged to be a translation result, taking the translation target language sequence as a final machine translation result and outputting the final machine translation result, otherwise, updating parameters of the ATransformer model based on a strategy gradient algorithm, inputting the preprocessed sentence to be translated into an updated ATransformer encoder, and realizing the machine translation algorithm again. The invention also provides a device of the machine translation algorithm based on layer aggregation. The invention realizes intelligent translation of the text.

Description

Machine translation algorithm and device based on layer aggregation

Technical Field

The invention relates to the technical field of text translation, in particular to a machine translation algorithm and a machine translation device based on layer aggregation.

Background

With the development of deep learning in the field of natural language processing, machine translation has transitioned from early studies of statistical machine translation, which is mainly centered on shallow machine learning, to neural machine translation, which is centered on deep learning techniques.

The traditional statistical machine translation has the disadvantages that human experts are needed to design features and corresponding translation processes, long-distance dependence is difficult to process, and serious data sparsity problem is caused by data dispersion; and the neural machine translation model effectively relieves long-distance dependence by combining an attention mechanism, and the effect is far better than that of a statistical machine translation model on a large-scale parallel corpus. However, through research, different layers in the neural machine translation model can capture different types of grammatical and semantic information, the existing neural machine translation model only utilizes the last layer information of the model, the last layer information is used as the summary of the whole network to the input, and the utilization of information propagated by the middle layer is lacked, meanwhile, the existing neural machine translation model usually adopts a single model training method based on the maximum likelihood principle, namely, the current translation model is used as a training target, and training is performed by maximizing the conditional probability of generating target language translation by taking the source language as a condition, so that the naturalness and the accuracy of the translation result are difficult to guarantee.

In view of this, those skilled in the art need to solve the problem of how to effectively improve the quality of machine translation while deeply capturing feature information between model layers and considering the relationship between layers.

Disclosure of Invention

The invention provides a machine translation algorithm and device based on layer aggregation, which can deeply capture characteristic information between model layers and consider the relation between the layers, and can effectively improve the quality of machine translation.

In order to achieve the above object, the present invention provides a layer aggregation based machine translation algorithm, including:

acquiring a Chinese sentence to be translated, and performing text preprocessing operation on the Chinese sentence;

inputting the preprocessed statement into a preset ATransformer encoder, wherein the ATransformer encoder performs multilayer semantic feature information extraction on the statement based on a multilayer information extraction algorithm;

inputting the multilayer semantic feature information into a preset ATransformer decoder, wherein the preset ATransformer decoder decodes the multilayer semantic feature information and outputs a translation target language sequence;

inputting a translation target language sequence into a pre-trained discrimination model D, and judging the translation target language sequence by the discrimination model D;

and if the translation target language sequence is judged to be a translation result, taking the translation target language sequence as a final machine translation result and outputting the final machine translation result, otherwise, updating parameters of the ATransformer model based on a strategy gradient algorithm, inputting the preprocessed sentence to be translated into the updated ATransformer coder, and realizing the machine translation algorithm again.

Optionally, the text preprocessing operation includes:

matching the constructed stop word list with words in the text data one by one, and deleting the words if matching succeeds;

finding out all possible words in the word string by constructing a prefix dictionary and a self-defined dictionary; and

according to all the possible words found, each word corresponds to one directed edge in the graph and is assigned with a corresponding side length weight, then aiming at the segmentation graph, in all paths from the starting point to the end point, the path sets with the length values of 1 st, 2 nd, … th, i th, … th and N th in a strict ascending order are solved and are taken as corresponding rough segmentation result sets, and the rough segmentation result sets are segmentation result sets of the Chinese sentences to be translated.

Optionally, the ATransformer encoder performs multi-layer semantic feature information extraction on the sentence based on a multi-layer information extraction algorithm, where the method includes:

the main layer of the first module in the ATransformer coder receives the preprocessed sentence to be translated, and the first sub-layer in the main layer

Calculating the statement to be translated based on a self-attention mechanism, and inputting the calculation result into a second sublayer in the main layer

Second sub-layer

Residual error connection is carried out on the output result based on a feedforward full-connection neural network; a merging layer in the module merges the output results of the two sub-layers by using a Joint function, and takes the merged result as an input value of the next module;

other modules in the ATransformer coder sequentially receive the output value of the previous module, the output value of the previous module is used as the input value of the current module for calculation and output, the output value of the last module is the extracted multilayer semantic feature information, the ATransformer coder extracts 12 layers of feature information and performs feature fusion, and the whole network structure is combined into a deeper hierarchical structure and a larger hierarchical structure from the shallowest structure and the smallest structure in an iterative manner;

the calculation formula based on the self-attention mechanism is as follows:

wherein:

LayerNorm (·) is a normalization function;

i is the ith module in the encoder;

attention (. cndot.) is the self-Attention mechanism,

d_kdimension of the sentence to be translated;

the three vector parameters obtained by the i-1 th module respectively represent query of a sentence to be translated, keys and weights of values, and when i is 1, the three vector parameters are preset in the invention;

the output value of the second sub-layer in the i-1 th module main layer is the preprocessed value when i is 1A sentence to be translated;

second sub-layer

The calculation formula for performing residual error connection on the output result based on the feedforward fully-connected neural network is as follows:

wherein:

FC (-) is a feed-forward fully-connected network;

W_Etraining weights for a preset encoder;

b_Esetting a preset encoder bias parameter;

the merging layer of the module merges the output results of the sub-layers in the main layer and outputs a result LⁱWherein i represents the ith module, and the merged fusion information is used as the input of the next module, so that the fusion of 12 layers of information is realized through 6 modules, and the final information fusion result is used as the multi-layer semantic feature information; and

when i is 1, the formula for merging the sub-layer output results is as follows:

wherein:

the output result of the first sub-layer of the first module;

the output result of the second sub-layer of the first module is;

joint (·) is a Joint function, Joint (a, b) ═ LayerNorm (FC ([ a; b ]) + a + b);

when i >1, the formula for merging the sub-layer output results is as follows:

wherein:

the first sub-layer is the ith module main layer;

a second sublayer that is the ith module main layer;

L^i-1the output result of the merging layer of the i-1 th module is obtained;

joint (·) is a Joint function, Joint (a, b, c) ═ LayerNorm (FC ([ a; b; c ]) + a + b + c).

Optionally, the decoding, by the preset ATransformer decoder, the multi-layer semantic feature information, and outputting a translation target language sequence, where the decoding includes:

the ATransformer decoder is formed by stacking 3 same modules, wherein each module is divided into a multi-head attention mechanism layer, a DSL1 sublayer and a DSL2 sublayer;

a first module of the ATransformer decoder receives multilayer semantic feature information, a multi-head attention mechanism layer, a DSL1 sublayer and a DSL2 sublayer in the module sequentially carry out output calculation on the multilayer semantic feature information, and an output result of the DSL2 sublayer is used as the input of the next module;

other modules in the ATransformer decoder receive the output value of the previous module, the output value of the previous module is used as the input value of the current module for calculation and output, and the output value of the DSL2 sublayer in the last module is the translation target language sequence obtained through final translation;

multiple attention mechanism layers for the decoder module

The calculation formula of (2) is as follows:

wherein:

LayerNorm (·) is a normalization function;

attention (. cndot.) is the self-Attention mechanism,

d_kdimension of the sentence to be translated;

the three vector parameters obtained by the i-1 th decoder module respectively represent query of a sentence to be translated, keys and weights of values, and when i is 1, the three vector parameters are obtained by training the last module in the encoder;

the output value of the i-1 th decoder module DSL2 sub-layer is the multilayer semantic feature information when i is equal to 1;

the calculation formula of the DSL1 sublayer of the decoder module is:

wherein:

K_E，V_Eparameters obtained by the last module of the encoder;

the calculation formula of the DSL2 sublayer of the decoder module is:

wherein:

FC (-) is a feed-forward fully-connected network;

W_Dtraining weights for a preset decoder;

b_Dbiasing parameters for a preset decoder;

the output result of the DSL2 sublayer in the last module is taken as the translation target language sequence T _ Pred.

Optionally, the training process of the discriminant model D is:

(1) splicing a source language sentence S and a reference translation T in a training set into a two-dimensional matrix representation in a splicing mode, and inputting a splicing result as a positive sample;

(2) inputting a source language sentence S into an ATransformer model to obtain a translation target language sequence T _ p, splicing the source language sentence S and the T _ p, and inputting a splicing result as a negative sample so as to set an optimization target function of a discrimination model D:

maxV(D)＝logD((S，T))+log(1-D(ATransformer(S，T_p)))

(3) performing convolution operation on currently obtained positive and negative sample input to obtain positive and negative sample characteristics, wherein an activation function sigma of the convolution operation is a softmax function, and a formula of the convolution operation is as follows:

F_i，j＝σ(W_F*f_i，j+b_F)

wherein:

W_Ftraining weights for preset convolutional layers;

b_Fis a preset convolution layer bias parameter;

f_i，jinputting positive and negative samples;

(4) and obtaining probability distribution of positive and negative sample categories by using a sigmoid function at the full connection layer, if the probability difference of the positive and negative sample categories is less than the error rate set by the method and the optimization target function reaches the maximum, considering that the discriminant model is successfully trained, and otherwise, performing convolution operation again.

Optionally, the updating parameters of the ATransformer model based on the policy gradient algorithm includes:

1) according to the source language sentence S and the model D, the invention provides the following loss function to train the ATransformer model:

Loss＝log(1-D(S，T_Pred))

wherein:

d is a discrimination model;

t _ Pred is a translation target language sequence output by the ATransformer model according to a source language sentence S;

2) performing gradient calculation on the parameters of the ATransformer model needing to be updated, and updating the parameters of the ATransformer model by using a gradient descending mode by using the gradient obtained after calculation, so as to finish training and optimization of the ATransformer model, wherein the gradient is calculated as follows:

wherein:

ATransformer (. DELTA.S) is the distribution of conditions generated by the ATransformer model;

theta is a parameter of the ATransformer model.

In addition, the present invention also provides an apparatus for a layer aggregation based machine translation algorithm, the apparatus comprising:

the text acquisition module is used for acquiring the Chinese sentences to be translated and preprocessing the Chinese sentences;

the encoding module is used for encoding the preprocessed Chinese sentences by using an ATransformer encoder and extracting multilayer semantic feature information of the Chinese sentences;

the decoding module is used for decoding the multilayer semantic feature information by using an ATransformer encoder so as to output a translation target language sequence;

and the translation result judging module is used for judging the translation target language sequence.

Optionally, the translation result distinguishing module includes:

judging the translation target language sequence by the discrimination model;

if the translation target language sequence is judged to be a translation result, taking the translation target language sequence as a final machine translation result and outputting the final machine translation result;

otherwise, updating parameters of the coding module and the decoding module based on the strategy gradient algorithm, inputting the preprocessed sentence to be translated into the updated coding module, and realizing the machine translation algorithm again.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon the model training program instructions of the ATransformer model, which are executable by one or more processors to implement the steps of a layer aggregation based machine translation algorithm as described above.

Compared with the prior art, the invention provides a machine translation algorithm and a device based on layer aggregation, and the technology has the following advantages:

firstly, the encoder of the ATransformer model provided by the invention is provided with 6 completely same modules, wherein the first module of the encoder is used for receiving a sentence to be translated and carrying out feature extraction processing on the sentence to be translated, so that a feature extraction result is input into the next module, and two modules exist in each module at the same time

A sub-layer,

Sub-layers, and a merged layer, where i denotes the ith module,

the sub-layer receives the output value of the previous module, similarity calculation is carried out on each word in the output value of the previous module and all words in the sentence based on self-attention mechanism to obtain respective weight, and softmax function is usedThe numbers are normalized by respective weights while using

Feed-forward fully-connected neural network pair in sub-layers

Carrying out nonlinear mapping on the output values of the sub-layers so as to extract characteristic information, finally merging the output values of the two sub-layers by using a Joint function Joint (-) based on a feedforward neural network in a merging layer, taking a merging result as an output value of a next module, wherein the output value of the last module is multilayer semantic characteristic information; in the prior art, only 12 sub-layers are directly stacked, wherein 6 attention layers and 6 feature extraction layers exist, and feature extraction algorithms are used for sequentially extracting the feature information of the sentences to be translated existing in each sub-layer, so that the feature information of the 6 sub-layers is equivalently extracted and linearly stacked, and the algorithm of the invention sequentially stacks the feature information based on the attention mechanism

Sublayer and based on feedforward fully-connected neural network

The sub-layer is arranged in one module, the calculation results of the two layers for the statement to be translated are merged by using a Joint function Joint (-), and further extracting characteristic information by using a feedforward neural network, and simultaneously taking a final result as an input value of a next module, therefore, the characteristic information of 12 layers is extracted and the characteristic fusion is carried out, the whole network structure is combined into a deeper and larger hierarchical structure from the shallowest and smallest structure in an iterative way, more and more detailed language structure characteristics and network available information between layers can be obtained, compared with the 6 layers of language characteristic information obtained by the prior art in a linear superposition way, the language characteristic information obtained by the algorithm of the invention comprises the characteristic information of 12 layers, meanwhile, the invention adopts a characteristic fusion mode to fuse the obtained characteristic information, and the fused characteristic information is obtained.The feature information can more accurately reflect the language features in the sentence to be translated, so that the machine translation algorithm has higher translation accuracy compared with the prior art.

Secondly, compared with the prior art, the invention provides a strategy for updating the machine translation model according to the judgment result by judging the machine translation result, in detail, the judgment model D judges the translation target language sequence, if the judgment result is the translation result, the translation target language sequence is directly output, otherwise, the ATransformer model is subjected to parameter updating, so the invention provides the following loss function to train the ATransformer model: loss log (1-D (S, T _ Pred)), where D is the discrimination model, T _ Pred is the translation target language sequence output by the ATransformer model according to the source language sentence S, (S, T _ Pred) represents the translation pair formed by the ATransformer model, and the discrimination model D receives the translation pair formed by the ATransformer model (S, T _ Pred) and outputs the similarity probability between T _ Pred and the reference translation T, and if the similarity probability between T _ Pred and the reference translation T is the maximum, i.e. the difference between T _ Pred and the reference translation T is the minimum, then the Loss function proposed by the present invention is the minimum, therefore, the ATransformer model is trained by using the Loss function proposed by the present invention, and if the Loss function is the minimum, the ATransformer model obtained by training is the optimal model, compared with the prior art, the algorithm of the present invention can determine the discrimination result of the model in the machine translation process, and the machine translation model is updated in real time, so that higher algorithm translation accuracy is achieved.

Finally, the invention analyzes the algorithm time complexity of the proposed algorithm. In each layer of the algorithm, for the main layer, each word in the sentence to be translated needs to encode a d-dimensional vector with a fixed length, so that the input calculation of the whole sentence to be translated depends on the sentence length n of the sentence to be translated, and the ATransformer model provided by the invention has the same structure

The sub-layer needs to calculate the weight information of each word in the sentence to be translated, then

The time complexity of the sub-layer is also determined by the sentence length n, and the time complexity of the sub-layer is O (n), so the time complexity of the main layer in the model of the invention is O (d.n)²) (ii) a For the sequence calculation operation in the model, the prior art generally adopts a multi-attention layer series connection mode to sequentially perform attention calculation in the sentence to be translated, so the time complexity of the model time sequence operation in the prior art is O (n), compared with the prior art, the method is based on a self-attention parallel processing mechanism, each main layer is modularized, the modules of each main layer are independent from each other, and the weight calculation based on the attention mechanism can be performed independently and simultaneously, so that the time sequence operation in the prior art is reduced from O (n) to O (1), for the merging layer provided by the invention, the calculation result of the main layer is mainly subjected to aggregation calculation, and the time complexity is O (1). Thus, the algorithm of the present invention has a time complexity of O (d.n) as a whole²) Compared with the prior art O (d.n)³) The algorithm reduces certain time overhead and can obtain the result of machine translation more quickly.

Drawings

Fig. 1 is a flow chart of a layer aggregation-based machine translation algorithm according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an internal structure of a layer aggregation-based machine translation algorithm apparatus according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a machine translation algorithm and device based on layer aggregation, which can deeply capture characteristic information between model layers and consider the relationship between the layers and effectively improve the quality of machine translation. Referring to fig. 1, a flowchart of a layer aggregation-based machine translation algorithm according to an embodiment of the present invention is shown.

In this embodiment, the layer aggregation based machine translation algorithm includes:

and S1, acquiring the Chinese sentence to be translated and preprocessing the Chinese sentence.

Firstly, the invention obtains the Chinese sentence to be translated and carries out preprocessing operation on the Chinese sentence, wherein in one embodiment of the invention, the preprocessing operation comprises word stop and word segmentation processing;

the method for removing stop words selected by the invention is to utilize a stop word list to carry out filtering, carry out one-to-one matching through the constructed stop word list and words in the text data, if the matching is successful, the word is the stop word, and the word needs to be deleted; the stop words are words with little practical meaning in the text data function words, and in the text, the classification of the text is not affected, but the occurrence frequency is high, and the stop words comprise common pronouns, prepositions and the like.

Further, because words have the capability of truly reflecting text content, but a chinese text is not separated from words by spaces as in an english text, a word segmentation operation is required for the chinese text. In the embodiment of the invention, the word segmentation operation is carried out on the text by using a word segmentation algorithm based on the N-shortest path;

the basic idea of the N-shortest path word segmentation algorithm is to find out all possible words in a character string according to a word segmentation dictionary and construct a word segmentation directed acyclic graph. In the embodiment of the invention, all possible words in the word string are found by constructing a prefix dictionary and a self-defined dictionary. The prefix dictionary includes prefixes of each participle in the statistical dictionary, for example, prefixes of a word "Beijing university" in the statistical dictionary are "Beijing", "Beijing Dada", respectively; the word "university" is prefixed by "big", etc.; the self-defined dictionary, which may also be called a proper noun dictionary, is a word that is not present in the statistical dictionary but is specific and exclusive in a certain field, such as resume, work experience, etc.

According to all the possible words found out above, each word corresponds to one directed edge in the graph, and is assigned with a corresponding side length (weight), then for the segmentation graph, in all the paths from the starting point to the end point, the path sets with the length values of 1 st, 2 nd, … th, i th, … th and N th in a strict ascending order (the values at any two different positions are not equal) are solved as corresponding rough segmentation result sets. If the lengths of two or more paths are equal, the length of the paths is parallel to the ith, the paths are listed into a rough-scoring result set, the sequence numbers of other paths are not influenced, the size of the final rough-scoring result set is larger than or equal to N, and the rough-scoring result set is the word segmentation result set of the Chinese sentence to be translated.

And S2, inputting the preprocessed sentence into a preset ATransformer encoder, wherein the ATransformer encoder performs multilayer semantic feature information extraction on the sentence based on a multilayer information extraction algorithm.

Further, the invention inputs the preprocessed statements into a preset ATransformer encoder, wherein the ATransformer encoder is formed by stacking 6 same modules, each module is divided into a main layer and a merging layer, each main layer is provided with two sub-layers, the first sub-layer contains a self-attention mechanism, and the second sub-layer is a fully-connected feedforward network layer;

in an embodiment of the present invention, a main layer of a first module of the ATransformer encoder receives a preprocessed sentence to be translated, two sublayers in the main layer sequentially perform calculation and output on the preprocessed sentence to be translated, and input an output result to a merging layer for merging calculation, where the merging result is simultaneously input to a next module for calculation and output, and a result of final information fusion is used as multilayer semantic feature information;

wherein the first sub-layer of the ith module main layer

The statement to be translated is calculated based on a self-attention mechanism, and the calculation formula based on the self-attention mechanism is as follows: :

wherein:

LayerNorm (·) is a normalization function;

i is the ith module in the encoder;

attention (. cndot.) is the self-Attention mechanism,

d_kdimension of the sentence to be translated;

and when i is equal to 1, the output value of the second sub-layer in the i-1 th module main layer is the preprocessed sentence to be translated.

Second sub-layer

wherein:

FC (-) is a feed-forward fully-connected network;

W_Etraining weights for a preset encoder;

b_Eis a preset encoder bias parameter.

Further, the merging layer of the module merges the output results of the sub-layers in the main layer, and outputs a result LⁱWherein i represents the ith module, and the merged fusion information is used as the input of the next module, so that through 6 modules, the invention realizes the fusion of 12 layers of information and takes the final information fusion result as the multilayer semantic feature information; by extracting the feature information once every two layers, more and more detailed language structure features and network available information can be obtained compared with the prior art;

wherein:

the output result of the first sub-layer of the first module;

is the output result of the second sub-layer of the first module.

Joint (·) is a Joint function, Joint (a, b) ═ LayerNorm (FC ([ a; b ]) + a + b).

When i >1, the formula for merging the results output by the sublayer is as follows:

wherein:

the first sub-layer is the ith module main layer;

a second sublayer that is the ith module main layer;

L^i-1the output result of the merging layer of the i-1 th module is obtained;

And S3, inputting the multilayer semantic feature information into a preset ATransformer decoder, and decoding the multilayer semantic feature information by the preset ATransformer decoder to output a translation target language sequence.

Furthermore, the present invention inputs the multi-layer semantic feature information into a preset ATransformer decoder, wherein the ATransformer decoder is composed of three identical modules, and each module is composed of two sub-layers and a multi-head attention mechanism layer

Composition is carried out;

in an embodiment of the present invention, a first module of the ATransformer decoder receives multiple layers of semantic feature information, and a multi-head attention mechanism layer, a DSL1 sublayer and a DSL2 sublayer in the first module sequentially perform output calculation on the received information, and an output result is used as an input of a next module;

a multi-headed attention mechanism layer of the ith decoder module

The calculation formula of (2) is as follows:

wherein:

LayerNorm (·) is a normalization function;

attention (. cndot.) is the self-Attention mechanism,

d_kdimension of the sentence to be translated;

the output value of the i-1 th decoder module DSL2 sub-layer is the multi-layer semantic feature information when i is equal to 1.

Further, the calculation formula of the DSL1 sublayer of the i-th decoder module is:

wherein:

K_E，V_Ethe parameters obtained for the last module of the encoder.

The calculation formula of the DSL2 sublayer of the i-th decoder module is:

wherein:

FC (-) is a feed-forward fully-connected network;

W_Dtraining weights for a preset decoder;

b_Dis a preset decoder bias parameter.

The invention sequentially carries out the calculation output of the three modules of the decoder, and takes the output result of the DSL2 sublayer in the last module as the translation target language sequence T _ Pred.

S4, inputting the translation target language sequence into a pre-trained discrimination model D, judging the translation target language sequence by the discrimination model D, if the translation target language sequence is judged to be a translation result, taking the translation target language sequence as a final machine translation result and outputting the translation target language sequence, otherwise, updating parameters of the ATransformer model based on a strategy gradient algorithm, inputting the preprocessed sentence to be translated into an updated ATransformer encoder, and realizing the machine translation algorithm again.

Furthermore, the translation target language sequence T _ Pred is input into a pre-trained discrimination model D, the discrimination model D judges the translation target language sequence T _ Pred, and the discrimination model D is a convolution neural network model;

the training process of the discriminant model D is as follows:

(1) for a source language sentence S and a reference translation T in a training set, word vectors in the S and the T are spliced into a two-dimensional matrix for representation in a splicing mode

And referring to the jth word in the translation T

The following feature maps, which are the positive sample inputs of the discriminant model D, may be obtained:

maxV(D)＝logD((S，T))+log(1-D(ATransformer(S，T_p)))

(3) performing convolution operation on currently obtained positive and negative sample inputs by adopting a convolution kernel of 5 x 5 to obtain positive and negative sample characteristics, wherein an activation function sigma of the convolution operation is a softmax function, and a formula of the convolution operation is as follows:

F_i，j＝σ(W_F*f_i，j+b_F)

wherein:

W_Ftraining weights for preset convolutional layers;

b_Fis a predetermined convolutional layer bias parameter.

Further, if the translation target language sequence T _ Pred is determined to be a translation result, the translation target language sequence T _ Pred is output as a final machine translation result; otherwise, the invention updates the parameters of the ATransformer model based on the strategy gradient algorithm, inputs the preprocessed sentence to be translated into the updated ATransformer coder, and implements the machine translation algorithm again;

because the translation result generated by the ATransformer model is not a continuous value, an error signal generated in the discrimination model cannot be transmitted to the ATransformer model, so that the parameter updating is performed by adopting a strategy gradient algorithm, and the process of performing the parameter updating on the ATransformer model based on the strategy gradient algorithm comprises the following steps:

Loss＝log(1-D(S，T_Pred))

wherein:

d is a discrimination model;

and T _ Pred is a translation target language sequence output by the ATransformer model according to the source language sentence S.

2) Performing gradient calculation on the parameters of the ATransformer model needing to be updated, and updating the parameters of the ATransformer model by using a gradient descending mode according to the gradient obtained after calculation, so as to complete training and optimization of the ATransformer model, wherein the gradient descending mode is the prior art and is not described herein, and the gradient calculation is as follows:

wherein:

theta is a parameter of the ATransformer model.

The following describes the embodiments of the present invention through a simulation experiment, and tests the inventive algorithm. The training and testing of the machine translation algorithm of the invention adopt a deep learning frame Pythrch, all models are trained on 6 NVIDIAK80 GPUs, each GPU is allocated with 4000 tokens, and a newtest2017 data set is used as a test set, and meanwhile, the selected comparison model of the invention comprises a transform model trained by a traditional method, a statistical machine translation model RNN-embed based on a deep neural network and a statistical machine translation model NNPR based on a neural network.

In order to verify the effectiveness of the algorithm provided by the invention, the accuracy of the translation result T _ Pred of each model is analyzed by using a machine translation evaluation index BLEU; for a source language sentence S_iThe translation result is T _ Pred_iThe reference translation of the corresponding target language is T_i＝{T_i1，…，T_imN-grams is a set of phrases with length n, let w_kRepresents possible n-grams, h in the kth group_k(T_Pred_i) Denotes w_kAt T _ Pred_iNumber of occurrences in, h_k(T_ij) Denotes w_kIn the reference translation T_ijIf the number of times of occurrence in the translation is less than the number of times of occurrence in the translation, the calculation formula of the BLEU superposition accuracy between the translation result and the reference translation is as follows:

according to the simulation experiment result, the Transformer model trained by the traditional method completes the translation of the data set in 8 hours, and the BLEU of the model is equally 18; the RNN-embed machine translation model completes the translation of the data set in 7 hours, and the BLEU of the model is 25; the NNPR machine translation model completes the translation of the data set in 9 hours, and the BLEU of the NNPR machine translation model is 27; the machine translation model based on layer aggregation completes the translation of the data set within 5 hours, and the BLEU of the translation model is equal to 32. Therefore, compared with the existing algorithm, the machine translation algorithm can finish the translation of the text more quickly, and has higher translation precision.

The invention also provides a device of the machine translation algorithm based on layer aggregation. Referring to fig. 2, a schematic diagram of an internal structure of an apparatus for a layer aggregation based machine translation algorithm according to an embodiment of the present invention is provided.

In this embodiment, the apparatus 1 for layer aggregation based machine translation algorithm at least includes a text obtaining module 11, an encoding module 12, a decoding module 13, a translation result judging module 14, and a communication bus 15.

The text acquiring module 11 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, and a portable Computer, or a server.

The encoding module 12, which may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip, is used to perform encoding operation on the preprocessed chinese sentence by using the ATransformer encoder and perform multi-layer semantic feature information extraction on the chinese sentence.

The decoding module 13 may be a Central Processing Unit (CPU), a controller, a microcontroller, a microprocessor or other data Processing chip in some embodiments, and is configured to decode the multiple layers of semantic feature information by using the ATransformer encoder, so as to output the translation target language sequence.

The translation result discrimination module 14 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type model training module (e.g., SD or DX model training module, etc.), a magnetic model training module, a magnetic disk, an optical disk, and the like. The translation result discrimination module 14 may in some embodiments be an internal storage unit of the apparatus 1 for layer aggregation based machine translation algorithms, for example a hard disk of the apparatus 1 for layer aggregation based machine translation algorithms. In other embodiments, the translation result determination module 14 may also be an external storage device of the device system 1 based on the layer-aggregation machine translation algorithm, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), and the like, which are equipped on the device 1 based on the layer-aggregation machine translation algorithm. Further, the translation result discrimination module 14 may also include both an internal storage unit and an external storage device of the apparatus 1 based on the layer-aggregation machine translation algorithm. The translation result discrimination module 14 may be configured to store not only application software installed in the device 1 based on the layer-aggregation machine translation algorithm and various types of data, such as a model training program instruction, but also temporarily store data that has been output or is to be output.

The communication bus 15 is used to realize connection communication between these components.

Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and an optional user interface which may also comprise a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the apparatus 1 based on the layer-aggregation machine translation algorithm and for displaying a visualized user interface.

Fig. 2 only shows the apparatus 1 with the components 11-15 and the layer aggregation based machine translation algorithm, and it will be understood by those skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the apparatus 1 of the layer aggregation based machine translation algorithm, and may comprise fewer or more components than those shown, or combine certain components, or a different arrangement of components.

In the embodiment of the apparatus 1 shown in fig. 2, the translation result determining module 14 stores a model training program instruction of the ATransformer model; the process of the device executing the layer aggregation based machine translation algorithm is the same as the process of executing the layer aggregation based machine translation algorithm, and the description is not repeated here.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a model training program instruction of an ATransformer model is stored on the computer-readable storage medium, where the model training program instruction is executable by one or more processors to implement the following operations:

acquiring a Chinese sentence to be translated, and preprocessing the Chinese sentence;

utilizing an ATransformer encoder to perform encoding operation on the preprocessed Chinese sentences, and performing multi-layer semantic feature information extraction on the Chinese sentences;

decoding the multilayer semantic feature information by using an ATransformer encoder, thereby outputting a translation target language sequence;

judging a translation target language sequence, and if judging that the translation target language sequence is a translation result, outputting the translation target language sequence as a final machine translation result; otherwise, updating parameters of the coding module and the decoding module based on the strategy gradient algorithm, inputting the preprocessed sentence to be translated into the updated coding module, and realizing the machine translation algorithm again.

The embodiment of the computer-readable storage medium of the present invention is substantially the same as that of the above-mentioned machine translation algorithm based on layer aggregation, and will not be described herein again.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better embodiment. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the present specification and drawings, or used directly or indirectly in other related fields, are included in the scope of the present invention.

Claims

1. A layer aggregation based machine translation algorithm, the method comprising:

and if the translation target language sequence is judged to be a translation result, taking the translation target language sequence as a final machine translation result and outputting the translation result, otherwise, updating parameters of the ATransformer model based on a strategy gradient algorithm, inputting the preprocessed statement to be translated into an updated ATransformer encoder, and realizing the machine translation algorithm again.

2. The layer aggregation-based machine translation algorithm of claim 1, wherein the text pre-processing operation comprises:

matching the constructed stop word list with words in the text data one by one, and deleting the words if the matching is successful;

and according to all the found possible words, each word corresponds to one directed edge in the graph, a corresponding side length weight is given, then for the segmentation graph, in all paths from the starting point to the end point, the path sets of which the length values are 1 st, 2 nd, … th, i th, … th and N th in sequence in a strict ascending order are solved to be used as corresponding rough segmentation result sets, and the rough segmentation result sets are segmentation result sets of the Chinese sentences to be translated.

3. The layer aggregation-based machine translation algorithm of claim 2, wherein the ATransformer encoder performs multi-layer semantic feature information extraction on a sentence based on a multi-layer information extraction algorithm, comprising:

the ATransformer encoder is formed by stacking 6 same modules, each module is divided into a main layer and a merging layer, each main layer is provided with two sub-layers, the first sub-layer contains a self-attention mechanism, the second sub-layer is a fully-connected feedforward network layer, the merging layer merges sub-layer results by using a Joint function, and the merged result is used as an output value of the next module;

Second sub-layer

the sub-layer

The calculation formula based on the self-attention mechanism in (1) is as follows:

wherein:

LayerNorm (·) is a normalization function;

i is the ith module in the encoder;

attention (. cndot.) is the self-Attention mechanism,

d_kdimension of the input sentence to be translated;

when i is 1, the output value of the second sub-layer in the i-1 th module main layer is the preprocessed sentence to be translated;

the sub-layer

The calculation formula of residual error connection based on the feedforward fully-connected neural network in the method is as follows:

wherein:

FC (-) is a feed-forward fully-connected network;

W_Etraining weights for a preset encoder;

b_Eis a preset encoder bias parameter.

4. The layer aggregation based machine translation algorithm of claim 3, wherein the preset ATransformer decoder decodes the multi-layer semantic feature information and outputs a translation target language sequence, comprising:

multiple attention mechanism layers for the decoder module

The calculation formula of (2) is as follows:

wherein:

LayerNorm (·) is a normalization function;

attention (. cndot.) is the self-Attention mechanism,

d_kdimension of the sentence to be translated;

the output value of the i-1 th decoder module DSL2 sub-layer is, when i is equal to 1, the information is the multilayer semantic feature information;

the calculation formula of the DSL1 sublayer of the decoder module is:

wherein:

i is the ith module;

K_E，V_Eparameters obtained by the last module of the encoder;

the calculation formula of the DSL2 sublayer of the decoder module is:

wherein:

FC (-) is a feed-forward fully-connected network;

W_Dtraining weights for a preset decoder;

b_Dbiasing parameters for a preset decoder;

5. The layer aggregation-based machine translation algorithm of claim 4, wherein the discriminant model D is trained by:

(1) splicing the source language sentences S and the reference translation T in the training set into a two-dimensional matrix representation in a splicing mode, and inputting a splicing result as a positive sample;

(2) inputting a source language sentence S into an ATransformer model to obtain a translation target language sequence T _ p, splicing the source language sentence S and the T _ p, and inputting a splicing result as a negative sample, thereby setting an optimization target function of a discrimination model D:

maxV(D)＝logD((S，T))+log(1-D(ATransformer(S，T_p)))

F_i，j＝σ(W_F*f_i，j+b_F)

wherein:

W_Ftraining weights for preset convolutional layers;

b_Fis a preset convolution layer bias parameter;

f_i，jinputting positive and negative samples;

(4) and obtaining probability distribution of positive and negative sample categories by using a sigmoid function at the full connection layer, if the probability difference of the positive and negative sample categories is smaller than the error rate set by the method and the optimization target function reaches the maximum, considering that the discriminant model is successfully trained, and otherwise, performing convolution operation again.

6. The layer aggregation based machine translation algorithm of claim 5, wherein the policy gradient based algorithm parameter updating the ATransformer model comprises:

1) the following loss function was proposed to train the ATransformer model:

Loss＝log(1-D(S，T_Pred))

wherein:

d is a discrimination model;

s is a source language sentence;

wherein:

and theta is the parameters of the ATransformer model, and comprises model weight, bias parameters and learning rate.

7. An apparatus for layer aggregation based machine translation algorithm, the apparatus comprising: text acquisition module, coding module, decoding module, translation result discrimination module, wherein:

8. The apparatus of layer aggregation based machine translation algorithm of claim 7, the translation result discrimination module comprising:

judging the translation target language sequence by the discrimination model;

9. A computer readable storage medium having stored thereon model training program instructions of an ATransformer model, the model training program instructions being executable by one or more processors to implement the steps of a layer aggregation based machine translation algorithm of any of claims 1 to 6.