Keywords

1 Introduction

In the real world, many of the natural language texts that are written in Latin language are case sensitive, such as English, French, German, etc. For many natural language processing (NLP) tasks, case information is an important feature for algorithms to distinguish sentence structures, identify the part-of-speech of a word, and recognize named entities. However, most existing machine translation approaches pay little attention to the capitalization correctness of the generated words, which does not meet the needs of practical requirements and may introduce noise to downstream NLP applications [9, 20].

In fact, there is a contradiction in the training corpus preprocessing process: using lowercased corpus can reduce the expansion of the vocabulary but neglecting some morphology information, while keeping the original morphological form will increase the vocabulary and lose its connection with the lowercase form of the word. Figure 1 gives an example to illustrate this contradiction. Using true-cased corpus seems to balance the unnecessary increasing vocabulary and the missing morphology information of word case. However, re-storing cases from true-cased corpus is not as easy as the reverse process. Table 1 shows that using corpus in lowercase and regular case gets the highest case-insensitive and case-sensitive BLEU scores respectively, which reflects the difficulty of case restoration.

Fig. 1.
figure 1

Two example of Zh-En translation. The Chinese side is presented in pinyin. “ ” and “apple” are aligned words pair, which are same in the source side but written in different case in the target side in our examples. The contradiction is that using lowercased “apple” in the second example will lose the information of a proper noun, while using a individual word “Apple” will lose the semantic connection with the parallel pair (“ ” “apple”).

Table 1. Case insensitive/sensitive BLEU scores on Zh-En translation. \(\varDelta \) represents the reduced BLEU scores compared to the “insensitive”. NRC is a rule-based case restoring method and more experiment setup details are described in Sect. 5.2.

In this paper, we introduce case-sensitive neural machine translation (NMT) approaches to alleviate the above problems. In our approaches, we apply lowercased vocabulary to both the source input and target output side in the NMT model, and the model is trained to jointly learn to generate translation and distinguish the capitalization of the generated words. During the decoding step, the model predicts the case of the output word while generating the translation.

Specifically, we proposed two kinds of methods to this extent: i) mixing case tokens into lowercased corpus to indicate the real case of the adjacent word; ii) expanding NMT model architecture with an additional network layer that performances case prediction. We evaluate on pairs of linguistically disparate corpora in three translation tasks: Chinese-English (Zh-En), English-German (En-De) and English-French (En-Fr), and observe that the proposed techniques improve translation quality on case-sensitive BLEU [16]. We also study the model performances on case-restoration tasks and experimental results show that our proposed methods lead to improvements on P, R and \(F_1\) scores.

2 Related Work

Recently, neural machine translation (NMT) with encoder-decoder framework [6] has shown promising results on many language pairs [8, 21], and incorporating linguistic knowledge into neural machine translation has been extensively studied [7, 12, 17]. However, the procedure of NMT decoding rarely considers the case correctness of the generated words, and there are approaches performing case restoration on the machine generated texts [9, 20].

Recent efforts have demonstrated that incorporating linguistic information can be useful in NMT [7, 12, 15, 17, 22, 23]. Since the source sentence is definitive and easy to attach extra information, it is a straightforward way to improve the translation performance by using the source side features [12, 17]. For example, Sennrich and Haddow incorporate linguistic features to improve the NMT performance by appending feature vectors to word embeddings [17], and the source side hierarchical syntax structures are also used for achieving promising improvement [7, 12]. It is uncertain to leverage target syntactic information for NMT as target words in the real decoding process. Niehues and Cho apply multi-task learning where the encoder of the NMT model is trained to produce multiple tasks such as POS tagging and named-entity recognition into NMT models [15]. There are also works that directly model the syntax of the target sentence during decoding [22,23,24].

Word case information is a kind of lexical morphology which is definitive and easy to be obtained without any additional annotation and parsing of the training corpus. Recently, a joint decoder is proposed for predicting words as well as their cases synchronously [25], which shares a similar spirit with a part of our approaches (see Sect. 4.2). The main distinction of our approaches is that we propose two series of case-sensitive NMT and study various model setups.

3 Neural Machine Translation

Given a source sentence \({\varvec{x}}=\{x_1,x_2,...,x_{T_x}\}\) and a target sentence \({\varvec{y}}=\{y_1,y_2,...,y_{T_y}\}\), most of popular neural machine translation approaches [3, 8, 21] directly model the conditional probability:

$$\begin{aligned} p({\varvec{y}}|{\varvec{x}};{\varvec{\theta }})=\prod _{t=1}^{T} p(y_t|{\varvec{y}}_{<t},{\varvec{x}};{\varvec{\theta }}), \end{aligned}$$
(1)

where \({\varvec{y}}_{<t}\) is the partial translation before decoding step t and \(\varvec{\theta }\) is a set of parameters of the NMT model.

In this paper, the proposed approaches make scarcely assumptions about the specific NMT model, and it can be applied to any popular encoder-decoder based NMT model architecture [8, 21]. To simplify the experiment and highlight our contributions, we take the Transformer [21], one of the popular state-of-the-art NMT models, as the specific implementation of the baseline NMT model. Specifically, the encoder contains a stack of six identical layers. Each layer consists of two sub-layers: i) a multi-head self-attention mechanism, and ii) a position-wise fully connected feed-forward network. A residual connection is applied around each of the two sub-layers, followed by layer normalization [2]. The decoder is also composed of a stack of six identical layers. Besides the two sub-layers stated above, a third sub-layer is inserted in each layer that performs multi-head attention over the output of the encoder. The implementations of our approaches are all based on the above model architecture. Following the base model setups of the Transformer [21], we use 8 attention heads, 512-dimensional output vectors for each layer, and 2048-dimensional inner-layer of the feed-forward network.

4 Approaches

4.1 Adding Case Token

The technique of adding artificial tokens is a straightforward and practical way to incorporate additional knowledge to NMT [4, 18], since it hardly modifies the model architecture or increases the model parameters.

In our approach, we add two artificial tokens “<ca>” and “<ab>” to indicate capital words and abbreviation words in a sequence, respectively. This special token can be insert to the left (LCT) or the right (RCT) side of the capital word. For the target sequence, LCT represents to predict the case of word previously and then generate general target language word and the case is opposite for applying RCT.

For the corpus segmented by subword units [11, 19], we insert the LCT to the left side of the first subword unit of a capital word and insert RCT to the right side of the last subword unit of a capital word. For instance, Fig. 2 shows the modified sentences by adding LCT and RCT given the original sentence and the sentence encoded by subword units.

Fig. 2.
figure 2

Examples of modifying original sentence (or sentence encoded by subwords) by LCT and RCT. “<ca>” and “<ab>” are two additional artificial tokens for indicating capital words and abbreviation words.

4.2 NMT Jointly Learning Case Prediction

In this approach, we add an additional case prediction output to the decoder of the encoder-decoder based NMT model on each decoding step. Given a source sentence: \({\varvec{x}}=\{x_1,x_2,...,x_{T_x}\}\), its target translation: \({\varvec{y}}=\{y_1,y_2,..,y_{T_y}\}\), and the case category sequence of the target language: \({\varvec{c}}=\{c_1,c_2,...,c_{T_c}\}\), the goal of the extension is to enable NMT model to compute the joint probability \(P({\varvec{y}},{\varvec{c}}|{\varvec{x}})\). The overall joint model can be computed as:

$$\begin{aligned} P({\varvec{y}},{\varvec{c}}|{\varvec{x}})=\prod _{t=1}^{T_y}p(y_t|{\varvec{y}}_{<t},{\varvec{c}}_{< t},{\varvec{x}})p(c_t|{\varvec{y}}_{< t},{\varvec{c}}_{<t},{\varvec{x}}). \end{aligned}$$
(2)
Fig. 3.
figure 3

The graphical illustrations of the proposed approaches. The hollow circle with black dashed lines represents the next word/case to be generated. (a) represents adding case token without modifying the decoder (see Sect. 4.1 for more details); (b), (c) and (d) are three kinds of implementations for joint predicting word and its cases (see Sect. 4.2 for more details).

Intuitively, there are three assumptions about joint predicting \(c_t\) at time step t: i) predicting \(c_t\) before generating the word \(y_t\) (\({CP}_{pre}\)), ii) predicting \(c_t\) after the word \(y_t\) generated (\({CP}_{pos}\)), and iii) predicting the probability of \(c_t\) and \(y_t\) synchronously (\({CP}_{syn}\)).

\({\varvec{CP}}_{pos}\): At the time step t, the model first predict \(y_t\) and then predict the case \(c_t\) for the known word \(y_t\), which is consistent with most of the case restoration process (as shown in Fig. 3(b)). Under this assumption, the conditional probabilities in Eq. (2) can be computed as:

$$\begin{aligned} p(y_t|{\varvec{y}}_{<t},{\varvec{c}}_{<t},{\varvec{x}})=g_y(y_{t-1}, z_t, s_t, h_t) \end{aligned}$$
(3)

and

$$\begin{aligned} p(c_t|{\varvec{y}}_{\le t},{\varvec{c}}_{<t},{\varvec{x}})=g_c(y_t, c_{t-1}, z_t, s_t, h_t), \end{aligned}$$
(4)

respectively, where \(s_t\) and \(z_t\) are self-attention based context vectors of previous generated \({\varvec{y}}_{< t}\) and \({\varvec{c}}_{< t}\). \(h_t\) is the output of the encoder. \(g_y(\cdot )\) is the output layer of the Transformer [21] decoder, and \(g_c(\cdot )\) is the additional output layer that performs case prediction. \(z_t\) and \(g_c(\cdot )\) compose an additional 1-layer Transformer-based decoder with one attention head, 32-dimensional output vectors and inner-layer of the feed-forward network, which works parallel with the original NMT decoder.

\({\varvec{CP}}_{pre}\): For the case of \({CP}_{pre}\), the decoder first estimates the categories of the probable word for narrowing the selection of the vocabulary and then further confirms the output words (as shown in Fig. 3(c)). Under this assumption, the conditional probabilities in Eq. (2) can be computed as:

$$\begin{aligned} p(c_t|{\varvec{y}}_{<t},{\varvec{c}}_{<t},{\varvec{x}})=g_c(c_{t-1}, z_t, s_t, h_t) \end{aligned}$$
(5)

and

$$\begin{aligned} p(y_t|{\varvec{y}}_{<t},{\varvec{c}}_{\le t},{\varvec{x}})=g_y(y_{t-1}, c_t, z_t, s_t, h_t). \end{aligned}$$
(6)

\({\varvec{CP}}_{syn}\): Under this assumption, the two generation processes are simultaneous and independent to each other (as shown in Fig. 3(d)), then, the decoder predicting the probability of \(c_t\) and \(y_t\) synchronously. The conditional probabilities in Eq. (2) can be computed as:

$$\begin{aligned} p(c_t|{\varvec{y}}_{<t},{\varvec{c}}_{<t},{\varvec{x}})=g_c(c_{t-1}, z_t, s_t, h_t) \end{aligned}$$
(7)

and

$$\begin{aligned} p(y_t|{\varvec{y}}_{<t},{\varvec{c}}_{<t},{\varvec{x}})=g_y(y_{t-1}, z_t, s_t, h_t). \end{aligned}$$
(8)

4.3 Adaptive Scaling Algorithm

In training corpus, capitalized words account for a small percentage of the vocabulary, which leads to class inequality problem for training case classification model. Reported results indicate that simply applying standard classification paradigm to class inequality tasks will result in deficient performance [1, 13]. To alleviate this problem, we apply Adaptive Scaling (AS) [13] to the case prediction training.

We refer words with uppercase letters as the positive instances and regard lowercased words as the negative instances. Formally, given P positive training instances \(\mathcal {P}\) and N negative instances \(\mathcal {N}\), \(TP(\theta )\) and \(TN(\theta )\) are the number of correctly predicted positive instances and the number of correctly predicted negative instances on training data with respect to \(\theta \)-parameterized model. Then, taking the loss of \({CP}_{pre}\) as the example, the loss function is modified as:

$$\begin{aligned} \mathcal {L}_{AS}(\theta )=-\sum _{(c_j, y_j)\in \mathcal {P}}\mathrm{log}~p(c_j|y_j;\theta )-\sum _{(c_j, y_j)\in \mathcal {N}}w(\theta )\cdot \mathrm{log}~p(c_j|y_j;\theta ), \end{aligned}$$
(9)

where

$$\begin{aligned} w(\theta )=\frac{TP(\theta )}{P+N-TN(\theta )}. \end{aligned}$$
(10)

Batch-wise Adaptive Scaling Algorithm. In practice, most NMT models are trained with batch-wise gradient based algorithm, so we apply the batch-wise version of the adaptive scaling algorithm [13] in our work. Let \(\mathcal {P}^B\) represents \(P^B\) positive instances and \(\mathcal {N}^B\) denotes \(N^B\) negative instances in a batch, \({TP}^{B}\) and \({TN}^{B}\) is estimated as:

$$\begin{aligned} {TP}^{B}(\theta )=\sum _{(c_i,y_i)\in \mathcal {P}^B}p(c_i|y_i;\theta ) \end{aligned}$$
(11)

and

$$\begin{aligned} {TN}^{B}(\theta )=\sum _{(c_i,y_i)\in \mathcal {N}^B}p(c_i|y_i;\theta ). \end{aligned}$$
(12)

Then \(w(\theta )^B\) is estimated as:

$$\begin{aligned} w(\theta )=\frac{TP^B(\theta )}{P^B+N^B-TN^B(\theta )}. \end{aligned}$$
(13)

5 Experiment

5.1 Datasets and Setups

To verify the effectiveness of the proposed methods, we evaluate the proposed approaches on three typical translation tasks: Chinese-English (Zh-En), English-French (En-Fr) and English-German (En-De). The above three language pair translation tasks represent three typical application scenarios: i) the source language does not share any word capitalization information with the target language (Zh-En); ii) the word capitalizing rules of the source language and the target language are approximate (En-Fr); iii) the words capitalizing rules for the source language and the target language are not the same (En-De). Those typical translation tasks will be helpful to study the effects of word cases on NMT performances.

Chinese-English. For Chinese-English translation, our training data are extracted from three LDC corporaFootnote 1. The training set contains about 1.3M parallel sentence pairs. For preprocessing, the Chinese part for both training sets and testing sets is segmented by the LTP Chinese word segmentor [5]. With the encoding of unigram language model [11], we get a Chinese vocabulary of about 39K tokens, and an English vocabulary of about 40K words. We use NIST02 as our validation set and use NIST2003–NIST2005 datasets as our test sets.

English-German and English-French. For English-German and English-French translation, we conduct our experiments on the publicly available corpora of WMT’14 dataset. This data set contains 4.5M sentence pairs and 18M sentence pairs for En-De translation task and the significantly larger En-Fr dataset consisting of 18M sentences pairs. We encode the corpora with unigram language model [11], and both source and target vocabulary contain about 37K tokens and 30K tokens for En-De and En-Fr translation tasks, respectively. We report results on newstest2014, and the newstest2013 is used as validation.

For all translation tasks, we tokenize all corpora with the Moses tokenizerFootnote 2 before applying subword units, and sentences longer than 200 words are discarded.

Evaluation. Following [21], we report the result of a single model obtained by averaging the 5 checkpoints around the best model selected on the development set. We apply beam search during decoding with the beam size of 6. The translation results in this paper are measured in both case-insensitive and case-sensitive BLEU scores [16] evaluated by the multi-bleu.perl script (See footnote 2). We also analyze the model performances on word case restoration tasks and evaluate the results on P, R and \(F_1\) scores.

5.2 Baselines

Regular Case (RC) and Lowercase (LC). RC and LC represent using original corpus in regular case and lowercasing all training corpus, respectively.

Table 2. “Case-insensitive/case-sensitive” BLEU scores on Zh-En, En-Fr and En-De translation tasks. For the column of “Models”, bold and italic represents target words cases and performed methods, respectively.

Truecase (TC). We truecase the target language part of the corpora using Moses [10] script truecase.perl (See footnote 2). It tries to keep words in their natural case, and only changes the words at the beginning of their sentence to their most frequent form.

Naive Re-case \(\mathbf ( {\varvec{NRC}}{} \mathbf ) \). For comparative research, we also apply rule-based methods that restore model outputs into regular case. We first build a capitalized words dictionary based on the target side of the translation corpus. The dictionary counts the words that usually appear in capitalized form (the frequency of occurrence in capitalized form is greater than 50%).

Joint Prediction Model \(\mathbf ( {\varvec{JPM}}{} \mathbf ) \) [25]. This work shares similar motivation with our approach. It proposes an NMT model that jointly predicts English words and their cases by employing two outputs layers on one decoder.

Table 3. Case restoration results on Zh-En, En-Fr and En-De translation tasks. For the column of “Models”, bold and italic represents target words cases and performed methods, respectively.

5.3 Main Results

Table 2 shows the experimental results on the three translation tasks. The results of baseline methods are listed in rows of #1–#4 and from #5 to #12 are results of our proposed methods. From Table 2 we can observe that for each experimental setup, model gains higher case-insensitive BLEU score than the case-sensitive version. The reduction in case-sensitive BLEU is more pronounced in Zh-En translation, since the source language does not provide any relevant morphological information. For En-Fr translation, since the target language shares similar capitalization rules with the source input, the case-sensitive performance reduces less. The phenomenon is not very prominent in En-De translation, probably because the writing rules of German are different from the other two languages (En and Fr).

The results show that our proposed CP methods obtain better performances than multiple baseline setups on the three translation tasks, but the translation quality of LCT and RCT decrease in some cases. The reason for the negative results on “Adding case token” approaches may be: i) the additional case tokens increase the average length of the generated sequences by more than 5 words; ii) decoding with case tokens inside the sequence may dilute the impacts of previous generated words. On the contrast, CP methods use relatively independent decoders and the case information performs as additional feature inputs for NMT decoding.

One of the interesting findings in the overall results is that LCT and \({CP}_{pre}\) usually gains better BLEU scores than RCT and \(CP_{pos}\). In our approaches, LCT and \(CP_{pre}\) predict case label before generate target word while RCT and \(CP_{pos}\) follow the reverse order. We suspect that the possible reason for experimental results is that the generated label can reduce the search space of the target words. \(CP_{pos}\) and \(CP_{syn}\) also achieve improvements on baseline methods. We also study the impact of applying adaptive scaling algorithm [13] and the results are listed in rows from #10 to #12 in Table 2. The experimental results show that the proposed methods with scaling algorithm (AS) performs better under case-sensitive measurements, which indicates that applying AS can enhance the prediction of the word cases.

5.4 Case Restoration

In this section, we analyze the impact of different methods on case restoration tasks. We conduct the experiments on three testsets: NIST2003–NIST2005 (Zh-En), newstest2014 and newstest2015 (En-Fr and En-de). Experimental results are evaluated on P, R and \(F_1\)Footnote 3 scores as shown in Table 3. From Table 3, we can see that our proposed CP methods gains higher \(F_1\) scores for most model setups. Comparing with adding case token, CP approaches separate case prediction from word prediction and introduce an additional network block to handle this task. Since the separated case prediction decoder learns more about lexical information, it leads to improvements on case-restoration tasks. As shown in rows #10–#12, applying adaptive scaling algorithm [13] is also effective for case prediction tasks, which can adaptively scale the influence of negative instances in loss function.

Comparing the three translation tasks, P, R and \(F_1\) performances on En-Fr are significantly better than other languages. As mentioned above, French shares similar capitalization rules with English, so the decoder can capture more lexical information from the source side. The NRC method works much better on En-De translation tasks than the others, since the capitalization rules for German words are relatively fixed.

5.5 Decoding Efficiency

We analyze the decoding time of our method compared with the baseline approach on NIST2002 Zh-En valid set with one NVIDIA GeForce GTX 1080Ti GPU and a batch size of 32 sentences. Table 4 shows that the proposed methods has lower decoding efficiency than the baseline method. For NMT with CP methods, the additional decoder creates extra decoding overhead. Especially, \(CP_{pre}\) and \(CP_{pos}\) require additional autoregressive steps to predict word case, which reduces decoding efficiency. For the approach of adding case token, although it does not increase the parameters number of NMT model, the additional tokens increase the length of generated target sequence.

Table 4. Comparison of model parameters and decoding efficiency. “#parameters” represents the free parameters number of the NMT model. “Speed” means the decoding speed (sentences per second) which are evaluated on NIST2002 Zh-En valid set.

6 Conclusion

Word case information, as one of linguistics features, is definitive and easily obtainable. Incorporating case information into machine translation also meets the needs of practical applications. In this paper, we propose two types of approaches to perform case-sensitive neural machine translation: i) directly adding a case token into the word sequence to indicate the case information of the nearby word; ii) applying additional decoder to the conventional NMT model performing case prediction along with generating translation. We test our approaches on multiple setups and three typical translation tasks (Zh-En, En-Fr, En-De). Experimental results show that our approaches outperform baseline approaches on case-sensitive BLEU score. Specifically, adding case token is easy to apply to any NMT models without modifying the network architecture but lose some accuracy, while applying case prediction decoder will offer more reliable results but increasing model parameters. In the future, we will apply our approaches to other natural language generation relatively tasks such as dialogue generation, text generation and automatic speech recognition.