Nothing Special   »   [go: up one dir, main page]

Learning To Generate Move-by-Move Commentary For Chess Games From Large-Scale Social Forum Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Learning to Generate Move-by-Move Commentary for Chess Games from

Large-Scale Social Forum Data

Harsh Jhamtani∗, Varun Gangal∗, Eduard Hovy, Graham Neubig, Taylor Berg-Kirkpatrick
Language Technologies Institute
Carnegie Mellon University
{jharsh,vgangal,hovy,gneubig,tberg}@cs.cmu.edu

Abstract 2016). In most cases, progress has been driven by


the availability of large training corpora that pair
This paper examines the problem of gen- natural language with examples from the ground-
erating natural language descriptions of ing (Lin et al., 2014). One line of work has in-
chess games. We introduce a new large- vestigated methods for producing and interpreting
scale chess commentary dataset and pro- language in the context of a game, a space that has
pose methods to generate commentary rich pragmatic structure, but where training data
for individual moves in a chess game. has been hard to come by. In this paper, we in-
The introduced dataset consists of more troduce a new large-scale resource for learning to
than 298K chess move-commentary pairs correlate natural language with individual moves
across 11K chess games. We highlight in the game of chess. We collect a dataset of more
how this task poses unique research chal- than 298K chess move/commentary pairs across ≈
lenges in natural language generation: the 11K chess games from online chess forums. To the
data contain a large variety of styles of best of our knowledge, this is the first such dataset
commentary and frequently depend on of this scale for a game commentary generation
pragmatic context. We benchmark vari- task. We provide an analysis of the dataset and
ous baselines and propose an end-to-end highlight the large variety in commentary texts by
trainable neural model which takes into categorizing them into six different aspects of the
account multiple pragmatic aspects of the game that they respectively discuss.
game state that may be commented upon
to describe a given chess move. Through
a human study on predictions for a sub-
set of the data which deals with direct
move descriptions, we observe that out-
puts from our models are rated similar to
ground truth commentary texts in terms of
correctness and fluency.1

1 Introduction
A variety of work in NLP has sought to produce
fluent natural language descriptions conditioned
on a contextual grounding. For example, several
lines of work explore methods for describing im-
ages of scenes and videos (Karpathy and Fei-Fei, Figure 1: Move commentary generated from our method
2015), while others have conditioned on structured (Game-aware neural commentary generation (GAC)) and
some baseline methods for a sample move.
sources like Wikipedia infoboxes (Lebret et al.,

HJ and VG contributed equally for this paper Automated game commentary generation can
1
We will make the code-base (including data collection
and processing) publicly available at https://github. be a useful learning aid. Novices and experts alike
com/harsh19/ChessCommentaryGeneration can learn more about the game by hearing expla-
nations of the motivations behind moves, or their
quality. In fact, on sites for game aficionados,
these commentaries are standard features, speak-
ing to their interestingness and utility as comple-
ments to concrete descriptions of the game boards
themselves.
Game commentary generation poses a number
of interesting challenges for existing approaches
to language generation. First, modeling human
commentary is challenging because human com-
mentators rely both on their prior knowledge of Figure 2: A multi-move, single commentary example from
game rules as well as their knowledge of effec- our data. Here, the sequence of moves Ba4 → b5 → Nd6 →
bxa4 → e5 is commented upon.
tive strategy when interpreting and referring to the
game state. Secondly, there are multiple aspects of Statistic Value
the game state that can be talked about for a given Total Games 11,578
move — the commentator’s choice depends on the Total Moves 298,008
Average no. of recorded steps in a game 25.73
pragmatic context of the game. For example, for
Frequent Word Types2 39,424
the move shown in Figure 1, one can comment Rare Word Types 167,321
simply that the pawn was moved, or one may com- Word Tokens 6,125,921
Unigram Entropy 6.88
ment on how the check was blocked by that move. Average Comment Length (in #words) 20.55
Both descriptions are true, but the latter is most Long Comments (#words) 230745 (77%)
salient given the player’s goal. However, some-
Table 1: Dataset and Vocabulary Statistics
times, none of the aspects may stand out as being
most salient, and the most salient aspect may even
change from commentator to commentator. More-
over, a human commentator may introduce varia- tecture. Outputs on the ‘Move Description’ subset
tions in the aspects he or she chooses to talk about, of data from our final model were judged by hu-
in order to reduce monotony in the commentary. mans to be as good as human written ground truth
This makes the dataset a useful testbed not only commentaries on measures of fluency and correct-
for NLG but also for related work on modeling ness.
pragmatics in language (Liu et al., 2016).
2 Chess Commentary Dataset
Prior work has explored game commentary gen-
eration. Liao and Chang (1990); Sadikov et al. In this section we introduce our new large-scale
(2006) have explored chess commentary genera- Chess Commentary dataset, share some statistics
tion, but for lack of large-scale training data their about the data, and discuss the variety in type
methods have been mainly rule-based. Kameko of commentaries. The data is collected from the
et al. (2015) have explored commentary gener- online chess discussion forum gameknot.com,
ation for the game of Shogi, proposing a two- which features multiple games self-annotated with
step process where salient terms are generated move-by-move commentary.
from the game state and then composed in a The dataset consists of 298K aligned game
language model. In contrast, given the larger move/commentary pairs. Some commentaries are
amount of training data available to us, our pro- written for a sequence of few moves (Figure 2)
posed model uses an end-to-end trainable neu- while others correspond to a single move. For the
ral architecture to predict commentaries given the purpose of initial analysis and modeling, we limit
game state. Our model conditions on semantic ourselves to only those data points where com-
and pragmatic information about the current state mentary text corresponds to a single move. Addi-
and explicitly learns to compose, conjoin, and se- tionally, we split the multi-sentence commentary
lect these features in a recurrent decoder module. texts to create multiple data points with the same
We perform an experimental evaluation compar- chess board and move inputs.
ing against baselines and variants of our model What are commentaries about? We observe
that ablate various aspects of our proposed archi- that there is a large variety in the commentary
% in Val
Category Example
data acc. of the data. A total of 1K comments are anno-
Direct Move
An attack on the queen 31.4% 71% tated by two annotators. A SVM classifier (Pe-
Description
Move
A rook blunder. 8.0% 90%
dregosa et al., 2011a) is trained for each comment
Quality
At this stage I figured class, considering the annotation as ground truth
Comparative 3.7% 77.7%
I better move my knight. and using word unigrams as features. This classi-
Trying to force a way to
Planning / fier is then used to predict tags for the train, valida-
eliminate d5 and 31.2% 65%
Rationale
prevent Bb5.
Somehow, the game I tion and test sets. For “Comparative” category, we
Contextual
Game Info
should have lost turned 12.6% 87% found that a classifier with manually defined rules
around in my favor .
General
Protect Calvin , Hobbs 29.9% 78%
such as presence of word “better” performs better
Comment
than the classifier, perhaps due to the paucity of
Table 2: Commentary texts have a large variety making the data, and thus we use this instead . As can be ob-
problem of content selection an important challenge in our served in Table 2, the classifiers used are able to
dataset. We classify the commentaries into 6 different cate-
gories using a classifier trained on some hand-labelled data, generalize well on the held out dataset
a fraction of which is kept for validation. % data refers to
the percentage of commentary sentences in the tagged data 3 Game Aware Neural Commentary
belonging to the respective category.
Generations (GAC)

texts. To analyze this variety, we consider la- Our dataset D consists of data points of the
belling the commentary texts in the data with a form (Si , Mi , Gi ), i ∈ {1, 2, .., |D|}, where Si
predefined set of categories. The choice of these is the commentary text for move Mi and Gi
categories is made based on a manual inspection is the corresponding chess game. Si is a se-
of a sub-sample of data. We consider the follow- quence of m tokens Si1 , Si2 , ..., Sim We want
ing set of commentary categories (Also shown in to model P (Si |Mi , Gi ). For simplicity, we use
Table 2): only current board (Ci ) and previous board (Ri )
information from the game. P (Si |Mi , Gi ) =
• Direct move description (MoveDesc3 ): Ex- P (Si |Mi , Ci , Ri ).
plicitly or implicitly describe the current We model this using an end-to-end trainable
move. neural model, which models conjunctions of fea-
• Quality of move (Quality4 ): Describe the tures using feature encoders. Our model employs
quality of the current move. a selection mechanism to select the salient fea-
tures for a given chess move. Finally a LSTM
• Comparative: Compare multiple possible recurrent neural network (Hochreiter and Schmid-
moves. huber, 1997) is used to generate the commentary
text based on selected features from encoder.
• Move Rationale or Planning (Planning):
Describe the rationale for the current move, 3.1 Incorporating Domain Knowledge
in terms of the future gameplay, advantage
over other potential moves etc. Past work shows that acquiring domain knowledge
is critical for NLG systems (Reiter et al., 2003b;
• Contextual game information: Describe Mahamood and Reiter, 2012). Commentary texts
not the current move alone, but the overall cover a range of perspectives, including criticism
game state – such as possibility of win/loss, or goodness of current move, possible alternate
overall aggression/defence, etc. moves, quality of alternate moves, etc. To be able
to make such comments, the model must learn
• General information: General idioms & ad-
about the quality of moves, as well as the set of
vice about chess, information about play-
valid moves for a given chess board state. We con-
ers/tournament, emotional remarks, retorts,
sider the following features to provide our model
etc.
with necessary information to generate commen-
The examples in Table 2 illustrate these classes. tary texts (Figure 3):
Note that the commentary texts are not necessar- Move features fmove (Mi , Ci , Ri ) encode the
ily limited to one tag, though that is true for most current move information such as which piece
3
MoveDesc & ‘Move Description’ used interchangeably moved, the position of the moved piece before and
4
Quality and ‘Move Quality’ used interchangeably after the move was made, the type and position
Figure 3: The figure shows some features extracted using the chess board states before (left) and after
(right) a chess move. Our method uses various semantic and pragmatic features of the move, including
the location and type of piece being moved, which opposing team pieces attack the piece being moved
before as well as after the move, the change in score by Stockfish UCI engine, etc.

of the captured piece (if any), whether the current such as game scores, we first bin them and then
move is castling or not, and whether there was a use corresponding number for embedding lookup.
check or not. Let E represent the embedding matrix. Then
j
Threat features fthreat (Mi , Ci , Ri ) encode in- E[fmove ] represents embeddings of j th move fea-
formation about pieces of opposite player attack- ture, or in general E[fmove ] represents the con-
ing the moved piece before and after the move, and catenated embeddings of all move features. Simi-
the pieces of opposite player being attacked by the larly, E(fmove , fthreat , fscore ) represents concate-
piece being moved. To extract this information, nated embeddings of all the features.
we use the python-chess library 5
Score features fscore (Mi , Ci , Ri ) capture the 3.3 Feature Conjunctions
quality of move and general progress of the game. We conjecture that explicitly modeling feature
This is done using the game evaluation score be- conjunctions might improve the performance. So
fore and after the move, and average rank of pawns we need an encoder which can handle input sets
of both the players. We use Stockfish evaluation of features of variable length (features such as
engine to obtain the game evaluation scores. 6 pieces attacking the moved piece can be of vari-
able length). One way to handle this is by picking
3.2 Feature Representation up a canonical ordering of the features and con-
In our simplest conditioned language gen- sider a bidirectional LSTM encoder over the fea-
eration model GAC-sparse, we repre- ture embeddings. As shown in Figure 4, this gen-
sent the above described features using erates conjunctions of features.
sparse representations through binary- g enc = BiLSTM∗ ({E(fmove , fthreat , fscore ))})
valued features. gsparse (Mi , Ci , Ri ) = Here E() represents the embedding matrix as
SparseRep(fmove , fthreat , fscore ) described earlier and BiLST M ∗ represents a se-
For our full GAC model we consider repre- quential application of the BiLST M function.
senting features through embeddings. This has Thus, if there a total of m feature keys and em-
the advantage of allowing for a shared embed- bedding dimension is d, E(fmove , fthreat , fscore )
ding space, which is pertinent for our problem is matrix of m ∗ d. If hidden size of BILSTM is
since attribute values can be shared, e.g. the of size x, then g enc is of dimensionality m ∗ x.
same piece type can occur as the moved piece as We observe that different orderings gave similar
well as the captured piece. For categorical fea- performance. We also experimented with running
tures, such as those indicating which piece was k encoders, each on different ordering of features,
moved, we directly look up the embedding us- and then letting the decoder access to each of the k
ing corresponding token. For real valued features encodings. This did not yield any significant gain
5 in performance.
https://pypi.org/project/
python-chess/ The GAC model, unlike GAC-sparse, has some
6
https://stockfishchess.org/about/ advantages as it uses a shared, continuous space
Input Feature Selection
Feature Representation Decoder
Extraction mechanism

PIECE

black Black threatens white bishop

pawn
Previous state

piece = pawn
RNN
color = black
f7
from = f7

to = f5
f5 white
<Start> Black threatens
attacked-by = [
bishop,
 knight] ATTACK-BY

.
.
. bishop

Selection
vector
knight

Figure 4: The figure shows a model overview. We first extract various semantic and pragmatic features
from the previous and current chess board states. We represent features through embedding in a shared
space. We observe that feeding in feature conjunctions helps a lot. We consider a selection mechanism
for the model to choose salient attributes from the input at every decoder step.

to embed attribute values of different features, and mechanism to select and identify these attributes.
can perform arbitrary feature conjunctions before We first transform hdec
t by multiplying it with a
passing a representation to the decoder, thereby trainable matrix Wc , and then take dot product of
sharing the burden of learning the necessary fea- the result with each gi .
ture conjunctions. Our experiments confirm this
(i)
intuition — GAC produces commentaries with at = dot(Wc ∗ hdec enc
t , gi )
higher BLEU as well as more diversity compared αt = softmax(at )
to GAC-sparse. i=|g|
(i)
X
3.4 Decoder ct = αt gienc
i=1
We use a LSTM decoder to generate the sentence
given the chess move and the features g. At every We use cross-entropy loss over the decoding
output step t, the LSTM decoder predicts a distri- outputs to train the model.
bution over vocabulary words taking into account
the current hidden state ht , the input token it , and 4 Experiments
additional selection vector ct . For GAC-sparse, We split each of the data subsets in a 70:10:20 ra-
the selection vector is simply an affine transfor- tio into train, validation and test. All our models
mation of the features g. For GAC model selection are implemented in Pytorch version 0.3.1 (Paszke
vector is derived via a selection mechanism. et al., 2017). We use the ADAM optimizer
ot , hdec = LST M (hdec (Kingma and Ba, 2014) with its default parame-
t t−1 , [concat(Edec (it ), ct )])
ters and a mini-batch size of 32. Validation set
pt = softmax(Wo [concat(ot , ct )] + bs )
perplexity is used for early-stopping. At test-time,
where pt represents th probability distribution we use greedy search to generate the model output.
over the vocabulary, Edec () represents the decoder We observed that beam decoding does not lead to
word embedding matrix and elements of Wo ma- any significant improvement in terms of validation
trix are trainable parameters. BLEU score.
Selection/Attention Mechanism: As there are We observe the BLEU (Papineni et al., 2002)
different salient attributes across the different and BLEU-2 (Vedantam et al., 2015) scores to
chess moves, we also equip the GAC model with a measure the performance of the models. Addi-
tionally, we consider a measure to quantify the di- Dataset Features BLEU BLEU-2 Diversity
TEMP 0.72 20.77 4.43
versity in the generated outputs. Finally, we also NN (M+T+S) 1.28 21.07 7.85
conduct a human evaluation study. In the remain- MoveDesc
RAW 1.13 13.74 2.37
der of this section, we discuss baselines along with GAC-sparse 1.76 21.49 4.29
GAC (M+T) 1.85 23.35 4.72
various experiments and results. TEMP 16.17 47.29 1.16
NN (M+T) 5.98 42.97 4.52
4.1 Baselines Quality
RAW 16.92 47.72 1.07
GAC-sparse 14.98 51.46 2.63
In this subsection we discuss the various baseline GAC(M+T+S) 16.94 47.65 1.01
methods. NN (M) 1.28 24.49 6.97
Manually-defined template (TEMP) We devise RAW 2.80 23.26 3.03
Comparative GAC-sparse 3.58 25.28 2.18
manually defined templates (Reiter, 1995) for GAC(M+T) 3.51 29.48 3.64
‘Move Description’ and ‘Move Quality’ cate-
gories. Note that template-based outputs tend to Table 3: Performance of baselines and our model with differ-
ent subsets of features as per various quantitative measures.
be repetitive as they lack diversity - drawing from ( S = Score, M= Move, T = Threat features; ) On all data sub-
a small, fixed vocabulary and using a largely static sets, our model outperforms the TEMP and NN baselines.
Among proposed models, GAC performs better than GAC-
sentence structure. We define templates for a sparse & RAW in general. For NN, GAC-sparse and GAC
fixed set of cases which cover our data (For exact methods, we experiment with multiple feature combinations
template specifications, refer to Appendix B). and report only the best as per BLEU scores.

Nearest Neighbor (NN): We observe that the we consider only the first three as the amount of
same move on similar board states often leads to variance in the last three categories indicates that
similar commentary texts. To construct a simple it would be extremely difficult for a model to learn
baseline, we find the most similar move NM CR to reproduce them accurately. The number of
from among training data points for a given previ- data points, as tagged by the trained classifiers, in
ous (R) and current (C) board states and move M . the subsets ‘Move Description’, ‘Move Quality’
The commentary text corresponding to NM CR is and ‘Comparative’ are 28,228, 793 and 5397
selected as the output. Thus, we need to consider respectively. We consider separate commentary
a scoring function to find the closest matching generation models for each of the three categories.
data point in training set. We use the Move, Threat Each model is tuned separately on the correspond-
and Score features to compute similarity to do ing validation sets. Table 3 shows the BLEU and
so. By using a sparse representation, we consider BLEU-2 scores for the proposed model under
total of 148 Move features, 18 Threat features, and different subsets of features. Overall BLEU
19 Score features. We use sklearn’s (Pedregosa scores are low, likely due to the inherent variance
et al., 2011b) NearestNeighbor module to find the in the language generation task (Novikova et al.,
closest matching game move. 2017) , although a precursory examination of
the outputs for data points selected randomly
Raw Board Information Only (RAW): The from test set indicated that they were reasonable.
RAW baseline ablates to assess the importance Figure 5 illustrates commentaries generated by
of our pragmatic feature functions. This archi- our models through an example (a larger list of
tecture is similar to GAC, except that instead of qualitative examples can be found in Appendix C).
our custom features A(f (Ri , Ci )), the encoder en-
codes raw board information of current and previ-
ous board states. Which features are useful? In general, adding
ARAW (Ri , Ci ) = [Lin(Ri ), Lin(Ci )] Threat features improves the performance, though
Lin() for a board denotes it’s representation in a the same is not always true for Score features.
row-linear fashion. Each element of Lin() is a Qual has higher BLEU scores than the other
piece name (e.g pawn) denoting the piece at that datasets due to smaller vocabulary and lesser vari-
square with special symbols for empty squares. ation in commentary. As can be observed in Ta-
ble 4, Threat features are useful for both ‘Move
4.2 Comment Category Models Quality’ and ‘Move Description’ subsets of data.
As shown earlier, we categorize comments into six Adding Score features helps for ‘Move Quality’
different categories. Among these, in this paper subset. This intuitively makes sense since Score
Figure 5: Outputs from various models on a test example from the MoveDesc subset.

Dataset Features BLEU BLEU-2 Diversity Dataset Features BLEU BLEU-2 Diversity
GAC (M) 1.41 19.06 4.32 COMB (M) 2.07 20.13 4.50
MoveDesc GAC (M+T) 1.85 23.35 4.72 All COMB (M+T) 2.43 25.37 4.88
GAC (M+T+S) 1.64 22.82 4.29 COMB (M+T+S) 1.83 28.86 4.33
GAC (M) 13.05 48.37 1.61 GAC-all(M) 1.69 20.66 4.67
Quality GAC (M+T) 14.22 49.57 1.54 All GAC-all(M+T) 1.94 24.11 5.16
GAC(M+T+S) 14.44 51.79 1.48 GAC-all (M+T+S) 2.02 24.70 4.97
GAC(M) 3.10 19.84 2.88 All CAT (M) 1.90 19.96 3.82
Comparative GAC(M+T) 3.51 29.48 3.64
GAC(M+T+S) 1.15 25.44 3.14 Table 5: The COMB approaches show the combined per-
formance of separately trained models on the respective test
Table 4: Performance of the GAC model with different fea- subsets.
ture sets. ( S = Score, M= Move, T = Threat features; ) Dif-
ferent subset of features work best for different subsets. For
instance, Score features seem to help only in the Quality cat- from a larger dataset.
egory. Note that the results for Quality are from 5-fold cross-
validation, since the number of datapoints in the category is
much lesser than the other two. Category-aware model (CAT) We observed
above that with the considered features, it is not
possible to predict the type of comment to be
features directly encode proxies for move quality
made, and the GAC-all model results are better
as per a chess evaluation engine.
than COMB results. Hence, we extend the GAC-
4.3 A Single Model For All Categories all model to explicitly provide with the informa-
tion about the comment category. We achieve this
In this experiment, we merge the training and val- by adding a one-hot representation of the category
idation data of the first three categories and tune a of the comment to the input of the RNN decoder
single model for this merged data. We then com- at every time step. As can be seen in the Table
pare its performance on all test sentences in our 5, CAT(M) performs better than GAC-all(M) in
data. COMB denotes using the best GAC model terms of BLEU-4, while performing slightly worse
for a test example based on its original class (e.g on BLEU-2. This demonstrates that explicitly pro-
Desc) and computing the BLEU of the sentences viding information about the comment category
so generated with the ground truth. GAC-all rep- can help the model.
resents the GAC model learnt on the merged train-
ing data. 4.4 Diversity In Generated Commentaries
As can be seen from Table 5, this does not lead Humans use some variety in the choice of words
to any performance improvements. We investigate and sentence structure. As such, outputs from rule
this issue further by analyzing whether the board based templates, which demonstrate low variety,
states are predictive of the type of category or may seem repetitive and boring. To capture this
not. To achieve this, we construct a multi-class quantitatively, and to demonstrate the variety in
classifier using all the Move, Threat and Score texts from our method, we calculate the entropy
features to predict the three categories under (Shannon, 1951) of the distribution of unigrams,
consideration. However, we observe accuracy bigrams and trigrams of words in the predicted
of around 33.4%, which is very close to the outputs, and report the geometric mean of these
performance of a random prediction model. This values. Using only a small set of words in similar
partially explains why a single model did not fare counts will lead to lower entropy and is undesir-
better even though it had the opportunity to learn able. As can be observed from Table 3, template
baseline performs worse on the said measure com- ten commentaries are either very ungrammatical
pared to our methods for the ’MoveDesc’ subset of or too concise. A more surprising observation is
the data. that around 30% of human written ground truth
outputs were also marked as not valid for given
4.5 Human Evaluation Study board move. On inspection, it seems that com-
As discussed in the qualitative examples above, mentary often contains extraneous game informa-
we often found the outputs to be good - though tion beyond that of move alone, which indicates
BLEU scores are low. BLEU is known to cor- that an ideal comparison should be over commen-
relate poorly (Reiter and Belz, 2009; Wiseman tary for an entire game, although this is beyond the
et al., 2017; Novikova et al., 2017) with human scope of the current work.
relevance scores for NLG tasks. Hence, we The inter-annotator agreement for our experi-
conduct a human evaluation study for the best 2 ments (Cohens κ (Cohen, 1968)) is 0.45 for Q1
neural (GAC,GAC-sparse) and best 2 non-neural and 0.32 for Q2. We notice some variation in
methods (TEMP,NN). κ coefficients across different systems. While
TEMP and GAC responses had a 0.5-0.7 coeffi-
Setup: Specifically, annotators are shown a chess cient range, the responses for CLM had a much
move through previous board and resulting board lower coefficient. In our setup, each HIT consists
snapshots, along with information on which piece of 7 comments, one from each system. For Q3
moved (a snapshot of a HIT7 is provided in the Ap- (fluency), which is on an ordinal scale, we mea-
pendix D). With this context, they were shown text sure rank-order consistency between the responses
commentary based on this move and were asked to of the two annotators of a HIT. Mean Kendall τ
judge the commentary via three questions, short- (Kendall, 1938) across all HITs was found to be
ened versions of which can be seen in the first col- 0.39.
umn of Table 6. To measures significance of results, we per-
We randomly select 100 data points from the form bootstrap tests on 1000 subsets of size 50
test split of ‘Move Description’ category and with a significance threshold of p = 0.05 for
collect the predictions from each of the methods each pair of systems. For Q1, we observe that
under consideration. We hired two Anglophone GAC(M), GAC(M+T) and GAC(M+T+S) meth-
(Lifetime HIT acceptance % > 80) annotators for ods are significantly better than baselines NN and
every human-evaluated test example. We addi- GAC-sparse. We find that neither of GAC(M+T)
tionally assess chess proficiency of the annotators and GT significantly outperform each other on Q1
using questions from the chess-QA dataset by as well as Q2. But we do find that GAC(M+T)
(Cirik et al., 2015). Within each HIT, we ask two does better than GAC(M) on both Q1 and Q2.
randomly selected questions from the chess-QA For fluency scores, we find that GAC(M+T) is
dataset. Finally we consider only those HITs more fluent than GT, NN , GAC-sparse, GAC(M).
wherein the annotator was able to answer the Neither of GAC(M) and GAC(M+T+S) is signifi-
proficiency questions correctly. cantly more fluent than the other.

Results: We conducted a human evaluation study 5 Related Work


for the MoveDesc subset of the data. As can be
NLG research has a long history, with systems
observed from Table 6, outputs from our method
ranging from completely rule-based to learning-
attain slightly more favorable scores compared
based ones (Reiter et al., 2005, 2003a), which
to the ground truth commentaries. This shows
have had both practical successes (Reiter et al.,
that the predicted outputs from our model are not
2005) and failures (Reiter et al., 2003a). Recently,
worse than ground truth on the said measures.
there have been numerous works which propose
This is in spite of the fact that the BLEU-4 score
text generation given structured records, biogra-
for the predicted outputs is only ∼ 2 w.r.t. the
phies (Lebret et al., 2016), recipes (Yang et al.,
ground truth outputs. One reason for slightly
2016; Kiddon et al., 2016), etc. A key difference
lower performance of the ground truth outputs on
between generation given a game state compared
the said measures is that some of the human writ-
to these inputs is that the game state is an evolv-
7
Human Intelligence Task ing description at a point in a process, as opposed
GAC GAC GAC GAC
Question GT TEMP NN
(M) (MT) (MTS) -sparse
Is commentary correct for the
70.4 42.3 64.8 67.6 56.3 91.5 52.1
given move? (%Yes)
Can the move be inferred from
45.1 25.3 42.3 36.7 40.8 92.9 42.3
the commentary? (%Yes)
Fluency (scale of (least)1 - 5(most) ) 4.03 4.15 4.44 4.54 4.15 4.69 3.72
Mean (Std. dev.) (1.31) (1.20) (1.02) (0.89) (1.26) (0.64) (1.36)

Table 6: Human study results on MoveDesc data category. Outputs from GAC are in general better than ground truth, NN and
GAC-sparse. TEMP outperforms other methods, though as shown earlier, outputs from TEMP lack diversity.

to recipes (which are independent of each other), proposed methods to be as good as human written
records (which are static) and biographies (which commentary texts for ‘Move Description’ subset
are one per person, and again independent). More- of the data.
over, our proposed method effectively uses vari- Our dataset also contains multi-move-single
ous types of semantic and pragmatic information commentary pairs in addition to single move-
about the game state. single commentary pairs. Generating commentary
In this paper we have introduced a new large- for such multi-moves is a potential direction for
scale data for game commentary generation. The future work. We anticipate this task to require
commentaries cover a variety of aspects like even deeper understanding of the game pragmat-
move description, quality of move, and alternative ics than the single move-single commentary case.
moves. This leads to a content selection challenge, Recent work (Silver et al., 2016) has proposed
similar to that noted in Wiseman et al. (2017). Un- reinforcement learning based game-playing agents
like Wiseman et al. (2017), our focus is on gener- which learn to play board games from scratch,
ating commentary for individual moves in a game, learning end-to-end from both recorded games
as opposed to game summaries from aggregate and self-play. An interesting point to explore
statistics as in their task. is whether such pragmatically trained game state
One of the first NLG datasets was the representations can be leveraged for the task of
SUMTIME-METEO (Reiter et al., 2005) corpus game commentary generation.
with ≈ 500 record-text pairs for technical weather
forecast generation. Liang et al (2009) worked Acknowledgements
on common weather forecast generation using
We thank Volkan Cirik, Daniel Clothiaux, Hiroaki
the WEATHERGOV dataset, which has ≈ 10K
Hayashi and anonymous reviewers for providing
record-text pairs. A criticism of WEATHER-
valuable comments and feedback.
GOV dataset (Reiter, 2017) is that weather records
themselves may have used templates and rules
with optional human post-editing. There have References
been prior works on generating commentary for
ROBOCUP matches (Chen and Mooney, 2008; David L Chen and Raymond J Mooney. 2008. Learn-
ing to sportscast: a test of grounded language ac-
Mei et al., 2015). The ROBOCUP dataset, how- quisition. In Proceedings of the 25th international
ever, is collected from 4 games and contains about conference on Machine learning. ACM, pages 128–
1K events in total. Our dataset is two orders of 135.
magnitude larger than the ROBOCUP dataset, and
Volkan Cirik, Louis-Philippe Morency, and Eduard
we hope that it provides a promising setting for Hovy. 2015. Chess q&a: Question Answering on
future NLG research. Chess Games. In Reasoning, Attention, Memory
(RAM) Workshop, Neural Information Processing
6 Conclusions Systems.
In this paper, we curate a dataset for the task of Jacob Cohen. 1968. Weighted kappa: Nominal scale
chess commentary generation and propose meth- agreement provision for scaled disagreement or par-
ods to perform generation on this dataset. Our tial credit. Psychological bulletin 70(4):213.
proposed method effectively utilizes information Sepp Hochreiter and Jürgen Schmidhuber. 1997.
related to the rules and pragmatics of the game. A Long short-term memory. Neural computation
human evaluation study judges outputs from the 9(8):1735–1780.
Hirotaka Kameko, Shinsuke Mori, and Yoshimasa Tsu- Jekaterina Novikova, Ondřej Dušek, Amanda Cer-
ruoka. 2015. Learning a game commentary gener- cas Curry, and Verena Rieser. 2017. Why
ator with grounded move expressions. In Compu- we need new evaluation metrics for nlg. In
tational Intelligence and Games (CIG), 2015 IEEE Proceedings of the 2017 Conference on Em-
Conference on. IEEE, pages 177–184. pirical Methods in Natural Language Process-
ing. Association for Computational Linguis-
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- tics, Copenhagen, Denmark, pages 2241–2252.
semantic alignments for generating image descrip- https://www.aclweb.org/anthology/D17-1238.
tions. In Proceedings of the IEEE conference
on computer vision and pattern recognition. pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
3128–3137. Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of
Maurice G Kendall. 1938. A new measure of rank cor- the 40th annual meeting on association for compu-
relation. Biometrika 30(1/2):81–93. tational linguistics. Association for Computational
Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. Linguistics, pages 311–318.
2016. Globally Coherent Text Generation with Neu-
Adam Paszke, Sam Gross, Soumith Chintala, and Gre-
ral Checklist Models. In Proceedings of the 2016
gory Chanan. 2017. Pytorch.
Conference on Empirical Methods in Natural Lan-
guage Processing. pages 329–339. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Diederik Kingma and Jimmy Ba. 2014. Adam: A B. Thirion, O. Grisel, M. Blondel, P. Pretten-
method for stochastic optimization. arXiv preprint hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
arXiv:1412.6980 . sos, D. Cournapeau, M. Brucher, M. Perrot, and
E. Duchesnay. 2011a. Scikit-learn: Machine Learn-
Rémi Lebret, David Grangier, and Michael Auli. 2016. ing in Python. Journal of Machine Learning Re-
Neural text generation from structured data with ap- search 12:2825–2830.
plication to the biography domain. arXiv preprint
arXiv:1603.07771 . Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
fort, Vincent Michel, Bertrand Thirion, Olivier
Percy Liang, Michael I Jordan, and Dan Klein. 2009. Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Learning semantic correspondences with less super- Weiss, Vincent Dubourg, et al. 2011b. Scikit-learn:
vision. In Proceedings of the Joint Conference of Machine learning in Python. Journal of Machine
the 47th Annual Meeting of the ACL and the 4th In- Learning Research 12(Oct):2825–2830.
ternational Joint Conference on Natural Language
Processing of the AFNLP: Volume 1-Volume 1. As- Ehud Reiter. 1995. NLG vs. templates. arXiv preprint
sociation for Computational Linguistics, pages 91– cmp-lg/9504013 .
99.
Ehud Reiter. 2017. You Need to Understand Your
Jen-Wen Liao and Jason S Chang. 1990. Com- Corpora - the Weathergov Example. Blogpost -
puter Generation of Chinese Commentary on Oth- https://ehudreiter.com/2017/05/09/
ello Games. In Proceedings of Rocling III Compu- weathergov/ .
tational Linguistics Conference III. pages 393–415.
Ehud Reiter and Anja Belz. 2009. An investigation into
Tsung-Yi Lin, Michael Maire, Serge Belongie, James the validity of some metrics for automatically evalu-
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, ating natural language generation systems. Compu-
and C Lawrence Zitnick. 2014. Microsoft coco: tational Linguistics 35(4):529–558.
Common objects in context. In European confer-
ence on computer vision. Springer, pages 740–755. Ehud Reiter, Roma Robertson, and Liesl M Osman.
2003a. Lessons from a failure: Generating tai-
Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael lored smoking cessation letters. Artificial Intelli-
Noseworthy, Laurent Charlin, and Joelle Pineau. gence 144(1-2):41–58.
2016. How not to evaluate your dialogue system:
An empirical study of unsupervised evaluation met- Ehud Reiter, Somayajulu Sripada, Jim Hunter, Jin Yu,
rics for dialogue response generation. arXiv preprint and Ian Davy. 2005. Choosing words in computer-
arXiv:1603.08023 . generated weather forecasts. Artificial Intelligence
167(1-2):137–169.
Saad Mahamood and Ehud Reiter. 2012. Working with
clinicians to improve a patient-information NLG Ehud Reiter, Somayajulu G Sripada, and Roma Robert-
system. In Proceedings of the Seventh International son. 2003b. Acquiring correct knowledge for natu-
Natural Language Generation Conference. Associa- ral language generation. Journal of Artificial Intelli-
tion for Computational Linguistics, pages 100–104. gence Research 18:491–516.
Hongyuan Mei, Mohit Bansal, and Matthew R Walter. Aleksander Sadikov, Martin Moina, Matej Guid, Jana
2015. What to talk about and how? selective gen- Krivec, and Ivan Bratko. 2006. Automated chess tu-
eration using LSTMs with coarse-to-fine alignment. tor. In International Conference on Computers and
arXiv preprint arXiv:1509.00838 . Games. Springer, pages 13–25.
Claude E Shannon. 1951. Prediction and entropy
of printed English. Bell Labs Technical Journal
30(1):50–64.
David Silver, Aja Huang, Chris J Maddison, Arthur
Guez, Laurent Sifre, George Van Den Driessche, Ju-
lian Schrittwieser, Ioannis Antonoglou, Veda Pan-
neershelvam, Marc Lanctot, et al. 2016. Mastering
the game of Go with deep neural networks and tree
search. nature 529(7587):484–489.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
Parikh. 2015. Cider: Consensus-based image de-
scription evaluation. In Proceedings of the IEEE
conference on computer vision and pattern recog-
nition. pages 4566–4575.
Sam Wiseman, Stuart M Shieber, and Alexander M
Rush. 2017. Challenges in Data-to-Document Gen-
eration. arXiv preprint arXiv:1707.08052 .
Zichao Yang, Phil Blunsom, Chris Dyer, and Wang
Ling. 2016. Reference-Aware Language Models.
arXiv preprint arXiv:1611.01628 .

You might also like