6232-Article Text-9457-1-10-20200516

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)
Story Realization: Expanding Plot Events into Sentences
Prithviraj Ammanabrolu, Ethan Tien, Wesley Cheung,

Zhaochen Luo, William Ma, Lara J. Martin, Mark O. Riedl
School of Interactive Computing
Georgia Institute of Technology
{raj.ammanabrolu, etien, wcheung8, zluo, wma61, ljmartin, riedl}@gatech.edu
Abstract for explicit domain modeling beyond providing a corpus

of example stories. The primary pitfall of neural language
Neural network based approaches to automated story plot model approaches for story generation is that the space of
generation attempt to learn how to generate novel plots from
stories that can be generated is huge, which in turn, implies
a corpus of natural language plot summaries. Prior work has
shown that a semantic abstraction of sentences called events that, in a textual story corpora, any given sentence will likely
improves neural plot generation and and allows one to de- only be seen once.
compose the problem into: (1) the generation of a sequence Martin et al. (2018) propose the use of a semantic abstrac-
of events (event-to-event) and (2) the transformation of these tion called an event, reducing the sparsity in a dataset that
events into natural language sentences (event-to-sentence). comes from an abundance of unique sentences. They define
However, typical neural language generation approaches to an event to be a unit of a story that creates a change in the
event-to-sentence can ignore the event details and produce story world’s state. Technically, an event is a tuple contain-
grammatically-correct but semantically-unrelated sentences.
We present an ensemble-based model that generates natural
ing a subject, verb, direct object, and some additional dis-
language guided by events. We provide results—including a ambiguation token(s).
human subjects study—for a full end-to-end automated story The event representation enables the decomposition of
generation system showing that our method generates more the plot generation task into two sub-problems: event-to-
coherent and plausible stories than baseline approaches 1 . event and event-to-sentence. Event-to-event is broadly the
problem of generating the sequence of events that together
comprise a plot. Models used to address this problem are
1 Introduction also responsible for maintaining plot coherence and consis-
Automated story plot generation is the problem of creat- tency. Once new events are generated, however, they are still
ing a sequence of main plot points for a story in a given not human-readable. Thus the second sub-problem, event-
domain. Generated plots must remain consistent across the to-sentence, focuses on transforming these events into natu-
entire story, preserve long-term dependencies, and make ral language sentences.
use of commonsense and schematic knowledge (Wiseman, Martin et al. (2017; 2018) further propose that this latter,
Shieber, and Rush 2017). Early work focused on sym- event-to-sentence problem can be thought of as a translation
bolic planning and case-based reasoning (Meehan 1977; task—translating from the language of events into natural
Turner and Dyer 1986; Lebowitz 1987; Pérez y Pérez and language. We find, however, that the sequence-to-sequence
Sharples 2001; Gervás et al. 2005; Porteous and Cavazza LSTM networks (Sutskever, Vinyals, and Le 2014) that they
2009; Riedl and Young 2010; Ware and Young 2011; Far- used frequently ignore the input event and only generate text
rell, Ware, and Baker 2019) at the expense of manually- based on the original corpus, overwriting the plot-based de-
engineering world domain knowledge. cisions made during event-to-event. There are two contribut-
In contrast, neural-based approaches to story and plot gen- ing factors. Firstly, event-to-event models tend to produce
eration train a neural language model on a corpus of sto- previously-unseen events, which, when fed into the event-
ries to predict the next character, word, or sentence in a to-sentence model result in unpredictable behavior. A basic
sequence based on a history of tokens (Jain et al. 2017; sequence-to-sequence model is unable to learn how to map
Clark, Ji, and Smith 2018; Fan, Lewis, and Dauphin 2018; these unseen events to sentences. Secondly, sentences are
Martin et al. 2018; Peng et al. 2018; Roemmele 2018). The often only seen once in the entire corpus. Despite the con-
advantage of neural-based approaches is that there is no need version into events, the sparsity of the data means that each
event is still likely seen a limited number of times. For these
Copyright c 2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved. reasons, we framed the event-to-sentence task as guided lan-
1
Code to reproduce our experiments is available at https:// guage generation, using a generated event as a guide.
github.com/rajammanabrolu/StoryRealization The contributions of this paper are twofold. We present an
7375
MRKQXQZLWWLQJO\XQOHDVKHVDQLQVLGLRXVSR[
natural language sentences. We use a variation of this event
VHQWHQFHQ 3(5621!GLVDVVHPEOHFRQWDJLRXVGLVHDVHQ䌙 structure. In our work, events are defined as a 5-tuple of
s, v, o, p, m as opposed to the 4-tuples used in Martin et
(YHQWLI\ HYHQWQ
(YHQW al. (2018). Here v is a verb, s is the subject of the verb, o

(YHQW is the object, p is the corresponding preposition, and m can
3(5621! MRKQ
&RQWDJLRXVGLVHDVHQ SR[
be a modifier, prepositional object, or indirect object. Any
of these elements can be ∅, denoting the absence of the el-
:RUNLQJ ement. All elements are stemmed and generalized with the
/RQJWHUP HYHQWQ
PHPRU\ exception of the preposition.
PDOHQVSDWLDOFRQILJXUDWLRQ䌙DGRSW The generalization process involves finding the Verb-
Net (Schuler and Kipper-Schuler 2005) v3.3 class of the
JHQHUDOL]HG
(YHQW verb and finding the WordNet (Miller 1995) v3.1 Synset
6ORWILOOHU
VHQWHQFHQ
6HQWHQFH
that is two levels higher in the hypernym tree for all of
the nouns in the event. This process also includes the iden-
PDOHQFUXPSOHVDQGLVDERXWWREH
VHQWHQFHQ VKHDWKQ tification of named entities in the event tuple, extracting
+HFUXPSOHVDQGLVDERXWWREHKXVN
people, organizations, locations, etc. through named entity
recognition (NER) and numbering them as the story goes
Figure 1: The full automated story generation pipeline, illus- on. For example “PERSON” names are replaced by the tag
trating an example where the event-to-event module gener- <PERSON>n where n indicates the n-th “PERSON” in
ates only a single following event. the story. Similarly, the other NER categories are replaced
with tags that indicate their category and their number within
the story. This maintains consistency in named entities for a
ensemble-based system for the event-to-sentence problem given story in the corpora.
that balances between retaining the event’s original seman- We further process the corpus by “splitting” sentences
tic meaning, while being an interesting continuation of the akin to the “split-and-prune” methodology of Martin et
story. We demonstrate that our system for guided language al. (2018). This is done to decrease the number of events
generation outperforms a baseline sequence-to-sequence ap- generated from a single sentence—reducing the number of
proach. Additionally, we present the results of a full end-to- mappings of a single sentence to multiple events. The split-
end story generation pipeline (Figure 1), showing how all of ting process starts with extracting the parse trees of each sen-
the sub-systems can be integrated. tence using the Stanford Parser. Sentences are then split on
S’s (SBARs) and conjunctions before nested sentences. This
2 Related Work and Background process can result in incomplete sentences where the S-bar
phrase is nested inside of a sentence, acting as the direct ob-
2.1 Story Generation via Machine Learning ject. For example, when it sees a sentence like “She says
Machine learning approaches to story and plot generation that he is upset.” it becomes “She says. He is upset.” Then
attempt to learn domain information from a corpus of story the split sentences are sorted to reflect the original ordering
examples (Swanson and Gordon 2012; Li et al. 2013). Re- of subjects or phrases as closely as possible.
cent work has looked at using recurrent neural networks For this paper, the event-to-event system is the policy
(RNNs) for story and plot generation. Roemmele and Gor- gradient deep reinforcement learner from Tambwekar et
don (2018) use LSTMs with skip-though vector embed- al. (2019). This system has been tested to ensure that the
dings (Kiros et al. 2015) to generate stories. Khalifa, Barros, resulting events are of high quality to minimize error in that
and Togelius (2017) train an RNN on a highly-specialized portion of the pipeline. Our event-to-sentence system is ag-
corpus, such as work from a single author. Fan, Lewis, and nostic to the choice of the event-to-event system, all it re-
Dauphin (2018) introduce a form of hierarchical story gen- quires is a sequence of events to turn into sentences. The
eration in which a premise is first generated by the model event-to-event network is placed into the pipeline as the
and then transformed into a passage. This last example is “Event2Event” module, seen in Figure 1, and its output is
a form of guided generation wherein a single sentence pro- fed into the event-to-sentence models during testing.
vides guidance. Similarly, Yao et al. (2019) decompose story
generation into planning out a storyline and then generating 3 Event-to-Sentence
a story from it. Our work differs in that we use the event- We define event-to-sentence to be the problem of select-
to-event process to provide guidance to event-to-sentence. ing a sequence of words st = st0 , st1 , ..., stk —that form
Ammanabrolu et al. (2019) look at narrative generation as a sentence—given the current input event et , i.e. the cur-
a form of quest generation in interactive fiction and use a rent sentence is generated based on maximizing P r(st |et ; θ)
knowledge graph to ground their generative models. where θ refers to the parameters of the generative sys-
tem. The eventification in Section 2.2 is a lossy process in
2.2 Event Representation and Generation which some of the information from the original sentence
Martin et al. (2018) showed that the performance on is dropped. Thus, the task of event-to-sentence involves fill-
both event-to-event and event-to-sentence problems improve ing in this missing information. There is also no guarantee
when using an abstraction—known as an event—instead of that the event-to-event process will produce an event that is
7376
part of the event-to-sentence training corpus, simply due to there are two key phases: “retrieve” phase and “edit” phase.
the fact that the space of potentially-generated events is very With respect to the input event, we first retrieve the nearest-
large; the correct mapping from the generated event to a nat- neighbor event and its corresponding sentence in the train-
ural language sentence would be unknown. ing set using the retriever model. Passing both the retrieved
In prior work, Martin et al. (2018) use a sequence-to- event-sentence pair and the input event as inputs, we use the
sequence LSTM neural network to translate events into sen- editor model to generate a sentence using beam search.
tences. We observe that “vanilla” sequence-to-sequence net- Many of the successes produced by the model stem from
works end up operating as simple language models, often its ability to retain the complex sentence structures that ap-
ignoring the input event when generating a sentence. The pear in our training corpus and thus attempts to balance be-
generated sentence is usually grammatically correct but re- tween maintaining coherence and being interesting. How-
tains little of the semantic meaning given by the event. ever, this interaction with the training data can also prove
We thus look for other forms of guided neural language to be a major drawback of the method; target events that
generation, with the goals of preserving the semantic mean- are distant in the embedding space from training examples
ing from the event in addition to keeping the generated typically result in poor sentence quality. Since RetEdit re-
sentences interesting. We propose four different models— lies heavily on having good examples, we set the confi-
optimized towards a different point in the spectrum be- dence of the retrieve-and-edit model to be proportional to 1 –
tween the two objectives, and a baseline fifth model that retrieval distance when generating sentences, as a lower
is used as a fallthrough. The task of each model is to retrieval distance implies greater confidence. However, the
translate events into “generalized” sentences, wherein nouns mapping from event to sentence is not a one-to-one func-
are replaced by WordNet Synsets. If a model does not tion. There are occasionally multiple sentences that map to
pass a specific threshold (determined individually for each a single event, resulting in retrieval distance of 0, in which
model), the system continues onto the next model in the case the example sentence is returned without modifications.
ensemble. In order, the models are: (1) a retrieve-and-edit
model based on Hashimoto et al. (2018); (2) template fill- 3.2 Sentence Templating
ing; (3) sequence-to-sequence with Monte Carlo beam de- As mentioned earlier, the baseline sequence-to-sequence
coding; (4) sequence-to-sequence with a finite state ma- network operates as a simple language model and can often
chine decoder; and (5) vanilla (beam-decoding) sequence- ignore the input event when generating a sentence. However,
to-sequence. We find that none of these models by them- we know that our inputs, an event tuple will have known
selves can successfully find a balance between the goals parts of speech.We created a simplified grammar for the syn-
of retaining all of the event tokens and generating interest- tax of sentences generated from events:
ing output. However, each of the models possess their own S → N P v (N P ) (P P )
strengths and weaknesses—each model is essentially opti-
mized towards a different point on the spectrum between the NP → d n
two goals. We combine these models into an ensemble in PP → p NP
an attempt to minimize the weaknesses of each individual where d is a determiner that will be added and the rest of
model and to achieve a balance. the terminal symbols correspond to an argument in the event,
with n being s, o, or m, depending on its position in the sen-
3.1 Retrieve-and-Edit tence. The resulting sentence would be [ s]{v [ o] [p m]}
The first model is based on the retrieve-and-edit RetEdit where blanks indicate where words should be added to make
framework for predicting structured outputs (Hashimoto et a complete sentence.
al. 2018). We first learn a task-specific similarity between First, our algorithm predicts the most likely VerbNet
event tuples by training an encoder-decoder to map each frame based on the contents of the input event (how many
event onto an embedding that can reconstruct the output and which arguments are filled). VerbNet provides a num-
sentence; this is our retriever model. Next, we train an edi- ber of syntactic structures for different verb classes based on
tor model which maximizes the likelihood of generating the how the verb is being used. For example, if the input event
target sentence given both the input event and a retrieved contains 2 nouns and a verb without a preposition, we as-
event-sentence example pair. We used a standard sequence- sume that the output sentence takes the form of [NP V NP],
to-sequence model with attention and copying (Gu et al. but if it has 2 nouns, a verb, and a proposition, then it should
2016) to stand in as our editor architecture. Although this be [NP V PP].
framework was initially applied to the generation of GitHub Second, we apply a Bidirectional LSTM language model
Python code and Hearthstone cards, we extend this tech- trained on the generalized sentences in our training cor-
nique to generate sentences from our event tuples. Specif- pus. Given a word, we can generate words before and af-
ically, we first initialize a new set of GLoVe word embed- ter it, within a particular phrase as given by some of the
dings (Pennington, Socher, and Manning 2014), using ran- rules above, and concatenate the generated sentence frag-
dom initialization for out-of-vocabulary words. We use our ments together. Specifically, we use the AWD-LSTM (Mer-
training set to learn weights for the retriever and editor mod- ity, Keskar, and Socher 2018) architecture as our language
els, set confidence thresholds for the model with the valida- model since it is currently state-of-the-art.
tion set, and evaluate performance using the test set. At decode time, we continue to generate words in each
In order to generate a sentence from a given input event, phrase until we reach a stopping condition: (1) reaching a
7377
maximum length (to prevent run-on sentences); or (2) gen- Monte Carlo beam decoder has been shown to generate bet-
erating a token that is indicative of an element in the next ter sentences that are more grammatically-correct than the
phrase, for example seeing a verb being generated in a noun other techniques in our ensemble, while sticking more to the
phrase. When picking words from the language model, we input than a traditional beam decoder. However, there is no
noticed that the words “the” and “and” were extremely com- guarantee that all input event tokens will be included in the
mon. To increase the variety of the sentences, we sample final output sentence.
from the top k most-likely next words and enforce a number
of grammar-related rules in order to keep the coherence of 3.4 Finite State Machine Constrained Beams
the sentence. For example, we do not allow two determiners Various forms of beam search, including Monte Carlo play-
nor two nouns to be generated next to each other. outs, cannot ensure that the tokens from an input event ap-
One can expect that many of the results will look struc- pear in the outputted sentence. As such, we adapted the al-
turally similar. However, we can guarantee that the provided gorithm to fit such lexical constraints, similar to Anderson
tokens in the event will appear in the generated sentence— et al. (2017) who adapted beam search to fit captions for
this model is optimized towards maintaining coherence. To images, with the lexical constraints coming from sets of im-
determine the confidence of the model for each sentence, we age tags. The Constrained Beam Search used finite state ma-
sum the loss after each generated token, normalize to sen- chines to guide the beam search toward generating the de-
tence length, and subtract from 1 as higher loss translates to sired tokens. Their approach, which we have co-opted for
lower confidence. event-to-sentence, attempts to achieve a balance between the
flexibility and sentence quality typical of a beam search ap-
3.3 Monte-Carlo Beam Search proach, while also adhering to the context and story encoded
in the input events that more direct approaches (e.g. Section
Our third method is an adaptation of Monte Carlo Beam
3.2) would achieve.
Search (Cazenave 2012) for event-to-sentence. We train a
The algorithm works on a per-event basis, beginning by
sequence-to-sequence model on pairs of events & general-
generating a finite state machine. This finite state machine
ized sentences and run Monte Carlo beam search at decode
consists of states that enforce the presence of input tokens
time. This method differs from traditional beam search in
in the generated sentence. As an example, assume we have
that it introduces another scoring term that is used to re-
an n-token input event, {t1 , t2 , t3 , ..., tn }. The correspond-
weight all the beams at each timestep.
ing machine consists of 2n states. Each state maintains a
After top-scoring words are outputted by the model at
search beam of size B s with at most b output sequences,
each timestep, playouts are done from each word, or node. A
corresponding to the configured beam size s. At each time
node is the final token of the partially-generated sequences
step, every state (barring the initial state) receives from pre-
on the beam currently and the start of a new playout. Dur-
decessor states those output sequences whose last generated
ing each playout, one word is sampled from the current
token matches an input event token. The state then adds to
step’s softmax over all words in the vocabulary. The decoder
its beam the b most likely output sequences from those re-
network is unrolled until it reaches the “end-of-story” tag.
ceived. Generating token t1 moves the current state from the
Then, the previously-generated sequence and the sequence
initial state to the state corresponding to t1 , t3 to a state for
generated from the current playout are concatenated together
t3 , and so on. The states t1 and t3 then, after generating to-
and passed into a scoring function that computes the current
kens t1 and t3 respectively, transmit said sequences to the
playout’s score.
state t1,3 . The states and transitions proceed as such until
The scoring function is a combination of (1) BLEU scores reaching the final state, wherein they have matched every
up to 4-grams between the input event and generated sen- token in the input event. Completed sequences in the final
tence, as well as (2) a weighted 1-gram BLEU score be- state contain all input event tokens, thus providing us with
tween each item in the input event and generated sentence. the ability to retain the semantic meaning of the event.
The weights combining the 1-gram BLEU scores are learned As much as the algorithm is based around balancing gen-
during validation time where the weight for each word in the erating good sentences with satisfying lexical constraints, it
event that does not appear in the final generated sequence does not perform particularly well at either. It is entirely pos-
gets bumped up. Multiple playouts are done from each word sible, if not at all frequent, for generated sentences to contain
and the score s for the current word is computed as: all input tokens but lose proper grammar and syntax, or even
st = α ∗ st−1 + (1 − α) ∗ AV G(playoutt ) (1) fail to reach the final state within a fixed time horizon. This is
exacerbated by larger tuples of tokens, seen even at just five
where α is a constant. tokens per tuple. To compensate, we relax our constraint to
In the end, the k partial sequences with the highest playout permit output sequences that have matched at least three out
scores are kept as the current beam. For the ensemble, this of five tokens from the input event.
model’s confidence score is the final score of the highest-
scoring end node. Monte Carlo beam search excels at cre- 3.5 Ensemble
ating diverse output—i.e. it skews towards generating inter- The entire event-to-sentence ensemble is designed as a cas-
esting sentences. Since the score for each word is based on cading sequence of models: (1) retrieve-and-edit, (2) sen-
playouts that sample based on weights at each timestep, it tence templating, (3) Monte Carlo beam search, (4) finite
is possible for the output to be different across runs. The state constrained beam search, and (5) standard beam search.
7378
We use the confidence scores generated by each of the mod- fully prepared, it is split in a 8:1:1 ratio to create the training,
els in order to re-rank the outputs of the individual models. validation, and testing sets, respectively.
This is done by setting a confidence threshold for each of
the models such that if a confidence threshold fails, the next 5 Experiments
model in the ensemble is tried. The thresholds are tuned on We perform two sets of experiments, one set evaluating
the confidence scores generated from the individual models our models on the event-to-sentence problem by itself,
on the validation set of the corpus. This ensemble saves on and another set intended to evaluate the full storytelling
computation as it sequentially queries each model, terminat- pipeline. Each of the models in the event-to-sentence en-
ing early and returning an output sentence if the confidence semble are trained on the training set in the sci-fi corpus.
threshold for any of the individual models are met. The training details for each of the models are as described
An event first goes through the retrieve-and-edit frame- above. All of the models in the ensemble slot-fill the verb
work, which generates a sentence and corresponding confi- automatically—filling a VerbNet class with a verb of ap-
dence score. This framework performs well when it is able to propriate conjugation—except for the sentence templating
retrieve a sample from the training set that is relatively close model which does verb slot-filling during post-processing.
in terms of retrieval distance to the input. Given the sparsity After the models are trained, we pick the cascading
of the dataset, this happens with a relatively low probability, thresholds for the ensemble by running the validation set
and so we place this model first in the sequence. through each of the models and generating confidence
The next two models are each optimized towards one of scores. This is done by running a grid search through a
our two main goals. The sentence templating approach re- limited set of thresholds such that the overall BLEU-4
tains all of the tokens within the event and so loses none of score (Papineni et al. 2002) of the generated sentences in
its semantic meaning, at the expense of generating a more in- the validation set is maximized. These thresholds are then
teresting sentence. The Monte-Carlo approach, on the other frozen when running the final set of evaluations on the test
hand, makes no guarantees regarding retaining the original set. For the baseline sequence-to-sequence method, we de-
tokens within the event but is capable of generating a di- code our output with a beam size of 5. We report perplexity,
verse set of sentences. We thus cascade first to the sentence BLEU-4, and ROUGE-4 scores, comparing against the gold
templating model and then the Monte-Carlo approach, im- standard from the test set.
plicitly placing greater importance on the goal of retaining
the semantic meaning of the event. P erplexity = 2− x p(x) log2 p(x)
(2)
The final model queried is the finite-state-machine–
where x is a token in the text, and
constrained beam search. This model has no confidence
score; either the model is successful in producing a sentence count(x)
within the given length with the event tokens or not. In the p(x) = (3)
v∈V count(V )
case that the finite state machine based model is unsuccess-
ful in producing a sentence, the final fallthrough model—the where V is the vocabulary. Our BLEU-4 scores are naturally
baseline sequence-to-sequence model with standard beam low (where higher is better) because of the creative nature of
search decoding—is used. the task—good sentences may not use any of the ground-
truth n-grams. Even though we frame Event2Sentence as
4 Dataset a translation task, BLEU-4 and ROUGE-4 are not reliable
metrics for creative generation tasks.
To aid in the performance of our story generation, we select The first experiment takes plots in the in the test set, even-
a single genre: science fiction. We scraped long-running sci- tifies them, and then uses our event-to-sentence ensemble to
ence fiction TV show plot summaries from the fandom wiki convert them back to sentences. In addition to using the full
service wikia.com. This dataset contains longer and more ensemble, we further experiment with using different combi-
detailed plot summaries than the dataset used in Martin et nations of models along the spectrum between maintaining
al. (2018) and Tambwekar et al. (2019), which we believe coherence and being interesting. We then evaluate the gen-
to be important for the overall story generation process. The erated sentences, using the original sentences from the test
corpus contains 2,276 stories in total, each story an episode set as a gold standard.
of a TV show. The average story length is 89.23 sentences. The second experiment uses event sequences generated by
There are stories from 11 shows, with an average of 207 sto- an event-to-event system such as Tambwekar et al. (2019)
ries per show, from shows like Doctor Who, Futurama, and and is designed to demonstrate how our system integrates
The X-Files. The data was pre-processed to simplify alien into the larger pipeline described in Figure 1. We then
names in order to aid the parser. Then the sentences were transform these generated event sequences into general-
split, partially following the “split-and-pruned” methodol- ized sentences using both the ensemble and the baseline
ogy of Martin et al. (2018) as described in 2.2. sequence-to-sequence approach. As the last step, the gen-
Once the sentences were split, they were “eventified” as eralized sentences are passed into the “slot filler” (see Fig-
described in Section 2.2. One benefit of having split sen- ure 1) such that the categories are filled. As the story goes
tences is that there is a higher chance of having a 1:1 cor- on, the “memory” maintains a dynamic graph that keeps
respondence between a sentence and an event, instead of a track of what entities (e.g. people, items) are mentioned at
single sentence becoming multiple events. After the data is which event and what their tag was (e.g. <PERSON>5,
7379
Table 1: Event-to-sentence examples for each model. ∅ represents an empty parameter; <PRP> is a pronoun.
Input Event RetEdit Templates Monte Carlo FSM Gold Standard
<PRP>, act-114-1- <PRP> and <PRP> act-114-1-1 <PRP> moves to physical entity.n.01 <PRP> move to the
1, to, ∅, event.n.01 <PERSON>0 move to event.n.01. the nearest natu- move back to the event.n.01.
to the event.n.01 of the ral object.n.01. phenomenon.n.01 of
natural object.n.01. the craft.n.02...
<PERSON>2, <PERSON>2 sends The <PERSON>2 <PERSON>2 ∅ In activity.n.01 to
send-11.1, through, <PERSON>6 send-11.1 the passes this un- avoid <PRP> out.n.01
<PERSON>6, through the <PERSON>6 dercover in the <PERSON>2 would trans-
<LOCATION>1 <LOCATION>1. through body part.n.01 and port <PERSON>6 through
<LOCATION>1. collapses. the <LOCATION>1.
Table 2: End-to-end pipeline examples on previously-unseen input data. The Event-to-Sentence model used is the full ensemble.
Sentences are generated using both the extracted and generated events.
Input Sent. Extracted event Generated Events (Event-to- Generated Sentences (Event-to-Sentence) Slot-filled Sentences
Event)
On Tatooine, <ORG>0, <PERSON>1, settle-36.1.2, The <ORG>0 can not scan the vessel.n.02 The Jabba the Hutt can not scan the
Jabba the Hutt assessment-34.1, ∅, indicator.n.03, indicator.n.03 of the <VESSEL>0. <PERSON>1 de- bareboat of the Uss Lakota. O Yani
inspects the drone ∅, vessel.n.02, ; music.n.01, escape-51.1- cides to be a little person.n.01 at the decides to be a little mailer at the air-
barge recently ∅ 1, from, ∅, ∅; <PRP>, structure.n.01. the music.n.01 arrives. dock. The Music arrives. She finds a
delivered to him. discover-84, to, run-51.3.2, <PRP> finds a lonely person.n.01 on the lonely mailer on the upper one of the
progenitor.n.01 upper one of the craft.n.02 which is not a bareboat which is not a love letter but
personal letter.n.01 but does not respond to does not respond to hails.
hails .
Boba Fett has <PERSON>0, <PERSON>0, chase- <PERSON>0 enters the bounty.n.04 and Boba Fett enters the bounty and tells
just chased down chase-51.6, ∅, 51.6, to, magnitude.n.01, tells <PRP>. <PERSON>0 attaches the it. Boba Fett attaches the explosive
another bounty, a bounty.n.04, ∅ ∅; magnitude.n.01, explosive.a.01 to the person.n.01 who is to the peer who is trying to fix the
Rodian art dealer comprehend-87.2, off, trying to fix the device.n.01. the magni- toy. The multiplicity doesn’t know
who sold fake craft.n.02, magnitude.n.01; tude.n.01 doesn’t know the craft.n.02 off the the bounty off the bounty. Dark Jedi
works to Gebbu <PERSON>2, amuse- craft.n.02. <PERSON>2 is surprised when Lomi Plo is surprised when it learns
the Hutt. 31.1, off, ∅, ∅; <PRP> learns that the person.n.01 is actu- that the peer is actually Mrs Conners.
<PERSON>2, discover-84, ally <PERSON>7. <PERSON>2 sees the Dark Jedi Lomi Plo sees the combi-
off, change of integrity.n.01, ∅ change of integrity.n.01 and tells <PRP>. nation off the Orbs and tells them.
found, it defaults to they/them, and if no previous entity can

Table 3: Test set perplexity, BLEU, & ROUGE (F1) scores, be found, it defaults to it. Organizations are always they. For
with average sentence lengths for event-to-sentence models. the purpose of this study, stories that came from the same
Model Perplexity BLEU ROUGE Length
RetEdit 71.354 0.041 11.25 9.27
events (story pairs across both conditions) were filled with
Templates 203.629 0.0034 6.21 5.43 the same entities.
Monte Carlo 71.385 0.0453 10.01 7.91 Once the sentences from both experiments were com-
FSM 104.775 0.0125 1.29 10.98 plete, we conducted a human participant study on Amazon
Seq2seq 83.410 0.040 10.66 6.59 Mechanical Turk. Each participant was presented a single
RetEdit+MC 72.441 0.0468 10.97 9.41 story and given the list of 5-point Likert scale questions,
Templ.+MC 79.295 0.0409 10.10 6.92 validated by Purdy et al. (2018) and used by Tambwekar et
Templ.+FSM 79.238 0.0296 6.36 9.09
RE+Tmpl.+MC 73.637 0.0462 10.96 9.35
al. (2019) in their evaluation. We exclude categories assess-
Full Ensemble 70.179 0.0481 11.18 9.22 ing the long-term coherence of a story as these categories are
designed to evaluate an event-to-event system and not event-
to-sentence, which is conditioned to map an event to a single
sentence at a time. Participants were also asked to provide a
Synset(‘instrument.n.01’)) When the slot filler sees a new summary of the story and which of the attributes from the
sentence, it first tries to see if it can fill it in with an entity it Likert questions thought to be most important for stories. If
as seen before. This includes if the current Synset it is look- the participants’ English was not deemed fluent enough in
ing at is a descendent of a Synset already stored in memory. the open-ended questions, their data was discarded. This left
If a new word has to be selected, named entities are ran- us with 64 in the ensemble and 58 in the baseline condition.
domly chosen from a list collected from the original science
fiction corpus, with entities paired with their respective tags
(PERSON, ORG, NUMBER, etc.). Synsets are selected by
6 Results and Discussion
finding a descendent 1 or 2 levels down. The word is cur- Table 3 shows the perplexity, BLEU-4 scores, ROUGE-4
rently selected randomly, but this will soon be improved by scores, and average sentence length for event-to-sentence
the addition of a language model guiding it. To fill a pro- on the testing set for each of the models, ensemble, and
noun (<PRP>), the slot filler refers to the memory to select baseline. Note that some of the models, such as the sen-
a recently-mentioned entity. Person names are run through tence templates, make no use of gold standard sentences
US Census data to determine the “gender” of the name in and are thus poorly optimized to use perplexity, BLEU, and
order to select an appropriate pronoun. If no pronoun can be ROUGE scores. In addition to running each model in the en-
7380
Table 4: Utilization percentages for each model combination on both events from the test set and from the full pipeline.
RetEdit Templates Monte Carlo FSM Seq2seq
Test Pipeline Test Pipeline Test Pipeline Test Pipeline Test Pipeline
RetEdit+MC 82.58 31.74 - - 9.95 48.4 - - 7.46 19.86
Templates+MC - - 6.14 5.48 65.7 66.67 - - 28.16 27.85
Templates+FSM - - 6.14 5.48 - - 56.77 32.65 37.09 61.87
RetEdit+Templates+MC 82.58 31.74 1.49 3.88 9.1 45.21 - - 6.82 19.18
Full Ensemble 94.91 55.71 0.22 0.91 4.29 41.10 0.15 0.68 0.43 1.60
the templates and the finite state machine.

Table 1 shows examples of generalized sentence outputs
of each of the event-to-sentence models, illustrating some of
the trends we alluded to in Section 3. Retrieve-and-edit fo-
cuses on semantics at the expense of sentence quality. The
sentence templates produce output that matches the input
event but is very formulaic. Monte Carlo generates enter-
taining and grammatically-correct sentences but occasion-
ally loses the semantics of the input event. The finite state
machine attempts to achieve a balance between semantics
and generating entertaining output, however it sometimes
fails to produce an output given the constraints of the state
machine itself. All of these can be compared to the original
next event from the testing set. We also provide examples of
the entire pipeline in Table 2, which demonstrates our en-
Figure 2: Human participant study results, where a higher semble’s ability to work with an existing plot generator.
score is better (scale of 1-5). Confidence values are 00.29 For the human participants study comparing a seq-to-seq
and 0.32, for genre and enjoyability respectively; α=0.1. The baseline to our full ensemble (Figure 2), most metrics were
confidence values for other metrics lie between 0.27-0.35. similar in score, which is understandable given that both
conditions produced stories that were at times confusing.
semble individually, we experiment with multiple combina- However, the ensemble consistently outperformed the base-
tions of the models to assess which combination makes the line in terms of quality, maintaining the genre (space opera),
most effective ensemble. The full ensemble performs better grammar, and enjoyablity. Enjoyability and genre were sig-
than any of the individual models with regard to perplexity, nificant at p < .10 using a two-tailed independent t-test.
as it is designed to combine the models such that each of
their weaknesses is minimized. The average sentence length 7 Conclusions
metric highlights the differences between the models, with Event representations improve the performance of plot gen-
the templates producing the shortest sentences and the finite eration and allow for planning toward plot points. However,
state machine taking longer to generate sentences due to the they are unreadable and abstract, needing to be translated
constraints it needs to satisfy. into syntactically- and semantically-sound sentences that
We also noted how often each model in the ensemble is can both keep the meaning of the original event and be an
used, shown in Table 4, when generating sentences from the interesting continuation of the story. We present an ensem-
eventified testing corpus or from the event-to-event model ble of four event-to-sentence models, in addition to a simple
within the pipeline, across different combinations of ensem- beam search model, that balance between these two prob-
bles. Utilization percentages show us how often each model lems. Each of the models in the ensemble is calibrated to-
was picked in the respective ensembles based on the corre- ward different points in the spectrum between the two issues
sponding confidence score thresholds. RedEdit was heavily and are thus designed to cover each other’s weaknesses. The
used on the test set, likely due the train and test sets having ensemble is integrated into a full story generation pipeline,
a similar distribution of data. On the pipeline events, RetE- demonstrating that our ensemble can work with generated
dit is used much less—events generated by event-to-event events drawn from a realistic distribution.
are often very different from those in the training set. A ma-
jority of the events that fall through RetEdit are caught by
our Monte Carlo beam search, irrespective of the fact that References
RetEdit—and sentence templates—are most likely to honor Ammanabrolu, P.; Broniec, W.; Mueller, A.; Paul, J.; and Riedl,
the event tokens. This is partially due to the fact that satis- M. O. 2019. Toward automated quest generation in text-adventure
fying the constraint of maintaining the events tokens makes games. CoRR abs/1909.06283.
it difficult for these models to meet the required threshold. Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2017.
The small portion of remaining events are transformed using Guided open vocabulary image captioning with constrained beam
7381
search. In Proceedings of the 2017 Conference on Empirical Meth- Peng, N.; Ghazvininejad, M.; May, J.; and Knight, K. 2018. To-
ods in Natural Language Processing, 36–945. wards Controllable Story Generation. In Proceedings of the First
Cazenave, T. 2012. Monte Carlo Beam Search. IEEE Transactions Workshop on Storytelling, 43–49. New Orleans, Louisiana: Asso-
on Computational Intelligence and AI in games 4(1):68–72. ciation for Computational Linguistics.
Clark, E.; Ji, Y.; and Smith, N. A. 2018. Neural Text Generation in Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global
Stories Using Entity Representations as Context. In NAACL-HLT. vectors for word representation. In Proceedings of the 2014
conference on empirical methods in natural language processing
Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical Neural (EMNLP), 1532–1543.
Story Generation. In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics, 889–898. Pérez y Pérez, M., and Sharples, R. 2001. MEXICA: A computer
model of a cognitive account of creative writing. Journal of Exper-
Farrell, R.; Ware, S. G.; and Baker, L. J. 2019. Manipulating nar- imental & Theoretical Artificial Intelligence 13(2001):119–139.
rative salience in interactive stories using indexter’s pairwise event
Porteous, J., and Cavazza, M. 2009. Controlling narrative gener-
salience hypothesis. IEEE Transactions on Games.
ation with planning trajectories: The role of constraints. In Joint
Gervás, P.; Dı́az-Agudo, B.; Peinado, F.; and Hervás, R. 2005. International Conference on Interactive Digital Storytelling, vol-
Story plot generation based on CBR. Knowledge-Based Systems ume 5915 LNCS, 234–245. Springer.
18(4-5):235–242. Purdy, C.; Wang, X.; He, L.; and Riedl, M. 2018. Predicting gener-
Gu, J.; Lu, Z.; Li, H.; and Li, V. O. K. 2016. Incorporating copy- ated story quality with quantitative measures. In Fourteenth Artifi-
ing mechanism in sequence-to-sequence learning. Association for cial Intelligence and Interactive Digital Entertainment Conference.
Computational Linguistics (ACL). Riedl, M. O., and Young, R. M. 2010. Narrative Planning: Balanc-
Hashimoto, T. B.; Guu, K.; Oren, Y.; and Liang, P. 2018. A ing Plot and Character. Journal of Artificial Intelligence Research
Retrieve-and-Edit Framework for Predicting Structured Outputs. 39:217–267.
In 32nd Conference on Neural Information Processing Systems Roemmele, M., and Gordon, A. S. 2018. An Encoder-decoder
(NeurIPS 2018). Approach to Predicting Causal Relations in Stories. In Proceed-
Jain, P.; Agrawal, P.; Mishra, A.; Sukhwani, M.; Laha, A.; and ings of the First Workshop on Storytelling, 50–59. New Orleans,
Sankaranarayanan, K. 2017. Story generation from sequence of Louisiana: Association for Computational Linguistics.
independent short descriptions. In SIGKDD Workshop on Machine Roemmele, M. 2018. Neural Networks for Narrative Continuation.
Learning for Creativity (ML4Creativity). Ph.D. Dissertation, University of Southern California.
Khalifa, A.; Barros, G. A. B.; and Togelius, J. 2017. DeepTingle. Schuler, K. K., and Kipper-Schuler, K. 2005. VerbNet: A Broad-
In International Conference on Computational Creativity. Coverage, Comprehensive Verb Lexicon. Ph.D. Dissertation, Uni-
Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urtasun, R.; versity of Pennsylvania.
Torralba, A.; and Fidler, S. 2015. Skip-thought vectors. In Ad- Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to Se-
vances in neural information processing systems, 3294–3302. quence Learning with Neural Networks. In Advances in Neural
Lebowitz, M. 1987. Planning Stories. In Proceedings of the 9th Information Processing Systems, 3104–3112.
Annual Conference of the Cogntive Science Society, 234–242. Swanson, R., and Gordon, A. 2012. Say Anything: Using tex-
Li, B.; Lee-Urban, S.; Johnston, G.; and Riedl, M. O. 2013. tual case-based reasoning to enable open-domain interactive sto-
Story generation with crowdsourced plot graphs. In Proceedings rytelling. ACM Transactions on Interactive Intelligent Systems
of the Twenty-Seventh AAAI Conference on Artificial Intelligence, 2(3):16:1–16:35.
AAAI’13, 598–604. AAAI Press. Tambwekar, P.; Dhuliawala, M.; Martin, L. J.; Mehta, A.; Harrison,
Martin, L. J.; Ammanabrolu, P.; Wang, X.; Singh, S.; Harrison, B.; B.; and Riedl, M. O. 2019. Controllable Neural Story Plot Genera-
Dhuliawala, M.; Tambwekar, P.; Mehta, A.; Arora, R.; Dass, N.; tion via Reward Shaping. In Proceedings of the 28th International
Purdy, C.; and Riedl, M. O. 2017. Improvisational Storytelling Joint Conference on Artificial Intelligence.
Agents. In Workshop on Machine Learning for Creativity and De- Turner, S. R., and Dyer, M. G. 1986. Thematic knowledge, episodic
sign (NeurIPS 2017). memory and analogy in MINSTREL, a story invention system. Uni-
Martin, L. J.; Ammanabrolu, P.; Wang, X.; Hancock, W.; Singh, S.; versity of California, Computer Science Department.
Harrison, B.; and Riedl, M. O. 2018. Event Representations for Au- Ware, S., and Young, R. M. 2011. Cpocl: A narrative planner
tomated Story Generation with Deep Neural Nets. In Thirty-Second supporting conflict. In Proceedings of the 7th AAAI Conference on
AAAI Conference on Artificial Intelligence (AAAI-18), 868–875. Artificial Intelligence and Interactive Digital Entertainment.
Meehan, J. R. 1977. TALE-SPIN, an interactive program that Wiseman, S.; Shieber, S. M.; and Rush, A. M. 2017. Challenges
writes stories. Proceedings of the 5th international joint confer- in data-to-document generation. In Proceedings of the 2014 Con-
ence on Artificial intelligence 1:91–98. ference on Empirical Methods in Natural Language Processing
(EMNLP).
Merity, S.; Keskar, N. S.; and Socher, R. 2018. Regularizing and
Optimizing LSTM Language Models. In 6th International Confer- Yao, L.; Peng, N.; Weischedel, R.; Knight, K.; Zhao, D.; and Yan,
ence on Learning Representations, ICLR 2018. R. 2019. Plan-And-Write: Towards Better Automatic Storytelling.
In Proceedings of the Thirty-Third AAAI Conference on Artificial
Miller, G. A. 1995. WordNet: A Lexical Database for English.
Intelligence (AAAI-19).
Communications of the ACM 38(11):39–41.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu:
a method for automatic evaluation of machine translation. In Pro-
ceedings of the 40th annual meeting on association for computa-
tional linguistics, 311–318. Association for Computational Lin-
guistics.
7382

6232-Article Text-9457-1-10-20200516

Uploaded by

Copyright:

Available Formats

6232-Article Text-9457-1-10-20200516

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6232-Article Text-9457-1-10-20200516

Uploaded by

Copyright:

Available Formats

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Story Realization: Expanding Plot Events into Sentences

Prithviraj Ammanabrolu, Ethan Tien, Wesley Cheung,

Abstract for explicit domain modeling beyond providing a corpus

found, it defaults to they/them, and if no previous entity can

the templates and the ﬁnite state machine.

You might also like