Keywords

1 Introduction

Natural Language Generation (NLG) and data-to-text approaches have gained increasing attention in recent years due to the applicability and challenge. By automatically generating high quality texts based on extracted relevant information from the source data, these approaches can eliminate the gap between raw data and human users.

There are some data-to-text solutions have been proposed for specific domains. There are some examples, generation of weather reports from meteorological data in several languages [23, 24], the creation of custom letters which answer customers’ questions [25], the generation of reports about the state of neonatal babies from intensive care data [26], and the generation of project management [27] and air quality reports [28].

With the development of sports (such as football and basketball), there are thousands of games concerned by billions of people every year. In this paper, we propose an Automatic Sport News Generating System, which aims to produce sport news immediately after each match is over so that saves both time and labor on writing the news articles. However, there are few researches in this area by now, especially using machine learning models, which leaves space in generating articles with better quality. One of the latest research in this area simplifies it into learning to rank problem. This method just pieces together past news sentences based on live comments instead of actually generating sentences using the match data [3]. In this paper, our system generates National Basketball Association (NBA) game news directly from stats. NBA games news usually consists of three parts: the first part summarizes the overall game result; the second part describes player’s performance; the last part describes the situation and events occurred during each of four sections. This paper focuses on first two parts, which are closely related to the game stats.

Recently, Generative Adversarial Networks (GAN) [1] have been introduced as a novel way to train a generative model. It has significantly improved several machine learning research fields, such as image generation [13], video prediction [7] and other domains [6, 8, 14]. The newly proposed WGAN, which solved the convergence problem, has enable the GAN to be utilized on fields based on discrete data, such as Natural Language Procession (NLP).

In this work, our Automatic Sport News Generating System is based on WGAN. There are two models that fulfil different generating tasks in the system. First we leverage a WGAN model to generate most important and variable phrases that can accurately describe the match or player’s performances. Then, by putting these phrases together with other constant information into template-based sentence generator, fluent sentences can be generated. This procedure ensures that the generated sentences are well-written and brings clarity to the reader, which is important in sports news. Concatenating outputs from Summary Sentence Generating Model (SSGM), and Player Performance Generating Model (PPGM), the system finally generates the most important parts in NBA sport news.

The main contribution of our system is that it can generate proper phrases to describe different matches, without the intervene of experts (news writers in sport news) or rules. For example, the system can judge which player plays good in a match and generates corresponding phrases to describe him. This process is automatically performed by WGAN according to real news and stats. Furthermore, we define a series of rules for sentence generator so that the generated sentences will be various, and are more close to idiomatic expression. As our best knowledge, this is the first framework that successfully generates sport news with the participating of GAN model based on game stats. It proves the potential of GAN model on NLP applications.

The rest of our paper is structured as follows: Sect. 2 describes related works; Sect. 3 gives a detailed description of our model; Sect. 4 analyzed experiment results and Sect. 5 summarizes this work and the future direction.

2 Related Work

Automatic Sports News Generating is a promising task that aims to generate sports news short after each game finishes. Then readers can receive the news quickly and saves human labor. Currently, there are not many researches on this area. A research recently [3] attempt to generate football news from live comments and use learning to rank method to extract most probable sentence appeared previously to fit a new live comment. Basically, the author reduced the problem into a ranking problem.

GAN has drawn significant attention as a new machine learning model recently [1]. The task of GAN is to produce a generator by a minimax game. In last few years, many variations have been invented to improve the original GAN, such as f-GAN [15], Energy based GAN [20], info GAN [17], and WGAN [21], which proves that GAN is a promising model.

Currently, A work that combines deep convolutional neural networks and GAN (DCGAN) [8] has been proposed, which proves that GAN has great potential in image generating area. Inspired by DCGAN, researchers have made much progress on image generating aspect. A model that utilize DCGAN within Laplacian pyramid framework [6] has been proved that it can produce higher quality images than original DCGAN. Meanwhile, DCGAN has been used to generate photo-realistic natural images by mainly updating the loss function cite.

Besides image generating, NLP is another important task in machine learning area. However, GAN is considered to be difficult to apply on NLP [1] since the update of GAN is based on continued space, but languages are discrete.

Though application of GAN on NLP is tough, there are several attempts on it. An effort that tries to learn words representation from documents [9] has bypass the problem of handling natural language sentences directly. In addition, researchers try to solve the discrete problem by modelling the generator as Reinforcement Learning (RL) called SeqGAN [18]. This model avoids the differentiate problem caused by gradient policy in generator. On the other hand, the discrete problem can be solved by mapping discrete natural languages into continuous vector spaces as well. Combining LSTM and CNN with generated vectors to do text generating [19], researchers successfully generated some realistic sentences. Another work, moreover, established a model that directly researches generating sentences from a continuous space using GAN [5], which is also a progress for GAN on NLP.

Traditional GAN and its applications only receive noise as input and produce output we expected in generator. We cannot decide which class to output if the expected output contains many categories. As solution, Convolutional Generative Adversarial Networks (CGAN) [4], are designed. It adds conditions as part of input both in generator and discriminator. Therefore, GAN can map conditions with expected output, and we are able to generate more accurate output when we feed conditions while testing. [13] is a typical application of CGAN that takes descriptions as conditions and corresponding images as output. Then generator can generate figures according to the descriptive words we input.

WGAN, and its refined version WGAN with gradient policy, is improved forms of GAN that aim to solve the problem that traditional GAN is hard to train. The author analyzed the reason why GAN usually not converge while training, and makes some specific modifies. The modification is very successful. WGAN even works on discrete embedding such as one-hot vectors.

In our research, we attempt to utilize WGAN in sports news generating tasks. Due to the limited usage of GAN on NLP, we reduced the aim of WGAN from generating a whole sentence to generate several phrases whose expression may be different between news. This idea is applicable and reasonable. In sports news, there are many similar sentences. Therefore, we can safely ignore these repeated words and just aiming at the variational words, which are mostly the judgemental words that contain mostly concerned information. Through combining the output words and other constant information, such as team name and player name, our model can generate summary sentences and player performance sentences very similar to what will appear in a real sports news.

3 Method

In this section, we will illustrate the method to generate NBA news. There are two models to fulfil different tasks. One is Summary Sentence Generating Model (SSGM), and another is Player Performance Generating Model (PPGM).

3.1 Summary Sentence Generating Model (SSGM)

SSGM is designed to summarize the overall game situation. It tends to get overall basic information (date, teams and score) as input, and output the summary sentence for the whole match.

The summary sentences contain the most abstract but important information in a match. The structure of summary sentence is relatively stable and simple. However, the phrases are various according to how one team beats another. For example, if Houston Rockets beats Dallas Mavericks with 131-102, the summary sentence in news may likely be “Rockets in home curt whup Mavericks with 131-102”. “whup” is used here that is better than normal word “beat” so as to show that Rockets played an excellent game. These differences among words are often decided by news writers, or precisely, experts who can judge the expression by experience.

In our work, we would like SSGM to learn these expressions automatically, which will generate appropriate sentences without importing experts. Therefore, instead of producing the whole sentence using algorithm directly, which will be inefficient and unnecessary, we focus on generating the subjective phrases, such as “whup” in this instance. Then, combining the phrases and other constant information together into template, the SSMG finally output the generated summary sentence.

To fulfil this task, we need to design a generator that can generate phrases suitable for different situations. GAN, a powerful model that can train such a generator through confronting with a discriminator, is very popular in recent researches and produces impressive results. To overcome the problem that GAN is hard to converge while training, a conditional WGAN with gradient policy is used in SSGM.

Comparing to traditional GAN, which trains the discriminator to classify whether an input is real or fake, WGAN uses Wasserstein distance in discriminator to measure the difference between two kinds of inputs, which ensure the model will converge to a plausible result even for one-hot word embedding data. In addition, gradient policy method is utilized to stable the training process so that we can train the model without devoting energy on tuning the model for converging.

In SSGM, the most important factor that will influence the generated phrase is the score of each team. Therefore, a neural network is designed, with scores and noise as input and the phrase embedding as output. In discriminator, another neural network is designed to take scores and corresponding phrase vector as input and output a single value that refers how likely the input phrase and scores combination to be true.

Given:

  • \(p_{g}\): the probability distribution of generated phrase vector, calculated from generator function G, in SSGM, the function is represented as \(G(s+z)\), where s is the score vector and z is the noise vector.

  • \(p_{r}\): the probability distribution of real phrase vector.

  • \(p_{\widehat{x}}\): the probability distribution of randomly interpolation calculated by \(\widehat{x}=\epsilon x_{r}+(1-\epsilon )x_{g}\), where \(\widehat{x} \sim p _{\widehat{x}}\), \(x_{r} \sim p_{r}\), \(x_{g} \sim p_{g}\) and \(\epsilon \sim Uniform[0,1]\).

The loss function of WGAN in SSGM is:

$$\begin{aligned}&L_{G} = -E _{x \sim p _{g}} [D(x+s)] \end{aligned}$$
(1)
$$\begin{aligned}&\begin{array}{ll} L_{D} = &{} -E_{x \sim p_{r}} [D(x+s)] + E_{x \sim p_{g}} [D(x+s)] \\ &{} + \, \lambda E_{x \sim p_{\widehat{x}}} [\Vert \nabla _{x}[D(x+s)] \Vert _{p}]^{2} \end{array} \end{aligned}$$
(2)

Where \(L_{G}\) and \(L_{D}\) is the loss of generator and discriminator in WGAN with gradient policy, D(x) is the function in discriminator.

While training WGAN, we try to minimize the two loss values alternatively until it reaches the equilibrium. Then the generator is expected to produce plausible result. After getting the phrases, we input them into Sentence Generator together with constant information to generate the final summary sentences. The constant information includes constant words include team name, score and home or away.

The Sentence Generator in SSGM is designed to combine constant information and generated sentences together according to some templates. For example, the template can be:

On [Date], [Team(A)] made a [Score(A)-Score[B] [SSGM phrase] [Team(B)].

Where [Team(K)] is the name of team K. [Score(K)] is the score that got by team K in the match, the [Date] is the date that the match happened and [Phrase] is the phrase that generated by WGAN. Figure 1 shows the flow chart of the SSGM.

Fig. 1.
figure 1

Flow chart of SSGM

In the flow chart, Generator takes team scores as input and generate score vectors, which is called condition. Then, concatenating with noise, they form the input for neural networks. The output of generator is phrase representation as embeddings that try to imitate what subjective phrase that an expert would like to employ for given scores. In discriminator, the neural network combines the phrase vector (real or fake) and score vector together as input, then output whether the discriminator believes the phrase is real or generated from generator.

3.2 Player Performance Generating Model (PPGM)

Unlike the overall description of match that the sentence structure is relative stable and simple, players’ performance is more variable. For instance, there are ten starting players and several alternates in each basketball game. However, not all players will be shown in the news. The news writer tends to report the players who are more eye-catching, such as star players and who acts surprisingly well in that game. In addition, which data of one player is going to be reported (scores, rebounds, assists, etc.) is also worth concerning.

Fig. 2.
figure 2

Flow chart of PPGM

In PPGM, as shown in Fig. 2, we considered all aspects above. The basic structure of PPGM is also a WGAN with gradient policy. In short, the goal of WGAN is to determine which data of a player in this match is worth reporting, and which phrases or words should be used to describe these data.

Firstly, each player will be represented as a vector called player embedding. Therefore, the model can learn the phrase distribution for different players with different data. For example, if an up-and-coming youngster hit 20 scores in his first match, we should report it as “delivered a sensational”. However, if LeBron James (a super star in NBA) got 20 scores, it is a normal data for him, we would more likely to use “scored” rather than “delivered a sensational”.

Furthermore, players’ performance will be represented as performance matrix, which combines player embedding and corresponding performance data as follow:

$$\begin{aligned} PM_{ij}=py_{i}^{T} \times pf_{ij} \end{aligned}$$
(3)

where \(PM_{ij}\) the Performance Matrix of the \(i^{th}\) player in the \(j^{th}\) match; \(py_{i}\) is the Player Vector of \(i^{th}\) player and \(pf_{ij}\) is the Performance Vector of the \(i^{th}\) player in the \(j^{th}\) match.

Next, performance matrix will be used as condition in both generator and discriminator. The generator takes performance matrix and noise as input and outputs word/phrase embedding for each data. These phrases describe how the players perform in term of each data.

Additionally, there is a special embedding that represent “no report” for each data. If the model outputs this embedding with the highest probability, it means that this data of the player is not worth reporting in the news. Therefore, if all data of one player are outputted as “no report”, this player’s performance will not show in the news at all.

In discriminator, the input consists of performance matrix with real data or generated data together. The output of discriminator decides whether these phrases are in real news or just generated by the generator.

WGAN in PPGM is similar to that in SSGM, but with different conditions and generator output. The generator function is modified into \(G(z+v_p^T v_s)\), where \(v_p\) is the player vector and \(v_s\) is the score vector got by corresponding players. Discriminator function is modified into \(D([x+v]_p^T v_s)\), where x is either the output of generator or real phrase vector.

The loss function of WGAN in PPGM is:

$$\begin{aligned}&L_G= -E_{x\sim p_{g}} [D(x+v_{p}^{T}v_{s})] \end{aligned}$$
(4)
$$\begin{aligned}&\begin{array}{ll} L_D= &{} -E_{x\sim p_{r}} [D(x+v_{p}^{T}v_{s})] + E_{x\sim p_{g}} [D(x \\ &{} +\, v_{p}^{T}v_{s})] + \lambda E_{x\sim p_{\widehat{x}}} [\Vert \nabla _{x} D(x+v_{p} ^{T}v_{s}) \Vert _{p}]^{2} \end{array} \end{aligned}$$
(5)

Since each player’s performance phrases are generated separately, the function of Sentence Generator in PPGM is to generate the whole paragraph for all corresponding players combining some constant information. Specifically, constant information contains all constant information such as player names, which team they belongs to and monotonous presentation. The Sentence Generator in PPGM is designed to combine constant information and generated sentences together according to specific rules. Different rules map to different templates so that the expressions are diversified and more similar to a real news. For example:

If one player got outstanding performance on two indexes such as scores and assists, it should be reported as ([Player(i)] [PPGM phrase] points and [PPGM phrase] assists). The connection word “and” is utilized to make sentence smooth.

As a result, the outputs of SSGM and PPGM constitute the paragraphs of a NBA news, which is accomplished automatically.

4 Experiments

In this section, we tested the two models based on real game stats.

4.1 Dataset

The data is collected from NetEase NBA website, which is one of the most popular NBA website in China. It contains full data in each match, including reports and the detailed statistic data for the match and each player. For example, the detailed statistic data and corresponding new report details provided in supplementary material appendix A.

For example, there is a match between Cleveland Cavaliers and Oklahoma City Thunder, the detailed statistic data is as shown in Fig. 3. The corresponding news report is also available as shown in Fig. 4. The summary sentence is shown in green box and players’ performance sentences are shown in red box.

The translation:

The Cavaliers scored 115-92 victory over Oklahoma City Thunder. Love scored 29 points, grabbed 11 rebounds and 4 assists, LeBron had 25 points, added 11 times Assists, 7 rebounds, 3 steals but had 5 turnovers, Jefferson got 15 points and 6 rebounds, Smith got 15 points, Thompson got 14 points and 14 rebounds, Mozgov got 11 points and 5 rebounds 15, rebounds and 3 assists, Westbrook gets 20 points, 11 assists and 9 rebounds, Durant finished 26 points, had 5 rebounds and 3 assists, Westbrook had 20 points, added 11 assists and 9 rebounds, Ibaka added 12 points.

The data will then be further processed for both two tasks in subsequent parts.

In the following subsections, to make non-Chinese readers easier to understand the news reports, we translate them into English for illustration.

Fig. 3.
figure 3

Match stats

Fig. 4.
figure 4

News report (Color figure online)

Table 1. SSGM training dataset

4.2 Experiment Settings and Results

In this subsection, we will describe experiment settings and analyze the results.

SSGM Settings. In SSGM, WGAN model takes scores as input, and generates the phrases that describe how one team beat another team. We extract the summary sentences from news reports and related data. Then we process and organize them into form as shown in Table 1.

In this experiment, two scores are both duplicated 10 times into 10 dimensionalities and normalized to guarantee that WGAN will fully use the score information. Then the scores vector are concatenated together to form the conditions. In generator, conditions are concatenated with 5 dimensionality noise vector as the input. Meanwhile, the output phrases are embedded using one-hot embedding method.

In discriminator, conditions vector are concatenated the phrase vector together to form the input, and get one value in [0, 1] to indicate whether the discriminator believe the condition and phrase combination comes from real news.

After training, the generator can generate phrases given two scores. SSGM then put it together with other information into templates. Then we can get the summary sentence.

SSGM Result. Given the data from NBA game in 2016-2-22, Cavaliers vs Thunder, Table 2 lists several results generated from our approach, compared to the news on website.

Table 2. SSGM generated result compared with real report

From the result, it should be noticed that the generated summary sentence is amazingly similar as the real news. It is scarcely possible for people to distinguish whether the report is generated automatically or written by sports-writers. It shows that WGAN in SSGM learned the pattern that reports the summary result for different score combination, without an expert getting involved.

PPGM Settings. In PPGM, situations are more complicated than SSGM. The data is processed as shown in Table 3. First, we set each players’ vector as length 50 with random value within \([-1, 1]\). Next, through statistic, only 5 kinds data are shown in the player performance sentence. They are “scores”, “rebounds”, “assists”, “mistakes”, “first start”. Among the 5 data, the first 4 are all integers and will be normalized. On the other hand, the “first start” data is Boolean, which 1 represent first start and 0 otherwise. Therefore, the performance information is a vector of length 5.

For each kind of data, we employ one-hot embedding to represent corresponding frequently used phrases. We embedded each data output as a vector of length 5 with the last dimension representing “no report”. Therefore, the output of generator in WGAN is a matrix of \(5*5\), which phrases of all data generated at once. The discriminator form the input as the concatenation of Performance Matrix and phrase embedding matrix then output a single value representing whether it believes the Performance Matrix and phrase embedding matrix comes from real news.

Table 3. PPGM training dataset

After training, in each game, each player will be evaluated whether his performance is worth reporting and which phrases should be reported. Finally, adding all the results from WGAN together with corresponding name into template. PPGM will generate the player performance sentences using the rules for all reported players.

PPGM Result. Given the data from NBA game in 2016-2-22, Cavaliers vs Thunder, we finally got the generated paragraph shown in Table 2.

From the results, we can see that the player performance sentences are clear and plausible. Furthermore, the sentence generator leveraged various templates and rules to generate the whole paragraph. It fully utilizes the input data and prevent monotonous description for all players. This brings even better effect than the original news.

Among all the input data, PPGM chose what kind of data to report automatically. All the chosen data are prominent and worth reporting. For example, Tristan Thompson got 14 rebounds, which is amazing in NBA match. our model reported it as “Tristan Thompson grabbed 14 rebounds”. For another example, Kyrie Irving only got 2 scores in the match. However, PPGM still reports him because our model can learn that he is NBA star concerned by many fans, based on his history performance as represented as Performance Matrix. And the corresponding report,“Kyrie Irving performed poorly, 5 hit 1, and only got 2 scores”, gives the vivid description about his performance (Table 4).

Table 4. PPGM generated results compared with real reports

In sum, PPGM takes all players’ data in a match and capable of deciding which data worthy reporting and generating diverse description of players’ performance.

Based on the both results of SSGM and PPGM, we are confident to extend our model to replace this part from human writer and save the time to generate NBA news.

5 Conclusion and Future Work

In this work, we propose an Automatic Sport News Generating System, which aims to produce sport news immediately after each match is over. We utilize WGAN combining with template to generate the summary sentences and player performance sentences in NBA match news. This is the first work that applied GAN on automatic sport news generation field. The system not only fulfil the task of generating sports news based on WGAN model, but also gives new angle of leverage GAN on NLP area.

As to future work, we will continue generating the remaining of NBA match news. For the abstract description of the whole match part, we will train WGAN model on sequence of data, which is a new attempt that combines GAN and sequence model together on NLP tasks.