For humans and robots to collaborate more in the real world, robots need to understand human intentions from the different manner of their behaviors. In our study, we focus on the meaning of adverbs which describe human motions. We propose a topic model, Hierarchical Dirichlet Process-Spectral Mixture Latent Dirichlet Allocation, which concurrently learns the relationship between those human motions and those adverbs by capturing the frequency kernels that represent motion characteristics and the shared topics of adverbs that depict such motions. We trained the model on datasets we made from movies about “walking” and “dancing”, and found that our model outperforms representative neural network models in terms of perplexity score. We also demonstrate our model’s ability to determine the adverbs for a given motion and confirmed that the model predicts more appropriate adverbs.
This paper focuses on the task of open-domain live commentary generation. Compared to domain-specific work in this task, this setting proved particularly challenging due to the absence of domain-specific features. Aiming to bridge this gap, we integrate spatial information by proposing an utterance generation model with a novel spatial graph that is flexible to deal with the open-domain characteristics of the commentaries and significantly improves performance. Furthermore, we propose a novel evaluation scheme, more suitable for live commentary generation, that uses LLMs to automatically check whether generated utterances address essential aspects of the video via the answerability of questions extracted directly from the videos using LVLMs. Our results suggest that using a combination of our answerability score and a standard machine translation metric is likely a more reliable way to evaluate the performance in this task.
We propose a method that extends a BART-based language generator using a plug-and-play model to control the rhetorical structure of generated text. Our approach considers rhetorical relations between clauses and generates sentences that reflect this structure using plug-and-play language models. We evaluated our method using the Newsela corpus, which consists of texts at various levels of English proficiency. Our experiments demonstrated that our method outperforms the vanilla BART in terms of the correctness of output discourse and rhetorical structures. In existing methods, the rhetorical structure tends to deteriorate when compared to the baseline, the vanilla BART, as measured by n-gram overlap metrics such as BLEU. However, our proposed method does not exhibit this significant deterioration, demonstrating its advantage.
This paper critically examines the arithmetic capabilities of Large Language Models (LLMs), uncovering significant limitations in their performance. Our research reveals a notable decline in accuracy for complex calculations involving large numbers, with addition and subtraction tasks showing varying degrees of proficiency. Additionally, we challenge the notion that arithmetic is language-independent, finding up to a 10% difference in performance across twenty languages. The study also compares self-verification methods with cross-agent collaborations, showing that a single model often outperforms collaborative approaches in basic arithmetic tasks. These findings suggest a need to reassess the effectiveness of LLMs in tasks requiring numerical accuracy and precision.
We introduce a simple yet effective Prior Knowledge-Guided ADVersarial Training (PKG-ADV) algorithm to improve adversarial training for natural language understanding. Our method simply utilizes task-specific label distribution to guide the training process. By prioritizing the use of prior knowledge of labels, we aim to generate more informative adversarial perturbations. We apply our model to several challenging temporal reasoning tasks. Our method enables a more reliable and controllable data training process than relying on randomized adversarial perturbation. Albeit simple, our method achieved significant improvements in these tasks. To facilitate further research, we will release the code and models.
The task of quote attribution seeks to pair textual utterances with the name of their speakers. Despite continuing research efforts on the task, models are rarely evaluated systematically against previous models in comparable settings on the same datasets. This has resulted in a poor understanding of the relative strengths and weaknesses of various approaches. In this work we formalize the task of quote attribution, and in doing so, establish a basis of comparison across existing models. We present an exhaustive benchmark of known models, including natural extensions to larger LLM base models, on all available datasets in both English and Chinese. Our benchmarking results reveal that the CEQA model attains state-of-the-art performance among all supervised methods, and ChatGPT, operating in a four-shot setting, demonstrates performance on par with or surpassing that of supervised methods on some datasets. Detailed error analysis identify several key factors contributing to prediction errors.
Numbers have unique characteristics to words. Teaching models to understand numbers in text is an open-ended research question. Instead of discussing the required calculation skills, this paper focuses on a more fundamental topic: understanding numerals. We point out that innumeracy—the inability to handle basic numeral concepts—exists in most pretrained language models (LMs), and we propose a method to solve this issue by exploring the notation of numbers. Further, we discuss whether changing notation and pre-finetuning along with the comparing-number task can improve performance in three benchmark datasets containing quantitative-related tasks. The results of this study indicate that input reframing and the proposed pre-finetuning task is useful for RoBERTa.
This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a query sentence, the goal is to recognize and determine temporal boundaries of action instances in the video described by natural language queries. Recent works tackled this task by improving query inputs with large pre-trained language models (PLM), at the cost of more expensive training. However, the effects of this integration are unclear, as these works also propose improvements in the visual inputs. Therefore, this paper studies the role of query sentence representation with PLMs in TVG and assesses the applicability of parameter-efficient training with NLP adapters. We couple popular PLMs with a selection of existing approaches and test different adapters to reduce the impact of the additional parameters. Our results on three challenging datasets show that, with the same visual inputs, TVG models greatly benefited from the PLM integration and fine-tuning, stressing the importance of the text query representation in this task. Furthermore, adapters were an effective alternative to full fine-tuning, even though they are not tailored to our task, allowing PLM integration in larger TVG models and delivering results comparable to SOTA models. Finally, our results shed light on which adapters work best in different scenarios.
We explore the idea of incorporating concepts from writing skills curricula into human-machine collaborative writing scenarios, focusing on adding writing modes as a control for text generation models. Using crowd-sourced workers, we annotate a corpus of narrative text paragraphs with writing mode labels. Classifiers trained on this data achieve an average accuracy of ~87% on held-out data. We fine-tune a set of large language models to condition on writing mode labels, and show that the generated text is recognized as belonging to the specified mode with high accuracy. To study the ability of writing modes to provide fine-grained control over generated text, we devise a novel turn-based text reconstruction game to evaluate the difference between the generated text and the author’s intention. We show that authors prefer text suggestions made by writing mode-controlled models on average 61.1% of the time, with satisfaction scores 0.5 higher on a 5-point ordinal scale. When evaluated by humans, stories generated via collaboration with writing mode-controlled models achieve high similarity with the professionally written target story. We conclude by identifying the most common mistakes found in the generated stories.
Live commentaries are essential for enhancing spectators’ enjoyment and understanding during sports events or e-sports streams. We introduce a live audio commentator system designed specifically for a racing game, driven by the high demand in the e-sports field. While a player is playing a racing game, our system tracks real-time user play data including speed and steer rotations, and generates commentary to accompany the live stream. Human evaluation suggested that generated commentary enhances enjoyment and understanding of races compared to streams without commentary. Incorporating additional modules to improve diversity and detect irregular events, such as course-outs and collisions, further increases the preference for the output commentaries.
Live commentary plays an important role in sports broadcasts and video games, making spectators more excited and immersed. In this context, though approaches for automatically generating such commentary have been proposed in the past, they have been generally concerned with specific fields, where it is possible to leverage domain-specific information. In light of this, we propose the task of generating video commentary in an open-domain fashion. We detail the construction of a new large-scale dataset of transcribed commentary aligned with videos containing various human actions in a variety of domains, and propose approaches based on well-known neural architectures to tackle the task. To understand the strengths and limitations of current approaches, we present an in-depth empirical study based on our data. Our results suggest clear trade-offs between textual and visual inputs for the models and highlight the importance of relying on external knowledge in this open-domain setting, resulting in a set of robust baselines for our task.
In recent years, many studies using deep learning have been conducted to elucidate the mechanism of information representation in the brain under stimuli evoked by various modalities. On the other hand, it has not yet been clarified how we humans link information of different modalities in the brain. In this study, to elucidate the relationship between visual and language information in the brain, we constructed encoding models that predict brain activity based on features extracted from the hidden layers of VGG16 for visual information and BERT for language information. We investigated the hierarchical characteristics of cortical localization and representational content of visual and semantic information in the cortex based on the brain activity predicted by the encoding model. The results showed that the cortical localization modeled by VGG16 is getting close to that of BERT as VGG16 moves to higher layers, while the representational contents differ significantly between the two modalities.
In this paper, we specifically look at the image-text retrieval problem. Recent multimodal frameworks have shown that structured inputs and fine-tuning lead to consistent performance improvement. However, this paradigm has been challenged recently with newer Transformer-based models that can reach zero-shot state-of-the-art results despite not explicitly using structured data during pre-training. Since such strategies lead to increased computational resources, we seek to better understand their role in image-text retrieval by analyzing visual and text representations extracted with three multimodal frameworks – SGM, UNITER, and CLIP. To perform such analysis, we represent a single image or text as low-dimensional linear subspaces and perform retrieval based on subspace similarity. We chose this representation as subspaces give us the flexibility to model an entity based on feature sets, allowing us to observe how integrating or reducing information changes the representation of each entity. We analyze the performance of the selected models’ features on two standard benchmark datasets. Our results indicate that heavily pre-training models can already lead to features with critical information representing each entity, with zero-shot UNITER features performing consistently better than fine-tuned features. Furthermore, while models can benefit from structured inputs, learning representations for objects and relationships separately, such as in SGM, likely causes a loss of crucial contextual information needed to obtain a compact cluster that can effectively represent a single entity.
The ability to capture temporal commonsense relationships for time-related events expressed in text is a very important task in natural language understanding. On the other hand, pre-trained language models such as BERT, which have recently achieved great success in a wide range of natural language processing tasks, are still considered to have poor performance in temporal reasoning. In this paper, we focus on the development of language models for temporal commonsense inference over several pre-trained language models. Our model relies on multi-step fine-tuning using multiple corpora, and masked language modeling to predict masked temporal indicators that are crucial for temporal commonsense reasoning. We also experimented with multi-task learning and build a language model that can improve performance on multiple time-related tasks. In our experiments, multi-step fine-tuning using the general commonsense reading task as auxiliary task produced the best results. This result showed a significant improvement in accuracy over standard fine-tuning in the temporal commonsense inference task.
In Japanese, there are different expressions used in speech depending on the speaker’s and listener’s social status, called honorifics. Unlike other languages, Japanese has many types of honorific expressions, and it is vital for machine translation and dialogue systems to handle the differences in meaning correctly. However, there is still no corpus that deals with honorific expressions based on social status. In this study, we developed an honorific corpus (KeiCO corpus) that includes social status information based on Systemic Functional Linguistics, which expresses language use in situations from the social group’s values and common understanding. As a general-purpose language resource, it filled in the Japanese honorific blanks. We expect the KeiCO corpus could be helpful for various tasks, such as improving the accuracy of machine translation, automatic evaluation, correction of Japanese composition and style transformation. We also verified the accuracy of our corpus by a BERT-based classification task.
We propose a multilingual adversarial training model for determining whether a sentence contains an idiomatic expression. Given that a key challenge with this task is the limited size of annotated data, our model relies on pre-trained contextual representations from different multi-lingual state-of-the-art transformer-based language models (i.e., multilingual BERT and XLM-RoBERTa), and on adversarial training, a training method for further enhancing model generalization and robustness. Without relying on any human-crafted features, knowledgebase, or additional datasets other than the target datasets, our model achieved competitive results and ranked 6thplace in SubTask A (zero-shot) setting and 15thplace in SubTask A (one-shot) setting
We propose an ensemble model for predicting the lexical complexity of words and multiword expressions (MWEs). The model receives as input a sentence with a target word or MWE and outputs its complexity score. Given that a key challenge with this task is the limited size of annotated data, our model relies on pretrained contextual representations from different state-of-the-art transformer-based language models (i.e., BERT and RoBERTa), and on a variety of training methods for further enhancing model generalization and robustness: multi-step fine-tuning and multi-task learning, and adversarial training. Additionally, we propose to enrich contextual representations by adding hand-crafted features during training. Our model achieved competitive results and ranked among the top-10 systems in both sub-tasks.
Temporal commonsense reasoning is a challenging task as it requires temporal knowledge usually not explicit in text. In this work, we propose an ensemble model for temporal commonsense reasoning. Our model relies on pre-trained contextual representations from transformer-based language models (i.e., BERT), and on a variety of training methods for enhancing model generalization: 1) multi-step fine-tuning using carefully selected auxiliary tasks and datasets, and 2) a specifically designed temporal masked language model task aimed to capture temporal commonsense knowledge. Our model greatly outperforms the standard fine-tuning approach and strong baselines on the MC-TACO dataset.
We present a simple yet effective Targeted Adversarial Training (TAT) algorithm to improve adversarial training for natural language understanding. The key idea is to introspect current mistakes and prioritize adversarial training steps to where the model errs the most. Experiments show that TAT can significantly improve accuracy over standard adversarial training on GLUE and attain new state-of-the-art zero-shot results on XNLI. Our code will be released upon acceptance of the paper.
We propose a simple yet effective Multi-Layer RAndom Perturbation Training algorithm (RAPT) to enhance model robustness and generalization. The key idea is to apply randomly sampled noise to each input to generate label-preserving artificial input points. To encourage the model to generate more diverse examples, the noise is added to a combination of the model layers. Then, our model regularizes the posterior difference between clean and noisy inputs. We apply RAPT towards robust and efficient BERT training, and conduct comprehensive fine-tuning experiments on GLUE tasks. Our results show that RAPT outperforms the standard fine-tuning approach, and adversarial training method, yet with 22% less training time.
We propose the task of automatically generating commentaries for races in a motor racing game, from vision, structured numerical, and textual data. Commentaries provide information to support spectators in understanding events in races. Commentary generation models need to interpret the race situation and generate the correct content at the right moment. We divide the task into two subtasks: utterance timing identification and utterance generation. Because existing datasets do not have such alignments of data in multiple modalities, this setting has not been explored in depth. In this study, we introduce a new large-scale dataset that contains aligned video data, structured numerical data, and transcribed commentaries that consist of 129,226 utterances in 1,389 races in a game. Our analysis reveals that the characteristics of commentaries change over time or from viewpoints. Our experiments on the subtasks show that it is still challenging for a state-of-the-art vision encoder to capture useful information from videos to generate accurate commentaries. We make the dataset and baseline implementation publicly available for further research.
Existing models for data-to-text tasks generate fluent but sometimes incorrect sentences e.g., “Nikkei gains” is generated when “Nikkei drops” is expected. We investigate models trained on contrastive examples i.e., incorrect sentences or terms, in addition to correct ones to reduce such errors. We first create rules to produce contrastive examples from correct ones by replacing frequent crucial terms such as “gain” or “drop”. We then use learning methods with several losses that exploit contrastive examples. Experiments on the market comment generation task show that 1) exploiting contrastive examples improves the capability of generating sentences with better lexical choice, without degrading the fluency, 2) the choice of the loss function is an important factor because the performances on different metrics depend on the types of loss functions, and 3) the use of the examples produced by some specific rules further improves performance. Human evaluation also supports the effectiveness of using contrastive examples.
We apply small perturbations to word embeddings and minimize the resultant adversarial risk to regularize the model. We exploit a novel combination of two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model boosts the fine-tuning performance of RoBERTa, achieving competitive results on multiple reading comprehension datasets that require commonsense inference.
The Differentiable Neural Computer (DNC), a neural network model with an addressable external memory, can solve algorithmic and question answering tasks. There are various improved versions of DNC, such as rsDNC and DNC-DMS. However, how to integrate structured knowledge into these DNC models remains a challenging research question. We incorporate an architecture for knowledge into such DNC models, i.e. DNC, rsDNC and DNC-DMS, to improve the ability to generate correct responses using both contextual information and structured knowledge. Our improved rsDNC model improves the mean accuracy by approximately 20% to the original rsDNC on tasks requiring knowledge in the dialog bAbI tasks. In addition, our improved rsDNC and DNC-DMS models also yield better performance than their original models in the Movie Dialog dataset.
End-to-end models on data-to-text learn the mapping of data and text from the aligned pairs in the dataset. However, these alignments are not always obtained reliably, especially for the time-series data, for which real time comments are given to some situation and there might be a delay in the comment delivery time compared to the actual event time. To handle this issue of possible noisy alignments in the dataset, we propose a neural network model with multi-timestep data and a copy mechanism, which allows the models to learn the correspondences between data and text from the dataset with noisier alignments. We focus on generating market comments in Japanese that are delivered each time an event occurs in the market. The core idea of our approach is to utilize multi-timestep data, which is not only the latest market price data when the comment is delivered, but also the data obtained at several timesteps earlier. On top of this, we employ a copy mechanism that is suitable for referring to the content of data records in the market price data. We confirm the superiority of our proposal by two evaluation metrics and show the accuracy improvement of the sentence generation using the time series data by our proposed method.
Temporal relation classification is the pair-wise task for identifying the relation of a temporal link (TLINKs) between two mentions, i.e. event, time and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two strong transfer learning baselines on both the English and Japanese data.
We propose a data-to-text generation model with two modules, one for tracking and the other for text generation. Our tracking module selects and keeps track of salient information and memorizes which record has been mentioned. Our generation module generates a summary conditioned on the state of tracking module. Our proposed model is considered to simulate the human-like writing process that gradually selects the information by determining the intermediate variables while writing the summary. In addition, we also explore the effectiveness of the writer information for generations. Experimental results show that our proposed model outperforms existing models in all evaluation metrics even without writer information. Incorporating writer information further improves the performance, contributing to content planning and surface realization.
We propose a data-to-document generator that can easily control the contents of output texts based on a neural language model. Conventional data-to-text model is useful when a reader seeks a global summary of data because it has only to describe an important part that has been extracted beforehand. However, because depending on users, it differs what they are interested in, so it is necessary to develop a method to generate various summaries according to users’ interests. We develop a model to generate various summaries and to control their contents by providing the explicit targets for a reference to the model as controllable factors. In the experiments, we used five-minute or one-hour charts of 9 indicators (e.g., Nikkei225), as time-series data, and daily summaries of Nikkei Quick News as textual data. We conducted comparative experiments using two pieces of information: human-designed topic labels indicating the contents of a sentence and automatically extracted keywords as the referential information for generation.
Comments on a stock market often include the reason or cause of changes in stock prices, such as “Nikkei turns lower as yen’s rise hits exporters.” Generating such informative sentences requires capturing the relationship between different resources, including a target stock price. In this paper, we propose a model for automatically generating such informative market comments that refer to external resources. We evaluated our model through an automatic metric in terms of BLEU and human evaluation done by an expert in finance. The results show that our model outperforms the existing model both in BLEU scores and human judgment.