Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024

Harry Bunt, Nancy Ide, Kiyong Lee, Volha Petukhova, James Pustejovsky, Laurent Romary (Editors)

Anthology ID:: 2024.isa-1
Month:: May
Year:: 2024
Address:: Torino, Italia
Venues:: ISA | WS
SIG:
Publisher:: ELRA and ICCL
URL:: https://aclanthology.org/2024.isa-1
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2024.isa-1.pdf

pdf bib abs
The MEET Corpus: Collocated, Distant and Hybrid Three-party Meetings with a Ranking Task
Ghazaleh Esfandiari-Baiat | Jens Edlund

We introduce the MEET corpus. The corpus was collected with the aim of systematically studying the effects of collocated (physical), remote (digital) and hybrid work meetings on collaborative decision-making. It consists of 10 sessions, where each session contains three recordings: a collocated, a remote and a hybrid meeting between three participants. The participants are working on a different survival ranking task during each meeting. The duration of each meeting ranges from 10 to 18 minutes, resulting in 380 minutes of conversation altogether. We also present the annotation scheme designed specifically to target our research questions. The recordings are currently being transcribed and annotated in accordance with this scheme

pdf bib abs
MSNER: A Multilingual Speech Dataset for Named Entity Recognition
Quentin Meeus | Marie-Francine Moens | Hugo Van hamme

While extensively explored in text-based tasks, Named Entity Recognition (NER) remains largely neglected in spoken language understanding. Existing resources are limited to a single, English-only dataset. This paper addresses this gap by introducing MSNER, a freely available, multilingual speech corpus annotated with named entities. It provides annotations to the VoxPopuli dataset in four languages (Dutch, French, German, and Spanish). We have also releasing an efficient annotation tool that leverages automatic pre-annotations for faster manual refinement. This results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set. We further provide an analysis comparing silver and gold annotations. Finally, we present baseline NER models to stimulate further research on this newly available dataset.

pdf bib abs
Attitudes in Diplomatic Speeches: Introducing the CoDipA UNSC 1.0
Mariia Anisimova | Šárka Zikánová

This paper presents CoDipA UNSC 1.0, a Corpus of Diplomatic Attitudes of the United Nations Security Council annotated with the attitude-part of the Appraisal theory. The speeches were manually selected according to topic-related and temporal criteria. The texts were then annotated according to the predefined annotation scenario. The distinguishing features of the diplomatic texts require a modified approach to attitude evaluation, which was implemented and presented in the current work. The corpus analysis has proven diplomatic speeches to be consistently evaluative, offered an overview of the most prominent means of expressing subjectivity in the corpus, and provided the results of the inter-annotator agreement evaluation.

pdf bib abs
Automatic Alignment of Discourse Relations of Different Discourse Annotation Frameworks
Yingxue Fu

Existing discourse corpora are annotated based on different frameworks, which show significant dissimilarities in definitions of arguments and relations and structural constraints. Despite surface differences, these frameworks share basic understandings of discourse relations. The relationship between these frameworks has been an open research question, especially the correlation between relation inventories utilized in different frameworks. Better understanding of this question is helpful for integrating discourse theories and enabling interoperability of discourse corpora annotated under different frameworks. However, studies that explore correlations between discourse relation inventories are hindered by different criteria of discourse segmentation, and expert knowledge and manual examination are typically needed. Some semi-automatic methods have been proposed, but they rely on corpora annotated in multiple frameworks in parallel. In this paper, we introduce a fully automatic approach to address the challenges. Specifically, we extend the label-anchored contrastive learning method introduced by Zhang et al. (2022b) to learn label embeddings during discourse relation classification. These embeddings are then utilized to map discourse relations from different frameworks. We show experimental results on RST-DT (Carlson et al., 2001) and PDTB 3.0 (Prasad et al., 2018).

pdf bib abs
A New Annotation Scheme for the Semantics of Taste
Teresa Paccosi | Sara Tonelli

This paper introduces a new annotation scheme for the semantics of gustatory language in English, which builds upon a previous framework for olfactory language based on frame semantics. The purpose of this annotation framework is to be used for annotating comparable resources for the study of sensory language and to create training datasets for supervised systems aimed at extracting sensory information. Furthermore, our approach incorporates words from specific historical periods, thereby enhancing the framework’s utility for studying language from a diachronic perspective.

pdf bib abs
What to Annotate: Retrieving Lexical Markers of Conspiracy Discourse from an Italian-English Corpus of Telegram Data
Costanza Marini | Elisabetta Jezek

In this age of social media, Conspiracy Theories (CTs) have become an issue that can no longer be ignored. After providing an overview of CT literature and corpus studies, we describe the creation of a 40,000-token English-Italian bilingual corpus of conspiracy-oriented Telegram comments – the Complotto corpus – and the linguistic analysis we performed using the Sketch Engine online platform (Kilgarriff et al., 2010) on our annotated data to identify statistically relevant linguistic markers of CT discourse. Thanks to the platform’s keywords and key terms extraction functions, we were able to assess the statistical significance of the following lexical and semantic phenomena, both cross-linguistically and cross-CT, namely: (1) evidentiality and epistemic modality markers; (2) debunking vocabulary referring to another version of the truth lying behind the official one; (3) the conceptual metaphor INSTITUTIONS ARE ABUSERS. All these features qualify as markers of CT discourse and have the potential to be effectively used for future semantic annotation tasks to develop automatic systems for CT identification.

pdf bib abs
Lightweight Connective Detection Using Gradient Boosting
Mustafa Erolcan Er | Murathan Kurfalı | Deniz Zeyrek

In this work, we introduce a lightweight discourse connective detection system. Employing gradient boosting trained on straightforward, low-complexity features, this proposed approach sidesteps the computational demands of the current approaches that rely on deep neural networks. Considering its simplicity, our approach achieves competitive results while offering significant gains in terms of time even on CPU. Furthermore, the stable performance across two unrelated languages suggests the robustness of our system in the multilingual scenario. The model is designed to support the annotation of discourse relations, particularly in scenarios with limited resources, while minimizing performance loss.

pdf bib abs
Shallow Discourse Parsing on Twitter Conversations
Berfin Aktas | Burak Özmen

We present our PDTB-style annotations on conversational Twitter data, which was initially annotated by Scheffler et al. (2019). We introduced 1,043 new annotations to the dataset, nearly doubling the number of previously annotated discourse relations. Subsequently, we applied a neural Shallow Discourse Parsing (SDP) model to the resulting corpus, improving its performance through retraining with in-domain data. The most substantial improvement was observed in the sense identification task (+19%). Our experiments with diverse training data combinations underline the potential benefits of exploring various data combinations in domain adaptation efforts for SDP. To the best of our knowledge, this is the first application of Shallow Discourse Parsing on Twitter data

pdf bib abs
Search tool for An Event-Type Ontology
Nataliia Petliak | Cristina Fernandéz Alcaina | Eva Fučíková | Jan Hajič | Zdeňka Urešová

This short demo description paper presents a new tool designed for searching an event-type ontology with rich information, demonstrated on the SynSemClass ontology resource. The tool complements a web browser, created by the authors of the SynSemClass ontology previously. Due to the complexity of the resource, the search tool offers possibilities both for a linguistically-oriented researcher as well as for teams working with the resource from a technical point of view, such as building role labeling tools, automatic annotation tools, etc.

pdf bib abs
Tiny But Mighty: A Crowdsourced Benchmark Dataset for Triple Extraction from Unstructured Text
Muhammad Salman | Armin Haller | Sergio J. Rodriguez Mendez | Usman Naseem

In the context of Natural Language Processing (NLP) and Semantic Web applications, constructing Knowledge Graphs (KGs) from unstructured text plays a vital role. Several techniques have been developed for KG construction from text, but the lack of standardized datasets hinders the evaluation of triple extraction methods. The evaluation of existing KG construction approaches is based on structured data or manual investigations. To overcome this limitation, this work introduces a novel dataset specifically designed to evaluate KG construction techniques from unstructured text. Our dataset consists of a diverse collection of compound and complex sentences meticulously annotated by human annotators with potential triples (subject, verb, object). The annotations underwent further scrutiny by expert ontologists to ensure accuracy and consistency. For evaluation purposes, the proposed F-measure criterion offers a robust approach to quantify the relatedness and assess the alignment between extracted triples and the ground-truth triples, providing a valuable tool for evaluating the performance of triple extraction systems. By providing a diverse collection of high-quality triples, our proposed benchmark dataset offers a comprehensive training and evaluation set for refining the performance of state-of-the-art language models on a triple extraction task. Furthermore, this dataset encompasses various KG-related tasks, such as named entity recognition, relation extraction, and entity linking.

pdf bib abs
Less is Enough: Less-Resourced Multilingual AMR Parsing
Bram Vanroy | Tim Van de Cruys

This paper investigates the efficacy of multilingual models for the task of text-to-AMR parsing, focusing on English, Spanish, and Dutch. We train and evaluate models under various configurations, including monolingual and multilingual settings, both in full and reduced data scenarios. Our empirical results reveal that while monolingual models exhibit superior performance, multilingual models are competitive across all languages, offering a more resource-efficient alternative for training and deployment. Crucially, our findings demonstrate that AMR parsing benefits from transfer learning across languages even when having access to significantly smaller datasets. As a tangible contribution, we provide text-to-AMR parsing models for the aforementioned languages as well as multilingual variants, and make available the large corpora of translated data for Dutch, Spanish (and Irish) that we used for training them in order to foster AMR research in non-English languages. Additionally, we open-source the training code and offer an interactive interface for parsing AMR graphs from text.

This paper presents MoCCA, a Model of Comparative Concepts for Aligning Constructicons under development by a consortium of research groups building Constructicons of different languages including Brazilian Portuguese, English, German and Swedish. The Constructicons will be aligned by using comparative concepts (CCs) providing language-neutral definitions of linguistic properties. The CCs are drawn from typological research on grammatical categories and constructions, and from FrameNet frames, organized in a conceptual network. Language-specific constructions are linked to the CCs in accordance with general principles. MoCCA is organized into files of two types: a largely static CC Database file and multiple Linking files containing relations between constructions in a Constructicon and the CCs. Tools are planned to facilitate visualization of the CC network and linking of constructions to the CCs. All files and guidelines will be versioned, and a mechanism is set up to report cases where a language-specific construction cannot be easily linked to existing CCs.

pdf bib abs
ISO 24617-8 Applied: Insights from Multilingual Discourse Relations Annotation in English, Polish, and Portuguese
Aleksandra Tomaszewska | Purificação Silvano | António Leal | Evelin Amorim

The main objective of this study is to contribute to multilingual discourse research by employing ISO-24617 Part 8 (Semantic Relations in Discourse, Core Annotation Schema – DR-core) for annotating discourse relations. Centering around a parallel discourse relations corpus that includes English, Polish, and European Portuguese, we initiate one of the few ISO-based comparative analyses through a multilingual corpus that aligns discourse relations across these languages. In this paper, we discuss the project’s contributions, including the annotated corpus, research findings, and statistics related to the use of discourse relations. The paper further discusses the challenges encountered in complying with the ISO standard, such as defining the scope of arguments and annotating specific relation types like Expansion. Our findings highlight the necessity for clearer definitions of certain discourse relations and more precise guidelines for argument spans, especially concerning the inclusion of connectives. Additionally, the study underscores the importance of ongoing collaborative efforts to broaden the inclusion of languages and more comprehensive datasets, with the objective of widening the reach of ISO-guided multilingual discourse research.

pdf bib abs
Combining semantic annotation schemes through interlinking
Harry Bunt

This paper explores the possibilities of using combinations of different semantic annotation schemes. This is particularly interesting for annotation schemes developed under the umbrella of the ISO Semantic Annotation Framework (ISO 24617), since these schemes were intended to be complementary, providing ways of indicating different semantic information about the same entities. However, there are certain overlaps between the schemes of SemAF parts, due to overlaps of their semantic domains, which are a potential source of inconsistencies. The paper shows how issues relating to inconsistencies can be addressed at the levels of concrete representation, abstract syntax, and semantic interpretation.

pdf bib abs
Fusing ISO 24617-2 Dialogue Acts and Application-Specific Semantic Content Annotations
Andrei Malchanau | Volha Petukhova | Harry Bunt

Accurately annotated data determines whether a modern high-performing AI/ML model will present a suitable solution to a complex task/application challenge, or time and resources are wasted. The more adequate the structure of the incoming data is specified, the more efficient the data is translated to be used by the application. This paper presents an approach to an application-specific dialogue semantics design which integrates the dialogue act annotation standard ISO 24617-2 and various domain-specific semantic annotations. The proposed multi-scheme design offers a plausible and a rather powerful strategy to integrate, validate, extend and reuse existing annotations, and automatically generate code for dialogue system modules. Advantages and possible trade-offs are discussed.

pdf bib abs
Annotation-Based Semantics for Dialogues in the Vox World
Kiyong Lee

This paper aims at enriching Annotation-Based Semantics (ABS) with the notion of small visual worlds, called the Vox worlds, to interpret dialogues in natural language. It attempts to implement classical set-theoretic models with these Vox worlds that serve as interpretation models. These worlds describe dialogue situations while providing background for the visualization of those situations in which these described dialogues take place interactively among dialogue participants, often triggering actions and emotions. The enriched ABS is based on VoxML, a modeling language for visual object conceptual structures (vocs or vox) that constitute the structural basis of visual worlds.

pdf bib abs
Annotating Evaluative Language: Challenges and Solutions in Applying Appraisal Theory
Jiamei Zeng | Min Dong | Alex Chengyu Fang

This article describes a corpus-based experiment to identify the challenges and solutions in the annotation of evaluative language according to the scheme defined in Appraisal Theory (Martin and White, 2005). Originating from systemic functional linguistics, Appraisal Theory provides a robust framework for the analysis of linguistic expressions of evaluation, stance, and interpersonal relationships. Despite its theoretical richness, the practical application of Appraisal Theory in text annotation presents significant challenges, chiefly due to the intricacies of identifying and classifying evaluative expressions within its sub-system of Attitude, which comprises Affect, Judgement, and Appreciation. This study examines these challenges through the annotation of a corpus of editorials related to the Russian-Ukraine conflict and aims to offer practical solutions to enhance the transparency and consistency of the annotation. By refining the annotation process and addressing the subjective nature in the identification and classification of evaluative language, this work represents some timely effort in the annotation of pragmatic knowledge in language resources.

pdf bib abs
Attractive Multimodal Instructions, Describing Easy and Engaging Recipe Blogs
Ielka van der Sluis | Jarred Kiewiet de Jonge

This paper presents a corpus study that extends and generalises an existing annotation model which integrates functional content descriptions delivered via text, pictures and interactive components. The model is used to describe a new corpus with 20 online vegan recipe blogs in terms of their Attractiveness for at least two types of readers: vegan readers and readers interested in a vegan lifestyle. Arguably, these readers value a blog that shows that the target dish is Easy to Make which can be inferred from the number of ingredients, procedural steps and visualised actions, according to an Easy to Read cooking instruction that displays a coherent use of verbal and visual modalities presenting processes and results of the cooking actions involved. Moreover, added value may be attributed to invitations to Engage with the blog content and functionality through which information about the recipe, the author, diet and nutrition can be accessed. Thus, the corpus study merges generalisable annotations of verbal, visual and interaction phenomena to capture the Attractiveness of online vegan recipe blogs to inform reader and user studies and ultimately offer guidelines for authoring effective online multimodal instructions.