In this paper we present the Kronieken Corpus, a new digital collection of 204 chronicles written in Dutch/Flemish between 1500 and 1850, which have been scanned, transcribed and annotated with named entities, dates, pages and a smaller part with sources and attributions. The texts belong to 308 physical volumes and contain between 23 and 24 million words. 107 chronicles, or 178 chronicle volumes, collected from 39 different archives and libraries in The Netherlands and Belgium and transcribed by volunteers had never been transcribed or published before. The result is a unique enriched historical text corpus of original hand-written, non-canonical and non-fiction text by lay people from the early modern period.
This paper presents a new web portal with information about the state of the art of natural language processing tasks in Spanish. It provides information about forums, competitions, tasks and datasets in Spanish, that would otherwise be spread in multiple articles and web sites. The portal consists of overview pages where information can be searched for and filtered by several criteria and individual pages with detailed information and hyperlinks to facilitate navigation. Information has been manually curated from publications that describe competitions and NLP tasks from 2013 until 2023 and will be updated as new tasks appear. A total of 185 tasks and 128 datasets from 94 competitions have been introduced.
We apply computational stylometric techniques to an 18th century Dutch chronicle to determine which fragments of the manuscript represent the author’s own original work and which show signs of external source use through either direct copying or paraphrasing. Through stylometric methods the majority of text fragments in the chronicle can be correctly labelled as either the author’s own words, direct copies from sources or paraphrasing. Our results show that clustering text fragments based on stylometric measures is an effective methodology for authorship verification of this document; however, this approach is less effective when personal writing style is masked by author independent styles or when applied to paraphrased text.
In this paper we present a procedure to extract posts that contain experiential knowledge from Facebook discussions in Dutch, using automated filtering, manual annotations and machine learning. We define guidelines to annotate experiential knowledge and test them on a subset of the data. After several rounds of (re-)annotations, we come to an inter-annotator agreement of K=0.69, which reflects the difficulty of the task. We subsequently discuss inclusion and exclusion criteria to cope with the diversity of manifestations of experiential knowledge relevant to guideline development.
While the production of information in the European early modern period is a well-researched topic, the question how people were engaging with the information explosion that occurred in early modern Europe, is still underexposed. This paper presents the annotations and experiments aimed at exploring whether we can automatically extract media related information (source, perception, and receiver) from a corpus of early modern Dutch chronicles in order to get insight in the mediascape of early modern middle class people from a historic perspective. In a number of classification experiments with Conditional Random Fields, three categories of features are tested: (i) raw and binary word embedding features, (ii) lexicon features, and (iii) character features. Overall, the classifier that uses raw embeddings performs slightly better. However, given that the best F-scores are around 0.60, we conclude that the machine learning approach needs to be combined with a close reading approach for the results to be useful to answer history research questions.
Cross-topic stance detection is the task to automatically detect stances (pro, against, or neutral) on unseen topics. We successfully reproduce state-of-the-art cross-topic stance detection work (Reimers et. al, 2019), and systematically analyze its reproducibility. Our attention then turns to the cross-topic aspect of this work, and the specificity of topics in terms of vocabulary and socio-cultural context. We ask: To what extent is stance detection topic-independent and generalizable across topics? We compare the model’s performance on various unseen topics, and find topic (e.g. abortion, cloning), class (e.g. pro, con), and their interaction affecting the model’s performance. We conclude that investigating performance on different topics, and addressing topic-specific vocabulary and context, is a future avenue for cross-topic stance detection. References Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, and Iryna Gurevych. 2019. Classification and Clustering of Arguments with Contextualized Word Embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 567–578, Florence, Italy. Association for Computational Linguistics.
In this paper we present the Vaccination Corpus, a corpus of texts related to the online vaccination debate that has been annotated with three layers of information about perspectives: attribution, claims and opinions. Additionally, events related to the vaccination debate are also annotated. The corpus contains 294 documents from the Internet which reflect different views on vaccinations. It has been compiled to study the language of online debates, with the final goal of experimenting with methodologies to extract and contrast perspectives in the framework of the vaccination debate.
In this paper we analyze the use of modal verbs in a corpus of texts related to the vaccination debate. Broadly speaking, the vaccination debate centers around whether vaccination is safe, and whether it is morally acceptable to enforce mandatory vaccination. In order to successfully intervene and curb the spread of preventable diseases due to low vaccination rates, health practitioners need to be adequately informed on public perception of the safety and necessity of vaccines. Public perception can relate to the strength of conviction that an individual may have towards a proposition (e.g. ‘one must vaccinate’ versus ‘one should vaccinate’), as well as qualify the type of proposition, be it related to morality (‘government should not interfere in my personal choice’) or related to possibility (‘too many vaccines at once could hurt my child’). Text mining and analysis of modal auxiliaries are economically viable means of gaining insights into these perspectives, particularly on a large scale due to the widespread use of social media and blogs as vehicles of communication.
In this work we address the processing of negation in Spanish. We first present a machine learning system that processes negation in Spanish. Specifically, we focus on two tasks: i) negation cue detection and ii) scope identification. The corpus used in the experimental framework is the SFU Corpus. The results for cue detection outperform state-of-the-art results, whereas for scope detection this is the first system that performs the task for Spanish. Moreover, we provide a qualitative error analysis aimed at understanding the limitations of the system and showing which negation cues and scopes are straightforward to predict automatically, and which ones are challenging.
Negation is a universal linguistic phenomenon with a great qualitative impact on natural language processing applications. The availability of corpora annotated with negation is essential to training negation processing systems. Currently, most corpora have been annotated for English, but the presence of languages other than English on the Internet, such as Chinese or Spanish, is greater every day. In this study, we present a review of the corpora annotated with negation information in several languages with the goal of evaluating what aspects of negation have been annotated and how compatible the corpora are. We conclude that it is very difficult to merge the existing corpora because we found differences in the annotation schemes used, and most importantly, in the annotation guidelines: the way in which each corpus was tokenized and the negation elements that have been annotated. Differently than for other well established tasks like semantic role labeling or parsing, for negation there is no standard annotation scheme nor guidelines, which hampers progress in its treatment.
Research in Computational Linguistics is dependent on text corpora for training and testing new tools and methodologies. While there exists a plethora of annotated linguistic information, these corpora are often not interoperable without significant manual work. Moreover, these annota-tions might have evolved into different versions, making it challenging for researchers to know the data’s provenance. This paper addresses this issue with a case study on event annotated corpora and by creating a new, more interoperable representation of this data in the form of nanopublications. We demonstrate how linguistic annotations from separate corpora can be reliably linked from the start, and thereby be accessed and queried as if they were a single dataset. We describe how such nanopublications can be created and demonstrate how SPARQL queries can be performed to extract interesting content from the new representations. The queries show that information of multiple corpora can be retrieved more easily and effectively because the information of different corpora is represented in a uniform data format.
The availability of corpora annotated with negation information is essential to develop negation processing systems in any language. However, there is a lack of these corpora even for languages like English, and when there are corpora available they are small and the annotations are not always compatible across corpora. In this paper we review the existing corpora annotated with negation in Spanish with the purpose of first, gathering the information to make it available for other researchers and, second, analyzing how compatible are the corpora and how has the linguistic phenomenon been addressed. Our final aim is to develop a supervised negation processing system for Spanish, for which we need training and test data. Our analysis shows that it will not be possible to merge the small corpora existing for Spanish due to lack of compatibility in the annotations.
This paper reports on a reimplementation of a system on detecting implicit positive meaning from negated statements. In the original regression experiment, different positive interpretations per negation are scored according to their likelihood. We convert the scores to classes and report our results on both the regression and classification tasks. We show that a baseline taking the mean score or most frequent class is hard to beat because of class imbalance in the dataset. Our error analysis indicates that an approach that takes the information structure into account (i.e. which information is new or contrastive) may be promising, which requires looking beyond the syntactic and semantic characteristics of negated statements.
In this paper we present annotation experiments with three different annotation schemes for the identification of argument components in texts related to the vaccination debate. Identifying claims about vaccinations made by participants in the debate is of great societal interest, as the decision to vaccinate or not has impact in public health and safety. Since most corpora that have been annotated with argumentation information contain texts that belong to a specific genre and have a well defined argumentation structure, we needed to adjust the annotation schemes to our corpus, which contains heterogeneous texts from the Web. We started with a complex annotation scheme that had to be simplified due to low IAA. In our final experiment, which focused on annotating claims, annotators reached 57.3% IAA.
In this paper we present on-going work on annotating negation in Spanish clinical documents. A corpus of anamnesis and radiology reports has been annotated by two domain expert annotators with negation markers and negated events. The Dice coefficient for inter-annotator agreement is higher than 0.94 for negation markers and higher than 0.72 for negated events. The corpus will be publicly released when the annotation process is finished, constituting the first corpus annotated with negation for Spanish clinical reports available for the NLP community.
This paper discusses the need for a dictionary of affixal negations and regular antonyms to facilitate their automatic detection in text. Without such a dictionary, affixal negations are very difficult to detect. In addition, we show that the set of affixal negations is not homogeneous, and that different NLP tasks may require different subsets. A dictionary can store the subtypes of affixal negations, making it possible to select a certain subset or to make inferences on the basis of these subtypes. We take a first step towards creating a negation dictionary by annotating all direct antonym pairs inWordNet using an existing typology of affixal negations. By highlighting some of the issues that were encountered in this annotation experiment, we hope to provide some insights into the necessary steps of building a negation dictionary.
This paper presents a framework and methodology for the annotation of perspectives in text. In the last decade, different aspects of linguistic encoding of perspectives have been targeted as separated phenomena through different annotation initiatives. We propose an annotation scheme that integrates these different phenomena. We use a multilayered annotation approach, splitting the annotation of different aspects of perspectives into small subsequent subtasks in order to reduce the complexity of the task and to better monitor interactions between layers. Currently, we have included four layers of perspective annotation: events, attribution, factuality and opinion. The annotations are integrated in a formal model called GRaSP, which provides the means to represent instances (e.g. events, entities) and propositions in the (real or assumed) world in relation to their mentions in text. Then, the relation between the source and target of a perspective is characterized by means of perspective annotations. This enables us to place alternative perspectives on the same entity, event or proposition next to each other.
In this paper we present ConanDoyle-neg, a corpus of stories by Conan Doyle annotated with negation information. The negation cues and their scope, as well as the event or property that is negated have been annotated by two annotators. The inter-annotator agreement is measured in terms of F-scores at scope level. It is higher for cues (94.88 and 92.77), less high for scopes (85.04 and 77.31), and lower for the negated event (79.23 and 80.67). The corpus is publicly available.
Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.
In this paper we present a description of negation cues and their scope in biomedical texts, based on the cues that occur in the BioScope corpus. We provide information about the morphological type of the cue, the characteristics of the scope in relation to the morpho-syntactic features of the cue and of the clause, and the ambiguity level of the cue by describing in which cases certain negation cues do not express negation. Additionally, we provide positive and negative examples per cue from the BioScope corpus. We show that the scope depends mostly on the part-of-speech of the cue and on the syntactic features of the clause. Although several studies have focused on processing negation in biomedical texts, we are not aware of publicly available resources that describe the scope of negation cues in detail. This paper aims at providing information for producing guidelines to annotate corpora with a negation layer, and for building resources that find the scope of negation cues automatically.
In this paper we present the Cast3LB-CoNLL-SemRol corpus, currently the only corpus of Spanish annotated with dependency syntax and semantic roles, and the tools that have been trained on the corpus: an ensemble of parsers and two dependency-based semantic role labelers that are the only semantic role labelers based on dependency syntax available for Spanish at this moment. One of the systems uses information from gold standard syntax, whereas the other one uses information from predicted syntax. The results of the first system (86 F1) are comparable to current state of the art results for constituent-based semantic role labeling of Spanish. The results of the second are 11 points lower. This work has been carried out as part of the project Técnicas semiautomáticas para el etiquetado de roles semánticos en corpus del español.