Dan Tufiş

Also published as: Dan Tufis, Dan Tufiș

2024

pdf bib abs
Building a corpus for the anonymization of Romanian jurisprudence
Vasile Păiș | Dan Tufis | Elena Irimia | Verginica Barbu Mititelu
Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII)

Access to jurisprudence is of paramount importance for both law professionals (judges, lawyers, law students) and for the larger public. In Romania, the Superior Council of Magistracy holds a large database of jurisprudence from different courts in the country, which is updated daily. However, granting public access requires its anonymization. This paper presents the efforts behind building a corpus for the anonymization process. We present the annotation scheme, the manual annotation methods, and the platform used.

2022

This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

Running large-scale pre-trained language models in computationally constrained environments remains a challenging problem yet to be addressed, while transfer learning from these models has become prevalent in Natural Language Processing tasks. Several solutions, including knowledge distillation, network quantization, or network pruning have been previously proposed; however, these approaches focus mostly on the English language, thus widening the gap when considering low-resource languages. In this work, we introduce three light and fast versions of distilled BERT models for the Romanian language: Distil-BERT-base-ro, Distil-RoBERT-base, and DistilMulti-BERT-base-ro. The first two models resulted from the individual distillation of knowledge from two base versions of Romanian BERTs available in literature, while the last one was obtained by distilling their ensemble. To our knowledge, this is the first attempt to create publicly available Romanian distilled BERT models, which were thoroughly evaluated on five tasks: part-of-speech tagging, named entity recognition, sentiment analysis, semantic textual similarity, and dialect identification. Our experimental results argue that the three distilled models offer performance comparable to their teachers, while being twice as fast on a GPU and ~35% smaller. In addition, we further test the similarity between the predictions of our students versus their teachers by measuring their label and probability loyalty, together with regression loyalty - a new metric introduced in this work.

The work in progress on the CEF Action CURLICA T is presented. The general aim of the Action is to compile curated datasets in seven languages of the con- sortium in domains of relevance to Euro- pean Digital Service Infrastructures (DSIs) in order to enhance the eTransla- tion services.

2020

pdf bib abs
Collection and Annotation of the Romanian Legal Corpus
Dan Tufiș | Maria Mitrofan | Vasile Păiș | Radu Ion | Andrei Coman
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the Romanian legislative corpus which is a valuable linguistic asset for the development of machine translation systems, especially for under-resourced languages. The knowledge that can be extracted from this resource is necessary for a deeper understanding of how law terminology is used and how it can be made more consistent. At this moment the corpus contains more than 140k documents representing the legislative body of Romania. This corpus is processed and annotated at different levels: linguistically (tokenized, lemmatized and pos-tagged), dependency parsed, chunked, named entities identified and labeled with IATE terms and EUROVOC descriptors. Each annotated document has a CONLL-U Plus format consisting in 14 columns, in addition to the standard 10-column format, four other types of annotations were added. Moreover the repository will be periodically updated as new legislative texts are published. These will be automatically collected and transmitted to the processing and annotation pipeline. The access to the corpus will be done through ELRC infrastructure.

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

pdf bib abs
MWSA Task at GlobaLex 2020: RACAI’s Word Sense Alignment System using a Similarity Measurement of Dictionary Definitions
Vasile Pais | Dan Tufiș | Radu Ion
Proceedings of the 2020 Globalex Workshop on Linked Lexicography

This paper describes RACAI’s word sense alignment system, which participated in the Monolingual Word Sense Alignment shared task organized at GlobaLex 2020 workshop. We discuss the system architecture, some of the challenges that we faced as well as present our results on several of the languages available for the task.

pdf bib abs
A Processing Platform Relating Data and Tools for Romanian Language
Vasile Păiș | Radu Ion | Dan Tufiș
Proceedings of the 1st International Workshop on Language Technology Platforms

This paper presents RELATE (http://relate.racai.ro), a high-performance natural language platform designed for Romanian language. It is meant both for demonstration of available services, from text-span annotations to syntactic dependency trees as well as playing or automatically synthesizing Romanian words, and for the development of new annotated corpora. It also incorporates the search engines for the large COROLA reference corpus of contemporary Romanian and the Romanian wordnet. It integrates multiple text and speech processing modules and exposes their functionality through a web interface designed for the linguist researcher. It makes use of a scheduler-runner architecture, allowing processing to be distributed across multiple computing nodes. A series of input/output converters allows large corpora to be loaded, processed and exported according to user preferences.

2018

pdf bib
The Reference Corpus of the Contemporary Romanian Language (CoRoLa)
Verginica Barbu Mititelu | Dan Tufiș | Elena Irimia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
BioRo: The Biomedical Corpus for the Romanian Language
Maria Mitrofan | Dan Tufiş
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A Bird’s-eye View of Language Processing Projects at the Romanian Academy
Dan Tufiș | Dan Cristea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
RACAI’s Natural Language Processing pipeline for Universal Dependencies
Stefan Daniel Dumitrescu | Tiberiu Boros | Dan Tufis
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper presents RACAI’s approach, experiments and results at CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. We handle raw text and we cover tokenization, sentence splitting, word segmentation, tagging, lemmatization and parsing. All results are reported under strict training, development and testing conditions, in which the corpora provided for the shared tasks is used “as is”, without any modifications to the composition of the train and development sets.

pdf bib abs
A data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper
Tiberiu Boros | Sonia Pipa | Verginica Barbu Mititelu | Dan Tufis
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

“Multiword expressions” are groups of words acting as a morphologic, syntactic and semantic unit in linguistic analysis. Verbal multiword expressions represent the subgroup of multiword expressions, namely that in which a verb is the syntactic head of the group considered in its canonical (or dictionary) form. All multiword expressions are a great challenge for natural language processing, but the verbal ones are particularly interesting for tasks such as parsing, as the verb is the central element in the syntactic organization of a sentence. In this paper we introduce our data-driven approach to verbal multiword expressions which was objectively validated during the PARSEME shared task on verbal multiword expressions identification. We tested our approach on 12 languages, and we provide detailed information about corpora composition, feature selection process, validation procedure and performance on all languages.

2016

pdf bib abs
The IPR-cleared Corpus of Contemporary Written and Spoken Romanian Language
Dan Tufiș | Verginica Barbu Mititelu | Elena Irimia | Ștefan Daniel Dumitrescu | Tiberiu Boroș
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country’s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before its official launching. The corpus will be freely available for searching. Users will be able to download the results of their searches and those original files when not against stipulations in the protocols we have with text providers.

2014

pdf bib abs
Large SMT data-sets extracted from Wikipedia
Dan Tufiş
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in three language pairs. Each extracted sentence pair is associated with a cross-lingual lexical similarity score based on which, several evaluations have been conducted to estimate the similarity thresholds which allow the extraction of the most useful data for training three-language pairs SMT systems. The experiments showed that for a similarity score higher than 0.7 all sentence pairs in the three language pairs were fully parallel. However, including in the training sets less parallel sentence pairs (that is with a lower similarity score) showed significant improvements in the translation quality (BLEU-based evaluations). The optimized SMT systems were evaluated on unseen test-sets also extracted from Wikipedia. As one of the main goals of our work was to help Wikipedia contributors to translate (with as little post editing as possible) new articles from major languages into less resourced languages and vice-versa, we call this type of translation experiments in-genre translation. As in the case of in-domain translation, our evaluations showed that using only in-genre training data for translating same genre new texts is better than mixing the training data with out-of-genre (even) parallel texts.

pdf bib abs
CoRoLa — The Reference Corpus of Contemporary Romanian Language
Verginica Barbu Mititelu | Elena Irimia | Dan Tufiș
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the project of creating CoRoLa, a reference corpus of contemporary Romanian (from 1945 onwards). In the international context, the project finds its place among the initiatives of gathering huge collections of texts, of pre-processing and annotating them at several levels, and also of documenting them with metadata (CMDI). Our project is a joined effort of two institutes of the Romanian Academy. We foresee a corpus of more than 500 million word forms, covering all functional styles of the language. Although the vast majority of texts will be in written form, we target about 300 hours of oral texts, too, obligatorily with associated transcripts. Most of the texts will be from books, while the rest will be harvested from newspapers, booklets, technical reports, etc. The pre-processing includes cleaning the data and harmonising the diacritics, sentence splitting and tokenization. Annotation will be done at a morphological level in a first stage, followed by lemmatization, with the possibility of adding syntactic, semantic and discourse annotation in a later stage. A core of CoRoLa is described in the article. The target users of our corpus will be researchers in linguistics and language processing, teachers of Romanian, students.

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.

pdf bib
News about the Romanian Wordnet
Verginica Barbu Mititelu | Ștefan Daniel Dumitrescu | Dan Tufiș
Proceedings of the Seventh Global Wordnet Conference

2013

pdf bib
Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
Tiberiu Boros | Radu Ion | Dan Tufis
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Wikipedia as an SMT Training Corpus
Dan Tufiș | Radu Ion | Ștefan Dumitrescu | Dan Ștefănescu
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib abs
Romanian to English automatic MT experiments at IWSLT12 – system description paper
Ştefan Daniel Dumitrescu | Radu Ion | Dan Ştefănescu | Tiberiu Boroş | Dan Tufiş
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

The paper presents the system developed by RACAI for the ISWLT 2012 competition, TED task, MT track, Romanian to English translation. We describe the starting baseline phrase-based SMT system, the experiments conducted to adapt the language and translation models and our post-translation cascading system designed to improve the translation without external resources. We further present our attempts at creating a better controlled decoder than the open-source Moses system offers.

pdf bib abs
ROMBAC: The Romanian Balanced Annotated Corpus
Radu Ion | Elena Irimia | Dan Ştefănescu | Dan Tufiș
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article describes the collecting, processing and validation of a large balanced corpus for Romanian. The annotation types and structure of the corpus are briefly reviewed. It was constructed at the Research Institute for Artificial Intelligence of the Romanian Academy in the context of an international project (METANET4U). The processing covers tokenization, POS-tagging, lemmatization and chunking. The corpus is in XML format generated by our in-house annotation tools; the corpus encoding schema is XCES compliant and the metadata specification is conformant to the METANET recommendations. To the best of our knowledge, this is the first large and richly annotated corpus for Romanian. ROMBAC is intended to be the foundation of a linguistic environment containing a reference corpus for contemporary Romanian and a comprehensive collection of interoperable processing tools.

pdf bib abs
Romanian TimeBank: An Annotated Parallel Corpus for Temporal Information
Corina Forăscu | Dan Tufiş
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The paper describes the main steps for the construction, annotation and validation of the Romanian version of the TimeBank corpus. Starting from the English TimeBank corpus ― the reference annotated corpus in the temporal domain, we have translated all the 183 English news texts into Romanian and mapped the English annotations onto Romanian, with a success rate of 96.53%. Based on ISO-Time - the emerging standard for representing temporal information, which includes many of the previous annotations schemes -, we have evaluated the automatic transfer onto Romanian and, and, when necessary, corrected the Romanian annotations so that in the end we obtained a 99.18% transfer rate for the TimeML annotations. In very few cases, due to language peculiarities, some original annotations could not be transferred. For the portability of the temporal annotation standard to Romanian, we suggested some additions for the ISO-Time standard, concerning especially the EVENT tag, based on linguistic evidence, the Romanian grammar, and also on the localisations of TimeML to other Romance languages. Future improvements to the Ro-TimeBank will take into consideration all temporal expressions, signals and events in texts, even those with a not very clear temporal anchoring.

Lack of sufficient parallel data for many languages and domains is currently one of the major obstacles to further advancement of automated translation. The ACCURAT project is addressing this issue by researching methods how to improve machine translation systems by using comparable corpora. In this paper we present tools and techniques developed in the ACCURAT project that allow additional data needed for statistical machine translation to be extracted from comparable corpora. We present methods and tools for acquisition of comparable corpora from the Web and other sources, for evaluation of the comparability of collected corpora, for multi-level alignment of comparable corpora and for extraction of lexical and terminological data for machine translation. Finally, we present initial evaluation results on the utility of collected corpora in domain-adapted machine translation and real-life applications.

pdf bib
Cascaded Phrase-Based Statistical Machine Translation Systems
Dan Tufiş | Ștefan Daniel Dumitrescu
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

2011

pdf bib
Experiments with a Differential Semantics Annotation for WordNet 3.0
Dan Tufiş | Dan Ştefănescu
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

2010

pdf bib abs
A Differential Semantics Approach to the Annotation of Synsets in WordNet
Dan Tufiş | Dan Ştefănescu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a new method for sentiment load annotation of the synsets of a wordnet, along the principles of Osgoods Semantic Differential theory and extending the Kamp and Marx calculus, by taking into account not only the WordNet structure but also the SUMO/MILO (Niles & Pease, 2001) and DOMAINS (Bentivogli et al., 2004) knowledge sources. We discuss the method to annotate all the synsets in PWN2.0, irrespective of their part of speech. As the number of possible factors (semantic oppositions, along which the synsets are ranked) is very large, we developed also an application allowing the text analyst to select the most discriminating factors for the type of text to be analyzed. Once the factors have been selected, the underlying wordnet is marked-up on the fly and it can be used for the intended textual analysis. We anticipate that these annotations can be imported in other language wordnets, provided they are aligned to PWN2.0. The method for the synsets annotation generalizes the usual subjectivity mark-up (positive, negative and objective) according to a user-based multi-criteria differential semantics model.

Currently, research infrastructures are being designed and established in many disciplines since they all suffer from an enormous fragmentation of their resources and tools. In the domain of language resources and tools the CLARIN initiative has been funded since 2008 to overcome many of the integration and interoperability hurdles. CLARIN can build on knowledge and work from many projects that were carried out during the last years and wants to build stable and robust services that can be used by researchers. Here service centres will play an important role that have the potential of being persistent and that adhere to criteria as they have been established by CLARIN. In the last year of the so-called preparatory phase these centres are currently developing four use cases that can demonstrate how the various pillars CLARIN has been working on can be integrated. All four use cases fulfil the criteria of being cross-national.

2008

pdf bib abs
DIAC+: a Professional Diacritics Recovering System
Dan Tufiş | Alexandru Ceauşu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In languages that use diacritical characters, if these special signs are stripped-off from a word, the resulted string of characters may not exist in the language, and therefore its normative form is, in general, easy to recover. However, this is not always the case, as presence or absence of a diacritical sign attached to a base letter of a word which exists in both variants, may change its grammatical properties or even the meaning, making the recovery of the missing diacritics a difficult task, not only for a program but sometimes even for a human reader. We describe and evaluate an accurate knowledge-based system for automatic recovery of the missing diacritics in MS-Office documents written in Romanian. For the rare cases when the system is not able to make a reliable decision, it either provides the user a list of words with their recovery suggestions, or probabilistically chooses one of the possible changes, but leaves a trace (a highlighted comment) on each word the modification of which was uncertain.

pdf bib abs
RACAI’s Linguistic Web Services
Dan Tufiş | Radu Ion | Alexandru Ceauşu | Dan Ştefănescu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Nowadays, there are hundreds of Natural Language Processing applications and resources for different languages that are developed and/or used, almost exclusively with a few but notable exceptions, by their creators. Assuming that the right to use a particular application or resource is licensed by the rightful owner, the user is faced with the often not so easy task of interfacing it with his/her own systems. Even if standards are defined that provide a unified way of encoding resources, few are the cases when the resources are actually coded in conformance to the standard (and, at present time, there is no such thing as general NLP application interoperability). Semantic Web came with the promise that the web will be a universal medium for information exchange whatever its content. In this context, the present article outlines a collection of linguistic web services for Romanian and English, developed at the Research Institute for AI for the Romanian Academy (RACAI) which are ready to provide a standardized way of calling particular NLP operations and extract the results without caring about what exactly is going on in the background.

pdf bib abs
Unsupervised Lexical Acquisition for Part of Speech Tagging
Dan Tufiş | Elena Irimia | Radu Ion | Alexandru Ceauşu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

It is known that POS tagging is not very accurate for unknown words (words which the POS tagger has not seen in the training corpora). Thus, a first step to improve the tagging accuracy would be to extend the coverage of the taggers learned lexicon. It turns out that, through the use of a simple procedure, one can extend this lexicon without using additional, hard to obtain, hand-validated training corpora. The basic idea consists of merely adding new words along with their (correct) POS tags to the lexicon and trying to estimate the lexical distribution of these words according to similar ambiguity classes already present in the lexicon. We present a method of automatically acquire high quality POS tagging lexicons based on morphologic analysis and generation. Currently, this procedure works on Romanian for which we have a required paradigmatic generation procedure but the architecture remains general in the sense that given the appropriate substitutes for the morphological generator and POS tagger, one should obtain similar results.

We present the main findings and preliminary results of an ongoing project aimed at developing a system for collocation extraction based on contextual morpho-syntactic properties. We explored two hybrid extraction methods: the first method applies language-indepedent statistical techniques followed by a linguistic filtering, while the second approach, available only for German, is based on a set of lexico-syntactic patterns to extract collocation candidates. To define extraction and filtering patterns, we studied a specific collocation category, the Verb-Noun constructions, using a model inspired by the systemic functional grammar, proposing three level analysis: lexical, functional and semantic criteria. From tagged and lemmatized corpus, we identify some contextual morpho-syntactic properties helping to filter the output of the statistical methods and to extract some potential interesting VN constructions (complex predicates vs complex predicators). The extracted candidates are validated and classified manually.

2007

pdf bib
RACAI: Meaning Affinity Models
Radu Ion | Dan Tufiş
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EU languages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).

pdf bib abs
Tagset Mapping and Statistical Training Data Cleaning-up
Felix Pîrvan | Dan Tufiş
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper describes a general method (as well as its implementation and evaluation) for deriving mapping systems for different tagsets available in existing training corpora (gold standards) for a specific language. For each pair of corpora (tagged with different tagsets), one such mapping system is derived. This mapping system is then used to improve the tagging of each of the two corpora with the tagset of the other (this process will be called cross-tagging). By reapplying the algorithm to the newly obtained corpora, the accuracy of the underlying training corpora can also be improved. Furthermore, comparing the results with the gold standards makes it possible to assess the distributional adequacy of various tagsets used in processing the language in case. Unlike other methods, such as those reported in (Brants, 1995) or (Tufis & Dragomirescu, 2004), which assume a subsumption relation between the considered tagsets, and as such they aim at minimizing the tagsets by eliminating the feature-value redundancy, this method is applicable for completely unrelated tagsets. Although the experiments were focused on morpho-syntactic (POS) tagging, the method is applicable to other types of tagging as well.

pdf bib abs
RoCo-News: A Hand Validated Journalistic Corpus of Romanian
Dan Tufiş | Elena Irimia
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The paper briefly describes the RoCo project and, in details, one of its first outcomes, the RoCo-News corpus. RoCo-News is a middle-sized journalistic corpus of Romanian, abundant in proper names, numerals and named entities. The initially raw text was previously segmented with MtSeg segmenter, then POS annotated with TNT tagger. RoCo-News was further lemmatized and validated. Because of limited human resources, time constraints and the dimension of the corpus, hand validation of each individual token was out of question. The validation stage required a coherent methodology for automatically identifying as many POS annotation and lemmatization errors as possible. The hand validation process was focused on these automatically spotted possible errors. This methodology relied on three main techniques for automatic detection of potential errors: 1. when lemmatizing the corpus, we extracted all the triples that were not found in the word-form lexicon; 2. we checked the correctness of POS annotation for closed class lexical categories, technique described by (Dickinson & Meurers, 2003); 3. we exploited the hypothesis (Tufiº, 1999) according to which an accurately tagged text, re-tagged with the language model learnt from it (biased evaluation) should have more than 98% tokens identically tagged.

pdf bib abs
Aligning Multilingual Thesauri
Dan Ştefănescu | Dan Tufiş
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The aligning and merging of ontologies with overlapping information are actual one of the most active domain of investigation in the Semantic Web community. Multilingual lexical ontologies thesauri are fundamental knowledge sources for most NLP projects addressing multilinguality. The alignment of multilingual lexical knowledge sources has various applications ranging from knowledge acquisition to semantic validation of interlingual equivalence of presumably the same meaning express in different languages. In this paper, we present a general method for aligning ontologies, which was used to align a conceptual thesaurus, lexicalized in 20 languages with a partial version of it lexicalized in Romanian. The objective of our work was to align the existing terms in the Romanian Eurovoc to the terms in the English Eurovoc and to automatically update the Romanian Eurovoc. The general formulation of the ontology alignment problem was set up along the lines established by Heterogeneity group of the KnowledgeWeb consortium, but the actual case study was motivated by the needs of a specific NLP project.

pdf bib abs
Dependency-Based Phrase Alignment
Radu Ion | Alexandru Ceauşu | Dan Tufiş
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Phrase alignment is the task that requires the constituent phrases of two halves of a bitext to be aligned. In order to align phrases, one must discover them first and this article presents a method of aligning phrases that are discovered automatically. Here, the notion of a 'phrase' will be understood as being given by a subtree of a dependency-like structure of a sentence called linkage. To discover phrases, we will make use of two distinct, language independent methods: the IBM-1 model (Brown et al., 1993) adapted to detect linkages and Constrained Lexical Attraction Models (Ion & Barbu Mititelu, 2006). The methods will be combined and the resulted model will be used to annotate the bitext. The accuracy of phrase alignment will be evaluated by obtaining word alignments from link alignments and then by checking the F-measure of the latter word aligner.

pdf bib abs
Acquis Communautaire Sentence Alignment using Support Vector Machines
Alexandru Ceauşu | Dan Ştefănescu | Dan Tufiş
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Sentence alignment is a task that requires not only accuracy, as possible errors can affect further processing, but also requires small computation resources and to be language pair independent. Although many implementations do not use translation equivalents because they are dependent on the language pair, this feature is a requirement for the accuracy increase. The paper presents a hybrid sentence aligner that has two alignment iterations. The first iteration is based mostly on sentences length, and the second is based on a translation equivalents table estimated from the results of the first iteration. The aligner uses a Support Vector Machine classifier to discriminate between positive and negative examples of sentence pairs.

pdf bib
Improved Lexical Alignment by Combining Multiple Reified Alignments
Dan Tufiş | Radu Ion | Alexandru Ceauşu | Dan Ştefănescu
11th Conference of the European Chapter of the Association for Computational Linguistics

2005

pdf bib
Combined Word Alignments
Dan Tufiş | Radu Ion | Alexandru Ceauşu | Dan Ştefănescu
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

pdf bib
An evaluation exercise for Romanian Word Sense Disambiguation
Rada Mihalcea | Vivi Năstase | Timothy Chklovski | Doina Tătar | Dan Tufiş | Florentina Hristea
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

pdf bib
Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets
Dan Tufis | Radu Ion | Nancy Ide
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Word Sense Disambiguation as a Wordnets’ Validation Method in Balkanet
Dan Tufis | Radu Ion | Nancy Ide
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Term Translations in Parallel Corpora: Discovery and Consistency Check
Dan Tufis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib abs
Tiered Tagging Revisited
Dan Tufis | Liviu Dragomirescu
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

In this paper we describe a new baseline tagset induction algorithm, which unlike the one described in previous work is fully automatic and produces tagsets with better performance than before. The algorithm is an information lossless transformation of the MULTEXT-EAST compliant lexical tags (MSD) into a reduced tagset that can be mapped back on the lexicon tagset fully deterministic. From the baseline tagsets, a corpus linguist, expert in the language in case, may further reduce the tagsets taking into account language distributional properties. As any further reduction of the baseline tagsets assumes losing information, adequate recovering rules should be designed for ensuring the final tagging in terms of lexicon encoding. The algorithm is described in details and the generated baseline tagsets for Czech, English, Estonian, Hungarian, Romanian and Slovenean are evaluated. They are much smaller and systematically ensures better tagging accuracy than the corresponding MSDs.

pdf bib abs
A Methodology and Associated Tools for Building Interlingual Wordnets
Dan Tufis | Eduard Barbu
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

The paper describes the methodology and the tools we developed for the purpose of building a Romanian wordnet. The work is carried out within the BalkaNet European project and is concerned with wordnets for Bulgarian, Czech, Greek, Romanian, Serbian and Turkish all of them aligned via an interlingual index (ILI) to Princeton Wordnet. The wordnets structuring follows the principles adopted in EuroWordNet. In order to ensure maximal cross-lingual lexical coverage, the consortium decided to implement the same concepts, represented by a common set of ILI concepts. We describe the selection of concepts to be implemented in all the monolingual wordnets The methodologies adopted by each partner were different and they depended on the language resources and personnel available. For the Romanian wordnet,we decided that it should be based on the reference lexicographic descriptions of Romanian which we had in electronic forms: EXPD, a heavily XML annotated explanatory dictionary (developed in the previous CONCEDE project and based on the standard Explanatory Dictionary of Romanian), SYND, a published dictionary of synonyms which we keyboarded, encoded and completed with more than 4000 new synonymy sets extracted from EXPD, EnRoD, a Romanian-English dictionary, most part of it being extracted automatically from parallel corpora and further hand validated and extended. Besides these monolingual resources, as all the other members of the consortium, we had at our disposal the interlingual mapping of the Princeton Wordnet. All the above mentioned resources have been incorporated into a user-friendly system, WnBuilder, which allows for cooperative work of a large number of lexicographers. When the distributed work is put together, the synsets are validated. Several errors show up, the most frequent and difficult to solve being the case of a literal with the same sense number appearing in different synsets. We discuss reasons for such conflicts as well as their correction, supported by another utility program called WnCorrector. The full paper presents WnBuilder and WnCorrector, as well as the status of the Romanian wordnet development.

Dan Tufiş

2024

2022

2020

2018

2017

2016

2014

2013

2012

2011

2010

2008

2007

2006

2005

2004

2003

2002

2000

1998

1996

1994

1993

1991

1989

Co-authors

Venues