2024
pdf
bib
abs
Locally Measuring Cross-lingual Lexical Alignment: A Domain and Word Level Perspective
Taelin Karidi
|
Eitan Grossman
|
Omri Abend
Findings of the Association for Computational Linguistics: EMNLP 2024
NLP research on aligning lexical representation spaces to one another has so far focused on aligning language spaces in their entirety. However, cognitive science has long focused on a local perspective, investigating whether translation equivalents truly share the same meaning or the extent that cultural and regional influences result in meaning variations. With recent technological advances and the increasing amounts of available data, the longstanding question of cross-lingual lexical alignment can now be approached in a more data-driven manner. However, developing metrics for the task requires some methodology for comparing metric efficacy. We address this gap and present a methodology for analyzing both synthetic validations and a novel naturalistic validation using lexical gaps in the kinship domain.We further propose new metrics, hitherto unexplored on this task, based on contextualized embeddings. Our analysis spans 16 diverse languages, demonstrating that there is substantial room for improvement with the use of newer language models. Our research paves the way for more accurate and nuanced cross-lingual lexical alignment methodologies and evaluation.
pdf
bib
abs
Aligning Alignments: Do Colexification and Distributional Similarity Align as Measures of cross-lingual Lexical Alignment?
Taelin Karidi
|
Eitan Grossman
|
Omri Abend
Proceedings of the 28th Conference on Computational Natural Language Learning
The data-driven investigation of the extent to which lexicons of different languages align has mostly fallen into one of two categories:colexification-based and distributional. The two approaches are grounded in distinct methodologies, operate on different assumptions, and are used in diverse ways.This raises two important questions: (a) are there settings in which the predictions of the two approaches can be directly compared? and if so, (b) what is the extent of the similarity and what are its determinants? We offer novel operationalizations for the two approaches in a manner that allows for their direct comparison, and conduct a comprehensive analysis on a diverse set of 16 languages.Our analysis is carried out at different levels of granularity. At the word-level, the two methods present different results across the board. However, intriguingly, at the level of semantic domains (e.g., kinship, quantity), the two methods show considerable convergence in their predictions.A detailed comparison of the metrics against a carefully validated dataset of kinship terms shows that the distributional methods likely capture a more fine-grained alignment than their counterpart colexification-based methods, and may thus be more suited for settings where fewer languages are evaluated.
2020
pdf
bib
Proceedings of the Second Workshop on Computational Research in Linguistic Typology
Ekaterina Vylomova
|
Edoardo M. Ponti
|
Eitan Grossman
|
Arya D. McCarthy
|
Yevgeni Berzak
|
Haim Dubossarsky
|
Ivan Vulić
|
Roi Reichart
|
Anna Korhonen
|
Ryan Cotterell
Proceedings of the Second Workshop on Computational Research in Linguistic Typology
pdf
bib
abs
SegBo: A Database of Borrowed Sounds in the World’s Languages
Eitan Grossman
|
Elad Eisen
|
Dmitry Nikolaev
|
Steven Moran
Proceedings of the Twelfth Language Resources and Evaluation Conference
Phonological segment borrowing is a process through which languages acquire new contrastive speech sounds as the result of borrowing new words from other languages. Despite the fact that phonological segment borrowing is documented in many of the world’s languages, to date there has been no large-scale quantitative study of the phenomenon. In this paper, we present SegBo, a novel cross-linguistic database of borrowed phonological segments. We describe our data aggregation pipeline and the resulting language sample. We also present two short case studies based on the database. The first deals with the impact of large colonial languages on the sound systems of the world’s languages; the second deals with universals of borrowing in the domain of rhotic consonants.
2018
pdf
bib
abs
Coming to Your Senses: on Controls and Evaluation Sets in Polysemy Research
Haim Dubossarsky
|
Eitan Grossman
|
Daphna Weinshall
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
The point of departure of this article is the claim that sense-specific vectors provide an advantage over normal vectors due to the polysemy that they presumably represent. This claim is based on performance gains observed in gold standard evaluation tests such as word similarity tasks. We demonstrate that this claim, at least as it is instantiated in prior art, is unfounded in two ways. Furthermore, we provide empirical data and an analytic discussion that may account for the previously reported improved performance. First, we show that ground-truth polysemy degrades performance in word similarity tasks. Therefore word similarity tasks are not suitable as an evaluation test for polysemy representation. Second, random assignment of words to senses is shown to improve performance in the same task. This and additional results point to the conclusion that performance gains as reported in previous work may be an artifact of random sense assignment, which is equivalent to sub-sampling and multiple estimation of word vector representations. Theoretical analysis shows that this may on its own be beneficial for the estimation of word similarity, by reducing the bias in the estimation of the cosine distance.
2017
pdf
bib
abs
Outta Control: Laws of Semantic Change and Inherent Biases in Word Representation Models
Haim Dubossarsky
|
Daphna Weinshall
|
Eitan Grossman
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
This article evaluates three proposed laws of semantic change. Our claim is that in order to validate a putative law of semantic change, the effect should be observed in the genuine condition but absent or reduced in a suitably matched control condition, in which no change can possibly have taken place. Our analysis shows that the effects reported in recent literature must be substantially revised: (i) the proposed negative correlation between meaning change and word frequency is shown to be largely an artefact of the models of word representation used; (ii) the proposed negative correlation between meaning change and prototypicality is shown to be much weaker than what has been claimed in prior art; and (iii) the proposed positive correlation between meaning change and polysemy is largely an artefact of word frequency. These empirical observations are corroborated by analytical proofs that show that count representations introduce an inherent dependence on word frequency, and thus word frequency cannot be evaluated as an independent factor with these representations.