Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Garrett Nicolai, Eleanor Chodroff (Editors)

Anthology ID:: 2022.sigmorphon-1
Month:: July
Year:: 2022
Address:: Seattle, Washington
Venue:: SIGMORPHON
SIG:: SIGMORPHON
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2022.sigmorphon-1
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2022.sigmorphon-1.pdf

pdf bib
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Eleanor Chodroff

pdf bib abs
On Building Spoken Language Understanding Systems for Low Resourced Languages
Akshat Gupta

Spoken dialog systems are slowly becoming an integral part of the human experience due to their various advantages over textual interfaces. Spoken language understanding (SLU) systems are fundamental building blocks of spoken dialog systems. But creating SLU systems for low resourced languages is still a challenge. In a large number of low resourced language, we don’t have access to enough data to build automatic speech recognition (ASR) technologies, which are fundamental to any SLU system. Also, ASR based SLU systems do not generalize to unwritten languages. In this paper, we present a series of experiments to explore extremely low-resourced settings where we perform intent classification with systems trained on as low as one data-point per intent and with only one speaker in the dataset. We also work in a low-resourced setting where we do not use language specific ASR systems to transcribe input speech, which compounds the challenge of building SLU systems to simulate a true low-resourced setting. We test our system on Belgian Dutch (Flemish) and English and find that using phonetic transcriptions to make intent classification systems in such low-resourced setting performs significantly better than using speech features. Specifically, when using a phonetic transcription based system over a feature based system, we see average improvements of 12.37% and 13.08% for binary and four-class classification problems respectively, when averaged over 49 different experimental settings.

pdf bib abs
Unsupervised morphological segmentation in a language with reduplication
Simon Todd | Annie Huang | Jeremy Needle | Jennifer Hay | Jeanette King

We present an extension of the Morfessor Baseline model of unsupervised morphological segmentation (Creutz and Lagus, 2007) that incorporates abstract templates for reduplication, a typologically common but computationally underaddressed process. Through a detailed investigation that applies the model to Maori, the ̄ Indigenous language of Aotearoa New Zealand, we show that incorporating templates improves Morfessor’s ability to identify instances of reduplication, and does so most when there are multiple minimally-overlapping templates. We present an error analysis that reveals important factors to consider when applying the extended model and suggests useful future directions.

pdf bib abs
Investigating phonological theories with crowd-sourced data: The Inventory Size Hypothesis in the light of Lingua Libre
Mathilde Hutin | Marc Allassonnière-Tang

Data-driven research in phonetics and phonology relies massively on oral resources, and access thereto. We propose to explore a question in comparative linguistics using an open-source crowd-sourced corpus, Lingua Libre, Wikimedia’s participatory linguistic library, to show that such corpora may offer a solution to typologists wishing to explore numerous languages at once. For the present proof of concept, we compare the realizations of Italian and Spanish vowels (sample size = 5000) to investigate whether vowel production is influenced by the size of the phonemic inventory (the Inventory Size Hypothesis), by the exact shape of the inventory (the Vowel Quality Hypothesis) or by none of the above. Results show that the size of the inventory does not seem to influence vowel production, thus supporting previous research, but also that the shape of the inventory may well be a factor determining the extent of variation in vowel production. Most of all, these results show that Lingua Libre has the potential to provide valuable data for linguistic inquiry.

pdf bib abs
Logical Transductions for the Typology of Ditransitive Prosody
Mai Ha Vu | Aniello De Santo | Hossep Dolatian

Given the empirical landscape of possible prosodic parses, this paper examines the computations required to formalize the mapping from syntactic structure to prosodic structure. In particular, we use logical tree transductions to define the prosodic mapping of ditransitive verb phrases in SVO languages, building off of the typology described in Kalivoda (2018). Explicit formalization of syntax-prosody mapping revealed a number of unanswered questions relating to the fine details of theoretical assumptions behind prosodic mapping.

pdf bib abs
A Masked Segmental Language Model for Unsupervised Natural Language Segmentation
C.m. Downey | Fei Xia | Gina-Anne Levow | Shane Steinert-Threlkeld

We introduce a Masked Segmental Language Model (MSLM) for joint language modeling and unsupervised segmentation. While near-perfect supervised methods have been developed for segmenting human-like linguistic units in resource-rich languages such as Chinese, many of the world’s languages are both morphologically complex, and have no large dataset of “gold” segmentations for supervised training. Segmental Language Models offer a unique approach by conducting unsupervised segmentation as the byproduct of a neural language modeling objective. However, current SLMs are limited in their scalability due to their recurrent architecture. We propose a new type of SLM for use in both unsupervised and lightly supervised segmentation tasks. The MSLM is built on a span-masking transformer architecture, harnessing a masked bidirectional modeling context and attention, as well as adding the potential for model scalability. In a series of experiments, our model outperforms the segmentation quality of recurrent SLMs on Chinese, and performs similarly to the recurrent model on English.

pdf bib abs
Trees probe deeper than strings: an argument from allomorphy
Hossep Dolatian | Shiori Ikawa | Thomas Graf

Linguists disagree on whether morphological representations should be strings or trees. We argue that tree-based views of morphology can provide new insights into morphological complexity even in cases where the posited tree structure closely matches the surface string. Our argument is based on a subregular case study of morphologically conditioned allomorphy, where the phonological form of some morpheme (the target) is conditioned by the presence of some other morpheme (the trigger) somewhere within the morphosyntactic context. The trigger and target can either be linearly adjacent or non-adjacent, and either the trigger precedes the target (inwardly sensitive) or the target precedes the trigger (outwardly sensitive). When formalized as string transductions, the only complexity difference is between local and non-local allomorphy. Over trees, on the other hand, we also see a complexity difference between inwardly sensitive and outwardly sensitive allomorphy. Just as unboundedness assumptions can sometimes tease apart patterns that are equally complex in the finitely bounded case, tree-based representations can reveal differences that disappear over strings.

pdf bib abs
Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi and Nepali
Niyati Bafna | Zdeněk Žabokrtský

Word embeddings are growing to be a crucial resource in the field of NLP for any language. This work introduces a novel technique for static subword embeddings transfer for Indic languages from a relatively higher resource language to a genealogically related low resource language. We primarily work with HindiMarathi, simulating a low-resource scenario for Marathi, and confirm observed trends on Nepali. We demonstrate the consistent benefits of unsupervised morphemic segmentation on both source and target sides over the treatment performed by fastText. Our best-performing approach uses an EM-style approach to learning bilingual subword embeddings; we also show, for the first time, that a trivial “copyand-paste” embeddings transfer based on even perfect bilingual lexicons is inadequate in capturing language-specific relationships. We find that our approach substantially outperforms the fastText baselines for both Marathi and Nepali on the Word Similarity task as well as WordNetBased Synonymy Tests; on the former task, its performance for Marathi is close to that of pretrained fastText embeddings that use three orders of magnitude more Marathi data.

pdf bib abs
Multidimensional acoustic variation in vowels across English dialects
James Tanner | Morgan Sonderegger | Jane Stuart-Smith

Vowels are typically characterized in terms of their static position in formant space, though vowels have also been long-known to undergo dynamic formant change over their timecourse. Recent studies have demonstrated that this change is highly informative for distinguishing vowels within a system, as well as providing additional resolution in characterizing differences between dialects. It remains unclear, however, how both static and dynamic representations capture the main dimensions of vowel variation across a large number of dialects. This study examines the role of static, dynamic, and duration information for 5 vowels across 21 British and North American English dialects, and observes that vowels exhibit highly structured variation across dialects, with dialects displaying similar patterns within a given vowel, broadly corresponding to a spectrum between traditional ‘monophthong’ and ‘diphthong’ characterizations. These findings highlight the importance of dynamic and duration information in capturing how vowels can systematically vary across a large number of dialects, and provide the first large-scale description of formant dynamics across many dialects of a single language.

pdf bib abs
Domain-Informed Probing of wav2vec 2.0 Embeddings for Phonetic Features
Patrick Cormac English | John D. Kelleher | Julie Carson-Berndsen

In recent years large transformer model architectures have become available which provide a novel means of generating high-quality vector representations of speech audio. These transformers make use of an attention mechanism to generate representations enhanced with contextual and positional information from the input sequence. Previous works have explored the capabilities of these models with regard to performance in tasks such as speech recognition and speaker verification, but there has not been a significant inquiry as to the manner in which the contextual information provided by the transformer architecture impacts the representation of phonetic information within these models. In this paper, we report the results of a number of probing experiments on the representations generated by the wav2vec 2.0 model’s transformer component, with regard to the encoding of phonetic categorization information within the generated embeddings. We find that the contextual information generated by the transformer’s operation results in enhanced capture of phonetic detail by the model, and allows for distinctions to emerge in acoustic data that are otherwise difficult to separate.

pdf bib abs
Morphotactic Modeling in an Open-source Multi-dialectal Arabic Morphological Analyzer and Generator
Nizar Habash | Reham Marzouk | Christian Khairallah | Salam Khalifa

Arabic is a morphologically rich and complex language, with numerous dialectal variants. Previous efforts on Arabic morphology modeling focused on specific variants and specific domains using a range of techniques with different degrees of linguistic modeling transparency. In this paper we propose a new approach to modeling Arabic morphology with an eye towards multi-dialectness, resource openness, and easy extensibility and use. We demonstrate our approach by modeling verbs from Standard Arabic and Egyptian Arabic, within a common framework, and with high coverage.

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.

pdf bib abs
Sharing Data by Language Family: Data Augmentation for Romance Language Morpheme Segmentation
Lauren Levine

This paper presents a basic character level sequence-to-sequence approach to morpheme segmentation for the following Romance languages: French, Italian, and Spanish. We experiment with adding a small set of additional linguistic features, as well as with sharing training data between sister languages for morphological categories with low performance in single language base models. We find that while the additional linguistic features were generally not helpful in this instance, data augmentation between sister languages did help to raise the scores of some individual morphological categories, but did not consistently result in an overall improvement when considering the aggregate of the categories.

pdf bib abs
SIGMORPHON 2022 Shared Task on Morpheme Segmentation Submission Description: Sequence Labelling for Word-Level Morpheme Segmentation
Leander Girrbach

We propose a sequence labelling approach to word-level morpheme segmentation. Segmentation labels are edit operations derived from a modified minimum edit distance alignment. We show that sequence labelling performs well for “shallow segmentation” and “canonical segmentation”, achieving 96.06 f1 score (macroaveraged over all languages in the shared task) and ranking 3rd among all participating teams. Therefore, we conclude that sequence labelling is a promising approach to morpheme segmentation.

pdf bib abs
Beyond Characters: Subword-level Morpheme Segmentation
Ben Peters | Andre F. T. Martins

This paper presents DeepSPIN’s submissions to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation. We make three submissions, all to the word-level subtask. First, we show that entmax-based sparse sequence-tosequence models deliver large improvements over conventional softmax-based models, echoing results from other tasks. Then, we challenge the assumption that models for morphological tasks should be trained at the character level by building a transformer that generates morphemes as sequences of unigram language model-induced subwords. This subword transformer outperforms all of our character-level models and wins the word-level subtask. Although we do not submit an official submission to the sentence-level subtask, we show that this subword-based approach is highly effective there as well.

pdf bib abs
Word-level Morpheme segmentation using Transformer neural network
Tsolmon Zundi | Chinbat Avaajargal

This paper presents the submission of team NUM DI to the SIGMORPHON 2022 Task on Morpheme Segmentation Part 1, word-level morpheme segmentation. We explore the transformer neural network approach to the shared task. We develop monolingual models for world-level morpheme segmentation and focus on improving the model by using various training strategies to improve accuracy and generalization across languages.

pdf bib abs
Morfessor-enriched features and multilingual training for canonical morphological segmentation
Aku Rouhe | Stig-Arne Grönroos | Sami Virpioja | Mathias Creutz | Mikko Kurimo

In our submission to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation, we study whether an unsupervised morphological segmentation method, Morfessor, can help in a supervised setting. Previous research has shown the effectiveness of the approach in semisupervised settings with small amounts of labeled data. The current tasks vary in data size: the amount of word-level annotated training data is much larger, but the amount of sentencelevel annotated training data remains small. Our approach is to pre-segment the input data for a neural sequence-to-sequence model with the unsupervised method. As the unsupervised method can be trained with raw text data, we use Wikipedia to increase the amount of training data. In addition, we train multilingual models for the sentence-level task. The results for the Morfessor-enriched features are mixed, showing benefit for all three sentencelevel tasks but only some of the word-level tasks. The multilingual training yields considerable improvements over the monolingual sentence-level models, but it negates the effect of the enriched features.

pdf bib abs
JB132 submission to the SIGMORPHON 2022 Shared Task 3 on Morphological Segmentation
Jan Bodnár

This paper describes the JB132 submission to the SIGMORPHON 2022 Shared Task 3 on Morpheme Segmentation. In this paper we describe probabilistic model trained with the Expectation-Maximization algorithm, we provide the results and analyze sources of errors and general limitations of our approach. The model was implemented within our own modular probabilistic framework.

pdf bib abs
SIGMORPHON–UniMorph 2022 Shared Task 0: Modeling Inflection in Language Acquisition
Jordan Kodner | Salam Khalifa

This year’s iteration of the SIGMORPHONUniMorph shared task on “human-like” morphological inflection generation focuses on generalization and errors in language acquisition. Systems are trained on data sets extracted from corpora of child-directed speech in order to simulate a natural learning setting, and their predictions are evaluated against what is known about children’s developmental trajectories for three well-studied patterns: English past tense, German noun plurals, and Arabic noun plurals. Three submitted neural systems were evaluated together with two baselines. Performance was generally good, and all systems were prone to human-like over-regularization. However, all systems were also prone to non-human-like over-irregularization and nonsense productions to varying degrees. We situate this behavior in a discussion of the Past Tense Debate.

The 2022 SIGMORPHON–UniMorph shared task on large scale morphological inflection generation included a wide range of typologically diverse languages: 33 languages from 11 top-level language families: Arabic (Modern Standard), Assamese, Braj, Chukchi, Eastern Armenian, Evenki, Georgian, Gothic, Gujarati, Hebrew, Hungarian, Itelmen, Karelian, Kazakh, Ket, Khalkha Mongolian, Kholosi, Korean, Lamahalot, Low German, Ludic, Magahi, Middle Low German, Old English, Old High German, Old Norse, Polish, Pomak, Slovak, Turkish, Upper Sorbian, Veps, and Xibe. We emphasize generalization along different dimensions this year by evaluating test items with unseen lemmas and unseen features separately under small and large training conditions. Across the five submitted systems and two baselines, the prediction of inflections with unseen features proved challenging, with average performance decreased substantially from last year. This was true even for languages for which the forms were in principle predictable, which suggests that further work is needed in designing systems that capture the various types of generalization required for the world’s languages.

pdf bib abs
SIGMORPHON 2022 Task 0 Submission Description: Modelling Morphological Inflection with Data-Driven and Rule-Based Approaches
Tatiana Merzhevich | Nkonye Gbadegoye | Leander Girrbach | Jingwen Li | Ryan Soh-Eun Shim

This paper describes our participation in the 2022 SIGMORPHON-UniMorph Shared Task on Typologically Diverse and AcquisitionInspired Morphological Inflection Generation. We present two approaches: one being a modification of the neural baseline encoderdecoder model, the other being hand-coded morphological analyzers using finite-state tools (FST) and outside linguistic knowledge. While our proposed modification of the baseline encoder-decoder model underperforms the baseline for almost all languages, the FST methods outperform other systems in the respective languages by a large margin. This confirms that purely data-driven approaches have not yet reached the maturity to replace trained linguists for documentation and analysis especially considering low-resource and endangered languages.

pdf bib abs
CLUZH at SIGMORPHON 2022 Shared Tasks on Morpheme Segmentation and Inflection Generation
Silvan Wehrli | Simon Clematide | Peter Makarov

This paper describes the submissions of the team of the Department of Computational Linguistics, University of Zurich, to the SIGMORPHON 2022 Shared Tasks on Morpheme Segmentation and Inflection Generation. Our submissions use a character-level neural transducer that operates over traditional edit actions. While this model has been found particularly wellsuited for low-resource settings, using it with large data quantities has been difficult. Existing implementations could not fully profit from GPU acceleration and did not efficiently implement mini-batch training, which could be tricky for a transition-based system. For this year’s submission, we have ported the neural transducer to PyTorch and implemented true mini-batch training. This has allowed us to successfully scale the approach to large data quantities and conduct extensive experimentation. We report competitive results for morpheme segmentation (including sharing first place in part 2 of the challenge). We also demonstrate that reducing sentence-level morpheme segmentation to a word-level problem is a simple yet effective strategy. Additionally, we report strong results in inflection generation (the overall best result for large training sets in part 1, the best results in low-resource learning trajectories in part 2). Our code is publicly available.

pdf bib abs
OSU at SigMorphon 2022: Analogical Inflection With Rule Features
Micha Elsner | Sara Court

OSU’s inflection system is a transformer whose input is augmented with an analogical exemplar showing how to inflect a different word into the target cell. In addition, alignment-based heuristic features indicate how well the exemplar is likely to match the output. OSU’s scores substantially improve over the baseline transformer for instances where an exemplar is available, though not quite matching the challenge winner. In Part 2, the system shows a tendency to over-apply the majority pattern in English, but not Arabic.

pdf bib abs
Generalizing Morphological Inflection Systems to Unseen Lemmas
Changbing Yang | Ruixin (Ray) Yang | Garrett Nicolai | Miikka Silfverberg

This paper presents experiments on morphological inflection using data from the SIGMORPHON-UniMorph 2022 Shared Task 0: Generalization and Typologically Diverse Morphological Inflection. We present a transformer inflection system, which enriches the standard transformer architecture with reverse positional encoding and type embeddings. We further apply data hallucination and lemma copying to augment training data. We train models using a two-stage procedure: (1) We first train on the augmented training data using standard backpropagation and teacher forcing. (2) We then continue training with a variant of the scheduled sampling algorithm dubbed student forcing. Our system delivers competitive performance under the small and large data conditions on the shared task datasets.

pdf bib abs
HeiMorph at SIGMORPHON 2022 Shared Task on Morphological Acquisition Trajectories
Akhilesh Kakolu Ramarao | Yulia Zinova | Kevin Tang | Ruben van de Vijver

This paper presents the submission by the HeiMorph team to the SIGMORPHON 2022 task 2 of Morphological Acquisition Trajectories. Across all experimental conditions, we have found no evidence for the so-called Ushaped development trajectory. Our submitted systems achieve an average test accuracies of 55.5% on Arabic, 67% on German and 73.38% on English. We found that, bigram hallucination provides better inferences only for English and Arabic and only when the number of hallucinations remains low.

pdf bib abs
Morphology is not just a naive Bayes – UniMelb Submission to SIGMORPHON 2022 ST on Morphological Inflection
Andreas Sherbakov | Ekaterina Vylomova

The paper describes the Flexica team’s submission to the SIGMORPHON 2022 Shared Task 1 Part 1: Typologically Diverse Morphological Inflection. Our team submitted a nonneural system that extracted transformation patterns from alignments between a lemma and inflected forms. For each inflection category, we chose a pattern based on its abstractness score. The system outperformed the non-neural baseline, the extracted patterns covered a substantial part of possible inflections. However, we discovered that such score that does not account for all possible combinations of string segments as well as morphosyntactic features is not sufficient for a certain proportion of inflection cases.