Ben Foley

2024

pdf bib abs
Access Control Framework for Language Collections
Ben Foley | Peter Sefton | Simon Musgrave | Moises Sacal Bonequi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper introduces the licence-based access control framework developed by the Language Data Commons of Australia (LDaCA) for a range of language collections, with examples given of implementation for significant Indigenous and Australian English collections. Language collections may be curated for many reasons, such as documentation for language revival, for research, security or commercial purposes. Some language collections are created with the intention of being “Open Access”; publicly available with no restriction. Other collections require that access be limited to individuals or groups of people, either at the collection level or at the level of individual items, such as a recording. To facilitate access, while respecting the intended access conditions for a collection, or collection items, some form of user identification and authorisation process is typically required. The access control framework described in this paper is based upon descriptions of access conditions in easy-to-read licences which are stored alongside data files in the collections; and is implemented using identity-based authentication and authorisation systems where required. The framework accommodates accessibility needs from unrestricted to extremely limited access, is dynamic, and able to be modified in response to changes in access needs. Storing licences with the data is a significant development in separating language data and access requirements from access infrastructure.

2023

pdf bib abs
Automated speech recognition of Indonesian-English language lessons on YouTube using transfer learning
Zara Maxwell-Smith | Ben Foley
Proceedings of the Second Workshop on NLP Applications to Field Linguistics

Experiments to fine-tune large multilingual models with limited data from a specific domain or setting has potential to improve automatic speech recognition (ASR) outcomes. This paper reports on the use of the Elpis ASR pipeline to fine-tune two pre-trained base models, Wav2Vec2-XLSR-53 and Wav2Vec2-Large-XLSR-Indonesian, with various mixes of data from 3 YouTube channels teaching Indonesian with English as the language of instruction. We discuss our results inferring new lesson audio (22-46% word error rate) in the context of speeding data collection in diverse and specialised settings. This study is an example of how ASR can be used to accelerate natural language research, expanding ethically sourced data in low-resource settings.

2021

pdf bib abs
Developing ASR for Indonesian-English Bilingual Language Teaching
Zara Maxwell-Smith | Ben Foley
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Usage-based analyses of teacher corpora and code-switching (Boztepe, 2003) are an important next stage in understanding language acquisition. Multilingual corpora are difficult to compile and a classroom setting adds pedagogy to the mix of factors which make this data so rich and problematic to classify. Using quantitative methods to understand language learning and teaching is difficult work as the ‘transcription bottleneck’ constrains the size of datasets. We found that using an automatic speech recognition (ASR) toolkit with a small set of training data is likely to speed data collection in this context (Maxwelll-Smith et al., 2020).

2020

pdf bib abs
Applications of Natural Language Processing in Bilingual Language Teaching: An Indonesian-English Case Study
Zara Maxwell-Smith | Simón González Ochoa | Ben Foley | Hanna Suominen
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

Multilingual corpora are difficult to compile and a classroom setting adds pedagogy to the mix of factors which make this data so rich and problematic to classify. In this paper, we set out methodological considerations of using automated speech recognition to build a corpus of teacher speech in an Indonesian language classroom. Our preliminary results (64% word error rate) suggest these tools have the potential to speed data collection in this context. We provide practical examples of our data structure, details of our piloted computer-assisted processes, and fine-grained error analysis. Our study is informed and directed by genuine research questions and discussion in both the education and computational linguistics fields. We highlight some of the benefits and risks of using these emerging technologies to analyze the complex work of language teachers and in education more generally.

pdf bib abs
Scaling Language Data Import/Export with a Data Transformer Interface
Nicholas Buckeridge | Ben Foley
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

This paper focuses on the technical improvement of Elpis, a language technology which assists people in the process of transcription, particularly for low-resource language documentation situations. To provide better support for the diversity of file formats encountered by people working to document the world’s languages, a Data Transformer interface has been developed to abstract the complexities of designing individual data import scripts. This work took place as part of a larger project of code quality improvement and the publication of template code that can be used for development of other language technologies.

2019

pdf bib
Future Directions in Technological Support for Language Documentation
Daan van Esch | Ben Foley | Nay San
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

Venues

lrec1

coling1