Abstract
It has been more than half a century since the first million-word Brown Corpus was constructed. However, the lack of resources has posed a challenge to corpus construction and language processing for low-resource languages. Therefore, the corpus construction of most local languages such as Hokkien in Taiwan is still under development. In this paper, a Chinese segmenter based on syntactic analysis is adapted with a Hokkien dictionary (from Taiwan’s Ministry of Education) to segment words in Hokkien texts. The proposed approach reports an accuracy rate of 87.50% in correctly separating the lemmas from the example sentences in the dictionary, and an average performance of 92.78% in testing with additional six unseen Hokkien e-paper articles. The source code of the proposed Hokkien segmenter is released with an online corpus of Hokkien word segmentation. This segmenter can be used to construct larger Hokkien corpus with segmented words, and apply to the segmentation of other less-resourced languages in the Chinese language family including Hakka.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kucera, H., Francis, W.N.: Computational analysis of present-day American English. Brown University Press, Providence (1967)
Chen, P.-J., et al.: Speech-to-Speech Translation for A Real-world Unwritten Language. (2022). https://doi.org/10.48550/arxiv.2211.06474
Davies, M.: The corpus of contemporary American English as the first reliable monitor corpus of English. Literary Linguist. Comput. 25, 447–464 (2010)
Jiang, S., Fu, Y., Lin, N.: Construction and Evaluation of Chinese Word Segmentation Datasets in Malay Archipelago. In: Dong, M., Gu, Y., Hong, J.-F. (eds.) Chinese Lexical Semantics, pp. 172–183. Springer International Publishing, Cham (2022)
Gao, Z.-M.: Preliminary Research on the Construction of National Corpus, Taiwan Ministry of Culture (2020)
Su, C.-T.: Corpus-based Hokkien Analysis and Teaching. Unpublished presentation (2022)
Wang, W., Chen, C., Lee, C., Lai, C., Lin, H.: Articut: Chinese Word Segmentation and POS Tagging System, https://api.droidtown.co, Accessed 11 Oct 2022
Lu, S.-E., Lu, B.-H., Lu, C.-Y., Tzong, R., Tsai, H.: Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 6316–6334. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022)
Acknowledgments
The author would like to thank Professor Yueh-chin Chang in National Tsing Hua University with her discussion on this work. This article is also written with AI assistance.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Shih, Mh. (2024). The Application of Chinese Word Segmentation to Less-Resourced Language Processing. In: Dong, M., Hong, JF., Lin, J., Jin, P. (eds) Chinese Lexical Semantics. CLSW 2023. Lecture Notes in Computer Science(), vol 14515. Springer, Singapore. https://doi.org/10.1007/978-981-97-0586-3_14
Download citation
DOI: https://doi.org/10.1007/978-981-97-0586-3_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0585-6
Online ISBN: 978-981-97-0586-3
eBook Packages: Computer ScienceComputer Science (R0)