The Application of Chinese Word Segmentation to Less-Resourced Language Processing

Meng-hsien Shih ORCID: orcid.org/0000-0003-2016-3520¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14515))

Included in the following conference series:

Workshop on Chinese Lexical Semantics

Abstract

It has been more than half a century since the first million-word Brown Corpus was constructed. However, the lack of resources has posed a challenge to corpus construction and language processing for low-resource languages. Therefore, the corpus construction of most local languages such as Hokkien in Taiwan is still under development. In this paper, a Chinese segmenter based on syntactic analysis is adapted with a Hokkien dictionary (from Taiwan’s Ministry of Education) to segment words in Hokkien texts. The proposed approach reports an accuracy rate of 87.50% in correctly separating the lemmas from the example sentences in the dictionary, and an average performance of 92.78% in testing with additional six unseen Hokkien e-paper articles. The source code of the proposed Hokkien segmenter is released with an online corpus of Hokkien word segmentation. This segmenter can be used to construct larger Hokkien corpus with segmented words, and apply to the segmentation of other less-resourced languages in the Chinese language family including Hakka.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/mhshih/Taiwanese-Corpora.github.io.

References

Kucera, H., Francis, W.N.: Computational analysis of present-day American English. Brown University Press, Providence (1967)
Google Scholar
Chen, P.-J., et al.: Speech-to-Speech Translation for A Real-world Unwritten Language. (2022). https://doi.org/10.48550/arxiv.2211.06474
Davies, M.: The corpus of contemporary American English as the first reliable monitor corpus of English. Literary Linguist. Comput. 25, 447–464 (2010)
Article Google Scholar
Jiang, S., Fu, Y., Lin, N.: Construction and Evaluation of Chinese Word Segmentation Datasets in Malay Archipelago. In: Dong, M., Gu, Y., Hong, J.-F. (eds.) Chinese Lexical Semantics, pp. 172–183. Springer International Publishing, Cham (2022)
Chapter Google Scholar
Gao, Z.-M.: Preliminary Research on the Construction of National Corpus, Taiwan Ministry of Culture (2020)
Google Scholar
Su, C.-T.: Corpus-based Hokkien Analysis and Teaching. Unpublished presentation (2022)
Google Scholar
Wang, W., Chen, C., Lee, C., Lai, C., Lin, H.: Articut: Chinese Word Segmentation and POS Tagging System, https://api.droidtown.co, Accessed 11 Oct 2022
Lu, S.-E., Lu, B.-H., Lu, C.-Y., Tzong, R., Tsai, H.: Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 6316–6334. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022)
Google Scholar

Download references

Acknowledgments

The author would like to thank Professor Yueh-chin Chang in National Tsing Hua University with her discussion on this work. This article is also written with AI assistance.

Author information

Authors and Affiliations

National United University, Miaoli, 360302, Taiwan
Meng-hsien Shih

Authors

Meng-hsien Shih
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng-hsien Shih .

Editor information

Editors and Affiliations

Institute for Infocomm Research, Singapore, Singapore
Minghui Dong
National Taiwan Normal University, Taipei, Taiwan
Jia-Fei Hong
Nanyang Technological University, Singapore, Singapore
Jingxia Lin
Leshan Normal University, Leshan, China
Peng Jin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shih, Mh. (2024). The Application of Chinese Word Segmentation to Less-Resourced Language Processing. In: Dong, M., Hong, JF., Lin, J., Jin, P. (eds) Chinese Lexical Semantics. CLSW 2023. Lecture Notes in Computer Science(), vol 14515. Springer, Singapore. https://doi.org/10.1007/978-981-97-0586-3_14

Download citation

DOI: https://doi.org/10.1007/978-981-97-0586-3_14
Published: 28 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0585-6
Online ISBN: 978-981-97-0586-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics