Nothing Special   »   [go: up one dir, main page]

Skip to main content

The Application of Chinese Word Segmentation to Less-Resourced Language Processing

  • Conference paper
  • First Online:
Chinese Lexical Semantics (CLSW 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14515))

Included in the following conference series:

Abstract

It has been more than half a century since the first million-word Brown Corpus was constructed. However, the lack of resources has posed a challenge to corpus construction and language processing for low-resource languages. Therefore, the corpus construction of most local languages such as Hokkien in Taiwan is still under development. In this paper, a Chinese segmenter based on syntactic analysis is adapted with a Hokkien dictionary (from Taiwan’s Ministry of Education) to segment words in Hokkien texts. The proposed approach reports an accuracy rate of 87.50% in correctly separating the lemmas from the example sentences in the dictionary, and an average performance of 92.78% in testing with additional six unseen Hokkien e-paper articles. The source code of the proposed Hokkien segmenter is released with an online corpus of Hokkien word segmentation. This segmenter can be used to construct larger Hokkien corpus with segmented words, and apply to the segmentation of other less-resourced languages in the Chinese language family including Hakka.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/mhshih/Taiwanese-Corpora.github.io.

References

  1. Kucera, H., Francis, W.N.: Computational analysis of present-day American English. Brown University Press, Providence (1967)

    Google Scholar 

  2. Chen, P.-J., et al.: Speech-to-Speech Translation for A Real-world Unwritten Language. (2022). https://doi.org/10.48550/arxiv.2211.06474

  3. Davies, M.: The corpus of contemporary American English as the first reliable monitor corpus of English. Literary Linguist. Comput. 25, 447–464 (2010)

    Article  Google Scholar 

  4. Jiang, S., Fu, Y., Lin, N.: Construction and Evaluation of Chinese Word Segmentation Datasets in Malay Archipelago. In: Dong, M., Gu, Y., Hong, J.-F. (eds.) Chinese Lexical Semantics, pp. 172–183. Springer International Publishing, Cham (2022)

    Chapter  Google Scholar 

  5. Gao, Z.-M.: Preliminary Research on the Construction of National Corpus, Taiwan Ministry of Culture (2020)

    Google Scholar 

  6. Su, C.-T.: Corpus-based Hokkien Analysis and Teaching. Unpublished presentation (2022)

    Google Scholar 

  7. Wang, W., Chen, C., Lee, C., Lai, C., Lin, H.: Articut: Chinese Word Segmentation and POS Tagging System, https://api.droidtown.co, Accessed 11 Oct 2022

  8. Lu, S.-E., Lu, B.-H., Lu, C.-Y., Tzong, R., Tsai, H.: Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 6316–6334. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022)

    Google Scholar 

Download references

Acknowledgments

The author would like to thank Professor Yueh-chin Chang in National Tsing Hua University with her discussion on this work. This article is also written with AI assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meng-hsien Shih .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shih, Mh. (2024). The Application of Chinese Word Segmentation to Less-Resourced Language Processing. In: Dong, M., Hong, JF., Lin, J., Jin, P. (eds) Chinese Lexical Semantics. CLSW 2023. Lecture Notes in Computer Science(), vol 14515. Springer, Singapore. https://doi.org/10.1007/978-981-97-0586-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0586-3_14

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0585-6

  • Online ISBN: 978-981-97-0586-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics