Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3528114.3528124acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdsdeConference Proceedingsconference-collections
research-article

Chinese Word Segmentation with Many Rare Terms in Low-Resource Scenarios

Published: 24 June 2022 Publication History

Abstract

In constructing the domain-specific knowledge graphs, we can use the texts accumulated in the domain as data sources for analysis. However, in many domains, there are many rare terms in the text that make the generic corpus inapplicable, but no domain-specific corpus is available. Using the existing Chinese word segmentation (CWS) corpus and methods, this type of texts cannot be effectively segmented. For such special texts without applicable corpus, this paper proposes a domain dictionary-based Chinese word segmentation method based on the BiLSTM-CNN-CRF method. We firstly manually label a part of the samples, then combine randomly selected words from the dictionary into the manually labeled sentences to generate pseudo-labeled data, and merge the two to get a composite training set. Then we preprocess the texts, replace the rare terms with non-segmentable strings to further improve the accuracy of word segmentation. The experimental results show that our approach has higher accuracy, recall and F1 score in the task of segmenting texts with many rare terms in low-resource scenarios. Our approach can be applied to the task of Chinese word segmentation in specific domains containing rare terms.

References

[1]
Xue N. Chinese Word Segmentation as Character Tagging[J].Computational Linguistics & Chinese Language Processing,2003, 8(1): 29-47.
[2]
Collobert R, Weston J, Bottou L, Natural Language Processing (Almost) from Scratch[J]. Journal of Machine Learning Research, 2011, 12: 2493-2537.
[3]
Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv:1508.01991.
[4]
Wang J, Zhou J, Zhou J, Multiple Character Embeddings for Chinese Word Segmentation[C]// Proceedings of the 57thAnnual Meeting of the Association for Computational Linguistics, Florence, Italy. New York, USA: ACL, 2019:210-216.
[5]
Ma X, Hovy E. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF[J] Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.
[6]
Xue N. Chinese Word Segmentation as Character Tagging[J].Computational Linguistics & Chinese Language rocessing, 2003, 8(1): 29-47.
[7]
Chen X, Qiu X, Huang X, A Feature-Enriched Neural Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging[C]// Twenty-sixth International Joint Conference on Artificial Intelligence. 2017.
[8]
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation,1997,9(8): 1735-1780
[9]
Mikolov T, Sutskever I, Chen K, Distributed Representations of Words and Phrases and their Compositionality[J]. Advances in neural information processing systems, 2013, 26:3111-3119
[10]
Mikolov T, Chen K, Corrado G, Efficient Estimation of Word Representations in Vector Space[J] 1st International Conference on Learning Representations, 2013.
[11]
Yao Y, Huang Z. Bi-directional LSTM Recurrent Neural Network for Chinese Word Segmentation[C]// International Conference on Neural Information Processing. Springer, Cham, 2016.
[12]
Xu J, Ma S, Zhang Y Transfer Deep Learning for Low-Resource Chinese Word Segmentation with a Novel Neural Network[C]// NLPCC 2017: 721-730.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
DSDE '22: Proceedings of the 2022 5th International Conference on Data Storage and Data Engineering
February 2022
124 pages
ISBN:9781450395724
DOI:10.1145/3528114
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Chinese word segmentation
  2. Domain-specific knowledge graph
  3. Low-resource scenario
  4. Rare terms

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

DSDE 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 30
    Total Downloads
  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media