SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain

Chenhui Chu ORCID: orcid.org/0000-0001-9848-6384¹,
Zhuoyuan Mao¹,
Toshiaki Nakazawa²,
Daisuke Kawahara³ &
…
Sadao Kurohashi¹

295 Accesses
Explore all metrics

Abstract

Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated with word segmentation, part-of-speech tags, and phrase structures are essential. Although there are large-scale Chinese treebanks in the news domain, such treebanks are unavailable in the scientific domain. This significantly limits the performance of Chinese language processing for scientific text. To address this problem, we annotate the 2nd version of the Chinese treebank in the scientific domain (SCTB-V2). SCTB-V2 contains 12,175 sentences annotated with word segmentation, part-of-speech tags, and phrase structures. We conducted Chinese analyses and machine translation experiments on SCTB-V2. The results show the effectiveness of SCTB-V2. We release this treebank to promote scientific Chinese language processing research http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20 in%20Scientific%20Domain%20%28SCTB%29.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

https://catalog.ldc.upenn.edu/LDC2005T01.
Statistics from Japan Patent Office.
http://syntree.github.io/index.html.
As both of the two workers were well trained through the entire two years annotation period, we did not observe biases introduced by the baseline systems in the final SCTB-V2.
https://bitbucket.org/msmoshen/kyotomorph-beta.
https://github.com/slavpetrov/berkeleyparser.
https://github.com/nikitakit/self-attentive-parser.
Preliminary experiments show that the union is better than using one of them only.
http://nlp.cs.nyu.edu/evalb/.
http://lotus.kuee.kyoto-u.ac.jp/ASPEC/.
http://orchid.kuee.kyoto-u.ac.jp/WAT/.
http://ntcir.nii.ac.jp/PatentMT-2/.
http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN.
https://github.com/howardchenhd/Syntax-awared-NMT.
https://catalog.ldc.upenn.edu/LDC2016T13.
http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html.
http://universaldependencies.org.
https://jipsti.jst.go.jp/jazh_zhja_mt/en.html.

References

Che, W., Li, Z., & Liu, T. (2012). Chinese dependency treebank 1.0. Linguistic Data Consortium
Chen, H., Huang, S., Chiang, D., & Chen, J. (2017). Improved neural machine translation with a syntax-aware encoder and decoder. In Proceedings of the 55th annual meeting of the Association for Computational Linguistics (Vol. 1: Long Papers, pp. 1936–1945). Association for Computational Linguistics. http://aclweb.org/anthology/P17-1177
Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2016). SCTB: A Chinese treebank in scientific domain. In Proceedings of the 12th workshop on Asian Language Resources (ALR12) (pp. 59–67). The COLING 2016 Organizing Committee. https://aclanthology.org/W16-5407
Duan, H., Bai, X., Chang, B., & Yu, S. (2003). Chinese word segmentation at Peking University. In Proceedings of the second SIGHAN workshop on Chinese Language Processing (pp. 152–155). Association for Computational Linguistics. https://doi.org/10.3115/1119250.1119272. http://www.aclweb.org/anthology/W03-1722
Goto, I., Chow, K. P., Lu, B., Sumita, E., & Tsou, B. K. (2013). Overview of the patent machine translation task at the NTCIR-10 workshop. In Proceedings of the 10th NTCIR conference (pp. 260–286). National Institute of Informatics (NII). http://dblp.uni-trier.de/db/conf/ntcir/ntcir2013.html#GotoCLST13
Hu, H., Li, Y., Patterson, Y., Tian, Z., Zhang, Y., Zhou, H., Kuebler, S., & Lin, C. J. C. (2020). Building a treebank for Chinese literature for translation studies. In Proceedings of the 19th international workshop on treebanks and linguistic theories (pp. 18–30) Association for Computational Linguistics, Düsseldorf, Germany. https://doi.org/10.18653/v1/2020.tlt-1.2. https://aclanthology.org/2020.tlt-1.2
Huang, C. R., Chen, K. J., & Chang, L. L. (1996). Segmentation standard for chinese natural language processing. In Proceedings of the 16th conference on computational linguistics (COLING ’96) (Vol. 2, pp. 1045–1048). Association for Computational Linguistics. https://doi.org/10.3115/993268.993362
Kitaev, N., Cao, S., & Klein, D. (2019). Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics (pp. 3499–3505). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1340. https://aclanthology.org/P19-1340
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In D. Lin & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 388–395). Association for Computational Linguistics, Barcelona, Spain.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions (pp. 177–180). Association for Computational Linguistics. http://www.aclweb.org/anthology/P/P07/P07-2045
Kurohashi, S., Nakamura, T., Matsumoto, Y., & Nagao, M. (1994). Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the international workshop on sharable natural language (pp. 22–28).
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of english: The Penn treebank. Computational Linguistics 19(2), 313–330.
Nakazawa, T., Nakayama, H., Ding, C., Dabre, R., Higashiyama, S., Mino, H., Goto, I., Pa Pa, W., Kunchukuttan, A., Parida, S., Bojar, O., Chu, C., Eriguchi, A., Abe, K., Oda, Y., & Kurohashi, S. (2021). Overview of the 8th workshop on Asian translation. In Proceedings of the 8th workshop on Asian Translation (WAT2021) (pp. 1–45). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.wat-1.1. https://aclanthology.org/2021.wat-1.1
Nakazawa, T., Yaguchi, M., Uchimoto, K., Utiyama, M., Sumita, E., Kurohashi, S., & Isahara, H. (2016). Aspec: Asian scientific paper excerpt corpus. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). European Language Resources Association (ELRA).
Nivre, J., de Marneffe, M.C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 1659–1666). European Language Resources Association (ELRA).
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135. https://aclanthology.org/P02-1040
Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In Human language technologies 2007: The conference of the North American chapter of the Association for Computational Linguistics; proceedings of the main conference (pp. 404–411). Association for Computational Linguistics. http://www.aclweb.org/anthology/N/N07/N07-1051
Qiu, L., Zhang, Y., Jin, P., & Wang, H. (2014). Multi-view Chinese treebanking. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers (pp. 257–268). Dublin City University and Association for Computational Linguistics. http://www.aclweb.org/anthology/C14-1026
Saka, A., & Igami, M. (2015). Benchmarking scientific research 2015 (pp. 1–172). Ministry of Education, Culture, Sports, Science and Technology.
Shen, M., Liu, H., Kawahara, D., & Kurohashi, S. (2014). Chinese morphological analysis with character-level POS tagging. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Volume 2: Short Papers, pp. 253–258). Association for Computational Linguistics. http://www.aclweb.org/anthology/P14-2042
Shen, M., Wingmui, L., Choe, H., Chu, C., Kawahara, D., & Kurohashi, S. (2016). Consistent word segmentation, part-of-speech tagging and dependency labelling annotation for Chinese language. In Proceedings of the 26th international conference on computational linguistics. Association for Computational Linguistics.
Thu, Y. K., Pa, W. P., Utiyama, M., Finch, A., & Sumita, E. (2016). Introducing the Asian language treebank (ALT). In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). European Language Resources Association (ELRA).
Xia, F., Palmer, M., Xue, N., Okurowski, M. E., Kovarik, J., Dong Chiou, F., Huang, S., Kroch, T., & Marcus, M. (2000). Developing guidelines and ensuring consistency for chinese text annotation. In Proceedings of the second language resources and evaluation conference.
Xue, N., Xia, F., Chiou, Fd., & Palmer, M. (2005). The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238. https://doi.org/10.1017/S135132490400364X
Article Google Scholar
Yu, S., Duan, H., Swen, B., & Chang, B. (2003). Specification for corpus processing at Peking University: Word segmentation, POS tagging and phonetic notation. Journal of Chinese Language and Computing, 13(2), 121–158.
Google Scholar
Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. CoRR. http://arxiv.org/abs/1212.5701
Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies (pp. 1–21). Association for Computational Linguistics. https://doi.org/10.18653/v1/K18-2001. https://aclanthology.org/K18-2001

Download references

Acknowledgements

This work was supported by “Project on Practical Implementation of Japanese to Chinese-Chinese to Japanese Machine Translation,” JST. We sincerely thank Ms. Fumio Hirao and Mr. Teruyasu Ueki, who annotated SCTB-V2. We are appreciated Mr. Frederic Bergeron for his development of the SynTree toolkit to speed up the annotation process. Finally, we want to thank Dr. Mo Shen for valuable discussions regarding annotation standards.

Author information

Authors and Affiliations

Kyoto University, Kyoto, Japan
Chenhui Chu, Zhuoyuan Mao & Sadao Kurohashi
The University of Tokyo, Tokyo, Japan
Toshiaki Nakazawa
Waseda University, Tokyo, Japan
Daisuke Kawahara

Authors

Chenhui Chu
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoyuan Mao
View author publications
You can also search for this author in PubMed Google Scholar
Toshiaki Nakazawa
View author publications
You can also search for this author in PubMed Google Scholar
Daisuke Kawahara
View author publications
You can also search for this author in PubMed Google Scholar
Sadao Kurohashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenhui Chu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chu, C., Mao, Z., Nakazawa, T. et al. SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain. Lang Resources & Evaluation 57, 1389–1403 (2023). https://doi.org/10.1007/s10579-022-09615-2

Download citation

Accepted: 06 September 2022
Published: 15 October 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10579-022-09615-2

SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain

Abstract

Access this article

Subscribe and save

Buy Now

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now