Abstract
Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated with word segmentation, part-of-speech tags, and phrase structures are essential. Although there are large-scale Chinese treebanks in the news domain, such treebanks are unavailable in the scientific domain. This significantly limits the performance of Chinese language processing for scientific text. To address this problem, we annotate the 2nd version of the Chinese treebank in the scientific domain (SCTB-V2). SCTB-V2 contains 12,175 sentences annotated with word segmentation, part-of-speech tags, and phrase structures. We conducted Chinese analyses and machine translation experiments on SCTB-V2. The results show the effectiveness of SCTB-V2. We release this treebank to promote scientific Chinese language processing research http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20 in%20Scientific%20Domain%20%28SCTB%29.
Notes
Statistics from Japan Patent Office.
As both of the two workers were well trained through the entire two years annotation period, we did not observe biases introduced by the baseline systems in the final SCTB-V2.
Preliminary experiments show that the union is better than using one of them only.
References
Che, W., Li, Z., & Liu, T. (2012). Chinese dependency treebank 1.0. Linguistic Data Consortium
Chen, H., Huang, S., Chiang, D., & Chen, J. (2017). Improved neural machine translation with a syntax-aware encoder and decoder. In Proceedings of the 55th annual meeting of the Association for Computational Linguistics (Vol. 1: Long Papers, pp. 1936–1945). Association for Computational Linguistics. http://aclweb.org/anthology/P17-1177
Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2016). SCTB: A Chinese treebank in scientific domain. In Proceedings of the 12th workshop on Asian Language Resources (ALR12) (pp. 59–67). The COLING 2016 Organizing Committee. https://aclanthology.org/W16-5407
Duan, H., Bai, X., Chang, B., & Yu, S. (2003). Chinese word segmentation at Peking University. In Proceedings of the second SIGHAN workshop on Chinese Language Processing (pp. 152–155). Association for Computational Linguistics. https://doi.org/10.3115/1119250.1119272. http://www.aclweb.org/anthology/W03-1722
Goto, I., Chow, K. P., Lu, B., Sumita, E., & Tsou, B. K. (2013). Overview of the patent machine translation task at the NTCIR-10 workshop. In Proceedings of the 10th NTCIR conference (pp. 260–286). National Institute of Informatics (NII). http://dblp.uni-trier.de/db/conf/ntcir/ntcir2013.html#GotoCLST13
Hu, H., Li, Y., Patterson, Y., Tian, Z., Zhang, Y., Zhou, H., Kuebler, S., & Lin, C. J. C. (2020). Building a treebank for Chinese literature for translation studies. In Proceedings of the 19th international workshop on treebanks and linguistic theories (pp. 18–30) Association for Computational Linguistics, Düsseldorf, Germany. https://doi.org/10.18653/v1/2020.tlt-1.2. https://aclanthology.org/2020.tlt-1.2
Huang, C. R., Chen, K. J., & Chang, L. L. (1996). Segmentation standard for chinese natural language processing. In Proceedings of the 16th conference on computational linguistics (COLING ’96) (Vol. 2, pp. 1045–1048). Association for Computational Linguistics. https://doi.org/10.3115/993268.993362
Kitaev, N., Cao, S., & Klein, D. (2019). Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics (pp. 3499–3505). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1340. https://aclanthology.org/P19-1340
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In D. Lin & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 388–395). Association for Computational Linguistics, Barcelona, Spain.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions (pp. 177–180). Association for Computational Linguistics. http://www.aclweb.org/anthology/P/P07/P07-2045
Kurohashi, S., Nakamura, T., Matsumoto, Y., & Nagao, M. (1994). Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the international workshop on sharable natural language (pp. 22–28).
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of english: The Penn treebank. Computational Linguistics 19(2), 313–330.
Nakazawa, T., Nakayama, H., Ding, C., Dabre, R., Higashiyama, S., Mino, H., Goto, I., Pa Pa, W., Kunchukuttan, A., Parida, S., Bojar, O., Chu, C., Eriguchi, A., Abe, K., Oda, Y., & Kurohashi, S. (2021). Overview of the 8th workshop on Asian translation. In Proceedings of the 8th workshop on Asian Translation (WAT2021) (pp. 1–45). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.wat-1.1. https://aclanthology.org/2021.wat-1.1
Nakazawa, T., Yaguchi, M., Uchimoto, K., Utiyama, M., Sumita, E., Kurohashi, S., & Isahara, H. (2016). Aspec: Asian scientific paper excerpt corpus. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). European Language Resources Association (ELRA).
Nivre, J., de Marneffe, M.C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 1659–1666). European Language Resources Association (ELRA).
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135. https://aclanthology.org/P02-1040
Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In Human language technologies 2007: The conference of the North American chapter of the Association for Computational Linguistics; proceedings of the main conference (pp. 404–411). Association for Computational Linguistics. http://www.aclweb.org/anthology/N/N07/N07-1051
Qiu, L., Zhang, Y., Jin, P., & Wang, H. (2014). Multi-view Chinese treebanking. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers (pp. 257–268). Dublin City University and Association for Computational Linguistics. http://www.aclweb.org/anthology/C14-1026
Saka, A., & Igami, M. (2015). Benchmarking scientific research 2015 (pp. 1–172). Ministry of Education, Culture, Sports, Science and Technology.
Shen, M., Liu, H., Kawahara, D., & Kurohashi, S. (2014). Chinese morphological analysis with character-level POS tagging. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Volume 2: Short Papers, pp. 253–258). Association for Computational Linguistics. http://www.aclweb.org/anthology/P14-2042
Shen, M., Wingmui, L., Choe, H., Chu, C., Kawahara, D., & Kurohashi, S. (2016). Consistent word segmentation, part-of-speech tagging and dependency labelling annotation for Chinese language. In Proceedings of the 26th international conference on computational linguistics. Association for Computational Linguistics.
Thu, Y. K., Pa, W. P., Utiyama, M., Finch, A., & Sumita, E. (2016). Introducing the Asian language treebank (ALT). In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). European Language Resources Association (ELRA).
Xia, F., Palmer, M., Xue, N., Okurowski, M. E., Kovarik, J., Dong Chiou, F., Huang, S., Kroch, T., & Marcus, M. (2000). Developing guidelines and ensuring consistency for chinese text annotation. In Proceedings of the second language resources and evaluation conference.
Xue, N., Xia, F., Chiou, Fd., & Palmer, M. (2005). The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238. https://doi.org/10.1017/S135132490400364X
Yu, S., Duan, H., Swen, B., & Chang, B. (2003). Specification for corpus processing at Peking University: Word segmentation, POS tagging and phonetic notation. Journal of Chinese Language and Computing, 13(2), 121–158.
Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. CoRR. http://arxiv.org/abs/1212.5701
Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies (pp. 1–21). Association for Computational Linguistics. https://doi.org/10.18653/v1/K18-2001. https://aclanthology.org/K18-2001
Acknowledgements
This work was supported by “Project on Practical Implementation of Japanese to Chinese-Chinese to Japanese Machine Translation,” JST. We sincerely thank Ms. Fumio Hirao and Mr. Teruyasu Ueki, who annotated SCTB-V2. We are appreciated Mr. Frederic Bergeron for his development of the SynTree toolkit to speed up the annotation process. Finally, we want to thank Dr. Mo Shen for valuable discussions regarding annotation standards.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chu, C., Mao, Z., Nakazawa, T. et al. SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain. Lang Resources & Evaluation 57, 1389–1403 (2023). https://doi.org/10.1007/s10579-022-09615-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-022-09615-2