Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated with word segmentation, part-of-speech tags, and phrase structures are essential. Although there are large-scale Chinese treebanks in the news domain, such treebanks are unavailable in the scientific domain. This significantly limits the performance of Chinese language processing for scientific text. To address this problem, we annotate the 2nd version of the Chinese treebank in the scientific domain (SCTB-V2). SCTB-V2 contains 12,175 sentences annotated with word segmentation, part-of-speech tags, and phrase structures. We conducted Chinese analyses and machine translation experiments on SCTB-V2. The results show the effectiveness of SCTB-V2. We release this treebank to promote scientific Chinese language processing research http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20 in%20Scientific%20Domain%20%28SCTB%29.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. https://catalog.ldc.upenn.edu/LDC2005T01.

  2. Statistics from Japan Patent Office.

  3. http://syntree.github.io/index.html.

  4. As both of the two workers were well trained through the entire two years annotation period, we did not observe biases introduced by the baseline systems in the final SCTB-V2.

  5. https://bitbucket.org/msmoshen/kyotomorph-beta.

  6. https://github.com/slavpetrov/berkeleyparser.

  7. https://github.com/nikitakit/self-attentive-parser.

  8. Preliminary experiments show that the union is better than using one of them only.

  9. http://nlp.cs.nyu.edu/evalb/.

  10. http://lotus.kuee.kyoto-u.ac.jp/ASPEC/.

  11. http://orchid.kuee.kyoto-u.ac.jp/WAT/.

  12. http://ntcir.nii.ac.jp/PatentMT-2/.

  13. http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN.

  14. https://github.com/howardchenhd/Syntax-awared-NMT.

  15. https://catalog.ldc.upenn.edu/LDC2016T13.

  16. http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html.

  17. http://universaldependencies.org.

  18. https://jipsti.jst.go.jp/jazh_zhja_mt/en.html.

References

  • Che, W., Li, Z., & Liu, T. (2012). Chinese dependency treebank 1.0. Linguistic Data Consortium

  • Chen, H., Huang, S., Chiang, D., & Chen, J. (2017). Improved neural machine translation with a syntax-aware encoder and decoder. In Proceedings of the 55th annual meeting of the Association for Computational Linguistics (Vol. 1: Long Papers, pp. 1936–1945). Association for Computational Linguistics. http://aclweb.org/anthology/P17-1177

  • Chu, C., Nakazawa, T., Kawahara, D., & Kurohashi, S. (2016). SCTB: A Chinese treebank in scientific domain. In Proceedings of the 12th workshop on Asian Language Resources (ALR12) (pp. 59–67). The COLING 2016 Organizing Committee. https://aclanthology.org/W16-5407

  • Duan, H., Bai, X., Chang, B., & Yu, S. (2003). Chinese word segmentation at Peking University. In Proceedings of the second SIGHAN workshop on Chinese Language Processing (pp. 152–155). Association for Computational Linguistics. https://doi.org/10.3115/1119250.1119272. http://www.aclweb.org/anthology/W03-1722

  • Goto, I., Chow, K. P., Lu, B., Sumita, E., & Tsou, B. K. (2013). Overview of the patent machine translation task at the NTCIR-10 workshop. In Proceedings of the 10th NTCIR conference (pp. 260–286). National Institute of Informatics (NII). http://dblp.uni-trier.de/db/conf/ntcir/ntcir2013.html#GotoCLST13

  • Hu, H., Li, Y., Patterson, Y., Tian, Z., Zhang, Y., Zhou, H., Kuebler, S., & Lin, C. J. C. (2020). Building a treebank for Chinese literature for translation studies. In Proceedings of the 19th international workshop on treebanks and linguistic theories (pp. 18–30) Association for Computational Linguistics, Düsseldorf, Germany. https://doi.org/10.18653/v1/2020.tlt-1.2. https://aclanthology.org/2020.tlt-1.2

  • Huang, C. R., Chen, K. J., & Chang, L. L. (1996). Segmentation standard for chinese natural language processing. In Proceedings of the 16th conference on computational linguistics (COLING ’96) (Vol. 2, pp. 1045–1048). Association for Computational Linguistics. https://doi.org/10.3115/993268.993362

  • Kitaev, N., Cao, S., & Klein, D. (2019). Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics (pp. 3499–3505). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1340. https://aclanthology.org/P19-1340

  • Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In D. Lin & D. Wu (Eds.), Proceedings of EMNLP 2004 (pp. 388–395). Association for Computational Linguistics, Barcelona, Spain.

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the Association for Computational Linguistics companion volume proceedings of the demo and poster sessions (pp. 177–180). Association for Computational Linguistics. http://www.aclweb.org/anthology/P/P07/P07-2045

  • Kurohashi, S., Nakamura, T., Matsumoto, Y., & Nagao, M. (1994). Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the international workshop on sharable natural language (pp. 22–28).

  • Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of english: The Penn treebank. Computational Linguistics 19(2), 313–330.

  • Nakazawa, T., Nakayama, H., Ding, C., Dabre, R., Higashiyama, S., Mino, H., Goto, I., Pa Pa, W., Kunchukuttan, A., Parida, S., Bojar, O., Chu, C., Eriguchi, A., Abe, K., Oda, Y., & Kurohashi, S. (2021). Overview of the 8th workshop on Asian translation. In Proceedings of the 8th workshop on Asian Translation (WAT2021) (pp. 1–45). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.wat-1.1. https://aclanthology.org/2021.wat-1.1

  • Nakazawa, T., Yaguchi, M., Uchimoto, K., Utiyama, M., Sumita, E., Kurohashi, S., & Isahara, H. (2016). Aspec: Asian scientific paper excerpt corpus. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). European Language Resources Association (ELRA).

  • Nivre, J., de Marneffe, M.C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 1659–1666). European Language Resources Association (ELRA).

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135. https://aclanthology.org/P02-1040

  • Petrov, S., & Klein, D. (2007). Improved inference for unlexicalized parsing. In Human language technologies 2007: The conference of the North American chapter of the Association for Computational Linguistics; proceedings of the main conference (pp. 404–411). Association for Computational Linguistics. http://www.aclweb.org/anthology/N/N07/N07-1051

  • Qiu, L., Zhang, Y., Jin, P., & Wang, H. (2014). Multi-view Chinese treebanking. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers (pp. 257–268). Dublin City University and Association for Computational Linguistics. http://www.aclweb.org/anthology/C14-1026

  • Saka, A., & Igami, M. (2015). Benchmarking scientific research 2015 (pp. 1–172). Ministry of Education, Culture, Sports, Science and Technology.

  • Shen, M., Liu, H., Kawahara, D., & Kurohashi, S. (2014). Chinese morphological analysis with character-level POS tagging. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (Volume 2: Short Papers, pp. 253–258). Association for Computational Linguistics. http://www.aclweb.org/anthology/P14-2042

  • Shen, M., Wingmui, L., Choe, H., Chu, C., Kawahara, D., & Kurohashi, S. (2016). Consistent word segmentation, part-of-speech tagging and dependency labelling annotation for Chinese language. In Proceedings of the 26th international conference on computational linguistics. Association for Computational Linguistics.

  • Thu, Y. K., Pa, W. P., Utiyama, M., Finch, A., & Sumita, E. (2016). Introducing the Asian language treebank (ALT). In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). European Language Resources Association (ELRA).

  • Xia, F., Palmer, M., Xue, N., Okurowski, M. E., Kovarik, J., Dong Chiou, F., Huang, S., Kroch, T., & Marcus, M. (2000). Developing guidelines and ensuring consistency for chinese text annotation. In Proceedings of the second language resources and evaluation conference.

  • Xue, N., Xia, F., Chiou, Fd., & Palmer, M. (2005). The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238. https://doi.org/10.1017/S135132490400364X

    Article  Google Scholar 

  • Yu, S., Duan, H., Swen, B., & Chang, B. (2003). Specification for corpus processing at Peking University: Word segmentation, POS tagging and phonetic notation. Journal of Chinese Language and Computing, 13(2), 121–158.

    Google Scholar 

  • Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. CoRR. http://arxiv.org/abs/1212.5701

  • Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies (pp. 1–21). Association for Computational Linguistics. https://doi.org/10.18653/v1/K18-2001. https://aclanthology.org/K18-2001

Download references

Acknowledgements

This work was supported by “Project on Practical Implementation of Japanese to Chinese-Chinese to Japanese Machine Translation,” JST. We sincerely thank Ms. Fumio Hirao and Mr. Teruyasu Ueki, who annotated SCTB-V2. We are appreciated Mr. Frederic Bergeron for his development of the SynTree toolkit to speed up the annotation process. Finally, we want to thank Dr. Mo Shen for valuable discussions regarding annotation standards.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenhui Chu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chu, C., Mao, Z., Nakazawa, T. et al. SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain. Lang Resources & Evaluation 57, 1389–1403 (2023). https://doi.org/10.1007/s10579-022-09615-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-022-09615-2

Keywords

Navigation