Abstract
So far, Multimodal Named Entity Recognition (MNER) has been performed almost exclusively on English corpora. Chinese phrases are not naturally segmented, making Chinese NER more challenging; nonetheless, Chinese MNER needs to be paid more attention. Thus, we first construct Wukong-CMNER, a multimodal NER dataset for the Chinese corpus that includes images and text. There are 55,423 annotated image-text pairs in our corpus. Based on this dataset, we propose a lexicon-based prompting visual clue extraction (LPE) module to capture certain entity-related visual clues from the image. We further introduce a novel cross-modal alignment (CA) module to make the representations of the two modalities more consistent through contrastive learning. Through extensive experiments, we observe that: (1) Discernible performance boosts as we move from unimodal to multimodal, verifying the necessity of integrating visual clues into Chinese NER. (2) Cross-modal alignment module further improves the performance of the model. (3) Our two modules decouple from the subsequent predicting process, which enables a plug-and-play framework to enhance Chinese NER models for Chinese MNER task. LPE and CA achieve state-of-the-art (SOTA) results on Wukong-CMNER when combined with W2NER [11], demonstrating its effectiveness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chen, D., Li, Z., Gu, B., Chen, Z.: Multimodal named entity recognition with image attributes and image knowledge. In: Jensen, C.S., et al. (eds.) DASFAA 2021. LNCS, vol. 12682, pp. 186–201. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73197-7_12
Chen, S., Aguilar, G., Neves, L., Solorio, T.: Can images help recognize entities? a study of the role of images for multimodal NER. arXiv:2010.12712 (2020)
Chen, X., et al.: Good visual guidance makes a better extractor: hierarchical visual prefix for multimodal entity and relation extraction. arXiv:2205.03521 (2022)
Ding, R., Xie, P., Zhang, X., Lu, W., Li, L., Si, L.: A neural multi-digraph model for chinese ner with gazetteers. In: ACL, pp. 1462–1467 (2019)
Gina-Anne, L.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: CLP, pp. 108–117 (2006)
Gu, J., et al.: Wukong: 100 million large-scale Chinese cross-modal pre-training dataset and a foundation framework. arXiv:2202.06767 (2022)
Gui, T., Ma, R., Zhang, Q., Zhao, L., Jiang, Y.G., Huang, X.: CNN-based Chinese NER with lexicon rethinking. In: IJCAI, pp. 4982–4988 (2019)
Gui, T., et al.: A lexicon-based graph neural network for Chinese NER. In: EMNLP, pp. 1040–1050 (2019)
He, H., Choi, J.D.: The stem cell hypothesis: dilemma behind multi-task learning with transformer encoders. arXiv preprint arXiv:2109.06939 (2021)
He, H., Sun, X.: F-score driven max margin neural network for named entity recognition in Chinese social media. arXiv:1611.04234 (2016)
Li, J., Fei, H., Liu, J., Wu, S., Zhang, M., Teng, C., Ji, D., Li, F.: Unified named entity recognition as word-word relation classification. In: AAAI. vol. 36, pp. 10965–10973 (2022)
Li, X., Yan, H., Qiu, X., Huang, X.: Flat: Chinese NER using flat-lattice transformer. In: ACL, pp. 6836–6842 (2020)
Liu, W., Fu, X., Zhang, Y., Xiao, W.: Lexicon enhanced Chinese sequence labeling using bert adapter. In: ACL, pp. 5847–5858 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H.: Visual attention model for name tagging in multimodal social media. In: ACL, pp. 1990–1999 (2018)
Ma, R., Peng, M., Zhang, Q., Huang, X.: Simplify the usage of lexicon in Chinese NER. In: ACL, pp. 5951–5960 (2020)
Mengge, X., Bowen, Y., Tingwen, L., Yue, Z., Erli, M., Bin, W.: Porous lattice-based transformer encoder for Chinese NER. In: COLING (2019)
Moon, S., Neves, L., Carvalho, V.: Multimodal named entity recognition for short social media posts. In: NAACL-HLT, pp. 852–860 (2018)
Peng, N., Dredze, M.: Named entity recognition for chinese social media with jointly trained embeddings. In: EMNLP, pp. 548–554 (2015)
Sui, D., Tian, Z., Chen, Y., Liu, K., Zhao, J.: A large-scale chinese multimodal ner dataset with speech clues. In: ACL, pp. 2807–2818 (2021)
Sun, L., et al.: RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In: COLING, pp. 1852–1862 (2020)
Sun, L., Wang, J., Zhang, K., Su, Y., Weng, F.: RpBERT: a text-image relation propagation-based BERT model for multimodal NER. In: AAAI, vol. 35, pp. 13860–13868 (2021)
Sun, Y., et al.: ERNIE: enhanced representation through knowledge integration. arXiv:1904.09223 (2019)
Wang, X., et al.: ITA: image-text alignments for multi-modal named entity recognition. arXiv:2112.06482 (2021)
Wang, X., et al.: Prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In: Database Systems for Advanced Applications. DASFAA 2022. LNCS, vol. 13247, pp. 297–305. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-00129-1_24
Wang, X., et al.: CAT-MNER: multimodal named entity recognition with knowledge-refined cross-modal attention. In: ICME, pp. 1–6. IEEE (2022)
Weischedel, R., et al.: OntoNotes release 5.0 ldc2013t19. web download. Philadelphia: Linguistic data consortium, 2013 (2013)
Wu, S., Song, X., Feng, Z.: MECT: multi-metadata embedding based cross-transformer for chinese named entity recognition. In: ACL, pp. 1529–1539 (2021)
Wu, Z., Zheng, C., Cai, Y., Chen, J., Leung, H., Li, Q.: Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In: MM, pp. 1038–1046 (2020)
Xu, B., Huang, S., Sha, C., Wang, H.: MAF: a general matching and alignment framework for multimodal named entity recognition. In: WSDM, pp. 1215–1223 (2022)
Yamada, I., Asai, A., Shindo, H., Takeda, H., Matsumoto, Y.: Luke: deep contextualized entity representations with entity-aware self-attention. In: EMNLP (2020)
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: ICCV, pp. 4683–4693 (2019)
Yu, J., Jiang, J., Yang, L., Xia, R.: Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: ACL (2020)
Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., Zhou, G.: Multi-modal graph fusion for named entity recognition with targeted visual guidance. In: AAAI, vol. 35, pp. 14347–14355 (2021)
Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: CVPR, pp. 833–842 (2021)
Zhang, Q., Fu, J., Liu, X., Huang, X.: Adaptive co-attention network for named entity recognition in tweets. In: AAAI (2018)
Zhang, Y., Yang, J.: Chinese NER using lattice LSTM. In: ACL, pp. 1554–1564 (2018)
Zheng, C., Wu, Z., Wang, T., Cai, Y., Li, Q.: Object-aware multimodal named entity recognition in social media posts with adversarial learning. Multimedia 23, 2520–2532 (2020)
Acknowledgements
This work is partially supported by the National Natural Science Foundation of China under Grant No. 61772534, partially supported by Public Computing Cloud, Renmin University of China.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bao, X., Wang, S., Qi, P., Qin, B. (2023). Wukong-CMNER: A Large-Scale Chinese Multimodal NER Dataset with Images Modality. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13945. Springer, Cham. https://doi.org/10.1007/978-3-031-30675-4_43
Download citation
DOI: https://doi.org/10.1007/978-3-031-30675-4_43
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30674-7
Online ISBN: 978-3-031-30675-4
eBook Packages: Computer ScienceComputer Science (R0)