Abstract
The construction of large open knowledge bases (OKBs) is integral to many knowledge-driven applications on the world wide web such as web search. However, noun phrases in OKBs often suffer from redundancy and ambiguity, which calls for the investigation on OKB canonicalization. Current solutions address OKB canonicalization by devising advanced clustering algorithms and using knowledge graph embedding (KGE) to further facilitate the canonicalization process. Nevertheless, these works fail to fully exploit the synergy between clustering and KGE learning, and the methods designed for these sub-tasks are sub-optimal. To this end, we put forward a multi-task learning framework, namely MulCanon, to tackle OKB canonicalization. Specifically, diffusion model is used in the soft clustering process to improve the noun phrase representations with neighboring information, which can lead to more accurate representations. MulCanon unifies the learning objective of diffusion model, KGE model, side information and cluster assignment, and adopts a two-stage multi-task learning paradigm for training. A thorough experimental study on popular OKB canonicalization benchmarks validates that MulCanon can achieve competitive canonicalization results.
Similar content being viewed by others
Availability of data and materials
All of the materials including figures is owned by the authors and no permissions are required.
References
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006). https://doi.org/10.1109/TKDE.2006.152
Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A core of semantic knowledge unifying WordNet and wikipedi. In: Proceedings of the 2007 World Wide Web Conference on World Wide Web-WWW’07, pp. 449–458(2007). https://hal.archives-ouvertes.fr/hal-01472497
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A shared database of structured general human knowledge. In: Proceedings of the Special Interest Group on Management Of Data-SIGMOD’08, pp. 1247–1250 (2008). https://doi.org/10.5555/1619797.1619981
Xiong, C., Power, R., Callan, J.: Explicit semantic ranking for academic search via knowledge graph embedding. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1271–1279 (2017). https://doi.org/10.1145/3038912.3052558
Kurt, Z., Köllmer, T., Aichroth, P.: An explainable knowledge graph-based news recommendation system. In: Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2023, Volume 1: KDIR, Rome, Italy, November 13-15, 2023, pp. 214–221 (2023). https://doi.org/10.5220/0012161300003598
Angeli, G., Johnson Premkumar, M.J., Manning, C.D.: Leveraging linguistic structure for open domain information extraction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 344–354 (2015). https://doi.org/10.3115/v1/P15-1034, http://aclweb.org/anthology/P15-1034
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545 (2011). https://aclanthology.org/D11-1142
Vashishth, S., Jain, P., Talukdar, P.: CESI: Canonicalizing open knowledge bases using embeddings and side information. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW’18, pp. 1317–1327 (2018). https://doi.org/10.1145/3178876.3186030, arXiv:1902.00172
Sturgeon, D.: Constructing a crowdsourced linked open knowledge base of chinese history. In: 2021 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC), pp. 1–6 (2021). https://doi.org/10.23919/PNC53575.2021.9672294, https://ieeexplore.ieee.org/document/9672294/
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). https://doi.org/10.3115/v1/D14-1162, http://aclweb.org/anthology/D14-1162
Lin, X., Chen, L.: Canonicalization of open knowledge bases with side information from the source text. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 950–961 (2019). https://doi.org/10.1109/ICDE.2019.00089, https://ieeexplore.ieee.org/document/8731346/
Shen, W., Yang, Y., Liu, Y.: Multi-view clustering for open knowledge base canonicalization. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1578–1588 (2022). https://doi.org/10.1145/3534678.3539449
Dash, S., Rossiello, G., Mihindukulasooriya, N., Bagchi, S., Gliozzo, A.: Open knowledge graphs canonicalization using variational autoencoders. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 10379–10394 (2021). arXiv:2012.04780
Jiang, Z., Zheng, Y., Tan, H., Tang, B., Zhou, H.: Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering. arXiv (2017). arXiv:1611.05148. Accessed 2023-01-01
Galárraga, L., Heitz, G., Murphy, K., Suchanek, F.M.: Canonicalizing open knowledge bases. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1679–1688 (2014). https://doi.org/10.1145/2661829.2662073
Liu, Y., Shen, W., Wang, Y., Wang, J., Yang, Z., Yuan, X.: Joint open knowledge base canonicalization and linking. In: Proceedings of the 2021 International Conference on Management of Data, pp. 2253–2261 (2021). https://doi.org/10.1145/3448016.3452776
Wu, T.-H., Wu, Z., Kao, B., Yin, P.: Towards practical open knowledge base canonicalization. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 883–892 (2018). https://doi.org/10.1145/3269206.3271707
Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 425–430 (2015). https://doi.org/10.3115/v1/P15-2070, http://aclweb.org/anthology/P15-2070
Zhao, X., Zeng, W., Tang, J.: Entity alignment-concepts, recent advances and novel approaches. Big Data Management (2023). https://doi.org/10.1007/978-981-99-4250-3
Zeng, W., Zhao, X., Li, X., Tang, J., Wang, W.: On entity alignment at scale. VLDB J. 31(5), 1009–1033 (2022)
Zeng, W., Zhao, X., Tang, J., Lin, X., Groth, P.: Reinforcement learning-based collective entity alignment with adaptive features. ACM Trans. Inf. Syst. 39(3), 26–12631 (2021)
Zeng, W., Zhao, X., Tang, J., Lin, X.: Collective entity alignment via adaptive features. In:36th IEEE International Conference on Data Engineering, pp. 1870–1873 (2020)
Zeng, W., Zhao, X., Wang, W., Tang, J., Tan, Z.: Degree-aware alignment for entities in tail. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR’20, pp. 811–820. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3397271.3401161
Chai, H., Cui, J., Wang, Y., Zhang, M., Fang, B., Liao, Q.: Improving gradient trade-offs between tasks in multi-task text classification. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 2565–2579 (2023)
Gao, M., Li, J.-Y., Chen, C.-H., Li, Y., Zhang, J., Zhan, Z.-H.: Enhanced multi-task learning and knowledge graph-based recommender system. IEEE Trans. Knowl. Data Eng. 35(10), 10281–10294 (2023). https://doi.org/10.1109/TKDE.2023.3251897
Zhou, Y., Guo, J., Song, B., Chen, C., Chang, J., Yu, F.R.: Trust-aware multi-task knowledge graph for recommendation. IEEE Trans. Knowl. Data Eng. 35(8), 8658–8671 (2023). https://doi.org/10.1109/TKDE.2022.3221160
Pei, S., Zhang, Q., Zhang, X.: Few-shot low-resource knowledge graph completion with reinforced task generation. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Findings of the Association for Computational Linguistics: ACL 2023, pp. 7252–7264. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.findings-acl.455, https://aclanthology.org/2023.findings-acl.455
Zhang, Z., Zhuang, F., Zhu, H., Li, C., Xiong, H., He, Q., Xu, Y.: Towards robust knowledge graph embedding via multi-task reinforcement learning. IEEE Trans. Knowl. Data Eng. 35(4), 4321–4334 (2023). https://doi.org/10.1109/TKDE.2021.3127951
Wyatt, J., Leach, A., Schmon, S.M., Willcocks, C.G.: AnoDDPM: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 649–655 (2022). https://doi.org/10.1109/CVPRW56347.2022.00080, https://ieeexplore.ieee.org/document/9857019/
Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.: Vector quantized diffusion model for text-to-image synthesis. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR) (2022). arXiv:2111.14822
Shan, X., Sun, J., Guo, Z., Yao, W., Zhou, Z.: Fractional-order diffusion model for multiplicative noise removal in texture-rich images and its fast explicit diffusion solving. BIT Numer. Math. 62(4), 1319–1354 (2022). https://doi.org/10.1007/s10543-022-00913-3
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc, Boston, MA, USA (2005)
Nickel, M., Rosasco, L., Poggio, T.A.: Holographic embeddings of knowledge graphs, 1955–1961 (2016). https://doi.org/10.1609/AAAI.V30I1.10314
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019)
Surdeanu, M., Tibshirani, J., Nallapati, R., Manning, C.D.: Multi-instance multi-label learning for relation extraction. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 455–465 (2012). https://doi.org/10.5555/2390948.2391003
Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 425–430. https://doi.org/10.3115/v1/P15-2070, http://aclweb.org/anthology/P15-2070
Schmitz, M., Bart, R., Soderland, S., Etzioni, O.: Open language learning for information extraction. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 523–534
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545
Smucker, M., Clarke, C., Cormack, G.: Experiments with clueweb09: Relevance feedback and web tracks. (2009). https://www.researchgate.net/publication/221038320_Experiments_with_ClueWeb09_Relevance_Feedback_and_Web_Tracks
Jiang, C., Jiang, Y., Wu, W., Zheng, Y., Xie, P., Tu, K.: Combo: A complete benchmark for open kg canonicalization. In: The 17th Conference of the European Chapter of the Association for Computational Linguistics (2023)
Souza Silva, L., Barbosa, L.: Matching news articles and wikipedia tables for news augmentation. Knowl. Inf. Syst. 65(4), 1713–1734 (2023). https://doi.org/10.1007/S10115-022-01815-0
Maximilian Nickel, T.P. Lorenzo Rosasco: Holographic embeddings of knowledge graphs. In: Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence (2016)
Jiang, Z., Zheng, Y., Tan, H., Tang, B., Zhou, H.: Variational deep embedding: An unsupervised and generative approach to clustering. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1965–1972 (2017)
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Neural Information Processing Systems (2013)
Funding
The authors would like to acknowledge the support provided by the Key R&D Program of Shandong Province, China (No. 2023CXGC010801), the National Natural Science Foundation of China (No.62302513 & 62272469), the “New 20 Regulations for Universities" funding program of Jinan (No.202228089) and the TaiShan Industrial Experts Programme (No.tscx202312128).
Author information
Authors and Affiliations
Contributions
Bingchen Liu wrote the main manuscript text, prepared all figures and tables and provided the methodology. Weixin Zeng, Xiang Zhao and Huang Peng provided writing-review and editing. Li Pan, Xin Li and Shijun Liu provided writing-review and editing and provided funding support.
Corresponding author
Ethics declarations
Competing interests
I declare that all authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and discussion reported in this paper.
Ethical approval
This declaration is not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, B., Peng, H., Zeng, W. et al. Open knowledge base canonicalization with multi-task learning. World Wide Web 27, 51 (2024). https://doi.org/10.1007/s11280-024-01288-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11280-024-01288-x