Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

Published: 02 April 2018 Publication History

Abstract

Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect prior knowledge. Lexicon-based methods concentrate on human’s intelligence contained in semantic resources, e.g., Tongyici Cilin, HowNet, and Chinese WordNet, but they have the drawback of being unable to deal with unknown words. This article proposes a three-stage framework for measuring the Chinese word similarity by incorporating prior knowledge obtained from lexicons and statistics into word embedding: in the first stage, we utilize retrieval techniques to crawl the contexts of word pairs from web resources to extend context corpus. In the next stage, we investigate three types of single similarity measurements, including lexicon similarities, statistical similarities, and embedding-based similarities. Finally, we exploit simple combination strategies with math operations and the counter-fitting combination strategy using optimization method. To demonstrate our system’s efficiency, comparable experiments are conducted on the PKU-500 dataset. Our final results are 0.561/0.516 of Spearman/Pearson rank correlation coefficient, which outperform the state-of-the-art performance to the best of our knowledge. Experiment results on Chinese MC-30 and SemEval-2012 datasets show that our system also performs well on other Chinese datasets, which proves its transferability. Besides, our system is not language-specific and can be applied to other languages, e.g., English.

References

[1]
Mostafa Ghazizadeh Ahsaee, Mahmoud Naghibzadeh, and S. Ehsan Yasrebi Naeini. 2014. Semantic similarity assessment of words using weighted wordnet. Int. J. Mach. Learn. Cybernet. 5, 3 (2014), 479--490.
[2]
Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In Proceedings of the International Conference on the World Wide Web (WWW’07). 757--766.
[3]
Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Revisiting word embedding for contrasting meaning. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 106--115.
[4]
Zhendong Dong and Qiang Dong. 2006. HowNet and the Computation of Meaning. World Scientific, Singapore.
[5]
Mengjia Fan, Yangsen Zhang, and Jiayuan Li. 2015. Word similarity computation based on HowNet. In Proceedings of Fuzzy Systems and Knowledge Discovery. 1487--1492.
[6]
Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1606--1615.
[7]
Shaoru Guo, Yong Guan, Ru Li, and Qi Zhang. 2016. Chinese word similarity computing based on combination strategy. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 744--752.
[8]
Derrick Higgins. 2005. Which statistics reflect semantics? Rethinking synonymy and word similarity. Linguist. Evid. Empir. Theoret. Comput. Perspect. Studies Generat. Grammar 85 (2005), 265--284.
[9]
Felix Hill, Roi Reichart, and Anna Korhonen. 2016. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. (2016).
[10]
Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 95--105.
[11]
Chunxia Liang, Yanqiu Shao, and Jing Zhao. 2013. Construction of a Chinese semantic dictionary by integrating two heterogeneous dictionaries: TongYiCi cilin and HowNet. In Proceedings of the Workshops on Web Intelligence and Intelligent Agent Technologies (IAT’13). 203--207.
[12]
Quan Liu, Hui Jiang, Si Wei, Zhenhua Ling, and Yu Hu. 2015. Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 1501--1511.
[13]
Qun Liu and Sujian Li. 2002. Word similarity computing based on how-net. Comput. Linguist. Chinese Lang. Process. 7, 2 (2002).
[14]
Jiaju Mei, Yiming Zhu, Yunqi Gao, and Hongxiang Yin. 1983. Tongyici Cilin. Shanghai Lexicon Publishing Company, Shanghai.
[15]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of Workshop at the International Conference on Learning Representations (ICLR’13). 422--431.
[16]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’13). 3111--3119.
[17]
Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’16). 142--148.
[18]
Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2016. Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In Proceedings of the Association for Computational Linguistics Conference (ACL’16).
[19]
Masataka Ono, Makoto Miwa, and Yutaka Sasaki. 2015. Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 984--989.
[20]
Jiahuan Pei, Cong Zhang, Degen Huang, and Jianjun Ma. 2016. Combining word embedding and semantic lexicon for Chinese Word Similarity Computation. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 766--777.
[21]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP’14). 1532--1543.
[22]
Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 1793--1803.
[23]
Cilibrasi Rudi and Paulf Vitanyi. 2007. The google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3 (2007), 370--383.
[24]
Wang Shan and Bond Francis. 2013. Building the Chinese open wordnet (COW): Starting from core synsets. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 10--18.
[25]
Jiule Tian and Wei Zhao. 2010. Words similarity algorithm based on tongyici cilin in semantic web adaptive learning system. J. Jilin Univ. (Info. Sci. Ed.) 28, 6 (2010), 602--608.
[26]
Peter Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning. 491--502.
[27]
Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. J. Artific. Intell. Res. 37 (2010), 141--188.
[28]
Siying Wu and Yangyang Wu. 2010. Chinese and english word similarity measure based on Chinese wordnet. J. Zhengzhou Univ. (Natural Sci. Ed.) 2 (2010), 017.
[29]
Yunfang Wu and Wei Li. 2016. Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word similarity measurement. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 828--839.
[30]
Yueh-Cheng Wu and Shu-Kai Hsieh. 2010. PyCWN: A python module for Chinese wordnet. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Demonstrations Volume, 5--8.
[31]
Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In Proceedings of the Association for Computational Linguistics Conference (ACL’14). 545--550.
[32]
Peiying Zhang, Zhanshan Zhang, and Weishan Zhang. 2013. An approach of semantic similarity by combining hownet and cilin. In Proceedings of GreenCom/iThings/CPScom. 1638--1643.
[33]
Jun Zhao, Shuanzhu Hu, and Xinghua Fan. 2009. Word similarity computation based on word link distribution. J. Chongqing Univ. Posts Telecommun. (Natural Sci. Ed.) 21, 4 (2009), 528--532.
[34]
Xinhua Zhu, Runcong Ma, Liu Sun, and Hongchao Chen. 2016. Word semantic similarity computation based on hownet and cilin. J. Chinese Info. Process. 30, 4 (2016), 29--36.

Cited By

View all
  • (2024)Building an Applied Linguistics Corpus for Chinese Language Teaching and LearningApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-15089:1Online publication date: 2-Jul-2024
  • (2022)Semantic Stabilization Tools for Managing the Cognitive Activity of the SubjectBiologically Inspired Cognitive Architectures 202110.1007/978-3-030-96993-6_17(176-181)Online publication date: 25-Mar-2022
  • (2021)Hybrid approach for semantic similarity calculation between Tamil wordsInternational Journal of Innovative Computing and Applications10.1504/ijica.2021.11360912:1(13-23)Online publication date: 1-Jan-2021
  • Show More Cited By

Index Terms

  1. Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 3
    September 2018
    196 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3184403
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 April 2018
    Accepted: 01 January 2018
    Revised: 01 November 2017
    Received: 01 January 2017
    Published in TALLIP Volume 17, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Chinese word similarity
    2. prior knowledge
    3. word embedding

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • National Natural Science Foundation of China
    • National Social Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Building an Applied Linguistics Corpus for Chinese Language Teaching and LearningApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-15089:1Online publication date: 2-Jul-2024
    • (2022)Semantic Stabilization Tools for Managing the Cognitive Activity of the SubjectBiologically Inspired Cognitive Architectures 202110.1007/978-3-030-96993-6_17(176-181)Online publication date: 25-Mar-2022
    • (2021)Hybrid approach for semantic similarity calculation between Tamil wordsInternational Journal of Innovative Computing and Applications10.1504/ijica.2021.11360912:1(13-23)Online publication date: 1-Jan-2021
    • (2020)The Embeddings That Came in From the Cold: Improving Vectors for New and Rare Products with Content-Based InferenceProceedings of the 14th ACM Conference on Recommender Systems10.1145/3383313.3411477(577-578)Online publication date: 22-Sep-2020
    • (2020)A Hybrid Semantic Representation with Internal and External Knowledge for Word Similarity2020 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD49809.2020.9137463(264-268)Online publication date: May-2020
    • (2020)Enhancing Lexical-Based Approach With External Knowledge for Vietnamese Multiple-Choice Machine Reading ComprehensionIEEE Access10.1109/ACCESS.2020.30357018(201404-201417)Online publication date: 2020
    • (2018)Improving entity recommendation with search log and multi-task learningProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304222.3304341(4107-4114)Online publication date: 13-Jul-2018
    • (2018)Entity Highlight Generation as Statistical and Neural Machine TranslationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.284511126:10(1860-1872)Online publication date: 1-Oct-2018

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media