research-article

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

Authors:

Jianjun MaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 17, Issue 3

Article No.: 23, Pages 1 - 21

https://doi.org/10.1145/3182622

Published: 02 April 2018 Publication History

Abstract

Word embedding-based methods have received increasing attention for their flexibility and effectiveness in many natural language-processing (NLP) tasks, including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect prior knowledge. Lexicon-based methods concentrate on human’s intelligence contained in semantic resources, e.g., Tongyici Cilin, HowNet, and Chinese WordNet, but they have the drawback of being unable to deal with unknown words. This article proposes a three-stage framework for measuring the Chinese word similarity by incorporating prior knowledge obtained from lexicons and statistics into word embedding: in the first stage, we utilize retrieval techniques to crawl the contexts of word pairs from web resources to extend context corpus. In the next stage, we investigate three types of single similarity measurements, including lexicon similarities, statistical similarities, and embedding-based similarities. Finally, we exploit simple combination strategies with math operations and the counter-fitting combination strategy using optimization method. To demonstrate our system’s efficiency, comparable experiments are conducted on the PKU-500 dataset. Our final results are 0.561/0.516 of Spearman/Pearson rank correlation coefficient, which outperform the state-of-the-art performance to the best of our knowledge. Experiment results on Chinese MC-30 and SemEval-2012 datasets show that our system also performs well on other Chinese datasets, which proves its transferability. Besides, our system is not language-specific and can be applied to other languages, e.g., English.

References

[1]

Mostafa Ghazizadeh Ahsaee, Mahmoud Naghibzadeh, and S. Ehsan Yasrebi Naeini. 2014. Semantic similarity assessment of words using weighted wordnet. Int. J. Mach. Learn. Cybernet. 5, 3 (2014), 479--490.

[2]

Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. In Proceedings of the International Conference on the World Wide Web (WWW’07). 757--766.

Digital Library

[3]

Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015. Revisiting word embedding for contrasting meaning. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 106--115.

[4]

Zhendong Dong and Qiang Dong. 2006. HowNet and the Computation of Meaning. World Scientific, Singapore.

Digital Library

[5]

Mengjia Fan, Yangsen Zhang, and Jiayuan Li. 2015. Word similarity computation based on HowNet. In Proceedings of Fuzzy Systems and Knowledge Discovery. 1487--1492.

[6]

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard H. Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 1606--1615.

[7]

Shaoru Guo, Yong Guan, Ru Li, and Qi Zhang. 2016. Chinese word similarity computing based on combination strategy. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 744--752.

[8]

Derrick Higgins. 2005. Which statistics reflect semantics? Rethinking synonymy and word similarity. Linguist. Evid. Empir. Theoret. Comput. Perspect. Studies Generat. Grammar 85 (2005), 265--284.

[9]

Felix Hill, Roi Reichart, and Anna Korhonen. 2016. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. (2016).

Digital Library

[10]

Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning sense embeddings for word and relational similarity. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 95--105.

[11]

Chunxia Liang, Yanqiu Shao, and Jing Zhao. 2013. Construction of a Chinese semantic dictionary by integrating two heterogeneous dictionaries: TongYiCi cilin and HowNet. In Proceedings of the Workshops on Web Intelligence and Intelligent Agent Technologies (IAT’13). 203--207.

Digital Library

[12]

Quan Liu, Hui Jiang, Si Wei, Zhenhua Ling, and Yu Hu. 2015. Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 1501--1511.

[13]

Qun Liu and Sujian Li. 2002. Word similarity computing based on how-net. Comput. Linguist. Chinese Lang. Process. 7, 2 (2002).

[14]

Jiaju Mei, Yiming Zhu, Yunqi Gao, and Hongxiang Yin. 1983. Tongyici Cilin. Shanghai Lexicon Publishing Company, Shanghai.

[15]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of Workshop at the International Conference on Learning Representations (ICLR’13). 422--431.

[16]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’13). 3111--3119.

Digital Library

[17]

Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’16). 142--148.

[18]

Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2016. Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. In Proceedings of the Association for Computational Linguistics Conference (ACL’16).

[19]

Masataka Ono, Makoto Miwa, and Yutaka Sasaki. 2015. Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the North American Chapter of the. Association for Computational Linguistics: Human Language Technologies (HLT-NAACL’15). 984--989.

[20]

Jiahuan Pei, Cong Zhang, Degen Huang, and Jianjun Ma. 2016. Combining word embedding and semantic lexicon for Chinese Word Similarity Computation. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 766--777.

[21]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods for Natural Language Processing (EMNLP’14). 1532--1543.

[22]

Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the Association for Computational Linguistics Conference (ACL’15). 1793--1803.

[23]

Cilibrasi Rudi and Paulf Vitanyi. 2007. The google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3 (2007), 370--383.

Digital Library

[24]

Wang Shan and Bond Francis. 2013. Building the Chinese open wordnet (COW): Starting from core synsets. In Proceedings of the 6th International Joint Conference on Natural Language Processing. 10--18.

[25]

Jiule Tian and Wei Zhao. 2010. Words similarity algorithm based on tongyici cilin in semantic web adaptive learning system. J. Jilin Univ. (Info. Sci. Ed.) 28, 6 (2010), 602--608.

[26]

Peter Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning. 491--502.

Digital Library

[27]

Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. J. Artific. Intell. Res. 37 (2010), 141--188.

Digital Library

[28]

Siying Wu and Yangyang Wu. 2010. Chinese and english word similarity measure based on Chinese wordnet. J. Zhengzhou Univ. (Natural Sci. Ed.) 2 (2010), 017.

[29]

Yunfang Wu and Wei Li. 2016. Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word similarity measurement. In Proceedings of the Conference on Natural Language Processing and Chinese Computing (NLPCC’16). 828--839.

[30]

Yueh-Cheng Wu and Shu-Kai Hsieh. 2010. PyCWN: A python module for Chinese wordnet. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. Demonstrations Volume, 5--8.

Digital Library

[31]

Mo Yu and Mark Dredze. 2014. Improving lexical embeddings with semantic knowledge. In Proceedings of the Association for Computational Linguistics Conference (ACL’14). 545--550.

[32]

Peiying Zhang, Zhanshan Zhang, and Weishan Zhang. 2013. An approach of semantic similarity by combining hownet and cilin. In Proceedings of GreenCom/iThings/CPScom. 1638--1643.

Digital Library

[33]

Jun Zhao, Shuanzhu Hu, and Xinghua Fan. 2009. Word similarity computation based on word link distribution. J. Chongqing Univ. Posts Telecommun. (Natural Sci. Ed.) 21, 4 (2009), 528--532.

[34]

Xinhua Zhu, Runcong Ma, Liu Sun, and Hongchao Chen. 2016. Word semantic similarity computation based on hownet and cilin. J. Chinese Info. Process. 30, 4 (2016), 29--36.

Cited By

Ma Y(2024)Building an Applied Linguistics Corpus for Chinese Language Teaching and LearningApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-15089:1Online publication date: 2-Jul-2024
https://doi.org/10.2478/amns-2024-1508
Ismailova LKosikov SSlieptsov IWolfengagen V(2022)Semantic Stabilization Tools for Managing the Cognitive Activity of the SubjectBiologically Inspired Cognitive Architectures 202110.1007/978-3-030-96993-6_17(176-181)Online publication date: 25-Mar-2022
https://doi.org/10.1007/978-3-030-96993-6_17
Karuppaiah DVincent P(2021)Hybrid approach for semantic similarity calculation between Tamil wordsInternational Journal of Innovative Computing and Applications10.1504/ijica.2021.11360912:1(13-23)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1504/ijica.2021.113609
Show More Cited By

Index Terms

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Lexical semantics

Recommendations

Improving Vietnamese WordNet using word embedding
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

This paper presents a simple but effective method to improve the quality of WordNet synsets and extract glosses for synsets. We translate the Princeton WordNet and other intermediate WordNets to a target language using a machine translator, then the ...
Composing Word Embeddings for Compound Words Using Linguistic Knowledge
In recent years, the use of distributed representations has been a fundamental technology for natural language processing. However, Japanese has multiple compound words, and often we must compare the meanings of a word and a compound word. Moreover, word ...
English-Chinese Cross Language Word Embedding Similarity Calculation
AICCC '18: Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference

Differences in languages among various countries, regions, and nationalities have created huge obstacles in communication. Cross-language word similarity (CLWS) calculation is the most practical way to solve this problem. Selection of corpus is one of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 17, Issue 3

September 2018

196 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3184403

Editor:
Nianwen Xue
Brandeis University, Waltham, USA

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2018

Accepted: 01 January 2018

Revised: 01 November 2017

Received: 01 January 2017

Published in TALLIP Volume 17, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China
National Social Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
238
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ma Y(2024)Building an Applied Linguistics Corpus for Chinese Language Teaching and LearningApplied Mathematics and Nonlinear Sciences10.2478/amns-2024-15089:1Online publication date: 2-Jul-2024
https://doi.org/10.2478/amns-2024-1508
Ismailova LKosikov SSlieptsov IWolfengagen V(2022)Semantic Stabilization Tools for Managing the Cognitive Activity of the SubjectBiologically Inspired Cognitive Architectures 202110.1007/978-3-030-96993-6_17(176-181)Online publication date: 25-Mar-2022
https://doi.org/10.1007/978-3-030-96993-6_17
Karuppaiah DVincent P(2021)Hybrid approach for semantic similarity calculation between Tamil wordsInternational Journal of Innovative Computing and Applications10.1504/ijica.2021.11360912:1(13-23)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1504/ijica.2021.113609
Tagliabue JYu BBianchi F(2020)The Embeddings That Came in From the Cold: Improving Vectors for New and Rare Products with Content-Based InferenceProceedings of the 14th ACM Conference on Recommender Systems10.1145/3383313.3411477(577-578)Online publication date: 22-Sep-2020
https://dl.acm.org/doi/10.1145/3383313.3411477
Wang YLiu JWang KYin F(2020)A Hybrid Semantic Representation with Internal and External Knowledge for Word Similarity2020 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD49809.2020.9137463(264-268)Online publication date: May-2020
https://doi.org/10.1109/ICAIBD49809.2020.9137463
Nguyen KTran KLuu SNguyen ANguyen N(2020)Enhancing Lexical-Based Approach With External Knowledge for Vietnamese Multiple-Choice Machine Reading ComprehensionIEEE Access10.1109/ACCESS.2020.30357018(201404-201417)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3035701
Huang JZhang WSun YWang HLiu T(2018)Improving entity recommendation with search log and multi-task learningProceedings of the 27th International Joint Conference on Artificial Intelligence10.5555/3304222.3304341(4107-4114)Online publication date: 13-Jul-2018
https://dl.acm.org/doi/10.5555/3304222.3304341
Huang JSun YZhang WWang HLiu T(2018)Entity Highlight Generation as Statistical and Neural Machine TranslationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2018.284511126:10(1860-1872)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1109/TASLP.2018.2845111

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents