CN109960786A

CN109960786A - Chinese Measurement of word similarity based on convergence strategy

Info

Publication number: CN109960786A
Application number: CN201910236195.2A
Authority: CN
Inventors: 吕学强; 董志安; 游新冬
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2019-03-27
Filing date: 2019-03-27
Publication date: 2019-07-02

Abstract

The present invention relates to a kind of Chinese Measurement of word similarity based on convergence strategy, based on HowNet, Chinese thesaurus, the Chinese wikipedia corpus and Baidu's dictionary of Word2Vec training combine to calculate Words similarity, for two words of input, first determine whether that it whether there is in HowNet or Chinese thesaurus, if there is, then similarity is calculated using HowNet or Chinese thesaurus, otherwise, judge whether it exists in wikipedia corpus or Baidu's dictionary, if there is, the similarity of word is then calculated using word2vec or Baidu's dictionary.Chinese Measurement of word similarity provided by the invention based on convergence strategy, convergence strategy has comprehensively considered Hownet, Chinese thesaurus, word2vec and Baidu's dictionary, form the mutual supplement with each other's advantages between strategy, the Spearman's correlation coefficient and Pearson correlation coefficient being calculated are higher than other methods, the accuracy of Word similarity result is improved, the needs of practical application can be met well.

Description

Chinese word similarity calculation method based on fusion strategy

Technical Field

The invention belongs to the technical field of text processing, and particularly relates to a Chinese word similarity calculation method based on a fusion strategy.

Background

Word similarity calculation is the fundamental research topic of Chinese information processing, and has wide and deep application in the fields of natural language processing, automatic question answering, knowledge mapping, text classification, text clustering, information retrieval, information extraction, word meaning disambiguation, machine translation and the like, so that the word similarity calculation is researched and concerned by more and more scholars.

The current word similarity calculation can be divided into three types: a method based on an existing ontology, a method of large-scale corpus statistics, and a method of corpus-based word embedding. The first ontology-based approach uses the hierarchy, density and distance between words in the knowledge tree to calculate the similarity between words. The second method for calculating word similarity is based on statistics of large-scale corpora. The method assumes that similar words also appear in similar contexts, namely, the similarity of the words is calculated by utilizing the relevance of the words, the expression of context vectors relevant to each word is obtained by training a large-scale corpus, and then the similarity between the vectors is used as the similarity between two words. The third method for calculating word similarity is a word embedding method based on a corpus. The method utilizes a neural network to train a large-scale corpus so as to obtain the distributed representation of words on the space, and then utilizes cosine to calculate the similarity between the words.

The method based on the knowledge ontology has the defects that the method is limited by a semantic dictionary, and can not process unregistered (oov) words, and the similarity calculation of the words has errors due to improper classification of the words in the process of constructing the ontology; the method based on large-scale corpus statistics and the method of word embedding are limited by the scale of the corpus used for training, and have the advantages of large calculated amount, slow calculation speed and large interference caused by corpus sparseness and noise in the corpus.

The content of the knowledge related to the prior art in this field is presented below:

HowNet (i.e. HowNet) is a common knowledge base which discloses the relationship between concepts and the relationship between the attributes of the concepts as basic contents, wherein rich vocabulary semantic knowledge is a basic resource for the research in the field of natural language processing. HowNet includes concepts such as "concept", "semantic source", "semantic item", "knowledge description language", etc., where "concept" is a description of a word, a word may have multiple "concepts" (i.e., polysemous words), "concept" is described by knowledge description language (KDML), an expression of "concept" in knowledge description language is called "semantic item", a word used in knowledge description language is called as "semantic source", and a semantic source is a minimum basic unit for describing "concept". Complex relations exist among the sememes, wherein the complex relations include 8 relations, such as a superior-inferior relation, a synonymous relation, an antisense relation, an opposite-sense relation, an attribute-host relation, a component-overall relation, a material-finished product relation, an event-role relation and the like, the sememes form a tree-shaped hierarchical structure (shown in figure 1) through the 8 relations, each sememe is a node in the tree-shaped hierarchical structure, and the tree-shaped sememe hierarchical structure forms the basis of word similarity calculation.

The synonym forest is compiled by Meijia colt et al in 1983, and the dictionary contains synonyms and a certain number of similar words. Because the time of the synonym forest is long and the synonym forest is not updated all the time later, a large amount of manpower and material resources are invested in the information retrieval research room of Harbin university of industry, a new synonym forest expansion version of the information retrieval research room of Harbin university of industry is completed, 14706 rare words are removed, some new words are added to adapt to the development of the era, and 77343 words are finally included. These words are divided into 12 major classes, 97 classes and 1400 minor classes, and the word groups and atom word groups are further divided under the minor classes, so that the synonym forest forms a five-layer tree structure, as shown in fig. 2.

Unlike the semantic tree hierarchy of the knowns, each node in the knowns represents an semantic, while in the synonym forest, the leaf nodes are the entries one by one, and the four upper layers are abstract classifications. The forest of words encodes each entry according to the category to which it belongs, as shown in table 1.

TABLE 1 coding structure of words in synonym forest

The first layer large class and the fourth layer word group are represented by capital English letters, the second layer middle class is represented by lowercase English letters, and the third layer small class and the fifth layer atomic word group are represented by binary decimal integers. The coded bits have 8 bits and are arranged in the sequence from left to right, and the 8 th bit has three conditions of ═ #, "#" and "@" respectively. "═ means" equal "," synonymous "; "#" represents "unequal", "same kind", belonging to related words; "@" stands for "self-enclosing", "independent", it has neither synonyms nor related words in the dictionary.

Word2Vec is a deep learning tool, and Google introduced the open source toolkit in 2013. Mikolovet al et al propose two expression of word vectorization in 2013: CBOW (Continuous Bag-of-Words Model) and Skip-gram.

The CBOW model is a learning framework that learns a continuous bag-of-words model from a corpus. Predicting the probability of a current word, i.e. predicting p (w), by k words first and k last of the current word, based mainly on context information_t|w_t-(k-1)，w_t+1，w_t+2…，w_t+k) The model diagram is shown in fig. 3.

Wherein w (t) is the currently required word vector, w (t-2), w (t-1), w (t +1), w (t +2) is the context vector of the current word, the window size of the context word is 2k +1, and SUM is the cumulative SUM; INPUT is the INPUT layer, the vector representation of each word is INPUT, PROJECTION is the hidden layer, the vectors of these INPUT words are accumulated, OUTPUT is the OUTPUT layer, and w (t) is OUTPUT.

The Skip-Gram model is the exact opposite of the CBOW model, which is the probability of predicting a context word based on the current word, i.e., p (w)_t-k，w_t-(k-1)…，w_t-1，w_t+1，w_t+2…w_t+k|w_t) The model is shown in FIG. 4.

The Baidu dictionary is an online instant word explanation service provided by Baidu companies, data searched by the dictionary comes from Dian translation and Chinese dictionary websites, and Baidu encyclopedia query services are integrated in the Baidu dictionary, most of contents in the dictionary are collected through the Internet, and the Baidu dictionary is used, so that not only can commonly used words be queried, but also some new words, unknown words and network phrases can be queried, and therefore, the Baidu dictionary can meet some common query explanation services.

Disclosure of Invention

In view of the problems in the prior art, the present invention is directed to a method for calculating similarity of chinese words based on a fusion strategy, which can avoid the above technical defects.

In order to achieve the above object, the present invention provides the following technical solutions:

a Chinese Word similarity calculation method based on a fusion strategy is used for calculating Word similarity based on the combination of HowNet, synonym forest, Word2Vec trained Chinese Wikipedia corpus and Baidu dictionary.

Further, for two input words, firstly judging whether the two input words exist in HowNet or a synonym forest, if so, calculating the similarity by utilizing the HowNet or the synonym forest, otherwise, judging whether the two input words exist in Wikipedia corpus or an encyclopedia dictionary, and if so, calculating the similarity of the words by utilizing a word2vec or the encyclopedia dictionary.

Further, the vocabulary semantic similarity based on the known network is used as a calculation method, and the formula is as follows:

wherein, Sim (W)₁，W₂) The expression W₁And W₂Similarity based on HowNet; s₁₁，S₁₂，...，S_1nThe expression W₁The meaning term (concept) of (1); s₂₁，S₂₂，...，S_2mThe expression W₂The meaning term (concept) of (1);

in the knowledge network, the expression of the words utilizes a knowledge description formula consisting of an sememe and special symbols, the sememe is composed of a tree-shaped hierarchical system, and the similarity calculation method formula is as follows:

wherein p is₁，p₂Represents an original meaning; distance (p)₁，p₂) Represents an atom p₁And p₂α is an adjustable parameter, the meaning is the distance value of the sememe when the similarity is 0.5;

the meaning item description of the words in the Homing network has four items, and can be divided into a first meaning source description, other basic meaning source descriptions, a relation meaning source description and a relation symbol description. Wherein the other semantic description formula is a set structure consisting of the sememes; both the relational primitive description and the relational symbolic description are feature structures.

The characteristic structure is a set of key-value key value pairs, wherein key is a relation sememe or a relation symbol, and value is a basic sememe or a specific word; for the calculation of the feature structure similarity, firstly establishing a one-to-one correspondence relationship between features with the same key, if the key has no corresponding feature, the correspondence of the key is null, and then calculating the similarity of value between corresponding keys;

for the calculation of set similarity: firstly, calculating the similarity between every two elements in two sets, selecting one element with the maximum similarity from the two elements, corresponding the two elements, then deleting the corresponding elements from the sets, repeating the steps until no element corresponding relation exists, and enabling the elements without the corresponding relation to correspond to empty elements; finally, calculating the weighted average of the similarity of the element pairs for the set similarity;

and calculating the overall similarity of the words, wherein the formula is as follows:

C₁，C₂representing a concept or source of a real word, sim₁(C₁，C₂) To sim₄(C₁，C₂) Respectively representing the similarity of the four semantic item descriptions, β₁To β₄And representing the weight corresponding to the similarity of each sense item.

Further, the calculation of word similarity based on the synonym forest includes: for two given words, the corresponding numbers of the given words are searched in a word forest, and then the layer where the two numbers are different is judged; starting from the first layer, if the same is judged, multiplying by 1, otherwise multiplying by the corresponding branch coefficient, and then multiplying by the adjusting parameterWherein n is a branchThe total number of nodes in a layer is then multiplied by the control parameter (n-k +1)/n, where k is the distance between the two branches.

Further, assume that the word to be calculated for similarity is W₁，W₂Expressed that the similarity is represented by Sim, then

If the two words are not on the same tree:

Sim(W₁，W₂)＝f；

if the two terms differ in the second level branch, the coefficient is a:

if the two terms differ in the third level branch, the coefficient is b:

if the two terms differ in the fourth level branch, the coefficient is c:

if the two words differ in the fifth level branch, the coefficient is d:

further, word similarity calculation based on word2 vec: firstly, extracting the text content in the Chinese Wikipedia xml file; performing a complex and simple conversion, and then performing word segmentation on the Wikipedia text content by using a jieba word segmentation tool carried by python; then loading a stop word list and removing stop words; and finally, putting the processed corpus into word2vec for training to obtain a final result.

Further, word2vec is used for word similarity calculation, cosine distance between word vectors is used for the calculation, and the calculation formula is as follows:

wherein,is the word W₁Is represented by a vector of (a) vectors,is the word W₂Is represented by a vector of (a).

Further, calculating the word similarity based on the Baidu dictionary:

inputting words word1 and word2, requesting a query from the Baidu dictionary;

return to centesimal dictionary parts of the interpretations of these two words S1 and S2;

extracting keywords from the interpretation parts of the words S1 and S2 respectively by using the TextRank algorithm and forming a Set1 ═ k₁₁，k₁₂，k₁₃...，k_1nAnd Set2 ═ k₂₁，k₂₂，k₂₃，...，k_2m}；

Calculating the similarity between words in the keyword Set1 and the Set2, and taking the maximum value as the similarity of word1 and word 2; the calculation formula is as follows:

wherein, sim (k)_1i，k_2j) The results are calculated using a combination of the knowns and the synonym forest.

Further, extracting keywords in the explanation period segments of the words in the Baidu dictionary by using a TextRank algorithm, wherein the steps are as follows:

(1) segmenting explanatory paragraphs S1 and S2 of words word1 and word2 in a Baidu dictionary according to complete sentences;

(2) for each sentence in S1 and S2, performing operations such as word segmentation, part of speech tagging, stop word filtering and the like, and only leaving words with specific part of speech;

(3) constructing a candidate keyword graph, wherein each word in the (2) is a vertex, and if the two vertices coexist in the set window size, an edge exists between the two vertices;

(4) continuously and iteratively updating the weight value of each vertex until the weight value of each vertex is finally converged;

(5) and sequencing the vertexes according to the weight values of the vertexes, setting the number N of the keywords to be acquired, and taking the first N vertexes with the highest weight values as candidate keywords.

Further, the word similarity calculation formula is:

Sim_Hrepresenting the semblance of words based on the Homing network, Sim_CRepresenting a term similarity based on a synonym forest; sim_WRepresenting the similarity of words based on wikipedia; sim_BRepresenting word similarity based on a Baidu dictionary.

The Chinese word similarity calculation method based on the fusion strategy provided by the invention can basically cover word pairs in an evaluation set, particularly some noun abbreviations and some network new words; the provided sentence keyword extraction algorithm is effective, and for the explanation of words in the hundred-degree dictionary, a plurality of keywords are extracted, so that the words can be represented more typically; the fusion strategy comprehensively considers the known net, the synonym forest, the word2vec and the Baidu dictionary, forms advantage complementation between strategies, and improves the accuracy of a word similarity calculation result and can well meet the requirement of practical application because the calculated spearman correlation coefficient and the pearson correlation coefficient are higher than those of other methods.

Drawings

FIG. 1 is a tree hierarchy of sememes;

FIG. 2 is a diagram of a five-level tree structure of a synonym forest;

FIG. 3 is a schematic structural diagram of a CBOW model;

FIG. 4 is a schematic diagram of a Skip-gram model;

FIG. 5 is a schematic diagram of a training process for word2 vec;

FIG. 6 is a diagram illustrating a word similarity calculation process based on a Baidu dictionary;

fig. 7 is a schematic diagram of a word similarity fusion strategy.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A Chinese Word similarity calculation method based on a fusion strategy is characterized in that Word similarity is calculated based on the combination of HowNet, a synonym Word forest, Word2Vec trained Chinese Wikipedia corpus and an encyclopedia, for two input words, whether the two input words exist in HowNet or the synonym Word forest is judged firstly, if yes, the HowNet or the synonym Word forest is used for calculating the similarity, otherwise, whether the two input words exist in the Wikipedia corpus or the encyclopedia is judged, and if yes, the Word2Vec or the encyclopedia is used for calculating the similarity of the words.

The TextRank algorithm is adapted based on the PageRank algorithm of google, and abstracts and keywords can be extracted from a text. It is a graph-based ranking algorithm, which is essentially a method of recursively deciding vertex importance in a graph based on global information. Its basic idea is "voting" or "recommendation". When one vertex in the graph links to another vertex in the graph, it is equivalent to casting a vote on it. The higher the number of votes cast on a vertex, the higher the importance of the vertex, and the number of votes cast on each vertex also determines the importance that the vertex will be linked to other vertices. Thus, the score associated with a vertex is determined based on the number of votes cast in and the score of the vertex on which the votes were cast.

The TextRank algorithm may be represented as a directed weighted graph consisting of a set of vertices V and a set of edges E, where E is a subset. The calculation formula of the score of the vertex is as follows:

wherein d is a damping coefficient, the numeric area of d is [0, 1], the probability of mapping from a given vertex to another random vertex in the graph is shown, and the probability is generally set to 0.85; the weight between any two points in the graph is a point set pointing to the vertex and a point set pointing to the vertex.

The method specifically comprises the following steps:

1.1 HowNet-based similarity calculation

HowNet (i.e. HowNet) is a common knowledge base which discloses the relationship between concepts and the relationship between the attributes of the concepts as basic contents, wherein rich vocabulary semantic knowledge is a basic resource for the research in the field of natural language processing. In order to calculate the word similarity by using HowNet, the invention uses the vocabulary semantic similarity based on the known net as a calculation method, and the formula is as follows:

wherein, Sim (W)₁，W₂) The expression W₁And W₂Similarity based on HowNet; s₁₁，S₁₂，...，S_1nThe expression W₁The meaning term (concept) of (1); s₂₁，S₂₂，...，S_2mThe expression W₂The meaning term (concept) of (1).

In the knowledge network, expressions of words are described by using knowledge composed of semaphores and special symbols. Therefore, all word similarity calculation is finally summarized to the calculation of the similarity of the semantic meanings. In the known network, the sememe is composed of a tree-shaped hierarchical system, and the similarity calculation method formula is as follows:

wherein p is₁，p₂Represents an original meaning; distance (p)₁，p₂) Represents an atom p₁And p₂Distance in the hierarchy tree of the sememes α is an adjustable parameter, meaning the value of the sememe distance at a similarity of 0.5.

For a feature, it can be understood a set of key-value pairs, where key is a relational primitive or a relational symbol and value is a basic primitive or a specific word. For the calculation of the feature structure similarity, firstly, a one-to-one correspondence relationship between features with the same key is established, and if the key does not have the corresponding feature, the correspondence of the key is null. Then, the similarity of value between the corresponding keys is calculated.

For the set structure, the basic idea of the calculation is similar to the characteristic structure, and a one-to-one correspondence relationship is also required to be established, but the difficulty of the set similarity calculation is that the elements in the set are unordered and equal. The idea of Liu group teacher for set similarity calculation is roughly as follows: firstly, calculating the similarity between every two elements in two sets, selecting the element with the maximum similarity from the two sets, corresponding the two elements, then deleting the corresponding elements from the sets, repeating the steps until no element corresponding relation exists, and enabling the elements without the corresponding relation to correspond to the empty elements. Finally, a weighted average of the similarity of the element pairs is calculated for the set similarity.

In the four-term semantic item description of the words, the first basic semantic item description is calculated through the similarity of the semantic items, the other basic semantic item descriptions are calculated through the similarity of the set, and the relation semantic item description and the relation symbol description are calculated through the similarity of the characteristic structure. Finally, the similarity degrees are combined to calculate the overall similarity degree of the words, and the formula is as follows:

in this formula, C₁，C₂Representing a concept or source of a real word, sim₁(C₁，C₂) To sim₄(C₁，C₂) Respectively representing the similarity of the four semantic item descriptions, β₁To β₄Each representsAnd (5) weight corresponding to the similarity of the variety meaning item.

1.2 similarity calculation based on synonym forest

The synonym forest is compiled by Meijia colt and the like in 1983, then a large amount of manpower and material resources are input through a Hadamard information retrieval research room, a new expansion version of the synonym forest of the Hadamard information retrieval research room is completed, the synonym forest of the new version is composed of a five-layer tree structure, the words are divided according to abstract categories, and the words are located at leaf nodes. The detailed description is described above and will not be repeated herein.

The method is characterized in that the method comprises the following steps of (1) calculating word similarity based on a synonym forest, wherein the main algorithm idea is as follows: for two given words, the corresponding numbers of the given words are searched in the word forest, and then the layer where the two numbers are different is judged. Such as: aa01a01 differs from Aa01B01, i.e. in the 4 th branch. Starting from the first layer, if the same is judged, multiplying by 1, otherwise multiplying by the corresponding branch coefficient, and then multiplying by the adjusting parameterWhere n is the total number of nodes in the branch layer, the purpose of the adjustment parameter is to control the range of the similarity to [0, 1]]Is multiplied by a control parameter (n-k +1)/n, where k is the distance between the two branches, and the function of the control parameter is mainly to consider that the distance between the branches is inversely proportional to the similarity. The specific calculation formula is as follows:

suppose that the word to be calculated similarity is W₁，W₂Expressed that the similarity is expressed by Sim

(1) If the two words are not on the same tree:

Sim(W₁，W₂)＝f (5)

(2) if the two terms differ in the second level branch, the coefficient is a:

(3) if the two terms differ in the third level branch, the coefficient is b:

(4) if the two terms differ in the fourth level branch, the coefficient is c:

(5) if the two words differ in the fifth level branch, the coefficient is d:

since the term similarity calculation based on the synonym forest simply calculates the similarity between terms, the context is not considered. Therefore, for the polysemous words, the similarity between each two polysemous words is often calculated, and the final similarity is taken as the maximum similarity.

1.3 word similarity calculation based on word2vec

Word2Vec is a deep learning tool, and Google introduced the open source toolkit in 2013. word2vec is to train large-scale linguistic data through a neural network, and to map words to K-dimensional real number vectors for representation. By using word2vec, the characteristic of high-dimensional sparsity in the traditional one-hot representation is well avoided, and dimension disaster is prevented. In the experiment, Chinese Wikipedia is used as a training corpus, and training parameters are set as follows: minicount is the lowest frequency of occurrence of the retained words, window is the context window value, and size is the dimension of each word vector, 5, and 200. The final training result has 669217 words, and the training process is shown in fig. 5.

Because the data opened by the Chinese Wikipedia is in the xml format, firstly, the text content in the xml file is extracted; because the content of the Chinese Wikipedia is the traditional Chinese character writing, the traditional Chinese character conversion is carried out next, and the invention utilizes an opencc (open Chinese converter) open source tool to carry out the conversion between the traditional Chinese character and the simplified Chinese character; then, segmenting the Wikipedia text content by using a jieba segmentation tool carried by python; then loading a stop word list and removing stop words; and finally, putting the processed corpus into word2vec for training to obtain a final result. Word2vec is used for word similarity calculation, generally cosine distance between word vectors is used, and a specific calculation formula is as follows:

1.4 word similarity calculation based on Baidu dictionary

The Baidu dictionary (http:// di.baidu.com /) is an Internet encyclopedia dictionary, wherein a plurality of new network words are recorded, if the network words can be well utilized, the problems of a plurality of unregistered words, such as network words of more recent trends like 'grand, donkey friends' and 'gnawing' and abbreviations of English words like 'GRE' and 'WTO', in HowNet and synonym forest, due to the fact that the updating speed is slow, the new words are not recorded generally, in Word2Vec training linguistic materials, the words are limited by the scale of a corpus and cannot be contained in time, and therefore, only HowNet, synonym forest and Word2Vec cannot meet the requirement of Word similarity calculation, and therefore the invention provides the Word similarity calculation based on the Baidu dictionary.

Assuming that the two words word1 and word2 do not exist in both the forest of words and the knowns, it is necessary to calculate their similarity using a Baidu dictionary, as shown in FIG. 6.

(1) Inputting words word1 and word2, requesting a query from the Baidu dictionary;

(2) return to centesimal dictionary parts of the interpretations of these two words S1 and S2;

(3) extracting keywords from the interpretation parts of the words S1 and S2 respectively by using the TextRank algorithm and forming a Set1 ═ k₁₁，k₁₂，k₁₃...，k_1nAnd Set2 ═ k₂₁，k₂₂，k₂₃，...，k_2m}；

(4) And calculating the similarity between words in the keyword sets Set1 and Set2, and taking the maximum value as the similarity of word1 and word 2.

The calculation formula is as follows:

The invention uses the TextRank algorithm to extract the keywords in the explanation sentence segments of the words in the Baidu dictionary, and the main steps are as follows:

(1) the explanatory paragraphs S1 and S2 of words word1 and word2 in the Baidu dictionary are segmented into complete sentences.

(2) For each sentence in S1 and S2, operations such as word segmentation, part-of-speech tagging, and stop word filtering are performed, leaving only words of a specific part-of-speech.

(3) And (3) constructing a candidate keyword graph, wherein each word in the (2) is a vertex, and if two vertices coexist in the set window size, an edge exists between the two vertices.

(4) And (4) according to the formula (1), continuously and iteratively updating the weight value of each top point until the weight value of each top point is finally converged.

(5) And sequencing the vertexes according to the weight values of the vertexes, setting the number N of the keywords to be acquired according to actual needs, and taking the first N vertexes with the highest weight values as candidate keywords.

The TextRank algorithm takes the words as vertexes and the edges as relations among the words, constructs a directed weighted graph and finally extracts keywords from the directed weighted graph. The method is greatly different from TF-IDF in that TF-IDF does not consider the relation between words, only considers the word frequency and the inverse document frequency, needs large-scale linguistic data and has better effect on the special field. In the paper, since the extracted keywords in the paraphrases of the Baidu dictionary belong to the general field, and it is not practical to acquire such large corpus, the TextRank is adopted to better conform to the extraction of the keywords in the paper.

1.5 fusion strategy

The invention mainly calculates the Word similarity based on the combination of HowNet, synonym forest, Chinese Wikipedia corpus trained by Word2Vec and Baidu dictionary, as shown in figure 7.

The invention uses a multi-method fusion strategy to calculate word similarity, namely a method based on a semantic dictionary and a method based on a large-scale corpus are included, and a method based on an internet dictionary-hundred-degree dictionary is added. Generally speaking, for two input words, firstly judging whether the two input words exist in HowNet or a synonym forest, if so, calculating the similarity by utilizing the HowNet or the synonym forest, otherwise, judging whether the two input words exist in Wikipedia or a hundred-degree dictionary, and if so, calculating the similarity of the words by utilizing a word2vec or the hundred-degree dictionary. The detailed algorithm steps are as follows:

wherein the similarity range is [1.0, 10.0 ]]1.0 represents that the two words are completely dissimilar, and 10.0 represents that the two words express the same meaning; f denotes a variety of functions including maximum, minimum, arithmetic mean, geometric mean, and the like; sim_HRepresenting the semblance of words based on the Homing network, Sim_CRepresenting a term similarity based on a synonym forest; sim_WRepresenting the similarity of words, Sim, based on wikipedia_BThe word similarity based on the Baidu dictionary is represented, and the specific formula is as follows:

experiments and analyses

Data set

The experimental data used by the invention comprises a 2003 edition semantic dictionary of known network, 66181 words (containing polysemons), synonym forest, 77343 words and word2vec trained Chinese Wikipedia corpus (containing 598454 words), and the similarity of the words is calculated, and the currently adopted standard evaluation set is as follows: RG-65, Miller & Charles-30, WordSimiarity-353, Words-240, PKU-500, and the like. The RG-65 evaluation set comprises 65 evaluation values of semantic similarity between English words and manual words, and the evaluation values are not used due to earlier years; the Miller & Charles-30 evaluation set is composed of 30 English word pairs issued by Miller & Charles, wherein the 30 English word pairs comprise 10 pairs of highly related words, 10 pairs of highly related words and 10 pairs of less related words, and the English word pairs are few, so that the evaluation set is not representative when the similarity of the words is calculated; the WordSimiarity-353 word similarity evaluation set comprises 353 pairs of English words and manual evaluation values, all the 353 pairs of English words are nouns, and word similarities of other parts of speech are not considered, so that the method for checking the word similarity is not comprehensive and deep. The 3 word similarity evaluation sets aim at all English words, and if the words are directly translated into Chinese as the evaluation set, the evaluation result is not strict enough due to human or machine translation inaccuracy. The currently known Chinese vocabulary similarity evaluation set in China comprises word-240 and PKU-500, wherein the word-240 word similarity evaluation set is somewhat similar to word similarity-353 and is evaluated by the organization personnel of the national defense science and technology university, wherein the evaluation values of the semantic relevance of the Chinese Words and the Words manually evaluated by 240 are included, and the evaluation is to adopt the PKU-500 similarity evaluation set because the evaluated semantic relevance among the Words is not the semantic similarity.

PKU-500 was evaluated by the researchers of the organization linguistics specialty of Wuyunfar Master, university of Beijing, computer languages research institute, taking into account the following factors in terms of word selection:

1. and (4) the method is field. The selected Chinese words are words from a news or microblog short text corpus, belong to the general field and are common in common words.

2. Frequency. The frequency of words is counted in the corpus and then 30% of the words at high frequencies, 50% of the words at medium frequencies and 20% of the words at low frequencies are selected, respectively.

3. The word is long. The selected words include single-character words, double-character words, triple-character words, and quadruple-character words.

4. Word sense. Some ambiguous words and words that are easily ambiguous are properly picked.

From the word selection factors, the rigor of the Wu teacher in word selection and word similarity evaluation can be seen, and the evaluation set also serves as an evaluation set of the NLPCC-ICCPOL 2016 word similarity task. Therefore, the method has higher reference value and significance.

Evaluation index

The spearman correlation coefficient (p) and the pearson correlation coefficient (y) are mainly used to test the correlation between the results calculated by the machine algorithm and the results annotated manually. In this paper, it can be used to evaluate the correlation between the results calculated by the word similarity algorithm used in the experiments of this paper and the results manually labeled by PKU-500.

The spearman correlation coefficient (p) is defined as follows:

wherein, X_iIs a score of the similarity of words calculated by machine, Y_iIs the score of the similarity of the manually labeled words.Andare respectively variable X_iStandard deviation of (2) and variable Y_iN is the number of word pairs.

The pearson correlation coefficient (γ) is defined as follows:

wherein, X_iIs a score of the similarity of words calculated by machine, Y_iIs the score of the similarity of the manually labeled words,andare each X_iAnd Y_iN is the number of word pairs.

Results and analysis of the experiments

Having systematically described the Word similarity calculation method of the present invention, we labeled the HowNet based method as H, the synonym forest based method as C, the Word2Vec based method as W, and the Baidu dictionary based method as B. Max, Min, Arith and Geo respectively represent that after a plurality of similarity calculation methods are fused, the maximum value, the minimum value, the arithmetic mean value and the geometric mean value are taken, and the similarity result range is 1.0-10.0. Applying these methods to the 500 pairs of words in PKU-500, the results of the experiments are shown in table 2:

TABLE 2 PKU-500 Experimental results

It can be seen from the table that when the knowledge network, the word forest or the word2vec is independently used for word similarity calculation, the spearman correlation coefficient and the pearson correlation coefficient are not high, the main reason is still limited by the size of the dictionary and the scale of the corpus, statistics shows that 66181 words (including multiple meaning words) are shared in the knowledge network, 77343 words are shared in the synonym forest, the Chinese wikipedia corpus trained in the experiment is only known as 598454 words, in the PKU-500 evaluation set word pairs, the dictionary of the knowledge network contains 380 word pairs, the synonym forest contains 454 word pairs, the Chinese wikipedia corpus contains 412 word pairs, which obviously cannot meet the requirement of word similarity; after the calculation methods of the knowns and the synonym forest are combined, the fact that the Spanish correlation coefficient and the Pearson correlation coefficient are respectively improved by 0.195 and 0.316 compared with the case that the knowns are used alone is found, the Spanish correlation coefficient and the Pearson correlation coefficient are respectively improved by 0.034 and 0.041 compared with the case that the word forest is used alone, the improvement amplitude is not too large, mainly because in statistics, the fact that the knowns and the word forest after being combined together contain 454 word pairs in PKU-500 is found, the number of the word pairs contained in the word forest is the same as that before being combined, and after being combined, the two word similarity calculation methods can achieve advantage complementation by averaging; after the Hopkinson web, the synonym word forest and the word2vec are combined, and before the word is combined with the Hopkinson web, the word similarity calculation is carried out compared with the word2vec which is singly used, the spearman correlation coefficient and the Pearson correlation coefficient are respectively improved by 0.334 and 0.327, the spearman correlation coefficient and the Pearson correlation coefficient are respectively improved by 0.014 and 0.016, the Hopkinson web, the word forest and the Chinese Wikipedia corpus are combined to contain 480 words, and 26 word pairs are added before the combination. Since word2vec is more computationally related than similar between words, the improvement is less. When the knowns, the synonym forest, the word2vec and the hectogram dictionary are combined, compared with the previous method, the spearman correlation coefficient and the pearson correlation coefficient are respectively improved by 0.047 and 0.05, and statistically, 498 word pairs in the PKU-500 are contained in total, and 18 word pairs are added before combination. As shown in the table, the words "GRE", "dongyou", "want to face", "workday", "walk a bend", "sexually-sential", "cat greasy", "high-up", "PC" and so on do not exist in hownnet, synonym forest and chinese wikipedia corpus, so the calculated similarity result is 1.0, and for these words, when we go to the Baidu dictionary query, the explanation of "dongyou" is as follows: "generally refer to people who like tourism and often accompany tourism together", when we use TextRank to extract keywords, we set to extract 5 keywords, and the extracted keywords are: [ tourism, together, frequently, hobbies, companions ], the keywords exist in both the learning network and the synonym forest, the words are respectively brought into the learning network and the synonym forest and the word of 'passenger' for similarity calculation, then the maximum value of the similarity result is taken, namely max { sim1 (passenger, tourism), sim2 (passenger, together), sim3 (passenger, frequently), sim4 (passenger, hobbies), sim5 (passenger, companions) }, according to the calculation result, the maximum value of sim1 (passenger, tourism) ═ 10.0 is taken, so that 10.0 is the final calculation result of the similarity, because the hectometer covers more interpretations of words in PKU-500, and statistics shows that the interpretations of 500 word pairs in total 498 word pairs can be inquired in the hectometer. Therefore, the Baidu dictionary is added to carry out similarity calculation, and the effect is better.

Table 3 partial word similarity calculation results

Comparative experiment

In order to better embody the calculation method of the present invention, since the evaluation set data PKU-500 used is derived from the NLPCC-ICCPOL 2016 evaluation task, the experimental results of 24 systems submitted by 21 teams participating in the evaluation task are compared (the first 5 results are screened from them), as shown in table 4:

TABLE 4 NLPCC evaluation task ranking table

The final effect of the calculation method of the experiment is that the spearman correlation coefficient is 0.508, the pearson correlation coefficient is 0.499, and according to the spearman correlation coefficient as an evaluation standard, the experiment can be ranked second in the evaluation task and is only 0.01 different from the first name, so that a better effect is achieved, and the method is effective and feasible. The spearman correlation coefficient and the pearson correlation coefficient calculated by the method are higher than those calculated by other methods, and the accuracy of word similarity calculation is improved.

The invention can achieve the following beneficial technical effects: firstly, the Baidu dictionary method provided by the invention can basically cover word pairs in an evaluation set, particularly, some noun abbreviations and some network new words; secondly, the sentence keyword extraction algorithm provided by the invention is effective, and can extract a plurality of keywords for explaining words in a hundred-degree dictionary, so that the words can be represented more typically; thirdly, the fusion strategy of the invention comprehensively considers the known network, the synonym forest, the word2vec and the Baidu dictionary, forms advantage complementation between strategies and further improves the accuracy of the calculation result of the word similarity.

The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A Chinese Word similarity calculation method based on a fusion strategy is characterized in that Word similarity is calculated based on the combination of HowNet, synonym forest, Word2Vec trained Chinese Wikipedia corpus and an encyclopedia.

2. The method of calculating similarity of chinese words according to claim 1, wherein for two inputted words, it is first determined whether they exist in HowNet or a synonym forest, and if so, similarity is calculated using HowNet or the synonym forest, otherwise, it is determined whether they exist in wikipedia or an encyclopedia, and if so, similarity of words is calculated using word2vec or the encyclopedia.

3. The method for calculating the similarity of Chinese words according to claims 1-2, wherein the lexical semantic similarity based on the knowledge network is used as a calculation method, and the formula is as follows:

4. The method for calculating the similarity of Chinese words according to claims 1-3, wherein the calculation of the similarity of words based on the forest of synonyms comprises: for two given words, the corresponding numbers of the given words are searched in a word forest, and then the layer where the two numbers are different is judged; starting from the first layer, if the same is judged, multiplying by 1, otherwise multiplying by the corresponding branch coefficient, and then multiplying by the adjusting parameterWhere n is the total number of nodes in the branch layer, and then multiplied by the control parameter (n-k +1)/n, where k is the distance between the two branches.

5. The method for calculating the similarity of Chinese words according to claims 1-4, wherein W is used for words assumed to be subjected to similarity calculation₁，W₂Expressed that the similarity is represented by Sim, then

If the two words are not on the same tree:

Sim(W₁，W₂)＝f；

if the two terms differ in the second level branch, the coefficient is a:

if the two terms differ in the third level branch, the coefficient is b:

if the two terms differ in the fourth level branch, the coefficient is c:

if the two words differ in the fifth level branch, the coefficient is d:

6. the method for calculating similarity of Chinese words according to claims 1 to 5, wherein the calculation of similarity of words based on word2vec is performed by: firstly, extracting the text content in the Chinese Wikipedia xml file; performing a complex and simple conversion, and then performing word segmentation on the Wikipedia text content by using a jieba word segmentation tool carried by python; then loading a stop word list and removing stop words; and finally, putting the processed corpus into word2vec for training to obtain a final result.

7. The method for calculating similarity of Chinese words according to claims 1-6, characterized in that word similarity calculation is performed using word2vec, cosine distance between word vectors is used, and the calculation formula is:

8. The method for calculating similarity of chinese words according to claims 1 to 7, wherein the calculation of similarity of words based on a Baidu dictionary is performed by:

inputting words word1 and word2, requesting a query from the Baidu dictionary;

wherein sim: (k_1i，k_2j) The results are calculated using a combination of the knowns and the synonym forest.

9. The method for calculating similarity of Chinese words according to claims 1-8, characterized in that keywords in an interpretation period segment of a word in a Baidu dictionary are extracted by using a TextRank algorithm, and the steps are as follows:

10. The method for calculating similarity of Chinese words according to claims 1-9, wherein the expression similarity calculation formula is: