CN106126494B

CN106126494B - Synonym finds method and device, data processing method and device

Info

Publication number: CN106126494B
Application number: CN201610429937.XA
Authority: CN
Inventors: 张昊; 朱频频
Original assignee: Shanghai Zhizhen Intelligent Network Technology Co Ltd
Current assignee: Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority date: 2016-06-16
Filing date: 2016-06-16
Publication date: 2018-12-28
Anticipated expiration: 2036-06-16
Also published as: CN106126494A

Abstract

A kind of synonym discovery method and device, data processing method and device, the synonym discovery method includes: to obtain phrase set to be processed, and the phrase set includes multiple words；For any word to be processed in the phrase set, when there are one or more target words in the phrase set, when so that the smallest edit distance of the word to be processed to the target word being less than preset threshold, the word to be processed is determined as synonym pair with a corresponding target word；Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, the edit distance approach includes delete operation, the corresponding editing distance of delete operation is less than remaining and operates corresponding editing distance, the corresponding editing distance of the delete operation is less than preset threshold, remaining operates corresponding editing distance and is greater than or equal to preset threshold described in single.The accuracy of discovery initialism can be improved in above scheme.

Description

Synonym finds method and device, data processing method and device

Technical field

The present invention relates to data processing fields, find method and device, data processing side more particularly to a kind of synonym Method and device.

Background technique

Synonymy is very important semantic relation, is often applied to the natural languages such as information retrieval, text classification In processing task.Specifically, needing to carry out obtaining for synonym before the processing task such as information retrieval or text classification of progress Take the identification with synonym.For example, the multiple words for belonging to synonym can be classified as one in the application scenarios of information retrieval Class can scan for synonym instead of original keyword when inputting in text there are when the keyword of synonym, so as to Searching system is set to be supplied to user's more text to be confirmed.

The shorthand of intrinsic title, the word of these shorthands are often had in the written and daily expression of Chinese The initialism of referred to as intrinsic title, initialism are a part of former intrinsic title, and initialism is also one kind of synonym.Example Such as, " National People's Congress " is the initialism of " National People's Congress ", and " China " is the initialism of " People's Republic of China (PRC) ", " Real Madrid " is the initialism etc. of " Real Madrid ".

However, synonym discovery method in the prior art can not preferably identify initialism, so that semantic understanding Accuracy it is lower.

Summary of the invention

The technical problem to be solved by the present invention is to provide a kind of synonyms to find method and device, improves the standard of discovery initialism True property.

In order to solve the above technical problems, the embodiment of the present invention provides a kind of synonym discovery method, which comprises obtain Phrase set to be processed is taken, the phrase set includes multiple words；For any word to be processed in the phrase set, when There are one or more target words in the phrase set, so that smallest edit distance of the word to be processed to the target word When less than preset threshold, the word to be processed is determined as synonym pair with a corresponding target word；Wherein, the minimum volume Collecting distance is to calculate to obtain by edit distance approach, and the edit distance approach includes delete operation, the delete operation Corresponding editing distance is less than remaining and operates corresponding editing distance, and the corresponding editing distance of the delete operation is less than default threshold It is worth, remaining operates corresponding editing distance more than or equal to preset threshold described in single.

Optionally, the method also includes: calculate separately remaining each word in the word to be processed and the phrase set Semantic similarity, and therefrom selection semantic similarity value be greater than similarity threshold word or the higher preceding N of semantic similarity value A word is as candidate word；

The target word determines in the following manner: calculating separately the minimum of the word to be processed Yu each candidate word Editing distance will be less than the candidate word of preset threshold with the smallest edit distance of the word to be processed as target word.

Optionally, the semantic similarity of remaining each word in the word to be processed and the phrase set, packet are calculated separately It includes:

Vectorization is carried out to each word in the phrase set；It is based on vectorization as a result, calculating the word to be processed With the cosine similarity of remaining each word, the cosine similarity is as the semantic similarity.

Optionally, vectorization is carried out to each word in the phrase set, comprising:

Vectorization is carried out to each word in the phrase set using word2vec method.

Optionally, the phrase set for obtaining synonym to be found, comprising:

Input corpus is segmented, to obtain the phrase set.

Optionally, the input corpus is segmented using dictionary for word segmentation, the dictionary for word segmentation obtains in the following manner :

The input corpus is pre-processed, to obtain text data；Branch's processing is carried out to the text data, is obtained To phrase data；Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, after obtaining participle Term data；Processing is combined to the term data after the adjacent participle, to generate candidate data string；To the time Serial data is selected to carry out judgement processing, to find neologisms；The dictionary for word segmentation is added in the neologisms.

Optionally, remaining described operation includes insertion operation and replacement operation, the corresponding editor of insertion operation described in single Distance is greater than or equal to preset threshold, and the corresponding editing distance of replacement operation described in single is greater than or equal to preset threshold.

The embodiment of the present invention also provides a kind of data processing method, and the data processing method includes above-mentioned synonym discovery Method.

The embodiment of the present invention also provides a kind of synonym discovery device, and described device includes:

Acquiring unit, suitable for obtaining phrase set to be processed, the phrase set includes multiple words；

Synonym determination unit, suitable for for any word to be processed in the phrase set, when in the phrase set There are one or more target words, so that the smallest edit distance of the word to be processed to the target word is less than preset threshold When, the word to be processed is determined as synonym pair with a corresponding target word；

Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, the edit distance approach packet Include delete operation, the corresponding editing distance of the delete operation is less than remaining and operates corresponding editing distance, the delete operation Corresponding editing distance is less than preset threshold, remaining operates corresponding editing distance more than or equal to preset threshold described in single.

Optionally, the synonym finds device further include:

Candidate word selection unit, suitable for calculating separately the language of remaining each word in the word to be processed and the phrase set Adopted similarity, and therefrom selection semantic similarity value is greater than the word or the higher top n word of semantic similarity value of similarity threshold As candidate word；

Target word determination unit, suitable for calculate separately the minimum editor of the word to be processed and each candidate word away from From by the candidate word with the smallest edit distance of the word to be processed less than preset threshold as target word.

Optionally, the candidate word selection unit includes:

Vectorization subelement, suitable for carrying out vectorization to each word in the phrase set；

Cosine similarity computation subunit, suitable for based on vectorization as a result, the calculating word to be processed is each with remaining The cosine similarity of word, the cosine similarity is as the semantic similarity.

Optionally, the vectorization subelement using word2vec method to each word in the phrase set carry out to Quantization.

Optionally, the acquiring unit includes:

Subelement is segmented, suitable for segmenting to input corpus, to obtain the phrase set.

Optionally, the participle subelement segments the input corpus using dictionary for word segmentation, the dictionary for word segmentation It is obtained by dictionary for word segmentation acquiring unit, the dictionary for word segmentation acquiring unit is suitable for:

The embodiment of the present invention also provides a kind of data processing equipment, and the data processing equipment includes above-mentioned synonym discovery Device.

Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that

The embodiment of the present invention obtains phrase set to be processed；For any word to be processed in the phrase set, when There are one or more target words in the phrase set, so that smallest edit distance of the word to be processed to the target word When less than preset threshold, the word to be processed is determined as synonym pair with a corresponding target word；Wherein, the minimum volume Collecting distance is to calculate to obtain by edit distance approach, and the edit distance approach includes delete operation, the delete operation Corresponding editing distance is less than remaining and operates corresponding editing distance, and the corresponding editing distance of the delete operation is less than default threshold It is worth, remaining operates corresponding editing distance more than or equal to preset threshold described in single.On the one hand above scheme passes through restriction and compiles The editing distance for collecting delete operation in distance method is less than editing distance of remaining operation, so that smallest edit distance is by excellent First obtained using delete operation；On the other hand, the corresponding editing distance of delete operation during smallest edit distance is calculated Less than preset threshold, at the same single remaining when operating corresponding editing distance and being greater than or equal to preset threshold, as a result, when to be processed When the smallest edit distance of word to target word is less than preset threshold, corresponding target word is only to pass through delete operation by word to be processed It obtains, so that it is guaranteed that be a part of word literal expression to be processed by the synonym that edit distance approach obtains, so that The initialism of acquisition is more accurate, improves the accuracy rate of initialism discovery.

Further, it by calculating the semantic similarity of remaining word in word to be processed and the phrase set, selects multiple Candidate word, and then target word can be determined from the more small range that multiple candidate words are formed, since multiple candidate words are to be processed The a subset of phrase set, so determining that the efficiency of determining synonym pair can be improved in target word from multiple candidate words, simultaneously By further improving the accuracy of discovery synonym pair using semantic similarity as another synonym performance assessment criteria, Just improve the accuracy of discovery initialism.

Detailed description of the invention

Fig. 1 is the flow chart of one of embodiment of the present invention synonym discovery method；

Fig. 2 is the flow chart for the method that one of embodiment of the present invention obtains dictionary for word segmentation；

Fig. 3 is the flow chart of another synonym discovery method in the embodiment of the present invention；

Fig. 4 is the structural schematic diagram of one of embodiment of the present invention synonym discovery device；

Fig. 5 is the structural schematic diagram of another synonym discovery device in the embodiment of the present invention.

Specific embodiment

The shorthand of intrinsic title, the word of these shorthands are often had in the written and daily expression of Chinese The initialism of referred to as intrinsic title, initialism are a part of former intrinsic title, and initialism is also one kind of synonym.Example Such as, " National People's Congress " is the initialism of " National People's Congress ", and " China " is the initialism of " People's Republic of China (PRC) ", " Real Madrid " is the initialism etc. of " Real Madrid ".However, synonym discovery method in the prior art cannot preferably be known Other initialism, so that the accuracy of semantic understanding is lower.

It is understandable to enable above-mentioned purpose of the invention, feature and beneficial effect to become apparent, with reference to the accompanying drawing to this The specific embodiment of invention is described in detail.

Fig. 1 is the flow chart of one of embodiment of the present invention synonym discovery method.Below with reference to step shown in FIG. 1 It is illustrated.

Step S101: obtaining phrase set to be processed, and the phrase set includes multiple words.

The phrase collection to be processed is combined into the phrase set to therefrom find synonym pair.

In specific implementation, the phrase set to be processed is obtained and segmenting to input corpus.It is described defeated The data mode for entering corpus can be the non-text data such as voice data, be also possible to text data.When input corpus is non-text When notebook data, need first to be converted to it text data, i.e. the object of subsequent processing is all text data.The input corpus can To obtain by the conversation recording for obtaining question answering system and user, the knowledge point data in manual sorting can be from.

In specific implementation, above-mentioned input corpus may be from a specific area, it is therefore to be understood that be processed Phrase set in word be the semantic expression in relation to the specific area, wherein may include with identical semantic but expression The different word of form, i.e. synonym.The specific area can be the bank field, education sector, sports field etc..

For example, the input corpus is from the bank field, wherein some sentences may use " China Merchants Bank " to express this One Bank Name, some sentences may then be expressed using " China Merchants Bank ", and " China Merchants Bank " and " China Merchants Bank " is a synonym pair；Class As, in expression exist " industrial and commercial bank " and " industrial and commercial bank " this to synonym.Above-mentioned two groups of synonym centerings " China Merchants Bank " are " silver of promoting trade and investment The initialism of row ", " industrial and commercial bank " is the initialism of " industrial and commercial bank ".Certainly, the same of other non-initialisms is likely present in expression Adopted word, such as " remittance " and " remittance money " is synonym, but the relationship of initialism is not present between the two.And the present embodiment is to lead to Cross step S101 to step S102 discovery initialism.

In specific implementation, carrying out participle to input corpus is realized by dictionary for word segmentation, in order to enable input corpus It is in other words obtained in the result segmented comprising initialism in order to segment initialism from a sentence, described point It is needed in word dictionary comprising initialism, and initialism in general basic dictionary and may be not present, institute as a kind of neologisms To need to update basic dictionary by new word discovery, so that initialism is added into the basic word of update as one of neologisms Allusion quotation, to use the basic dictionary updated as dictionary for word segmentation to input corpus participle.

In order to make to include initialism in the dictionary for word segmentation, the dictionary for word segmentation obtains in the following manner, refers to Fig. 2 Shown step.

S11: input corpus is pre-processed, to obtain text data.

In the input corpus Format Type may it is more, for convenient for input corpus carry out subsequent processing, need to be to input Corpus is pre-processed, and text data is obtained.

In specific implementation, the pretreatment can by input corpus uniform format be text formatting, and filter dirty word, One of sensitive word and stop words are a variety of.It, can be by current skill when the uniform format that will input corpus is text formatting The information filtering that art wouldn't can be converted to text formatting is fallen.

S12: branch's processing is carried out to the text data, obtains phrase data.

Branch's processing can be to input corpus according to punctuate branch, such as fullstop, comma, exclamation, question mark etc. is occurring Punctuate punishment row.Obtaining phrase data herein is the primary segmentation to corpus, in order to the range of the subsequent word segmentation processing of determination.

S13 carries out word segmentation processing to the phrase data according to the independent word for including in basic dictionary, after obtaining participle Term data.

The basis dictionary is for that may not contain initialism in the basis dictionary for differentiation dictionary for word segmentation.It is described Basic dictionary includes multiple independent words, and the length of different individually words can be different.In specific implementation, it is carried out based on basic dictionary The process of word segmentation processing can use one of the two-way maximum matching method of dictionary, HMM method and CRF method or a variety of.

The word segmentation processing is to carry out word segmentation processing to the phrase data of same a line, and the term data is all included in base Independent word in plinth dictionary.

S14 is combined processing to the term data after the adjacent participle, to generate candidate data string.

Word segmentation processing is carried out according to basic dictionary, it is possible that by should be as the word of a word in some field Data are divided into the case where multiple term datas, therefore need new word discovery.Subsequent impose a condition is sieved from candidate data string Choosing, using the candidate data string filtered out as neologisms.Premise of the candidate data string as above-mentioned screening process is generated, can be used Various ways are completed.

In specific implementation, it can use Bigram model using word two neighboring in the phrase data of same a line as time Select serial data.

Assuming that a sentence S can be expressed as a sequence S=w1w2 ... wn, language model is exactly to require that sentence S's is general Rate p (S):

P (S)=p (w1, w2, w3, w4, w5 ..., wn)

=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1) (1)

Probability statistics are based on Ngram model in formula (1), and the calculation amount of probability is too big, can not be applied in practical application. Assume (Markov Assumption) based on Markov: the appearance of next word only relies upon one or several before it Word.Assuming that the appearance of next word relies on a word before it, then have:

P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)

=p (w1) p (w2 | w1) p (w3 | w2) ... p (wn | wn-1) (2)

Assuming that the appearance of next word relies on two words before it, then have:

P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)

=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | wn-1, wn-2) (3)

Formula (2) is the calculation formula of Bigram probability, and formula (3) is the calculation formula of trigram probability.Pass through setting The more constraint informations occurred to next word can be set in bigger n value, have bigger discrimination；By being arranged more Small n value, the number that candidate data string occurs in new word discovery is more, can provide more reliable statistical information, has more High reliability.

Theoretically, n value is bigger, and reliability is higher, and in existing processing method, Trigram's is most；But Bigram's Calculation amount is smaller, and system effectiveness is higher.

S15: judging whether the candidate data string is particular candidate serial data, and the particular candidate serial data includes basis Noun, and the word for being located at the specific relative position of the basic noun is noun or adjective.

If, should according to inventor the study found that if noun or adjective on the specific relative position of a basic noun Basic noun is very likely needed by as neologisms.Such as basic noun " card ", the left side of " card " are noun, can form " dragon Card ", " elite school's card ", " platinum card ", " business card " etc..Therefore judge whether candidate data string is particular candidate serial data, it can sentence Whether disconnected candidate data string meets comprising basic noun, and whether the word of the specific relative position of the basis noun is noun Or adjective.

The specific relative position of basic noun can be set according to different basic nouns and corpus, for example, working as language Include a variety of " cards " in material, and when needing to regard the title of various cards as neologisms, can set the left side of basic noun as Noun or adjective.

In specific implementation, specific relative position can be any one of left and right side or two kinds, can be according to need It is configured.

In specific implementation, it is referred to the frequency and determines the basic noun.Since basic noun can be repeatedly in corpus Occur, therefore is referred to the frequency and determines basic noun.It is understood that basic noun can also be selected by manual read It selects and sets.

S16: judgement processing is carried out to the candidate data string, to find neologisms；The judgement is handled

When the candidate data string is nonspecific candidate data string, calculate in the candidate data string in each word and its The comentropy of side word, and remove candidate data string of the comentropy outside preset range；

When the candidate data string is particular candidate serial data, the word except the particular candidate serial data is only calculated With the comentropy of its inside word, candidate data string of the comentropy outside preset range is removed.

Since candidate data string includes two term datas, when carrying out judgement processing to candidate data string, need to distinguish The inside comentropy of two term datas is judged, comentropy is to the probabilistic measurement of stochastic variable, calculation formula It is as follows:

H (X)=- ∑ p (x_i)logp(x_i)

Comentropy is bigger, indicates that the uncertainty of variable is bigger, i.e., the probability that each possible value occurs is average.Such as The probability that some value of fruit variable occurs is 1, then entropy is 0.Show that variable only works as the generation of former value, is an inevitable thing Part.

The formula of the left side comentropy and right side comentropy that calculate word W is as follows:

H₁(W)=∑_x∈X(#XW>0)P (x | W) log P (x | W), wherein X is all term data collection for appearing in the left side W It closes, H₁(W) the left side comentropy for being term data W.

H₂(W)=∑_x∈Y(#WY>0)P (y | W) log P (y | W), wherein Y is all term data collection appeared on the right of W It closes, H₂(W) the right side comentropy for being term data W.

Inside comentropy is that each independent term data is successively fixed to candidate data string, calculates and occurs in the term data In the case of another word occur comentropy.If candidate data string is (W1W2), the right side letter of term data W1 is calculated Cease the left side comentropy of entropy and term data W2.

It calculates the entropy of term data and the term data on the inside of it in candidate data string and embodies word on the inside of the term data The confusion degree of language data.For example, by calculating candidate data string W₁W₂Middle left side term data W₁Right side comentropy and the right side Side term data W₂Left side comentropy, it can be determined that term data W₁And W₂The confusion degree of inside, so as to pass through setting Preset range is screened, and excludes each word and its inside word constitutes candidate of the probability characteristics value of neologisms outside preset range Serial data.

In particular candidate serial data, the inside comentropy of basic noun perhaps can be because outside preset range, causing to make It is excluded for the particular candidate serial data of neologisms, for example, particular candidate serial data is " platinum card ", " elite school's card ", " Long Card " etc. When candidate data string comprising basic noun " card ", word " platinum ", " name ", the right side comentropy of " dragon " within a preset range, But since the left side word of word " card " is more chaotic, left side comentropy may be outside preset range, so as to lead to candidate Candidate's serial data such as serial data " platinum card ", " elite school's card ", " Long Card " is by the exclusion of mistake.

Therefore when the candidate data string is particular candidate serial data, the word except the particular candidate serial data is only calculated The comentropy of language and its inside word, removes candidate data string of the comentropy outside preset range, no longer to basic noun Inside comentropy calculated, avoid the inside comentropy of gene basis noun outside preset range caused by error exception.

S17: the dictionary for word segmentation is added in the neologisms.

Due to initialism and a kind of neologisms, then the new set of words obtained from input corpus also includes initialism, from And neologisms addition dictionary for word segmentation is also achieved that in dictionary for word segmentation comprising initialism, and then can be with dictionary for word segmentation to described defeated Enter corpus and is segmented to obtain the phrase set in the present embodiment.

It continues with and is illustrated to obtaining the step after phrase set to be processed.

Step S102: for any word to be processed in the phrase set, when in the phrase set there are one or Multiple target words, it is described wait locate when so that the smallest edit distance of the word to be processed to the target word being less than preset threshold It manages word and is determined as synonym pair with a corresponding target word.

In specific implementation, remaining described operation may include replacement operation and insertion operation.The volume of the present embodiment meaning Volume distance is that a word is taken edit operation with editor's cost needed for being converted into another word, that is, number of operations and every The product of cost needed for single stepping.And smallest edit distance, then refer to editor's the smallest editing distance of cost.Every step operation is only Only for one of word.

Illustrate editing distance and smallest edit distance so that " industrial and commercial bank " converts to " industrial and commercial bank " as an example below.It will " industrial and commercial silver Row " conversion can take different edit operation combinations to obtain to " industrial and commercial bank ".Assuming that the editing distance of single step replacement operation It is 1000, the editing distance of single step delete operation is 1, and the editing distance of single step insertion operation is 1000.

By the first conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " are as follows: divide 3 delete operations by " industrial and commercial bank " " work ", " quotient " and " silver " delete, then carry out insertion operation insertion " work " and obtain " industrial and commercial bank ", then " industrial and commercial bank " arrives " work The editing distance of row " is 1003；

By second of conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " are as follows: divide 2 delete operations by " industrial and commercial bank " " work " and " quotient " delete, then carry out a replacement operation " silver " replaced with " work " obtaining " industrial and commercial bank ", then " industrial and commercial bank " arrives The editing distance of " industrial and commercial bank " is 1002；

By the third conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " are as follows: divide 2 delete operations by " industrial and commercial bank " " quotient " and " silver " delete, obtain " industrial and commercial bank ", then the editing distance of " industrial and commercial bank " to " industrial and commercial bank " is 2.

It should be noted that the conversion regime of word to be processed " industrial and commercial bank " conversion to " industrial and commercial bank " is not limited to above-mentioned enumerate Operative combination, the corresponding editing distance of different conversion regimes is different.However, in a variety of conversion regimes, minimum editor away from From being unique.It can be appreciated that the above-mentioned smallest edit distance by " industrial and commercial bank " conversion to " industrial and commercial bank " should be 2, i.e., by upper The third conversion regime is stated to obtain.

Therefore, for any word to be processed in the phrase set, the smallest edit distance to another word is determining 's.By calculating the smallest edit distance of any word to be processed and other words in the phrase set, when in the phrase set There are one or more target words, so that the smallest edit distance of the word to be processed to the target word is less than preset threshold When, the word to be processed is determined as synonym pair with a corresponding target word.For example, phrase collection be combined into L (A, B, C, D, E, F, G and H), for word A to be processed, it is assumed that target word from subset M (B, C, D, E, F, G and H), when (B, C, D, E, F, G and H there are a word B in), and when so that the smallest edit distance of word A to be processed to word B being less than preset threshold, then A and B is synonymous Word pair.

For the initialism that the target word for guaranteeing the synonym centering searched out is word to be processed, i.e. breviary in the present embodiment A part of word necessarily word to be processed, in the edit distance approach, limiting single, remaining operates corresponding editing distance More than or equal to preset threshold, and limits the corresponding editing distance of delete operation and be less than remaining corresponding editing distance of operation, and Not only the corresponding editing distance of delete operation described in single is less than preset threshold, but also repeatedly (described repeatedly can be according to full name Word determines that such as: maximum deletes 5 words to the number of words deleted maximum between initialism, then is at this time 5 times) delete operation Corresponding editing distance is again smaller than preset threshold.

In specific implementation, the initialism found by the above method is either one or more, needs to illustrate , what method through this embodiment was found is not necessarily initialism pass between the word of word to be processed composition synonym pair System.For example, it is word B that the method for implementing the present embodiment, which obtains one of initialism of word A to be processed, in phrase set L, and Another initialism for finding word A to be processed is word C, i.e., the smallest edit distance of word A to word B to be processed and word A to be processed are arrived The smallest edit distance of word C is respectively less than preset threshold, but initialism relationship is not necessarily between word B and word C, i.e., it cannot be guaranteed that word The initialism that the initialism or word C that B is word C are word B, but be synonym relationship between word B and word C.

It also needs to illustrate, method through this embodiment, the obtained same initialism is corresponding multiple wait locate Manage not necessarily synonym relationship between word.For example, the method for implementing the present embodiment obtains word A's to be processed in phrase set L Initialism is B, is similarly obtained the initialism of word D to be processed as B, but not necessarily synonym relationship between word A and word D.

It is less than remaining corresponding editing distance of operation due to limiting the corresponding editing distance of delete operation in the present embodiment, makes It obtains when calculating smallest edit distance using edit distance approach, word to be processed is converted into the edit operation of another word and preferentially adopted With delete operation, on the other hand, the corresponding editing distance of delete operation during calculating smallest edit distance is less than default threshold Value, at the same single remaining when operating corresponding editing distance and being greater than or equal to preset threshold, as a result, when word to be processed to target word Smallest edit distance be less than preset threshold when, corresponding target word be only obtained by delete operation by word to be processed, thus Ensure that the synonym by edit distance approach acquisition is a part of word literal expression to be processed, so that the breviary obtained Word is more accurate, improves the accuracy rate of initialism discovery.

Fig. 3 is the flow chart of one of embodiment of the present invention synonym discovery method.Below with reference to step shown in Fig. 3 into Row explanation.

Step S301: obtaining phrase set to be processed, and the phrase set includes multiple words.

The implementation of this step can correspond to step S101 shown referring to Fig.1, and details are not described herein.

Step S302: for any word to be processed in the phrase set, calculate separately the word to be processed with it is described The semantic similarity of each word of remaining in phrase set, and therefrom selection semantic similarity value is greater than the word or language of similarity threshold The adopted higher top n word of similarity value is as candidate word.

In the specific implementation, it can be by comparing the semantic similarity value and similarity threshold of remaining word and word to be processed one Semantic similarity value is greater than the word of similarity threshold as candidate word by the size of value.It should be noted that the similarity threshold Value can carry out different default, not do any restriction, the number of candidate word changes with the variation of similarity threshold at this time.

Another in the specific implementation, the higher time of semantic similarity value can be obtained by limiting the number N of candidate word Select word.Specifically, semantic similarity value is ranked up by from high to low sequence, the higher preceding N of semantic similarity value is taken A word is as candidate word.

It is to determine target word from candidate word in order to subsequent that this step selects candidate word from the phrase set.In this way, On the one hand, the range of the determining target word that synonym pair is constituted with the word to be processed is reduced, so as to reduce calculating Complexity improves the efficiency of discovery initialism.On the other hand, by judging whether it is synonymous using semantic similarity as another The performance assessment criteria of word further improves the accuracy of discovery synonym pair, namely improves the accuracy of discovery initialism.

In specific implementation, when calculating the semantic similarity of remaining each word in the word to be processed and the phrase set Following steps can be passed through:

Firstly, carrying out vectorization to each word in the phrase set；

Secondly, based on vectorization as a result, calculate the cosine similarity of the word to be processed and remaining each word, it is described more than String similarity is as the semantic similarity.It is understood that can therefrom select cosine similar after calculating cosine similarity The word or the higher top n word of cosine similarity value that angle value is greater than similarity threshold are as candidate word.

In specific implementation, vectorization can be carried out to each word in the phrase set using word2vec method. It should be pointed out that vectorization can also be carried out to each word in the phrase set using other existing methods.

Step S303: when there are one or more target words in the phrase set, so that the word to be processed is described in When the smallest edit distance of target word is less than preset threshold, the word to be processed is determined as synonymous with a corresponding target word Word pair.

Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, the edit distance approach packet Delete operation is included, the corresponding editing distance of delete operation is less than remaining and operates corresponding editing distance, and the delete operation is corresponding Editing distance be less than preset threshold.

The target word determines in the following manner: calculating separately the minimum of the word to be processed Yu each candidate word Editing distance will be less than the candidate word of preset threshold with the smallest edit distance of the word to be processed as the target word.

In the present embodiment, remaining described operation includes insertion operation and replacement operation.Insertion operation described in single is corresponding Editing distance be greater than or equal to preset threshold, the corresponding editing distance of replacement operation described in single is greater than or equal to default threshold Value.

The implementation of step S301 to step S303 is illustrated with an example below, wherein each step using one specific implementation as Example should not be a limitation of the present invention.

Implementation steps S301 obtains phrase collection to be processed and is combined into Q (A, B, C and D), and wherein A, B, C and D can be for wait locate Manage word, it is assumed that A is specially " China Merchants Bank ", and B is " industrial and commercial bank ", C is " China Merchants Bank ", and D is " industrial and commercial bank ".

Following steps are A " China Merchants Bank " example with word to be processed.

Implementation steps S302, using word2vec method for each word (A, B, C and D) in phrase set Q carry out to Quantization, it is based on vectorization as a result, calculate the cosine similarity of word A to be processed Yu remaining each word B, C and D, obtain cosine phase Like angle value from high to low sequence be D, C and B, therefrom select higher preceding 2 words of cosine similarity value as candidate word, that is, select Word D " industrial and commercial bank " and word C " China Merchants Bank " is selected as candidate word.

Implementation steps S303 calculates word to be processed using edit distance approach respectively for word A " China Merchants Bank " to be processed The smallest edit distance and word A to be processed " China Merchants Bank " and candidate word C of A " China Merchants Bank " and candidate word D " industrial and commercial bank " The smallest edit distance of " China Merchants Bank ".

In the edit distance approach of this example, the corresponding editing distance of delete operation is less than insertion operation and replacement operation pair The editing distance answered, the corresponding editing distance of the delete operation are less than preset threshold, the corresponding volume of insertion operation described in single It collects distance and is greater than or equal to preset threshold, the corresponding editing distance of replacement operation described in single is greater than or equal to preset threshold.It is false If the corresponding editing distance of single delete operation is 1, the corresponding editing distance of single insertion operation is 1000, single replacement operation Corresponding editing distance is 1000, preset threshold 10, then:

In all edit operations combination that word to be processed " China Merchants Bank " is converted to candidate word D " industrial and commercial bank ", pass through 1 The editing distance that step replacement operation obtains is minimum, and " trick " is specifically replaced with " work ", so smallest edit distance is 1000；

In all edit operations combination that word to be processed " China Merchants Bank " is converted to candidate word C " China Merchants Bank ", deleted by 2 steps Except the editing distance that operation obtains is minimum, " quotient " and " silver " specifically is deleted respectively, so smallest edit distance is 2；

In the above-mentioned smallest edit distance being calculated, the smallest edit distance less than preset threshold 10 is 2, therefore corresponding Target word be candidate word C " China Merchants Bank ", determine that word A " China Merchants Bank " to be processed and candidate word C " China Merchants Bank " are synonym pair, " recruit Row " is the initialism of " China Merchants Bank ".

For another example, it is assumed that phrase collection to be processed is combined into P (" China Merchants Bank ", " industrial and commercial bank " and " industrial and commercial bank "), for wait locate It manages word " China Merchants Bank ", calculates separately " China Merchants Bank " and " industrial and commercial bank ", and the semantic phase of " China Merchants Bank " and " industrial and commercial bank " Like degree, the semantic similarity of " China Merchants Bank " and " industrial and commercial bank " and the semantic similarity of " China Merchants Bank " and " industrial and commercial bank " are obtained Value is all larger than similarity threshold.Then smallest edit distance is calculated, is replaced in calculating since the editing distance of delete operation is less than The editing distance of operation, therefore each step preferentially uses delete operation:

" China Merchants Bank " conversion to " industrial and commercial bank " at least can be by taking 2 step delete operations and 1 step replacement operation to be converted to. Specifically, least operation can be by deletion " trick " and " quotient ", and replacing " silver " is that " work " obtains.And the single step of delete operation is compiled Volume distance is 1, and the single step editing distance of replacement operation is 1000, thus calculate " China Merchants Bank " arrive " industrial and commercial bank " it is minimum edit away from From being 1002；

Similarly, " China Merchants Bank " conversion can at least be obtained to " industrial and commercial bank " by 1 step replacement operation, specifically, be replaced Changing " trick " is " work ", and the single step editing distance of replacement operation is 1000, therefore calculates " China Merchants Bank " conversion to " industrial and commercial silver The smallest edit distance of row " is 1000.

As can be seen that smallest edit distance of " China Merchants Bank " conversion to " industrial and commercial bank ", and " China Merchants Bank " convert to " work The smallest edit distance of quotient bank " is all larger than preset threshold 10, so candidate word " industrial and commercial bank " and candidate word " industrial and commercial bank " are not It is the target word, that is to say, that there is no form same word pair with word to be processed " China Merchants Bank " in phrase set to be processed Word.

It is less than remaining corresponding editing distance of operation due to limiting the corresponding editing distance of delete operation in the present embodiment, makes When must calculate smallest edit distance in edit distance approach, word to be processed is converted preferentially to be adopted into the edit operation of other words Use delete operation.On this basis,

Calculate smallest edit distance during the corresponding editing distance of delete operation be less than preset threshold, while single its When the corresponding editing distance of remaining operation is greater than or equal to preset threshold, as a result, when word to be processed to target word it is minimum edit away from When from being less than preset threshold, corresponding target word is only obtained by delete operation by word to be processed, so that it is guaranteed that passing through editor The synonym that distance method obtains is a part of word literal expression to be processed, so that the initialism obtained is more accurate, Improve the accuracy rate of initialism discovery.

Further, the present embodiment passes through the semantic similarity for calculating remaining word in word to be processed and the phrase set, Multiple candidate words are selected, and then can determine target word from the more small range that multiple candidate words are formed, since multiple candidate words are The a subset of phrase set to be processed, so determining that the effect of determining synonym pair can be improved in target word from multiple candidate words Rate, while the performance assessment criteria by judging whether it is synonym using semantic similarity as another, further improve discovery The accuracy of synonym pair.

The embodiment of the invention also provides a kind of data processing methods based on above-mentioned synonym discovery method.The data The judgement of synonym is carried out in processing method by thesaurus, and includes being found using above-mentioned synonym in thesaurus The initialism that method obtains.The data processing method is illustrated below.

The data processing method includes: to obtain knowledge point, and the knowledge point includes question sentence and corresponding answer；To described Question sentence segmented after any keyword, judge the keyword with the presence or absence of synonym according to thesaurus；When the pass The synonym found is replaced corresponding keyword there are when synonym by keyword；The question sentence obtained after storage replacement, and will replacement The knowledge point is added in the question sentence obtained afterwards.

For example, finding method by above-mentioned synonym, the initialism that " China Merchants Bank " is " China Merchants Bank " is obtained, the two is synonymous One group of synonym pair in dictionary.Implement the data processing method below:

Obtain a knowledge point, wherein question sentence is " how open-minded China Merchants Bank's credit card is ", and corresponding answer is S；

To one of keyword " China Merchants Bank " that question sentence " how open-minded China Merchants Bank's credit card is " is segmented, The keyword " China Merchants Bank " obtained according to thesaurus judgement participle is with the presence or absence of synonym；Due to there is " China Merchants Bank " Synonym is its initialism " China Merchants Bank ", then " China Merchants Bank " is replaced the keyword in question sentence " how open-minded China Merchants Bank's credit card is " " China Merchants Bank " stores replaced question sentence " how open-minded China Merchants Bank's credit card is ", and " China Merchants Bank's credit card is such as by replaced question sentence What is open-minded " knowledge point is added.So former knowledge point is extended for: question sentence has " how open-minded China Merchants Bank's credit card is " and " China Merchants Bank How open-minded credit card is ", corresponding answer S.Synonym " China Merchants Bank " therein is obtained using above-mentioned synonym discovery method, no longer It repeats.

It can thus be seen that the question sentence that above-mentioned synonym discovery method can be used for expanding knowledge in a little, and then reach expansion The effect of knowledge base is filled, so as to still reply and answer accordingly when carrying out the expression of different question sentences using initialism Case, and then improve the semantic understanding ability of intelligent Answer System and reply the accuracy rate of answer.It should be noted that above-mentioned synonymous Word discovery method can be applied not only to the library that expands knowledge, and can be also used for information search.It, not only can be with when applied to information search Search obtains the related information of keyword, can also search for the initialism for obtaining keyword or the related information of full name word.

Fig. 4 is the structural schematic diagram of one of embodiment of the present invention synonym discovery device.The synonym discovery dress Set may include: acquiring unit 401 and synonym determination unit 402；

The acquiring unit 401, suitable for obtaining phrase set to be processed, the phrase set includes multiple words；

The synonym determination unit 402, suitable for for any word to be processed in the phrase set, when the phrase There are one or more target words in set, is preset so that the smallest edit distance of the word to be processed to the target word is less than When threshold value, the word to be processed is determined as synonym pair with a corresponding target word；

In specific implementation, remaining described operation includes insertion operation and replacement operation, and insertion operation described in single is corresponding Editing distance be greater than or equal to preset threshold, the corresponding editing distance of replacement operation described in single is greater than or equal to default threshold Value.

In specific implementation, the acquiring unit 401 includes participle subelement, suitable for being segmented to input corpus, with Obtain the phrase set.In specific implementation, the participle subelement divides the input corpus using dictionary for word segmentation Word, the dictionary for word segmentation are obtained by dictionary for word segmentation acquiring unit, and the dictionary for word segmentation acquiring unit is suitable for:

The explanation of structure and beneficial effect in relation to the discovery device of synonym described in the present embodiment can correspond to referring to Fig.1 Synonym find method the step of and beneficial effect explanation, repeat no more.

Fig. 5 is the structural schematic diagram of one of embodiment of the present invention synonym discovery device.Synonym as shown in Figure 5 It was found that device may include: that acquiring unit 501, candidate word selection unit 502, target word determination unit 503 and synonym determine Unit 504.

The acquiring unit 501, suitable for obtaining phrase set to be processed, the phrase set includes multiple words.

The synonym determination unit 504, suitable for for any word to be processed in the phrase set, when the phrase There are one or more target words in set, is preset so that the smallest edit distance of the word to be processed to the target word is less than When threshold value, the word to be processed is determined as synonym pair with a corresponding target word.Wherein, the smallest edit distance is Acquisition is calculated by edit distance approach, in the edit distance approach, the corresponding editing distance of delete operation is less than it The corresponding editing distance of remaining operation, the corresponding editing distance of the delete operation are less than preset threshold, remaining operation described in single Corresponding editing distance is greater than or equal to preset threshold.

In specific implementation, the acquiring unit 501 includes participle subelement 5011, suitable for dividing input corpus Word, to obtain the phrase set.

In specific implementation, the participle subelement 5011 segments the input corpus using dictionary for word segmentation, institute It states dictionary for word segmentation to obtain by dictionary for word segmentation acquiring unit, the dictionary for word segmentation acquiring unit is suitable for:

In specific implementation, the synonym discovery device can also include:

Candidate word selection unit 502, suitable for calculating separately remaining each word in the word to be processed and the phrase set Semantic similarity, and therefrom selection semantic similarity value be greater than similarity threshold word or the higher preceding N of semantic similarity value A word is as candidate word；

Target word determination unit 503 is edited suitable for calculating separately the word to be processed and the minimum of each candidate word Distance will be less than the candidate word of preset threshold with the smallest edit distance of the word to be processed as target word.

In specific implementation, the candidate word selection unit 502 may include:

Vectorization subelement 5021, suitable for carrying out vectorization to each word in the phrase set；

Cosine similarity computation subunit 5022, suitable for based on vectorization as a result, calculating the word to be processed and remaining The cosine similarity of each word, the cosine similarity is as the semantic similarity.

In specific implementation, vectorization can be carried out to each word in the phrase set using word2vec method.

The explanation of structure and beneficial effect in relation to the discovery device of synonym described in the present embodiment can be to should refer to Fig. 3 Synonym find method the step of and beneficial effect explanation, repeat no more.

The embodiment of the present invention also provides a kind of data processing equipment, and the data processing equipment uses Fig. 4 or shown in fig. 5 Synonym finds that device, the data processing equipment may include:

Knowledge point acquiring unit is suitable for obtaining knowledge point, and the knowledge point includes question sentence and corresponding answer；

Synonym searching unit judges suitable for any keyword after segmenting to the question sentence according to thesaurus The keyword whether there is synonym；

Replacement unit, suitable for when the keyword is there are when synonym, the synonym found is replaced corresponding keyword；

The question sentence obtained after replacement suitable for storing the question sentence obtained after replacement, and is added this and known by knowledge point expansion unit Know point.

The structure and beneficial effect of the data processing equipment can refer to the explanation of above-mentioned data processing method, no longer superfluous It states.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium may include: ROM, RAM, disk or CD etc..

Although present disclosure is as above, present invention is not limited to this.Anyone skilled in the art are not departing from this It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Subject to the range of restriction.

Claims

1. a kind of synonym finds method characterized by comprising

Phrase set to be processed is obtained, the phrase set includes multiple words；

Any word to be processed in the phrase set is made when there are one or more target words in the phrase set When the smallest edit distance for obtaining the word to be processed to the target word is less than preset threshold, the word to be processed and corresponding one The target word is determined as synonym pair, and the target word is only obtained by delete operation by word to be processed；

Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, and the edit distance approach includes deleting Except operation, the corresponding editing distance of the delete operation is less than remaining and operates corresponding editing distance, delete operation described in single And repeatedly the corresponding editing distance of the delete operation is respectively less than preset threshold, remaining operates corresponding editing distance described in single More than or equal to preset threshold.

2. synonym according to claim 1 finds method, which is characterized in that the method also includes: calculate separately institute The semantic similarity of remaining each word in word to be processed and the phrase set is stated, and therefrom selection semantic similarity value is greater than phase Like the word or the higher top n word of semantic similarity value for spending threshold value as candidate word；

The target word determines in the following manner: calculating separately the word to be processed and the minimum of each candidate word is edited Distance will be less than the candidate word of preset threshold with the smallest edit distance of the word to be processed as target word.

3. synonym according to claim 2 finds method, which is characterized in that calculate separately the word to be processed with it is described The semantic similarity of each word of remaining in phrase set, comprising:

Vectorization is carried out to each word in the phrase set；

It is based on vectorization as a result, calculating the cosine similarity of the word to be processed and remaining each word, the cosine similarity As the semantic similarity.

4. synonym according to claim 3 finds method, which is characterized in that each word in the phrase set into Row vector, comprising:

5. synonym according to claim 1 finds method, which is characterized in that the phrase for obtaining synonym to be found Set, comprising:

Input corpus is segmented, to obtain the phrase set.

6. synonym according to claim 5 finds method, which is characterized in that using dictionary for word segmentation to the input corpus It is segmented, the dictionary for word segmentation obtains in the following manner:

The input corpus is pre-processed, to obtain text data；

Branch's processing is carried out to the text data, obtains phrase data；

Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, with the word number after being segmented According to；

Processing is combined to the term data after the adjacent participle, to generate candidate data string；

Judgement processing is carried out to the candidate data string, to find neologisms；

The dictionary for word segmentation is added in the neologisms.

7. synonym according to claim 1 finds method, which is characterized in that remaining described operation include insertion operation and Replacement operation, the corresponding editing distance of insertion operation described in single are greater than or equal to preset threshold, replacement operation pair described in single The editing distance answered is greater than or equal to preset threshold.

8. a kind of data processing method, which is characterized in that find method including the described in any item synonyms of claim 1-7.

9. a kind of synonym finds device characterized by comprising

Synonym determination unit, suitable for existing when in the phrase set for any word to be processed in the phrase set One or more target words, when so that the smallest edit distance of the word to be processed to the target word being less than preset threshold, institute It states word to be processed and is determined as synonym pair with a corresponding target word, the target word is only to pass through deletion by word to be processed Operation obtains；

10. synonym according to claim 9 finds device, which is characterized in that described device further include:

Candidate word selection unit, suitable for calculating separately the semantic phase of the word to be processed with remaining each word in the phrase set Like degree, and therefrom select semantic similarity value be greater than similarity threshold word or the higher top n word of semantic similarity value as Candidate word；

Target word determination unit will suitable for calculating separately the smallest edit distance of the word to be processed Yu each candidate word It is less than the candidate word of preset threshold as target word with the smallest edit distance of the word to be processed.

11. synonym according to claim 10 finds device, which is characterized in that the candidate word selection unit includes:

Cosine similarity computation subunit, suitable for based on vectorization as a result, calculating the word to be processed and remaining each word Cosine similarity, the cosine similarity is as the semantic similarity.

12. synonym according to claim 11 finds device, which is characterized in that the vectorization subelement uses Word2vec method carries out vectorization to each word in the phrase set.

13. synonym according to claim 9 finds device, which is characterized in that the acquiring unit includes: that participle is single Member, suitable for being segmented to input corpus, to obtain the phrase set.

14. synonym according to claim 13 finds device, which is characterized in that the participle subelement utilizes participle word Allusion quotation segments the input corpus, and the dictionary for word segmentation is obtained by dictionary for word segmentation acquiring unit, and the dictionary for word segmentation obtains Unit is taken to be suitable for:

The input corpus is pre-processed, to obtain text data；Branch's processing is carried out to the text data, obtains language Sentence data；Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, with the word after being segmented Language data；Processing is combined to the term data after the adjacent participle, to generate candidate data string；To the candidate number Judgement processing is carried out according to string, to find neologisms；The dictionary for word segmentation is added in the neologisms.

15. synonym according to claim 9 finds device, which is characterized in that remaining described operation includes insertion operation And replacement operation, the corresponding editing distance of insertion operation described in single are greater than or equal to preset threshold, replacement operation described in single Corresponding editing distance is greater than or equal to preset threshold.

16. a kind of data processing equipment, which is characterized in that including the described in any item synonym discovery dresses of claim 9-15 It sets.