CN106126494B - Synonym finds method and device, data processing method and device - Google Patents
Synonym finds method and device, data processing method and device Download PDFInfo
- Publication number
- CN106126494B CN106126494B CN201610429937.XA CN201610429937A CN106126494B CN 106126494 B CN106126494 B CN 106126494B CN 201610429937 A CN201610429937 A CN 201610429937A CN 106126494 B CN106126494 B CN 106126494B
- Authority
- CN
- China
- Prior art keywords
- word
- processed
- synonym
- phrase set
- distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
A kind of synonym discovery method and device, data processing method and device, the synonym discovery method includes: to obtain phrase set to be processed, and the phrase set includes multiple words;For any word to be processed in the phrase set, when there are one or more target words in the phrase set, when so that the smallest edit distance of the word to be processed to the target word being less than preset threshold, the word to be processed is determined as synonym pair with a corresponding target word;Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, the edit distance approach includes delete operation, the corresponding editing distance of delete operation is less than remaining and operates corresponding editing distance, the corresponding editing distance of the delete operation is less than preset threshold, remaining operates corresponding editing distance and is greater than or equal to preset threshold described in single.The accuracy of discovery initialism can be improved in above scheme.
Description
Technical field
The present invention relates to data processing fields, find method and device, data processing side more particularly to a kind of synonym
Method and device.
Background technique
Synonymy is very important semantic relation, is often applied to the natural languages such as information retrieval, text classification
In processing task.Specifically, needing to carry out obtaining for synonym before the processing task such as information retrieval or text classification of progress
Take the identification with synonym.For example, the multiple words for belonging to synonym can be classified as one in the application scenarios of information retrieval
Class can scan for synonym instead of original keyword when inputting in text there are when the keyword of synonym, so as to
Searching system is set to be supplied to user's more text to be confirmed.
The shorthand of intrinsic title, the word of these shorthands are often had in the written and daily expression of Chinese
The initialism of referred to as intrinsic title, initialism are a part of former intrinsic title, and initialism is also one kind of synonym.Example
Such as, " National People's Congress " is the initialism of " National People's Congress ", and " China " is the initialism of " People's Republic of China (PRC) ",
" Real Madrid " is the initialism etc. of " Real Madrid ".
However, synonym discovery method in the prior art can not preferably identify initialism, so that semantic understanding
Accuracy it is lower.
Summary of the invention
The technical problem to be solved by the present invention is to provide a kind of synonyms to find method and device, improves the standard of discovery initialism
True property.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of synonym discovery method, which comprises obtain
Phrase set to be processed is taken, the phrase set includes multiple words;For any word to be processed in the phrase set, when
There are one or more target words in the phrase set, so that smallest edit distance of the word to be processed to the target word
When less than preset threshold, the word to be processed is determined as synonym pair with a corresponding target word;Wherein, the minimum volume
Collecting distance is to calculate to obtain by edit distance approach, and the edit distance approach includes delete operation, the delete operation
Corresponding editing distance is less than remaining and operates corresponding editing distance, and the corresponding editing distance of the delete operation is less than default threshold
It is worth, remaining operates corresponding editing distance more than or equal to preset threshold described in single.
Optionally, the method also includes: calculate separately remaining each word in the word to be processed and the phrase set
Semantic similarity, and therefrom selection semantic similarity value be greater than similarity threshold word or the higher preceding N of semantic similarity value
A word is as candidate word;
The target word determines in the following manner: calculating separately the minimum of the word to be processed Yu each candidate word
Editing distance will be less than the candidate word of preset threshold with the smallest edit distance of the word to be processed as target word.
Optionally, the semantic similarity of remaining each word in the word to be processed and the phrase set, packet are calculated separately
It includes:
Vectorization is carried out to each word in the phrase set;It is based on vectorization as a result, calculating the word to be processed
With the cosine similarity of remaining each word, the cosine similarity is as the semantic similarity.
Optionally, vectorization is carried out to each word in the phrase set, comprising:
Vectorization is carried out to each word in the phrase set using word2vec method.
Optionally, the phrase set for obtaining synonym to be found, comprising:
Input corpus is segmented, to obtain the phrase set.
Optionally, the input corpus is segmented using dictionary for word segmentation, the dictionary for word segmentation obtains in the following manner
:
The input corpus is pre-processed, to obtain text data;Branch's processing is carried out to the text data, is obtained
To phrase data;Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, after obtaining participle
Term data;Processing is combined to the term data after the adjacent participle, to generate candidate data string;To the time
Serial data is selected to carry out judgement processing, to find neologisms;The dictionary for word segmentation is added in the neologisms.
Optionally, remaining described operation includes insertion operation and replacement operation, the corresponding editor of insertion operation described in single
Distance is greater than or equal to preset threshold, and the corresponding editing distance of replacement operation described in single is greater than or equal to preset threshold.
The embodiment of the present invention also provides a kind of data processing method, and the data processing method includes above-mentioned synonym discovery
Method.
The embodiment of the present invention also provides a kind of synonym discovery device, and described device includes:
Acquiring unit, suitable for obtaining phrase set to be processed, the phrase set includes multiple words;
Synonym determination unit, suitable for for any word to be processed in the phrase set, when in the phrase set
There are one or more target words, so that the smallest edit distance of the word to be processed to the target word is less than preset threshold
When, the word to be processed is determined as synonym pair with a corresponding target word;
Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, the edit distance approach packet
Include delete operation, the corresponding editing distance of the delete operation is less than remaining and operates corresponding editing distance, the delete operation
Corresponding editing distance is less than preset threshold, remaining operates corresponding editing distance more than or equal to preset threshold described in single.
Optionally, the synonym finds device further include:
Candidate word selection unit, suitable for calculating separately the language of remaining each word in the word to be processed and the phrase set
Adopted similarity, and therefrom selection semantic similarity value is greater than the word or the higher top n word of semantic similarity value of similarity threshold
As candidate word;
Target word determination unit, suitable for calculate separately the minimum editor of the word to be processed and each candidate word away from
From by the candidate word with the smallest edit distance of the word to be processed less than preset threshold as target word.
Optionally, the candidate word selection unit includes:
Vectorization subelement, suitable for carrying out vectorization to each word in the phrase set;
Cosine similarity computation subunit, suitable for based on vectorization as a result, the calculating word to be processed is each with remaining
The cosine similarity of word, the cosine similarity is as the semantic similarity.
Optionally, the vectorization subelement using word2vec method to each word in the phrase set carry out to
Quantization.
Optionally, the acquiring unit includes:
Subelement is segmented, suitable for segmenting to input corpus, to obtain the phrase set.
Optionally, the participle subelement segments the input corpus using dictionary for word segmentation, the dictionary for word segmentation
It is obtained by dictionary for word segmentation acquiring unit, the dictionary for word segmentation acquiring unit is suitable for:
The input corpus is pre-processed, to obtain text data;Branch's processing is carried out to the text data, is obtained
To phrase data;Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, after obtaining participle
Term data;Processing is combined to the term data after the adjacent participle, to generate candidate data string;To the time
Serial data is selected to carry out judgement processing, to find neologisms;The dictionary for word segmentation is added in the neologisms.
Optionally, remaining described operation includes insertion operation and replacement operation, the corresponding editor of insertion operation described in single
Distance is greater than or equal to preset threshold, and the corresponding editing distance of replacement operation described in single is greater than or equal to preset threshold.
The embodiment of the present invention also provides a kind of data processing equipment, and the data processing equipment includes above-mentioned synonym discovery
Device.
Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that
The embodiment of the present invention obtains phrase set to be processed;For any word to be processed in the phrase set, when
There are one or more target words in the phrase set, so that smallest edit distance of the word to be processed to the target word
When less than preset threshold, the word to be processed is determined as synonym pair with a corresponding target word;Wherein, the minimum volume
Collecting distance is to calculate to obtain by edit distance approach, and the edit distance approach includes delete operation, the delete operation
Corresponding editing distance is less than remaining and operates corresponding editing distance, and the corresponding editing distance of the delete operation is less than default threshold
It is worth, remaining operates corresponding editing distance more than or equal to preset threshold described in single.On the one hand above scheme passes through restriction and compiles
The editing distance for collecting delete operation in distance method is less than editing distance of remaining operation, so that smallest edit distance is by excellent
First obtained using delete operation;On the other hand, the corresponding editing distance of delete operation during smallest edit distance is calculated
Less than preset threshold, at the same single remaining when operating corresponding editing distance and being greater than or equal to preset threshold, as a result, when to be processed
When the smallest edit distance of word to target word is less than preset threshold, corresponding target word is only to pass through delete operation by word to be processed
It obtains, so that it is guaranteed that be a part of word literal expression to be processed by the synonym that edit distance approach obtains, so that
The initialism of acquisition is more accurate, improves the accuracy rate of initialism discovery.
Further, it by calculating the semantic similarity of remaining word in word to be processed and the phrase set, selects multiple
Candidate word, and then target word can be determined from the more small range that multiple candidate words are formed, since multiple candidate words are to be processed
The a subset of phrase set, so determining that the efficiency of determining synonym pair can be improved in target word from multiple candidate words, simultaneously
By further improving the accuracy of discovery synonym pair using semantic similarity as another synonym performance assessment criteria,
Just improve the accuracy of discovery initialism.
Detailed description of the invention
Fig. 1 is the flow chart of one of embodiment of the present invention synonym discovery method;
Fig. 2 is the flow chart for the method that one of embodiment of the present invention obtains dictionary for word segmentation;
Fig. 3 is the flow chart of another synonym discovery method in the embodiment of the present invention;
Fig. 4 is the structural schematic diagram of one of embodiment of the present invention synonym discovery device;
Fig. 5 is the structural schematic diagram of another synonym discovery device in the embodiment of the present invention.
Specific embodiment
The shorthand of intrinsic title, the word of these shorthands are often had in the written and daily expression of Chinese
The initialism of referred to as intrinsic title, initialism are a part of former intrinsic title, and initialism is also one kind of synonym.Example
Such as, " National People's Congress " is the initialism of " National People's Congress ", and " China " is the initialism of " People's Republic of China (PRC) ",
" Real Madrid " is the initialism etc. of " Real Madrid ".However, synonym discovery method in the prior art cannot preferably be known
Other initialism, so that the accuracy of semantic understanding is lower.
The embodiment of the present invention obtains phrase set to be processed;For any word to be processed in the phrase set, when
There are one or more target words in the phrase set, so that smallest edit distance of the word to be processed to the target word
When less than preset threshold, the word to be processed is determined as synonym pair with a corresponding target word;Wherein, the minimum volume
Collecting distance is to calculate to obtain by edit distance approach, and the edit distance approach includes delete operation, the delete operation
Corresponding editing distance is less than remaining and operates corresponding editing distance, and the corresponding editing distance of the delete operation is less than default threshold
It is worth, remaining operates corresponding editing distance more than or equal to preset threshold described in single.On the one hand above scheme passes through restriction and compiles
The editing distance for collecting delete operation in distance method is less than editing distance of remaining operation, so that smallest edit distance is by excellent
First obtained using delete operation;On the other hand, the corresponding editing distance of delete operation during smallest edit distance is calculated
Less than preset threshold, at the same single remaining when operating corresponding editing distance and being greater than or equal to preset threshold, as a result, when to be processed
When the smallest edit distance of word to target word is less than preset threshold, corresponding target word is only to pass through delete operation by word to be processed
It obtains, so that it is guaranteed that be a part of word literal expression to be processed by the synonym that edit distance approach obtains, so that
The initialism of acquisition is more accurate, improves the accuracy rate of initialism discovery.
It is understandable to enable above-mentioned purpose of the invention, feature and beneficial effect to become apparent, with reference to the accompanying drawing to this
The specific embodiment of invention is described in detail.
Fig. 1 is the flow chart of one of embodiment of the present invention synonym discovery method.Below with reference to step shown in FIG. 1
It is illustrated.
Step S101: obtaining phrase set to be processed, and the phrase set includes multiple words.
The phrase collection to be processed is combined into the phrase set to therefrom find synonym pair.
In specific implementation, the phrase set to be processed is obtained and segmenting to input corpus.It is described defeated
The data mode for entering corpus can be the non-text data such as voice data, be also possible to text data.When input corpus is non-text
When notebook data, need first to be converted to it text data, i.e. the object of subsequent processing is all text data.The input corpus can
To obtain by the conversation recording for obtaining question answering system and user, the knowledge point data in manual sorting can be from.
In specific implementation, above-mentioned input corpus may be from a specific area, it is therefore to be understood that be processed
Phrase set in word be the semantic expression in relation to the specific area, wherein may include with identical semantic but expression
The different word of form, i.e. synonym.The specific area can be the bank field, education sector, sports field etc..
For example, the input corpus is from the bank field, wherein some sentences may use " China Merchants Bank " to express this
One Bank Name, some sentences may then be expressed using " China Merchants Bank ", and " China Merchants Bank " and " China Merchants Bank " is a synonym pair;Class
As, in expression exist " industrial and commercial bank " and " industrial and commercial bank " this to synonym.Above-mentioned two groups of synonym centerings " China Merchants Bank " are " silver of promoting trade and investment
The initialism of row ", " industrial and commercial bank " is the initialism of " industrial and commercial bank ".Certainly, the same of other non-initialisms is likely present in expression
Adopted word, such as " remittance " and " remittance money " is synonym, but the relationship of initialism is not present between the two.And the present embodiment is to lead to
Cross step S101 to step S102 discovery initialism.
In specific implementation, carrying out participle to input corpus is realized by dictionary for word segmentation, in order to enable input corpus
It is in other words obtained in the result segmented comprising initialism in order to segment initialism from a sentence, described point
It is needed in word dictionary comprising initialism, and initialism in general basic dictionary and may be not present, institute as a kind of neologisms
To need to update basic dictionary by new word discovery, so that initialism is added into the basic word of update as one of neologisms
Allusion quotation, to use the basic dictionary updated as dictionary for word segmentation to input corpus participle.
In order to make to include initialism in the dictionary for word segmentation, the dictionary for word segmentation obtains in the following manner, refers to Fig. 2
Shown step.
S11: input corpus is pre-processed, to obtain text data.
In the input corpus Format Type may it is more, for convenient for input corpus carry out subsequent processing, need to be to input
Corpus is pre-processed, and text data is obtained.
In specific implementation, the pretreatment can by input corpus uniform format be text formatting, and filter dirty word,
One of sensitive word and stop words are a variety of.It, can be by current skill when the uniform format that will input corpus is text formatting
The information filtering that art wouldn't can be converted to text formatting is fallen.
S12: branch's processing is carried out to the text data, obtains phrase data.
Branch's processing can be to input corpus according to punctuate branch, such as fullstop, comma, exclamation, question mark etc. is occurring
Punctuate punishment row.Obtaining phrase data herein is the primary segmentation to corpus, in order to the range of the subsequent word segmentation processing of determination.
S13 carries out word segmentation processing to the phrase data according to the independent word for including in basic dictionary, after obtaining participle
Term data.
The basis dictionary is for that may not contain initialism in the basis dictionary for differentiation dictionary for word segmentation.It is described
Basic dictionary includes multiple independent words, and the length of different individually words can be different.In specific implementation, it is carried out based on basic dictionary
The process of word segmentation processing can use one of the two-way maximum matching method of dictionary, HMM method and CRF method or a variety of.
The word segmentation processing is to carry out word segmentation processing to the phrase data of same a line, and the term data is all included in base
Independent word in plinth dictionary.
S14 is combined processing to the term data after the adjacent participle, to generate candidate data string.
Word segmentation processing is carried out according to basic dictionary, it is possible that by should be as the word of a word in some field
Data are divided into the case where multiple term datas, therefore need new word discovery.Subsequent impose a condition is sieved from candidate data string
Choosing, using the candidate data string filtered out as neologisms.Premise of the candidate data string as above-mentioned screening process is generated, can be used
Various ways are completed.
In specific implementation, it can use Bigram model using word two neighboring in the phrase data of same a line as time
Select serial data.
Assuming that a sentence S can be expressed as a sequence S=w1w2 ... wn, language model is exactly to require that sentence S's is general
Rate p (S):
P (S)=p (w1, w2, w3, w4, w5 ..., wn)
=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1) (1)
Probability statistics are based on Ngram model in formula (1), and the calculation amount of probability is too big, can not be applied in practical application.
Assume (Markov Assumption) based on Markov: the appearance of next word only relies upon one or several before it
Word.Assuming that the appearance of next word relies on a word before it, then have:
P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)
=p (w1) p (w2 | w1) p (w3 | w2) ... p (wn | wn-1) (2)
Assuming that the appearance of next word relies on two words before it, then have:
P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)
=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | wn-1, wn-2) (3)
Formula (2) is the calculation formula of Bigram probability, and formula (3) is the calculation formula of trigram probability.Pass through setting
The more constraint informations occurred to next word can be set in bigger n value, have bigger discrimination;By being arranged more
Small n value, the number that candidate data string occurs in new word discovery is more, can provide more reliable statistical information, has more
High reliability.
Theoretically, n value is bigger, and reliability is higher, and in existing processing method, Trigram's is most;But Bigram's
Calculation amount is smaller, and system effectiveness is higher.
S15: judging whether the candidate data string is particular candidate serial data, and the particular candidate serial data includes basis
Noun, and the word for being located at the specific relative position of the basic noun is noun or adjective.
If, should according to inventor the study found that if noun or adjective on the specific relative position of a basic noun
Basic noun is very likely needed by as neologisms.Such as basic noun " card ", the left side of " card " are noun, can form " dragon
Card ", " elite school's card ", " platinum card ", " business card " etc..Therefore judge whether candidate data string is particular candidate serial data, it can sentence
Whether disconnected candidate data string meets comprising basic noun, and whether the word of the specific relative position of the basis noun is noun
Or adjective.
The specific relative position of basic noun can be set according to different basic nouns and corpus, for example, working as language
Include a variety of " cards " in material, and when needing to regard the title of various cards as neologisms, can set the left side of basic noun as
Noun or adjective.
In specific implementation, specific relative position can be any one of left and right side or two kinds, can be according to need
It is configured.
In specific implementation, it is referred to the frequency and determines the basic noun.Since basic noun can be repeatedly in corpus
Occur, therefore is referred to the frequency and determines basic noun.It is understood that basic noun can also be selected by manual read
It selects and sets.
S16: judgement processing is carried out to the candidate data string, to find neologisms;The judgement is handled
When the candidate data string is nonspecific candidate data string, calculate in the candidate data string in each word and its
The comentropy of side word, and remove candidate data string of the comentropy outside preset range;
When the candidate data string is particular candidate serial data, the word except the particular candidate serial data is only calculated
With the comentropy of its inside word, candidate data string of the comentropy outside preset range is removed.
Since candidate data string includes two term datas, when carrying out judgement processing to candidate data string, need to distinguish
The inside comentropy of two term datas is judged, comentropy is to the probabilistic measurement of stochastic variable, calculation formula
It is as follows:
H (X)=- ∑ p (xi)logp(xi)
Comentropy is bigger, indicates that the uncertainty of variable is bigger, i.e., the probability that each possible value occurs is average.Such as
The probability that some value of fruit variable occurs is 1, then entropy is 0.Show that variable only works as the generation of former value, is an inevitable thing
Part.
The formula of the left side comentropy and right side comentropy that calculate word W is as follows:
H1(W)=∑x∈X(#XW>0)P (x | W) log P (x | W), wherein X is all term data collection for appearing in the left side W
It closes, H1(W) the left side comentropy for being term data W.
H2(W)=∑x∈Y(#WY>0)P (y | W) log P (y | W), wherein Y is all term data collection appeared on the right of W
It closes, H2(W) the right side comentropy for being term data W.
Inside comentropy is that each independent term data is successively fixed to candidate data string, calculates and occurs in the term data
In the case of another word occur comentropy.If candidate data string is (W1W2), the right side letter of term data W1 is calculated
Cease the left side comentropy of entropy and term data W2.
It calculates the entropy of term data and the term data on the inside of it in candidate data string and embodies word on the inside of the term data
The confusion degree of language data.For example, by calculating candidate data string W1W2Middle left side term data W1Right side comentropy and the right side
Side term data W2Left side comentropy, it can be determined that term data W1And W2The confusion degree of inside, so as to pass through setting
Preset range is screened, and excludes each word and its inside word constitutes candidate of the probability characteristics value of neologisms outside preset range
Serial data.
In particular candidate serial data, the inside comentropy of basic noun perhaps can be because outside preset range, causing to make
It is excluded for the particular candidate serial data of neologisms, for example, particular candidate serial data is " platinum card ", " elite school's card ", " Long Card " etc.
When candidate data string comprising basic noun " card ", word " platinum ", " name ", the right side comentropy of " dragon " within a preset range,
But since the left side word of word " card " is more chaotic, left side comentropy may be outside preset range, so as to lead to candidate
Candidate's serial data such as serial data " platinum card ", " elite school's card ", " Long Card " is by the exclusion of mistake.
Therefore when the candidate data string is particular candidate serial data, the word except the particular candidate serial data is only calculated
The comentropy of language and its inside word, removes candidate data string of the comentropy outside preset range, no longer to basic noun
Inside comentropy calculated, avoid the inside comentropy of gene basis noun outside preset range caused by error exception.
S17: the dictionary for word segmentation is added in the neologisms.
Due to initialism and a kind of neologisms, then the new set of words obtained from input corpus also includes initialism, from
And neologisms addition dictionary for word segmentation is also achieved that in dictionary for word segmentation comprising initialism, and then can be with dictionary for word segmentation to described defeated
Enter corpus and is segmented to obtain the phrase set in the present embodiment.
It continues with and is illustrated to obtaining the step after phrase set to be processed.
Step S102: for any word to be processed in the phrase set, when in the phrase set there are one or
Multiple target words, it is described wait locate when so that the smallest edit distance of the word to be processed to the target word being less than preset threshold
It manages word and is determined as synonym pair with a corresponding target word.
Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, the edit distance approach packet
Include delete operation, the corresponding editing distance of the delete operation is less than remaining and operates corresponding editing distance, the delete operation
Corresponding editing distance is less than preset threshold, remaining operates corresponding editing distance more than or equal to preset threshold described in single.
In specific implementation, remaining described operation may include replacement operation and insertion operation.The volume of the present embodiment meaning
Volume distance is that a word is taken edit operation with editor's cost needed for being converted into another word, that is, number of operations and every
The product of cost needed for single stepping.And smallest edit distance, then refer to editor's the smallest editing distance of cost.Every step operation is only
Only for one of word.
Illustrate editing distance and smallest edit distance so that " industrial and commercial bank " converts to " industrial and commercial bank " as an example below.It will " industrial and commercial silver
Row " conversion can take different edit operation combinations to obtain to " industrial and commercial bank ".Assuming that the editing distance of single step replacement operation
It is 1000, the editing distance of single step delete operation is 1, and the editing distance of single step insertion operation is 1000.
By the first conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " are as follows: divide 3 delete operations by " industrial and commercial bank "
" work ", " quotient " and " silver " delete, then carry out insertion operation insertion " work " and obtain " industrial and commercial bank ", then " industrial and commercial bank " arrives " work
The editing distance of row " is 1003;
By second of conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " are as follows: divide 2 delete operations by " industrial and commercial bank "
" work " and " quotient " delete, then carry out a replacement operation " silver " replaced with " work " obtaining " industrial and commercial bank ", then " industrial and commercial bank " arrives
The editing distance of " industrial and commercial bank " is 1002;
By the third conversion regime of " industrial and commercial bank " conversion to " industrial and commercial bank " are as follows: divide 2 delete operations by " industrial and commercial bank "
" quotient " and " silver " delete, obtain " industrial and commercial bank ", then the editing distance of " industrial and commercial bank " to " industrial and commercial bank " is 2.
It should be noted that the conversion regime of word to be processed " industrial and commercial bank " conversion to " industrial and commercial bank " is not limited to above-mentioned enumerate
Operative combination, the corresponding editing distance of different conversion regimes is different.However, in a variety of conversion regimes, minimum editor away from
From being unique.It can be appreciated that the above-mentioned smallest edit distance by " industrial and commercial bank " conversion to " industrial and commercial bank " should be 2, i.e., by upper
The third conversion regime is stated to obtain.
Therefore, for any word to be processed in the phrase set, the smallest edit distance to another word is determining
's.By calculating the smallest edit distance of any word to be processed and other words in the phrase set, when in the phrase set
There are one or more target words, so that the smallest edit distance of the word to be processed to the target word is less than preset threshold
When, the word to be processed is determined as synonym pair with a corresponding target word.For example, phrase collection be combined into L (A, B, C, D, E,
F, G and H), for word A to be processed, it is assumed that target word from subset M (B, C, D, E, F, G and H), when (B, C, D, E, F, G and
H there are a word B in), and when so that the smallest edit distance of word A to be processed to word B being less than preset threshold, then A and B is synonymous
Word pair.
For the initialism that the target word for guaranteeing the synonym centering searched out is word to be processed, i.e. breviary in the present embodiment
A part of word necessarily word to be processed, in the edit distance approach, limiting single, remaining operates corresponding editing distance
More than or equal to preset threshold, and limits the corresponding editing distance of delete operation and be less than remaining corresponding editing distance of operation, and
Not only the corresponding editing distance of delete operation described in single is less than preset threshold, but also repeatedly (described repeatedly can be according to full name
Word determines that such as: maximum deletes 5 words to the number of words deleted maximum between initialism, then is at this time 5 times) delete operation
Corresponding editing distance is again smaller than preset threshold.
In specific implementation, the initialism found by the above method is either one or more, needs to illustrate
, what method through this embodiment was found is not necessarily initialism pass between the word of word to be processed composition synonym pair
System.For example, it is word B that the method for implementing the present embodiment, which obtains one of initialism of word A to be processed, in phrase set L, and
Another initialism for finding word A to be processed is word C, i.e., the smallest edit distance of word A to word B to be processed and word A to be processed are arrived
The smallest edit distance of word C is respectively less than preset threshold, but initialism relationship is not necessarily between word B and word C, i.e., it cannot be guaranteed that word
The initialism that the initialism or word C that B is word C are word B, but be synonym relationship between word B and word C.
It also needs to illustrate, method through this embodiment, the obtained same initialism is corresponding multiple wait locate
Manage not necessarily synonym relationship between word.For example, the method for implementing the present embodiment obtains word A's to be processed in phrase set L
Initialism is B, is similarly obtained the initialism of word D to be processed as B, but not necessarily synonym relationship between word A and word D.
It is less than remaining corresponding editing distance of operation due to limiting the corresponding editing distance of delete operation in the present embodiment, makes
It obtains when calculating smallest edit distance using edit distance approach, word to be processed is converted into the edit operation of another word and preferentially adopted
With delete operation, on the other hand, the corresponding editing distance of delete operation during calculating smallest edit distance is less than default threshold
Value, at the same single remaining when operating corresponding editing distance and being greater than or equal to preset threshold, as a result, when word to be processed to target word
Smallest edit distance be less than preset threshold when, corresponding target word be only obtained by delete operation by word to be processed, thus
Ensure that the synonym by edit distance approach acquisition is a part of word literal expression to be processed, so that the breviary obtained
Word is more accurate, improves the accuracy rate of initialism discovery.
Fig. 3 is the flow chart of one of embodiment of the present invention synonym discovery method.Below with reference to step shown in Fig. 3 into
Row explanation.
Step S301: obtaining phrase set to be processed, and the phrase set includes multiple words.
The implementation of this step can correspond to step S101 shown referring to Fig.1, and details are not described herein.
Step S302: for any word to be processed in the phrase set, calculate separately the word to be processed with it is described
The semantic similarity of each word of remaining in phrase set, and therefrom selection semantic similarity value is greater than the word or language of similarity threshold
The adopted higher top n word of similarity value is as candidate word.
In the specific implementation, it can be by comparing the semantic similarity value and similarity threshold of remaining word and word to be processed one
Semantic similarity value is greater than the word of similarity threshold as candidate word by the size of value.It should be noted that the similarity threshold
Value can carry out different default, not do any restriction, the number of candidate word changes with the variation of similarity threshold at this time.
Another in the specific implementation, the higher time of semantic similarity value can be obtained by limiting the number N of candidate word
Select word.Specifically, semantic similarity value is ranked up by from high to low sequence, the higher preceding N of semantic similarity value is taken
A word is as candidate word.
It is to determine target word from candidate word in order to subsequent that this step selects candidate word from the phrase set.In this way,
On the one hand, the range of the determining target word that synonym pair is constituted with the word to be processed is reduced, so as to reduce calculating
Complexity improves the efficiency of discovery initialism.On the other hand, by judging whether it is synonymous using semantic similarity as another
The performance assessment criteria of word further improves the accuracy of discovery synonym pair, namely improves the accuracy of discovery initialism.
In specific implementation, when calculating the semantic similarity of remaining each word in the word to be processed and the phrase set
Following steps can be passed through:
Firstly, carrying out vectorization to each word in the phrase set;
Secondly, based on vectorization as a result, calculate the cosine similarity of the word to be processed and remaining each word, it is described more than
String similarity is as the semantic similarity.It is understood that can therefrom select cosine similar after calculating cosine similarity
The word or the higher top n word of cosine similarity value that angle value is greater than similarity threshold are as candidate word.
In specific implementation, vectorization can be carried out to each word in the phrase set using word2vec method.
It should be pointed out that vectorization can also be carried out to each word in the phrase set using other existing methods.
Step S303: when there are one or more target words in the phrase set, so that the word to be processed is described in
When the smallest edit distance of target word is less than preset threshold, the word to be processed is determined as synonymous with a corresponding target word
Word pair.
Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, the edit distance approach packet
Delete operation is included, the corresponding editing distance of delete operation is less than remaining and operates corresponding editing distance, and the delete operation is corresponding
Editing distance be less than preset threshold.
The target word determines in the following manner: calculating separately the minimum of the word to be processed Yu each candidate word
Editing distance will be less than the candidate word of preset threshold with the smallest edit distance of the word to be processed as the target word.
In the present embodiment, remaining described operation includes insertion operation and replacement operation.Insertion operation described in single is corresponding
Editing distance be greater than or equal to preset threshold, the corresponding editing distance of replacement operation described in single is greater than or equal to default threshold
Value.
The implementation of step S301 to step S303 is illustrated with an example below, wherein each step using one specific implementation as
Example should not be a limitation of the present invention.
Implementation steps S301 obtains phrase collection to be processed and is combined into Q (A, B, C and D), and wherein A, B, C and D can be for wait locate
Manage word, it is assumed that A is specially " China Merchants Bank ", and B is " industrial and commercial bank ", C is " China Merchants Bank ", and D is " industrial and commercial bank ".
Following steps are A " China Merchants Bank " example with word to be processed.
Implementation steps S302, using word2vec method for each word (A, B, C and D) in phrase set Q carry out to
Quantization, it is based on vectorization as a result, calculate the cosine similarity of word A to be processed Yu remaining each word B, C and D, obtain cosine phase
Like angle value from high to low sequence be D, C and B, therefrom select higher preceding 2 words of cosine similarity value as candidate word, that is, select
Word D " industrial and commercial bank " and word C " China Merchants Bank " is selected as candidate word.
Implementation steps S303 calculates word to be processed using edit distance approach respectively for word A " China Merchants Bank " to be processed
The smallest edit distance and word A to be processed " China Merchants Bank " and candidate word C of A " China Merchants Bank " and candidate word D " industrial and commercial bank "
The smallest edit distance of " China Merchants Bank ".
In the edit distance approach of this example, the corresponding editing distance of delete operation is less than insertion operation and replacement operation pair
The editing distance answered, the corresponding editing distance of the delete operation are less than preset threshold, the corresponding volume of insertion operation described in single
It collects distance and is greater than or equal to preset threshold, the corresponding editing distance of replacement operation described in single is greater than or equal to preset threshold.It is false
If the corresponding editing distance of single delete operation is 1, the corresponding editing distance of single insertion operation is 1000, single replacement operation
Corresponding editing distance is 1000, preset threshold 10, then:
In all edit operations combination that word to be processed " China Merchants Bank " is converted to candidate word D " industrial and commercial bank ", pass through 1
The editing distance that step replacement operation obtains is minimum, and " trick " is specifically replaced with " work ", so smallest edit distance is 1000;
In all edit operations combination that word to be processed " China Merchants Bank " is converted to candidate word C " China Merchants Bank ", deleted by 2 steps
Except the editing distance that operation obtains is minimum, " quotient " and " silver " specifically is deleted respectively, so smallest edit distance is 2;
In the above-mentioned smallest edit distance being calculated, the smallest edit distance less than preset threshold 10 is 2, therefore corresponding
Target word be candidate word C " China Merchants Bank ", determine that word A " China Merchants Bank " to be processed and candidate word C " China Merchants Bank " are synonym pair, " recruit
Row " is the initialism of " China Merchants Bank ".
For another example, it is assumed that phrase collection to be processed is combined into P (" China Merchants Bank ", " industrial and commercial bank " and " industrial and commercial bank "), for wait locate
It manages word " China Merchants Bank ", calculates separately " China Merchants Bank " and " industrial and commercial bank ", and the semantic phase of " China Merchants Bank " and " industrial and commercial bank "
Like degree, the semantic similarity of " China Merchants Bank " and " industrial and commercial bank " and the semantic similarity of " China Merchants Bank " and " industrial and commercial bank " are obtained
Value is all larger than similarity threshold.Then smallest edit distance is calculated, is replaced in calculating since the editing distance of delete operation is less than
The editing distance of operation, therefore each step preferentially uses delete operation:
" China Merchants Bank " conversion to " industrial and commercial bank " at least can be by taking 2 step delete operations and 1 step replacement operation to be converted to.
Specifically, least operation can be by deletion " trick " and " quotient ", and replacing " silver " is that " work " obtains.And the single step of delete operation is compiled
Volume distance is 1, and the single step editing distance of replacement operation is 1000, thus calculate " China Merchants Bank " arrive " industrial and commercial bank " it is minimum edit away from
From being 1002;
Similarly, " China Merchants Bank " conversion can at least be obtained to " industrial and commercial bank " by 1 step replacement operation, specifically, be replaced
Changing " trick " is " work ", and the single step editing distance of replacement operation is 1000, therefore calculates " China Merchants Bank " conversion to " industrial and commercial silver
The smallest edit distance of row " is 1000.
As can be seen that smallest edit distance of " China Merchants Bank " conversion to " industrial and commercial bank ", and " China Merchants Bank " convert to " work
The smallest edit distance of quotient bank " is all larger than preset threshold 10, so candidate word " industrial and commercial bank " and candidate word " industrial and commercial bank " are not
It is the target word, that is to say, that there is no form same word pair with word to be processed " China Merchants Bank " in phrase set to be processed
Word.
It is less than remaining corresponding editing distance of operation due to limiting the corresponding editing distance of delete operation in the present embodiment, makes
When must calculate smallest edit distance in edit distance approach, word to be processed is converted preferentially to be adopted into the edit operation of other words
Use delete operation.On this basis,
Calculate smallest edit distance during the corresponding editing distance of delete operation be less than preset threshold, while single its
When the corresponding editing distance of remaining operation is greater than or equal to preset threshold, as a result, when word to be processed to target word it is minimum edit away from
When from being less than preset threshold, corresponding target word is only obtained by delete operation by word to be processed, so that it is guaranteed that passing through editor
The synonym that distance method obtains is a part of word literal expression to be processed, so that the initialism obtained is more accurate,
Improve the accuracy rate of initialism discovery.
Further, the present embodiment passes through the semantic similarity for calculating remaining word in word to be processed and the phrase set,
Multiple candidate words are selected, and then can determine target word from the more small range that multiple candidate words are formed, since multiple candidate words are
The a subset of phrase set to be processed, so determining that the effect of determining synonym pair can be improved in target word from multiple candidate words
Rate, while the performance assessment criteria by judging whether it is synonym using semantic similarity as another, further improve discovery
The accuracy of synonym pair.
The embodiment of the invention also provides a kind of data processing methods based on above-mentioned synonym discovery method.The data
The judgement of synonym is carried out in processing method by thesaurus, and includes being found using above-mentioned synonym in thesaurus
The initialism that method obtains.The data processing method is illustrated below.
The data processing method includes: to obtain knowledge point, and the knowledge point includes question sentence and corresponding answer;To described
Question sentence segmented after any keyword, judge the keyword with the presence or absence of synonym according to thesaurus;When the pass
The synonym found is replaced corresponding keyword there are when synonym by keyword;The question sentence obtained after storage replacement, and will replacement
The knowledge point is added in the question sentence obtained afterwards.
For example, finding method by above-mentioned synonym, the initialism that " China Merchants Bank " is " China Merchants Bank " is obtained, the two is synonymous
One group of synonym pair in dictionary.Implement the data processing method below:
Obtain a knowledge point, wherein question sentence is " how open-minded China Merchants Bank's credit card is ", and corresponding answer is S;
To one of keyword " China Merchants Bank " that question sentence " how open-minded China Merchants Bank's credit card is " is segmented,
The keyword " China Merchants Bank " obtained according to thesaurus judgement participle is with the presence or absence of synonym;Due to there is " China Merchants Bank "
Synonym is its initialism " China Merchants Bank ", then " China Merchants Bank " is replaced the keyword in question sentence " how open-minded China Merchants Bank's credit card is "
" China Merchants Bank " stores replaced question sentence " how open-minded China Merchants Bank's credit card is ", and " China Merchants Bank's credit card is such as by replaced question sentence
What is open-minded " knowledge point is added.So former knowledge point is extended for: question sentence has " how open-minded China Merchants Bank's credit card is " and " China Merchants Bank
How open-minded credit card is ", corresponding answer S.Synonym " China Merchants Bank " therein is obtained using above-mentioned synonym discovery method, no longer
It repeats.
It can thus be seen that the question sentence that above-mentioned synonym discovery method can be used for expanding knowledge in a little, and then reach expansion
The effect of knowledge base is filled, so as to still reply and answer accordingly when carrying out the expression of different question sentences using initialism
Case, and then improve the semantic understanding ability of intelligent Answer System and reply the accuracy rate of answer.It should be noted that above-mentioned synonymous
Word discovery method can be applied not only to the library that expands knowledge, and can be also used for information search.It, not only can be with when applied to information search
Search obtains the related information of keyword, can also search for the initialism for obtaining keyword or the related information of full name word.
Fig. 4 is the structural schematic diagram of one of embodiment of the present invention synonym discovery device.The synonym discovery dress
Set may include: acquiring unit 401 and synonym determination unit 402;
The acquiring unit 401, suitable for obtaining phrase set to be processed, the phrase set includes multiple words;
The synonym determination unit 402, suitable for for any word to be processed in the phrase set, when the phrase
There are one or more target words in set, is preset so that the smallest edit distance of the word to be processed to the target word is less than
When threshold value, the word to be processed is determined as synonym pair with a corresponding target word;
Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, the edit distance approach packet
Include delete operation, the corresponding editing distance of the delete operation is less than remaining and operates corresponding editing distance, the delete operation
Corresponding editing distance is less than preset threshold, remaining operates corresponding editing distance more than or equal to preset threshold described in single.
In specific implementation, remaining described operation includes insertion operation and replacement operation, and insertion operation described in single is corresponding
Editing distance be greater than or equal to preset threshold, the corresponding editing distance of replacement operation described in single is greater than or equal to default threshold
Value.
In specific implementation, the acquiring unit 401 includes participle subelement, suitable for being segmented to input corpus, with
Obtain the phrase set.In specific implementation, the participle subelement divides the input corpus using dictionary for word segmentation
Word, the dictionary for word segmentation are obtained by dictionary for word segmentation acquiring unit, and the dictionary for word segmentation acquiring unit is suitable for:
The input corpus is pre-processed, to obtain text data;Branch's processing is carried out to the text data, is obtained
To phrase data;Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, after obtaining participle
Term data;Processing is combined to the term data after the adjacent participle, to generate candidate data string;To the time
Serial data is selected to carry out judgement processing, to find neologisms;The dictionary for word segmentation is added in the neologisms.
The explanation of structure and beneficial effect in relation to the discovery device of synonym described in the present embodiment can correspond to referring to Fig.1
Synonym find method the step of and beneficial effect explanation, repeat no more.
Fig. 5 is the structural schematic diagram of one of embodiment of the present invention synonym discovery device.Synonym as shown in Figure 5
It was found that device may include: that acquiring unit 501, candidate word selection unit 502, target word determination unit 503 and synonym determine
Unit 504.
The acquiring unit 501, suitable for obtaining phrase set to be processed, the phrase set includes multiple words.
The synonym determination unit 504, suitable for for any word to be processed in the phrase set, when the phrase
There are one or more target words in set, is preset so that the smallest edit distance of the word to be processed to the target word is less than
When threshold value, the word to be processed is determined as synonym pair with a corresponding target word.Wherein, the smallest edit distance is
Acquisition is calculated by edit distance approach, in the edit distance approach, the corresponding editing distance of delete operation is less than it
The corresponding editing distance of remaining operation, the corresponding editing distance of the delete operation are less than preset threshold, remaining operation described in single
Corresponding editing distance is greater than or equal to preset threshold.
In specific implementation, remaining described operation includes insertion operation and replacement operation, and insertion operation described in single is corresponding
Editing distance be greater than or equal to preset threshold, the corresponding editing distance of replacement operation described in single is greater than or equal to default threshold
Value.
In specific implementation, the acquiring unit 501 includes participle subelement 5011, suitable for dividing input corpus
Word, to obtain the phrase set.
In specific implementation, the participle subelement 5011 segments the input corpus using dictionary for word segmentation, institute
It states dictionary for word segmentation to obtain by dictionary for word segmentation acquiring unit, the dictionary for word segmentation acquiring unit is suitable for:
The input corpus is pre-processed, to obtain text data;Branch's processing is carried out to the text data, is obtained
To phrase data;Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, after obtaining participle
Term data;Processing is combined to the term data after the adjacent participle, to generate candidate data string;To the time
Serial data is selected to carry out judgement processing, to find neologisms;The dictionary for word segmentation is added in the neologisms.
In specific implementation, the synonym discovery device can also include:
Candidate word selection unit 502, suitable for calculating separately remaining each word in the word to be processed and the phrase set
Semantic similarity, and therefrom selection semantic similarity value be greater than similarity threshold word or the higher preceding N of semantic similarity value
A word is as candidate word;
Target word determination unit 503 is edited suitable for calculating separately the word to be processed and the minimum of each candidate word
Distance will be less than the candidate word of preset threshold with the smallest edit distance of the word to be processed as target word.
In specific implementation, the candidate word selection unit 502 may include:
Vectorization subelement 5021, suitable for carrying out vectorization to each word in the phrase set;
Cosine similarity computation subunit 5022, suitable for based on vectorization as a result, calculating the word to be processed and remaining
The cosine similarity of each word, the cosine similarity is as the semantic similarity.
In specific implementation, vectorization can be carried out to each word in the phrase set using word2vec method.
The explanation of structure and beneficial effect in relation to the discovery device of synonym described in the present embodiment can be to should refer to Fig. 3
Synonym find method the step of and beneficial effect explanation, repeat no more.
The embodiment of the present invention also provides a kind of data processing equipment, and the data processing equipment uses Fig. 4 or shown in fig. 5
Synonym finds that device, the data processing equipment may include:
Knowledge point acquiring unit is suitable for obtaining knowledge point, and the knowledge point includes question sentence and corresponding answer;
Synonym searching unit judges suitable for any keyword after segmenting to the question sentence according to thesaurus
The keyword whether there is synonym;
Replacement unit, suitable for when the keyword is there are when synonym, the synonym found is replaced corresponding keyword;
The question sentence obtained after replacement suitable for storing the question sentence obtained after replacement, and is added this and known by knowledge point expansion unit
Know point.
The structure and beneficial effect of the data processing equipment can refer to the explanation of above-mentioned data processing method, no longer superfluous
It states.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage
Medium may include: ROM, RAM, disk or CD etc..
Although present disclosure is as above, present invention is not limited to this.Anyone skilled in the art are not departing from this
It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute
Subject to the range of restriction.
Claims (16)
1. a kind of synonym finds method characterized by comprising
Phrase set to be processed is obtained, the phrase set includes multiple words;
Any word to be processed in the phrase set is made when there are one or more target words in the phrase set
When the smallest edit distance for obtaining the word to be processed to the target word is less than preset threshold, the word to be processed and corresponding one
The target word is determined as synonym pair, and the target word is only obtained by delete operation by word to be processed;
Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, and the edit distance approach includes deleting
Except operation, the corresponding editing distance of the delete operation is less than remaining and operates corresponding editing distance, delete operation described in single
And repeatedly the corresponding editing distance of the delete operation is respectively less than preset threshold, remaining operates corresponding editing distance described in single
More than or equal to preset threshold.
2. synonym according to claim 1 finds method, which is characterized in that the method also includes: calculate separately institute
The semantic similarity of remaining each word in word to be processed and the phrase set is stated, and therefrom selection semantic similarity value is greater than phase
Like the word or the higher top n word of semantic similarity value for spending threshold value as candidate word;
The target word determines in the following manner: calculating separately the word to be processed and the minimum of each candidate word is edited
Distance will be less than the candidate word of preset threshold with the smallest edit distance of the word to be processed as target word.
3. synonym according to claim 2 finds method, which is characterized in that calculate separately the word to be processed with it is described
The semantic similarity of each word of remaining in phrase set, comprising:
Vectorization is carried out to each word in the phrase set;
It is based on vectorization as a result, calculating the cosine similarity of the word to be processed and remaining each word, the cosine similarity
As the semantic similarity.
4. synonym according to claim 3 finds method, which is characterized in that each word in the phrase set into
Row vector, comprising:
Vectorization is carried out to each word in the phrase set using word2vec method.
5. synonym according to claim 1 finds method, which is characterized in that the phrase for obtaining synonym to be found
Set, comprising:
Input corpus is segmented, to obtain the phrase set.
6. synonym according to claim 5 finds method, which is characterized in that using dictionary for word segmentation to the input corpus
It is segmented, the dictionary for word segmentation obtains in the following manner:
The input corpus is pre-processed, to obtain text data;
Branch's processing is carried out to the text data, obtains phrase data;
Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, with the word number after being segmented
According to;
Processing is combined to the term data after the adjacent participle, to generate candidate data string;
Judgement processing is carried out to the candidate data string, to find neologisms;
The dictionary for word segmentation is added in the neologisms.
7. synonym according to claim 1 finds method, which is characterized in that remaining described operation include insertion operation and
Replacement operation, the corresponding editing distance of insertion operation described in single are greater than or equal to preset threshold, replacement operation pair described in single
The editing distance answered is greater than or equal to preset threshold.
8. a kind of data processing method, which is characterized in that find method including the described in any item synonyms of claim 1-7.
9. a kind of synonym finds device characterized by comprising
Acquiring unit, suitable for obtaining phrase set to be processed, the phrase set includes multiple words;
Synonym determination unit, suitable for existing when in the phrase set for any word to be processed in the phrase set
One or more target words, when so that the smallest edit distance of the word to be processed to the target word being less than preset threshold, institute
It states word to be processed and is determined as synonym pair with a corresponding target word, the target word is only to pass through deletion by word to be processed
Operation obtains;
Wherein, the smallest edit distance is to calculate to obtain by edit distance approach, and the edit distance approach includes deleting
Except operation, the corresponding editing distance of the delete operation is less than remaining and operates corresponding editing distance, delete operation described in single
And repeatedly the corresponding editing distance of the delete operation is respectively less than preset threshold, remaining operates corresponding editing distance described in single
More than or equal to preset threshold.
10. synonym according to claim 9 finds device, which is characterized in that described device further include:
Candidate word selection unit, suitable for calculating separately the semantic phase of the word to be processed with remaining each word in the phrase set
Like degree, and therefrom select semantic similarity value be greater than similarity threshold word or the higher top n word of semantic similarity value as
Candidate word;
Target word determination unit will suitable for calculating separately the smallest edit distance of the word to be processed Yu each candidate word
It is less than the candidate word of preset threshold as target word with the smallest edit distance of the word to be processed.
11. synonym according to claim 10 finds device, which is characterized in that the candidate word selection unit includes:
Vectorization subelement, suitable for carrying out vectorization to each word in the phrase set;
Cosine similarity computation subunit, suitable for based on vectorization as a result, calculating the word to be processed and remaining each word
Cosine similarity, the cosine similarity is as the semantic similarity.
12. synonym according to claim 11 finds device, which is characterized in that the vectorization subelement uses
Word2vec method carries out vectorization to each word in the phrase set.
13. synonym according to claim 9 finds device, which is characterized in that the acquiring unit includes: that participle is single
Member, suitable for being segmented to input corpus, to obtain the phrase set.
14. synonym according to claim 13 finds device, which is characterized in that the participle subelement utilizes participle word
Allusion quotation segments the input corpus, and the dictionary for word segmentation is obtained by dictionary for word segmentation acquiring unit, and the dictionary for word segmentation obtains
Unit is taken to be suitable for:
The input corpus is pre-processed, to obtain text data;Branch's processing is carried out to the text data, obtains language
Sentence data;Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, with the word after being segmented
Language data;Processing is combined to the term data after the adjacent participle, to generate candidate data string;To the candidate number
Judgement processing is carried out according to string, to find neologisms;The dictionary for word segmentation is added in the neologisms.
15. synonym according to claim 9 finds device, which is characterized in that remaining described operation includes insertion operation
And replacement operation, the corresponding editing distance of insertion operation described in single are greater than or equal to preset threshold, replacement operation described in single
Corresponding editing distance is greater than or equal to preset threshold.
16. a kind of data processing equipment, which is characterized in that including the described in any item synonym discovery dresses of claim 9-15
It sets.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610429937.XA CN106126494B (en) | 2016-06-16 | 2016-06-16 | Synonym finds method and device, data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610429937.XA CN106126494B (en) | 2016-06-16 | 2016-06-16 | Synonym finds method and device, data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106126494A CN106126494A (en) | 2016-11-16 |
CN106126494B true CN106126494B (en) | 2018-12-28 |
Family
ID=57470670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610429937.XA Active CN106126494B (en) | 2016-06-16 | 2016-06-16 | Synonym finds method and device, data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126494B (en) |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776543B (en) * | 2016-11-23 | 2019-09-06 | 上海智臻智能网络科技股份有限公司 | New word discovery method, apparatus, terminal and server |
CN106649783B (en) * | 2016-12-28 | 2022-12-06 | 上海智臻智能网络科技股份有限公司 | Synonym mining method and device |
CN106649816B (en) * | 2016-12-29 | 2020-06-09 | 北京奇虎科技有限公司 | Synonym filtering method and device |
CN106777283B (en) * | 2016-12-29 | 2021-02-26 | 北京奇虎科技有限公司 | Synonym mining method and synonym mining device |
CN106933806A (en) * | 2017-03-15 | 2017-07-07 | 北京大数医达科技有限公司 | The determination method and apparatus of medical synonym |
CN107180026B (en) * | 2017-05-02 | 2020-12-29 | 苏州大学 | Event phrase learning method and device based on word embedding semantic mapping |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN107621892B (en) * | 2017-10-18 | 2021-03-09 | 北京百度网讯科技有限公司 | Method and device for acquiring information |
CN108170806B (en) * | 2017-12-28 | 2020-11-20 | 东软集团股份有限公司 | Sensitive word detection and filtering method and device and computer equipment |
CN108255810B (en) * | 2018-01-10 | 2019-04-09 | 北京神州泰岳软件股份有限公司 | Near synonym method for digging, device and electronic equipment |
CN108829799A (en) * | 2018-06-05 | 2018-11-16 | 中国人民公安大学 | Based on the Text similarity computing method and system for improving LDA topic model |
CN110852056B (en) * | 2018-07-25 | 2024-09-24 | 中兴通讯股份有限公司 | Method, device and equipment for obtaining text similarity and readable storage medium |
WO2020061910A1 (en) * | 2018-09-27 | 2020-04-02 | 北京字节跳动网络技术有限公司 | Method and apparatus used for generating information |
CN109741190A (en) * | 2018-12-27 | 2019-05-10 | 清华大学 | A kind of method, system and the equipment of the classification of personal share bulletin |
CN113534972A (en) * | 2020-04-14 | 2021-10-22 | 北京搜狗科技发展有限公司 | Entry prompting method and device and entry prompting device |
CN113689923B (en) * | 2020-05-19 | 2024-06-18 | 北京平安联想智慧医疗信息技术有限公司 | Medical data processing device, system and method |
CN113761905A (en) * | 2020-07-01 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method and device for constructing domain modeling vocabulary |
CN113761151A (en) * | 2021-05-07 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Synonym mining method, synonym mining device, synonym question answering method, synonym question answering device, computer equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561813A (en) * | 2009-05-27 | 2009-10-21 | 东北大学 | Method for analyzing similarity of character string under Web environment |
CN102750282A (en) * | 2011-04-19 | 2012-10-24 | 北京百度网讯科技有限公司 | Synonym template mining method and device as well as synonym mining method and device |
CN102760134A (en) * | 2011-04-28 | 2012-10-31 | 北京百度网讯科技有限公司 | Method and device for mining synonyms |
CN104978356A (en) * | 2014-04-10 | 2015-10-14 | 阿里巴巴集团控股有限公司 | Synonym identification method and device |
CN105095204A (en) * | 2014-04-17 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method and device for obtaining synonym |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4862072B2 (en) * | 2009-09-09 | 2012-01-25 | 株式会社日立製作所 | Design check knowledge construction method and system |
-
2016
- 2016-06-16 CN CN201610429937.XA patent/CN106126494B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101561813A (en) * | 2009-05-27 | 2009-10-21 | 东北大学 | Method for analyzing similarity of character string under Web environment |
CN102750282A (en) * | 2011-04-19 | 2012-10-24 | 北京百度网讯科技有限公司 | Synonym template mining method and device as well as synonym mining method and device |
CN102760134A (en) * | 2011-04-28 | 2012-10-31 | 北京百度网讯科技有限公司 | Method and device for mining synonyms |
CN104978356A (en) * | 2014-04-10 | 2015-10-14 | 阿里巴巴集团控股有限公司 | Synonym identification method and device |
CN105095204A (en) * | 2014-04-17 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method and device for obtaining synonym |
CN105183923A (en) * | 2015-10-27 | 2015-12-23 | 上海智臻智能网络科技股份有限公司 | New word discovery method and device |
Non-Patent Citations (2)
Title |
---|
一种基于无监督学习的词变体识别方法;王宝勋 等;《中文信息学报》;20080531;第22卷(第3期);第32-36、114页 * |
基于动态规划的缩写发现算法;李华 等;《武汉大学学报(工学版)》;20040229;第37卷(第1期);第128-131页 * |
Also Published As
Publication number | Publication date |
---|---|
CN106126494A (en) | 2016-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106126494B (en) | Synonym finds method and device, data processing method and device | |
US8892420B2 (en) | Text segmentation with multiple granularity levels | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN107480143B (en) | Method and system for segmenting conversation topics based on context correlation | |
US7461056B2 (en) | Text mining apparatus and associated methods | |
CN105224682B (en) | New word discovery method and device | |
CN105183923A (en) | New word discovery method and device | |
CN101021838A (en) | Text handling method and system | |
CN105389349A (en) | Dictionary updating method and apparatus | |
CN106445921B (en) | Utilize the Chinese text terminology extraction method of quadratic mutual information | |
JP2005122533A (en) | Question-answering system and question-answering processing method | |
CN106897290B (en) | Method and device for establishing keyword model | |
EP3608799A1 (en) | Search method and apparatus, and non-temporary computer-readable storage medium | |
CN111027323A (en) | Entity nominal item identification method based on topic model and semantic analysis | |
CN111414763A (en) | Semantic disambiguation method, device, equipment and storage device for sign language calculation | |
KR101638535B1 (en) | Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same | |
CN113268982B (en) | Network table structure identification method and device, computer device and computer readable storage medium | |
CN103886077A (en) | Short text clustering method and system | |
CN110532569B (en) | Data collision method and system based on Chinese word segmentation | |
CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
CN111144109A (en) | Text similarity determination method and device | |
CN117763106B (en) | Document duplicate checking method and device, storage medium and electronic equipment | |
CN113761104A (en) | Method and device for detecting entity relationship in knowledge graph and electronic equipment | |
CN106126606B (en) | A kind of short text new word discovery method | |
KR101615164B1 (en) | Query processing method and apparatus based on n-gram |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |