CN109543009A - Text similarity assessment system and text similarity appraisal procedure - Google Patents
Text similarity assessment system and text similarity appraisal procedure Download PDFInfo
- Publication number
- CN109543009A CN109543009A CN201811210881.4A CN201811210881A CN109543009A CN 109543009 A CN109543009 A CN 109543009A CN 201811210881 A CN201811210881 A CN 201811210881A CN 109543009 A CN109543009 A CN 109543009A
- Authority
- CN
- China
- Prior art keywords
- text
- doc1
- word
- module
- formula
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 239000013598 vector Substances 0.000 claims abstract description 78
- 238000012549 training Methods 0.000 claims abstract description 20
- 230000011218 segmentation Effects 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 31
- 238000012545 processing Methods 0.000 claims description 10
- 230000015654 memory Effects 0.000 claims description 9
- 230000005055 memory storage Effects 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000006403 short-term memory Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses the appraisal procedure of a kind of text similarity assessment system and text similarity, and the assessment system includes that corpus obtains module, word segmentation module, term vector training module, twin biLSTM network module, pays attention to power module and likelihood probability computing module.The invention proposes a kind of twin network structures based on attention mechanism, the important information for determining two text similarity degrees can be found, compared to existing technology, network structure proposed by the present invention more focuses on local message, the interference for excluding global information, is lifted at the accuracy of Text similarity computing in different scenes.
Description
" technical field "
The present invention relates to electronic information and technical field of data processing, and in particular to a kind of text similarity assessment system and
Text similarity appraisal procedure.
" background technique "
Text similarity computing is an important research topic of natural language processing field, not according to calculation method
It is same to be divided into the method based on character and the method based on corpus.Method based on character is mainly identical from two texts
Character portion considers similarity, it does not consider the semantic information of text, and the judgement to unordered character lists is effectively, still
It is invalid to the language of text.The semantic information of character is excavated by the contextual information of text then based on the method for corpus to sentence
The similarity of disconnected two texts, the main sex work that represents of this kind of research have the directly calculating vector such as word embdding similar
The method that the building model such as the method for degree and siamesenetwork goes judgement.Method based on corpus is the master of current research
Flow direction.
Method based on corpus have developed rapidly in recent years, but still suffer from the insufficient problem of information excavating.
It will appear the very little part for determining that the key message of two text similarity degrees only accounts for text component in many application scenarios, it is existing
Some work is more to excavate global semantic information, and the similarity degree for judging text is removed by global semantic information,
Obviously and it is inaccurate.
" summary of the invention "
The first object of the present invention is to provide a kind of text similarity assessment system, it is intended in twin biLSTM network module
On the basis of attention mechanism is added, find and determine the important informations of two text similarity degrees, be lifted at different scenes Chinese
The accuracy of this similarity calculation.The first object of the present invention is realized by the following technical scheme:
A kind of text similarity assessment system characterized by comprising
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b;
Word segmentation module, for the text doc1_a and the text doc1_b to be divided into the word sequence of X word respectively;
Term vector training module, for the text doc1_a and the text doc1_b segmentation word sequence carry out to
Quantization;
Twin biLSTM network module, including biLSTMa network module and biLSTMb network module, are respectively used to input
The term vector of each word of the text doc1_a and the text doc1_b carries out word grade volume to the term vector of each word
Code, exports the encoded information Hai and Hbi of each word of the text doc1_a and the text doc1_b, i value 1 to X;
Pay attention to power module, the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight of regularization
Bi, wherein ∑ ai=1, ∑ bi=1, and by formula sa=∑ ai*Hai be calculated the attention of the text doc1_a to
Sa is measured, the attention force vector sb of the text doc1_b is calculated by formula sb=∑ bi*Hbi;
Likelihood probability computing module calculates the text by the attention force vector sa and attention force vector sb
The likelihood probability p of doc1_a and the text doc1_b.
As a specific technical scheme, the form of the corpus are as follows: doc1_a, doc1_b, sim;Wherein sim is label,
Sim=1 indicates that the text doc1_a is similar with text doc1_b, and sim=0 indicates the text doc1_a and the text
Doc1_b is dissimilar.
As a specific technical scheme, the attention power module includes: Tanh function sension unit, passes through formula uai=
Tanh (W*Hai+b) is calculated the encoded information Hai and is exported a weight uai, passes through formula ubi=Tanh (W*
Hbi+b) the encoded information Hbi is calculated and exports a weight ubi;Softmax function processing unit, passes through formula
The weight ai of the regularization of the text doc1_a current word is calculated in ai=softmax (uai*uw), passes through formula
The weight bi of the regularization of the text doc1_b current word is calculated in bi=softmax (ubi*uw);And weighting is asked
And unit, for completing the calculating of the formula sa=∑ ai*Hai and the calculating of the formula sb=∑ bi*Hbi;Wherein W,
Uw, b are the parameter of setup parameter or trained acquisition.
As a specific technical scheme, described parameter W, uw, b are the parameter of trained acquisition;The text similarity is commented
Estimating system further includes parameter training module, and the parameter training module is by setting loss function and constantly optimizes, until
The loss function convergence, so that it is determined that described parameter W, uw, b.
As a specific technical scheme, the loss function using following logloss function or uses mean square error letter
Number, logloss function are as follows:
Wherein, N is test sample sum, and M is the sum of class, yl,jIt is two-valued variable, value 0 or 1 indicates first of sample
Whether the label of jth class, p are belonged tol,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.
As a specific technical scheme, the calculation method of the likelihood probability p are as follows:
P=g (sa, sb)=exp (- | | sa-sb | |1), 0=< p≤1
Or p=cosine (sa, sb).
A kind of text similarity assessment system characterized by comprising
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, the text doc1_a
Y sentence is separately included with the text doc1_b;
Word segmentation module, for dividing respective k-th of sentence in the text doc1_a and the text doc1_b respectively
At the word sequence of X word, k value 1 to Y;
Term vector training module, for respective k-th of sentence in the text doc1_a and the text doc1_b point
The word sequence cut carries out vectorization;
First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are used respectively
In the term vector for inputting each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each
The term vector of a word carries out word grade coding, exports the every of respective k-th sentence of the text doc1_a and the text doc1_b
The encoded information Hai and Hbi of a word, i value 1 to X;
First pays attention to power module, the weight ai's and regularization of the respectively described encoded information Hai and Hbi offer regularization
Weight bi, wherein ∑ ai=1, ∑ bi=1, are calculated in the text doc1_a k-th by formula sak=∑ ai*Hai
K-th of sentence in the text doc1_b is calculated by formula sbk=∑ bi*Hbi in the attention force vector sak of sentence
Attention force vector sbk;
Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are used respectively
In inputting the attention force vector sak and attention force vector sak, to the attention force vector sak and the attention to
It measures sbk and carries out Sentence-level coding, export corresponding encoded information HAk and HBk;Second pays attention to power module, the respectively described coding
Information HAk and HBk provide the weight Ak of regularization and the weight Bk of regularization, and wherein ∑ Ak=1, ∑ Bk=1 pass through formula Va
The attention force vector Va of the text doc1_a is calculated in=∑ Ak*HAk, and institute is calculated by formula Vb=∑ Bk*HBk
State the attention force vector Vb of text doc1_b;
Likelihood probability computing module calculates the text by the attention force vector Va and attention force vector Vb
The likelihood probability p of doc1_a and the text doc1_b.
The present invention also provides a kind of memories, which is characterized in that the memory stores above-mentioned text similarity assessment
The program of system.
The second object of the present invention is to provide a kind of text similarity appraisal procedure, based on text similarity described above
Assessment system carries out similarity assessment to two of input texts to be predicted.The second object of the present invention is by following technical side
Case is realized:
A kind of text similarity appraisal procedure, it is characterised in that: by two text inputs to be predicted text described above
Similarity assessment system exports the likelihood probability p of described two texts to be predicted.
The present invention also provides a kind of computers, including memory and processor, which is characterized in that the memory storage
Processor is supported to execute the program of above-mentioned text similarity appraisal procedure, the processor is configured to for executing the storage
The described program stored in device.
The beneficial effects of the present invention are: a kind of twin network structure based on attention mechanism is proposed, can be found
The important information for determining two text similarity degrees, compared to existing technology, network structure proposed by the present invention is more focused on
Local message excludes the interference of global information, is lifted at the accuracy of Text similarity computing in different scenes.The present invention is counting
There is better effect in terms of calculating sentence similarity.
" Detailed description of the invention "
Fig. 1 is the schematic diagram of text similarity assessment system provided by the invention.
Fig. 2 is the network structure of the data handling procedure of text similarity assessment system provided by the invention.
Fig. 3 is after increasing the twin biLSTM network module of Sentence-level and the attention power module of Sentence-level on the basis of Fig. 2
Text similarity assessment system data handling procedure network structure.
" specific embodiment "
Embodiment one
As shown in connection with fig. 1, text similarity assessment system provided in this embodiment includes: that corpus obtains module, participle mould
Block, term vector training module, twin biLSTM network (Bi-directional Long Short TermMemory, two-way length
Short-term memory network, is abbreviated as biLSTM) module, attention power module, likelihood probability computing module and parameter training module.Below
It is described in detail:
Corpus obtains module for inputting the corpus comprising two texts of doc1_a and doc1_b.In the present embodiment, corpus
Form are as follows: doc1_a, doc1_b, sim;Wherein doc1_a and doc1_b is two similar or dissimilar texts, and sim is mark
Label, sim=1 indicate similar, and sim=0 indicates dissimilar.
Word segmentation module is used to text doc1_a and text doc1_b being divided into word sequence respectively, as shown in connection with fig. 2, will be literary
This doc1_a is divided into the word sequence of tetra- words of Wa1, Wa2, Wa3, Wa4, and text doc1_b is divided into Wb1, Wb2, Wb3, Wb4
The word sequence of four words.Stammerer participle tool is a kind of common participle tool, and effect is as follows:
" student and teachers of the Computer Department of the Chinese Academy of Science " --- > [Chinese Academy of Sciences calculates institute, student, and, teacher ,].
Term vector training module, for carrying out vectorization to the word sequence of text doc1_a and text doc1_b segmentation.
Word2vec (i.e. word to vector is also word embeddings, Chinese name " term vector ") is developed and is opened by Google
One term vector Core Generator in source is a kind of side that potential applications related information between word is excavated using neural network model
Method, core concept are to predict current word by the word of context appearance, and the potential feature of word is excavated by co-occurrence word,
The form of output is the dense vector that each vocabulary is shown as to a low-dimensional.As shown in connection with fig. 2, text doc1_a, which is divided, to form
Tetra- words of Wa1, Wa2, Wa3, Wa4, term vector are expressed as Va1, Va2, Va3, Va4;Text doc1_b, which is divided, to be formed
Tetra- words of Wb1, Wb2, Wb3, Wb4, term vector are expressed as Vb1, Vb2, Vb3, Vb4.
In practice 300 dimensions, the i.e. vector of 300*1 can be traditionally arranged to be according to the dimension of specific condition setting term vector.
Such as:
" Chinese Academy of Sciences "-> [0.03,0.3,0.423,0.43,0.7623,1.32,2.34,0.1323 ...]300*1Vector
In each dimension value by corpus training obtain.
It is known that LSTM network (Long Short Term Memory, shot and long term memory network are abbreviated as LSTM) is
A kind of improved model of Recognition with Recurrent Neural Network (Recurrent Neural Network, be abbreviated as RNN) is determined by forgeing door
Which fixed information needs are filtered, and input gate determines current input information and current state, and out gate determines output.Pass through door
Method learning text contextual information.BiLSTM is a kind of two-way structure, he thinks that text positive sequence and inverted order can be caught
Useful information is grasped, splices two-way information usually in training, enters next layer of operation together.
As shown in connection with fig. 1, twin biLSTM network module includes biLSTMa network module and biLSTMb network module, is divided
Term vector (Va1, Va2, Va3, Va4 described above of each word of text doc1_a and text doc1_b Yong Yu not inputted;
Vb1, Vb2, Vb3, Vb4), word grade coding (Encoding) is carried out to the term vector of each word, exports text doc1_a and text
The encoded information Hai and Hbi of each word of this doc1_b, i is the number of text doc1_a and text doc1_b participle, this implementation
I takes 1 to 4 in example, such as the Ha1 to Ha4 and Hb1 to Hb4 in Fig. 2.Wherein Hai and Hbi hides stratiform by biLSTM network module
State vector splices to obtain, for example, Hai=[ha+i, ha-i], ha+i, ha-i are respectively that biLSTMa two different hidden layers generate
Hidden layer state vector.
Notice that power module includes Tanh function sension unit, softmax function processing unit and weighted sum unit, in conjunction with
Shown in Fig. 1, it is described as follows:
Tanh function sension unit, by formula uai=Tanh (W*Hai+b) to encoded information Hai carry out calculate and it is defeated
A weight uai out calculates encoded information Hbi by formula ubi=Tanh (W*Hbi+b) and exports a weight
ubi;
It is current that text doc1_a is calculated by formula ai=softmax (uai*uw) in softmax function processing unit
The canonical of text doc1_b current word is calculated by formula bi=softmax (ubi*uw) by the weight ai of the regularization of word
The weight bi of change, wherein ∑ ai=1, ∑ bi=1;
Weighted sum unit, for the attention force vector of text doc1_a to be calculated by formula sa=∑ ai*Hai
The attention force vector sb of text doc1_b is calculated by formula sb=∑ bi*Hbi by sa.
Likelihood probability computing module is by paying attention to force vector sa and noticing that force vector sb calculates text doc1_a and text
The likelihood probability p of doc1_b.Specifically, likelihood probability p can pass through manhatton distance calculation method or the method for seeking cosine value
It calculates, corresponding calculation formula difference is as follows:
P=g (sa, sb)=exp (- | | sa-sb | |1), 0=< p≤1
Or p=cosine (sa, sb).
Parameter W, uw, b in above-mentioned formula are the parameter of setting or trained acquisition;Parameter training mould in the present embodiment
Block is by setting loss function and constantly optimizes, until loss function is restrained, so that it is determined that described parameter W, uw, b.This reality
It applies in example, loss function logloss is specific as follows:
In the formula, N is test sample sum, and M is the sum of class, yl,jIt is two-valued variable, value 0 or 1 indicates first
Whether sample belongs to the label of jth class, pl,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.
The method of optimization is gradient decline optimization method, and input data starts to train, until convergence.Loss function can also
To select the functions such as mean square error (mse) according to the actual situation.Model is trained by loss function and determines suitable ginseng
Several methods belongs to the prior art of neural network training process, and repeats no more herein.
Embodiment two
The specific length of the manageable not restricted text of text of the present invention, if it is sentence, embodiment one is provided
Word-based grade (word level) twin biLSTM network and pay attention to power module system structure can be obtained by expression sentence
The attention force vector of son then need to also obtain multiple sentence vectors (i.e. if it is long text as paragraph or article
Sa and sb obtained in multiple embodiments one) after again plus one layer of Sentence-level (sentence level) twin biLSTM network mould
The attention power module of block and Sentence-level, finally obtains the attention force vector that can indicate long text, other links are constant.
The present embodiment two provides a kind of text similarity assessment system, can handle and respectively includes the two long of multiple sentences
Text, the assessment system include that corpus obtains module, word segmentation module, term vector training module, the first twin biLSTM network mould
Block, first notice that power module, the second twin biLSTM network module, second pay attention to power module and likelihood probability computing module, tool
Body is described as follows:
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, text doc1_a and institute
It states text doc1_b and separately includes Y sentence (such as respectively containing 4 sentences);
Word segmentation module, for k-th of sentence respective in text doc1_a and text doc1_b (only to be provided text in Fig. 3
Some sentence is as example in doc1_a and text doc1_b) it is divided into X word (such as 4 words, referring in Fig. 3 respectively
Wa1, Wa2, Wa3, Wa4 and Wb1, Wb2, Wb3, Wb4) word sequence, k takes 1 to Y;
Term vector training module, for respective k-th of sentence in the text doc1_a and the text doc1_b point
The word sequence cut carries out vectorization;For example, text doc1_a divides tetra- words of Wa1, Wa2, Wa3, Wa4 to be formed, term vector
It is expressed as Va1, Va2, Va3, Va4;Text doc1_b divides tetra- words of Wb1, Wb2, Wb3, Wb4 to be formed, term vector
It is expressed as Vb1, Vb2, Vb3, Vb4.
First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are used respectively
In the term vector for inputting each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each
The term vector of a word carries out word grade coding, the volume of the respective each word of k-th of sentence of output text doc1_a and text doc1_b
Code information Hai and Hbi, i take 1 to X;Referring to the Ha1 to Ha4 and Hb1 to Hb4 in Fig. 3.
First pays attention to power module, the weight ai's and regularization of the respectively described encoded information Hai and Hbi offer regularization
Weight bi, wherein ∑ ai=1, ∑ bi=1, are calculated in the text doc1_a k-th by formula sak=∑ ai*Hai
The text is calculated by formula sbk=∑ bi*Hbi in the attention force vector sak (such as sa3 in Fig. 3) of sentence
The attention force vector sbk (such as sb3 in Fig. 3) of k-th of sentence in doc1_b.
Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are used respectively
In inputting the attention force vector sak and attention force vector sak, to the attention force vector sak and the attention to
It measures sbk and carries out Sentence-level coding, export corresponding encoded information HAk and HBk;Referring to the HA1 to HA4 and HB1 to HB4 in Fig. 3.
Second pays attention to power module, the weight Ak's and regularization of the respectively described encoded information HAk and HBk offer regularization
Weight Bk (referring to the A1-A4 and B1-B4 in Fig. 3), wherein ∑ Ak=1, ∑ Bk=1, are calculated by formula Va=∑ Ak*HAk
The attention force vector Va of the text doc1_a is obtained, is calculated the text doc1_b's by formula Vb=∑ Bk*HBk
Pay attention to force vector Vb.
Likelihood probability computing module calculates the text by the attention force vector Va and attention force vector Vb
The likelihood probability p of doc1_a and the text doc1_b.
The function and implementation of the first twin biLSTM network module, the second twin biLSTM network module in embodiment two
The function of twin biLSTM network module described in example one is identical, in the present embodiment by " first ", " second " restriction only
For the processing links of differentiating words grade and the processing links of Sentence-level.In addition, first in embodiment two pays attention to power module, second
Notice that the function of power module is also identical as the function of attention power module described in embodiment one, pass through in the present embodiment " the
One ", the restriction of " second " is only used for the processing links of differentiating words grade and the processing links of Sentence-level.Furthermore in embodiment two
Likelihood probability computing module is identical as the function of likelihood probability computing module described in embodiment one.In addition, the present embodiment two
There is provided text similarity assessment system also includes parameter training identical with parameter training functions of modules described in embodiment one
Module.
The present embodiment also provides a kind of memory, which stores the program of above-mentioned text similarity assessment system.
The present embodiment also provides a kind of text similarity appraisal procedure, and trained text similarity assessment system can be located
Text to be predicted is managed, will judge two text input model exports the likelihood probability p of two texts, as two texts
This similarity.
The present embodiment also provides a kind of computer, including memory and processor, the memory storage support processing
Device executes the program of above-mentioned text similarity appraisal procedure, the processor is configured to storing in the memory for executing
Described program.
Above embodiments be only it is sufficiently open is not intended to limit the present invention, it is all based on the inventive subject matter of the present invention, need not move through
The replacement for the equivalence techniques feature that creative work can wait until should be considered as the range of the application exposure.
Claims (10)
1. a kind of text similarity assessment system characterized by comprising
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b;
Word segmentation module, for the text doc1_a and the text doc1_b to be divided into the word sequence of X word respectively;
Term vector training module, for carrying out vector to the word sequence of the text doc1_a and text doc1_b segmentation
Change;
Twin biLSTM network module, including biLSTMa network module and biLSTMb network module, are respectively used to described in input
The term vector of each word of text doc1_a and the text doc1_b carries out word grade coding to the term vector of each word,
Export the encoded information Hai and Hbi of each word of the text doc1_a and the text doc1_b, i value 1 to X;
Paying attention to power module, the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight bi of regularization,
Wherein ∑ ai=1, ∑ bi=1, and the attention force vector of the text doc1_a is calculated by formula sa=∑ ai*Hai
The attention force vector sb of the text doc1_b is calculated by formula sb=∑ bi*Hbi by sa;
Likelihood probability computing module calculates the text doc1_a by the attention force vector sa and attention force vector sb
With the likelihood probability p of the text doc1_b.
2. text similarity assessment system according to claim 1, which is characterized in that the form of the corpus are as follows: doc1_
a,doc1_b,sim;Wherein sim is label, and sim=1 indicates that the text doc1_a is similar with text doc1_b, sim=0 table
Show that the text doc1_a and text doc1_b is dissimilar.
3. text similarity assessment system according to claim 1, which is characterized in that the attention power module includes:
Tanh function sension unit calculates the encoded information Hai by formula uai=Tanh (W*Hai+b) and exports one
A weight uai calculates the encoded information Hbi by formula ubi=Tanh (W*Hbi+b) and exports a weight
ubi;It is current that the text doc1_a is calculated by formula ai=softmax (uai*uw) in softmax function processing unit
It is current that the text doc1_b is calculated by formula bi=softmax (ubi*uw) in the weight ai of the regularization of word
The weight bi of the regularization of word;And weighted sum unit, for completing calculating and the institute of the formula sa=∑ ai*Hai
State the calculating of formula sb=∑ bi*Hbi;Wherein W, uw, b are the parameter of setup parameter or trained acquisition.
4. text similarity assessment system according to claim 3, which is characterized in that described parameter W, uw, b are trained
The parameter of acquisition;The text similarity assessment system further includes parameter training module, and the parameter training module passes through setting
Loss function simultaneously constantly optimizes, until the loss function is restrained, so that it is determined that described parameter W, uw, b.
5. text similarity assessment system according to claim 1, which is characterized in that the loss function uses following
Logloss function uses mean square error function, and logloss function is as follows:
Wherein, N is test sample sum, and M is the sum of class, yl,jIt is two-valued variable, whether value 0 or 1 indicates first of sample
Belong to the label of jth class, pl,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.
6. text similarity assessment system according to claim 1, which is characterized in that the calculating side of the likelihood probability p
Method are as follows:
P=g (sa, sb)=exp (- | | sa-sb | |1), 0=< p≤1
Or p=cosine (sa, sb).
7. a kind of text similarity assessment system characterized by comprising
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, the text doc1_a and institute
It states text doc1_b and separately includes Y sentence;
Word segmentation module, for respective k-th of sentence in the text doc1_a and the text doc1_b to be divided into X respectively
The word sequence of word, k value 1 to Y;
Term vector training module, for what is divided to k-th of sentence respective in the text doc1_a and the text doc1_b
Word sequence carries out vectorization;
First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are respectively used to defeated
The term vector for entering each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each word
Term vector carry out word grade coding, export each word of respective k-th of the sentence of the text doc1_a and the text doc1_b
Encoded information Hai and Hbi, i value 1 to X;
First pays attention to power module, and the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight of regularization
Bi, wherein ∑ ai=1, ∑ bi=1, are calculated k-th of sentence in the text doc1_a by formula sak=∑ ai*Hai
Attention force vector sak, the note of k-th of sentence in the text doc1_b is calculated by formula sbk=∑ bi*Hbi
Anticipate force vector sbk;
Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are respectively used to defeated
Enter the attention force vector sak and the attention force vector sak, to the attention force vector sak and the attention force vector sbk
Sentence-level coding is carried out, corresponding encoded information HAk and HBk is exported;Second pays attention to power module, the respectively described encoded information
HAk and HBk provides the weight Ak of regularization and the weight Bk of regularization, and wherein ∑ Ak=1, ∑ Bk=1, pass through formula Va=∑
The attention force vector Va of the text doc1_a is calculated in Ak*HAk, and the text is calculated by formula Vb=∑ Bk*HBk
The attention force vector Vb of this doc1_b;
Likelihood probability computing module calculates the text doc1_a by the attention force vector Va and attention force vector Vb
With the likelihood probability p of the text doc1_b.
8. a kind of memory, which is characterized in that text described in memory storage claim 1 to 7 any one is similar
Spend the program of assessment system.
9. a kind of text similarity appraisal procedure, it is characterised in that: two text input claims 1 to 7 to be predicted are any
Text similarity assessment system described in one exports the likelihood probability p of described two texts to be predicted.
10. a kind of computer, including memory and processor, which is characterized in that the memory storage supports processor to hold
The program of text similarity appraisal procedure described in row claim 9, the processor is configured to for executing the memory
The described program of middle storage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811210881.4A CN109543009B (en) | 2018-10-17 | 2018-10-17 | Text similarity assessment system and text similarity appraisal procedure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811210881.4A CN109543009B (en) | 2018-10-17 | 2018-10-17 | Text similarity assessment system and text similarity appraisal procedure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109543009A true CN109543009A (en) | 2019-03-29 |
CN109543009B CN109543009B (en) | 2019-10-25 |
Family
ID=65843947
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811210881.4A Active CN109543009B (en) | 2018-10-17 | 2018-10-17 | Text similarity assessment system and text similarity appraisal procedure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109543009B (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046240A (en) * | 2019-04-16 | 2019-07-23 | 浙江爱闻格环保科技有限公司 | In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network |
CN110211594A (en) * | 2019-06-06 | 2019-09-06 | 杭州电子科技大学 | A kind of method for distinguishing speek person based on twin network model and KNN algorithm |
CN110362681A (en) * | 2019-06-19 | 2019-10-22 | 平安科技(深圳)有限公司 | The recognition methods of question answering system replication problem, device and storage medium |
CN110413988A (en) * | 2019-06-17 | 2019-11-05 | 平安科技(深圳)有限公司 | Method, apparatus, server and the storage medium of text information matching measurement |
CN110610003A (en) * | 2019-08-15 | 2019-12-24 | 阿里巴巴集团控股有限公司 | Method and system for assisting text annotation |
CN110717332A (en) * | 2019-07-26 | 2020-01-21 | 昆明理工大学 | News and case similarity calculation method based on asymmetric twin network |
CN110738059A (en) * | 2019-10-21 | 2020-01-31 | 支付宝(杭州)信息技术有限公司 | text similarity calculation method and system |
CN110941951A (en) * | 2019-10-15 | 2020-03-31 | 平安科技(深圳)有限公司 | Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment |
CN111198939A (en) * | 2019-12-27 | 2020-05-26 | 北京健康之家科技有限公司 | Statement similarity analysis method and device and computer equipment |
CN111209395A (en) * | 2019-12-27 | 2020-05-29 | 铜陵中科汇联科技有限公司 | Short text similarity calculation system and training method thereof |
CN111627566A (en) * | 2020-05-22 | 2020-09-04 | 泰康保险集团股份有限公司 | Indication information processing method and device, storage medium and electronic equipment |
CN111737954A (en) * | 2020-06-12 | 2020-10-02 | 百度在线网络技术(北京)有限公司 | Text similarity determination method, device, equipment and medium |
CN111783419A (en) * | 2020-06-12 | 2020-10-16 | 上海东普信息科技有限公司 | Address similarity calculation method, device, equipment and storage medium |
CN111785287A (en) * | 2020-07-06 | 2020-10-16 | 北京世纪好未来教育科技有限公司 | Speaker recognition method, speaker recognition device, electronic equipment and storage medium |
CN111859988A (en) * | 2020-07-28 | 2020-10-30 | 阳光保险集团股份有限公司 | Semantic similarity evaluation method and device and computer-readable storage medium |
CN112784587A (en) * | 2021-01-07 | 2021-05-11 | 国网福建省电力有限公司泉州供电公司 | Text similarity measurement method and device based on multi-model fusion |
CN112800196A (en) * | 2021-01-18 | 2021-05-14 | 北京明略软件系统有限公司 | FAQ question-answer library matching method and system based on twin network |
CN113743077A (en) * | 2020-08-14 | 2021-12-03 | 北京京东振世信息技术有限公司 | Method and device for determining text similarity |
CN114595687A (en) * | 2021-12-20 | 2022-06-07 | 昆明理工大学 | Laos language text regularization method based on BilSTM |
WO2023065635A1 (en) * | 2021-10-22 | 2023-04-27 | 平安科技(深圳)有限公司 | Named entity recognition method and apparatus, storage medium and terminal device |
CN116776854A (en) * | 2023-08-25 | 2023-09-19 | 湖南汇智兴创科技有限公司 | Online multi-version document content association method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107168954A (en) * | 2017-05-18 | 2017-09-15 | 北京奇艺世纪科技有限公司 | Text key word generation method and device and electronic equipment and readable storage medium storing program for executing |
CN107291699A (en) * | 2017-07-04 | 2017-10-24 | 湖南星汉数智科技有限公司 | A kind of sentence semantic similarity computational methods |
CN107562812A (en) * | 2017-08-11 | 2018-01-09 | 北京大学 | A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space |
CN108132931A (en) * | 2018-01-12 | 2018-06-08 | 北京神州泰岳软件股份有限公司 | A kind of matched method and device of text semantic |
CN108415977A (en) * | 2018-02-09 | 2018-08-17 | 华南理工大学 | One is read understanding method based on the production machine of deep neural network and intensified learning |
-
2018
- 2018-10-17 CN CN201811210881.4A patent/CN109543009B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107168954A (en) * | 2017-05-18 | 2017-09-15 | 北京奇艺世纪科技有限公司 | Text key word generation method and device and electronic equipment and readable storage medium storing program for executing |
CN107291699A (en) * | 2017-07-04 | 2017-10-24 | 湖南星汉数智科技有限公司 | A kind of sentence semantic similarity computational methods |
CN107562812A (en) * | 2017-08-11 | 2018-01-09 | 北京大学 | A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space |
CN108132931A (en) * | 2018-01-12 | 2018-06-08 | 北京神州泰岳软件股份有限公司 | A kind of matched method and device of text semantic |
CN108415977A (en) * | 2018-02-09 | 2018-08-17 | 华南理工大学 | One is read understanding method based on the production machine of deep neural network and intensified learning |
Non-Patent Citations (1)
Title |
---|
ZONGKUI ZHU等: ""A Semantic Similarity Computing Model based on Siamese Network for Duplicate Questions Identification", 《PROCEEDINGS OF THE EVALUATION TASKS AT THE CHINA CONFERENCE ON KNOWLEDGE GRAPH AND SEMANTIC COMPUTING》 * |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046240A (en) * | 2019-04-16 | 2019-07-23 | 浙江爱闻格环保科技有限公司 | In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network |
CN110046240B (en) * | 2019-04-16 | 2020-12-08 | 浙江爱闻格环保科技有限公司 | Target field question-answer pushing method combining keyword retrieval and twin neural network |
CN110211594A (en) * | 2019-06-06 | 2019-09-06 | 杭州电子科技大学 | A kind of method for distinguishing speek person based on twin network model and KNN algorithm |
CN110211594B (en) * | 2019-06-06 | 2021-05-04 | 杭州电子科技大学 | Speaker identification method based on twin network model and KNN algorithm |
CN110413988A (en) * | 2019-06-17 | 2019-11-05 | 平安科技(深圳)有限公司 | Method, apparatus, server and the storage medium of text information matching measurement |
CN110413988B (en) * | 2019-06-17 | 2023-01-31 | 平安科技(深圳)有限公司 | Text information matching measurement method, device, server and storage medium |
CN110362681B (en) * | 2019-06-19 | 2023-09-22 | 平安科技(深圳)有限公司 | Method, device and storage medium for identifying repeated questions of question-answering system |
CN110362681A (en) * | 2019-06-19 | 2019-10-22 | 平安科技(深圳)有限公司 | The recognition methods of question answering system replication problem, device and storage medium |
WO2020252930A1 (en) * | 2019-06-19 | 2020-12-24 | 平安科技(深圳)有限公司 | Method, apparatus, and device for identifying duplicate questions in question answering system, and storage medium |
CN110717332A (en) * | 2019-07-26 | 2020-01-21 | 昆明理工大学 | News and case similarity calculation method based on asymmetric twin network |
CN110610003A (en) * | 2019-08-15 | 2019-12-24 | 阿里巴巴集团控股有限公司 | Method and system for assisting text annotation |
CN110610003B (en) * | 2019-08-15 | 2023-09-15 | 创新先进技术有限公司 | Method and system for assisting text annotation |
CN110941951A (en) * | 2019-10-15 | 2020-03-31 | 平安科技(深圳)有限公司 | Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment |
WO2021072863A1 (en) * | 2019-10-15 | 2021-04-22 | 平安科技(深圳)有限公司 | Method and apparatus for calculating text similarity, electronic device, and computer-readable storage medium |
CN110738059A (en) * | 2019-10-21 | 2020-01-31 | 支付宝(杭州)信息技术有限公司 | text similarity calculation method and system |
CN111209395A (en) * | 2019-12-27 | 2020-05-29 | 铜陵中科汇联科技有限公司 | Short text similarity calculation system and training method thereof |
CN111198939A (en) * | 2019-12-27 | 2020-05-26 | 北京健康之家科技有限公司 | Statement similarity analysis method and device and computer equipment |
CN111627566A (en) * | 2020-05-22 | 2020-09-04 | 泰康保险集团股份有限公司 | Indication information processing method and device, storage medium and electronic equipment |
CN111783419A (en) * | 2020-06-12 | 2020-10-16 | 上海东普信息科技有限公司 | Address similarity calculation method, device, equipment and storage medium |
CN111783419B (en) * | 2020-06-12 | 2024-02-27 | 上海东普信息科技有限公司 | Address similarity calculation method, device, equipment and storage medium |
CN111737954A (en) * | 2020-06-12 | 2020-10-02 | 百度在线网络技术(北京)有限公司 | Text similarity determination method, device, equipment and medium |
US11676609B2 (en) | 2020-07-06 | 2023-06-13 | Beijing Century Tal Education Technology Co. Ltd. | Speaker recognition method, electronic device, and storage medium |
CN111785287A (en) * | 2020-07-06 | 2020-10-16 | 北京世纪好未来教育科技有限公司 | Speaker recognition method, speaker recognition device, electronic equipment and storage medium |
CN111785287B (en) * | 2020-07-06 | 2022-06-07 | 北京世纪好未来教育科技有限公司 | Speaker recognition method, speaker recognition device, electronic equipment and storage medium |
CN111859988A (en) * | 2020-07-28 | 2020-10-30 | 阳光保险集团股份有限公司 | Semantic similarity evaluation method and device and computer-readable storage medium |
CN113743077B (en) * | 2020-08-14 | 2023-09-29 | 北京京东振世信息技术有限公司 | Method and device for determining text similarity |
CN113743077A (en) * | 2020-08-14 | 2021-12-03 | 北京京东振世信息技术有限公司 | Method and device for determining text similarity |
CN112784587B (en) * | 2021-01-07 | 2023-05-16 | 国网福建省电力有限公司泉州供电公司 | Text similarity measurement method and device based on multi-model fusion |
CN112784587A (en) * | 2021-01-07 | 2021-05-11 | 国网福建省电力有限公司泉州供电公司 | Text similarity measurement method and device based on multi-model fusion |
CN112800196A (en) * | 2021-01-18 | 2021-05-14 | 北京明略软件系统有限公司 | FAQ question-answer library matching method and system based on twin network |
CN112800196B (en) * | 2021-01-18 | 2024-03-01 | 南京明略科技有限公司 | FAQ question-answering library matching method and system based on twin network |
WO2023065635A1 (en) * | 2021-10-22 | 2023-04-27 | 平安科技(深圳)有限公司 | Named entity recognition method and apparatus, storage medium and terminal device |
CN114595687A (en) * | 2021-12-20 | 2022-06-07 | 昆明理工大学 | Laos language text regularization method based on BilSTM |
CN114595687B (en) * | 2021-12-20 | 2024-04-19 | 昆明理工大学 | Laos text regularization method based on BiLSTM |
CN116776854A (en) * | 2023-08-25 | 2023-09-19 | 湖南汇智兴创科技有限公司 | Online multi-version document content association method, device, equipment and medium |
CN116776854B (en) * | 2023-08-25 | 2023-11-03 | 湖南汇智兴创科技有限公司 | Online multi-version document content association method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN109543009B (en) | 2019-10-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543009B (en) | Text similarity assessment system and text similarity appraisal procedure | |
Wang et al. | On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation | |
CN109284506A (en) | A kind of user comment sentiment analysis system and method based on attention convolutional neural networks | |
CN108628823A (en) | In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training | |
CN112883714B (en) | ABSC task syntactic constraint method based on dependency graph convolution and transfer learning | |
CN115357719B (en) | Power audit text classification method and device based on improved BERT model | |
CN110162789A (en) | A kind of vocabulary sign method and device based on the Chinese phonetic alphabet | |
CN116579347A (en) | Comment text emotion analysis method, system, equipment and medium based on dynamic semantic feature fusion | |
CN112818698A (en) | Fine-grained user comment sentiment analysis method based on dual-channel model | |
CN114154570A (en) | Sample screening method and system and neural network model training method | |
CN113723083A (en) | Weighted negative supervision text emotion analysis method based on BERT model | |
Thattinaphanich et al. | Thai named entity recognition using Bi-LSTM-CRF with word and character representation | |
CN117312490A (en) | Characterization model of text attribute graph, pre-trained self-supervision method and node representation updated model framework | |
Perera et al. | Personality Classification of text through Machine learning and Deep learning: A Review (2023) | |
Ma et al. | Multi-teacher knowledge distillation for end-to-end text image machine translation | |
Zheng et al. | Named entity recognition in electric power metering domain based on attention mechanism | |
CN114757183B (en) | Cross-domain emotion classification method based on comparison alignment network | |
CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism | |
CN113779249B (en) | Cross-domain text emotion classification method and device, storage medium and electronic equipment | |
CN114330350A (en) | Named entity identification method and device, electronic equipment and storage medium | |
CN111309849B (en) | Fine-grained value information extraction method based on joint learning model | |
CN114692615B (en) | Small sample intention recognition method for small languages | |
CN117744658A (en) | Ship naming entity identification method based on BERT-BiLSTM-CRF | |
Chavali et al. | A study on named entity recognition with different word embeddings on gmb dataset using deep learning pipelines | |
CN117251545A (en) | Multi-intention natural language understanding method, system, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP02 | Change in the address of a patent holder |
Address after: 519031 office 1316, No. 1, lianao Road, Hengqin new area, Zhuhai, Guangdong Patentee after: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd. Address before: 519031 room 417, building 20, creative Valley, Hengqin New District, Zhuhai City, Guangdong Province Patentee before: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd. |
|
CP02 | Change in the address of a patent holder | ||
PP01 | Preservation of patent right |
Effective date of registration: 20240718 Granted publication date: 20191025 |
|
PP01 | Preservation of patent right |