Nothing Special   »   [go: up one dir, main page]

CN109543009A - Text similarity assessment system and text similarity appraisal procedure - Google Patents

Text similarity assessment system and text similarity appraisal procedure Download PDF

Info

Publication number
CN109543009A
CN109543009A CN201811210881.4A CN201811210881A CN109543009A CN 109543009 A CN109543009 A CN 109543009A CN 201811210881 A CN201811210881 A CN 201811210881A CN 109543009 A CN109543009 A CN 109543009A
Authority
CN
China
Prior art keywords
text
doc1
word
module
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811210881.4A
Other languages
Chinese (zh)
Other versions
CN109543009B (en
Inventor
郑权
徐泓洋
张峰
聂颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Original Assignee
Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd filed Critical Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority to CN201811210881.4A priority Critical patent/CN109543009B/en
Publication of CN109543009A publication Critical patent/CN109543009A/en
Application granted granted Critical
Publication of CN109543009B publication Critical patent/CN109543009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses the appraisal procedure of a kind of text similarity assessment system and text similarity, and the assessment system includes that corpus obtains module, word segmentation module, term vector training module, twin biLSTM network module, pays attention to power module and likelihood probability computing module.The invention proposes a kind of twin network structures based on attention mechanism, the important information for determining two text similarity degrees can be found, compared to existing technology, network structure proposed by the present invention more focuses on local message, the interference for excluding global information, is lifted at the accuracy of Text similarity computing in different scenes.

Description

Text similarity assessment system and text similarity appraisal procedure
" technical field "
The present invention relates to electronic information and technical field of data processing, and in particular to a kind of text similarity assessment system and Text similarity appraisal procedure.
" background technique "
Text similarity computing is an important research topic of natural language processing field, not according to calculation method It is same to be divided into the method based on character and the method based on corpus.Method based on character is mainly identical from two texts Character portion considers similarity, it does not consider the semantic information of text, and the judgement to unordered character lists is effectively, still It is invalid to the language of text.The semantic information of character is excavated by the contextual information of text then based on the method for corpus to sentence The similarity of disconnected two texts, the main sex work that represents of this kind of research have the directly calculating vector such as word embdding similar The method that the building model such as the method for degree and siamesenetwork goes judgement.Method based on corpus is the master of current research Flow direction.
Method based on corpus have developed rapidly in recent years, but still suffer from the insufficient problem of information excavating. It will appear the very little part for determining that the key message of two text similarity degrees only accounts for text component in many application scenarios, it is existing Some work is more to excavate global semantic information, and the similarity degree for judging text is removed by global semantic information, Obviously and it is inaccurate.
" summary of the invention "
The first object of the present invention is to provide a kind of text similarity assessment system, it is intended in twin biLSTM network module On the basis of attention mechanism is added, find and determine the important informations of two text similarity degrees, be lifted at different scenes Chinese The accuracy of this similarity calculation.The first object of the present invention is realized by the following technical scheme:
A kind of text similarity assessment system characterized by comprising
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b;
Word segmentation module, for the text doc1_a and the text doc1_b to be divided into the word sequence of X word respectively;
Term vector training module, for the text doc1_a and the text doc1_b segmentation word sequence carry out to Quantization;
Twin biLSTM network module, including biLSTMa network module and biLSTMb network module, are respectively used to input The term vector of each word of the text doc1_a and the text doc1_b carries out word grade volume to the term vector of each word Code, exports the encoded information Hai and Hbi of each word of the text doc1_a and the text doc1_b, i value 1 to X;
Pay attention to power module, the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight of regularization Bi, wherein ∑ ai=1, ∑ bi=1, and by formula sa=∑ ai*Hai be calculated the attention of the text doc1_a to Sa is measured, the attention force vector sb of the text doc1_b is calculated by formula sb=∑ bi*Hbi;
Likelihood probability computing module calculates the text by the attention force vector sa and attention force vector sb The likelihood probability p of doc1_a and the text doc1_b.
As a specific technical scheme, the form of the corpus are as follows: doc1_a, doc1_b, sim;Wherein sim is label, Sim=1 indicates that the text doc1_a is similar with text doc1_b, and sim=0 indicates the text doc1_a and the text Doc1_b is dissimilar.
As a specific technical scheme, the attention power module includes: Tanh function sension unit, passes through formula uai= Tanh (W*Hai+b) is calculated the encoded information Hai and is exported a weight uai, passes through formula ubi=Tanh (W* Hbi+b) the encoded information Hbi is calculated and exports a weight ubi;Softmax function processing unit, passes through formula The weight ai of the regularization of the text doc1_a current word is calculated in ai=softmax (uai*uw), passes through formula The weight bi of the regularization of the text doc1_b current word is calculated in bi=softmax (ubi*uw);And weighting is asked And unit, for completing the calculating of the formula sa=∑ ai*Hai and the calculating of the formula sb=∑ bi*Hbi;Wherein W, Uw, b are the parameter of setup parameter or trained acquisition.
As a specific technical scheme, described parameter W, uw, b are the parameter of trained acquisition;The text similarity is commented Estimating system further includes parameter training module, and the parameter training module is by setting loss function and constantly optimizes, until The loss function convergence, so that it is determined that described parameter W, uw, b.
As a specific technical scheme, the loss function using following logloss function or uses mean square error letter Number, logloss function are as follows:
Wherein, N is test sample sum, and M is the sum of class, yl,jIt is two-valued variable, value 0 or 1 indicates first of sample Whether the label of jth class, p are belonged tol,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.
As a specific technical scheme, the calculation method of the likelihood probability p are as follows:
P=g (sa, sb)=exp (- | | sa-sb | |1), 0=< p≤1
Or p=cosine (sa, sb).
A kind of text similarity assessment system characterized by comprising
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, the text doc1_a Y sentence is separately included with the text doc1_b;
Word segmentation module, for dividing respective k-th of sentence in the text doc1_a and the text doc1_b respectively At the word sequence of X word, k value 1 to Y;
Term vector training module, for respective k-th of sentence in the text doc1_a and the text doc1_b point The word sequence cut carries out vectorization;
First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are used respectively In the term vector for inputting each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each The term vector of a word carries out word grade coding, exports the every of respective k-th sentence of the text doc1_a and the text doc1_b The encoded information Hai and Hbi of a word, i value 1 to X;
First pays attention to power module, the weight ai's and regularization of the respectively described encoded information Hai and Hbi offer regularization Weight bi, wherein ∑ ai=1, ∑ bi=1, are calculated in the text doc1_a k-th by formula sak=∑ ai*Hai K-th of sentence in the text doc1_b is calculated by formula sbk=∑ bi*Hbi in the attention force vector sak of sentence Attention force vector sbk;
Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are used respectively In inputting the attention force vector sak and attention force vector sak, to the attention force vector sak and the attention to It measures sbk and carries out Sentence-level coding, export corresponding encoded information HAk and HBk;Second pays attention to power module, the respectively described coding Information HAk and HBk provide the weight Ak of regularization and the weight Bk of regularization, and wherein ∑ Ak=1, ∑ Bk=1 pass through formula Va The attention force vector Va of the text doc1_a is calculated in=∑ Ak*HAk, and institute is calculated by formula Vb=∑ Bk*HBk State the attention force vector Vb of text doc1_b;
Likelihood probability computing module calculates the text by the attention force vector Va and attention force vector Vb The likelihood probability p of doc1_a and the text doc1_b.
The present invention also provides a kind of memories, which is characterized in that the memory stores above-mentioned text similarity assessment The program of system.
The second object of the present invention is to provide a kind of text similarity appraisal procedure, based on text similarity described above Assessment system carries out similarity assessment to two of input texts to be predicted.The second object of the present invention is by following technical side Case is realized:
A kind of text similarity appraisal procedure, it is characterised in that: by two text inputs to be predicted text described above Similarity assessment system exports the likelihood probability p of described two texts to be predicted.
The present invention also provides a kind of computers, including memory and processor, which is characterized in that the memory storage Processor is supported to execute the program of above-mentioned text similarity appraisal procedure, the processor is configured to for executing the storage The described program stored in device.
The beneficial effects of the present invention are: a kind of twin network structure based on attention mechanism is proposed, can be found The important information for determining two text similarity degrees, compared to existing technology, network structure proposed by the present invention is more focused on Local message excludes the interference of global information, is lifted at the accuracy of Text similarity computing in different scenes.The present invention is counting There is better effect in terms of calculating sentence similarity.
" Detailed description of the invention "
Fig. 1 is the schematic diagram of text similarity assessment system provided by the invention.
Fig. 2 is the network structure of the data handling procedure of text similarity assessment system provided by the invention.
Fig. 3 is after increasing the twin biLSTM network module of Sentence-level and the attention power module of Sentence-level on the basis of Fig. 2 Text similarity assessment system data handling procedure network structure.
" specific embodiment "
Embodiment one
As shown in connection with fig. 1, text similarity assessment system provided in this embodiment includes: that corpus obtains module, participle mould Block, term vector training module, twin biLSTM network (Bi-directional Long Short TermMemory, two-way length Short-term memory network, is abbreviated as biLSTM) module, attention power module, likelihood probability computing module and parameter training module.Below It is described in detail:
Corpus obtains module for inputting the corpus comprising two texts of doc1_a and doc1_b.In the present embodiment, corpus Form are as follows: doc1_a, doc1_b, sim;Wherein doc1_a and doc1_b is two similar or dissimilar texts, and sim is mark Label, sim=1 indicate similar, and sim=0 indicates dissimilar.
Word segmentation module is used to text doc1_a and text doc1_b being divided into word sequence respectively, as shown in connection with fig. 2, will be literary This doc1_a is divided into the word sequence of tetra- words of Wa1, Wa2, Wa3, Wa4, and text doc1_b is divided into Wb1, Wb2, Wb3, Wb4 The word sequence of four words.Stammerer participle tool is a kind of common participle tool, and effect is as follows:
" student and teachers of the Computer Department of the Chinese Academy of Science " --- > [Chinese Academy of Sciences calculates institute, student, and, teacher ,].
Term vector training module, for carrying out vectorization to the word sequence of text doc1_a and text doc1_b segmentation. Word2vec (i.e. word to vector is also word embeddings, Chinese name " term vector ") is developed and is opened by Google One term vector Core Generator in source is a kind of side that potential applications related information between word is excavated using neural network model Method, core concept are to predict current word by the word of context appearance, and the potential feature of word is excavated by co-occurrence word, The form of output is the dense vector that each vocabulary is shown as to a low-dimensional.As shown in connection with fig. 2, text doc1_a, which is divided, to form Tetra- words of Wa1, Wa2, Wa3, Wa4, term vector are expressed as Va1, Va2, Va3, Va4;Text doc1_b, which is divided, to be formed Tetra- words of Wb1, Wb2, Wb3, Wb4, term vector are expressed as Vb1, Vb2, Vb3, Vb4.
In practice 300 dimensions, the i.e. vector of 300*1 can be traditionally arranged to be according to the dimension of specific condition setting term vector. Such as:
" Chinese Academy of Sciences "-> [0.03,0.3,0.423,0.43,0.7623,1.32,2.34,0.1323 ...]300*1Vector In each dimension value by corpus training obtain.
It is known that LSTM network (Long Short Term Memory, shot and long term memory network are abbreviated as LSTM) is A kind of improved model of Recognition with Recurrent Neural Network (Recurrent Neural Network, be abbreviated as RNN) is determined by forgeing door Which fixed information needs are filtered, and input gate determines current input information and current state, and out gate determines output.Pass through door Method learning text contextual information.BiLSTM is a kind of two-way structure, he thinks that text positive sequence and inverted order can be caught Useful information is grasped, splices two-way information usually in training, enters next layer of operation together.
As shown in connection with fig. 1, twin biLSTM network module includes biLSTMa network module and biLSTMb network module, is divided Term vector (Va1, Va2, Va3, Va4 described above of each word of text doc1_a and text doc1_b Yong Yu not inputted; Vb1, Vb2, Vb3, Vb4), word grade coding (Encoding) is carried out to the term vector of each word, exports text doc1_a and text The encoded information Hai and Hbi of each word of this doc1_b, i is the number of text doc1_a and text doc1_b participle, this implementation I takes 1 to 4 in example, such as the Ha1 to Ha4 and Hb1 to Hb4 in Fig. 2.Wherein Hai and Hbi hides stratiform by biLSTM network module State vector splices to obtain, for example, Hai=[ha+i, ha-i], ha+i, ha-i are respectively that biLSTMa two different hidden layers generate Hidden layer state vector.
Notice that power module includes Tanh function sension unit, softmax function processing unit and weighted sum unit, in conjunction with Shown in Fig. 1, it is described as follows:
Tanh function sension unit, by formula uai=Tanh (W*Hai+b) to encoded information Hai carry out calculate and it is defeated A weight uai out calculates encoded information Hbi by formula ubi=Tanh (W*Hbi+b) and exports a weight ubi;
It is current that text doc1_a is calculated by formula ai=softmax (uai*uw) in softmax function processing unit The canonical of text doc1_b current word is calculated by formula bi=softmax (ubi*uw) by the weight ai of the regularization of word The weight bi of change, wherein ∑ ai=1, ∑ bi=1;
Weighted sum unit, for the attention force vector of text doc1_a to be calculated by formula sa=∑ ai*Hai The attention force vector sb of text doc1_b is calculated by formula sb=∑ bi*Hbi by sa.
Likelihood probability computing module is by paying attention to force vector sa and noticing that force vector sb calculates text doc1_a and text The likelihood probability p of doc1_b.Specifically, likelihood probability p can pass through manhatton distance calculation method or the method for seeking cosine value It calculates, corresponding calculation formula difference is as follows:
P=g (sa, sb)=exp (- | | sa-sb | |1), 0=< p≤1
Or p=cosine (sa, sb).
Parameter W, uw, b in above-mentioned formula are the parameter of setting or trained acquisition;Parameter training mould in the present embodiment Block is by setting loss function and constantly optimizes, until loss function is restrained, so that it is determined that described parameter W, uw, b.This reality It applies in example, loss function logloss is specific as follows:
In the formula, N is test sample sum, and M is the sum of class, yl,jIt is two-valued variable, value 0 or 1 indicates first Whether sample belongs to the label of jth class, pl,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.
The method of optimization is gradient decline optimization method, and input data starts to train, until convergence.Loss function can also To select the functions such as mean square error (mse) according to the actual situation.Model is trained by loss function and determines suitable ginseng Several methods belongs to the prior art of neural network training process, and repeats no more herein.
Embodiment two
The specific length of the manageable not restricted text of text of the present invention, if it is sentence, embodiment one is provided Word-based grade (word level) twin biLSTM network and pay attention to power module system structure can be obtained by expression sentence The attention force vector of son then need to also obtain multiple sentence vectors (i.e. if it is long text as paragraph or article Sa and sb obtained in multiple embodiments one) after again plus one layer of Sentence-level (sentence level) twin biLSTM network mould The attention power module of block and Sentence-level, finally obtains the attention force vector that can indicate long text, other links are constant.
The present embodiment two provides a kind of text similarity assessment system, can handle and respectively includes the two long of multiple sentences Text, the assessment system include that corpus obtains module, word segmentation module, term vector training module, the first twin biLSTM network mould Block, first notice that power module, the second twin biLSTM network module, second pay attention to power module and likelihood probability computing module, tool Body is described as follows:
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, text doc1_a and institute It states text doc1_b and separately includes Y sentence (such as respectively containing 4 sentences);
Word segmentation module, for k-th of sentence respective in text doc1_a and text doc1_b (only to be provided text in Fig. 3 Some sentence is as example in doc1_a and text doc1_b) it is divided into X word (such as 4 words, referring in Fig. 3 respectively Wa1, Wa2, Wa3, Wa4 and Wb1, Wb2, Wb3, Wb4) word sequence, k takes 1 to Y;
Term vector training module, for respective k-th of sentence in the text doc1_a and the text doc1_b point The word sequence cut carries out vectorization;For example, text doc1_a divides tetra- words of Wa1, Wa2, Wa3, Wa4 to be formed, term vector It is expressed as Va1, Va2, Va3, Va4;Text doc1_b divides tetra- words of Wb1, Wb2, Wb3, Wb4 to be formed, term vector It is expressed as Vb1, Vb2, Vb3, Vb4.
First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are used respectively In the term vector for inputting each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each The term vector of a word carries out word grade coding, the volume of the respective each word of k-th of sentence of output text doc1_a and text doc1_b Code information Hai and Hbi, i take 1 to X;Referring to the Ha1 to Ha4 and Hb1 to Hb4 in Fig. 3.
First pays attention to power module, the weight ai's and regularization of the respectively described encoded information Hai and Hbi offer regularization Weight bi, wherein ∑ ai=1, ∑ bi=1, are calculated in the text doc1_a k-th by formula sak=∑ ai*Hai The text is calculated by formula sbk=∑ bi*Hbi in the attention force vector sak (such as sa3 in Fig. 3) of sentence The attention force vector sbk (such as sb3 in Fig. 3) of k-th of sentence in doc1_b.
Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are used respectively In inputting the attention force vector sak and attention force vector sak, to the attention force vector sak and the attention to It measures sbk and carries out Sentence-level coding, export corresponding encoded information HAk and HBk;Referring to the HA1 to HA4 and HB1 to HB4 in Fig. 3.
Second pays attention to power module, the weight Ak's and regularization of the respectively described encoded information HAk and HBk offer regularization Weight Bk (referring to the A1-A4 and B1-B4 in Fig. 3), wherein ∑ Ak=1, ∑ Bk=1, are calculated by formula Va=∑ Ak*HAk The attention force vector Va of the text doc1_a is obtained, is calculated the text doc1_b's by formula Vb=∑ Bk*HBk Pay attention to force vector Vb.
Likelihood probability computing module calculates the text by the attention force vector Va and attention force vector Vb The likelihood probability p of doc1_a and the text doc1_b.
The function and implementation of the first twin biLSTM network module, the second twin biLSTM network module in embodiment two The function of twin biLSTM network module described in example one is identical, in the present embodiment by " first ", " second " restriction only For the processing links of differentiating words grade and the processing links of Sentence-level.In addition, first in embodiment two pays attention to power module, second Notice that the function of power module is also identical as the function of attention power module described in embodiment one, pass through in the present embodiment " the One ", the restriction of " second " is only used for the processing links of differentiating words grade and the processing links of Sentence-level.Furthermore in embodiment two Likelihood probability computing module is identical as the function of likelihood probability computing module described in embodiment one.In addition, the present embodiment two There is provided text similarity assessment system also includes parameter training identical with parameter training functions of modules described in embodiment one Module.
The present embodiment also provides a kind of memory, which stores the program of above-mentioned text similarity assessment system.
The present embodiment also provides a kind of text similarity appraisal procedure, and trained text similarity assessment system can be located Text to be predicted is managed, will judge two text input model exports the likelihood probability p of two texts, as two texts This similarity.
The present embodiment also provides a kind of computer, including memory and processor, the memory storage support processing Device executes the program of above-mentioned text similarity appraisal procedure, the processor is configured to storing in the memory for executing Described program.
Above embodiments be only it is sufficiently open is not intended to limit the present invention, it is all based on the inventive subject matter of the present invention, need not move through The replacement for the equivalence techniques feature that creative work can wait until should be considered as the range of the application exposure.

Claims (10)

1. a kind of text similarity assessment system characterized by comprising
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b;
Word segmentation module, for the text doc1_a and the text doc1_b to be divided into the word sequence of X word respectively;
Term vector training module, for carrying out vector to the word sequence of the text doc1_a and text doc1_b segmentation Change;
Twin biLSTM network module, including biLSTMa network module and biLSTMb network module, are respectively used to described in input The term vector of each word of text doc1_a and the text doc1_b carries out word grade coding to the term vector of each word, Export the encoded information Hai and Hbi of each word of the text doc1_a and the text doc1_b, i value 1 to X;
Paying attention to power module, the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight bi of regularization, Wherein ∑ ai=1, ∑ bi=1, and the attention force vector of the text doc1_a is calculated by formula sa=∑ ai*Hai The attention force vector sb of the text doc1_b is calculated by formula sb=∑ bi*Hbi by sa;
Likelihood probability computing module calculates the text doc1_a by the attention force vector sa and attention force vector sb With the likelihood probability p of the text doc1_b.
2. text similarity assessment system according to claim 1, which is characterized in that the form of the corpus are as follows: doc1_ a,doc1_b,sim;Wherein sim is label, and sim=1 indicates that the text doc1_a is similar with text doc1_b, sim=0 table Show that the text doc1_a and text doc1_b is dissimilar.
3. text similarity assessment system according to claim 1, which is characterized in that the attention power module includes: Tanh function sension unit calculates the encoded information Hai by formula uai=Tanh (W*Hai+b) and exports one A weight uai calculates the encoded information Hbi by formula ubi=Tanh (W*Hbi+b) and exports a weight ubi;It is current that the text doc1_a is calculated by formula ai=softmax (uai*uw) in softmax function processing unit It is current that the text doc1_b is calculated by formula bi=softmax (ubi*uw) in the weight ai of the regularization of word The weight bi of the regularization of word;And weighted sum unit, for completing calculating and the institute of the formula sa=∑ ai*Hai State the calculating of formula sb=∑ bi*Hbi;Wherein W, uw, b are the parameter of setup parameter or trained acquisition.
4. text similarity assessment system according to claim 3, which is characterized in that described parameter W, uw, b are trained The parameter of acquisition;The text similarity assessment system further includes parameter training module, and the parameter training module passes through setting Loss function simultaneously constantly optimizes, until the loss function is restrained, so that it is determined that described parameter W, uw, b.
5. text similarity assessment system according to claim 1, which is characterized in that the loss function uses following Logloss function uses mean square error function, and logloss function is as follows:
Wherein, N is test sample sum, and M is the sum of class, yl,jIt is two-valued variable, whether value 0 or 1 indicates first of sample Belong to the label of jth class, pl,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.
6. text similarity assessment system according to claim 1, which is characterized in that the calculating side of the likelihood probability p Method are as follows:
P=g (sa, sb)=exp (- | | sa-sb | |1), 0=< p≤1
Or p=cosine (sa, sb).
7. a kind of text similarity assessment system characterized by comprising
Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, the text doc1_a and institute It states text doc1_b and separately includes Y sentence;
Word segmentation module, for respective k-th of sentence in the text doc1_a and the text doc1_b to be divided into X respectively The word sequence of word, k value 1 to Y;
Term vector training module, for what is divided to k-th of sentence respective in the text doc1_a and the text doc1_b Word sequence carries out vectorization;
First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are respectively used to defeated The term vector for entering each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each word Term vector carry out word grade coding, export each word of respective k-th of the sentence of the text doc1_a and the text doc1_b Encoded information Hai and Hbi, i value 1 to X;
First pays attention to power module, and the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight of regularization Bi, wherein ∑ ai=1, ∑ bi=1, are calculated k-th of sentence in the text doc1_a by formula sak=∑ ai*Hai Attention force vector sak, the note of k-th of sentence in the text doc1_b is calculated by formula sbk=∑ bi*Hbi Anticipate force vector sbk;
Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are respectively used to defeated Enter the attention force vector sak and the attention force vector sak, to the attention force vector sak and the attention force vector sbk Sentence-level coding is carried out, corresponding encoded information HAk and HBk is exported;Second pays attention to power module, the respectively described encoded information HAk and HBk provides the weight Ak of regularization and the weight Bk of regularization, and wherein ∑ Ak=1, ∑ Bk=1, pass through formula Va=∑ The attention force vector Va of the text doc1_a is calculated in Ak*HAk, and the text is calculated by formula Vb=∑ Bk*HBk The attention force vector Vb of this doc1_b;
Likelihood probability computing module calculates the text doc1_a by the attention force vector Va and attention force vector Vb With the likelihood probability p of the text doc1_b.
8. a kind of memory, which is characterized in that text described in memory storage claim 1 to 7 any one is similar Spend the program of assessment system.
9. a kind of text similarity appraisal procedure, it is characterised in that: two text input claims 1 to 7 to be predicted are any Text similarity assessment system described in one exports the likelihood probability p of described two texts to be predicted.
10. a kind of computer, including memory and processor, which is characterized in that the memory storage supports processor to hold The program of text similarity appraisal procedure described in row claim 9, the processor is configured to for executing the memory The described program of middle storage.
CN201811210881.4A 2018-10-17 2018-10-17 Text similarity assessment system and text similarity appraisal procedure Active CN109543009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811210881.4A CN109543009B (en) 2018-10-17 2018-10-17 Text similarity assessment system and text similarity appraisal procedure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811210881.4A CN109543009B (en) 2018-10-17 2018-10-17 Text similarity assessment system and text similarity appraisal procedure

Publications (2)

Publication Number Publication Date
CN109543009A true CN109543009A (en) 2019-03-29
CN109543009B CN109543009B (en) 2019-10-25

Family

ID=65843947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811210881.4A Active CN109543009B (en) 2018-10-17 2018-10-17 Text similarity assessment system and text similarity appraisal procedure

Country Status (1)

Country Link
CN (1) CN109543009B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046240A (en) * 2019-04-16 2019-07-23 浙江爱闻格环保科技有限公司 In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network
CN110211594A (en) * 2019-06-06 2019-09-06 杭州电子科技大学 A kind of method for distinguishing speek person based on twin network model and KNN algorithm
CN110362681A (en) * 2019-06-19 2019-10-22 平安科技(深圳)有限公司 The recognition methods of question answering system replication problem, device and storage medium
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN110610003A (en) * 2019-08-15 2019-12-24 阿里巴巴集团控股有限公司 Method and system for assisting text annotation
CN110717332A (en) * 2019-07-26 2020-01-21 昆明理工大学 News and case similarity calculation method based on asymmetric twin network
CN110738059A (en) * 2019-10-21 2020-01-31 支付宝(杭州)信息技术有限公司 text similarity calculation method and system
CN110941951A (en) * 2019-10-15 2020-03-31 平安科技(深圳)有限公司 Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN111198939A (en) * 2019-12-27 2020-05-26 北京健康之家科技有限公司 Statement similarity analysis method and device and computer equipment
CN111209395A (en) * 2019-12-27 2020-05-29 铜陵中科汇联科技有限公司 Short text similarity calculation system and training method thereof
CN111627566A (en) * 2020-05-22 2020-09-04 泰康保险集团股份有限公司 Indication information processing method and device, storage medium and electronic equipment
CN111737954A (en) * 2020-06-12 2020-10-02 百度在线网络技术(北京)有限公司 Text similarity determination method, device, equipment and medium
CN111783419A (en) * 2020-06-12 2020-10-16 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111785287A (en) * 2020-07-06 2020-10-16 北京世纪好未来教育科技有限公司 Speaker recognition method, speaker recognition device, electronic equipment and storage medium
CN111859988A (en) * 2020-07-28 2020-10-30 阳光保险集团股份有限公司 Semantic similarity evaluation method and device and computer-readable storage medium
CN112784587A (en) * 2021-01-07 2021-05-11 国网福建省电力有限公司泉州供电公司 Text similarity measurement method and device based on multi-model fusion
CN112800196A (en) * 2021-01-18 2021-05-14 北京明略软件系统有限公司 FAQ question-answer library matching method and system based on twin network
CN113743077A (en) * 2020-08-14 2021-12-03 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN114595687A (en) * 2021-12-20 2022-06-07 昆明理工大学 Laos language text regularization method based on BilSTM
WO2023065635A1 (en) * 2021-10-22 2023-04-27 平安科技(深圳)有限公司 Named entity recognition method and apparatus, storage medium and terminal device
CN116776854A (en) * 2023-08-25 2023-09-19 湖南汇智兴创科技有限公司 Online multi-version document content association method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168954A (en) * 2017-05-18 2017-09-15 北京奇艺世纪科技有限公司 Text key word generation method and device and electronic equipment and readable storage medium storing program for executing
CN107291699A (en) * 2017-07-04 2017-10-24 湖南星汉数智科技有限公司 A kind of sentence semantic similarity computational methods
CN107562812A (en) * 2017-08-11 2018-01-09 北京大学 A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space
CN108132931A (en) * 2018-01-12 2018-06-08 北京神州泰岳软件股份有限公司 A kind of matched method and device of text semantic
CN108415977A (en) * 2018-02-09 2018-08-17 华南理工大学 One is read understanding method based on the production machine of deep neural network and intensified learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168954A (en) * 2017-05-18 2017-09-15 北京奇艺世纪科技有限公司 Text key word generation method and device and electronic equipment and readable storage medium storing program for executing
CN107291699A (en) * 2017-07-04 2017-10-24 湖南星汉数智科技有限公司 A kind of sentence semantic similarity computational methods
CN107562812A (en) * 2017-08-11 2018-01-09 北京大学 A kind of cross-module state similarity-based learning method based on the modeling of modality-specific semantic space
CN108132931A (en) * 2018-01-12 2018-06-08 北京神州泰岳软件股份有限公司 A kind of matched method and device of text semantic
CN108415977A (en) * 2018-02-09 2018-08-17 华南理工大学 One is read understanding method based on the production machine of deep neural network and intensified learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZONGKUI ZHU等: ""A Semantic Similarity Computing Model based on Siamese Network for Duplicate Questions Identification", 《PROCEEDINGS OF THE EVALUATION TASKS AT THE CHINA CONFERENCE ON KNOWLEDGE GRAPH AND SEMANTIC COMPUTING》 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046240A (en) * 2019-04-16 2019-07-23 浙江爱闻格环保科技有限公司 In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network
CN110046240B (en) * 2019-04-16 2020-12-08 浙江爱闻格环保科技有限公司 Target field question-answer pushing method combining keyword retrieval and twin neural network
CN110211594A (en) * 2019-06-06 2019-09-06 杭州电子科技大学 A kind of method for distinguishing speek person based on twin network model and KNN algorithm
CN110211594B (en) * 2019-06-06 2021-05-04 杭州电子科技大学 Speaker identification method based on twin network model and KNN algorithm
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN110413988B (en) * 2019-06-17 2023-01-31 平安科技(深圳)有限公司 Text information matching measurement method, device, server and storage medium
CN110362681B (en) * 2019-06-19 2023-09-22 平安科技(深圳)有限公司 Method, device and storage medium for identifying repeated questions of question-answering system
CN110362681A (en) * 2019-06-19 2019-10-22 平安科技(深圳)有限公司 The recognition methods of question answering system replication problem, device and storage medium
WO2020252930A1 (en) * 2019-06-19 2020-12-24 平安科技(深圳)有限公司 Method, apparatus, and device for identifying duplicate questions in question answering system, and storage medium
CN110717332A (en) * 2019-07-26 2020-01-21 昆明理工大学 News and case similarity calculation method based on asymmetric twin network
CN110610003A (en) * 2019-08-15 2019-12-24 阿里巴巴集团控股有限公司 Method and system for assisting text annotation
CN110610003B (en) * 2019-08-15 2023-09-15 创新先进技术有限公司 Method and system for assisting text annotation
CN110941951A (en) * 2019-10-15 2020-03-31 平安科技(深圳)有限公司 Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
WO2021072863A1 (en) * 2019-10-15 2021-04-22 平安科技(深圳)有限公司 Method and apparatus for calculating text similarity, electronic device, and computer-readable storage medium
CN110738059A (en) * 2019-10-21 2020-01-31 支付宝(杭州)信息技术有限公司 text similarity calculation method and system
CN111209395A (en) * 2019-12-27 2020-05-29 铜陵中科汇联科技有限公司 Short text similarity calculation system and training method thereof
CN111198939A (en) * 2019-12-27 2020-05-26 北京健康之家科技有限公司 Statement similarity analysis method and device and computer equipment
CN111627566A (en) * 2020-05-22 2020-09-04 泰康保险集团股份有限公司 Indication information processing method and device, storage medium and electronic equipment
CN111783419A (en) * 2020-06-12 2020-10-16 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111783419B (en) * 2020-06-12 2024-02-27 上海东普信息科技有限公司 Address similarity calculation method, device, equipment and storage medium
CN111737954A (en) * 2020-06-12 2020-10-02 百度在线网络技术(北京)有限公司 Text similarity determination method, device, equipment and medium
US11676609B2 (en) 2020-07-06 2023-06-13 Beijing Century Tal Education Technology Co. Ltd. Speaker recognition method, electronic device, and storage medium
CN111785287A (en) * 2020-07-06 2020-10-16 北京世纪好未来教育科技有限公司 Speaker recognition method, speaker recognition device, electronic equipment and storage medium
CN111785287B (en) * 2020-07-06 2022-06-07 北京世纪好未来教育科技有限公司 Speaker recognition method, speaker recognition device, electronic equipment and storage medium
CN111859988A (en) * 2020-07-28 2020-10-30 阳光保险集团股份有限公司 Semantic similarity evaluation method and device and computer-readable storage medium
CN113743077B (en) * 2020-08-14 2023-09-29 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN113743077A (en) * 2020-08-14 2021-12-03 北京京东振世信息技术有限公司 Method and device for determining text similarity
CN112784587B (en) * 2021-01-07 2023-05-16 国网福建省电力有限公司泉州供电公司 Text similarity measurement method and device based on multi-model fusion
CN112784587A (en) * 2021-01-07 2021-05-11 国网福建省电力有限公司泉州供电公司 Text similarity measurement method and device based on multi-model fusion
CN112800196A (en) * 2021-01-18 2021-05-14 北京明略软件系统有限公司 FAQ question-answer library matching method and system based on twin network
CN112800196B (en) * 2021-01-18 2024-03-01 南京明略科技有限公司 FAQ question-answering library matching method and system based on twin network
WO2023065635A1 (en) * 2021-10-22 2023-04-27 平安科技(深圳)有限公司 Named entity recognition method and apparatus, storage medium and terminal device
CN114595687A (en) * 2021-12-20 2022-06-07 昆明理工大学 Laos language text regularization method based on BilSTM
CN114595687B (en) * 2021-12-20 2024-04-19 昆明理工大学 Laos text regularization method based on BiLSTM
CN116776854A (en) * 2023-08-25 2023-09-19 湖南汇智兴创科技有限公司 Online multi-version document content association method, device, equipment and medium
CN116776854B (en) * 2023-08-25 2023-11-03 湖南汇智兴创科技有限公司 Online multi-version document content association method, device, equipment and medium

Also Published As

Publication number Publication date
CN109543009B (en) 2019-10-25

Similar Documents

Publication Publication Date Title
CN109543009B (en) Text similarity assessment system and text similarity appraisal procedure
Wang et al. On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation
CN109284506A (en) A kind of user comment sentiment analysis system and method based on attention convolutional neural networks
CN108628823A (en) In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training
CN112883714B (en) ABSC task syntactic constraint method based on dependency graph convolution and transfer learning
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN110162789A (en) A kind of vocabulary sign method and device based on the Chinese phonetic alphabet
CN116579347A (en) Comment text emotion analysis method, system, equipment and medium based on dynamic semantic feature fusion
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN114154570A (en) Sample screening method and system and neural network model training method
CN113723083A (en) Weighted negative supervision text emotion analysis method based on BERT model
Thattinaphanich et al. Thai named entity recognition using Bi-LSTM-CRF with word and character representation
CN117312490A (en) Characterization model of text attribute graph, pre-trained self-supervision method and node representation updated model framework
Perera et al. Personality Classification of text through Machine learning and Deep learning: A Review (2023)
Ma et al. Multi-teacher knowledge distillation for end-to-end text image machine translation
Zheng et al. Named entity recognition in electric power metering domain based on attention mechanism
CN114757183B (en) Cross-domain emotion classification method based on comparison alignment network
CN115510230A (en) Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism
CN113779249B (en) Cross-domain text emotion classification method and device, storage medium and electronic equipment
CN114330350A (en) Named entity identification method and device, electronic equipment and storage medium
CN111309849B (en) Fine-grained value information extraction method based on joint learning model
CN114692615B (en) Small sample intention recognition method for small languages
CN117744658A (en) Ship naming entity identification method based on BERT-BiLSTM-CRF
Chavali et al. A study on named entity recognition with different word embeddings on gmb dataset using deep learning pipelines
CN117251545A (en) Multi-intention natural language understanding method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 519031 office 1316, No. 1, lianao Road, Hengqin new area, Zhuhai, Guangdong

Patentee after: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

Address before: 519031 room 417, building 20, creative Valley, Hengqin New District, Zhuhai City, Guangdong Province

Patentee before: LONGMA ZHIXIN (ZHUHAI HENGQIN) TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder
PP01 Preservation of patent right

Effective date of registration: 20240718

Granted publication date: 20191025

PP01 Preservation of patent right