CN109543009A

CN109543009A - Text similarity assessment system and text similarity appraisal procedure

Info

Publication number: CN109543009A
Application number: CN201811210881.4A
Authority: CN
Inventors: 郑权; 徐泓洋; 张峰; 聂颖
Original assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Current assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2019-03-29
Anticipated expiration: 2038-10-17
Also published as: CN109543009B

Abstract

The present invention discloses the appraisal procedure of a kind of text similarity assessment system and text similarity, and the assessment system includes that corpus obtains module, word segmentation module, term vector training module, twin biLSTM network module, pays attention to power module and likelihood probability computing module.The invention proposes a kind of twin network structures based on attention mechanism, the important information for determining two text similarity degrees can be found, compared to existing technology, network structure proposed by the present invention more focuses on local message, the interference for excluding global information, is lifted at the accuracy of Text similarity computing in different scenes.

Description

Text similarity assessment system and text similarity appraisal procedure

" technical field "

The present invention relates to electronic information and technical field of data processing, and in particular to a kind of text similarity assessment system and Text similarity appraisal procedure.

" background technique "

Text similarity computing is an important research topic of natural language processing field, not according to calculation method It is same to be divided into the method based on character and the method based on corpus.Method based on character is mainly identical from two texts Character portion considers similarity, it does not consider the semantic information of text, and the judgement to unordered character lists is effectively, still It is invalid to the language of text.The semantic information of character is excavated by the contextual information of text then based on the method for corpus to sentence The similarity of disconnected two texts, the main sex work that represents of this kind of research have the directly calculating vector such as word embdding similar The method that the building model such as the method for degree and siamesenetwork goes judgement.Method based on corpus is the master of current research Flow direction.

Method based on corpus have developed rapidly in recent years, but still suffer from the insufficient problem of information excavating. It will appear the very little part for determining that the key message of two text similarity degrees only accounts for text component in many application scenarios, it is existing Some work is more to excavate global semantic information, and the similarity degree for judging text is removed by global semantic information, Obviously and it is inaccurate.

" summary of the invention "

The first object of the present invention is to provide a kind of text similarity assessment system, it is intended in twin biLSTM network module On the basis of attention mechanism is added, find and determine the important informations of two text similarity degrees, be lifted at different scenes Chinese The accuracy of this similarity calculation.The first object of the present invention is realized by the following technical scheme:

A kind of text similarity assessment system characterized by comprising

Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b；

Word segmentation module, for the text doc1_a and the text doc1_b to be divided into the word sequence of X word respectively；

Term vector training module, for the text doc1_a and the text doc1_b segmentation word sequence carry out to Quantization；

Twin biLSTM network module, including biLSTMa network module and biLSTMb network module, are respectively used to input The term vector of each word of the text doc1_a and the text doc1_b carries out word grade volume to the term vector of each word Code, exports the encoded information Hai and Hbi of each word of the text doc1_a and the text doc1_b, i value 1 to X；

Pay attention to power module, the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight of regularization Bi, wherein ∑ ai=1, ∑ bi=1, and by formula sa=∑ ai*Hai be calculated the attention of the text doc1_a to Sa is measured, the attention force vector sb of the text doc1_b is calculated by formula sb=∑ bi*Hbi；

Likelihood probability computing module calculates the text by the attention force vector sa and attention force vector sb The likelihood probability p of doc1_a and the text doc1_b.

As a specific technical scheme, the form of the corpus are as follows: doc1_a, doc1_b, sim；Wherein sim is label, Sim=1 indicates that the text doc1_a is similar with text doc1_b, and sim=0 indicates the text doc1_a and the text Doc1_b is dissimilar.

As a specific technical scheme, the attention power module includes: Tanh function sension unit, passes through formula uai= Tanh (W*Hai+b) is calculated the encoded information Hai and is exported a weight uai, passes through formula ubi=Tanh (W* Hbi+b) the encoded information Hbi is calculated and exports a weight ubi；Softmax function processing unit, passes through formula The weight ai of the regularization of the text doc1_a current word is calculated in ai=softmax (uai*uw), passes through formula The weight bi of the regularization of the text doc1_b current word is calculated in bi=softmax (ubi*uw)；And weighting is asked And unit, for completing the calculating of the formula sa=∑ ai*Hai and the calculating of the formula sb=∑ bi*Hbi；Wherein W, Uw, b are the parameter of setup parameter or trained acquisition.

As a specific technical scheme, described parameter W, uw, b are the parameter of trained acquisition；The text similarity is commented Estimating system further includes parameter training module, and the parameter training module is by setting loss function and constantly optimizes, until The loss function convergence, so that it is determined that described parameter W, uw, b.

As a specific technical scheme, the loss function using following logloss function or uses mean square error letter Number, logloss function are as follows:

Wherein, N is test sample sum, and M is the sum of class, y_l,jIt is two-valued variable, value 0 or 1 indicates first of sample Whether the label of jth class, p are belonged to_l,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.

As a specific technical scheme, the calculation method of the likelihood probability p are as follows:

P=g (sa, sb)=exp (- | | sa-sb | |₁), 0=< p≤1

Or p=cosine (sa, sb).

A kind of text similarity assessment system characterized by comprising

Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, the text doc1_a Y sentence is separately included with the text doc1_b；

Word segmentation module, for dividing respective k-th of sentence in the text doc1_a and the text doc1_b respectively At the word sequence of X word, k value 1 to Y；

Term vector training module, for respective k-th of sentence in the text doc1_a and the text doc1_b point The word sequence cut carries out vectorization；

First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are used respectively In the term vector for inputting each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each The term vector of a word carries out word grade coding, exports the every of respective k-th sentence of the text doc1_a and the text doc1_b The encoded information Hai and Hbi of a word, i value 1 to X；

First pays attention to power module, the weight ai's and regularization of the respectively described encoded information Hai and Hbi offer regularization Weight bi, wherein ∑ ai=1, ∑ bi=1, are calculated in the text doc1_a k-th by formula sak=∑ ai*Hai K-th of sentence in the text doc1_b is calculated by formula sbk=∑ bi*Hbi in the attention force vector sak of sentence Attention force vector sbk；

Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are used respectively In inputting the attention force vector sak and attention force vector sak, to the attention force vector sak and the attention to It measures sbk and carries out Sentence-level coding, export corresponding encoded information HAk and HBk；Second pays attention to power module, the respectively described coding Information HAk and HBk provide the weight Ak of regularization and the weight Bk of regularization, and wherein ∑ Ak=1, ∑ Bk=1 pass through formula Va The attention force vector Va of the text doc1_a is calculated in=∑ Ak*HAk, and institute is calculated by formula Vb=∑ Bk*HBk State the attention force vector Vb of text doc1_b；

Likelihood probability computing module calculates the text by the attention force vector Va and attention force vector Vb The likelihood probability p of doc1_a and the text doc1_b.

The present invention also provides a kind of memories, which is characterized in that the memory stores above-mentioned text similarity assessment The program of system.

The second object of the present invention is to provide a kind of text similarity appraisal procedure, based on text similarity described above Assessment system carries out similarity assessment to two of input texts to be predicted.The second object of the present invention is by following technical side Case is realized:

A kind of text similarity appraisal procedure, it is characterised in that: by two text inputs to be predicted text described above Similarity assessment system exports the likelihood probability p of described two texts to be predicted.

The present invention also provides a kind of computers, including memory and processor, which is characterized in that the memory storage Processor is supported to execute the program of above-mentioned text similarity appraisal procedure, the processor is configured to for executing the storage The described program stored in device.

The beneficial effects of the present invention are: a kind of twin network structure based on attention mechanism is proposed, can be found The important information for determining two text similarity degrees, compared to existing technology, network structure proposed by the present invention is more focused on Local message excludes the interference of global information, is lifted at the accuracy of Text similarity computing in different scenes.The present invention is counting There is better effect in terms of calculating sentence similarity.

" Detailed description of the invention "

Fig. 1 is the schematic diagram of text similarity assessment system provided by the invention.

Fig. 2 is the network structure of the data handling procedure of text similarity assessment system provided by the invention.

Fig. 3 is after increasing the twin biLSTM network module of Sentence-level and the attention power module of Sentence-level on the basis of Fig. 2 Text similarity assessment system data handling procedure network structure.

" specific embodiment "

Embodiment one

As shown in connection with fig. 1, text similarity assessment system provided in this embodiment includes: that corpus obtains module, participle mould Block, term vector training module, twin biLSTM network (Bi-directional Long Short TermMemory, two-way length Short-term memory network, is abbreviated as biLSTM) module, attention power module, likelihood probability computing module and parameter training module.Below It is described in detail:

Corpus obtains module for inputting the corpus comprising two texts of doc1_a and doc1_b.In the present embodiment, corpus Form are as follows: doc1_a, doc1_b, sim；Wherein doc1_a and doc1_b is two similar or dissimilar texts, and sim is mark Label, sim=1 indicate similar, and sim=0 indicates dissimilar.

Word segmentation module is used to text doc1_a and text doc1_b being divided into word sequence respectively, as shown in connection with fig. 2, will be literary This doc1_a is divided into the word sequence of tetra- words of Wa1, Wa2, Wa3, Wa4, and text doc1_b is divided into Wb1, Wb2, Wb3, Wb4 The word sequence of four words.Stammerer participle tool is a kind of common participle tool, and effect is as follows:

" student and teachers of the Computer Department of the Chinese Academy of Science " --- > [Chinese Academy of Sciences calculates institute, student, and, teacher ,].

Term vector training module, for carrying out vectorization to the word sequence of text doc1_a and text doc1_b segmentation. Word2vec (i.e. word to vector is also word embeddings, Chinese name " term vector ") is developed and is opened by Google One term vector Core Generator in source is a kind of side that potential applications related information between word is excavated using neural network model Method, core concept are to predict current word by the word of context appearance, and the potential feature of word is excavated by co-occurrence word, The form of output is the dense vector that each vocabulary is shown as to a low-dimensional.As shown in connection with fig. 2, text doc1_a, which is divided, to form Tetra- words of Wa1, Wa2, Wa3, Wa4, term vector are expressed as Va1, Va2, Va3, Va4；Text doc1_b, which is divided, to be formed Tetra- words of Wb1, Wb2, Wb3, Wb4, term vector are expressed as Vb1, Vb2, Vb3, Vb4.

In practice 300 dimensions, the i.e. vector of 300*1 can be traditionally arranged to be according to the dimension of specific condition setting term vector. Such as:

" Chinese Academy of Sciences "-> [0.03,0.3,0.423,0.43,0.7623,1.32,2.34,0.1323 ...]_300*1Vector In each dimension value by corpus training obtain.

It is known that LSTM network (Long Short Term Memory, shot and long term memory network are abbreviated as LSTM) is A kind of improved model of Recognition with Recurrent Neural Network (Recurrent Neural Network, be abbreviated as RNN) is determined by forgeing door Which fixed information needs are filtered, and input gate determines current input information and current state, and out gate determines output.Pass through door Method learning text contextual information.BiLSTM is a kind of two-way structure, he thinks that text positive sequence and inverted order can be caught Useful information is grasped, splices two-way information usually in training, enters next layer of operation together.

As shown in connection with fig. 1, twin biLSTM network module includes biLSTMa network module and biLSTMb network module, is divided Term vector (Va1, Va2, Va3, Va4 described above of each word of text doc1_a and text doc1_b Yong Yu not inputted； Vb1, Vb2, Vb3, Vb4), word grade coding (Encoding) is carried out to the term vector of each word, exports text doc1_a and text The encoded information Hai and Hbi of each word of this doc1_b, i is the number of text doc1_a and text doc1_b participle, this implementation I takes 1 to 4 in example, such as the Ha1 to Ha4 and Hb1 to Hb4 in Fig. 2.Wherein Hai and Hbi hides stratiform by biLSTM network module State vector splices to obtain, for example, Hai=[ha+i, ha-i], ha+i, ha-i are respectively that biLSTMa two different hidden layers generate Hidden layer state vector.

Notice that power module includes Tanh function sension unit, softmax function processing unit and weighted sum unit, in conjunction with Shown in Fig. 1, it is described as follows:

Tanh function sension unit, by formula uai=Tanh (W*Hai+b) to encoded information Hai carry out calculate and it is defeated A weight uai out calculates encoded information Hbi by formula ubi=Tanh (W*Hbi+b) and exports a weight ubi；

It is current that text doc1_a is calculated by formula ai=softmax (uai*uw) in softmax function processing unit The canonical of text doc1_b current word is calculated by formula bi=softmax (ubi*uw) by the weight ai of the regularization of word The weight bi of change, wherein ∑ ai=1, ∑ bi=1；

Weighted sum unit, for the attention force vector of text doc1_a to be calculated by formula sa=∑ ai*Hai The attention force vector sb of text doc1_b is calculated by formula sb=∑ bi*Hbi by sa.

Likelihood probability computing module is by paying attention to force vector sa and noticing that force vector sb calculates text doc1_a and text The likelihood probability p of doc1_b.Specifically, likelihood probability p can pass through manhatton distance calculation method or the method for seeking cosine value It calculates, corresponding calculation formula difference is as follows:

P=g (sa, sb)=exp (- | | sa-sb | |₁), 0=< p≤1

Or p=cosine (sa, sb).

Parameter W, uw, b in above-mentioned formula are the parameter of setting or trained acquisition；Parameter training mould in the present embodiment Block is by setting loss function and constantly optimizes, until loss function is restrained, so that it is determined that described parameter W, uw, b.This reality It applies in example, loss function logloss is specific as follows:

In the formula, N is test sample sum, and M is the sum of class, y_l,jIt is two-valued variable, value 0 or 1 indicates first Whether sample belongs to the label of jth class, p_l,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.

The method of optimization is gradient decline optimization method, and input data starts to train, until convergence.Loss function can also To select the functions such as mean square error (mse) according to the actual situation.Model is trained by loss function and determines suitable ginseng Several methods belongs to the prior art of neural network training process, and repeats no more herein.

Embodiment two

The specific length of the manageable not restricted text of text of the present invention, if it is sentence, embodiment one is provided Word-based grade (word level) twin biLSTM network and pay attention to power module system structure can be obtained by expression sentence The attention force vector of son then need to also obtain multiple sentence vectors (i.e. if it is long text as paragraph or article Sa and sb obtained in multiple embodiments one) after again plus one layer of Sentence-level (sentence level) twin biLSTM network mould The attention power module of block and Sentence-level, finally obtains the attention force vector that can indicate long text, other links are constant.

The present embodiment two provides a kind of text similarity assessment system, can handle and respectively includes the two long of multiple sentences Text, the assessment system include that corpus obtains module, word segmentation module, term vector training module, the first twin biLSTM network mould Block, first notice that power module, the second twin biLSTM network module, second pay attention to power module and likelihood probability computing module, tool Body is described as follows:

Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, text doc1_a and institute It states text doc1_b and separately includes Y sentence (such as respectively containing 4 sentences)；

Word segmentation module, for k-th of sentence respective in text doc1_a and text doc1_b (only to be provided text in Fig. 3 Some sentence is as example in doc1_a and text doc1_b) it is divided into X word (such as 4 words, referring in Fig. 3 respectively Wa1, Wa2, Wa3, Wa4 and Wb1, Wb2, Wb3, Wb4) word sequence, k takes 1 to Y；

Term vector training module, for respective k-th of sentence in the text doc1_a and the text doc1_b point The word sequence cut carries out vectorization；For example, text doc1_a divides tetra- words of Wa1, Wa2, Wa3, Wa4 to be formed, term vector It is expressed as Va1, Va2, Va3, Va4；Text doc1_b divides tetra- words of Wb1, Wb2, Wb3, Wb4 to be formed, term vector It is expressed as Vb1, Vb2, Vb3, Vb4.

First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are used respectively In the term vector for inputting each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each The term vector of a word carries out word grade coding, the volume of the respective each word of k-th of sentence of output text doc1_a and text doc1_b Code information Hai and Hbi, i take 1 to X；Referring to the Ha1 to Ha4 and Hb1 to Hb4 in Fig. 3.

First pays attention to power module, the weight ai's and regularization of the respectively described encoded information Hai and Hbi offer regularization Weight bi, wherein ∑ ai=1, ∑ bi=1, are calculated in the text doc1_a k-th by formula sak=∑ ai*Hai The text is calculated by formula sbk=∑ bi*Hbi in the attention force vector sak (such as sa3 in Fig. 3) of sentence The attention force vector sbk (such as sb3 in Fig. 3) of k-th of sentence in doc1_b.

Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are used respectively In inputting the attention force vector sak and attention force vector sak, to the attention force vector sak and the attention to It measures sbk and carries out Sentence-level coding, export corresponding encoded information HAk and HBk；Referring to the HA1 to HA4 and HB1 to HB4 in Fig. 3.

Second pays attention to power module, the weight Ak's and regularization of the respectively described encoded information HAk and HBk offer regularization Weight Bk (referring to the A1-A4 and B1-B4 in Fig. 3), wherein ∑ Ak=1, ∑ Bk=1, are calculated by formula Va=∑ Ak*HAk The attention force vector Va of the text doc1_a is obtained, is calculated the text doc1_b's by formula Vb=∑ Bk*HBk Pay attention to force vector Vb.

The function and implementation of the first twin biLSTM network module, the second twin biLSTM network module in embodiment two The function of twin biLSTM network module described in example one is identical, in the present embodiment by " first ", " second " restriction only For the processing links of differentiating words grade and the processing links of Sentence-level.In addition, first in embodiment two pays attention to power module, second Notice that the function of power module is also identical as the function of attention power module described in embodiment one, pass through in the present embodiment " the One ", the restriction of " second " is only used for the processing links of differentiating words grade and the processing links of Sentence-level.Furthermore in embodiment two Likelihood probability computing module is identical as the function of likelihood probability computing module described in embodiment one.In addition, the present embodiment two There is provided text similarity assessment system also includes parameter training identical with parameter training functions of modules described in embodiment one Module.

The present embodiment also provides a kind of memory, which stores the program of above-mentioned text similarity assessment system.

The present embodiment also provides a kind of text similarity appraisal procedure, and trained text similarity assessment system can be located Text to be predicted is managed, will judge two text input model exports the likelihood probability p of two texts, as two texts This similarity.

The present embodiment also provides a kind of computer, including memory and processor, the memory storage support processing Device executes the program of above-mentioned text similarity appraisal procedure, the processor is configured to storing in the memory for executing Described program.

Above embodiments be only it is sufficiently open is not intended to limit the present invention, it is all based on the inventive subject matter of the present invention, need not move through The replacement for the equivalence techniques feature that creative work can wait until should be considered as the range of the application exposure.

Claims

1. a kind of text similarity assessment system characterized by comprising

Term vector training module, for carrying out vector to the word sequence of the text doc1_a and text doc1_b segmentation Change；

Twin biLSTM network module, including biLSTMa network module and biLSTMb network module, are respectively used to described in input The term vector of each word of text doc1_a and the text doc1_b carries out word grade coding to the term vector of each word, Export the encoded information Hai and Hbi of each word of the text doc1_a and the text doc1_b, i value 1 to X；

Paying attention to power module, the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight bi of regularization, Wherein ∑ ai=1, ∑ bi=1, and the attention force vector of the text doc1_a is calculated by formula sa=∑ ai*Hai The attention force vector sb of the text doc1_b is calculated by formula sb=∑ bi*Hbi by sa；

Likelihood probability computing module calculates the text doc1_a by the attention force vector sa and attention force vector sb With the likelihood probability p of the text doc1_b.

2. text similarity assessment system according to claim 1, which is characterized in that the form of the corpus are as follows: doc1_ a,doc1_b,sim；Wherein sim is label, and sim=1 indicates that the text doc1_a is similar with text doc1_b, sim=0 table Show that the text doc1_a and text doc1_b is dissimilar.

3. text similarity assessment system according to claim 1, which is characterized in that the attention power module includes: Tanh function sension unit calculates the encoded information Hai by formula uai=Tanh (W*Hai+b) and exports one A weight uai calculates the encoded information Hbi by formula ubi=Tanh (W*Hbi+b) and exports a weight ubi；It is current that the text doc1_a is calculated by formula ai=softmax (uai*uw) in softmax function processing unit It is current that the text doc1_b is calculated by formula bi=softmax (ubi*uw) in the weight ai of the regularization of word The weight bi of the regularization of word；And weighted sum unit, for completing calculating and the institute of the formula sa=∑ ai*Hai State the calculating of formula sb=∑ bi*Hbi；Wherein W, uw, b are the parameter of setup parameter or trained acquisition.

4. text similarity assessment system according to claim 3, which is characterized in that described parameter W, uw, b are trained The parameter of acquisition；The text similarity assessment system further includes parameter training module, and the parameter training module passes through setting Loss function simultaneously constantly optimizes, until the loss function is restrained, so that it is determined that described parameter W, uw, b.

5. text similarity assessment system according to claim 1, which is characterized in that the loss function uses following Logloss function uses mean square error function, and logloss function is as follows:

Wherein, N is test sample sum, and M is the sum of class, y_l,jIt is two-valued variable, whether value 0 or 1 indicates first of sample Belong to the label of jth class, p_l,jBelong to the probability that the label of jth class is 1 for first of sample of model prediction.

6. text similarity assessment system according to claim 1, which is characterized in that the calculating side of the likelihood probability p Method are as follows:

P=g (sa, sb)=exp (- | | sa-sb | |₁), 0=< p≤1

Or p=cosine (sa, sb).

7. a kind of text similarity assessment system characterized by comprising

Corpus obtains module, for inputting the corpus comprising two texts of doc1_a and doc1_b, the text doc1_a and institute It states text doc1_b and separately includes Y sentence；

Word segmentation module, for respective k-th of sentence in the text doc1_a and the text doc1_b to be divided into X respectively The word sequence of word, k value 1 to Y；

Term vector training module, for what is divided to k-th of sentence respective in the text doc1_a and the text doc1_b Word sequence carries out vectorization；

First twin biLSTM network module, including biLSTMa1 network module and biLSTMb1 network module, are respectively used to defeated The term vector for entering each word of respective k-th of sentence in the text doc1_a and text doc1_b, to each word Term vector carry out word grade coding, export each word of respective k-th of the sentence of the text doc1_a and the text doc1_b Encoded information Hai and Hbi, i value 1 to X；

First pays attention to power module, and the respectively described encoded information Hai and Hbi provide the weight ai of regularization and the weight of regularization Bi, wherein ∑ ai=1, ∑ bi=1, are calculated k-th of sentence in the text doc1_a by formula sak=∑ ai*Hai Attention force vector sak, the note of k-th of sentence in the text doc1_b is calculated by formula sbk=∑ bi*Hbi Anticipate force vector sbk；

Second twin biLSTM network module, including biLSTMa2 network module and biLSTMb2 network module, are respectively used to defeated Enter the attention force vector sak and the attention force vector sak, to the attention force vector sak and the attention force vector sbk Sentence-level coding is carried out, corresponding encoded information HAk and HBk is exported；Second pays attention to power module, the respectively described encoded information HAk and HBk provides the weight Ak of regularization and the weight Bk of regularization, and wherein ∑ Ak=1, ∑ Bk=1, pass through formula Va=∑ The attention force vector Va of the text doc1_a is calculated in Ak*HAk, and the text is calculated by formula Vb=∑ Bk*HBk The attention force vector Vb of this doc1_b；

Likelihood probability computing module calculates the text doc1_a by the attention force vector Va and attention force vector Vb With the likelihood probability p of the text doc1_b.

8. a kind of memory, which is characterized in that text described in memory storage claim 1 to 7 any one is similar Spend the program of assessment system.

9. a kind of text similarity appraisal procedure, it is characterised in that: two text input claims 1 to 7 to be predicted are any Text similarity assessment system described in one exports the likelihood probability p of described two texts to be predicted.

10. a kind of computer, including memory and processor, which is characterized in that the memory storage supports processor to hold The program of text similarity appraisal procedure described in row claim 9, the processor is configured to for executing the memory The described program of middle storage.