CN110287494A

CN110287494A - A method of the short text Similarity matching based on deep learning BERT algorithm

Info

Publication number: CN110287494A
Application number: CN201910583223.8A
Authority: CN
Inventors: 尹青山; 李锐; 于治楼
Original assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Current assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2019-09-27

Abstract

The invention discloses a kind of methods of short text Similarity matching based on deep learning BERT algorithm, belong to field of artificial intelligence, the realization process of this method is as follows: 1), using common data sets and short text carrying out BERT training, obtain trained BERT model；2) word segmentation processing, is carried out to short text to be matched；3), the short text after word segmentation processing that step 2) obtains is input in the BERT model that step 1) obtains, gets the feature vector of short text；4), Similarity matching short text is obtained using cosine similarity algorithm.The present invention carries out short text Similarity matching using the BERT model of pre-training, relatively before text Similarity matching method there is preferably performance.

Description

A method of the short text Similarity matching based on deep learning BERT algorithm

Technical field

The present invention relates to field of artificial intelligence, specifically a kind of short text based on deep learning BERT algorithm The method of Similarity matching.

Background technique

During natural language processing, the similitude between two short texts of measurement, text are often involved how to It is a kind of semantic space of higher-dimension, abstract decomposition how is carried out to it, goes to quantify its similitude in mathematical angle so as to stand. If the metric form of similitude between text can be obtained, we can utilize the K-means of partitioning, based on density The DBSCAN either probabilistic method based on model carries out the clustering between text；On the other hand, we also can use Similitude between text carries out duplicate removal pretreatment to large-scale corpus, or looks for the related names (mould of a certain entity name Paste matching).And there are many kinds of methods for the similitude of two character strings of measurement, such as most directly utilize hashcode, and classics Topic model perhaps using term vector by text be abstracted as vector indicate again by Euclidean distance between feature vector or Pearson's distance is measured.With the fast development of artificial intelligence, now constantly leaps up and revealed new algorithm and model, with more The calculating of good, more efficient realization deep learning, short text Similarity matching play text analyzing and corpus processing important Effect promotes the efficiency of short text Similarity matching, has great importance how under the calculating environment quickly updated.

Summary of the invention

Technical assignment of the invention be against the above deficiency place, a kind of short essay based on deep learning BERT algorithm is provided The method of this Similarity matching carries out the Similarity matching of short text using the model of pre-training, has better application effect.

The technical solution adopted by the present invention to solve the technical problems is:

A method of the short text Similarity matching based on deep learning BERT algorithm, it is characterised in that the realization of this method Process is as follows:

1) BERT training, is carried out using common data sets and short text, obtains trained BERT model；

2) word segmentation processing, is carried out to short text to be matched；

3), the short text after word segmentation processing that step 2) obtains is input in the BERT model that step 1) obtains, is obtained To the feature vector of short text；

4), Similarity matching short text is obtained using cosine similarity algorithm.

BERT is a kind of method that pre-training language indicates, one is had trained on a large amount of corpus of text (wikipedia) and is led to Then " language understanding " model is gone to execute the NLP task for wanting to do with this model.The method performance of BERT ratio before more goes out Color, because it is first with unsupervised, the depth bilateral system on pre-training NLP.We use the BERT of pre-training Model has preferably performance to carry out short text Similarity matching.

For BERT by the word of random mask list entries, target is exactly to utilize contextual information (relative to tradition from left-hand Right language model, mask model can be predicted using masked word or so context simultaneously) it predicts by the word of mask. It using the contextual information of masked or so is realized by the two-way encoder of Transformer simultaneously.It so both can be with Solve " language model is unidirectional " limitation.In addition, BERT model introduces the task of " predicting whether as next sentence ", to learn jointly Practise pre-trained representations.

Specifically, described instructed using common data sets and short text progress BERT training including the relationship between word in sentence Relationship training between experienced and sentence.

Wherein, the relationship training method in the sentence between word are as follows:

Part of words is covered as training sample at random, wherein 80% is replaced with masked token, 10% at random A word come replace, 10% keep this word it is constant.

Preferably, the part of words is 15%, i.e., covers 15% word at random as training sample, and model needs compared with More training steps are restrained.

Further, the relationship training method between sentence are as follows:

One two disaggregated model of pre-training, positive sample and negative sample ratio are 1:1, and positive sample is given sentence A and B, and B is The actual context of A lower；Negative sample be in corpus randomly selected sentence as B.

Preferably, carrying out word segmentation processing to short text to be matched includes that removal text stop words and removal link.

Specifically, carrying out word segmentation processing to short text to be matched, participleization processing, benefit are carried out to short text with participle tool With the acne method based on statistics, while carrying out string matching participle to short text, hidden horse is used using dictionary for word segmentation Er Kefu model identifies some neologisms, is split to short text.

Specifically, the short text after word segmentation processing is input in trained BERT model, BERT model passes through inquiry Each word in short is converted to one-dimensional vector by word vector table, as mode input；Model output is then that each word of input is corresponding Fusion full text semantic information after vector indicate.

Further, mode input also includes text vector and position vector,

Text vector: the value of the vector learns automatically during model training, and the overall situation for portraying text is semantic Information, and blended with the semantic information of single character/word；

Position vector: the semantic information as entrained by the character/word for appearing in text different location has differences, BERT model adds a different vector respectively to the character/word of different location to distinguish.

Using the adduction of word vector, text vector and position vector as mode input, model is exported by character/word BERT model The text vector that vector converts can include more accurate semantic information.

Preferably, Similarity matching short text is obtained using cosine similarity algorithm, to target short text and other short texts Distribution carries out cosine similarity calculating, takes the highest text of similarity as the Similarity matching text of target text.

A kind of method of short text Similarity matching based on deep learning BERT algorithm of the invention compared with prior art, It has the advantages that

Direct before measure the similitudes of two character strings relatively using BERT model using hashcode, it is classical Text is abstracted as vector expression using term vector by topic model, then passes through the Euclidean distance between feature vector or Pearson The mode of distance metric is compared, BERT method performance it is outstanding, it be first on pre-training NLP it is unsupervised, Depth bilateral system, so short text Similarity matching is carried out using the BERT model of pre-training has preferably performance, it can be with Very big promotion matching accuracy, improves the efficiency of short text Similarity matching, preferably realizes the text in natural language processing Analysis and corpus processing.

Detailed description of the invention

Fig. 1 is the flow diagram of the method for the short text Similarity matching of the invention based on deep learning BERT algorithm；

Fig. 2 is the secondary relationship training example figure of BERT algorithm in embodiment.

Specific embodiment

The present invention is further explained in the light of specific embodiments.

2) word segmentation processing, is carried out to short text to be matched；

Wherein, described to include the relationship training between word in sentence using common data sets and short text progress BERT training Relationship training between sentence.

Relationship training method in sentence between word are as follows:

15% word is covered as training sample at random, wherein 80% is replaced with masked token, 10% at random A word come replace, 10% keep this word it is constant.

With " Wu Tse-tien is first empress of China." for the words, as shown in Fig. 2,

Position " then " is chosen, is covered " then " as training sample；

For " then " being occluded, wherein 80% is replaced with masked token:

" military day [mask] is first empress of China."；

Wherein 10% keep this word constant:

" Wu Tse-tien is first empress of China."；

Wherein 10% replaced with a random word:

" Wu Zongtian is first empress of China."；

Predict the position of " then ":

Then the positive Zhou Yingzong of king often takes

0.999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

The part selection 80% being occluded is replaced with masked token, replaces choosing with [mask] to avoid 100% Word caused by fine tuning when model it is possible that some words that do not see, therefore wherein 10% keep this word not Become；Wherein 10% is replaced with a random word and be intended to Transformer and to keep to the distributed table of each input token Sign, otherwise Transformer is likely to remember that this [MASK] is exactly " then ".

(training of rest part is same as above, and the convergence of model can be carried out by repeatedly training.)

Relationship training method between sentence are as follows:

Many important Downstream Jobs are based on the understanding to relationship between two text sentences, such as question and answer (QA) and natural language Speech infers (NLI), and this relationship is directly obtained not by Language Modeling.In order to train the model for understanding sentence relationship, Pre-training one binaryzation lower prediction task, the task can easily be generated from any single language corpus.It is specific next Say, select sentence A and B as pre-training sample: B have 50% may be A next sentence, also there is 50% to be from language Expect the random sentence in library.

Wherein, carrying out word segmentation processing to short text to be matched includes that removal text stop words and removal link.

Word segmentation processing is carried out to short text to be matched, participleization is carried out to short text with participle tool and is handled, using being based on The participle of statistics while carrying out string matching participle to short text, uses Hidden Markov Model using dictionary for word segmentation It identifies some neologisms, short text is split.

Such as: text library format is corresponding two texts (format is " id ", " text-a ", " text-b ") two-by-two, In, text-a is similar " what is private health insurance? ", " what private health insurance is? " or the like approximate short text, And text-b is then corresponding text, in upper example text-a, text-b is then to tell that user's private health insurance is Text.It utilizes " jieba " to segment, each of text library sample is segmented.During participle, needing first will such as Punctuation mark and stop word etc. are removed without the stop words of specific meaning, guarantee participle effect and speed, secondly according to user Dictionary is segmented, it is ensured that can extract proprietary noun.In upper example, " private health insurance " can be regarded as a proprietary name Word, rather than it is divided into " private, health, insurance " three words.Be directed to " what is private health insurance? ", then can be converted to " what " "Yes" " private health insurance ".

Short text after word segmentation processing is input in trained BERT model, the feature vector of short text is got:

Short text after word segmentation processing is input in trained BERT model, BERT model passes through inquiry word vector table Each word in short is converted into one-dimensional vector, as mode input；Model output is then that the corresponding fusion of each word of input is complete Vector after literary semantic information indicates；

In addition, mode input, in addition to word vector, also include other two part: 1, text vector: the value of the vector exists Automatic study, mutually melts for portraying the global semantic information of text, and with the semantic information of single character/word during model training It closes；2, position vector: the semantic information as entrained by the character/word for appearing in text different location have differences (such as: " I Like you " and " you like me "), therefore, BERT model adds a different vector respectively to the character/word of different location to make area Point；

BERT model using the adduction of word vector, text vector and position vector as mode input, therefore model output by The text vector that character/word vector converts can include more accurate semantic information.

Similarity matching short text is obtained using cosine similarity algorithm, target short text and the distribution of other short texts are carried out Cosine similarity calculates, and takes the highest text of similarity as the Similarity matching text of target text.

The technical personnel in the technical field can readily realize the present invention with the above specific embodiments,.But it answers Work as understanding, the present invention is not limited to above-mentioned specific embodiments.On the basis of the disclosed embodiments, the technical field Technical staff can arbitrarily combine different technical features, to realize different technical solutions.

Except for the technical features described in the specification, it all is technically known to those skilled in the art.

Claims

1. a kind of method of the short text Similarity matching based on deep learning BERT algorithm, it is characterised in that the realization of this method Journey is as follows:

2) word segmentation processing, is carried out to short text to be matched；

3), the short text after word segmentation processing that step 2) obtains is input in the BERT model that step 1) obtains, is got short The feature vector of text；

2. a kind of method of short text Similarity matching based on deep learning BERT algorithm according to claim 1, special Sign is that described to carry out BERT training using common data sets and short text include relationship training and the sentence in sentence between word Between relationship training.

3. a kind of method of short text Similarity matching based on deep learning BERT algorithm according to claim 2, special Sign is the relationship training method in the sentence between word are as follows:

Part of words is covered as training sample at random, wherein 80% is replaced with masked token, 10% uses random one A word come replace, 10% keep this word it is constant.

4. a kind of method of short text Similarity matching based on deep learning BERT algorithm according to claim 3, special Sign is that the part of words is 15%.

5. a kind of side of the short text Similarity matching based on deep learning BERT algorithm according to Claims 2 or 3 or 4 Method, it is characterised in that the relationship training method between sentence are as follows:

One two disaggregated model of pre-training, positive sample and negative sample ratio are 1:1, and positive sample is given sentence A and B, and B is A Actual context lower；Negative sample be in corpus randomly selected sentence as B.

6. a kind of method of short text Similarity matching based on deep learning BERT algorithm according to claim 1, special Sign is that it includes removal text stop words and removal link that word segmentation processing is carried out to short text to be matched.

7. a kind of method of short text Similarity matching based on deep learning BERT algorithm according to claim 6, special Sign is that carrying out participleization to short text with participle tool is handled, and using the segmenting method based on statistics, uses dictionary for word segmentation While carrying out string matching participle to short text, some neologisms are identified using Hidden Markov Model, and short text is carried out Segmentation.

8. a kind of method of short text Similarity matching based on deep learning BERT algorithm according to claim 1, special Sign is for the short text after word segmentation processing to be input in trained BERT model, and BERT model will by inquiry word vector table Each word in short is converted to one-dimensional vector, as mode input；Model output is then the corresponding fusion full text of each word of input Vector after semantic information indicates.

9. a kind of method of short text Similarity matching based on deep learning BERT algorithm according to claim 8, special Sign is that mode input also includes text vector and position vector,

Text vector: the value of the vector learns automatically during model training, for portraying the global semantic information of text, And it is blended with the semantic information of single character/word；

10. a kind of method of short text Similarity matching based on deep learning BERT algorithm according to claim 1, special Sign is to obtain Similarity matching short text using cosine similarity algorithm, more than target short text and the distribution progress of other short texts String similarity calculation takes the highest text of similarity as the Similarity matching text of target text.