CN115982144A

CN115982144A - Similar text duplicate removal method and device, storage medium and electronic device

Info

Publication number: CN115982144A
Application number: CN202211698954.5A
Authority: CN
Inventors: 杨梦诗; 刘升平; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-18

Abstract

The invention discloses a similar text duplicate removal method and device, a storage medium and an electronic device. The similar text deduplication method comprises the following steps: obtaining a text to be processed; inputting a text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed; the text representation matrix is used as input, similar text deduplication processing is carried out by utilizing a clustering algorithm, and a processed target text is obtained, namely, a semantic model can accurately capture wide and complex semantic information by combining context and word positions through pre-training of massive texts, so that the technical problem of low similar text deduplication precision in the prior art is solved.

Description

Similar text duplicate removal method and device, storage medium and electronic device

Technical Field

The invention relates to the field of natural language processing, in particular to a similar text duplicate removal method, a similar text duplicate removal device, a similar text duplicate removal storage medium and an electronic device.

Background

The existing text deduplication system mainly comprises the following steps: text cleaning, text word segmentation, text representation (text fingerprint), candidate generation and distance calculation. The text cleaning is mainly to clean and arrange characters such as punctuations, blanks, chinese and English, simple and complex characters and the like in the document according to requirements, so that accurate word segmentation and text representation are facilitated; text word segmentation uses a mature tool such as jieba to divide sentences into words to form a word bank or extract high-frequency keywords for subsequent text representation; the text representation is to encode the word information of the previous step, such as tf-idf, and obtain a vector capable of representing text semantic information; then generating candidate text pairs according to the correlation between the text fingerprints; and finally, removing repeated texts by calculating the distance between text pairs, such as a hamming distance.

Similar texts related in the prior art are subjected to de-duplication and seriously depend on word segmentation effect, so that a satisfactory effect is difficult to obtain in the field of strong professionalism such as medical treatment, and complicated semantic features such as context and word sequence information cannot be captured sufficiently by text fingerprints based on words or fragments, so that errors are brought to similarity calculation, and the de-duplication precision of the similar texts is low.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for removing duplicate of similar texts, a storage medium and an electronic device, which are used for at least solving the technical problem of low duplicate removal precision of similar texts in the prior art.

According to an aspect of the embodiments of the present invention, there is provided a similar text deduplication method, including: acquiring a text to be processed; inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed; and taking the text representation matrix as input, and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text.

Optionally, the step of taking the text representation matrix as input and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text includes: inputting the text representation matrix into the clustering algorithm to generate a plurality of clusters with similar semantics; and taking a representative sample from each cluster in the plurality of clusters with similar voice according to the distance between the text vector of each cluster and the central vector to obtain the target text.

Optionally, the inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed includes: newly adding [ cls ] and [ seq ] at the head and the tail of the text to be processed respectively to obtain a newly added text; performing word segmentation processing on the newly added text to obtain a word segmentation set with a total number of word segmentation tokens of l; and inputting the word set into the preset training semantic model to obtain a corresponding identifier id of each word in the word set in a word list of the preset training semantic model, and generating a first vector with a dimension l, wherein the text representation matrix is composed of representation vectors of different texts, and the text representation vector is obtained by calculating the first vector, the second vector and the third vector.

Optionally, the first vector X _{input_id} Is represented as follows: chi shape _{input_id} ＝(x _[cls] ,x _[1] ,x _[2] ,...,x _[l-2] ,x _[seq] )

Wherein x is _[cls] ，x _[seq] Each represents the [ cls ]]And said [ seq ]]Id, x in the pre-training model vocabulary _[i] And i is more than or equal to 1 and less than or equal to l-2, and the i th token in the newly added text represents id in the pre-training model vocabulary.

Optionally, the second vector χ _{segment_id} All tokens are defaulted to be 0, i.e. vectors with l-dimension value of 0: chi shape _{segment_id} ＝(0,0,...,0)；

The third vector χ _{attention_mask} All tokens are defaulted to be 1, namely the vector with the l-dimension value of 1:

χ _{attention_mask} ＝(1,1,...,1)。

optionally, the method further includes: processing the newly added texts in parallel to obtain bs texts processed in parallel; inputting the bs text into the preset training semantic model, and generating a first matrix, a second matrix and a third matrix of the bs text, wherein the first matrix, the second matrix and the third matrix are respectively composed of a first vector, a second vector and a third vector as row vectors, the dimensions are (bs, lmax), lmax is the maximum value in the bs text, the row vector of an X matrix is composed of χ vectors of each text, the χ vector ends of which the dimensions are less than lmax are filled with 0, and the formula is represented as follows:

wherein k belongs to { input _ id, attribute _ mask, segment _ id },

chi representing the ith text in the batch _k Vector, 1 ≦ i ≦ bs.

Optionally, the inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed includes: and taking the text representation matrix as input, and generating a vector representation matrix V of each token in the bs texts through calculation of the pre-training semantic model encoder, wherein the dimensionality is (bs, lmax, hs), and hs is the hidden _ size of the hidden layer of the pre-training model:

V＝f _enc (X _{input_id} ，X _{attention_mask} ，X _{segment_id} )。

optionally, the method further includes: subjecting said [ cls ] to]The vector is determined as the text representation vector, and hs dimension representation vector v corresponding to each text in the batch is obtained _[cls] ；

Wherein,

and an hs dimension characterization vector representing the ith text in the batch, wherein i is more than or equal to 1 and less than or equal to bs.

Optionally, the method further includes: circularly executing to obtain a characterization matrix V of all texts _[cls] Dimension is (n, hs), and n is the total number of texts;

wherein,

represents the hs-dimensional characterization vector of the jth text in the total sample, j is more than or equal to 1 and less than or equal to n, and/or is greater than or equal to n>

Characterization vector equal to the i-th text of the p-th batch of parallel samples ≥>

j＝(p-1)+i，1≤i≤bs。

Optionally, the step of using the text representation matrix as input and using a clustering algorithm to perform similar text deduplication processing to obtain a processed target text includes: characterizing matrix V to the text _[cls] Normalization processing is carried out to obtain a normalized feature matrix V _[c ′ _ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance:

wherein,

is->

J is more than or equal to 1 and less than or equal to n, and>

the vector is characterized for the j-th text after normalization.

Optionally, the method further includes: by normalizing the characterization matrix V _[c ′ _ls] Is input intoClustering is carried out according to Euclidean distances among n text vectors to obtain a cluster label corresponding to each vector, the vectors with the same label form semantic similar clusters, finally, the n text vectors form m clusters, wherein n is more than or equal to m,

wherein,

represents the normalized Euclidean distance of the ith text and the jth text vector, and is combined with the Euclidean distance based on the normalized Euclidean distance>

Is the k-dimension component after the j text vector normalization, and k is more than or equal to 1 and less than or equal to hs.

Optionally, the obtaining a representative sample from each of the plurality of clusters similar to the speech according to the distance between the text vector of each cluster and the center vector includes: calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the clusters is m:

wherein,

representing the central vector of the g-th semantic similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster;

and finding out the sample closest to the central vector in each cluster to form a text set after de-duplication, wherein the number of the text sets is m, and finally de-duplication from n to m texts is realized.

According to a first aspect of embodiments of the present application, there is provided a similar text deduplication apparatus, including: the acquisition unit is used for acquiring a text to be processed; the output unit is used for inputting the text to be processed into a pre-training semantic model and outputting a text representation matrix combining the context and the word position information of the text to be processed; and the processing unit is used for carrying out similar text duplicate removal processing by using the text representation matrix as input and utilizing a clustering algorithm to obtain a processed target text.

In the embodiment of the invention, the text to be processed is obtained; inputting a text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed; the text representation matrix is used as input, similar text deduplication processing is carried out by utilizing a clustering algorithm, and a processed target text is obtained, namely, a semantic model can accurately capture wide and complex semantic information by combining context and word positions through pre-training of massive texts, so that the technical problem of low similar text deduplication precision in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:

fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative similar text deduplication method of the present invention;

FIG. 2 is a flow diagram of an alternative similar text deduplication method in accordance with an embodiment of the present invention;

fig. 3 is a diagram of an alternative similar text deduplication apparatus according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The similar text deduplication method provided by the embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the application in a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a similar text deduplication method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the similar text deduplication method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Fig. 2 is a flowchart of a similar text deduplication method according to an embodiment of the present invention, and as shown in fig. 2, the similar text deduplication method includes the following steps:

step S202, acquiring a text to be processed.

And step S204, inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining the context and the word position information of the text to be processed.

And step S206, using the text representation matrix as input, and utilizing a clustering algorithm to perform similar text duplicate removal processing to obtain a processed target text.

In this embodiment, the similar text deduplication method may include, but is not limited to, the method for text processing. Similar texts in the related texts may be deduplicated first when the related texts are subjected to speech processing. The pre-training semantic model may include, but is not limited to, sentenceBert models, which are divided into similar text generation, similar text retrieval, and text matching, for the main application scenarios of simBert.

The Clustering algorithm may include, but is not limited to, k-means, DBSCAN (sensitivity-Based Spatial Clustering of applications with noise) is a relatively representative Density-Based Clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of densely connected points, it is possible to partition areas with a sufficiently high density into clusters and find clusters of arbitrary shape in a noisy spatial database.

In the K-means clustering algorithm, K represents that data are clustered into K clusters, and means represents that the mean value of the data in each cluster is used as the center of the cluster, which is also called a centroid. The K-means clustering tries to classify similar objects into the same cluster, and classifies dissimilar objects into different clusters, a calculation method for measuring similarity of data is needed, the K-means algorithm is a typical distance-based clustering algorithm, distance is used as an evaluation index of similarity, and an Euclidean distance is used as a similarity measure by default, namely the closer the distance between two objects is, the greater the similarity is.

In the prior art, a text word segmentation uses a mature tool such as a jieba to segment sentences into words, form a word bank or extract high-frequency keywords for subsequent text representation; the text representation is to encode the word information of the previous step, such as tf-idf, and obtain a vector capable of representing text semantic information; then generating candidate text pairs according to the correlation between the text fingerprints; and finally, removing repeated texts by calculating the distance between text pairs, such as a hamming distance.

According to the embodiment provided by the application, the text to be processed is obtained; inputting a text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed; the text representation matrix is used as input, similar text deduplication processing is carried out by using a clustering algorithm, and a processed target text is obtained, namely, the semantic model can accurately capture wide and complex semantic information by combining context and word positions through pre-training of massive texts, and further the technical problem of low deduplication precision of similar texts in the prior art is solved.

Optionally, the inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed may include: newly adding [ cls ] and [ seq ] at the head and the tail of the text to be processed respectively to obtain a newly added text; performing word segmentation processing on the newly added text to obtain a word segmentation set with a total number of word segmentation tokens of l; and inputting the word set into the preset training semantic model to obtain a corresponding identifier id of each word in the word set in a word list of the preset training semantic model, and generating a first vector with a dimension l, wherein the text representation matrix is composed of representation vectors of different texts, and the text representation vector is obtained by calculating the first vector, the second vector and the third vector.

Optionally, the first vector X _{input_id} Is represented as follows: chi-type food processing machine _{input_id} ＝(x _[cls] ,x _[1] ,x _[2] ,...,x _[l-2] ,x _[seq] )

Wherein x is _[cls] ，x _[seq] Each represents [ cls ] as described]And said [ seq ]]Id, x in the pre-training model vocabulary _[i] And i is more than or equal to 1 and less than or equal to l-2, and the i th token in the newly added text represents id in the pre-training model vocabulary.

Optionally, the second vector χ _segmen t _{_id} All tokens are defaulted to be 0, i.e. vectors with l-dimension value of 0: chi-type food processing machine _{segment_id} ＝(0,0,...,0)；

The third vector χ _att enti _{on_mask} All tokens are defaulted to be 1, i.e. vectors with l-dimension value of 1:

χ _{attention_mask} ＝(1,1,...,1)。

wherein k belongs to { input _ id, attribute _ mask, segment _ id },

chi representing the ith text in the batch _k Vector, 1 ≦ i ≦ bs.

Optionally, the inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed may include: and taking the text representation matrix as input, and generating a vector representation matrix V of each token in the bs texts through calculation of the pre-training semantic model encoder, wherein the dimensionality is (bs, lmax, hs), and hs is the hidden _ size of the hidden layer of the pre-training model:

V＝f _enc (X _{input_id} ，X _{attention_mask} ，X _{segment_id} )。

Wherein,

hs dimension table representing ith text in batchAnd i is more than or equal to 1 and less than or equal to bs.

wherein,

j＝(p-1)+i，1≤i≤bs。

Optionally, the step of using the text representation matrix as input and using a clustering algorithm to perform similar text deduplication processing to obtain a processed target text includes: characterizing a matrix V for the text _[cls] Normalization processing is carried out to obtain a normalized feature matrix V _[c ′ _ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance:

wherein,

is->

Is greater than or equal to 1 and less than or equal to n>

The vector is characterized for the j < th > normalized text.

Optionally, the method further includes: characterizing moments by normalizationMatrix V _[c ′ _ls] For inputting, clustering is carried out according to Euclidean distance between n text vectors to obtain cluster labels corresponding to each vector, the vectors with the same labels form semantic similar clusters, finally the n text vectors form m clusters, wherein n is more than or equal to m,

wherein,

represents the normalized Euclidean distance of the ith text and the jth text vector, and is/are>

Optionally, the obtaining the target text by taking a representative sample from each of the plurality of clusters similar to the speech according to the distance between the text vector of each cluster and the center vector may include: calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the cluster center vectors is m:

wherein,

representing the central vector of the g-th semantically similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster;

As an alternative embodiment, the present application further provides an accurate and convenient text similarity deduplication method. The details of this scheme are as follows.

In the embodiment, an accurate and convenient text similarity duplication elimination method is provided, and the specific steps are as follows:

firstly, a text vector is obtained through a semantic model.

In the first step, rich big data prior knowledge of deep learning pre-training semantic models (such as simBert and sentenceBert) is utilized, context and word position information can be combined, a text representation vector can be accurately obtained in one-stop mode, and complicated text cleaning and word segmentation processes are not needed.

And secondly, text duplicate removal is realized through a clustering algorithm.

And a clustering algorithm (such as kmeans and DBSCAN) is introduced in the second step, clusters with similar semantics can be generated in a one-stop mode according to a similarity threshold, each cluster can be taken as a representative sample to form a text set after duplication is removed, the method realizes parallel computation of similarity and improves efficiency, and meanwhile, a series of pipeline processes of establishing indexes, generating candidates, computing distances and the like are avoided.

In this embodiment, an accurate and convenient text similarity deduplication method is specifically implemented as follows:

step 1: obtaining a text representation matrix by utilizing a pre-training semantic model;

step 1.1: generating X by using tokenizer class of transformations library by taking original text as input _{input_id} (corresponding to the first matrix), X _{attention_mask} (corresponding to the second matrix), X _{segment_id} Matrix (equivalent to the third matrix).

Step 1.1.1: for a single text, firstly, tokenize processing is carried out (Chinese text can be directly divided into characters), and [ cls ] are respectively supplemented at the head and the tail]And [ seq ]]Generating total token of l (including [ cls ]]And [ seq ]]) Finding out the corresponding id of each token in the pre-training model vocabulary, and generating l-dimension x _{input_id} Vector (corresponding to the first vector).

χ _{input_id} ＝(x _[cls] ,x _[1] ,x _[2] ,...,x _[l-2] ,x _[seq] )

Wherein x is _[cls] ，x[seq]Each represents [ cls]And [ seq ]]Id, x in pre-training model vocabulary _[i] Representing id of the ith token in the text in a pre-training model vocabulary, wherein i is more than or equal to 1 and less than or equal to l-2.

Step 1.1.2: generating χ for single text _{segment_id} (corresponding to the second vector) and χ _{attention_mask} (corresponding to the third vector) vector χ _{segment_id} All tokens are defaulted to be 0, namely, the l-dimensional zero vector, x _{attention_mas} And k defaults that all tokens (non-completion) are 1, namely the vector with the l-dimension value of 1.

χ _{segment_id} ＝(0,0,...,0)

χ _{attention_mask} ＝(1,1,...,1)

Step 1.1.3: in order to improve the efficiency, parallel computation of texts is needed, the parallel number is bs, and X of bs texts can be generated in parallel by using a token class batch _ encode method of a transformations library _{input_id} ，X _{attention_mask} ，X _{segment_id} The matrix and the dimensionality are (bs, lmax), the lmax is the maximum l value in the bs texts, the chi vectors of the single texts form the row vectors of the X matrix, the chi vectors with the dimensionality less than the lmax are filled with 0 at the tail end, and the formula is represented as follows:

wherein k belongs to { input _ id, attribute _ mask, segment _ id },

chi representing the ith text in the batch _k Vector, 1 ≦ i ≦ bs.

Step 1.2: with X _{input_id} ，X _{attention_mask} ，X _{segment_id} For input, a model class of a transformers library is utilized to obtain a text characterization matrix V _[cls] 。

Step 1.2.1: x obtained in step 1.1 _{input_id} ，X _{attention_mask} ，X _{segment_id} The matrix is input and is pre-trained by a model encoder (denoted as f) _enc ) Generating a vector representation matrix V of each token in the bs text, and having dimensions of (bs, lmax, hs), wherein hs is the hidden _ size of the hidden layer of the pre-trained model.

V＝f _enc (X _{input_id} ，X _{attention_mask} ，X _{segment_id} )

Step 1.2.2: taking [ cls ]]The vector representation is used as a text representation vector, and an hs dimension representation vector v corresponding to each text in the batch is obtained _[cls] 。

Wherein,

and representing the hs dimension characterization vector of the ith text in the batch, wherein i is more than or equal to 1 and less than or equal to bs.

Step 1.2.3: circularly executing the steps of 1.1, 1.2.1 and 1.2.2 to obtain the characterization matrix V of all texts _[cls] The dimension is (n, hs), and n is the total number of texts.

Wherein,

j＝(p-1)+i，1≤i≤bs。

Step 2: and taking all the text characterization matrixes as input, and performing similar text deduplication processing by using a clustering algorithm.

Step 2.1: for the characterization matrix V obtained in step 1 _[cls] Normalization processing is carried out to obtain a normalized feature matrix V _[c ′ _ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance.

Wherein,

is->

J is more than or equal to 1 and less than or equal to n.

The vector is characterized for the j < th > normalized text.

Step 2.2: by normalizing the characterization matrix V _[c ′ _ls] For input, each text vector is labeled with a cluster using the clustering algorithm of the sklern library.

Step 2.2.1: selecting a clustering algorithm and setting parameters of the clustering algorithm according to requirements, wherein eps and min _ samples values in DBSCAN and k values in k-means are suitable for the condition of indefinite de-duplication target sample number and need to keep outlier samples, and the latter is suitable for the condition of keeping fixed sample number (k).

Step 2.2.2: by normalizing the characterization matrix V _[c ′ _ls] For inputting, clustering is carried out according to Euclidean distance between n text vectors to obtain cluster labels corresponding to each vector, the vectors with the same label (except the-1 label in DBSCAN) form semantic similar clusters, each vector with the-1 label in DBSCAN is a cluster, and finally the n text vectors form m clusters (n are n text vectors>＝m)。

Wherein,

Step 2.3: and performing text deduplication processing on each semantic similar cluster.

Step 2.3.1: calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the clusters is m

Wherein,

representing the central vector of the g-th semantically similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster.

Step 2.3.2: and finding out the sample closest to the central vector in each cluster to form a text set after de-duplication, wherein the number of the text sets is m, and finally de-duplication from n to m texts is realized.

In the scheme provided by the embodiment, firstly, the semantic model obtains text vectors and a clustering algorithm and has a mature toolkit, so that one-stop calling can be realized, secondly, the semantic model is pre-trained by massive texts, can accurately capture wide and complex semantic information by combining context and word positions, and compared with the conventional schemes of word segmentation, keywords and the like, the accuracy of text representation and subsequent similarity calculation is qualitatively improved. Therefore, the method achieves double improvement of accuracy and convenience in the task of text similarity deduplication.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a similar text deduplication device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description of which has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a block diagram of a similar text deduplication apparatus according to an embodiment of the present invention, the similar text deduplication apparatus including:

an obtaining unit 31, configured to obtain a text to be processed.

And the output unit 33 is configured to input the text to be processed into the pre-training semantic model, and output a text representation matrix combining the context and the word position information of the text to be processed.

And the first processing unit 35 is configured to perform similar text deduplication processing by using a clustering algorithm with the text representation matrix as input to obtain a processed target text.

According to the embodiment provided by the application, the to-be-processed text is acquired through the acquisition unit 31; the output unit 33 inputs the text to be processed into the pre-training semantic model, and outputs a text representation matrix combining the context and the word position information of the text to be processed; the first processing unit 35 takes the text representation matrix as input, and performs similar text deduplication processing by using a clustering algorithm to obtain a processed target text, that is, the semantic model of the invention can accurately capture wide and complex semantic information by combining context and word position through pre-training of massive texts, thereby solving the technical problem of low deduplication precision of similar texts in the prior art.

Optionally, the processing unit 35 may include: the first input module is used for inputting the text representation matrix into the clustering algorithm to generate a plurality of clusters with similar semantemes; and the determining module is used for taking a representative sample from each cluster in the clusters with similar voice according to the distance between the text vector of each cluster and the central vector to obtain the target text.

Optionally, the output unit 33 may include: the newly-added module is used for respectively adding [ cls ] and [ seq ] at the head and the tail of the text to be processed to obtain a newly-added text; the word segmentation module is used for carrying out word segmentation processing on the newly added text to obtain a word segmentation set with the total number of word segmentation tokens of l; and the second input module is used for inputting the word segmentation set into the preset training semantic model to obtain a corresponding identifier id of each word in the word segmentation set in the word list of the preset training semantic model, and generating a first vector with a dimension of l, wherein the text characterization matrix consists of characterization vectors of different texts, and the text characterization vectors are obtained by calculating the first vector, the second vector and the third vector.

Optionally, the second vector χ _{segment_id} All tokens are defaulted to be 0, namely the vector with the l-dimension value of 0: chi shape _{segment_id} ＝(0,0,...,0)；

χ _{attention_mask} ＝(1,1,...,1)。

optionally, the apparatus may further include: the second processing unit carries out parallel processing on the newly added text to obtain bs texts which are processed in parallel; the generating unit is used for inputting the bs texts into the preset training semantic model to generate a first matrix, a second matrix and a third matrix of the bs texts, wherein the first matrix, the second matrix and the third matrix are respectively composed of a first vector, a second vector and a third vector as row vectors, dimensions are (bs, lmax), lmax is the maximum l value in the bs texts, a χ vector of each text forms a row vector of an X matrix, χ vector ends with dimensions less than lmax are complemented by 0, and a formula is expressed as follows:

wherein k belongs to { input _ id, attribute _ mask, segment _ id },

chi representing the ith text in the batch _k Vector, 1 ≦ i ≦ bs.

Optionally, the output unit 33 may include: and a generating module, configured to use the text representation matrix as an input, and generate a vector representation matrix V of each token in the bs texts through calculation of the pre-training semantic model encoder, where a dimension is (bs, lmax, hs), and hs is a hidden _ size of the hidden layer of the pre-training semantic model:

V＝f _enc (X _{input_id} ，X _{attention_mask} ，X _{segment_id} )。

optionally, the apparatus may further perform the following operations: subjecting said [ cls ] to]The vector is determined as the text representation vector, and hs dimension representation vector v corresponding to each text in the batch is obtained _[cls] ；

Wherein,

Optionally, the apparatus may further perform the following operations: circularly executing to obtain a characterization matrix V of all texts _[cls] Dimension is (n, hs), and n is the total number of texts;

wherein,

Characterization vector equal to ith text of the pth parallel sample->

j＝(p-1)+i，1≤i≤bs。

Optionally, the apparatus may further perform the following operations: characterizing matrix V to the text _[cls] Normalization processing is carried out to obtain a normalized feature matrix V _[c ′ _ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance:

wherein,

is->

Is greater than or equal to 1 and less than or equal to n>

The vector is characterized for the j < th > normalized text.

Optionally, the apparatus may further perform the following operations: characterization matrix V by normalization _[c ′ _ls] For inputting, clustering is carried out according to Euclidean distance between n text vectors to obtain cluster labels corresponding to each vector, the vectors with the same labels form semantic similar clusters, finally the n text vectors form m clusters, wherein n is more than or equal to m,

wherein,

Optionally, the apparatus may further perform the following operations: calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the cluster center vectors is m:

wherein,

representing the central vector of the g-th semantically similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster; and finding out the sample closest to the central vector in each cluster to form a text set after de-duplication, wherein the number of the text sets is m, and finally de-duplication from n to m texts is realized.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are located in different processors in any combination.

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a text to be processed;

s2, inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining the context and word position information of the text to be processed;

and S3, with the text representation matrix as input, carrying out similar text duplicate removal processing by using a clustering algorithm to obtain a processed target text.

Optionally, in this embodiment, the storage medium may include but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring a text to be processed;

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized in a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a memory device and executed by a computing device, and in some cases, the steps shown or described may be executed out of order, or separately as individual integrated circuit modules, or multiple modules or steps thereof may be implemented as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for similar text deduplication, comprising:

acquiring a text to be processed;

inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed;

and taking the text representation matrix as input, and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text.

2. The method of claim 1, wherein the step of taking the text representation matrix as input and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text comprises:

inputting the text representation matrix into the clustering algorithm to generate a plurality of clusters with similar semantics;

and taking a representative sample from each cluster in the clusters with similar voice according to the distance between the text vector of each cluster and the central vector to obtain the target text.

3. The method of claim 1, wherein the inputting the text to be processed into a pre-trained semantic model and outputting a text representation matrix combining context and word position information of the text to be processed comprises:

aiming at a single text, respectively adding [ cls ] and [ seq ] at the head and the tail of the text to be processed to obtain a new text;

performing word segmentation processing on the newly added text to obtain a word segmentation set with a total number of word segmentation tokens of l;

and inputting the word set into the preset training semantic model to obtain a corresponding identifier id of each word in the word set in a word list of the preset training semantic model, and generating a first vector with a dimension l, wherein the text representation matrix is composed of representation vectors of different texts, and the text representation vector is obtained by calculating the first vector, the second vector and the third vector.

4. The method of claim 3, wherein the first vector X _{input_id} Is represented as follows:

χ _{input_id} ＝(x _[cls] ,x _[1] ,x _[2] ,...,x _[l-2] ,x _[seq] )

x _[cls] ，x _[seq] each represents the [ cls ]]And said [ seq ]]Id, x in the pre-training model vocabulary _[i] Representing id of the ith token in the newly added text in the pre-training model vocabulary, wherein i is more than or equal to 1 and less than or equal to l-2.

5. The method of claim 3, wherein the second vector and the third vector represent,

the second vector χ _{segment_id} All tokens are defaulted to be 0, i.e. vectors with l-dimension value of 0: chi shape _{segment_id} ＝(0,0,...,0)

The third vector χ _att enti _{on_mas} ^k All tokens are defaulted to be 1, i.e. vectors with l-dimension value of 1: chi shape _{attention_mask} ＝(1,1,...,1)。

6. The method of claim 5, further comprising:

processing the newly added texts in parallel to obtain bs texts processed in parallel;

inputting the bs texts into the preset training semantic model, generating a first matrix, a second matrix and a third matrix of the bs texts, wherein the first matrix, the second matrix and the third matrix are respectively composed of a first vector, a second vector and a third vector as row vectors, the dimensions are (bs, lmax), the text feature matrix is obtained by calculating the first matrix, the second matrix and the third matrix, lmax is the maximum value in the bs texts, the χ vector of each text forms the row vector of the X matrix, and the χ vector end with dimension less than lmax is completed by 0, and the formula is expressed as follows:

wherein k belongs to { input _ id, attribute _ mask, segment _ id },

chi representing the ith text in the batch _k Vector, 1 ≦ i ≦ bs.

7. The method of claim 6, wherein the inputting the text to be processed into a pre-trained semantic model and outputting a text representation matrix combining context and word position information of the text to be processed comprises:

and taking the first matrix, the second matrix and the third matrix as input, and calculating by the pre-training semantic model encoder to generate a vector representation matrix V of each token in the bs text, with dimensions (bs, lmax, hs), where hs is the hidden _ size of the hidden layer of the pre-training model:

V＝f _enc (X _{input_id} ，X _{attention_mask} ，X _{segment_id} )。

8. the method of claim 7, further comprising:

subjecting said [ cls ] to]The vector is determined as the text representation vector, and hs dimension representation vector v corresponding to each text in the batch is obtained _[cls] ；

Wherein,

9. The method of claim 8, further comprising:

circularly executing to obtain a characterization matrix V of all texts _[cls] Dimension is (n, hs), and n is the total number of texts;

wherein,

Characterization vector equal to ith text of the pth parallel sample->

j＝(p-1)+i，1≤i≤bs。

10. The method according to claim 1, wherein the step of performing similar text deduplication processing by using the text characterization matrix as an input and using a clustering algorithm to obtain a processed target text comprises:

characterizing a matrix V for the text _[cls] Normalization processing is carried out to obtain a normalized feature matrix V _[c ′ _ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance:

wherein,

is->

J is more than or equal to 1 and less than or equal to n, and>

the vector is characterized for the j < th > normalized text.

11. The method of claim 10, further comprising:

characterization matrix V by normalization _[c ′ _ls] For input, clustering is carried out according to Euclidean distances among n text vectors to obtain a cluster label corresponding to each vector, the vectors with the same label form semantic similar clusters, finally, the n text vectors form m clusters, wherein n is more than or equal to m,

wherein,

And k is more than or equal to 1 and less than or equal to hs for the kth dimension component after the jth text vector normalization.

12. The method of claim 2, wherein obtaining the target text by taking a representative sample from each of the plurality of speech-like clusters according to a distance between the text vector of each cluster and the center vector comprises:

calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the cluster center vectors is m:

wherein,

13. A similar text deduplication apparatus, comprising:

the acquisition unit is used for acquiring a text to be processed;

the output unit is used for inputting the text to be processed into a pre-training semantic model and outputting a text representation matrix combining the context and the word position information of the text to be processed;

and the processing unit is used for carrying out similar text duplicate removal processing by using the text representation matrix as input and utilizing a clustering algorithm to obtain a processed target text.

14. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 12 when executed.

15. An electronic device comprising a memory and a processor, wherein the memory has a computer program stored therein, and the processor is configured to execute the computer program to perform the method of any of claims 1 to 12.