CN115982144A - Similar text duplicate removal method and device, storage medium and electronic device - Google Patents
Similar text duplicate removal method and device, storage medium and electronic device Download PDFInfo
- Publication number
- CN115982144A CN115982144A CN202211698954.5A CN202211698954A CN115982144A CN 115982144 A CN115982144 A CN 115982144A CN 202211698954 A CN202211698954 A CN 202211698954A CN 115982144 A CN115982144 A CN 115982144A
- Authority
- CN
- China
- Prior art keywords
- text
- vector
- matrix
- processed
- equal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 56
- 239000011159 matrix material Substances 0.000 claims abstract description 121
- 238000012549 training Methods 0.000 claims abstract description 54
- 238000012545 processing Methods 0.000 claims abstract description 47
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 34
- 239000013598 vector Substances 0.000 claims description 203
- 238000012512 characterization method Methods 0.000 claims description 31
- 230000011218 segmentation Effects 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000004140 cleaning Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a similar text duplicate removal method and device, a storage medium and an electronic device. The similar text deduplication method comprises the following steps: obtaining a text to be processed; inputting a text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed; the text representation matrix is used as input, similar text deduplication processing is carried out by utilizing a clustering algorithm, and a processed target text is obtained, namely, a semantic model can accurately capture wide and complex semantic information by combining context and word positions through pre-training of massive texts, so that the technical problem of low similar text deduplication precision in the prior art is solved.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a similar text duplicate removal method, a similar text duplicate removal device, a similar text duplicate removal storage medium and an electronic device.
Background
The existing text deduplication system mainly comprises the following steps: text cleaning, text word segmentation, text representation (text fingerprint), candidate generation and distance calculation. The text cleaning is mainly to clean and arrange characters such as punctuations, blanks, chinese and English, simple and complex characters and the like in the document according to requirements, so that accurate word segmentation and text representation are facilitated; text word segmentation uses a mature tool such as jieba to divide sentences into words to form a word bank or extract high-frequency keywords for subsequent text representation; the text representation is to encode the word information of the previous step, such as tf-idf, and obtain a vector capable of representing text semantic information; then generating candidate text pairs according to the correlation between the text fingerprints; and finally, removing repeated texts by calculating the distance between text pairs, such as a hamming distance.
Similar texts related in the prior art are subjected to de-duplication and seriously depend on word segmentation effect, so that a satisfactory effect is difficult to obtain in the field of strong professionalism such as medical treatment, and complicated semantic features such as context and word sequence information cannot be captured sufficiently by text fingerprints based on words or fragments, so that errors are brought to similarity calculation, and the de-duplication precision of the similar texts is low.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for removing duplicate of similar texts, a storage medium and an electronic device, which are used for at least solving the technical problem of low duplicate removal precision of similar texts in the prior art.
According to an aspect of the embodiments of the present invention, there is provided a similar text deduplication method, including: acquiring a text to be processed; inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed; and taking the text representation matrix as input, and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text.
Optionally, the step of taking the text representation matrix as input and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text includes: inputting the text representation matrix into the clustering algorithm to generate a plurality of clusters with similar semantics; and taking a representative sample from each cluster in the plurality of clusters with similar voice according to the distance between the text vector of each cluster and the central vector to obtain the target text.
Optionally, the inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed includes: newly adding [ cls ] and [ seq ] at the head and the tail of the text to be processed respectively to obtain a newly added text; performing word segmentation processing on the newly added text to obtain a word segmentation set with a total number of word segmentation tokens of l; and inputting the word set into the preset training semantic model to obtain a corresponding identifier id of each word in the word set in a word list of the preset training semantic model, and generating a first vector with a dimension l, wherein the text representation matrix is composed of representation vectors of different texts, and the text representation vector is obtained by calculating the first vector, the second vector and the third vector.
Optionally, the first vector X input_id Is represented as follows: chi shape input_id =(x [cls] ,x [1] ,x [2] ,...,x [l-2] ,x [seq] )
Wherein x is [cls] ,x [seq] Each represents the [ cls ]]And said [ seq ]]Id, x in the pre-training model vocabulary [i] And i is more than or equal to 1 and less than or equal to l-2, and the i th token in the newly added text represents id in the pre-training model vocabulary.
Optionally, the second vector χ segment_id All tokens are defaulted to be 0, i.e. vectors with l-dimension value of 0: chi shape segment_id =(0,0,...,0);
The third vector χ attention_mask All tokens are defaulted to be 1, namely the vector with the l-dimension value of 1:
χ attention_mask =(1,1,...,1)。
optionally, the method further includes: processing the newly added texts in parallel to obtain bs texts processed in parallel; inputting the bs text into the preset training semantic model, and generating a first matrix, a second matrix and a third matrix of the bs text, wherein the first matrix, the second matrix and the third matrix are respectively composed of a first vector, a second vector and a third vector as row vectors, the dimensions are (bs, lmax), lmax is the maximum value in the bs text, the row vector of an X matrix is composed of χ vectors of each text, the χ vector ends of which the dimensions are less than lmax are filled with 0, and the formula is represented as follows:
wherein k belongs to { input _ id, attribute _ mask, segment _ id },chi representing the ith text in the batch k Vector, 1 ≦ i ≦ bs.
Optionally, the inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed includes: and taking the text representation matrix as input, and generating a vector representation matrix V of each token in the bs texts through calculation of the pre-training semantic model encoder, wherein the dimensionality is (bs, lmax, hs), and hs is the hidden _ size of the hidden layer of the pre-training model:
V=f enc (X input_id ,X attention_mask ,X segment_id )。
optionally, the method further includes: subjecting said [ cls ] to]The vector is determined as the text representation vector, and hs dimension representation vector v corresponding to each text in the batch is obtained [cls] ;
Wherein,and an hs dimension characterization vector representing the ith text in the batch, wherein i is more than or equal to 1 and less than or equal to bs.
Optionally, the method further includes: circularly executing to obtain a characterization matrix V of all texts [cls] Dimension is (n, hs), and n is the total number of texts;
wherein,represents the hs-dimensional characterization vector of the jth text in the total sample, j is more than or equal to 1 and less than or equal to n, and/or is greater than or equal to n>Characterization vector equal to the i-th text of the p-th batch of parallel samples ≥>j=(p-1)+i,1≤i≤bs。
Optionally, the step of using the text representation matrix as input and using a clustering algorithm to perform similar text deduplication processing to obtain a processed target text includes: characterizing matrix V to the text [cls] Normalization processing is carried out to obtain a normalized feature matrix V [c ′ ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance:
wherein,is->J is more than or equal to 1 and less than or equal to n, and>the vector is characterized for the j-th text after normalization.
Optionally, the method further includes: by normalizing the characterization matrix V [c ′ ls] Is input intoClustering is carried out according to Euclidean distances among n text vectors to obtain a cluster label corresponding to each vector, the vectors with the same label form semantic similar clusters, finally, the n text vectors form m clusters, wherein n is more than or equal to m,
wherein,represents the normalized Euclidean distance of the ith text and the jth text vector, and is combined with the Euclidean distance based on the normalized Euclidean distance>Is the k-dimension component after the j text vector normalization, and k is more than or equal to 1 and less than or equal to hs.
Optionally, the obtaining a representative sample from each of the plurality of clusters similar to the speech according to the distance between the text vector of each cluster and the center vector includes: calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the clusters is m:
wherein,representing the central vector of the g-th semantic similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster;
and finding out the sample closest to the central vector in each cluster to form a text set after de-duplication, wherein the number of the text sets is m, and finally de-duplication from n to m texts is realized.
According to a first aspect of embodiments of the present application, there is provided a similar text deduplication apparatus, including: the acquisition unit is used for acquiring a text to be processed; the output unit is used for inputting the text to be processed into a pre-training semantic model and outputting a text representation matrix combining the context and the word position information of the text to be processed; and the processing unit is used for carrying out similar text duplicate removal processing by using the text representation matrix as input and utilizing a clustering algorithm to obtain a processed target text.
In the embodiment of the invention, the text to be processed is obtained; inputting a text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed; the text representation matrix is used as input, similar text deduplication processing is carried out by utilizing a clustering algorithm, and a processed target text is obtained, namely, a semantic model can accurately capture wide and complex semantic information by combining context and word positions through pre-training of massive texts, so that the technical problem of low similar text deduplication precision in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention and do not constitute a limitation of the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative similar text deduplication method of the present invention;
FIG. 2 is a flow diagram of an alternative similar text deduplication method in accordance with an embodiment of the present invention;
fig. 3 is a diagram of an alternative similar text deduplication apparatus according to an embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The similar text deduplication method provided by the embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the application in a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a similar text deduplication method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the similar text deduplication method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
Fig. 2 is a flowchart of a similar text deduplication method according to an embodiment of the present invention, and as shown in fig. 2, the similar text deduplication method includes the following steps:
step S202, acquiring a text to be processed.
And step S204, inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining the context and the word position information of the text to be processed.
And step S206, using the text representation matrix as input, and utilizing a clustering algorithm to perform similar text duplicate removal processing to obtain a processed target text.
In this embodiment, the similar text deduplication method may include, but is not limited to, the method for text processing. Similar texts in the related texts may be deduplicated first when the related texts are subjected to speech processing. The pre-training semantic model may include, but is not limited to, sentenceBert models, which are divided into similar text generation, similar text retrieval, and text matching, for the main application scenarios of simBert.
The Clustering algorithm may include, but is not limited to, k-means, DBSCAN (sensitivity-Based Spatial Clustering of applications with noise) is a relatively representative Density-Based Clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of densely connected points, it is possible to partition areas with a sufficiently high density into clusters and find clusters of arbitrary shape in a noisy spatial database.
In the K-means clustering algorithm, K represents that data are clustered into K clusters, and means represents that the mean value of the data in each cluster is used as the center of the cluster, which is also called a centroid. The K-means clustering tries to classify similar objects into the same cluster, and classifies dissimilar objects into different clusters, a calculation method for measuring similarity of data is needed, the K-means algorithm is a typical distance-based clustering algorithm, distance is used as an evaluation index of similarity, and an Euclidean distance is used as a similarity measure by default, namely the closer the distance between two objects is, the greater the similarity is.
In the prior art, a text word segmentation uses a mature tool such as a jieba to segment sentences into words, form a word bank or extract high-frequency keywords for subsequent text representation; the text representation is to encode the word information of the previous step, such as tf-idf, and obtain a vector capable of representing text semantic information; then generating candidate text pairs according to the correlation between the text fingerprints; and finally, removing repeated texts by calculating the distance between text pairs, such as a hamming distance.
According to the embodiment provided by the application, the text to be processed is obtained; inputting a text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed; the text representation matrix is used as input, similar text deduplication processing is carried out by using a clustering algorithm, and a processed target text is obtained, namely, the semantic model can accurately capture wide and complex semantic information by combining context and word positions through pre-training of massive texts, and further the technical problem of low deduplication precision of similar texts in the prior art is solved.
Optionally, the step of taking the text representation matrix as input and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text includes: inputting the text representation matrix into the clustering algorithm to generate a plurality of clusters with similar semantics; and taking a representative sample from each cluster in the plurality of clusters with similar voice according to the distance between the text vector of each cluster and the central vector to obtain the target text.
Optionally, the inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed may include: newly adding [ cls ] and [ seq ] at the head and the tail of the text to be processed respectively to obtain a newly added text; performing word segmentation processing on the newly added text to obtain a word segmentation set with a total number of word segmentation tokens of l; and inputting the word set into the preset training semantic model to obtain a corresponding identifier id of each word in the word set in a word list of the preset training semantic model, and generating a first vector with a dimension l, wherein the text representation matrix is composed of representation vectors of different texts, and the text representation vector is obtained by calculating the first vector, the second vector and the third vector.
Optionally, the first vector X input_id Is represented as follows: chi-type food processing machine input_id =(x [cls] ,x [1] ,x [2] ,...,x [l-2] ,x [seq] )
Wherein x is [cls] ,x [seq] Each represents [ cls ] as described]And said [ seq ]]Id, x in the pre-training model vocabulary [i] And i is more than or equal to 1 and less than or equal to l-2, and the i th token in the newly added text represents id in the pre-training model vocabulary.
Optionally, the second vector χ segmen t _id All tokens are defaulted to be 0, i.e. vectors with l-dimension value of 0: chi-type food processing machine segment_id =(0,0,...,0);
The third vector χ att enti on_mask All tokens are defaulted to be 1, i.e. vectors with l-dimension value of 1:
χ attention_mask =(1,1,...,1)。
optionally, the method further includes: processing the newly added texts in parallel to obtain bs texts processed in parallel; inputting the bs text into the preset training semantic model, and generating a first matrix, a second matrix and a third matrix of the bs text, wherein the first matrix, the second matrix and the third matrix are respectively composed of a first vector, a second vector and a third vector as row vectors, the dimensions are (bs, lmax), lmax is the maximum value in the bs text, the row vector of an X matrix is composed of χ vectors of each text, the χ vector ends of which the dimensions are less than lmax are filled with 0, and the formula is represented as follows:
wherein k belongs to { input _ id, attribute _ mask, segment _ id },chi representing the ith text in the batch k Vector, 1 ≦ i ≦ bs.
Optionally, the inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed may include: and taking the text representation matrix as input, and generating a vector representation matrix V of each token in the bs texts through calculation of the pre-training semantic model encoder, wherein the dimensionality is (bs, lmax, hs), and hs is the hidden _ size of the hidden layer of the pre-training model:
V=f enc (X input_id ,X attention_mask ,X segment_id )。
optionally, the method further includes: subjecting said [ cls ] to]The vector is determined as the text representation vector, and hs dimension representation vector v corresponding to each text in the batch is obtained [cls] ;
Wherein,hs dimension table representing ith text in batchAnd i is more than or equal to 1 and less than or equal to bs.
Optionally, the method further includes: circularly executing to obtain a characterization matrix V of all texts [cls] Dimension is (n, hs), and n is the total number of texts;
wherein,represents the hs-dimensional characterization vector of the jth text in the total sample, j is more than or equal to 1 and less than or equal to n, and/or is greater than or equal to n>Characterization vector equal to the i-th text of the p-th batch of parallel samples ≥>j=(p-1)+i,1≤i≤bs。
Optionally, the step of using the text representation matrix as input and using a clustering algorithm to perform similar text deduplication processing to obtain a processed target text includes: characterizing a matrix V for the text [cls] Normalization processing is carried out to obtain a normalized feature matrix V [c ′ ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance:
wherein,is->Is greater than or equal to 1 and less than or equal to n>The vector is characterized for the j < th > normalized text.
Optionally, the method further includes: characterizing moments by normalizationMatrix V [c ′ ls] For inputting, clustering is carried out according to Euclidean distance between n text vectors to obtain cluster labels corresponding to each vector, the vectors with the same labels form semantic similar clusters, finally the n text vectors form m clusters, wherein n is more than or equal to m,
wherein,represents the normalized Euclidean distance of the ith text and the jth text vector, and is/are>Is the k-dimension component after the j text vector normalization, and k is more than or equal to 1 and less than or equal to hs.
Optionally, the obtaining the target text by taking a representative sample from each of the plurality of clusters similar to the speech according to the distance between the text vector of each cluster and the center vector may include: calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the cluster center vectors is m:
wherein,representing the central vector of the g-th semantically similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster;
and finding out the sample closest to the central vector in each cluster to form a text set after de-duplication, wherein the number of the text sets is m, and finally de-duplication from n to m texts is realized.
As an alternative embodiment, the present application further provides an accurate and convenient text similarity deduplication method. The details of this scheme are as follows.
In the embodiment, an accurate and convenient text similarity duplication elimination method is provided, and the specific steps are as follows:
firstly, a text vector is obtained through a semantic model.
In the first step, rich big data prior knowledge of deep learning pre-training semantic models (such as simBert and sentenceBert) is utilized, context and word position information can be combined, a text representation vector can be accurately obtained in one-stop mode, and complicated text cleaning and word segmentation processes are not needed.
And secondly, text duplicate removal is realized through a clustering algorithm.
And a clustering algorithm (such as kmeans and DBSCAN) is introduced in the second step, clusters with similar semantics can be generated in a one-stop mode according to a similarity threshold, each cluster can be taken as a representative sample to form a text set after duplication is removed, the method realizes parallel computation of similarity and improves efficiency, and meanwhile, a series of pipeline processes of establishing indexes, generating candidates, computing distances and the like are avoided.
In this embodiment, an accurate and convenient text similarity deduplication method is specifically implemented as follows:
step 1: obtaining a text representation matrix by utilizing a pre-training semantic model;
step 1.1: generating X by using tokenizer class of transformations library by taking original text as input input_id (corresponding to the first matrix), X attention_mask (corresponding to the second matrix), X segment_id Matrix (equivalent to the third matrix).
Step 1.1.1: for a single text, firstly, tokenize processing is carried out (Chinese text can be directly divided into characters), and [ cls ] are respectively supplemented at the head and the tail]And [ seq ]]Generating total token of l (including [ cls ]]And [ seq ]]) Finding out the corresponding id of each token in the pre-training model vocabulary, and generating l-dimension x input_id Vector (corresponding to the first vector).
χ input_id =(x [cls] ,x [1] ,x [2] ,...,x [l-2] ,x [seq] )
Wherein x is [cls] ,x[seq]Each represents [ cls]And [ seq ]]Id, x in pre-training model vocabulary [i] Representing id of the ith token in the text in a pre-training model vocabulary, wherein i is more than or equal to 1 and less than or equal to l-2.
Step 1.1.2: generating χ for single text segment_id (corresponding to the second vector) and χ attention_mask (corresponding to the third vector) vector χ segment_id All tokens are defaulted to be 0, namely, the l-dimensional zero vector, x attention_mas And k defaults that all tokens (non-completion) are 1, namely the vector with the l-dimension value of 1.
χ segment_id =(0,0,...,0)
χ attention_mask =(1,1,...,1)
Step 1.1.3: in order to improve the efficiency, parallel computation of texts is needed, the parallel number is bs, and X of bs texts can be generated in parallel by using a token class batch _ encode method of a transformations library input_id ,X attention_mask ,X segment_id The matrix and the dimensionality are (bs, lmax), the lmax is the maximum l value in the bs texts, the chi vectors of the single texts form the row vectors of the X matrix, the chi vectors with the dimensionality less than the lmax are filled with 0 at the tail end, and the formula is represented as follows:
wherein k belongs to { input _ id, attribute _ mask, segment _ id },chi representing the ith text in the batch k Vector, 1 ≦ i ≦ bs.
Step 1.2: with X input_id ,X attention_mask ,X segment_id For input, a model class of a transformers library is utilized to obtain a text characterization matrix V [cls] 。
Step 1.2.1: x obtained in step 1.1 input_id ,X attention_mask ,X segment_id The matrix is input and is pre-trained by a model encoder (denoted as f) enc ) Generating a vector representation matrix V of each token in the bs text, and having dimensions of (bs, lmax, hs), wherein hs is the hidden _ size of the hidden layer of the pre-trained model.
V=f enc (X input_id ,X attention_mask ,X segment_id )
Step 1.2.2: taking [ cls ]]The vector representation is used as a text representation vector, and an hs dimension representation vector v corresponding to each text in the batch is obtained [cls] 。
Wherein,and representing the hs dimension characterization vector of the ith text in the batch, wherein i is more than or equal to 1 and less than or equal to bs.
Step 1.2.3: circularly executing the steps of 1.1, 1.2.1 and 1.2.2 to obtain the characterization matrix V of all texts [cls] The dimension is (n, hs), and n is the total number of texts.
Wherein,represents the hs-dimensional characterization vector of the jth text in the total sample, j is more than or equal to 1 and less than or equal to n, and/or is greater than or equal to n>Characterization vector equal to the i-th text of the p-th batch of parallel samples ≥>j=(p-1)+i,1≤i≤bs。
Step 2: and taking all the text characterization matrixes as input, and performing similar text deduplication processing by using a clustering algorithm.
Step 2.1: for the characterization matrix V obtained in step 1 [cls] Normalization processing is carried out to obtain a normalized feature matrix V [c ′ ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance.
Wherein,is->J is more than or equal to 1 and less than or equal to n.The vector is characterized for the j < th > normalized text.
Step 2.2: by normalizing the characterization matrix V [c ′ ls] For input, each text vector is labeled with a cluster using the clustering algorithm of the sklern library.
Step 2.2.1: selecting a clustering algorithm and setting parameters of the clustering algorithm according to requirements, wherein eps and min _ samples values in DBSCAN and k values in k-means are suitable for the condition of indefinite de-duplication target sample number and need to keep outlier samples, and the latter is suitable for the condition of keeping fixed sample number (k).
Step 2.2.2: by normalizing the characterization matrix V [c ′ ls] For inputting, clustering is carried out according to Euclidean distance between n text vectors to obtain cluster labels corresponding to each vector, the vectors with the same label (except the-1 label in DBSCAN) form semantic similar clusters, each vector with the-1 label in DBSCAN is a cluster, and finally the n text vectors form m clusters (n are n text vectors>=m)。
Wherein,represents the normalized Euclidean distance of the ith text and the jth text vector, and is/are>Is the k-dimension component after the j text vector normalization, and k is more than or equal to 1 and less than or equal to hs.
Step 2.3: and performing text deduplication processing on each semantic similar cluster.
Step 2.3.1: calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the clusters is m
Wherein,representing the central vector of the g-th semantically similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster.
Step 2.3.2: and finding out the sample closest to the central vector in each cluster to form a text set after de-duplication, wherein the number of the text sets is m, and finally de-duplication from n to m texts is realized.
In the scheme provided by the embodiment, firstly, the semantic model obtains text vectors and a clustering algorithm and has a mature toolkit, so that one-stop calling can be realized, secondly, the semantic model is pre-trained by massive texts, can accurately capture wide and complex semantic information by combining context and word positions, and compared with the conventional schemes of word segmentation, keywords and the like, the accuracy of text representation and subsequent similarity calculation is qualitatively improved. Therefore, the method achieves double improvement of accuracy and convenience in the task of text similarity deduplication.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a similar text deduplication device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and the description of which has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of a similar text deduplication apparatus according to an embodiment of the present invention, the similar text deduplication apparatus including:
an obtaining unit 31, configured to obtain a text to be processed.
And the output unit 33 is configured to input the text to be processed into the pre-training semantic model, and output a text representation matrix combining the context and the word position information of the text to be processed.
And the first processing unit 35 is configured to perform similar text deduplication processing by using a clustering algorithm with the text representation matrix as input to obtain a processed target text.
According to the embodiment provided by the application, the to-be-processed text is acquired through the acquisition unit 31; the output unit 33 inputs the text to be processed into the pre-training semantic model, and outputs a text representation matrix combining the context and the word position information of the text to be processed; the first processing unit 35 takes the text representation matrix as input, and performs similar text deduplication processing by using a clustering algorithm to obtain a processed target text, that is, the semantic model of the invention can accurately capture wide and complex semantic information by combining context and word position through pre-training of massive texts, thereby solving the technical problem of low deduplication precision of similar texts in the prior art.
Optionally, the processing unit 35 may include: the first input module is used for inputting the text representation matrix into the clustering algorithm to generate a plurality of clusters with similar semantemes; and the determining module is used for taking a representative sample from each cluster in the clusters with similar voice according to the distance between the text vector of each cluster and the central vector to obtain the target text.
Optionally, the output unit 33 may include: the newly-added module is used for respectively adding [ cls ] and [ seq ] at the head and the tail of the text to be processed to obtain a newly-added text; the word segmentation module is used for carrying out word segmentation processing on the newly added text to obtain a word segmentation set with the total number of word segmentation tokens of l; and the second input module is used for inputting the word segmentation set into the preset training semantic model to obtain a corresponding identifier id of each word in the word segmentation set in the word list of the preset training semantic model, and generating a first vector with a dimension of l, wherein the text characterization matrix consists of characterization vectors of different texts, and the text characterization vectors are obtained by calculating the first vector, the second vector and the third vector.
Optionally, the first vector X input_id Is represented as follows: chi-type food processing machine input_id =(x [cls] ,x [1] ,x [2] ,...,x [l-2] ,x [seq] )
Wherein x is [cls] ,x [seq] Each represents [ cls ] as described]And said [ seq ]]Id, x in the pre-training model vocabulary [i] And i is more than or equal to 1 and less than or equal to l-2, and the i th token in the newly added text represents id in the pre-training model vocabulary.
Optionally, the second vector χ segment_id All tokens are defaulted to be 0, namely the vector with the l-dimension value of 0: chi shape segment_id =(0,0,...,0);
The third vector χ attention_mask All tokens are defaulted to be 1, namely the vector with the l-dimension value of 1:
χ attention_mask =(1,1,...,1)。
optionally, the apparatus may further include: the second processing unit carries out parallel processing on the newly added text to obtain bs texts which are processed in parallel; the generating unit is used for inputting the bs texts into the preset training semantic model to generate a first matrix, a second matrix and a third matrix of the bs texts, wherein the first matrix, the second matrix and the third matrix are respectively composed of a first vector, a second vector and a third vector as row vectors, dimensions are (bs, lmax), lmax is the maximum l value in the bs texts, a χ vector of each text forms a row vector of an X matrix, χ vector ends with dimensions less than lmax are complemented by 0, and a formula is expressed as follows:
wherein k belongs to { input _ id, attribute _ mask, segment _ id },chi representing the ith text in the batch k Vector, 1 ≦ i ≦ bs.
Optionally, the output unit 33 may include: and a generating module, configured to use the text representation matrix as an input, and generate a vector representation matrix V of each token in the bs texts through calculation of the pre-training semantic model encoder, where a dimension is (bs, lmax, hs), and hs is a hidden _ size of the hidden layer of the pre-training semantic model:
V=f enc (X input_id ,X attention_mask ,X segment_id )。
optionally, the apparatus may further perform the following operations: subjecting said [ cls ] to]The vector is determined as the text representation vector, and hs dimension representation vector v corresponding to each text in the batch is obtained [cls] ;
Wherein,and representing the hs dimension characterization vector of the ith text in the batch, wherein i is more than or equal to 1 and less than or equal to bs.
Optionally, the apparatus may further perform the following operations: circularly executing to obtain a characterization matrix V of all texts [cls] Dimension is (n, hs), and n is the total number of texts;
wherein,represents the hs-dimensional characterization vector of the jth text in the total sample, j is more than or equal to 1 and less than or equal to n, and/or is greater than or equal to n>Characterization vector equal to ith text of the pth parallel sample->j=(p-1)+i,1≤i≤bs。
Optionally, the apparatus may further perform the following operations: characterizing matrix V to the text [cls] Normalization processing is carried out to obtain a normalized feature matrix V [c ′ ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance:
wherein,is->Is greater than or equal to 1 and less than or equal to n>The vector is characterized for the j < th > normalized text.
Optionally, the apparatus may further perform the following operations: characterization matrix V by normalization [c ′ ls] For inputting, clustering is carried out according to Euclidean distance between n text vectors to obtain cluster labels corresponding to each vector, the vectors with the same labels form semantic similar clusters, finally the n text vectors form m clusters, wherein n is more than or equal to m,
wherein,represents the normalized Euclidean distance of the ith text and the jth text vector, and is/are>Is the k-dimension component after the j text vector normalization, and k is more than or equal to 1 and less than or equal to hs.
Optionally, the apparatus may further perform the following operations: calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the cluster center vectors is m:
wherein,representing the central vector of the g-th semantically similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster; and finding out the sample closest to the central vector in each cluster to form a text set after de-duplication, wherein the number of the text sets is m, and finally de-duplication from n to m texts is realized.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a text to be processed;
s2, inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining the context and word position information of the text to be processed;
and S3, with the text representation matrix as input, carrying out similar text duplicate removal processing by using a clustering algorithm to obtain a processed target text.
Optionally, in this embodiment, the storage medium may include but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a text to be processed;
s2, inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining the context and word position information of the text to be processed;
and S3, with the text representation matrix as input, carrying out similar text duplicate removal processing by using a clustering algorithm to obtain a processed target text.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized in a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a memory device and executed by a computing device, and in some cases, the steps shown or described may be executed out of order, or separately as individual integrated circuit modules, or multiple modules or steps thereof may be implemented as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.
Claims (15)
1. A method for similar text deduplication, comprising:
acquiring a text to be processed;
inputting the text to be processed into a pre-training semantic model, and outputting a text representation matrix combining context and word position information of the text to be processed;
and taking the text representation matrix as input, and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text.
2. The method of claim 1, wherein the step of taking the text representation matrix as input and performing similar text deduplication processing by using a clustering algorithm to obtain a processed target text comprises:
inputting the text representation matrix into the clustering algorithm to generate a plurality of clusters with similar semantics;
and taking a representative sample from each cluster in the clusters with similar voice according to the distance between the text vector of each cluster and the central vector to obtain the target text.
3. The method of claim 1, wherein the inputting the text to be processed into a pre-trained semantic model and outputting a text representation matrix combining context and word position information of the text to be processed comprises:
aiming at a single text, respectively adding [ cls ] and [ seq ] at the head and the tail of the text to be processed to obtain a new text;
performing word segmentation processing on the newly added text to obtain a word segmentation set with a total number of word segmentation tokens of l;
and inputting the word set into the preset training semantic model to obtain a corresponding identifier id of each word in the word set in a word list of the preset training semantic model, and generating a first vector with a dimension l, wherein the text representation matrix is composed of representation vectors of different texts, and the text representation vector is obtained by calculating the first vector, the second vector and the third vector.
4. The method of claim 3, wherein the first vector X input_id Is represented as follows:
χ input_id =(x [cls] ,x [1] ,x [2] ,...,x [l-2] ,x [seq] )
x [cls] ,x [seq] each represents the [ cls ]]And said [ seq ]]Id, x in the pre-training model vocabulary [i] Representing id of the ith token in the newly added text in the pre-training model vocabulary, wherein i is more than or equal to 1 and less than or equal to l-2.
5. The method of claim 3, wherein the second vector and the third vector represent,
the second vector χ segment_id All tokens are defaulted to be 0, i.e. vectors with l-dimension value of 0: chi shape segment_id =(0,0,...,0)
The third vector χ att enti on_mas k All tokens are defaulted to be 1, i.e. vectors with l-dimension value of 1: chi shape attention_mask =(1,1,...,1)。
6. The method of claim 5, further comprising:
processing the newly added texts in parallel to obtain bs texts processed in parallel;
inputting the bs texts into the preset training semantic model, generating a first matrix, a second matrix and a third matrix of the bs texts, wherein the first matrix, the second matrix and the third matrix are respectively composed of a first vector, a second vector and a third vector as row vectors, the dimensions are (bs, lmax), the text feature matrix is obtained by calculating the first matrix, the second matrix and the third matrix, lmax is the maximum value in the bs texts, the χ vector of each text forms the row vector of the X matrix, and the χ vector end with dimension less than lmax is completed by 0, and the formula is expressed as follows:
7. The method of claim 6, wherein the inputting the text to be processed into a pre-trained semantic model and outputting a text representation matrix combining context and word position information of the text to be processed comprises:
and taking the first matrix, the second matrix and the third matrix as input, and calculating by the pre-training semantic model encoder to generate a vector representation matrix V of each token in the bs text, with dimensions (bs, lmax, hs), where hs is the hidden _ size of the hidden layer of the pre-training model:
V=f enc (X input_id ,X attention_mask ,X segment_id )。
8. the method of claim 7, further comprising:
subjecting said [ cls ] to]The vector is determined as the text representation vector, and hs dimension representation vector v corresponding to each text in the batch is obtained [cls] ;
9. The method of claim 8, further comprising:
circularly executing to obtain a characterization matrix V of all texts [cls] Dimension is (n, hs), and n is the total number of texts;
10. The method according to claim 1, wherein the step of performing similar text deduplication processing by using the text characterization matrix as an input and using a clustering algorithm to obtain a processed target text comprises:
characterizing a matrix V for the text [cls] Normalization processing is carried out to obtain a normalized feature matrix V [c ′ ls] The euclidean distance between the normalized matrix vectors is equivalent to the cosine distance:
11. The method of claim 10, further comprising:
characterization matrix V by normalization [c ′ ls] For input, clustering is carried out according to Euclidean distances among n text vectors to obtain a cluster label corresponding to each vector, the vectors with the same label form semantic similar clusters, finally, the n text vectors form m clusters, wherein n is more than or equal to m,
wherein,represents the normalized Euclidean distance of the ith text and the jth text vector, and is combined with the Euclidean distance based on the normalized Euclidean distance>And k is more than or equal to 1 and less than or equal to hs for the kth dimension component after the jth text vector normalization.
12. The method of claim 2, wherein obtaining the target text by taking a representative sample from each of the plurality of speech-like clusters according to a distance between the text vector of each cluster and the center vector comprises:
calculating the mean value of the samples in each cluster as a cluster center vector, wherein the number of the cluster center vectors is m:
wherein,representing the central vector of the g-th semantically similar cluster, wherein g is more than or equal to 1 and less than or equal to m, and | g | represents the total number of samples contained in the g-th cluster;
and finding out the sample closest to the central vector in each cluster to form a text set after de-duplication, wherein the number of the text sets is m, and finally de-duplication from n to m texts is realized.
13. A similar text deduplication apparatus, comprising:
the acquisition unit is used for acquiring a text to be processed;
the output unit is used for inputting the text to be processed into a pre-training semantic model and outputting a text representation matrix combining the context and the word position information of the text to be processed;
and the processing unit is used for carrying out similar text duplicate removal processing by using the text representation matrix as input and utilizing a clustering algorithm to obtain a processed target text.
14. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 12 when executed.
15. An electronic device comprising a memory and a processor, wherein the memory has a computer program stored therein, and the processor is configured to execute the computer program to perform the method of any of claims 1 to 12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211698954.5A CN115982144A (en) | 2022-12-28 | 2022-12-28 | Similar text duplicate removal method and device, storage medium and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211698954.5A CN115982144A (en) | 2022-12-28 | 2022-12-28 | Similar text duplicate removal method and device, storage medium and electronic device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115982144A true CN115982144A (en) | 2023-04-18 |
Family
ID=85960802
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211698954.5A Pending CN115982144A (en) | 2022-12-28 | 2022-12-28 | Similar text duplicate removal method and device, storage medium and electronic device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115982144A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118378723A (en) * | 2024-06-21 | 2024-07-23 | 中国电信股份有限公司 | Model training data processing method and device and electronic equipment |
CN118377899A (en) * | 2024-04-15 | 2024-07-23 | 广州天懋信息系统股份有限公司 | Text data de-duplication method, apparatus, storage medium and program product |
-
2022
- 2022-12-28 CN CN202211698954.5A patent/CN115982144A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118377899A (en) * | 2024-04-15 | 2024-07-23 | 广州天懋信息系统股份有限公司 | Text data de-duplication method, apparatus, storage medium and program product |
CN118378723A (en) * | 2024-06-21 | 2024-07-23 | 中国电信股份有限公司 | Model training data processing method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804641B (en) | Text similarity calculation method, device, equipment and storage medium | |
CN111291177B (en) | Information processing method, device and computer storage medium | |
CN112035599B (en) | Query method and device based on vertical search, computer equipment and storage medium | |
CN109508379A (en) | A kind of short text clustering method indicating and combine similarity based on weighted words vector | |
CN110134777B (en) | Question duplication eliminating method and device, electronic equipment and computer readable storage medium | |
CN111125469B (en) | User clustering method and device of social network and computer equipment | |
CN115982144A (en) | Similar text duplicate removal method and device, storage medium and electronic device | |
CN111506726B (en) | Short text clustering method and device based on part-of-speech coding and computer equipment | |
CN109829045A (en) | A kind of answering method and device | |
CN113392191B (en) | Text matching method and device based on multi-dimensional semantic joint learning | |
CN110895559A (en) | Model training method, text processing method, device and equipment | |
CN111985228A (en) | Text keyword extraction method and device, computer equipment and storage medium | |
CN114238329A (en) | Vector similarity calculation method, device, equipment and storage medium | |
CN112632261A (en) | Intelligent question and answer method, device, equipment and storage medium | |
CN113722512A (en) | Text retrieval method, device and equipment based on language model and storage medium | |
CN116910599A (en) | Data clustering method, system, electronic equipment and storage medium | |
CN115470338A (en) | Multi-scene intelligent question and answer method and system based on multi-way recall | |
CN112579752A (en) | Entity relationship extraction method and device, storage medium and electronic equipment | |
CN112632248A (en) | Question answering method, device, computer equipment and storage medium | |
Bassiou et al. | Greek folk music classification into two genres using lyrics and audio via canonical correlation analysis | |
CN117235137B (en) | Professional information query method and device based on vector database | |
CN114282513A (en) | Text semantic similarity matching method and system, intelligent terminal and storage medium | |
CN110825852B (en) | Long text-oriented semantic matching method and system | |
CN114003706A (en) | Keyword combination generation model training method and device | |
CN110209895B (en) | Vector retrieval method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |