CN111737469A

CN111737469A - Data mining method and device, terminal equipment and readable storage medium

Info

Publication number: CN111737469A
Application number: CN202010584569.2A
Authority: CN
Inventors: 衣杨; 佘滢; 宋嘉伦; 赵福利; 林倩青; 周晓聪
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-10-02

Abstract

The embodiment of the invention discloses a data mining method, a data mining device, terminal equipment and a readable storage medium, wherein the method comprises the following steps: processing data to be analyzed to obtain a standard data set; when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center; performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers; respectively calculating the compactness of each category according to a preset error square sum formula; and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number. According to the technical scheme, the error square sum formula is used for calculating the compactness of each category respectively, the category with the minimum compactness is used as a new target set to continue to perform classification operation, the condition that the classification process falls into a local optimal solution is avoided, and the accuracy of data processing is improved.

Description

Data mining method and device, terminal equipment and readable storage medium

Technical Field

The present invention relates to the field of data mining, and in particular, to a data mining method, apparatus, terminal device, and readable storage medium.

Background

With the development and application of network technology and the explosive growth of information resources, the research of text mining, information filtering and information searching has unprecedented prospects. Therefore, clustering techniques are becoming the core of text information mining techniques. Text clustering is an important technique used in text mining to find data distribution and its implicit data patterns. At present, in the field of data mining, some simple clustering algorithms are widely applied due to the advantages of simple principle, easy realization, high convergence rate and the like, however, the algorithm can cause different clustering results to different initial values, and is easy to fall into a local minimum value, so that the clustering results are not ideal, and the algorithm is not beneficial to effectively and accurately mining and objectively analyzing a large amount of data information.

Disclosure of Invention

In view of the above problems, the present invention provides a data mining method, apparatus, terminal device and readable storage medium.

One embodiment of the present invention provides a data mining method, including:

processing data to be analyzed to obtain a standard data set;

when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center;

performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers;

respectively calculating the compactness of each category according to a preset error square sum formula;

and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number.

The data mining method according to the foregoing embodiment, which performs classification operations centering on the cluster center of the target set and the data sample farthest from the cluster center of the target set, includes:

calculating a first distance between each data sample in the target set and the clustering center of the target set and a second distance between each data sample in the target set and the data sample farthest from the clustering center of the target set according to a distance measurement formula;

taking data samples with the first distance smaller than the second distance as a class;

and taking the data samples with the first distance being greater than or equal to the second distance as another class.

The data mining method according to the foregoing embodiment determines the data sample farthest from the cluster center of the target set by:

calculating the distance between each data sample in the target set and the clustering center of the target set through a distance measurement formula;

and selecting the data sample with the largest distance as the data sample farthest from the clustering center of the target set.

In the data mining method according to the above embodiment, the distance metric formula is as follows:

dis represents the distance between two data samples, A_iI-th coordinate point of weight vector representing a data sample, B_iThe ith coordinate point of the weight vector representing another data sample, and n represents the number of coordinate points in the weight vector.

In the data mining method according to the above embodiment, the sum of squared errors formula is as follows:

ASSE stands for the sum of squares of the errors, reflecting c_lClass as a clustering centerOther degree of compactness, c_kRepresents another cluster center, m represents c_lNumber of data samples in a class as a clustering center, x_jIs represented by c_lThe jth data sample in the class as the cluster center, r, represents the regularization constant.

The data mining method according to the above embodiment, where the processing of the data to be analyzed to obtain the standard data set includes:

performing text segmentation on data to be analyzed, and constructing a bag-of-words model vector;

counting common vocabularies and important vocabularies in the bag-of-words model vector by using a word frequency-reverse file frequency method to obtain a text-vocabulary matrix;

and performing dimensionality reduction on the text-vocabulary matrix to obtain a standard data set.

Another embodiment of the present invention provides a data mining apparatus, including:

the data preprocessing module is used for processing the data to be analyzed to obtain a standard data set;

the initial clustering center selection module is used for selecting a data sample as a clustering center in a target set when the standard data set is taken as the target set to be classified;

the classification operation execution module is used for executing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers;

the compactness calculation module is used for calculating the compactness of each category according to a preset error square sum formula;

and the new target set determining module is used for taking the category with the minimum compactness as a new target set and repeatedly executing the classification operation until the classification number reaches a preset number.

The data mining device according to the above embodiment determines the data sample farthest from the cluster center of the target set by:

The above embodiments relate to a terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the data mining method of the above embodiments.

The above embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the data mining method of the above embodiments.

The data mining method disclosed by the invention processes data to be analyzed to obtain a standard data set; when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center; performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers; respectively calculating the compactness of each category according to a preset error square sum formula; and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number. According to the technical scheme, massive network data can be collected and sorted, main information related to the data is mined, the data mining process is more objective, the subjectivity of manual operation is avoided, and the data processing efficiency is improved. In addition, the method is used for processing the mass data, and the obtained information is more accurate and reliable.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.

FIG. 1 is a flow chart of a data mining method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a data processing procedure involved in a data mining method according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating a classification operation involved in a data mining method provided by an embodiment of the present invention;

fig. 4 shows a schematic structural diagram of a data mining device according to an embodiment of the present invention.

Main element symbols:

1-a data mining device; 100-a data preprocessing module; 200-an initial clustering center selection module; 300-a classification operation execution module; 400-compactness calculation module; 500-new object set determination module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are only intended to indicate specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.

Example 1

In this embodiment, referring to fig. 1, a data mining method is shown, which includes the following steps:

step S100: the data to be analyzed is processed to obtain a standard data set.

At present, various search engines are widely used, and users are used to search resource information from the search engines and obtain corresponding resource information. Massive resource information is stored in the internet, a search engine can retrieve a large amount of data according to search keywords of a user, the user can also use a web crawler to crawl a large amount of data from the internet, and the obtained large amount of data needs to be analyzed and processed so that the user can obtain effective data with utilization value.

First, the acquired data to be analyzed needs to be processed to obtain a standard data set. Referring to fig. 2, the processing of the data to be analyzed includes the steps of:

step S101: and performing text segmentation on the data to be analyzed, and constructing a bag-of-words model vector.

Text segmentation is carried out on data to be analyzed, and a long text can be accurately and reasonably segmented by using a python Chinese word segmentation component 'jieba' to obtain a corresponding vocabulary set. And removing repeated words in the word set to obtain a new word set T, and representing words contained in each evaluation index by adopting a bag-of-words model. It can be understood that a vector is constructed for each evaluation index, the dimension of the vector is the same as the number of words contained in the word set T, the value of the vector is the frequency of the occurrence frequency of each word in the word set T in the evaluation text, and the element sequence of each position in the vector is consistent with the occurrence sequence of the words in the dictionary.

Exemplarily, when mining and analyzing a large amount of thesis data, corresponding evaluation indexes may include science, society, humanity, astronomy and the like, each evaluation index includes a plurality of words, that is, a vector may be constructed by using a word set under each evaluation index, the dimension of the vector is the same as the number of the words, and the vector value in the vector may be identified by the frequency of occurrence of each word.

Step S102: and counting common words and important words in the bag-of-words model vector by using a word frequency-reverse file frequency method to obtain a text-word matrix.

The constructed bag-of-words model vector is converted into a text-vocabulary matrix, and the bag-of-words model vector can be converted into the text-vocabulary matrix by using term frequency-inverse document frequency (TF-IDF). The method aims to find out words which frequently appear in a certain text and frequently appear in other texts, and achieves filtering of common words and retention of important words so as to obtain a text-word matrix.

Step S103: and performing dimensionality reduction on the text-vocabulary matrix to obtain a standard data set.

The text-vocabulary matrix is a matrix composed of sparse vectors, each row represents a text, each column comprises vocabularies related to the text, and as the number of the texts is far smaller than the number of the vocabularies, Principal Component Analysis (PCA) dimension reduction processing is required, and the text-vocabulary matrix is converted through linear transformation, so that main characteristic components of the data are extracted to obtain a standard data set.

Step S200: and when the standard data set is taken as a target set to be classified, selecting a data sample in the target set as a clustering center.

The standard data set comprises a plurality of data samples, and one data sample can be randomly selected, and the data sample point is used as an initial clustering center.

Step S300: and performing classification operation by taking the clustering center of the target set and the data sample farthest from the clustering center of the target set as centers.

The distance between each data sample in the target set and the cluster center of the target set may be calculated using the following distance metric formula. Considering that the euclidean distance measurement tends to reflect the difference in numerical value, the cosine similarity measurement focuses more on the difference of two vectors in the direction dimension, so the cosine similarity is more suitable for performing text similarity measurement, and accordingly, the distance measurement formula is as follows:

The distance between two data samples may represent the similarity between the two data samples, i.e. the greater the distance between two data samples, the less the similarity between two data samples; the smaller the distance between two data samples, the greater the similarity between the two data samples.

In this embodiment, the initially set clustering center is referred to as a first clustering center, and the data sample determined by the clustering formula and farthest from the initially set clustering center of the target set is referred to as a second clustering center.

And performing a classification operation according to the first clustering center and the second clustering center to classify the target set into two classes.

Sorting operation referring to fig. 3, the sorting operation includes the following steps:

step S301: and calculating a first distance between each data sample in the target set and the clustering center of the target set and a second distance between each data sample in the target set and the data sample farthest from the clustering center of the target set according to a distance measurement formula.

Illustratively, a first distance between each data sample in the target set and a cluster center of the target set is calculated according to a distance metric formula,

dis1 represents the distance C between the kth data sample to the first cluster center₁，A_k，iI-th coordinate point, C, of weight vector representing k-th data sample_1，iAnd representing the ith coordinate point of the weight vector of the first cluster center, and n represents the number of coordinate points in the weight vector.

Exemplarily, a second distance between each data sample in the target set and the cluster center of the target set is calculated according to a distance metric formula,

dis2 represents the k-th data sample to the second clustering center C₂A distance between A and A_k，iI-th coordinate point, C, of weight vector representing k-th data sample_2，iAnd the ith coordinate point of the weight vector representing the center of the second cluster, and n represents the number of coordinate points in the weight vector.

Step S302: and taking the data samples with the first distance smaller than the second distance as a class.

And comparing the first distance with the second distance, and taking the data samples with the first distance smaller than the second distance as a class, namely the cluster center of the class is the first cluster center.

Step S303: and taking the data samples with the first distance being greater than or equal to the second distance as another class.

And comparing the first distance with the second distance, and taking the data samples with the first distance larger than the second distance as a class, namely the cluster center of the class is the second cluster center.

Step S400: and respectively calculating the compactness of each category according to a preset error square sum formula.

Respectively calculating the compactness of the first cluster and the compactness of the second cluster, wherein the embodiment utilizes an error sum of squares formula

Computing compactness, wherein ASSE represents the sum of squares of the errors, reflecting c_lCompactness of a class as a clustering center, c_kRepresents another cluster center, m represents c_lNumber of data samples in a class as a clustering center, x_jIs represented by c_lThe j-th data sample in the category as the cluster center, r, represents a regularization constant, and in this embodiment, r is 1.

Exemplarily, in the calculation of c₁Compactness of the first cluster as the center of the cluster, c₁As c is_l，c₂As c is_k，x_jIs represented by c₁The j-th data sample in the category as the cluster center is represented by c₁The sum of squared errors of the first cluster as the cluster center may reflect the compactness of the first cluster.

Exemplarily, in the calculation of c₂Compactness of the second cluster as cluster center, c₂As c is_l，c₁As c is_k，x_jIs represented by c₂The j-th data sample in the category as the cluster center is represented by c₂The sum of squared errors of the second cluster as the cluster center may reflect the compactness of the second cluster.

It should be understood that the sum of squares of error formula

The molecule part of the cluster can reflect the compactness of each data sample in a certain cluster, and the smaller the molecule part is, the smaller the distance between each data sample point in the cluster and the cluster center is, the stronger the compactness of the cluster is; sum of squares error formula

The denominator part ofThe compactness of each data sample point in a certain cluster and another cluster can be reflected, the larger the denominator part is, the larger the distance between each data sample point in the cluster and the center of another cluster is, the smaller the compactness of the cluster and another cluster is, the larger the separation degree is, that is, the stronger the compactness of the cluster can be reflected. It can be understood that the smaller the sum of squared errors of the clusters, the more compact the clusters are.

Step S500: and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number.

And comparing the compactness of the first cluster with the compactness of the second cluster, taking the category with the minimum compactness as a new target set, namely selecting the category with the maximum error square sum as the new target set, and repeatedly executing the steps S300, S400 and S500 until the classification number reaches the preset number.

The preset classification number can be correspondingly set according to the actual situation. Exemplarily, when data mining is performed on evaluation index data related to the leaderboard of the four great-power university, all the evaluation index data are subjected to cluster analysis to generate a new evaluation system, and the classification number can be set to 4 so as to reasonably classify all the evaluation indexes into four categories. The subjectivity of artificial classification is avoided, the data mining and processing process is more objective, and the data mining speed is improved.

Exemplarily, for a thesis recommendation system, the above technical scheme can be used to perform data mining on thesis data that has been retrieved by a user, for example, for thesis data in the artificial intelligence field, the thesis data can be classified according to ten machine learning algorithms to divide the thesis data into 10 types, user habits can be obtained according to mining results of the thesis data, and a user can be depicted to subsequently recommend related academic thesis to the user, thereby improving user experience.

Exemplarily, for the search engine, the above technical solution may perform data mining on the information data that the user has retrieved, for example, the information data may be divided into a plurality of categories, such as social, scientific, humanistic, historical, astronomical, and the like, to mine user habits, and portray the user, so that the user may be accurately recommended relevant information in the future.

The data mining method disclosed by the embodiment processes data to be analyzed to obtain a standard data set; when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center; performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers; respectively calculating the compactness of each category according to a preset error square sum formula; and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number. According to the technical scheme, massive network data can be collected and sorted, main information related to the data is mined, the data mining process is more objective, the subjectivity of manual operation is avoided, and the data processing efficiency is improved. In addition, the method is used for processing the mass data, and the obtained information is more accurate and reliable.

Example 2

In the present embodiment, referring to fig. 4, a data mining apparatus 1 is shown including: the system comprises a data preprocessing module 100, an initial cluster center selecting module 200, a classification operation executing module 300, a compactness calculating module 400 and a new target set determining module 500.

A data preprocessing module 100, configured to process data to be analyzed to obtain a standard data set; an initial clustering center selecting module 200, configured to select a data sample as a clustering center in a target set when the standard data set is used as the target set to be classified; a classification operation executing module 300, configured to execute a classification operation centering on a cluster center of a target set and a data sample farthest from the cluster center of the target set; the compactness calculation module 400 is used for calculating the compactness of each category according to a preset error square sum formula; and a new target set determining module 500, configured to repeatedly perform the classification operation with the category with the smallest compactness as a new target set until the number of classifications reaches a preset number.

In this embodiment, the data mining apparatus 1 is configured to execute the data mining method according to the above embodiment by using the data preprocessing module 100, the initial clustering center selecting module 200, the classifying operation executing module 300, the compactness calculating module 400, and the new target set determining module 500 in a matching manner, and the implementation scheme and the beneficial effect related to the above embodiment are also applicable to this embodiment, and are not described herein again.

It should be understood that the present embodiment relates to a terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to execute the data mining method according to the above embodiment.

It should be appreciated that the present embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the data mining method described in the above embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A method of data mining, the method comprising:

processing data to be analyzed to obtain a standard data set;

2. The data mining method of claim 1, wherein the classifying operation is performed centering on a cluster center of a target set and a data sample farthest from the cluster center of the target set, and comprises:

3. The data mining method of claim 1, wherein the data sample furthest from the cluster center of the target set is determined by:

4. A method of data mining according to claim 2 or 3, wherein the distance metric is formulated as follows:

dis represents the distance between two data samples, A_iI-th coordinate point of weight vector representing a data sample, B_iRepresenting another data sampleThe ith coordinate point of the weight vector, and n represents the number of coordinate points in the weight vector.

5. The data mining method of claim 1, wherein the sum of squared errors formula is as follows:

ASSE stands for the sum of squares of the errors, reflecting c_lCompactness of a class as a clustering center, c_kRepresents another cluster center, m represents c_lNumber of data samples in a class as a clustering center, x_jIs represented by c_lThe jth data sample in the class as the cluster center, r, represents the regularization constant.

6. The data mining method of claim 1, wherein the processing the data to be analyzed to obtain a standard data set comprises:

7. A data mining device, the device comprising:

8. The data mining device of claim 7, wherein the data sample furthest from the cluster center of the target set is determined by:

9. A terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the data mining method of any one of claims 1 to 6.

10. A readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the data mining method of any of claims 1 to 6.