CN111737469A - Data mining method and device, terminal equipment and readable storage medium - Google Patents
Data mining method and device, terminal equipment and readable storage medium Download PDFInfo
- Publication number
- CN111737469A CN111737469A CN202010584569.2A CN202010584569A CN111737469A CN 111737469 A CN111737469 A CN 111737469A CN 202010584569 A CN202010584569 A CN 202010584569A CN 111737469 A CN111737469 A CN 111737469A
- Authority
- CN
- China
- Prior art keywords
- data
- target set
- distance
- clustering center
- data sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000007418 data mining Methods 0.000 title claims abstract description 48
- 238000012545 processing Methods 0.000 claims abstract description 16
- 239000013598 vector Substances 0.000 claims description 33
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000005259 measurement Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 5
- 230000009467 reduction Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 7
- 238000011156 evaluation Methods 0.000 description 10
- 238000005065 mining Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 101100030351 Schizosaccharomyces pombe (strain 972 / ATCC 24843) dis2 gene Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The embodiment of the invention discloses a data mining method, a data mining device, terminal equipment and a readable storage medium, wherein the method comprises the following steps: processing data to be analyzed to obtain a standard data set; when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center; performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers; respectively calculating the compactness of each category according to a preset error square sum formula; and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number. According to the technical scheme, the error square sum formula is used for calculating the compactness of each category respectively, the category with the minimum compactness is used as a new target set to continue to perform classification operation, the condition that the classification process falls into a local optimal solution is avoided, and the accuracy of data processing is improved.
Description
Technical Field
The present invention relates to the field of data mining, and in particular, to a data mining method, apparatus, terminal device, and readable storage medium.
Background
With the development and application of network technology and the explosive growth of information resources, the research of text mining, information filtering and information searching has unprecedented prospects. Therefore, clustering techniques are becoming the core of text information mining techniques. Text clustering is an important technique used in text mining to find data distribution and its implicit data patterns. At present, in the field of data mining, some simple clustering algorithms are widely applied due to the advantages of simple principle, easy realization, high convergence rate and the like, however, the algorithm can cause different clustering results to different initial values, and is easy to fall into a local minimum value, so that the clustering results are not ideal, and the algorithm is not beneficial to effectively and accurately mining and objectively analyzing a large amount of data information.
Disclosure of Invention
In view of the above problems, the present invention provides a data mining method, apparatus, terminal device and readable storage medium.
One embodiment of the present invention provides a data mining method, including:
processing data to be analyzed to obtain a standard data set;
when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center;
performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers;
respectively calculating the compactness of each category according to a preset error square sum formula;
and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number.
The data mining method according to the foregoing embodiment, which performs classification operations centering on the cluster center of the target set and the data sample farthest from the cluster center of the target set, includes:
calculating a first distance between each data sample in the target set and the clustering center of the target set and a second distance between each data sample in the target set and the data sample farthest from the clustering center of the target set according to a distance measurement formula;
taking data samples with the first distance smaller than the second distance as a class;
and taking the data samples with the first distance being greater than or equal to the second distance as another class.
The data mining method according to the foregoing embodiment determines the data sample farthest from the cluster center of the target set by:
calculating the distance between each data sample in the target set and the clustering center of the target set through a distance measurement formula;
and selecting the data sample with the largest distance as the data sample farthest from the clustering center of the target set.
In the data mining method according to the above embodiment, the distance metric formula is as follows:
dis represents the distance between two data samples, AiI-th coordinate point of weight vector representing a data sample, BiThe ith coordinate point of the weight vector representing another data sample, and n represents the number of coordinate points in the weight vector.
In the data mining method according to the above embodiment, the sum of squared errors formula is as follows:
ASSE stands for the sum of squares of the errors, reflecting clClass as a clustering centerOther degree of compactness, ckRepresents another cluster center, m represents clNumber of data samples in a class as a clustering center, xjIs represented by clThe jth data sample in the class as the cluster center, r, represents the regularization constant.
The data mining method according to the above embodiment, where the processing of the data to be analyzed to obtain the standard data set includes:
performing text segmentation on data to be analyzed, and constructing a bag-of-words model vector;
counting common vocabularies and important vocabularies in the bag-of-words model vector by using a word frequency-reverse file frequency method to obtain a text-vocabulary matrix;
and performing dimensionality reduction on the text-vocabulary matrix to obtain a standard data set.
Another embodiment of the present invention provides a data mining apparatus, including:
the data preprocessing module is used for processing the data to be analyzed to obtain a standard data set;
the initial clustering center selection module is used for selecting a data sample as a clustering center in a target set when the standard data set is taken as the target set to be classified;
the classification operation execution module is used for executing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers;
the compactness calculation module is used for calculating the compactness of each category according to a preset error square sum formula;
and the new target set determining module is used for taking the category with the minimum compactness as a new target set and repeatedly executing the classification operation until the classification number reaches a preset number.
The data mining device according to the above embodiment determines the data sample farthest from the cluster center of the target set by:
calculating the distance between each data sample in the target set and the clustering center of the target set through a distance measurement formula;
and selecting the data sample with the largest distance as the data sample farthest from the clustering center of the target set.
The above embodiments relate to a terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the data mining method of the above embodiments.
The above embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the data mining method of the above embodiments.
The data mining method disclosed by the invention processes data to be analyzed to obtain a standard data set; when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center; performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers; respectively calculating the compactness of each category according to a preset error square sum formula; and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number. According to the technical scheme, massive network data can be collected and sorted, main information related to the data is mined, the data mining process is more objective, the subjectivity of manual operation is avoided, and the data processing efficiency is improved. In addition, the method is used for processing the mass data, and the obtained information is more accurate and reliable.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention. Like components are numbered similarly in the various figures.
FIG. 1 is a flow chart of a data mining method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a data processing procedure involved in a data mining method according to an embodiment of the present invention;
FIG. 3 is a flow chart illustrating a classification operation involved in a data mining method provided by an embodiment of the present invention;
fig. 4 shows a schematic structural diagram of a data mining device according to an embodiment of the present invention.
Main element symbols:
1-a data mining device; 100-a data preprocessing module; 200-an initial clustering center selection module; 300-a classification operation execution module; 400-compactness calculation module; 500-new object set determination module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Hereinafter, the terms "including", "having", and their derivatives, which may be used in various embodiments of the present invention, are only intended to indicate specific features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the existence of, or adding to, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
Furthermore, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.
Example 1
In this embodiment, referring to fig. 1, a data mining method is shown, which includes the following steps:
step S100: the data to be analyzed is processed to obtain a standard data set.
At present, various search engines are widely used, and users are used to search resource information from the search engines and obtain corresponding resource information. Massive resource information is stored in the internet, a search engine can retrieve a large amount of data according to search keywords of a user, the user can also use a web crawler to crawl a large amount of data from the internet, and the obtained large amount of data needs to be analyzed and processed so that the user can obtain effective data with utilization value.
First, the acquired data to be analyzed needs to be processed to obtain a standard data set. Referring to fig. 2, the processing of the data to be analyzed includes the steps of:
step S101: and performing text segmentation on the data to be analyzed, and constructing a bag-of-words model vector.
Text segmentation is carried out on data to be analyzed, and a long text can be accurately and reasonably segmented by using a python Chinese word segmentation component 'jieba' to obtain a corresponding vocabulary set. And removing repeated words in the word set to obtain a new word set T, and representing words contained in each evaluation index by adopting a bag-of-words model. It can be understood that a vector is constructed for each evaluation index, the dimension of the vector is the same as the number of words contained in the word set T, the value of the vector is the frequency of the occurrence frequency of each word in the word set T in the evaluation text, and the element sequence of each position in the vector is consistent with the occurrence sequence of the words in the dictionary.
Exemplarily, when mining and analyzing a large amount of thesis data, corresponding evaluation indexes may include science, society, humanity, astronomy and the like, each evaluation index includes a plurality of words, that is, a vector may be constructed by using a word set under each evaluation index, the dimension of the vector is the same as the number of the words, and the vector value in the vector may be identified by the frequency of occurrence of each word.
Step S102: and counting common words and important words in the bag-of-words model vector by using a word frequency-reverse file frequency method to obtain a text-word matrix.
The constructed bag-of-words model vector is converted into a text-vocabulary matrix, and the bag-of-words model vector can be converted into the text-vocabulary matrix by using term frequency-inverse document frequency (TF-IDF). The method aims to find out words which frequently appear in a certain text and frequently appear in other texts, and achieves filtering of common words and retention of important words so as to obtain a text-word matrix.
Step S103: and performing dimensionality reduction on the text-vocabulary matrix to obtain a standard data set.
The text-vocabulary matrix is a matrix composed of sparse vectors, each row represents a text, each column comprises vocabularies related to the text, and as the number of the texts is far smaller than the number of the vocabularies, Principal Component Analysis (PCA) dimension reduction processing is required, and the text-vocabulary matrix is converted through linear transformation, so that main characteristic components of the data are extracted to obtain a standard data set.
Step S200: and when the standard data set is taken as a target set to be classified, selecting a data sample in the target set as a clustering center.
The standard data set comprises a plurality of data samples, and one data sample can be randomly selected, and the data sample point is used as an initial clustering center.
Step S300: and performing classification operation by taking the clustering center of the target set and the data sample farthest from the clustering center of the target set as centers.
The distance between each data sample in the target set and the cluster center of the target set may be calculated using the following distance metric formula. Considering that the euclidean distance measurement tends to reflect the difference in numerical value, the cosine similarity measurement focuses more on the difference of two vectors in the direction dimension, so the cosine similarity is more suitable for performing text similarity measurement, and accordingly, the distance measurement formula is as follows:
dis represents the distance between two data samples, AiI-th coordinate point of weight vector representing a data sample, BiThe ith coordinate point of the weight vector representing another data sample, and n represents the number of coordinate points in the weight vector.
The distance between two data samples may represent the similarity between the two data samples, i.e. the greater the distance between two data samples, the less the similarity between two data samples; the smaller the distance between two data samples, the greater the similarity between the two data samples.
In this embodiment, the initially set clustering center is referred to as a first clustering center, and the data sample determined by the clustering formula and farthest from the initially set clustering center of the target set is referred to as a second clustering center.
And performing a classification operation according to the first clustering center and the second clustering center to classify the target set into two classes.
Sorting operation referring to fig. 3, the sorting operation includes the following steps:
step S301: and calculating a first distance between each data sample in the target set and the clustering center of the target set and a second distance between each data sample in the target set and the data sample farthest from the clustering center of the target set according to a distance measurement formula.
Illustratively, a first distance between each data sample in the target set and a cluster center of the target set is calculated according to a distance metric formula,dis1 represents the distance C between the kth data sample to the first cluster center1,Ak,iI-th coordinate point, C, of weight vector representing k-th data sample1,iAnd representing the ith coordinate point of the weight vector of the first cluster center, and n represents the number of coordinate points in the weight vector.
Exemplarily, a second distance between each data sample in the target set and the cluster center of the target set is calculated according to a distance metric formula,dis2 represents the k-th data sample to the second clustering center C2A distance between A and Ak,iI-th coordinate point, C, of weight vector representing k-th data sample2,iAnd the ith coordinate point of the weight vector representing the center of the second cluster, and n represents the number of coordinate points in the weight vector.
Step S302: and taking the data samples with the first distance smaller than the second distance as a class.
And comparing the first distance with the second distance, and taking the data samples with the first distance smaller than the second distance as a class, namely the cluster center of the class is the first cluster center.
Step S303: and taking the data samples with the first distance being greater than or equal to the second distance as another class.
And comparing the first distance with the second distance, and taking the data samples with the first distance larger than the second distance as a class, namely the cluster center of the class is the second cluster center.
Step S400: and respectively calculating the compactness of each category according to a preset error square sum formula.
Respectively calculating the compactness of the first cluster and the compactness of the second cluster, wherein the embodiment utilizes an error sum of squares formulaComputing compactness, wherein ASSE represents the sum of squares of the errors, reflecting clCompactness of a class as a clustering center, ckRepresents another cluster center, m represents clNumber of data samples in a class as a clustering center, xjIs represented by clThe j-th data sample in the category as the cluster center, r, represents a regularization constant, and in this embodiment, r is 1.
Exemplarily, in the calculation of c1Compactness of the first cluster as the center of the cluster, c1As c isl,c2As c isk,xjIs represented by c1The j-th data sample in the category as the cluster center is represented by c1The sum of squared errors of the first cluster as the cluster center may reflect the compactness of the first cluster.
Exemplarily, in the calculation of c2Compactness of the second cluster as cluster center, c2As c isl,c1As c isk,xjIs represented by c2The j-th data sample in the category as the cluster center is represented by c2The sum of squared errors of the second cluster as the cluster center may reflect the compactness of the second cluster.
It should be understood that the sum of squares of error formulaThe molecule part of the cluster can reflect the compactness of each data sample in a certain cluster, and the smaller the molecule part is, the smaller the distance between each data sample point in the cluster and the cluster center is, the stronger the compactness of the cluster is; sum of squares error formulaThe denominator part ofThe compactness of each data sample point in a certain cluster and another cluster can be reflected, the larger the denominator part is, the larger the distance between each data sample point in the cluster and the center of another cluster is, the smaller the compactness of the cluster and another cluster is, the larger the separation degree is, that is, the stronger the compactness of the cluster can be reflected. It can be understood that the smaller the sum of squared errors of the clusters, the more compact the clusters are.
Step S500: and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number.
And comparing the compactness of the first cluster with the compactness of the second cluster, taking the category with the minimum compactness as a new target set, namely selecting the category with the maximum error square sum as the new target set, and repeatedly executing the steps S300, S400 and S500 until the classification number reaches the preset number.
The preset classification number can be correspondingly set according to the actual situation. Exemplarily, when data mining is performed on evaluation index data related to the leaderboard of the four great-power university, all the evaluation index data are subjected to cluster analysis to generate a new evaluation system, and the classification number can be set to 4 so as to reasonably classify all the evaluation indexes into four categories. The subjectivity of artificial classification is avoided, the data mining and processing process is more objective, and the data mining speed is improved.
Exemplarily, for a thesis recommendation system, the above technical scheme can be used to perform data mining on thesis data that has been retrieved by a user, for example, for thesis data in the artificial intelligence field, the thesis data can be classified according to ten machine learning algorithms to divide the thesis data into 10 types, user habits can be obtained according to mining results of the thesis data, and a user can be depicted to subsequently recommend related academic thesis to the user, thereby improving user experience.
Exemplarily, for the search engine, the above technical solution may perform data mining on the information data that the user has retrieved, for example, the information data may be divided into a plurality of categories, such as social, scientific, humanistic, historical, astronomical, and the like, to mine user habits, and portray the user, so that the user may be accurately recommended relevant information in the future.
The data mining method disclosed by the embodiment processes data to be analyzed to obtain a standard data set; when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center; performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers; respectively calculating the compactness of each category according to a preset error square sum formula; and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number. According to the technical scheme, massive network data can be collected and sorted, main information related to the data is mined, the data mining process is more objective, the subjectivity of manual operation is avoided, and the data processing efficiency is improved. In addition, the method is used for processing the mass data, and the obtained information is more accurate and reliable.
Example 2
In the present embodiment, referring to fig. 4, a data mining apparatus 1 is shown including: the system comprises a data preprocessing module 100, an initial cluster center selecting module 200, a classification operation executing module 300, a compactness calculating module 400 and a new target set determining module 500.
A data preprocessing module 100, configured to process data to be analyzed to obtain a standard data set; an initial clustering center selecting module 200, configured to select a data sample as a clustering center in a target set when the standard data set is used as the target set to be classified; a classification operation executing module 300, configured to execute a classification operation centering on a cluster center of a target set and a data sample farthest from the cluster center of the target set; the compactness calculation module 400 is used for calculating the compactness of each category according to a preset error square sum formula; and a new target set determining module 500, configured to repeatedly perform the classification operation with the category with the smallest compactness as a new target set until the number of classifications reaches a preset number.
In this embodiment, the data mining apparatus 1 is configured to execute the data mining method according to the above embodiment by using the data preprocessing module 100, the initial clustering center selecting module 200, the classifying operation executing module 300, the compactness calculating module 400, and the new target set determining module 500 in a matching manner, and the implementation scheme and the beneficial effect related to the above embodiment are also applicable to this embodiment, and are not described herein again.
It should be understood that the present embodiment relates to a terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to execute the data mining method according to the above embodiment.
It should be appreciated that the present embodiments relate to a readable storage medium storing a computer program which, when run on a processor, performs the data mining method described in the above embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.
Claims (10)
1. A method of data mining, the method comprising:
processing data to be analyzed to obtain a standard data set;
when the standard data set is used as a target set to be classified, selecting a data sample in the target set as a clustering center;
performing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers;
respectively calculating the compactness of each category according to a preset error square sum formula;
and repeatedly executing the classification operation by taking the category with the minimum compactness as a new target set until the classification number reaches a preset number.
2. The data mining method of claim 1, wherein the classifying operation is performed centering on a cluster center of a target set and a data sample farthest from the cluster center of the target set, and comprises:
calculating a first distance between each data sample in the target set and the clustering center of the target set and a second distance between each data sample in the target set and the data sample farthest from the clustering center of the target set according to a distance measurement formula;
taking data samples with the first distance smaller than the second distance as a class;
and taking the data samples with the first distance being greater than or equal to the second distance as another class.
3. The data mining method of claim 1, wherein the data sample furthest from the cluster center of the target set is determined by:
calculating the distance between each data sample in the target set and the clustering center of the target set through a distance measurement formula;
and selecting the data sample with the largest distance as the data sample farthest from the clustering center of the target set.
4. A method of data mining according to claim 2 or 3, wherein the distance metric is formulated as follows:
5. The data mining method of claim 1, wherein the sum of squared errors formula is as follows:
ASSE stands for the sum of squares of the errors, reflecting clCompactness of a class as a clustering center, ckRepresents another cluster center, m represents clNumber of data samples in a class as a clustering center, xjIs represented by clThe jth data sample in the class as the cluster center, r, represents the regularization constant.
6. The data mining method of claim 1, wherein the processing the data to be analyzed to obtain a standard data set comprises:
performing text segmentation on data to be analyzed, and constructing a bag-of-words model vector;
counting common vocabularies and important vocabularies in the bag-of-words model vector by using a word frequency-reverse file frequency method to obtain a text-vocabulary matrix;
and performing dimensionality reduction on the text-vocabulary matrix to obtain a standard data set.
7. A data mining device, the device comprising:
the data preprocessing module is used for processing the data to be analyzed to obtain a standard data set;
the initial clustering center selection module is used for selecting a data sample as a clustering center in a target set when the standard data set is taken as the target set to be classified;
the classification operation execution module is used for executing classification operation by taking the clustering center of a target set and a data sample farthest from the clustering center of the target set as centers;
the compactness calculation module is used for calculating the compactness of each category according to a preset error square sum formula;
and the new target set determining module is used for taking the category with the minimum compactness as a new target set and repeatedly executing the classification operation until the classification number reaches a preset number.
8. The data mining device of claim 7, wherein the data sample furthest from the cluster center of the target set is determined by:
calculating the distance between each data sample in the target set and the clustering center of the target set through a distance measurement formula;
and selecting the data sample with the largest distance as the data sample farthest from the clustering center of the target set.
9. A terminal device, comprising a memory for storing a computer program and a processor for executing the computer program to enable the terminal device to perform the data mining method of any one of claims 1 to 6.
10. A readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the data mining method of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010584569.2A CN111737469A (en) | 2020-06-23 | 2020-06-23 | Data mining method and device, terminal equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010584569.2A CN111737469A (en) | 2020-06-23 | 2020-06-23 | Data mining method and device, terminal equipment and readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111737469A true CN111737469A (en) | 2020-10-02 |
Family
ID=72650840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010584569.2A Pending CN111737469A (en) | 2020-06-23 | 2020-06-23 | Data mining method and device, terminal equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111737469A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113920373A (en) * | 2021-10-29 | 2022-01-11 | 平安银行股份有限公司 | Object classification method and device, terminal equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020183966A1 (en) * | 2001-05-10 | 2002-12-05 | Nina Mishra | Computer implemented scalable, incremental and parallel clustering based on weighted divide and conquer |
CN107273918A (en) * | 2017-05-26 | 2017-10-20 | 国信优易数据有限公司 | A kind of sample data classification determines method and apparatus |
WO2017181660A1 (en) * | 2016-04-21 | 2017-10-26 | 华为技术有限公司 | K-means algorithm-based data clustering method and device |
CN107704872A (en) * | 2017-09-19 | 2018-02-16 | 安徽理工大学 | A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method |
CN110109975A (en) * | 2019-05-14 | 2019-08-09 | 重庆紫光华山智安科技有限公司 | Data clustering method and device |
CN110825826A (en) * | 2019-11-07 | 2020-02-21 | 深圳大学 | Clustering calculation method, device, terminal and storage medium |
-
2020
- 2020-06-23 CN CN202010584569.2A patent/CN111737469A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020183966A1 (en) * | 2001-05-10 | 2002-12-05 | Nina Mishra | Computer implemented scalable, incremental and parallel clustering based on weighted divide and conquer |
WO2017181660A1 (en) * | 2016-04-21 | 2017-10-26 | 华为技术有限公司 | K-means algorithm-based data clustering method and device |
CN107273918A (en) * | 2017-05-26 | 2017-10-20 | 国信优易数据有限公司 | A kind of sample data classification determines method and apparatus |
CN107704872A (en) * | 2017-09-19 | 2018-02-16 | 安徽理工大学 | A kind of K means based on relatively most discrete dimension segmentation cluster initial center choosing method |
CN110109975A (en) * | 2019-05-14 | 2019-08-09 | 重庆紫光华山智安科技有限公司 | Data clustering method and device |
CN110825826A (en) * | 2019-11-07 | 2020-02-21 | 深圳大学 | Clustering calculation method, device, terminal and storage medium |
Non-Patent Citations (1)
Title |
---|
石云平: ""使用平均误差准则函数E的K-means算法分析"", 《计算机与信息技术》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113920373A (en) * | 2021-10-29 | 2022-01-11 | 平安银行股份有限公司 | Object classification method and device, terminal equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804641B (en) | Text similarity calculation method, device, equipment and storage medium | |
CN111898366B (en) | Document subject word aggregation method and device, computer equipment and readable storage medium | |
CN109408641B (en) | Text classification method and system based on supervised topic model | |
CA3059414A1 (en) | Hybrid approach to approximate string matching using machine learning | |
CN113360701B (en) | Sketch processing method and system based on knowledge distillation | |
US20080082475A1 (en) | System and method for resource adaptive classification of data streams | |
CN107832778B (en) | Same target identification method based on spatial comprehensive similarity | |
CN112579783B (en) | Short text clustering method based on Laplace atlas | |
Bortnikova et al. | Search Query Classification Using Machine Learning for Information Retrieval Systems in Intelligent Manufacturing. | |
Wong et al. | Feature selection and feature extraction: highlights | |
CN111737469A (en) | Data mining method and device, terminal equipment and readable storage medium | |
US11048730B2 (en) | Data clustering apparatus and method based on range query using CF tree | |
Kostkina et al. | Document categorization based on usage of features reduction with synonyms clustering in weak semantic map | |
CN117077680A (en) | Question and answer intention recognition method and device | |
CN115455142A (en) | Text retrieval method, computer device and storage medium | |
CN106021346B (en) | Retrieval processing method and device | |
CN114254622A (en) | Intention identification method and device | |
CN113673237A (en) | Model training method, intent recognition method, device, electronic equipment and storage medium | |
Geler | Role of Similarity Measures in Time Series Analysis | |
Azis et al. | LL-KNN ACW-NB: Local Learning K-Nearest Neighbor in Absolute Correlation Weighted Naïve Bayes for Numerical Data Classification | |
Su et al. | Research on product reviews hot spot discovery algorithm based on mapreduce | |
Hochma et al. | Efficient Feature Ranking and Selection using Statistical Moments | |
CN116932487B (en) | Quantized data analysis method and system based on data paragraph division | |
EP4369258A1 (en) | Systems and methods for finding nearest neighbors | |
CN110609961A (en) | Collaborative filtering recommendation method based on word embedding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |