CN109214468A - It is a kind of based on can open up away from optimization cluster centre data clustering method - Google Patents
It is a kind of based on can open up away from optimization cluster centre data clustering method Download PDFInfo
- Publication number
- CN109214468A CN109214468A CN201811215252.0A CN201811215252A CN109214468A CN 109214468 A CN109214468 A CN 109214468A CN 201811215252 A CN201811215252 A CN 201811215252A CN 109214468 A CN109214468 A CN 109214468A
- Authority
- CN
- China
- Prior art keywords
- distance
- sample
- clustering
- extension
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000005457 optimization Methods 0.000 title abstract description 4
- 238000013507 mapping Methods 0.000 claims abstract description 15
- 230000001174 ascending effect Effects 0.000 claims abstract description 4
- 238000012216 screening Methods 0.000 claims description 13
- 238000010276 construction Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 abstract description 2
- 238000004364 calculation method Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 5
- 238000003064 k means clustering Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
It is a kind of based on the data clustering method that can be opened up away from optimization cluster centre, including by sample classics distance to Interval Maps can be opened up, between every two sample distance be considered as can open up away from section a bit, then using can open up space computing method obtain side away from and mapping range;It is proposed can averagely open up side away under concept measurement cluster centre point performance quantizating index, i.e., can averagely open up left side away from can averagely open up right side away from respectively sample closeness and cluster centre degree of becoming estranged;Using closeness and degree of becoming estranged as cluster centre Criterion of Selecting: to sample spacing point ascending sort, it is first to be greater than centre coordinate corresponding to closeness sample spacing point for first initial cluster center point, successively judge next sample spacing point, if it is greater than closeness index, and when with that can be opened up away from degree of becoming estranged is all larger than between the cluster centre point selected, it is then next selected cluster centre, until selecting optimal initial point;If do not select at satisfactory, using zoom factor, mapping range is zoomed in and out.
Description
Technical Field
The invention relates to a data clustering method based on an optimal clustering center of extension distances.
Background
With the rapid development of big data technology and extendibility, how to integrate the extendibility analysis and the hidden knowledge in the data becomes an important factor for determining whether an enterprise has competitiveness. Clustering is an important means of data analysis and is widely applied to the fields of data mining, image processing and the like, but in the existing improved algorithm, the defect that the first initial clustering center point falls on a sparse boundary still exists in the problem of initial clustering center selection, so that the problems of poor intra-cluster equilibrium and the like in the clustering process are caused, and the clustering effect is not ideal. Therefore, an algorithm based on the extension distance k-means is provided, and the extension distances are mapped from the classical distance to the extension distance by using the 'intra-class difference' of the extension distances, so that the screening with identification of the sample distances is carried out, the central point is prevented from being distributed on a sparse boundary, and the sample clustering is realized.
Disclosure of Invention
In order to overcome the defect that initial clustering center points are easy to fall on sparse boundaries in the conventional clustering method, a data clustering method based on a topologic optimization clustering center is provided, and a better clustering effect is realized by means of a topologic theory.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a data clustering method based on an optimal clustering center of extension distance comprises the following steps:
s1: constructing an equivalent dense distance interval: normalizing the sample data, calculating a normalized inter-sample distance set by using a classical distance to measure the density degree between sample points, and constructing an equivalent dense distance interval [ A, B ] according to a sorting rule;
let set X of n samples be { X ═ X1,x2,…,xnWhere xi is an m-dimensional vector (i ═ 1,2, …, n),
Z=[A,B]=[min(D),max(D)](1)
wherein,set of distances between two samples. The inter-sample distance is used to measure the density of the inter-sample distribution.
S2: mapping sample distances to equivalent dense distance intervals: using the sample distance described in S1 as a mapping point, using the equivalent dense distance interval as a mapping extension interval, and mapping the sample distance to a point on the extension interval according to equations (2) and (3) by using the concept of extension side distance, i.e., extending the left side distance and the right side distance;
s3: and (3) construction of evaluation indexes: obtaining corresponding extension average left side distance between samples according to the extension side distances of all sample distances and the formula (4)And can extend the average right side distanceRespectively as indexes for measuring the quality of the cluster center point, i.e. the intensityAnd degree of separation
Wherein,the number of combinations of 2 samples arbitrarily taken from n samples is shown.
S4: clustering center points according to a screening principle: and taking the extension side distance density and the distance index constructed in the S3 as a screening principle of the initial clustering center to obtain an optimized center clustering point.
In step S4, the following sub-steps are included:
s41: screening a first cluster center: sorting the extension side distances in an ascending order, traversing the sorted extension distances, and taking the center point coordinate corresponding to the first extension distance larger than the inter-sample density index as a first cluster center.
S42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C1,...,ci]Distance between the sample and the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center ci+1And judging whether the cluster heart number reaches the required number or not, and stopping when the cluster heart number reaches the required numberSearching; otherwise, the point is invalid, and the next potential cluster center point is judged.
S43: after traversing once, when the initial clustering center reaches K, outputting a clustering center point, otherwise, executing the step S44;
s44, interval scaling, namely, introducing a scaling factor, calculating a reduction factor η according to the formula (5), and dynamically reducing the right side distance of the average extension of the sampleAnd repeating the steps from S41 to S42 until the number of the clustering centers reaches K, and finishing the selection of the clustering centers.
S45: and clustering the extension distances of the sample distances to obtain a clustering result.
Furthermore, sample clustering is carried out according to the clustering center points, the defect that the center points appear in non-dense areas of the boundary is overcome by the obtained optimal initial clustering center points, the optimal initial clustering center points are distributed in dense areas to the maximum extent, and the clustering center points are uniformly distributed.
Compared with the prior art, the invention has the beneficial effects that:
1. reflecting the density degree among the samples according to the sample distance to construct a density interval; 2. the extension distance theory is fused, the distance between the two samples is used as a point to be mapped to an extension interval, and the sample distance for measuring the sample density is screened in the extension distance field in an identifying way, so that the central point is effectively prevented from falling on a sparse boundary; 3. introducing a scaling factor to scale the interval; 4. the obtained result is scientific and reasonable, and is more suitable for high-dimensional sample data clustering, so that the method has a wide application prospect.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a classic-to-topologic mapping in an embodiment.
Fig. 3a to 3c are initial clustering center distribution diagrams of the algorithm before and after improvement based on the Iris data, that is, two-dimensional distribution display diagrams of initial clustering center points obtained by the algorithm of this embodiment and the other two algorithms based on the Iris sample set, where fig. 3a is based on the topologic method, fig. 3b is based on the intensity method, and fig. 3c is based on the average difference method.
Fig. 4a to 4c are initial cluster center distribution diagrams of the algorithm before and after improvement based on the Wine data, that is, two-dimensional distribution display diagrams of initial cluster center points obtained by the algorithm herein and the other two algorithms based on the Wine sample set in the embodiment, where fig. 4a is based on the scalable method, fig. 4b is based on the density method, and fig. 4c is based on the average difference method.
Fig. 5a to 5d are graphs of clustering based on Iris data of the initial clustering center of the algorithm before and after improvement, that is, three-dimensional distribution display graphs of the result graph of the initial clustering center based on the Iris sample set obtained by the algorithm and the other three algorithms in the embodiment, wherein fig. 5a is actual clustering, fig. 5b is clustering based on the scalable method, fig. 5c is clustering based on the density method, and fig. 5d is clustering based on the average difference method.
Fig. 6a to 6d are three-dimensional distribution display diagrams of initial clustering centers of algorithms before and after improvement based on Wine data clustering graphs, that is, initial clustering center point clustering effect result graphs obtained by the algorithm of this document and the other three algorithms based on a Wine sample set in the embodiment, wherein fig. 6a is actual clustering, fig. 6b is clustering based on a topologic method, fig. 6c is clustering based on a density method, and fig. 6d is clustering based on an average difference method.
Detailed Description
The invention is further illustrated by the following figures and examples.
Examples
The embodiment provides a k-means algorithm based on an optimal clustering center of the extension distance, which comprises the following steps:
s1: the method is compared with a traditional algorithm, an average difference degree K-mean clustering algorithm and a density-based improved K-means clustering algorithm. Constructing an equivalent dense distance interval [ A, B ] by using the classical distance between every two samples in the test sample set;
the test data set adopted in this example is derived from an Iris data set and a Wine data set used for test clustering in a UCI database, and the characteristics of each data are shown in table 1:
TABLE 1 basic characteristics of the data sets
Based on the average difference degree K-means clustering algorithm from Liwu, Zhao Jia Yan, Titaishan improved K-means clustering algorithm [ J ] based on the average difference degree to optimize initial clustering center control and decision, 2017,32(04):759 762-.
The improved K-means clustering algorithm based on the density is from Hanjun, Neijian, yellow river, etc. the power supply block division method based on the improved K-means clustering algorithm [ J ]. Power Automation equipment, 2015,35(06): 123-.
The inter-sample distance is used to measure the density of the inter-sample distribution.
S2: mapping sample distances to equivalent dense distance intervals: using the sample distance D described in S1 as a mapping point and the equivalent dense distance interval as a mapping extension interval [ a, B ], using the concept of extension side distance, mapping the sample distance to a point on the extension interval according to equations (2) and (3) as described in fig. 1, i.e. extending the left side distance and the right side distance;
s3: and (3) construction of evaluation indexes: obtaining corresponding extension average left side distance between samples according to the extension side distance of all sample distances D and the formula (4)And can extend the average right side distanceRespectively as indexes for measuring the quality of the cluster center point, i.e. the intensityAnd degree of separation
Wherein,the number of combinations of 2 samples arbitrarily taken from n samples is shown.
S4: clustering center points according to a screening principle: and taking the extension side distance density and the distance index constructed in the S3 as a screening principle of the initial clustering center to obtain an optimized center clustering point.
In step S4, the following sub-steps are included:
s41: screening a first cluster center: sorting the extension side distances in an ascending order, traversing the sorted extension distances, and taking the center point coordinate corresponding to the first extension distance larger than the inter-sample density index as a first cluster center.
S42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C1,...,ci]The extension distance of (2) can be used to extend the right distance with the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center ci+1Judging whether the cluster center number reaches the required number, and stopping searching when the cluster center number reaches the required number; otherwise, the point is invalid, and the next potential cluster center point is judged.
S43: after traversing once, when the initial clustering center reaches K, outputting a clustering center point, otherwise, executing the step S44;
s44, interval scaling, namely, introducing a scaling factor, calculating a reduction factor η according to the formula (5), and dynamically reducing the right side distance of the average extension of the sampleAnd repeating the steps from S41 to S42 until the number of the clustering centers reaches K, and finishing the selection of the clustering centers.
S45: and clustering the extension distances of the sample distances to obtain a clustering result.
The method described in this example and other comparison methods perform two-dimensional spatial distribution display on the clustering centers obtained by two sets of data sets, as shown in fig. 2 and fig. 3:
in this embodiment, the initial clustering center points selected by the improved algorithm are distributed more uniformly for the initial clustering center points selected by the low-dimensional and high-dimensional data, and the data points around the initial clustering center points are relatively dense when the initial clustering center points are located in the boundary region.
Fig. 4, fig. 5 shows a three-dimensional representation of the respective clustering of the center points:
aiming at the effectiveness of the central point clustering effect quantitative measurement algorithm, a classification accuracy index and a clustering balance index are adopted, and result statistics are shown in tables 2 to 4;
TABLE 2 Cluster accuracy index before and after improvement
The statistics of the number of each cluster sample after clustering the initial clustering center point are shown in table 3:
TABLE 3 number of samples of each cluster after clustering by the improved pre-and post-algorithm
The balance index of the algorithm before and after improvement obtained from table 3 is shown in table 4:
TABLE 4 Cluster equalization index before and after improvement
The accuracy index is the ratio of the number of samples correctly classified into corresponding categories to the total number of samples; the cluster balance is the variance between the number of samples in each class and the number of samples in the actual corresponding class.
The embodiment shows that under the condition of a high-dimensional data set, the initial clustering center point cluster selected based on the improved k-means algorithm of the extension distance has higher accuracy and balance rate, and is more suitable for clustering mass and high-dimensional data under the background of large data.
The embodiments described in this specification are merely examples of the implementation of the proposed method under two sets of sample sets in the UCI database, and the scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but also as equivalent technical means which can be conceived by those skilled in the art according to the inventive concept.
Claims (1)
1. A data clustering method based on an optimal clustering center of extension distance comprises the following steps:
s1: constructing an equivalent dense distance interval: normalizing the sample data, calculating a normalized inter-sample distance set by using a classical distance to measure the density degree between sample points, and constructing an equivalent dense distance interval [ A, B ] according to a sorting rule;
let set X of n samples be { X ═ X1,x2,…,xnWhere xi is an m-dimensional vector, i is 1,2, …, n,
Z=[A,B]=[min(D),max(D)](1)
wherein,a distance set between every two samples; the inter-sample distance is used for measuring the density of distribution among samples;
s2: mapping sample distances to equivalent dense distance intervals: using the sample distance described in S1 as a mapping point, using the equivalent dense distance interval as a mapping extension interval, and mapping the sample distance to a point on the extension interval according to equations (2) and (3) by using the concept of extension side distance, i.e., extending the left side distance and the right side distance;
s3: and (3) construction of evaluation indexes: obtaining corresponding extension average left side distance between samples according to the extension side distances of all sample distances and the formula (4)And can extend the average right side distanceRespectively as indexes for measuring the quality of the cluster center point, i.e. the intensityAnd degree of separation
Wherein,a combination number representing 2 samples arbitrarily taken from n samples;
s4: clustering center points according to a screening principle: taking the extension side distance density and the distance index constructed in the S3 as a screening principle of an initial clustering center to obtain an optimized center clustering point;
in step S4, the following sub-steps are included:
s41: screening a first cluster center: sorting the extension side distances in an ascending order, traversing the sorted extension distances, and taking the center point coordinate corresponding to the first extension distance larger than the inter-sample density index as a first cluster center;
s42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C1,...,ci]Distance between the sample and the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center ci+1Judging whether the cluster core number reaches the required number, and stopping searching when the cluster core number reaches the required number; otherwise, judging the next potential cluster center point if the point is invalid;
s43: after traversing once, when the initial clustering center reaches K, outputting a clustering center point, otherwise, executing the step S44;
s44, interval scaling, namely, introducing a scaling factor, calculating a reduction factor η according to the formula (5), and dynamically reducing the right side distance of the average extension of the sampleRepeating the steps from S41 to S42 until the number of the clustering centers reaches K, and finishing the selection of the clustering centers;
s45: and clustering the extension distances of the sample distances to obtain a clustering result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811215252.0A CN109214468A (en) | 2018-10-18 | 2018-10-18 | It is a kind of based on can open up away from optimization cluster centre data clustering method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811215252.0A CN109214468A (en) | 2018-10-18 | 2018-10-18 | It is a kind of based on can open up away from optimization cluster centre data clustering method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109214468A true CN109214468A (en) | 2019-01-15 |
Family
ID=64980831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811215252.0A Pending CN109214468A (en) | 2018-10-18 | 2018-10-18 | It is a kind of based on can open up away from optimization cluster centre data clustering method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109214468A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112034473A (en) * | 2020-08-31 | 2020-12-04 | 福建省特种设备检验研究院 | Method, device and equipment for measuring distance between guide rail brackets of elevator and storage medium |
CN112070548A (en) * | 2020-09-11 | 2020-12-11 | 上海风秩科技有限公司 | User layering method, device, equipment and storage medium |
CN113591672A (en) * | 2021-07-28 | 2021-11-02 | 常州大学 | Detection method for identifying fish state based on Mask-Rcnn |
-
2018
- 2018-10-18 CN CN201811215252.0A patent/CN109214468A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112034473A (en) * | 2020-08-31 | 2020-12-04 | 福建省特种设备检验研究院 | Method, device and equipment for measuring distance between guide rail brackets of elevator and storage medium |
CN112034473B (en) * | 2020-08-31 | 2024-02-27 | 福建省特种设备检验研究院 | Elevator guide rail bracket spacing measuring method, device, equipment and storage medium |
CN112070548A (en) * | 2020-09-11 | 2020-12-11 | 上海风秩科技有限公司 | User layering method, device, equipment and storage medium |
CN112070548B (en) * | 2020-09-11 | 2024-02-20 | 上海秒针网络科技有限公司 | User layering method, device, equipment and storage medium |
CN113591672A (en) * | 2021-07-28 | 2021-11-02 | 常州大学 | Detection method for identifying fish state based on Mask-Rcnn |
CN113591672B (en) * | 2021-07-28 | 2024-05-03 | 常州大学 | Detection method for identifying fish state based on Mask-Rcnn |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lengyel et al. | Silhouette width using generalized mean—A flexible method for assessing clustering efficiency | |
García et al. | Theoretical analysis of a performance measure for imbalanced data | |
CN105930862A (en) | Density peak clustering algorithm based on density adaptive distance | |
CN107832456B (en) | Parallel KNN text classification method based on critical value data division | |
CN104598586B (en) | The method of large-scale text categorization | |
CN109214468A (en) | It is a kind of based on can open up away from optimization cluster centre data clustering method | |
CN106056136A (en) | Data clustering method for rapidly determining clustering center | |
CN111259933B (en) | High-dimensional characteristic data classification method and system based on distributed parallel decision tree | |
JPWO2005050479A1 (en) | Similar pattern search device, similar pattern search method, similar pattern search program, and fraction separation device | |
Yang et al. | Density clustering with divergence distance and automatic center selection | |
WO2018006631A1 (en) | User level automatic segmentation method and system | |
CN107016416B (en) | Data classification prediction method based on neighborhood rough set and PCA fusion | |
Ding et al. | Student behavior clustering method based on campus big data | |
CN113254717A (en) | Multidimensional graph network node clustering processing method, apparatus and device | |
CN111563821A (en) | Financial stock fluctuation prediction method based on quantitative investment of support vector machine | |
CN112650818B (en) | Clustering mining method based on multidimensional time series data | |
Rani | Visual analytics for comparing the impact of outliers in k-means and k-medoids algorithm | |
CN111027609B (en) | Image data weighted classification method and system | |
CN109784354A (en) | Based on the non-parametric clustering method and electronic equipment for improving classification effectiveness | |
Ji et al. | Machine learning of discriminative gate locations for clinical diagnosis | |
CN110008994A (en) | P-CFSFDP Density Clustering method based on the operation of Spark platform | |
CN113379823B (en) | Minority sample generation method based on construction of equilateral balanced triangle SMOTE algorithm | |
CN108090514B (en) | Infrared image identification method based on two-stage density clustering | |
CN113535527A (en) | Load shedding method and system for real-time flow data predictive analysis | |
Ding et al. | Evolutionary Parameter-Free Clustering Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190115 |
|
RJ01 | Rejection of invention patent application after publication |