CN109214468A

CN109214468A - It is a kind of based on can open up away from optimization cluster centre data clustering method

Info

Publication number: CN109214468A
Application number: CN201811215252.0A
Authority: CN
Inventors: 赵燕伟; 朱芬; 谢智伟; 徐晨; 桂方志; 任设东
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-10-18
Filing date: 2018-10-18
Publication date: 2019-01-15

Abstract

It is a kind of based on the data clustering method that can be opened up away from optimization cluster centre, including by sample classics distance to Interval Maps can be opened up, between every two sample distance be considered as can open up away from section a bit, then using can open up space computing method obtain side away from and mapping range；It is proposed can averagely open up side away under concept measurement cluster centre point performance quantizating index, i.e., can averagely open up left side away from can averagely open up right side away from respectively sample closeness and cluster centre degree of becoming estranged；Using closeness and degree of becoming estranged as cluster centre Criterion of Selecting: to sample spacing point ascending sort, it is first to be greater than centre coordinate corresponding to closeness sample spacing point for first initial cluster center point, successively judge next sample spacing point, if it is greater than closeness index, and when with that can be opened up away from degree of becoming estranged is all larger than between the cluster centre point selected, it is then next selected cluster centre, until selecting optimal initial point；If do not select at satisfactory, using zoom factor, mapping range is zoomed in and out.

Description

Data clustering method based on extension distance optimization clustering center

Technical Field

The invention relates to a data clustering method based on an optimal clustering center of extension distances.

Background

With the rapid development of big data technology and extendibility, how to integrate the extendibility analysis and the hidden knowledge in the data becomes an important factor for determining whether an enterprise has competitiveness. Clustering is an important means of data analysis and is widely applied to the fields of data mining, image processing and the like, but in the existing improved algorithm, the defect that the first initial clustering center point falls on a sparse boundary still exists in the problem of initial clustering center selection, so that the problems of poor intra-cluster equilibrium and the like in the clustering process are caused, and the clustering effect is not ideal. Therefore, an algorithm based on the extension distance k-means is provided, and the extension distances are mapped from the classical distance to the extension distance by using the 'intra-class difference' of the extension distances, so that the screening with identification of the sample distances is carried out, the central point is prevented from being distributed on a sparse boundary, and the sample clustering is realized.

Disclosure of Invention

In order to overcome the defect that initial clustering center points are easy to fall on sparse boundaries in the conventional clustering method, a data clustering method based on a topologic optimization clustering center is provided, and a better clustering effect is realized by means of a topologic theory.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a data clustering method based on an optimal clustering center of extension distance comprises the following steps:

s1: constructing an equivalent dense distance interval: normalizing the sample data, calculating a normalized inter-sample distance set by using a classical distance to measure the density degree between sample points, and constructing an equivalent dense distance interval [ A, B ] according to a sorting rule;

let set X of n samples be { X ═ X₁,x₂,…,x_nWhere xi is an m-dimensional vector (i ═ 1,2, …, n),

Z＝[A,B]＝[min(D),max(D)](1)

wherein,set of distances between two samples. The inter-sample distance is used to measure the density of the inter-sample distribution.

S2: mapping sample distances to equivalent dense distance intervals: using the sample distance described in S1 as a mapping point, using the equivalent dense distance interval as a mapping extension interval, and mapping the sample distance to a point on the extension interval according to equations (2) and (3) by using the concept of extension side distance, i.e., extending the left side distance and the right side distance;

s3: and (3) construction of evaluation indexes: obtaining corresponding extension average left side distance between samples according to the extension side distances of all sample distances and the formula (4)And can extend the average right side distanceRespectively as indexes for measuring the quality of the cluster center point, i.e. the intensityAnd degree of separation

Wherein,the number of combinations of 2 samples arbitrarily taken from n samples is shown.

S4: clustering center points according to a screening principle: and taking the extension side distance density and the distance index constructed in the S3 as a screening principle of the initial clustering center to obtain an optimized center clustering point.

In step S4, the following sub-steps are included:

s41: screening a first cluster center: sorting the extension side distances in an ascending order, traversing the sorted extension distances, and taking the center point coordinate corresponding to the first extension distance larger than the inter-sample density index as a first cluster center.

S42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C₁,...,c_i]Distance between the sample and the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center c_i+1And judging whether the cluster heart number reaches the required number or not, and stopping when the cluster heart number reaches the required numberSearching; otherwise, the point is invalid, and the next potential cluster center point is judged.

S43: after traversing once, when the initial clustering center reaches K, outputting a clustering center point, otherwise, executing the step S44;

s44, interval scaling, namely, introducing a scaling factor, calculating a reduction factor η according to the formula (5), and dynamically reducing the right side distance of the average extension of the sampleAnd repeating the steps from S41 to S42 until the number of the clustering centers reaches K, and finishing the selection of the clustering centers.

S45: and clustering the extension distances of the sample distances to obtain a clustering result.

Furthermore, sample clustering is carried out according to the clustering center points, the defect that the center points appear in non-dense areas of the boundary is overcome by the obtained optimal initial clustering center points, the optimal initial clustering center points are distributed in dense areas to the maximum extent, and the clustering center points are uniformly distributed.

Compared with the prior art, the invention has the beneficial effects that:

1. reflecting the density degree among the samples according to the sample distance to construct a density interval; 2. the extension distance theory is fused, the distance between the two samples is used as a point to be mapped to an extension interval, and the sample distance for measuring the sample density is screened in the extension distance field in an identifying way, so that the central point is effectively prevented from falling on a sparse boundary; 3. introducing a scaling factor to scale the interval; 4. the obtained result is scientific and reasonable, and is more suitable for high-dimensional sample data clustering, so that the method has a wide application prospect.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a classic-to-topologic mapping in an embodiment.

Fig. 3a to 3c are initial clustering center distribution diagrams of the algorithm before and after improvement based on the Iris data, that is, two-dimensional distribution display diagrams of initial clustering center points obtained by the algorithm of this embodiment and the other two algorithms based on the Iris sample set, where fig. 3a is based on the topologic method, fig. 3b is based on the intensity method, and fig. 3c is based on the average difference method.

Fig. 4a to 4c are initial cluster center distribution diagrams of the algorithm before and after improvement based on the Wine data, that is, two-dimensional distribution display diagrams of initial cluster center points obtained by the algorithm herein and the other two algorithms based on the Wine sample set in the embodiment, where fig. 4a is based on the scalable method, fig. 4b is based on the density method, and fig. 4c is based on the average difference method.

Fig. 5a to 5d are graphs of clustering based on Iris data of the initial clustering center of the algorithm before and after improvement, that is, three-dimensional distribution display graphs of the result graph of the initial clustering center based on the Iris sample set obtained by the algorithm and the other three algorithms in the embodiment, wherein fig. 5a is actual clustering, fig. 5b is clustering based on the scalable method, fig. 5c is clustering based on the density method, and fig. 5d is clustering based on the average difference method.

Fig. 6a to 6d are three-dimensional distribution display diagrams of initial clustering centers of algorithms before and after improvement based on Wine data clustering graphs, that is, initial clustering center point clustering effect result graphs obtained by the algorithm of this document and the other three algorithms based on a Wine sample set in the embodiment, wherein fig. 6a is actual clustering, fig. 6b is clustering based on a topologic method, fig. 6c is clustering based on a density method, and fig. 6d is clustering based on an average difference method.

Detailed Description

The invention is further illustrated by the following figures and examples.

Examples

The embodiment provides a k-means algorithm based on an optimal clustering center of the extension distance, which comprises the following steps:

s1: the method is compared with a traditional algorithm, an average difference degree K-mean clustering algorithm and a density-based improved K-means clustering algorithm. Constructing an equivalent dense distance interval [ A, B ] by using the classical distance between every two samples in the test sample set;

the test data set adopted in this example is derived from an Iris data set and a Wine data set used for test clustering in a UCI database, and the characteristics of each data are shown in table 1:

TABLE 1 basic characteristics of the data sets

Based on the average difference degree K-means clustering algorithm from Liwu, Zhao Jia Yan, Titaishan improved K-means clustering algorithm [ J ] based on the average difference degree to optimize initial clustering center control and decision, 2017,32(04):759 762-.

The improved K-means clustering algorithm based on the density is from Hanjun, Neijian, yellow river, etc. the power supply block division method based on the improved K-means clustering algorithm [ J ]. Power Automation equipment, 2015,35(06): 123-.

The inter-sample distance is used to measure the density of the inter-sample distribution.

S2: mapping sample distances to equivalent dense distance intervals: using the sample distance D described in S1 as a mapping point and the equivalent dense distance interval as a mapping extension interval [ a, B ], using the concept of extension side distance, mapping the sample distance to a point on the extension interval according to equations (2) and (3) as described in fig. 1, i.e. extending the left side distance and the right side distance;

s3: and (3) construction of evaluation indexes: obtaining corresponding extension average left side distance between samples according to the extension side distance of all sample distances D and the formula (4)And can extend the average right side distanceRespectively as indexes for measuring the quality of the cluster center point, i.e. the intensityAnd degree of separation

In step S4, the following sub-steps are included:

S42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C₁,...,c_i]The extension distance of (2) can be used to extend the right distance with the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center c_i+1Judging whether the cluster center number reaches the required number, and stopping searching when the cluster center number reaches the required number; otherwise, the point is invalid, and the next potential cluster center point is judged.

The method described in this example and other comparison methods perform two-dimensional spatial distribution display on the clustering centers obtained by two sets of data sets, as shown in fig. 2 and fig. 3:

in this embodiment, the initial clustering center points selected by the improved algorithm are distributed more uniformly for the initial clustering center points selected by the low-dimensional and high-dimensional data, and the data points around the initial clustering center points are relatively dense when the initial clustering center points are located in the boundary region.

Fig. 4, fig. 5 shows a three-dimensional representation of the respective clustering of the center points:

aiming at the effectiveness of the central point clustering effect quantitative measurement algorithm, a classification accuracy index and a clustering balance index are adopted, and result statistics are shown in tables 2 to 4;

TABLE 2 Cluster accuracy index before and after improvement

The statistics of the number of each cluster sample after clustering the initial clustering center point are shown in table 3:

TABLE 3 number of samples of each cluster after clustering by the improved pre-and post-algorithm

The balance index of the algorithm before and after improvement obtained from table 3 is shown in table 4:

TABLE 4 Cluster equalization index before and after improvement

The accuracy index is the ratio of the number of samples correctly classified into corresponding categories to the total number of samples; the cluster balance is the variance between the number of samples in each class and the number of samples in the actual corresponding class.

The embodiment shows that under the condition of a high-dimensional data set, the initial clustering center point cluster selected based on the improved k-means algorithm of the extension distance has higher accuracy and balance rate, and is more suitable for clustering mass and high-dimensional data under the background of large data.

The embodiments described in this specification are merely examples of the implementation of the proposed method under two sets of sample sets in the UCI database, and the scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but also as equivalent technical means which can be conceived by those skilled in the art according to the inventive concept.

Claims

1. A data clustering method based on an optimal clustering center of extension distance comprises the following steps:

let set X of n samples be { X ═ X₁,x₂,…,x_nWhere xi is an m-dimensional vector, i is 1,2, …, n,

Z＝[A,B]＝[min(D),max(D)](1)

wherein,a distance set between every two samples; the inter-sample distance is used for measuring the density of distribution among samples;

Wherein,a combination number representing 2 samples arbitrarily taken from n samples;

s4: clustering center points according to a screening principle: taking the extension side distance density and the distance index constructed in the S3 as a screening principle of an initial clustering center to obtain an optimized center clustering point;

in step S4, the following sub-steps are included:

s41: screening a first cluster center: sorting the extension side distances in an ascending order, traversing the sorted extension distances, and taking the center point coordinate corresponding to the first extension distance larger than the inter-sample density index as a first cluster center;

s42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C₁,...,c_i]Distance between the sample and the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center c_i+1Judging whether the cluster core number reaches the required number, and stopping searching when the cluster core number reaches the required number; otherwise, judging the next potential cluster center point if the point is invalid;

s44, interval scaling, namely, introducing a scaling factor, calculating a reduction factor η according to the formula (5), and dynamically reducing the right side distance of the average extension of the sampleRepeating the steps from S41 to S42 until the number of the clustering centers reaches K, and finishing the selection of the clustering centers;