CN107578070A

CN107578070A - K means initial cluster center method for optimizing based on neighborhood information and mean difference degree

Info

Publication number: CN107578070A
Application number: CN201710849046.4A
Authority: CN
Inventors: 吴仲城; 吴紫恒; 李芳�; 张俊; 罗健飞
Original assignee: Anhui Kemei Network Information Technology Co Ltd
Current assignee: Anhui Kemei Network Information Technology Co Ltd
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2018-01-12

Abstract

The present invention discloses a kind of K means initial cluster center method for optimizing based on neighborhood information and mean difference degree, including S1, the sample set X={ X for inputting n object₁, X₂..., X_i..., X_n, cluster numbers K is determined, initializes the initial cluster center number k=0 currently determined；S2, form Distance matrix D；S3, determine radius of neighbourhood value δ；S4, calculate each sample point X_iδ neighborhoods in sample size N_i, form matrix N；S5, by sample size maximum N in δ neighborhoods in N_iCorresponding sample point X_iAs the 1st cluster centre C₁, k=k+1, by corresponding N in N_iSet to 0；Sample size maximum N in δ neighborhoods in S6, searching N_jCorresponding sample point X_j, calculate itself and cluster centre { C₁, C₂..., C_kDistance, by corresponding N in N_jSet to 0；If S7, X_jIt is not less than mean difference degree M with the distance of cluster centre, then k=k+1, C_k+1=X_j, otherwise return to S6；If S8, current cluster centre number k are equal to cluster classification number K, K initial cluster center is exported, otherwise returns to S6；S9, with K means clustering algorithms whole sample set is clustered, export cluster result.

Description

K-means initial clustering center optimization method based on neighborhood information and average difference

Technical Field

The invention relates to the technical field of data mining, in particular to a K-means initial clustering center optimization method based on neighborhood information and average difference.

Background

The clustering algorithm is an unsupervised classification algorithm, which means that a group of objects without class identification are divided into a plurality of classes according to certain similarity, so that the distance between the objects in the classes is as large as possible, and the distance between the objects in the classes is as small as possible, is one of basic methods of system modeling and data mining, and is widely applied to various fields, such as text classification, image recognition and the like.

The K-means clustering algorithm (K-means) is a dynamic clustering algorithm based on partitioning, and has become one of the most popular clustering methods at present due to its simple method. The traditional K-means clustering algorithm has the defects that: firstly, the clustering result of the algorithm is easily affected by the initial clustering center, and when the initial clustering center is selected unreasonably, the conditions of consistent clustering and incapability of convergence occur. Secondly, the adverse effect of outliers on the clustering result cannot be overcome, so that the clustering result is unstable and has low accuracy.

Disclosure of Invention

The invention aims to provide a K-means initial clustering center optimization method based on neighborhood information and average difference, so as to improve the accuracy of a K-means clustering algorithm.

In order to realize the purpose, the invention adopts the technical scheme that: the K-means initial clustering center optimization method based on neighborhood information and average difference degree is provided and comprises the following steps:

s1, inputting a sample set X = { X) of n objects ₁ ，X ₂ ，…，X _i ，…，X _n }，X _i Determining the number K of clustering classes for the m-dimensional vector, and initializing the number K =0 of the currently determined initial clustering centers;

s2, calculating the distance between every two objects in the sample set, and forming a distance matrix D;

s3, calculating the overall average difference degree M of the sample set, determining a neighborhood radius value delta,

s4, calculating each sample point X based on the distance matrix D _i Delta neighborhood number of samples N _i Forming a matrix N;

s5, maximizing the number N of samples in delta neighborhood in N _i Corresponding sample point X _i As the 1 st cluster center C ₁ Let k = k +1 and have a maximum of N in the matrix N _i Setting 0;

s6, continuously searching the maximum N of the number of the samples in the delta neighborhood in the matrix N _j Corresponding sample point X _j And, calculating its clustering center with the existing { C ₁ ，C ₂ ，…，C _k And the maximum N in the matrix N _j Setting 0;

s7, if X _j And { C ₁ ，C ₂ ，…，C _k No distance between cluster centers in the data is less than the average difference degree M, then k = k +1,C _k+1 ＝X _j Otherwise, returning to the step S6;

s8, if the number K of the current clustering centers is equal to the number K of the clustering categories, outputting K initial clustering centers C = { C ₁ ，C ₂ ，…，C _K If not, returning to the step S6;

and S9, clustering the whole sample set by using a K-means clustering algorithm, and outputting a clustering result.

In step S2, the process of calculating the distance between each two objects in the sample set specifically includes:

suppose that two objects in the sample set are X respectively _i 、X _j Wherein X is _i ＝{X _i1 ，X _i2 ，…，X _im }，X _j ＝{X _j1 ，X _j2 ，…，X _jm }, then X _i And X _j Distance d of _ij Comprises the following steps:

in step S3, the process of calculating the overall average difference M of the sample set specifically includes:

calculating a sample X _i The average degree of difference of (a) is:

calculating the overall average difference degree of the sample set as:

compared with the prior art, the invention has the following technical effects: the invention comprehensively considers the spatial distribution of the data set, and the obtained clustering center is more reasonable and accords with the actual situation. The invention not only effectively overcomes the blindness and the randomness of the selection of the initial clustering center of the K-means algorithm, obviously improves the accuracy and the stability of the clustering result, has less iteration times, but also overcomes the sensitivity problem of outliers to a certain extent.

Drawings

The following detailed description of embodiments of the invention refers to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a K-means initial clustering center optimization method based on neighborhood information and average difference in the invention.

Detailed Description

To further illustrate the features of the present invention, refer to the following detailed description of the invention and the accompanying drawings. The drawings are for reference and illustration purposes only and are not intended to limit the scope of the present disclosure.

As shown in fig. 1, the present embodiment discloses a method for optimizing K-means initial clustering centers based on neighborhood information and average disparity, which includes the following steps:

s3, calculating the overall average difference M of the sample set, determining a neighborhood radius value delta,

s5, maximizing the number N of samples in delta neighborhoods in N _i Corresponding sample point X _i As the 1 st cluster center C ₁ Let k = k +1 and have a maximum of N in the matrix N _i Setting to 0;

s6, continuously searching the maximum N of the number of the samples in the delta neighborhood in the matrix N _j Corresponding sample point X _j Calculating its clustering center with the existing { C ₁ ，C ₂ ，…，C _k And the maximum N in the matrix N _j Setting 0;

s7, if X _j And { C ₁ ，C ₂ ，…，C _k No distance between cluster centers is less than the average difference M, then k = k +1,C _k+1 ＝X _j Otherwise, returning to the step S6;

Further, in step S2, the process of calculating the distance between each two objects in the sample set specifically includes:

further, in step S3, the process of calculating the overall average difference M of the sample set specifically includes:

calculating sample X _i The average degree of difference of (a) is:

calculating the overall average difference degree of the sample set as:

further, in step S4, the neighborhood radius value δ is calculated as: δ = M/4.

Further, in step S3, the neighborhood is defined as:

let < U, Δ > be a non-null metric space, where U is a non-null finite set of objects, a closed sphere centered at X ∈ U and having δ as the radius, called δ neighborhood of each sample data in X, defined as:

n(X)＝{y∈U|Δ(X，y)≤δ}；

wherein: delta is more than or equal to 0, delta is a distance function, and Euclidean distance is adopted.

Further, the distance matrix D is providedThe body is as follows:wherein d is _ij Is X _i And X _j The distance of (c).

Further, the matrix N is specifically:wherein N is _i Is a sample point X _i Delta neighborhood of (d).

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A K-means initial clustering center optimization method based on neighborhood information and average difference is characterized by comprising the following steps:

s5, maximizing the number N of samples in delta neighborhood in the matrix N _i Corresponding sample point X _i As the 1 st cluster center C ₁ Let k = k +1 and have a maximum of N in the matrix N _i Setting to 0;

s6, continuously searching the maximum N of the number of the samples in the delta neighborhood in the matrix N _j Corresponding sample point X _j Calculating its clustering center with the existing { C ₁ ，C ₂ ，…，C _k OfDistance and maximum N in matrix N _j Setting 0;

2. The method according to claim 1, wherein in step S2, the distance between two objects in the sample set is calculated by:

3. the method according to claim 2, wherein in step S3, the process of calculating the ensemble average difference degree M of the sample set specifically comprises:

calculating sample X _i The average degree of difference of (a) is:

calculating the overall average difference degree of the sample set as: