Nothing Special   »   [go: up one dir, main page]

CN107578070A - K means initial cluster center method for optimizing based on neighborhood information and mean difference degree - Google Patents

K means initial cluster center method for optimizing based on neighborhood information and mean difference degree Download PDF

Info

Publication number
CN107578070A
CN107578070A CN201710849046.4A CN201710849046A CN107578070A CN 107578070 A CN107578070 A CN 107578070A CN 201710849046 A CN201710849046 A CN 201710849046A CN 107578070 A CN107578070 A CN 107578070A
Authority
CN
China
Prior art keywords
clustering
distance
matrix
calculating
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710849046.4A
Other languages
Chinese (zh)
Inventor
吴仲城
吴紫恒
李芳�
张俊
罗健飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Kemei Network Information Technology Co Ltd
Original Assignee
Anhui Kemei Network Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Kemei Network Information Technology Co Ltd filed Critical Anhui Kemei Network Information Technology Co Ltd
Priority to CN201710849046.4A priority Critical patent/CN107578070A/en
Publication of CN107578070A publication Critical patent/CN107578070A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of K means initial cluster center method for optimizing based on neighborhood information and mean difference degree, including S1, the sample set X={ X for inputting n object1, X2..., Xi..., Xn, cluster numbers K is determined, initializes the initial cluster center number k=0 currently determined;S2, form Distance matrix D;S3, determine radius of neighbourhood value δ;S4, calculate each sample point Xiδ neighborhoods in sample size Ni, form matrix N;S5, by sample size maximum N in δ neighborhoods in NiCorresponding sample point XiAs the 1st cluster centre C1, k=k+1, by corresponding N in NiSet to 0;Sample size maximum N in δ neighborhoods in S6, searching NjCorresponding sample point Xj, calculate itself and cluster centre { C1, C2..., CkDistance, by corresponding N in NjSet to 0;If S7, XjIt is not less than mean difference degree M with the distance of cluster centre, then k=k+1, Ck+1=Xj, otherwise return to S6;If S8, current cluster centre number k are equal to cluster classification number K, K initial cluster center is exported, otherwise returns to S6;S9, with K means clustering algorithms whole sample set is clustered, export cluster result.

Description

K-means initial clustering center optimization method based on neighborhood information and average difference
Technical Field
The invention relates to the technical field of data mining, in particular to a K-means initial clustering center optimization method based on neighborhood information and average difference.
Background
The clustering algorithm is an unsupervised classification algorithm, which means that a group of objects without class identification are divided into a plurality of classes according to certain similarity, so that the distance between the objects in the classes is as large as possible, and the distance between the objects in the classes is as small as possible, is one of basic methods of system modeling and data mining, and is widely applied to various fields, such as text classification, image recognition and the like.
The K-means clustering algorithm (K-means) is a dynamic clustering algorithm based on partitioning, and has become one of the most popular clustering methods at present due to its simple method. The traditional K-means clustering algorithm has the defects that: firstly, the clustering result of the algorithm is easily affected by the initial clustering center, and when the initial clustering center is selected unreasonably, the conditions of consistent clustering and incapability of convergence occur. Secondly, the adverse effect of outliers on the clustering result cannot be overcome, so that the clustering result is unstable and has low accuracy.
Disclosure of Invention
The invention aims to provide a K-means initial clustering center optimization method based on neighborhood information and average difference, so as to improve the accuracy of a K-means clustering algorithm.
In order to realize the purpose, the invention adopts the technical scheme that: the K-means initial clustering center optimization method based on neighborhood information and average difference degree is provided and comprises the following steps:
s1, inputting a sample set X = { X) of n objects 1 ,X 2 ,…,X i ,…,X n },X i Determining the number K of clustering classes for the m-dimensional vector, and initializing the number K =0 of the currently determined initial clustering centers;
s2, calculating the distance between every two objects in the sample set, and forming a distance matrix D;
s3, calculating the overall average difference degree M of the sample set, determining a neighborhood radius value delta,
s4, calculating each sample point X based on the distance matrix D i Delta neighborhood number of samples N i Forming a matrix N;
s5, maximizing the number N of samples in delta neighborhood in N i Corresponding sample point X i As the 1 st cluster center C 1 Let k = k +1 and have a maximum of N in the matrix N i Setting 0;
s6, continuously searching the maximum N of the number of the samples in the delta neighborhood in the matrix N j Corresponding sample point X j And, calculating its clustering center with the existing { C 1 ,C 2 ,…,C k And the maximum N in the matrix N j Setting 0;
s7, if X j And { C 1 ,C 2 ,…,C k No distance between cluster centers in the data is less than the average difference degree M, then k = k +1,C k+1 =X j Otherwise, returning to the step S6;
s8, if the number K of the current clustering centers is equal to the number K of the clustering categories, outputting K initial clustering centers C = { C 1 ,C 2 ,…,C K If not, returning to the step S6;
and S9, clustering the whole sample set by using a K-means clustering algorithm, and outputting a clustering result.
In step S2, the process of calculating the distance between each two objects in the sample set specifically includes:
suppose that two objects in the sample set are X respectively i 、X j Wherein X is i ={X i1 ,X i2 ,…,X im },X j ={X j1 ,X j2 ,…,X jm }, then X i And X j Distance d of ij Comprises the following steps:
in step S3, the process of calculating the overall average difference M of the sample set specifically includes:
calculating a sample X i The average degree of difference of (a) is:
calculating the overall average difference degree of the sample set as:
compared with the prior art, the invention has the following technical effects: the invention comprehensively considers the spatial distribution of the data set, and the obtained clustering center is more reasonable and accords with the actual situation. The invention not only effectively overcomes the blindness and the randomness of the selection of the initial clustering center of the K-means algorithm, obviously improves the accuracy and the stability of the clustering result, has less iteration times, but also overcomes the sensitivity problem of outliers to a certain extent.
Drawings
The following detailed description of embodiments of the invention refers to the accompanying drawings in which:
FIG. 1 is a schematic flow chart of a K-means initial clustering center optimization method based on neighborhood information and average difference in the invention.
Detailed Description
To further illustrate the features of the present invention, refer to the following detailed description of the invention and the accompanying drawings. The drawings are for reference and illustration purposes only and are not intended to limit the scope of the present disclosure.
As shown in fig. 1, the present embodiment discloses a method for optimizing K-means initial clustering centers based on neighborhood information and average disparity, which includes the following steps:
s1, inputting a sample set X = { X) of n objects 1 ,X 2 ,…,X i ,…,X n },X i Determining the number K of clustering classes for the m-dimensional vector, and initializing the number K =0 of the currently determined initial clustering centers;
s2, calculating the distance between every two objects in the sample set, and forming a distance matrix D;
s3, calculating the overall average difference M of the sample set, determining a neighborhood radius value delta,
s4, calculating each sample point X based on the distance matrix D i Delta neighborhood number of samples N i Forming a matrix N;
s5, maximizing the number N of samples in delta neighborhoods in N i Corresponding sample point X i As the 1 st cluster center C 1 Let k = k +1 and have a maximum of N in the matrix N i Setting to 0;
s6, continuously searching the maximum N of the number of the samples in the delta neighborhood in the matrix N j Corresponding sample point X j Calculating its clustering center with the existing { C 1 ,C 2 ,…,C k And the maximum N in the matrix N j Setting 0;
s7, if X j And { C 1 ,C 2 ,…,C k No distance between cluster centers is less than the average difference M, then k = k +1,C k+1 =X j Otherwise, returning to the step S6;
s8, if the number K of the current clustering centers is equal to the number K of the clustering categories, outputting K initial clustering centers C = { C 1 ,C 2 ,…,C K If not, returning to the step S6;
and S9, clustering the whole sample set by using a K-means clustering algorithm, and outputting a clustering result.
Further, in step S2, the process of calculating the distance between each two objects in the sample set specifically includes:
suppose that two objects in the sample set are X respectively i 、X j Wherein X is i ={X i1 ,X i2 ,…,X im },X j ={X j1 ,X j2 ,…,X jm }, then X i And X j Distance d of ij Comprises the following steps:
further, in step S3, the process of calculating the overall average difference M of the sample set specifically includes:
calculating sample X i The average degree of difference of (a) is:
calculating the overall average difference degree of the sample set as:
further, in step S4, the neighborhood radius value δ is calculated as: δ = M/4.
Further, in step S3, the neighborhood is defined as:
let < U, Δ > be a non-null metric space, where U is a non-null finite set of objects, a closed sphere centered at X ∈ U and having δ as the radius, called δ neighborhood of each sample data in X, defined as:
n(X)={y∈U|Δ(X,y)≤δ};
wherein: delta is more than or equal to 0, delta is a distance function, and Euclidean distance is adopted.
Further, the distance matrix D is providedThe body is as follows:wherein d is ij Is X i And X j The distance of (c).
Further, the matrix N is specifically:wherein N is i Is a sample point X i Delta neighborhood of (d).
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (3)

1. A K-means initial clustering center optimization method based on neighborhood information and average difference is characterized by comprising the following steps:
s1, inputting a sample set X = { X) of n objects 1 ,X 2 ,…,X i ,…,X n },X i Determining the number K of clustering classes for the m-dimensional vector, and initializing the number K =0 of the currently determined initial clustering centers;
s2, calculating the distance between every two objects in the sample set, and forming a distance matrix D;
s3, calculating the overall average difference degree M of the sample set, determining a neighborhood radius value delta,
s4, calculating each sample point X based on the distance matrix D i Delta neighborhood number of samples N i Forming a matrix N;
s5, maximizing the number N of samples in delta neighborhood in the matrix N i Corresponding sample point X i As the 1 st cluster center C 1 Let k = k +1 and have a maximum of N in the matrix N i Setting to 0;
s6, continuously searching the maximum N of the number of the samples in the delta neighborhood in the matrix N j Corresponding sample point X j Calculating its clustering center with the existing { C 1 ,C 2 ,…,C k OfDistance and maximum N in matrix N j Setting 0;
s7, if X j And { C 1 ,C 2 ,…,C k No distance between cluster centers in the data is less than the average difference degree M, then k = k +1,C k+1 =X j Otherwise, returning to the step S6;
s8, if the number K of the current clustering centers is equal to the number K of the clustering categories, outputting K initial clustering centers C = { C 1 ,C 2 ,…,C K If not, returning to the step S6;
and S9, clustering the whole sample set by using a K-means clustering algorithm, and outputting a clustering result.
2. The method according to claim 1, wherein in step S2, the distance between two objects in the sample set is calculated by:
suppose that two objects in the sample set are X respectively i 、X j Wherein X is i ={x i1 ,x i2 ,…,x im },X j ={x j1 ,x j2 ,…,x jm }, then X i And X j Distance d of ij Comprises the following steps:
3. the method according to claim 2, wherein in step S3, the process of calculating the ensemble average difference degree M of the sample set specifically comprises:
calculating sample X i The average degree of difference of (a) is:
calculating the overall average difference degree of the sample set as:
CN201710849046.4A 2017-09-19 2017-09-19 K means initial cluster center method for optimizing based on neighborhood information and mean difference degree Pending CN107578070A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710849046.4A CN107578070A (en) 2017-09-19 2017-09-19 K means initial cluster center method for optimizing based on neighborhood information and mean difference degree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710849046.4A CN107578070A (en) 2017-09-19 2017-09-19 K means initial cluster center method for optimizing based on neighborhood information and mean difference degree

Publications (1)

Publication Number Publication Date
CN107578070A true CN107578070A (en) 2018-01-12

Family

ID=61032940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710849046.4A Pending CN107578070A (en) 2017-09-19 2017-09-19 K means initial cluster center method for optimizing based on neighborhood information and mean difference degree

Country Status (1)

Country Link
CN (1) CN107578070A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805174A (en) * 2018-05-18 2018-11-13 广东惠禾科技发展有限公司 clustering method and device
CN109063781A (en) * 2018-08-14 2018-12-21 浙江理工大学 A kind of fuzzy image Fabric Design method of imitative natural colour function and form
CN111860700A (en) * 2020-09-22 2020-10-30 深圳须弥云图空间科技有限公司 Energy consumption classification method and device, storage medium and equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805174A (en) * 2018-05-18 2018-11-13 广东惠禾科技发展有限公司 clustering method and device
CN109063781A (en) * 2018-08-14 2018-12-21 浙江理工大学 A kind of fuzzy image Fabric Design method of imitative natural colour function and form
CN111860700A (en) * 2020-09-22 2020-10-30 深圳须弥云图空间科技有限公司 Energy consumption classification method and device, storage medium and equipment

Similar Documents

Publication Publication Date Title
US20200160124A1 (en) Fine-grained image recognition
CN103258210B (en) A kind of high-definition image classification method based on dictionary learning
CN109273054B (en) Protein subcellular interval prediction method based on relational graph
CN110097060B (en) Open set identification method for trunk image
TWI464604B (en) Data clustering method and device, data processing apparatus and image processing apparatus
CN106845528A (en) A kind of image classification algorithms based on K means Yu deep learning
CN110516533B (en) Pedestrian re-identification method based on depth measurement
CN109948534B (en) Method for face recognition by adopting fast density peak value clustering
CN110942091A (en) Semi-supervised few-sample image classification method for searching reliable abnormal data center
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN105631416A (en) Method for carrying out face recognition by using novel density clustering
CN109241816B (en) Image re-identification system based on label optimization and loss function determination method
CN107194351B (en) Face recognition feature extraction method based on Weber local symmetric graph structure
CN110705648A (en) Large-scale multi-view data self-dimension-reduction K-means algorithm and system
CN107578070A (en) K means initial cluster center method for optimizing based on neighborhood information and mean difference degree
CN107358172B (en) Human face feature point initialization method based on human face orientation classification
CN102663681B (en) Gray scale image segmentation method based on sequencing K-mean algorithm
CN112115806A (en) Remote sensing image scene accurate classification method based on Dual-ResNet small sample learning
WO2020114109A1 (en) Interpretation method and apparatus for embedding result
CN110188864B (en) Small sample learning method based on distribution representation and distribution measurement
CN109948662B (en) Face image depth clustering method based on K-means and MMD
CN115344693A (en) Clustering method based on fusion of traditional algorithm and neural network algorithm
CN114463552A (en) Transfer learning and pedestrian re-identification method and related equipment
CN103577825B (en) The Motion parameters method of synthetic aperture sonar picture and automatic recognition system
CN118072119A (en) Multi-source heterogeneous data distillation method and device for data privacy protection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180112

RJ01 Rejection of invention patent application after publication