Nothing Special   »   [go: up one dir, main page]

CN109214468A - It is a kind of based on can open up away from optimization cluster centre data clustering method - Google Patents

It is a kind of based on can open up away from optimization cluster centre data clustering method Download PDF

Info

Publication number
CN109214468A
CN109214468A CN201811215252.0A CN201811215252A CN109214468A CN 109214468 A CN109214468 A CN 109214468A CN 201811215252 A CN201811215252 A CN 201811215252A CN 109214468 A CN109214468 A CN 109214468A
Authority
CN
China
Prior art keywords
distance
sample
clustering
extension
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811215252.0A
Other languages
Chinese (zh)
Inventor
赵燕伟
朱芬
谢智伟
徐晨
桂方志
任设东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201811215252.0A priority Critical patent/CN109214468A/en
Publication of CN109214468A publication Critical patent/CN109214468A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

It is a kind of based on the data clustering method that can be opened up away from optimization cluster centre, including by sample classics distance to Interval Maps can be opened up, between every two sample distance be considered as can open up away from section a bit, then using can open up space computing method obtain side away from and mapping range;It is proposed can averagely open up side away under concept measurement cluster centre point performance quantizating index, i.e., can averagely open up left side away from can averagely open up right side away from respectively sample closeness and cluster centre degree of becoming estranged;Using closeness and degree of becoming estranged as cluster centre Criterion of Selecting: to sample spacing point ascending sort, it is first to be greater than centre coordinate corresponding to closeness sample spacing point for first initial cluster center point, successively judge next sample spacing point, if it is greater than closeness index, and when with that can be opened up away from degree of becoming estranged is all larger than between the cluster centre point selected, it is then next selected cluster centre, until selecting optimal initial point;If do not select at satisfactory, using zoom factor, mapping range is zoomed in and out.

Description

Data clustering method based on extension distance optimization clustering center
Technical Field
The invention relates to a data clustering method based on an optimal clustering center of extension distances.
Background
With the rapid development of big data technology and extendibility, how to integrate the extendibility analysis and the hidden knowledge in the data becomes an important factor for determining whether an enterprise has competitiveness. Clustering is an important means of data analysis and is widely applied to the fields of data mining, image processing and the like, but in the existing improved algorithm, the defect that the first initial clustering center point falls on a sparse boundary still exists in the problem of initial clustering center selection, so that the problems of poor intra-cluster equilibrium and the like in the clustering process are caused, and the clustering effect is not ideal. Therefore, an algorithm based on the extension distance k-means is provided, and the extension distances are mapped from the classical distance to the extension distance by using the 'intra-class difference' of the extension distances, so that the screening with identification of the sample distances is carried out, the central point is prevented from being distributed on a sparse boundary, and the sample clustering is realized.
Disclosure of Invention
In order to overcome the defect that initial clustering center points are easy to fall on sparse boundaries in the conventional clustering method, a data clustering method based on a topologic optimization clustering center is provided, and a better clustering effect is realized by means of a topologic theory.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a data clustering method based on an optimal clustering center of extension distance comprises the following steps:
s1: constructing an equivalent dense distance interval: normalizing the sample data, calculating a normalized inter-sample distance set by using a classical distance to measure the density degree between sample points, and constructing an equivalent dense distance interval [ A, B ] according to a sorting rule;
let set X of n samples be { X ═ X1,x2,…,xnWhere xi is an m-dimensional vector (i ═ 1,2, …, n),
Z=[A,B]=[min(D),max(D)](1)
wherein,set of distances between two samples. The inter-sample distance is used to measure the density of the inter-sample distribution.
S2: mapping sample distances to equivalent dense distance intervals: using the sample distance described in S1 as a mapping point, using the equivalent dense distance interval as a mapping extension interval, and mapping the sample distance to a point on the extension interval according to equations (2) and (3) by using the concept of extension side distance, i.e., extending the left side distance and the right side distance;
s3: and (3) construction of evaluation indexes: obtaining corresponding extension average left side distance between samples according to the extension side distances of all sample distances and the formula (4)And can extend the average right side distanceRespectively as indexes for measuring the quality of the cluster center point, i.e. the intensityAnd degree of separation
Wherein,the number of combinations of 2 samples arbitrarily taken from n samples is shown.
S4: clustering center points according to a screening principle: and taking the extension side distance density and the distance index constructed in the S3 as a screening principle of the initial clustering center to obtain an optimized center clustering point.
In step S4, the following sub-steps are included:
s41: screening a first cluster center: sorting the extension side distances in an ascending order, traversing the sorted extension distances, and taking the center point coordinate corresponding to the first extension distance larger than the inter-sample density index as a first cluster center.
S42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C1,...,ci]Distance between the sample and the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center ci+1And judging whether the cluster heart number reaches the required number or not, and stopping when the cluster heart number reaches the required numberSearching; otherwise, the point is invalid, and the next potential cluster center point is judged.
S43: after traversing once, when the initial clustering center reaches K, outputting a clustering center point, otherwise, executing the step S44;
s44, interval scaling, namely, introducing a scaling factor, calculating a reduction factor η according to the formula (5), and dynamically reducing the right side distance of the average extension of the sampleAnd repeating the steps from S41 to S42 until the number of the clustering centers reaches K, and finishing the selection of the clustering centers.
S45: and clustering the extension distances of the sample distances to obtain a clustering result.
Furthermore, sample clustering is carried out according to the clustering center points, the defect that the center points appear in non-dense areas of the boundary is overcome by the obtained optimal initial clustering center points, the optimal initial clustering center points are distributed in dense areas to the maximum extent, and the clustering center points are uniformly distributed.
Compared with the prior art, the invention has the beneficial effects that:
1. reflecting the density degree among the samples according to the sample distance to construct a density interval; 2. the extension distance theory is fused, the distance between the two samples is used as a point to be mapped to an extension interval, and the sample distance for measuring the sample density is screened in the extension distance field in an identifying way, so that the central point is effectively prevented from falling on a sparse boundary; 3. introducing a scaling factor to scale the interval; 4. the obtained result is scientific and reasonable, and is more suitable for high-dimensional sample data clustering, so that the method has a wide application prospect.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a classic-to-topologic mapping in an embodiment.
Fig. 3a to 3c are initial clustering center distribution diagrams of the algorithm before and after improvement based on the Iris data, that is, two-dimensional distribution display diagrams of initial clustering center points obtained by the algorithm of this embodiment and the other two algorithms based on the Iris sample set, where fig. 3a is based on the topologic method, fig. 3b is based on the intensity method, and fig. 3c is based on the average difference method.
Fig. 4a to 4c are initial cluster center distribution diagrams of the algorithm before and after improvement based on the Wine data, that is, two-dimensional distribution display diagrams of initial cluster center points obtained by the algorithm herein and the other two algorithms based on the Wine sample set in the embodiment, where fig. 4a is based on the scalable method, fig. 4b is based on the density method, and fig. 4c is based on the average difference method.
Fig. 5a to 5d are graphs of clustering based on Iris data of the initial clustering center of the algorithm before and after improvement, that is, three-dimensional distribution display graphs of the result graph of the initial clustering center based on the Iris sample set obtained by the algorithm and the other three algorithms in the embodiment, wherein fig. 5a is actual clustering, fig. 5b is clustering based on the scalable method, fig. 5c is clustering based on the density method, and fig. 5d is clustering based on the average difference method.
Fig. 6a to 6d are three-dimensional distribution display diagrams of initial clustering centers of algorithms before and after improvement based on Wine data clustering graphs, that is, initial clustering center point clustering effect result graphs obtained by the algorithm of this document and the other three algorithms based on a Wine sample set in the embodiment, wherein fig. 6a is actual clustering, fig. 6b is clustering based on a topologic method, fig. 6c is clustering based on a density method, and fig. 6d is clustering based on an average difference method.
Detailed Description
The invention is further illustrated by the following figures and examples.
Examples
The embodiment provides a k-means algorithm based on an optimal clustering center of the extension distance, which comprises the following steps:
s1: the method is compared with a traditional algorithm, an average difference degree K-mean clustering algorithm and a density-based improved K-means clustering algorithm. Constructing an equivalent dense distance interval [ A, B ] by using the classical distance between every two samples in the test sample set;
the test data set adopted in this example is derived from an Iris data set and a Wine data set used for test clustering in a UCI database, and the characteristics of each data are shown in table 1:
TABLE 1 basic characteristics of the data sets
Based on the average difference degree K-means clustering algorithm from Liwu, Zhao Jia Yan, Titaishan improved K-means clustering algorithm [ J ] based on the average difference degree to optimize initial clustering center control and decision, 2017,32(04):759 762-.
The improved K-means clustering algorithm based on the density is from Hanjun, Neijian, yellow river, etc. the power supply block division method based on the improved K-means clustering algorithm [ J ]. Power Automation equipment, 2015,35(06): 123-.
The inter-sample distance is used to measure the density of the inter-sample distribution.
S2: mapping sample distances to equivalent dense distance intervals: using the sample distance D described in S1 as a mapping point and the equivalent dense distance interval as a mapping extension interval [ a, B ], using the concept of extension side distance, mapping the sample distance to a point on the extension interval according to equations (2) and (3) as described in fig. 1, i.e. extending the left side distance and the right side distance;
s3: and (3) construction of evaluation indexes: obtaining corresponding extension average left side distance between samples according to the extension side distance of all sample distances D and the formula (4)And can extend the average right side distanceRespectively as indexes for measuring the quality of the cluster center point, i.e. the intensityAnd degree of separation
Wherein,the number of combinations of 2 samples arbitrarily taken from n samples is shown.
S4: clustering center points according to a screening principle: and taking the extension side distance density and the distance index constructed in the S3 as a screening principle of the initial clustering center to obtain an optimized center clustering point.
In step S4, the following sub-steps are included:
s41: screening a first cluster center: sorting the extension side distances in an ascending order, traversing the sorted extension distances, and taking the center point coordinate corresponding to the first extension distance larger than the inter-sample density index as a first cluster center.
S42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C1,...,ci]The extension distance of (2) can be used to extend the right distance with the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center ci+1Judging whether the cluster center number reaches the required number, and stopping searching when the cluster center number reaches the required number; otherwise, the point is invalid, and the next potential cluster center point is judged.
S43: after traversing once, when the initial clustering center reaches K, outputting a clustering center point, otherwise, executing the step S44;
s44, interval scaling, namely, introducing a scaling factor, calculating a reduction factor η according to the formula (5), and dynamically reducing the right side distance of the average extension of the sampleAnd repeating the steps from S41 to S42 until the number of the clustering centers reaches K, and finishing the selection of the clustering centers.
S45: and clustering the extension distances of the sample distances to obtain a clustering result.
The method described in this example and other comparison methods perform two-dimensional spatial distribution display on the clustering centers obtained by two sets of data sets, as shown in fig. 2 and fig. 3:
in this embodiment, the initial clustering center points selected by the improved algorithm are distributed more uniformly for the initial clustering center points selected by the low-dimensional and high-dimensional data, and the data points around the initial clustering center points are relatively dense when the initial clustering center points are located in the boundary region.
Fig. 4, fig. 5 shows a three-dimensional representation of the respective clustering of the center points:
aiming at the effectiveness of the central point clustering effect quantitative measurement algorithm, a classification accuracy index and a clustering balance index are adopted, and result statistics are shown in tables 2 to 4;
TABLE 2 Cluster accuracy index before and after improvement
The statistics of the number of each cluster sample after clustering the initial clustering center point are shown in table 3:
TABLE 3 number of samples of each cluster after clustering by the improved pre-and post-algorithm
The balance index of the algorithm before and after improvement obtained from table 3 is shown in table 4:
TABLE 4 Cluster equalization index before and after improvement
The accuracy index is the ratio of the number of samples correctly classified into corresponding categories to the total number of samples; the cluster balance is the variance between the number of samples in each class and the number of samples in the actual corresponding class.
The embodiment shows that under the condition of a high-dimensional data set, the initial clustering center point cluster selected based on the improved k-means algorithm of the extension distance has higher accuracy and balance rate, and is more suitable for clustering mass and high-dimensional data under the background of large data.
The embodiments described in this specification are merely examples of the implementation of the proposed method under two sets of sample sets in the UCI database, and the scope of the present invention should not be construed as being limited to the specific forms set forth in the embodiments, but also as equivalent technical means which can be conceived by those skilled in the art according to the inventive concept.

Claims (1)

1. A data clustering method based on an optimal clustering center of extension distance comprises the following steps:
s1: constructing an equivalent dense distance interval: normalizing the sample data, calculating a normalized inter-sample distance set by using a classical distance to measure the density degree between sample points, and constructing an equivalent dense distance interval [ A, B ] according to a sorting rule;
let set X of n samples be { X ═ X1,x2,…,xnWhere xi is an m-dimensional vector, i is 1,2, …, n,
Z=[A,B]=[min(D),max(D)](1)
wherein,a distance set between every two samples; the inter-sample distance is used for measuring the density of distribution among samples;
s2: mapping sample distances to equivalent dense distance intervals: using the sample distance described in S1 as a mapping point, using the equivalent dense distance interval as a mapping extension interval, and mapping the sample distance to a point on the extension interval according to equations (2) and (3) by using the concept of extension side distance, i.e., extending the left side distance and the right side distance;
s3: and (3) construction of evaluation indexes: obtaining corresponding extension average left side distance between samples according to the extension side distances of all sample distances and the formula (4)And can extend the average right side distanceRespectively as indexes for measuring the quality of the cluster center point, i.e. the intensityAnd degree of separation
Wherein,a combination number representing 2 samples arbitrarily taken from n samples;
s4: clustering center points according to a screening principle: taking the extension side distance density and the distance index constructed in the S3 as a screening principle of an initial clustering center to obtain an optimized center clustering point;
in step S4, the following sub-steps are included:
s41: screening a first cluster center: sorting the extension side distances in an ascending order, traversing the sorted extension distances, and taking the center point coordinate corresponding to the first extension distance larger than the inter-sample density index as a first cluster center;
s42: non-first cluster center screening: calculating the coordinates of the center point corresponding to the next value and the determined initial clustering center C1,...,ci]Distance between the sample and the sampleComparing if they are all greater thanThe center point coordinate is taken as the next initial clustering center ci+1Judging whether the cluster core number reaches the required number, and stopping searching when the cluster core number reaches the required number; otherwise, judging the next potential cluster center point if the point is invalid;
s43: after traversing once, when the initial clustering center reaches K, outputting a clustering center point, otherwise, executing the step S44;
s44, interval scaling, namely, introducing a scaling factor, calculating a reduction factor η according to the formula (5), and dynamically reducing the right side distance of the average extension of the sampleRepeating the steps from S41 to S42 until the number of the clustering centers reaches K, and finishing the selection of the clustering centers;
s45: and clustering the extension distances of the sample distances to obtain a clustering result.
CN201811215252.0A 2018-10-18 2018-10-18 It is a kind of based on can open up away from optimization cluster centre data clustering method Pending CN109214468A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811215252.0A CN109214468A (en) 2018-10-18 2018-10-18 It is a kind of based on can open up away from optimization cluster centre data clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811215252.0A CN109214468A (en) 2018-10-18 2018-10-18 It is a kind of based on can open up away from optimization cluster centre data clustering method

Publications (1)

Publication Number Publication Date
CN109214468A true CN109214468A (en) 2019-01-15

Family

ID=64980831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811215252.0A Pending CN109214468A (en) 2018-10-18 2018-10-18 It is a kind of based on can open up away from optimization cluster centre data clustering method

Country Status (1)

Country Link
CN (1) CN109214468A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112034473A (en) * 2020-08-31 2020-12-04 福建省特种设备检验研究院 Method, device and equipment for measuring distance between guide rail brackets of elevator and storage medium
CN112070548A (en) * 2020-09-11 2020-12-11 上海风秩科技有限公司 User layering method, device, equipment and storage medium
CN113591672A (en) * 2021-07-28 2021-11-02 常州大学 Detection method for identifying fish state based on Mask-Rcnn

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112034473A (en) * 2020-08-31 2020-12-04 福建省特种设备检验研究院 Method, device and equipment for measuring distance between guide rail brackets of elevator and storage medium
CN112034473B (en) * 2020-08-31 2024-02-27 福建省特种设备检验研究院 Elevator guide rail bracket spacing measuring method, device, equipment and storage medium
CN112070548A (en) * 2020-09-11 2020-12-11 上海风秩科技有限公司 User layering method, device, equipment and storage medium
CN112070548B (en) * 2020-09-11 2024-02-20 上海秒针网络科技有限公司 User layering method, device, equipment and storage medium
CN113591672A (en) * 2021-07-28 2021-11-02 常州大学 Detection method for identifying fish state based on Mask-Rcnn
CN113591672B (en) * 2021-07-28 2024-05-03 常州大学 Detection method for identifying fish state based on Mask-Rcnn

Similar Documents

Publication Publication Date Title
Lengyel et al. Silhouette width using generalized mean—A flexible method for assessing clustering efficiency
García et al. Theoretical analysis of a performance measure for imbalanced data
CN105930862A (en) Density peak clustering algorithm based on density adaptive distance
CN107832456B (en) Parallel KNN text classification method based on critical value data division
CN104598586B (en) The method of large-scale text categorization
CN109214468A (en) It is a kind of based on can open up away from optimization cluster centre data clustering method
CN106056136A (en) Data clustering method for rapidly determining clustering center
CN111259933B (en) High-dimensional characteristic data classification method and system based on distributed parallel decision tree
JPWO2005050479A1 (en) Similar pattern search device, similar pattern search method, similar pattern search program, and fraction separation device
Yang et al. Density clustering with divergence distance and automatic center selection
WO2018006631A1 (en) User level automatic segmentation method and system
CN107016416B (en) Data classification prediction method based on neighborhood rough set and PCA fusion
Ding et al. Student behavior clustering method based on campus big data
CN113254717A (en) Multidimensional graph network node clustering processing method, apparatus and device
CN111563821A (en) Financial stock fluctuation prediction method based on quantitative investment of support vector machine
CN112650818B (en) Clustering mining method based on multidimensional time series data
Rani Visual analytics for comparing the impact of outliers in k-means and k-medoids algorithm
CN111027609B (en) Image data weighted classification method and system
CN109784354A (en) Based on the non-parametric clustering method and electronic equipment for improving classification effectiveness
Ji et al. Machine learning of discriminative gate locations for clinical diagnosis
CN110008994A (en) P-CFSFDP Density Clustering method based on the operation of Spark platform
CN113379823B (en) Minority sample generation method based on construction of equilateral balanced triangle SMOTE algorithm
CN108090514B (en) Infrared image identification method based on two-stage density clustering
CN113535527A (en) Load shedding method and system for real-time flow data predictive analysis
Ding et al. Evolutionary Parameter-Free Clustering Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190115

RJ01 Rejection of invention patent application after publication