1 s2.0 S1877050923018549 Main
1 s2.0 S1877050923018549 Main
1 s2.0 S1877050923018549 Main
2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017)
Abstract: The traditional stand-alone K-means clustering algorithm has the limitation of time
consumption and memory overflow when dealing with large-scale data. Although this problem is
solved with the help of MapReduce framework. However, the clustering accuracy effect is not stable
due to the selection of initial clustering center. Therefore, this paper presents an algorithm for
optimizing the initial clustering center in K-means by using several equal-scale sampling, calculating
the local density and selecting the optimal initial clustering center. The experimental results show
that the optimized algorithm shortens the clustering time and improves the accuracy and stability of
clustering procedure in K-means.
1. Introduction
With the rapid development of information technology, data scale is increasing continually. The
large-scale data set for effective mining analysis can promote the development and progress of
information technology. Clustering analysis is an important data processing technology which is
widely used in data mining, information retrieval, and other research. The main goal is to divide the
data set into similar category in the same subset under the guarantee that similarity between different
subsets is small. K-means algorithm is one of the classical clustering algorithms in the partition
method that owns the characteristics of simple operation and fast convergence[1]. With the
continuous expansion of data size, the traditional stand-alone clustering algorithm is unable to meet
the needs of large data clustering. The parallel K-means clustering algorithm[2] is thus proposed that
combines the stand-alone k-means clustering algorithm and MapReduce to process large-scale data.
However, the parallel K-means algorithm still has some flaws: (1) Pre-given the categories and easy
to fall into local optimization; (2) Sensitive to edge and isolated points. Aiming at the above issues, a
method is proposed to determine K initial clustering centers based on data density[3], which can
select clustering center by calculating the density of the data to avoid random selection.
Unfortunately, the method must calculate the density of all data, the workload and time consumption
is immeasurable.
This paper presents an algorithm for optimizing the initial clustering center in K-means. First,
equal-scale sampling is adopted to ensure that no duplication of the sample set. Second, the local
density and average density of the sample is calculated and sorted where the larger local density of
data is selected as the initial clustering center. Finally, the optimization algorithm is implemented
under MapReduce framework. The experiment and performance analysis show that the proposed
algorithm can effectively reduce the iterations and clustering time and owns high clustering stability.
The rest of the paper is organized as follows. Section 2 mainly introduces the related technology
and work. Section 3 presents the optimization algorithm in detail. The experiment and performance
analysis is given in section 4. The conclusions are made in section 5.
178
Advances in Engineering Research, volume 138
179
Advances in Engineering Research, volume 138
180
Advances in Engineering Research, volume 138
100 900
PK-Means PK-Means
90 800
DK-Means DK-Means
OK-Means OK-Means
80
700
70
600
60
accuracy rate/%
500
time/s
50
400
40
300
30
200
20
10 100
0 0
30 60 90 120 150 30 60 90 120 150
5. Conclusions
In terms of the accuracy and stability of clustering large-scale data in K-means, an optimized
initial clustering center selection algorithm is proposed in this paper under MapReduce framework.
The initial clustering center is optimized by several equal-scale sampling methods, and the calulation
of local density of the data. The error square sum function is introduced to judge the fluctuation of the
center point in the iterative process. Compared with other existing approachs, the proposed
optimization algorithm can reduces the iterations and time consumption significantly when clustering
large-scale data, and owns better stability and accuracy.
6. Acknowledgments
This work was financially supported by National Natural Science Foundation of China (No.
61402095).
References
[1] Jigui S, Jie L, Lianyu Z, et al, Research on clustering algorithm[J], Journal of Software, 2008,
19(1):48-61.
[2] Weizhong Z, Huifang M, Yanxiang F, et al, Research on Parallel K - means Clustering Algorith
m Based on Cloud Computing Platform Hadoop[J], Computer Science, 2011, 38(10):166-168.
[3] Weiben Z, Yuexiang S, Optimization Algorithm of K - means Clustering Center Based on Densi
ty[J], Application Research of Computers, 2012, 29(5):1726-1728.
[4] LI Jian-Jiang, J Cui, D Wang, L Yan, et al. Survey of MapReduce paraller programming model [J],
Tien Tzu Hsueh Pao/acta Electronica Sinica, November 2011,39(11):2635-2642.
[5] Boukhdhir A, Lachiheb O, Gouider M S. An improved mapReduce design of kmeans for clusteri
ng very large datasets[C]// Ieee/acs, International Conference of Computer Systems and Application
s. 2015.
[6] Anchalia P P. Improved MapReduce k-Means Clustering Algorithm with Combiner[C]// Uksim-
Amss, International Conference on Computer Modelling and Simulation. IEEE, 2014:386-391.
[7] Cui X, Zhu P, Yang X, et al. Optimized big data K-means clustering using MapReduce[J]. The J
ournal of Supercomputing, 2014, 70(3):1249-1259.
[8] Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases[C]// AC
MSIGMOD International Conference on Management of Data. ACM, 1998:73-84.
181