Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2872599.2872610acmconferencesArticle/Chapter ViewAbstractPublication Pageshp3cConference Proceedingsconference-collections
research-article

Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce

Published: 12 April 2015 Publication History

Abstract

Single-linkage hierarchical clustering is one of the prominent and widely-used data mining techniques for its informative representation of clustering results. However, the parallelization of this algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. Moreover, in many modern applications, new data is continuously added into the already huge datasets. It would be impractical to reapply the clustering algorithm on the augmented datasets from scratch.
In this paper, we propose a unified algorithm which can not only cluster the large dataset, but also incorporate the newly arrived data incrementally. More specifically, we formulate the single-linkage hierarchical clustering problem as a Minimum Spanning Tree (MST) construction problem on a complete graph. The algorithm decomposes the complete graph into a set of non-overlapped subgraphs, computes the intermediate sub-MSTs for each subgraph in parallel, and merges all the sub-MSTs to achieve the final solution. In addition, the same framework can treat the incremental data insertion as a separate data subset and integrate it nicely with the existing solution. We implement the unified algorithm by employing MapReduce framework.
Using both synthetic and real-world datasets containing up to millions of high-dimensional points, we show that the proposed algorithm achieves a scalable speedup up to 200 on 300 computer cores for the base dataset and a speedup up to 120 for the dataset with maximum 5% random insertion.

References

[1]
http://www.facebook.com/blog/blog.php?topic_id=185341929641. {Online}.
[2]
http://www.nersc.gov/users/computational-systems/testbeds/jesup. {Online}.
[3]
SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal, 1 (1973).
[4]
Agrawal, R., and Srikant, R. Quest Synthetic Data Generator. IBM Almaden Research Center (1994).
[5]
Alizadeh, A., Eisen, M., Davis, R., Ma, C., Lossos, I., Rosenwald, A., Boldrick, J., Sabet, H., Tran, T., Yu, X., Ji, P., Yang, L., Ge, M., Moore, T., Hudson, J., Lu, L., Tibshirani, R., Sherlock, G., Chan, W., Greiner, T., Weisenburger, D., Armitage, J., Warnke, R., Levy, R., Wilson, W., Grever, M., Byrd, J., Botstein, D., Brown, P., and Staudt, L. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature (2000).
[6]
Bertone, S., De Lucia, G., and Thomas, P. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model. Monthly Notices of the Royal Astronomical Society 379, 3 (2007), 1143--1154.
[7]
Boruvka, O. O Jistém Problému Minimálním (About a Certain Minimal Problem) (in Czech, German summary). Práce Mor. Prírodoved. Spol. v Brne III 3 (1926).
[8]
Bower, R., Benson, A., Malbon, R., Helly, J., Frenk, C., Baugh, C., Cole, S., and Lacey, C. Breaking the hierarchy of galaxy formation. Monthly Notices of the Royal Astronomical Society 370, 2 (2006), 645--655.
[9]
Cathey, R. J., Jensen, E. C., Beitzel, S. M., Frieder, O., and Grossman, D. Exploiting parallelism to support scalable hierarchical clustering. Journal of the American Society for Information Science and Technology 58, 8 (2007), 1207--1221.
[10]
Chang, D.-J., Kantardzic, M. M., and Ouyang, M. Hierarchical clustering with CUDA/GPU. In ISCA PDCCS, J. H. Graham and A. Skjellum, Eds., ISCA (2009), 7--12.
[11]
Chen, C., Hwang, S.-C., and Oyang, Y.-J. An incremental hierarchical data clustering algorithm based on gravity theory. In Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD (London, UK, 2002), 237--250.
[12]
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to Algorithms, 3rd ed. The MIT Press, 2009.
[13]
De Lucia, G., and Blaizot, J. The hierarchical formation of the brightest cluster galaxies. Monthly Notices of the Royal Astronomical Society 375, 1 (2007), 2--14.
[14]
Dean, J., and Ghemawat, S. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, USENIX Association (Berkeley, CA, USA, 2004), 10--10.
[15]
Gurrutxaga, I., Arbelaitz, O., Martín, J. I., Muguerza, J., Pérez, J. M., and Perona, I. n. SIHC: A Stable Incremental Hierarchical Clustering Algorithm. In ICEIS (2)'09 (2009), 300--304.
[16]
Heisele, B., Serre, T., Mukherjee, S., and Poggio, T. Feature reduction and hierarchy of classifiers for fast object detection in video images. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on 2 (2001), 18.
[17]
Hendrix, W., Patwary, M. M. A., Agrawal, A., keng Liao, W., and Choudhary, A. Parallel hierarchical clustering on shared memory platforms. In HiPC (2012), 1--9.
[18]
Hofmeyr, S. A., Forrest, S., and Somayaji, A. Intrusion detection using sequences of system calls. Journal of Computer Security 6 (1998), 151--180.
[19]
Jain, A. K., Murty, M. N., and Flynn, P. J. Data clustering: A review, 1999.
[20]
Kruskal, J. B. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society 7, 1 (Feb. 1956), 48--50.
[21]
Lattanzi, S., Moseley, B., Suri, S., and Vassilvitskii, S. Filtering: a method for solving graph problems in mapreduce. In Proceedings of the 23rd ACM symposium on Parallelism in algorithms and architectures, SPAA (2011), 85--94.
[22]
Liu, L., Rui, Y., Sun, L., Yang, B., Zhang, J., and Yang, S.-Q. Topic mining on web-shared videos. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2008), 2145--2148.
[23]
Madeira, S. C., and Oliveira, A. L. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics 1, 1 (Jan. 2004), 24--45.
[24]
Pisharath, J., Liu, Y., keng Liao, W., Memik, G., Choudhary, A., and Dubey, P. NU-MineBench 3.0.
[25]
Prim, R. C. Shortest connection networks and some generalizations. Bell System Technology Journal (1957).
[26]
Rastogi, V., Machanavajjhala, A., Chitnis, L., and Sarma, A. D. Finding connected components on map-reduce in logarithmic rounds. CoRR abs/1203.5387 (2012).
[27]
Sahoo, N., Callan, J., Krishnan, R., Duncan, G., and Padman, R. Incremental hierarchical clustering of text documents. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, CIKM '06 (New York, NY, USA, 2006), 357--366.
[28]
Surdeanu, M., Turmo, J., and Ageno, A. A hybrid unsupervised approach for document clustering. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (2005), 685--690.
[29]
Wang, S., and Dutta, H. Parable: A parallel random-partition based hierarchical clustering algorithm for the mapreduce framework. Technical Report, CCLS-11-04 (2011).
[30]
Widyantoro, D. H., Ioerger, T. R., and Yen, J. An incremental approach to building a cluster hierarchy. In IEEE International Conference on Data Mining (2002), 705--708.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HPC '15: Proceedings of the Symposium on High Performance Computing
April 2015
253 pages
ISBN:9781510801011

Sponsors

Publisher

Society for Computer Simulation International

San Diego, CA, United States

Publication History

Published: 12 April 2015

Check for updates

Author Tags

  1. MapReduce
  2. hierarchical clustering
  3. incremental hierarchical clustering

Qualifiers

  • Research-article

Conference

SpringSim '15
Sponsor:
SpringSim '15: 2015 Spring Simulation Multiconference
April 12 - 15, 2015
Virginia, Alexandria

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 150
    Total Downloads
  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media