Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3626183.3659973acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article
Open access

Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage Clustering

Published: 17 June 2024 Publication History

Abstract

Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree T, the SLD of T is a binary dendrogram that summarizes the n-1 clusterings obtained by contracting the edges of T in order of weight. Existing algorithms for computing the SLD all require Ømega(nłog n) work where n = |T|. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem. In this paper, we design faster parallel algorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires O(n łog h) work and O(łog^2 n łog^2 h) depth, where h is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest-neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves O(nłog h) work and O(h łog n) depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice.

References

[1]
Karl Abrahamson, Norm Dadoun, David G. Kirkpatrick, and T Przytycka. 1989. A simple parallel tree contraction algorithm. Journal of Algorithms, Vol. 10, 2 (1989), 287--302.
[2]
Umut A Acar, Vitaly Aksenov, and Sam Westrick. 2017. Brief Announcement: Parallel Dynamic Tree Contraction via Self-Adjusting Computation. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA).
[3]
Daniel Anderson. 2023. Parallel Batch-Dynamic Algorithms Dynamic Trees, Graphs, and Self-Adjusting Computation. Ph.,D. Dissertation. Carnegie Mellon University.
[4]
Dalya Baron. 2019. Machine Learning in Astronomy: a practical overview. arxiv: 1904.07248 [astro-ph.IM]
[5]
J-P Benzécri. 1982. Construction d'une classification ascendante hiérarchique par la recherche en cha^ine des voisins réciproques. Cahiers de l'analyse des données, Vol. 7, 2 (1982), 209--218.
[6]
Guy E. Blelloch, Daniel Anderson, and Laxman Dhulipala. 2020a. ParlayLib - A Toolkit for Parallel Algorithms on Shared-Memory Multicore Machines. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 507--509. https://cmuparlay.github.io/parlaylib/
[7]
Guy E. Blelloch, Laxman Dhulipala, and Yihan Sun. 2021. Introduction to Parallel Algorithms. https://www.cs.cmu.edu/ guyb/paralg/paralg/parallel.pdf. Carnegie Mellon University.
[8]
Guy E. Blelloch, Jeremy T. Fineman, Yan Gu, and Yihan Sun. 2020b. Optimal Parallel Algorithms in the Binary-Forking Model. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 89--102.
[9]
Guy E. Blelloch, Yan Gu, Julian Shun, and Yihan Sun. 2020c. Randomized Incremental Convex Hull is Highly Parallel. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 103--115.
[10]
Ricardo JGB Campello, Davoud Moulavi, Arthur Zimek, and Jörg Sander. 2015. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 10, 1 (2015), 1--51.
[11]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms.
[12]
Laxman Dhulipala, Xiaojun Dong, Kishen N Gowda, and Yan Gu. 2024. Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage Clustering. arxiv: 2404.19019 [cs.DS]
[13]
Laxman Dhulipala, David Eisenstat, Jakub Łkacki, Vahab Mirrokni, and Jessica Shi. 2021. Hierarchical agglomerative graph clustering in nearly-linear time. In International Conference on Machine Learning. PMLR, 2676--2686.
[14]
E.D. Feigelson and G.J. Babu. 1998. Statistical Methodology for Large Astronomical Surveys. Symposium - International Astronomical Union, Vol. 179 (1998), 363--370.
[15]
Molly Gasperini, Andrew J Hill, José L McFaline-Figueroa, Beth Martin, Seungsoo Kim, Melissa D Zhang, Dana Jackson, Anh Leith, Jacob Schreiber, William S Noble, et al. 2019. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell, Vol. 176, 1 (2019), 377--390.
[16]
Hillel Gazit, Gary L Miller, and Shang-Hua Teng. 1988. Optimal tree contraction in the EREW model. In Concurrent Computations: Algorithms, Architecture, and Technology. 139--156.
[17]
Markus Götz, Gabriele Cavallaro, Thierry Géraud, Matthias Book, and Morris Riedel. 2018. Parallel computation of component trees on distributed memory machines. IEEE Transactions on Parallel and Distributed Systems, Vol. 29, 11 (2018), 2582--2598.
[18]
J. C. Gower and G. J. S. Ross. 1969. Minimum Spanning Trees and Single Linkage Cluster Analysis. Journal of the Royal Statistical Society. seriesx C (Applied Statistics), Vol. 18, 1 (1969), 54--64.
[19]
Yan Gu, Julian Shun, Yihan Sun, and Guy E. Blelloch. 2015. A Top-Down Parallel Semisort. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 24--34.
[20]
MohammadTaghi Hajiaghayi, Marina Knittel, Hamed Saleh, and Hsin-Hao Su. 2022. Adaptive Massively Parallel Constant-Round Tree Contraction. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), Mark Braverman (Ed.), Vol. 215. 83:1--83:23.
[21]
Jivrí Havel, Franccois Merciol, and Sébastien Lefèvre. 2019. Efficient tree construction for multiscale image representation and processing. Journal of Real-Time Image Processing, Vol. 16 (2019), 1129--1146.
[22]
William Hendrix, Diana Palsetia, Md Mostofa Ali Patwary, Ankit Agrawal, Wei-keng Liao, and Alok Choudhary. 2013. A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures. In IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV). IEEE, 7--13.
[23]
David B Henry, Patrick H Tolan, and Deborah Gorman-Smith. 2005. Cluster analysis in family psychology research. Journal of Family Psychology, Vol. 19, 1 (2005), 121.
[24]
J. JaJa. 1992. Introduction to Parallel Algorithms.
[25]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. In International World Wide Web Conference (WWW). 591--600.
[26]
Ivica Letunic and Peer Bork. 2007. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics, Vol. 23, 1 (2007), 127--128.
[27]
Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval.
[28]
Magdalen Dobson Manohar, Zheqi Shen, Guy Blelloch, Laxman Dhulipala, Yan Gu, Harsha Vardhan Simhadri, and Yihan Sun. 2024. ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search Algorithms. In ACM Symposium on Principles and Practice of Parallel Programming (PPOPP). 270--285.
[29]
Gary L Miller and John H Reif. 1985. Parallel tree contraction and its application. In IEEE Symposium on Foundations of Computer Science (FOCS), Vol. 26. 478--489.
[30]
Corey J. Nolet, Divye Gala, Alex Fender, Mahesh doixjade, Joe Eaton, Edward Raff, John Zedlewski, Brad Rees, and Tim Oates. 2023. cuSLINK: Single-Linkage Agglomerative Clustering on the GPU. In ECML PKDD. 711--726.
[31]
Georgios K Ouzounis and Pierre Soille. 2012. The alpha-tree algorithm. JRC Scientific and Policy Report (2012).
[32]
John H. Reif and Stephen R. Tate. 1994. Dynamic parallel tree contraction (extended abstract). In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA). 114--121.
[33]
Piyush Sao, Andrey Prokopenko, and Damien Lebrun-Grandié. 2024. PANDORA: A Parallel Dendrogram Construction Algorithm for Single Linkage Clustering on GPU. arxiv: 2401.06089 [cs.LG]
[34]
Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Introduction to information retrieval.
[35]
Julian Shun and Guy E. Blelloch. 2014. A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction. ACM Trans. Parallel Comput., Vol. 1, 1 (2014).
[36]
Julian Shun, Yan Gu, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. 2015. Sequential random permutation, list contraction and tree contraction are highly parallel. In ACM-SIAM Symposium on Discrete Algorithms (SODA). 431--448.
[37]
Suhas Jayaram Subramanya, Devvrit, Rohan Kadekodi, Ravishankar Krishaswamy, and Harsha Vardhan Simhadri. 2019. DiskANN: fast accurate billion-point nearest neighbor search on a single node. In Neural Information Processing Systems (NeurIPS).
[38]
Baris Sumengen, Anand Rajagopalan, Gui Citovsky, David Simcha, Olivier Bachem, Pradipta Mitra, Sam Blasiak, Mason Liang, and Sanjiv Kumar. 2021. Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets. arxiv: 2105.11653 [cs.LG]
[39]
Yiqiu Wang, Shangdi Yu, Yan Gu, and Julian Shun. 2021. Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering. In Proceedings of the 2021 International Conference on Management of Data. 1982--1995.
[40]
Lo"ic Yengo, Sailaja Vedantam, Eirini Marouli, Julia Sidorenko, Eric Bartell, Saori Sakaue, Marielisa Graff, Anders U Eliasen, Yunxuan Jiang, Sridharan Raghavan, et al. 2022. A saturated map of common genetic variants associated with human height. Nature, Vol. 610, 7933 (2022), 704--712.
[41]
Odilia Yim and Kylee T Ramdeen. 2015. Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. The Quantitative Methods for Psychology, Vol. 11, 1 (2015), 8--21.
[42]
Shangdi Yu, Yiqiu Wang, Yan Gu, Laxman Dhulipala, and Julian Shun. 2021. ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain. Proc. VLDB Endow., Vol. 15, 2 (2021), 285--298.

Cited By

View all
  • (2025)YOLO-HV: A fast YOLOv8-based method for measuring hemorrhage volumesBiomedical Signal Processing and Control10.1016/j.bspc.2024.107131100(107131)Online publication date: Feb-2025

Index Terms

  1. Optimal Parallel Algorithms for Dendrogram Computation and Single-Linkage Clustering

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SPAA '24: Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures
      June 2024
      510 pages
      ISBN:9798400704161
      DOI:10.1145/3626183
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 June 2024

      Check for updates

      Author Tags

      1. dendrograms
      2. hac
      3. hierarchical graph clustering
      4. parallel algorithms
      5. single-linkage clustering

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      SPAA '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 447 of 1,461 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)76
      • Downloads (Last 6 weeks)20
      Reflects downloads up to 12 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)YOLO-HV: A fast YOLOv8-based method for measuring hemorrhage volumesBiomedical Signal Processing and Control10.1016/j.bspc.2024.107131100(107131)Online publication date: Feb-2025

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media