Hop: Heterogeneity-aware Decentralized Training

Published: 04 April 2019


Recent work has shown that decentralized algorithms can deliver superior performance over centralized ones in the context of machine learning. The two approaches, with the main difference residing in their distinct communication patterns, are both susceptible to performance degradation in heterogeneous environments. Although vigorous efforts have been devoted to supporting centralized algorithms against heterogeneity, little has been explored in decentralized algorithms regarding this problem. This paper proposes Hop, the first heterogeneity-aware decentralized training protocol. Based on a unique characteristic of decentralized training that we have identified, the iteration gap, we propose a queue-based synchronization mechanism that can efficiently implement backup workers and bounded staleness in the decentralized setting. To cope with deterministic slowdown, we propose skipping iterations so that the effect of slower workers is further mitigated. We build a prototype implementation of Hop on TensorFlow. The experiment results on CNN and SVM show significant speedup over standard decentralized training in heterogeneous settings.


  • (2024)Heet: Accelerating Elastic Training in Heterogeneous Deep Learning ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640375(499-513)Online publication date: 27-Apr-2024
  • (2023)Train 'n tradeProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667359(28478-28490)Online publication date: 10-Dec-2023
  • (2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术:综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
  • Show More Cited By



ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems
April 2019
1126 pages
Published: 04 April 2019


  1. decentralized training
  2. heterogeneity


ASPLOS '19 Paper Acceptance Rate 74 of 351 submissions, 21%
Overall Acceptance Rate 535 of 2,713 submissions, 20%

  • Downloads (Last 12 months)199
  • Downloads (Last 6 weeks)19
  • (2024)Heet: Accelerating Elastic Training in Heterogeneous Deep Learning ClustersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640375(499-513)Online publication date: 27-Apr-2024
  • (2023)Train 'n tradeProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667359(28478-28490)Online publication date: 10-Dec-2023
  • (2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术:综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
  • (2023)PSRA-HGADMM: A Communication Efficient Distributed ADMM AlgorithmProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605610(82-91)Online publication date: 7-Aug-2023
  • (2023)Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN StructureACM Transactions on Architecture and Code Optimization10.1145/360514920:3(1-21)Online publication date: 19-Jul-2023
  • (2023)Topology Construction with Minimum Total Time for Geo-Distributed Decentralized Federated LearningProceedings of the 2023 9th International Conference on Computing and Artificial Intelligence10.1145/3594315.3594397(720-726)Online publication date: 17-Mar-2023
  • (2023)ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575721(266-280)Online publication date: 27-Jan-2023
  • (2023)Consensus and Diffusion for First-Order Distributed Optimization Over Multi-Hop NetworkIEEE Access10.1109/ACCESS.2023.329711211(76913-76925)Online publication date: 2023
  • (2022)Decentralized Machine Learning over the Internet2022 41st Chinese Control Conference (CCC)10.23919/CCC55666.2022.9901831(2010-2015)Online publication date: 25-Jul-2022
  • (2022)Enabling Efficient Large-Scale Deep Learning Training with Cache Coherent Disaggregated Memory Systems2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00018(126-140)Online publication date: Apr-2022
  • Show More Cited By

