GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters

Ce Gao¹⁶,
Rui Ren¹⁶ &
Hongming Cai¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11335))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2043 Accesses
4 Citations

Abstract

With widespread applications in image recognition, language translation, computer vision and other areas, deep learning (DL) have been proliferating over the past decade. Practitioners from different business groups in industries train DL models on a shared cloud computing infrastructure for these applications with different priorities. During the model training process, one of the key challenges is to minimize the lifecycle of high priority model training jobs. This paper analyzes the distributed training of machine learning (ML) models and identifies short board effect in the training process: GPU training requires higher network bandwidth compared to CPU training. The key insight motivates the design of GAI, a centralized scheduler for ML workload. It relies on two techniques: (1) tree-based structure. The structure stores the cluster information hierarchically to apply multi-layer scheduling. (2) well-extended priority algorithm. We consider priorities from multiple dimensions for model training jobs comprehensively to support resource degradation and preemption. The prototype of GAI is implemented on top of Kubernetes, Kubeflow, and TensorFlow. It is evaluated using a simulator and a real cloud-based cluster. Evaluations show 28% increase in scheduling throughput and 21% training convergence speedup on DL models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Dynamic Resource Management for Machine Learning Pipeline Workloads

Article 30 August 2023

Resource scheduling techniques in cloud from a view of coordination: a holistic survey

Article 23 January 2023

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

Article 02 March 2021

References

Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation, vol. 16, pp. 265–283 (2016)
Google Scholar
Burns, B., Grant, B., Oppenheimer, D., Brewer, E., Wilkes, J.: Borg, omega, and kubernetes. Commun. ACM 59(5), 50–57 (2016)
Article Google Scholar
Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015)
Delgado, P., Dinu, F., Kermarrec, A.M., Zwaenepoel, W.: Hawk: hybrid datacenter scheduling. In: Proceedings of the 2015 USENIX Annual Technical Conference, No. EPFL-CONF-208856, pp. 499–510. USENIX Association (2015)
Google Scholar
Delimitrou, C., Kozyrakis, C.: Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Not. 48, 77–88 (2013)
Article Google Scholar
Delimitrou, C., Kozyrakis, C.: Quasar: resource-efficient and QoS-aware cluster management. ACM SIGPLAN Not. 49, 127–144 (2014)
Google Scholar
Gog, I., Schwarzkopf, M., Gleave, A., Watson, R.N., Hand, S.: Firmament: fast, centralized cluster scheduling at scale. In: 12th USENIX Symposium on Operating Systems Design and Implementation. USENIX (2016)
Google Scholar
Hazelwood, K., et al.: Applied machine learning at Facebook: a datacenter infrastructure perspective. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 620–629. IEEE (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hindman, B., et al.: Mesos: a platform for fine-grained resource sharing in the data center. In: 8th USENIX Symposium on Networked Systems Design and Implementation, vol. 11, p. 22 (2011)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Jiang, J., Yu, L., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5, 216–236 (2017). https://doi.org/10.1093/nsr/nwx018
Article Google Scholar
Jin, J., Luo, J., Song, A., Dong, F., Xiong, R.: Bar: an efficient data locality driven task scheduling algorithm for cloud computing. In: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 295–304. IEEE Computer Society (2011)
Google Scholar
Karanasos, K., et al.: Mercury: hybrid centralized and distributed scheduling in large shared clusters. In: USENIX Annual Technical Conference, pp. 485–497 (2015)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Lee, D., Mehta, N., Shearer, A., Kastner, R.: A hardware accelerated system for high throughput cellular image analysis. J. Parallel Distrib. Comput. 113, 167–178 (2018)
Article Google Scholar
Li, M., et al.: Scaling distributed machine learning with the parameter server. In: 11th USENIX Symposium on Operating Systems Design and Implementation, vol. 1, p. 3 (2014)
Google Scholar
Lu, C., Ye, K., Xu, G., Xu, C.Z., Bai, T.: Imbalance in the cloud: an analysis on Alibaba cluster trace. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 2884–2892. IEEE (2017)
Google Scholar
Ousterhout, K., Wendell, P., Zaharia, M., Stoica, I.: Sparrow: distributed, low latency scheduling. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 69–84. ACM (2013)
Google Scholar
Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., Wilkes, J.: Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 351–364. ACM (2013)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM (2013)
Google Scholar
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with borg. In: Proceedings of the Tenth European Conference on Computer Systems, p. 18. ACM (2015)
Google Scholar
Xia, Y., Ren, R., Cai, H., Vasilakos, A.V., Lv, Z.: Daphne: a flexible and hybrid scheduling framework in multi-tenant clusters. IEEE Trans. Netw. Serv. Manag. 15, 330–343 (2017)
Article Google Scholar
Zhang, Q., Zhani, M.F., Boutaba, R., Hellerstein, J.L.: Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans. Cloud Comput. 2(1), 14–28 (2014)
Article Google Scholar

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China under Grant No. 61373030.

Author information

Authors and Affiliations

School of Software, Shanghai Jiao Tong University, Shanghai, China
Ce Gao, Rui Ren & Hongming Cai

Authors

Ce Gao
View author publications
You can also search for this author in PubMed Google Scholar
Rui Ren
View author publications
You can also search for this author in PubMed Google Scholar
Hongming Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Ren .

Editor information

Editors and Affiliations

Rutgers University, Newark, NJ, USA
Jaideep Vaidya
Guangzhou University, Guangzhou, China
Jin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gao, C., Ren, R., Cai, H. (2018). GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11335. Springer, Cham. https://doi.org/10.1007/978-3-030-05054-2_46

Download citation

DOI: https://doi.org/10.1007/978-3-030-05054-2_46
Published: 07 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05053-5
Online ISBN: 978-3-030-05054-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic Resource Management for Machine Learning Pipeline Workloads

Resource scheduling techniques in cloud from a view of coordination: a holistic survey

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

GAI: A Centralized Tree-Based Scheduler for Machine Learning Workload in Large Shared Clusters

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Dynamic Resource Management for Machine Learning Pipeline Workloads

Resource scheduling techniques in cloud from a view of coordination: a holistic survey

Rationing bandwidth resources for mitigating network resource contention in distributed DNN training clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation