research-article

Spread-n-share: improving application performance and cluster throughput with resource-aware job placement

Authors:

Xiongchao Tang,

Nosayba El-Sayed,

Ashraf AboulnagaAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 12, Pages 1 - 15

https://doi.org/10.1145/3295500.3356152

Published: 17 November 2019 Publication History

Abstract

Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated.

In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.

References

[1]

Intel 64 and IA-32 Architectures Software Developer Manuals | Intel Software https://software.intel.com/en-us/articles/intel-sdm.

[2]

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for large-scale machine learning. In OSDI (2016), vol. 16, pp. 265--283.

Digital Library

[3]

Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., and Zhou, J. Re-optimizing data-parallel computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2012), NSDI'12, pp. 21--21.

Digital Library

[4]

Alibaba. Alibaba cloud instance type families https://www.alibabacloud.com/help/doc-detail/25378.htm.

[5]

Amazon. Amazon ec2 instance types https://aws.amazon.com/ec2/instance-types/.

[6]

Amvrosiadis, G., Park, J. W., Ganger, G. R., Gibson, G. A., Baseman, E., and DeBardeleben, N. On the diversity of cluster workloads and its impact on research results. In 2018 USENIX Annual Technical Conference (USENIX ATC 18) (2018), USENIX Association.

Digital Library

[7]

Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. Disk-locality in datacenter computing considered irrelevant. In HotOS (2011), vol. 13, pp. 12--12.

Digital Library

[8]

Apache. Spark Standalone Mode - Spark 2.3.1 Documentation https://spark.apache.org/docs/latest/spark-standalone.html, 2018.

[9]

AssafSchuster, L. O.-Y. Ginseng: Market-driven llc allocation. In 2016 USENIX Annual Technical Conference (2016), p. 295.

[10]

Aymericdamien. Tensorflow examples https://github.com/aymericdamien/tensorflow-examples, 2018.

[11]

Bailey, D., Harris, T., Saphir, W., Wijngaart, R. V. D., Woo, A., and Yarrow, M. The NAS Parallel Benchmarks 2.0. NAS Systems Division, NASA Ames Research Center, Moffett Field, CA, 1995.

[12]

Blagodurov, S., Fedorova, A., Vinnik, E., Dwyer, T., and Hermenier, F. Multi-objective job placement in clusters. In High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for (2015), IEEE, pp. 1--12.

[13]

Bode, B., Halstead, D. M., Kendall, R., Lei, Z., and Jackson, D. The portable batch scheduler and the maui scheduler on linux clusters. In Annual Linux Showcase & Conference (2000).

[14]

Breitbart, J., Pickartz, S., Lankes, S., Weidendorfer, J., and Monti, A. Dynamic co-scheduling driven by main memory bandwidth utilization. In 2017 IEEE International Conference on Cluster Computing (CLUSTER) (2017), IEEE, pp. 400--409.

[15]

Breslow, A. D., Porter, L., Tiwari, A., Laurenzano, M., Carrington, L., Tullsen, D. M., Snavely, A. E., Snavely, A. E., and Snavely, A. E. The case for colocation of high performance computing workloads. Concurrency and Computation: Practice and Experience 28, 2 (2016), 232--251.

Digital Library

[16]

Breslow, A. D., Tiwari, A., Schulz, M., Carrington, L., Tang, L., and Mars, J. Enabling fair pricing on high performance computer systems with node sharing. Scientific Programming 22, 2 (2014), 59--74.

Digital Library

[17]

Brown, K., and Matsuoka, S. Co-locating graph analytics and hpc applications. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on (2017), IEEE, pp. 659--660.

[18]

Bruno, N., Agarwal, S., Kandula, S., Shi, B., Wu, M.-C., and Zhou, J. Recurring job optimization in scope. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (2012), SIGMOD '12, pp. 805--806.

[19]

Burns, B., Grant, B., Oppenheimer, D., Brewer, E., and Wilkes, J. Borg, omega, and kubernetes. Queue 14, 1 (2016), 10.

Digital Library

[20]

Chandrasekar, K., Seshasayee, B., Gavrilovska, A., and Schwan, K. Task characterization-driven scheduling of multiple applications in a task-based runtime. In Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware (2015), ACM, pp. 52--55.

Digital Library

[21]

Citron, D., Hurani, A., and Gnadrey, A. The harmonic or geometric mean: does it really matter? ACM SIGARCH Computer Architecture News 34, 4 (2006), 18--25.

Digital Library

[22]

Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Fontoura, M., and Bianchini, R. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles (2017), ACM, pp. 153--167.

Digital Library

[23]

De Blanche, A., and Lundqvist, T. Node sharing for increased throughput and shorter runtimes: an industrial co-scheduling case study. In HiPEAC Workshop on Co-Scheduling of HPC Applications (2018), pp. 15--20.

[24]

De Melo, A. C. The new linux perf tools. In Slides from Linux Kongress (2010), vol. 18.

[25]

Delimitrou, C., and Kozyrakis, C. Paragon: Qos-aware scheduling for heterogeneous datacenters. In ACM SIGPLAN Notices (2013), vol. 48, ACM, pp. 77--88.

[26]

Delimitrou, C., and Kozyrakis, C. Quasar: resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144.

Digital Library

[27]

Deslauriers, F., McCormick, P., Amvrosiadis, G., Goel, A., and Brown, A. D. Quartet: Harmonizing task scheduling and caching for cluster computing. In HotStorage (2016).

[28]

El-Sayed, N., Mukkara, A., Tsai, P.-A., Kasture, H., Ma, X., and Sanchez, D. Kpart: A hybrid cache partitioning-sharing technique for commodity multicores. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on (2018), IEEE, pp. 104--117.

[29]

El-Sayed, N., and Schroeder, B. Reading between the lines of failure logs: Understanding how hpc systems fail. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on (2013), IEEE, pp. 1--12.

Digital Library

[30]

El-Sayed, N., Zhu, H., and Schroeder, B. Learning from failure across multiple clusters: A trace-driven approach to understanding, predicting, and mitigating job terminations. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (2017), IEEE, pp. 1333--1344.

[31]

Forum, M. Mpi specification http://mpi-forum.org/docs/, 2016.

[32]

Garefalakis, P., Karanasos, K., Pietzuch, P. R., Suresh, A., and Rao, S. Medea: scheduling of long running applications in shared production clusters. In EuroSys (2018), pp. 4--1.

Digital Library

[33]

Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., and Stoica, I. Dominant resource fairness: Fair allocation of multiple resource types. In Nsdi (2011), vol. 11, pp. 24--24.

Digital Library

[34]

Graph500. Graph 500 reference code http://graph500.org/, 2018.

[35]

Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R. H., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI (2011), vol. 11, pp. 22--22.

Digital Library

[36]

Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on (2010), IEEE, pp. 41--51.

[37]

IBM. Lsf documentations https://www.ibm.com/support/knowledgecenter/en/sswrjv_10.1.0/lsf_welcome/lsf_welcome.html, 2018.

[38]

Intel. Intel xeon processor e5 and e7 v4 families uncore performance https://www.intel.com/content/www/us/en/products/docs/processors/xeon/xeon-e5-e7-v4-uncore-performance-monitoring.html.

[39]

Intel. Intel resource director technology https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html, 2018.

[40]

Iorgulescu, C., Azimi, R., Kwon, Y., Elnikety, S., Syamala, M., Narasayya, V., Herodotou, H., Tomita, P., Chen, A., Zhang, J., et al. Perfiso: Performance isolation for commercial latency-sensitive services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18) (2018), pp. 519--532.

[41]

Johnson, D. S. Fast algorithms for bin packing. Journal of Computer and System Sciences 8, 3 (1974), 272--314.

Digital Library

[42]

Koop, M. J., Luo, M., and Panda, D. K. Reducing network contention with mixed workloads on modern multicore, clusters. In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on (2009), IEEE, pp. 1--10.

[43]

Kurth, T., Zhang, J., Satish, N., Racah, E., Mitliagkas, I., Patwary, M. M. A., Malas, T., Sundaram, N., Bhimji, W., Smorkalov, M., et al. Deep learning at 15pf: supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2017), ACM, p. 7.

Digital Library

[44]

Lin, H., Zhu, X., Yu, B., Tang, X., Xue, W., Chen, W., Zhang, L., Hoefler, T., Ma, X., Liu, X., Zheng, W., and Xu, J. Shentu: Processing multi-trillion edge graphs on millions of cores in seconds. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18) (2018), ACM.

Digital Library

[45]

Liu, Y., Gunasekaran, R., Ma, X., and Vazhkudai, S. S. Server-side log data analytics for i/o workload characterization and coordination on large shared storage systems. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for (2016), IEEE, pp. 819--829.

Digital Library

[46]

Luu, H., Winslett, M., Gropp, W., Ross, R., Carns, P., Harms, K., Prabhat, M., Byna, S., and Yao, Y. A multiplatform study of i/o behavior on petascale supercomputers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (2015), ACM, pp. 33--44.

Digital Library

[47]

Mashey, J. R. War of the benchmark means: Time for a truce. SIGARCH Comput. Archit. News 32, 4 (Sept. 2004), 1--14.

Digital Library

[48]

McCalpin, J. Memory bandwidth: Stream benchmark performance results https://www.cs.virginia.edu/stream/.

[49]

Park, J. W., Tumanov, A., Jiang, A., Kozuch, M. A., and Ganger, G. R. 3sigma: distribution-based cluster scheduling for runtime uncertainty. In Proceedings of the Thirteenth EuroSys Conference (2018), ACM, p. 2.

Digital Library

[50]

Rodrigo Álvarez, G. P., Östberg, P.-O., Elmroth, E., Antypas, K., Gerber, R., and Ramakrishnan, L. Hpc system lifetime story: Workload characterization and evolutionary analyses on nersc systems. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2015), HPDC '15, ACM, pp. 57--60.

[51]

Simakov, N. A., DeLeon, R. L., White, J. P., Furlani, T. R., Innus, M., Gallo, S. M., Jones, M. D., Patra, A., Plessinger, B. D., Sperhac, J., et al. A quantitative analysis of node sharing on hpc clusters using xdmod application kernels. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale (2016), ACM, p. 32.

Digital Library

[52]

SPEC. Spec cpu 2006 benchmark set https://www.spec.org/cpu2006/, 2006.

[53]

Tang, X., Zhai, J., Qian, X., and Chen, W. plock: A fast lock for architectures with explicit inter-core message passing. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13--17, 2019 (2019), pp. 765--778.

Digital Library

[54]

Top500. Top 500 list on 2018 november https://www.top500.org/lists/2018/11/, 2018.

[55]

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (2013), ACM, p. 5.

Digital Library

[56]

Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., and Wilkes, J. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems (2015), ACM, p. 18.

Digital Library

[57]

Vetter, J., and Chambreau, C. mpip: Lightweight, scalable mpi profiling.

[58]

Wu, X., and Taylor, V. Using processor partitioning to evaluate the performance of mpi, openmp and hybrid parallel applications on dual-and quad-core cray xt4 systems. In the 51st Cray User Group Conference (CUG2009) (2009), pp. 4--7.

[59]

Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In ACM SIGARCH Computer Architecture News (2013), vol. 41, ACM, pp. 607--618.

[60]

Yoo, A. B., Jette, M. A., and Grondona, M. Slurm: Simple linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing (2003), Springer, pp. 44--60.

[61]

You, Y., Buluç, A., and Demmel, J. Scaling deep learning on gpu and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2017), ACM, p. 9.

Digital Library

[62]

Zaspel, P., and Griebel, M. Massively parallel fluid simulations on amazon's hpc cloud. In Network Cloud Computing and Applications (NCCA), 2011 First International Symposium on (2011), IEEE, pp. 73--78.

Digital Library

[63]

Zhou, A. C., Gong, Y., He, B., and Zhai, J. Efficient process mapping in geo-distributed cloud data centers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2017), ACM, p. 16.

Digital Library

[64]

Zhuravlev, S., Saez,J. C., Blagodurov, S., Fedorova, A., and Prieto, M. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys (CSUR) 45, 1 (2012), 4.

[65]

Zimmer, C., Gupta, S., Atchley, S., Vazhkudai, S. S., and Albing, C. A multi-faceted approach to job placement for improved performance on extreme-scale systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2016), IEEE Press, p. 87.

Cited By

Jung KMitra GGopalakrishnan SPattabiraman K(2024)ImmunoPlane: Middleware for Providing Adaptivity to Distributed Internet-of-Things Applications2024 IEEE/ACM Ninth International Conference on Internet-of-Things Design and Implementation (IoTDI)10.1109/IoTDI61053.2024.00006(13-24)Online publication date: 13-May-2024
https://doi.org/10.1109/IoTDI61053.2024.00006
Copik MChrapek MSchmid LCalotoiu AHoefler T(2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00021
Wang JYu HFan GZhang J(2024)Response Time and Energy-Aware Optimization for Co-Locating Microservices and Offline Tasks2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00077(525-530)Online publication date: 2-Jul-2024
https://doi.org/10.1109/COMPSAC61105.2024.00077
Show More Cited By

Index Terms

Spread-n-share: improving application performance and cluster throughput with resource-aware job placement
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multiprocessing / multiprogramming / multitasking
        Scheduling

Recommendations

Surplus Fair Schedulin: A Proportional-Share CPU Scheduling Algorithm for Symmetric Multiprocessors
A fair share scheduler

Central-processing-unit schedulers have traditionally allocated resources fairly among processes. By contrast, a fair Share scheduler allocates resources so that users get their fair machine share over a long period.
Dynamic proportional share scheduling in Hadoop
JSSPP'10: Proceedings of the 15th international conference on Job scheduling strategies for parallel processing

We present the Dynamic Priority (DP) parallel task scheduler for Hadoop. It allows users to control their allocated capacity by adjusting their spending over time. This simple mechanism allows the scheduler to make more efficient decisions about which ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

National Key R&D Program of China
National Natural Science Foundation of China

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
875
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)4

Reflects downloads up to 19 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jung KMitra GGopalakrishnan SPattabiraman K(2024)ImmunoPlane: Middleware for Providing Adaptivity to Distributed Internet-of-Things Applications2024 IEEE/ACM Ninth International Conference on Internet-of-Things Design and Implementation (IoTDI)10.1109/IoTDI61053.2024.00006(13-24)Online publication date: 13-May-2024
https://doi.org/10.1109/IoTDI61053.2024.00006
Copik MChrapek MSchmid LCalotoiu AHoefler T(2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00021
Wang JYu HFan GZhang J(2024)Response Time and Energy-Aware Optimization for Co-Locating Microservices and Offline Tasks2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00077(525-530)Online publication date: 2-Jul-2024
https://doi.org/10.1109/COMPSAC61105.2024.00077
An ZLi DGuo YGao GRen YJia NHu X(2023)Towards OS Heterogeneity Aware Cluster Management for HPCProceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3609510.3609819(16-23)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.1145/3609510.3609819
Ma ZJin YTang SWang HXue WZhai JZheng W(2023)Unified Programming Models for Heterogeneous High-Performance ComputersJournal of Computer Science and Technology10.1007/s11390-023-2888-438:1(211-218)Online publication date: 31-Jan-2023
https://doi.org/10.1007/s11390-023-2888-4
Hall JLathi ALowenthal DPatki T(2023)Evaluating the Potential of Coscheduling on High-Performance Computing SystemsJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-43943-8_8(155-172)Online publication date: 15-Sep-2023
https://doi.org/10.1007/978-3-031-43943-8_8
Wang WDas SWu XWang ZChen ANg T(2021)MXDAGProceedings of the 20th ACM Workshop on Hot Topics in Networks10.1145/3484266.3487384(221-228)Online publication date: 10-Nov-2021
https://dl.acm.org/doi/10.1145/3484266.3487384
Das SWang WNg T(2021)Towards All-optical Circuit-switched Datacenter Network CoresProceedings of the ACM SIGCOMM 2021 Workshop on Optical Systems10.1145/3473938.3474505(1-5)Online publication date: 23-Aug-2021
https://dl.acm.org/doi/10.1145/3473938.3474505
Wei XXu ZLi HDing Z(2021)Coordinated process scheduling algorithms for coupled earth system modelsConcurrency and Computation: Practice and Experience10.1002/cpe.634633:20Online publication date: 10-Jun-2021
https://doi.org/10.1002/cpe.6346
Oh JKim Y(2020)Job placement using reinforcement learning in GPU virtualization environmentCluster Computing10.1007/s10586-019-03044-7Online publication date: 9-Jan-2020
https://doi.org/10.1007/s10586-019-03044-7

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents