Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3295500.3356152acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Spread-n-share: improving application performance and cluster throughput with resource-aware job placement

Published: 17 November 2019 Publication History

Abstract

Traditional batch job schedulers adopt the Compact-n-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often brings self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the imbalance in cache and memory bandwidth usage is still under-investigated.
In this work, we propose Spread-n-Share (SNS): a new batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locate jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering shared-cache capacity and memory bandwidth as two types of performance-critical shared resources. Experimental results using 12 diverse cluster workloads show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.

References

[1]
Intel 64 and IA-32 Architectures Software Developer Manuals | Intel Software https://software.intel.com/en-us/articles/intel-sdm.
[2]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for large-scale machine learning. In OSDI (2016), vol. 16, pp. 265--283.
[3]
Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., and Zhou, J. Re-optimizing data-parallel computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2012), NSDI'12, pp. 21--21.
[4]
Alibaba. Alibaba cloud instance type families https://www.alibabacloud.com/help/doc-detail/25378.htm.
[5]
Amazon. Amazon ec2 instance types https://aws.amazon.com/ec2/instance-types/.
[6]
Amvrosiadis, G., Park, J. W., Ganger, G. R., Gibson, G. A., Baseman, E., and DeBardeleben, N. On the diversity of cluster workloads and its impact on research results. In 2018 USENIX Annual Technical Conference (USENIX ATC 18) (2018), USENIX Association.
[7]
Ananthanarayanan, G., Ghodsi, A., Shenker, S., and Stoica, I. Disk-locality in datacenter computing considered irrelevant. In HotOS (2011), vol. 13, pp. 12--12.
[8]
Apache. Spark Standalone Mode - Spark 2.3.1 Documentation https://spark.apache.org/docs/latest/spark-standalone.html, 2018.
[9]
AssafSchuster, L. O.-Y. Ginseng: Market-driven llc allocation. In 2016 USENIX Annual Technical Conference (2016), p. 295.
[10]
Aymericdamien. Tensorflow examples https://github.com/aymericdamien/tensorflow-examples, 2018.
[11]
Bailey, D., Harris, T., Saphir, W., Wijngaart, R. V. D., Woo, A., and Yarrow, M. The NAS Parallel Benchmarks 2.0. NAS Systems Division, NASA Ames Research Center, Moffett Field, CA, 1995.
[12]
Blagodurov, S., Fedorova, A., Vinnik, E., Dwyer, T., and Hermenier, F. Multi-objective job placement in clusters. In High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for (2015), IEEE, pp. 1--12.
[13]
Bode, B., Halstead, D. M., Kendall, R., Lei, Z., and Jackson, D. The portable batch scheduler and the maui scheduler on linux clusters. In Annual Linux Showcase & Conference (2000).
[14]
Breitbart, J., Pickartz, S., Lankes, S., Weidendorfer, J., and Monti, A. Dynamic co-scheduling driven by main memory bandwidth utilization. In 2017 IEEE International Conference on Cluster Computing (CLUSTER) (2017), IEEE, pp. 400--409.
[15]
Breslow, A. D., Porter, L., Tiwari, A., Laurenzano, M., Carrington, L., Tullsen, D. M., Snavely, A. E., Snavely, A. E., and Snavely, A. E. The case for colocation of high performance computing workloads. Concurrency and Computation: Practice and Experience 28, 2 (2016), 232--251.
[16]
Breslow, A. D., Tiwari, A., Schulz, M., Carrington, L., Tang, L., and Mars, J. Enabling fair pricing on high performance computer systems with node sharing. Scientific Programming 22, 2 (2014), 59--74.
[17]
Brown, K., and Matsuoka, S. Co-locating graph analytics and hpc applications. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on (2017), IEEE, pp. 659--660.
[18]
Bruno, N., Agarwal, S., Kandula, S., Shi, B., Wu, M.-C., and Zhou, J. Recurring job optimization in scope. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (2012), SIGMOD '12, pp. 805--806.
[19]
Burns, B., Grant, B., Oppenheimer, D., Brewer, E., and Wilkes, J. Borg, omega, and kubernetes. Queue 14, 1 (2016), 10.
[20]
Chandrasekar, K., Seshasayee, B., Gavrilovska, A., and Schwan, K. Task characterization-driven scheduling of multiple applications in a task-based runtime. In Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware (2015), ACM, pp. 52--55.
[21]
Citron, D., Hurani, A., and Gnadrey, A. The harmonic or geometric mean: does it really matter? ACM SIGARCH Computer Architecture News 34, 4 (2006), 18--25.
[22]
Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Fontoura, M., and Bianchini, R. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles (2017), ACM, pp. 153--167.
[23]
De Blanche, A., and Lundqvist, T. Node sharing for increased throughput and shorter runtimes: an industrial co-scheduling case study. In HiPEAC Workshop on Co-Scheduling of HPC Applications (2018), pp. 15--20.
[24]
De Melo, A. C. The new linux perf tools. In Slides from Linux Kongress (2010), vol. 18.
[25]
Delimitrou, C., and Kozyrakis, C. Paragon: Qos-aware scheduling for heterogeneous datacenters. In ACM SIGPLAN Notices (2013), vol. 48, ACM, pp. 77--88.
[26]
Delimitrou, C., and Kozyrakis, C. Quasar: resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127--144.
[27]
Deslauriers, F., McCormick, P., Amvrosiadis, G., Goel, A., and Brown, A. D. Quartet: Harmonizing task scheduling and caching for cluster computing. In HotStorage (2016).
[28]
El-Sayed, N., Mukkara, A., Tsai, P.-A., Kasture, H., Ma, X., and Sanchez, D. Kpart: A hybrid cache partitioning-sharing technique for commodity multicores. In High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on (2018), IEEE, pp. 104--117.
[29]
El-Sayed, N., and Schroeder, B. Reading between the lines of failure logs: Understanding how hpc systems fail. In Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on (2013), IEEE, pp. 1--12.
[30]
El-Sayed, N., Zhu, H., and Schroeder, B. Learning from failure across multiple clusters: A trace-driven approach to understanding, predicting, and mitigating job terminations. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (2017), IEEE, pp. 1333--1344.
[31]
Forum, M. Mpi specification http://mpi-forum.org/docs/, 2016.
[32]
Garefalakis, P., Karanasos, K., Pietzuch, P. R., Suresh, A., and Rao, S. Medea: scheduling of long running applications in shared production clusters. In EuroSys (2018), pp. 4--1.
[33]
Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., and Stoica, I. Dominant resource fairness: Fair allocation of multiple resource types. In Nsdi (2011), vol. 11, pp. 24--24.
[34]
Graph500. Graph 500 reference code http://graph500.org/, 2018.
[35]
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R. H., Shenker, S., and Stoica, I. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI (2011), vol. 11, pp. 22--22.
[36]
Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on (2010), IEEE, pp. 41--51.
[37]
IBM. Lsf documentations https://www.ibm.com/support/knowledgecenter/en/sswrjv_10.1.0/lsf_welcome/lsf_welcome.html, 2018.
[38]
Intel. Intel xeon processor e5 and e7 v4 families uncore performance https://www.intel.com/content/www/us/en/products/docs/processors/xeon/xeon-e5-e7-v4-uncore-performance-monitoring.html.
[39]
Intel. Intel resource director technology https://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html, 2018.
[40]
Iorgulescu, C., Azimi, R., Kwon, Y., Elnikety, S., Syamala, M., Narasayya, V., Herodotou, H., Tomita, P., Chen, A., Zhang, J., et al. Perfiso: Performance isolation for commercial latency-sensitive services. In 2018 USENIX Annual Technical Conference (USENIX ATC 18) (2018), pp. 519--532.
[41]
Johnson, D. S. Fast algorithms for bin packing. Journal of Computer and System Sciences 8, 3 (1974), 272--314.
[42]
Koop, M. J., Luo, M., and Panda, D. K. Reducing network contention with mixed workloads on modern multicore, clusters. In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on (2009), IEEE, pp. 1--10.
[43]
Kurth, T., Zhang, J., Satish, N., Racah, E., Mitliagkas, I., Patwary, M. M. A., Malas, T., Sundaram, N., Bhimji, W., Smorkalov, M., et al. Deep learning at 15pf: supervised and semi-supervised classification for scientific data. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2017), ACM, p. 7.
[44]
Lin, H., Zhu, X., Yu, B., Tang, X., Xue, W., Chen, W., Zhang, L., Hoefler, T., Ma, X., Liu, X., Zheng, W., and Xu, J. Shentu: Processing multi-trillion edge graphs on millions of cores in seconds. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC18) (2018), ACM.
[45]
Liu, Y., Gunasekaran, R., Ma, X., and Vazhkudai, S. S. Server-side log data analytics for i/o workload characterization and coordination on large shared storage systems. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for (2016), IEEE, pp. 819--829.
[46]
Luu, H., Winslett, M., Gropp, W., Ross, R., Carns, P., Harms, K., Prabhat, M., Byna, S., and Yao, Y. A multiplatform study of i/o behavior on petascale supercomputers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (2015), ACM, pp. 33--44.
[47]
Mashey, J. R. War of the benchmark means: Time for a truce. SIGARCH Comput. Archit. News 32, 4 (Sept. 2004), 1--14.
[48]
McCalpin, J. Memory bandwidth: Stream benchmark performance results https://www.cs.virginia.edu/stream/.
[49]
Park, J. W., Tumanov, A., Jiang, A., Kozuch, M. A., and Ganger, G. R. 3sigma: distribution-based cluster scheduling for runtime uncertainty. In Proceedings of the Thirteenth EuroSys Conference (2018), ACM, p. 2.
[50]
Rodrigo Álvarez, G. P., Östberg, P.-O., Elmroth, E., Antypas, K., Gerber, R., and Ramakrishnan, L. Hpc system lifetime story: Workload characterization and evolutionary analyses on nersc systems. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (New York, NY, USA, 2015), HPDC '15, ACM, pp. 57--60.
[51]
Simakov, N. A., DeLeon, R. L., White, J. P., Furlani, T. R., Innus, M., Gallo, S. M., Jones, M. D., Patra, A., Plessinger, B. D., Sperhac, J., et al. A quantitative analysis of node sharing on hpc clusters using xdmod application kernels. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale (2016), ACM, p. 32.
[52]
SPEC. Spec cpu 2006 benchmark set https://www.spec.org/cpu2006/, 2006.
[53]
Tang, X., Zhai, J., Qian, X., and Chen, W. plock: A fast lock for architectures with explicit inter-core message passing. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13--17, 2019 (2019), pp. 765--778.
[54]
Top500. Top 500 list on 2018 november https://www.top500.org/lists/2018/11/, 2018.
[55]
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (2013), ACM, p. 5.
[56]
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., and Wilkes, J. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems (2015), ACM, p. 18.
[57]
Vetter, J., and Chambreau, C. mpip: Lightweight, scalable mpi profiling.
[58]
Wu, X., and Taylor, V. Using processor partitioning to evaluate the performance of mpi, openmp and hybrid parallel applications on dual-and quad-core cray xt4 systems. In the 51st Cray User Group Conference (CUG2009) (2009), pp. 4--7.
[59]
Yang, H., Breslow, A., Mars, J., and Tang, L. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In ACM SIGARCH Computer Architecture News (2013), vol. 41, ACM, pp. 607--618.
[60]
Yoo, A. B., Jette, M. A., and Grondona, M. Slurm: Simple linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing (2003), Springer, pp. 44--60.
[61]
You, Y., Buluç, A., and Demmel, J. Scaling deep learning on gpu and knights landing clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2017), ACM, p. 9.
[62]
Zaspel, P., and Griebel, M. Massively parallel fluid simulations on amazon's hpc cloud. In Network Cloud Computing and Applications (NCCA), 2011 First International Symposium on (2011), IEEE, pp. 73--78.
[63]
Zhou, A. C., Gong, Y., He, B., and Zhai, J. Efficient process mapping in geo-distributed cloud data centers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2017), ACM, p. 16.
[64]
Zhuravlev, S., Saez,J. C., Blagodurov, S., Fedorova, A., and Prieto, M. Survey of scheduling techniques for addressing shared resources in multicore processors. ACM Computing Surveys (CSUR) 45, 1 (2012), 4.
[65]
Zimmer, C., Gupta, S., Atchley, S., Vazhkudai, S. S., and Albing, C. A multi-faceted approach to job placement for improved performance on extreme-scale systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2016), IEEE Press, p. 87.

Cited By

View all
  • (2024)ImmunoPlane: Middleware for Providing Adaptivity to Distributed Internet-of-Things Applications2024 IEEE/ACM Ninth International Conference on Internet-of-Things Design and Implementation (IoTDI)10.1109/IoTDI61053.2024.00006(13-24)Online publication date: 13-May-2024
  • (2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
  • (2024)Response Time and Energy-Aware Optimization for Co-Locating Microservices and Offline Tasks2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00077(525-530)Online publication date: 2-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China
  • National Natural Science Foundation of China

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)4
Reflects downloads up to 19 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ImmunoPlane: Middleware for Providing Adaptivity to Distributed Internet-of-Things Applications2024 IEEE/ACM Ninth International Conference on Internet-of-Things Design and Implementation (IoTDI)10.1109/IoTDI61053.2024.00006(13-24)Online publication date: 13-May-2024
  • (2024)Software Resource Disaggregation for HPC with Serverless Computing2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00021(139-156)Online publication date: 27-May-2024
  • (2024)Response Time and Energy-Aware Optimization for Co-Locating Microservices and Offline Tasks2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00077(525-530)Online publication date: 2-Jul-2024
  • (2023)Towards OS Heterogeneity Aware Cluster Management for HPCProceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3609510.3609819(16-23)Online publication date: 24-Aug-2023
  • (2023)Unified Programming Models for Heterogeneous High-Performance ComputersJournal of Computer Science and Technology10.1007/s11390-023-2888-438:1(211-218)Online publication date: 31-Jan-2023
  • (2023)Evaluating the Potential of Coscheduling on High-Performance Computing SystemsJob Scheduling Strategies for Parallel Processing10.1007/978-3-031-43943-8_8(155-172)Online publication date: 15-Sep-2023
  • (2021)MXDAGProceedings of the 20th ACM Workshop on Hot Topics in Networks10.1145/3484266.3487384(221-228)Online publication date: 10-Nov-2021
  • (2021)Towards All-optical Circuit-switched Datacenter Network CoresProceedings of the ACM SIGCOMM 2021 Workshop on Optical Systems10.1145/3473938.3474505(1-5)Online publication date: 23-Aug-2021
  • (2021)Coordinated process scheduling algorithms for coupled earth system modelsConcurrency and Computation: Practice and Experience10.1002/cpe.634633:20Online publication date: 10-Jun-2021
  • (2020)Job placement using reinforcement learning in GPU virtualization environmentCluster Computing10.1007/s10586-019-03044-7Online publication date: 9-Jan-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media