research-article

Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Authors:

Mary Lou SoffaAuthors Info & Claims

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

Pages 1 - 12

https://doi.org/10.1145/2259016.2259018

Published: 31 March 2012 Publication History

Abstract

As the class of datacenters recently coined as warehouse scale computers (WSCs) continues to leverage commodity multicore processors with increasing core counts, there is a growing need to consolidate various workloads on these machines to fully utilize their computation power. However, it is well known that when multiple applications are co-located on a multicore machine, contention for shared memory resources can cause severe cross-core performance interference. To ensure that the quality of service (QoS) of user-facing applications does not suffer from performance interference, WSC operators resort to disallowing co-location of latency-sensitive applications with other applications. This policy translates to low machine utilization and millions of dollars wasted in WSCs.

This paper presents QoS-Compile, the first compilation approach that statically manipulates application contentiousness to enable the co-location of applications with varying QoS requirements, and as a result, can greatly improve machine utilization. Our technique first pinpoints an application's code regions that tend to cause contention and performance interference. QoS-Compile then transforms those regions to reduce their contentious nature. In essence, to co-locate applications of different QoS priorities, our compilation technique uses pessimizing transformations to throttle down the memory access rate of the contentious regions in low priority applications to reduce their interference to high priority applications. Our evaluation using synthetic benchmarks, SPEC benchmarks and large-scale Google applications show that QoS-Compile can greatly reduce contention, improve QoS of applications, and improve machine utilization. Our experiments show that our technique improves applications' QoS performance by 21% and machine utilization by 36% on average.

References

[1]

M. Banikazemi, D. Poff, and B. Abali. Pam: a novel performance/power aware meta-scheduler for multi-core systems. SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008.

Digital Library

[2]

L. Barroso, J. Dean, and U. Holzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22--28, 2003.

Digital Library

[3]

L. Barroso and U. Hölzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture, 4(1):1--108, 2009.

[4]

M. Bhadauria and S. McKee. An approach to resource-aware co-scheduling for cmps. ICS 2010, Jun 2010.

Digital Library

[5]

S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. MICRO 39, Dec 2006.

Digital Library

[6]

E. Ebrahimi, C. Lee, O. Mutlu, and Y. Patt. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. ASPLOS 2010, Mar 2010.

Digital Library

[7]

A. Fedorova, M. Seltzer, and M. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. PACT 2007, Sep 2007.

Digital Library

[8]

F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. MIRCO 2007, pages 343--355, 2007.

Digital Library

[9]

A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based qos techniques for cache/memory in cmp platforms. ICS '09: Proceedings of the 23rd international conference on Supercomputing, Jun 2009.

Digital Library

[10]

R. Hundt, E. Raman, M. Thuresson, and N. Vachharajani. Mao: An extensible micro-architectural optimizer. In CGO 2011, pages 1--10, Apr 2011.

Digital Library

[11]

R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. SIGMETRICS '07, Jun 2007.

Digital Library

[12]

Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. PACT '08, Oct 2008.

Digital Library

[13]

Y. Jiang, K. Tian, and X. Shen. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. HiPEAC 2010, pages 201--215, 2010.

Digital Library

[14]

M. Kandemir, S. Muralidhara, S. Narayanan, Y. Zhang, and O. Ozturk. Optimizing shared cache behavior of chip multiprocessors. Microarchitecture, 2009, pages 505--516, 2009.

Digital Library

[15]

M. Kandemir, T. Yemliha, S. Muralidhara, S. Srikantaiah, M. Irwin, and Y. Zhnag. Cache topology aware computation mapping for multicores. PLDI '10, Jun 2010.

Digital Library

[16]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. PACT 2004, Sep 2004.

Digital Library

[17]

R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using os observations to improve performance in multicore systems. IEEE Micro, 28(3):54--66, 2008.

Digital Library

[18]

J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. HPCA 2008, pages 367--378, 2008.

[19]

F. Liu, X. Jiang, and Y. Solihin. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. HPCA 2010, pages 1--12, 2010.

[20]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. PLDI '05, pages 190--200, New York, NY, USA, 2005. ACM.

Digital Library

[21]

J. Mars and R. Hundt. Scenario based optimization: A framework for statically enabling online optimizations. CGO '09, pages 169--179, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

[22]

J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In MICRO '11: Proceedings of The 44th Annual IEEE/ACM International Symposium on Microarchitecture, New York, NY, USA, 2011. ACM.

Digital Library

[23]

J. Mars, N. Vachharajani, R. Hundt, and M. Soffa. Contention aware execution: online contention detection and response. CGO '10, Apr 2010.

Digital Library

[24]

D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idle power. ASPLOS '09, pages 205--216, New York, NY, USA, 2009. ACM.

Digital Library

[25]

R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: managing performance interference effects for qos-aware clouds. EuroSys '10, Apr 2010.

Digital Library

[26]

K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair queuing memory systems. MICRO 2006, pages 208--222, 2006.

Digital Library

[27]

M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2006.

Digital Library

[28]

P. Ranganathan and N. Jouppi. Enterprise it trends and implications for architecture research. HPCA 2005, pages 253--256, 2005.

Digital Library

[29]

S. Rus, R. Ashok, and D. Li. Automated locality optimization based on the reuse distance of string operations. CGO '11, pages 181--190, Apr 2011.

Digital Library

[30]

A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. SC 2010, Nov 2010.

Digital Library

[31]

L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an os-level, software-only pollute buffer. Micro 2008, pages 258--269, 2008.

Digital Library

[32]

S. Son, M. Kandemir, M. Karakoy, and D. Chakrabarti. A compiler-directed data prefetching scheme for chip multiprocessors. PPoPP 2009, Feb 2009.

Digital Library

[33]

S. Srikantaiah, M. Kandemir, and M. Irwin. Adaptive set pinning: managing shared caches in chip multiprocessors. ASPLOS XIII, Mar 2008.

Digital Library

[34]

L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. ISCA '11, pages 283--294, New York, NY, USA, 2011. ACM.

Digital Library

[35]

X. Xiang, B. Bao, T. Bai, C. Ding, and T. Chilimbi. All-window profiling and composable models of cache sharing. PPoPP '11, pages 91--102, 2011.

Digital Library

[36]

D. Xu, C. Wu, and P.-C. Yew. On mitigating memory bandwidth contention through bandwidth-aware scheduling. PACT 2010, Sep 2010.

Digital Library

[37]

E. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? PPoPP 2010, pages 203--212, 2010.

Digital Library

[38]

X. Zhang, S. Dwarkadas, and K. Shen. Hardware execution throttling for multi-core resource management. Proceedings of the 2009 conference on USENIX Annual technical conference, page 23, 2009.

Digital Library

[39]

Q. Zhao, D. Koh, S. Raza, D. Bruening, W. Wong, and S. Amarasinghe. Dynamic cache contention detection in multi-threaded applications. VEE 2011, pages 27--38, 2011.

Digital Library

[40]

S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. ASPLOS 2010, Mar 2010.

Digital Library

Cited By

Xu HSong SMao Z(2024)Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00098(468-477)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00098
Kim SZhao JAsanovic KNikolic BShao Y(2023)AuRORA: Virtualized Accelerator Orchestration for Multi-Tenant WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614280(62-76)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614280
Kim SGenc HNikiforov VAsanović KNikolić BShao Y(2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071035
Show More Cited By

Compiling for niceness: mitigating contention for QoS in warehouse scale computers
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Compiling for multi-language task migration
DLS 2015: Proceedings of the 11th Symposium on Dynamic Languages

Task migration allows a running program to continue its execution in a different destination environment. Increasingly, execution environments are defined by combinations of cultural and technological constraints, affecting the choice of host language, ...
Traffic-sensitive live migration of virtual machines
CCGRID '15: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing

In this paper we address the problem of network contention between the migration traffic and the Virtual Machine (VM) application traffic for the live migration of co-located Virtual Machines. When VMs are migrated with pre-copy, they run at the source ...
Compiling for multi-language task migration
DLS '15

Task migration allows a running program to continue its execution in a different destination environment. Increasingly, execution environments are defined by combinations of cultural and technological constraints, affecting the choice of host language, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization

March 2012

285 pages

ISBN:9781450312066

DOI:10.1145/2259016

General Chairs:
Carol Eidt
Microsoft
,
Anne Holler
VMware
,
Program Chairs:
Uma Srinivasan
Intel
,
Saman Amarasinghe
MIT

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

CGO '12

Sponsor:

CGO '12: Annual IEEE/ACM International Symposium on Code Generation and Optimization

March 31 - April 4, 2012

California, San Jose

Acceptance Rates

CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

58
Total Citations
View Citations
493
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu HSong SMao Z(2024)Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00098(468-477)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00098
Kim SZhao JAsanovic KNikolic BShao Y(2023)AuRORA: Virtualized Accelerator Orchestration for Multi-Tenant WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614280(62-76)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614280
Kim SGenc HNikiforov VAsanović KNikolić BShao Y(2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071035
Liu ZLeng JZhang ZChen QLi CGuo MFalsafi BFerdman MLu SWenisch T(2022)VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and schedulingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507752(388-401)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507752
Kim IHwang JWang WHumphrey M(2022)Guaranteeing Performance SLAs of Cloud Applications Under Resource StormsIEEE Transactions on Cloud Computing10.1109/TCC.2020.298537210:2(1329-1343)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TCC.2020.2985372
Ghodrati SAhn BKyung Kim JKinzer SYatham BAlla NSharma HAlian MEbrahimi EKim NYoung CEsmaeilzadeh H(2020)Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00062(681-697)Online publication date: Oct-2020
https://doi.org/10.1109/MICRO50266.2020.00062
Patel TTiwari D(2020)CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00025(193-206)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00025
Zhang WCui WFu KChen QMawhirter DWu BLi CGuo MEigenmann RDing CMcKee S(2019)LaiusProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330351(58-68)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330351
Kulkarni NQi FDelimitrou C(2019)Pliant: Leveraging Approximation to Improve Datacenter Resource Efficiency2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00035(159-171)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00035
Tootoonchian APanda ALan CWalls MArgyraki KRatnasamy SShenker SSeshan SBanerjee S(2018)ResQProceedings of the 15th USENIX Conference on Networked Systems Design and Implementation10.5555/3307441.3307466(283-297)Online publication date: 9-Apr-2018
https://dl.acm.org/doi/10.5555/3307441.3307466
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents