Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2259016.2259018acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article

Compiling for niceness: mitigating contention for QoS in warehouse scale computers

Published: 31 March 2012 Publication History

Abstract

As the class of datacenters recently coined as warehouse scale computers (WSCs) continues to leverage commodity multicore processors with increasing core counts, there is a growing need to consolidate various workloads on these machines to fully utilize their computation power. However, it is well known that when multiple applications are co-located on a multicore machine, contention for shared memory resources can cause severe cross-core performance interference. To ensure that the quality of service (QoS) of user-facing applications does not suffer from performance interference, WSC operators resort to disallowing co-location of latency-sensitive applications with other applications. This policy translates to low machine utilization and millions of dollars wasted in WSCs.
This paper presents QoS-Compile, the first compilation approach that statically manipulates application contentiousness to enable the co-location of applications with varying QoS requirements, and as a result, can greatly improve machine utilization. Our technique first pinpoints an application's code regions that tend to cause contention and performance interference. QoS-Compile then transforms those regions to reduce their contentious nature. In essence, to co-locate applications of different QoS priorities, our compilation technique uses pessimizing transformations to throttle down the memory access rate of the contentious regions in low priority applications to reduce their interference to high priority applications. Our evaluation using synthetic benchmarks, SPEC benchmarks and large-scale Google applications show that QoS-Compile can greatly reduce contention, improve QoS of applications, and improve machine utilization. Our experiments show that our technique improves applications' QoS performance by 21% and machine utilization by 36% on average.

References

[1]
M. Banikazemi, D. Poff, and B. Abali. Pam: a novel performance/power aware meta-scheduler for multi-core systems. SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008.
[2]
L. Barroso, J. Dean, and U. Holzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23(2):22--28, 2003.
[3]
L. Barroso and U. Hölzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture, 4(1):1--108, 2009.
[4]
M. Bhadauria and S. McKee. An approach to resource-aware co-scheduling for cmps. ICS 2010, Jun 2010.
[5]
S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. MICRO 39, Dec 2006.
[6]
E. Ebrahimi, C. Lee, O. Mutlu, and Y. Patt. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. ASPLOS 2010, Mar 2010.
[7]
A. Fedorova, M. Seltzer, and M. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. PACT 2007, Sep 2007.
[8]
F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. MIRCO 2007, pages 343--355, 2007.
[9]
A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based qos techniques for cache/memory in cmp platforms. ICS '09: Proceedings of the 23rd international conference on Supercomputing, Jun 2009.
[10]
R. Hundt, E. Raman, M. Thuresson, and N. Vachharajani. Mao: An extensible micro-architectural optimizer. In CGO 2011, pages 1--10, Apr 2011.
[11]
R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. SIGMETRICS '07, Jun 2007.
[12]
Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. PACT '08, Oct 2008.
[13]
Y. Jiang, K. Tian, and X. Shen. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. HiPEAC 2010, pages 201--215, 2010.
[14]
M. Kandemir, S. Muralidhara, S. Narayanan, Y. Zhang, and O. Ozturk. Optimizing shared cache behavior of chip multiprocessors. Microarchitecture, 2009, pages 505--516, 2009.
[15]
M. Kandemir, T. Yemliha, S. Muralidhara, S. Srikantaiah, M. Irwin, and Y. Zhnag. Cache topology aware computation mapping for multicores. PLDI '10, Jun 2010.
[16]
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. PACT 2004, Sep 2004.
[17]
R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using os observations to improve performance in multicore systems. IEEE Micro, 28(3):54--66, 2008.
[18]
J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. HPCA 2008, pages 367--378, 2008.
[19]
F. Liu, X. Jiang, and Y. Solihin. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. HPCA 2010, pages 1--12, 2010.
[20]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. PLDI '05, pages 190--200, New York, NY, USA, 2005. ACM.
[21]
J. Mars and R. Hundt. Scenario based optimization: A framework for statically enabling online optimizations. CGO '09, pages 169--179, Washington, DC, USA, 2009. IEEE Computer Society.
[22]
J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In MICRO '11: Proceedings of The 44th Annual IEEE/ACM International Symposium on Microarchitecture, New York, NY, USA, 2011. ACM.
[23]
J. Mars, N. Vachharajani, R. Hundt, and M. Soffa. Contention aware execution: online contention detection and response. CGO '10, Apr 2010.
[24]
D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idle power. ASPLOS '09, pages 205--216, New York, NY, USA, 2009. ACM.
[25]
R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: managing performance interference effects for qos-aware clouds. EuroSys '10, Apr 2010.
[26]
K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair queuing memory systems. MICRO 2006, pages 208--222, 2006.
[27]
M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2006.
[28]
P. Ranganathan and N. Jouppi. Enterprise it trends and implications for architecture research. HPCA 2005, pages 253--256, 2005.
[29]
S. Rus, R. Ashok, and D. Li. Automated locality optimization based on the reuse distance of string operations. CGO '11, pages 181--190, Apr 2011.
[30]
A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. SC 2010, Nov 2010.
[31]
L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an os-level, software-only pollute buffer. Micro 2008, pages 258--269, 2008.
[32]
S. Son, M. Kandemir, M. Karakoy, and D. Chakrabarti. A compiler-directed data prefetching scheme for chip multiprocessors. PPoPP 2009, Feb 2009.
[33]
S. Srikantaiah, M. Kandemir, and M. Irwin. Adaptive set pinning: managing shared caches in chip multiprocessors. ASPLOS XIII, Mar 2008.
[34]
L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. ISCA '11, pages 283--294, New York, NY, USA, 2011. ACM.
[35]
X. Xiang, B. Bao, T. Bai, C. Ding, and T. Chilimbi. All-window profiling and composable models of cache sharing. PPoPP '11, pages 91--102, 2011.
[36]
D. Xu, C. Wu, and P.-C. Yew. On mitigating memory bandwidth contention through bandwidth-aware scheduling. PACT 2010, Sep 2010.
[37]
E. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? PPoPP 2010, pages 203--212, 2010.
[38]
X. Zhang, S. Dwarkadas, and K. Shen. Hardware execution throttling for multi-core resource management. Proceedings of the 2009 conference on USENIX Annual technical conference, page 23, 2009.
[39]
Q. Zhao, D. Koh, S. Raza, D. Bruening, W. Wong, and S. Amarasinghe. Dynamic cache contention detection in multi-threaded applications. VEE 2011, pages 27--38, 2011.
[40]
S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. ASPLOS 2010, Mar 2010.

Cited By

View all
  • (2024)Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00098(468-477)Online publication date: 27-May-2024
  • (2023)AuRORA: Virtualized Accelerator Orchestration for Multi-Tenant WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614280(62-76)Online publication date: 28-Oct-2023
  • (2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
  • Show More Cited By
  1. Compiling for niceness: mitigating contention for QoS in warehouse scale computers

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CGO '12: Proceedings of the Tenth International Symposium on Code Generation and Optimization
    March 2012
    285 pages
    ISBN:9781450312066
    DOI:10.1145/2259016
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 31 March 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    CGO '12

    Acceptance Rates

    CGO '12 Paper Acceptance Rate 26 of 90 submissions, 29%;
    Overall Acceptance Rate 312 of 1,061 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 18 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00098(468-477)Online publication date: 27-May-2024
    • (2023)AuRORA: Virtualized Accelerator Orchestration for Multi-Tenant WorkloadsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614280(62-76)Online publication date: 28-Oct-2023
    • (2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
    • (2022)VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and schedulingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507752(388-401)Online publication date: 28-Feb-2022
    • (2022)Guaranteeing Performance SLAs of Cloud Applications Under Resource StormsIEEE Transactions on Cloud Computing10.1109/TCC.2020.298537210:2(1329-1343)Online publication date: 1-Apr-2022
    • (2020)Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00062(681-697)Online publication date: Oct-2020
    • (2020)CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00025(193-206)Online publication date: Feb-2020
    • (2019)LaiusProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330351(58-68)Online publication date: 26-Jun-2019
    • (2019)Pliant: Leveraging Approximation to Improve Datacenter Resource Efficiency2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00035(159-171)Online publication date: Feb-2019
    • (2018)ResQProceedings of the 15th USENIX Conference on Networked Systems Design and Implementation10.5555/3307441.3307466(283-297)Online publication date: 9-Apr-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media