Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2830772.2830803acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory

Published: 05 December 2015 Publication History

Abstract

In a multi-core system, interference at shared resources (such as caches and main memory) slows down applications running on different cores. Accurately estimating the slowdown of each application has several benefits: e.g., it can enable shared resource allocation in a manner that avoids unfair application slowdowns or provides slowdown guarantees. Unfortunately, prior works on estimating slowdowns either lead to inaccurate estimates, do not take into account shared caches, or rely on a priori application knowledge. This severely limits their applicability.
In this work, we propose the Application Slowdown Model (ASM), a new technique that accurately estimates application slowdowns due to interference at both the shared cache and main memory, in the absence of a priori application knowledge. ASM is based on the observation that the performance of each application is strongly correlated with the rate at which the application accesses the shared cache. Thus, ASM reduces the problem of estimating slowdown to that of estimating the shared cache access rate of the application had it been run alone on the system. To estimate this for each application, ASM periodically 1) minimizes interference for the application at the main memory, 2) quantifies the interference the application receives at the shared cache, in an aggregate manner for a large set of requests. Our evaluations across 100 workloads show that ASM has an average slowdown estimation error of only 9.9%, a 2.97x improvement over the best previous mechanism.
We present several use cases of ASM that leverage its slowdown estimates to improve fairness, performance and provide slowdown guarantees. We provide detailed evaluations of three such use cases: slowdown-aware cache partitioning, slowdown-aware memory bandwidth partitioning and an example scheme to provide soft slowdown guarantees. Our evaluations show that these new schemes perform significantly better than state-of-the-art cache partitioning and memory scheduling schemes.

References

[1]
Amazon EC2 Pricing. http://aws.amazon.com/ec2/pricing/.
[2]
Microsoft Azure Pricing. http://azure.microsoft.com/en-us/pricing/details/virtual-machines/.
[3]
NAS Parallel Benchmark Suite. http://www.nas.nasa.gov/publications/npb.html.
[4]
SPEC CPU2006. http://www.spec.org/spec2006.
[5]
K. Aisopos et al. Pcasa: Probabilistic control-adjusted selective allocation for shared caches. In DATE, 2012.
[6]
R. Ausavarungnirun et al. Staged Memory Scheduling: Achieving high performance and scalability in heterogeneous systems. In ISCA, 2012.
[7]
J.-L. Baer and T.-F. Chen. Effective hardware-based data prefetching for high-performance processors. IEEE TC, May 1995.
[8]
B. Bloom. Space/time trade-offs in hash coding with allowable errors. ACM Communications, July 1970.
[9]
F. Cazorla et al. Predictable Performance in SMT Processors: Synergy between the OS and SMTs. IEEE TC, 2006.
[10]
K. Chang et al. HAT: Heterogeneous adaptive throttling for on-chip networks. In SBAC-PAD, 2012.
[11]
B. Cooper et al. Benchmarking cloud serving systems with YCSB. In SOCC, 2010.
[12]
R. Das et al. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In HPCA, 2013.
[13]
R. Das et al. Application-aware prioritization mechanisms for on-chip networks. In MICRO, 2009.
[14]
K. Du Bois et al. Per-thread cycle accounting in multicore processors. In HiPEAC, 2013.
[15]
E. Ebrahimi et al. Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems. In ASPLOS, 2010.
[16]
D. Eklov et al. Cache pirating: Measuring the curse of the shared cache. In ICPP, 2011.
[17]
D. Eklov et al. Bandwidth bandit: Quantitative characterization of memory contention. In PACT, 2012.
[18]
D. Eklov et al. A software based profiling method for obtaining speedup stacks on commodity multi-cores. In ISPASS, 2014.
[19]
S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, June 2008.
[20]
S. Eyerman and L. Eeckhout. Per-thread cycle accounting in SMT processors. In ASPLOS, 2009.
[21]
S. Ghose et al. Improving memory scheduling via processor-side load criticality information. In ISCA, 2013.
[22]
A. Glew. MLP yes! ILP no! In ASPLOS WACI, 1998.
[23]
E. Ipek et al. Self-optimizing memory controllers: A reinforcement learning approach. In ISCA, 2008.
[24]
C. Isci and M. Martonosi. Runtime power monitoring in high-end processors: Methodology and empirical data. In MICRO, 2003.
[25]
R. Iyer et al. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, 2007.
[26]
M. K. Jeong et al. Balancing DRAM locality and parallelism in shared memory CMP systems. In HPCA, 2012.
[27]
D. Kaseridis et al. Cache friendliness-aware managementof shared last-level caches for highperformance multi-core systems. IEEE TC, April 2014.
[28]
D. Kaseridis et al. Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era. In MICRO, 2011.
[29]
S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004.
[30]
Y. Kim et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In HPCA, 2010.
[31]
Y. Kim et al. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In MICRO, 2010.
[32]
Y. Kim et al. Ramulator: A fast and extensible DRAM simulator. CAL, 2015.
[33]
X. Lin and R. Balasubramonian. Refining the utility metric for utility-based cache partitioning. In WDDD, 2009.
[34]
F. Liu et al. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In HPCA, 2010.
[35]
L. Liu et al. A software memory partition approach for eliminating bank-level interference in multicore systems. In PACT, 2012.
[36]
M. Liu and T. Li. Optimizing virtual machine consolidation performance on numa server architecture for cloud workloads. In ISCA, 2014.
[37]
C.-K. Luk et al. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.
[38]
K. Luo et al. Balancing thoughput and fairness in SMT processors. In ISPASS, 2001.
[39]
C. Luque et al. CPU accounting in CMP processors. IEEE CAL, January 2009.
[40]
J. Mars et al. Bubble-Up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In MICRO, 2011.
[41]
Micron. Verilog: DDR3 SDRAM Verilog model.
[42]
Micron. 2Gb DDR3 SDRAM, 2012.
[43]
T. Moscibroda and O. Mutlu. Memory performance attacks: Denial of memory service in multi-core systems. In USENIX Security, 2007.
[44]
T. Moscibroda and O. Mutlu. Distributed order scheduling and its application to multi-core DRAM controllers. In PODC, 2008.
[45]
S. P. Muralidhara et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In MICRO, 2011.
[46]
O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO, 2007.
[47]
O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In ISCA, 2008.
[48]
O. Mutlu et al. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA, 2003.
[49]
K. Nesbit et al. Fair queuing memory systems. In MICRO, 2006.
[50]
G. Nychis et al. Next generation on-chip networks: What kind of congestion control do we need? In HotNets, 2010.
[51]
G. Nychis et al. On-chip networks from a networking perspective: Congestion and scalability in many-core interconnects. In SIGCOMM, 2012.
[52]
H. Patil et al. Pinpointing representative portions of large Intel Itanium programs with dynamic instrumentation. In MICRO, 2004.
[53]
J. Pomerene et al. Prefetching system for a cache having a second directory for sequentially accessed blocks. Patent 407110 A, 1989.
[54]
M. Qureshi et al. Adaptive insertion policies for high performance caching. In ISCA, 2007.
[55]
M. Qureshi et al. A case for MLP-aware cache replacement. In ISCA, 2006.
[56]
M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, 2006.
[57]
J. Rao et al. Optimizing virtual machine scheduling in numa multicore systems. In HPCA, 2013.
[58]
S. Rixner et al. Memory access scheduling. In ISCA, 2000.
[59]
P. Rosenfeld et al. DRAMSim2: A cycle accurate memory system simulator. IEEE CAL, January 2011.
[60]
A. Sandberg et al. Modeling performance variation due to cache sharing. In HPCA, 2013.
[61]
V. Seshadri et al. The evicted-address filter: A unified mechanism to address both cache pollution and thrashing. In PACT, 2012.
[62]
T. Sherwood et al. Automatically characterizing large scale program behavior. In ASPLOS, 2002.
[63]
S. Srinath et al. Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers. In HPCA, 2007.
[64]
H. Stone et al. Optimal partitioning of cache memory. IEEE TC, September 1992.
[65]
L. Subramanian et al. The blacklisting memory scheduler: Achieving high performance and fairness at low cost. In ICCD, 2014.
[66]
L. Subramanian et al. MISE: Providing performance predictability and improving fairness in shared main memory systems. In HPCA, 2013.
[67]
L. Tang et al. The impact of memory subsystem resource sharing on datacenter applications. In ISCA, 2011.
[68]
Transaction Processing Performance Council. http://www.tpc.org/.
[69]
H. Vandierendonck and A. Seznec. Fairness metrics for multithreaded processors. IEEE CAL, February 2011.
[70]
H. Wang et al. A-DRM: Architecture-aware distributed resource management of virtualized clusters. In VEE, 2015.
[71]
M. Xie et al. Improving system throughput and fairness simultaneously in shared memory CMP systems via dynamic bank partitioning. In HPCA, 2014.
[72]
H. Yang et al. Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers. In ISCA, 2013.
[73]
S. Zhuravlev et al. Addressing shared resource contention in multicore processors via scheduling. In ASPLOS, 2010.
[74]
W. Zuravleff and T. Robinson. Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order. Patent 5630096, 1997.

Cited By

View all
  • (2024)Suppressing the Interference Within a Datacenter: Theorems, Metric and StrategyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335441835:5(732-750)Online publication date: May-2024
  • (2024)Hashing ATD Tags for Low-Overhead Safe Contention MonitoringIEEE Computer Architecture Letters10.1109/LCA.2024.340157023:2(166-169)Online publication date: Jul-2024
  • (2024)FEDGE: An Interference-Aware QoS Prediction Framework for Black-Box Scenario in IaaS Clouds with Domain Generalization2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00020(128-138)Online publication date: 27-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
December 2015
787 pages
ISBN:9781450340342
DOI:10.1145/2830772
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

  • Intel Science and Technology Center on Cloud Computing
  • NSF
  • Semiconductor Research Corporation

Conference

MICRO-48
Sponsor:

Acceptance Rates

MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)86
  • Downloads (Last 6 weeks)7
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Suppressing the Interference Within a Datacenter: Theorems, Metric and StrategyIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335441835:5(732-750)Online publication date: May-2024
  • (2024)Hashing ATD Tags for Low-Overhead Safe Contention MonitoringIEEE Computer Architecture Letters10.1109/LCA.2024.340157023:2(166-169)Online publication date: Jul-2024
  • (2024)FEDGE: An Interference-Aware QoS Prediction Framework for Black-Box Scenario in IaaS Clouds with Domain Generalization2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00020(128-138)Online publication date: 27-May-2024
  • (2024)MemorAI: Energy-Efficient Last-Level Cache Memory Optimization for Virtualized RANs2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN)10.1109/ICMLCN59089.2024.10624821(25-30)Online publication date: 5-May-2024
  • (2024)Accelerating Aggregation Using a Real Processing-in-Memory System2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00300(3920-3932)Online publication date: 13-May-2024
  • (2024)Spatial Variation-Aware Read Disturbance Defenses: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00048(560-577)Online publication date: 2-Mar-2024
  • (2024)MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00024(186-203)Online publication date: 2-Mar-2024
  • (2024)Geo-Distributed Analytical Streaming Architecture for IoT Platforms2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00030(263-274)Online publication date: 24-Sep-2024
  • (2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术:综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
  • (2023)McCore: A Holistic Management of High-Performance Heterogeneous MulticoresProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614295(1044-1058)Online publication date: 28-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media