research-article

Optimal Symbiosis and Fair Scheduling in Shared Cache

Authors:

Zhenlin WangAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 28, Issue 4

Pages 1134 - 1148

https://doi.org/10.1109/TPDS.2016.2611572

Published: 01 April 2017 Publication History

Abstract

On multi-core processors, applications are run sharing the cache. This paper presents optimization theory to co-locate applications to minimize cache interference and maximize performance. The theory precisely specifies MRC-based composition, optimization, and correctness conditions. The paper also presents a new technique called footprint symbiosis to obtain the best shared cache performance under fair CPU allocation as well as a new sampling technique which reduces the cost of locality analysis. When sampling and optimization are combined, the paper shows that it takes less than 0.1 second analysis per program to obtain a co-run that is within 1.5 percent of the best possible performance. In an exhaustive evaluation with 12,870 tests, the best prior work improves co-run performance by 56 percent on average. The new optimization improves it by another 29 percent. Without single co-run test, footprint symbiosis is able to choose co-run choices that are just 8 percent slower than the best co-run solutions found with exhaustive testing.

References

[1]

M. Arnold and B. G. Ryder, “A framework for reducing the cost of instrumented code,” in Proc. ACM SIGPLAN Conf. Program. Language Des. Implementation, Jun. 2001, pp. 168–179.

Digital Library

[2]

E. Berg and E. Hagersten, “StatCache: A probabilistic approach to efficient and accurate data locality analysis,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2004, pp. 20–27.

Digital Library

[3]

K. Beyls and E. H. D'Hollander, “Discovery of locality-improving refactoring by reuse path analysis,” in Proc. High Perform. Comput. Commun., 2006, pp. 220–229.

Digital Library

[4]

M. D. Bond, K. E. Coons, and K. S. McKinley, “PACER: Proportional detection of data races,” in Proc. ACM SIGPLAN Conf. Program. Language Des. Implementation, 2010, pp. 255–268.

Digital Library

[5]

J. F. Cantin and M. D. Hill, “Cache performance for SPEC CPU2000 benchmarks,” 2003. {Online}. Available: http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data

[6]

C. Cascaval, E. Duesterwald, P. F. Sweeney, and R. W. Wisniewski, “Multiple page size modeling and optimization,” in Proc. Int. Conf. Parallel Archit. Compilation Tech., 2005, pp. 339–349.

Digital Library

[7]

T. M. Chilimbi and M. Hirzel, “Dynamic hot data stream prefetching for general-purpose programs,” in Proc. ACM SIGPLAN Conf. Program. Language Des. Implementation, 2002, pp. 199–209.

Digital Library

[8]

C. Delimitrou and C. Kozyrakis, “Paragon: QoS-aware scheduling for heterogeneous datacenters,” in Proc. Int. Conf. Archit. Support Program. Languages Operating Syst., 2013, pp. 77–88.

Digital Library

[9]

P. J. Denning, “The working set model for program behaviour,” Commun. ACM, vol. Volume 11, no. Issue 5, pp. 323–333, 1968.

Digital Library

[10]

P. J. Denning and S. C. Schwartz, “Properties of the working set model,” Commun. ACM, vol. Volume 15, no. Issue 3, pp. 191–198, 1972.

Digital Library

[11]

C. Ding, X. Xiang, B. Bao, H. Luo, Y. Luo, and X. Wang, “Performance metrics and models for shared cache,” J. Comput. Sci. Technol., vol. Volume 29, no. Issue 4, pp. 692–712, 2014.

[12]

Y. Jiang, X. Shen, J. Chen, and R. Tripathi, “Analysis and approximation of optimal co-scheduling on chip multiprocessors,” in Proc. Int. Conf. Parallel Archit. Compilation Tech., 2008, pp. 220–229.

Digital Library

[13]

Y. Jiang, K. Tian, and X. Shen, “Combining locality analysis with online proactive job co-scheduling in chip multiprocessors,” in Proc. Int. Conf. High Perform. Embedded Archit. Compilers, 2010, pp. 201–215.

Digital Library

[14]

Y. Jiang, K. Tian, X. Shen, J. Zhang, J. Chen, and R. Tripathi, “The complexity of optimal job co-scheduling on chip multiprocessors and heuristics-based solutions,” IEEE Trans. Parallel Distrib. Syst., vol. Volume 22, no. Issue 7, pp. 1192–1205, 2011.

Digital Library

[15]

, 15, 2016. {Online}. Available: https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt

[16]

L. Liu, Y. Li, Z. Cui, Y. Bao, M. Chen, and C. Wu, “Going vertical in memory management: Handling multiplicity by multi-policy,” in Proc. Int. Symp. Comput. Archit., 2014, pp. 169–180.

Digital Library

[17]

X. Liu, K. Sharma, and J. M. Mellor-Crummey, “ArrayTool: A lightweight profiler to guide array regrouping,” in Proc. Int. Conf. Parallel Archit. Compilation Tech., 2014, pp. 405–416.

Digital Library

[18]

J. Mars, L. Tang, K. Skadron, M. L. Soffa, and R. Hundt, “Increasing utilization in modern warehouse-scale computers using bubble-up,” IEEE Micro, vol. Volume 32, no. Issue 3, pp. 88–99, 2012.

Digital Library

[19]

R. L. Mattson, J. Gecsei, D. Slutz, and I. L. Traiger, “Evaluation techniques for storage hierarchies,” IBM Syst. J., vol. Volume 9, no. Issue 2, pp. 78–117, 1970.

Digital Library

[20]

T. Moseley, A. Shye, V. J. Reddi, D. Grunwald, and R. Peri, “Shadow profiling: Hiding instrumentation costs with parallelism,” in Proc. Int. Symp. Code Generation Optimization, 2007, pp. 198–208.

Digital Library

[21]

A. Sandberg, A. Sembrant, E. Hagersten, and D. Black-Schaffer, “Modeling performance variation due to cache sharing,” in Proc. Int. Symp. High-Perform. Comput. Archit., 2013, pp. 155–166.

Digital Library

[22]

D. L. Schuff, M. Kulkarni, and V. S. Pai, “Accelerating multicore reuse distance analysis with sampling and parallelization,” in Proc. Int. Conf. Parallel Archit. Compilation Tech., 2010, pp. 53–64.

Digital Library

[23]

R. Sen and D. A. Wood, “Reuse-based online models for caches,” in Proc. Int. Conf. Meas. Model. Comput. Syst., 2013, pp. 279–292.

Digital Library

[24]

A. J. Smith, “On the effectiveness of set associative page mapping and its applications in main memory management,” in Proc. Int. Conf. Softw. Eng., 1976, pp. 286–292.

Digital Library

[25]

A. Snavely and D. M. Tullsen, “Symbiotic jobscheduling for a simultaneous multithreading processor,” in Proc. Int. Conf. Archit. Support Program. Languages Operating Syst., 2000, pp. 234–244.

Digital Library

[26]

X.-H. Sun and D. Wang, “APC: A performance metric of memory systems,” SIGMETRICS Perform. Eval. Rev., vol. Volume 40, no. Issue 2, pp. 125–130, 2012.

Digital Library

[27]

D. K. Tam, R. Azimi, L. Soares, and M. Stumm, “RapidMRC: Approximating L2 miss rate curves on commodity systems for online optimizations,” in Proc. Int. Conf. Archit. Support Program. Languages Operating Syst., 2009, pp. 121–132.

Digital Library

[28]

S. Wallace and K. Hazelwood, “SuperPin: Parallelizing dynamic instrumentation for real-time performance,” in Proc. Int. Symp. Code Generation Optimization, 2007, pp. 209–220.

Digital Library

[29]

J. Wires, S. Ingram, Z. Drudi, N. J. Harvey, A. Warfield, and C. Data, “Characterizing storage workloads with counter stacks,” in Proc. Symp. Operating Syst. Des. Implementation, 2014, pp. 335–349.

Digital Library

[30]

M.-J. Wu, M. Zhao, and D. Yeung, “Studying multicore processor scaling via reuse distance analysis,” in Proc. Int. Symp. Comput. Archit., 2013, pp. 499–510.

Digital Library

[31]

X. Xiang, B. Bao, T. Bai, C. Ding, and T. M. Chilimbi, “All-window profiling and composable models of cache sharing,” in Proc. ACM SIGPLAN Symp. Principles Practice Parallel Program., 2011, pp. 91–102.

Digital Library

[32]

X. Xiang, B. Bao, C. Ding, and Y. Gao, “Linear-time modeling of program working set in shared cache,” in Proc. Int. Conf. Parallel Archit. Compilation Tech., 2011, pp. 350–360.

Digital Library

[33]

X. Xiang, B. Bao, C. Ding, and K. Shen, “Cache conscious task regrouping on multicore processors,” in Proc. IEEE Int. Symp. Cluster Comput. Grid, 2012, pp. 603–611.

Digital Library

[34]

X. Xiang, C. Ding, H. Luo, and B. Bao, “HOTL: A higher order theory of locality,” in Proc. Int. Conf. Archit. Support Program. Languages Operating Syst., 2013, pp. 343–356.

Digital Library

[35]

X. Zhang, S. Dwarkadas, and K. Shen, “Towards practical page coloring-based multicore cache management,” in Proc. 4th ACM Eur. Conf. Comput. Syst., 2009, pp. 89–102.

Digital Library

[36]

Y. Zhong and W. Chang, “Sampling-based program locality approximation,” in Proc. Int. Symp. Memory Manage., 2008, pp. 91–100.

Digital Library

[37]

Y. Zhong, X. Shen, and C. Ding, “Program locality analysis using reuse distance,” ACM Trans. Program. Languages Syst., vol. Volume 31, no. Issue 6, pp. 1–39, 2009.

Digital Library

[38]

S. Zhuravlev, S. Blagodurov, and A. Fedorova, “Addressing shared resource contention in multicore processors via scheduling,” in Proc. Int. Conf. Archit. Support Program. Languages Operating Syst., 2010, pp. 129–142.

Digital Library

[39]

S. Kim, D. Chandra, and Y. Solihin, “Fair cache sharing and partitioning in a chip multiprocessor architecture,” in Proc. 13th Int. Conf. Parallel Archit. Compilation Tech., 2004, pp. 111–122.

Digital Library

[40]

D. Chandra, F. Guo, S. Kim, and Y. Solihin, “Predicting inter-thread cache contention on a chip multi-processor architecture,” in Proc. Int. Symp. High-Perform. Comput. Archit., 2005, pp. 340–351.

Digital Library

[41]

G. E. Suh, S. Devadas, and L. Rudolph, “Analytical cache models with applications to cache partitioning,” in Proc. Int. Conf. Supercomputing, 2001, pp. 1–12.

Digital Library

Cited By

Ding CChen DLiu FReber BSmith W(2022)CARL: Compiler Assigned Reference LeasingACM Transactions on Architecture and Code Optimization10.1145/349873019:1(1-28)Online publication date: 17-Mar-2022
https://dl.acm.org/doi/10.1145/3498730
Xiang YYe CWang XLuo YWang Z(2019)EMBAProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337863(1-12)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337863
Chen DLiu FDing CPai S(2018)Locality analysis through static parallel samplingACM SIGPLAN Notices10.1145/3296979.319240253:4(557-570)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192402
Show More Cited By

Recommendations

Adaptive prefetching for shared cache based chip multiprocessors
DATE '09: Proceedings of the Conference on Design, Automation and Test in Europe

Chip multiprocessors (CMPs) present a unique scenario for software data prefetching with subtle tradeoffs between memory bandwidth and performance. In a shared L2 based CMP, multiple cores compete for the shared on-chip cache space and limited off-chip ...
Benefits of cache-affinity scheduling in shared-memory multiprocessors: a summary

An interesting and common class of workloads for shared-memory multiprocessors is multiprogrammed workloads. Because these workloads generally contain more processes than there are processors in the machine, there are two factors that increase the ...
An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiers

Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 28, Issue 4

April 2017

310 pages

ISSN:1045-9219

Issue’s Table of Contents

Copyright © 2017.

Publisher

IEEE Press

Publication History

Published: 01 April 2017

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ding CChen DLiu FReber BSmith W(2022)CARL: Compiler Assigned Reference LeasingACM Transactions on Architecture and Code Optimization10.1145/349873019:1(1-28)Online publication date: 17-Mar-2022
https://dl.acm.org/doi/10.1145/3498730
Xiang YYe CWang XLuo YWang Z(2019)EMBAProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337863(1-12)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337863
Chen DLiu FDing CPai S(2018)Locality analysis through static parallel samplingACM SIGPLAN Notices10.1145/3296979.319240253:4(557-570)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192402
Chen DLiu FDing CPai SFoster JGrossman D(2018)Locality analysis through static parallel samplingProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3192366.3192402(557-570)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3192366.3192402
Xiang YWang XHuang ZWang ZLuo YWang ZOliveira RFelber PHu Y(2018)DCAPSProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190511(1-15)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3190508.3190511

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents