research-article

Analysis and approximation of optimal co-scheduling on chip multiprocessors

Authors:

Rahul TripathiAuthors Info & Claims

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Pages 220 - 229

https://doi.org/10.1145/1454115.1454146

Published: 25 October 2008 Publication History

Abstract

Cache sharing among processors is important for Chip Multiprocessors to reduce inter-thread latency, but also brings cache contention, degrading program performance considerably. Recent studies have shown that job co-scheduling can effectively alleviate the contention, but it remains an open question how to efficiently find optimal co-schedules. Solving the question is critical for determining the potential of a co-scheduling system. This paper presents a theoretical analysis of the complexity of co-scheduling, proving its NP-completeness. Furthermore, for a special case when there are two sharers per chip, we propose an algorithm that finds the optimal co-schedules in polynomial time. For more complex cases, we design and evaluate a sequence of approximation algorithms, among which, the hierarchical matching algorithm produces near-optimal schedules and shows good scalability. This study facilitates the evaluation of co-scheduling systems, as well as offers some techniques directly usable in proactive job co-scheduling.

References

[1]

J. R. Bulpin and I. A. Pratt. Hyper-threading aware process scheduling heuristics. In 2005 USENIX Annual Technical Conference, pages 103--106, 2005.

Digital Library

[2]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2005.

Digital Library

[3]

W. Cook and A. Rohe. Computing minimum-weight perfect matchings. INFORMS Journal on Computing, 11:138--148, 1999.

Digital Library

[4]

P. Denning. Thrashing: Its causes and prevention. In Proceedings of the AFIPS 1968 Fall Joint Computer Conference, volume 33, pages 915--922, 1968.

[5]

M. DeVuyst, R. Kumar, and D. M. Tullsen. Exploiting unbalanced thread scheduling for energy and performance on a cmp of smt processors. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), 2006.

Digital Library

[6]

J. Edmonds. Maximum matching and a polyhedron with 0,1-vertices. Journal of Research of the National Bureau of Standards B, 69B:125--130, 1965.

[7]

A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a cmp of multi-threaded processors. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), 2006.

Digital Library

[8]

A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In USENIX Annual Technical Conference, 2005.

Digital Library

[9]

A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007.

Digital Library

[10]

H. Gabow and R. E. Tarjan. Faster scaling algorithms for general graph-matching problems. Journal of ACM, 38:815--853, 1991.

Digital Library

[11]

M. Garey and D. Johnson. Computers and Intractability. Feeman, San Francisco, CA, 1979.

[12]

L. R. Hsu, S. K. Reinhardt, R. Lyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2006.

Digital Library

[13]

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler. A nuca substrate for flexible cmp cache sharing. In Proceedings of International Conference on Supercomputing, pages 31--40, 2005.

Digital Library

[14]

Y. Jiang and X. Shen. Exploration of the influence of program inputs on cmp co-scheduling. In European Conference on Parallel Computing (Euro-Par), August 2008.

Digital Library

[15]

R. Karp. Reducibility among combinatiorial problems. In R. Miller and J. Thatcher, editors, Complexity of Computer Computations, pages 85--103. Plenum Press, 1972.

[16]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2004.

Digital Library

[17]

R. Kumar, D. M. Tullsen, and N. P. Jouppi. Core architecture optimization for heterogeneous chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2006.

Digital Library

[18]

J. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, 1995. http://www.cs.virginia.edu/stream.

[19]

P. Nagpurkar, M. Hind, C. Krintz, P. F. Sweeney, and V. Rajan. Online phase detection algorithms. In Proceedings of the International Symposium on Code Generation and Optimization, March 2006.

Digital Library

[20]

Nakijima and Pallipadi. Enhancements for hyperthreading technology in the operating system -- seeking the optimal scheduling. In Proceedings of USENIX Annual Technical Conference, 2002.

Digital Library

[21]

S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for smt processors. Technical Report 2000-04-02, University of Washington, June 2000.

[22]

N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven cmp cache management. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2006.

Digital Library

[23]

A. Settle, J. L. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced smt job scheduling. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 63--73, 2004.

Digital Library

[24]

X. Shen and J. Shaw. Scalable implementation of efficient locality approximation. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, 2008.

Digital Library

[25]

X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In Proceedings of the ACM SIGPLAN Conference on Principles of Programming Languages (POPL), 2007.

Digital Library

[26]

X. Shen, Y. Zhong, and C. Ding. Locality phase prediction. In Proceedings of the Eleventh International Conference on Architect ural Support for Programming Languages and Operating Systems (ASPLOS XI), Boston, MA, 2004.

Digital Library

[27]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 2002.

Digital Library

[28]

A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proceedings of ASPLOS, 2000.

Digital Library

[29]

A. Snavely, D. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, 2002.

Digital Library

[30]

H. Stone, J. Turek, and J. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 41(9), 1992.

Digital Library

[31]

G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2002.

[32]

N. Tuck and D. M. Tullsen. Initial observations of the simultaneous multithreading Pentium 4 processor. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, New Orleans, Louisiana, September 2003.

Digital Library

[33]

X. Zhang, S. Dwarkadas, G. Folkmanis, and K. Shen. Processor hardware counter statistics as a first-class system resource. In Proceedings of the 11th Workshop on Hot Topics in Operating Systems, 2007.

Digital Library

[34]

Y. Zhong and W. Chang. Sampling-based program locality approximation. In Proceedings of the International Symposium on Memory Management, 2008.

Digital Library

Cited By

Saroliya UArima ELiu DSchulz M(2023)Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00023(185-196)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00023
Arslan SÜnsal O(2023)Efficient thread‐to‐core mapping alternatives for application‐level redundant multithreadingConcurrency and Computation: Practice and Experience10.1002/cpe.762235:24Online publication date: 18-Jan-2023
https://doi.org/10.1002/cpe.7622
Yu HShen JZhang HWang JMiao CXu M(2022)Scorpius: Proactive Code Preparation to Accelerate Function Startup2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS)10.1109/IWQoS54832.2022.9812868(1-10)Online publication date: 10-Jun-2022
https://doi.org/10.1109/IWQoS54832.2022.9812868
Show More Cited By

Index Terms

Analysis and approximation of optimal co-scheduling on chip multiprocessors
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multiprocessing / multiprogramming / multitasking
        Scheduling

Recommendations

The Complexity of Optimal Job Co-Scheduling on Chip Multiprocessors and Heuristics-Based Solutions

In Chip Multiprocessors (CMPs) architecture, it is common that multiple cores share some on-chip cache. The sharing may cause cache thrashing and contention among co-running jobs. Job co-scheduling is an approach to tackling the problem by assigning ...
A study on optimally co-scheduling jobs of different lengths on chip multiprocessors
CF '09: Proceedings of the 6th ACM conference on Computing frontiers

Cache sharing in Chip Multiprocessors brings cache contention among corunning processes, which often causes considerable degradation of program performance and system fairness. Recent studies have seen the effectiveness of job co-scheduling in ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

October 2008

328 pages

ISBN:9781605582825

DOI:10.1145/1454115

General Chair:
Andreas Moshovos
University of Toronto, Canada
,
Program Chairs:
David Tarditi
Microsoft, USA
,
Kunle Olukotun
Stanford University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '08

Sponsor:

PACT '08: International Conference on Parallel Architectures and Compilation Techniques

October 25 - 29, 2008

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

140
Total Citations
View Citations
690
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)2

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Saroliya UArima ELiu DSchulz M(2023)Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00023(185-196)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00023
Arslan SÜnsal O(2023)Efficient thread‐to‐core mapping alternatives for application‐level redundant multithreadingConcurrency and Computation: Practice and Experience10.1002/cpe.762235:24Online publication date: 18-Jan-2023
https://doi.org/10.1002/cpe.7622
Yu HShen JZhang HWang JMiao CXu M(2022)Scorpius: Proactive Code Preparation to Accelerate Function Startup2022 IEEE/ACM 30th International Symposium on Quality of Service (IWQoS)10.1109/IWQoS54832.2022.9812868(1-10)Online publication date: 10-Jun-2022
https://doi.org/10.1109/IWQoS54832.2022.9812868
Yu HZheng ZShen JMiao CSun CHu HBi JWu JWang J(2021)Octans: Optimal Placement of Service Function Chains in Many-Core SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306361332:9(2202-2215)Online publication date: 1-Sep-2021
https://doi.org/10.1109/TPDS.2021.3063613
Islam MRouff M(2021)Threads Scheduling and Load Balancing with Loop Iteration in Multicore Processors: a Case Study with OpenMP2021 3rd International Conference on Sustainable Technologies for Industry 4.0 (STI)10.1109/STI53101.2021.9732563(1-6)Online publication date: 18-Dec-2021
https://doi.org/10.1109/STI53101.2021.9732563
Kim SKim Y(2021)Interference-aware execution framework with Co-scheML on GPU clustersCluster Computing10.1007/s10586-021-03299-z26:5(2577-2589)Online publication date: 18-May-2021
https://doi.org/10.1007/s10586-021-03299-z
Eremeev AMalakhov ASakhno MSosnovskaya M(2021)Multi-core Processor Scheduling with Respect to Data Bus BandwidthAdvances in Optimization and Applications10.1007/978-3-030-65739-0_5(55-69)Online publication date: 18-Jan-2021
https://doi.org/10.1007/978-3-030-65739-0_5
Serpa MCruz EDiener MLorenzon ABeck ANavaux P(2021)Mitigating execution unit contention in parallel applications using instruction‐aware mappingConcurrency and Computation: Practice and Experience10.1002/cpe.681935:17Online publication date: 30-Dec-2021
https://doi.org/10.1002/cpe.6819
Varniab MHung CSharghi V(2020)Data mining and image analysis using genetic programmingACM SIGAPP Applied Computing Review10.1145/3381307.338131119:4(40-49)Online publication date: 28-Jan-2020
https://dl.acm.org/doi/10.1145/3381307.3381311
Walker ACerny TSong E(2020)Open-source tools and benchmarks for code-clone detectionACM SIGAPP Applied Computing Review10.1145/3381307.338131019:4(28-39)Online publication date: 28-Jan-2020
https://dl.acm.org/doi/10.1145/3381307.3381310
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents