Article

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Authors:

Michael StummAuthors Info & Claims

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

Pages 47 - 58

https://doi.org/10.1145/1272996.1273004

Published: 21 March 2007 Publication History

Abstract

The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory multiprocessors with L1 and L2 cache sharing within a chip. Mid- and large-scale systems will have multiple processing chips and hence consist of an SMP-CMP-SMT configuration with non-uniform data sharing overheads. Current operating system schedulers are not aware of these new cache organizations, and as a result, distribute threads across processors in a way that causes many unnecessary, long-latency cross-chip cache accesses.

In this paper we describe the design and implementation of a scheme to schedule threads based on sharing patterns detected online using features of standard performance monitoring units (PMUs) available in today's processing units. The primary advantage of using the PMU infrastructure is that it is fine-grained (down to the cache line) and has relatively low overhead. We have implemented our scheme in Linux running on an 8-way Power5 SMP-CMP-SMT multi-processor. For commercial multithreaded server workloads (VolanoMark, SPECjbb, and RUBiS), we are able to demonstrate reductions in cross-chip cache accesses of up to 70%. These reductions lead to application-reported performance improvements of up to 7%.

References

[1]

C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared memory computing on networks of workstations. IEEE Computer, 29(2):18--28, Feb 1996.

Digital Library

[2]

R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Intl. Conf. on Supercomputing, 2005.

Digital Library

[3]

F. Bellosa. Follow-on scheduling: Using TLB information to reduce cache misses. In Symp. on Operating Systems Principles - Work in Progress Session, 1997.

[4]

F. Bellosa and M. Steckermeier. The performance implications of locality information usage in shared-memory multiprocessors. J. of Parallel and Distributed Computing, 37(1):113--121, Aug 1996.

Digital Library

[5]

J. R. Bulpin and I. A. Pratt. Hyper-threading aware process scheduling heuristics. In Usenix Annual Technical Conf., 2005.

Digital Library

[6]

A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a CMP of multi-threaded processors. In Intl. Parallel and Distributed Processing Symp., 2006.

Digital Library

[7]

A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In Usenix Annual Technical Conf., 2005.

Digital Library

[8]

A. Fedorova, C. Small, D. Nussbaum, and M. Seltzer. Chip multithreading systems need a new operating system scheduler. In SIGOPS European Workshop, 2004.

Digital Library

[9]

S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Conf. on Very Large Data Bases, 2004.

Digital Library

[10]

A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264--323, 1999.

Digital Library

[11]

P. Koka and M. H. Lipasti. Opportunities for cache friendly process scheduling. In Workshop on Interaction Between Operating Systems and Computer Architecture, 2005.

[12]

J. Larus and M. Parkes. Using cohort scheduling to enhance server performance. In Usenix Annual Technical Conf., 2002.

Digital Library

[13]

R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In Intl. Parallel and Distributed Processing Symp., 2005.

Digital Library

[14]

J. Nakajima and V. Pallipadi. Enhancements for Hyper-Threading technology in the operating system - seeking the optimal micro-architectural scheduling. In Workshop on Industrial Experiences with Systems Software, 2002.

Digital Library

[15]

S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for SMT processors. Technical report, Dept. of Computer Science & Engineering, Univ. of Washington, 2000.

[16]

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In Conf. on Architectural Support for Programming Languages and Operating Systems, 1996.

Digital Library

[17]

A. Settle, J. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced SMT job scheduling. In Symp. on Parallel Architectures and Compilation Techniques, 2004.

Digital Library

[18]

A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Conf. on Architectural Support for Programming Languages and Operating Systems, 2000.

Digital Library

[19]

S. Sridharan, B. Keck, R. Murphy, S. Chandra, and P. Kogge. Thread migration to improve synchronization performance. In Workshop on Operating System Interference in High Performance Applications, 2006.

[20]

E. G. Suh, L. Rudolph, and S. Devadas. Effects of memory performance on parallel job scheduling. In D. G. Feitelson and L. Rudolph, editors, Workshop on Job Scheduling Strategies for Parallel Processing, volume 2221 of Lecture Notes in Computer Science, pages 116--132, Cambridge, MA, Jun 16 2001. Springer-Verlag.

Digital Library

[21]

E. G. Suh, L. Rudolph, and S. Devadas. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Symp. on High-Performance Computer Architecture, 2002.

Digital Library

[22]

R. Thekkah and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Intl. Symp. on Computer Architecture, 1994.

Digital Library

[23]

B. Weissman. Performance counters and state sharing annotations: a unified approach to thread locality. In Conf. on Architectural Support for Programming Languages and Operating Systems, 1998.

Digital Library

[24]

M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In Symp. on Operating Systems Principles, 2001.

Digital Library

Cited By

Jin WPeng X(2023)SLITS: Sparsity-Lightened Intelligent Thread SchedulingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794367:1(1-23)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3579436
Zhao XJahre MTang YZhang GEeckhout LAamodt TJerger NSwift M(2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575745
Yook JEgger B(2023)Selective Data Migration Between Locality Groups in NUMA SystemsEconomics of Grids, Clouds, Systems, and Services10.1007/978-3-031-29315-3_13(143-147)Online publication date: 31-Mar-2023
https://doi.org/10.1007/978-3-031-29315-3_13
Show More Cited By

Index Terms

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors

Recommendations

Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors
EuroSys'07 Conference Proceedings

The major chip manufacturers have all introduced chip multiprocessing (CMP) and simultaneous multithreading (SMT) technology into their processing units. As a result, even low-end computing systems and game consoles have become shared memory ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

March 2007

431 pages

ISBN:9781595936363

DOI:10.1145/1272996

ACM SIGOPS Operating Systems Review Volume 41, Issue 3
EuroSys'07 Conference Proceedings
June 2007
386 pages
ISSN:0163-5980
DOI:10.1145/1272998
Issue’s Table of Contents

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 March 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

EuroSys07

Sponsor:

SIGOPS

EuroSys07: Eurosys 2007 Conference

March 21 - 23, 2007

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

238
Total Citations
View Citations
2,152
Total Downloads

Downloads (Last 12 months)79
Downloads (Last 6 weeks)14

Reflects downloads up to 20 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jin WPeng X(2023)SLITS: Sparsity-Lightened Intelligent Thread SchedulingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794367:1(1-23)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3579436
Zhao XJahre MTang YZhang GEeckhout LAamodt TJerger NSwift M(2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575745
Yook JEgger B(2023)Selective Data Migration Between Locality Groups in NUMA SystemsEconomics of Grids, Clouds, Systems, and Services10.1007/978-3-031-29315-3_13(143-147)Online publication date: 31-Mar-2023
https://doi.org/10.1007/978-3-031-29315-3_13
Klimiankou YSerafini MXu H(2022)Towards practical multikernel OSes with MySySProceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3546591.3547525(29-37)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3546591.3547525
de A. Rocha HSchwarzrock JLorenzon ABeck AOshana R(2022)Using machine learning to optimize graph execution on NUMA machinesProceedings of the 59th ACM/IEEE Design Automation Conference10.1145/3489517.3530581(1027-1032)Online publication date: 10-Jul-2022
https://dl.acm.org/doi/10.1145/3489517.3530581
Yook JEgger B(2021)Modeling Cache and Application Performance on Modern Shared Memory Multiprocessors2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00158(1151-1158)Online publication date: Sep-2021
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00158
Serpa MCruz EDiener MLorenzon ABeck ANavaux P(2021)Mitigating execution unit contention in parallel applications using instruction‐aware mappingConcurrency and Computation: Practice and Experience10.1002/cpe.681935:17Online publication date: 30-Dec-2021
https://doi.org/10.1002/cpe.6819
Gouicem RCarver DLozi JSopena JLepers BZwaenepoel WPalix NLawall JMuller GGavrilovska AZadok E(2020)Fewer cores, more hertzProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489175(435-448)Online publication date: 15-Jul-2020
https://dl.acm.org/doi/10.5555/3489146.3489175
Smolyar IMarkuze APismenny BEran HZellweger GBolen ALiss LMorrison ATsafrir DLarus JCeze LStrauss K(2020)IOctopusProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378509(101-115)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378509
Antoniadis KGuerraoui RTrigonakis V(2020)Thread-Placement Learning2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS47774.2020.00050(877-887)Online publication date: Nov-2020
https://doi.org/10.1109/ICDCS47774.2020.00050
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents