research-article

Per-thread cycle accounting in SMT processors

Authors:

Lieven EeckhoutAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 37, Issue 1

Pages 133 - 144

https://doi.org/10.1145/2528521.1508260

Published: 07 March 2009 Publication History

Abstract

This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. This is done by accounting each cycle to either a base, miss event or waiting cycle component during multi-threaded execution. Single-threaded alone execution time is then estimated as the sum of the base and miss event components; the waiting cycle component represents the lost cycle count due to SMT execution. The cycle accounting architecture incurs reasonable hardware cost (around 1KB of storage) and estimates single-threaded performance with average prediction errors around 7.2% for two-program workloads and 11.7% for four-program workloads.

The cycle accounting architecture has several important applications to system software and its interaction with SMT hardware. For one, the estimated single-thread alone execution time provides an accurate picture to system software of the actually consumed processor cycles per thread. The alone execution time instead of the total execution time (timeslice) may make system software scheduling policies more effective. Second, a new class of thread-progress aware SMT fetch policies based on per-thread progress indicators enable system software level priorities to be enforced at the hardware level.

References

[1]

C. Boneti, F. J. Cazorla, R. Gioiosa, A. Buyuktosunoglu, C.-Y. Cher, and M. Valero. Software-controlled priority characterization of POWER5 processor. In ISCA, pages 415--426, June 2008.

Digital Library

[2]

F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernández, A. Ramirez, and M. Valero. Predictable performance in SMT processors: Synergy between the OS and SMTs. IEEE Transactions on Computers, 55(7):785--799, July 2006.

Digital Library

[3]

F. J. Cazorla, A. Ramirez, M. Valero, and E. Fernandez. Dynamically controlled resource allocation in SMT processors. In MICRO, pages 171--182, Dec. 2004.

Digital Library

[4]

F. J. Cazorla, A. Ramirez, M. Valero, P. M. W. Knijnenburg, R. Sakellariou, and E. Fernández. QoS for high-performance SMT processors in embedded systems. IEEE Micro, 24(4):24--31, July 2004.

Digital Library

[5]

S. Choi and D. Yeung. Learning-based SMT processor resource distribution via hill-climbing. In ISCA, pages 239--250, June 2006.

Digital Library

[6]

Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In ISCA, pages 76--87, June 2004.

Digital Library

[7]

E. Cota=Robles. Priority Based Simultaneous Multi-Threading, Dec. 2003. United States Patent No. 6,658,447 B2.

[8]

J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In MICRO, Dec. 1997.

Digital Library

[9]

J. Emer. EV8: The post-ultimate alpha. Keynote presentation at PACT, Sept. 2001.

[10]

S. Eyerman and L. Eeckhout. A memory-level parallelism aware fetch policy for SMT processors. In HPCA, pages 240--249, Feb. 2007.

Digital Library

[11]

S. Eyerman and L. Eeckhout. System-level performance metrics for multi-program workloads. IEEE Micro, 28(3):42--53, May/June 2008.

Digital Library

[12]

S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. In ASPLOS, pages 175--184, Oct. 2006.

Digital Library

[13]

A. Fedorova, M. Seltzer, and M. D. Smith. A non-work-conserving operating system scheduler for SMT processors. In WIOSCA, in conjunction with ISCA, June 2006.

[14]

B. A. Fields, R. Bodik, M. D. Hill, and C. J. Newburn. Interaction cost and shotgun profiling. ACM Transactions on Architecture and Code Optimization, 1(3):272--304, Sept. 2004.

Digital Library

[15]

R. Gabor, S. Weiss, and A. Mendelson. Fairness enforcement in switch on event multithreading. ACM Transactions on Architecture and Code Optimization (TACO), 4(3):34, Sept. 2007.

Digital Library

[16]

R. Jain, C. J. Hughes, and S. V. Adve. Soft real-time scheduling on simultaneous multithreaded processors. In Proceedings of the 23rd IEEE International Real-Time Systems Symposium, pages 134--145, Dec. 2002.

Digital Library

[17]

T. S. Karkhanis and J. E. Smith. A first-order superscalar processor model. In ISCA, pages 338--349, June 2004.

Digital Library

[18]

K. Luo, J. Gummaraju, and M. Franklin. Balancing throughput and fairness in SMT processors. In ISPASS, pages 164--171, Nov. 2001.

[19]

A. Mericas. Performance monitoring on the POWER5 microprocessor. In L. K. John and L. Eeckhout, editors, Performance Evaluation and Benchmarking, pages 247--266. CRC Press, 2006.

[20]

M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA, pages 167--177, June 2006.

Digital Library

[21]

S. E. Raasch and S. K. Reinhardt. The impact of resource partitioning on SMT processors. In PACT, pages 15--26, Sept. 2003.

Digital Library

[22]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS, pages 45--57, Oct. 2002.

Digital Library

[23]

A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for simultaneous multithreading processor. In ASPLOS, pages 234--244, Nov. 2000.

Digital Library

[24]

A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In SIGMETRICS, pages 66--76, June 2002.

Digital Library

[25]

W. Stallings. Operating Systems: Internals and Design Principles. Prentice Hall, fifth edition, 2005.

Digital Library

[26]

D. Tullsen. Simulation and modeling of a simultaneous multithreading processor. In Proceedings of the 22nd Annual Computer Measurement Group Conference, Dec. 1996.

[27]

D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO, pages 318--327, Dec. 2001.

Digital Library

[28]

D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In ISCA, pages 191--202, May 1996.

Digital Library

[29]

D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In ISCA, pages 392--403, June 1995.

Digital Library

Cited By

Kandalintsev AKliazovich DLo Cigno R(2016) Freeze'nSense : estimation of performance isolation in cloud environments Software: Practice and Experience10.1002/spe.245647:6(831-847)Online publication date: 27-Sep-2016
https://doi.org/10.1002/spe.2456
Navarro MFeliu JPetit SGómez MSahuquillo J(2024)SYNPA: SMT Performance Analysis and Allocation of Threads to Cores in ARM Processors2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00068(705-715)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00068
Eyerman SHeirman WDu Bois KHur I(2018)Multi-Stage CPI StacksIEEE Computer Architecture Letters10.1109/LCA.2017.276175117:1(55-58)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1109/LCA.2017.2761751
Show More Cited By

Index Terms

Per-thread cycle accounting in SMT processors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

Per-thread cycle accounting in SMT processors
ASPLOS 2009

This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. ...
Per-thread cycle accounting in SMT processors
ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems

This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. ...
Memory-level parallelism aware fetch policies for simultaneous multithreading processors

A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 37, Issue 1

ASPLOS 2009

March 2009

346 pages

ISSN:0163-5964

DOI:10.1145/2528521

Issue’s Table of Contents

ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
March 2009
358 pages
ISBN:9781605584065
DOI:10.1145/1508244
General Chair:
Mary Lou Soffa
University of Virginia, USA
,
Program Chair:
Mary Jane Irwin
Penn State University, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2009

Published in SIGARCH Volume 37, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
790
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)2

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kandalintsev AKliazovich DLo Cigno R(2016) Freeze'nSense : estimation of performance isolation in cloud environments Software: Practice and Experience10.1002/spe.245647:6(831-847)Online publication date: 27-Sep-2016
https://doi.org/10.1002/spe.2456
Navarro MFeliu JPetit SGómez MSahuquillo J(2024)SYNPA: SMT Performance Analysis and Allocation of Threads to Cores in ARM Processors2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00068(705-715)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00068
Eyerman SHeirman WDu Bois KHur I(2018)Multi-Stage CPI StacksIEEE Computer Architecture Letters10.1109/LCA.2017.276175117:1(55-58)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1109/LCA.2017.2761751
Eyerman SHeirman WDu Bois KHur I(2018)Extending the Performance Analysis Tool Box: Multi-stage CPI Stacks and FLOPS Stacks2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2018.00031(179-188)Online publication date: Apr-2018
https://doi.org/10.1109/ISPASS.2018.00031
Jahre MEeckhout L(2018)GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00034(296-309)Online publication date: Feb-2018
https://doi.org/10.1109/HPCA.2018.00034
Xiong DHuang KJiang XYan X(2017)Providing Predictable Performance via a Slowdown Estimation ModelACM Transactions on Architecture and Code Optimization10.1145/312445114:3(1-26)Online publication date: 22-Aug-2017
https://dl.acm.org/doi/10.1145/3124451
Feliu JEyerman SSahuquillo JPetit SEeckhout L(2017)Improving IBM POWER8 Performance Through Symbiotic Job SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269170828:10(2838-2851)Online publication date: 1-Oct-2017
https://doi.org/10.1109/TPDS.2017.2691708
Selfa VSahuquillo JEeckhout LPetit SGomez M(2017)Application Clustering Policies to Address System Fairness with Intel’s Cache Allocation Technology2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2017.19(194-205)Online publication date: Sep-2017
https://doi.org/10.1109/PACT.2017.19
Petit SSahuquillo JGmez MSelfa V(2017)A research-oriented course on Advanced Multicore ArchitectureJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.01.011105:C(63-72)Online publication date: 1-Jul-2017
https://dl.acm.org/doi/10.1016/j.jpdc.2017.01.011
Bao YWang S(2017)Labeled von Neumann Architecture for Software-Defined CloudJournal of Computer Science and Technology10.1007/s11390-017-1716-032:2(219-223)Online publication date: 13-Mar-2017
https://doi.org/10.1007/s11390-017-1716-0
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents