Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Per-thread cycle accounting in SMT processors

Published: 07 March 2009 Publication History

Abstract

This paper proposes a cycle accounting architecture for Simultaneous Multithreading (SMT) processors that estimates the execution times for each of the threads had they been executed alone, while they are running simultaneously on the SMT processor. This is done by accounting each cycle to either a base, miss event or waiting cycle component during multi-threaded execution. Single-threaded alone execution time is then estimated as the sum of the base and miss event components; the waiting cycle component represents the lost cycle count due to SMT execution. The cycle accounting architecture incurs reasonable hardware cost (around 1KB of storage) and estimates single-threaded performance with average prediction errors around 7.2% for two-program workloads and 11.7% for four-program workloads.
The cycle accounting architecture has several important applications to system software and its interaction with SMT hardware. For one, the estimated single-thread alone execution time provides an accurate picture to system software of the actually consumed processor cycles per thread. The alone execution time instead of the total execution time (timeslice) may make system software scheduling policies more effective. Second, a new class of thread-progress aware SMT fetch policies based on per-thread progress indicators enable system software level priorities to be enforced at the hardware level.

References

[1]
C. Boneti, F. J. Cazorla, R. Gioiosa, A. Buyuktosunoglu, C.-Y. Cher, and M. Valero. Software-controlled priority characterization of POWER5 processor. In ISCA, pages 415--426, June 2008.
[2]
F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernández, A. Ramirez, and M. Valero. Predictable performance in SMT processors: Synergy between the OS and SMTs. IEEE Transactions on Computers, 55(7):785--799, July 2006.
[3]
F. J. Cazorla, A. Ramirez, M. Valero, and E. Fernandez. Dynamically controlled resource allocation in SMT processors. In MICRO, pages 171--182, Dec. 2004.
[4]
F. J. Cazorla, A. Ramirez, M. Valero, P. M. W. Knijnenburg, R. Sakellariou, and E. Fernández. QoS for high-performance SMT processors in embedded systems. IEEE Micro, 24(4):24--31, July 2004.
[5]
S. Choi and D. Yeung. Learning-based SMT processor resource distribution via hill-climbing. In ISCA, pages 239--250, June 2006.
[6]
Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. In ISCA, pages 76--87, June 2004.
[7]
E. Cota=Robles. Priority Based Simultaneous Multi-Threading, Dec. 2003. United States Patent No. 6,658,447 B2.
[8]
J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In MICRO, Dec. 1997.
[9]
J. Emer. EV8: The post-ultimate alpha. Keynote presentation at PACT, Sept. 2001.
[10]
S. Eyerman and L. Eeckhout. A memory-level parallelism aware fetch policy for SMT processors. In HPCA, pages 240--249, Feb. 2007.
[11]
S. Eyerman and L. Eeckhout. System-level performance metrics for multi-program workloads. IEEE Micro, 28(3):42--53, May/June 2008.
[12]
S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. In ASPLOS, pages 175--184, Oct. 2006.
[13]
A. Fedorova, M. Seltzer, and M. D. Smith. A non-work-conserving operating system scheduler for SMT processors. In WIOSCA, in conjunction with ISCA, June 2006.
[14]
B. A. Fields, R. Bodik, M. D. Hill, and C. J. Newburn. Interaction cost and shotgun profiling. ACM Transactions on Architecture and Code Optimization, 1(3):272--304, Sept. 2004.
[15]
R. Gabor, S. Weiss, and A. Mendelson. Fairness enforcement in switch on event multithreading. ACM Transactions on Architecture and Code Optimization (TACO), 4(3):34, Sept. 2007.
[16]
R. Jain, C. J. Hughes, and S. V. Adve. Soft real-time scheduling on simultaneous multithreaded processors. In Proceedings of the 23rd IEEE International Real-Time Systems Symposium, pages 134--145, Dec. 2002.
[17]
T. S. Karkhanis and J. E. Smith. A first-order superscalar processor model. In ISCA, pages 338--349, June 2004.
[18]
K. Luo, J. Gummaraju, and M. Franklin. Balancing throughput and fairness in SMT processors. In ISPASS, pages 164--171, Nov. 2001.
[19]
A. Mericas. Performance monitoring on the POWER5 microprocessor. In L. K. John and L. Eeckhout, editors, Performance Evaluation and Benchmarking, pages 247--266. CRC Press, 2006.
[20]
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA, pages 167--177, June 2006.
[21]
S. E. Raasch and S. K. Reinhardt. The impact of resource partitioning on SMT processors. In PACT, pages 15--26, Sept. 2003.
[22]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS, pages 45--57, Oct. 2002.
[23]
A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for simultaneous multithreading processor. In ASPLOS, pages 234--244, Nov. 2000.
[24]
A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In SIGMETRICS, pages 66--76, June 2002.
[25]
W. Stallings. Operating Systems: Internals and Design Principles. Prentice Hall, fifth edition, 2005.
[26]
D. Tullsen. Simulation and modeling of a simultaneous multithreading processor. In Proceedings of the 22nd Annual Computer Measurement Group Conference, Dec. 1996.
[27]
D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO, pages 318--327, Dec. 2001.
[28]
D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In ISCA, pages 191--202, May 1996.
[29]
D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In ISCA, pages 392--403, June 1995.

Cited By

View all
  • (2016) Freeze'nSense : estimation of performance isolation in cloud environments Software: Practice and Experience10.1002/spe.245647:6(831-847)Online publication date: 27-Sep-2016
  • (2024)SYNPA: SMT Performance Analysis and Allocation of Threads to Cores in ARM Processors2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00068(705-715)Online publication date: 27-May-2024
  • (2018)Multi-Stage CPI StacksIEEE Computer Architecture Letters10.1109/LCA.2017.276175117:1(55-58)Online publication date: 1-Jan-2018
  • Show More Cited By

Index Terms

  1. Per-thread cycle accounting in SMT processors

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 37, Issue 1
    ASPLOS 2009
    March 2009
    346 pages
    ISSN:0163-5964
    DOI:10.1145/2528521
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS XIV: Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
      March 2009
      358 pages
      ISBN:9781605584065
      DOI:10.1145/1508244
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 March 2009
    Published in SIGARCH Volume 37, Issue 1

    Check for updates

    Author Tags

    1. cycle accounting
    2. simultaneous multithreading (smt)
    3. thread-progress aware fetch policy

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)38
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2016) Freeze'nSense : estimation of performance isolation in cloud environments Software: Practice and Experience10.1002/spe.245647:6(831-847)Online publication date: 27-Sep-2016
    • (2024)SYNPA: SMT Performance Analysis and Allocation of Threads to Cores in ARM Processors2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00068(705-715)Online publication date: 27-May-2024
    • (2018)Multi-Stage CPI StacksIEEE Computer Architecture Letters10.1109/LCA.2017.276175117:1(55-58)Online publication date: 1-Jan-2018
    • (2018)Extending the Performance Analysis Tool Box: Multi-stage CPI Stacks and FLOPS Stacks2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2018.00031(179-188)Online publication date: Apr-2018
    • (2018)GDP: Using Dataflow Properties to Accurately Estimate Interference-Free Performance at Runtime2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00034(296-309)Online publication date: Feb-2018
    • (2017)Providing Predictable Performance via a Slowdown Estimation ModelACM Transactions on Architecture and Code Optimization10.1145/312445114:3(1-26)Online publication date: 22-Aug-2017
    • (2017)Improving IBM POWER8 Performance Through Symbiotic Job SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269170828:10(2838-2851)Online publication date: 1-Oct-2017
    • (2017)Application Clustering Policies to Address System Fairness with Intel’s Cache Allocation Technology2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2017.19(194-205)Online publication date: Sep-2017
    • (2017)A research-oriented course on Advanced Multicore ArchitectureJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.01.011105:C(63-72)Online publication date: 1-Jul-2017
    • (2017)Labeled von Neumann Architecture for Software-Defined CloudJournal of Computer Science and Technology10.1007/s11390-017-1716-032:2(219-223)Online publication date: 13-Mar-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media