research-article

Open access

Making the Most of SMT in HPC: System- and Application-Level Perspectives

Authors:

Michael A. Laurenzano,

William A. Ward, Jr.,

Laura CarringtonAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 11, Issue 4

Article No.: 59, Pages 1 - 26

https://doi.org/10.1145/2687651

Published: 09 January 2015 Publication History

Abstract

This work presents an end-to-end methodology for quantifying the performance and power benefits of simultaneous multithreading (SMT) for HPC centers and applies this methodology to a production system and workload. Ultimately, SMT’s value system-wide depends on whether users effectively employ SMT at the application level. However, predicting SMT’s benefit for HPC applications is challenging; by doubling the number of threads, the application’s characteristics may change. This work proposes statistical modeling techniques to predict the speedup SMT confers to HPC applications. This approach, accurate to within 8%, uses only lightweight, transparent performance monitors collected during a single run of the application.

References

[1]

Emile H. L. Aarts and Jan K. Lenstra. 1997. Local Search in Combinatorial Optimization. Princeton University Press.

Digital Library

[2]

Katie Antypas, John Shalf, and Harvey Wasserman. 2008. NERSC-6 Workload Analysis and Benchmark Selection Process. Technical Report, Lawrence Berkeley National Laboratory.

[3]

David H. Bailey, Eric Barszcz, John T. Barton, David S. Browning, Robert L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, Thomas A. Lasinski, Rob S. Schreiber, et al. 1991. The NAS parallel benchmarks summary and preliminary results. In Proceedings of the 5th Conference on Supercomputing.

Digital Library

[4]

Bradley J. Barnes, Barry Rountree, David K. Lowenthal, Jaxk Reeves, Bronis de Supinski, and Martin Schulz. 2008. A regression-based approach to scalability prediction. In Proceedings of the 22nd International Conference on Supercomputing.

Digital Library

[5]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The Parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[6]

Gary Bradski. 2000. The OpenCV library. Doctor Dobbs Journal 25, 11 (2000).

[7]

Laura Carrington, Michael Laurenzano, Allan Snavely, Roy L. Campbell, and Larry P. Davis. 2005. How well can simple metrics represent the performance of HPC applications&quest; In Proceedings of the 19th International Conference on Supercomputing.

Digital Library

[8]

Laura Carrington, Michael Laurenzano, and Ananta Tiwari. 2013. Characterizing large-scale hpc applications through trace extrapolation. Parallel Processing Letters 23, 4 (2013).

[9]

Francisco J. Cazorla, Alex Ramirez, Mateo Valero, and Enrique Fernandez. 2004. Dynamically controlled resource allocation in SMT processors. In Proceedings of the 37th International Symposium on Microarchitecture.

Digital Library

[10]

Onur Celebioglu, Amina Saify, Tau Leng, Jenwei Hsieh, Victor Mashayekhi, and Reza Rooholamini. 2004. The performance impact of computational efficiency on HPC clusters with hyper-threading technology. In Proceedings of the 18th International Parallel and Distributed Processing Symposium.

[11]

Eric P. Chassignet, Harley E. Hurlburt, Ole Martin Smedstad, George R. Halliwell, Patrick J. Hogan, Alan J. Wallcraft, Remy Baraille, and Rainer Bleck. 2007. The HYCOM (hybrid coordinate ocean model) data assimilative system. Journal of Marine Systems 65, 1 (2007).

[12]

Seungryul Choi and Donald Yeung. 2006. Learning-based SMT processor resource distribution via hill-climbing. In Proceedings of the 33rd International Symposium on Computer Architecture.

Digital Library

[13]

M. J. Cordery, B. Austin, H. J. Wassermann, C. S. Daley, N. J. Wright, S. D. Hammond, and D. Doerfler. 2013. Analysis of Cray XC30 Performance Using Trinity-NERSC-8 Benchmarks and Comparison with Cray XE6 and IBM BG/Q. In Proceedings of the 4th International Workshop on Performance Modeling, Benchmarking and Simulation.

[14]

Matthew Curtis-Maury, James Dzierwa, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In Proceedings of the 20th International Conference on Supercomputing.

Digital Library

[15]

Matthew Curtis-Maury, Tanping Wang, Christos Antonopoulos, and Dimitrios Nikolopoulos. 2005. Integrating multiple forms of multithreaded execution on multi-SMT systems: A study with scientific applications. In Proceedings of the 2nd International Conference on the Quantitative Evaluation of Systems.

Digital Library

[16]

Larry P. Davis, Cray J. Henry, Roy L. Campbell, and William A. Ward. 2007. High-Performance computing acquisitions based on the factors that matter. Computing in Science & Engineering 9, 6 (2007).

Digital Library

[17]

Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen. 2006. Exploiting unbalanced thread scheduling for energy and performance on a cmp of smt processors. In Proceedings of the 20th International Conference on Parallel and Distributed Processing.

Digital Library

[18]

Chris Ding and Xiaofeng He. 2004. K-means clustering via principal component analysis. In Proceedings of the 21st International Conference on Machine Learning.

Digital Library

[19]

Norman Richard Draper and Harry Smith. 1981. Applied Regression Analysis (2nd ed.). John Wiley and Sons.

[20]

Peter E. Duda and Richard O. Hart. 1973. Pattern Classification and Scene Analysis. John Wiley and Sons.

[21]

Stijn Eyerman and Lieven Eeckhout. 2009. Per-thread cycle accounting in smt processors. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[22]

Stijn Eyerman and Lieven Eeckhout. 2012. Probabilistic modeling for job symbiosis scheduling on smt processors. ACM Transactions on Architecture and Code Optimization 9, 2 (June 2012).

Digital Library

[23]

Stijn Eyerman and Lieven Eeckhout. 2014. The benefit of SMT in the multi-core era: Flexibility towards degrees of thread-level parallelism. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[24]

Paweł Gepner, David L. Fraser, Michał F. Kowalik, and Kazimierz Waćkowski. 2011. Evaluating new architectural features of the Intel® Xeon® 7500 processor for HPC workloads. Computer Science 12 (2011).

[25]

Ryan E. Grant and Ahmad Afsahi. 2005. Characterization of multithreaded scientific workloads on simultaneous multithreading intel processors. In Proceedings of the Workshop on Interaction between Operating System and Computer Architecture.

[26]

Alan Gray, J. Hein, M. Plummer, A. Sunderland, L. Smith, A. Simpson, and A. Trew. 2006. An Investigation of Simultaneous Multithreading on HPCx. Technical Report 0604, EPCC--University of Edinburgh.

[27]

Charles Hoke, Victor Burnley, and Gregory Schwabacher. 2004. Aerodynamic analysis of complex missile configurations using AVUS (air vehicles unstructured solver). In Proceedings of the 22nd Applied Aerodynamics Conference and Exhibit.

[28]

Ian Jolliffe. 2005. Principal Component Analysis. Wiley Online Library.

[29]

Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro 30, 2 (2010).

Digital Library

[30]

Darren J. Kerbyson, Henry J. Alme, Adolfy Hoisie, Fabrizio Petrini, Harvey J. Wasserman, and Mike Gittings. 2001. Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 15th International Conference on Supercomputing.

Digital Library

[31]

Darren J. Kerbyson and Philip W. Jones. 2005. A performance model of the parallel ocean program. International Journal of High Performance Computing Applications 19, 3 (2005).

Digital Library

[32]

David Koufaty and Deborah T. Marr. 2003. Hyperthreading technology in the NetBurst microarchitecture. IEEE Micro 23, 2 (2003).

Digital Library

[33]

Michael A. Laurenzano, Mitesh Meswani, Laura Carrington, Allan Snavely, Mustafa M. Tikir, and Stephen Poole. 2011. Reducing energy usage with memory and computation-aware dynamic frequency scaling. In Euro-Par 2011 Parallel Processing.

Digital Library

[34]

Benjamin C. Lee, David M. Brooks, Bronis R. de Supinski, Martin Schulz, Karan Singh, and Sally A. McKee. 2007. Methods of inference and learning for performance modeling of parallel applications. In Proceedings of the 12th International Symposium on Principles and Practice of Parallel Programming.

Digital Library

[35]

Peter Mardahl, Andrew Greenwood, Tony Murphy, and Keith Cartwright. 2003. Parallel performance characteristics of ICEPIC. In Proceedings of the User Group Conference.

Digital Library

[36]

Gabriel Marin and John Mellor-Crummey. 2004. Cross-architecture performance predictions for scientific applications using parameterized models. In ACM SIGMETRICS Performance Evaluation Review, Vol. 32.

Digital Library

[37]

Harry M. Mathis, Alex E. Mericas, John D. McCalpin, Richard J. Eickemeyer, and Steven R. Kunkel. 2005. Characterization of simultaneous multithreading (SMT) efficiency in power5. IBM Journal of Research and Development 49, 4.5 (2005).

Digital Library

[38]

Hans Meuer, Erich Strohmaier, Jack Dongarra, and Horst Simon. 2014. The Top 500 List. Retrieved from http://www.top500.org.

[39]

Kent F. Milfeld, Chona S. Guiang, Avijit Purkayastha, and John R. Boisseau. 2003. Exploring the effects of hyper-threading on scientific applications. Cray User Group 2003 112 (2003).

[40]

Tipp Moseley, Joshua L. Kihm, Daniel A. Connors, and Dirk Grunwald. 2005. Methods for modeling resource contention on simultaneous multithreading processors. In Proceedings of the 2005 International Conference on Computer Design: VLSI in Computers and Processors.

Digital Library

[41]

Steve Plimpton, Paul Crozier, and Aidan Thompson. 2007. LAMMPS-large-scale Atomic/Molecular Massively Parallel Simulator. Sandia National Laboratories.

[42]

Louis-Noël Pouchet. 2012. Polybench: The Polyhedral Benchmark Suite. Retrieved from http://www.cs.ucla. edu/&sim;pouchet/software/polybench/.

[43]

Steven E. Raasch and Steven K. Reinhardt. 2003. The impact of resource partitioning on SMT processors. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[44]

Ryan Rakvic, Qiong Cai, José González, Grigorios Magklis, Pedro Chaparro, and Antonio González. 2010. Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors. ACM Transactions on Architecture and Code Optimization 7, 2 (Oct. 2010).

Digital Library

[45]

RuleQuest Research. 2012. Data Mining with Cubist. Retrieved from http://rulequest.com/cubist-info.html.

[46]

Subhash Saini, Johnny Chang, and Haoqiang Jin. 2013. Performance Evaluation of the Intel Sandy Bridge Based NASA Pleiades Using Scientific and Engineering Applications. White paper, NASA Ames Research Center. (2013).

[47]

Subhash Saini, Haoqiang Jin, Robert Hood, David Barker, Piyush Mehrotra, and Rupak Biswas. 2011. The impact of hyper-threading on processor resource utilization in production applications. In Proceedings of the 18th International Conference on High Performance Computing.

Digital Library

[48]

Robert Schöne, Daniel Hackenberg, and Daniel Molka. 2011. Simultaneous multithreading on x86_64 systems: An energy efficiency evaluation. In Proceedings of the 4th Workshop on Power-Aware Computing and Systems.

Digital Library

[49]

Ronak Singhal and Senior Principal Engineer. 2008. Inside Intel core microarchitecture (Nehalem). In A Symposium on High Performance Chips, Vol. 20.

[50]

Allan Snavely, Laura Carrington, Nicole Wolter, Jesus Labarta, Rosa Badia, and Avi Purkayastha. 2002a. A framework for performance modeling and prediction. In Proceedings of the 16th International Conference on Supercomputing.

Digital Library

[51]

Allan Snavely and Dean M. Tullsen. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[52]

Allan Snavely, Dean M. Tullsen, and Geoff Voelker. 2002b. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.

Digital Library

[53]

ThinkTank Energy Products Inc. 2014. Watts up&quest; Product. Retrieved from http://www.wattsupmeters.com.

[54]

Xinmin Tian, Milind Girkar, Sanjiv Shah, Douglas Armstrong, Ernesto Su, and Paul Petersen. 2003. Compiler and runtime support for running OpenMP programs on Pentium-and Itanium-architectures. In Proceedings of the International Symposium on Parallel and Distributed Processing.

Digital Library

[55]

Mustafa M. Tikir, Michael A. Laurenzano, Laura Carrington, and Allan Snavely. 2009. PSINS: An open source event tracer and execution simulator for MPI applications. In Euro-Par 2009 Parallel Processing.

Digital Library

[56]

Nathan Tuck and Dean M. Tullsen. 2003. Initial observations of the simultaneous multithreading Pentium 4 processor. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[57]

Dean M. Tullsen and Jeffery A. Brown. 2001. Handling long-latency loads in a simultaneous multithreading processor. In Proceedings of the 34th International Symposium on Microarchitecture.

Digital Library

[58]

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd International Symposium on Computer Architecture.

Digital Library

[59]

Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In ACM SIGARCH Computer Architecture News, Vol. 23.

Digital Library

[60]

Augusto Vega, Alper Buyuktosunoglu, and Pradip Bose. 2013. SMT-centric power-aware thread placement in chip multiprocessors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[61]

Huaping Wang, Israel Koren, and C. Mani Krishna. 2008. An adaptive resource partitioning algorithm for SMT processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

Cited By

Zhang GLiu YYang HQian D(2022)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-478:1(1381-1408)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s11227-021-03892-4
Capodieci NCavicchioli R(2021)vkpolybench: A crossplatform Vulkan Compute port of the PolyBench/GPU benchmark suiteSoftwareX10.1016/j.softx.2021.10079315(100793)Online publication date: Jul-2021
https://doi.org/10.1016/j.softx.2021.100793
Pérez DRopars TMeneses E(2021)On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threadingEuro-Par 2020: Parallel Processing Workshops10.1007/978-3-030-71593-9_23(290-302)Online publication date: 14-Mar-2021
https://doi.org/10.1007/978-3-030-71593-9_23
Show More Cited By

Index Terms

Making the Most of SMT in HPC: System- and Application-Level Perspectives

Recommendations

The impact of speculative execution on SMT processors

By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are ...
A latency-conscious SMT branch prediction architecture

Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 11, Issue 4

January 2015

797 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/2695583

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2015

Accepted: 01 October 2014

Revised: 01 October 2014

Received: 01 June 2014

Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

DoD High Performance Computing Modernization Program at the AFRL, ARL and ERDC DoD Supercomputing Resource Centers
HPCMP's PETTT program (Contract No: GS04T09DBC0017 though DRC)
Air Force Office of Scientific Research under AFOSR Award No. FA9550-12-1-0476

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
853
Total Downloads

Downloads (Last 12 months)112
Downloads (Last 6 weeks)17

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang GLiu YYang HQian D(2022)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-478:1(1381-1408)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1007/s11227-021-03892-4
Capodieci NCavicchioli R(2021)vkpolybench: A crossplatform Vulkan Compute port of the PolyBench/GPU benchmark suiteSoftwareX10.1016/j.softx.2021.10079315(100793)Online publication date: Jul-2021
https://doi.org/10.1016/j.softx.2021.100793
Pérez DRopars TMeneses E(2021)On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threadingEuro-Par 2020: Parallel Processing Workshops10.1007/978-3-030-71593-9_23(290-302)Online publication date: 14-Mar-2021
https://doi.org/10.1007/978-3-030-71593-9_23
Antoniou GHadjilambrou ZSazeides Y(2020)Performance Characterization of Simultaneous Multi-Threading and Index Partitioning for an Online Document Search Application2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00043(231-240)Online publication date: Aug-2020
https://doi.org/10.1109/ISPASS48437.2020.00043
Reghenzani FMassari GFornaciari W(2020)Timing Predictability in High-Performance Computing with Probabilistic Real-TimeIEEE Access10.1109/ACCESS.2020.3038559(1-1)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3038559
Feliu JEyerman SSahuquillo JPetit SEeckhout L(2017)Improving IBM POWER8 Performance Through Symbiotic Job SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269170828:10(2838-2851)Online publication date: 1-Oct-2017
https://doi.org/10.1109/TPDS.2017.2691708
Das AMerrett GAl-Hashimi BFanucci LTeich J(2016)The slowdown or race-to-idle questionProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2971932(535-538)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2971932
Roy PLiu XSong SNakashima HTaura KLange J(2016)SMT-Aware Instantaneous Footprint OptimizationProceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing10.1145/2907294.2907308(267-279)Online publication date: 31-May-2016
https://dl.acm.org/doi/10.1145/2907294.2907308
Feliu JEyerman SSahuquillo JPetit S(2016)Symbiotic job scheduling on the IBM POWER82016 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2016.7446103(669-680)Online publication date: Mar-2016
https://doi.org/10.1109/HPCA.2016.7446103
Jundt ACauble-Chantrenne ATiwari APeraza JLaurenzano MCarrington LCameron KHoisie AKerbyson DLowenthal DNikolopoulos DYalamanchili SCarrington LManzano J(2015)Compute bottlenecks on the new 64-bit ARMProceedings of the 3rd International Workshop on Energy Efficient Supercomputing10.1145/2834800.2834806(1-7)Online publication date: 15-Nov-2015
https://dl.acm.org/doi/10.1145/2834800.2834806
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents