Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Making the Most of SMT in HPC: System- and Application-Level Perspectives

Published: 09 January 2015 Publication History

Abstract

This work presents an end-to-end methodology for quantifying the performance and power benefits of simultaneous multithreading (SMT) for HPC centers and applies this methodology to a production system and workload. Ultimately, SMT’s value system-wide depends on whether users effectively employ SMT at the application level. However, predicting SMT’s benefit for HPC applications is challenging; by doubling the number of threads, the application’s characteristics may change. This work proposes statistical modeling techniques to predict the speedup SMT confers to HPC applications. This approach, accurate to within 8%, uses only lightweight, transparent performance monitors collected during a single run of the application.

References

[1]
Emile H. L. Aarts and Jan K. Lenstra. 1997. Local Search in Combinatorial Optimization. Princeton University Press.
[2]
Katie Antypas, John Shalf, and Harvey Wasserman. 2008. NERSC-6 Workload Analysis and Benchmark Selection Process. Technical Report, Lawrence Berkeley National Laboratory.
[3]
David H. Bailey, Eric Barszcz, John T. Barton, David S. Browning, Robert L. Carter, Leonardo Dagum, Rod A. Fatoohi, Paul O. Frederickson, Thomas A. Lasinski, Rob S. Schreiber, et al. 1991. The NAS parallel benchmarks summary and preliminary results. In Proceedings of the 5th Conference on Supercomputing.
[4]
Bradley J. Barnes, Barry Rountree, David K. Lowenthal, Jaxk Reeves, Bronis de Supinski, and Martin Schulz. 2008. A regression-based approach to scalability prediction. In Proceedings of the 22nd International Conference on Supercomputing.
[5]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The Parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques.
[6]
Gary Bradski. 2000. The OpenCV library. Doctor Dobbs Journal 25, 11 (2000).
[7]
Laura Carrington, Michael Laurenzano, Allan Snavely, Roy L. Campbell, and Larry P. Davis. 2005. How well can simple metrics represent the performance of HPC applications? In Proceedings of the 19th International Conference on Supercomputing.
[8]
Laura Carrington, Michael Laurenzano, and Ananta Tiwari. 2013. Characterizing large-scale hpc applications through trace extrapolation. Parallel Processing Letters 23, 4 (2013).
[9]
Francisco J. Cazorla, Alex Ramirez, Mateo Valero, and Enrique Fernandez. 2004. Dynamically controlled resource allocation in SMT processors. In Proceedings of the 37th International Symposium on Microarchitecture.
[10]
Onur Celebioglu, Amina Saify, Tau Leng, Jenwei Hsieh, Victor Mashayekhi, and Reza Rooholamini. 2004. The performance impact of computational efficiency on HPC clusters with hyper-threading technology. In Proceedings of the 18th International Parallel and Distributed Processing Symposium.
[11]
Eric P. Chassignet, Harley E. Hurlburt, Ole Martin Smedstad, George R. Halliwell, Patrick J. Hogan, Alan J. Wallcraft, Remy Baraille, and Rainer Bleck. 2007. The HYCOM (hybrid coordinate ocean model) data assimilative system. Journal of Marine Systems 65, 1 (2007).
[12]
Seungryul Choi and Donald Yeung. 2006. Learning-based SMT processor resource distribution via hill-climbing. In Proceedings of the 33rd International Symposium on Computer Architecture.
[13]
M. J. Cordery, B. Austin, H. J. Wassermann, C. S. Daley, N. J. Wright, S. D. Hammond, and D. Doerfler. 2013. Analysis of Cray XC30 Performance Using Trinity-NERSC-8 Benchmarks and Comparison with Cray XE6 and IBM BG/Q. In Proceedings of the 4th International Workshop on Performance Modeling, Benchmarking and Simulation.
[14]
Matthew Curtis-Maury, James Dzierwa, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In Proceedings of the 20th International Conference on Supercomputing.
[15]
Matthew Curtis-Maury, Tanping Wang, Christos Antonopoulos, and Dimitrios Nikolopoulos. 2005. Integrating multiple forms of multithreaded execution on multi-SMT systems: A study with scientific applications. In Proceedings of the 2nd International Conference on the Quantitative Evaluation of Systems.
[16]
Larry P. Davis, Cray J. Henry, Roy L. Campbell, and William A. Ward. 2007. High-Performance computing acquisitions based on the factors that matter. Computing in Science & Engineering 9, 6 (2007).
[17]
Matthew DeVuyst, Rakesh Kumar, and Dean M. Tullsen. 2006. Exploiting unbalanced thread scheduling for energy and performance on a cmp of smt processors. In Proceedings of the 20th International Conference on Parallel and Distributed Processing.
[18]
Chris Ding and Xiaofeng He. 2004. K-means clustering via principal component analysis. In Proceedings of the 21st International Conference on Machine Learning.
[19]
Norman Richard Draper and Harry Smith. 1981. Applied Regression Analysis (2nd ed.). John Wiley and Sons.
[20]
Peter E. Duda and Richard O. Hart. 1973. Pattern Classification and Scene Analysis. John Wiley and Sons.
[21]
Stijn Eyerman and Lieven Eeckhout. 2009. Per-thread cycle accounting in smt processors. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems.
[22]
Stijn Eyerman and Lieven Eeckhout. 2012. Probabilistic modeling for job symbiosis scheduling on smt processors. ACM Transactions on Architecture and Code Optimization 9, 2 (June 2012).
[23]
Stijn Eyerman and Lieven Eeckhout. 2014. The benefit of SMT in the multi-core era: Flexibility towards degrees of thread-level parallelism. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems.
[24]
Paweł Gepner, David L. Fraser, Michał F. Kowalik, and Kazimierz Waćkowski. 2011. Evaluating new architectural features of the Intel® Xeon® 7500 processor for HPC workloads. Computer Science 12 (2011).
[25]
Ryan E. Grant and Ahmad Afsahi. 2005. Characterization of multithreaded scientific workloads on simultaneous multithreading intel processors. In Proceedings of the Workshop on Interaction between Operating System and Computer Architecture.
[26]
Alan Gray, J. Hein, M. Plummer, A. Sunderland, L. Smith, A. Simpson, and A. Trew. 2006. An Investigation of Simultaneous Multithreading on HPCx. Technical Report 0604, EPCC--University of Edinburgh.
[27]
Charles Hoke, Victor Burnley, and Gregory Schwabacher. 2004. Aerodynamic analysis of complex missile configurations using AVUS (air vehicles unstructured solver). In Proceedings of the 22nd Applied Aerodynamics Conference and Exhibit.
[28]
Ian Jolliffe. 2005. Principal Component Analysis. Wiley Online Library.
[29]
Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro 30, 2 (2010).
[30]
Darren J. Kerbyson, Henry J. Alme, Adolfy Hoisie, Fabrizio Petrini, Harvey J. Wasserman, and Mike Gittings. 2001. Predictive performance and scalability modeling of a large-scale application. In Proceedings of the 15th International Conference on Supercomputing.
[31]
Darren J. Kerbyson and Philip W. Jones. 2005. A performance model of the parallel ocean program. International Journal of High Performance Computing Applications 19, 3 (2005).
[32]
David Koufaty and Deborah T. Marr. 2003. Hyperthreading technology in the NetBurst microarchitecture. IEEE Micro 23, 2 (2003).
[33]
Michael A. Laurenzano, Mitesh Meswani, Laura Carrington, Allan Snavely, Mustafa M. Tikir, and Stephen Poole. 2011. Reducing energy usage with memory and computation-aware dynamic frequency scaling. In Euro-Par 2011 Parallel Processing.
[34]
Benjamin C. Lee, David M. Brooks, Bronis R. de Supinski, Martin Schulz, Karan Singh, and Sally A. McKee. 2007. Methods of inference and learning for performance modeling of parallel applications. In Proceedings of the 12th International Symposium on Principles and Practice of Parallel Programming.
[35]
Peter Mardahl, Andrew Greenwood, Tony Murphy, and Keith Cartwright. 2003. Parallel performance characteristics of ICEPIC. In Proceedings of the User Group Conference.
[36]
Gabriel Marin and John Mellor-Crummey. 2004. Cross-architecture performance predictions for scientific applications using parameterized models. In ACM SIGMETRICS Performance Evaluation Review, Vol. 32.
[37]
Harry M. Mathis, Alex E. Mericas, John D. McCalpin, Richard J. Eickemeyer, and Steven R. Kunkel. 2005. Characterization of simultaneous multithreading (SMT) efficiency in power5. IBM Journal of Research and Development 49, 4.5 (2005).
[38]
Hans Meuer, Erich Strohmaier, Jack Dongarra, and Horst Simon. 2014. The Top 500 List. Retrieved from http://www.top500.org.
[39]
Kent F. Milfeld, Chona S. Guiang, Avijit Purkayastha, and John R. Boisseau. 2003. Exploring the effects of hyper-threading on scientific applications. Cray User Group 2003 112 (2003).
[40]
Tipp Moseley, Joshua L. Kihm, Daniel A. Connors, and Dirk Grunwald. 2005. Methods for modeling resource contention on simultaneous multithreading processors. In Proceedings of the 2005 International Conference on Computer Design: VLSI in Computers and Processors.
[41]
Steve Plimpton, Paul Crozier, and Aidan Thompson. 2007. LAMMPS-large-scale Atomic/Molecular Massively Parallel Simulator. Sandia National Laboratories.
[42]
Louis-Noël Pouchet. 2012. Polybench: The Polyhedral Benchmark Suite. Retrieved from http://www.cs.ucla. edu/∼pouchet/software/polybench/.
[43]
Steven E. Raasch and Steven K. Reinhardt. 2003. The impact of resource partitioning on SMT processors. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques.
[44]
Ryan Rakvic, Qiong Cai, José González, Grigorios Magklis, Pedro Chaparro, and Antonio González. 2010. Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors. ACM Transactions on Architecture and Code Optimization 7, 2 (Oct. 2010).
[45]
RuleQuest Research. 2012. Data Mining with Cubist. Retrieved from http://rulequest.com/cubist-info.html.
[46]
Subhash Saini, Johnny Chang, and Haoqiang Jin. 2013. Performance Evaluation of the Intel Sandy Bridge Based NASA Pleiades Using Scientific and Engineering Applications. White paper, NASA Ames Research Center. (2013).
[47]
Subhash Saini, Haoqiang Jin, Robert Hood, David Barker, Piyush Mehrotra, and Rupak Biswas. 2011. The impact of hyper-threading on processor resource utilization in production applications. In Proceedings of the 18th International Conference on High Performance Computing.
[48]
Robert Schöne, Daniel Hackenberg, and Daniel Molka. 2011. Simultaneous multithreading on x86_64 systems: An energy efficiency evaluation. In Proceedings of the 4th Workshop on Power-Aware Computing and Systems.
[49]
Ronak Singhal and Senior Principal Engineer. 2008. Inside Intel core microarchitecture (Nehalem). In A Symposium on High Performance Chips, Vol. 20.
[50]
Allan Snavely, Laura Carrington, Nicole Wolter, Jesus Labarta, Rosa Badia, and Avi Purkayastha. 2002a. A framework for performance modeling and prediction. In Proceedings of the 16th International Conference on Supercomputing.
[51]
Allan Snavely and Dean M. Tullsen. 2000. Symbiotic jobscheduling for a simultaneous multithreaded processor. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems.
[52]
Allan Snavely, Dean M. Tullsen, and Geoff Voelker. 2002b. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.
[53]
ThinkTank Energy Products Inc. 2014. Watts up? Product. Retrieved from http://www.wattsupmeters.com.
[54]
Xinmin Tian, Milind Girkar, Sanjiv Shah, Douglas Armstrong, Ernesto Su, and Paul Petersen. 2003. Compiler and runtime support for running OpenMP programs on Pentium-and Itanium-architectures. In Proceedings of the International Symposium on Parallel and Distributed Processing.
[55]
Mustafa M. Tikir, Michael A. Laurenzano, Laura Carrington, and Allan Snavely. 2009. PSINS: An open source event tracer and execution simulator for MPI applications. In Euro-Par 2009 Parallel Processing.
[56]
Nathan Tuck and Dean M. Tullsen. 2003. Initial observations of the simultaneous multithreading Pentium 4 processor. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques.
[57]
Dean M. Tullsen and Jeffery A. Brown. 2001. Handling long-latency loads in a simultaneous multithreading processor. In Proceedings of the 34th International Symposium on Microarchitecture.
[58]
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd International Symposium on Computer Architecture.
[59]
Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In ACM SIGARCH Computer Architecture News, Vol. 23.
[60]
Augusto Vega, Alper Buyuktosunoglu, and Pradip Bose. 2013. SMT-centric power-aware thread placement in chip multiprocessors. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[61]
Huaping Wang, Israel Koren, and C. Mani Krishna. 2008. An adaptive resource partitioning algorithm for SMT processors. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques.

Cited By

View all
  • (2022)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-478:1(1381-1408)Online publication date: 1-Jan-2022
  • (2021)vkpolybench: A crossplatform Vulkan Compute port of the PolyBench/GPU benchmark suiteSoftwareX10.1016/j.softx.2021.10079315(100793)Online publication date: Jul-2021
  • (2021)On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threadingEuro-Par 2020: Parallel Processing Workshops10.1007/978-3-030-71593-9_23(290-302)Online publication date: 14-Mar-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 11, Issue 4
January 2015
797 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2695583
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2015
Accepted: 01 October 2014
Revised: 01 October 2014
Received: 01 June 2014
Published in TACO Volume 11, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Simultaneous multithreading
  2. energy evaluation
  3. high-performance computing
  4. performance evaluation

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • DoD High Performance Computing Modernization Program at the AFRL, ARL and ERDC DoD Supercomputing Resource Centers
  • HPCMP's PETTT program (Contract No: GS04T09DBC0017 though DRC)
  • Air Force Office of Scientific Research under AFOSR Award No. FA9550-12-1-0476

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)112
  • Downloads (Last 6 weeks)17
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-478:1(1381-1408)Online publication date: 1-Jan-2022
  • (2021)vkpolybench: A crossplatform Vulkan Compute port of the PolyBench/GPU benchmark suiteSoftwareX10.1016/j.softx.2021.10079315(100793)Online publication date: Jul-2021
  • (2021)On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threadingEuro-Par 2020: Parallel Processing Workshops10.1007/978-3-030-71593-9_23(290-302)Online publication date: 14-Mar-2021
  • (2020)Performance Characterization of Simultaneous Multi-Threading and Index Partitioning for an Online Document Search Application2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00043(231-240)Online publication date: Aug-2020
  • (2020)Timing Predictability in High-Performance Computing with Probabilistic Real-TimeIEEE Access10.1109/ACCESS.2020.3038559(1-1)Online publication date: 2020
  • (2017)Improving IBM POWER8 Performance Through Symbiotic Job SchedulingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.269170828:10(2838-2851)Online publication date: 1-Oct-2017
  • (2016)The slowdown or race-to-idle questionProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2971932(535-538)Online publication date: 14-Mar-2016
  • (2016)SMT-Aware Instantaneous Footprint OptimizationProceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing10.1145/2907294.2907308(267-279)Online publication date: 31-May-2016
  • (2016)Symbiotic job scheduling on the IBM POWER82016 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2016.7446103(669-680)Online publication date: Mar-2016
  • (2015)Compute bottlenecks on the new 64-bit ARMProceedings of the 3rd International Workshop on Energy Efficient Supercomputing10.1145/2834800.2834806(1-7)Online publication date: 15-Nov-2015
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media