research-article

Public Access

SMT-Aware Instantaneous Footprint Optimization

Authors:

Shuaiwen Leon SongAuthors Info & Claims

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

Pages 267 - 279

https://doi.org/10.1145/2907294.2907308

Published: 31 May 2016 Publication History

Abstract

Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the entire memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging because they typically spawn threads within Single Program Multiple Data (SPMD) models. Since these threads have similar resource requirements, their contention cannot be easily mitigated through simple thread scheduling. To address this important issue, we first vigorously conduct a systematic performance evaluation on a wide-range of representative HPC and CMP applications on three mainstream SMT architectures, and quantify their performance sensitivity to SMT effects. Then we introduce a simple scheme for SMT-aware code optimization which aims to reduce the memory contention across SMT threads. Finally, we develop a lightweight performance tool, named SMTAnalyzer, to effectively identify the optimization opportunities in the source code of multithreaded programs. Experiments on three SMT architectures (i.e., Intel Xeon, IBM POWER7, and Intel Xeon Phi) demonstrate that our proposed SMT-aware optimization scheme can significantly improve the performance for general HPC applications.

References

[1]

Top 500 lists. http://www.top500.org/lists/2015/11, Nov. 2015.

[2]

D. H. Bailey et al. The NAS parallel benchmarks -- summary and preliminary results. In Proc. of SC, 1991.

Digital Library

[3]

C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.

Digital Library

[4]

D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.

Digital Library

[5]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of IISWC, 2009.

Digital Library

[6]

P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors, Fall 2010.

[7]

S. Eyerman and L. Eeckhout. Probabilistic modeling for job symbiosis scheduling on smt processors. ACM Trans. Archit. Code Optim., 9(2), June 2012.

Digital Library

[8]

S. Eyerman and L. Eeckhout. The benefit of SMT in the multi-core era: Flexibility towards degrees of thread-level parallelism. In Proc. of ASPLOS, 2014.

Digital Library

[9]

J. Feliu, J. Sahuquillo, S. Petit, and J. Duato. L1-bandwidth aware thread allocation in multicore SMT processors. In Proc. of PACT, 2013.

Digital Library

[10]

J. Feliu, J. Sahuquillo, S. Petit, and J. Duato. Addressing bandwidth contention in SMT multicores through scheduling. In Proc. of ICS, 2014.

Digital Library

[11]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. of PLDI, Montreal, Quebec, Canada, 1998.

Digital Library

[12]

P. Havlak. Nesting of reducible and irreducible loops. ACM Trans. Program. Lang. Syst., 19(4):557--567, 1997.

Digital Library

[13]

IBM Corp. POWER8 Processor. In Hot Chips: A Symposium on High Performance Chips, 2013.

[14]

Intel Corporation. Intel 64 and IA-32 architectures software developer's manual, Volume 3B: System programming guide, Part 2, Number 253669-032, June 2010.

[15]

Intel Corporation. Intel Performance Tuning Utility 4.0 Update 5. https://software.intel.com/en-us/articles/intel-performance-tuning-utility, October 2012. Last accessed: Aug. 10, 2014.

[16]

J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2013.

Digital Library

[17]

R. Lachaize, B. Lepers, and V. Quéma. MemProf: A memory profiler for NUMA multicore systems. In Proc. of USENIX ATC, 2012.

Digital Library

[18]

Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks. Last accessed: Dec. 12, 2013.

[19]

T. Liu et al. PREDATOR: Predictive false sharing detection. In Proc. of PPoPP, 2014.

Digital Library

[20]

X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proc. of PPoPP, 2014.

Digital Library

[21]

X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, Denver, CO, USA, 2013.

Digital Library

[22]

X. Liu, K. Sharma, and J. Mellor-Crummey. ArrayTool: a lightweight profiler to guide array regrouping. In Proc. of PACT, Edmonton, Alberta, Canada, 2014.

Digital Library

[23]

X. Liu and B. Wu. ScaAnalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In Proc. of the 2015 ACM/IEEE Conference on Supercomputing, Austin, TX, USA, 2015.

Digital Library

[24]

Z. Majo and T. R. Gross. Matching memory access patterns and data placement for NUMA systems. In Proc. of CGO, 2012.

Digital Library

[25]

S. Manousopoulos et al. Characterizing thread placement in the IBM POWER7 processor. In Proc. of IISWC, 2012.

Digital Library

[26]

C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. of ISPASS, 2010.

[27]

D. Nikolopoulos. Code and data transformations for improving shared cache performance on SMT processors. In High Performance Computing. 2003.

[28]

OpenMP Architecture Review Board. OpenMP application program interface, version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, July 2013.

[29]

A. Pan and V. S. Pai. Imbalanced cache partitioning for balanced data-parallel programs. In Proc. of MICRO-46, 2013.

Digital Library

[30]

S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for SMT processors. Technical report, University of Washington, Department of Computer Science and Engineering, 2000. Available as http://www.cs.washington.edu/research/smt/papers/threadScheduling.pdf.

[31]

G. Piccoli et al. Compiler support for selective page migration in numa architectures. In Proc. of PACT, 2014.

Digital Library

[32]

L. Porter et al. Making the most of smt in hpc: System- and application-level perspectives. ACM Trans. Archit. Code Optim., 11(4):59:1--59:26, Jan. 2015.

Digital Library

[33]

A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proc. of PACT, Minneapolis, MN, USA, 2012.

Digital Library

[34]

J. Reinders. Intel Threading Building Blocks. O'Reilly, Sebastopol, CA, 2007.

Digital Library

[35]

Rogue Wave Software. ThreadSpotter manual, version 2012.1. http://www.roguewave.com/documents.aspx?Command=Core_Download&EntryId=1492, August 2012. Last accessed: Dec. 12, 2013.

[36]

A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In Proc. of SC, 2010, 2010.

Digital Library

[37]

D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proc. of ISCA, 1995.

Digital Library

[38]

V. Cakarević et al. Characterizing the resource-sharing levels in the UltraSPARC T2 processor. In Proceedings of MICRO, 2009.

Digital Library

[39]

E. Z. Zhang et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In Proc. of PPoPP, 2010.

Digital Library

Cited By

Li BSu PChabbi MJiao SLiu XDubach CBruening DHardekopf B(2023)DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for JavaProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580010(81-94)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580010

Index Terms

SMT-Aware Instantaneous Footprint Optimization
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Architecture optimization for multimedia application exploiting data and thread-level parallelism

The characteristics of multimedia applications when executed oil general-purpose processors are not well understood. Such knowledge is extremely important in guiding the development of multimedia applications and the design of future processors.In this ...
L1-bandwidth aware thread allocation in multicore SMT processors
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Improving the utilization of shared resources is a key issue to increase performance in SMT processors. Recent work has focused on resource sharing policies to enhance the processor performance, but their proposals mainly concentrate on novel hardware ...
SMT-Directory: Efficient Load-Load Ordering for SMT

Memory models like SC, TSO, and PC enforce load-load ordering, requiring that loads from any thread appear to occur in program order to all other threads. Out-of-order execution can violate load-load ordering. Multi-processors with out-of-order cores ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

May 2016

302 pages

ISBN:9781450343145

DOI:10.1145/2907294

General Chair:
Hiroshi Nakashima
Kyoto University, Japan
,
Program Chairs:
Kenjiro Taura
The University of Tokyo, Japan
,
Jack Lange
University of Pittsburgh, USA

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

HPDC'16

Sponsor:

University of Arizona
SIGARCH

HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing

May 31 - June 4, 2016

Kyoto, Japan

Acceptance Rates

HPDC '16 Paper Acceptance Rate 20 of 129 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
562
Total Downloads

Downloads (Last 12 months)105
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li BSu PChabbi MJiao SLiu XDubach CBruening DHardekopf B(2023)DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for JavaProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580010(81-94)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3579990.3580010

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents