Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2907294.2907308acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

SMT-Aware Instantaneous Footprint Optimization

Published: 31 May 2016 Publication History

Abstract

Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the entire memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging because they typically spawn threads within Single Program Multiple Data (SPMD) models. Since these threads have similar resource requirements, their contention cannot be easily mitigated through simple thread scheduling. To address this important issue, we first vigorously conduct a systematic performance evaluation on a wide-range of representative HPC and CMP applications on three mainstream SMT architectures, and quantify their performance sensitivity to SMT effects. Then we introduce a simple scheme for SMT-aware code optimization which aims to reduce the memory contention across SMT threads. Finally, we develop a lightweight performance tool, named SMTAnalyzer, to effectively identify the optimization opportunities in the source code of multithreaded programs. Experiments on three SMT architectures (i.e., Intel Xeon, IBM POWER7, and Intel Xeon Phi) demonstrate that our proposed SMT-aware optimization scheme can significantly improve the performance for general HPC applications.

References

[1]
Top 500 lists. http://www.top500.org/lists/2015/11, Nov. 2015.
[2]
D. H. Bailey et al. The NAS parallel benchmarks -- summary and preliminary results. In Proc. of SC, 1991.
[3]
C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.
[4]
D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.
[5]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of IISWC, 2009.
[6]
P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors, Fall 2010.
[7]
S. Eyerman and L. Eeckhout. Probabilistic modeling for job symbiosis scheduling on smt processors. ACM Trans. Archit. Code Optim., 9(2), June 2012.
[8]
S. Eyerman and L. Eeckhout. The benefit of SMT in the multi-core era: Flexibility towards degrees of thread-level parallelism. In Proc. of ASPLOS, 2014.
[9]
J. Feliu, J. Sahuquillo, S. Petit, and J. Duato. L1-bandwidth aware thread allocation in multicore SMT processors. In Proc. of PACT, 2013.
[10]
J. Feliu, J. Sahuquillo, S. Petit, and J. Duato. Addressing bandwidth contention in SMT multicores through scheduling. In Proc. of ICS, 2014.
[11]
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. of PLDI, Montreal, Quebec, Canada, 1998.
[12]
P. Havlak. Nesting of reducible and irreducible loops. ACM Trans. Program. Lang. Syst., 19(4):557--567, 1997.
[13]
IBM Corp. POWER8 Processor. In Hot Chips: A Symposium on High Performance Chips, 2013.
[14]
Intel Corporation. Intel 64 and IA-32 architectures software developer's manual, Volume 3B: System programming guide, Part 2, Number 253669-032, June 2010.
[15]
Intel Corporation. Intel Performance Tuning Utility 4.0 Update 5. https://software.intel.com/en-us/articles/intel-performance-tuning-utility, October 2012. Last accessed: Aug. 10, 2014.
[16]
J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2013.
[17]
R. Lachaize, B. Lepers, and V. Quéma. MemProf: A memory profiler for NUMA multicore systems. In Proc. of USENIX ATC, 2012.
[18]
Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks. Last accessed: Dec. 12, 2013.
[19]
T. Liu et al. PREDATOR: Predictive false sharing detection. In Proc. of PPoPP, 2014.
[20]
X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on NUMA architectures. In Proc. of PPoPP, 2014.
[21]
X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, Denver, CO, USA, 2013.
[22]
X. Liu, K. Sharma, and J. Mellor-Crummey. ArrayTool: a lightweight profiler to guide array regrouping. In Proc. of PACT, Edmonton, Alberta, Canada, 2014.
[23]
X. Liu and B. Wu. ScaAnalyzer: A tool to identify memory scalability bottlenecks in parallel programs. In Proc. of the 2015 ACM/IEEE Conference on Supercomputing, Austin, TX, USA, 2015.
[24]
Z. Majo and T. R. Gross. Matching memory access patterns and data placement for NUMA systems. In Proc. of CGO, 2012.
[25]
S. Manousopoulos et al. Characterizing thread placement in the IBM POWER7 processor. In Proc. of IISWC, 2012.
[26]
C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. of ISPASS, 2010.
[27]
D. Nikolopoulos. Code and data transformations for improving shared cache performance on SMT processors. In High Performance Computing. 2003.
[28]
OpenMP Architecture Review Board. OpenMP application program interface, version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, July 2013.
[29]
A. Pan and V. S. Pai. Imbalanced cache partitioning for balanced data-parallel programs. In Proc. of MICRO-46, 2013.
[30]
S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for SMT processors. Technical report, University of Washington, Department of Computer Science and Engineering, 2000. Available as http://www.cs.washington.edu/research/smt/papers/threadScheduling.pdf.
[31]
G. Piccoli et al. Compiler support for selective page migration in numa architectures. In Proc. of PACT, 2014.
[32]
L. Porter et al. Making the most of smt in hpc: System- and application-level perspectives. ACM Trans. Archit. Code Optim., 11(4):59:1--59:26, Jan. 2015.
[33]
A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proc. of PACT, Minneapolis, MN, USA, 2012.
[34]
J. Reinders. Intel Threading Building Blocks. O'Reilly, Sebastopol, CA, 2007.
[35]
Rogue Wave Software. ThreadSpotter manual, version 2012.1. http://www.roguewave.com/documents.aspx?Command=Core_Download&EntryId=1492, August 2012. Last accessed: Dec. 12, 2013.
[36]
A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In Proc. of SC, 2010, 2010.
[37]
D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proc. of ISCA, 1995.
[38]
V. Cakarević et al. Characterizing the resource-sharing levels in the UltraSPARC T2 processor. In Proceedings of MICRO, 2009.
[39]
E. Z. Zhang et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In Proc. of PPoPP, 2010.

Cited By

View all
  • (2023)DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for JavaProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580010(81-94)Online publication date: 17-Feb-2023

Index Terms

  1. SMT-Aware Instantaneous Footprint Optimization

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
        May 2016
        302 pages
        ISBN:9781450343145
        DOI:10.1145/2907294
        © 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Sponsors

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 31 May 2016

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. instantaneous footprint
        2. locality
        3. memory hierarchy
        4. performance tools
        5. smt
        6. smt-aware optimization

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        HPDC'16
        Sponsor:

        Acceptance Rates

        HPDC '16 Paper Acceptance Rate 20 of 129 submissions, 16%;
        Overall Acceptance Rate 166 of 966 submissions, 17%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)105
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 27 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for JavaProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580010(81-94)Online publication date: 17-Feb-2023

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media