research-article

Holistic run-time parallelism management for time and energy efficiency

Authors:

Srinath Sridharan,

Gurindar S. SohiAuthors Info & Claims

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Pages 337 - 348

https://doi.org/10.1145/2464996.2465016

Published: 10 June 2013 Publication History

Abstract

The ubiquity of parallel machines will necessitate time- and energy-efficient parallel execution of a program in a wide range of hardware and software environments. Prevalent parallel execution models can fail to be efficient. Unable to account for dynamic changes in operating conditions, they may create non-optimum parallelism, leading to underutilization or contention of resources. We propose ParallelismDial (PD), a model to dynamically, continuously and judiciously adapt a program's degree of parallelism to a given dynamic operating environment. PD uses a holistic metric to measure system-efficiency. The metric is used to systematically optimize the program's execution.

We apply PD to two diverse parallel programming models: Intel TBB, an industry standard, and Prometheus, a recent research effort. Two prototypes of PD have been implemented. The prototypes are evaluated on two stock multicore workstations. Dedicated and multiprogrammed environments were considered. Experimental results show that the prototypes outperform the state-of-the-art approaches, on average, by 15% on time and 31% on energy efficiency, in the dedicated environment. In the multiprogrammed environment, the savings are to the tune of 19% and 21% in time and energy, respectively.

References

[1]

Intel64 and IA-32 Architectures Software Developer's Manual Combined Volumes 3A and 3B: System Programming Guide, Parts 1 and 2. http://www.intel.com/Assets/PDF/manual/325384.pdf.

[2]

M. D. Allen. Data-Driven Decomposition of Sequential Programs for Determinate Parallel Execution. PhD thesis, University of Wisconsin, Madison, 2010.

Digital Library

[3]

M. D. Allen, S. Sridharan, and G. S. Sohi. Serialization sets: a dynamic dependence-based parallel execution model. In PPoPP '09, pages 85--96, New York, NY, USA, 2009.

Digital Library

[4]

A. Anand, C. Muthukrishnan, A. Akella, and R. Ramjee. Redundancy in network traffic: findings and implications. SIGMETRICS '09, pages 37--48, New York, NY, USA, 2009.

Digital Library

[5]

T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. Levy. Scheduler activations: effective kernel support for the user-level management of parallelism. In Proceedings of the thirteenth ACM symposium on Operating systems principles, SOSP '91, pages 95--109, New York, NY, USA, 1991. ACM.

Digital Library

[6]

R. L. Bocchino, Jr., V. S. Adve, D. Dig, S. V. Adve, S. Heumann, R. Komuravelli, J. Overbey, P. Simmons, H. Sung, and M. Vakilian. A type and effect system for deterministic parallel java. OOPSLA '09, pages 97--116, New York, NY, USA, 2009.

Digital Library

[7]

S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. MICRO '05, 25(6):10--16, 2005.

Digital Library

[8]

S. Boyd-Wickizer, A. T. Clements, Y. Mao, A. Pesterev, M. F. Kaashoek, R. Morris, and N. Zeldovich. An analysis of linux scalability to manycores. OSDI'10, pages 1--8, Berkeley, CA, USA, 2010. USENIX Association.

Digital Library

[9]

M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. ICS '06, pages 157--166, New York, NY, USA, 2006.

Digital Library

[10]

M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R. de Supinski, and M. Schulz. Prediction models for multi-dimensional power-performance optimization on many cores. PACT '08, pages 250--259, New York, NY, USA, 2008.

Digital Library

[11]

D. J. DeWitt, R. H. Katz, F. Olken, L. D. Shapiro, M. R. Stonebraker, and D. A. Wood. Implementation techniques for main memory database systems. SIGMOD '84, pages 1--8, New York, NY, USA, 1984.

Digital Library

[12]

E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. In ASPLOS '10, pages 335--346, New York, NY, USA, 2010.

Digital Library

[13]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI '98, pages 212--223, 1998.

Digital Library

[14]

S. H. Fuller and E. Lynette I. Millett. The Future of Computing Performance: Game Over or Next Level? The National Academies Press, 2011.

Digital Library

[15]

M. Gendreau. An Introduction to Tabu Search. In F. Glover and G. Kochenberger, editors, Handbook of Metaheuristics, chapter 2, pages 37--54. Kluwer Academic Publishers, 2003.

[16]

J. Gilchrist. Parallel data compression with bzip2. In ICPDCS '04, pages 559--564, 2004.

[17]

F. Glover and M. Laguna. Tabu Search. Kluwer Academic Publishers, Norwell, MA, USA, 1997.

Digital Library

[18]

G. Gupta and G. S. Sohi. Dataflow execution of sequential imperative programs on multicore architectures. In MICRO '11, pages 59--70, New York, NY, USA, 2011.

Digital Library

[19]

R. Illikkal, V. Chadha, A. Herdrich, R. Iyer, and D. Newell. PIRATE: QoS and performance management in CMP architectures. SIGMETRICS Perform. Eval. Rev., 37:3--10, March 2010.

Digital Library

[20]

R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In ICS '04, pages 257--266, New York, NY, USA, 2004.

Digital Library

[21]

R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. SIGMETRICS Perform. Eval. Rev., 35(1):25--36, June 2007.

Digital Library

[22]

J. O. Kephart and D. M. Chess. The vision of autonomic computing. Computer, 36(1):41--50, Jan. 2003.

Digital Library

[23]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT '04, pages 111--122, 2004.

Digital Library

[24]

M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval. Lonestar: A suite of parallel irregular programs. In ISPASS '09, pages 65--76, April 2009.

[25]

J. Lee, H. Wu, M. Ravichandran, and N. Clark. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications. In ISCA '10, pages 270--279, New York, NY, USA, 2010.

Digital Library

[26]

D. Li, B. de Supinski, M. Schulz, K. Cameron, and D. Nikolopoulos. Hybrid MPI/OpenMP power-aware computing. In IPDPS '10, pages 1--12, April 2010.

[27]

J. Li and J. Martinez. Power-performance implications of thread-level parallelism on chip multiprocessors. In ISPASS '05, pages 124--134, March 2005.

Digital Library

[28]

J. Li and J. Martinez. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In HPCA '06, pages 77--87, Feb. 2006.

[29]

R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanović, and J. Kubiatowicz. Tessellation: space-time partitioning in a manycore client os. HotPar'09, pages 10--10, Berkeley, CA, USA, 2009.

Digital Library

[30]

J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. TCCA Newsletter, pages 19--25, Dec. 1995.

[31]

C. McCann, R. Vaswani, and J. Zahorjan. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst., 11(2):146--178, May 1993.

Digital Library

[32]

P. Mucci, S. Browne, C. Deane, and G. Ho. Papi: A portable interface to hardware performance counters. In Proc. Dept. of Defense HPCMP Users Group Conference, pages 7--10, 1999.

[33]

O. Mutlu and T. Moscibroda. Stall-time fair memory access scheduling for chip multiprocessors. In MICRO '07, pages 146--160, 2007.

Digital Library

[34]

O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared dram systems. In ISCA '08, pages 63--74, 2008.

Digital Library

[35]

K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems. In MICRO '06, pages 208--222, 2006.

Digital Library

[36]

K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In ISCA '07, pages 57--68, New York, NY, USA, 2007.

Digital Library

[37]

H. Pan, B. Hindman, and K. Asanović. Composing parallel software efficiently with lithe. In PLDI '10, pages 376--387, New York, NY, USA, 2010.

Digital Library

[38]

J. Perez, R. Badia, and J. Labarta. A dependency-aware task-based programming environment for multi-core architectures. In Cluster Computing, 2008 IEEE International Conference on, pages 142--151, 29 2008-oct. 1 2008.

[39]

A. Raman, H. Kim, T. Oh, J. W. Lee, and D. I. August. Parallelism orchestration using DoPE: the degree of parallelism executive. In PLDI '11, pages 26--37, New York, NY, USA, 2011. ACM.

Digital Library

[40]

A. Raman, A. Zaks, J. W. Lee, and D. I. August. Parcae: a system for flexible parallel execution. In PLDI '12, pages 133--144, New York, NY, USA, 2012.

Digital Library

[41]

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. HPCA '07, pages 13--24, Washington, DC, USA, 2007.

Digital Library

[42]

J. Reinders. Intel Threading Building Blocks. O'Reilly Media, Inc., 2007.

Digital Library

[43]

U. Richter, M. Mnif, J. Branke, C. MÃijller-Schloer, and H. Schmeck. Towards a generic observer/controller architecture for organic computing. In C. Hochberger and R. Liskowsky, editors, GI Jahrestagung (1), volume 93 of LNI, pages 112--119. GI, 2006.

[44]

H. Sasaki, T. Tanimoto, K. Inoue, and H. Nakamura. Scalability-based manycore partitioning. In PACT '12, pages 107--116, New York, NY, USA, 2012.

Digital Library

[45]

M. A. Suleman, M. K. Qureshi, and Y. N. Patt. Feedback-driven threading: power-efficient and high-performance execution of multithreaded workloads on CMPs. In ASPLOS '08, pages 277--286,2008.

Digital Library

[46]

J. Teich, J. Henkel, A. Herkersdorf, D. Schmitt-Landsiedel, W. Schroder-Preikschat, and G. Snelting. Invasive computing: An overview. In Multiprocessor System-on-Chip, pages 241--268. 2011.

Cited By

Chen WWang YXu YGao CHan YZhang L(2022)Amphis: Managing Reconfigurable Processor Architectures With Generative Adversarial LearningIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319798041:11(3993-4003)Online publication date: Nov-2022
https://doi.org/10.1109/TCAD.2022.3197980
Marques SRossi FLuizelli MBeck ALorenzon A(2022)Thermal-Aware Thread and Turbo Frequency Throttling Optimization for Parallel Applications2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI55532.2022.9893245(1-6)Online publication date: 22-Aug-2022
https://doi.org/10.1109/SBCCI55532.2022.9893245
Kumar V(2021)Teaching High Productivity and High Performance in an Introductory Parallel Programming Course2021 IEEE 28th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)10.1109/HiPCW54834.2021.00010(21-28)Online publication date: Dec-2021
https://doi.org/10.1109/HiPCW54834.2021.00010
Show More Cited By

Index Terms

Holistic run-time parallelism management for time and energy efficiency
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Adaptive, efficient, parallel execution of parallel programs
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation

Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the ...
Adaptive, efficient, parallel execution of parallel programs
PLDI '14

Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the ...
Minimization of Xeon Phi Core Use with Negligible Execution Time Impact
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

For many years GPUs have been components of HPC clusters (Titan and Piz Daint), while only in recent years has the Intel® Xeon Phi™ been included (Tianhe-2 and Stampede). For example, GPUs are in 14% of systems in the November 2015 Top500 list, while ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

June 2013

512 pages

ISBN:9781450321303

DOI:10.1145/2464996

General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'13

Sponsor:

SIGARCH

ICS'13: International Conference on Supercomputing

June 10 - 14, 2013

Oregon, Eugene, USA

Acceptance Rates

ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
454
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen WWang YXu YGao CHan YZhang L(2022)Amphis: Managing Reconfigurable Processor Architectures With Generative Adversarial LearningIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.319798041:11(3993-4003)Online publication date: Nov-2022
https://doi.org/10.1109/TCAD.2022.3197980
Marques SRossi FLuizelli MBeck ALorenzon A(2022)Thermal-Aware Thread and Turbo Frequency Throttling Optimization for Parallel Applications2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI55532.2022.9893245(1-6)Online publication date: 22-Aug-2022
https://doi.org/10.1109/SBCCI55532.2022.9893245
Kumar V(2021)Teaching High Productivity and High Performance in an Introductory Parallel Programming Course2021 IEEE 28th International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW)10.1109/HiPCW54834.2021.00010(21-28)Online publication date: Dec-2021
https://doi.org/10.1109/HiPCW54834.2021.00010
Vogel AGriebler DFernandes L(2021)Providing high‐level self‐adaptive abstractions for stream parallelism on multicoresSoftware: Practice and Experience10.1002/spe.294851:6(1194-1217)Online publication date: 10-Jan-2021
https://doi.org/10.1002/spe.2948
Wan CSantriaji MRogers EHoffmann HMaire MLu SGavrilovska AZadok E(2020)ALERTProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489170(353-369)Online publication date: 15-Jul-2020
https://dl.acm.org/doi/10.5555/3489146.3489170
Navarro ALorenzon AAyguadé EBeltran V(2020)Enhancing Resource Management Through Prediction-Based PoliciesEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_31(493-509)Online publication date: 18-Aug-2020
https://doi.org/10.1007/978-3-030-57675-2_31
Ding YMishra NHoffmann HManne SHunter HAltman E(2019)Generative and multi-phase learning for computer systems optimizationProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3326633(39-52)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3326633
Lorenzon Ade Oliveira CSouza JBeck A(2019)Aurora: Seamless Optimization of OpenMP ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.287299230:5(1007-1021)Online publication date: 1-May-2019
https://doi.org/10.1109/TPDS.2018.2872992
Griebler DVogel ADe Sensi DDanelutto MFernandes L(2019)Simplifying and implementing service level objectives for stream parallelismThe Journal of Supercomputing10.1007/s11227-019-02914-6Online publication date: 5-Jun-2019
https://doi.org/10.1007/s11227-019-02914-6
Lorenzon ASchneider Beck Filho AFrancisco Lorenzon ABeck Filho A(2019)Tuning Parallel ApplicationsParallel Computing Hits the Power Wall10.1007/978-3-030-28719-1_4(41-54)Online publication date: 6-Nov-2019
https://doi.org/10.1007/978-3-030-28719-1_4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents