research-article

Online Scalability Characterization of Data-Parallel Programs on Many Cores

Authors:

Bernhard EggerAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 191 - 205

https://doi.org/10.1145/2967938.2967960

Published: 11 September 2016 Publication History

Abstract

We present an accurate online scalability prediction model for data-parallel programs on NUMA many-core systems. Memory contention is considered to be the major limiting factor of program scalability as data parallelism limits the amount of synchronization or data dependencies between parallel work units. Reflecting the architecture of NUMA systems, contention is modeled at the last-level caches of the compute nodes and the memory nodes using a two-level queuing model to estimate the mean service time of the individual memory nodes. Scalability predictions for individual or co-located parallel applications are based solely on data obtained during a short sampling period at runtime; this allows the presented model to be employed in a variety of scenarios. The proposed model has been implemented into an open-source OpenCL and the GNU OpenMP runtime and evaluated on a 64-core AMD system. For a wide variety of parallel workloads and configurations, the evaluations show that the model is able to predict the scalability of data-parallel kernels with high accuracy.

References

[1]

AMD. AMD Opteron 6300 Series Processors. http://www.amd.com/en-us/products/server/opteron/6000/6300. {online; accessed March 21, 2016}.

[2]

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.

[3]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The NAS parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991.

Digital Library

[4]

B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. De Supinski, and M. Schulz. A regression-based approach to scalability prediction. In Proceedings of the 22nd annual international conference on Supercomputing, pages 368--377. ACM, 2008.

Digital Library

[5]

A. Bhattacharyya and T. Hoefler. Pemogen: automatic adaptive performance modeling during program runtime. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 393--404. ACM, 2014.

Digital Library

[6]

S. Blagodurov and A. Fedorova. User-level scheduling on NUMA multicore systems under Linux. In Proc. of Linux Symposium, 2011.

[7]

A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Using automated performance modeling to find scalability bugs in complex codes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 45. ACM, 2013.

Digital Library

[8]

A. Collins, T. Harris, M. Cole, and C. Fensch. Lira: Adaptive contention-aware thread placement for parallel runtime systems. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, page 2. ACM, 2015.

Digital Library

[9]

T. Creech, A. Kotha, and R. Barua. Efficient multiprogramming for multicores with scaf. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 334--345. ACM, 2013.

Digital Library

[10]

L. Dagum and R. Enon. Openmp: an industry standard api for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46--55, 1998.

Digital Library

[11]

M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: a holistic approach to memory placement on NUMA systems. ACM SIGPLAN Notices, 48(4):381--394, 2013.

Digital Library

[12]

A. M. Devices. BIOS and kernel developer's guide (BKDG) for AMD family 15h models 00h-0fh processors, 2012.

[13]

A. M. Devices. Revision Guide for AMD Family 15h Models 00h-0Fh Processors, 2014.

[14]

M. K. Emani, Z. Wang, and M. F. O'Boyle. Smart, adaptive mapping of parallelism in the presence of external workload. In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on, pages 1--10. IEEE, 2013.

Digital Library

[15]

D. Grewe, Z. Wang, and M. F. O'Boyle. A workload-aware mapping approach for data-parallel programs. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, pages 117--126. ACM, 2011.

Digital Library

[16]

T. Harris, M. Maas, and V. J. Marathe. Callisto: co-scheduling parallel runtime systems. In Proceedings of the Ninth European Conference on Computer Systems, page 24. ACM, 2014.

Digital Library

[17]

Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, 2015.

[18]

Intel. Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual, 2015.

[19]

H. Jonkers. Queueing models of parallel applications: the glamis methodology. In Computer Performance Evaluation Modelling Techniques and Tools, pages 123--138. Springer, 1994.

Digital Library

[20]

Khronos Group. The open standard for parallel programming of heterogeneous systems. https://www.khronos.org/opencl/. {online; accessed March 21, 2016}.

[21]

J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the 26th ACM international conference on Supercomputing, pages 341--352. ACM, 2012.

Digital Library

[22]

S. Lee, J. S. Meredith, and J. S. Vetter. Compass: A framework for automated performance modeling and prediction. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 405--414. ACM, 2015.

Digital Library

[23]

H. Li, S. Tandri, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In null, pages 140--147. IEEE, 1993.

Digital Library

[24]

Z. Majo and T. R. Gross. Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead. In ACM SIGPLAN Notices, volume 46, pages 11--20. ACM, 2011.

Digital Library

[25]

R. W. Moore and B. R. Childers. Using utility prediction models to dynamically choose program thread counts. In ISPASS, pages 135--144, 2012.

Digital Library

[26]

H. Sasaki, S. Imamura, and K. Inoue. Coordinated power-performance optimization in manycores. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 51--62. IEEE Press, 2013.

Digital Library

[27]

H. Sasaki, T. Tanimoto, K. Inoue, and H. Nakamura. Scalability-based manycore partitioning. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 107--116. ACM, 2012.

Digital Library

[28]

A. L. Scherr. An analysis of time-shared computer systems, volume 71.

[29]

S. Seo, G. Jo, and J. Lee. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 137--148. IEEE, 2011.

Digital Library

[30]

S. Seo, J. Kim, G. Jo, J. Lee, J. Nah, and J. Lee. SNU NPB Suite. http://aces.snu.ac.kr/software/snu-npb/, 2011. {online; accessed March 21, 2016}.

[31]

S. Seo, J. Lee, G. Jo, and J. Lee. Automatic opencl work-group size selection for multicore cpus. In Parallel Architectures and Compilation Techniques (PACT), 2013 22nd International Conference on, pages 387--398. IEEE, 2013.

Digital Library

[32]

J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. Performance gaps between openmp and opencl for multi-core cpus. In Parallel Processing Workshops (ICPPW), 2012 41st International Conference on, pages 116--125. IEEE, 2012.

Digital Library

[33]

J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. An application-centric evaluation of opencl on multi-core cpus. Parallel Computing, 39(12):834--850, 2013.

Digital Library

[34]

R. Susukita, H. Ando, M. Aoyagi, H. Honda, Y. Inadomi, K. Inoue, S. Ishizuki, Y. Kimura, H. Komatsu, M. Kurokawa, et al. Performance prediction of large-scale parallell system and application using macro-level simulation. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 20. IEEE Press, 2008.

Digital Library

[35]

J. Sztrik. Basic queueing theory. University of Debrecen: Faculty of Informatics, 2011.

[36]

N. R. Tallent and A. Hoisie. Palm: easing the burden of analytical performance modeling. In Proceedings of the 28th ACM international conference on Supercomputing, pages 221--230. ACM, 2014.

Digital Library

[37]

T.-F. Tsuei and W. Yamamoto. Queuing simulation model for multiprocessor systems. Computer, 36(2):58--64, 2003.

Digital Library

[38]

B. M. Tudor and Y. M. Teo. A practical approach for performance analysis of shared-memory programs. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 652--663. IEEE, 2011.

Digital Library

[39]

B. M. Tudor, Y. M. Teo, and S. See. Understanding off-chip memory contention of parallel programs in multicore systems. In Parallel Processing (ICPP), 2011 International Conference on, pages 602--611. IEEE, 2011.

Digital Library

[40]

W. Wang, J. W. Davidson, and M. L. Soffa. Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale numa machines. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 419--431. IEEE, 2016.

[41]

J. Zhai, W. Chen, and W. Zheng. Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. In ACM Sigplan Notices, volume 45, pages 305--314. ACM, 2010.

Digital Library

[42]

S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In ACM SIGARCH Computer Architecture News, volume 38, pages 129--142. ACM, 2010.

Digital Library

Cited By

Cho YPark JNegele FJo CGross TEgger BLee JAgrawal KSpear M(2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508421
Entezari-Maleki RCho YEgger B(2020)Evaluation of memory performance in NUMA architectures using Stochastic Reward NetsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2020.05.022Online publication date: Jun-2020
https://doi.org/10.1016/j.jpdc.2020.05.022
Cho YGuzman CEgger BEvripidou SStenström PO'Boyle M(2018)Maximizing system utilization via parallelism management for co-located parallel applicationsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243199(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243199
Show More Cited By

Index Terms

Recommendations

Enabling Scalable and Fine-Grained Nested Parallelism on Embedded Many-cores
MCSOC '15: Proceedings of the 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip

Current high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a general-purpose host processor is coupled to a programmable manycore accelerator (PMCA). Such PMCAs typically leverage hierarchical interconnect and ...
Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators
COSMIC '15: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores

Many modern high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a powerful general purpose multicore host processor is coupled to a manycore accelerator. The host executes legacy applications on top of standard ...
An application-centric evaluation of OpenCL on multi-core CPUs

Although designed as a cross-platform parallel programming model, OpenCL remains mainly used for GPU programming. Nevertheless, a large amount of applications are parallelized, implemented, and eventually optimized in OpenCL. Thus, in this paper, we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

BK21 Plus for Pioneers in Innovative Computing funded by the National Research Foundation (NRF) of Korea
Promising-Pioneering Researcher Program through Seoul National University
Basic Science Research Program through NRF funded by the Ministry of Science ICT & Future Planning

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
255
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)2

Reflects downloads up to 23 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cho YPark JNegele FJo CGross TEgger BLee JAgrawal KSpear M(2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508421
Entezari-Maleki RCho YEgger B(2020)Evaluation of memory performance in NUMA architectures using Stochastic Reward NetsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2020.05.022Online publication date: Jun-2020
https://doi.org/10.1016/j.jpdc.2020.05.022
Cho YGuzman CEgger BEvripidou SStenström PO'Boyle M(2018)Maximizing system utilization via parallelism management for co-located parallel applicationsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243199(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243199
Cho YGuzman CEgger B(2017)POSTER: Improving NUMA System Efficiency with a Utilization-Based Co-scheduling2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2017.27(150-151)Online publication date: Sep-2017
https://doi.org/10.1109/PACT.2017.27
Plauth MSterz CEberhardt FFeinbube FPolze A(2017)Assessing NUMA Performance Based on Hardware Event Counters2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.51(904-913)Online publication date: May-2017
https://doi.org/10.1109/IPDPSW.2017.51

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents