Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2967938.2967960acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Online Scalability Characterization of Data-Parallel Programs on Many Cores

Published: 11 September 2016 Publication History

Abstract

We present an accurate online scalability prediction model for data-parallel programs on NUMA many-core systems. Memory contention is considered to be the major limiting factor of program scalability as data parallelism limits the amount of synchronization or data dependencies between parallel work units. Reflecting the architecture of NUMA systems, contention is modeled at the last-level caches of the compute nodes and the memory nodes using a two-level queuing model to estimate the mean service time of the individual memory nodes. Scalability predictions for individual or co-located parallel applications are based solely on data obtained during a short sampling period at runtime; this allows the presented model to be employed in a variety of scenarios. The proposed model has been implemented into an open-source OpenCL and the GNU OpenMP runtime and evaluated on a 64-core AMD system. For a wide variety of parallel workloads and configurations, the evaluations show that the model is able to predict the scalability of data-parallel kernels with high accuracy.

References

[1]
AMD. AMD Opteron 6300 Series Processors. http://www.amd.com/en-us/products/server/opteron/6000/6300. {online; accessed March 21, 2016}.
[2]
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.
[3]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The NAS parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991.
[4]
B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. De Supinski, and M. Schulz. A regression-based approach to scalability prediction. In Proceedings of the 22nd annual international conference on Supercomputing, pages 368--377. ACM, 2008.
[5]
A. Bhattacharyya and T. Hoefler. Pemogen: automatic adaptive performance modeling during program runtime. In Proceedings of the 23rd international conference on Parallel architectures and compilation, pages 393--404. ACM, 2014.
[6]
S. Blagodurov and A. Fedorova. User-level scheduling on NUMA multicore systems under Linux. In Proc. of Linux Symposium, 2011.
[7]
A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Using automated performance modeling to find scalability bugs in complex codes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 45. ACM, 2013.
[8]
A. Collins, T. Harris, M. Cole, and C. Fensch. Lira: Adaptive contention-aware thread placement for parallel runtime systems. In Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, page 2. ACM, 2015.
[9]
T. Creech, A. Kotha, and R. Barua. Efficient multiprogramming for multicores with scaf. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 334--345. ACM, 2013.
[10]
L. Dagum and R. Enon. Openmp: an industry standard api for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46--55, 1998.
[11]
M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: a holistic approach to memory placement on NUMA systems. ACM SIGPLAN Notices, 48(4):381--394, 2013.
[12]
A. M. Devices. BIOS and kernel developer's guide (BKDG) for AMD family 15h models 00h-0fh processors, 2012.
[13]
A. M. Devices. Revision Guide for AMD Family 15h Models 00h-0Fh Processors, 2014.
[14]
M. K. Emani, Z. Wang, and M. F. O'Boyle. Smart, adaptive mapping of parallelism in the presence of external workload. In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on, pages 1--10. IEEE, 2013.
[15]
D. Grewe, Z. Wang, and M. F. O'Boyle. A workload-aware mapping approach for data-parallel programs. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers, pages 117--126. ACM, 2011.
[16]
T. Harris, M. Maas, and V. J. Marathe. Callisto: co-scheduling parallel runtime systems. In Proceedings of the Ninth European Conference on Computer Systems, page 24. ACM, 2014.
[17]
Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, 2015.
[18]
Intel. Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual, 2015.
[19]
H. Jonkers. Queueing models of parallel applications: the glamis methodology. In Computer Performance Evaluation Modelling Techniques and Tools, pages 123--138. Springer, 1994.
[20]
Khronos Group. The open standard for parallel programming of heterogeneous systems. https://www.khronos.org/opencl/. {online; accessed March 21, 2016}.
[21]
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the 26th ACM international conference on Supercomputing, pages 341--352. ACM, 2012.
[22]
S. Lee, J. S. Meredith, and J. S. Vetter. Compass: A framework for automated performance modeling and prediction. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages 405--414. ACM, 2015.
[23]
H. Li, S. Tandri, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In null, pages 140--147. IEEE, 1993.
[24]
Z. Majo and T. R. Gross. Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead. In ACM SIGPLAN Notices, volume 46, pages 11--20. ACM, 2011.
[25]
R. W. Moore and B. R. Childers. Using utility prediction models to dynamically choose program thread counts. In ISPASS, pages 135--144, 2012.
[26]
H. Sasaki, S. Imamura, and K. Inoue. Coordinated power-performance optimization in manycores. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 51--62. IEEE Press, 2013.
[27]
H. Sasaki, T. Tanimoto, K. Inoue, and H. Nakamura. Scalability-based manycore partitioning. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, pages 107--116. ACM, 2012.
[28]
A. L. Scherr. An analysis of time-shared computer systems, volume 71.
[29]
S. Seo, G. Jo, and J. Lee. Performance characterization of the NAS Parallel Benchmarks in OpenCL. In Workload Characterization (IISWC), 2011 IEEE International Symposium on, pages 137--148. IEEE, 2011.
[30]
S. Seo, J. Kim, G. Jo, J. Lee, J. Nah, and J. Lee. SNU NPB Suite. http://aces.snu.ac.kr/software/snu-npb/, 2011. {online; accessed March 21, 2016}.
[31]
S. Seo, J. Lee, G. Jo, and J. Lee. Automatic opencl work-group size selection for multicore cpus. In Parallel Architectures and Compilation Techniques (PACT), 2013 22nd International Conference on, pages 387--398. IEEE, 2013.
[32]
J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. Performance gaps between openmp and opencl for multi-core cpus. In Parallel Processing Workshops (ICPPW), 2012 41st International Conference on, pages 116--125. IEEE, 2012.
[33]
J. Shen, J. Fang, H. Sips, and A. L. Varbanescu. An application-centric evaluation of opencl on multi-core cpus. Parallel Computing, 39(12):834--850, 2013.
[34]
R. Susukita, H. Ando, M. Aoyagi, H. Honda, Y. Inadomi, K. Inoue, S. Ishizuki, Y. Kimura, H. Komatsu, M. Kurokawa, et al. Performance prediction of large-scale parallell system and application using macro-level simulation. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 20. IEEE Press, 2008.
[35]
J. Sztrik. Basic queueing theory. University of Debrecen: Faculty of Informatics, 2011.
[36]
N. R. Tallent and A. Hoisie. Palm: easing the burden of analytical performance modeling. In Proceedings of the 28th ACM international conference on Supercomputing, pages 221--230. ACM, 2014.
[37]
T.-F. Tsuei and W. Yamamoto. Queuing simulation model for multiprocessor systems. Computer, 36(2):58--64, 2003.
[38]
B. M. Tudor and Y. M. Teo. A practical approach for performance analysis of shared-memory programs. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 652--663. IEEE, 2011.
[39]
B. M. Tudor, Y. M. Teo, and S. See. Understanding off-chip memory contention of parallel programs in multicore systems. In Parallel Processing (ICPP), 2011 International Conference on, pages 602--611. IEEE, 2011.
[40]
W. Wang, J. W. Davidson, and M. L. Soffa. Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale numa machines. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 419--431. IEEE, 2016.
[41]
J. Zhai, W. Chen, and W. Zheng. Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. In ACM Sigplan Notices, volume 45, pages 305--314. ACM, 2010.
[42]
S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In ACM SIGARCH Computer Architecture News, volume 38, pages 129--142. ACM, 2010.

Cited By

View all
  • (2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
  • (2020)Evaluation of memory performance in NUMA architectures using Stochastic Reward NetsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2020.05.022Online publication date: Jun-2020
  • (2018)Maximizing system utilization via parallelism management for co-located parallel applicationsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243199(1-14)Online publication date: 1-Nov-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
September 2016
474 pages
ISBN:9781450341219
DOI:10.1145/2967938
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-parallelism
  2. last-level caches
  3. many-core resource management
  4. memory architectures
  5. memory contention
  6. multi-socket multi-core architectures
  7. numa architectures
  8. opencl
  9. openmp
  10. parallel programming models
  11. performance modeling
  12. performance prediction
  13. performance scalability
  14. queuing models
  15. space-shared scheduling

Qualifiers

  • Research-article

Funding Sources

  • BK21 Plus for Pioneers in Innovative Computing funded by the National Research Foundation (NRF) of Korea
  • Promising-Pioneering Researcher Program through Seoul National University
  • Basic Science Research Program through NRF funded by the Ministry of Science ICT & Future Planning

Conference

PACT '16
Sponsor:
  • IFIP WG 10.3
  • IEEE TCCA
  • SIGARCH
  • IEEE CS TCPP

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)DopiaProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508421(32-45)Online publication date: 2-Apr-2022
  • (2020)Evaluation of memory performance in NUMA architectures using Stochastic Reward NetsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2020.05.022Online publication date: Jun-2020
  • (2018)Maximizing system utilization via parallelism management for co-located parallel applicationsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243199(1-14)Online publication date: 1-Nov-2018
  • (2017)POSTER: Improving NUMA System Efficiency with a Utilization-Based Co-scheduling2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2017.27(150-151)Online publication date: Sep-2017
  • (2017)Assessing NUMA Performance Based on Hardware Event Counters2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2017.51(904-913)Online publication date: May-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media