research-article

Mesh independent loop fusion for unstructured mesh applications

Authors:

Carlo Bertolli,

Paul H.J. Kelly,

Gihan R. Mudalige,

Mike B. GilesAuthors Info & Claims

CF '12: Proceedings of the 9th conference on Computing Frontiers

Pages 43 - 52

https://doi.org/10.1145/2212908.2212917

Published: 15 May 2012 Publication History

Abstract

Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric architectures require specialised knowledge to achieve optimal performance. OP2 is a parallel programming layer which attempts to ease this programming burden by allowing programmers to express parallel iterations over elements in the unstructured mesh through an API call, a so-called OP2-loop. The OP2 compiler infrastructure then uses source-to-source transformations to realise a parallel implementation of each OP2-loop and discover opportunities for optimisation.

In this paper, we describe how several compiler techniques can be effectively utilised in tandem to increase the performance of unstructured mesh applications. In particular, we show how whole-program analysis --- which is often inhibited due to the size of the control flow graph - often becomes feasible as a result of the OP2 programming model, facilitating aggressive optimisation. We subsequently show how whole-program analysis then becomes an enabler to OP2-loop optimisations. Based on this, we show how a classical technique, namely loop fusion, which is typically difficult to apply to unstructured mesh applications, can be defined at compile-time. We examine the limits of its application and show experimental results on a computational fluid dynamic application benchmark, assessing the performance gains due to loop fusion.

References

[1]

The ROSE compiler. http://wwww.rosecompiler.org/.

[2]

M. Bartlett, I. Bate, and D. Kazakov. Guaranteed loop bound identification from program traces for wcet. In Proceedings of the $15^th$ Real-Time Technology and Applications Symposium (RTAS'09), April 2009.

Digital Library

[3]

C. Bertolli, A. Betts, G. Mudalige, M. B. Giles, and P. H.J. Kelly. Design and performance of the OP2 library for unstructured mesh applications. In Euro-Par 2001 Parallel Processing Workshops, LNCS. Springer, 2011.

Digital Library

[4]

G. Bilardi and K. Pingali. A framework for generalized control dependence. SIGPLAN Not., 31, May 1996.

Digital Library

[5]

D.A. Burgess, P.I. Crumpton, and M.B. Giles. A parallel framework for unstructured grid solvers. In K.M. Decker and R.M. Rehmann, editors, Programming Environments for Massively Parallel Distributed Systems, pages 97--106, 1994.

[6]

P.I. Crumpton and M.B. Giles. Parallel Computational Fluid Dynamics: Implementations and Results Using Parallel Computers, chapter Multigrid aircraft computations using the OPlus parallel library, pages 339--346. 1996.

[7]

Z. DeVito, N. Joubert, F. Palacios, S. Oakley, M. Medina, M. Barrientos, E. Elsen, F. Ham, A. Aiken, K. Duraisamy, E. Darve, J. Alonso, and P. Hanrahan. Liszt: a domain specific language for building portable mesh-based pde solvers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 9:1--9:12, New York, NY, USA, 2011. ACM.

Digital Library

[8]

A. Ermedahl, C. Sandberg, J. Gustafsson, S. Bygde, and B. Lisper. Loop bound analysis based on a combination of program slicing, abstract interpretation, and invariant analysis. In Proceedings of the 7th Int'l. Workshop on Worst Case Execution Time (WCET) Analysis, July 2007.

[9]

M. B. Giles, M. C. Duta, J. D. Muller, and N. A. Pierce. Algorithm developments for discrete adjoint methods. AIAA Journal, 42(2):198--205, 2003.

[10]

M.B. Giles, G.R. Mudalige, Z. Sharif, G. Markall, and P. H.J. Kelly. Performance analysis and optimisation of the OP2 framework on many-core architectures. The Computer Journal, 2011.

Digital Library

[11]

M.B. Giles, G.R. Mudalige, Z. Sharif, G. Markall, and P. H.J. Kelly. Performance analysis of the OP2 framework on many-core architectures. SIGMETRICS Perform. Eval. Rev., 38(4):9--15, March 2011.

Digital Library

[12]

C. A. Healy, M. Sjödin, V. Rustagi, D. Whalley, and R. van Engelen. Supporting timing analysis by automatic bounding of loops iterations. Real-Time Systems, 18(2--3):129--156, May 2000.

Digital Library

[13]

Lee W. Howes, Anton Lokhmotov, Alastair F. Donaldson, and Paul H.J. Kelly. Deriving efficient data movement from decoupled access/execute specifications. In Proceedings of the 4th International. Conference on High Performance Embedded Architectures and Compilers, HiPEAC '09, 2009.

Digital Library

[14]

J. S. Meredith, R. Sisneros, D. Pugmire, and S. Ahern. A distributed data-parallel framework for analysis and visualization algorithm development. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU-5, pages 11--19, New York, NY, USA, 2012. ACM.

Digital Library

[15]

P. Moinier, J. D. Muller, and M. B. Giles. Edge-based multigrid and preconditioning for hybrid grids. AIAA Journal, 40(10):1954--1960, 2002.

[16]

http://www.oerc.ox.ac.uk/research/op2.

[17]

M. Strout, L. Carter, and J. Ferrante. Compile-time composition of run-time data and iteration reorderings. In Proceedings of the 2003 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2003.

Digital Library

[18]

Mark Weiser. Program slicing. In Proceedings of the 5th Int'l. conference on Software engineering, ICSE '81, 1981.

Digital Library

Cited By

Xia CZhao JSun QWang ZWen YYu TFeng XCui HTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Optimizing Deep Learning Inference via Global Analysis and Tensor ExpressionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624858(286-301)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624858
Krieger CStrout MOlschanowsky CStone AGuzik SGao XBertolli CKelly PMudalige GVan Straalen BWilliams S(2013)Loop ChainingProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.68(375-384)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.68
Kelly PFilinski AGrelck C(2012)Using domain-specific languages and access-execute descriptors to expand the parallel code synthesis design spaceProceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing10.1145/2364474.2364476(1-2)Online publication date: 15-Sep-2012
https://dl.acm.org/doi/10.1145/2364474.2364476

Index Terms

Mesh independent loop fusion for unstructured mesh applications
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing unstructured mesh computations for many-core architectures

Achieving optimal performance on the latest multi-core and many-core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '12: Proceedings of the 9th conference on Computing Frontiers

May 2012

320 pages

ISBN:9781450312158

DOI:10.1145/2212908

General Chair:
John Feo
Pacific Northwest National Laboratory, USA
,
Program Chairs:
Paolo Faraboschi
HP Labs, Spain
,
Oreste Villa
Pacific Northwest National Laboratory, USA

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CF'12

Sponsor:

SIGMICRO

CF'12: Computing Frontiers Conference

May 15 - 17, 2012

Cagliari, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
123
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xia CZhao JSun QWang ZWen YYu TFeng XCui HTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Optimizing Deep Learning Inference via Global Analysis and Tensor ExpressionsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624858(286-301)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624858
Krieger CStrout MOlschanowsky CStone AGuzik SGao XBertolli CKelly PMudalige GVan Straalen BWilliams S(2013)Loop ChainingProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.68(375-384)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.68
Kelly PFilinski AGrelck C(2012)Using domain-specific languages and access-execute descriptors to expand the parallel code synthesis design spaceProceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing10.1145/2364474.2364476(1-2)Online publication date: 15-Sep-2012
https://dl.acm.org/doi/10.1145/2364474.2364476

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents