article

Cache-Efficient Multigrid Algorithms

Authors:

Sriram Sellappa,

Siddhartha ChatterjeeAuthors Info & Claims

International Journal of High Performance Computing Applications, Volume 18, Issue 1

Pages 115 - 133

https://doi.org/10.1177/1094342004041295

Published: 01 February 2004 Publication History

Abstract

Multigrid is widely used as an efficient solver for sparse linear systems arising from the discretization of elliptic boundary value problems. Linear relaxation methods such as Gauss-Seidel and Red-Black Gauss-Seidel form the principal computational component of multigrid, and thus affect its efficiency. In the context of multigrid, these iterative solvers are executed for a small number of iterations (2-8). We exploit this property of the algorithm to develop a cache-efficient multigrid method, by focusing on improving the memory behavior of the linear relaxation methods. The efficiency in our cache-efficient linear relaxation algorithm comes from two sources: reducing the number of data cache and TLB misses, and reducing the number of memory references by keeping values register-resident. Our optimizations are applicable to multigrid applied to linear systems arising from constant coefficient elliptic PDEs on structured grids. Experiments on five modern computing platforms show a performance improvement of 1.15-2.7 times over a standard implementation of Full Multigrid V-Cycle.

References

[1]

Bassetti, F., Davis, K., and Quinlan, D. December 1998. Optimizing transformations of stencil operations for parallel object-oriented scientific frameworks on cache-based architectures. In Proceedings of ISCOPE'98, Santa Fe, NM.]]

[2]

Bassetti, F., Davis, K., and Marathe, M. December 1999. Improving cache utilization of linear relaxation methods: theory and practice. In Proceedings of ISCOPE'99, San Francisco, CA.]]

[3]

Briggs, W. L.1987. A Multigrid Tutorial, SIAM, Philadelphia, PA.]]

[4]

Bromley, M., Heller, S., McNerney, T., and Steele, Jr, G. L. June 1991. Fortran at ten gigaflops: the Connection Machine convolution compiler. In Proceedings of the ACM SIGP-LAN'91 Conference on Programming Language Design and Implementation, Toronto, Canada, pp. 145-156.]]

[5]

Chatterjee, S., Jain, V. V., Lebeck, A. R., Mundhra, S., and Thottethodi, M. June 1999. Nonlinear array layouts for hierarchical memory systems. In Proceedings of 1999 ACM International Conference on Supercomputing, Rhodes, Greece, pp. 444-453.]]

[6]

Douglas, C., Hu, J., Kowarschik, M., Rüde, U., and Weiss, C.2000. Cache optimization for structured and unstructured grid multigrid. Electronic Transactions on Numerical Analysis10:21-40. University of Kentucky, Louisville, KY, ISSN 1068-9613.]]

[7]

Frumkin, M. A. and Van Der Wijngaart, R. F. March 2001. Interference lattice-based loop nest tilings for stencil computations. In Proceedings of the 10th SIAM Conference on Parallel Processing for Scientific Computing (CD-ROM), Portsmouth, VA, SIAM, Philadelphia, PA.]]

[8]

Hill, M. D. and Smith, A. J.1989. Evaluating associativity in a CPU caches. IEEE Transactions on Computing38(12): 1612-1620.]]

[9]

Lam, M. S., Rothberg, E. E., and Wolf, M. E. April 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 63-74.]]

[10]

Lebeck, A. R. and Wood, D. A.1994. Cache profiling and the SPEC benchmarks: a case study. IEEE Computer27(10): 15-26.]]

[11]

Leiserson, C. E., Rao, S., and Toledo, S.1997. Efficient out-of-core algorithms for linear relaxation using blocking covers. Journal of Computer and System Science54(2):332-344.]]

[12]

Mitchell, N., Carter, L., Ferrante, J., and Högstedt, K. 1998. Quantifying the multi-level nature of tiling interactions. In Languages and Compilers for Parallel Computing: 10th Annual Workshop, LCPC'97, Lecture Notes in Computer Science Vol. 1366, Springer-Verlag, Berlin, pp. 1-15.]]

[13]

Povitsky, A. October 1999. Wavefront cache-friendly algorithm for compact numerical schemes, Technical Report 99-40, ICASE, Hampton, VA.]]

[14]

Stals, L. and Rüde, U. October 1997. Techniques for improving the data locality of iterative methods, Technical Report MRR 038-97, Institut für Mathematik, Universität Augsburg, Augsburg, Germany.]]

[15]

Wolf, M. E. and Lam, M. S. June 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, Toronto, Canada, pp. 30-44,.]]

[16]

Wolfe, M. November 1989. More iteration space tiling. In Proceedings of Supercomputing'89, Reno, NV, pp. 655-664.]]

[17]

Zagha, M., Larson, B., Turner, S., and Itzkowitz, M. November 1996. Performance analysis using the MIPS R10000 performance counters. In Proceedings of SC96 (CD-ROM), Pittsburgh, PA. Available from http://www.supercomp. org/sc96.]]

Cited By

Antepara OWilliams SJohansen HZhao THirsch SGoyal PHall M(2023)Performance Portability Evaluation of Blocked Stencil Computations on GPUsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624177(1007-1018)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624177
Kempf DHeß RMüthing SBastian P(2020)Automatic Code Generation for High-performance Discontinuous Galerkin Methods on Modern ArchitecturesACM Transactions on Mathematical Software10.1145/342414447:1(1-31)Online publication date: 8-Dec-2020
https://dl.acm.org/doi/10.1145/3424144
Zhao TBasu PWilliams SHall MJohansen HTaufer MBalaji PPeña A(2019)Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356210(1-44)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356210
Show More Cited By

Recommendations

Improving Performance of Large Physically Indexed Caches by Decoupling Memory Addresses from Cache Addresses

Modern CPUs often use large physically indexed caches that are direct-mapped or have low associativities. Such caches do not interact well with virtual memory systems. An improperly placed physical page will end up in a wrong place in the cache, causing ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Cache-Efficient Multigrid Algorithms
ICCS '01: Proceedings of the International Conference on Computational Sciences-Part I

Multigrid is widely used as an efficient solver for sparse linear systems arising from the discretization of elliptic boundary value problems. Linear relaxation methods like Gauss-Seidel and Red-Black Gauss-Seidel form the principal computational ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 18, Issue 1

February 2004

160 pages

ISSN:1094-3420

Issue’s Table of Contents

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 February 2004

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Antepara OWilliams SJohansen HZhao THirsch SGoyal PHall M(2023)Performance Portability Evaluation of Blocked Stencil Computations on GPUsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624177(1007-1018)Online publication date: 12-Nov-2023
Kempf DHeß RMüthing SBastian P(2020)Automatic Code Generation for High-performance Discontinuous Galerkin Methods on Modern ArchitecturesACM Transactions on Mathematical Software10.1145/342414447:1(1-31)Online publication date: 8-Dec-2020
Zhao TBasu PWilliams SHall MJohansen HTaufer MBalaji PPeña A(2019)Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356210(1-44)Online publication date: 17-Nov-2019
Ciznicki MKurowski KWeglarz J(2017)Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architecturesCluster Computing10.1007/s10586-016-0686-220:3(2535-2549)Online publication date: 1-Sep-2017
Basu PWilliams SVan Straalen BOliker LHall MAmarasinghe SKamil SSadayappan P(2014)Converting Stencils to Accumulations Forcommunication-Avoiding Optimizationin Geometric MultigridProceedings of the Second Workshop on Optimizing Stencil Computations10.1145/2686745.2686749(9-16)Online publication date: 20-Oct-2014
Williams SKalamkar DSingh ADeshpande AVan Straalen BSmelyanskiy MAlmgren ADubey PShalf JOliker LHollingsworth J(2012)Optimization of geometric multigrid for emerging multi- and manycore processorsProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/2388996.2389126(1-11)Online publication date: 10-Nov-2012
Krueger JDonofrio DShalf JMohiyuddin MWilliams SOliker LPfreund FLathrop SCosta JKramer W(2011)Hardware/software co-design for energy-efficient seismic modelingProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2063384.2063482(1-12)Online publication date: 12-Nov-2011
Sung IStratton JHwu WSalapura VGschwind MKnoop J(2010)Data layout transformation exploiting memory-level parallelism in structured grid many-core applicationsProceedings of the 19th international conference on Parallel architectures and compilation techniques10.1145/1854273.1854336(513-522)Online publication date: 11-Sep-2010
(2010)Algorithm engineeringundefinedOnline publication date: 1-Jan-2010
Wallin DLof HHagersten EHolmgren S(2009)Reconsidering algorithms for iterative solvers in the multicore eraInternational Journal of Computational Science and Engineering10.1504/IJCSE.2009.0291634:4(270-282)Online publication date: 1-Nov-2009
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents