Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs

Moritz Kreutzer¹⁷,
Dominik Ernst¹⁷,
Alan R. Bishop¹⁸,
Holger Fehske¹⁹,
Georg Hager¹⁷,
Kengo Nakajima²⁰ &
…
Gerhard Wellein¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10876))

Included in the following conference series:

International Conference on High Performance Computing

2091 Accesses
1 Altmetric

Abstract

Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space supported by both architectures under consideration: Intel Xeon Phi “Knights Landing” and Nvidia “Pascal”/“Volta.” After a thorough performance analysis of the single-device implementations using the roofline model we propose two optimizations: (1) Subspace blocking is applied for improved performance and data access efficiency. We also show that it allows transparently handling problems much larger than the high-bandwidth memory without significant performance penalties. (2) Pipelining of communication and computation phases of successive subspaces is implemented to hide communication costs without extra memory traffic. As an application scenario we perform filter diagonalization studies for topological quantum matter. Performance numbers on up to 2048 nodes of the Oakforest-PACS and Piz Daint supercomputers are presented, achieving beyond 500 Tflop/s for computing $10^2$ inner eigenvalues of sparse matrices of dimension $4\cdot 10^9$.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

Linear Systems Solvers for Distributed-Memory Machines with GPU Accelerators

Spliss: A Sparse Linear System Solver for Transparent Integration of Emerging HPC Technologies into CFD Solvers and Applications

Notes

1.
https://www.lrz.de/services/compute/supermuc/systemdescription/.
2.
http://www.cscs.ch/computers/piz_daint.
3.
http://www.cc.u-tokyo.ac.jp/system/ofp/index-e.html.
4.
No relevant performance differences were observed between CUDAv8 and CUDAv9 on V100.
5.
Due to the absence of suitable tools, measurements were not conducted on the Oakforest-PACS system but on an Intel Xeon Phi 7210 with 64 cores and the same amount of L2 cache.
6.
The identification of the respective bottleneck is ongoing but probably pointless as this architecture line will not be continued by Intel.

References

NVIDIA Profiler. http://docs.nvidia.com/cuda/profiler-users-guide
TOP500 Supercomputer Sites, June 2017. http://www.top500.org
Aktulga, H.M., Buluç, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium, May 2012. IEEE Computer Society (2014)
Google Scholar
Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing, HPC 2015, pp. 75–82. Society for Computer Simulation International, San Diego (2015). http://dl.acm.org/citation.cfm?id=2872599.2872609
Banerjee, A.S., Lin, L., Hu, W., Yang, C., Pask, J.E.: Chebyshev polynomial filtered subspace iteration in the discontinuous Galerkin method for large-scale electronic structure calculations. J. Chem. Phys. 145(15), 154101 (2016). https://doi.org/10.1063/1.4964861
Article Google Scholar
Basermann, A., Reichel, B., Schelthoff, C.: Preconditioned CG methods for sparse matrices on massively parallel machines. Parallel Comput. 23(3), 381–398 (1997). http://www.sciencedirect.com/science/article/pii/S0167819197000057
Article MathSciNet Google Scholar
Bhardwaj, O., Ineichen, Y., Bekas, C., Curioni, A.: Highly scalable linear time estimation of spectrograms – a tool for very large scale data analysis. Poster at 2013 ACM/IEEE International Conference on High Performance Computing Networking, Storage and Analysis (2013)
Google Scholar
Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 206–239 (2012)
Article MathSciNet Google Scholar
Di Napoli, E., Polizzi, E., Saad, Y.: Efficient estimation of eigenvalue counts in an interval. Numer. Linear Algebra Appl. 23(4), 674–692 (2016)
Article MathSciNet Google Scholar
Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFD 1999, pp. 233–240. Elsevier (1999)
Chapter Google Scholar
Kamiabad, A.A., Tate, J.E.: Polynomial preconditioning of power system matrices with graphics processing units. In: Khaitan, S., Gupta, A. (eds.) High Performance Computing in Power and Energy Systems, pp. 229–246. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-32683-7_8
Chapter Google Scholar
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Basermann, A., Bishop, A.R.: Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum, pp. 1696–1702, May 2012
Google Scholar
Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance engineering of the Kernel Polynomal Method on large-scale CPU-GPU systems. In: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 417–426, May 2015
Google Scholar
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). https://doi.org/10.1137/130930352
Article MathSciNet MATH Google Scholar
Kreutzer, M., Thies, J., Röhrig-Zöllner, M., Pieper, A., Shahzad, F., Galgon, M., Basermann, A., Fehske, H., Hager, G., Wellein, G.: GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems. In: International Journal of Parallel Programming, pp. 1–27 (2016)
Article Google Scholar
Li, X., Li, F.: Estimation of the largest eigenvalue in Chebyshev preconditioner for parallel conjugate gradient method-based power flow computation. IET Gener. Transm. Distrib. 10(1), 123–130 (2016)
Article Google Scholar
Li, X., Li, F.: GPU-based power flow analysis with Chebyshev preconditioner and conjugate gradient method. Electr. Power Syst. Res. 116(Suppl. C), 87–93 (2014). http://www.sciencedirect.com/science/article/pii/S0378779614001850
Article Google Scholar
Liu, X., Chow, E., Vaidyanathan, K., Smelyanskiy, M.: Improving the performance of dynamical simulations via multiple right-hand sides. In: Proceedings of the 2012 IEEE International Parallel and Distributed Processing Symposium, May 2012, pp. 36–47. IEEE Computer Society (2012)
Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995
Google Scholar
Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226–243 (2016). http://www.sciencedirect.com/science/article/pii/S0021999116303837
Article MathSciNet Google Scholar
Saad, Y.: Chebyshev acceleration techniques for solving nonsymmetric eigenvalue problems. Math. Comput. 42, 567–588 (1984)
Article MathSciNet Google Scholar
Schubert, G., Fehske, H.: Metal-to-insulator transition and electron-hole puddle formation in disordered graphene nanoribbons. Phys. Rev. Lett. 108, 066402 (2012)
Article Google Scholar
Schubert, G., Fehske, H., Fritz, L., Vojta, M.: Fate of topological-insulator surface states under strong disorder. Phys. Rev. B 85, 201105 (2012)
Article Google Scholar
Sitte, M., Rosch, A., Altman, E., Fritz, L.: Topological insulators in magnetic fields: Quantum Hall effect and edge channels with a nonquantized $\theta $ term. Phys. Rev. Lett. 108, 126807 (2012)
Article Google Scholar
Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23, 2165–2182 (2002)
Article MathSciNet Google Scholar
Treibig, J., Hager, G., Wellein, G.: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego, CA (2010)
Google Scholar
Weiße, A., Wellein, G., Alvermann, A., Fehske, H.: The kernel polynomial method. Rev. Mod. Phys. 78, 275–306 (2006). https://link.aps.org/doi/10.1103/RevModPhys.78.275
Article MathSciNet Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Article Google Scholar
Zhou, Y., Saad, Y., Tiago, M.L., Chelikowsky, J.R.: Self-consistent-field calculations using Chebyshev-filtered subspace iteration. J. Comput. Phys. 219(1), 172–184 (2006). http://www.sciencedirect.com/science/article/pii/S002199910600146X
Article Google Scholar

Download references

Acknowledgments

This work was funded by DFG SPP1648 through the ESSEX-II project and by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d35. We gratefully acknowledge the access to the Oakforest-PACS supercomputers at JCAHPC (University of Tokyo) and inspiring discussions with Andreas Alvermann, Bruno Lang, and Jonas Thies. H.F. and G.W. are thankful for the hospitality of Los Alamos National Laboratory.

Author information

Authors and Affiliations

Erlangen Regional Computing Center (RRZE), Friedrich-Alexander University of Erlangen-Nuremberg, Erlangen, Germany
Moritz Kreutzer, Dominik Ernst, Georg Hager & Gerhard Wellein
Theory, Simulation and Computation Directorate, Los Alamos National Laboratory, Los Alamos, USA
Alan R. Bishop
Institut für Physik, Ernst Moritz Arndt University Greifswald, Greifswald, Germany
Holger Fehske
Information Technology Center, The University of Tokyo, Tokyo, Japan
Kengo Nakajima

Authors

Moritz Kreutzer
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Ernst
View author publications
You can also search for this author in PubMed Google Scholar
Alan R. Bishop
View author publications
You can also search for this author in PubMed Google Scholar
Holger Fehske
View author publications
You can also search for this author in PubMed Google Scholar
Georg Hager
View author publications
You can also search for this author in PubMed Google Scholar
Kengo Nakajima
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Wellein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Georg Hager .

Editor information

Editors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Rio Yokota
University of Edinburgh, Edinburgh, United Kingdom
Michèle Weiland
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
David Keyes
Technische Universität München, Garching bei München, Germany
Carsten Trinitis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kreutzer, M. et al. (2018). Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 10876. Springer, Cham. https://doi.org/10.1007/978-3-319-92040-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-92040-5_17
Published: 29 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92039-9
Online ISBN: 978-3-319-92040-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics