Nothing Special   »   [go: up one dir, main page]

Skip to main content

Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10876))

Included in the following conference series:

Abstract

Chebyshev filter diagonalization is well established in quantum chemistry and quantum physics to compute bulks of eigenvalues of large sparse matrices. Choosing a block vector implementation, we investigate optimization opportunities on the new class of high-performance compute devices featuring both high-bandwidth and low-bandwidth memory. We focus on the transparent access to the full address space supported by both architectures under consideration: Intel Xeon Phi “Knights Landing” and Nvidia “Pascal”/“Volta.” After a thorough performance analysis of the single-device implementations using the roofline model we propose two optimizations: (1) Subspace blocking is applied for improved performance and data access efficiency. We also show that it allows transparently handling problems much larger than the high-bandwidth memory without significant performance penalties. (2) Pipelining of communication and computation phases of successive subspaces is implemented to hide communication costs without extra memory traffic. As an application scenario we perform filter diagonalization studies for topological quantum matter. Performance numbers on up to 2048 nodes of the Oakforest-PACS and Piz Daint supercomputers are presented, achieving beyond 500 Tflop/s for computing \(10^2\) inner eigenvalues of sparse matrices of dimension \(4\cdot 10^9\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.lrz.de/services/compute/supermuc/systemdescription/.

  2. 2.

    http://www.cscs.ch/computers/piz_daint.

  3. 3.

    http://www.cc.u-tokyo.ac.jp/system/ofp/index-e.html.

  4. 4.

    No relevant performance differences were observed between CUDAv8 and CUDAv9 on V100.

  5. 5.

    Due to the absence of suitable tools, measurements were not conducted on the Oakforest-PACS system but on an Intel Xeon Phi 7210 with 64 cores and the same amount of L2 cache.

  6. 6.

    The identification of the respective bottleneck is ongoing but probably pointless as this architecture line will not be continued by Intel.

References

  1. NVIDIA Profiler. http://docs.nvidia.com/cuda/profiler-users-guide

  2. TOP500 Supercomputer Sites, June 2017. http://www.top500.org

  3. Aktulga, H.M., Buluç, A., Williams, S., Yang, C.: Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations. In: Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium, May 2012. IEEE Computer Society (2014)

    Google Scholar 

  4. Anzt, H., Tomov, S., Dongarra, J.: Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product. In: Proceedings of the Symposium on High Performance Computing, HPC 2015, pp. 75–82. Society for Computer Simulation International, San Diego (2015). http://dl.acm.org/citation.cfm?id=2872599.2872609

  5. Banerjee, A.S., Lin, L., Hu, W., Yang, C., Pask, J.E.: Chebyshev polynomial filtered subspace iteration in the discontinuous Galerkin method for large-scale electronic structure calculations. J. Chem. Phys. 145(15), 154101 (2016). https://doi.org/10.1063/1.4964861

    Article  Google Scholar 

  6. Basermann, A., Reichel, B., Schelthoff, C.: Preconditioned CG methods for sparse matrices on massively parallel machines. Parallel Comput. 23(3), 381–398 (1997). http://www.sciencedirect.com/science/article/pii/S0167819197000057

    Article  MathSciNet  Google Scholar 

  7. Bhardwaj, O., Ineichen, Y., Bekas, C., Curioni, A.: Highly scalable linear time estimation of spectrograms – a tool for very large scale data analysis. Poster at 2013 ACM/IEEE International Conference on High Performance Computing Networking, Storage and Analysis (2013)

    Google Scholar 

  8. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput. 34, 206–239 (2012)

    Article  MathSciNet  Google Scholar 

  9. Di Napoli, E., Polizzi, E., Saad, Y.: Efficient estimation of eigenvalue counts in an interval. Numer. Linear Algebra Appl. 23(4), 674–692 (2016)

    Article  MathSciNet  Google Scholar 

  10. Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for implicit CFD codes. In: Proceedings of Parallel CFD 1999, pp. 233–240. Elsevier (1999)

    Chapter  Google Scholar 

  11. Kamiabad, A.A., Tate, J.E.: Polynomial preconditioning of power system matrices with graphics processing units. In: Khaitan, S., Gupta, A. (eds.) High Performance Computing in Power and Energy Systems, pp. 229–246. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-32683-7_8

    Chapter  Google Scholar 

  12. Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Basermann, A., Bishop, A.R.: Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops PhD Forum, pp. 1696–1702, May 2012

    Google Scholar 

  13. Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance engineering of the Kernel Polynomal Method on large-scale CPU-GPU systems. In: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 417–426, May 2015

    Google Scholar 

  14. Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). https://doi.org/10.1137/130930352

    Article  MathSciNet  MATH  Google Scholar 

  15. Kreutzer, M., Thies, J., Röhrig-Zöllner, M., Pieper, A., Shahzad, F., Galgon, M., Basermann, A., Fehske, H., Hager, G., Wellein, G.: GHOST: building blocks for high performance sparse linear algebra on heterogeneous systems. In: International Journal of Parallel Programming, pp. 1–27 (2016)

    Article  Google Scholar 

  16. Li, X., Li, F.: Estimation of the largest eigenvalue in Chebyshev preconditioner for parallel conjugate gradient method-based power flow computation. IET Gener. Transm. Distrib. 10(1), 123–130 (2016)

    Article  Google Scholar 

  17. Li, X., Li, F.: GPU-based power flow analysis with Chebyshev preconditioner and conjugate gradient method. Electr. Power Syst. Res. 116(Suppl. C), 87–93 (2014). http://www.sciencedirect.com/science/article/pii/S0378779614001850

    Article  Google Scholar 

  18. Liu, X., Chow, E., Vaidyanathan, K., Smelyanskiy, M.: Improving the performance of dynamical simulations via multiple right-hand sides. In: Proceedings of the 2012 IEEE International Parallel and Distributed Processing Symposium, May 2012, pp. 36–47. IEEE Computer Society (2012)

    Google Scholar 

  19. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December 1995

    Google Scholar 

  20. Pieper, A., Kreutzer, M., Alvermann, A., Galgon, M., Fehske, H., Hager, G., Lang, B., Wellein, G.: High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations. J. Comput. Phys. 325, 226–243 (2016). http://www.sciencedirect.com/science/article/pii/S0021999116303837

    Article  MathSciNet  Google Scholar 

  21. Saad, Y.: Chebyshev acceleration techniques for solving nonsymmetric eigenvalue problems. Math. Comput. 42, 567–588 (1984)

    Article  MathSciNet  Google Scholar 

  22. Schubert, G., Fehske, H.: Metal-to-insulator transition and electron-hole puddle formation in disordered graphene nanoribbons. Phys. Rev. Lett. 108, 066402 (2012)

    Article  Google Scholar 

  23. Schubert, G., Fehske, H., Fritz, L., Vojta, M.: Fate of topological-insulator surface states under strong disorder. Phys. Rev. B 85, 201105 (2012)

    Article  Google Scholar 

  24. Sitte, M., Rosch, A., Altman, E., Fritz, L.: Topological insulators in magnetic fields: Quantum Hall effect and edge channels with a nonquantized \(\theta \) term. Phys. Rev. Lett. 108, 126807 (2012)

    Article  Google Scholar 

  25. Stathopoulos, A., Wu, K.: A block orthogonalization procedure with constant synchronization requirements. SIAM J. Sci. Comput. 23, 2165–2182 (2002)

    Article  MathSciNet  Google Scholar 

  26. Treibig, J., Hager, G., Wellein, G.: LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of PSTI2010, the First International Workshop on Parallel Software Tools and Tool Infrastructures, San Diego, CA (2010)

    Google Scholar 

  27. Weiße, A., Wellein, G., Alvermann, A., Fehske, H.: The kernel polynomial method. Rev. Mod. Phys. 78, 275–306 (2006). https://link.aps.org/doi/10.1103/RevModPhys.78.275

    Article  MathSciNet  Google Scholar 

  28. Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785

    Article  Google Scholar 

  29. Zhou, Y., Saad, Y., Tiago, M.L., Chelikowsky, J.R.: Self-consistent-field calculations using Chebyshev-filtered subspace iteration. J. Comput. Phys. 219(1), 172–184 (2006). http://www.sciencedirect.com/science/article/pii/S002199910600146X

    Article  Google Scholar 

Download references

Acknowledgments

This work was funded by DFG SPP1648 through the ESSEX-II project and by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d35. We gratefully acknowledge the access to the Oakforest-PACS supercomputers at JCAHPC (University of Tokyo) and inspiring discussions with Andreas Alvermann, Bruno Lang, and Jonas Thies. H.F. and G.W. are thankful for the hospitality of Los Alamos National Laboratory.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Georg Hager .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kreutzer, M. et al. (2018). Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 10876. Springer, Cham. https://doi.org/10.1007/978-3-319-92040-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92040-5_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92039-9

  • Online ISBN: 978-3-319-92040-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics