Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

Published: 19 May 2020 Publication History

Abstract

Basic Linear Algebra Subroutines For Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an application programming interface (API) which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open source and proprietary. This article investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are easily developed. This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache. Therefore, it can boost a wide range of applications, where standard BLAS and LAPACK libraries are employed and the matrix size is moderate. In particular, this article investigates the benefits in scientific programming languages such as Octave, SciPy, and Julia.

References

[1]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). SIAM.
[2]
BLASFEO. 2016. Retrieved from https://github.com/giaf/blasfeo.
[3]
Blaze. 2012. Retrieved from http://bitbucket.org/blaze-lib/blaze.
[4]
A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35, 1 (2009), 38--53.
[5]
Eigen. 2010. Eigen v3. Retreved from http://eigen.tuxfamily.org/.
[6]
H. J. Ferreau, S. Almer, R. Verschueren, M. Diehl, D. Frick, A. Domahidi, J. L. Jerez, G. Stathopoulos, and C. Jones. 2017. Embedded optimization methods for industrial automatic control. In Proceedings of the IFAC World Congress.
[7]
G. Frison, H. B. Sorensen, B. Dammann, and J. B. Jørgensen. 2014. High-performance small-scale solvers for linear model predictive control. In Proceedings of the European Control Conference (ECC’14). 128--133.
[8]
Gianluca Frison. 2015. Algorithms and Methods for High-Performance Model Predictive Control. Ph.D. thesis, Technical University of Denmark (DTU).
[9]
G. Frison, D. Kouzoupis, T. Sartor, A. Zanelli, and M. Diehl. 2018. BLASFEO: Basic linear algebra subroutines for embedded optimization. In ACM Transactions on Mathematical Software 44, 4 (2018), 42:1--42:30.
[10]
M. R. Hasan. 2017. Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-Based Linear Algebra. Ph.D. thesis. Louisiana State University.
[11]
Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
[12]
B. Houska, H. J. Ferreau, and M. Diehl. 2011. An auto-generated real-time iteration algorithm for nonlinear MPC in the microsecond range. Automatica 47, 10 (2011), 2279--2285.
[13]
J. L. Jerez, P. J. Goulart, S. Richter, G. A. Constantinides, E. C. Kerrigan, and M. Morari. 2014. Embedded online optimization for model predictive control at megahertz rates. IEEE Transactions on Automatic Control 59, 12 (2014), 3238--3251.
[14]
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (2008), 12:1--12:25.
[15]
Kazushige Goto and Robert A. van de Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software 35, 1 (2008).
[16]
Bo Kågström, Per Ling, and Charles Van Loan. 1998. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software 24, 3 (1998), 268--302.
[17]
I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, and J. Dongarra. 2016. High-performance matrix-matrix multiplications of very small matrices. In European Conference on Parallel Processing, Springer. 659--671.
[18]
Intel. 2019. Math Kernel Library. Retrieved from https://software.intel.com/en-us/mkl.
[19]
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software 5 (1979), 308--325.
[20]
Travis E. Oliphant. 2006. A Guide to NumPy. Trelgol Publishing USA.
[21]
OpenBLAS. 2011. OpenBLAS: An optimized BLAS library. Retrieved from http://www.openblas.net/.
[22]
Daniele G. Spampinato and Markus Püschel. 2016. A basic linear algebra compiler for structured matrices. In International Symposium on Code Generation and Optimization (CGO). 117--127.
[23]
D. G. Spampinato, D. Fabregat-Traver, P. Bientinesi, and M. Püschel. 2018. Program generation for small-scale linear algebra applications. In International Symposium on Code Generation and Optimization (CGO). 327--339.
[24]
Tim Peters. 2004. PEP 20—The Zen of Python. Retrieved from https://www.python.org/dev/peps/pep-0020/.
[25]
R. C. Whaley and J. Dongarra. 1999. Automatically tuned linear algebra software. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing.
[26]
F. G. Van Zee, E. Chan, R. A. van de Geijn, E. S. Quintana-Orti, and G. Quintana-Orti. 2009. The libflame library for dense matrix computations. In IEEE Computing in Science and Engineering 11, 6 (2009).
[27]
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software 41, 3 (2015), 14:1--14:33.
[28]
Field G. Van Zee, Tyler Smith, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Tze Meng Low, Bryan Marker, Lee Killough, and Robert A. van de Geijn. 2016. The BLIS framework: Experiments in portability. ACM Transactions on Mathematical Software 42, 2 (2016), 12:1--12:19.

Cited By

View all
  • (2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: Mar-2024
  • (2023)Fast integrators with sensitivity propagation for use in CasADi2023 European Control Conference (ECC)10.23919/ECC57647.2023.10178402(1-6)Online publication date: 13-Jun-2023
  • (2023)Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu ProcessorsACM Transactions on Mathematical Software10.1145/359517849:3(1-29)Online publication date: 30-Sep-2023
  • Show More Cited By

Index Terms

  1. The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 46, Issue 2
    June 2020
    274 pages
    ISSN:0098-3500
    EISSN:1557-7295
    DOI:10.1145/3401021
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 19 May 2020
    Online AM: 07 May 2020
    Accepted: 01 January 2020
    Revised: 01 September 2019
    Received: 01 February 2019
    Published in TOMS Volume 46, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BLAS
    2. Linear algebra
    3. high-performance
    4. libraries
    5. matrix

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • DyConPV
    • German Federal Ministry for Economic Affairs and Energy (BMWi) via eco4wind
    • DFG via Research Unit FOR 2401

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)28
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: Mar-2024
    • (2023)Fast integrators with sensitivity propagation for use in CasADi2023 European Control Conference (ECC)10.23919/ECC57647.2023.10178402(1-6)Online publication date: 13-Jun-2023
    • (2023)Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu ProcessorsACM Transactions on Mathematical Software10.1145/359517849:3(1-29)Online publication date: 30-Sep-2023
    • (2023)Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processorsCluster Computing10.1007/s10586-021-03274-826:5(2539-2549)Online publication date: 1-Oct-2023
    • (2023)Parallel Cholesky Factorization for Banded Matrices Using OpenMP TasksEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_49(725-739)Online publication date: 28-Aug-2023
    • (2022)IATF: An Input-Aware Tuning Framework for Compact BLAS Based on ARMv8 CPUsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545032(1-11)Online publication date: 29-Aug-2022
    • (2021)LIBSHALOMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476217(1-14)Online publication date: 14-Nov-2021
    • (2021)Embedded Real-Time Nonlinear Model Predictive Control for the Thermal Torque Derating of an Electric VehicleIFAC-PapersOnLine10.1016/j.ifacol.2021.08.57054:6(359-364)Online publication date: 2021
    • (2020)Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC?2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00008(20-25)Online publication date: Nov-2020

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media