research-article

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

Authors:

Gianluca Frison,

Tommaso Sartor,

Andrea Zanelli,

Moritz DiehlAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 46, Issue 2

Article No.: 15, Pages 1 - 36

https://doi.org/10.1145/3378671

Published: 19 May 2020 Publication History

Abstract

Basic Linear Algebra Subroutines For Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an application programming interface (API) which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open source and proprietary. This article investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are easily developed. This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache. Therefore, it can boost a wide range of applications, where standard BLAS and LAPACK libraries are employed and the matrix size is moderate. In particular, this article investigates the benefits in scientific programming languages such as Octave, SciPy, and Julia.

References

[1]

E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). SIAM.

[2]

BLASFEO. 2016. Retrieved from https://github.com/giaf/blasfeo.

[3]

Blaze. 2012. Retrieved from http://bitbucket.org/blaze-lib/blaze.

[4]

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35, 1 (2009), 38--53.

Digital Library

[5]

Eigen. 2010. Eigen v3. Retreved from http://eigen.tuxfamily.org/.

[6]

H. J. Ferreau, S. Almer, R. Verschueren, M. Diehl, D. Frick, A. Domahidi, J. L. Jerez, G. Stathopoulos, and C. Jones. 2017. Embedded optimization methods for industrial automatic control. In Proceedings of the IFAC World Congress.

[7]

G. Frison, H. B. Sorensen, B. Dammann, and J. B. Jørgensen. 2014. High-performance small-scale solvers for linear model predictive control. In Proceedings of the European Control Conference (ECC’14). 128--133.

[8]

Gianluca Frison. 2015. Algorithms and Methods for High-Performance Model Predictive Control. Ph.D. thesis, Technical University of Denmark (DTU).

[9]

G. Frison, D. Kouzoupis, T. Sartor, A. Zanelli, and M. Diehl. 2018. BLASFEO: Basic linear algebra subroutines for embedded optimization. In ACM Transactions on Mathematical Software 44, 4 (2018), 42:1--42:30.

Digital Library

[10]

M. R. Hasan. 2017. Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-Based Linear Algebra. Ph.D. thesis. Louisiana State University.

[11]

Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]

B. Houska, H. J. Ferreau, and M. Diehl. 2011. An auto-generated real-time iteration algorithm for nonlinear MPC in the microsecond range. Automatica 47, 10 (2011), 2279--2285.

Digital Library

[13]

J. L. Jerez, P. J. Goulart, S. Richter, G. A. Constantinides, E. C. Kerrigan, and M. Morari. 2014. Embedded online optimization for model predictive control at megahertz rates. IEEE Transactions on Automatic Control 59, 12 (2014), 3238--3251.

[14]

Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34, 3 (2008), 12:1--12:25.

Digital Library

[15]

Kazushige Goto and Robert A. van de Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software 35, 1 (2008).

Digital Library

[16]

Bo Kågström, Per Ling, and Charles Van Loan. 1998. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software 24, 3 (1998), 268--302.

Digital Library

[17]

I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, and J. Dongarra. 2016. High-performance matrix-matrix multiplications of very small matrices. In European Conference on Parallel Processing, Springer. 659--671.

[18]

Intel. 2019. Math Kernel Library. Retrieved from https://software.intel.com/en-us/mkl.

[19]

C. Lawson, R. Hanson, D. Kincaid, and F. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software 5 (1979), 308--325.

Digital Library

[20]

Travis E. Oliphant. 2006. A Guide to NumPy. Trelgol Publishing USA.

Digital Library

[21]

OpenBLAS. 2011. OpenBLAS: An optimized BLAS library. Retrieved from http://www.openblas.net/.

[22]

Daniele G. Spampinato and Markus Püschel. 2016. A basic linear algebra compiler for structured matrices. In International Symposium on Code Generation and Optimization (CGO). 117--127.

[23]

D. G. Spampinato, D. Fabregat-Traver, P. Bientinesi, and M. Püschel. 2018. Program generation for small-scale linear algebra applications. In International Symposium on Code Generation and Optimization (CGO). 327--339.

[24]

Tim Peters. 2004. PEP 20—The Zen of Python. Retrieved from https://www.python.org/dev/peps/pep-0020/.

[25]

R. C. Whaley and J. Dongarra. 1999. Automatically tuned linear algebra software. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing.

[26]

F. G. Van Zee, E. Chan, R. A. van de Geijn, E. S. Quintana-Orti, and G. Quintana-Orti. 2009. The libflame library for dense matrix computations. In IEEE Computing in Science and Engineering 11, 6 (2009).

[27]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software 41, 3 (2015), 14:1--14:33.

[28]

Field G. Van Zee, Tyler Smith, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Tze Meng Low, Bryan Marker, Lee Killough, and Robert A. van de Geijn. 2016. The BLIS framework: Experiments in portability. ACM Transactions on Mathematical Software 42, 2 (2016), 12:1--12:19.

Cited By

Yang WFang JDong DSu XWang Z(2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2024.3350368
Frey JSchutter JDiehl M(2023)Fast integrators with sensitivity propagation for use in CasADi2023 European Control Conference (ECC)10.23919/ECC57647.2023.10178402(1-6)Online publication date: 13-Jun-2023
https://doi.org/10.23919/ECC57647.2023.10178402
Deshmukh SYokota RBosilca G(2023)Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu ProcessorsACM Transactions on Mathematical Software10.1145/359517849:3(1-29)Online publication date: 30-Sep-2023
https://dl.acm.org/doi/10.1145/3595178
Show More Cited By

Index Terms

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices
1. Mathematics of computing
  1. Mathematical software

Recommendations

BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization

Basic Linear Algebra Subroutines for Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and small-scale high-performance ...
BLIS: A Framework for Rapidly Instantiating BLAS Functionality

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

We study the high-performance implementation of the inversion of a Symmetric Positive Definite (SPD) matrix on architectures ranging from sequential processors to Symmetric MultiProcessors to distributed memory parallel computers. This inversion is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 46, Issue 2

June 2020

274 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/3401021

Editors:
Zhaojun Bai
University of California at Davis, USA
,
Wolfgang Bangerth
Colorado State University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 May 2020

Online AM: 07 May 2020

Accepted: 01 January 2020

Revised: 01 September 2019

Received: 01 February 2019

Published in TOMS Volume 46, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

DyConPV
German Federal Ministry for Economic Affairs and Energy (BMWi) via eco4wind
DFG via Research Unit FOR 2401

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
260
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang WFang JDong DSu XWang Z(2024)Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.335036835:3(439-454)Online publication date: Mar-2024
https://doi.org/10.1109/TPDS.2024.3350368
Frey JSchutter JDiehl M(2023)Fast integrators with sensitivity propagation for use in CasADi2023 European Control Conference (ECC)10.23919/ECC57647.2023.10178402(1-6)Online publication date: 13-Jun-2023
https://doi.org/10.23919/ECC57647.2023.10178402
Deshmukh SYokota RBosilca G(2023)Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu ProcessorsACM Transactions on Mathematical Software10.1145/359517849:3(1-29)Online publication date: 30-Sep-2023
https://dl.acm.org/doi/10.1145/3595178
Park YKim RNguyen TChoi J(2023)Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processorsCluster Computing10.1007/s10586-021-03274-826:5(2539-2549)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1007/s10586-021-03274-8
Liu FFredriksson AMarkidis S(2023)Parallel Cholesky Factorization for Banded Matrices Using OpenMP TasksEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_49(725-739)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39698-4_49
Wei CJia HZhang YXu LQi J(2022)IATF: An Input-Aware Tuning Framework for Compact BLAS Based on ARMv8 CPUsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545032(1-11)Online publication date: 29-Aug-2022
https://dl.acm.org/doi/10.1145/3545008.3545032
Yang WFang JDong DSu XWang Zde Supinski BHall MGamblin T(2021)LIBSHALOMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476217(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476217
Winkler AFrey JFahrbach TFrison GScheer RDiehl MAndert J(2021)Embedded Real-Time Nonlinear Model Predictive Control for the Thermal Torque Derating of an Electric VehicleIFAC-PapersOnLine10.1016/j.ifacol.2021.08.57054:6(359-364)Online publication date: 2021
https://doi.org/10.1016/j.ifacol.2021.08.570
Hunold SSteiner S(2020)Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC?2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)10.1109/PMBS51919.2020.00008(20-25)Online publication date: Nov-2020
https://doi.org/10.1109/PMBS51919.2020.00008

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents