Abdelfattah A, Costa T, Dongarra J, Gates M, Haidar A, Hammarling S, Higham N, Kurzak J, Luszczek P, Tomov S and Zounon M. (2021). A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines. ACM Transactions on Mathematical Software. 47:3. (1-23). Online publication date: 30-Sep-2021.

Charara A, Keyes D and Ltaief H. (2019). Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs. ACM Transactions on Mathematical Software. 45:2. (1-28). Online publication date: 30-Jun-2019.

Dongarra J, Gates M, Kurzak J, Luszczek P and Tsai Y. Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators. Proceedings of the IEEE. 10.1109/JPROC.2018.2868961. 106:11. (2040-2055).

https://ieeexplore.ieee.org/document/8476161/

Abdelfattah A, Haidar A, Tomov S and Dongarra J. (2018). Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization 2018 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC.2018.8547576. 978-1-5386-5989-2. (1-7).

https://ieeexplore.ieee.org/document/8547576/

Haidar A, Abdelfattah A, Zounon M, Tomov S and Dongarra J. A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2017.2783929. 29:5. (973-984).

https://ieeexplore.ieee.org/document/8214236/

Abdelfattah A, Haidar A, Tomov S and Dongarra J. Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. Proceedings of the International Conference on Supercomputing. (1-10).

https://doi.org/10.1145/3079079.3079103

Gates M, Kurzak J, Luszczek P, Yu Pei and Dongarra J. (2017). Autotuning batch Cholesky factorization in CUDA with interleaved layout of matrices 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW). 10.1109/IPDPSW.2017.18. 978-1-5386-3408-0. (1408-1417).

http://ieeexplore.ieee.org/document/7965201/

Kurzak J, Anzt H, Gates M and Dongarra J. (2016). Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs. IEEE Transactions on Parallel and Distributed Systems. 27:7. (2036-2048). Online publication date: 1-Jul-2016.

https://doi.org/10.1109/TPDS.2015.2481890

Abdelfattah A, Haidar A, Tomov S and Dongarra J. (2016). Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs. Procedia Computer Science. 80:C. (119-130). Online publication date: 1-Jun-2016.

https://doi.org/10.1016/j.procs.2016.05.303

Kabir K, Haidar A, Tomov S and Dongarra J. (2015). On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors. High Performance Computing. 10.1007/978-3-319-20119-1_5. (58-73).

https://link.springer.com/10.1007/978-3-319-20119-1_5

Haidar A, Dong T, Tomov S, Luszczek P and Dongarra J. (2015). A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations. High Performance Computing. 10.1007/978-3-319-20119-1_3. (31-47).

https://link.springer.com/10.1007/978-3-319-20119-1_3

Deshmukh S, Yokota R and Bosilca G. (2023). Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors. ACM Transactions on Mathematical Software. 49:3. (1-29). Online publication date: 30-Sep-2023.

https://doi.org/10.1145/3595178

Abdelfattah A, Haidar A, Tomov S and Dongarra J. (2016). On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW.2016.190. 978-1-5090-3682-0. (1249-1258).

http://ieeexplore.ieee.org/document/7530009/