• Abdelfattah A, Costa T, Dongarra J, Gates M, Haidar A, Hammarling S, Higham N, Kurzak J, Luszczek P, Tomov S and Zounon M. (2021). A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines. ACM Transactions on Mathematical Software. 47:3. (1-23). Online publication date: 30-Sep-2021.

    https://doi.org/10.1145/3431921

  • Charara A, Keyes D and Ltaief H. (2019). Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs. ACM Transactions on Mathematical Software. 45:2. (1-28). Online publication date: 30-Jun-2019.

    https://doi.org/10.1145/3267101

  • Dongarra J, Gates M, Kurzak J, Luszczek P and Tsai Y. Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators. Proceedings of the IEEE. 10.1109/JPROC.2018.2868961. 106:11. (2040-2055).

    https://ieeexplore.ieee.org/document/8476161/

  • Abdelfattah A, Haidar A, Tomov S and Dongarra J. (2018). Optimizing GPU Kernels for Irregular Batch Workloads: A Case Study for Cholesky Factorization 2018 IEEE High Performance Extreme Computing Conference (HPEC). 10.1109/HPEC.2018.8547576. 978-1-5386-5989-2. (1-7).

    https://ieeexplore.ieee.org/document/8547576/

  • Haidar A, Abdelfattah A, Zounon M, Tomov S and Dongarra J. A Guide for Achieving High Performance with Very Small Matrices on GPU: A Case Study of Batched LU and Cholesky Factorizations. IEEE Transactions on Parallel and Distributed Systems. 10.1109/TPDS.2017.2783929. 29:5. (973-984).

    https://ieeexplore.ieee.org/document/8214236/

  • Abdelfattah A, Haidar A, Tomov S and Dongarra J. Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs. Proceedings of the International Conference on Supercomputing. (1-10).

    https://doi.org/10.1145/3079079.3079103

  • Gates M, Kurzak J, Luszczek P, Yu Pei and Dongarra J. (2017). Autotuning batch Cholesky factorization in CUDA with interleaved layout of matrices 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW). 10.1109/IPDPSW.2017.18. 978-1-5386-3408-0. (1408-1417).

    http://ieeexplore.ieee.org/document/7965201/

  • Kurzak J, Anzt H, Gates M and Dongarra J. (2016). Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs. IEEE Transactions on Parallel and Distributed Systems. 27:7. (2036-2048). Online publication date: 1-Jul-2016.

    https://doi.org/10.1109/TPDS.2015.2481890

  • Abdelfattah A, Haidar A, Tomov S and Dongarra J. (2016). Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs. Procedia Computer Science. 80:C. (119-130). Online publication date: 1-Jun-2016.

    https://doi.org/10.1016/j.procs.2016.05.303

  • Kabir K, Haidar A, Tomov S and Dongarra J. (2015). On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors. High Performance Computing. 10.1007/978-3-319-20119-1_5. (58-73).

    https://link.springer.com/10.1007/978-3-319-20119-1_5

  • Haidar A, Dong T, Tomov S, Luszczek P and Dongarra J. (2015). A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations. High Performance Computing. 10.1007/978-3-319-20119-1_3. (31-47).

    https://link.springer.com/10.1007/978-3-319-20119-1_3

  • Deshmukh S, Yokota R and Bosilca G. (2023). Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors. ACM Transactions on Mathematical Software. 49:3. (1-29). Online publication date: 30-Sep-2023.

    https://doi.org/10.1145/3595178

  • Abdelfattah A, Haidar A, Tomov S and Dongarra J. (2016). On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 10.1109/IPDPSW.2016.190. 978-1-5090-3682-0. (1249-1258).

    http://ieeexplore.ieee.org/document/7530009/