Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Performance Evaluation of Matrix-Matrix Multiplications Using Intel's Advanced Vector Extensions (AVX)

Published: 01 November 2016 Publication History

Abstract

Implementing dense matrix-matrix multiplication kernels in parallel using Intel's advanced vector extension (AVX) instruction sets.The obtained results are compared using inline assembly versus intrinsic functions for programming.A comparative study to indicate the effects of two widely used C++ compilers: Intel C++ compiler (ICC) in Intel Parallel Studio XE 2016 against Microsoft Visual Studio C++ compiler 2015 (MSVC++) has been investigated.The obtained results are compared using inline assembly versus intrinsic functions for programming. A comparative study to indicate the effects of two widely used C++ compilers: Intel C++ compiler (ICC) in Intel Parallel Studio XE 2016 against Microsoft Visual Studio C++ compiler 2015 (MSVC++) has been investigated.The performance of using intrinsic functions compared to the inline assembly demonstrates that the intrinsic functions has better performance than inline assembly by 2.1, 2.13, and 2.18 using Intel compiler and by 2.08, 2.49, and 2.11 using MSVC++ compiler for C=A. B, C=A. BT, and C=AT. B, respectively. Intel's Advanced Vector Extensions is known as single instruction multiple data streams (SIMD), and the instruction sets is introduced in the second-generation Intel Core processor family. This new technology is supported by new generations of Intel and AMD processors. The advanced vector extensions (AVX) exploits single instruction multiple data (SIMD) computing units for fine grained-parallelism. These instructions process multiple data elements simultaneously and independently. Many applications such as signal processing, recognition, visual processing, scientific and engineering numerical, physics and other areas of applications need for vector floating point performance supported by AVX. Matrix-Matrix multiplications is the core of many important algorithms such as signal processing, scientific and engineering numerical, so it is substantial to accelerate implementation of matrix-matrix multiplications. It is very important to use appropriate compilers that can optimally utilize the new features of the evolving processors. For this purpose, a clear vision on the performance of the compilers on performance characteristics of AVX is needed. In addition choosing the appropriate programming method is substantial to gain the best performance. In this paper, the performance evaluation of matrix-matrix multiplications in three forms (C=A. B, C=A. BT, and C=AT. B), using Intel's advanced vector extension (AVX) instruction sets has been reported. The obtained results are compared using inline assembly versus intrinsic functions for programming. A comparative study to indicate the effects of two widely used C++ compilers: Intel C++ compiler (ICC) in Intel Parallel Studio XE 2016 against Microsoft Visual Studio C++ compiler 2015 (MSVC++) has been investigated. The results are evaluated on Intel Core i7 processor on a Broadwell system for square matrices of different large sizes. The results demonstrate that the Intel compiler has better performance than MSVC++ compiler by 1.34, 1.32, and 1.22 using inline assembly language and by 1.36, 1.19, and 1.25 using intrinsic functions for C=A. B, C=A. BT, and C=AT. B, respectively. The performance of using intrinsic functions compared to the inline assembly demonstrates that the intrinsic functions has better performance than inline assembly by 2.1, 2.13, and 2.18 using Intel compiler and by 2.08, 2.49, and 2.11 using MSVC++ compiler for C=A. B, C=A. BT, and C=AT. B, respectively.

References

[1]
"Intel Advanced Vector Extensions Programming Reference," Ref. #319433-005, www.intel.com, January 2009.
[2]
"Intel Architecture Instruction Set Extensions Programming Reference," Ref. #319433-14, www.intel.com, August 2012.
[3]
A. Klimovitski, "Using SSE and SSE2: Misconceptions and Reality," Developer UPDATE Intel Magazine, March 2001.
[4]
"Streaming SIMD Extensions," Wikipedia, the free encyclopedia, June 2014.
[5]
M. Corden, "Intel Compiler Options for Intel SSE and Intel AVX generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2, AVX-512) and processor-specific optimizations," January 24, 2010.
[6]
H. Jeong, W. Lee, S. Kim, S.-H. Myung, Performance of SSE and AVX instruction sets, in: The 30th International Symposium on Lattice Field Theory, 2012.
[7]
K. Diefendorff, P. Dubey, R. Hochsprung, H. Scales, Altivec extension to PowerPC accelerates media processing, IEEE Micro, 20 (2000) 85-95.
[8]
R.C. Whaley, J.J. Dongarra. "Automatically tuned linear algebra software," Technical report, Computer Science Department, University of Tennessee, 1997. http://www.netlib.org/utk/projects/atlas/.
[9]
J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, C. Whaley, K. Yelick, Self adapting linear algebra algorithms and software, Proc. IEEE, 93 (2005).
[10]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, LAPACK Users' Guide, SIAM, Philadelphia, 1999.
[11]
J. Bilmes, K. Asanovic, J. Demmel, D. Lam, C.W. Chin. "PHiPAC: A portable, high-performace, ANSI C coding methodoloogy and its application to matrix multiply," Technical report, University of Tennessee, August 1996. http://www.icsi.berkeley.edu/_bilmes/phipac.
[12]
P. Gepner, V. Gamayunov, D.L. Fraser, Effective implementation of DGEMM on modern multicore CPU, Procedia Comput. Sci., 9 (2012) 126-135.
[13]
"Streaming SIMD Extensions Matrix Multiplication," Order Number: 245045-001, June 1999, http://www.intel.com.
[14]
Douglas Aberdeen, and Jonathan Baxter, "Emmerald: a fast matrix-matrix multiply using intel's sse instructions," August 26, 2000.
[15]
Chris Lomont, "Introduction to intel advanced vector extensions," May 2011.
[16]
"Intel 64 and IA-32 Architectures Software Developers Manual," Volume 2 (2A, 2B&2C), www.intel.com, June 2015.
[17]
"Intel 64 and IA-32 Architectures Software Developers Manual," Volume 3 (3A, 3B&3C), www.intel.com, June 2015.
[18]
M. Valipour, Optimization of neural networks for precipitation analysis in a humid region to detect drought and wet year alarms, Meteorol. Appl., 23 (2016) 91-100.
[19]
S.I. Yannopoulos, G. Lyberatos, N. Theodossiou, W. Li, M. Valipour, A. Tamburrino, A.N. Angelakis, Evolution of water lifting devices (pumps) over the centuries worldwide, Water, 7 (2015) 5031-5060. www.mdpi.com/journal/water
[20]
M. Valipour, Sprinkle and trickle irrigation system design using tapered pipes for pressure loss adjusting, J. Agric. Sci., 4 (2012).
[21]
M.M. Khasraghi, M.A.G. Sefidkouhi, M. Valipour, Simulation of open- and closed-end border irrigation systems using SIRMOD, Arch. Agron. Soil Sci., 61 (2015) 929-941.
[22]
M. Valipour, Comparison of surface irrigation simulation models: full hydrodynamic, zero inertia, kinematic wave, J. Agric. Sci., 4 (2012).
[23]
J.J. Dongarra, J. Du Croz, S. Hammarling, I. Duff, A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Softw., 16 (1990) 1-17.
[24]
"Basic Linear Algebra Subprograms (BLAS)," http://www.netlib.org/blas/, 31 May 2001.
[25]
David.S Watkins, "Fundamentals of Matrix Computations," 2nd edition, New York (2002).
[26]
N.J. Higham, Exploiting fast matrix multiplication within the level 3 BLAS, ACM Trans. Math. Softw., 16 (1990) 352-368.
[27]
"General Matrix Multiply Sample User's Guide, 2013 Intel Corporation, Document Number: 325264-003US.
[28]
"5th Generation Intel Core¿ Processors based on the Mobile U-Processor Line," (Intel Core¿ i7-5650U Processor, Intel Core¿ i5-5350U Processor, Intel Core¿ i3-5010U Processor), 2015.
[29]
"Intel Announces New 5th Gen Intel Core¿ Processors and Latest Intel Xeon Processors at Computex 2015," June 2, 2015.
[30]
"Intel Parallel Studio XE 2016," Windows* and Linux* Release Notes, 7 August 2015.
[31]
"Intel' C++ Compiler in Intel Parallel Studio XE," Optimization Notes, August 16, 2012, last updated on February 2, 2016.
[32]
"Visual C++ in Visual Studio 2015," https://msdn.microsoft.com/
[33]
"Using The RDTSC Instruction for Performance Monitoring," www.intel.com, Intel Corporation, 1997.
[34]
A. Fog, "Optimizing Subroutines in Assembly Language, an Optimization Guide for x86 Platforms," 1996.
[35]
"Intel Intrinsic Guide," https://software.intel.com/sites/landingpage/IntrinsicsGuide/ 2015 Last updated 2015-12-23.
[36]
"A Guide to Vectorization with Intel C++ Compilers," http://developer.intel.com, 2012.
[37]
"Compiling for the Intel 2nd Generation Core TM Processor Family and the Intel AVX Instruction Set," Developer Products Division Software & Services Group Intel Corporation, http://developer.intel.com, February 2011.

Cited By

View all
  • (2024)Efficient and Accurate PageRank Approximation on Large GraphsProceedings of the ACM on Management of Data10.1145/36771322:4(1-26)Online publication date: 30-Sep-2024
  • (2020)SIMD programming using Intel vector extensionsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.09.012135:C(83-100)Online publication date: 1-Jan-2020
  • (2018)Effective Implementation of MatrixVector Multiplication on Intel's AVX multicore ProcessorComputer Languages, Systems and Structures10.1016/j.cl.2017.06.00351:C(158-175)Online publication date: 1-Jan-2018
  1. Performance Evaluation of Matrix-Matrix Multiplications Using Intel's Advanced Vector Extensions (AVX)

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Microprocessors & Microsystems
    Microprocessors & Microsystems  Volume 47, Issue PB
    November 2016
    241 pages

    Publisher

    Elsevier Science Publishers B. V.

    Netherlands

    Publication History

    Published: 01 November 2016

    Author Tags

    1. Advanced vector extension (AVX)
    2. Inline assembly
    3. Intel C++ compiler
    4. Intrinsic functions
    5. Matrix-matrix multiplications
    6. Microsoft VC++ compiler

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient and Accurate PageRank Approximation on Large GraphsProceedings of the ACM on Management of Data10.1145/36771322:4(1-26)Online publication date: 30-Sep-2024
    • (2020)SIMD programming using Intel vector extensionsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.09.012135:C(83-100)Online publication date: 1-Jan-2020
    • (2018)Effective Implementation of MatrixVector Multiplication on Intel's AVX multicore ProcessorComputer Languages, Systems and Structures10.1016/j.cl.2017.06.00351:C(158-175)Online publication date: 1-Jan-2018

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media