Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3650200.3656630acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Open access

Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core Architectures

Published: 03 June 2024 Publication History

Abstract

Loop tiling and loop interchange (or permutation) are techniques that can expose task and data-level parallelisms and can exploit data locality available in multi-dimensional loop nests. Choosing the appropriate tile size and loop order is important to achieve significant performance improvement. However, the effect of these transformations on the performance of the loop nest is not straightforward due to the complex interplay of several architectural features in multi-/many-core architectures. In this work, we propose using a supervised learning technique and develop a Support Vector Machine (SVM) based hierarchical classifier to identify the best-performing tile size and loop order for a given loop nest. Our approach results in identifying tile sizes and loop orders whose performance, on average, is within 18% and 9% of the optimal performance for two sets of loop nests on Intel Xeon Cascadelake architecture. Further, our method outperforms state-of-the-art techniques, Pluto and Polly, with a geometric mean speedup of 1.35x to 1.58x.

References

[1]
Felix Agakov, Edwin Bonilla, John Cavazos, Björn Franke, Grigori Fursin, Michael FP O’Boyle, John Thomson, Marc Toussaint, and Christopher KI Williams. 2006. Using machine learning to focus iterative optimization. In International Symposium on Code Generation and Optimization (CGO’06). IEEE, 295–305.
[2]
Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation. 303–316.
[3]
Mohamed Arafa, Bahaa Fahim, Sailesh Kottapalli, Akhilesh Kumar, Lily P Looi, Sreenivas Mandava, Andy Rudoff, Ian M Steiner, Bob Valentine, Geetha Vedaraman, 2019. Cascade lake: Next generation intel xeon scalable processor. IEEE Micro 39, 2 (2019), 29–36.
[4]
Amir H Ashouri, Andrea Bignoli, Gianluca Palermo, Cristina Silvano, Sameer Kulkarni, and John Cavazos. 2017. Micomp: Mitigating the compiler phase-ordering problem using optimization sub-sequences and machine learning. ACM Transactions on Architecture and Code Optimization (TACO) 14, 3 (2017), 1–28.
[5]
Shilpa Babalad, Shirish K Shevade, Matthew Jacob Thazhuthaveetil, and R Govindarajan. 2023. A Machine Learning Approach to Identify the Best-Performing Loop Order. https://github.com/knightlander2023/OptLoopOrder, Technical Report, Department of Computer Science and Automation, Indian Institute of Science, Bengaluru.
[6]
David F Bacon, Susan L Graham, and Oliver J Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys (CSUR) 26, 4 (1994), 345–420.
[7]
David Bailey, Tim Harris, William Saphir, Rob Van Der Wijngaart, Alex Woo, and Maurice Yarrow. 1995. The NAS parallel benchmarks 2.0. Technical Report. Technical Report NAS-95-020, NASA Ames Research Center.
[8]
David H Bailey, Eric Barszcz, John T Barton, David S Browning, Robert L Carter, Leonardo Dagum, Rod A Fatoohi, Paul O Frederickson, Thomas A Lasinski, Rob S Schreiber, 1991. The NAS parallel benchmarks summary and preliminary results. In Supercomputing’91: Proceedings of the 1991 ACM/IEEE conference on Supercomputing. IEEE, 158–165.
[9]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques. 72–81.
[10]
Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction. Springer, 132–146.
[11]
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101–113.
[12]
Marius Cornea. 2015. Intel AVX-512 instructions and their use in the implementation of math functions. Intel Corporation (2015), 1–20.
[13]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.
[14]
Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. Synthesizing benchmarks for predictive modeling. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 86–99.
[15]
Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. 2021. AnghaBench: A suite with one million compilable C benchmarks for code-size reduction. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 378–390.
[16]
Pradipta De, Ravi Kothari, and Vijay Mann. 2007. Identifying sources of operating system jitter through fine-grained kernel instrumentation. In 2007 IEEE International Conference on Cluster Computing. IEEE, 331–340.
[17]
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming 34, 3 (2006), 261–317.
[18]
Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-Polyhedral optimization in LLVM. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), Vol. 2011. 1.
[19]
Ameer Haj-Ali, Nesreen K Ahmed, Ted Willke, Yakun Sophia Shao, Krste Asanovic, and Ion Stoica. 2020. Neurovectorizer: End-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 242–255.
[20]
Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation. 127–138.
[21]
David Meyer, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, Friedrich Leisch, Chih-Chung Chang, Chih-Chen Lin, and Maintainer David Meyer. 2019. Package ‘e1071’. The R Journal (2019).
[22]
Kumudha Narasimhan, Aravind Acharya, Abhinav Baid, and Uday Bondhugula. 2021. A practical tile size selection model for affine loop nests. In Proceedings of the ACM International Conference on Supercomputing. 27–39.
[23]
LN Pouchet. 2012. Polybench: The polyhedral benchmark suite. http://www. cs. ucla. edu/pouchet/software/polybench.
[24]
LN Pouchet and Scott Grauer-Gray. 2011. PolyBench: The Polyhedral Benchmark suite (2011), Version 3.2. http://www-roc. inria. fr/  pouchet/software/polybench.
[25]
Louis-Noël Pouchet, C. Bastoul, and U. Bondhugula. 2019. PoCC: the polyhedral compiler collection. http://web.cs.ucla.edu/ pouchet/software/pocc/.
[26]
Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, Jagannathan Ramanujam, Ponnuswamy Sadayappan, and Nicolas Vasilache. 2011. Loop transformations: convexity, pruning and optimization. ACM SIGPLAN Notices 46, 1 (2011), 549–562.
[27]
Kishore Kumar Pusukuri, Rajiv Gupta, and Laxmi N Bhuyan. 2012. Thread tranquilizer: Dynamically reducing performance variation. ACM Transactions on Architecture and Code Optimization (TACO) 8, 4 (2012), 1–21.
[28]
Peter J Rousseeuw and Mia Hubert. 2011. Robust statistics for outlier detection. Wiley interdisciplinary reviews: Data mining and knowledge discovery 1, 1 (2011), 73–79.
[29]
Savvas Sioutas, Sander Stuijk, Henk Corporaal, Twan Basten, and Lou Somers. 2018. Loop transformations leveraging hardware prefetching. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 254–264.
[30]
Avinash Sodani. 2015. Knights Landing (KNL): 2nd generation Intel® Xeon Phi processor. In 2015 IEEE Hot Chips 27 Symposium (HCS). IEEE, 1–24.
[31]
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-generation Intel Xeon Phi product. IEEE Micro 36, 2 (2016), 34–46.
[32]
Mark Stephenson and Saman Amarasinghe. 2005. Predicting unroll factors using supervised classification. In International symposium on code generation and optimization. IEEE, 123–134.
[33]
Kevin Stock, Louis-Noël Pouchet, and P Sadayappan. 2012. Using machine learning to improve automatic vectorization. ACM Transactions on Architecture and Code Optimization (TACO) 8, 4 (2012), 1–23.
[34]
Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira Rosen. 2009. Polyhedral-model guided loop-nest auto-vectorization. In 2009 18th International Conference on Parallel Architectures and Compilation Techniques. IEEE, 327–337.
[35]
Sven Verdoolaege. 2010. isl: An integer set library for the polyhedral model. In International Congress on Mathematical Software. Springer, 299–302.
[36]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Ignacio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization (TACO) 9, 4 (2013), 1–23.
[37]
Rui Xu, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Yuhong Song, and Han Wang. 2023. Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs. Journal of Systems Architecture 135 (2023), 102799.

Index Terms

  1. Tile Size and Loop Order Selection using Machine Learning for Multi-/Many-Core Architectures

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing
    May 2024
    582 pages
    ISBN:9798400706103
    DOI:10.1145/3650200
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 June 2024

    Check for updates

    Author Tags

    1. Hierarchical Classifier
    2. Loop transformations
    3. Supervised learning
    4. Support Vector Machine
    5. Vectorization and Parallelization

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICS '24
    Sponsor:

    Acceptance Rates

    ICS '24 Paper Acceptance Rate 45 of 125 submissions, 36%;
    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 223
      Total Downloads
    • Downloads (Last 12 months)223
    • Downloads (Last 6 weeks)51
    Reflects downloads up to 23 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media