Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3579371.3589046acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open access

Orinoco: Ordered Issue and Unordered Commit with Non-Collapsible Queues

Published: 17 June 2023 Publication History

Abstract

Modern out-of-order processors call for more aggressive scheduling techniques such as priority scheduling and out-of-order commit to make use of increasing core resources. Since these approaches prioritize the issue or commit of certain instructions, they face the conundrum of providing the capacity efficiency of scheduling structures while preserving the ideal ordering of instructions. Traditional collapsible queues are too expensive for today's processors, while state-of-the-art queue designs compromise with the pseudo-ordering of instructions, leading to performance degradation as well as other limitations.
In this paper, we present Orinoco, a microarchitecture/circuit co-design that supports ordered issue and unordered commit with non-collapsible queues. We decouple the temporal ordering of instructions from their queue positions by introducing an age matrix with the bit count encoding, along with a commit dependency matrix and a memory disambiguation matrix to determine instructions to prioritize issue or commit. We leverage the Processing-in-Memory (PIM) approach and efficiently implement the matrix schedulers as 8T SRAM arrays. Orinoco achieves an average IPC improvement of 14.8% over the baseline in-order commit core with the state-of-the-art scheduler while incurring overhead equivalent to a few kilobytes of SRAM.

References

[1]
Jaume Abella Ferrer, Ramon Canal Corretger, and Antonio María González Colás. 2003. Power-and complexity-aware issue queue designs. IEEE micro 23, 5 (2003), 50--58.
[2]
Amogh Agrawal, Akhilesh Jaiswal, Chankyu Lee, and Kaushik Roy. 2018. X-SRAM: Enabling in-memory Boolean computations in CMOS static random access memories. IEEE Transactions on Circuits and Systems I: Regular Papers 65, 12 (2018), 4219--4232.
[3]
Mehdi Alipour, Trevor E Carlson, and Stefanos Kaxiras. 2017. Exploring the performance limits of out-of-order commit. In Proceedings of the Computing Frontiers Conference. 211--220.
[4]
Mehdi Alipour, Stefanos Kaxiras, David Black-Schaffer, and Rakesh Kumar. 2020. Delay and bypass: Ready and criticality aware instruction scheduling in out-of-order processors. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 424--434.
[5]
Mehdi Alipour, Rakesh Kumar, Stefanos Kaxiras, and David Black-Schaffer. 2019. FIFOrder MicroArchitecture: Ready-aware instruction scheduling for OoO processors. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 716--721.
[6]
Hideki Ando. 2019. SWQUE: A mode switching issue queue with priority-correcting circular queue. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 506--518.
[7]
Ashkan Asgharzadeh, Juan M Cebrian, Arthur Perais, Stefanos Kaxiras, and Alberto Ros. 2022. Free atomics: hardware atomic operations without fences. In ISCA. 14--26.
[8]
Gordon B Bell and Mikko H Lipasti. 2004. Deconstructing commit. In IEEE International Symposium on-ISPASS Performance Analysis of Systems and Software, 2004. IEEE, 68--77.
[9]
David Bernstein, Michael Rodeh, and Izidor Gertner. 1989. On the complexity of scheduling problems for parallel/pipelined machines. IEEE Transactions on computers 38, 9 (1989), 1308--1313.
[10]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1--7.
[11]
Edward Brekelbaum, Jeff Rupley, Chris Wilkerson, and Bryan Black. 2002. Hierarchical scheduling windows. In 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002.(MICRO-35). Proceedings. IEEE, 27--36.
[12]
James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. 41--42.
[13]
Trevor E Carlson, Wim Heirman, Osman Allam, Stefanos Kaxiras, and Lieven Eeckhout. 2015. The load slice core microarchitecture. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 272--284.
[14]
Dibei Chen, Zhaoshi Li, Tianzhu Xiong, Zhiwei Liu, Jun Yang, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2020. CATCAM: Constant-time Alteration Ternary CAM with Scalable In-Memory Architecture. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 342--355.
[15]
Adrian Cristal, Daniel Ortega, Josep Llosa, and Mateo Valero. 2004. Out-of-order commit processors. In 10th International Symposium on High Performance Computer Architecture (HPCA'04). IEEE, 48--59.
[16]
Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2021. Polygraph: Exposing the value of flexibility for graph processing accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 595--608.
[17]
Aniket Deshmukh and Yale N Patt. 2021. Criticality Driven Fetch. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 380--391.
[18]
Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. 2017. Inside 6th-generation intel core: New microarchitecture code-named skylake. IEEE Micro 37, 2 (2017), 52--62.
[19]
Mark Evers, Leslie Barnes, and Mike Clark. 2022. The AMD Next-Generation "Zen 3" Core. IEEE Micro 42, 3 (2022), 7--12.
[20]
James A Farrell and Timothy C Fischer. 1998. Issue logic for a 600-mhz out-of-order execution microprocessor. IEEE Journal of Solid-State Circuits 33, 5 (1998), 707--712.
[21]
Daichi Fujiki, Xiaowei Wang, Arun Subramaniyan, and Reetuparna Das. 2021. In-/near-memory Computing. Synthesis Lectures on Computer Architecture 16, 2 (2021), 1--140.
[22]
Michael Golden, Srikanth Arekapudi, and James Vinh. 2011. 40-entry unified out-of-order scheduler and integer execution unit for the AMD Bulldozer x86--64 core. In 2011 IEEE International Solid-State Circuits Conference. IEEE, 80--82.
[23]
Antonio Gonzalez, Fernando Latorre, and Grigorios Magklis. 2010. Processor microarchitecture: An implementation perspective. Synthesis Lectures on Computer Architecture 5, 1 (2010), 1--116.
[24]
Mashiro Goshima, Kengo Nishino, Yasuhiko Nakashima, Shin-inchiro Mori, Toshiaki Kitamura, and Shinji Tomita. 2001. A high-speed dynamic instruction scheduling scheme for superscalar processors. In Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34. IEEE Computer Society, 225--225.
[25]
Björn Gottschall, Lieven Eeckhout, and Magnus Jahre. 2021. TIP: Time-Proportional Instruction Profiling. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 15--27.
[26]
Part Guide. 2011. Intel® 64 and ia-32 architectures software developer's manual. Volume 3B: System programming Guide, Part 2, 11 (2011).
[27]
Ali Hajiabadi, Andreas Diavastos, and Trevor E Carlson. 2021. NOREBA: a compiler-informed non-speculative out-of-order commit processor. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 182--193.
[28]
Tae Jun Ham, Juan L Aragón, and Margaret Martonosi. 2015. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 191--203.
[29]
Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. 2005. Simpoint 3.0: Faster and more flexible program phase analysis. Journal of Instruction Level Parallelism 7, 4 (2005), 1--28.
[30]
Milad Hashemi, Onur Mutlu, and Yale N Patt. 2016. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1--12.
[31]
Min Huang, Moty Mehalel, Ramesh Arvapalli, and Songnian He. 2013. An energy efficient 32-nm 20-mb shared on-die L3 cache for intel® xeon® processor E5 family. IEEE Journal of Solid-State Circuits 48, 8 (2013), 1954--1962.
[32]
Jack Tigar Humphries, Neel Natu, Ashwin Chaugule, Ofir Weisse, Barret Rhoden, Josh Don, Luigi Rizzo, Oleg Rombakh, Paul Turner, and Christos Kozyrakis. 2021. ghost: Fast & flexible user-space delegation of linux scheduling. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 588--604.
[33]
Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz, Changhoon Kim, and Nick McKeown. 2021. The nanoPU: A Nanosecond Network Stack for Datacenters. In OSDI. 239--256.
[34]
Mark C Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. 2015. A scalable architecture for ordered parallelism. In Proceedings of the 48th international symposium on microarchitecture. 228--241.
[35]
Supreet Jeloka, Naveen Bharathwaj Akesh, Dennis Sylvester, and David Blaauw. 2016. A 28 nm configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit cell enabling logic-in-memory. IEEE Journal of Solid-State Circuits 51, 4 (2016), 1009--1021.
[36]
Ipoom Jeong, Jiwon Lee, Myung Kuk Yoon, and Won Woo Ro. 2022. Reconstructing Out-of-Order Issue Queue. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 144--161.
[37]
Ipoom Jeong, Seihoon Park, Changmin Lee, and Won Woo Ro. 2020. CASINO core microarchitecture: Generating out-of-order schedules using cascaded in-order scheduling windows. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 383--396.
[38]
Lizy Kurian John and Vijaykrishnan Narayanan. 2021. Microprocessor at 50: Industry Leaders Speak. IEEE Micro 41, 06 (2021), 13--15.
[39]
Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. 2019. Shinjuku: Preemptive scheduling for μsecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 345--360.
[40]
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture. 158--169.
[41]
Mingu Kang, Sujan K Gonugondla, Ameya Patil, and Naresh R Shanbhag. 2018. A multi-functional in-memory inference processor using a standard 6T SRAM array. IEEE Journal of Solid-State Circuits 53, 2 (2018), 642--655.
[42]
Richard E Kessler. 1999. The alpha 21264 microprocessor. IEEE micro 19, 2 (1999), 24--36.
[43]
Rakesh Kumar, Mehdi Alipour, and David Black-Schaffer. 2019. Freeway: Maximizing MLP for slice-out-of-order execution. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 558--569.
[44]
Kartik Lakshminarasimhan, Ajeya Naithani, Josué Feliu, and Lieven Eeckhout. 2020. The forward slice core microarchitecture. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 361--372.
[45]
Andrew Lenharth, Donald Nguyen, and Keshav Pingali. 2015. Priority queues are not good concurrent priority schedulers. In Euro-Par 2015: Parallel Processing: 21st International Conference on Parallel and Distributed Computing, Vienna, Austria, August 24--28, 2015, Proceedings 21. Springer, 209--221.
[46]
Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen, and Norman P Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd annual ieee/acm international symposium on microarchitecture. 469--480.
[47]
Zhiting Lin, Zhiyong Zhu, Honglan Zhan, Chunyu Peng, Xiulong Wu, Yuan Yao, Jianchao Niu, and Junning Chen. 2021. Two-direction in-memory computing based on 10T SRAM with horizontal and vertical decoupled read ports. IEEE Journal of Solid-State Circuits 56, 9 (2021), 2832--2844.
[48]
Heiner Litz, Grant Ayers, and Parthasarathy Ranganathan. 2022. CRISP: critical slice prefetching. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 300--313.
[49]
Salvador Petit Marti, Julio Sahuquillo Borras, Pedro Lopez Rodriguez, Rafael Ubal Tena, and Jose Duato Marin. 2009. A complexity-effective out-of-order retirement microarchitecture. IEEE Transactions on computers 58, 12 (2009), 1626--1639.
[50]
José F Martínez, Jose Renau, Michael C Huang, and Milos Prvulovic. 2002. Cherry: Checkpointed early resource recycling in out-of-order microprocessors. In 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002.(MICRO-35). Proceedings. IEEE, 3--14.
[51]
Daniel S McFarlin, Charles Tucker, and Craig Zilles. 2013. Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism? ACM SIGARCH Computer Architecture News 41, 1 (2013), 241--252.
[52]
Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. IEEE, 129--140.
[53]
Ajeya Naithani, Sam Ainsworth, Timothy M Jones, and Lieven Eeckhout. 2021. Vector runahead. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 195--208.
[54]
Ajeya Naithani, Josué Feliu, Almutaz Adileh, and Lieven Eeckhout. 2020. Precise runahead execution. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 397--410.
[55]
Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. In NSDI, Vol. 19. 361--378.
[56]
Michael L Overton. 2001. Numerical computing with IEEE floating point arithmetic. SIAM.
[57]
Chandandeep Singh Pabla. 2009. Completely fair scheduler. Linux Journal 2009, 184 (2009), 4.
[58]
Subbarao Palacharla, Norman P Jouppi, and James E Smith. 1997. Complexity-effective superscalar processors. In Proceedings of the 24th annual international symposium on Computer architecture. 206--218.
[59]
Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, et al. 2011. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 12--25.
[60]
Gilead Posluns, Yan Zhu, Guowei Zhang, and Mark C Jeffrey. 2022. A scalable architecture for reprioritizing ordered parallelism. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 437--453.
[61]
Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, and John Ousterhout. 2018. Arachne: Core-aware thread management. In 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18). 145--160.
[62]
Alberto Ros, Trevor E Carlson, Mehdi Alipour, and Stefanos Kaxiras. 2017. Non-speculative load-load reordering in TSO. ACM SIGARCH Computer Architecture News 45, 2 (2017), 187--200.
[63]
Alberto Ros and Stefanos Kaxiras. 2018. The superfluous load queue. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 95--107.
[64]
Efraim Rotem, Adi Yoaz, Lihu Rappoport, Stephen J Robinson, Julius Yuli Mandelblat, Arik Gihon, Eliezer Weissmann, Rajshree Chabukswar, Vadim Basin, Russell Fenger, et al. 2022. Intel Alder Lake CPU Architectures. IEEE Micro 42, 3 (2022), 13--19.
[65]
Amir Roth. 2005. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. In 32nd International Symposium on Computer Architecture (ISCA'05). IEEE, 458--468.
[66]
Peter G Sassone, Jeff Rupley, Edward Brekelbaum, Gabriel H Loh, and Bryan Black. 2007. Matrix scheduler reloaded. ACM SIGARCH Computer Architecture News 35, 2 (2007), 335--346.
[67]
Andreas Sembrant, Trevor Carlson, Erik Hagersten, David Black-Shaffer, Arthur Perais, André Seznec, and Pierre Michaud. 2015. Long term parking (LTP) criticality-aware resource allocation in OOO processors. In Proceedings of the 48th International Symposium on Microarchitecture. 334--346.
[68]
Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O Myreen. 2010. x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM 53, 7 (2010), 89--97.
[69]
Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44, 3 (2016), 14--26.
[70]
Balaram Sinharoy, JA Van Norstrand, Richard J Eickemeyer, Hung Q Le, Jens Leenstra, Dung Q Nguyen, B Konigsburg, K Ward, MD Brown, José E Moreira, et al. 2015. IBM POWER8 processor core microarchitecture. IBM Journal of Research and Development 59, 1 (2015), 2--1.
[71]
James E. Smith and Andrew R. Pleszkun. 1988. Implementing precise interrupts in pipelined processors. IEEE Transactions on computers 37, 5 (1988), 562--573.
[72]
Akshitha Sriraman, Abhishek Dhanotia, and Thomas F Wenisch. 2019. Softsku: Optimizing server architectures for microservice diversity@ scale. In Proceedings of the 46th International Symposium on Computer Architecture. 513--526.
[73]
Arun Subramaniyan, Jingcheng Wang, Ezhil RM Balasubramanian, David Blaauw, Dennis Sylvester, and Reetuparna Das. 2017. Cache automaton. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 259--272.
[74]
Robert M Tomasulo. 1967. An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of research and Development 11, 1 (1967), 25--33.
[75]
Kim-Anh Tran, Alexandra Jimborean, Trevor E Carlson, Konstantinos Koukos, Magnus Själander, and Stefanos Kaxiras. 2018. SWOOP: Software-hardware co-design for non-speculative, execute-ahead, in-order cores. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 328--343.
[76]
Andrew Waterman, Yunsup Lee, David Patterson, Krste Asanovic, Volume I User level Isa, Andrew Waterman, Yunsup Lee, and David Patterson. 2014. The RISC-V instruction set manual. Volume I: User-Level ISA', version 2 (2014).
[77]
Vincent M Weaver. 2016. Advanced hardware profiling and sampling (PEBS, IBS, etc.): creating a new PAPI sampling interface. Technical Report. Technical Report UMAINE-VMWTR-PEBS-IBS-SAMPLING-2016-08. University of Maine.
[78]
Kenneth C Yeager. 1996. The MIPS R10000 superscalar microprocessor. IEEE micro 16, 2 (1996), 28--41.
[79]
Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2019. Understanding priority-based scheduling of graph algorithms on a shared-memory platform. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.
[80]
Jintao Zhang, Zhuo Wang, and Naveen Verma. 2016. A machine-learning classifier implemented in a standard 6T SRAM array. In 2016 ieee symposium on vlsi circuits (vlsi-circuits). IEEE, 1--2.

Cited By

View all
  • (2024)Multi: Reduce Energy Overhead of Criticality-Aware Dynamic Instruction Scheduling for Energy Efficiency2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00020(60-67)Online publication date: 18-Nov-2024

Index Terms

  1. Orinoco: Ordered Issue and Unordered Commit with Non-Collapsible Queues

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture
    June 2023
    1225 pages
    ISBN:9798400700958
    DOI:10.1145/3579371
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 June 2023

    Check for updates

    Author Tags

    1. microarchitecture
    2. out-of-order execution
    3. instruction scheduling
    4. out-of-order commit
    5. processing-in-memory

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ISCA '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,033
    • Downloads (Last 6 weeks)94
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Multi: Reduce Energy Overhead of Criticality-Aware Dynamic Instruction Scheduling for Energy Efficiency2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00020(60-67)Online publication date: 18-Nov-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media