research-article

A scalable architecture for reprioritizing ordered parallelism

Authors:

Gilead Posluns,

Mark C. JeffreyAuthors Info & Claims

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

Pages 437 - 453

https://doi.org/10.1145/3470496.3527387

Published: 11 June 2022 Publication History

Abstract

Many algorithms schedule their work, or tasks, according to a priority order for correctness or faster convergence. While priority schedulers commonly implement task enqueue and dequeueMin operations, some algorithms need a priority update operation that alters the scheduling metadata for a task. Prior software and hardware systems that support scheduling with priority updates compromise on either parallelism, work-efficiency, or both, leading to missed performance opportunities. Moreover, incorrectly navigating these compromises violates correctness in those algorithms that are not resilient to relaxing priority order.

We present Hive, a task-based execution model and multicore architecture that extracts abundant fine-grain parallelism from algorithms with priority updates, while retaining their strict priority schedules. Like prior hardware systems for ordered parallelism, Hive uses data- and control-dependence speculation and a large speculative window to execute tasks in parallel and out of order. Hive improves on prior work by (i) directly supporting updates in the interface, (ii) identifying the novel scheduler-carried dependence, and (iii) speculating on such dependences with task versioning, distinct from data versioning. Hive enables safe speculative updates to the schedule and avoids spurious conflicts among tasks to better utilize speculation tracking resources and efficiently uncover more parallelism. Across a suite of nine benchmarks, Hive improves performance at 256 cores by up to 2.8× over the next best hardware solution, and even more over software-only parallel schedulers.

References

[1]

2006. 9th DIMACS Implementation Challenge: Shortest Paths.

[2]

2015. OpenStreetMap. https://www.openstreetmap.org

[3]

Maleen Abeydeera and Daniel Sanchez. 2020. Chronos: Efficient Speculative Parallelism for Accelerators. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXV). ACM, 1247--1262.

Digital Library

[4]

Vitalii Aksenov, Dan Alistarh, and Janne H. Korhonen. 2020. Scalable Belief Propagation via Relaxed Scheduling. In Proc. of the International Conference on Neural Information Processing Systems (NeurIPS). MIT Press, 22361--22372.

[5]

Dan Alistarh, Trevor Brown, Justin Kopinsky, and Giorgi Nadiradze. 2018. Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms. In Proc. of the Symposium on Principles of Distributed Computing (PODC). ACM, 377--386.

Digital Library

[6]

Dan Alistarh, Justin Kopinsky, Jerry Li, and Giorgi Nadiradze. 2017. The Power of Choice in Priority Scheduling. In Proc. of the Symposium on Principles of Distributed Computing (PODC). ACM, 283--292.

Digital Library

[7]

Dan Alistarh, Justin Kopinsky, Jerry Li, and Nir Shavit. 2015. The SprayList: A scalable relaxed priority queue. In Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 11--20.

Digital Library

[8]

Dan Alistarh, Giorgi Nadiradze, and Nikita Koval. 2019. Efficiency Guarantees for Parallel Incremental Algorithms under Relaxed Schedulers. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 145--154.

Digital Library

[9]

J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. 1983. Conversion of Control Dependence to Data Dependence. In Proc. of the Symposium on Principles of Programming Languages (POPL). ACM, 177--189.

Digital Library

[10]

J. Ignacio Alvarez-Hamelin, Luca Dall'Asta, Alain Barrat, and Alessandro Vespignani. 2005. Large Scale Networks Fingerprinting and Visualization Using the K-Core Decomposition. In Proc. of the International Conference on Neural Information Processing Systems (NeurIPS). MIT Press, 41--50.

[11]

Analog Bits 2011. 4096 x 128 ternary CAM datasheet (28nm). Analog Bits. http://mail.analogbits.com/pdf/28nm_TCAM_Product_Brief.pdf

[12]

David A. Bader, Henning Meyerhenke, Peter Sanders, and Dorothea Wagner (Eds.). 2012. 10th DIMACS Implementation Challenge Workshop.

[13]

Rajeev Balasubramonian, Andrew B. Kahng, Naveen Muralimanohar, Ali Shafiee, and Vaishnav Srinivas. 2017. CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories. ACM Transactions on Computer Architecture and Compiler Optimizations (TACO) 14, 2 (2017).

Digital Library

[14]

Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. In Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 181--192.

Digital Library

[15]

Guy E. Blelloch, Jeremy T. Fineman, and Julian Shun. 2012. Greedy Sequential Maximal Independent Set and Matching are Parallel on Average. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 308--317.

Digital Library

[16]

Guy E. Blelloch, Yan Gu, Yihan Sun, and Kanat Tangwongsan. 2016. Parallel Shortest Paths Using Radius Stepping. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 443--454.

Digital Library

[17]

J. Lawrence Carter and Mark Wegman. 1979. Universal classes of hash functions. J. Comput. System Sci. 18, 2 (1979), 143--154.

[18]

E. Castillo, L. Alvarez, M. Moreto, M. Casas, E. Vallejo, J. L. Bosque, R. Beivide, and M. Valero. 2018. Architectural Support for Task Dependence Management with Flexible Software Scheduling. In Proc. of the International Symposium on High Performance Computer Architecture (HPCA-24). IEEE, 283--295.

[19]

Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proc. of the International Conference on Data Mining (SDM). SIAM, 442--446.

[20]

Barry A. Cipra. 1987. An Introduction to the Ising Model. The American Mathematical Monthly 94, 10 (1987), 937--959.

[21]

Vidushi Dadu, Sihao Liu, and Tony Nowatzki. 2021. PolyGraph: Exposing the Value of Flexibility for Graph Processing Accelerators. In Proc. of the International Symposium on Computer Architecture (ISCA-48). ACM/IEEE, 595--608.

Digital Library

[22]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1--25.

Digital Library

[23]

Laxman Dhulipala, Guy Blelloch, and Julian Shun. 2017. Julienne: A Framework for Parallel Graph Algorithms Using Work-efficient Bucketing. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 293--304.

Digital Library

[24]

Edsger W. Dijkstra. 1959. A note on two problems in connexion with graphs. Numer. Math. 1, 1 (1959), 269--271.

Digital Library

[25]

Gal Elidan, Ian McGraw, and Daphne Koller. 2006. Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI'06). AUAI Press, 165--173.

[26]

H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and D. Burger. 2011. Dark Silicon and The End of Multicore Scaling. In Proc. of the International Symposium on Computer Architecture (ISCA-38). ACM/IEEE, 122--134.

Digital Library

[27]

Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R.M. Badia, E. Ayguade, J. Labarta, and M. Valero. 2010. Task Superscalar: An Out-of-Order Task Pipeline. In Proc. of the International Symposium on Microarchitecture (MICRO-43). IEEE/ACM, 89--100.

Digital Library

[28]

Michael L. Fredman and Robert Endre Tarjan. 1987. Fibonacci Heaps and Their Uses in Improved Network Optimization Algorithms. J. ACM 34, 3 (1987), 596--615.

Digital Library

[29]

Thomas Haigh, Mark Priestley, and Crispin Rope. 2014. Reconsidering the Stored-Program Concept. IEEE Annals of the History of Computing 36, 1 (2014), 4--17.

[30]

Lance Hammond, Mark Willey, and Kunle Olukotun. 1998. Data speculation support for a chip multiprocessor. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII). ACM, 58--69.

Digital Library

[31]

Peter Hart, Nils Nilsson, and Bertram Raphael. 1968. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics 4, 2 (1968), 100--107.

[32]

William Hasenplaugh, Tim Kaler, Tao B. Schardl, and Charles E. Leiserson. 2014. Ordering heuristics for parallel graph coloring. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 166--177.

Digital Library

[33]

Muhammad Amber Hassaan, Martin Burtscher, and Keshav Pingali. 2011. Ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms. In Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 3--12.

Digital Library

[34]

Muhammad Amber Hassaan, Donald Nguyen, and Keshav Pingali. 2015. Kinetic Dependence Graphs. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XX). ACM, 457--471.

Digital Library

[35]

David R. Jefferson. 1985. Virtual time. ACM Transactions on Programming Languages and Systems (TOPLAS) 7, 3 (1985), 404--425.

Digital Library

[36]

Mark C. Jeffrey, Suvinay Subramanian, Maleen Abeydeera, Joel Emer, and Daniel Sanchez. 2016. Data-centric execution of speculative parallel programs. In Proc. of the International Symposium on Microarchitecture (MICRO-49). IEEE/ACM, 1--13.

[37]

Mark C. Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. 2015. A scalable architecture for ordered parallelism. In Proc. of the International Symposium on Microarchitecture (MICRO-48). IEEE/ACM, 228--241.

Digital Library

[38]

Mark C. Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. 2016. Unlocking ordered parallelism with the Swarm architecture. IEEE Micro 36, 3 (2016), 105--117.

[39]

Mark C. Jeffrey, Victor A. Ying, Suvinay Subramanian, Hyun Ryong Lee, Joel Emer, and Daniel Sanchez. 2018. Harmonizing speculative and non-speculative execution in architectures for ordered parallelism. In Proc. of the International Symposium on Microarchitecture (MICRO-51). IEEE/ACM, 217--230.

Digital Library

[40]

David S. Johnson. 1974. Approximation algorithms for combinatorial problems. J. Comput. System Sci. 9, 3 (1974), 256--278.

Digital Library

[41]

Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X). ACM, 211--222.

Digital Library

[42]

Venkata Krishnan and Josep Torrellas. 1999. A Chip-Multiprocessor Architecture with Speculative Multithreading. IEEE Trans. Comput. 48, 9 (1999), 866--880.

Digital Library

[43]

Joseph B. Kruskal. 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Amer. Math. Soc. 7 (1956), 48--50.

[44]

Milind Kulkarni, Keshav Pingali, Bruce Walter, Ganesh Ramanarayanan, Kavita Bala, and L. Paul Chew. 2007. Optimistic parallelism requires abstractions. In Proc. of the Conference on Programming Language Design and Implementation (PLDI). ACM.

[45]

Charles Leiserson and Tao Schardl. 2010. A work-efficient parallel breadth-first search algorithm. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 303--314.

Digital Library

[46]

Andrew Lenharth, Donald Nguyen, and Keshav Pingali. 2015. Priority queues are not good concurrent priority schedulers. In Proc. of the European Conference on Parallel Processing (Euro-Par). Springer Berlin Heidelberg, 209--221.

[47]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proc. of the Conference on Programming Language Design and Implementation (PLDI). ACM, 190--200.

Digital Library

[48]

Michael W. Mahoney, Anirban Dasgupta, Jure Leskovec, and Kevin J. Lan. 2009. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6, 1 (2009).

[49]

Fragkiskos D. Malliaros, Christos Giatsidis, Apostolos N. Papadopoulos, and Michalis Vazirgiannis. 2020. The core decomposition of networks: theory, algorithms and applications. The VLDB Journal 29 (2020), 61--92.

Digital Library

[50]

David W. Matula and Leland L. Beck. 1983. Smallest-Last Ordering and Clustering and Graph Coloring Algorithms. J. ACM 30, 3 (1983), 417--427.

Digital Library

[51]

U. Meyer and P. Sanders. 2003. Delta-stepping: A Parallelizable Shortest Path Algorithm. Journal of Algorithms 49, 1 (2003), 114--152.

Digital Library

[52]

Lucas Morais, Vitor Silva, Alfredo Goldman, Carlos Alvarez, Jaume Bosch, Michael Frank, and Guido Araujo. 2019. Adding Tightly-Integrated Task Scheduling Acceleration to a RISC-V Multi-core Processor. In Proc. of the International Symposium on Microarchitecture (MICRO-52). IEEE/ACM, 861--872.

Digital Library

[53]

Flaviano Morone, Kate Burleson-Lesser, H. A. Vinutha, Srikanth Sastry, and Hernán A. Makse. 2019. The jamming transition is a k-core percolation transition. Physica A: Statistical Mechanics and its Applications 516 (2019), 172--177.

[54]

Flaviano Morone, Gino Del Ferraro, and Hernán A. Makse. 2019. The k-core as a predictor of structural collapse in mutualistic ecosystems. Nature Physics 15 (2019), 95--102.

[55]

Neha Narula, Cody Cutler, Eddie Kohler, and Robert Morris. 2014. Phase Reconciliation for Contended In-Memory Transactions. In Proc. of the Symposium on Operating Systems Design and Implementation (OSDI-11). USENIX, 511--524.

[56]

Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A Lightweight Infrastructure for Graph Analytics. In Proc. of the Symposium on Operating Systems Principles (SOSP-24). ACM, 456--471.

Digital Library

[57]

Heidi Pan, Krste Asanović, Robert Cohn, and Chi-Keung Luk. 2005. Controlling program execution through binary instrumentation. SIGARCH Comput. Archit. News 33, 5 (2005), 45--50.

Digital Library

[58]

Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. 2011. The tao of parallelism in algorithms. In Proc. of the Conference on Programming Language Design and Implementation (PLDI). ACM, 12--25.

Digital Library

[59]

D. N. Pnevmatikatos and G. S. Sohi. 1994. Guarded Execution and Branch Prediction in Dynamic ILP Processors. In Proc. of the International Symposium on Computer Architecture (ISCA-21). ACM/IEEE, 120--129.

Digital Library

[60]

Anastasiia Postnikova, Nikita Koval, Giorgi Nadiradze, and Dan Alistarh. 2022. Multi-Queues Can Be State-of-the-Art Priority Schedulers. In Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 353--367.

Digital Library

[61]

R. C. Prim. 1957. Shortest connection networks and some generalizations. The Bell System Technical Journal 36, 6 (1957), 1389--1401.

[62]

Jose Renau, Karin Strauss, Luis Ceze, Wei Liu, Smruti Sarangi, James Tuck, and Josep Torrellas. 2005. Thread-level speculation on a CMP can be energy efficient. In Proc. of the International Conference on Supercomputing (ICS'05). ACM, 219--228.

Digital Library

[63]

Jose Renau, James Tuck, Wei Liu, Luis Ceze, Karin Strauss, and Josep Torrellas. 2005. Tasking with out-of-order spawn in TLS chip multiprocessors: Microarchitecture and compilation. In Proc. of the International Conference on Supercomputing (ICS'05). ACM, 179--188.

Digital Library

[64]

Hamza Rihani, Peter Sanders, and Roman Dementiev. 2015. Brief Announcement: MultiQueues: Simple Relaxed Concurrent Priority Queues. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 80--82.

Digital Library

[65]

Stephen B. Seidman. 1983. Network structure and minimum degree. Social Networks 5, 3 (1983), 269--287.

[66]

Mohsin Shan and Omer Khan. 2021. Accelerating Concurrent Priority Scheduling Using Adaptive in-Hardware Task Distribution in Multicores. IEEE Computer Architecture Letters (CAL) 20, 1 (2021), 17--21.

[67]

Kijung Shin, Tina Eliassi-Rad, and Christos Faloutsos. 2016. CoreScope: Graph Mining Using k-Core Analysis --- Patterns, Anomalies and Algorithms. In Proc of the International Conference on Data Mining (ICDM). IEEE, 469--478.

[68]

Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, and Phillip B. Gibbons. 2013. Reducing Contention through Priority Updates. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 152--163.

Digital Library

[69]

Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief announcement: The Problem Based Benchmark Suite. In Proc. of the Symposium on Parallelism in Algorithms and Architectures (SPAA). ACM, 68--70.

Digital Library

[70]

Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar processors. In Proc. of the International Symposium on Computer Architecture (ISCA-22). ACM/IEEE, 414--425.

Digital Library

[71]

J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry. 2000. A scalable approach to thread-level speculation. In Proc. of the International Symposium on Computer Architecture (ISCA-27). ACM/IEEE, 1--12.

Digital Library

[72]

Suvinay Subramanian, Mark C. Jeffrey, Maleen Abeydeera, Hyun Ryong Lee, Victor A. Ying, Joel Emer, and Daniel Sanchez. 2017. Fractal: An execution model for fine-grain nested speculative parallelism. In Proc. of the International Symposium on Computer Architecture (ISCA-44). ACM/IEEE, 587--599.

Digital Library

[73]

Xubin Tan, Jaume Bosch, Miquel Vidal, Carlos Alvarez, Daniel Jiménez-González, Eduard Ayguadé, and Mateo Valero. 2017. General Purpose Task-Dependence Management Hardware for Task-Based Dataflow Programming Models. In Proc. of the International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 244-253.

[74]

Mikkel Thorup. 2000. On RAM Priority Queues. SIAM J. Comput. 30, 1 (2000), 86--109.

Digital Library

[75]

Jenn-Yuan Tsai, Jian Huang, Christoffer Amlo, David J. Lilja, and Pen-Chung Yew. 1999. The Superthreaded Processor Architecture. IEEE Trans. Comput. 48, 9 (1999), 881--902.

Digital Library

[76]

Jean Vuillemin. 1978. A Data Structure for Manipulating Priority Queues. Commun.ACM 21, 4 (1978), 309--315.

Digital Library

[77]

David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. 2007. On-chip interconnection architecture of the Tile Processor. IEEE Micro 27, 5 (2007), 15--31.

[78]

Maurice V. Wilkes and William Renwick. 1949. The EDSAC - an Electronic Calculating Machine. Journal of Scientific Instruments 26, 12 (1949), 385--391.

[79]

J. W. J Williams. 1964. Algorithm 232 Heapsort. Commun. ACM 7, 6 (1964), 347--349.

Digital Library

[80]

Martin Wimmer, Jakob Gruber, Jesper Larsson Träff, and Philippas Tsigas. 2015. The Lock-Free k-LSM Relaxed Priority Queue. In Proc. of the Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 277--278.

Digital Library

[81]

Stefan Wuchty and Eivind Almaas. 2005. Peeling the yeast protein network. Proteomics 5 (2005), 444--449. Issue 2.

[82]

Jaewon Yang and Jure Leskovec. 2012. Defining and Evaluating Network Communities Based on Ground-Truth. In Proc. International Conference on Data Mining (ICDM). IEEE, 745--754.

Digital Library

[83]

Fahimeh Yazdanpanah, Carlos Alvarez, Daniel Jiménez-González, Rosa M. Badia, and Mateo Valero. 2015. Picos: A hardware runtime architecture support for OmpSs. Future Generation Computer Systems 53 (December 2015), 130--139.

Digital Library

[84]

Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, Mark D. Hill, Michael M. Swift, and David A. Wood. 2007. LogTM-SE: Decoupling hardware transactional memory from caches. In Proc. of the International Symposium on High Performance Computer Architecture (HPCA-13). IEEE, 261--272.

Digital Library

[85]

Serif Yesil, Azin Heidarshenas, Adam Morrison, and Josep Torrellas. 2019. Understanding Priority-Based Scheduling of Graph Algorithms on a Shared-Memory Platform. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19). ACM, 1--14.

Digital Library

[86]

Victor A. Ying, Mark C. Jeffrey, and Daniel Sanchez. 2020. T4: Compiling sequential code for effective speculative parallelization in hardware. In Proc. of the International Symposium on Computer Architecture (ISCA-47). ACM/IEEE, 159--172.

Digital Library

[87]

Guowei Zhang, Webb Horn, and Daniel Sanchez. 2015. Exploiting Commutativity to Reduce the Cost of Updates to Shared Data in Cache-Coherent Systems. In Proc. of the International Symposium on Microarchitecture (MICRO-48). IEEE/ACM, 13--25.

Digital Library

[88]

Yunming Zhang, Ajay Brahmakshatriya, Xinyi Chen, Laxman Dhulipala, Shoaib Kamil, Saman Amarasinghe, and Julian Shun. 2020. Optimizing Ordered Graph Algorithms with GraphIt. In Proc. of the International Symposium on Code Generation and Optimization (CGO). IEEE.

[89]

Tingzhe Zhou, Maged Michael, and Michael Spear. 2019. A Practical, Scalable, Relaxed Priority Queue. In Proc. of the International Conference on Parallel Processing (ICPP). ACM, 1--10.

Digital Library

Cited By

Orenes-Vera MTureci EMartonosi MWentzlaff D(2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00015

Index Terms

A scalable architecture for reprioritizing ordered parallelism
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

A scalable architecture for ordered parallelism
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified ...
Efficient Exploitation of Instruction-Level Parallelism for Superscalar Processors by the Conjugate Register File Scheme

This paper introduces a novel superscalar micro-architecture, called IAS-S, and its related software techniques. We treat two basic problems in superscalar machines. First, we seek a feasible hardware platform which allows the compiler to perform more ...
Exploiting instruction level parallelism in the presence of conditional branches

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

June 2022

1097 pages

ISBN:9781450386104

DOI:10.1145/3470496

General Chairs:
Valentina Salapura
Google
,
Mohamed Zahran
New York University
,
Program Chairs:
Fred Chong
The University of Chicago
,
Lingjia Tang
The University of Michigan

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Sciences and Engineering Research Council of Canada
Edward S. Rogers Sr. Graduate Scholarships
Queen Elizabeth II Graduate Scholarship in Science and Technology Program
Engineering Science Research Opportunities Program

Conference

ISCA '22

Sponsor:

SIGARCH

ISCA '22: The 49th Annual International Symposium on Computer Architecture

June 18 - 22, 2022

New York, New York

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
887
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)18

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Orenes-Vera MTureci EMartonosi MWentzlaff D(2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024
https://doi.org/10.1109/ISPASS61541.2024.00015

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents