High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level parallelism (ILP) to achieve high performance on such applications by searching for independent instructions in a dynamic window of instructions and executing them on a wide-issue pipeline. Increasing the window size and the issue width to extract more ILP may hinder achieving high clock speeds, limiting overall performance.
The Multiscalar architecture employs multiple small windows and many narrow-issue processing units to exploit ILP at high clock speeds. Sequential programs are partitioned into code fragments called tasks, which are speculatively executed in parallel. Inter-task register dependences are honored via communication and synchronization and inter-task control flow and memory dependences are handled by speculation and verification in hardware.
Since this thesis is the first attempt at investigating the problem of compiling for the Multiscalar architecture, I identify the fundamental interactions between applications and the Multiscalar architecture from the standpoint of performance and explore a few compiler optimization opportunities instead of proposing the best technique for a specific problem.
Control flow speculation, register communication, memory dependence speculation, load imbalance, and task overheads are key performance issues. To extract high degrees of ILP, compiler heuristics partition programs into large tasks, which comprise multiple basic blocks. To maintain high prediction accuracy and avoid delays due to inter-task register communication, the heuristics control the number of successors of tasks while including register dependences within tasks. Inter-task register communication is generated and scheduled to overlap computation and inter-task register communication.
For the SPEC95 benchmarks, the heuristics increase task sizes significantly while improving control flow speculation accuracy with respect to basic blocks, enabling large window spans from which to extract parallelism. Including register dependences within tasks improves performance considerably. Sophisticated register communication generation and scheduling are effective in boosting performance. Dead register analysis reduces register communication traffic considerably. All the optimizations grow in importance for larger number of PUs.
Cited By
- Estebanez A, Llanos D and Gonzalez-Escribano A (2016). A Survey on Thread-Level Speculation Techniques, ACM Computing Surveys, 49:2, (1-39), Online publication date: 11-Nov-2016.
- Colohan C, Ailamaki A, Steffan J and Mowry T (2008). Incrementally parallelizing database transactions with thread-level speculation, ACM Transactions on Computer Systems (TOCS), 26:1, (1-50), Online publication date: 1-Feb-2008.
- Zhai A, Steffan J, Colohan C and Mowry T (2008). Compiler and hardware support for reducing the synchronization of speculative threads, ACM Transactions on Architecture and Code Optimization (TACO), 5:1, (1-33), Online publication date: 1-May-2008.
- Dou J and Cintra M (2007). A compiler cost model for speculative parallelization, ACM Transactions on Architecture and Code Optimization (TACO), 4:2, (12-es), Online publication date: 1-Jun-2007.
- Colohan C, Ailamaki A, Steffan J and Mowry T Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Proceedings of the 33rd annual international symposium on Computer Architecture, (216-226)
- Colohan C, Ailamaki A, Steffan J and Mowry T (2019). Tolerating Dependences Between Large Speculative Threads Via Sub-Threads, ACM SIGARCH Computer Architecture News, 34:2, (216-226), Online publication date: 1-May-2006.
- Quiñones C, Madriles C, Sánchez J, Marcuello P, González A and Tullsen D (2005). Mitosis compiler, ACM SIGPLAN Notices, 40:6, (269-279), Online publication date: 12-Jun-2005.
- Quiñones C, Madriles C, Sánchez J, Marcuello P, González A and Tullsen D Mitosis compiler Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, (269-279)
- Marcuello P, González A and Tubella J (2004). Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism, IEEE Transactions on Computers, 53:2, (114-125), Online publication date: 1-Feb-2004.
- Du Z, Lim C, Li X, Yang C, Zhao Q and Ngai T A cost-driven compilation framework for speculative parallelization of sequential programs Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation, (71-81)
- Du Z, Lim C, Li X, Yang C, Zhao Q and Ngai T (2004). A cost-driven compilation framework for speculative parallelization of sequential programs, ACM SIGPLAN Notices, 39:6, (71-81), Online publication date: 9-Jun-2004.
- Dou J and Cintra M Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, (203-214)
- Zhai A, Colohan C, Steffan J and Mowry T Compiler optimization of scalar value communication between speculative threads Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, (171-183)
- Zhai A, Colohan C, Steffan J and Mowry T (2002). Compiler optimization of scalar value communication between speculative threads, ACM SIGPLAN Notices, 37:10, (171-183), Online publication date: 1-Oct-2002.
- Zhai A, Colohan C, Steffan J and Mowry T (2002). Compiler optimization of scalar value communication between speculative threads, ACM SIGARCH Computer Architecture News, 30:5, (171-183), Online publication date: 1-Dec-2002.
- Zhai A, Colohan C, Steffan J and Mowry T (2002). Compiler optimization of scalar value communication between speculative threads, ACM SIGOPS Operating Systems Review, 36:5, (171-183), Online publication date: 1-Dec-2002.
- Codrescu L, Wills D and Meindl J (2001). Architecture of the Atlas Chip-Multiprocessor, IEEE Transactions on Computers, 50:1, (67-82), Online publication date: 1-Jan-2001.
- Rotenberg E and Smith J Control independence in trace processors Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, (4-15)
- Vijaykumar T and Sohi G Task selection for a multiscalar processor Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, (81-92)
Recommendations
Multiscalar processors
ISCA '95: Proceedings of the 22nd annual international symposium on Computer architectureMultiscalar processors use a new, aggressive implementation paradigm for extracting large quantities of instruction level parallelism from ordinary high level language programs. A single program is divided into a collection of tasks by a combination of ...
Task Selection for the Multiscalar Architecture
Special issue on compilation and architectural support for parallel applicationsThe multiscalar architecture advocates a distributed processor organization and task-level speculation to exploit high degrees of instruction level parallelism (ILP) in sequential programs without impeding improvements in clock speeds. The main goal of ...