research-article

Software thread integration for instruction-level parallelism

Authors:

Alexander G. DeanAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 13, Issue 1

Article No.: 8, Pages 1 - 23

https://doi.org/10.1145/2512466

Published: 05 September 2013 Publication History

Abstract

Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy.

This article proposes Software Thread Integration (STI) for instruction-level parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. Assuming the parallel procedures are given, we define a methodology for finding the best performing integrated procedure with a minimum compilation time.

We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. During integration of threads, different ILP-improving code transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration.

References

[1]

Aigner, G., Diwan, A., Heine, D. L., Lam, M. S., Moore, D. L., Murphy, B. R., and Sapuntzakis, C. 1999. An overview of the SUIF2 compiler infrastructure. http://Suif.stanford.edu/Suif/Suif2/doc-2.2.0-4/.

[2]

Aiken, A. and Nicolau, A. 1987. Loop quantization: an analysis and algorithm. Tech. rep., Cornell University, Ithaca, NY.

Digital Library

[3]

Allen, J., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of Programming Languages. 177--189.

Digital Library

[4]

Allen, R. and Johnson, S. 1988. Compiling C for vectorization, parallelization, and inline expansion. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 241--249.

Digital Library

[5]

Allen, V. H., Janardhan, J., Lee, R. M., and Srinivas, M. 1992. Enhanced region scheduling on a program dependence graph. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 72--80.

Digital Library

[6]

Almagor, L., Cooper, K. D., Grosul, A., Harvey, T. J., Reeves, S. W., Subramanian, D., Torczon, L., and Waterman, T. 2004. Finding effective compilation sequences. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. ACM Press, New York, NY, 231--239.

Digital Library

[7]

Callahan, D., Cocke, J., and Kennedy, K. 1988. Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Comput. 5, 4, 334--358.

Digital Library

[8]

Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Hawaii International Conference on System Sciences.

Digital Library

[9]

Carribault, P., Cohen, A., and Jalby, W. 2005. Deep jam: Conversion of coarse-grain parallelism to instruction-level and vector parallelism for irregular applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 291--302.

Digital Library

[10]

Chang, P. P., Warter, N. J., Mahlke, S., Chen, W. Y., and Hwu, W. W. 1991. Three superblock scheduling models for superscalar and superpipelined processors. Tech. rep., University of Illinois, Urbana, IL.

[11]

Charlesworth, A. 1981. An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family. IEEE Comput. 14, 3, 18--27.

Digital Library

[12]

Davidson, J. W. and Holler, A. M. 1992. Subprogram inlining: A study of its effects on program execution time. IEEE Trans. Softw. Engi. 18, 2, 89--102.

Digital Library

[13]

Dean, A. G. 2000. Software thread integration for hardware to software migration. Ph.D. Dissertation, Carnegie Mellon University, Pittsburg, PA.

Digital Library

[14]

Dean, A. G. 2002. Compiling for fine-grain concurrency: Planning and performing software thread integration. In Proceedings of the 23rd IEEE Real-Time Systems Symposium (RTSS'02). IEEE Computer Society. 103.

Digital Library

[15]

Dean, A. G. and Shen, J. P. 1998. Techniques for software thread integration in real-time embedded systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium. 322--333.

Digital Library

[16]

Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349.

Digital Library

[17]

Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. 30, 7, 278--490.

Digital Library

[18]

Gupta, R. and Soffa, M. L. 1990. Region scheduling: An approach for detecting and redistributing parallelism. IEEE Trans. Softw. Engi. 16, 4, 421--431.

Digital Library

[19]

Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. Mibench: A free, commercially representative embedded benchmark suite. In Proceedings of the 4th IEEE Annual Workshop on Workload Characterization.

Digital Library

[20]

Hank, R. E., Hwu, W. W., and Rau, B. R. 1995. Region-based compilation: An introduction and motivation. In Proceedings of the 28th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, 158--168.

Digital Library

[21]

Havanki, W., Banerjia, S., and Conte, T. 1998. Treegion scheduling for wide issue processors. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture. IEEE Computer Society, Washington, DC, 266.

Digital Library

[22]

Hwu, W. W. and Chang, P. P. 1989. Inline function expansion for compiling C programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 246--257.

Digital Library

[23]

Hwu, W. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R. E., Kiyohara, T., Haab, G. E., Holm, J. G., and Lavery, D. M. 1993. The superblock: An effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 1--2, 229--248.

Digital Library

[24]

Kisuki, T., Knijnenburg, P. M. W., and O'Boyle, M. F. P. 2000. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Washington, DC, 237.

Digital Library

[25]

Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 318--328.

Digital Library

[26]

Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. SIGMICRO Newsl. 23, 1--2, 45--54.

Digital Library

[27]

Qian, Y., Carr, S., and Sweany, P. H. 2002. Optimizing loop performance for clustered VLIW architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 271--280.

Digital Library

[28]

Rau, B. R. and Glaeser, C. D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming. IEEE Press, Piscataway, NJ, 183--198.

Digital Library

[29]

So, W. 2007. Software thread integration for insturction level parallelism. Ph.D. Dissertation, North Carolina State University, Raleigh, NC.

[30]

So, W. and Dean, A. G. 2003. Procedure cloning and integration for converting parallelism from coarse to fine grain. In Proceedings of the 7th Workshop on Interaction between Compilers and Computer Architectures. IEEE Computer Society, 27--36.

Digital Library

[31]

So, W. and Dean, A. G. 2005. Complementing software pipelining with software thread integration. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. ACM Press.

Digital Library

[32]

So, W. and Dean, A. G. 2006. Reaching fast code faster: Using modeling for efficient software thread integration on a VLIW DSP. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM Press, 13--23.

Digital Library

[33]

Texas Instruments. 2004. TMS320C6000 Optimizing Compiler User's Guide (Rev. L). Texas Instruments, Dallas, TX.

[34]

Thies, W., Karczmarek, M., and Amarasinghe, S. 2002. StreamIt: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction.

Digital Library

[35]

Tullsen, D. M., Eggers, S. J., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. ACM Press, New York, NY, 392--403.

Digital Library

[36]

Wall, D. W. 1991. Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, NY, 176--188.

Digital Library

[37]

Warter, N. J., Bockhaus, J. W., Haab, G. E., and Subramanian, K. 1992. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture.

Digital Library

[38]

Way, T., Breech, B., and Pollock, L. 2001. Demand-driven inlining heuristics in a region-based optimizing compiler for ILP architectures. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems. 90--95.

[39]

Zhou, H. and Conte, T. M. 2002. Code size efficiency in global scheduling for ILP processors. In Proceedings of the 6th Annual Workshop on Interaction between Compilers and Computer Architectures. IEEE Computer Society, Washington, DC, 79.

Digital Library

Index Terms

Software thread integration for instruction-level parallelism
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Exploiting Java instruction/thread level parallelism with horizontal multithreading

Java bytecodes can be executed with the following three methods: a Java interpretor running on a particular machine interprets bytecodes; a Just-In-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 13, Issue 1

August 2013

332 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/2501626

Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 05 September 2013

Accepted: 01 November 2011

Revised: 01 August 2011

Received: 01 January 2011

Published in TECS Volume 13, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
309
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)4

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents