Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Software thread integration for instruction-level parallelism

Published: 05 September 2013 Publication History

Abstract

Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy.
This article proposes Software Thread Integration (STI) for instruction-level parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. Assuming the parallel procedures are given, we define a methodology for finding the best performing integrated procedure with a minimum compilation time.
We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. During integration of threads, different ILP-improving code transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration.

References

[1]
Aigner, G., Diwan, A., Heine, D. L., Lam, M. S., Moore, D. L., Murphy, B. R., and Sapuntzakis, C. 1999. An overview of the SUIF2 compiler infrastructure. http://Suif.stanford.edu/Suif/Suif2/doc-2.2.0-4/.
[2]
Aiken, A. and Nicolau, A. 1987. Loop quantization: an analysis and algorithm. Tech. rep., Cornell University, Ithaca, NY.
[3]
Allen, J., Kennedy, K., Porterfield, C., and Warren, J. 1983. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM Symposium on Principles of Programming Languages. 177--189.
[4]
Allen, R. and Johnson, S. 1988. Compiling C for vectorization, parallelization, and inline expansion. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 241--249.
[5]
Allen, V. H., Janardhan, J., Lee, R. M., and Srinivas, M. 1992. Enhanced region scheduling on a program dependence graph. In Proceedings of the 25th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 72--80.
[6]
Almagor, L., Cooper, K. D., Grosul, A., Harvey, T. J., Reeves, S. W., Subramanian, D., Torczon, L., and Waterman, T. 2004. Finding effective compilation sequences. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. ACM Press, New York, NY, 231--239.
[7]
Callahan, D., Cocke, J., and Kennedy, K. 1988. Estimating interlock and improving balance for pipelined architectures. J. Parallel Distrib. Comput. 5, 4, 334--358.
[8]
Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Hawaii International Conference on System Sciences.
[9]
Carribault, P., Cohen, A., and Jalby, W. 2005. Deep jam: Conversion of coarse-grain parallelism to instruction-level and vector parallelism for irregular applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques. 291--302.
[10]
Chang, P. P., Warter, N. J., Mahlke, S., Chen, W. Y., and Hwu, W. W. 1991. Three superblock scheduling models for superscalar and superpipelined processors. Tech. rep., University of Illinois, Urbana, IL.
[11]
Charlesworth, A. 1981. An approach to scientific array processing: The architectural design of the AP-120B/FPS-164 family. IEEE Comput. 14, 3, 18--27.
[12]
Davidson, J. W. and Holler, A. M. 1992. Subprogram inlining: A study of its effects on program execution time. IEEE Trans. Softw. Engi. 18, 2, 89--102.
[13]
Dean, A. G. 2000. Software thread integration for hardware to software migration. Ph.D. Dissertation, Carnegie Mellon University, Pittsburg, PA.
[14]
Dean, A. G. 2002. Compiling for fine-grain concurrency: Planning and performing software thread integration. In Proceedings of the 23rd IEEE Real-Time Systems Symposium (RTSS'02). IEEE Computer Society. 103.
[15]
Dean, A. G. and Shen, J. P. 1998. Techniques for software thread integration in real-time embedded systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium. 322--333.
[16]
Ferrante, J., Ottenstein, K. J., and Warren, J. D. 1987. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9, 3, 319--349.
[17]
Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. 30, 7, 278--490.
[18]
Gupta, R. and Soffa, M. L. 1990. Region scheduling: An approach for detecting and redistributing parallelism. IEEE Trans. Softw. Engi. 16, 4, 421--431.
[19]
Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. Mibench: A free, commercially representative embedded benchmark suite. In Proceedings of the 4th IEEE Annual Workshop on Workload Characterization.
[20]
Hank, R. E., Hwu, W. W., and Rau, B. R. 1995. Region-based compilation: An introduction and motivation. In Proceedings of the 28th Annual International Symposium on Microarchitecture. IEEE Computer Society Press, 158--168.
[21]
Havanki, W., Banerjia, S., and Conte, T. 1998. Treegion scheduling for wide issue processors. In Proceedings of the 4th International Symposium on High-Performance Computer Architecture. IEEE Computer Society, Washington, DC, 266.
[22]
Hwu, W. W. and Chang, P. P. 1989. Inline function expansion for compiling C programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 246--257.
[23]
Hwu, W. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R. E., Kiyohara, T., Haab, G. E., Holm, J. G., and Lavery, D. M. 1993. The superblock: An effective technique for VLIW and superscalar compilation. J. Supercomput. 7, 1--2, 229--248.
[24]
Kisuki, T., Knijnenburg, P. M. W., and O'Boyle, M. F. P. 2000. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, Washington, DC, 237.
[25]
Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM Press, 318--328.
[26]
Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., and Bringmann, R. A. 1992. Effective compiler support for predicated execution using the hyperblock. SIGMICRO Newsl. 23, 1--2, 45--54.
[27]
Qian, Y., Carr, S., and Sweany, P. H. 2002. Optimizing loop performance for clustered VLIW architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 271--280.
[28]
Rau, B. R. and Glaeser, C. D. 1981. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proceedings of the 14th Annual Workshop on Microprogramming. IEEE Press, Piscataway, NJ, 183--198.
[29]
So, W. 2007. Software thread integration for insturction level parallelism. Ph.D. Dissertation, North Carolina State University, Raleigh, NC.
[30]
So, W. and Dean, A. G. 2003. Procedure cloning and integration for converting parallelism from coarse to fine grain. In Proceedings of the 7th Workshop on Interaction between Compilers and Computer Architectures. IEEE Computer Society, 27--36.
[31]
So, W. and Dean, A. G. 2005. Complementing software pipelining with software thread integration. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. ACM Press.
[32]
So, W. and Dean, A. G. 2006. Reaching fast code faster: Using modeling for efficient software thread integration on a VLIW DSP. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. ACM Press, 13--23.
[33]
Texas Instruments. 2004. TMS320C6000 Optimizing Compiler User's Guide (Rev. L). Texas Instruments, Dallas, TX.
[34]
Thies, W., Karczmarek, M., and Amarasinghe, S. 2002. StreamIt: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction.
[35]
Tullsen, D. M., Eggers, S. J., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. ACM Press, New York, NY, 392--403.
[36]
Wall, D. W. 1991. Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, NY, 176--188.
[37]
Warter, N. J., Bockhaus, J. W., Haab, G. E., and Subramanian, K. 1992. Enhanced modulo scheduling for loops with conditional branches. In Proceedings of the 25th Annual International Symposium on Microarchitecture.
[38]
Way, T., Breech, B., and Pollock, L. 2001. Demand-driven inlining heuristics in a region-based optimizing compiler for ILP architectures. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems. 90--95.
[39]
Zhou, H. and Conte, T. M. 2002. Code size efficiency in global scheduling for ILP processors. In Proceedings of the 6th Annual Workshop on Interaction between Compilers and Computer Architectures. IEEE Computer Society, Washington, DC, 79.

Index Terms

  1. Software thread integration for instruction-level parallelism

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 13, Issue 1
    August 2013
    332 pages
    ISSN:1539-9087
    EISSN:1558-3465
    DOI:10.1145/2501626
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 05 September 2013
    Accepted: 01 November 2011
    Revised: 01 August 2011
    Received: 01 January 2011
    Published in TECS Volume 13, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Software thread integration
    2. digital signal processor
    3. instruction-level parallelism
    4. thread-level parallelism
    5. very long instruction word

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 309
      Total Downloads
    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media