Nothing Special   »   [go: up one dir, main page]

Skip to main content

Advertisement

Log in

Enabling energy-proportional computing on instruction-level parallel processors

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Digital signal processor (DSP) cores with very-long-instruction-word (VLIW) architectures have been widely used in recent embedded and multimedia systems-on-chips. Improving power efficiency becomes a crucial issue for designing VLIW-DSP cores. This paper proposes a novel approach—energy-proportional parallel computing—to improve the power efficiency of a VLIW-DSP core. The main theme in this approach is to adapt parallelism by performance demand and apply power gating to idle hardware. The challenge is to reduce the power dissipated on register files. Energy proportionality is realized through an architecture featuring distributed and power-gated register files. Power efficiency is exploited by the compiler through instruction scheduling and register allocation. Energy-aware list scheduling is proposed to reduce the register file power, while building an as-soon-as-possible schedule. Following the scheduling outcome, register allocation with energy-oriented decision rules through weighted graph coloring is performed. Evaluation with the MiBench benchmark suite shows a significant power saving and scaling range on the register file power. Moreover, the evaluation justifies the effect of a power-gated processor to maintain good energy efficiency over applications with diverse behavior. This result shows that energy-proportional parallel computing is an attractive direction for future processor design in the deep-submicron era.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38
Fig. 39
Fig. 40
Fig. 41
Fig. 42
Fig. 43
Fig. 44

Similar content being viewed by others

References

  1. Texas Instruments (2011) OMAP 5 mobile applications platform. http://www.ti.com/pdfs/wtbu/swct010.pdf

  2. Philips (2011) Philips nexperia—highly integrated programmable system-on-chip. http://www.semiconductors.philips.com/products/nexperia

  3. St Nomadik (2011) ST Nomadik multimedia processor. http://www.st.com/stonline/prodpres/dedicate/proc/proc.htm

  4. Qualcomm Inc. (2010) Snapdragon mobile processors and chipsets. http://www.qualcomm.com/snapdragon

  5. Lambrechts A, Raghavan P, Leroy A, Talavera G, Aa T, Jayapala M, Catthoor F, Verkest D, Deconinck G, Corporaal H, Robert F, Carrabina J (2005) Power breakdown analysis for a heterogeneous NoC platform running a video application. In: Proceedings of 16th IEEE international conference on application-specific systems, architecture processors (ASAP’05), pp 179–184

  6. Zyuban V, Kogge P (1998) The energy complexity of register files. In: Proceedings of the 1998 international symposium on low power electronics and design (ISLPED’98). ACM, New York, pp 305–310

  7. Texas Instruments (2005) TMS320C6455 fixed-point digital signal processor. http://www.ti.com.cn/cn/lit/ds/symlink/tms320c6455.pdf

  8. Analog Devices (2010) Getting started with Blackfin processors. http://www.analog.com/static/imported-files/processor_manuals/GettingStartedwithBlackfinProcessors.pdf

  9. Freescale Semiconductor (2008) Tuning C code for StarCore-based digital signal processors. http://cache.freescale.com/files/dsp/doc/app_note/an3357.pdf

  10. Terechko AS, Corporaal H (2007) Inter-cluster communication in VLIW architectures. ACM Trans Archit Code Optim 4(2):11

  11. Wang M, Wang Y, Liu D, Qin Z, Shao Z (2010) Compiler-assisted leakage-aware loop scheduling for embedded VLIW DSP processors. J Syst Softw 83(5):772–785

  12. Nagpal R, Srikant Y (2011) Compiler-assisted power optimization for clustered VLIW architectures. Parallel Comput 37(1):42–59

  13. Industry Technology Roadmap for Semiconductors (2010) ITRS report 2010. http://www.itrs.net/links/2010itrs/home2010.htm

  14. Shin Y, Seomun J, Choi KM, Sakurai T (2010) Power gating: circuits, design methodologies, and best practice for standard-cell VLSI designs. ACM Trans Des Autom Electron Syst 15:28:1–28:37

  15. Kim HS, Vijaykrishnan N, Kandemir M, Irwin MJ (2003) Adapting instruction level parallelism for optimizing leakage in VLIW architectures. In: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems (LCTES’03). ACM, New York, pp 275–283

  16. You Y-P, Lee C, Lee JK (2006) Compilers for leakage power reduction. ACM Trans Des Autom Electron Syst 11(1):147–164

  17. You Y-P, Huang C-W, Lee JK (2007) Compilation for compact power-gating controls. ACM Trans Des Autom Electron Syst 12(4):51

  18. Venkatachalam V, Franz M (2005) Power reduction techniques for microprocessor systems. ACM Comput Surv 37(3):195–237

  19. Barroso L, Holzle U (2007) The case for energy-proportional computing. Computer 40:33–37

  20. Cameron KW (2010) The challenges of energy-proportional computing. Computer 43:82–83

  21. Ma YC, Liu TA, Chao WS (2013) Energy-aware compiler optimization for VLIW-DSP cores. In: Pan JS, Yang CN, Lin CC (eds) Advances in intelligent systems and applications of smart innovation, systems and technologies. Springer, Berlin. 21: 779–788

  22. Taiwan Semiconductor Manufacturing Company (2011) 40nm technology. http://www.tsmc.com/english/dedicatedfoundry/technology/40nm.htm

  23. Fisher JA, Faraboschi P, Young C (2005) Embedded computing: a VLIW approach to architecture, compilers, and tools. Elsevier, London

  24. Aho AV, Lam MS, Sethi R, Ullman JD (2007) Compilers: principles, techniques, and tools, 2/e. Addison Wesley, USA

  25. Chaitin GJ (1982) Register allocation & spilling via graph coloring. In: Proceedings of the 1982 SIGPLAN symposium on Compiler construction (SIGPLAN’82). ACM, New York, pp 98–105

  26. Faraboschi P, Brown G, Fisher JA, Desoli G, Homewood F (2000) Lx: a technology platform for customizable vliw embedded processing. In: Proceedings of the 27th annual internationalsymposium on computer architecture (ISCA’00). ACM, New York, pp 203–213

  27. Lin Y-C, Lu CH, Wu C-J, Tang C-L, You Y-P, Moo Y-C, Lee J-K (2008) Effective code generation for distributed and ping-pong register files: a case study on PAC VLIW DSP cores. J Signal Process Syst 51:269–288

  28. Dally W, Balfour J, Black-Shaffer D, Chen J, Harting R, Parikh V, Park J, Sheffield D (2008) Efficient embedded computing. Computer 41(7):27–32

  29. Aleta A, Codina JM, Gonzalez A, Kaeli D (2007) Heterogeneous clustered vliw microarchitectures. In: Proceedings of the international symposium on code generation and optimization (CGO’07). IEEE Computer Society, Washington, DC, pp 354–366

  30. Kailas K, Franklin M, Ebcioglu K (2002) A register file architecture and compilation scheme for clustered ilp processors. In: Proceedings of the 8th international euro-par conference on parallel processing (Euro-Par’02). Springer, London, pp 500–511

  31. Akturan C, Jacome MF (2001) Caliber: a software pipelining algorithm for clustered embedded VLIW processors. In: Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design (ICCAD’01). IEEE Press, Piscataway, pp 112–118

  32. Qian Y, Carr S, Sweany P (2002) Loop fusion for clustered vliw architectures. In: Proceedings of the joint conference on languages, compilers and tools for embedded systems: software and compilers for embedded systems (LCTES/SCOPES’02). ACM, New York, pp 112–119

  33. Qian Y, Carr S, Sweany PH (2002) Optimizing loop performance for clustered vliw architectures. In: Proceedings of the 2002 International conference on parallel architectures and compilation techniques (PACT’02). IEEE Computer Society, Washington, DC, pp 271–280

  34. Zalamea J, Llosa J, Ayguadé E, Valero M (2001) Modulo scheduling with integrated register spilling for clustered vliw architectures. In: Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture (MICRO 34). IEEE Computer Society, Washington, DC, pp 160–169

  35. Wen M, Wu N, Guan M, Zhang C (2008) Load scheduling: reducing pressure on distributed register files for free. In Proceedings of the 2008 Asia and South Pacific design automation conference (ASP DAC’08). IEEE Computer Society Press, Los Alamitos, pp 340–345

  36. Wu C-J, Lin Y-T, Lee J-K (2012) Instruction scheduling methods and phase ordering framework for VLIW DSP processors with distributed register files. J Supercomput 61(3):1024–1047

  37. Shieh W-Y, Wang B-S (2012) Power-aware register assignment for large register file design. J Supercomput 61(3):719–742

  38. Won HS, Kim KS, Jeong KO, Park KT, Choi KM, Kong JT (2003) MTCMOS design methodology and its application to mobile computing. In: Proceedings of the international symposium on low power electronics and design (ISLPED’03), pp 110–115

  39. Roy S, Ranganathan N, Katkoori S (2009) A framework for power-gating functional units in embedded microprocessors, IEEE Trans Very Large Scale Integr (VLSI) Syst 17(11):1640–1649

  40. Morgan R (1998) Building an optimizing compiler. Butterworth-Heinemann, London

  41. Hochbaum DS (1995) Approximation Algorithms for NP-Hard problems. International Thomson Publishing

  42. Guthaus M, Ringenberg J, Ernst D, Austin T, Mudge T, Brown R (2001) Mibench: a free, commercially representative embedded benchmark suite, In: Proceedings of the IEEE international workshop on workload characterization (WWC-4), pp 3–14

  43. Chiu J-C, Yang K-M (2010) A novel instruction stream buffer for VLIW architectures. Comput Electr Eng 36(1):190–198

  44. Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the international symposium on code generation and optimization: feedbackdirected and runtime optimization (CGO’04). IEEE Computer Society, Washington, DC, p 75

  45. Keating M (2007) Low power methodology manual for system-on-chip design. Springer, New York

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yung-Cheng Ma.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, YC., Chao, WS. & Liu, TA. Enabling energy-proportional computing on instruction-level parallel processors. J Supercomput 71, 391–447 (2015). https://doi.org/10.1007/s11227-014-1301-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1301-z

Keywords

Navigation