Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

High performance, energy efficiency, and scalability with GALS chip multiprocessors

Published: 01 January 2009 Publication History

Abstract

Chip multiprocessors with globally asynchronous locally synchronous (GALS) clocking styles are promising candidates for processing computationally-intensive and energy-constrained workloads. The GALS methodology simplifies clock tree design, provides opportunities to use clock and voltage scaling jointly in system submodules to achieve high energy efficiencies, and can also result in easily scalable clocking systems. However, its use typically also introduces performance penalties due to additional communication latency between clock domains. We show that GALS chip multiprocessors (CMPs) with large inter-processor first-inputs-first-outputs (FIFOs) buffers can inherently hide much of the GALS performance penalty while executing applications that have been mapped with few communication loops. In fact, the penalty can be driven to zero with sufficiently large FIFOs and the removal of multiple-loop communication links. We present an example mesh-connected GALS chip multiprocessor and show it has a less than 1% performance (throughput) reduction on average compared to the corresponding synchronous system for many DSP workloads. Furthermore, adaptive clock and voltage scaling for each processor provides an approximately 40% power savings without any performance reduction. These results compare favorably with the GALS uniprocessor, which compared to the corresponding synchronous uniprocessor, has a reported greater than 10% performance (throughput) reduction and an energy savings of approximately 25% using dynamic clock and voltage scaling for many general purpose applications.

References

[1]
S. Naffziger, B. Stackhouse, and T. Grutkowski, "The implementation of a 2-core multi-threaded Itanium family processor," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2005, pp. 182-183.
[2]
S. Borkar, T. Kainik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, "Parameter variations and impact on circuits and microarchitecture," in Proc. IEEE Int. Conf. Des. Autom., Jun. 2003, pp. 338-342.
[3]
D. M. Chapiro, "Globally-asynchronous locally-synchronous systems," Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ., Stanford, CA, Oct. 1984.
[4]
T. Meincke, A. Hemani, S. Kumar, P. Ellervee, J. Oberg, T. Olsson, and P. Nilsson, "Globally asynchronous locally synchronous architecture for large high-performance ASICs," in Proc. IEEE Int. Symp. Circuits Syst., May 1999, pp. 512-515.
[5]
G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, and M. L. Scott, "Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling," in Proc. IEEE Int. Symp. High-Perform. Comput. Arch., Feb. 2002, pp. 29-40.
[6]
E. Talpes and D. Marculescu, "Toward a multiple clock/voltage island design style for power-aware processors," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 5, pp. 591-603, May 2005.
[7]
E. Talpes and D. Marculescu, "A critical analysis of application-adaptive multiple clock processor," in Proc. Int. Symp. Low Power Electron. Des., Aug. 2003, pp. 278-281.
[8]
S. F. Smith, "Performance of a GALS single-chip multiprocessor," in Proc. Int. Conf. Parallel Distrib. Process. Techn. Appl. (PDPTA), Jun. 2004, pp. 449-454.
[9]
A. Upadhyay, S. R. Hasan, and M. Nekili, "Optimal partitioning of globally asynchronous locally synchronous processor arrays," in Proc. Great Lakes Symp. VLSI (GLSVLSI), Apr. 2004, pp. 26-28.
[10]
R. W. Apperson, Z. Yu, M. Meeuwsen, T. Mohsenin, and B. Baas, "A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 10, pp. 1125-1134, Oct. 2007.
[11]
W. J. Dally and J. W. Poulton, Digital Systems Engineering. Cambridge, U.K.: Cambridge Univ. Press, 1998.
[12]
T. Chelcea and S. M. Nowick, "A low-latency FIFO for mixed-clock systems," in Proc. IEE Comput. Soc. Ann. Workshop VLSI (WVLSI), Apr. 2000, pp. 119-126.
[13]
Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work, T. Mohsenin, M. Singh, and B. Baas, "An asynchronous array of simple processors for DSP applications," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2006, pp. 428-429.
[14]
H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. M. Rabaey, "A 1-V heterogeneous reconfigurable DSP IC for wireless baseband digital signal processing," IEEE J. Solid-State Circuits, vol. 35, no. 11, pp. 1697-1704, Nov. 2000.
[15]
D. Lattard et al., "A telecom baseband circuit based on an asynchronous network-on-chip," in Proc. ISSCC, Feb. 2007, pp. 258-259.
[16]
A. M. Jones and M. Butts, "TeraOPS hardware: A new massively-parallel MIMD computing fabric IC," in Proc. Hotchips, Aug. 2006, Session 5.
[17]
J. Oliver, R. Rao, P. Sultatna, J. Crandall, E. Czernikowski, L. W. Jones, D. Franklin, V. Akella, and F. T. Chong, "Synchroscalar: A multiple clock domain, power-aware, tile-based embedded processor," in Proc. Int. Symp. Comput. Arch., Jun. 2004, pp. 150-161.
[18]
S. Vangal et al., "An 80-tile 1.28 TFLOPS network-on-chip in 65 nm CMOS," in Proc. ISSCC, Feb. 2007, pp. 98-99.
[19]
M. Meeuwsen, O. Sattari, and B. Baas, "A full-rate software implementation of an IEEE 802.1 la compliant digital baseband transmitter," in Proc. IEEE Workshop Signal Process. Syst., Oct. 2004, pp. 297-301.
[20]
K. K. Parhi, VLSI Digital Signal Processing Systems. New York: Wiley, 1999.
[21]
D. A. Patterson and J. L. Hennessy, Computer Architecture--A Quantitative Approach, 2nd ed. San Mateo, CA: Morgan Kaufmann, 1999.
[22]
A. Iyer and D. Marculescu, "Power and performance evaluation of globally asynchronous locally synchronous processors," in Proc. Int. Symp. Comput. Arch., May 2002, pp. 158-168.
[23]
G. Semeraro, D. H. Albonesi, G. Magklis, M. L. Scott, S. G. Dropsho, and S. Dwarkadas, "Hiding synchronization delays in a GALS processor microarchitecture," in Proc. Int. Symp. Asynchronous Circuits Syst. (ASYNC), Apr. 2004, pp. 159-169.
[24]
Z. Yu and B. Baas, "Performance and power analysis of globally asynchronous locally synchronous multi-processor systems," in Proc. IEEE Comput. Soc. Ann. Symp. VLSI, Mar. 2006, pp. 378-384.
[25]
O. Sattari, "Fast Fourier transform on a distributed digital signal processor," M.S. thesis, Elect. Comput. Eng. Dept., UC Davis, Davis, CA, 2004.
[26]
C. E. Dike, N. A. Kurd, P. Patra, and J. Barkatullah, "A design for digital, dynamic clock deskew," in Proc. Symp. VLSI Circuits, Jun. 2003, pp. 21-24.
[27]
P. Mahoney, E. Fetzer, B. Doyle, and S. Naffziger, "Clock distribution on a dual-core multi-threaded Itanium-family processor," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2005, pp. 292-293.
[28]
M. Horowitz and W. Dally, "How scaling will change processor architecture," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2004, pp. 132-133.
[29]
B. Flachs, S. Asano, S. H. Dhong, P. Hotstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H. Oh, S. M. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, and N. Yano, "A streaming processing unit for a cell processor," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2005, pp. 134-135.
[30]
D. C. Sekar, "Clock trees: Differential or single ended?," in Proc. Int. Symp. Quality Electron. Des., Mar. 2005, pp. 545-553.
[31]
M. Hashimoto, T. Yamamoto, and H. Onodera, "Statistical analysis of clock skew variation in H-tree structure," in Proc. Int. Symp. Quality Electron. Des., Mar. 2005, pp. 402-407.
[32]
K. J. Nowka, G. D. Carpenter, E. W. MacDonald, H. C. Ngo, B. C. Brock, K. I. Ishii, T. Y. Nguyen, and J. L. Burns, "A 32-bit PowerPC system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling," IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1441-1447, Nov. 2002.

Cited By

View all
  • (2020)Scalable energy-efficient parallel sorting on a fine-grained many-core processor arrayJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.12.011138:C(32-47)Online publication date: 1-Apr-2020
  • (2016)Clock domain crossing (CDC) in 3D-SICsIntegration, the VLSI Journal10.1016/j.vlsi.2015.05.00252:C(367-380)Online publication date: 1-Jan-2016
  • (2013)A compact clock generator for heterogeneous GALS MPSoCs in 65-nm CMOS technologyIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2012.218722421:3(566-570)Online publication date: 1-Mar-2013
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Very Large Scale Integration (VLSI) Systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems  Volume 17, Issue 1
January 2009
160 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 January 2009
Revised: 21 December 2007
Received: 22 July 2007

Author Tags

  1. Array processor
  2. array processor
  3. chip multiprocessor
  4. energy efficient
  5. globally asynchronous locally synchronous (GALS)
  6. low power
  7. scalable

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Scalable energy-efficient parallel sorting on a fine-grained many-core processor arrayJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.12.011138:C(32-47)Online publication date: 1-Apr-2020
  • (2016)Clock domain crossing (CDC) in 3D-SICsIntegration, the VLSI Journal10.1016/j.vlsi.2015.05.00252:C(367-380)Online publication date: 1-Jan-2016
  • (2013)A compact clock generator for heterogeneous GALS MPSoCs in 65-nm CMOS technologyIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2012.218722421:3(566-570)Online publication date: 1-Mar-2013
  • (2012)Dataflow-driven execution control in a coarse-grained reconfigurable array (abstract only)Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays10.1145/2145694.2145755(269-269)Online publication date: 22-Feb-2012
  • (2010)Design space exploration of a mesochronous link for cost-effective and flexible GALS NOCsProceedings of the Conference on Design, Automation and Test in Europe10.5555/1870926.1871091(679-684)Online publication date: 8-Mar-2010
  • (2010)A low-area multi-link interconnect architecture for GALS chip multiprocessorsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2009.201791218:5(750-762)Online publication date: 1-May-2010
  • (2009)Architecture design principles for the integration of synchronization interfaces into Network-on-Chip switchesProceedings of the 2nd International Workshop on Network on Chip Architectures10.1145/1645213.1645222(31-36)Online publication date: 12-Dec-2009

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media