research-article

High performance, energy efficiency, and scalability with GALS chip multiprocessors

Authors:

Bevan M. BaasAuthors Info & Claims

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume 17, Issue 1

Pages 66 - 79

https://doi.org/10.1109/TVLSI.2008.2001947

Published: 01 January 2009 Publication History

Abstract

Chip multiprocessors with globally asynchronous locally synchronous (GALS) clocking styles are promising candidates for processing computationally-intensive and energy-constrained workloads. The GALS methodology simplifies clock tree design, provides opportunities to use clock and voltage scaling jointly in system submodules to achieve high energy efficiencies, and can also result in easily scalable clocking systems. However, its use typically also introduces performance penalties due to additional communication latency between clock domains. We show that GALS chip multiprocessors (CMPs) with large inter-processor first-inputs-first-outputs (FIFOs) buffers can inherently hide much of the GALS performance penalty while executing applications that have been mapped with few communication loops. In fact, the penalty can be driven to zero with sufficiently large FIFOs and the removal of multiple-loop communication links. We present an example mesh-connected GALS chip multiprocessor and show it has a less than 1% performance (throughput) reduction on average compared to the corresponding synchronous system for many DSP workloads. Furthermore, adaptive clock and voltage scaling for each processor provides an approximately 40% power savings without any performance reduction. These results compare favorably with the GALS uniprocessor, which compared to the corresponding synchronous uniprocessor, has a reported greater than 10% performance (throughput) reduction and an energy savings of approximately 25% using dynamic clock and voltage scaling for many general purpose applications.

References

[1]

S. Naffziger, B. Stackhouse, and T. Grutkowski, "The implementation of a 2-core multi-threaded Itanium family processor," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2005, pp. 182-183.

[2]

S. Borkar, T. Kainik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, "Parameter variations and impact on circuits and microarchitecture," in Proc. IEEE Int. Conf. Des. Autom., Jun. 2003, pp. 338-342.

Digital Library

[3]

D. M. Chapiro, "Globally-asynchronous locally-synchronous systems," Ph.D. dissertation, Dept. Comput. Sci., Stanford Univ., Stanford, CA, Oct. 1984.

Digital Library

[4]

T. Meincke, A. Hemani, S. Kumar, P. Ellervee, J. Oberg, T. Olsson, and P. Nilsson, "Globally asynchronous locally synchronous architecture for large high-performance ASICs," in Proc. IEEE Int. Symp. Circuits Syst., May 1999, pp. 512-515.

[5]

G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, and M. L. Scott, "Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling," in Proc. IEEE Int. Symp. High-Perform. Comput. Arch., Feb. 2002, pp. 29-40.

Digital Library

[6]

E. Talpes and D. Marculescu, "Toward a multiple clock/voltage island design style for power-aware processors," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 5, pp. 591-603, May 2005.

Digital Library

[7]

E. Talpes and D. Marculescu, "A critical analysis of application-adaptive multiple clock processor," in Proc. Int. Symp. Low Power Electron. Des., Aug. 2003, pp. 278-281.

Digital Library

[8]

S. F. Smith, "Performance of a GALS single-chip multiprocessor," in Proc. Int. Conf. Parallel Distrib. Process. Techn. Appl. (PDPTA), Jun. 2004, pp. 449-454.

[9]

A. Upadhyay, S. R. Hasan, and M. Nekili, "Optimal partitioning of globally asynchronous locally synchronous processor arrays," in Proc. Great Lakes Symp. VLSI (GLSVLSI), Apr. 2004, pp. 26-28.

Digital Library

[10]

R. W. Apperson, Z. Yu, M. Meeuwsen, T. Mohsenin, and B. Baas, "A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 10, pp. 1125-1134, Oct. 2007.

Digital Library

[11]

W. J. Dally and J. W. Poulton, Digital Systems Engineering. Cambridge, U.K.: Cambridge Univ. Press, 1998.

Digital Library

[12]

T. Chelcea and S. M. Nowick, "A low-latency FIFO for mixed-clock systems," in Proc. IEE Comput. Soc. Ann. Workshop VLSI (WVLSI), Apr. 2000, pp. 119-126.

Digital Library

[13]

Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work, T. Mohsenin, M. Singh, and B. Baas, "An asynchronous array of simple processors for DSP applications," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2006, pp. 428-429.

[14]

H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J. M. Rabaey, "A 1-V heterogeneous reconfigurable DSP IC for wireless baseband digital signal processing," IEEE J. Solid-State Circuits, vol. 35, no. 11, pp. 1697-1704, Nov. 2000.

[15]

D. Lattard et al., "A telecom baseband circuit based on an asynchronous network-on-chip," in Proc. ISSCC, Feb. 2007, pp. 258-259.

[16]

A. M. Jones and M. Butts, "TeraOPS hardware: A new massively-parallel MIMD computing fabric IC," in Proc. Hotchips, Aug. 2006, Session 5.

[17]

J. Oliver, R. Rao, P. Sultatna, J. Crandall, E. Czernikowski, L. W. Jones, D. Franklin, V. Akella, and F. T. Chong, "Synchroscalar: A multiple clock domain, power-aware, tile-based embedded processor," in Proc. Int. Symp. Comput. Arch., Jun. 2004, pp. 150-161.

Digital Library

[18]

S. Vangal et al., "An 80-tile 1.28 TFLOPS network-on-chip in 65 nm CMOS," in Proc. ISSCC, Feb. 2007, pp. 98-99.

[19]

M. Meeuwsen, O. Sattari, and B. Baas, "A full-rate software implementation of an IEEE 802.1 la compliant digital baseband transmitter," in Proc. IEEE Workshop Signal Process. Syst., Oct. 2004, pp. 297-301.

[20]

K. K. Parhi, VLSI Digital Signal Processing Systems. New York: Wiley, 1999.

[21]

D. A. Patterson and J. L. Hennessy, Computer Architecture--A Quantitative Approach, 2nd ed. San Mateo, CA: Morgan Kaufmann, 1999.

Digital Library

[22]

A. Iyer and D. Marculescu, "Power and performance evaluation of globally asynchronous locally synchronous processors," in Proc. Int. Symp. Comput. Arch., May 2002, pp. 158-168.

Digital Library

[23]

G. Semeraro, D. H. Albonesi, G. Magklis, M. L. Scott, S. G. Dropsho, and S. Dwarkadas, "Hiding synchronization delays in a GALS processor microarchitecture," in Proc. Int. Symp. Asynchronous Circuits Syst. (ASYNC), Apr. 2004, pp. 159-169.

[24]

Z. Yu and B. Baas, "Performance and power analysis of globally asynchronous locally synchronous multi-processor systems," in Proc. IEEE Comput. Soc. Ann. Symp. VLSI, Mar. 2006, pp. 378-384.

Digital Library

[25]

O. Sattari, "Fast Fourier transform on a distributed digital signal processor," M.S. thesis, Elect. Comput. Eng. Dept., UC Davis, Davis, CA, 2004.

[26]

C. E. Dike, N. A. Kurd, P. Patra, and J. Barkatullah, "A design for digital, dynamic clock deskew," in Proc. Symp. VLSI Circuits, Jun. 2003, pp. 21-24.

[27]

P. Mahoney, E. Fetzer, B. Doyle, and S. Naffziger, "Clock distribution on a dual-core multi-threaded Itanium-family processor," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2005, pp. 292-293.

[28]

M. Horowitz and W. Dally, "How scaling will change processor architecture," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), Feb. 2004, pp. 132-133.

[29]

B. Flachs, S. Asano, S. H. Dhong, P. Hotstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H. Oh, S. M. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, and N. Yano, "A streaming processing unit for a cell processor," in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2005, pp. 134-135.

[30]

D. C. Sekar, "Clock trees: Differential or single ended?," in Proc. Int. Symp. Quality Electron. Des., Mar. 2005, pp. 545-553.

Digital Library

[31]

M. Hashimoto, T. Yamamoto, and H. Onodera, "Statistical analysis of clock skew variation in H-tree structure," in Proc. Int. Symp. Quality Electron. Des., Mar. 2005, pp. 402-407.

Digital Library

[32]

K. J. Nowka, G. D. Carpenter, E. W. MacDonald, H. C. Ngo, B. C. Brock, K. I. Ishii, T. Y. Nguyen, and J. L. Burns, "A 32-bit PowerPC system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling," IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1441-1447, Nov. 2002.

Cited By

Stillmaker ABohnenstiehl BStillmaker LBaas B(2020)Scalable energy-efficient parallel sorting on a fine-grained many-core processor arrayJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.12.011138:C(32-47)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1016/j.jpdc.2019.12.011
(2016)Clock domain crossing (CDC) in 3D-SICsIntegration, the VLSI Journal10.1016/j.vlsi.2015.05.00252:C(367-380)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1016/j.vlsi.2015.05.002
Höppner SEisenreich HHenker SWalter DEllguth GSchüffny R(2013)A compact clock generator for heterogeneous GALS MPSoCs in 65-nm CMOS technologyIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2012.218722421:3(566-570)Online publication date: 1-Mar-2013
https://dl.acm.org/doi/10.1109/TVLSI.2012.2187224
Show More Cited By

Index Terms

High performance, energy efficiency, and scalability with GALS chip multiprocessors

Recommendations

A low-area multi-link interconnect architecture for GALS chip multiprocessors

A new inter-processor communication architecture for chip multiprocessors is proposed which has a low area cost, flexible routing capability, and supports globally asynchronous locally synchronous (GALS) clocking styles. To achieve a low area cost, the ...
Exploring hybrid photonic networks-on-chip foremerging chip multiprocessors
CODES+ISSS '09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis

Increasing application complexity and improvements in process technology have today enabled chip multiprocessors (CMPs) with tens to hundreds of cores on a chip. Networks on Chip (NoCs) have emerged as scalable communication fabrics that can support ...
A Low-Overhead Asynchronous Interconnection Network for GALS Chip Multiprocessors

A new asynchronous interconnection network is introduced for globally-asynchronous locally-synchronous (GALS) chip multiprocessors. The network eliminates the need for global clock distribution, and can interface multiple synchronous timing domains ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Very Large Scale Integration (VLSI) Systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Volume 17, Issue 1

January 2009

160 pages

ISSN:1063-8210

Issue’s Table of Contents

Copyright © 2009.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 January 2009

Revised: 21 December 2007

Received: 22 July 2007

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Stillmaker ABohnenstiehl BStillmaker LBaas B(2020)Scalable energy-efficient parallel sorting on a fine-grained many-core processor arrayJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.12.011138:C(32-47)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1016/j.jpdc.2019.12.011
(2016)Clock domain crossing (CDC) in 3D-SICsIntegration, the VLSI Journal10.1016/j.vlsi.2015.05.00252:C(367-380)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1016/j.vlsi.2015.05.002
Höppner SEisenreich HHenker SWalter DEllguth GSchüffny R(2013)A compact clock generator for heterogeneous GALS MPSoCs in 65-nm CMOS technologyIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2012.218722421:3(566-570)Online publication date: 1-Mar-2013
https://dl.acm.org/doi/10.1109/TVLSI.2012.2187224
Panda RHauck SCompton KHutchings B(2012)Dataflow-driven execution control in a coarse-grained reconfigurable array (abstract only)Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays10.1145/2145694.2145755(269-269)Online publication date: 22-Feb-2012
https://dl.acm.org/doi/10.1145/2145694.2145755
Ludovici DStrano AGaydadjiev GBenini LBertozzi DDe Micheli GAl-Hashimi BMueller WMacii E(2010)Design space exploration of a mesochronous link for cost-effective and flexible GALS NOCsProceedings of the Conference on Design, Automation and Test in Europe10.5555/1870926.1871091(679-684)Online publication date: 8-Mar-2010
https://dl.acm.org/doi/10.5555/1870926.1871091
Yu ZBaas B(2010)A low-area multi-link interconnect architecture for GALS chip multiprocessorsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2009.201791218:5(750-762)Online publication date: 1-May-2010
https://dl.acm.org/doi/10.1109/TVLSI.2009.2017912
Ludovici DStrano ABertozzi DPalesi MKumar S(2009)Architecture design principles for the integration of synchronization interfaces into Network-on-Chip switchesProceedings of the 2nd International Workshop on Network on Chip Architectures10.1145/1645213.1645222(31-36)Online publication date: 12-Dec-2009
https://dl.acm.org/doi/10.1145/1645213.1645222

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents