research-article

DESC: energy-efficient data exchange using synchronized counters

Authors:

Mahdi Nazm Bojnordi,

Engin IpekAuthors Info & Claims

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 234 - 246

https://doi.org/10.1145/2540708.2540729

Published: 07 December 2013 Publication History

Abstract

Increasing cache sizes in modern microprocessors require long wires to connect cache arrays to processor cores. As a result, the last-level cache (LLC) has become a major contributor to processor energy, necessitating techniques to increase the energy efficiency of data exchange over LLC interconnects.

This paper presents an energy-efficient data exchange mechanism using synchronized counters. The key idea is to represent information by the delay between two consecutive pulses on a set of wires, which makes the number of state transitions on the interconnect independent of the data patterns, and significantly lowers the activity factor. Simulation results show that the proposed technique reduces overall processor energy by 7%, and the L2 cache energy by 1.81× on a set of sixteen parallel applications. This efficiency gain is attained at a cost of less than 1% area overhead to the L2 cache, and a 2% delay overhead to execution time.

References

[1]

Nir Magen, Avinoam Kolodny, Uri Weiser, and Nachum Shamir. Interconnect-power dissipation in a microprocessor. International Workshop on System Level Interconnect Prediction, 2004.

Digital Library

[2]

A. N. Udipi, N. Muralimanohar, and R. Balasubramonian. Non-uniform power access in large caches with low-swing wires. International Conference on High Performance Computing, 2009.

[3]

A. Lambrechts, P. Raghavan, M. Jayapala, F. Catthoor, and D. Verkest. Energy-aware interconnect optimization for a coarse grained reconfigurable processor. International Conference on VLSI Design, 2008.

Digital Library

[4]

G. Chandra, P. Kapur, and K. C. Saraswat. Scaling trends for the on chip power dissipation. International Interconnect Technology Conference, 2002.

[5]

Nikos Hardavellas, Michael Ferdman, Anastasia Ailamaki, and Babak Falsafi. Power scaling: the ultimate obstacle to 1k-core chips. Technical Report NWU-EECS-10-05, 2010.

[6]

Bradford M. Beckmann and David A. Wood. TLC: Transmission line caches. International Symposium on Microarchitecture, 2003.

Digital Library

[7]

Hui Zhang and J. Rabaey. Low-swing interconnect interface circuits. International Symposium on Low Power Electronics and Design, 1998.

Digital Library

[8]

Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. International Conference on Architectural Support for Programming Languages and Operating Systems, 2002.

Digital Library

[9]

Daniel Citron. Exploiting low entropy to reduce wire delay. IEEE Computer Architecture Letters, 2004.

Digital Library

[10]

A. Seznec. Decoupled sectored caches: conciliating low tag implementation cost. In International Symposium on Computer Architecture, 1994.

Digital Library

[11]

Julien Dusser, Thomas Piquet, and André Seznec. Zero-Content Augmented Caches. Rapport de recherche RR-6705, INRIA, 2008.

[12]

Luis Villa, Michael Zhang, and Krste Asanovic. Dynamic zero compression for cache energy reduction. International Symposium on Microarchitecture, 2000.

Digital Library

[13]

Rajeev Balasubramonian, Naveen Muralimanohar, Karthik Ramani, and Venkatanand Venkatachalapathy. Microarchitectural wire management for performance and power in partitioned architectures. International Symposium on High-Performance Computer Architecture, 2005.

Digital Library

[14]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. International Symposium on Microarchitecture, 2007.

Digital Library

[15]

Mircea R. Stan and Wayne P. Burleson. Bus-invert coding for low-power I/O. IEEE Transactions on VLSI Systems, 1995.

Digital Library

[16]

S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational behavior to reduce cache leakage power. International Symposium on Computer Architecture, 2001.

Digital Library

[17]

K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: simple techniques for reducing leakage power. International Symposium on Computer Architecture, 2002.

Digital Library

[18]

Nam Sung Kim, David Blaauw, and Trevor Mudge. Leakage power optimization techniques for ultra deep sub-micron multi-level caches. International Conference on Computer-Aided Design, 2003.

Digital Library

[19]

J. G. Proakis. Digital Communications. Third Edition, McGraw-Hill, 1995.

[20]

K. Itoh, K. Sasaki, and Y. Nakagome. Trends in low- power RAM circuit technologies. Symposium on Low Power Electronics, 1994.

[21]

J. Wuu, D. Weiss, C. Morganti, and M. Dreesen. The asynchronous 24MB on-chip level-3 cache for a dual-core Itanium^®-family processor. International Solid-State Circuits Conference, 2005.

[22]

C. W. Slayman. Cache and memory error detection, correction, and reduction techniques for terrestrial servers and workstations. IEEE Transactions on Device and Materials Reliability, 2005.

[23]

Y. Q. Shi, Xi Min Zhang, Zhi-Cheng Ni, and N. Ansari. Interleaving for combating bursts of errors. IEEE Circuits and Systems Magazine, 2004.

[24]

Doe Hyun Yoon and Mattan Erez. Memory mapped ecc: Low-cost error protection for last level caches. International Symposium on Computer Architecture, 2009.

Digital Library

[25]

Jose Renau et al. SESC simulator, Jan. 2005. http://sesc.sourceforge.net.

[26]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. International Symposium on Computer Architecture, 2009.

Digital Library

[27]

K. Kanda, T. Miyazaki, Min Kyeong Sik, H. Kawaguchi, and T. Sakurai. Two orders of magnitude leakage power reduction of low voltage SRAMs by row-by-row dynamic VDD control (RRDV) scheme. International ASIC/SOC Conference, 2002.

[28]

S. Rusu, S. Tam, H. Muljono, D. Ayers, and J. Chang. A dual-core multi-threaded xeon processor with 16MB L3 cache. International Solid-State Circuits Conference, 2006.

[29]

Varghese George, Sanjeev Jahagirdar, Chao Tong, K. Smits, Satish Damaraju, Scott Siers, Ves Naydenov, Tanveer Khondker, Sanjib Sarkar, and Puneet Singh. Penryn: 45-nm next generation Intel Core 2 processor. Asian Solid-State Circuits Conference, 2007.

[30]

D. James. Intel Ivy Bridge unveiled - the first commercial tri-gate, high-k, metal-gate CPU. Custom Integrated Circuits Conference, 2012.

[31]

E. Karl, Yih Wang, Yong-Gee Ng, Zheng Guo, F. Hamzaoglu, U. Bhattacharya, K. Zhang, K. Mistry, and M. Bohr. A 4.6GHz 162MB SRAM design in 22nm tri-gate CMOS technology with integrated active vmin-enhancing assist circuitry. International Solid-State Circuits Conference, 2012.

[32]

N. Maeda, S. Komatsu, M. Morimoto, and Y. Shimazaki. A 0.41 ua standby leakage 32kb embedded SRAM with low-voltage resume-standby utilizing all digital current comparator in 28nm hkmg CMOS. Symposium on VLSI Circuits, 2012.

[33]

Masaki Fujigaya, Noriaki Sakamoto, Takao Koike, Takahiro Irita, Kohei Wakahara, Tsugio Matsuyama, Keiji Hasegawa, Toshiharu Saito, Akira Fukuda, Kaname Teranishi, Kazuki Fukuoka, Noriaki Maeda, Koji Nii, Takeshi Kataoka, and Toshihiro Hattori. A 28nm high-k metal-gate single-chip communications processor with 1.5GHz dual-core application processor and LTE/HSPA+-capable baseband processor. International Solid-State Circuits Conference, 2013.

[34]

Fumihiko Tachibana, Osamu Hirabayashi, Yasuhisa Takeyama, Miyako Shizuno, Atsushi Kawasumi, Keiichi Kushida, Azuma Suzuki, Yusuke Niki, Shinichi Sasaki, Tomoaki Yabe, and Yasuo Unekawa. A 27% active and 85% standby power reduction in dual-power-supply SRAM using BL power calculator and digitally controllable retention circuit. International Solid-State Circuits, 2013.

[35]

Richard M. Yoo, Anthony Romano, and Christos Kozyrakis. Phoenix rebirth: Scalable MapReduce on a large-scale shared-memory system. International Symposium on Workload Characterization, 2009.

Digital Library

[36]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In International Symposium on Computer Architecture, 1995.

Digital Library

[37]

L. Dagum and R. Menon. OpenMP: An industry-standard API for shared-memory programming. IEEE Computational Science and Engineering, 1998.

Digital Library

[38]

D. H. Bailey et al. NAS parallel benchmarks. Technical report RNR-94-007., NASA Ames Research Center, 1994.

[39]

Greg Hamerly, Erez Perelman, Jeremy Lau, and Brad Calder. Simpoint 3.0: Faster and more flexible program analysis. Journal of Instruction Level Parallelism, 2005.

[40]

Encounter RTL compiler. http://www.cadence.com/products/ld/rtl_compiler/.

[41]

Free PDK 45nm open-access based PDK for the 45nm technology node. http://www.eda.ncsu.edu/wiki/FreePDK.

[42]

ITRS. International Technology Roadmap for Semiconductors. http://www.itrs.net/links/2010itrs/home2010.htm.

[43]

Wei Zhao and Yu Cao. New generation of predictive technology model for sub-45nm design exploration. International Symposium on Quality Electronic Design, 2006.

Digital Library

[44]

John L. Henning. SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News, 2006.

Digital Library

Cited By

Maragkoudaki EPavlidis V(2020)Energy-Efficient Time-Based Adaptive Encoding for Off-Chip CommunicationIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2020.301806228:12(2551-2562)Online publication date: Dec-2020
https://doi.org/10.1109/TVLSI.2020.3018062
Behnam PBojnordi M(2020)STFL-DDR: Improving the Energy-Efficiency of Memory InterfaceIEEE Transactions on Computers10.1109/TC.2020.297882669:12(1823-1834)Online publication date: 1-Dec-2020
https://doi.org/10.1109/TC.2020.2978826
Behnam PBojnordi M(2019)STFLProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317819(1-6)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3317819
Show More Cited By

Index Terms

DESC: energy-efficient data exchange using synchronized counters
1. Hardware
  1. Integrated circuits
    1. Interconnect
    2. Semiconductor memory

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Using the first-level caches as filters to reduce the pollution caused by speculative memory references

High-performance processors employ aggressive branch prediction and prefetching techniques to increase performance. Speculative memory references caused by these techniques sometimes bring data into the caches that are not needed by correct execution. ...
Reducing traffic generated by conflict misses in caches
CF '04: Proceedings of the 1st conference on Computing frontiers

Off-chip memory accesses are a major source of power consumption in embedded processors. In order to reduce the amount of traffic between the processor and the off-chip memory as well as to hide the memory latency, nearly all embedded processors have a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

December 2013

498 pages

ISBN:9781450326384

DOI:10.1145/2540708

General Chair:
Matthew Farrens
UC Davis
,
Program Chair:
Christos Kozyrakis
Stanford University

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Computing and Communication Foundations

Conference

MICRO-46

Sponsor:

SIGMICRO

MICRO-46: The 46th Annual IEEE/ACM International Symposium on Microarchitecture

December 7 - 11, 2013

California, Davis

Acceptance Rates

MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
324
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Maragkoudaki EPavlidis V(2020)Energy-Efficient Time-Based Adaptive Encoding for Off-Chip CommunicationIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2020.301806228:12(2551-2562)Online publication date: Dec-2020
https://doi.org/10.1109/TVLSI.2020.3018062
Behnam PBojnordi M(2020)STFL-DDR: Improving the Energy-Efficiency of Memory InterfaceIEEE Transactions on Computers10.1109/TC.2020.297882669:12(1823-1834)Online publication date: 1-Dec-2020
https://doi.org/10.1109/TC.2020.2978826
Behnam PBojnordi M(2019)STFLProceedings of the 56th Annual Design Automation Conference 201910.1145/3316781.3317819(1-6)Online publication date: 2-Jun-2019
https://dl.acm.org/doi/10.1145/3316781.3317819
Ghose SYaglikçi AGupta RLee DKudrolli KLiu WHassan HChang KChatterjee NAgrawal AO'Connor MMutlu O(2018)What Your DRAM Power Models Are Not Telling YouProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32244192:3(1-41)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3224419
Khoshavi NDemara R(2018)Read-Tuned STT-RAM and eDRAM Cache Hierarchies for Throughput and Energy OptimizationIEEE Access10.1109/ACCESS.2018.28136686(14576-14590)Online publication date: 2018
https://doi.org/10.1109/ACCESS.2018.2813668
Behnam PSedaghati NBojnordi M(2017)Adaptive Time-based Encoding for Energy-Efficient Large Cache ArchitecturesProceedings of the 5th International Workshop on Energy Efficient Supercomputing10.1145/3149412.3149417(1-8)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3149412.3149417
Li AZhao WSong SHunter HMoreno JEmer JSanchez D(2017)BVFProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123944(532-545)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123944
Mishkin MKim NLipasti M(2017)Temporal codes in on-chip interconnects2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)10.1109/ISLPED.2017.8009158(1-6)Online publication date: Jul-2017
https://doi.org/10.1109/ISLPED.2017.8009158
Wang SIpek EHsu WYang CLipasti MLee H(2016)Reducing data movement energy via online data clustering and encodingThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195676(1-13)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195676
Mittal SVetter J(2016)A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2015.243578827:5(1524-1536)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1109/TPDS.2015.2435788
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten