research-article

Concise loads and stores: the case for an asymmetric compute-memory architecture for approximation

Authors:

Shih-Chieh Lin,

Michael A. Laurenzano,

Jason MarsAuthors Info & Claims

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

Article No.: 41, Pages 1 - 13

Published: 15 October 2016 Publication History

Abstract

Cache capacity and memory bandwidth play critical roles in application performance, particularly for data-intensive applications from domains that include machine learning, numerical analysis, and data mining. Many of these applications are also tolerant to imprecise inputs and have loose constraints on the quality of output, making them ideal candidates for approximate computing. This paper introduces a novel approximate computing technique that decouples the format of data in the memory hierarchy from the format of data in the compute subsystem to significantly reduce the cost of storing and moving bits throughout the memory hierarchy and improve application performance. This asymmetric compute-memory extension to conventional architectures, ACME, adds two new instruction classes to the ISA - load-concise and store-concise - along with three small functional units to the micro-architecture to support these instructions. ACME does not affect exact execution of applications and comes into play only when concise memory operations are used. Through detailed experimentation we find that ACME is very effective at trading result accuracy for improved application performance. Our results show that ACME achieves a 1.3X speedup (up to 1.8X) while maintaining 99% accuracy, or a 1.1X speedup while maintaining 99.999% accuracy. Moreover, our approach incurs negligible area and power overheads, adding just 0.005% area and 0.1% power to a conventional modern architecture.

References

[1]

J. Meng, S. Chakradhar, and A. Raghunathan, "Best-effort parallel execution framework for recognition and mining applications," in International Symposium on Parallel and Distributed Processing (IPDPS), 2009.

Digital Library

[2]

V. Chippa, S. Chakradhar, K. Roy, and A. Raghunathan, "Analysis and characterization of inherent application resilience for approximate computing," in Design Automation Conference (DAC), 2013.

Digital Library

[3]

J. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. G. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars, "Sirius: An open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers," in Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015.

Digital Library

[4]

A. Sampson, J. Nelson, K. Strauss, and L. Ceze, "Approximate storage in solid-state memories," in International Symposium on Microarchitecture (MICRO), 2013.

Digital Library

[5]

W. Baek and T. M. Chilimbi, "Green: A framework for supporting energy-conscious programming using controlled approximation," in Programming Language Design and Implementation (PLDI), 2010.

Digital Library

[6]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Architecture support for disciplined approximate programming," in Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.

Digital Library

[7]

H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard, "Dynamic knobs for responsive power-aware computing," in Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.

Digital Library

[8]

M. A. Laurenzano, P. Hill, M. Samadi, S. Mahlke, J. Mars, and L. Tang, "Input responsiveness: using canary inputs to dynamically steer approximation," in Programming Language Design and Implementation (PLDI), 2016.

Digital Library

[9]

A. Arelakis and P. Stenstrom, "Sc2: A statistical compression cache scheme," in International Symposium on Computer Architecuture (MICRO), 2014.

Digital Library

[10]

G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Base-delta-immediate compression: Practical data compression for on-chip caches," in Parallel Architectures and Compilation Techniques (PACT), 2012.

Digital Library

[11]

S. Sardashti and D. A. Wood, "Decoupled compressed cache: Exploiting spatial locality for energy-optimized compressed caching," in International Symposium on Microarchitecture (MICRO), 2013.

Digital Library

[12]

S. Sardashti, A. Seznec, and D. A. Wood, "Skewed compressed caches," in International Symposium on Microarchitecture (MICRO), 2014.

Digital Library

[13]

J. San Miguel, J. Albericio, A. Moshovos, and N. Enright Jerger, "Doppelganger: A cache for approximate computing," in International Symposium on Microarchitecture (MICRO), 2015.

Digital Library

[14]

D. A. Patterson and J. L. Hennessy, Computer organization and design: the hardware/software interface, 2013.

Digital Library

[15]

A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "Enerj: Approximate data types for safe and general low-power computation," in Programming Language Design and Implementation (PLDI), 2011.

Digital Library

[16]

H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Neural acceleration for general-purpose approximate programs," in International Symposium on Microarchitecture (MICRO), 2012.

Digital Library

[17]

A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and H. Esmaeilzadeh, "Neural acceleration for gpu throughput processors," in International Symposium on Microarchitecture (MICRO), 2015.

Digital Library

[18]

A. Yazdanbakhsh, D. Mahajan, B. Thwaites, J. Park, A. Nagendrakumar, S. Sethuraman, K. Ramkrishnan, N. Ravindran, R. Jariwala, A. Rahimi, H. Esmaeilzadeh, and K. Bazargan, "Axilog: Language support for approximate hardware design," in Design, Automation and Test in Europe (DATE), 2015.

Digital Library

[19]

M. Carbin, S. Misailovic, and M. C. Rinard, "Verifying quantitative reliability for programs that execute on unreliable hardware," in Object Oriented Programming Systems Languages and Applications (OOPSLA), 2013.

Digital Library

[20]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," in Innovative Parallel Computing (InPar), 2012.

[21]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in International Symposium on Workload Characterization (ISWC), 2009.

Digital Library

[22]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The Gem5 simulator," in SIGARCH Computer Architecture News, 2011.

Digital Library

[23]

S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, "Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures," in International Symposium on Microarchitecture (MICRO), 2009.

Digital Library

[24]

N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, "Cacti 6.0: A tool to model large caches," HP Laboratories, 2009. {Online}. Available: http://www.hpl.hp.com/techreports/2009/HPL-2009-85.html

[25]

G. Rivera and C.-W. Tseng, "Data transformations for eliminating conflict misses," in Programming language design and implementation (PLDI), 1998.

Digital Library

[26]

N. M. Ravindra, "International technology roadmap for semiconductors (ITRS) symposium," in Journal of electronic materials, 2001.

[27]

P. Zhou, B. Zhao, J. Yang, and Y. Zhang, "A durable and energy efficient main memory using phase change memory technology," in International Symposium on Computer Architecture (ISCA), 2009.

Digital Library

[28]

M. K. Qureshi, V. Srinivasan, and J. A. Rivers, "Scalable high performance main memory system using phase-change memory technology," in International Symposium on Computer Architecture (ISCA), 2009.

Digital Library

[29]

B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, "Architecting phase change memory as a scalable dram alternative," in International Symposium on Computer Architecture (ISCA), 2009.

Digital Library

[30]

J. San Miguel, M. Badr, and N. Jerger, "Load value approximation," in International Symposium on Microarchitecture (MICRO), 2014.

Digital Library

[31]

S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, "Flikker: Saving dram refresh-power through critical data partitioning," in Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.

Digital Library

[32]

A. Raha, H. Jayakumar, S. Sutar, and V. Raghunathan, "Quality-aware data allocation in approximate dram," in Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2015.

Digital Library

[33]

R. K. Venkatesan, S. Herr, and E. Rotenberg, "Retention-aware placement in dram (rapid): software methods for quasi-non-volatile dram," in High-Performance Computer Architecture (HPCA), 2006.

[34]

M. Courbariaux, Y. Bengio, and J.-P. David, "Low precision arithmetic for deep learning," in arXiv:1412.7024, 2014.

[35]

S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep learning with limited numerical precision," in arXiv:1502.02551, 2015.

[36]

S. Han, J. Pool, J. Tran, and W. J. Dally, "Learning both weights and connections for efficient neural networks," in arXiv:1506.02626, 2015.

Digital Library

[37]

C. Rubio-González, C. Nguyen, H. D. Nguyen, J. Demmel, W. Kahan, K. Sen, D. H. Bailey, C. Iancu, and D. Hough, "Precimonious: Tuning assistant for floating-point precision," in International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2013.

Digital Library

[38]

M. D. Linderman, M. Ho, D. L. Dill, T. H. Meng, and G. P. Nolan, "Towards program optimization through automated analysis of numerical precision," in Code Generation and Optimization (CGO), 2010.

Digital Library

[39]

A. Sampson, P. Panchekha, T. Mytkowicz, K. S. McKinley, D. Grossman, and L. Ceze, "Expressing and verifying probabilistic assertions," in Programming Language Design and Implementation (PLDI), 2014.

Digital Library

[40]

J. Bornholt, T. Mytkowicz, and K. S. McKinley, "Uncertain<T>: A first-order type for uncertain data," in Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014.

Digital Library

Cited By

Eldstål-Ahrens AArelakis ASourdis I(2022)L2C: Combining Lossy and Lossless Compression on Memory and I/OACM Transactions on Embedded Computing Systems10.1145/348164121:1(1-27)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.1145/3481641
Eldstål-Damlin ATrancoso PSourdis I(2019)AVRProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337824(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337824
Jain ALaurenzano MPokam GMars JTang LEvripidou SStenström PO'Boyle M(2018)Architectural support for convolutional neural networks on modern CPUsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243177(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243177
Show More Cited By

Recommendations

Decoupling address generation from loads and stores to improve data access energy efficiency
LCTES '18

Level-one data cache (L1 DC) accesses impact energy usage as they frequently occur and use significantly more energy than register file accesses. A memory access instruction consists of an address generation operation calculating the location where the ...
Improving Data Access Efficiency by Using Context-Aware Loads and Stores
LCTES'15: Proceedings of the 16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems 2015 CD-ROM

Memory operations have a significant impact on both performance and energy usage even when an access hits in the level-one data cache (L1 DC). Load instructions in particular affect performance as they frequently result in stalls since the register to ...
Decoupling address generation from loads and stores to improve data access energy efficiency
LCTES 2018: Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Level-one data cache (L1 DC) accesses impact energy usage as they frequently occur and use significantly more energy than register file accesses. A memory access instruction consists of an address generation operation calculating the location where the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

October 2016

816 pages

General Chairs:
Wei-Chung Hsu
NTU, Taiwan
,
Chia-Lin Yang
NTU, Taiwan
,
Program Chairs:
Mikko Lipasti
Univ. Wisconsin
,
Hsien-Hsin Lee
TSMC, Taiwan

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS\DATC: IEEE Computer Society

Publisher

IEEE Press

Publication History

Published: 15 October 2016

Check for updates

Qualifiers

Research-article

Conference

MICRO-49

Sponsor:

SIGMICRO
IEEE-CS\DATC

MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

October 15 - 19, 2016

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
71
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Eldstål-Ahrens AArelakis ASourdis I(2022)L2C: Combining Lossy and Lossless Compression on Memory and I/OACM Transactions on Embedded Computing Systems10.1145/348164121:1(1-27)Online publication date: 14-Jan-2022
https://dl.acm.org/doi/10.1145/3481641
Eldstål-Damlin ATrancoso PSourdis I(2019)AVRProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337824(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337824
Jain ALaurenzano MPokam GMars JTang LEvripidou SStenström PO'Boyle M(2018)Architectural support for convolutional neural networks on modern CPUsProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243177(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243177
Li SPark SMahlke S(2018)SculptorProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205317(341-351)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205317
Miguel JGanesan KBadr MXia CLi RHsiao HJerger NOskin MInoue K(2018)The EH modelProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00055(600-612)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00055
Jain APhanishayee AMars JTang LPekhimenko G(2018)GistProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00070(776-789)Online publication date: 2-Jun-2018
https://dl.acm.org/doi/10.1109/ISCA.2018.00070

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents