research-article

Computer Architecture with Associative Processor Replacing Last-Level Cache and SIMD Accelerator

Authors:

Ran GinosarAuthors Info & Claims

IEEE Transactions on Computers, Volume 64, Issue 2

Pages 368 - 381

https://doi.org/10.1109/TC.2013.220

Published: 01 February 2015 Publication History

Abstract

This study presents a computer architecture, where a last-level cache and a SIMD accelerator are replaced by an associative processor. Associative processor combines data storage and data processing, and functions as a massively parallel SIMD processor and a memory at the same time. An analytic performance model of this computer architecture is introduced. Comparative analysis supported by cycle-accurate simulation and emulation shows that this architecture may outperform a conventional computer architecture comprising a SIMD coprocessor and a shared last-level cache while consuming less power.

References

[1]

The Intel® Xeon Phi ™ Coprocessor, [Online]. Available: http://www.intel.com/content/www/us/en/high-performance-computing/high-performance-xeon-phi-coprocessor-brief.html.

[2]

D. Aberdeen and J. Baxter, “Emmerald: A fast matrix–matrix multiply using Intel's SSE instructions”, in Concurrency Comput.: Pract. Exp., vol. 13, no. 2, pp. 103–119, 2001.

[3]

G. Almási et al., “Dissecting cyclops: A detailed analysis of a multithreaded architecture”, in ACM SIGARCH Comput. Archit. News, vol. 31, no. 1, pp. 26–38, 2003.

Digital Library

[4]

C. Auth et al., “A 22 nm high performance and low-power CMOS technology featuring fully-depleted tri-gate transistors, self-aligned contacts and high density MIM capacitors,” in Proc. IEEE Symp. VLSI Technol. (VLSIT’12), 2012.

[5]

K. Banerjee et al., “A self-consistent junction temperature estimation methodology for nanometer scale ICs with implications for performance and thermal management,” in Proc. IEEE Int. Electron Devices Meet. (IEDM’03), 2003, pp. 887–890.

[6]

N. Binkert et al., “The gem5 simulator”, in ACM SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, 2011.

Digital Library

[7]

F. Black and M. Scholes, “The pricing of options and corporate liabilities”, in J. Politic. Econ., vol. 81, pp. 637–654, 1973.

[8]

S. Borkar, “Thousand core chips: A technology perspective,” in Proc. ACM/IEEE 44th Des. Autom. Conf. (DAC), 2007, pp. 746–749.

[9]

J. Brockman et al., “A low cost, multithreaded processing-in-memory system,” in Proc. 31st Int. Symp. Comput. Archit., 2004.

[10]

D. Burger and T. Austin, “The SimpleScalar tool set, version 2.0”, in ACM SIGARCH Comput. Archit. News, vol. 25, no. 3, pp. 13–25, 1997.

Digital Library

[11]

A. Cassidy and A. Andreou, “Beyond Amdahl Law—An objective function that links performance gains to delay and energy”, in IEEE Trans. Comput., vol. 61, no. 8, pp. 1110–1126, Aug. 2012.

Digital Library

[12]

H. Flatt and K. Kennedy, “Performance of parallel processors”, in Parallel Comput., vol. 12, no. 1, pp. 1–20, 1989.

[13]

C. Foster, Content Addressable Parallel Processors, New York, NY: Van Nostrand, 1976.

Digital Library

[14]

Y. Fung, “Associative processor architecture—A survey”, in ACM Comput. Surveys J. (CSUR), vol. 9, no. 1, pp. 3–27, Mar. 1977.

[15]

M. Gokhale et al., “Processing in memory: The TeraSys massively parallel PIM array”, in Computer, pp. 23–31, 1995.

[16]

M. Gschwind et al., “Synergistic processing in cell's multicore architecture”, in IEEE Micro, vol. 26, no. 2, pp. 10–24, 2006.

Digital Library

[17]

N. Gunther, S. Subramanyam, and S. Parvu, A Methodology for Optimizing Multithreaded System Scalability on Multi-Cores, [Online]. Available: http://arxiv.org/abs/1105.4301.

[18]

Z. Guz et al., “Threads vs. caches: Modeling the behavior of parallel workloads,” in Proc. IEEE Int. Conf. Comput. Des. (ICCD), Oct. 2010, pp. 274–281.

[19]

M. Hall et al., “Mapping irregular applications to DIVA, a PIM-based data-intensive architecture,” in Proc. ACM/IEEE Conf. Supercomput., 1999.

[20]

N. Hardavellas et al., “Toward dark silicon in servers”, in IEEE Micro, vol. 31, no. 4, pp. 6–15, 2011.

Digital Library

[21]

A. Hartstein et al., “On the nature of cache miss behavior: Is it square root of 2?”, in J. Instruct.-Level Parallel., 2008.

[22]

D. Hentrich et al., “Performance evaluation of SRAM cells in 22 nm predictive CMOS technology”, in Proc. IEEE Int. Conf. Electro/Inf. Technol., 2009.

[23]

M. Hill and M. Marty, “Amdahl's law in the multicore ERA”, in Computer, vol. 41, no. 7, pp. 33–38, Jul. 2008.

Digital Library

[24]

[Online]. Available: http://www.arm.com/products/processors/technologies/neon.php.

[25]

S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams, “An auto-tuning framework for parallel multicore stencil computations,” in Proc. IEEE Int. Symp. Parallel Distrib. Process. (IPDPS), 2010, pp. 1–12.

[26]

P. Kogge et al., “PIM architectures to support petaflops level computation in the HTMT machine,” in Proc. Int. Workshop Innov. Archit. Future Gener. Processors Syst., 2000.

[27]

H. Li et al., “An AND-type match line scheme for high-performance energy-efficient content addressable memories”, in IEEE J. Solid-State Circuits, vol. 41, no. 5, pp. 1108–1119, May 2006.

[28]

G. Lipovski and C. Yu, “The dynamic associative access memory chip and its application to SIMD processing and full-text database retrieval”, in Proc. IEEE Int. Workshop Memory Technol. Des. Test., 1999.

[29]

G. Loh, “The cost of uncore in throughput-oriented many-core processors,” in Proc. Workshop Archit. Lang. Throughput Appl. (ALTA), Jun. 2008.

[30]

D. Luebke, “General-purpose computation on graphics hardware,” in Proc. Workshop, SIGGRAPH, 2004.

[31]

T. Morad et al., “Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors”, in IEEE Comput. Archit. Letters, vol. 5, no. 1, pp. 14–17, Jan./Jun. 2006.

Digital Library

[32]

J. Owens et al., “GPU computing”, in Proc. IEEE, vol. 96, no. 5, pp. 879–899, May 2008.

[33]

K. Pagiamtzis and A. Sheikholeslami, “Content-addressable memory (CAM) circuits and architectures: A tutorial and survey”, in IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 712–727, Mar. 2006.

[34]

F. Pollack, “New microarchitecture challenges in the coming generations of CMOS process technologies,” in Proc. 32nd Annu. ACM/IEEE Int. Symp. Microarchit. (MICRO 32), 1999.

[35]

J. Potter et al., “ASC: An associative-computing paradigm”, in Computer, vol. 27, no. 11, pp. 19–25, 1994.

Digital Library

[36]

J. Potter and W. Meilander, “Array processor supercomputers”, in Proc. IEEE, vol. 77, no. 12, pp. 1896–1914, 1989.

[37]

G. Qing, X. Guo, R. Patel, E. Ipek, and E. Friedman, “AP-DIMM: Associative computing with STT-MRAM,” in Proc. Int. Symp. Comput. Archit. (ISCA), 2013.

[38]

M. Quinn, Designing Efficient Algorithms for Parallel Computers, New York, NY: McGraw-Hill, 1987, p. 125.

[39]

B. Rogers et al., “Scaling the bandwidth wall: Challenges in and avenues for CMP scaling,” in Proc. 36th Annu. Int. Symp. Comput. Archit. (ISCA’09), 2009, pp. 371–382.

[40]

I. Scherson et al., “Bit-parallel arithmetic in a massively-parallel associative processor”, in IEEE Trans. Comput., vol. 41, no. 10, pp. 1201–1210, Oct. 1992.

Digital Library

[41]

J. Sheaffer et al., “Studying thermal management for graphics-processor architectures,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. (ISPASS), 2005.

[42]

D. Steinkraus, L. Buck, and P. Simard, “Using GPUs for machine learning algorithms,” in Proc. 8th IEEE Int. Conf. Document Anal. Recognit. (ICDAR), 2005.

[43]

T. Sterling and H. Zima, “Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing,” in Proc. ACM/IEEE Conf. Supercomput., 2002.

[44]

J. Suh et al., “A PIM-based multiprocessor system”, in Proc. 15th Int. Symp. Parallel Distrib. Process., 2001.

[45]

D. Wentzlaff et al., Core count vs. cache size for manycore architectures in the cloud, Massachusetts Inst. Technol., Cambridge, MA, Tech. Rep. MIT-CSAIL-TR-2010-008, 2010.

[46]

L. Yavits et al., “The effect of communication and synchronization on Amdahl's law in multicore systems,” in Parallel Comput., 2014, vol. 40, no. 1, pp. 1–16.

[47]

L. Yavits, Architecture and design of associative processor for image processing and computer vision, M.Sc. thesis, Technion–Israel Instit. Technol., Haifa, Israel: 1994, [Online]. Available: http://webee.technion.ac.il/publication-link/index/id/633.

[48]

L. Yavits et al., Thermal Analysis of 3D Associative Processor, [Online]. Available: http://arxiv.org/abs/1307.3853v1.

[49]

L. Yavits et al., “Cache hierarchy optimization,” in IEEE Comput. Archit. Lett., Jul. 2013.

Cited By

Hanhan RGarzón EJahshan ZTeman ALanuzza MYavits LSalapura VZahran MChong FTang L(2022)EDAMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527424(495-507)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527424
Yavits LKaplan RGinosar R(2022)GIRAF: General Purpose In-Storage Resistive Associative FrameworkIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306544833:2(276-287)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1109/TPDS.2021.3065448
Fan XMeyer NGemmeke T(2022)Compiling All-Digital-Embedded Content Addressable Memories on Chip for Edge ApplicationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.311213541:8(2560-2572)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1109/TCAD.2021.3112135
Show More Cited By

Index Terms

Computer Architecture with Associative Processor Replacing Last-Level Cache and SIMD Accelerator
1. Networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient STT-RAM last-level-cache architecture to replace DRAM cache
MEMSYS '17: Proceedings of the International Symposium on Memory Systems

Recent research has proposed die-stacked Last Level Cache (LLC) to overcome the Memory Wall. Lately, Spin-Transfer-Torque Random Access Memory (STT-RAM) caches have been recommended as they provide improved energy efficiency compared to DRAM caches. ...
Adaptive Cache Bypassing for Inclusive Last Level Caches
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Cache hierarchy designs, including bypassing, replacement, and the inclusion property, have significant performance impact. Recent works on high performance caches have shown that cache bypassing is an effective technique to enhance the last level cache ...
A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines

Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 64, Issue 2

Feb. 2015

297 pages

ISSN:0018-9340

Issue’s Table of Contents

Copyright © 2013.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 February 2015

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hanhan RGarzón EJahshan ZTeman ALanuzza MYavits LSalapura VZahran MChong FTang L(2022)EDAMProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527424(495-507)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527424
Yavits LKaplan RGinosar R(2022)GIRAF: General Purpose In-Storage Resistive Associative FrameworkIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.306544833:2(276-287)Online publication date: 1-Feb-2022
https://dl.acm.org/doi/10.1109/TPDS.2021.3065448
Fan XMeyer NGemmeke T(2022)Compiling All-Digital-Embedded Content Addressable Memories on Chip for Edge ApplicationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.311213541:8(2560-2572)Online publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1109/TCAD.2021.3112135
Garzón ETeman ALanuzza MYavits L(2022)AIDA: Associative In-Memory Deep Learning AcceleratorIEEE Micro10.1109/MM.2022.319092442:6(67-75)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1109/MM.2022.3190924
Yantır HEltawil ASalama K(2022)A hardware/software co-design methodology for in-memory processorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.10.009161:C(63-71)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1016/j.jpdc.2021.10.009
Li HJin HZheng LHuang YLiao X(2022)ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memoryFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-1322-317:2Online publication date: 8-Aug-2022
https://dl.acm.org/doi/10.1007/s11704-022-1322-3
Yantir HEltawil AKurdahi F(2017)Approximate Memristive In-memory ComputingACM Transactions on Embedded Computing Systems10.1145/312652616:5s(1-18)Online publication date: 27-Sep-2017
https://dl.acm.org/doi/10.1145/3126526

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents