Article

A first glance at Kilo-instruction based multiprocessors

Authors:

Marco Galluzzi,

Valentín Puente,

Adrián Cristal,

Ramón Beivide,

José-Ángel Gregorio,

Mateo ValeroAuthors Info & Claims

CF '04: Proceedings of the 1st conference on Computing frontiers

Pages 212 - 221

https://doi.org/10.1145/977091.977120

Published: 14 April 2004 Publication History

Abstract

The ever increasing gap between processor and memory speed, sometimes referred to as the Memory Wall problem [42], has a very negative impact on performance. This mismatch will be more severe in future processor's generation. Modern cache organizations and prefetching techniques will not be able to solve this problem. A very novel and promising technique to deal with the Memory Wall consists on designing processors able to maintain thousands of in-flight instructions. An example of this kind of processors has been denoted as Kilo-instruction processors [8]. When running numerical applications, Kilo-instruction processors have demonstrated its ability to effectively maintain high values of IPC while increasing memory latencies.In this paper, we will study for the first time, the influence of Kilo-instruction processors on the performance of small-scale CC-NUMA multiprocessors. Our first results, using an ideal network, show the enormous potential of the Kilo-instruction processors, when using them as computing nodes, not only for hiding local DRAM latencies but also for the remote ones. A deeper analysis, using realistic networks, reveals the existence of heavy demands on packet throughput required by each node, since larger re-order buffers translate on higher density of remote accesses. Next, we show that current interconnection networks cannot cope with this high traffic levels, so newer and faster networks have to be designed. In short, our results show dramatic performance gains over multiprocessors based on current microprocessors and dictate a possible way to build future shared-memory multiprocessor systems.

References

[1]

H. Akkary, R. Rajwar, and S. Srinivasan. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors. Proceedings of the 36th Annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 423--434, December 2003.

Digital Library

[2]

J.-L. Baer and T.-F. Chen. An Effective On-chip Preloading Scheme to Reduce Data Access Penalty. In Proceedings of Supercomputing '91, pages 176--186, November 1991.

Digital Library

[3]

W. Camp and J. Tomkins. The Red Storm Computer Architecture and its Implementation. Conference on High-Speed Computing, April 2003.

[4]

C. Carrion, R. Beivide, J. Gregorio, and F. Vallejo. A Flow Control Mechanism to Avoid Message Deadlock in K-ary N-cube Networks. Fourth International Conference on High Performance Computing, pages 322--329, December 1997.

Digital Library

[5]

R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). Proceedings of the 26th Annual Intl. Symposium on Computer Architecture, pages 186--195, May 1999.

Digital Library

[6]

J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen. Dynamic Speculative Precomputation. Proceedings of the 34th Annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 306--317, December 2001.

Digital Library

[7]

A. Cristal, J. F. Martinez, J. Llosa, and M. Valero. A Case for Resource-conscious Out-of-order Processors. In IEEE TCCA Computer Architecture Letters, 2, October 2003.

Digital Library

[8]

A. Cristal, D. Ortega, J. Llosa, and M. Valero. Kilo-instruction Processors. Proceedings of the 5th International Symposium on High Performance Computing (invited paper), pages 10--25, October 2003.

[9]

A. Cristal, D. Ortega, J. Llosa, and M. Valero. Out-of-Order Commit Processors. Proceedings of the 10th Intl. Conference on High Performance Computer Architecture, February 2004.

Digital Library

[10]

A. Cristal, M. Valero, A. Gonzalez, and J. Llosa. Large Virtual ROBs by Processor Checkpointing. Technical Report UPC-DAC-2002-39, Universidad Politécnica de Cataluña, July 2002.

[11]

M. Dubois and Y. Song. Assisted Execution. Technical Report CENG 98-25, Department of EE-Systems, University of Southern California, October 1998.

[12]

M. Galles. Spider: A High-Speed Network Interconnect. IEEE Micro, 17(1):34--39, Jan.-Feb. 1997.

Digital Library

[13]

K. Gharachorloo, A. Gupta, and H. Hennessy. Hiding Memory Latency Using Dynamic Scheduling in Shared-memory Multiprocessors. Proceedings of the 19th Annual Intl. Symposium on Computer Architecture, pages 22--33, May 1992.

Digital Library

[14]

K. Gharachorloo, A. Gupta, and J. Hennessy. Performance Evaluation of Memory Consistency Models for Shared-memory Multiprocessors. Proceedings of the 4th Intl. Conference on Architectural Support for Programming Languages and Operating Systems, pages 245--257, April 1991.

Digital Library

[15]

C. Gniady, B. Falsafi, and T. Vijaykumar. Is SC + ILP = RC? Proceedings of the 26th Annual Intl. Symposium on Computer Architecture, pages 162--171, May 1999.

Digital Library

[16]

J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Mateo, California, 3rd edition, 2003.

Digital Library

[17]

W. Hwu and Y. Patt. Checkpoint Repair for Out-of-Order Execution Machines. pages 18--26, December 1987.

[18]

D. Joseph and D. Grunwald. Prefetching Using Markov Predictors. Proceedings of the 24th Annual Intl. Symposium on Computer Architecture, pages 252--263, June 1997.

Digital Library

[19]

T. Karkhanis and J. Smith. A Day in the Life of a Cache Miss. 2nd Annual Workshop on Memory Performance Issues (WMPI), June 2002.

[20]

M. Karlsson, F. Dahlgren, and P. Stenstrom. A Prefetching Technique for Irregular Accesses to Linked Data Structures. Proceedings of the 6th Intl. Conference on High Performance Computer Architecture, pages 206--217, January 2000.

[21]

A. Klaiber and H. Levy. An Architecture for Software-Controlled Data Prefetching. Proceedings of the 18th Annual Intl. Symposium on Computer Architecture, pages 43--53, May 1991.

Digital Library

[22]

J. Martinez, A. Cristal, M. Valero, and J. Llosa. Ephemeral Registers. Technical Report CSL-TR-2003-1035, Cornell Computer Systems Lab, 2003.

[23]

T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, and V. Vinals. Delaying Physical Register Allocation Through Virtual-Physical Registers. Proceedings of the 32nd Annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 186--192, November 1999.

Digital Library

[24]

T. Mowry, M. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. Proceedings of the 5th Intl. Conference on Architectural Support for Programming Languages and Operating Systems, pages 62--73, October 1992.

Digital Library

[25]

S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D. Webb. The Alpha 21364 Network Architecture. In Proceedings of Hot Interconnects 9, August 2001.

Digital Library

[26]

N.R. Adiga et al. An Overview of the BlueGene/L Supercomputer. In Proceedings of Supercomputing '02, November 2002.

Digital Library

[27]

D. Ortega, E. Ayguade, J.-L. Baer, and M. Valero. Cost-Effective Compiler Directed Memory Prefetching and Bypassing. Proceedings of the 11th Intl. Conference on Parallel Architectures and Compilation Techniques, pages 189--198, September 2002.

Digital Library

[28]

V. Pai, P. Ranganathan, and S. Adve. RSIM: An execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors. IEEE TCCA Newsletter, 35(11):37--48, October 1997.

[29]

I. Park, C. Ooi, and T. Vijaykumar. Reducing Design Complexity of the Load/Store Queue. Proceedings of the 36th Annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 411--422, December 2003.

Digital Library

[30]

V. Puente, J. Gregorio, and R. Beivide. SICOSYS: An Integrated Framework for Studying Interconnection Networks in Multiprocessor Systems. Proceedings of the 10th Euromicro Workshop on Parallel and Distributed Processing, pages 360--368, January 2002.

Digital Library

[31]

V. Puente, J. Gregorio, R. Beivide, and C. Izu. On the Design of a High-Performance Adaptive Router for CC-NUMA Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 14(5), May 2003.

Digital Library

[32]

V. Puente, C. Izu, J. Gregorio, R. Beivide, and F. Vallejo. The Adaptive Bubble Router. Journal on Parallel and Distributed Computing, 61(9):1180--1208, September 2001.

Digital Library

[33]

P. Ranganathan, V. Pai, and S. Adve. Using Speculative Retirement and Larger Instruction Windows to Narrow the Performance Gap between Memory Consistency Models. In Proceedings of the 9th Symposium on Parallel Algorithms and Architectures, June 1997.

Digital Library

[34]

A. Roth and G. S. Sohi. Speculative Data-Driven Multithreading. Proceedings of the 7th Intl. Conference on High Performance Computer Architecture, pages 37--50, January 2001.

Digital Library

[35]

S. Sethumadhavan, R. Desikan, D. Burger, C. Moore, and S. Keckler. Scalable Hardware Memory Disambiguation for High ILP Processors. Proceedings of the 36th Annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 399--410, December 2003.

Digital Library

[36]

J. Singh, W. Weber, and A. Gupta. SPLASH: Stanford Parallel Applications for Shared-Memory. Computer Architecture News, 20(1):5--44, March 1992.

Digital Library

[37]

A. Smith. Cache Memories. Computing surveys, 14(3):473--530, September 1982.

Digital Library

[38]

Y. Sohilin, J. Lee, and J. Torrellas. Using a User-Level Memory Thread for Correlation Prefetching. Proceedings of the 29th Annual Intl. Symposium on Computer Architecture, pages 171--182, May 2002.

Digital Library

[39]

C. Stunkel, J. Herring, B. Abali, and R. Sivaram. A New Switch Chip for IBM RS/6000 SP Systems. In Proceedings of Supercomputing '99, November 1999.

Digital Library

[40]

R. Tomasulo. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development, (11):25--33, January 1967.

Digital Library

[41]

S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. Proceedings of the 22nd Annual Intl. Symposium on Computer Architecture, pages 24--36, June 1995.

Digital Library

[42]

W. Wulf and S. McKee. Hitting the Memory Wall: Implications of the Obvious. Computer Architecture News, 23(1):20--24, March 1995.

Digital Library

[43]

C. Zilles and G. Sohi. Execution-based Prediction using Speculative Slices. Proceedings of the 28th Annual Intl. Symposium on Computer Architecture, pages 2--13, July 2001.

Digital Library

Cited By

Kaxiras SMartonosi M(2008)Computer Architecture Techniques for Power-EfficiencySynthesis Lectures on Computer Architecture10.2200/S00119ED1V01Y200805CAC0043:1(1-207)Online publication date: Jan-2008
https://doi.org/10.2200/S00119ED1V01Y200805CAC004
Galluzzi MVallejo ECristal AVallejo FBeivide RStenström PSmith JValero M(2007)Implicit transactional memory in kilo-instruction multiprocessorsProceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture10.5555/2392163.2392195(339-353)Online publication date: 23-Aug-2007
https://dl.acm.org/doi/10.5555/2392163.2392195
Galluzzi MVallejo ECristal AVallejo FBeivide RStenström PSmith JValero M(2007)Implicit Transactional Memory in Kilo-Instruction MultiprocessorsAdvances in Computer Systems Architecture10.1007/978-3-540-74309-5_32(339-353)Online publication date: 2007
https://doi.org/10.1007/978-3-540-74309-5_32
Show More Cited By

Index Terms

A first glance at Kilo-instruction based multiprocessors
1. Computer systems organization
  1. Architectures

Recommendations

Evaluating kilo-instruction multiprocessors
WMPI '04: Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture

The ever increasing gap in processor and memory speeds has a very negative impact on performance. One possible solution to overcome this problem is the Kilo-instruction processor. It is a recent proposed architecture able to hide large memory latencies ...
Kilo-Instruction Processors: Overcoming the Memory Wall

Kilo-instruction processors are a new type of out-of-order superscalar processor that overlaps long memory access delays by maintaining thousands of in-flight instructions, in a scalable, efficient manner.
Kilo-instruction processors, runahead and prefetching
CF '06: Proceedings of the 3rd conference on Computing frontiers

There is a continuous research effort devoted to overcome the memory wall problem. Prefetching is one of the most frequently used techniques. A prefetch mechanism anticipates the processor requests by moving data into the lower levels of the memory ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '04: Proceedings of the 1st conference on Computing frontiers

April 2004

522 pages

ISBN:1581137419

DOI:10.1145/977091

General Chair:
Stamatis Vassiliadis
Delft University of Technology, The Netherlands
,
Program Chairs:
Jean-Luc Gaudiot
University of California at Irvine, USA
,
Vincenzo Piuri
University of Milan, Italy

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 April 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CF04

Sponsor:

CF04: Computing Frontiers Conference

April 14 - 16, 2004

Ischia, Italy

Acceptance Rates

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
476
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kaxiras SMartonosi M(2008)Computer Architecture Techniques for Power-EfficiencySynthesis Lectures on Computer Architecture10.2200/S00119ED1V01Y200805CAC0043:1(1-207)Online publication date: Jan-2008
https://doi.org/10.2200/S00119ED1V01Y200805CAC004
Galluzzi MVallejo ECristal AVallejo FBeivide RStenström PSmith JValero M(2007)Implicit transactional memory in kilo-instruction multiprocessorsProceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture10.5555/2392163.2392195(339-353)Online publication date: 23-Aug-2007
https://dl.acm.org/doi/10.5555/2392163.2392195
Galluzzi MVallejo ECristal AVallejo FBeivide RStenström PSmith JValero M(2007)Implicit Transactional Memory in Kilo-Instruction MultiprocessorsAdvances in Computer Systems Architecture10.1007/978-3-540-74309-5_32(339-353)Online publication date: 2007
https://doi.org/10.1007/978-3-540-74309-5_32
Vachharajani NIyer MAshok CVachharajani MAugust DConnors D(2005)Chip multi-processor scalability for single-threaded applicationsACM SIGARCH Computer Architecture News10.1145/1105734.110574133:4(44-53)Online publication date: 1-Nov-2005
https://dl.acm.org/doi/10.1145/1105734.1105741
Vallejo EGalIuzzi MCristaI AVallejo FBeivide RStenstrom PSmith JValero M(2005)Implementing kilo-instruction multiprocessorsICPS '05. Proceedings. International Conference on Pervasive Services, 2005.10.1109/PERSER.2005.1506430(325-336)Online publication date: 2005
https://doi.org/10.1109/PERSER.2005.1506430
Cristal ASantana OCazorla FGalluzzi MRamirez TPericas MValero M(2005)Kilo-Instruction ProcessorsIEEE Micro10.1109/MM.2005.5325:3(48-57)Online publication date: 1-May-2005
https://dl.acm.org/doi/10.1109/MM.2005.53
Kyrman MKyrman NMartynez J(2005)Cherry-MPProceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2005.15(245-256)Online publication date: 12-Nov-2005
https://dl.acm.org/doi/10.1109/MICRO.2005.15
Cristal ASantana OValero M(2004)Maintaining Thousands of In-flight InstructionsEuro-Par 2004 Parallel Processing10.1007/978-3-540-27866-5_2(9-20)Online publication date: 2004
https://doi.org/10.1007/978-3-540-27866-5_2

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten