Exploiting Execution Locality with a Decoupled Kilo-Instruction Processor

Miquel Pericàs^1,2,
Adrian Cristal²,
Ruben González¹,
Daniel A. Jiménez³ &
…
Mateo Valero^1,2

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4759))

Included in the following conference series:

787 Accesses

Abstract

Overcoming increasing memory latency is one of the main problems that microprocessor designers have faced over the years. The two basic techniques introduced to mitigate latencies are caches and out-of-order execution. However, neither of these solutions is adequatefor hiding off-chip memory accesses in the order of 200 cycles or more. Theoretically, increasing the size of the instruction window would allow much longer latencies to be hidden. But scaling the structures to support thousands of in-flight instructions would be prohibitively expensive.

However, the distribution of instruction issue times under the presence of L2 cache misses is highly correlated. This paper describes this phenomenon of Execution Locality and shows how it can be exploited with an inexpensive microarchitecture consisting of two linked cores. This Decoupled Kilo-Instruction Processor (D-KIP) is very effective in recovering lost potential performance. Extensive simulations show that speed-ups of up to 379% are possible for numerical benchmarks thanks to the exploitation of impressive degrees of Memory-Level Parallelism (MLP) and the execution of independent instructions in the shadow of L2 misses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis

Article 01 May 2021

Instruction Fusion for Multiscalar and Many-Core Processors

Article 30 September 2015

Shrinking L1 Instruction Caches to Improve Energy–Delay in SMT Embedded Processors

References

Wulf, W.A., McKee, S.A.: Hitting the memory wall: Implications of the obvious. Computer Architecture News (1995)
Google Scholar
Wilkes, M.V.: Slave memories and dynamic storage allocation. IEEE Transactions on Electronic Computers, 270–271 (1965)
Google Scholar
Smith, A.J.: Cache memories. ACM Computing Surveys 14(3), 473–530 (1982)
Article Google Scholar
Jimenez, D.A., Lin, C.: Dynamic branch prediction with perceptrons. In: Proc. of the 7th Intl. Symp. on High Performance Computer Architecture, pp. 197–206 (2001)
Google Scholar
Karkhanis, T., Smith, J.E.: A day in the life of a data cache miss. In: Proc. of the Workshop on Memory Performance Issues (2002)
Google Scholar
Yeager, K.C.: The MIPS R10000 superscalar microprocessor. IEEE Micro 16, 28–41 (1996)
Article Google Scholar
Cristal, A., Ortega, D., Llosa, J., Valero, M.: Out-of-order commit processors. In: Proc. of the 10th Intl. Symp. on High-Performance Computer Architecture (2004)
Google Scholar
Austin, T., Larson, E., Ernst, D.: Simplescalar: an infrastructure for computer system modeling. IEEE Computer (2002)
Google Scholar
Perelman, E., Hamerly, G., Biesbrouck, M.V., Sherwood, T., Calder, B.: Using SimPoint for accurate and efficient simulation. In: Proc. of the Intl. Conf. on Measurement and Modeling of Computer Systems (2003)
Google Scholar
Cristal, A., Valero, M., Gonzalez, A., LLosa, J.: Large virtual ROBs by processor checkpointing. Technical report (2002), Technical Report number UPC-DAC-2002-39 (2002)
Google Scholar
Cristal, A., Santana, O.J., Martinez, J.F., Valero, M.: Toward kilo-instruction processors. ACM Transactions on Architecture and Code Optimization (TACO), 389–417 (2004)
Google Scholar
Akkary, H., Rajwar, R., Srinivasan, S.T.: Checkpoint processing and recovery: Towards scalable large instruction window processors (2003)
Google Scholar
Lebeck, A.R., Koppanalil, J., Li, T., Patwardhan, J., Rotenberg, E.: A large, fast instruction window for tolerating cache misses. In: Proc. of the 29th Intl. Symp. on Computer Architecture (2002)
Google Scholar
Srinivasan, S.T., Rajwar, R., Akkary, H., Gandhi, A., Upton, M.: Continual flow pipelines. In: Proc. of the 11th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2004)
Google Scholar
Gonzalez, A., Valero, M., Gonzalez, J., Monreal, T.: Virtual registers. In: Proc. of the 4th Intl. Conf. on High-Performance Computing (1997)
Google Scholar
Moudgill, M., Pingali, K., Vassiliadis, S.: Register renaming and dynamic speculation: an alternative approach. In: Proc. of the 26th. Intl. Symp. on Microarchitecture, pp. 202–213 (1993)
Google Scholar
Cristal, A., Martinez, J., LLosa, J., Valero, M.: Ephemeral registers with multicheckpointing. Technical report(2003), Technical Report number UPC-DAC-2003-51, Departament d’Arquitectura de Computadors, Universitat Politecnica de Catalunya (2003)
Google Scholar
Park, I., Ooi, C.L., Vijaykumar, T.N.: Reducing design complexity of the load/store queue. In: Proc. of the 36th Intl. Symp. on Microarchitecture (2003)
Google Scholar
Sethumadhavan, S., Desikan, R., Burger, D., Moore, C.R., Keckler, S.W.: Scalable hardware memory disambiguation for high ILP processors. In: Proc. of the 36th Intl. Symp. on Microarchitecture (2003)
Google Scholar
Smith, J.E.: Decoupled access/execute computer architectures. In: Proc. of the 9th annual Intl. Symp. on Computer Architecture (1982)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Architecture Department, Technical University of Catalonia (UPC), Jordi Girona, 1-3, Mòdul D6 Campus Nord, 08034, Barcelona, Spain
Miquel Pericàs, Ruben González & Mateo Valero
Barcelona Supercomputing Center (BSC), Jordi Girona, 29, Edifici Nexus-II Campus Nord, 08034, Barcelona, Spain
Miquel Pericàs, Adrian Cristal & Mateo Valero
Department of Computer Science, The University of Texas at San Antonio (UTSA), Science Building, One UTSA Circle, San Antonio, TX 78249-1644, USA
Daniel A. Jiménez

Authors

Miquel Pericàs
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Cristal
View author publications
You can also search for this author in PubMed Google Scholar
Ruben González
View author publications
You can also search for this author in PubMed Google Scholar
Daniel A. Jiménez
View author publications
You can also search for this author in PubMed Google Scholar
Mateo Valero
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Jesús Labarta Kazuki Joe Toshinori Sato

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pericàs, M., Cristal, A., González, R., Jiménez, D.A., Valero, M. (2008). Exploiting Execution Locality with a Decoupled Kilo-Instruction Processor. In: Labarta, J., Joe, K., Sato, T. (eds) High-Performance Computing. ISHPC ALPS 2005 2006. Lecture Notes in Computer Science, vol 4759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77704-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-77704-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77703-8
Online ISBN: 978-3-540-77704-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploiting Execution Locality with a Decoupled Kilo-Instruction Processor

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis

Instruction Fusion for Multiscalar and Many-Core Processors

Shrinking L1 Instruction Caches to Improve Energy–Delay in SMT Embedded Processors

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Exploiting Execution Locality with a Decoupled Kilo-Instruction Processor

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Towards enhanced I/O performance of a highly integrated many-core processor by empirical analysis

Instruction Fusion for Multiscalar and Many-Core Processors

Shrinking L1 Instruction Caches to Improve Energy–Delay in SMT Embedded Processors

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation