Abstract
Overcoming increasing memory latency is one of the main problems that microprocessor designers have faced over the years. The two basic techniques introduced to mitigate latencies are caches and out-of-order execution. However, neither of these solutions is adequatefor hiding off-chip memory accesses in the order of 200 cycles or more. Theoretically, increasing the size of the instruction window would allow much longer latencies to be hidden. But scaling the structures to support thousands of in-flight instructions would be prohibitively expensive.
However, the distribution of instruction issue times under the presence of L2 cache misses is highly correlated. This paper describes this phenomenon of Execution Locality and shows how it can be exploited with an inexpensive microarchitecture consisting of two linked cores. This Decoupled Kilo-Instruction Processor (D-KIP) is very effective in recovering lost potential performance. Extensive simulations show that speed-ups of up to 379% are possible for numerical benchmarks thanks to the exploitation of impressive degrees of Memory-Level Parallelism (MLP) and the execution of independent instructions in the shadow of L2 misses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Wulf, W.A., McKee, S.A.: Hitting the memory wall: Implications of the obvious. Computer Architecture News (1995)
Wilkes, M.V.: Slave memories and dynamic storage allocation. IEEE Transactions on Electronic Computers, 270–271 (1965)
Smith, A.J.: Cache memories. ACM Computing Surveys 14(3), 473–530 (1982)
Jimenez, D.A., Lin, C.: Dynamic branch prediction with perceptrons. In: Proc. of the 7th Intl. Symp. on High Performance Computer Architecture, pp. 197–206 (2001)
Karkhanis, T., Smith, J.E.: A day in the life of a data cache miss. In: Proc. of the Workshop on Memory Performance Issues (2002)
Yeager, K.C.: The MIPS R10000 superscalar microprocessor. IEEE Micro 16, 28–41 (1996)
Cristal, A., Ortega, D., Llosa, J., Valero, M.: Out-of-order commit processors. In: Proc. of the 10th Intl. Symp. on High-Performance Computer Architecture (2004)
Austin, T., Larson, E., Ernst, D.: Simplescalar: an infrastructure for computer system modeling. IEEE Computer (2002)
Perelman, E., Hamerly, G., Biesbrouck, M.V., Sherwood, T., Calder, B.: Using SimPoint for accurate and efficient simulation. In: Proc. of the Intl. Conf. on Measurement and Modeling of Computer Systems (2003)
Cristal, A., Valero, M., Gonzalez, A., LLosa, J.: Large virtual ROBs by processor checkpointing. Technical report (2002), Technical Report number UPC-DAC-2002-39 (2002)
Cristal, A., Santana, O.J., Martinez, J.F., Valero, M.: Toward kilo-instruction processors. ACM Transactions on Architecture and Code Optimization (TACO), 389–417 (2004)
Akkary, H., Rajwar, R., Srinivasan, S.T.: Checkpoint processing and recovery: Towards scalable large instruction window processors (2003)
Lebeck, A.R., Koppanalil, J., Li, T., Patwardhan, J., Rotenberg, E.: A large, fast instruction window for tolerating cache misses. In: Proc. of the 29th Intl. Symp. on Computer Architecture (2002)
Srinivasan, S.T., Rajwar, R., Akkary, H., Gandhi, A., Upton, M.: Continual flow pipelines. In: Proc. of the 11th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2004)
Gonzalez, A., Valero, M., Gonzalez, J., Monreal, T.: Virtual registers. In: Proc. of the 4th Intl. Conf. on High-Performance Computing (1997)
Moudgill, M., Pingali, K., Vassiliadis, S.: Register renaming and dynamic speculation: an alternative approach. In: Proc. of the 26th. Intl. Symp. on Microarchitecture, pp. 202–213 (1993)
Cristal, A., Martinez, J., LLosa, J., Valero, M.: Ephemeral registers with multicheckpointing. Technical report(2003), Technical Report number UPC-DAC-2003-51, Departament d’Arquitectura de Computadors, Universitat Politecnica de Catalunya (2003)
Park, I., Ooi, C.L., Vijaykumar, T.N.: Reducing design complexity of the load/store queue. In: Proc. of the 36th Intl. Symp. on Microarchitecture (2003)
Sethumadhavan, S., Desikan, R., Burger, D., Moore, C.R., Keckler, S.W.: Scalable hardware memory disambiguation for high ILP processors. In: Proc. of the 36th Intl. Symp. on Microarchitecture (2003)
Smith, J.E.: Decoupled access/execute computer architectures. In: Proc. of the 9th annual Intl. Symp. on Computer Architecture (1982)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pericàs, M., Cristal, A., González, R., Jiménez, D.A., Valero, M. (2008). Exploiting Execution Locality with a Decoupled Kilo-Instruction Processor. In: Labarta, J., Joe, K., Sato, T. (eds) High-Performance Computing. ISHPC ALPS 2005 2006. Lecture Notes in Computer Science, vol 4759. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77704-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-77704-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77703-8
Online ISBN: 978-3-540-77704-5
eBook Packages: Computer ScienceComputer Science (R0)