research-article

SHIFT: shared history instruction fetch for lean-core server processors

Authors:

Babak FalsafiAuthors Info & Claims

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 272 - 283

https://doi.org/10.1145/2540708.2540732

Published: 07 December 2013 Publication History

Abstract

In server workloads, large instruction working sets result in high L1 instruction cache miss rates. Fast access requirements preclude large instruction caches that can accommodate the deep software stacks prevalent in server applications. Prefetching has been a promising approach to mitigate instruction-fetch stalls by relying on recurring instruction streams of server workloads to predict future instruction misses. By recording and replaying instruction streams from dedicated storage next to each core, stream-based prefetchers have been shown to overcome instruction fetch stalls. Problematically, existing stream-based prefetchers incur high history storage costs resulting from large instruction working sets and complex control flow inherent in server workloads. The high storage requirements of these prefetchers prohibit their use in emerging lean-core server processors.

We introduce Shared History Instruction Fetch, SHIFT, an instruction prefetcher suitable for lean-core server processors. By sharing the history across cores, SHIFT minimizes the cost per core without sacrificing miss coverage. Moreover, by embedding the shared instruction history in the LLC, SHIFT obviates the need for dedicated instruction history storage, while transparently enabling multiple instruction histories in the presence of workload consolidation. In a 16-core server CMP, SHIFT eliminates 81% (up to 93%) of instruction cache misses, achieving 19% (up to 42%) speedup on average. SHIFT captures 90% of the performance benefit of the state-of-the-art instruction prefetcher at 14x less storage cost.

References

[1]

T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P. Shen. Hardware support for prescient instruction prefetch. In Proceedings of the International Symposium on High Performance Computer Architecture, Feb. 2004.

Digital Library

[2]

A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the International Conference on Very Large Data Bases, Sept. 1999.

Digital Library

[3]

M. Annavaram, J. M. Patel, and E. S. Davidson. Call graph prefetching for database applications. ACM Transactions on Computer Systems, 21(4), Dec. 2003.

Digital Library

[4]

I. Atta, P. Tözün, A. Ailamaki, and A. Moshovos. SLICC: Self-assembly of instruction cache collectives for OLTP workloads. In Proceedings of the International Symposium on Microarchitecture, Dec. 2012.

Digital Library

[5]

I. Atta, P. Tözün, X. Tong, A. Ailamaki, and A. Moshovos. STREX: Boosting instruction cache reuse in OLTP workloads through stratified execution. In Proceedings of the International Symposium on Computer Architecture, June 2013.

Digital Library

[6]

M. Baron. The F1: T1's 65nm Cortex-A8. Microprocessor Report, 20(7):1--9, July 2006.

[7]

M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-core key-value store. In Proceedings of the International Green Computing Conference and Workshops, 2011.

Digital Library

[8]

I. Burcea, S. Somogyi, A. Moshovos, and B. Falsafi. Predictor virtualization. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2008.

Digital Library

[9]

Calxeda. http://www.calxeda.com/.

[10]

Q. Cao, P. Trancoso, J.-L. Larriba-Pey, J. Torrellas, R. Knighten, and Y. Won. Detailed characterization of a Quad Pentium Pro server running TPC-D. In International Conference on Computer Design, Oct. 1999.

Digital Library

[11]

K. Chakraborty, P. M. Wells, and G. S. Sohi. Computation spreading: Employing hardware migration to specialize CMP cores on-the-fly. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006.

Digital Library

[12]

I.-C. K. Chen, C.-C. Lee, and T. N. Mudge. Instruction prefetching using branch prediction information. In Proceedings of the International Conference on Computer Design, Oct. 1997.

Digital Library

[13]

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2012.

Digital Library

[14]

M. Ferdman, C. Kaynak, and B. Falsafi. Proactive instruction fetch. In Proceedings of the International Symposium on Microarchitecture, Dec. 2011.

Digital Library

[15]

M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Temporal instruction fetch streaming. In Proceedings of the International Symposium on Microarchitecture, Dec. 2008.

Digital Library

[16]

N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and B. Falsafi. Database servers on chip multi-processors: Limitations and opportunities. In Proceedings of the Conference on Innovative Data Systems Research, Jan. 2007.

[17]

S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Proceedings of the International Conference on Very Large Data Bases, Aug. 2004.

Digital Library

[18]

K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance characterization of a Quad Pentium Pro SMP using OLTP workloads. In Proceedings of the International Symposium on Computer Architecture, June 1998.

Digital Library

[19]

A. Kolli, A. Saidi, and T. Wenisch. RDIP: Return-address-stack directed instruction prefetching. In Proceedings of the International Symposium on Microarchitecture, Dec. 2013.

Digital Library

[20]

J. L. Lo, L. A. Barroso, S. J. Eggers, K. Gharachorloo, H. M. Levy, and S. S. Parekh. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the International Symposium on Computer Architecture, June 1998.

Digital Library

[21]

P. Lotfi-Kamran, B. Grot, and B. Falsafi. NOC-Out: Microarchitecting a scale-out processor. In Proceedings of the International Symposium on Microarchitecture, Dec. 2012.

Digital Library

[22]

P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi. Scale-out processors. In Proceedings of the International Symposium on Computer Architecture, June 2012.

Digital Library

[23]

C.-K. Luk and T. C. Mowry. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Proceedings of the International Symposium on Microarchitecture, Dec. 1998.

Digital Library

[24]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the International Symposium on Microarchitecture, Dec. 2007.

Digital Library

[25]

O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An effective alternative to large instruction windows. IEEE Micro, 23(6):20--25, Nov.-Dec. 2003.

Digital Library

[26]

K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the International Symposium on High Performance Computer Architecture, Feb. 2004.

Digital Library

[27]

A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P. G. Lowney, and M. Valero. Code layout optimizations for transaction processing workloads. In Proceedings of the International Symposium on Computer Architecture, June 2001.

Digital Library

[28]

A. Ramirez, O. J. Santana, J. L. Larriba-Pey, and M. Valero. Fetching instruction streams. In Proceedings of the International Symposium on Microarchitecture, Dec. 2002.

Digital Library

[29]

P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 1998.

Digital Library

[30]

G. Reinman, B. Calder, and T. Austin. Fetch directed instruction prefetching. In Proceedings of the International Symposium on Microarchitecture, Dec. 1999.

Digital Library

[31]

D. Sanchez and C. Kozyrakis. Vantage: Scalable and efficient fine-grain cache partitioning. In Proceedings of the International Symposium on Computer Architecture, June 2011.

Digital Library

[32]

O. J. Santana, A. Ramirez, and M. Valero. Enlarging instruction streams. IEEE Transactions on Computers, 56(10):1342--1357, 2007.

Digital Library

[33]

A. J. Smith. Sequential program prefetching in memory hierarchies. Computer, 11(12):7--21, 1978.

Digital Library

[34]

S. Somogyi, T. F. Wenisch, A. Ailamaki, and B. Falsafi. Spatio-temporal memory streaming. In Proceedings of the International Symposium on Computer Architecture, June 2009.

Digital Library

[35]

L. Spracklen, Y. Chou, and S. G. Abraham. Effective instruction prefetching in chip multiprocessors for modern commercial applications. In Proceedings of the International Symposium on High Performance Computer Architecture, Feb. 2005.

Digital Library

[36]

V. Srinivasan, E. S. Davidson, G. S. Tyson, M. J. Charney, and T. R. Puzak. Branch history guided instruction prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture, Jan. 2001.

Digital Library

[37]

K. Sundaramoorthy, Z. Purser, and E. Rotenburg. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 2000.

Digital Library

[38]

J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the International Symposium on High Performance Computer Architecture, Jan. 1995.

Digital Library

[39]

P. Tözün, I. Pandis, C. Kaynak, D. Jevdjic, and A. Ailamaki. From A to E: Analyzing TPC's OLTP benchmarks: the obsolete, the ubiquitous, the unexplored. In Proceedings of the International Conference on Extending Database Technology, Mar. 2013.

Digital Library

[40]

J. Turley. Cortex-A15 "Eagle" flies the coop. Microprocessor Report, 24(11):1--11, Nov. 2010.

[41]

R. Uhlig, D. Nagle, T. Mudge, S. Sechrest, and J. Emer. Instruction fetching: Coping with code bloat. In Proceedings of the International Symposium on Computer Architecture, June 1995.

Digital Library

[42]

A. V. Veidenbaum. Instruction cache prefetching using multi-level branch prediction. In Proceedings of the International Symposium on High-Performance Computing, Nov. 1997.

Digital Library

[43]

T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the International Symposium on Computer Architecture, June 2005.

Digital Library

[44]

T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. SimFlex: Statistical sampling of computer system simulation. IEEE Micro, 26(4):18--31, July-Aug. 2006.

Digital Library

[45]

B. Wheeler. Tilera sees opening in clouds. Microprocessor Report, 25(7):13--16, July 2011.

[46]

R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the International Symposium on Computer Architecture, June 2003.

Digital Library

[47]

C. Xia and J. Torrellas. Instruction prefetching of systems codes with layout optimized for reduced cache misses. In Proceedings of the International Symposium on Computer Architecture, June 1996.

Digital Library

[48]

C. B. Zilles and G. S. Sohi. Execution-based prediction using speculative slices. In Proceedings of the International Symposium on Computer Architecture, June 2001.

Digital Library

Cited By

Brunner RKumar R(2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00102
Oh SXu MKhan TKasikci BLitz H(2024)UDP: Utility-Driven Fetch Directed Instruction Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00089(1188-1201)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00089
Hassan MPark CBlack-Schaffer D(2023)Protean: Resource-efficient Instruction PrefetchingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631904(1-13)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631904
Show More Cited By

Index Terms

SHIFT: shared history instruction fetch for lean-core server processors
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Confluence: unified instruction supply for scale-out servers
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Multi-megabyte instruction working sets of server workloads defy the capacities of latency-critical instruction-supply components of a core; the instruction cache (L1-I) and the branch target buffer (BTB). Recent work has proposed dedicated prefetching ...
RDIP: return-address-stack directed instruction prefetching
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

L1 instruction fetch misses remain a critical performance bottleneck, accounting for up to 40% slowdowns in server applications. Whereas instruction footprints typically fit within last-level caches, they overwhelm L1 caches, whose capacity is limited ...
Proactive instruction fetch
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Fast access requirements preclude building L1 instruction caches large enough to capture the working set of server workloads. Efforts exist to mitigate limited L1 instruction cache capacity by relying on the stability and repetitiveness of the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

December 2013

498 pages

ISBN:9781450326384

DOI:10.1145/2540708

General Chair:
Matthew Farrens
UC Davis
,
Program Chair:
Christos Kozyrakis
Stanford University

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Swiss National Science Foundation

Conference

MICRO-46

Sponsor:

SIGMICRO

MICRO-46: The 46th Annual IEEE/ACM International Symposium on Microarchitecture

December 7 - 11, 2013

California, Davis

Acceptance Rates

MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
554
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)2

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Brunner RKumar R(2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00102
Oh SXu MKhan TKasikci BLitz H(2024)UDP: Utility-Driven Fetch Directed Instruction Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00089(1188-1201)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00089
Hassan MPark CBlack-Schaffer D(2023)Protean: Resource-efficient Instruction PrefetchingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631904(1-13)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631904
Asheim TGrot BKumar R(2023)A Storage-Effective BTB Organization for Servers2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070938(1153-1167)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070938
Kumar RGrot B(2022)Shooting Down the Server Front-End BottleneckACM Transactions on Computer Systems10.1145/348449238:3-4(1-30)Online publication date: 4-Jan-2022
https://dl.acm.org/doi/10.1145/3484492
Schall DMargaritov AUstiugov DSandberg AGrot BSalapura VZahran MChong FTang L(2022)Lukewarm serverless functionsProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527390(757-770)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527390
Ansari AGolshan FBarati RLotfi-Kamran PSarbazi-Azad H(2022)MANA: Microarchitecting a Temporal Instruction PrefetcherIEEE Transactions on Computers10.1109/TC.2022.3176825(1-1)Online publication date: 2022
https://doi.org/10.1109/TC.2022.3176825
Cho MHur JJang W(2021)Clean-Prefetcher: Look-Ahead Prefetching without Cache PollutionIEICE Electronics Express10.1587/elex.18.20210027Online publication date: 2021
https://doi.org/10.1587/elex.18.20210027
Khan TBrown NSriraman ASoundararajan NKumar RDevietti JSubramoney SPokam GLitz HKasikci B(2021)Twig: Profile-Guided BTB Prefetching for Data Center ApplicationsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480124(816-829)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480124
Pourhabibi ASutherland MDaglis AFalsafi B(2021)Cerebros: Evading the RPC Tax in DatacentersMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480055(407-420)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480055
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents