Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2540708.2540732acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

SHIFT: shared history instruction fetch for lean-core server processors

Published: 07 December 2013 Publication History

Abstract

In server workloads, large instruction working sets result in high L1 instruction cache miss rates. Fast access requirements preclude large instruction caches that can accommodate the deep software stacks prevalent in server applications. Prefetching has been a promising approach to mitigate instruction-fetch stalls by relying on recurring instruction streams of server workloads to predict future instruction misses. By recording and replaying instruction streams from dedicated storage next to each core, stream-based prefetchers have been shown to overcome instruction fetch stalls. Problematically, existing stream-based prefetchers incur high history storage costs resulting from large instruction working sets and complex control flow inherent in server workloads. The high storage requirements of these prefetchers prohibit their use in emerging lean-core server processors.
We introduce Shared History Instruction Fetch, SHIFT, an instruction prefetcher suitable for lean-core server processors. By sharing the history across cores, SHIFT minimizes the cost per core without sacrificing miss coverage. Moreover, by embedding the shared instruction history in the LLC, SHIFT obviates the need for dedicated instruction history storage, while transparently enabling multiple instruction histories in the presence of workload consolidation. In a 16-core server CMP, SHIFT eliminates 81% (up to 93%) of instruction cache misses, achieving 19% (up to 42%) speedup on average. SHIFT captures 90% of the performance benefit of the state-of-the-art instruction prefetcher at 14x less storage cost.

References

[1]
T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P. Shen. Hardware support for prescient instruction prefetch. In Proceedings of the International Symposium on High Performance Computer Architecture, Feb. 2004.
[2]
A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a modern processor: Where does time go? In Proceedings of the International Conference on Very Large Data Bases, Sept. 1999.
[3]
M. Annavaram, J. M. Patel, and E. S. Davidson. Call graph prefetching for database applications. ACM Transactions on Computer Systems, 21(4), Dec. 2003.
[4]
I. Atta, P. Tözün, A. Ailamaki, and A. Moshovos. SLICC: Self-assembly of instruction cache collectives for OLTP workloads. In Proceedings of the International Symposium on Microarchitecture, Dec. 2012.
[5]
I. Atta, P. Tözün, X. Tong, A. Ailamaki, and A. Moshovos. STREX: Boosting instruction cache reuse in OLTP workloads through stratified execution. In Proceedings of the International Symposium on Computer Architecture, June 2013.
[6]
M. Baron. The F1: T1's 65nm Cortex-A8. Microprocessor Report, 20(7):1--9, July 2006.
[7]
M. Berezecki, E. Frachtenberg, M. Paleczny, and K. Steele. Many-core key-value store. In Proceedings of the International Green Computing Conference and Workshops, 2011.
[8]
I. Burcea, S. Somogyi, A. Moshovos, and B. Falsafi. Predictor virtualization. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2008.
[9]
Calxeda. http://www.calxeda.com/.
[10]
Q. Cao, P. Trancoso, J.-L. Larriba-Pey, J. Torrellas, R. Knighten, and Y. Won. Detailed characterization of a Quad Pentium Pro server running TPC-D. In International Conference on Computer Design, Oct. 1999.
[11]
K. Chakraborty, P. M. Wells, and G. S. Sohi. Computation spreading: Employing hardware migration to specialize CMP cores on-the-fly. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2006.
[12]
I.-C. K. Chen, C.-C. Lee, and T. N. Mudge. Instruction prefetching using branch prediction information. In Proceedings of the International Conference on Computer Design, Oct. 1997.
[13]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2012.
[14]
M. Ferdman, C. Kaynak, and B. Falsafi. Proactive instruction fetch. In Proceedings of the International Symposium on Microarchitecture, Dec. 2011.
[15]
M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Temporal instruction fetch streaming. In Proceedings of the International Symposium on Microarchitecture, Dec. 2008.
[16]
N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and B. Falsafi. Database servers on chip multi-processors: Limitations and opportunities. In Proceedings of the Conference on Innovative Data Systems Research, Jan. 2007.
[17]
S. Harizopoulos and A. Ailamaki. STEPS towards cache-resident transaction processing. In Proceedings of the International Conference on Very Large Data Bases, Aug. 2004.
[18]
K. Keeton, D. A. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker. Performance characterization of a Quad Pentium Pro SMP using OLTP workloads. In Proceedings of the International Symposium on Computer Architecture, June 1998.
[19]
A. Kolli, A. Saidi, and T. Wenisch. RDIP: Return-address-stack directed instruction prefetching. In Proceedings of the International Symposium on Microarchitecture, Dec. 2013.
[20]
J. L. Lo, L. A. Barroso, S. J. Eggers, K. Gharachorloo, H. M. Levy, and S. S. Parekh. An analysis of database workload performance on simultaneous multithreaded processors. In Proceedings of the International Symposium on Computer Architecture, June 1998.
[21]
P. Lotfi-Kamran, B. Grot, and B. Falsafi. NOC-Out: Microarchitecting a scale-out processor. In Proceedings of the International Symposium on Microarchitecture, Dec. 2012.
[22]
P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi. Scale-out processors. In Proceedings of the International Symposium on Computer Architecture, June 2012.
[23]
C.-K. Luk and T. C. Mowry. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Proceedings of the International Symposium on Microarchitecture, Dec. 1998.
[24]
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the International Symposium on Microarchitecture, Dec. 2007.
[25]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An effective alternative to large instruction windows. IEEE Micro, 23(6):20--25, Nov.-Dec. 2003.
[26]
K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In Proceedings of the International Symposium on High Performance Computer Architecture, Feb. 2004.
[27]
A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey, P. G. Lowney, and M. Valero. Code layout optimizations for transaction processing workloads. In Proceedings of the International Symposium on Computer Architecture, June 2001.
[28]
A. Ramirez, O. J. Santana, J. L. Larriba-Pey, and M. Valero. Fetching instruction streams. In Proceedings of the International Symposium on Microarchitecture, Dec. 2002.
[29]
P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 1998.
[30]
G. Reinman, B. Calder, and T. Austin. Fetch directed instruction prefetching. In Proceedings of the International Symposium on Microarchitecture, Dec. 1999.
[31]
D. Sanchez and C. Kozyrakis. Vantage: Scalable and efficient fine-grain cache partitioning. In Proceedings of the International Symposium on Computer Architecture, June 2011.
[32]
O. J. Santana, A. Ramirez, and M. Valero. Enlarging instruction streams. IEEE Transactions on Computers, 56(10):1342--1357, 2007.
[33]
A. J. Smith. Sequential program prefetching in memory hierarchies. Computer, 11(12):7--21, 1978.
[34]
S. Somogyi, T. F. Wenisch, A. Ailamaki, and B. Falsafi. Spatio-temporal memory streaming. In Proceedings of the International Symposium on Computer Architecture, June 2009.
[35]
L. Spracklen, Y. Chou, and S. G. Abraham. Effective instruction prefetching in chip multiprocessors for modern commercial applications. In Proceedings of the International Symposium on High Performance Computer Architecture, Feb. 2005.
[36]
V. Srinivasan, E. S. Davidson, G. S. Tyson, M. J. Charney, and T. R. Puzak. Branch history guided instruction prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture, Jan. 2001.
[37]
K. Sundaramoorthy, Z. Purser, and E. Rotenburg. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Nov. 2000.
[38]
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the International Symposium on High Performance Computer Architecture, Jan. 1995.
[39]
P. Tözün, I. Pandis, C. Kaynak, D. Jevdjic, and A. Ailamaki. From A to E: Analyzing TPC's OLTP benchmarks: the obsolete, the ubiquitous, the unexplored. In Proceedings of the International Conference on Extending Database Technology, Mar. 2013.
[40]
J. Turley. Cortex-A15 "Eagle" flies the coop. Microprocessor Report, 24(11):1--11, Nov. 2010.
[41]
R. Uhlig, D. Nagle, T. Mudge, S. Sechrest, and J. Emer. Instruction fetching: Coping with code bloat. In Proceedings of the International Symposium on Computer Architecture, June 1995.
[42]
A. V. Veidenbaum. Instruction cache prefetching using multi-level branch prediction. In Proceedings of the International Symposium on High-Performance Computing, Nov. 1997.
[43]
T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In Proceedings of the International Symposium on Computer Architecture, June 2005.
[44]
T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe. SimFlex: Statistical sampling of computer system simulation. IEEE Micro, 26(4):18--31, July-Aug. 2006.
[45]
B. Wheeler. Tilera sees opening in clouds. Microprocessor Report, 25(7):13--16, July 2011.
[46]
R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the International Symposium on Computer Architecture, June 2003.
[47]
C. Xia and J. Torrellas. Instruction prefetching of systems codes with layout optimized for reduced cache misses. In Proceedings of the International Symposium on Computer Architecture, June 1996.
[48]
C. B. Zilles and G. S. Sohi. Execution-based prediction using speculative slices. In Proceedings of the International Symposium on Computer Architecture, June 2001.

Cited By

View all
  • (2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
  • (2024)UDP: Utility-Driven Fetch Directed Instruction Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00089(1188-1201)Online publication date: 29-Jun-2024
  • (2023)Protean: Resource-efficient Instruction PrefetchingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631904(1-13)Online publication date: 2-Oct-2023
  • Show More Cited By

Index Terms

  1. SHIFT: shared history instruction fetch for lean-core server processors

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
    December 2013
    498 pages
    ISBN:9781450326384
    DOI:10.1145/2540708
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 December 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. branch prediction
    2. caching
    3. instruction streaming
    4. prefetching

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MICRO-46
    Sponsor:

    Acceptance Rates

    MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;
    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)40
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 18 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Weeding out Front-End Stalls with Uneven Block Size Instruction Cache2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00102(1382-1396)Online publication date: 2-Nov-2024
    • (2024)UDP: Utility-Driven Fetch Directed Instruction Prefetching2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00089(1188-1201)Online publication date: 29-Jun-2024
    • (2023)Protean: Resource-efficient Instruction PrefetchingProceedings of the International Symposium on Memory Systems10.1145/3631882.3631904(1-13)Online publication date: 2-Oct-2023
    • (2023)A Storage-Effective BTB Organization for Servers2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070938(1153-1167)Online publication date: Feb-2023
    • (2022)Shooting Down the Server Front-End BottleneckACM Transactions on Computer Systems10.1145/348449238:3-4(1-30)Online publication date: 4-Jan-2022
    • (2022)Lukewarm serverless functionsProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527390(757-770)Online publication date: 18-Jun-2022
    • (2022)MANA: Microarchitecting a Temporal Instruction PrefetcherIEEE Transactions on Computers10.1109/TC.2022.3176825(1-1)Online publication date: 2022
    • (2021)Clean-Prefetcher: Look-Ahead Prefetching without Cache PollutionIEICE Electronics Express10.1587/elex.18.20210027Online publication date: 2021
    • (2021)Twig: Profile-Guided BTB Prefetching for Data Center ApplicationsMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480124(816-829)Online publication date: 18-Oct-2021
    • (2021)Cerebros: Evading the RPC Tax in DatacentersMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480055(407-420)Online publication date: 18-Oct-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media