Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

ZSim: fast and accurate microarchitectural simulation of thousand-core systems

Published: 23 June 2013 Publication History

Abstract

Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial.
We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores.
We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.

References

[1]
Computer architecture simulation and modeling. IEEE Micro Special Issue, 26(4), 2006.
[2]
A. Alameldeen and D. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26(4), 2006.
[3]
C. Bienia, S. Kumar, J. P. Singh, et al. The PARSEC benchmark suite: Characterization and architectural implications. In PACT-17, 2008.
[4]
N. Binkert, B. Beckmann, G. Black, et al. The gem5 simulator. SIGARCH Comp. Arch. News, 39(2), 2011.
[5]
N. Binkert, R. Dreslinski, L. Hsu, et al. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006.
[6]
E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisiting the RISC vs CISC Debate on Contemporary ARM and x86 Architectures. In HPCA-19, 2013.
[7]
S. Boyd-Wickizer, H. Chen, R. Chen, et al. Corey: An operating system for many cores. In OSDI-8, 2008.
[8]
T. Carlson, W. Heirman, and L. Eeckhout. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Supercomputing, 2011.
[9]
S. Chandrasekaran and M. D. Hill. Optimistic simulation of parallel architectures using program executables. In PADS, 1996.
[10]
J. Chen, L. K. Dabbiru, D. Wong, et al. Adaptive and speculative slack simulations of CMPs on CMPs. In MICRO-43, 2010.
[11]
M. Chidester and A. George. Parallel simulation of chip-multiprocessor architectures. TOMACS, 12(3), 2002.
[12]
D. Chiou, D. Sunwoo, J. Kim, et al. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators. In MICRO-40, 2007.
[13]
A. Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs, http://www.agner.org/optimize/.
[14]
R. Fujimoto. Parallel discrete event simulation. CACM, 33--10, 1990.
[15]
D. R. Hower, P. Montesinos, L. Ceze, et al. Two hardware-based approaches for deterministic multiprocessor replay. CACM, 52--6, 2009.
[16]
X. Huang, J. Moss, K. McKinley, et al. Dynamic simplescalar: Simulating java virtual machines. Technical report, UT Austin, 2003.
[17]
Intel. Intel Xeon E3-1200 Family. Datasheet, 2011.
[18]
A. Jaleel, R. Cohn, C. Luk, and B. Jacob. CMPSim: A Pin-based on-the-fly multi-core cache simulator. In MoBS-4, 2008.
[19]
A. Khan, M. Vijayaraghavan, S. Boyd-Wickizer, and Arvind. Fast cycle-accurate modeling of a multicore processor. In ISPASS, 2012.
[20]
G. Kurian, J. Miller, J. Psota, et al. ATAC: A 1000-core cache-coherent processor with on-chip optical network. In PACT-19, 2010.
[21]
R. Liu, K. Klues, S. Bird, et al. Tessellation: Spacetime partitioning in a manycore client os. In HotPar, 2009.
[22]
C.-K. Luk, R. Cohn, R. Muth, et al. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.
[23]
K. T. Malladi, B. C. Lee, F. A. Nothaft, et al. Towards energy-proportional datacenter memory with mobile DRAM. In ISCA-39, 2012.
[24]
K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, et al. Rethinking DRAM power modes for energy proportionality. In MICRO-45, 2012.
[25]
M. Martin, D. Sorin, B. Beckmann, et al. Multi-facet's general execution driven multiprocessor simulator (gems) toolset. Comp. Arch. News, 33--4, 2005.
[26]
C. J. Mauer, M. D. Hill, and D. A. Wood. Full-system timing-first simulation. In SIGMETRICS conf., 2002.
[27]
J. Miller, H. Kasture, G. Kurian, et al. Graphite: A distributed parallel simulator for multicores. In HPCA-16, 2010.
[28]
H. Pan, B. Hindman, and K. Asanovic. Lithe: Enabling efficient composition of parallel libraries. HotPar, 2009.
[29]
A. Patel, F. Afram, S. Chen, and K. Ghose. MARSS: A full system simulator for multicore x86 CPUs. In DAC-48, 2011.
[30]
A. Patel, F. Afram, K. Ghose, et al. MARSS: Micro Architectural Systems Simulator. In ISCA tutorial 6, 2012.
[31]
M. Pellauer, M. Adler, M. Kinsy, et al. HAsim: FPGA-based high detail multicore simulation using time-division multiplexing. In HPCA-17, 2011.
[32]
A. Pesterev, J. Strauss, N. Zeldovich, and R. Morris. Improving network connection locality on multicore systems. In EuroSys-7, 2012.
[33]
S. K. Reinhardt, M. D. Hill, J. R. Larus, et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers. In SIGMETRICS conf., 1993.
[34]
P. Ren, M. Lis, M. Cho, et al. HORNET: A Cycle-Level Multicore Simulator. IEEE TCAD, 31(6), 2012.
[35]
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAM-Sim2: A Cycle Accurate Memory System Simulator. CAL, 10(1), 2011.
[36]
D. Sanchez and C. Kozyrakis. The ZCache: Decoupling Ways and Associativity. In MICRO-43, 2010.
[37]
D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In ISCA-38, 2011.
[38]
D. Sanchez and C. Kozyrakis. Scalable and Efficient Fine-Grained Cache Partitioning with Vantage. IEEE Micro's Top Picks, 32(3), 2012.
[39]
D. Sanchez and C. Kozyrakis. SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding. In HPCA-18, 2012.
[40]
D. Sanchez, D. Lo, R. Yoo, et al. Dynamic Fine-Grain Scheduling of Pipeline Parallelism. In PACT-20, 2011.
[41]
D. Sanchez, G. Michelogiannakis, and C. Kozyrakis. An Analysis of Interconnection Networks for Large Scale Chip-Multiprocessors. TACO, 7(1), 2010.
[42]
E. Schnarr and J. R. Larus. Fast out-of-order processor simulation using memoization. In ASPLOS-8, 1998.
[43]
E. C. Schnarr, M. D. Hill, and J. R. Larus. Facile: A language and compiler for high-performance processor simulators. In PLDI, 2001.
[44]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In ASPLOS-10, 2002.
[45]
J. Shin, K. Tam, D. Huang, et al. A 40nm 16-core 128-thread CMT SPARC SoC processor. In ISSCC, 2010.
[46]
S. Srinivasan, L. Zhao, B. Ganesh, et al. CMP Memory Modeling: How Much Does Accuracy Matter? In MoBS-5, 2009.
[47]
Z. Tan, A. Waterman, R. Avizienis, et al. RAMP Gold: An FPGA-based architecture simulator for multiprocessors. In DAC-47, 2010.
[48]
Tilera. TILE-Gx 3000 Series Overview. Technical report, 2011.
[49]
T. von Eicken, A. Basu, V. Buch, et al. U-net: a user-level network interface for parallel and distributed computing. In SOSP-15, 1995.
[50]
J. Wawrzynek, D. Patterson, M. Oskin, et al. RAMP: Research accelerator for multiple processors. IEEE Micro, 27(2), 2007.
[51]
T. Wenisch, R. Wunderlich, M. Ferdman, et al. Simflex: statistical sampling of computer system simulation. IEEE Micro, 26(4), 2006.
[52]
E. Witchel and M. Rosenblum. Embra: Fast and flexible machine simulation. In SIGMETRICS Perf. Eval. Review, volume 24, 1996.
[53]
R. Wunderlich, T. Wenisch, B. Falsafi, and J. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In ISCA-30, 2003.

Cited By

View all
  • (2024)GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core SystemsACM Transactions on Architecture and Code Optimization10.1145/366199821:3(1-25)Online publication date: 26-Apr-2024
  • (2024)A Survey of Computing-in-Memory Processor: From Circuit to ApplicationIEEE Open Journal of the Solid-State Circuits Society10.1109/OJSSCS.2023.33282904(25-42)Online publication date: 2024
  • (2024)MindPalace: A Framework for Studying Microarchitecture Design of Function-as-a-Service2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00042(313-315)Online publication date: 5-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents
  • cover image ACM Other conferences
    ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture
    June 2013
    686 pages
    ISBN:9781450320795
    DOI:10.1145/2485922
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013
Published in SIGARCH Volume 41, Issue 3

Check for updates

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)254
  • Downloads (Last 6 weeks)41
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core SystemsACM Transactions on Architecture and Code Optimization10.1145/366199821:3(1-25)Online publication date: 26-Apr-2024
  • (2024)A Survey of Computing-in-Memory Processor: From Circuit to ApplicationIEEE Open Journal of the Solid-State Circuits Society10.1109/OJSSCS.2023.33282904(25-42)Online publication date: 2024
  • (2024)MindPalace: A Framework for Studying Microarchitecture Design of Function-as-a-Service2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00042(313-315)Online publication date: 5-May-2024
  • (2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024
  • (2024)Tartan: Microarchitecting a Robotic Processor2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00047(548-565)Online publication date: 29-Jun-2024
  • (2024)A Comparative Study on Simulation Frameworks for AI Accelerator Evaluation2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00073(321-328)Online publication date: 27-May-2024
  • (2024)An Architecture-Level Framework for Enabling Processing-Using-Memory Simulations in Deep Neural Networks2024 International Conference on Electronics, Information, and Communication (ICEIC)10.1109/ICEIC61013.2024.10457163(1-3)Online publication date: 28-Jan-2024
  • (2024)Gem5-AVX: Extension of the Gem5 Simulator to Support AVX Instruction SetsIEEE Access10.1109/ACCESS.2024.335929612(20767-20778)Online publication date: 2024
  • (2024)HOPE: Holistic STT-RAM Architecture Exploration Framework for Future Cross-Platform AnalysisIEEE Access10.1109/ACCESS.2024.335889112(16598-16609)Online publication date: 2024
  • (2024)Viper: Utilizing Hierarchical Program Structure to Accelerate Multi-Core SimulationIEEE Access10.1109/ACCESS.2024.335406912(17669-17678)Online publication date: 2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media