Principal kernel analysis: A tractable methodology to simulate scaled GPU workloads
C Avalos Baddouh, M Khairy, RN Green… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-level
simulation is orders of magnitude slower than native silicon, the only solution is to …
simulation is orders of magnitude slower than native silicon, the only solution is to …
Forecasting GPU Performance for Deep Learning Training and Inference
S Lee, A Phanishayee, D Mahajan - Proceedings of the 30th ACM …, 2025 - dl.acm.org
Deep learning kernels exhibit a high level of predictable memory accesses and compute
patterns, making GPU's architecture well-suited for their execution. Moreover, software and …
patterns, making GPU's architecture well-suited for their execution. Moreover, software and …
Treelet prefetching for ray tracing
Ray tracing is traditionally only used in offline rendering to produce images of high fidelity
because it is computationally expensive. Recent Graphics Processing Units (GPUs) have …
because it is computationally expensive. Recent Graphics Processing Units (GPUs) have …
CRISP: Concurrent Rendering and Compute Simulation Platform for GPUs
… We would like to thank Cesar Avalos for his help in the project. We would also like to
thank Shichen Qiao and Matthew D. Sinclair for their work on per-stream stat in GPGPU-Sim. …
thank Shichen Qiao and Matthew D. Sinclair for their work on per-stream stat in GPGPU-Sim. …
[PDF][PDF] Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads
Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-level
simulation is orders of magnitude slower than native silicon, the only solution is to …
simulation is orders of magnitude slower than native silicon, the only solution is to …
[PDF][PDF] Accelerating the Evaluation of Large Workloads on Post-Dennard Systems using Sampling
A Sabu - alenks.github.io
With the end of Moore’s law, computer architects have turned to alternative approaches to
enhance computational capabilities. One prominent strategy involves a shift towards …
enhance computational capabilities. One prominent strategy involves a shift towards …
Data-driven Forecasting of Deep Learning Performance on GPUs
Deep learning kernels exhibit predictable memory accesses and compute patterns, making
GPUs' parallel architecture well-suited for their execution. Software and runtime systems for …
GPUs' parallel architecture well-suited for their execution. Software and runtime systems for …
Photon: A fine-grained sampled simulation methodology for GPU workloads
GPUs, due to their massively-parallel computing architectures, provide high performance for
data-parallel applications. However, existing GPU simulators are too slow to enable …
data-parallel applications. However, existing GPU simulators are too slow to enable …
Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads
Today, DNNs’ high computational complexity and sub-optimal device utilization present a
major roadblock to democratizing DNNs. To reduce the execution time and improve device …
major roadblock to democratizing DNNs. To reduce the execution time and improve device …
Development Of A Heterogeneous Architecture Simulation Framework
S Mohapatra - 2022 - etda.libraries.psu.edu
Heterogenous systems consisting of processors of varying nature which complement each
other’s deficiencies are rapidly eclipsing the homogeneous systems of past. The consumer …
other’s deficiencies are rapidly eclipsing the homogeneous systems of past. The consumer …