Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3577193.3593719acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

Published: 21 June 2023 Publication History

Abstract

Fast and accurate climate simulations and weather predictions are critical for understanding and preparing for the impact of climate change. Real-world climate and weather simulations involve the use of complex compound stencil kernels, which are composed of a combination of different stencils. Horizontal diffusion is one such important compound stencil found in many climate and weather prediction models. Its computation involves a large amount of data access and manipulation that leads to two main issues on current computing systems. First, such compound stencils have high memory bandwidth demands as they require large amounts of data access. Second, compound stencils have complex data access patterns and poor data locality, as the memory access pattern is typically irregular with low arithmetic intensity. As a result, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. Recent works propose using FPGAs as an alternative to traditional CPU and GPU-based systems to accelerate weather stencil kernels. However, we observe that stencil computation cannot leverage the bit-level flexibility available on an FPGA because of its complex memory access patterns, leading to high hardware resource utilization and low peak performance.
We introduce SPARTA, a novel spatial accelerator for horizontal diffusion weather stencil computation. We exploit the two-dimensional spatial architecture to efficiently accelerate the horizontal diffusion stencil by designing the first scaled-out spatial accelerator using the MLIR (Multi-Level Intermediate Representation) compiler framework. We evaluate SPARTA on a real cutting-edge AMD-Xilinx Versal AI Engine (AIE) spatial architecture. Our real-system evaluation results demonstrate that SPARTA outperforms state-of-the-art CPU, GPU, and FPGA implementations by 17.1×, 1.2×, and 2.1×, respectively. Compared to the most energy-efficient design on an HBM-based FPGA, SPARTA provides 2.43× higher energy efficiency. Our results reveal that balancing workload across the available processing resources is crucial in achieving high performance on spatial architectures. We also implement and evaluate five elementary stencils that are commonly used as benchmarks for stencil computation research. We freely open-source all our implementations to aid future research in stencil computation and spatial computing systems at https://github.com/CMU-SAFARI/SPARTA.

References

[1]
P. Bauer, P. D. Dueben, T. Hoefler, T. Quintino, T. C. Schulthess, and N. P. Wedi, "The Digital Revolution of Earth-System Science," in Nat. Comput. Sci, 2021.
[2]
Z. Hausfather, H. F. Drake, T. Abbott, and G. A. Schmidt, "Evaluating the Performance of Past Climate Model Projections," in Geophys. Res. Lett., 2020.
[3]
J. Slingo, P. Bates, P. Bauer, S. Belcher, T. Palmer, G. Stephens, B. Stevens, T. Stocker, and G. Teutsch, "Ambitious Partnership Needed for Reliable Climate Prediction," in Nat. Clim. Change., 2022.
[4]
J. Sillmann, T. Thorarinsdottir, N. Keenlyside, N. Schaller, L. V. Alexander, G. Hegerl, S. I. Seneviratne, R. Vautard, X. Zhang, and F. W. Zwiers, "Understanding, Modeling and Predicting Weather and Climate Extremes: Challenges and Opportunities," in Weather. Clim. Extremes, 2017.
[5]
T. Necker, D. Hinger, P. J. Griewank, T. Miyoshi, and M. Weissmann, "Guidance on How to Improve Vertical Covariance Localization Based on a 1000-Member Ensemble," in NPG, 2023.
[6]
V. Balaji, F. Couvreux, J. Deshayes, J. Gautrais, F. Hourdin, and C. Rio, "Are General Circulation Models Obsolete?" in PNAS, 2022.
[7]
G. Hu and S. L. Dance, "Efficient Computation of Matrix-Vector Products With Full Observation Weighting Matrices in Data Assimilation," in Q. J. R. Meteorol. Soc., 2021.
[8]
S. L. Dance, S. P. Ballard, R. N. Bannister, P. Clark, H. L. Cloke, T. Darlington, D. L. A. Flack, S. L. Gray, L. Hawkness-Smith, N. Husnoo, A. J. Illingworth, G. A. Kelly, H. W. Lean, D. Li, N. K. Nichols, J. C. Nicol, A. Oxley, R. S. Plant, N. M. Roberts, I. Roulstone, D. Simonin, R. J. Thompson, and J. A. Waller, "Improvements in Forecasting Intense Rainfall: Results From the FRANC (Forecasting Rainfall Exploiting New Data Assimilation Techniques and Novel Observations of Convection) Project," in Atmosphere, 2019.
[9]
G. Hu, S. L. Dance, R. N. Bannister, H. G. Chipilski, O. Guillet, B. Macpherson, M. Weissmann, and N. Yussouf, "Progress, Challenges, and Future Steps in Data Assimilation for Convection-Permitting Numerical Weather Prediction: Report on the Virtual Meeting Held on 10 and 12 November 2021," in ASL, 2023.
[10]
P. D. Dueben and P. Bauer, "Challenges and Design Choices for Global Weather and Climate Models Based on Machine Learning," in GMD, 2018.
[11]
R. Pyle, N. Jovanovic, D. Subramanian, K. V. Palem, and A. B. Patel, "Domain-Driven Models Yield Better Predictions at Lower Cost than Reservoir Computers in Lorenz Systems," in Philos. Trans. R. Soc. A, 2021.
[12]
L. Bonaventura, "A Semi-Implicit Semi-Lagrangian Scheme Using the Height Coordinate for a Nonhydrostatic and Fully Elastic Model of Atmospheric Flows," in JCP, 2000.
[13]
F. Thaler, S. Moosbrugger, C. Osuna, M. Bianco, H. Vogt, A. Afanasyev, L. Mosimann, O. Fuhrer, T. C. Schulthess, and T. Hoefler, "Porting the COSMO Weather Model to Manycore CPUs," in PASC, 2019.
[14]
G. Doms and U. Schättler, "The Nonhydrostatic Limited-Area Model LM (Lokalmodel) of the DWD. Part I: Scientific Documentation," in DWD, GB Forschung und Entwicklung, 1999.
[15]
T. Gysi, T. Grosser, and T. Hoefler, "MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures," in SC, 2015.
[16]
J. de Fine Licht, A. Kuster, T. De Matteis, T. Ben-Nun, D. Hofer, and T. Hoefler, "StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems," in CGO, 2021.
[17]
T. Palmer, C. Brankovic, F. Molteni, S. Tibaldi, L. Ferranti, A. Hollingsworth, U. Cubasch, and E. Klinker, "The European Centre for Medium-Range Weather Forecasts (ECMWF) Program on Extended-Range Prediction," in Bull. Am. Meteorol. Soc., 1990.
[18]
T. McClung, "Global Forecast System: Technical Implementation Notice 16- 11 Amended," in Nation Weather Service, 2016.
[19]
J. W. Hurrell, M. M. Holland, P. R. Gent, S. Ghan, J. E. Kay, P. J. Kushner, J.-F. Lamarque, W. G. Large, D. Lawrence, K. Lindsay, W. H. Lipscomb, M. C. Long, N. Mahowald, D. R. Marsh, R. B. Neale, P. Rasch, S. Vavrus, M. Vertenstein, D. Bader, W. D. Collins, J. J. Hack, J. Kiehl, and S. Marshall, "The Community Earth System Model: A Framework for Collaborative Research," in Bull. Amer. Meteor. Soc., 2013.
[20]
S. Watanabe, T. Hajima, K. Sudo, T. Nagashima, T. Takemura, H. Okajima, T. Nozawa, H. Kawase, M. Abe, T. Yokohata, T. Ise, H. Sato, E. Kato, K. Takata, S. Emori, and M. Kawamiya, "MIROC-ESM 2010: Model Description and Basic Results of CMIP5-20c3m Experiments," in GMD, 2011.
[21]
D. M. Daley and J. C. Garand, "Horizontal Diffusion, Vertical Diffusion, and Internal Pressure in State Environmental Policymaking, 1989--1998," in Am. Politics Res., 2005.
[22]
W. C. Skamarock and J. B. Klemp, "A Time-Split Nonhydrostatic Atmospheric Model for Weather Research and Forecasting Applications," in J. Comput. Phys., 2008.
[23]
G. Singh, D. Diamantopoulos, C. Hagleitner, S. Stuijk, and H. Corporaal, "NAR-MADA: Near-Memory Horizontal Diffusion Accelerator for Scalable Stencil Computations," in FPL, 2019.
[24]
G. Singh, D. Diamantopoulos, C. Hagleitner, J. Gomez-Luna, S. Stuijk, O. Mutlu, and H. Corporaal, "NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling," in FPL, 2020.
[25]
G. Singh, "Designing, Modeling, and Optimizing Data-Intensive Computing Systems," in arXiv, 2022.
[26]
G. Singh, D. Diamantopoulos, J. Gómez-Luna, C. Hagleitner, S. Stuijk, H. Corporaal, and O. Mutlu, "Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric," in TRETS, 2022.
[27]
G. Singh, D. Diamantopoulos, S. Stuijk, C. Hagleitner, and H. Corporaal, "Low Precision Processing for High Order Stencil Computations," in Springer LNCS, 2019.
[28]
G. Singh, M. Alser, D. S. Cali, D. Diamantopoulos, J. Gómez-Luna, H. Corporaal, and O. Mutlu, "FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications," in IEEE Micro, 2021.
[29]
G. Singh, "Designing, Modeling, and Optimizing Data-Intensive Computing Systems," Ph.D. Dissertation, Eindhoven University of Technology, 2021.
[30]
S. Williams, A. Waterman, and D. Patterson, "Roofline: An Insightful Visual Performance Model for Multicore architectures," in CACM, 2009.
[31]
S. K. Sadasivam, B. W. Thompto, R. Kalla, and W. J. Starke, "IBM POWER9 Processor Architecture," in IEEE Micro, 2017.
[32]
NVIDIA, "NVIDIA Tesla V100 GPU Architecture," https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf, 2017.
[33]
"ADM-PCIE-9H7-High-Speed Communications Hub, https://www.alpha-data.com/dcp/products.php?product=adm-pcie-9h7."
[34]
K. Vissers, "Versal: The Xilinx Adaptive Compute Acceleration Platform (ACAP)," in FPGA, 2019.
[35]
D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie-Hurd, M. Bye, E. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Kurtz, "Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads," in ISCA, 2020.
[36]
T. P. Morgan, "Intel's Exascale Dataflow Engine Drops X86 and von Neumann," The Next Platform, 2018.
[37]
M. La and A. Chien, "Cerebras Systems: Journey to the Wafer-Scale Engine," in University of Chicago, Tech. Rep, 2020.
[38]
J. Zhuang, J. Lau, H. Ye, Z. Yang, Y. Du, J. Lo, K. Denolf, S. Neuendorffer, A. Jones, J. Hu, D. Chen, J. Cong, and P. Zhou, "CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture," in FPGA, 2023.
[39]
G. Singh, M. Alser, A. Khodamoradi, K. Denolf, C. Firtina, M. B. Cavlak, H. Corporaal, and O. Mutlu, "A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers," in bioRxiv, 2022.
[40]
J. A. Fisher, "Very Long Instruction Word Architectures and the ELI-512," in ISCA, 1983.
[41]
H. M. Waidyasooriya and M. Hariyama, "Multi-FPGA Accelerator Architecture for Stencil Computation Exploiting Spacial and Temporal Scalability," in IEEE Access, 2019.
[42]
K. Sano, Y. Hatsuda, and S. Yamamoto, "Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth," in TPDS, 2014.
[43]
H. M. Waidyasooriya, Y. Takei, S. Tatsumi, and M. Hariyama, "OpenCL-Based FPGA-Platform for Stencil Computation and Its Optimization Methodology," in TPDS, 2017.
[44]
Y. Chi, J. Cong, P. Wei, and P. Zhou, "SODA: Stencil with Optimized Dataflow Architecture," in ICCAD, 2018.
[45]
J. de Fine Licht, M. Blott, and T. Hoefler, "Designing Scalable FPGA Architectures Using High-Level Synthesis," in PPoPP, 2018.
[46]
AMD, "AI Engine (AIE) r2p18." https://www.xilinx.com/htmldocs/xilinx2021_1/aiengine_intrinsics/intrinsics/index.html
[47]
AMD-Xilinx, "Versal ACAP AI Engine Architecture Manual." https://www.xilinx.com/support/documentation/architecture-manuals/am009-versal-ai-engine.pdf
[48]
F. Váňa, P. Düben, S. Lang, T. Palmer, M. Leutbecher, D. Salmond, and G. Carver, "Single Precision in Weather Forecasting Models: An Evaluation with the IFS," in Mon. Weather Rev., 2017.
[49]
T. Kimpson, E. A. Paxton, M. Chantry, and T. Palmer, "Climate Change Modelling At Reduced Floating-Point Precision With Stochastic Rounding," in Q. J. R. Meteorol. Soc, 2023.
[50]
T. Palmer, "Stochastic Weather And Climate Models," in Nature Reviews Physics, 2019.
[51]
S. Hatfield, M. Chantry, P. Düben, and T. Palmer, "Accelerating High-Resolution Weather Models With Deep-Learning Hardware," in PASC, 2019.
[52]
M. Chantry, H. Christensen, P. Dueben, and T. Palmer, "Opportunities And Challenges For Machine Learning In Weather And Climate Modelling: Hard, Medium And Soft AI," in Philos. Trans. R. Soc., 2021.
[53]
M. Chantry, S. Hatfield, P. Dueben, I. Polichtchouk, and T. Palmer, "Machine Learning Emulation Of Gravity Wave Drag In Numerical Weather Forecasting," in JAMES, 2021.
[54]
L. Saffin, S. Hatfield, P. Düben, and T. Palmer, "Reduced-Precision Parametrization: Lessons From An Intermediate-Complexity Atmospheric Model," in Q. J. R. Meteorol. Soc., 2020.
[55]
J. Yuval, P. A. O'Gorman, and C. N. Hill, "Use Of Neural Networks For Stable, Accurate And Physically Consistent Parameterization of Subgrid Atmospheric Processes With Good Performance At Reduced Precision," in Geophys. Res. Lett., 2021.
[56]
M. Klöwer, P. Düben, and T. Palmer, "Number Formats, Error Mitigation, And Scope for 16-bit Arithmetics In Weather And Climate Modeling Analyzed With A Shallow Water Model," in JAMES, 2020.
[57]
E. A. Paxton, M. Chantry, M. Klöwer, L. Saffin, and T. Palmer, "Climate Modeling in Low Precision: Effects of Both Deterministic And Stochastic Rounding," in J. Clim, 2022.
[58]
J. Ackmann, P. D. Dueben, T. Palmer, and P. K. Smolarkiewicz, "Mixed-Precision for Linear Solvers in Global Geophysical Flows," in JAMES, 2022.
[59]
C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko, "MLIR: Scaling Compiler Infrastructure for Domain Specific Computation," in CGO, 2021.
[60]
K. Denolf, M. Bekooij, J. Cockx, D. Verkest, and H. Corporaal, "Exploiting the Expressiveness of Cyclo-Static Dataflow to Model Multimedia Implementations," in EURASIP JASP, 2007.
[61]
O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, "Processing Data Where It Makes Sense: Enabling In-Memory Computation," in MicPro, 2019.
[62]
O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun, "A Modern Primer on Processing in Memory," in Emerging Computing: From Devices to Systems-Looking Beyond Moore and Von Neumann. Springer, 2021.
[63]
S. Ghose, A. Boroumand, J. S. Kim, J. Gómez-Luna, and O. Mutlu, "Processing-in-Memory: A Workload-Driven Perspective," in IBM JRD, 2019.
[64]
J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing," in ISCA, 2015.
[65]
G. Singh, L. Chelini, S. Corda, A. J. Awan, S. Stuijk, R. Jordans, H. Corporaal, and A.-J. Boonstra, "Near-Memory Computing: Past, Present, and Future," in MicPro, 2019.
[66]
G. Singh, J. Gomez-Luna, G. Mariani, G. F. Oliveira, S. Corda, S. Stujik, O. Mutlu, and H. Corporaal, "NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning," in DAC, 2019.
[67]
K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu, "Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation," in ICCD, 2016.
[68]
K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O'Connor, N. Vijaykumar, O. Mutlu, and S. W. Keckler, "Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems," in ISCA, 2016.
[69]
J. Ahn, S. Yoo, O. Mutlu, and K. Choi, "PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture," in ISCA, 2015.
[70]
A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks," in ASPLOS, 2018.
[71]
G. Singh, L. Chelini, S. Corda, A. J. Awan, S. Stuijk, R. Jordans, H. Corporaal, and A.-J. Boonstra, "A Review of Near-Memory Computing Architectures: Opportunities and Challenges," in DSD, 2018.
[72]
G. Singh, R. Nadig, J. Park, R. Bera, N. Hajinazar, D. Novo, J. Gómez-Luna, S. Stuijk, H. Corporaal, and O. Mutlu, "Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning," in ISCA, 2022.
[73]
K. Vadivel, L. Chelini, A. BanaGozar, G. Singh, S. Corda, R. Jordans, and H. Corporaal, "TDO-CIM: Transparent Detection and Offloading for Computation In-Memory," in DATE, 2020.
[74]
S. Corda, G. Singh, A. J. Awan, R. Jordans, and H. Corporaal, "Platform Independent Software Analysis for Near Memory Computing," in DSD, 2019.
[75]
O. Mutlu, "Intelligent Architectures for Intelligent Computing Systems," in DATE, 2021.
[76]
A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, "Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks," in PACT, 2021.
[77]
"Vitis Unified Software Platform Documentation: Embedded Software Development (UG1400), https://docs.xilinx.com/r/en-US/ug1400-vitis-embedded/Getting-Started-with-Vitis."
[78]
"MLIR-based AIEngine toolchain, https://github.com/Xilinx/mlir-aie."
[79]
L.-N. Pouchet, "Polybench: The Polyhedral Benchmark Suite," in URL: http://www.cs.ucla.edu/pouchet/software/polybench, 2012.
[80]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick, "Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures," in SC, 2008.
[81]
K. Datta, S. Williams, V. Volkov, J. Carter, L. Oliker, J. Shalf, and K. Yelick, "Auto-tuning the 27-point Stencil for Multicore," in iWAPT, 2009.
[82]
K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and K. Yelick, "Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors," in SIAM review, 2009.
[83]
AMD-Xilinx, "Versal AI Core Series VCK190 Evaluation Kit, https://www.xilinx.com/products/boards-and-kits/vck190.html."
[84]
AMD-Xilinx, "Versal Architecture and Product Data Sheet: Overview, https://www.xilinx.com/support/documentation/data_sheets/ds950-versal-overview.pdf."
[85]
C. Osuna, T. Wicky, F. Thuering, T. Hoefler, and O. Fuhrer, "Dawn: A High-Level Domain-Specific Language Compiler Toolchain for Weather and Climate Applications," in Supercomput. Front. Innov., 2020.
[86]
MeteoSwiss, "Stencil Benchmarks, https://github.com/MeteoSwiss-APN/stencil_benchmarks."
[87]
AMD, "Introducing 3rd Gen AMD EPYC™ Processors, https://www.amd.com/en/events/epyc."
[88]
D. M. Tullsen, S. J. Eggers, and H. M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," in ISCA, 1995.
[89]
MICRON, "RDIMM, https://www.micron.com/products/dram-modules/rdimm."
[90]
Ubuntu, "Ubuntu 20.04.3 LTS (Focal Fossa), https://releases.ubuntu.com/20.04/."
[91]
GCC Project, "GCC, the GNU Compiler Collection, https://gcc.gnu.org/."
[92]
AMD, "AMD Radeon Instinct™ MI50 Accelerator (32GB), https://www.amd.com/system/files/documents/radeon-instinct-mi50-datasheet.pdf."
[93]
AMD, "ROCm, https://github.com/RadeonOpenCompute/ROCm."
[94]
AMD-Xilinx, "Memory Interfaces Design Hub - UltraScale DDR3/DDR4 Memory, https://www.xilinx.com/support/documentation-navigation/design-hubs/dh0061-ultrascale-memory-interface-ddr4-ddr3-hub.html."
[95]
ARM, "ARM Cortex-A72 MPCore Processor Technical Reference Manual r0p3, https://developer.arm.com/documentation/100095/0003."
[96]
AMD-Xilinx, "Xilinx Power Estimator (XPE)), https://www.xilinx.com/products/technology/power/xpe.html."
[97]
AMD-Xilinx, "Virtex UltraScale+, https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus.html."
[98]
Intel, "Intel Xeon Processor E5-2690 v3, https://www.intel.com/content/www/us/en/products/sku/81713/intel-xeon-processor-e52690-v3-30m-cache-2-60-ghz/specifications.html."
[99]
Intel, "Intel Stratix 10 FPGA and SoC FPGA, https://www.intel.com/content/www/us/en/products/details/fpga/stratix/10.html."
[100]
H. Huynh, Z. J. Wang, and P. E. Vincent, "High-Order Methods for Computational Fluid Dynamics: A Brief Review of Compact Differential Formulations on Unstructured Grids," in Computers & Fluids, 2014.
[101]
T. Hermosilla, E. Bermejo, A. Balaguer, and L. A. Ruiz, "Non-Linear Fourth-Order Image Interpolation for Subpixel Edge Detection and Localization," in IMAVIS, 2008.
[102]
G. A. McMechan, "Migration by Extrapolation of Time-Dependent Boundary Values," in Geophys. Prospect., 1983.
[103]
A. Taflove, "Review of the Formulation and Applications of the Finite-Difference Time-Domain Method for Numerical Modeling of Electromagnetic Wave Interactions With Arbitrary Structures," in Wave Motion, 1988.
[104]
M. Frigo and V. Strumpen, "The Memory Behavior of Cache Oblivious Stencil Computations," in J. Supercomput., 2007.
[105]
D. S. Balsara, "Higher-Order Accurate Space-Time Schemes for Computational Astrophysics---Part I: Finite Volume Methods," in Living Rev. Comput. Astrophys., 2017.
[106]
K. Kormann and A. Nissen, "Error Control for Simulations of a Dissociative Quantum System," in ENUMATH, 2009.
[107]
W. Augustin, V. Heuveline, and J.-P. Weiss, "Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems," in Euro-Par, 2009.
[108]
R. De La Cruz, M. Araya-Polo, and J. M. Cela, "Introducing the Semi-Stencil Algorithm," in PPAM, 2009.
[109]
H. Dursun, K.-i. Nomura, W. Wang, M. Kunaseth, L. Peng, R. Seymour, R. K. Kalia, A. Nakano, and P. Vashishta, "In-Core Optimization of High-Order Stencil Computations," in PDPTA, 2009.
[110]
H. Dursun, K.-i. Nomura, L. Peng, R. Seymour, W. Wang, R. K. Kalia, A. Nakano, and P. Vashishta, "A Multilevel Parallelization Framework for High-Order Stencil Computations," in Euro-Par, 2009.
[111]
S. Kamil, P. Husbands, L. Oliker, J. Shalf, and K. Yelick, "Impact of Modern Memory Subsystems on Cache Optimizations for Stencil Computations," in MSP, 2005.
[112]
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan, "Effective Automatic Parallelization of Stencil Computations," in PLDO, 2007.
[113]
Z. Li and Y. Song, "Automatic Tiling of Iterative Stencil Loops," in TOPLAS, 2004.
[114]
J. Meng and K. Skadron, "Performance Modeling and Automatic Ghost Zone Optimization for Iterative Stencil Loops on GPUs," in SC, 2009.
[115]
P. Micikevicius, "3D Finite Difference Computation on GPUs Using CUDA," in GPGPU, 2009.
[116]
L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, and G. Yang, "Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture," in IEEE Micro, 2017.
[117]
J. van Lunteren, R. Luijten, D. Diamantopoulos, F. Auernhammer, C. Hagleitner, L. Chelini, S. Corda, and G. Singh, "Coherently Attached Programmable Near-Memory Acceleration Platform and its Application to Stencil Processing," in DATE, 2019.
[118]
A. Denzler, G. F. Oliveira, N. Hajinazar, R. Bera, G. Singh, J. Gómez-Luna, and O. Mutlu, "Casper: Accelerating Stencil Computations Using Near-Cache Processing," in IEEE Access, 2023.
[119]
J. Li, X. Wang, A. Tumeo, B. Williams, J. D. Leidel, and Y. Chen, "PIMS: A Lightweight Processing-in-Memory Accelerator for Stencil Computations," in ISMS, 2019.
[120]
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, "3.5-D Blocking Optimization for Stencil Computations on Modern CPUs And GPUs," in SC, 2010.
[121]
H. Stengel, J. Treibig, G. Hager, and G. Wellein, "Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model," in ICS, 2015.
[122]
O. Fuhrer, T. Chadha, T. Hoefler, G. Kwasniewski, X. Lapillonne, D. Leutwyler, D. Lüthi, C. Osuna, C. Schär, T. C. Schulthess, and H. Vogt, "Near-Global Climate Simulation at 1 Km Resolution: Establishing a Performance Baseline on 4888 GPUs with COSMO 5.0," in GMD, 2018.
[123]
A. Armejach, H. Caminal, J. M. Cebrian, R. González-Alberquilla, C. Adeniyi-Jones, M. Valero, M. Casas, and M. Moretó, "Stencil Codes on a Vector Length Agnostic Architecture," in PACT, 2018.
[124]
H. E. Yantır, A. M. Eltawil, and K. N. Salama, "Efficient Acceleration of Stencil Applications through In-Memory Computing," in Micromachines, 2020.
[125]
R. Wester and J. Kuper, "Deriving Stencil Hardware Accelerators from a Single Higher-Order Function," in CPA, 2014.
[126]
M. Christen, O. Schenk, and H. Burkhart, "Patus: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures," in IPDPS, 2011.
[127]
C. Olschanowsky, M. M. Strout, S. Guzik, J. Loffeld, and J. Hittinger, "A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers," in SC, 2014.
[128]
T. Brandvik and G. Pullan, "SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-Core Platforms," in ICCIT, 2010.
[129]
E. H. Phillips and M. Fatica, "Implementing the Himeno Benchmark with CUDA on GPU Clusters," in IPDPS, 2010.
[130]
L. Szustak, K. Rojek, and P. Gepner, "Using Intel Xeon Phi Coprocessor to Accelerate Computations in MPDATA Algorithm," in PPAM, 2013.
[131]
S. Wang and Y. Liang, "A Comprehensive Framework for Synthesizing Stencil Algorithms on FPGAs Using OpenCL Model," in DAC, 2017.
[132]
A. Sohrabizadeh, C. H. Yu, M. Gao, and J. Cong, "AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators," in TODAES, 2022.
[133]
E. Reggiani, E. Del Sozzo, D. Conficconi, G. Natale, C. Moroni, and M. D. Santambrogio, "Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL Components," in TRETS, 2021.
[134]
M. Koraei, O. Fatemi, and M. Jahre, "DCMI: A Scalable Strategy for Accelerating Iterative Stencil Loops on FPGAs," in TACO, 2019.
[135]
X. Tian, Z. Ye, A. Lu, L. Guo, Y. Chi, and Z. Fang, "SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs," in arXiv, 2022.
[136]
M. Bianco, T. Diamanti, O. Fuhrer, T. Gysi, X. Lapillonne, C. Osuna, and T. Schulthess, "A GPU Capable Version of the COSMO Weather Model," in ISC, 2013.
[137]
G. Singh, D. Diamantopoulos, J. Gómez-Luna, S. Stuijk, H. Corporaal, and O. Mutlu, "LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning," in ICCD, 2022.
[138]
G. Singh, D. Diamantopolous, J. Gómez-Luna, S. Stuijk, O. Mutlu, and H. Corporaal, "Modeling FPGA-Based Systems via Few-Shot Learning," in FPGA, 2021.
[139]
D. Diamantopoulos, B. Ringlein, M. Purandare, G. Singh, and C. Hagleitner, "Agile Autotuning of a Transprecision Tensor Accelerator Overlay for TVM Compiler Stack," in FPL, 2020.
[140]
S.-W. Jun, M. Liu, S. Lee, J. Hicks, J. Ankcorn, M. King, S. Xu, and Arvind, "BlueDBM: An Appliance for Big Data Analytics," in ISCA, 2015.
[141]
D. S. Cali, K. Kanellopoulos, J. Lindegger, Z. Bingöl, G. S. Kalsi, Z. Zuo, C. Firtina, M. B. Cavlak, J. Kim, N. M. Ghiasi, G. Singh, J. Gómez-Luna, N. A. Alserr, M. Alser, S. Subramoney, C. Alkan, S. Ghose, and O. Mutlu, "SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping," in ISCA, 2022.
[142]
M. Alser, J. Lindegger, C. Firtina, N. Almadhoun, H. Mao, G. Singh, J. Gomez-Luna, and O. Mutlu, "From Molecules to Genomic Variations: Accelerating Genome Analysis via Intelligent Algorithms and Architectures," in CSBJ, 2022.
[143]
J. Lee, H. Kim, S. Yoo, K. Choi, H. P. Hofstee, G.-J. Nam, M. R. Nutter, and D. Jamsek, "ExtraV: Boosting Graph Processing Near Storage With a Coherent Accelerator," in VLDB, 2017.
[144]
J. Jiang, Z. Wang, X. Liu, J. Gómez-Luna, N. Guan, Q. Deng, W. Zhang, and O. Mutlu, "Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL Applications on FPGAs," in FPGA, 2020.

Cited By

View all
  • (2024)RUBICON: a framework for designing efficient deep learning-based genomic basecallersGenome Biology10.1186/s13059-024-03181-225:1Online publication date: 16-Feb-2024
  • (2024)Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00015(54-65)Online publication date: 5-May-2024
  • (2023)MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00016(96-105)Online publication date: 12-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing
June 2023
505 pages
ISBN:9798400700569
DOI:10.1145/3577193
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. spatial computing systems
  2. dataflow architectures
  3. high-performance computing
  4. hybrid systems
  5. weather prediction
  6. stencil computation
  7. memory access patterns
  8. climate modeling

Qualifiers

  • Research-article

Conference

ICS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)194
  • Downloads (Last 6 weeks)5
Reflects downloads up to 28 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)RUBICON: a framework for designing efficient deep learning-based genomic basecallersGenome Biology10.1186/s13059-024-03181-225:1Online publication date: 16-Feb-2024
  • (2024)Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00015(54-65)Online publication date: 5-May-2024
  • (2023)MaxEVA: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine2023 International Conference on Field Programmable Technology (ICFPT)10.1109/ICFPT59805.2023.00016(96-105)Online publication date: 12-Dec-2023
  • (2023)AIM: Accelerating Arbitrary-Precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323754(1-9)Online publication date: 28-Oct-2023
  • (undefined)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/3678010

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media