Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- posterSeptember 2020
Bandwidth Bottleneck in Network-on-Chip for High-Throughput Processors
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 157–158https://doi.org/10.1145/3410463.3414673A critical component of high-throughput processors such as GPGPUs is the network-on-chip (NoC) that interconnects the cores and the memory partitions together. Different NoC architectures for throughput processors have been proposed but they have often ...
- research-articleSeptember 2020
MEPHESTO: Modeling Energy-Performance in Heterogeneous SoCs and Their Trade-Offs
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 413–425https://doi.org/10.1145/3410463.3414671Integrated shared memory heterogeneous architectures are pervasive because they satisfy the diverse needs of mobile, autonomous, and edge computing platforms. Although specialized processing units (PUs) that share a unified system memory improve ...
- posterSeptember 2020
Deep Learning Assisted Resource Partitioning for Improving Performance on Commodity Servers
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 153–154https://doi.org/10.1145/3410463.3414668In this paper, we introduce a deep reinforcement learning (DRL) framework for solving the problem of partitioning LLC and memory bandwidth coordinately in an end-to-end manner. To this end, we formulate the problem as a markov decision process and ...
- posterSeptember 2020
Decoupled Address Translation for Heterogeneous Memory Systems
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 155–156https://doi.org/10.1145/3410463.3414662The support for the heterogeneous memory in the conventional virtual memory has an inherent problem. For the efficient translation in the critical translation lookaside buffers (TLBs), the page size has been growing. However, the heterogeneous memory ...
- posterSeptember 2020
Collective Affinity Aware Computation Mapping
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 343–344https://doi.org/10.1145/3410463.3414661This work defines the concept of collective affinity. It is claimed that collective affinity has more potential than single core-centric affinity, for data locality optimization in manycores. The reason is that collective affinity captures the potential ...
- research-articleSeptember 2020
GOPipe: A Granularity-Oblivious Programming Framework for Pipelined Stencil Executions on GPU
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 43–54https://doi.org/10.1145/3410463.3414656Recent studies have shown promising performance benefits when multiple stages of a pipelined stencil application are mapped to different parts of a GPU to run concurrently. An important factor for the computing efficiency of such pipelines is the ...
- research-articleSeptember 2020
Regional Out-of-Order Writes in Total Store Order
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 205–216https://doi.org/10.1145/3410463.3414645The store buffer, an essential component in today's processors, is designed to hide memory latency by moving stores off the processor's critical path. Furthermore, under the Total Store Order (TSO) memory model, the store buffer ensures the in-order ...
- research-articleSeptember 2020
TAFE: Thread Address Footprint Estimation for Capturing Data/Thread Locality in GPU Systems
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 17–29https://doi.org/10.1145/3410463.3414641In multi-GPU and multi-chiplet GPU systems exhibiting NUMA behavior, information about addresses accessed by threads is crucial for various optimizations such as data/thread co-location and cache/scratchpad memory management. To make optimal decisions ...
- research-articleSeptember 2020
Enhancing Address Translations in Throughput Processors via Compression
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 191–204https://doi.org/10.1145/3410463.3414633Efficient memory sharing among multiple compute engines plays an important role in shaping the overall application performance on CPU-GPU heterogeneous platforms. Unified Virtual Memory (UVM) is a promising feature that allows globally-visible data ...
- research-articleSeptember 2020
PRISM: Architectural Support for Variable-granularity Memory Metadata
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 441–454https://doi.org/10.1145/3410463.3414630Modern architectures track memory accesses using page granularity metadata such as access and dirty bits, leading to fundamental tradeoffs for system software that uses this metadata. Larger page sizes reduce address translation overheads and page table ...
- research-articleSeptember 2020
Transmuter: Bridging the Efficiency Gap using Memory and Dataflow Reconfiguration
- Subhankar Pal,
- Siying Feng,
- Dong-hyeon Park,
- Sung Kim,
- Aporva Amarnath,
- Chi-Sheng Yang,
- Xin He,
- Jonathan Beaumont,
- Kyle May,
- Yan Xiong,
- Kuba Kaszyk,
- John Magnus Morton,
- Jiawen Sun,
- Michael O'Boyle,
- Murray Cole,
- Chaitali Chakrabarti,
- David Blaauw,
- Hun-Seok Kim,
- Trevor Mudge,
- Ronald Dreslinski
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 175–190https://doi.org/10.1145/3410463.3414627With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build hardware for emerging applications that meet power and performance targets, while remaining flexible and programmable for end users. This is particularly ...
- research-articleSeptember 2020
Ribbon: High Performance Cache Line Flushing for Persistent Memory
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 427–439https://doi.org/10.1145/3410463.3414625Cache line flushing (CLF) is a fundamental building block for programming persistent memory (PM). CLF is prevalent in PM-aware workloads to ensure crash consistency. It also imposes high overhead. Extensive works have explored persistency semantics and ...
- research-articleSeptember 2020
Analyzing and Leveraging Shared L1 Caches in GPUs
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 161–173https://doi.org/10.1145/3410463.3414623Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes them effective for achieving high throughput for a wide range of applications. However, the memory wall often limits peak throughput. GPUs use caches to address this ...