Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleAugust 2023
Efficient GPU Implementation of Affine Index Permutations on Arrays
FHPNC 2023: Proceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical ComputingPages 15–28https://doi.org/10.1145/3609024.3609411Optimal usage of the memory system is a key element of fast GPU algorithms. Unfortunately many common algorithms fail in this regard despite exhibiting great regularity in memory access patterns. In this paper we propose efficient kernels to permute the ...
AgEBO-tabular: joint neural architecture and hyperparameter search with autotuned data-parallel training for tabular data
- Romain Égelé,
- Prasanna Balaprakash,
- Isabelle Guyon,
- Venkatram Vishwanath,
- Fangfang Xia,
- Rick Stevens,
- Zhengying Liu
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 30, Pages 1–14https://doi.org/10.1145/3458817.3476203Developing high-performing predictive models for large tabular data sets is a challenging task. Neural architecture search (NAS) is an AutoML approach that generates and evaluates multiple neural networks with different architectures concurrently to ...
- research-articleAugust 2021
Executable modelling for highly parallel accelerators
MODELS '19: Proceedings of the 22nd International Conference on Model Driven Engineering Languages and SystemsPages 318–321https://doi.org/10.1109/MODELS-C.2019.00049High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction of heterogeneous systems, with ...
- research-articleAugust 2019
Compositional deep learning in Futhark
FHPNC 2019: Proceedings of the 8th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical ComputingPages 47–59https://doi.org/10.1145/3331553.3342617We present a design pattern for composing deep learning networks in a typed, higher-order fashion. The exposed library functions are generically typed and the composition structure allows for networks to be trained (using back-propagation) and for ...
- research-articleFebruary 2019
Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures
PMAM'19: Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and ManycoresPages 91–100https://doi.org/10.1145/3303084.3309496Nowadays, the majority of desktop, mobile, and embedded devices in the consumer and industrial markets are heterogeneous, as they contain at least multi-core CPU and GPU resources in the same system. However, exploiting the performance and energy-...
-
- short-paperJanuary 2019
Single-source Library for Enabling Seamless Assignment of Data-parallel Task-DAGs to CPUs and GPUs in Heterogeneous Architectures
PARMA-DITAM 2019: Proceedings of the 10th and 8th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing PlatformsArticle No.: 3, Pages 1–4https://doi.org/10.1145/3310411.3310416Currently, the majority of devices is heterogeneous and comprises at least a multi-core CPU and a GPU. Exploiting these modules requires programmers to a) assign parallel activities to the different hardware resources, and b) code each activity through ...
- research-articleFebruary 2018
Extending ILUPACK with a Task-Parallel Version of BiCG for Dual-GPU Servers
PMAM'18: Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and ManycoresPages 71–78https://doi.org/10.1145/3178442.3178450We target the solution of sparse linear systems via iterative Krylov subspace-based methods enhanced with the ILUPACK preconditioner on graphics processing units (GPUs). Concretely, in this work we extend ILUPACK with an implementation of the BiCG ...
- research-articleOctober 2017
Making collection operations optimal with aggressive JIT compilation
SCALA 2017: Proceedings of the 8th ACM SIGPLAN International Symposium on ScalaPages 29–40https://doi.org/10.1145/3136000.3136002Functional collection combinators are a neat and widely accepted data processing abstraction. However, their generic nature results in high abstraction overheads -- Scala collections are known to be notoriously slow for typical tasks. We show that ...
- short-paperMay 2017
Optimizing Data-Intensive Applications Automatically By Leveraging Parallel Data Processing Frameworks
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataPages 1675–1678https://doi.org/10.1145/3035918.3056440In this demonstration we will showcase CASPER, a novel tool that enables sequential data-intensive programs to automatically leverage the optimizations provided by parallel data processing frameworks. The goal of CASPER is to reduce the inertia against ...
- research-articleSeptember 2016
Online Scalability Characterization of Data-Parallel Programs on Many Cores
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationPages 191–205https://doi.org/10.1145/2967938.2967960We present an accurate online scalability prediction model for data-parallel programs on NUMA many-core systems. Memory contention is considered to be the major limiting factor of program scalability as data parallelism limits the amount of ...
- research-articleAugust 2015
Petuum: A New Platform for Distributed Machine Learning on Big Data
- Eric P. Xing,
- Qirong Ho,
- Wei Dai,
- Jin-Kyu Kim,
- Jinliang Wei,
- Seunghak Lee,
- Xun Zheng,
- Pengtao Xie,
- Abhimanu Kumar,
- Yaoliang Yu
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningPages 1335–1344https://doi.org/10.1145/2783258.2783323How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or ...
- research-articleMay 2015
LightLDA: Big Topic Models on Modest Computer Clusters
WWW '15: Proceedings of the 24th International Conference on World Wide WebPages 1351–1361https://doi.org/10.1145/2736277.2741115When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-...
- research-articleFebruary 2013
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems
PMAM '13: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and ManycoresPages 107–114https://doi.org/10.1145/2442992.2443004Hybrid systems with CPU and GPU have become the new standard in high performance computing. Workloads are split into two parts and distributed to different devices to utilize both CPU and GPU for data parallelism in hybrid systems. But it is challenging ...
- ArticleDecember 2012
An Experimentation Platform for the Automatic Parallelization of R Programs
APSEC '12: Proceedings of the 2012 19th Asia-Pacific Software Engineering Conference - Volume 01Pages 203–212https://doi.org/10.1109/APSEC.2012.70We present our ALCHEMY platform that supports the automatic parallelization of R programs during execution. Parallelization occurs fully transparent to the user. Different parallelization techniques can be implemented as modules, linked into the ...
- research-articleOctober 2012
Adaptive data parallelism for internet clients on heterogeneous platforms
DLS '12: Proceedings of the 8th symposium on Dynamic languagesPages 53–62https://doi.org/10.1145/2384577.2384585Today's Internet is long past static web pages filled with HTML-formatted text sprinkled with an occasional image or animation. We have entered an era of Rich Internet Applications executed locally on Internet clients such as web browsers: games, ...
Also Published in:
ACM SIGPLAN Notices: Volume 48 Issue 2 - articleJuly 2011
Hypercubic storage layout and transforms in arbitrary dimensions using GPUs and CUDA
Concurrency and Computation: Practice & Experience (CCOMP), Volume 23, Issue 10Pages 1027–1050https://doi.org/10.1002/cpe.1628Many simulations in the physical sciences are expressed in terms of rectilinear arrays of variables. It is attractive to develop such simulations for use in 1-, 2-, 3- or arbitrary physical dimensions and also in a manner that supports exploitation of ...
- ArticleJuly 2010
Operational Semantics of the Marte Repetitive Structure Modeling Concepts for Data-Parallel Applications Design
ISPDC '10: Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed ComputingPages 25–32https://doi.org/10.1109/ISPDC.2010.30This paper presents an operational semantics of the repetitive model of computation, which is the basis for the repetitive structure modeling (RSM) package defined in the standard UML Marte profile. It also deals with the semantics of an RSM extension ...
- articleDecember 2009
Exploiting graphical processing units for data-parallel scientific applications
Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them ...
- ArticleJune 2007
Selective bandwidth and resource management in scheduling for dynamically reconfigurable architectures
DAC '07: Proceedings of the 44th annual Design Automation ConferencePages 771–776https://doi.org/10.1145/1278480.1278673Partial dynamic reconfiguration (often referred to as partial RTR) enables true on-demand computing. A dynamically invoked application is assigned resources such as data bandwidth, configurable logic, and the limited logic resources are customized ...
- ArticleJanuary 2006
PARLGRAN: parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures
ASP-DAC '06: Proceedings of the 2006 Asia and South Pacific Design Automation ConferencePages 491–496https://doi.org/10.1145/1118299.1118419Partial dynamic reconfiguration, often called RTR (run-time reconfiguration) is a key feature in modern reconfigurable platforms. While partial RTR enables additional application performance, it imposes physical constraints necessitating simultaneous ...