Keyword: data-parallelism : Search

research-article

Efficient GPU Implementation of Affine Index Permutations on Arrays

FHPNC 2023: Proceedings of the 11th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical ComputingPages 15–28https://doi.org/10.1145/3609024.3609411

Optimal usage of the memory system is a key element of fast GPU algorithms. Unfortunately many common algorithms fail in this regard despite exhibiting great regularity in memory access patterns. In this paper we propose efficient kernels to permute the ...

research-article

AgEBO-tabular: joint neural architecture and hyperparameter search with autotuned data-parallel training for tabular data

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 30, Pages 1–14https://doi.org/10.1145/3458817.3476203

Developing high-performing predictive models for large tabular data sets is a challenging task. Neural architecture search (NAS) is an AutoML approach that generates and evaluates multiple neural networks with different architectures concurrently to ...

research-article

Executable modelling for highly parallel accelerators

MODELS '19: Proceedings of the 22nd International Conference on Model Driven Engineering Languages and SystemsPages 318–321https://doi.org/10.1109/MODELS-C.2019.00049

High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction of heterogeneous systems, with ...

research-article

Compositional deep learning in Futhark

FHPNC 2019: Proceedings of the 8th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical ComputingPages 47–59https://doi.org/10.1145/3331553.3342617

We present a design pattern for composing deep learning networks in a typed, higher-order fashion. The exposed library functions are generically typed and the composition structure allows for networks to be trained (using back-propagation) and for ...

research-article

Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures

PMAM'19: Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and ManycoresPages 91–100https://doi.org/10.1145/3303084.3309496

Nowadays, the majority of desktop, mobile, and embedded devices in the consumer and industrial markets are heterogeneous, as they contain at least multi-core CPU and GPU resources in the same system. However, exploiting the performance and energy-...

short-paper

Single-source Library for Enabling Seamless Assignment of Data-parallel Task-DAGs to CPUs and GPUs in Heterogeneous Architectures

PARMA-DITAM 2019: Proceedings of the 10th and 8th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing PlatformsArticle No.: 3, Pages 1–4https://doi.org/10.1145/3310411.3310416

Currently, the majority of devices is heterogeneous and comprises at least a multi-core CPU and a GPU. Exploiting these modules requires programmers to a) assign parallel activities to the different hardware resources, and b) code each activity through ...

research-article

Extending ILUPACK with a Task-Parallel Version of BiCG for Dual-GPU Servers

PMAM'18: Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and ManycoresPages 71–78https://doi.org/10.1145/3178442.3178450

We target the solution of sparse linear systems via iterative Krylov subspace-based methods enhanced with the ILUPACK preconditioner on graphics processing units (GPUs). Concretely, in this work we extend ILUPACK with an implementation of the BiCG ...

research-article

Making collection operations optimal with aggressive JIT compilation

SCALA 2017: Proceedings of the 8th ACM SIGPLAN International Symposium on ScalaPages 29–40https://doi.org/10.1145/3136000.3136002

Functional collection combinators are a neat and widely accepted data processing abstraction. However, their generic nature results in high abstraction overheads -- Scala collections are known to be notoriously slow for typical tasks. We show that ...

short-paper

Public Access

Optimizing Data-Intensive Applications Automatically By Leveraging Parallel Data Processing Frameworks

SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataPages 1675–1678https://doi.org/10.1145/3035918.3056440

In this demonstration we will showcase CASPER, a novel tool that enables sequential data-intensive programs to automatically leverage the optimizations provided by parallel data processing frameworks. The goal of CASPER is to reduce the inertia against ...

research-article

Online Scalability Characterization of Data-Parallel Programs on Many Cores

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and CompilationPages 191–205https://doi.org/10.1145/2967938.2967960

We present an accurate online scalability prediction model for data-parallel programs on NUMA many-core systems. Memory contention is considered to be the major limiting factor of program scalability as data parallelism limits the amount of ...

research-article

Petuum: A New Platform for Distributed Machine Learning on Big Data

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningPages 1335–1344https://doi.org/10.1145/2783258.2783323

How can one build a distributed framework that allows efficient deployment of a wide spectrum of modern advanced machine learning (ML) programs for industrial-scale problems using Big Models (100s of billions of parameters) on Big Data (terabytes or ...

research-article

LightLDA: Big Topic Models on Modest Computer Clusters

WWW '15: Proceedings of the 24th International Conference on World Wide WebPages 1351–1361https://doi.org/10.1145/2736277.2741115

When building large-scale machine learning (ML) programs, such as massive topic models or deep neural networks with up to trillions of parameters and training examples, one usually assumes that such massive tasks can only be attempted with industrial-...

research-article

CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

PMAM '13: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and ManycoresPages 107–114https://doi.org/10.1145/2442992.2443004

Hybrid systems with CPU and GPU have become the new standard in high performance computing. Workloads are split into two parts and distributed to different devices to utilize both CPU and GPU for data parallelism in hybrid systems. But it is challenging ...

Article

An Experimentation Platform for the Automatic Parallelization of R Programs

APSEC '12: Proceedings of the 2012 19th Asia-Pacific Software Engineering Conference - Volume 01Pages 203–212https://doi.org/10.1109/APSEC.2012.70

We present our ALCHEMY platform that supports the automatic parallelization of R programs during execution. Parallelization occurs fully transparent to the user. Different parallelization techniques can be implemented as modules, linked into the ...

research-article

Adaptive data parallelism for internet clients on heterogeneous platforms

DLS '12: Proceedings of the 8th symposium on Dynamic languagesPages 53–62https://doi.org/10.1145/2384577.2384585

Today's Internet is long past static web pages filled with HTML-formatted text sprinkled with an occasional image or animation. We have entered an era of Rich Internet Applications executed locally on Internet clients such as web browsers: games, ...

Also Published in:

ACM SIGPLAN Notices: Volume 48 Issue 2

article

Hypercubic storage layout and transforms in arbitrary dimensions using GPUs and CUDA

Concurrency and Computation: Practice & Experience (CCOMP), Volume 23, Issue 10Pages 1027–1050https://doi.org/10.1002/cpe.1628

Many simulations in the physical sciences are expressed in terms of rectilinear arrays of variables. It is attractive to develop such simulations for use in 1-, 2-, 3- or arbitrary physical dimensions and also in a manner that supports exploitation of ...

Article

Operational Semantics of the Marte Repetitive Structure Modeling Concepts for Data-Parallel Applications Design

ISPDC '10: Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed ComputingPages 25–32https://doi.org/10.1109/ISPDC.2010.30

This paper presents an operational semantics of the repetitive model of computation, which is the basis for the repetitive structure modeling (RSM) package defined in the standard UML Marte profile. It also deals with the semantics of an RSM extension ...

article

Exploiting graphical processing units for data-parallel scientific applications

Concurrency and Computation: Practice & Experience (CCOMP), Volume 21, Issue 18Pages 2400–2437

Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them ...

Article

Selective bandwidth and resource management in scheduling for dynamically reconfigurable architectures

DAC '07: Proceedings of the 44th annual Design Automation ConferencePages 771–776https://doi.org/10.1145/1278480.1278673

Partial dynamic reconfiguration (often referred to as partial RTR) enables true on-demand computing. A dynamically invoked application is assigned resources such as data bandwidth, configurable logic, and the limited logic resources are customized ...

Article

PARLGRAN: parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures

ASP-DAC '06: Proceedings of the 2006 Asia and South Pacific Design Automation ConferencePages 491–496https://doi.org/10.1145/1118299.1118419

Partial dynamic reconfiguration, often called RTR (run-time reconfiguration) is a key feature in modern reconfigurable platforms. While partial RTR enables additional application performance, it imposes physical constraints necessitating simultaneous ...

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Save to Binder

Upcoming Conferences

Also Published in: