Scalable energy-efficient parallel sorting on a fine-grained many-core processor array
Three parallel sorting applications and two list output protocols for the first phase of an external sort execute on a fine-grained many-core processor array that contains no algorithm-specific hardware acting as a co-processor with a ...
Highlights
- Many-core processors can serve as a high-performance energy-efficient co-processor.
A Parallel Multilevel Feature Selection algorithm for improved cancer classification
Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in parallelizing computational methods. (...
Highlights
- Biological data keeps growing and dealing with this huge data is a challenging task.
Scheduling directed acyclic graphs with optimal duplication strategy on homogeneous multiprocessor systems
Modern applications generally need a large volume of computation and communication to fulfill the goal. These applications are often implemented on multiprocessor systems to meet the requirements in computing capacity and communication ...
Highlights
- A MILP formulation for the duplication-based scheduling problem is proposed.
- A ...
Structured multi-block grid partitioning using balanced cut trees
An algorithm to partition structured multi-block hexahedral grids for a load balanced assignment of the partitions to a given number of bins is presented. It uses a balanced hierarchical cut tree data structure to partition the ...
Highlights
- Balanced Cut Trees are used to generate structured partitions of hexahedral grids.
Efficient AES implementation on Sunway TaihuLight supercomputer: A systematic approach
Encryption is an important technique to improve information security for many real-world applications. The Advanced Encryption Standard (AES) is a widely-used efficient cryptographic algorithm. Although AES is fast both in software and ...
Highlights
- A data layout enabling SIMD operations inside one Computing Processor Element.
- ...
Subgraph fault tolerance of distance optimally edge connected hypercubes and folded hypercubes
Hypercube and folded hypercube are the most fundamental interconnection networks for the attractive topological properties. We assume for any distinct vertices u , v ∈ V , κ ( u , v ) defined as local connectivity of u and v, is the ...
DQPFS: Distributed quadratic programming based feature selection for big data
With the advent of the Big data, the scalability of the machine learning algorithms has become more crucial than ever before. Furthermore, Feature selection as an essential preprocessing technique can improve the performance of the ...
Highlights
- Proposing a distributed and scalable feature selection for Big Data, DQPFS.
- An ...
Kokkos implementation of an Ewald Coulomb solver and analysis of performance portability
We have implemented the computation of Coulomb interactions in particle systems using the performance portable C++ framework Kokkos. For the computation of the electrostatic interactions in particle systems we used an Ewald summation. ...
Highlights
- Portability for particle simulation algorithm demonstrated with C++ framework Kokkos
CHAMELEON: Reactive Load Balancing for Hybrid MPI+OpenMP Task-Parallel Applications
Many applications in high performance computing are designed based on underlying performance and execution models. While these models could successfully be employed in the past for balancing load within and between compute nodes, ...
Highlights
- Increasing dynamic variability observable in modern hardware and software.
- ...
sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)
In this work we have implemented a novel Linear Algebra Library on top of the task-based runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak dependencies and regions, together with the final clause for the ...
Highlights
- Development of a highly optimized auto-tuned library for BLAS-3 and LAPACK operations.
On the performance difference between theory and practice for parallel algorithms
The performance of parallel algorithms is often inconsistent with their preliminary theoretical analyses. Indeed, the difference is increasing between the ability to theoretically predict the performance of a parallel algorithm and the ...
Highlights
- Performance analysis of Cormen’s parallel Quicksort algorithm.
- Comparing the ...
Efficient convolution pooling on the GPU
- Shunsuke Suita,
- Takahiro Nishimura,
- Hiroki Tokura,
- Koji Nakano,
- Yasuaki Ito,
- Akihiko Kasagi,
- Tsuguchika Tabaru
The main contribution of this paper is to show efficient implementations of the convolution-pooling in the GPU, in which the pooling follows the multiple convolution. Since the multiple convolution and the pooling operations are ...
Highlights
- Efficient GPU implementations for the convolution-pooling have been presented.
- ...
On demand clock synchronization for live VM migration in distributed cloud data centers
Live migration of virtual machines (VMs) has become an extremely powerful tool for cloud data center management and provides significant benefits of seamless VM mobility among physical hosts within a data center or across multiple data ...
Highlights
- Clock synchronization problem for time-sensitive applications and services.
- ...
Extending the limits for big data RSA cracking: Towards cache-oblivious TU decomposition
Nowadays, Big Data security processes require mining large amounts of content that was traditionally not typically used for security analysis in the past. The RSA algorithm has become the de facto standard for encryption, especially ...
Highlights
- We investigate prospects for a cache oblivious adaptation of the TURBO algorithm for solving linear systems over finite fields, necessary for adversarial ...
A semantic-based methodology for digital forensics analysis
Nowadays, more than ever, digital forensics activities are involved in any criminal, civil or military investigation and represent a fundamental tool to support cyber-security. Investigators use a variety of techniques and proprietary ...
Highlights
- A reusable methodology that make use of NLP techniques to produce a semantic representation of relevant concepts of a specific domain is presented.
Blockchain 3.0 applications survey
In this paper we survey a number of interesting applications of blockchain technology not related to cryptocurrencies. As a matter of fact, after an initial period of application to cryptocurrencies and to the financial world, ...
Highlights
- Survey of five selected applications Blockchain 3.0 applications.
- Problem ...
High level programming abstractions for leveraging hierarchical memories with micro-core architectures
Micro-core architectures combine many low memory, low power computing cores together in a single package. These are attractive for use as accelerators but due to limited on-chip memory and multiple levels of memory hierarchy, the way ...
Highlights
- Pass by reference model is mandatory for micro-cores to support arbitrary large data
Designing an efficient parallel spectral clustering algorithm on multi-core processors in Julia
Spectral clustering is widely used in data mining, machine learning and other fields. It can identify the arbitrary shape of a sample space and converge to the global optimal solution. Compared with the traditional k-means algorithm, ...
Highlights
- A Julia-based parallel algorithm of the spectral clustering is designed.
- The ...