A Decomposition-Based Approach for Scalable Many-Field Packet Classification on Multi-core Processors
As a kernel function in network routers, packet classification requires the incoming packet headers to be checked against a set of predefined rules. There are two trends for packet classification: (1) to examine a large number of packet header fields, ...
Fully Optimized Code Block Segmentation Algorithm for LTE-Advanced
In our previous work, we presented a brief analysis of the performance of the code block segmentation procedure adopted by the 3GPP LTE Advanced (LTE-A) Standard as part of its physical layer channel coding scheme. Here, a detailed analysis of its ...
Invasive Compute Balancing for Applications with Shared and Hybrid Parallelization
Achieving high scalability with dynamically adaptive algorithms in high-performance computing (HPC) is a non-trivial task. The invasive paradigm using compute migration represents an efficient alternative to classical data migration approaches for such ...
PageRank Computation Using a Multiple Implicitly Restarted Arnoldi Method for Modeling Epidemic Spread
A parallel implementation based on implicitly restarted Arnoldi method (MIRAM) is proposed for calculating dominant eigenpair of stochastic matrices derived from very large real networks. Their high damping factor makes many existing algorithms less ...
Cluster Cache Monitor: Leveraging the Proximity Data in CMP
As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies ...
BPLG: A Tuned Butterfly Processing Library for GPU Architectures
In order to increase the efficiency of existing software many works are incorporating GPU processing. However, despite the current advances in GPU languages and tools, taking advantage of their parallel architecture is still far more complex than ...
List Scheduling in Embedded Systems Under Memory Constraints
Video decoding and image processing in embedded systems are subject to strong resource constraints, particularly in terms of memory. List-scheduling heuristics with static priorities (HEFT, SDC, etc.) being the oft-cited solutions due to both their good ...
A Hardware/Software Approach for Database Query Acceleration with FPGAs
Complex analytics queries often involve expensive operations that may require large computational runtimes leading to slow query responsiveness and hampering real-time performance. Moreover, running these expensive analytics queries inside traditional ...
An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms
This work presents an optimization method to run the 3D-fast wavelet transform (3D-FWT) on a CPU + GPU system. The optimization engine detects the different computing components in the system, and executes the appropriate kernel implemented in both CUDA ...
The Scalability of Disjoint Data Structures on a New Hardware Transactional Memory System
In this paper we present our experiences constructing and testing in-memory data structures designed to be disjoint enough for transactional memory to be profitable as a serialization mechanism with no fallback to traditional locking. Our goal was to ...
Extending Summation Precision for Network Reduction Operations
Double precision summation is at the core of numerous important algorithms such as Newton---Krylov methods and other operations involving inner products, such as matrix multiplication and dot products. However, the effectiveness of summation is limited ...