Masoud Daneshtalab

KTH Royal Institute of Technology, Electronic systems, Faculty Member

University of Tehran, School of Electrical and Computer Engineering, Graduate Student

Followers

105

Following

Co-authors

Public Views

InterestsView All (7)

Uploads

Papers by Masoud Daneshtalab

Editorial: Special issue on many-core embedded systems

PARS—An efficient congestion-Aware Routing method for Networks-on-Chip

The performance of NoCs (Networks-On-Chip) highly relies on the routing algorithm. Despite the hi... more The performance of NoCs (Networks-On-Chip) highly relies on the routing algorithm. Despite the higher implementation complexity compared with deterministic routing, adaptive routing has several merits, such as lower latency, higher throughput and better fault-tolerance performance. Most of the existing adaptive routing algorithms are based on the comparison of the horizontal and vertical congestion status in the network. However the performance of adaptive routing schemes suffers from the inadequate global ...

TransPar: Transformation based dynamic Parallelism for low power CGRAs

2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014

Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet th... more Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications (e.g. 4G, CDMA, etc.). Recently proposed CGRAs offer runtime parallelism to reduce energy consumption (by lowering voltage/frequency). To implement the runtime parallelism, CGRAs commonly store multiple compile-time generated implementations of an application (with different degree of parallelism) and select the optimal version at runtime. However, the compile-time binding incurs excessive configuration memory overheads and/or is unable to parallelize an application even when sufficient resources are available. As a solution to this problem, we propose Transformation based dynamic Parallelism (TransPar). TransPar stores only a single implementation and applies a series for transformations to generate the bitstream for the parallel version. In addition, it also allows to displace and/or rotate an application to parallelize in resource constrained scenarios. By storing only a single implementation, TransPar offers significant reductions in configuration memory requirements (up to 73% for the tested applications), compared to state of the art compaction techniques. Simulation and synthesis results, using real applications, reveal that the additional flexibility allows up to 33% energy reduction compared to static memory based parallelism techniques. Gate level analysis reveals that TransPar incurs negligible silicon (0.2% of the platform) and timing (6 additional cycles per application) penalty.

NeuroCGRA: A CGRA with support for neural networks

2014 International Conference on High Performance Computing & Simulation (HPCS), 2014

Today, Coarse Grained Reconfigurable Architectures (CGRAs) are becoming an increasingly popular i... more Today, Coarse Grained Reconfigurable Architectures (CGRAs) are becoming an increasingly popular implementation platform. In real world applications, the CGRAs are required to simultaneously host processing (e.g. Audio/video acquisition) and estimation (e.g. audio/video/image recognition) tasks. For estimation problems, neural networks, promise a higher efficiency than conventional processing. However, most of the existing CGRAs provide no support for neural networks. To realize realize both neural networks and conventional processing on the same platform, this paper presents NeuroCGRA. NeuroCGRA allows the processing elements and the network to dynamically morph into either conventional CGRA or a neural network, depending on the hosted application. We have chosen the DRRA as a vehicle to study the feasibility and overheads of our approach. Synthesis results reveal that the proposed enhancements incur negligible overheads (4.4% area and 9.1% power) compared to the original DRRA cell.

RuRot: Run-time rotatable-expandable partitions for efficient mapping in CGRAs

2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014

Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbit... more Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Compile-time mapping decisions are neither optimal nor desirable to efficiently support the diverse and unpredictable application requirements. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remappers displace or expand (parallelize/serialize) an application to optimize different parameters (such as platform utilization). However, the existing remappers support application displacement or expansion in either horizontal or vertical direction. Moreover, most of the works only address dynamic remapping in packet-switched networks and therefore are not applicable to the CGRAs that exploit circuitswitching for low-power and high predictability. To enhance the optimality of the run-time remappers, this paper presents a design framework called Run-time Rotatable-expandable Partitions (RuRot). RuRot provides architectural support to dynamically remap or expand (i.e. parallelize) the hosted applications in CGRAs with circuit-switched interconnects. Compared to state of the art, the proposed design supports application rotation (in clockwise and anticlockwise directions) and displacement (in horizontal and vertical directions), at run-time. Simulation results using a few applications reveal that the additional flexibility enhances the device utilization, significantly (on average 50 % for the tested applications). Synthesis results confirm that the proposed remapper has negligible silicon (0.2 % of the platform) and timing (2 cycles per application) overheads.

Smart hill climbing for agile dynamic mapping in many-core systems

Proceedings of the 50th Annual Design Automation Conference on - DAC '13, 2013

ABSTRACT Stochastic hill climbing algorithm is adapted to rapidly find the appropriate start node... more ABSTRACT Stochastic hill climbing algorithm is adapted to rapidly find the appropriate start node in the application mapping of network-based many-core systems. Due to highly dynamic and unpredictable workload of such systems, an agile run-time task allocation scheme is required. The scheme is desired to map the tasks of an incoming application at run-time onto an optimum contiguous area of the available nodes. Contiguous and unfragmented area mapping is to settle the communicating tasks in close proximity. Hence, the power dissipation, the congestion between different applications, and the latency of the system will be significantly reduced. To find an optimum region, we first propose an approximate model that quickly estimates the available area around a given node. Then the stochastic hill climbing algorithm is used as a search heuristic to find a node that has the required number of available nodes around it. Presented agile climber takes the steps using an adapted version of hill climbing algorithm named Smart Hill Climbing, SHiC, which takes the runtime status of the system into account. Finally, the application mapping is performed starting from the selected first node. Experiments show significant gain in the mapping contiguousness which results in better network latency and power dissipation, compared to state-of-the-art works.

DyXYZ: Fully Adaptive Routing Algorithm for 3D NoCs

2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2013

Masoud Daneshtalab

Uploads

Papers by Masoud Daneshtalab

Log In