Papers by Masoud Daneshtalab
Bookmarks Related papers MentionsView impact
The performance of NoCs (Networks-On-Chip) highly relies on the routing algorithm. Despite the hi... more The performance of NoCs (Networks-On-Chip) highly relies on the routing algorithm. Despite the higher implementation complexity compared with deterministic routing, adaptive routing has several merits, such as lower latency, higher throughput and better fault-tolerance performance. Most of the existing adaptive routing algorithms are based on the comparison of the horizontal and vertical congestion status in the network. However the performance of adaptive routing schemes suffers from the inadequate global ...
Bookmarks Related papers MentionsView impact
2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014
Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet th... more Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications (e.g. 4G, CDMA, etc.). Recently proposed CGRAs offer runtime parallelism to reduce energy consumption (by lowering voltage/frequency). To implement the runtime parallelism, CGRAs commonly store multiple compile-time generated implementations of an application (with different degree of parallelism) and select the optimal version at runtime. However, the compile-time binding incurs excessive configuration memory overheads and/or is unable to parallelize an application even when sufficient resources are available. As a solution to this problem, we propose Transformation based dynamic Parallelism (TransPar). TransPar stores only a single implementation and applies a series for transformations to generate the bitstream for the parallel version. In addition, it also allows to displace and/or rotate an application to parallelize in resource constrained scenarios. By storing only a single implementation, TransPar offers significant reductions in configuration memory requirements (up to 73% for the tested applications), compared to state of the art compaction techniques. Simulation and synthesis results, using real applications, reveal that the additional flexibility allows up to 33% energy reduction compared to static memory based parallelism techniques. Gate level analysis reveals that TransPar incurs negligible silicon (0.2% of the platform) and timing (6 additional cycles per application) penalty.
Bookmarks Related papers MentionsView impact
2014 International Conference on High Performance Computing & Simulation (HPCS), 2014
Today, Coarse Grained Reconfigurable Architectures (CGRAs) are becoming an increasingly popular i... more Today, Coarse Grained Reconfigurable Architectures (CGRAs) are becoming an increasingly popular implementation platform. In real world applications, the CGRAs are required to simultaneously host processing (e.g. Audio/video acquisition) and estimation (e.g. audio/video/image recognition) tasks. For estimation problems, neural networks, promise a higher efficiency than conventional processing. However, most of the existing CGRAs provide no support for neural networks. To realize realize both neural networks and conventional processing on the same platform, this paper presents NeuroCGRA. NeuroCGRA allows the processing elements and the network to dynamically morph into either conventional CGRA or a neural network, depending on the hosted application. We have chosen the DRRA as a vehicle to study the feasibility and overheads of our approach. Synthesis results reveal that the proposed enhancements incur negligible overheads (4.4% area and 9.1% power) compared to the original DRRA cell.
Bookmarks Related papers MentionsView impact
2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014
Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbit... more Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Compile-time mapping decisions are neither optimal nor desirable to efficiently support the diverse and unpredictable application requirements. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remappers displace or expand (parallelize/serialize) an application to optimize different parameters (such as platform utilization). However, the existing remappers support application displacement or expansion in either horizontal or vertical direction. Moreover, most of the works only address dynamic remapping in packet-switched networks and therefore are not applicable to the CGRAs that exploit circuitswitching for low-power and high predictability. To enhance the optimality of the run-time remappers, this paper presents a design framework called Run-time Rotatable-expandable Partitions (RuRot). RuRot provides architectural support to dynamically remap or expand (i.e. parallelize) the hosted applications in CGRAs with circuit-switched interconnects. Compared to state of the art, the proposed design supports application rotation (in clockwise and anticlockwise directions) and displacement (in horizontal and vertical directions), at run-time. Simulation results using a few applications reveal that the additional flexibility enhances the device utilization, significantly (on average 50 % for the tested applications). Synthesis results confirm that the proposed remapper has negligible silicon (0.2 % of the platform) and timing (2 cycles per application) overheads.
Bookmarks Related papers MentionsView impact
Proceedings of the 50th Annual Design Automation Conference on - DAC '13, 2013
ABSTRACT Stochastic hill climbing algorithm is adapted to rapidly find the appropriate start node... more ABSTRACT Stochastic hill climbing algorithm is adapted to rapidly find the appropriate start node in the application mapping of network-based many-core systems. Due to highly dynamic and unpredictable workload of such systems, an agile run-time task allocation scheme is required. The scheme is desired to map the tasks of an incoming application at run-time onto an optimum contiguous area of the available nodes. Contiguous and unfragmented area mapping is to settle the communicating tasks in close proximity. Hence, the power dissipation, the congestion between different applications, and the latency of the system will be significantly reduced. To find an optimum region, we first propose an approximate model that quickly estimates the available area around a given node. Then the stochastic hill climbing algorithm is used as a search heuristic to find a node that has the required number of available nodes around it. Presented agile climber takes the steps using an adapted version of hill climbing algorithm named Smart Hill Climbing, SHiC, which takes the runtime status of the system into account. Finally, the application mapping is performed starting from the selected first node. Experiments show significant gain in the mapping contiguousness which results in better network latency and power dissipation, compared to state-of-the-art works.
Bookmarks Related papers MentionsView impact
2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 2013
Bookmarks Related papers MentionsView impact
7th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), 2012
Bookmarks Related papers MentionsView impact
Communications in Computer and Information Science, 2011
Bookmarks Related papers MentionsView impact
2012 IEEE 3rd International Conference on Networked Embedded Systems for Every Application (NESEA), 2012
Bookmarks Related papers MentionsView impact
2010 IEEE International 3D Systems Integration Conference (3DIC), 2010
Bookmarks Related papers MentionsView impact
2010 15th CSI International Symposium on Computer Architecture and Digital Systems, 2010
Bookmarks Related papers MentionsView impact
2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012
Bookmarks Related papers MentionsView impact
2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip, 2011
Bookmarks Related papers MentionsView impact
2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip, 2012
Bookmarks Related papers MentionsView impact
2011 IEEE 2nd International Conference on Networked Embedded Systems for Enterprise Applications, 2011
Bookmarks Related papers MentionsView impact
2011 IEEE International 3D Systems Integration Conference (3DIC), 2011 IEEE International, 2012
Bookmarks Related papers MentionsView impact
Proceedings of the 8th ACM International Conference on Computing Frontiers - CF '11, 2011
Bookmarks Related papers MentionsView impact
6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC), 2011
Bookmarks Related papers MentionsView impact
6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC), 2011
Bookmarks Related papers MentionsView impact
Uploads
Papers by Masoud Daneshtalab