Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
Application-Specific Networks-on-Chip Topology Customization Using Network Partitioning Ahmed A. Morgan Haytham Elmiligi M. Watheq El-Kharashi Elec. and Comp. Eng. Dept. University of Victoria BC, Canada V8W 3P6 Elec. and Comp. Eng. Dept. University of Victoria BC, Canada V8W 3P6 Dept. of Comp. & Sys. Eng. Ain Shams University Cairo 11517, Egypt ahmorgan@ece.uvic.ca haytham@ece.uvic.ca watheq@ece.uvic.ca Fayez Gebali Elec. and Comp. Eng. Dept. University of Victoria BC, Canada V8W 3P6 fayez@ece.uvic.ca ABSTRACT One of the most challenging problems in Application-Specific Networks-on-Chip (ASNoC) design is to customize the topological structure of the on-chip network in order to meet the application requirements with the minimum possible cost. In this paper, the area cost of ASNoCs is reduced by using network partitioning techniques. The enhancement in area cost is achieved by reducing both routers area and the number of global links. Given the application core graph, Fiduccia-Mattheyses (FM) algorithm is adopted with modification to formulate the partitioning problem as an optimization one. As a proof of concept, our technique is applied to three different applications with different number of cores. Results show that the proposed technique is a promising way to reduce the ASNoC area compared to other topology generation techniques. Categories and Subject Descriptors B.4.3 [INPUT/OUTPUT AND DATA COMMUNICATIONS]: Interconnections—Topology General Terms Algorithm, Analysis, Optimization, Topology Keywords Application-Specific Networks-on-Chip (ASNoC), Core graph, Network partitioning 1. INTRODUCTION The number of cores within modern System-on-Chip (SoC) applications grows rapidly. Although this large number of Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. ,)07 1RYHPEHU&DLUR(J\SW . © ACM 2008 ISBN: 978-1-60558-407-2 ...$5.00 computational resources allows more functionality to be integrated on a single chip, the communication among these cores constitutes a major performance bottleneck. ApplicationSpecific Networks-on-Chip (ASNoC) is an emerging paradigm proposed to handle the communication problem between these computational resources [2]. ASNoC aims at optimizing the design of the on-chip network to comply with the application requirements. High throughput, low latency, low energy consumption, and low area are the main desirable characteristics of ASNoC-based systems [9]. The choice of a network topology for ASNoCbased applications significantly impacts their performance. in the literature, both standard and custom topologies are used for ASNoCs [1, 22, 23]. Although standard topologies could be easily implemented and ensure the regularity of the on-chip network, they might not be the best choice for ASNoCs due to the mismatch between the topology and the application characteristics [2, 22]. Therefore, it is required to carefully customize the topological structure of ASNoCs to meet the design requirements, while minimizing the cost. In [10], a topology generation methodology to minimize the power consumption cost of ASNoCs was proposed. In this paper, network partitioning techniques are employed to optimize ASNoCs topology from the area cost point of view. To this end, our contribution is threefold. 1. Analyzing the impact of network partitioning on ASNoC area, 2. Formulating a partitioning technique for any application. As the problem of network partitioning is known to be an NP-hard one [6], it is formulated as an optimization one based on the famous Fiduccia-Mattheyses (FM) algorithm [11], and 3. Validating our technique through three real applications as case studies. Results show how applying network partitioning reduces the area cost compared to both standard and custom topologies. This paper is organized as follows. Related work is discussed in Section 2. The impact of applying network partitioning techniques on ASNoC area is investigated in Section 3. Section 4 gives some background from graph theory on 2. RELATED WORK Standard topologies like ring, octagon, spidergon, mesh, torus, folded torus, SPIN, and BFT were compared and evaluated with respect to throughput, average latency, energy consumption, and area in [5, 23]. Different algorithms, such as NMAP [20], PMAP [17], BMAP [24], GMAP [13], and PBB [13], were proposed to map any application onto these standard topologies. A tool, SUNMAP, was presented in [21] to automatically select the best standard topology for a given application. SUNMAP was only limited to five topologies (mesh, torus, hypercube, butterfly, and clos). The advantages of using custom topologies over standard topologies for ASNoC were explained in [2, 22]. The authors in [22] proposed a methodology for ASNoC custom topology generation by adding some long-range links to standard mesh topology. Their methodology tries to connect distant nodes that communicate with each other frequently with direct long links. The long-range link insertion methodology proved useful in reducing the average latency and improving the throughput. Graph partitioning algorithms1 are widely used in many fields, such as parallel computing, power system analysis, and VLSI design to divide a large network into smaller partitions [6, 15, 19]. The most popular partitioning algorithms are Kernighan-Lin (KL) [16] and its derivatives like Fiduccia-Mattheyses (FM) [11] and Min-Cut [18]. Many other techniques were also proposed to carry out the partitioning task like linear, scattered, random, and spectral methods [7, 12]. A tool, OIDIPUS, that uses network partitioning for ASNoC topology generation was presented in [1]. As being the first work to use network partitioning, the tool objectives were limited to reduce latency and power consumption. The number of partitions and the size of each partition should be given to the tool. Moreover, each partition was only limited to ring topology. Finally, the generated topology from the tool was not compared to any other standard or custom topologies. To the best of our knowledge, the work presented here is the first work that tries to reduce ASNoC area using network partitioning. The proposed technique is completely general and puts no restrictions on the number of partitions or the size and the topology of each partition. 3. IMPACT OF NETWORK PARTITIONING ON ASNOC AREA The ever-increasing on-chip communication requirements for modern applications force ASNoCs to replace ordinary shared buses. Nevertheless, ASNoC area is not supposed to exceed 10% of the whole design area [14]. The area of the on-chip networks is mainly determined by routers area [24]. Moreover, ports with their buffers constitute the major percentage of any router area [8]. Therefore, to lower ASNoC 1 Network partitioning and graph partitioning are used interchangeably in this paper. area, the number of ports for the routers used in generating the topology should be reduced. To explain this point further, the area of input-queue routers with different number of ports, including the local port, is replicated from [24] in Figure 1. Each port has an equal buffer size and consists of two channels for packets reception and transmission, respectively. The figure clarifies how reducing the number of ports is a key factor in reducing the ASNoC area. 4 11 x 10 10 9 Router area (µm2) how to represent different applications. Formulation and assumptions of the network partitioning problem are discussed in Section 5. Section 6 represents some experimental results and justifies the efficiency of our technique through case studies. Finally, we draw our conclusions and give ideas for future work in Section 7. 8 7 6 5 4 3 2 1 0 2 3 4 5 Number of ports Figure 1: Router area versus number of ports [24]. Network partitioning divides a large network into smaller partitions. Partitions are then connected together to produce a complete custom topology. It is apparent that smaller partitions require routers with less number of ports than those used in a single unpartitioned network. As a result, partitioning could be used to reduce ASNoC routers area. Not only does using network partitioning reduce routers area, but also removes some of the global links. Global links connect between routers, whereas local links connect cores to routers. Global links are usually long and require driving circuits and repeaters. Accordingly, it is desirable to decrease the number of global links in the generated topology. This discussion clarifies that network partitioning could be used to reduce the two major factors of ASNoC area; routers area and the number of global links. Consequently, the area reduction problem converges into a network partitioning problem. As an example of the area reduction resulted from using network partitioning, Figure 2 shows a mesh implementation of 16-node application before and after partitioning. The two partitions are separated by the dotted line in Figure 2(b). The partitioning strategy itself is discussed in details in Section 5. The number in each node in the figure represents the number of ports of the router in this node. From one hand, partitioning results in removing three global links in this example. On the other hand, Table 1 summarizes the total number of routers with different number of ports before and after partitioning and the total ASNoC routers area. Using network partitioning reduces the ASNoC routers area by 7.36% in this example. As the number of nodes becomes larger, the number of links removed by partitioning increases. By removing a link, the number of ports of the two routers that were connected by this link decreases by one. According to Figure 1, the router area decreases in a nearly linear proportion with the number of ports removed. Therefore, the more cores exist in the application, the more global links are removed by partitioning, and the more routers area is saved. As a result, 3 4 4 3 3 4 4 3 C1 70 C2 362 C3 362 362 C7 4 5 5 3 4 4 5 3 353 357 C6 300 C8 27 C5 C4 C9 49 0.5 910 C16 C10 C13 C11 32 16 C9 190 C12 16 C4 173 C1 0.5 670 16 313 C5 600 C5 64 C2 60 C7 C1 128 64 C6 C3 64 C3 C2 64 64 313 4 3 4 5 5 5 4 3 500 94 C10 3 4 3 3 4 (a) before partitioning 4 4 3 (b) after partitioning Circles represent routers and rectangles represent local processing cores. Number in each circle represents the number of ports of the router in this node. Figure 2: Mesh implementation of a 16-node application before and after partitioning. Table 1: Number of routers and total routers area before and after partitioning of a 16-node application, mapped onto mesh topology, as shown in Figure 2. Before partitioning After partitioning 5-port Router 4 4-port Router 8 3-port Router 2 Total routers area (µm2 ) 1,332,000 2 6 8 1,234,000 network partitioning is to be more useful in reducing the area cost as the number of cores in the application increases. 4. C11 GRAPH THEORY REPRESENTATION OF APPLICATIONS Graph theory concepts could be used to represent ASNoC applications. The application description and traffic characteristics are assumed to be statically analyzed and known a priori in this paper. Therefore, any application could be described by a core graph [20, 21]. Graph partitioning algorithms could be then applied to this graph. Core graph represents the processing elements in the application and the traffic between them. As shown in Figure 3, a core graph is a directed graph G(C, W), where each vertex ci ∈ C represents a core in the application, and the directed edge wij ∈ W represents the communication between cores ci and cj . The weight of the edge expresses the number of packets per time step transfered from core ci to core cj . Using ASNoC terminology, a core graph could be represented in the form of a traffic distribution matrix (λ) [10]. If there are n cores in the application, the dimension of the matrix is n×n. Any element λij in the matrix represents the weight of the edge wij . That is, the number of packets sent from core ci to core cj per time step. For example, the traffic distribution matrix for the PIP application shown in Figure 3(c) is represented as: 157 16 16 16 C15 16 C14 500 C12 (a) 250 C8 40 40 C8 64 C7 64 C4 C6 (c) (b) Figure 3: Examples of core graphs (a) VOPD [20], [24] (b) MPEG-4 [10], [21] (c) PIP [4], [20]. 5. PROBLEM FORMULATION The problem of network partitioning is known to be an NP-hard one [6]. Therefore, partitioning is usually formulated as an optimization problem. FM algorithm is adopted with modification to formulate our partitioning technique. Therefore, it is given the name ASNoC-FM. This section presents the mathematical formulation of the partitioning scheme used in this paper. Using ASNoC notations, the objective function to be minimized is first formulated in Subsection 5.1 according to the FM algorithm. Two constraints are then added to the algorithm in Subsection 5.2 to match the ASNoC design environment. FM algorithm is selected because of its superior performance over other partitioning techniques [6]. The algorithm aims at minimizing the total weights of those edges cut due to partitioning. Using ASNoC terminology, the FM algorithm results in minimizing the inter-partition traffic. The advantage of this minimization is fourfold. First, it ensures that the cores that communicate with each other frequently are within the same partition. Second, this minimization avoids creating bottlenecks in the ASNoC through the interpartition links. Third, it reduces the required buffering for those nodes located at the boundary of partitions. Fourth, minimizing the inter-partition traffic proves useful in mitigating the effect of partitioning on the average delay, as explained in Section 6. 5.1 Objective Function Formulation In order to formulate the partitioning problem mathematically, suppose that there are n cores in the application and it is required to divide them into m partitions, P1 · · ·Pm . The number of cores within any partition, Pk , is nk . Accordingly, the total traffic (λtotal ), in packets/time step, exchanged between all the cores within the application, is expressed as: λtotal = n n X X λij (1) i=1 j=1 For any partition, Pk , the total traffic within the partition (λk ), in packets/time step, is expressed as: λk = n n X X λij , ∀ {i, j} where ci , cj ∈ Pk (2) i=1 j=1  0  128   0   0 λ=  0  0   0 0 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 0 0 64 0 0 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0 64 0 0 0 0 0 0 64 0 64 0 0 0 0 0 0 0 0 64 0            Similarly, for any partition, Pk , the total traffic sent from its nodes to all nodes in other partitions (λ̄k ), in packets/time step, is expressed as: λ̄k = n X n X i=1 j=1 λij , ∀ {i, j} where ci ∈ Pk and cj ∈ / Pk (3) The total traffic (λtotal ) depends on the application itself and will not be affected by partitioning. In contrast, the intra-partition traffic (λk ) and the inter-partition traffic (λ̄k ) change according to the employed partitioning scheme. As explained in Section 5, FM algorithm aims at minimizing the total traffic between all partitions. As a result, the overall inter-partition traffic (λinter ), in packets/time step, to be minimized as an objective function is expressed as: λinter = m X λ̄k (4) k=1 5.2 Constraints Formulation The FM optimization algorithm is originally proposed as a graph partitioning technique. It is going to be modified it in this subsection by adding some constraints to make it more suitable for the on-chip environment. The first constraint to be added to the FM algorithm is to divide the total ASNoC traffic equally between partitions, i.e., λ1 = · · · = λm = λtotal /m. This traffic-balance constraint ensures equal activity in each partition and avoids creating hot spots in the design. As the traffic for all partitions cannot exactly be the same, a more realistic unbalance traffic factors, uλ1 · · · uλm , are to be used. For any partition (Pk ), its traffic factor (uλk ) is expressed as a percentage of the total traffic (λtotal ). The choice of these traffic factors gives the designer the flexibility to allow for some traffic variations between partitions. Although these factors are applicationdependent and no specific values could be proposed for them, they should be carefully chosen. Very large factors produce completely unbalanced partitions, whereas no result might be obtained with very small factors. Using these unbalance traffic factors, the first constraint is expressed as: λtotal ± uλk · λtotal , ∀ 1 ≤ k ≤ m (5) m The second constraint to be added to the FM optimization algorithm is to divide the cores equally between partitions, i.e., n1 = · · · = nm = n/m. This core-balance constraint is to facilitate the floorplanning. Similar to the first constraint, unbalance core factors, un1 · · · unm , are used to allow for some variation between the number of cores within partitions. For any partition (Pk ), its core factor (unk ) represents the number of cores above or below the average number of cores for all partitions. According to the number of cores within the application, each factor is usually either 0 or 1. Using these unbalance core factors, the second constraint is expressed as: jnk nk = ± unk , ∀ 1 ≤ k ≤ m (6) m To this end, our ASNoC-FM optimization problem is completely formulated as: M inimize : m X λinter = λ̄k (7) λk = k=1 Subject to : λk − ( 1 ± uλk ) · λtotal = 0 m ,∀ 1 ≤ k ≤ m (8) ,∀ 1 ≤ k ≤ m (9) and nk − jnk m = ±unk where n is the total number of cores in the application, nk is the number of cores in a specific partition (Pk ), and m is the number of partitions. λij is the amount of traffic sent from core ci to core cj . uλk and unk are the traffic and core unbalance factors, respectively. λk and λ̄k are the intrapartition and the inter-partition traffic for partition (Pk ), as expressed in (2) and (3), respectively. Finally, λtotal and λinter are the total and the total inter-partition traffic, as expressed in (1) and (4), respectively. The ASNoC-FM algorithm starts with an arbitrary division of cores between partitions. Cores within any partition are ranked according to the amount of communication with other partitions. The core with the highest rank is then moved to the partition with which it communicates the most. The move is only confirmed and carried out if it reduces the inter-partition traffic according to (7) and meets the constraints in (8) and (9). The ranking process is then repeated and the algorithm continues with all other cores to partition the cores with the minimum inter-partition traffic. ASNoC-FM finally produces the cores within each partition corresponding to this minimum traffic. 6. EXPERIMENTAL RESULTS As a proof of concept, the ASNoC-FM algorithm is applied to three applications with different number of cores: Video Object Plane Decoder (VOPD) with 16 cores, as shown in Figure 3(a) [20, 24], MPEG-4 decoder with 12 cores, as shown in Figure 3(b) [21, 24], and a Picture-In-Picture (PIP) with 8 cores, as shown in Figure 3(c) [4, 20]. As the number of cores is not large, the number of partitions is chosen to be two. The software package, Chaco [12], is used to carry out the partitioning. NMAP is employed for the mapping because of its superior performance over other mapping techniques [20]. Mesh topology is used for all partitions. The reason behind mesh selection is threefold. First, mesh is proved to outperform other ASNoC topologies with respect to area and power consumption [4, 23]. Second, most of practical NoC implementations are achieved with grid-based topologies, like mesh and torus [3]. Third, mesh topology could be easily implemented inside chips. Total routers area and the number of global links resulted from using ASNoC-FM algorithm are compared to those of a standard mesh topology, mapped also by NMAP algorithm. Long-range links are then inserted into the mesh according to the methodology proposed in [22] and discussed in the related work. The resultant topology is given the name custom long-range topology. Total routers area and the number of global links of the long-range topology are also included in the comparison. These results are shown in Figures 4 and 5, respectively. Using ASNoC-FM, routers area is reduced by 7.46%, 6.86%, and 5.53% with respect to standard topology and by 10.75%, 11.05%, and 11.7% with respect to custom long-range topology for the VOPD, MPEG-4, and PIP applications, respectively. Similarly, the number of global links are reduced by 12.5%, 11.76%, and 10% with respect to standard topology and by 19.05%, 20%, and 22.2% with respect to custom long-range topology for the same three applications, respectively. From these results, it is apparent that network partitioning is a promising way to reduce ASNoC area especially for applications with large number of cores. Any ASNoC design faces a trade-off between cost and performance. Although ASNoC-FM algorithm reduces both routers area and the number of global links, it increases 5 x 10 12 Routers area (µm2) 1.6 ASNoC−FM Standard mesh Custom long−range Average internode disance (hops) 14 10 8 6 4 2 0 VOPD MPEG−4 1.2 1 0.8 0.6 0.4 0.2 0 PIP ASNoC−FM Standard mesh Custom long−range 1.4 VOPD Application MPEG−4 Figure 4: Routers area for different topology generation techniques. Figure 6: Average internode distance for different topology generation techniques. Number of global links Average internode distance (hops) 3.5 25 ASNoC−FM Standard mesh Custom long−range 20 15 10 5 VOPD MPEG−4 2.5 2 1.5 1 0.5 the average delay of the ASNoC. The zero-load average delay of ASNoC is usually expressed by the average internode distance [22]. To see the effect of using ASNoC-FM on the delay, the average internode distances for the three applications are calculated and plotted in Figure 6. The average internode distances are increased by 1.81%, 1.48%, and 20% with respect to standard topology and by 9.1%, 11.84%, and 25% with respect to custom long-range topology for the VOPD, MPEG-4, and PIP applications, respectively. The increase in average delay comes from the fact that the traffic to or from some nodes, other than those on the boundary, in a specific partition needs more hops to reach the destination nodes in other partitions. Consequently, minimizing the inter-partition traffic by using ASNoC-FM makes the increase in the average delay minimum. To validate this point, Figure 7 represents the average internode distances resulting from using other partitioning schemes. It is apparent that ASNoC-FM completely outperforms random and scattered partitioning schemes. Moreover, it is better than the spectral scheme by a ratio up to 1.95%. This result emphasizes that ASNoC-FM has the minimum effect on average delay. From the results above, it is apparent that application requirements and design constraints are to control the ASNoC topology generation. From one hand, previous results prove that network partitioning methodology outperforms other standard and custom topology generation methods from the area-cost point of view in the expenses of the delay. On the VOPD MPEG−4 PIP Application PIP Applicaion Figure 5: Number of global links for different topology generation techniques. ASNoC−FM Spectral Scattered Random 3 0 0 PIP Application Figure 7: Comparison between average internode distances for different partitioning schemes. other hand, long-range link insertion is a powerful method to reduce the delay in the expenses of the area-cost. Finally, if both area and delay are of equal importance, a compromise is required to select the most suitable topology generation method according to the application requirements and the design constraints. 7. CONCLUSION AND FUTURE WORK ASNoC is an efficient paradigm to solve the on-chip communication problem in modern SoC applications. ASNoC area should be kept as small as possible. In this paper, we analyzed how the area of the on-chip network could be reduced using network partitioning techniques. The smaller partitions require less number of ports and buffering, which constitute the main parts of the on-chip area. The formulation of the partitioning problem is presented by adopting and modifying the FM graph partitioning algorithm. Results prove that network partitioning is an effective way to reduce the area especially for applications with large number of cores. We plan to extend this work in different ways. First, we will try to develop some methods to compensate for the increase in average delay resulting from using network partitioning. Second, we are going to investigate and model the optimum number of partitions and the optimum number of nodes per partition. Third, we will target some of the reliability issues resulting from using network partitioning. Finally, we will try to combine partitioning with 3D NoCs to better enhance the performance. 8. REFERENCES [1] T. Ahonen, D. Siguenza-Tortosa, H. Bin, and J. Nurmi. Topology optimization for application-specific networks-on-chip. In Proceedings of the 2004 ACM international workshop on System level interconnect prediction (SLIP’04), pages 53–60, Paris, France, Feb. 14–15, 2004. [2] L. Benini. Application specific NoC design. In Proceedings of the IEEE Design, Automation and Test in Europe Conference (DATE’06), volume 1, pages 1–5, Munich, Germany, Mar. 6–10, 2006. [3] L. Benini and G. D. Micheli. Networks on chips: technology and tools. Morgan Kaufmann, San Francisco, CA, USA, July 2006. [4] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. D. Micheli. NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE Transactions on Parallel and Distributed Systems, 16(2):113–129, Feb. 2005. [5] L. Bononi and N. Concer. Simulation and analysis of network on chip architectures: Ring, spidergon and 2D mesh. In Proceedings of the IEEE Design, Automation and Test in Europe Conference (DATE’06), volume 2, pages 154–159, Munich, Germany, Mar. 6–10, 2006. [6] B. Chamberlain. Graph partitioning algorithms for distributing workloads of parallel computations. Technical Report UW-CSE-98-10-03, University of Washington, WA, USA, Oct. 1998. [7] P. Chan, M. Schlag, and J. Zien. Spectral k-way ratio-cut partitioning and clustering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13(9):1088–1095, Sept. 1994. [8] M. Coenen, S. Murali, A. Radulescu, K. Goossens, and G. D. Micheli. A buffer-sizing algorithm for networks on chip using TDMA and credit-based end-to-end flow control. In Proceedings of the Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’06), pages 130–135, Seoul, Korea, Oct. 22–25, 2006. [9] H. Elmiligi, A. A. Morgan, M. W. El-Kharashi, and F. Gebali. A topology-based design methodology for networks-on-chip applications. In Proceedings of the second International Design and Test Workshop (IDT’07), pages 61–65, Cairo, Egypt, Dec. 16–18, 2007. [10] H. Elmiligi, A. A. Morgan, M. W. El-Kharashi, and F. Gebali. Power-aware topology optimization for networks-on-chips. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’08), pages 360–363, Seattle, WA, USA, May 18–21, 2008. [11] C. Fiduccia and R. Mattheyses. A linear-time heuristic for improving network partitions. In Proceedings of the 19th Annual IEEE/ACM Design Automation Conference (DAC’82), pages 175–181, Las Vegas, NV, USA, June 14–18, 1982. [12] B. Hendrickson and R. Leland. The Chaco user’s guide: Version 2. Technical Report SAND95-2344, Sandia National Laboratories, Albuquerque, NM, USA, July 1995. [13] J. Hu and R. Marculescu. Energy-aware mapping for [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] tile-based NoC architectures under performance constraints. In Proceedings of the IEEE Asia and South Pacific Design Automation Conference (ASP-DAC’03), pages 233–239, Kitakyushu, Japan, Jan. 21–24, 2003. A. Jantsch. NoCs: A new contract between hardware and software. In Proceedings of the Euromicro Symposium on Digital System Design (DSD’03), pages 10–16, Belek-Antalya, Turkey, Sept. 1–6, 2003. G. Karypis and V. Kumar. Multilevel algorithms for multi-constraint graph partitioning. In Proceedings of the 1998 IEEE/ACM Conference on Supercomputing (SC’98), pages 1–17, San Jose, CA , USA, Nov. 7–13 1998. B. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 49(2):291–307, Feb. 1970. N. Koziris, M. Romesis, P. Tsanakas, and G. Papakonstantinou. An efficient algorithm for the physical mapping of clustered taskgraphs onto multiprocessor architectures. In Proceedings of the Eighth IEEE Euromicro Workshop on Parallel and Distributed Processing (EURO-PDP’2000), pages 406–413, Rhodos, Greece, Jan. 19–21, 2000. B. Krishnamurthy. An improved min-cut algorithm for partitioning VLSI networks. IEEE Transactions on Computers, 33(5):438–446, May 1984. F. Li, Y. Xiao, and Q. Zhang. Graph partition based traffic balance in industrial network. In Proceedings of the IEEE International Conference on Communications and Networking in China (CHINACOM’07), pages 438–442, Shanghai, China, Aug. 22–24, 2007. S. Murali and G. D. Micheli. Bandwidth-constrained mapping of cores onto NoC architectures. In Proceedings of the IEEE Design, Automation and Test in Europe Conference and Exhibition (DATE’04), volume 2, pages 896–901, Paris, France, Feb. 16–20, 2004. S. Murali and G. D. Micheli. SUNMAP: A tool for automatic topology selection and generation for NoCs. In Proceedings of the 41st IEEE/ACM Design Automation Conference (DAC’04), pages 914–919, San Diego, CA, USA, June 7–11, 2004. U. Ogras and R. Marculescu. It’s a small world after all: NoC performance optimization via long-range link insertion. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 14(7):693–706, July 2006. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh. Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Transactions on Computers, 54(8):1025–1040, Aug. 2005. W. Shen, C. Chao, Y. Lien, and A. Wu. A new binomial mapping and optimization algorithm for reduced-complexity mesh-based on-chip network. In Proceedings of the IEEE International Symposium on Networks-on-Chip (NOCS’07), pages 317–322, Princeton, NJ, USA, May 7–9, 2007.