Adaptive Load Balancing in Learning-based Approaches for many-core embedded systems

Adaptive routing algorithms improve network performance by distributing traffic over the whole network. However, they require congestion information to facilitate load balancing. To provide local and global congestion information, we propose a learning method based on dual reinforcement learning approach. This information can be dynamically updated according to the changing traffic condition in the network by propagating data and learning packets. We utilize a congestion detection method which updates the learning rate according to the congestion level. This method calculates the average number of free buffer slots in each switch at specific time intervals and compares it with maximum and minimum values. Based on the comparison result, the learning rate sets to a value between 0 and 1. If a switch gets congested, the learning rate is set to a high value, meaning that the global information is more important than local. In contrast, local is more emphasized than global information in non-congested switches. Results show that the proposed approach achieves a significant performance improvement over the traditional Q-routing, DRQ-routing, DBAR and Dynamic XY

Copyright Notice The document is provided by the contributing author(s) as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. This is the author’s version of the work. The final version can be found on the publisher's webpage. This document is made available only for personal use and must abide to copyrights of the publisher. Permission to make digital or hard copies of part or all of these works for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage. This works may not be reposted without the explicit permission of the copyright holder. Permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the corresponding copyright holders. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each copyright holder. IEEE papers: © IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The final publication is available at http://ieeexplore.ieee.org ACM papers: © ACM. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The final publication is available at http://dl.acm.org/ Springer papers: © Springer. Pre-prints are provided only for personal use. The final publication is available at link.springer.com Adaptive Load Balancing in Learning-based Approaches for many-core embedded systems F. Farahnakian, M. Ebrahimi, M.Daneshtalab, P.Liljeberg, and J.Plosila Abstract—Adaptive routing algorithms improve network performance by distributing traffic over the whole network. However, they require congestion information to facilitate load-balancing. To provide local and global congestion information, we propose a learning method based on dual reinforcement learning approach. This information can be dynamically updated according to the changing traffic condition in the network by propagating data and learning packets. We utilize a congestion detection method which updates the learning rate according to the congestion level. This method calculates the average number of free buffer slots in each switch at specific time intervals and compares it with maximum and minimum values. Based on the comparison result, the learning rate sets to a value between 0 and 1. If a switch gets congested, the learning rate is set to a high value, meaning that the global information is more important than local. In contrast, local is more emphasized than global information in noncongested switches. Results show that the proposed approach achieves a significant performance improvement over the traditional Q-routing, DRQ-routing , DBAR and Dynamic XY algorithms. Index Terms—Congestion-aware routing algorithm, Adaptive routing algorithm, Q-learning and Q-routing approaches ——————————  —————————— 1 INTRODUCTION As the routing model impacts the performance and power consumption of the network in Many-core Embedded Systems (MES), a considerable amount of research has been done to improve the routing efficiency of the network. Routing algorithm determines the output channels for forwarding a packet to the destination switch. Generally, routing algorithms can be classified into deterministic and adaptive [1]. In deterministic routing algorithms [2], a transfer path is completely determined by the source and destination addresses. These algorithms are unable to alleviate congestion since the routing decision is independent of the network condition. For example, in the XY routing algorithm, packets first transfer along the X direction, then along the Y direction. Implementing deterministic routing is simple because it stays unchanged during network's operation. However, when the packet injection rate increases, deterministic algorithms cannot balance the load across the links. In addition, congestion is a major factor in limiting the performance of network and needs to be avoided [3]. In adaptive routing algorithms [4], the transfer path of each packet is determined based on the current network conditions. When network congestion happens, they choose paths with low latencies to avoid congested links and switches. Reinforcement learning [5] is a machine learning paradigm that has been widely applied in many different areas. In reinforcement learning, an agent or decision-maker learns a model of the environment at run-time and then utilizes these experiences to find an effective control policy for the given task. So, it provides a general framework for highperformance, self-optimizing controller design without prior knowledge. The most popular learning algorithm applied in reinforcement learning is Q-learning [6]. In Q-learning, the learning agent percepts the environment and chooses an action at each state. The agent receives a reward after every action it executes. The final goal of the agent is to learn a policy for selecting the best action among all possible actions. The Q-learning algorithm was used to create an adaptive routing algorithm named Q-routing [7]. Each switch makes its routing decisions based on the information on their neighboring switches in Q-routing. In other words, a switch stores a table of Q-values that estimate the quality of the alternative paths. These values are updated each time the switch sends a packet to one of its neighbors. Therefore, a few Q-values might be updated depending on the traffic patterns and load levels in the network. Dual Reinforcement Qrouting (DRQ-routing) [8] has been presented to address this problem. In DRQ-routing, the Q-values are updated by carrying the latency information to neighboring switches (backward exploration unique to DRQ-routing) and by receiving packets from neighboring switches (forward exploration similar to Q-routing). The speed of learning is expected to be higher in DRQ-routing than Q-routing because of the enhanced exploration in DRQ-routing. At high and medium injection loads, the routing policy learned by DRQ-routing leads to a higher performance than Q-routing in terms of average packet delivery time. In this paper we present a routing algorithm named Congestion-aware routing Algorithm using Dual Q-routing (CADuQ) with the following main contributions:  A learning method based on DRQ-routing is utilized in CADuQ to alleviate congestion in the network by providing local and global congestion information. This congestion information is carried by packets throughout the network. Each switch maintains a table to store the estimated latencies from the source to any possible destination switches. The corresponding entry of the table is updated whenever the switch receives a packet. This method employs both backward and forward explorations to update the estimated latency values in each switch. The estimated latencies are used to select less congested paths for packets.  The waiting time in the packet queue of a switch is a commonly used metric for estimating latency in communication networks. Since the range of the waiting time varies from zero to infinity, the size of the tables should be very large to store the estimated waiting times. This is not a suitable approach due to its area overhead. Although an upper bound value can be considered, but this cannot be easily determined as it highly depends on the traffic condition in the network. In this paper, we considered the average number of free slots in input ports as a metric to reflect the congestion condition in each switch. In this way, not only the values are limited to a bounded range but also as shown in the result section, they lead to a better performance gain.  In traditional methods, commonly, a static learning rate is used whatever the network load is. Another contribution of this paper is to adjust the learning rate in accordance with the changing traffic condition of the network. For this purpose, we present a congestion detection method which calculates the average number of free buffer slots for each switch at specific time intervals. According to the obtained value, when the network is congested, the learning rate is set to a large value in order to keep the values as updated as possible. In contract, when the switch is not congested the learning rate is set to a minimum value amplifying the impact of local congestion values. The reminder of this paper is organized as follows. In Section 2, the related work is discussed. Some background information on Q-routing and DRQ-routing techniques is given in Section 3. In Section 4, we discuss the congestion metrics that can be used to estimate the congestion status in the network. The congestion detection method is described in Section 5. The proposed adaptive routing algorithm is explained in Section 6. The results are reported in Section 7, while the summary and conclusion are given in the last section. 2 RELATED WORK A significant amount of research has been directed towards developing adaptive routing protocols in network. Adaptive routing algorithms, depending on the degree of adaptiveness can be classified as partially adaptive or fully adaptive [9]. Partially adaptive routing algorithms use some of the shortest paths between the sender and the receiver, but not all packets are allowed to use every shortest path. For example the turn model routing is based on prohibiting certain turns during the packet routing to prevent deadlock [10]. Fully adaptive routing algorithms allow routing packets on any shortest paths. To make the routing algorithm fully adaptive, there is a need of virtual channels in a wormhole switching network [11][12]. Adaptive routing algorithms can be distinguished whether they are minimal or non-minimal. Minimal adaptive routing algorithm only route packets along shortest paths to their destinations. Non-minimal routing algorithms do not necessarily use shortest paths. Adaptive routing policies can be categorized into congestion-oblivious and congestion-aware schemes. In congestion-oblivious algorithms, routing decisions are independent of the congestion condition of the network. Random [14] and Zigzag [13] are two examples of congestion-oblivious methods. DyXY [15] and DyAD [16] are two approaches of the congestion-aware routing algorithms. These routing algorithms consider the congestion status of the network in the routing decision. The congestion level of a switch can be measured based on the number of available virtual channels [17] or queue length [18] at input port. The presented approach in [19] uses the amount of free buffer space available in each VC of adjacent switches as a congestion metric. Congestion-aware routing policies can be further classiﬁed based on whether they rely on purely local congestion information or take into account the congestion status at other points in the network. In DyXY, a packet is sent either to the X or Y direction depending on the congestion condition. It uses local information which is the current queue length of the corresponding input port in the neighboring switches to decide on the next hop. Most common implementations of routing algorithms, e.g. NoP[20], RCA [21], and DBAR [22], have focused on collecting local or global congestion information to get an estimation of the congested areas in the network. The Q-routing allows a network to be constantly adapted to the changing traffic condition. In this method, each switch makes its routing decision based on Q-values which estimate the quality of alternative routes. However, Q-routing suffers from the unreliability of the estimated Q-values. The reason is that depending on traffic patterns and load levels, only few Q-values are updated regularly while most of the Q-values in the network remain unupdated. Some proposals have been presented to address this issue such as PQ-routing [23], CQ-routing [24] and DRQrouting [25]. PQ-routing keeps the best Q-values and reuses them by predicting the traffic trend. This algorithm is able to explore paths that have been inactive for a long time, and thereby is able to restore its previous policy. In CQ-routing, each Q-value is attached with a conﬁdence value (C-value) which determines how closely the corresponding Q-value represents the current state of the network. A similar idea is presented in DRQ-routing where the speed of restoration is expected to be high. In the DRQ method, learning is performed by the latency information appended to data packet to neighboring switches (backward exploration unique to DRQ-routing) and also by receiving latency information from the neighboring switch where a data packet is sent to (forward exploration similar to Q-routing). This approach leads to increase the speed of learning by providing a bi-directional exploration. Reinforcement learning approaches have rarely been investigated in on-chip networks. A fault-tolerant deflection routing algorithm (FTDR) is presented in [26] which is inspired by Q-learning techniques for tolerating faults. In another work, DyNoC [27] is proposed in order to handle communication among modules which are dynamically placed on a reconfigurable on-chip network. Q-routing is also used to provide different levels of Quality-of-Service (QoS) such as Best Effort (BE) and Guaranteed Throughput (GT) [28]. In this approach, a novel scheme is presented which contrasts the performance of Q-routing with the XY routing strategy in the context of QoS. C-routing [29] is a Q-learning based method which takes advantage of clustering approach to reduce the routing table size compared with conventional Qrouting algorithms. A Q-learning based Congestion-aware Algorithm (QCA) [30] is presented to alleviate congestion condition. QCA estimates and predicts the congestion condition of the network as close to the actual values as possible. This estimation is used in the routing decision to choose a less congested path. Moreover, Dual Q-routing Adaptive learning Rate (DuQAR) [31] has enhanced Q-routing performance for Networks-on-Chip when the network becomes congested. For this to happen, a congestion detection technique is introduced in this method which updates information dynamically according to the changing traffic condition. HARAQ [32] takes the best usage of allowable turn in the network and finds all alternative paths to route packets. Then it uses a learning approach to find an optimal path between all options of minimal or non-minimal routes. 3 BACKGROUND This section reviews the basic concepts of Q-learning which is one of the most widely used reinforcement learning methods. Then we describe Q-routing algorithm by applying Q-learning to a communication network. Finally, we explain the DRQ-routing approach taking advantage of the dual reinforcement learning aspect. 3.1 Q-learning Q-learning [6] is a highly adaptive algorithm which enables an agent or decision maker to interact with the environment without prior knowledge about the system status to achieve a given goal. At each step of an episode, the agent first observes the current system state (s) and chooses an action (a). By performing the action, the system moves to the next state (s’) and the agent also receives a reward (a real number) or punishment (a negative real number). The reward value updates the Q-value, Q(s,a), which represents the expected long-term reward of taking action a in state s. Q-values are stored into a Q-table that maintains four fields named Current State, Next State, Action, and Q-value. Current State refers to everything that the agent currently perceives, while Next State is determined by the Current State and the actions selected by the agents. The Action indicates the action that the agent performs. Q-value holds the reward value that the agent receives in the current state when it executes an action. Once Q-values have been learned, the optimal action from any state is the one with the highest Q-value. 3.2 Q-routing Q-routing [7] is an adaptive routing method based on the Q-learning model in a communication network. Each switch includes a Q-table that estimates the quality of alternative routes. Q-table is updated each time a switch sends a packet to one of its neighbors. Assume that a packet destined for the switch d is forwarded from switch x to the neighboring switch y, (Figure 1). The maximum time it will take for a packet to reach its destination from the switch x is bounded by the sum of three quantities: - The waiting time (qy) in the packet queue of the switch y. - The transmission delay (δ) over the link from switch x to y. - The time Qy(z, d) it would take for the switch y to send this packet via the switch z to the destination switch (Figure 1). Switch z has the minimum latency (Q-value) between the neighbors of switch y. Qy(z, d) is obtained from the following equation: Q y ( z, d )  min Q y (n, d ) nN ( y ) where N(y) is a set neighbors of switch y. The switch y sends an estimated latency value to the switch x. This value contains the best latency estimate from the switch y to the destination d (Qy(z,d)) and the waiting time at the input buffer of switch y (qy) back to the switch x. Upon receiving the estimated value, the switch x computes the new estimate for Qx(y,d) as follows: Qx ( y, d ) est  Qy ( z, d )  q y   Qx(y, d)est is the switch x's best estimated delay that it would take for a packet to reach its destination switch d from the switch x when sent via its neighboring switch y. The new Q-value in the Q-table is obtained by the following equation after receiving the Qx(y, d)est value: Qx  y, d new  Qx  y, d old   Qx ( y, d ) est  Qx  y, d old  (1) Learning is performed by updating Q-values. Learning rate (γ) determines in which rate the new information overwrites the old one. Learning rate can take a value between zero and one; the value of zero means that no learning takes place by the algorithm; while the value of one indicates that only the most recent information is used. x s Qx ( i y,d ) y j z d Figure 1. An example of Q-routing method The Q-routing algorithm has two steps as follows: - Step 1: the switch x sends a packet, destined for the switch d, to one of its neighboring switches, y. 1. Select a packet from the queue. 2. Find the minimum Q-value from the switch x to the destination switch d, through one of the neighboring switches, assumed y. where N(x) is a set of the x’s neighboring switches. 3. Forward the packet to the neighboring switch y. 4. As he switch y receives a packet from switch x, find the minimum Q-value from the switch y to the destination switch d through one of the neighboring switches, assumed z. where N(y) is a set of the y’s neighboring switches. 5. Send y’s estimate back to switch x which includes Qy(z,d) and qy. 6. Get ready for receiving the next packet (goto 1). - Step 2: when the switch x receives Q-value‘s estimate from its neighboring switch y. 1. Switch x receives an estimated value from neighbor y containing Qy(z,d) and qy. 2. Update the Q-value (Qx(y, d)) as given in Equation (1). Q-routing first learns a representation of the network state in terms of Q-values and then uses these values to make routing decisions. Therefore, it is able to find the least congested path among available paths from a source switch to a destination switch. 3.3 DRQ-routing The Q-routing method only updates the Q-values when a switch sends a packet to neighboring switches. Therefore, this algorithm only uses forward exploration to determine the latency of the remaining path from the current switch to the destination switch. The DRQ-routing technique utilizes Q-routing for both backward and forward exploration for updating Q-values more frequently. Backward exploration indicates the latency of the traversed path from the current to the source switch. Therefore, the corresponding entries of the Q-table are updated with both forward and backward exploration rather than only backward exploration in Q-routing. As an example of the backward exploration, consider a case where the switch x sends a packet to its destination switch d via its neighboring switch y. When the data packet traverses from the source to the destination, it carries some latency information between each two neighboring switches. As the switch y receives this packet, it uses this information for updating its own estimate of sending a packet to source s via switch x. Suppose the packet is currently at switch x in Figure 2. This packet contains the minimum latency value from the switch s to the switch x passing though the switch h. This value can be defined as: Qx (h, s)  min Qx (n, s) nN ( x ) When the packet arrives at switch y, it can update the estimation of sending a packet to the switch s via neighbor x. The new value includes Qx(h,s) and the waiting time at the input buffer of switch x (qx): (2) Qy x, s new  Qy x, s old   (Qx h, s   qx   )  Qy x, s old  In this way, each packet carries the routing information from a source to a destination. These values are used to update the Q-values of intermediate switches. In other words, the receiving switch uses the latency information of the traversed path to update the Q-values. Q x(h,s s h ) Qy ( x Qx ( i y,d ) j x,s ) y z d Figure 2. An example of DRQ-routing method 4 CONGESTION DETECTION METRICS The primary objective of the proposed algorithm is to improve the network performance by routing the packets trough the less congested paths. In Q-routing, Q-values reflect the real congestion level in each switch. So by sending packets through paths with minimum Q-values, congestion can be alleviated in the network. For calculating the Qvalues, we consider two congestion metrics: Waiting time (wTime): when a header flit enters a switch through one of the input port, it is stored in an input buffer. The flit is kept in a switch until it proceeds to the routing unit. If the switch is congested, the flit potentially waits on the input channel for extended period of time. Therefore, the waiting time in the input channel can determine the congestion status in a switch. wTime metric indicates the time from when a header flit enters an upstream switch until it processes by routing unit. Average free buffers (AvgBf): indicates the average number of free input buffer slots at downstream switches. The count of free buffers was first proposed as an indicator of congestion by Kim et al.[19]. Higher buffer capacity or a larger number of free buffer slots in each virtual channel will reduce network contention, thereby reducing latency. Measuring Q-values based on these metrics can reflect the congestion level of each switch. In order to validate the efficiency of congestion metrics, we set up experimental environment using a wormhole-based simulator based on the OMNET framework [33]. For all switches, the data width is set to 32 bits and the maximum bandwidth at each link is 1 flit per cycle. We used two virtual channels that each input buffer (FIFO) size of 8 flits. In each simulation time step (nanosecond), these packets are stored in the queue of the source switches. The amount of injected packets depends on the network offered load. The network offered load is calculated as follow: Network offered load = flit-size/flit arrival delay flit_size represents the size of the flit. Flit-arrival-delay represents the delay time between previous flit generation and next flit generation. The simulations were conducted on the 2D-mesh which is the most typical topology. The mesh size is selected to be 8×8 in our experimental results. The average latency as a function of the offered load is plotted for uniform random, transpose and hotspot traffic patterns. Experimental results (Figure 3) show that Q-values based on the average number of free buffers (AvgBf) are more efficient than wTime under various traffic patterns. It can be obtained that AvgBf represent the real latency value. 80 60 40 120 100 120 wTime AvgBf Latency(nSec) Latency(nSec) 100 wTime AvgBf Latency(nSec) 120 100 80 60 80 60 40 40 0.08 0.2 0.3 0.5 1 1,3 Offered Load(GB/Sec) wTime AvgBf 0.08 0.2 0.3 0.5 1 1,3 Offered Load(GB/Sec) 0.08 0.2 0.3 0.5 1 1,3 Offered Load(GB/Sec) (a) (b) (c) Figure 3. Performance evaluation under (a) uniform, (b) transpose and (c) hotspot traffic in the 8×8 mesh network In order to demonstrate how fast the AvgBf can detect the network congestion, Figure 4 depicts the learning process under the hotspot traffic with a single hotspot switch at (4, 4) in the 8×8 2D mesh. The variation of Q-values is plotted on the vertical axis and simulation time on the horizontal axis. Figure 4(a) and Figure 4(b) show the Q-values when using the AvgBf and wTime metrics, respectively. The variations of Q-values in Figure 4(a) are much less than that of Q-values in Figure 4(b). Moreover, Figure 4(a) shows that the Q-values are converged faster and have a shorter learning process. As can be obtained from these two figures, the buffer size can better reflect the congestion status of the network. (a) (b) Figure 4. The variation of Q-values based on the (a) AvgBf (b) wTime metric of a hotspot switch at (4, 4) in the 8×8 2D mesh 5 CONGESTION DETECTION METHOD In order to distribute the traffic, we need to distinguish the current congestion level of the network. Results in Section 4 show that using AvgBf for Q-values is more efficient than wTime to represent the current congestion status of the network. Furthermore, it is necessary to update the Q-values dynamically based on the network congestion condition. For this purpose, we present a congestion detection method which calculates the average number of free buffer slots of each switch at a predetermined time interval. After that, the average number of free buffer slots is compared with the maximum (MaxVal) and minimum (MinVal) values at each time interval. A massive experimental result was done in order to estimate the MaxVal and MinVal values. Based on our analysis, in general, the best results were obtained when MinVal and MaxVal set to 25% and 65% of total buffer slots. Thereby, a switch is counted as congested one if less than or equal to 25% of total queue size is free (MinVal). On the other hand, a switch is counted as a less congested one if more than or equal to 65% of total buffer size in a switch is free (MaxVal). The equations (3) and (4) are used to set the MinVal and MaxVal values, respectively. MinVal =25% × (TotalBufferSlots) MaxVal =65% × (TotalBufferSlots) (3) (4) If AvgBf in a switch is larger than MaxVal, it indicates that the switch is not congested and it is unnecessary to update the Q-values regularly. Thus, the learning rate is set to a minimum value amplifying the impact of local congestion statuses (LearnRate= 0.1). Moreover, if AvgBf in a switch is smaller than MinVal, the learning rate is set to a large value in order to keep the Q-values as updated as possible (LearnRate= 0.9). Finally, if AvgBf is between MinVal and MaxVal, LearnRate is set to 0.5. In this way, global congestion information gets more emphasis than local values. The functionality of congestion detection method is given as follow: ---congestion detection method 1. MinVal=25% (TotalBufferSlots); 2. MaxVal=65% (TotalBufferSlots); 3. AvgFreeBufferSlots, counti =0; 4. for t = current simulation time to 200ns OR 100 clk then 5. if switchi receives a flit 6. counti=counti+1; 7. FreeBufferSlotsi+= FreeBufferSlotsi[t]; 8. end if; 9. end for; 10. AvgFreeBufferSlots= FreeBufferSlotsi /counti ; -- Determine the learning rate 11. if AvgFreeBufferSlots i< = MinVal 12. LearnRate= 0.9; 13. else if MinVal < AvgFreeBufferSlots i< MaxVal 14. LearnRate = 0.5; 15. else if AvgFreeBufferSlots i>= MaxVal 16. LearnRate= 0.1; Initially, AvgBf is set to zero. Then it is measured in 200 nanoseconds (100 cycles) time interval (line 4-10) and then it compares with the MinVal and MaxVal values. Finally, the learning rate is set to a new value which is used during the current time interval (line 10-17). Throughout our experiments, we have considered learning rate values. These values are between 0 to 1 and control how much to change the Q-values. We could miss it in Equation (1) and (2) by setting ɣ to 1. The switch takes longer to learn when this rate set to a large value and Q-values update frequently. We therefore use three values learning rate depending upon the congestion level in a switch. In general, if the network gets congested, the Q-values are frequently updated and global congestion values from distant switches become reliable. In contrast, a switch may receive few packets in a time interval, so that the values from distant switches are not accurate and the local values should be more emphasized than the global ones. 6 CONGESTION-AWARE ROUTING ALGORITHM USING DRQ-ROUTING In this section, we describe the routing algorithm based on the adaptive learning rate. We explain the format of Qtables according to the 2D-mesh network features along with the format of packets that propagate in the network. Two types of packets can travel within the network, data packets and learning packets. Data packets carry both data and congestion information (used in backward exploration) while learning packets carry only the congestion information (used in forward exploration). 6.1 Routing Algorithm Routing algorithm selects an output channel for forwarding a packet to a destination switch. Congestion-aware routing Algorithm using Dual Q-routing (CADuQ) method can efficiently estimate the latency of a packet to reach different destinations through each of the possible output channels. CADuQ diagnoses the congested path in the network and routes packets through the less congested ones. CADuQ has three main characteristics: (1) Local and global congestion information is provided based on dual reinforcement learning by propagating data and learning packets.(2) local congestion information used in forward exploration is measured according to the free input buffer slots of the input buffer. This information is carried by learning packets. However, local congestion information used in backward exploration is calculated based on the free buffer slots of the input buffer located in the output direction. The reason is that, this information is used when a packet traverses in the reverse direction (i.e. from the current node to the source node). This information is carried by data packets. (3) Since the reliability of latency values depend on the speed of learning, we utilized a congestion detection method for updating the learning rate that was described in Section 5. In addition, forward and backward explorations in the CADuQ can reflect the current congestion condition in the network. The CADuQ algorithm can be summarized in three steps as follows: - Step 1: When a data packet is delivered from switch x to switch y. 1. Switch x determines the neighboring switch (y) with minimum latency value to the destination switch d: 2. Find the minimum estimated latency from the switch x back to the source switch s through the neighboring switch h (Q x(h,s)): Also, find the average number of free output buffer slots in switch x (qx). --- Include the backward estimation (Qx(h,s) and qx) into the header of the data packet. 3. BackwardLatency  Qx(h,s)+ qx ; 4. Forward the packet to the neighbor y. - Step 2: when switch y receives a data packet from switch x. 1. Extract the backward estimation including Qx(h,s) and qx from the received packet. 2. Set the learning rate according to the “Congestion Detection” method. 3. Update the Q-value (Qy(x, s)) using Equation (2). 4. Find the minimum latency estimation from the switch y to the destination switch d through the neighboring switch z: Also, calculate the average number of free input buffer slots in switch y (qy). 4. Send the learning packet back to the switch x including forward estimation (Qy(z,d) and qy). 5. ForwardLatency  Qy(z,d) + qy ; - Step 3: when a learning packet is transferred from switch y to switch x. 1. Extract the forward estimation containing Qy(z,d) and qy. 2. Set the learning rate according to the “Congestion Detection” method. 3. Update the Q-value (Qx(y, d)) using Equation (1) 6.2 Routing Tables In Q-routing, each switch has a Q-table similar to Figure 5(a) that maintains the routing latency from itself to the possible destination switches. The Q-table contains n entries, where n is the number of switches in the network. In this table, each row corresponds to a destination; each column relates to an output port; the contents of the table are the estimated latencies. Therefore, routing unit selects an output port from the Q-table which has minimum latency value for a corresponding destination. Notice that, in conventional Q-routing models, the waiting time at input buffers is a measure of latency, however, we use the number of free buffer slots at input buffers as a metric of latency. Since CADuQ is a minimal routing approach, packets can be delivered into at most two directions at a switch. Our basic structure of the adjusted Q-table is shown in Figure 5(b) in which the number of columns is reduced to two. Therefore, the table size is decreased by 50% compared with conventional Q-tables. Output Ports East West North . . . Y-Dir 0 1 0 1 Destination Switches X-Dir South Estimated Latencies . . . Estimated Latencies 8 8 (a) (b) Figure 5. (a) Conventional Q-table, (b) Adjusted Q-table of switch 1 in 3× 3 mesh network 6.3 Packets Format The header flit of data packet contains routing information such as destination and source addresses. In CADUQ, the data packet format is modified by adding five bits into the header flit (Figure 6). These additional fields can be described as follows:  Direction: determines the direction in which the data packet is delivered from the current switch. The value of 0 and 1 are corresponding to the X-direction and Y-direction, respectively. Since the packet is sent either to the X or Y dimension, one bit is enough to encode the direction of neighboring switch.  Backward Latency: determines the expected total latency to forward a packet from the current switch to the source switch. In Equation (2), the terms Qx(h,s) and qx indicate the global and local information of the backward latency. Total latency values are determined with four bits. 4 bits 1 bits 6 bits 6 bits Source Switch ID Destination Switch ID Direction Backward Latency ... Figure 6. Header of the data packet format The format of the learning packet is illustrated in Figure 7, which consists of three fields as follows:  Direction: It is a one bit value determining the direction of receiver switch of the data packet or direction of sender switch of the learning packet.  Forward Latency: It is obtained by summing up the forward local and global latency values. Local latency is measured by the average free buffer slots in downstream switch. In Equation (1), the value of Qy(z,d) and qy indicate the forward local latency. The term Qy(z,d) in Equation (1) indicates the global and local information in forward latency. This latency value determines the expected latency of a packet from the next switch to the destination. We have assigned four bits to represent the forward latency.  Destination switch ID: It determines the destination switch of a data packet. Six bits are considered for this purpose which is sufficient for 8×8 mesh network. For a bigger size network, more bits should be allocated to the destination field. 1 bits 4 bits Direction Forward Latency 6 bits Destination Switch ID Figure 7. Learning packet format Now, we present an example for the CADuQ routing process in a 3×3 mesh network. The number of Virtual Channels (VCs) is two and each VC has a buffer where buffer size is set to 8 flits. The MinVal and MaxVal value can be computed by using equations (3) and (4): MinVal=25%×16= 4 MaxVal=65%×16= 10.4 In Figure 8, suppose a packet is generated at the source switch 0 for the destination switch 8. Two output channels (East and North) can be selected at switch 0 for forwarding the data packet to the destination. As illustrated in Figure 8(a), the east direction has the lowest latency to reach the destination switch, so it is chosen as the next switch. Before forwarding the packet to switch 1, the sum of backward local and global latencies is added to the header flit of the data packet. The backward local latency (7) is obtained from the Q-table, indicating the average number of free buffer slots at the output port of switch 0. The global latency is equal to zero because the switch 0 is the source switch. When the intermediate switch 1 receives the data packet from the switch 0, backward exploration information is extracted from the data packet (Figure 8 (b)). Based on this information, the first row and Y-Dir column of the routing table in switch 1 is updated. Suppose the average free buffer slot of switch 1 is greater than MaxVal. So, the congestion condition of switch 1 is low and the learning rate is set to 0.1. This learning rate is used to update the Q-values. Therefore, according to Equation (2), the new Q-value is: Q1 0,0new  Q1 0,0old  0.1(0  q0   )  Q1 0,0old   2  0.1((0  7  0)  2)  2.5 We suppose that the link delay, δ, is a constant value. In this work, we set it to 0. Using Q-table information at switch 1, among north and east directions, the north direction has the lowest estimated latency. Hence, switch 4 is selected for forwarding the packet to switch 8. At this time, the learning packet is generated to return the forward local and global latencies back to switch 0. Local information is the average free number of free buffer slots at input port of switch 1. This value is assumed to be 6 in our example. The minimum estimated latency of routing the packet from switch 1 to switch 8 via the north port is considered as the global latency. This value is extracted from the Q-table in switch 1 (i.e. global=3). The sum of the local and global values is the new latency estimation from switch 1 to the switch 0 which will be stored in the learning packet. Finally, the table in the switch 0 is updated whenever the learning packet is received. Assume the average number of free buffer slots in the switch 0 is 7. Since this value is between MaxVal and MinVal in the last time interval, the learning rate γ is set to 0.5. Q0 1,8new  Q0 1,8old  0.5(Q1 4,8  q1 )  Q0 1,8old   2  0.5((3  6)  2)  5.5 As shown in Figure 8(a), the corresponding entry of the Q-table at switch 0 is updated taking the new estimated value and an existing estimation by using Equation (1). As can be seen from Figure 8 (b), switch 4 is selected for forwarding the packet to the destination switch and a similar process is done until the packet reaches switch 8. 6 7 8 6 7 8 3 4 5 3 4 5 0 1 2 0 1 2 Q-table of switch 1 X-Dir 0 8 Y-Dir 0 . . . 4 Source Destination Backward Switch ID Switch ID Direction Latency 8 1 7+0 X-Dir ... 0 Data packet from switch 0 to switch 1 2 8 5 . . . 3 Y-Dir 2 Forward Direction Latency 1 3+6 Destination Switch ID 8 2.5 7 5.5 Learning Packet (a) (b) Data Packet Figure 8. An example of CADuQ for a 3 ×3 mesh 7 RESULTS In order to evaluate the performance of our proposed routing algorithm, we compare it with conventional on-chip network (DBAR and Dynamic XY) algorithms and Q-learning schemes (Q-routing and DRQ-routing). The performance of the routing scheme is evaluated through latency curves. The simulations were conducted on the 8×8 and 14×14 2D-mesh under various traffic patterns. It is assumed that latency is the duration from the time when the first flit is created at the source core, to the time when the last flit is delivered to the destination core. To propagate data packets, two VCs are used along the X and Y dimensions, while for learning packets, one virtual channel is utilized. The buffer size per VC is set to 8 flits. The simulator is warmed up with 3,000 packets and then the average performance is measured over 10,000 data packets. Three synthetic traffic profiles including uniform random, transpose and hotspot, along with SPLASH-2 [34] application traces are used. 7.1 Uniform Random Traffic Profile In uniform random traffic, each core sends a packet to another core with a random probability. The destination of different packets in each switch is determined randomly using a uniform distribution. In Figure 9, the average latency as a function of the offered load is plotted. As observed from the results, the proposed routing scheme achieves better performance compared with the other schemes. As load increases, DBAR is unable to tolerate the high load condition, while Q-routing schemes learn an efficient routing policy. CADuQ can alleviate the congested areas and performs considerably better than other schemes. 7.2 Transpose Traffic Profile Transpose traffic means that a switch with the coordinates (i,j) sends packets to switch with the coordinates (n−j, n−i) in an n×n mesh. Figure 10 shows CADuQ behaves as efficiently as other routing algorithms. CADuQ leads to the lowest latency due to the fact that it can increase the exploration (two-fold) by dual reinforcement learning. In turn, increased exploration leads to increased speed of learning. Moreover, proposed method improves the quality of learning by adapting learning rate based on the congestion condition. 7.3 Hotspot Traffic Profile Under the hotspot traffic pattern, one or more switches are chosen as hotspots receiving an extra portion of the traffic in addition to the regular uniform traffic. This traffic represents a more realistic traffic pattern. In simulations, given a hotspot percentage of H, a newly generated packet is directed to each hotspot switch with an additional H percent probability. We simulate the hotspot traffic with a single hotspot switch at (4, 4) and (7, 7) in the 8×8 and 14×14 2D-meshes, respectively. The average latency of each network with H=10% are illustrated in Figure 11. As observed from the figure, the Q-learning schemes behave as efficiently as DBAR and DyXY especially in medium and high loads. Figure 11 illustrates the performance gain of our proposed method over Q-routing, DRQ-routing and conventional on-chip network (DBAR and Dynamic XY) algorithms near the saturation point (0.5) for a 14×14 2Dmesh. Using minimal routes along with the intelligent selection policy reduces the average network latency of CQ-routing for both mesh sizes. Table 1 illustrates the performance gain of the proposed approach over Q-routing, DRQ-routing, DBAR and Dynamic XY algorithms near the saturation point (0.5) for a 8×8 2D-mesh. Experimental results under different traffic patterns demonstrate that the on-chip network utilizing the CADUQ routing method clearly outperforms the conventional approach considerably. In addition, the impact of using dual reinforcement Learning policy and congestion detection method near the saturation point (0.5) over the DyXY routing, DBAR and the other Q-routing techniques is summarized in Table 2 for a 14×14 2D-mesh. The results reveal that the proposed approach can distribute the traffic efficiently. Table 1. Performance gain of the CADuQ method over others methods for three traffic patterns in a 8×8 2D-mesh Traffic Pattern Q-routing DRQ-routing DBAR DyXY Uniform 17.7% 12.9% 24.1% 30.6% 28% 35% Transpose 12.2% 7% 19.2% Hotspot 14.2% 8% 23.8% Table 2. Performance gain of the CADuQ method over others methods for three traffic patterns in a 14×14 2D-mesh Traffic Pattern Q-routing DRQ-routing DBAR DyXY Uniform 15.4% 8.6% 22.8% 30% 20.1% 26.7% Transpose 9.6% 7.2% 12.7% Hotspot 11.6% 9.4% 18.3% 7.4 Application Traffic Profile Application traces are obtained from the GEMS simulator [35] using some application benchmark suites selected from SPLASH-2. We use a 64-switch network configuration: 20 processors and 44 L2-cache memory modules. For the CPU, we assume a core similar to Sun Niagara and use SPARC ISA. Each L2 cache core is 512KB, and thus, the total shared L2 cache is 22MB. The memory hierarchy implemented is governed by a two-level directory cache coherence protocol. Each processor has a private write-back L1 cache (split L1 I and D cache, 64 KB, 2-way, 3-cycle access). The L2 cache is shared among all processors and split into banks (44 banks, 512 KB each for a total of 22 MB, 6-cycle bank access), connected via on-chip routers. The L1/L2 block size is 64 B. Our coherence model includes a MESI-based protocol with distributed directories, with each L2 bank maintaining its own local directory. The simulated memory hierarchy mimics SNUCA while the off-chip memory is a 4 GB DRAM with a 220-cycle access time. Error! Reference source not found. shows the average packet latency across five benchmark traces, normalized to DyXY. CADUQ provides lower latency than other schemes and it shows the greatest performance gain on Ocean with 50% reduction in latency. 7.5 Area overhead To estimate the hardware cost of our proposed method along with other routing schemes, the on-chip router of each scheme is implemented with VHDL and synthesized with Synopsys Design Compiler using the 65nm standard CMOS technology with a timing constraint of 1GHz for the system clock and supply voltage of 1V. The synthesized netlist is verified through post synthesis simulations. The layout areas of the four schemes are listed in Table 3 and the area overhead of CADuQ is verified with the other learning methods. Table 3. Hardware cost Network Area (mm2) 0.1503 0.1683 0.1689 0.1705 Routing Method DyXY Q-routing DRQ-routing CADuQ 8 CONCLUSION In this paper, we propose a congestion-aware routing algorithm based on dual reinforcement learning for multiprocessor platform. Commonly, in learning approaches, each switch maintains a routing table which is updated at runtime by local and global latency information (Q-values). The latency values are obtained when transferring data and learning packets between adjacent switches. Depending on the network load and traffic patterns, some of the Q-values may not be updated for a long time. On the other hand, the routing decision based on unreliable Q-values cannot be accurate and does not necessarily reflect the current congestion state of the network. For addressing this issue, we proposed a technique to adjust the learning rate with the network condition. A congestion detection technique is utilized to compute the current congestion level of a switch. If a switch is congested, the learning rate is set to a large value in order to keep the global latency values as updated as possible. Otherwise, when the switch is not congested, a small value is assigned to the learning rate, meaning that the local information prioritize over global information. The experiments show that the CADuQ routing method is able to route packets more efficiently than traditional methods. For instance, the performance is improved up to 35% compared with the DyXY method under hotspot traffic in 8×8 mesh network. DyXY DBAR Q-routing DRQ-routing CADuQ 100 80 DyXY DBAR Q-routing DRQ-routing CADuQ 120 Latency(nSec) Latency(nSec) 120 60 100 80 60 40 40 0.08 0.2 0.3 0.5 1 0.08 1,3 Offered Load(GB/Sec) 0.2 0.3 0.5 1 1,3 Offered Load(GB/Sec) (a) (b) Figure 9. Performance under different traffic models in (a) 8×8 2D-mesh and (b) 14×14 2D-mesh under the uniform traffic model DyXY DBAR Q-routing DRQ-routing CADuQ 100 80 DyXY DBAR Q-routing DRQ-routing CADuQ 120 Latency(nSec) Latency(nSec) 120 60 40 100 80 60 40 0.08 0.2 0.3 0.5 1 Offered Load(GB/Sec) 1,3 0.08 0.2 0.3 0.5 1 Offered Load(GB/Sec) (a) (b) Figure 10. Performance under different traffic models in (a) 8×8 2D-mesh and (b) 14×14 2D-mesh under the transpose traffic model 1,3 DyXY DBAR Q-routing DRQ-routing CADuQ 100 80 DyXY DBAR Q-routing DRQ-routing CADuQ 120 Latency(nSec) Latency(nSec) 120 60 40 100 80 60 40 0.08 0.2 0.3 0.5 1 1,3 0.08 Offered Load(GB/Sec) 0.2 0.3 0.5 1 1,3 Offered Load(GB/Sec) (a) (b) Figure 11. Performance under different traffic models in (a) 8×8 2D-mesh and (b) 14×14 2D-mesh under the hotspot traffic model Normalized average latency DyXY DBAR Q-routing DRQ-routing CADuQ 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 FFT LU Radix Ocean Water-Nsq Figure 12. Performance across SPLASH-2 benchmarks REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] L. M. Ni, P. K. McKinley, “A survey of wormhole routing techniques in direct networks”, Computer, pp. 62-76,1993. E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. Van Meerbergen, P. Wielage, and E. Waterlander, “Trade-offs in the design of a switch with both guaranteed and best-effort services for networks on chip”, IEE Proceedings: Computers and Digital Techniques, pp. 294-302, 2003. D. Kim, S. Yoo, S. Lee, “A Network Congestion-Aware Memory Controller”, in Proc. of Fourth ACM/IEEE International Symposium on Networks-on-Chip, pp. 257-264, 2010. M. Dehyadegari et al., “An Adaptive Fuzzy Logic-based Routing Algorithm for Networks-on-Chip,” in Proceedings of 13th IEEE/NASA-ESA International Conference on Adaptive Hardware and Systems (AHS), pp. 208-214, 2011. R. S. Sutton and A. G. Barto, “Reinforcement Learning. An Introduction”, MIT Press, Cambridge, MA, 2000. C. J. C. H. Watkins and P. Dayan, “Q-Learning”, in Proc. Machine Learning, pp.279-292, 1992. J. A. Boyan, M. L . Littman, “Packet routing in dynamically changing networks:A reinforcement learning approach”, Advances in Neural Information Processing Systems 6., pp. 671-678,1994. S. Kumar and R. Miikkulainen, “Dual reinforcement Q-routing: An on-line adaptive routing algorithm”, in Proc. of the Artificial Neural Networks in Engineering Conference, pp. 231-238, 1997. T. Schonwald, J. Zimmermann, and O. Bringmann, "Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures," in Euromicro Conf. on Digital System Design Architectures, Methods and Tools (DSD), Lübeck, pp. 527-34, 2007. G.-M. Chiu, “The Odd-Even Turn Model for Adaptive Routing”, IEEE Transactions on Parallel and Distributed Systems, vol. 1, no. 7, 2000. Y. M. Boura and C. R. Das, “Efﬁcient Fully Adaptive Wormhole Routing in n-Dimensional Meshes”, in Proceedings of the 14th International Conference on Distributed Computing Systems (ICDCS), 1994. A. A. Chien and J. H. Kim, “Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors”, Journal of the ACM, vol. 42, no. 1, 1995. W. Feng and K. G. Shin, “Impact of Selection Functions on Routing Algorithm Performance in Multicomputer Networks”, In International Conference on Supercomputing, pp. 132–139, 1997. H.G. Badr, S. Podar, “An optimal shortest-path routing policy for network computers with regular mesh-connected topologies”, pp.1362-1371, 1989. M. Li, Q. Zeng, W. Jone, “DyXY - a proximity congestion-aware deadlock-free dynamic routing method for network on chip”, in Proc. of DAC, pp. 849-852, 2006. J. C. Hu and R. Marculescu, “DyAD – Smart Routing for Network-on-Chip”, Design and Automation Conference, 2004. W. J. Dally and H. Aoki. Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels. IEEE Transactions on Parallel and Distributed Systems, 4(4):466– 475, 1993. A. Singh,W. J. Dally, A. K. Gupta, and B. Towles. GOAL: A Load-Balanced Adaptive Routing Algorithm for Torus Networks. In International Symposium on Computer Architecture, pages 194–205, 2003. J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, and C. R. Das. A Low Latency Router Supporting Adaptivity for On-Chip Interconnects. In International Conference on Design Automation, pages 559–564, 2005. [20] G. Ascia, et al., “Implementation and Analysis of a New Selection Strategy for Adaptive Routing in Networks-on-Chip”, IEEE Transaction on Computers, v.57, I.6, pp. 809-820, 2008. [21] P. Gratz, B. Grot and S.W. Keckler, “Regional Congestion Awareness for Load Balance in Networks-on-Chip”, in Proc. of HPCA, pp. 203–214, 2008. [22] S. Ma, et al., “DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip”, in Proc. of ISCA, pp.413-424, 2011. [23] S.P. Choi and D.-Y. Yeung, “Predictive Q-routing: A memory-based reinforcement learning approach to adaptive traffic control”, In Advances in Neural Information Processing Systems 8 (NIPS8), 945–951, 1996. [24] S. Kumar, and R. Miikkulainen, “Confidence-Based Q-routing: An On-Line Adaptive Network Routing Algorithm”, in Smart Engineering Systems: Neural Networks, Fuzzy Logic, Data Mining, and Evolutionary Programming, Vol. 8, pp. 147-152, 1998. [25] S. Kumar and R. Miikkulainen, “Dual reinforcement Q-routing: An on-line adaptive routing algorithm”, in Proc. of the Artificial Neural Networks in Engineering Conference, pp. 231-238, 1997. [26] C. Feng, Z. Lu, A. Jantsch, J. Li, and M. Zhang, “A reconfigurable fault-tolerant deflection routing algorithm based on reinforcement learning for network-on-chip”, in Proc of NoCArc, pp.11-16 , 2010. [27] M. Majer, et al. “Packet Routing in Dynamically Changing Networks on Chip”, in Proc. of the 19th International Parallel and Distributed Processing Symposium (IPDPS). Denver, CO, U.S.A. , 2005. [28] K. K. Paliwal, J. S.George, N. Rameshan, V. Laxmi, M.S. Gaur, V. Janyani, and R.Narasimhan, “Implementation of QoS Aware Q-routing Algorithm for Network-on-Chip”, pp. 370-380, 1C3 2009. [29] M.K. Puthal, V. Singh, M.S. Gaur, and V. Laxmi, “C-routing: An Adaptive Hierarchical NoC Routing Methodology” , 19th International Conference on VLSI and System-on-Chip, pp. 392 – 397, 2011. [30] F. Farahnakian, M. Ebrahimi, M. Daneshtalab, P. Liljeberg, and J. Plosila "Q-learning based Congestion-aware Routing Algorithm for On-Chip Network," in Proceedings of 2th IEEE International Conference on Networked Embedded Systems for Enterprise Applications (NESEA), pp. 17, 2011. [31] F. Farahnakian, M. Ebrahimi, M. Daneshtalab, J. Plosila, P. Liljeberg, "Adaptive Reinforcement Learning Method for Networks-on-Chip," in Proceedings of 12th IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XII), 2012. [32] M. Ebrahimi, M. Daneshtalab, F. Farahnakian, P. Liljeberg, J. Plosila, M. Palesi, and H. Tenhunen, "HARAQ: Congestion-Aware Learning Model for Highly Adaptive Routing Algorithm in On-Chip Networks," in Proceedings of 6th ACM/IEEE International Symposium on Networks-on-Chip (NOCS), pp. 19-26, 2012. [33] A. Varga et al., “The OMNeT++ discrete event simulation system”, In Proc. of the European Simulation Multiconference (ESM’2001), pp. 319-324, 2001. [34] S.C. Woo, et al., “The splash-2 programs: Characterization and methodological considerations”, in Proc. of Computer Architecture (ISCA), pp. 24-36, 1995. [35] M. K. Martin, D. J. Sorin, B. M. Beckmann, et al. “Multifacet's general execution driven multiprocessor simulator (GEMS) toolset,” SIGARCH Computer Architecture News, v. 33, No. 4, pp.92-99, 2005.

Log In

Adaptive Load Balancing in Learning-based Approaches for many-core embedded systems

Related papers

Related papers

Related topics