DRCD: a regional-contention-driven arbitration policy for CPU–GPU heterogeneous systems

Juan Fang¹,
Haoyu Cheng¹^na1,
Yuening Wang¹^na1 &
…
Ran Zhai¹^na1

242 Accesses
Explore all metrics

Abstract

In CPU–GPU heterogeneous systems, there exists intense resource contention between CPUs and GPUs. Traditional resource arbitration policies fail to account for the heterogeneity of cores, leading to inefficient network resource utilization for the CPU, which negatively impacts its performance. In heterogeneous networks, the degree of resource contention varies across different regions. This paper first uses reinforcement learning to analyze the message feature weights relied upon for resource arbitration in different network regions. To achieve more efficient resource allocation, a regional-contention-driven arbitration policy is proposed. The simulation results show that, compared to traditional arbitration policy, the overall network latency is reduced by 7.99%, and CPU performance is improved by 11.42%. Furthermore, a dynamic regional-contention-driven arbitration policy is proposed, which further reduces the overall network latency by 10.47% and increases CPU performance by 16.79% compared to traditional arbitration policy.

RETRACTED ARTICLE: A novel machine learning-based framework for channel bandwidth allocation and optimization in distributed computing environments

Article Open access 28 September 2023

Management of Heterogeneous Cloud Resources with Use of the PPO

Efficient Dynamic Pinning of Parallelized Applications by Distributed Reinforcement Learning

Article 24 November 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Currently, the rapid advancement of computer applications has significantly increased the demand for processor performance [1]. CPU–GPU heterogeneous network-on-chip (NoC) designs integrate both CPU and GPU cores on the same chip, enabling the sharing of various on-chip resources, and have become crucial for enhancing overall system performance [2, 3]. Despite the promising potential of CPU–GPU heterogeneous processors to deliver exceptional performance, they also introduce a major challenge: resource sharing between the CPU and GPU exacerbates resource contention issues, which in turn limits system performance.

In CPU–GPU heterogeneous multi-core systems, there exists a resource contention problem between CPU and GPU applications. The CPU is mainly tasked with managing operations that require fast responses and low latency [4, 5], while the GPU, with its numerous computational units and robust thread execution capacity, excels at handling computation-heavy tasks with high throughput [6]. These contrasting characteristics result in the emergence of two primary communication patterns [7,8,9]. The CPU frequently enters a wait state, awaiting requested data, making it highly sensitive to delays. In contrast, the GPU generates a large number of memory access requests in a short time, consuming considerable memory bandwidth and network resources. Due to its higher throughput, the GPU demonstrates a greater tolerance for latency-induced delays. Due to the differences in computational characteristics between the two types of cores, traditional NoC round-robin arbitration policy often give GPUs an advantage in resource competition, allowing them to occupy a significant portion of network resources. This dominance can substantially degrade CPU performance. In previous work [10], we have run both CPU and GPU benchmarks in CPU–GPU heterogeneous systems and the CPU performance has decreased by an average of 62%. To achieve more balanced resource allocation, it is essential to optimize and improve NoC resource arbitration policies in CPU–GPU heterogeneous systems.

Numerous researchers have attempted to mitigate resource contention between CPU and GPU applications by partitioning resources such as virtual channels, memory controllers, and others within the network. Zheng et al. [11] suggested a NoC architecture that assigns network regions of varying sizes and locations for concurrently running applications, thereby reducing resource competition between them. Cui et al. [12] presented an interference-free NoC architecture that alleviates CPU–GPU interference by partitioning memory controllers. Li et al. [13] explored the combined application characteristics and optimized router injection ports by modifying injection links and buffer organization, aimed at reducing contention in heterogeneous multicore systems. Fang et al. [14] discussed the placement of LLC (last-level cache) and MC (memory controller), proposing a more efficient placement strategy. To address hotspot issues in the network, they designed a CPU–GPU task-based routing algorithm for path planning, which reduces resource contention within the network. Some researchers have focused on enhancing prioritization mechanisms in CPU–GPU heterogeneous systems. Wen et al. [15] studied the memory request streams from both CPU and GPU within a heterogeneous NoC, where they share an LLC. They introduced a cache request stream management strategy that prioritizes CPU requests, leading to improved CPU performance.

In recent years, many researchers have used machine learning to improve micro-architecture design. Machine learning can provide better decision making and find more efficient optimization solutions for complex systems. In addressing large-scale dynamic heterogeneous systems, Peng et al. [16] Present a novel view of DRL-based offloading strategies from the viewpoint of key design elements. Lin et al. [17] explored the necessity of using machine learning to enhance the power estimation/prediction capabilities of servers, providing important guidance for server power modeling. Garza et al. [18] propose a bit-level indirect branch prediction scheme using perceptron-based predictors to predict individual branch target address bits. Bhatia et al. [19] propose a perceptron-based prefetch filtering mechanism that enhances the coverage of prefetches while maintaining accuracy. Bera et al. [20] propose a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design. Yang et al. [21] propose a reinforcement learning (RL)-based coordinated prefetching controller RL-CoPref for multiple prefetchers. It can dynamically adjust prefetch activation and prefetch level to make multiple prefetchers complement each other in hybrid applications. Singh et al. [22] propose Sibyl, the first technology to use RL for data placement in a hybrid storage system, which is highly adaptable and easy to scale.

Machine learning can provide innovative solutions for on-chip network optimization, by learning the network state, researchers use machine learning to optimize routers in terms of routing, fault tolerance and DVFS control. Fettes et al. [23] propose a learning-enabled Energy-aware DVFS mechanism for multicore architectures, utilizing both supervised learning and reinforcement learning techniques. Zheng et al. [24] apply RL that automatically learns an optimal control policy to improve NoC energy efficiency to learn a DVFS policy and propose an artificial neural network to efficiently implement the large state-action table necessary by RL. Lin et al. [25] propose a novel deep reinforcement framework for designing optimal loop layouts for routerless NoC, achieve better throughput and latency. Sethumurugan et al. [26] use ML as an offline tool to design a cost-effective cache replacement policy. Kang et al. [27] propose Q-adaptive routing, a multi-agent reinforcement learning routing scheme for dragonfly systems. Yin et al. [28] applied machine learning techniques to improve NoC performance and proposed a novel arbitration scheme effective under high contention conditions. Zhou et al. [29] introduce a novel approach to automatically distill the arbitration logic from simulation traces. The distilled arbitration policy can also generalize to various injection rates and traffic patterns. Chen et al. [30] propose LAMP, an algorithm based on RL that can determine the path of information transmission based on traffic load, which can effectively utilize NoC resources and reduce the latency of the point-to-point NoC.

The objective of this paper is to alleviate resource contention between CPUs and GPUs in heterogeneous NoC and to improve the arbitration policy for better on-chip resource allocation. We analyze the resource competition intensity in different regions of the heterogeneous NoC and use reinforcement learning to study the impact of message feature weights on resource allocation during arbitration in these regions. Based on the analysis, we propose an optimized arbitration policy. The contributions of this paper are summarized as follows:

We analyze the limitations of traditional arbitration policies in resource allocation and explore the impact of message feature weights on resource allocation during arbitration in different regions of heterogeneous NoC using reinforcement learning.
Based on the results of message feature weight analysis, we propose a regional-contention-driven (RCD) arbitration policy, which uses message features with different weights for resource allocation priority calculation according to the intensity of regional resource competition, and allocates resources more rationally.
We propose a dynamic regional-contention-driven (DRCD) arbitration policy. This policy uses dynamic sampling techniques to evaluate the contention coefficients of different network regions during program execution, assesses resource contention intensity, and dynamically adjusts the arbitration policy.

The remainder of this paper is organized as follows. Section 2 analyzes the performance degradation caused by the limitations of traditional arbitration policies in heterogeneous NoC and discusses the application of machine learning in improving NoC performance. Section 3 presents the proposed RCD and DRCD policies in detail. In Sects. 4 and 5, we introduce the simulator parameters, benchmark set, and evaluation results along with analysis. Section 6 concludes the paper.

2 Background and motivation

2.1 CPU–GPU heterogeneous architecture

Figure 1 illustrates the baseline chip architecture and router microarchitecture used in this study. In this architecture, the CPU is equipped with both L1 and L2 private caches, while the GPU is equipped with an L1 private cache. All CPU and GPU cores share the last-level cache (LLC) and memory controllers (MC). Each core is connected to other cores through a router to facilitate packet exchange and transmission.

The key components of a router include input buffers, routing computation modules, virtual channel arbiters, crossbars, and switch allocators. When the message enters a router, the buffer is responsible for receiving it, while the routing computation module determines the appropriate output port based on the message’s destination. The allocator decides which message can leave the virtual channels of the router’s input ports and proceed to the crossbar’s input ports. Finally, the crossbar transfers the message to their designated output ports.

Figure 2 depicts the baseline NoC topology in this work, which employs a 5$\times$5 Mesh architecture. It consists of 5 CPU cores, 11 GPU cores, 6 LLCs, and 3 MCs. In our prior research [14], we categorized heterogeneous NoC architectures into three models based on the placement of LLC and MC: center placement (LLC/MC CENTER), side placement (LLC/MC SIDE), and corner placement (LLC/MC CORNER).

In our previous research [14], we compared the latency of the three models at different scales under high and low traffic benchmarks (the benchmarks are described in Sect. 4.1), and normalized the results using the center placement model as the benchmark. As can be seen from Fig. 3, the center placement model always has the lowest latency at different scales, with an average latency reduction of 56%. As the network scale increases, the advantages of the central placement model become more apparent. This is because, in larger networks, the central placement model enables the traffic from the CPU and GPU to be more evenly distributed across different paths in the network, effectively reducing resource contention. Based on these experimental results, we decided to use the center placement model as the model for subsequent arbitration policy optimization. The design of the center placement model takes into account the memory access requirements of both the CPU and GPU, positioning the MC around the LLC cache to minimize the hop count for accessing the LLC. However, since the center placement model concentrates traffic in the central region, it may lead to congestion and hotspot formation under heavy load, which can affect the overall performance of the NoC.

2.2 NoC arbitration

During the arbitration process at the crossbar’s input ports, multiple virtual channels from the router’s input ports may request access to the same crossbar input port. These requests must be resolved by an arbiter, which allocates resources based on predefined priority rules. Specifically, when multiple virtual channels contend for the same crossbar input port, they submit requests to the arbiter. The arbiter evaluates these requests according to the priority policy and decides which message can proceed to the crossbar’s input port, while others must wait for the next arbitration cycle.

Arbitration policies significantly influence the throughput and latency of the NoC. The round-robin policy is one of the most commonly used approaches. Under this policy, the request that wins arbitration is assigned the lowest priority in the next round; if no new requests arrive, the existing priorities remain unchanged. Another policy, older-first, prioritizes message based on their residence time in the router, giving higher priority to those that have been waiting longer. These traditional policies provide high local fairness among input buffers.

However, in heterogeneous NoCs, these traditional policies fail to take into account the differences in message types of heterogeneous cores, which may lead to poor global service fairness. For example, in some cases, CPU request messages may be blocked by lower-priority GPU traffic, leading to increased CPU latency and reduced throughput. In recent research [28, 29], some researchers have used machine learning methods to determine the allocation or access method of shared hardware resources. However, these methods fail to fully consider the differences in message types when applied to CPU–GPU heterogeneous systems and cannot dynamically adjust arbitration policies according to changes in resource contention in heterogeneous systems.

2.3 Reinforcement learning

Reinforcement learning (RL) is a machine learning method used for decision-making problems. In RL, an agent learns a policy through interactions with the environment, aiming to maximize its long-term rewards. During training, the environment provides a numerical reward for each action the agent takes, and the agent uses this reward to adjust and optimize its policy.

Q-learning is a value-iteration-based RL algorithm that guides the agent in selecting optimal actions by learning a state-action value function. In Q-learning, the agent updates the Q-value for each state-action pair to learn its policy. The Q-value represents the expected future reward for taking a specific action in a given state. Traditional Q-learning methods store these Q-values in a table for updating and lookup. However, in this study, the state-action space of the arbitration problem may be vast, making the traditional Q-table approach impractical for handling such a large state space.

Deep Q-learning (DQN) is an extension of Q-learning. DQN uses neural networks to approximate the Q-function. The neural network outputs an approximate Q-value for each possible action, replacing the Q-table used in traditional Q-learning. DQN also introduces techniques such as target networks and experience replay, which further enhance the stability and convergence of the learning process.

3 Heterogeneous NoC arbitration policy

3.1 Probability-based CPU-first arbitration policy (PBC)

The high thread-level parallelism of GPU cores leads to a significant imbalance in the number of data packets generated by CPU and GPU in the network. During competition for crossbar port resources, the sheer volume of GPU messages gives them an advantage, resulting in GPU applications monopolizing crossbar port resources and significantly degrading CPU performance. To mitigate this issue, we propose a probability-based CPU-first arbitration policy (PBC), designed to enhance the priority of CPU programs in resource competition.

As detailed in Algorithm 1, the approach involves setting a probability parameter as a decimal value between 0 and 1. During the arbitration phase of the crossbar switch, a random decimal number between 0 and 1 is generated. If this number is less than the preset probability parameter, CPU packets are prioritized over GPU packets for resource access. Subsequently, among the CPU packets, the "Round-robin" policy is applied to select the packet that gains access finally. If no CPU packets are involved in the current arbitration, the process defaults to the "Round-robin" policy for GPU packets.

Implementing PBC allows CPU message to be given higher priority during the switch allocation phase. Although PBC improves the priority of CPU access to resources to some extent, it still has several potential limitations:

The effectiveness of the PBC policy is highly sensitive to the preset probability parameters. Inappropriate values may hinder performance-setting the probability too high may excessively limit GPU access, while too low a value may offer minimal improvement to CPU performance.
The PBC policy prioritizes the CPU through probability alone and overlooks other key message features. It does not account for varying resource competition intensities or use features like source, destination, and age to guide resource allocation, limiting its applicability.

3.2 RCD and DRCD arbitration policy

3.2.1 NoC arbitration with RL

An ideal arbitration policy should be able to reflect the priority of message in a NoC in resource competition based on their multiple characteristics. However, the process of manually combining and weighing these features is extremely complex and difficult to implement. Based on the benchmark we use as an example, a single run of the simulation procedure may contain millions of cycles, and the workload of manually extracting information about the features of packets in the arbitration process is enormous and difficult to accomplish efficiently.

RL is an effective tool to learn the relationships between message features and arbitration decisions by interacting with the environment and making optimal decisions based on these relationships. NoC is a complex system, influenced by various factors such as load fluctuations and traffic patterns. The environment of NoC systems is dynamic and uncertain. In this context, RL can better adapt to the dynamic changes during program execution, thus effectively addressing complex scheduling problems. Additionally, there are many important message features involved in the NoC resource arbitration process, such as local age, hop count, message type, and destination port. Due to the intricate interactions between these features, the system state space in NoC resource arbitration is highly complex, making it difficult for traditional heuristic methods to handle effectively. RL, by automatically learning policies, can more efficiently handle these complex state spaces. Yin et al. [28] used DQN to study the arbitration priority in NoC, and by analyzing the trained neural network weights, they identified message features that have a significant impact on arbitration decisions. However, their research focused on the global performance of the entire heterogeneous on-chip network and did not fully consider the differences in resource competition across various areas within the heterogeneous network.

In this work, we use a center placement model, where hotspots are more likely to form in the central regions of the network under high load. Compared to the edge areas, these hotspot regions experience particularly intense resource competition, which places higher demands on the arbitration policy. Therefore, it is necessary to explore the impact of resource competition intensity on the selection of message features during arbitration and to develop corresponding arbitration policies based on different levels of competition.

3.2.2 Regional-contention-driven arbitration policy

In this work, we use DQN to explore for differences in resource competition between different areas in heterogeneous NoC. By analyzing the weight of message characteristics in the degree of resource competition, we formulate reasonable arbitration policy for heterogeneous NoC.

Based on the intensity of resource competition, we divide the heterogeneous NoC into hotspot and edge areas, as shown in Fig. 4, and separately study the arbitration policies for these two regions using RL. The hotspot and edge areas of the NoC are managed by their respective RL agents. We used a benchmark combination with a high load (mcf+spmv), which is a set of memory-intensive applications that cause frequent resource requests in hotspot areas [14], thus providing an effective training environment for the agents. Following Yin’s work [28], the selected neural network is a multilayer perceptron (MLP) with one hidden layer, the activation functions are Sigmoid and ReLU, the learning rate, discount factor, and exploration rate are set to 0.001, 0.9, and 0.001. In the NoC architecture used in this work, each router has five input ports, each with four input virtual channels, and each message has five features, which require nine numbers to represent. Therefore, the neural network has $5 \times 4 \times 9 = 180$ input neurons, 20 neurons in the hidden layer, and 20 neurons in the output layer.

Figure 5 shows how we use agents to train arbitration policies based on message features.

State Vector: During each arbitration at the crossbar input ports, the router collects the features of all messages competing for the same input port to construct the state vector. Each message contains 5 features, all of which are normalized to improve the convergence of training. The selected message features, as shown in Table 1, include the following two categories:

Dynamic Features: local age, hop count, and distance, which change as the message traverses through the network. Each of the above features is represented by an element.
Static Features: message type and destination type, which remain constant after the message enters the NoC and reflect the application’s behavioral characteristics. The above two features are processed using one-hot coding and a total of six elements are required for the two features.

Table 1 Message features

Full size table

Agent: The hotspot and edge areas of the network each use different agents. During each crossbar input port arbitration process, the router sends the constructed state vector to the corresponding agent. The agent calculates the Q-values for each virtual channel using the neural network, forming a Q-value vector, which is then returned to the router. Based on these Q-values, the router decides the arbitration outcome, with the virtual channel having the highest score granted the resource. Arbitration at each crossbar input port is performed independently. If a crossbar input port has only one virtual channel requesting the resource, there is no need to query the agent, and the resource is granted directly.

Reward: After each resource arbitration, the router generates a reward and feeds it back to the agent. The reward should reflect the quality of the arbitration. In this work, we choose global age and contention factor (demonstrated in Sect. 3.2.3) to calculate the reward. Global age has been proven to be one of the optimal arbitration policies, but it has high hardware costs in NoC and is difficult to implement [31]. The contention factor reflects the intensity of the current resource contention. If the arbitration result selects the message with the largest global age, and the contention factor decreases after the arbitration, a positive reward is given to the agent; otherwise, the reward is zero.

Training: After each arbitration decision, the router stores the tuple (state, action, next state, reward) and reward into the replay memory, and uses the experience replay technique for training. After sufficient training, the agents for the hotspot and edge areas optimize their respective arbitration policies, and the router performs resource arbitration based on the optimized policies.

After the training was completed, we visualized and analyzed the weights of the hidden layer of the neural network. Figure 6 shows the weight distribution of message features in the hotspot area and edge area during arbitration. The results show that there is a significant difference in the dependence of arbitration policies on message features in different areas, and this difference directly affects the effectiveness of arbitration policies.

In the edge area of the NoC, where resource competition is weaker, the network load typically does not create severe bottlenecks, allowing message to pass through relatively smoothly. In such cases, the focus of the arbitration policy is on optimizing the scheduling order to reduce unnecessary delays. As a result, dynamic features related to message transmission, such as local age, distance, and hop count, carry more weight in the arbitration process. These features help the network organize scheduling more efficiently, further improving transmission performance.

In the hotspot area, where resource competition is intense, limited resources must be shared among a large number of packets, leading to potential congestion and significant delays. In this scenario, the arbitration policy needs to more intelligently prioritize higher-priority message for resource allocation to alleviate the performance degradation caused by resource contention. Here, static features related to the message’s application behavior, such as message type and destination type, carry more weight in the arbitration process. These features help the system allocate resources more effectively in high-competition environments, reduce bottlenecks, and optimize overall performance.

Based on the above observations, we propose a regional-contention-driven arbitration policy (RCD), which has different arbitration schemes for high-load and low-load area, as shown in Algorithm 2.

Under low load: In edge area, where resource competition is weak, the arbitration policy should aim to minimize unnecessary waiting times. Therefore, we combine the local age and distance features to prioritize scheduling message that have shorter transmission paths and longer waiting times, thus improving transmission efficiency and optimizing the scheduling order. The priority as follows:

$\textit{Priority\_level} = \textit{Local\_Age} \ll 1 + \textit{Distance} \gg 1$

under high load: In hotspot area, resource competition intensifies, where a large number of GPU packets may interfere with the normal operation of CPU programs, leading to delayed responses to CPU requests. To address this issue, we have designed priority rules based on destination type and message type, giving higher priority to response packets for CPU messages. This ensures that the critical path of CPU tasks is not disrupted, thereby alleviating the performance degradation of the CPU in high-competition environments. In the following priority, the Message_type value for coherence and response messages is 2, and the value for request messages is 1. The Destination_Type value is 2 for messages with the CPU as the destination, and 1 for messages with the GPU or LLC or MC as the destination. The priority as follows:

$\textit{Priority\_level} = \textit{Message\_type} \ll 1 + \textit{Destination\_Type} \ll 1 + \textit{Local\_Age}$

3.2.3 Dynamic regional-contention-driven arbitration policy

In the previous sections, we analyzed the message feature weight distribution in the hotspot and edge areas and designed arbitration policies for the high-competition hotspot areas and low-competition edge areas. However, due to dynamic changes in application load and execution phases, the degree of resource competition in these areas also varies over time.

In low-load situations, even hotspot areas containing LLC and MC may not experience intense resource competition. For example, during low-load phases, both CPU and GPU memory access requests are fewer, and the competition in hotspot areas may resemble that of the edge areas. Applying a high-competition policy in such cases would not only introduce additional computational complexity but also waste resources. In contrast, a simple low-competition policy can more efficiently allocate resources.

Therefore, we propose a policy-switching mechanism based on dynamic sampling monitoring to adapt to the dynamic changes in system load and resource competition.

In order to quantitatively monitor the resource competition on different cores, we propose a contention factor (CF) to quantitatively describe the degree of resource contention between CPUs and GPUs during a certain period of time. The competition coefficient uses the overlap rate of CPUs and GPUs competing for the same crossbar input port to reflect the intensity of competition. Its calculation formula is as follows:

$$\begin{aligned} \text {CF}(i) = \frac{N_{\text {overlap}}(i)}{N_{\text {total}}(i)} \end{aligned}$$

(1)

$N_{\text {overlap}}(i)$ denotes the number of overlapping resource requests on CPU and GPU on node i. $N_{\text {total}}(i)$ denotes the total number of resource requests on node i. When the contention factor is high, it indicates intense resource contention at the node. In this case, a high-load arbitration policy should be prioritized to alleviate resource bottlenecks. Conversely, when the contention factor is low, the system is in a low-load state, and a low-load arbitration policy can more effectively optimize resource allocation efficiency.

The entire application execution process is divided into several stages, including the initialization stage, sampling stage, and main stage. As shown in Fig. 7, the transitions of stages during application execution are presented.

Initialization Stage: During the initialization stage, the system state is unstable, and a low-load arbitration policy is used. No sampling is performed during this period.

Sampling Stage: The goal of the sampling stage is to assess the current resource competition intensity to determine whether a switch to another arbitration policy is necessary. During this stage, the edge areas continue to use low-load policy, while the hotspot areas use low-load and high-load policies for arbitration, and the corresponding CF are calculated. If the low-load policy is in use and the CF significantly increases, this indicates that resource contention has become intense, and the system may need to switch to a high-load policy.

$$\begin{aligned} \text {Contention\_Ratio} = \frac{\sum \text {CF}_{\text {low\_load}}}{\sum \text {CF}_{\text {high\_load}}} \end{aligned}$$

(2)

Main Stage: The two contention factors generated in the sampling stage are used to determine the arbitration policy for the main stage. To collect the contention factors of different cores in the hotspot area, we placed a central control logic at the central node of the network. After the central control logic module collects the contention factors of the cores in the hotspot area, it calculates the contention factor ratio under two arbitration policies based on the formula(2). When resource competition within an area significantly intensifies, its contention factor will show a noticeable increase. We ran multiple memory-intensive load benchmarks and recorded the contention factors of the hotspot and edge areas. After analyzing the information from several experiments, we found that the contention factor under high load is approximately 1.5 times that of the coefficient under low load. Therefore, we set 1.5 as the threshold for switching between the two arbitration policies. When the ratio reaches or exceeds a preset threshold of 1.5, the high-load policy is considered more suitable for the current hotspot area. Conversely, when the ratio falls below the threshold, the low-load policy will be enabled to optimize network performance. After the main stage is completed, the system returns to the sampling stage.

Table 2 Stage length setting

Full size table

The settings of the stage length are important for system performance. If the sampling stage is too short, it may be impossible to obtain accurate resource contention information due to performance fluctuations over a short period of time. If the sampling stage is too long, it may result in a failure to adjust arbitration policies in time, missing the best opportunity to optimize system performance. In our previous research [14], we analyzed the IPC of different phases of the CPU and determined the appropriate stage length based on the changes in the performance of CPU programs in different phases of the program. The duration of each stage is specified in Table 2.

3.2.4 The scalability of policy

In this section, we discuss the scalability of DRCD under different network scales, placement structures, and traffic patterns.

Network Scale: For larger-scale mesh networks (e.g., 8x8), the weights of message features during arbitration may vary. For example, longer transmission paths might make hop count more important than local age [29], as hop count better reflects the time taken for a message to travel through the network. Additionally, the weight of message type may increase, as properly scheduling different message types becomes more critical. In this case, the NoC arbitration rule need to be updated to assign different weights to features. We plan to conduct further research on the variation of feature weights in large-scale networks in future work.

Placement Structure: In networks of the same scale, different placement structures may impact the overlap regions of resource access. In our study, we adopted the central model with the minimum average latency. For other structures, designers need to refer to routing algorithms to identify overlapping regions of CPU and GPU resource access (typically the LLC/MC regions), and include the nodes in these regions within the central control logic to dynamically adjust arbitration policies based on resource contention. In this case, the arbitration rules themselves do not need modification.

Traffic Patterns: When the traffic environment changes, the policy does not need modification. In this study, although the RL training focused mainly on memory-intensive programs, by learning the feature weights of messages under different load conditions (high load and low load) in different network regions, our arbitration rules can flexibly adapt to other traffic patterns. During program execution, our designed arbitration rules automatically adjust the policy according to changes in traffic load, thus working effectively under various traffic patterns without requiring further adjustments to the arbitration rules.

4 Experiment setup

4.1 Simulator and benchmark

We use the MacSim [14] simulator for our experiments, which functions as a trace-driven, loop-level simulator for heterogeneous architectures. The processor configurations are shown in Table 3. Our baseline CPU is modeled similarly to Intel’s Sandy Bridge, while the GPU core is comparable to NVIDIA Fermi. For all simulations, we perform repeated executions of applications that are terminated early to simulate resource contention between applications.

Table 3 Heterogeneous CPU–GPU Architecture Configuration

Full size table

Table 4 CPU and GPU benchmark

Full size table

For our experiments, we use the SPEC 2006 CPU benchmarks in conjunction with a set of CUDA GPGPU benchmarks. The GPGPU benchmarks include the Nvidia CUDA SDK Rodinia suite and the Parboil suite. In the experimental setup, each CPU core runs a CPU application, while all GPU cores execute a single GPU application. PKC was utilized as a statistical measure to assess the communication status of the network and identify the high and low traffic application groups. The results of the grouping are shown in Table 4.

4.2 Metrics

IPC is selected as the metric to evaluate CPU performance, while application latency is used to assess overall network performance. The IPC formula is shown in equation 3, where cycles indicates the number of cycles used by the CPU to execute the application program and $instruction_{i}$ indicates the number of instructions executed by the CPU. The total IPC of the network is calculated using the formula in equation 4, where n refers to the total number of CPU cores.

$$\begin{aligned} IPC_{i}&= \frac{instruction_{i}}{cycles} \end{aligned}$$

(3)

$$\begin{aligned} \overline{IPC}&= \frac{\sum _{i=0}^{n-1} IPC_{i}}{n} \end{aligned}$$

(4)

The power consumption of NoC is divided into link power consumption and router power consumption. In this work, changes to the arbitration policy involve router power consumption. The energy consumed by each message as it passes through the router is expressed by the following equation:

$$\begin{aligned} E_{router}= E_{BW} + E_{RC} + E_{VA} + E_{SA} + E_{BR} + E_{ST} \end{aligned}$$

(5)

Among them, $E_{BW}$, $E_{RC}$, $E_{VA}$, $E_{SA}$, $E_{BR}$ and $E_{ST}$, respectively, represent the power consumption of buffer writing, routing calculation, virtual channel allocation, switch allocation, buffer reading and switch transmission, respectively. The number of times that virtual channel allocation and switch allocation is determined by the competition for the resources of each router [32, 33]. The higher the number of resource requests, the higher the power consumption in this part.

5 Results and analysis

In this section, we first evaluate the PBC policy and analyze its limitations. We then assess the proposed RCD and DRCD arbitration policies and provide a detailed analysis of the experimental results. We choose the classic round-robin and older-first policies as comparison benchmarks. All experimental results are normalized based on the round-robin policy.

5.1 PBC policy

Figures 8 and 9 show the latency and performance results of the PBC policy under the high-traffic group. For the PBC policy, we use two probability parameters, 0.2 and 0.7, which represent the low and high probability scenarios for the CPU obtaining resources, respectively. The experimental results indicate that the performance of the PBC policy is significantly influenced by the settings of the probability parameters. When the probability parameter is set to 0.2, the latency and performance are similar to those of traditional arbitration policies. In this case, although CPU packets receive some priority, the improvement is not significant. When the probability parameter is set to 0.7, the CPU performance is significantly improved, but the overall latency in some workloads also increases noticeably. The higher probability parameter means that CPU messages are given priority in most cases, which reduces the waiting time and increases the throughput of CPU messages. However, this also leads to a significant increase in the waiting time of GPU messages, especially when GPU programs are frequently accessed. The backlog of GPU messages causes network congestion, which in turn increases the overall latency.

Due to the instability of the PBC policy under different probability settings, and its inability to make reasonable arbitration decisions based on message features, it cannot consistently and effectively improve network performance.

5.2 RCD and DRCD policy

Figures 10 and 11 show the comparison of latency and IPC between traditional policies and the proposed RCD policy in low-traffic groups. During 10 sets of mixed-application workloads, the RCD policy reduced the average latency by 1.08$\%$ and improved the IPC by 2.2$\%$ compared to the traditional round-robin policy. In the low-load environment, where there is less resource competition, the effect of the arbitration policy is not significant, which aligns with our expectations.

In high-traffic groups, the RCD policy significantly improved CPU performance compared to traditional arbitration policies, as shown in Figs. 12 and 13. During 10 sets of mixed-application workloads, the average IPC of the RCD policy increased by 11.42$\%$. This indicates that in high-load environment, the RCD policy effectively mitigates the interference of GPU applications on the CPU, thereby enhancing CPU performance. Additionally, the average latency of the RCD policy decreased by 7.99$\%$, a result of the policy’s ability to reduce resource competition and packet congestion, which in turn improves the overall network transmission efficiency.

Furthermore, we introduced the DRCD policy, which dynamically adjusts the arbitration mechanism based on the load during program execution, better adapting to different load conditions. To more accurately simulate real network environments when evaluating the DRCD policy, we bound different CPU benchmarks to different CPU cores and mixed them with different GPU benchmarks. We also divided the mixed workloads into two categories: mixed-low workloads, which contain combinations of benchmarks from the low-traffic group, and mixed-high workloads, which contain combinations from the high-traffic group. As shown in Figs. 14 and 15, by comparing the RCD and DRCD policies, the DRCD policy showed an average IPC improvement of 4.82$\%$ and a 2.7$\%$ reduction in latency across different mixed-workload conditions. Compared with the round-robin policy, the DRCD policy reduces latency by an average of 10.47% and improves CPU performance by 16.79%.

5.3 Energy consumption

In this section, we analyze the power consumption of the proposed RCD arbitration policy and the DRCD arbitration policy. The RCD policy requires only some simple hardware modifications. Among the features used by the RCD, for the local age, a 5-bit counter needs to be added to the input buffer to store the local age of the message. The distance can be calculated directly based on the current router location and the destination node ID. The message type already exists in the message header and can be used directly. The destination type can be determined directly based on the destination node ID.

We compared the network power consumption of the round-robin arbitration policy with that of the RCD arbitration policy. Compared with the round-robin arbitration policy, the RCD arbitration policy increased power consumption by an average of 2.8%, and in high-traffic benchmark tests, the average increase was 1.6%. The slight increase in power consumption is due to the additional hardware and the need to modify the circuitry for priority calculations. For an arbiter, a higher number of resource requests results in higher power consumption. The RCD policy alleviates resource competition under high loads, reduces the time that messages are congested in the router, and reduces the number of resource requests compared to the round-robin policy. Therefore, the power consumption increase under high load benchmarks is lower. On the other hand, the DRCD policy requires the addition of a central control logic module that stores the contention factor returned from the core and broadcasts the arbitration policy to be used. However, the amount of stored messages is small, and the total overhead is low. The DRCD arbitration policy increases power consumption by an average of 4.6% compared to the round-robin arbitration policy. However, compared to the average increase in IPC, it is still within an acceptable range.

6 Conclusion

We studied the resource contention problem in heterogeneous NoC and found that traditional arbitration policies cannot achieve balanced resource allocation. The degree of resource contention in different regions of a heterogeneous NoC may vary significantly, and the message features relied upon for arbitration differ under varying contention levels. To address this issue more effectively, we conducted an in-depth analysis of message feature weights for hotspot and edge areas using reinforcement learning, and designed the regional-contention-driven (RCD) arbitration policy. To further adapt to the dynamic changes during actual program execution, we proposed the Dynamic regional-contention-driven (DRCD) arbitration policy based on the RCD policy. DRCD dynamically adjusts the arbitration mechanism by monitoring the network’s resource contention status in real time. We validated the effectiveness of the proposed policies through extensive simulation experiments. In mixed benchmark testing, compared to traditional arbitration policies, the RCD policy reduced overall network latency by 7.99$\%$ and improved CPU performance by 11.42$\%$ in high-load networks. The DRCD policy further reduced network latency by 10.47$\%$ and improved CPU performance by 16.79$\%$. Compared with traditional arbitration policies, the power consumption of the two policies increased by an average of 2.8% and 4.6%, respectively. The DRCD policy proposed in this paper is not limited to the topology scale used in the study. In other architectures and larger-scale scenarios, although the message feature weights relied upon during network arbitration may differ, the resource contention between different areas still varies. With the ideas presented in this paper, it is possible to effectively select the appropriate arbitration policy based on the resource contention conditions. In future work, our goal is to further optimize the DRCD policy to achieve more accurate congestion monitoring. Additionally, we plan to explore the performance of arbitration policies in large-scale heterogeneous NoC architectures with more cores.

Data availibility

All data generated or analyzed during this study are included in this article. There are no further materials to provide.

Abbreviations

NoC:: Network-on-chip
LLC:: Last-level cache
MC:: Memory controller
VC:: Virtual channel
PKC:: Packet per kilo cycle
IPC:: Instruction per cycle
RCD:: Regional-contention-driven
DRCD:: Dynamic regional-contention-driven

References

Meng X, Raj K, Ray S, Basu K (2023) SeVNoC: security validation of system-on-chip designs with NoC fabrics. IEEE Trans Comput-Aided Des Integr Circuits Syst 42:672–682
Article MATH Google Scholar
Gerzhoy D, Sun X, Zuzak M, Yeung D (2019) Nested MIMD-SIMD parallelization for heterogeneous microprocessors. ACM Trans Arch Code Optim (TACO) 16:1–27
Article Google Scholar
Sadrosadati M, Ehsani SB, Falahati H, Ausavarungnirun R, Tavakkol A, Abaee M, Orosa L, Wang Y, Sarbazi-Azad H, Mutlu O (2019) ITAP. ACM Trans Arch Code Optim TACO 16:1–26
Article Google Scholar
Chen Y, Louri A (2020) An approximate communication framework for network-on-chips. IEEE Trans Parallel Distrib Syst 31:1434–1446
Article MATH Google Scholar
Brondolin R, Santambrogio MD (2020) A black-box monitoring approach to measure microservices runtime performance. ACM Trans Arch Code Optim (TACO) 17:1–26
Article MATH Google Scholar
Asad A, Mohammadi FA (2022) Godiva: green on-chip interconnection for dnns. J Supercomput 79:2404–2430
Article MATH Google Scholar
Matoussi O (2021) Noc performance model for efficient network latency estimation. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp 994–999
Zhao X, Eeckhout L, Jahre M (2022) Delegated replies: alleviating network clogging in heterogeneous architectures. In: 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp 1014–1028
Chen P, Chen H, Zhou J, Li M, Liu W, Xiao C, Xie Y, Guan N (2022) Contention minimization in emerging smart noc via direct and indirect routes. IEEE Trans Comput 71:1874–1888
MATH Google Scholar
Fang J, Cheng H, Wei Z, Yang H (2023) Dpbc-vcp: A network-on-chip prioritization mechanism combined with VCP for CPU-GPU heterogeneous systems. In: 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), pp 1927–1934
Zheng H, Wang K, Louri A (2021) Adapt-noc: A flexible network-on-chip design for heterogeneous manycore architectures. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp 723–735
Cui Y-W, Prabhakar SM, Zhao H, Mohanty SP, Fang J (2020) A low-cost conflict-free noc architecture for heterogeneous multicore systems. In: 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp 300–305
Li Y, Louri A (2021) Alpha: a learning-enabled high-performance network-on-chip router design for heterogeneous manycore architectures. IEEE Trans Sustain Comput 6(2):274–288
Article MATH Google Scholar
Fang J, Wei Z, Liu Y, Hou Y (2023) TB-TBP: a task-based adaptive routing algorithm for network-on-chip in heterogenous cpu-gpu architectures. J Supercomput 80(5):6311–6335
Article MATH Google Scholar
Wen H, Zhang W (2019) Heterogeneous cache hierarchy management for integrated cpu-gpu architecture. In: 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6
Peng P, Lin W, Wu W, Zhang H, Peng S, Wu Q, Li K (2024) A survey on computation offloading in edge systems: from the perspective of deep reinforcement learning approaches. Comput Sci Re. 53:100656
Article MathSciNet MATH Google Scholar
Lin W, Shi F, Wu W, Li K, Wu G, Mohammed A-A (2020) A taxonomy and survey of power models and power modeling for cloud servers. ACM Comput Surv (CSUR) 53:1–41
Article MATH Google Scholar
Garza E, Mirbagher-Ajorpaz S, Khan TA, Jiménez DA (2019) Bit-level perceptron prediction for indirect branches. In: 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp 27–38
Bhatia E, Chacon G, Pugsley S, Teran E, Gratz PV, Jiménez DA (2019) Perceptron-based prefetch filtering. In: 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp 1–13
Bera R, Kanellopoulos K, Nori AV, Shahroodi T, Subramoney S, Mutlu O (2021) Pythia: a customizable hardware prefetching framework using online reinforcement learning. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
Yang H, Fang J, Su X, Cai Z, Wang Y (2024) Rl-copref: a reinforcement learning-based coordinated prefetching controller for multiple prefetchers. J Supercomput 80:13001–13026
Article Google Scholar
Singh G, Nadig R, Park J, Bera R, Hajinazar N, Novo D, G’omez-Luna J, Stuijk S, Corporaal H, Mutlu O (2022) Sibyl: adaptive and extensible data placement in hybrid storage systems using online reinforcement learning. In: Proceedings of the 49th Annual International Symposium on Computer Architecture
Fettes Q, Clark M, Bunescu R, Karanth A, Louri A (2019) Dynamic voltage and frequency scaling in NoCs with supervised and reinforcement learning techniques. IEEE Trans Comput 68(3):375–389
Article MathSciNet MATH Google Scholar
Zheng H, Louri A (2019) An energy-efficient network-on-chip design using reinforcement learning. In: 2019 56th ACM/IEEE Design Automation Conference (DAC), pp 1–6
Lin T-R, Penney D, Pedram M, Chen L (2019) Optimizing routerless network-on-chip designs: An innovative learning-based framework. ArXiv ArXiv:1905.04423
Sethumurugan S, Yin J, Sartori J (2021) Designing a cost-effective cache replacement policy using machine learning. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp 291–303
Kang Y, Wang X, Lan Z (2021) Q-adaptive: a multi-agent reinforcement learning based routing on dragonfly network. In: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing
Yin J, Sethumurugan S, Eckert Y, Patel C, Smith A, Morton E, Oskin M, Jerger NDE, Loh GH (2020) Experiences with ml-driven design: A noc case study. In: 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 637–648
Zhou Y, Wang H, Yin J, Zhang Z (2021) Distilling arbitration logic from traces using machine learning: a case study on NoC. In: 2021 58th ACM/IEEE Design Automation Conference (DAC), pp 55–60
Chen H, Chen P, Luo X, Huai S, Liu W (2022) Lamp: load-balanced multipath parallel transmission in point-to-point NoCs. IEEE Trans Comput Aided Des Integr Circuits Syst 41:5232–5245
Article MATH Google Scholar
Ahn J, Kim J, Kasan H, Jin Z, Delshadtehrani L, Song W, Joshi AM, Kim J (2021) Network-on-chip microarchitecture-based covert channel in gpus. In: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
Deb D, Rohith MK, Jose J (2022) Flitzip: effective packet compression for NoC in multiprocessor system-on-chip. IEEE Trans Parallel Distrib Syst 33:117–128
Article MATH Google Scholar
Chen Y, Louri A (2020) Learning-based quality management for approximate communication in network-on-chips. IEEE Trans Comput Aided Des Integr Circuits Syst 39:3724–3735
Article MATH Google Scholar

Download references

Funding

This work is supported by Beijing Natural Science Foundation (4192007), and supported by the National Natural Science Foundation of China (61202076), along with other government sponsors.

Author information

Haoyu Cheng, Yuening Wang and Ran Zhai have contributed equally to this work.

Authors and Affiliations

College of Computer Science, Beijing University of Technology, Beijing, 100124, China
Juan Fang, Haoyu Cheng, Yuening Wang & Ran Zhai

Authors

Juan Fang
View author publications
You can also search for this author in PubMed Google Scholar
Haoyu Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yuening Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ran Zhai
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Haoyu Cheng and Juan Fang designed the RCD and DRCD arbitration policy for heterogeneous NoC. Yuening Wang and Ran Zhai participated in the optimization of image design and text in the paper. All authors composed the rest of the manuscript, reviewed the whole manuscript, and approved the final manuscript.

Corresponding author

Correspondence to Juan Fang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent for publication

The authors readily consent to have this paper published.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Fang, J., Cheng, H., Wang, Y. et al. DRCD: a regional-contention-driven arbitration policy for CPU–GPU heterogeneous systems. J Supercomput 81, 473 (2025). https://doi.org/10.1007/s11227-025-07001-7

Download citation

Accepted: 27 January 2025
Published: 09 February 2025
DOI: https://doi.org/10.1007/s11227-025-07001-7

DRCD: a regional-contention-driven arbitration policy for CPU–GPU heterogeneous systems

Abstract

Similar content being viewed by others

RETRACTED ARTICLE: A novel machine learning-based framework for channel bandwidth allocation and optimization in distributed computing environments

Management of Heterogeneous Cloud Resources with Use of the PPO

Efficient Dynamic Pinning of Parallelized Applications by Distributed Reinforcement Learning

1 Introduction

2 Background and motivation

2.1 CPU–GPU heterogeneous architecture

2.2 NoC arbitration

2.3 Reinforcement learning

3 Heterogeneous NoC arbitration policy

3.1 Probability-based CPU-first arbitration policy (PBC)

3.2 RCD and DRCD arbitration policy

3.2.1 NoC arbitration with RL

3.2.2 Regional-contention-driven arbitration policy

3.2.3 Dynamic regional-contention-driven arbitration policy

3.2.4 The scalability of policy

4 Experiment setup

4.1 Simulator and benchmark

4.2 Metrics

5 Results and analysis

5.1 PBC policy

5.2 RCD and DRCD policy

5.3 Energy consumption

6 Conclusion

Data availibility

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords