1 Introduction

The proliferation of supercomputers with parallel computing capabilities has revolutionized several industries [1, 2]. Modern supercomputers typically consist of nodes containing multiple multicore central processing units (CPUs) and one or more graphic processing units (GPUs).Footnote 1 These architectures allow computationally intensive problems to be tackled efficiently by utilizing both types of processing elements, often through hybrid CPU-GPU implementations. This requires custom analysis to identify bottlenecks and allocate tasks to the most appropriate computing resource to maximize overall performance. One area that has benefited significantly from these advances is digital signal processing for wireless communications, particularly in the implementation of efficient multiple-input multiple-output (MIMO) algorithms [3, 4].

MIMO techniques have been widely employed since the third generation (3G) of mobile communications. In the current fifth generation (5G) and future sixth generation (6G), the use of MIMO has expanded to deployments referred to as massive MIMO, which includes tens or hundreds of antenna elements. A second major improvement is the use of higher frequency bands, as it is the case of millimeter wave (mmWave) spectrum, which is a key technology for achieving enhanced mobile broadband (eMBB) services [5, 6]. The deployment of mmWave requires the use of beamforming techniques with highly directional beams to increase the gain of the communication link between the transmitter (Tx) and receiver (Rx) to compensate for the larger channel propagation losses. The high directivity is typically achieved by using massive MIMO systems, which often require more complex signal processing techniques, highlighting the need for efficient implementations. This study focuses on this specific scenario, a massive MIMO system using a mmWave frequency band, where achieving an efficient and robust channel estimation is critical.

Accurate parametric modeling of mmWave channels allows the implementation of beamforming techniques. In this case, the directional characteristics of the channel are captured through the estimation of angle-of-arrival (AoA) and angle-of-departure (AoD) parameters. In this work, we focus on an artificial intelligence (AI) approach using two convolutional neural networks (CNNs) to accurately retrieve AoA and AoD values from frequency-domain channel observations in an analog beamforming scenario, which exploit the sparse nature of the mmWave channel. In particular, the estimation is addressed as a supervised image-to-image translation problem, where the aim is to transfer an image from one domain to another while preserving the content of the given image. Our approach builds upon the framework proposed in [7], where the 2D input matrix can be interpreted as the discrete Fourier transform (DFT) of a sum of 2D complex sinusoids defined by the existing channel paths. This matrix, which can be seen as an input image, is fed into AI-based methods for signal denoising and peak enhancement, producing as output a real image that encodes the underlying paths in the main peaks. The first AI architecture evaluated in this paper is a residual CNN, denoted as ResNet, based on the 2D-DeepFreq architecture presented in [8], which is an adaptation of the residual CNN proposed in [9]. The second evaluated architecture is based on a well-known CNN architecture called U-Net, which has been widely used to solve problems of semantic segmentation [10], audio source separation [11], etc.

GPUs have become the preferred choice for deploying deep neural networks (DNNs) in both training and inference due to their high computational power and flexibility. In the context of mmWave MIMO systems, where data parallelism and efficient handling of intensive matrix operations are essential, GPUs offer significant advantages. The ability of GPUs to process multiple data points in parallel enables rapid and accurate channel estimation. In this contribution, the focus of the evaluation is on the efficient implementation of the CNN models in an embedded system. We evaluate the effect of modifying the frequencies of its CPU and GPU on the performance of the inference process, both in terms of execution time and energy consumption. These parameters are important metrics in the context of 6G systems, where substantial effort is located into placing sustainability in the center of the architecture design and future implementation [12, 13]. In fact, obtaining energy-efficient and low-complexity AI solutions is a challenge for future sustainable 6G systems.

A key technology trend toward achieving sustainable communication systems [14] is edge computing, and more specifically edge AI, which besides has emerged as a promising solution to the challenges of real-time data processing in wireless communication systems [15,16,17]. By processing data at the edge of the network, close to the data source, edge AI reduces latency and bandwidth consumption, enabling real-time decision making and analysis. In our contribution, the low-power NVIDIA Jetson Orin Nano [18] exemplifies this paradigm shift with its advanced GPU architecture and energy-efficient design. This low-power system on a chip (SoC) incorporates Tensor Cores optimized for DNN processing, allowing it to meet the real-time demands of edge AI applications with a high performance-per-watt ratio, essential for energy-efficient operation in 5G and 6G environments. While GPUs provide a robust solution for edge computing tasks, alternative hardware platforms such as FPGAs and specialized AI accelerators offer promising future directions. FPGAs, for instance, allow for customized configurations that can potentially optimize latency and energy efficiency in specific DNN applications. It could be interesting to assess their feasibility in scenarios where extreme customization or ultralow power consumption is required. However, given the GPU’s current advantages in flexibility, widespread support, and optimization for DNNs, it remains the most suitable solution for our present study.

The main contributions of this work are the following:

  • The paper showcases the deployment of two deep learning models for mmWave MIMO channel estimation, specifically, a ResNet and a U-Net, on a low-power embedded system (NVIDIA Jetson Orin Nano), showcasing the feasibility and benefits of edge AI for 6G networks.

  • This study provides a detailed comparison of energy consumption between the ResNet and U-Net models, revealing that, despite a higher power usage, U-Net’s faster execution time leads to a superior overall energy efficiency, a crucial consideration for sustainable edge AI deployments.

  • The research highlights the trade-offs between power consumption and execution speed, offering valuable insights into optimizing AI model selection for resource-constrained environments in future wireless communication systems.

  • By implementing the models on an actual edge computing platform, the paper demonstrates the practical challenges and solutions when implementing AI-based channel estimation.

Experimental results show that the U-Net model is consistently faster than the ResNet model for all CPU and GPU frequencies. Both models behave similarly regarding their execution time and power consumption as we modify the frequencies of both devices. The U-Net model is more power consuming, but as it is faster, it consumes less energy per channel. Specifically, the U-Net model is about 3.5 times faster than the ResNet independently of the frequencies and consumes about 1.5 times more power. As a result, in the best case, the ResNet model consumes 0.59 joules per channel, while the U-Net only consumes 0.30 joules per channel. Our results highlight the trade-offs between computational performance and power efficiency, providing valuable insights for optimizing AI applications on low-power embedded systems.

The remainder of the paper is structured as follows. Section 2 provides an overview of the system model and background. In Sect. 3, we describe the proposed deep learning models, including the architecture of the ResNet and U-Net, and their implementation on the NVIDIA Jetson Orin Nano platform. Next, in Sect. 4, we detail the experimental setup, evaluation metrics, and the results of our performance and energy efficiency analysis. Finally, Sect. 5 offers concluding remarks, summarizing the key findings of the study.

2 System Model

2.1 Channel Model

Consider a single-user mmWave MIMO geometric channel where both the Tx and Rx are equipped with uniform linear arrays consisting of \(n_t\) and \(n_r\) antenna elements, respectively. Let L represent the number of scatterers, with each scatterer contributing a single propagation path between Tx and Rx. The complex channel coefficient for each path is given by \(\alpha _l\), \(l=1,\ldots ,L\), while \(\psi _l\) and \(\phi _l\) denote the AoA and AoD, respectively. The full parametric model expresses the channel as [19]:

$$\begin{aligned} {\textbf{H}}(\varvec{\theta }) = \sqrt{n_t n_r}\sum _{l=1}^{L}\alpha _l {\textbf{a}}_r(\psi _l){\textbf{a}}_t^H(\phi _l), \end{aligned}$$
(1)

where \(\varvec{\theta }~\triangleq ~[|\alpha _1|,\angle {\alpha _1},\phi _1,\psi _1,\dots ,|\alpha _L|,\angle {\alpha _L},\phi _L,\psi _L]^{T}\) is the parameter vector. Here, \(|\alpha _l|\) and \(\angle {\alpha _l}\) represent the magnitude and phase of each channel coefficient, respectively. The \(\alpha _l\) values are independent identically distributed (i.i.d.) random variables with the distribution \(\alpha _l~\sim ~\mathcal{C}\mathcal{N}(0,\sigma ^{2}_{\alpha }/L)\). The AoAs (\(\psi _l\)) and AoDs (\(\phi _l\)) are uniformly distributed within \([0,2\pi ]\).

The array responses for the Tx and Rx antennas, assuming half-wavelength antenna spacing, are given by:

$$\begin{aligned} & {\textbf{a}}_t(\phi _l) = \frac{1}{\sqrt{n_t}}[1,\,e^{-j\pi \cos \phi _l},\cdots ,e^{-j\pi (n_t-1)\cos \phi _l}]^T, \end{aligned}$$
(2)
$$\begin{aligned} & {\textbf{a}}_r(\psi _l)=\frac{1}{\sqrt{n_r}}[1,\,e^{-j\pi \cos \psi _l},\cdots ,e^{-j\pi (n_r-1)\cos \psi _l}]^T. \end{aligned}$$
(3)

Estimating the channel \({\textbf{H}}(\varvec{\theta })\) is equivalent to estimating the parameters in \(\varvec{\theta }\). Measurements have shown that mmWave channels exhibit significant sparsity [20]. Consequently, the L paths are typically well-separated, which simplifies the task of channel estimation.

2.2 Pilot-Based Training Phase

The initial phase is the open-loop pilot-based training phase. The beam search space is defined by a codebook containing P and Q code words/directions at the Tx and Rx sides, respectively. After transmitting the pilot symbol through the \(Q\times P\) direction combinations, the observation matrix is formed.

The process works as follows: a pilot symbol \(\textrm{s}\), known to both Tx and Rx, is transmitted and received through a subset of \(P\le N_{max}\) and \(Q\le N_{max}\) spatial directions, respectively. Here, \(N_{max}\) denotes the maximum number of angle quantization levels, limited by the realistic phase shifters’ angle resolution. Since the beamforming scenario is analog, the Tx and Rx each have a single radio frequency chain, so beamforming and combining operations are performed in the analog domain [7].

Beamforming/combining vectors are computed to match the channel response [21], thus \({\textbf{f}}={\textbf{a}}_t({\bar{\phi }}_p)\) for \(p=0,\,1,\dots ,P-1\), and \({\textbf{w}}={\textbf{a}}_r({\bar{\psi }}_q)\) for \(q=0,\,1,\dots , Q-1\). For each \(\{q,p\}\) direction pair, the received signal is given by

$$\begin{aligned} \textrm{y}_{q,p}=\sqrt{\rho }\, {\textbf{w}}_q^H{\textbf{H}}{\textbf{f}}_p\,\textrm{s} + {\textbf{w}}_q^H{\textbf{n}}, \end{aligned}$$
(4)

where \(\rho \in {\mathbb {R}}^{+}\) represents the transmit power. The noise term \({\textbf{n}} \sim \mathcal{C}\mathcal{N}(0,\varvec{\Sigma }_{{\textbf{n}}})\) is a complex additive white Gaussian noise vector of size \(1 \times n_r\) with covariance \(\varvec{\Sigma }_{{\textbf{n}}}=\sigma ^2_{n}{\textbf{I}}_{n_r}\), where \({\textbf{I}}_{n_r}\) is the \(n_r\times n_r\) identity matrix. During training, the symbol \(\textrm{s}\) is set to 1 for simplicity, making the system signal-to-noise ratio (SNR) \(\rho /\sigma _{n}^2\).

The observation matrix, formed by transmitting the pilot symbols through the \(Q\times P\) directions, is given by

$$\begin{aligned} {\textbf{Y}}= \begin{bmatrix} \textrm{y}_{0,0} & \textrm{y}_{0,1} & \cdots & \textrm{y}_{0,P-1} \\ \textrm{y}_{1,0} & \textrm{y}_{1,1} & \cdots & \textrm{y}_{1,P-1} \\ \vdots & \vdots & \ddots & \vdots \\ \textrm{y}_{Q-1,0} & \textrm{y}_{Q-1,1} & \cdots & \textrm{y}_{Q-1,P-1} \end{bmatrix}=\sqrt{\rho }\, {\textbf{G}}(\varvec{\theta }) + {\textbf{N}}. \end{aligned}$$
(5)

The noise matrix \({\textbf{N}}\in {\mathbb {C}}^{Q\times P}\) contains i.i.d. elements \(\sim \mathcal{C}\mathcal{N}(0,\sigma ^2_n)\), and \({\textbf{G}} \in {\mathbb {C}}^{Q\times P}\) encodes the channel parameter vector \(\varvec{\theta }\).

To separate the impact of different path components, the elements in the observation matrix are rearranged as follows:

$$\begin{aligned} {\textbf{Y}} = \sqrt{\rho }\sum _{l = 1}^{L}{\textbf{G}}^{(l)}(\varvec{\theta }_l) + {\textbf{N}}. \end{aligned}$$
(6)

In this expression, the observation matrix is written as a sum of path contributions \({\textbf{G}}^{(l)}(\varvec{\theta }_l)\in {\mathbb {C}}^{Q\times P}\), each depending on a parameter vector

$$\begin{aligned} \varvec{\theta }_l~=~[|\alpha _l|,\angle {\alpha _l},\phi _l,\,\psi _l]^{T}. \end{aligned}$$
(7)

The problem of estimating the AoA and AoD is approached by searching for spectral peaks in the 2D image formed by the values in the observation matrix \({\textbf{Y}}\). Specifically, the estimation is treated as a supervised image-to-image translation problem, where images from one domain are transformed to exhibit the characteristics of images from another domain.

3 Deep-Learning-Based AoA/AoD Estimation

The procedure for estimating the AoA and AoD follows a three-step process. Initially, in a preprocessing phase, the complex-valued observation matrix \({\textbf{Y}}\) is separated into its real and imaginary components, denoted as \({\textbf{Y}}_{\Re }\in {\mathbb {R}}^{Q\times P}\) and \({\textbf{Y}}_\Im \in {\mathbb {R}}^{Q\times P}\), respectively. To enlarge their dimensions to \(Q'=\beta Q\) and \(P'=\beta P\), nearest-neighbor interpolation [22] is applied, resulting in the new matrices \({\bar{\textbf{Y}}}_{\Re }\in {\mathbb {R}}^{Q'\times P'}\) and \({\bar{\textbf{Y}}}_\Im \in {\mathbb {R}}^{Q'\times P'}\):

$$\begin{aligned} {\bar{\textbf{Y}}}_{\Re }= {\mathcal {I}} ({\textbf{Y}}_{\Re }), \\ {\bar{\textbf{Y}}}_{\Im }= {\mathcal {I}} ({\textbf{Y}}_{\Im }), \end{aligned}$$

where \({\mathcal {I}}(\cdot )\) represents the interpolation function and \(\beta \in {\mathbb {Z}}_{+}\) is a design parameter. In this work, interpolation with \(\beta \in \{2, 4\}\) has been applied to set \(Q'= P'\) (input matrix of size \(64 \times 64\)). By interpolating all inputs to a uniform size, we ensured that a single network model could be used across different antenna setups, simplifying the overall implementation.

After this preprocessing, the domain consists of the matrices \({\bar{\textbf{Y}}}_{\Re }\) and \({\bar{\textbf{Y}}}_\Im \) from which we aim to estimate the AoAs and AoDs. Consequently, the network input is of size \(Q' \times P'\times 2\). Next, these matrices are fed into the NN architectures, obtaining an output matrix (or image) \({\textbf{Z}}\in {\mathbb {R}}^{M\times N}\). The parameters M and N are chosen as integer multiples of the input dimensions \(Q'\) and \(P'\). In the last step, the output images are post-processed to find the most prominent peaks. In the following, we present the ResNet and U-Net models, as well as the algorithm used in the post-processing step for peak detection and the training configuration for the NNs.

3.1 ResNet Model

The first approach utilizes a residual CNN denoted as ResNet, which is an adaptation of the 2D-DeepFreq architecture introduced in [9] (refer to Fig. 1). This modified ResNet integrates transposed convolution layers to achieve super-resolution. The goal is to address the AoA and AoD estimation problem by identifying the complex sinusoids present in the input complex-valued observation matrix \({\textbf{Y}}\). Unlike the 2D-DeepFreq model, our input \({\textbf{Y}}\) already represents a frequency-domain sum of multiple sinusoids. Therefore, the matched filtering module found in 2D-DeepFreq before the upsampling stage has been omitted. Let \(w_n\) and \(h_n\) represent the width and height of the output feature map at the n-th layer, respectively. The number of output channels at the n-th layer is denoted by \(c_n\), the number of filters by \(f_n\), and the length of the filter kernels by \(k_n\) (assuming they are square). All filters used in the convolutions have a consistent size of \(k_n=k=5\), as detailed in [9]. The input consists of the interpolated real and imaginary components of the observation matrix, namely \({\bar{\textbf{Y}}}_{\Re }\) and \({\bar{\textbf{Y}}}_{\Im }\). These components are processed through an upsampling stage (Up 1) using a transposed 2D convolutional layer with a \(k \times k\) filter kernel (\(\text {ConvT}_{k}\)) and a stride of 2.

Fig. 1
figure 1

ResNet architecture

Fig. 2
figure 2

Residual block (RB)

Then, the image enters the super-resolution module which iteratively stacks 64 residual blocks (RBs), as described in [9]. As shown in Fig. 2, RBs follow the \(\text {Conv}_k\)-BN-ReLU order, where \(\text {Conv}_k\) stands for a 2D-convolution with a filter with dimensions \(k\times k\), ReLU is the rectified linear unit function and BN stands for batch normalization. For a generic input x to any RB, the output is given by:

$$\begin{aligned} \text {RB}(x) := \text {ReLU}({\mathcal {F}}(x) + x), \end{aligned}$$

where

$$\begin{aligned} {\mathcal {F}}(x):= \text {BN}(\text {Conv}_{k}(\text {ReLU}(\text {BN}( \text {Conv}_{k}(x)))). \end{aligned}$$

Finally, another transposed convolutional layer (\(\text {ConvT}_{k}\)) with a stride of 2 is applied for a second upsampling (Up 2). The output of the network is a matrix \({\textbf{Z}}\in {\mathbb {R}}^{M\times N}\), where M and N are selected as integer multiples of the input dimensions \(Q'\) and \(P'\). In this work we set a minimum upsampling factor of 2.

3.2 U-Net Model

The second proposed method is based on the well-known U-Net architecture (see Fig. 3), which was originally created for biomedical image segmentation tasks [10]. U-Net is a fully CNN designed for applications involving large amounts of data and high variability. Its architecture, characterized by a “U" shape, includes a contracting path to capture context and a symmetric expanding path to enable precise localization. In this study, we utilize the Wave-U-Net variant, which was introduced for audio source separation [11].

The core component of the U-Net model is the U-Net Block, depicted in Fig. 4. The U-Net Block consists of an encoder section (left) and a decoder section (right), which are composed of depth blocks (D-Blocks) and upsampling blocks (U-Blocks), respectively. The encoder is made up of five D-Blocks, defined as

$$\begin{aligned} \text {D-Block}(x):= \text {MaxPool} (\text {ConvBlock}(\text {ConvBlock}(x))), \end{aligned}$$

where x is the input tensor, and ConvBlock represents a standard convolutional block (with \(k=5\)), defined as:

$$\begin{aligned} \text {ConvBlock}(x) = \text {ReLU} (\text {BN} (\text {Conv}_{k} (x))). \end{aligned}$$
Fig. 3
figure 3

U-Net overall scheme

Fig. 4
figure 4

U-Net block

Between the encoder and decoder of the U-Net Block, there is a Dense Block. This Dense Block transforms an input tensor x with shape (BHWC) into a 2D tensor with shape \((B, H \times W \times C)\) using a flattening operator (Fl). The output is then passed through two dense layers and a final reshaping operator (Rshp) to fit the decoder input:

$$\begin{aligned} \text {DenseBlock}(x) = \text {ConvBlock}(\text {Rshp}(\text {Dense}_{2}(\text {Dense}_{1}(\text {Fl}(x))))). \end{aligned}$$

Here, the number of neurons in \(\text {Dense}_2\) is chosen to match the required size for the decoder input, while \(\text {Dense}_1\) defines the bottleneck size, which is set to 1024. Similarly, the decoder contains five U-Blocks and a super-resolution layer. Each U-Block is composed of a \(\text {ConvT}_{k}\) layer, followed by batch normalization (BN) and ReLU activation, defined as:

$$\begin{aligned} \text {U-Block}_i(x) = \text {ConvBlock}(\text {ConvBlock}(\text {ConvT}_{k}(x \oplus y_i))), \end{aligned}$$

where \(y_i\) represents the output of the corresponding D-Block in the encoder, and \(\oplus \) indicates concatenation. Finally, the inputs and outputs of the U-Net block pass through the upsampling layers “Up 0" and “Up 1" before being concatenated and forwarded to the final “Up 2" layer, as illustrated in Fig. 3.

Table 1 shows the size of both models based on their trainable and non-trainable parameters. The U-Net model is much larger than the ResNet model, but it is still a small DNN that can be processed in a low-power SoC with 8GB of DDR memory.

Table 1 Model parameters (thousands) and size for both models

3.3 Peak Detection

To detect the most prominent peaks in the image resulting from the output of either the ResNet or the U-Net, where the useful angular information is, we use the SimpleBlobDetector class, which implements the well-known blob detection method included in the OpenCV library.Footnote 2 Specifically, we first normalize the scale pixel values between 0 and 255, and then we apply a Gaussian blur filter to smooth the image and make the peak detection more reliable. To finish the preprocessing, we binarize the image to highlight the peaks. Then, we apply the blob detection method to detect the regions with higher average intensities and then sort them based on this parameter to get the most prominent ones [23].

3.4 Training Configuration

Similar to the method described in [9], during the training phase, the target for each channel in the dataset is a high-resolution matrix \({\textbf{X}} \in {\mathbb {R}}^{M \times N}\), composed of the sum of L 2D Gaussian functions. The elements of the matrix \({\textbf{X}}\), for \(m=0,\ldots ,M-1\) and \(n=0,\ldots ,N-1\), are calculated as follows:

$$\begin{aligned} {\textbf{X}}_{m,n}= \frac{1}{2\pi \sigma _1\sigma _2}\sum _{l=1}^L e^{\left( \frac{(\omega _m- {\tilde{\omega }}_{\psi _l})^2}{2\sigma _1^2} + \frac{\left( \omega _n - {\tilde{\omega }}_{\phi _l}\right) ^2}{2\sigma _2^2}\right) }, \end{aligned}$$
(8)

where \(\omega _m = m\frac{2\pi }{M}\), \(\omega _n = n\frac{2\pi }{N}\), and \({\tilde{\omega }}_{\phi _l}={\mathcal {W}}_{2\pi }(\omega _{\phi _l})\), \({\tilde{\omega }}_{\psi _l}={\mathcal {W}}_{2\pi }(\omega _{\psi _l})\) denote the \(2\pi \)-wrapped ground-truth frequencies. The standard deviations \(\sigma _1\) and \(\sigma _2\) determine the width of the Gaussian functions, with values set to \(\sigma _1=\sigma _2=0.2\). The models are trained by minimizing the mean square error (MSE) between the ground truth and the NN output, defined by the loss function:

$$\begin{aligned} \text {Loss} = \frac{1}{MN} \sum _{n=0}^{N-1} \sum _{m=0}^{M-1} ({\textbf{Z}}_{m,n} - {\textbf{X}}_{m,n})^2. \end{aligned}$$
(9)

4 Edge AI Implementation

4.1 Experimental Environment

For this investigation, we utilized Jetson Orin Nano modules, crafted using advanced 8nm technology, as our testing units [18]. Each module houses an Orin SoC that integrates a six-core ARM Cortex-A78AE CPU with 1.5 MB L2 and 4 MB L3 cache, alongside a GPU based on NVIDIA’s Ampere architecture. This GPU includes four Streaming Multiprocessors, 512 CUDA cores, and 16 Tensor Cores. The system also includes 4GB of 64-bit LPDDR5 memory, providing a bandwidth of 34 GB/s, shared between the CPU and GPU. The modules operate in two power modes: a low-power 7-watt mode and a higher 10-watt mode for increased performance. Figure 5 depicts the main components of the Jetson Orin Nano modules, including its Orin SoC.

Fig. 5
figure 5

Main components of the Jetson Nano Orin module

To measure power consumption accurately, we employed the PMLIB framework [24]. This system captures power data at 10 Hz from the device’s integrated sensors. This power data is averaged over the duration of each test to evaluate the total power usage across three main components: the entire module, the combined load from the CPU, GPU, and other processing units like Computer Vision devices, and the memory subsystem along with other key components like the image signal processor.

Energy usage was calculated by multiplying their power consumption by the operational time during specific tasks, providing a measure in Joules (J). This approach allowed us to quantify the energy consumption necessary to evaluate the efficiency of the components associated to each sensor under different operational scenarios.

The board’s power, thermal, and electrical management is supported by the NVIDIA board support package (BSP) [25]. This package includes CPU dynamic frequency scaling, which is managed by the ondemand governor based on Linux. This governor adjusts the CPU’s frequency in response to real-time system demands, a strategy similarly employed for the GPU. This adaptive frequency scaling ensures that both the CPU and GPU modulate their operational frequencies based on their computational load, optimizing power efficiency. We can also set a fixed frequency both for the CPU and GPU that does not very during a given test. Our experiments explored various frequency settings for both the CPU and GPU to determine their effects on the board’s overall power consumption.

4.2 Experimental Results

For the training and validation of the ResNet and U-Net architectures, datasets composed of 10,000 and 1000 different observation matrices were used, respectively. The datasets were synthetically generated following the model in Eq. (5) with \(\rho = 1\), i.e., the SNR is \(1/\sigma ^2_n\). More specifically, each dataset element contains a random observation matrix generated for a random SNR in the range between \(-10\, \text {dB}\) and \(25\, \text {dB}\), with a number of channel paths randomly selected from \(L = 1\) to \(10\). The codebook size is also selected randomly by setting \(Q = P\) to either 16 or 32. Channel coefficients \( \alpha _l \), \( l=1,\ldots ,L \), are drawn from a zero mean complex Gaussian distribution with variance \( 1/L \), while AoA and AoD angles are drawn from a uniform random distribution in the range \([0, \pi ]\). Each dataset element also contains the corresponding ground-truth AoA and AoD for the \( L \) channel paths, to assist the model training.

The experimental evaluation of the resulting CNN model's implementations assumes equal values for the parameters P, Q, \(n_r\) and \(n_t\) for the sake of simplicity (note that in more realistic systems these parameters may differ), which implies that the size of the observation matrix \({\textbf{Y}}\) is directly \(n_r\times n_t\). As a typical case for evaluation, we chose a number of paths equal to \(L=3\) and observation matrices of size \(n_r\times n_t =16 \times 16\).

All the experiments have been performed using a batch size equal to one. That means that for every channel we transfer the information from the CPU to the GPU, run the inference on the GPU and post-process the result on the CPU. We are measuring the time and power consumption of the whole process while iterating over several channels and calculating the average among the measurements from all the iterations.

Fig. 6
figure 6

Execution time using both models (rn and un) to perform the inference of one channel varying both the CPU and GPU frequencies. Note that the results for the different GPU frequencies overlap for each of the two models

Figure 6 presents the inference times (in seconds) for both models across various combinations of CPU and GPU frequencies. It is evident that the U-Net model (un) consistently outperforms the ResNet model (rn) in speed, irrespective of the frequencies. Both models exhibit similar behavior in response to changes in CPU and GPU frequencies: the execution time decreases rapidly at lower CPU frequencies and more gradually at higher frequencies, achieving the fastest execution time at the highest frequency. Interestingly, GPU frequency does not affect the execution time of either model, as a batch size of one does not fully utilize the GPU resources. Using the jetson-stats package, we monitored the platform’s resource usage during the inference process.Footnote 3 At the highest GPU frequency, only about 50% of the GPU’s capacity is utilized. However, as the GPU frequency decreases (while maintaining the CPU frequency), the communication latency to transfer each channel to the GPU is reduced, leading to up to 70% GPU utilization. Thus, the reduction in GPU frequency is offset by increased utilization, resulting in consistent inference times regardless of the GPU frequency. In order to confirm the effect of the GPU frequency, we have also carried out the inference process using a batch size of 32 channels. In this case, the GPU load is always very close to 100% for every GPU frequency. As a consequence, the GPU frequency affects the execution time, which decreases as we increase the frequency.

Fig. 7
figure 7

Watts consumed by the whole platform to perform the inference using the ResNet model

Fig. 8
figure 8

Watts consumed by the whole platform to perform the inference using the U-Net model

Reducing inference execution time is important, but in many applications using edge devices, minimizing power consumption can be paramount. Extending battery life is crucial in many environments, and one effective approach is to reduce the frequencies of the device components. Figures 7 and 8 illustrate the power consumption of the entire platform while executing both models at various combinations of CPU and GPU frequencies. We determined the power consumed by the application by subtracting the power consumed by the platform at idle with the minimum CPU and GPU frequencies. Both figures use the same color scale to facilitate comparison between the two models.

As previously observed, the U-Net model is consistently faster than the ResNet model. However, it also consumes more power across all CPU and GPU frequencies. The power consumption patterns of both models are similar, gradually increasing from the bottom-left corner to the top-right corner as both CPU and GPU frequencies rise. Therefore, to minimize power consumption, it is essential to use the lowest frequencies for both computing devices.

Fig. 9
figure 9

Total energy (J) consumed by the whole platform to perform the inference of one channel for each combination of CPU and GPU frequencies using the ResNet model

Fig. 10
figure 10

Total energy (J) consumed by the whole platform to perform the inference of one channel for each combination of CPU and GPU frequencies using the U-Net model

To assess the trade-off between execution time and power consumption, we calculated the energy required to perform the inference of one channel. This was done by multiplying the power consumption shown in the previous figures by the execution time shown in Fig. 6, resulting in the energy consumption per channel in joules. Figures 9 and 10 present this metric for both models, using the same color scale for easy comparison.

As demonstrated, the U-Net model consumes less energy per channel than the ResNet model across all combinations of CPU and GPU frequencies. This indicates that its faster execution time more than compensates for its higher power consumption, leading to lower energy consumption per channel. When analyzing the evolution of energy consumption with varying CPU and GPU frequencies, both models exhibit similar behavior. The highest energy consumption occurs at the highest CPU and GPU frequencies, while the lowest energy consumption is observed at the two lowest GPU frequencies and when the CPU runs between 883.2 MHz and 1267.2 MHz.

5 Conclusion

In this paper, we have presented a comprehensive study on the implementation of deep learning-based models, specifically a residual convolutional neural network (ResNet) and a U-Net architecture, for the efficient estimation of angle-of-arrival (AoA) and angle-of-departure (AoD) in mmWave MIMO systems. Our focus was on evaluating the performance and energy efficiency of these models when deployed on the NVIDIA Jetson Orin Nano platform, a low-power embedded system optimized for edge AI applications.

The experimental results demonstrate that while the U-Net model consumes more power than the ResNet model, it is approximately 3.5 times faster, leading to significantly lower energy consumption per channel estimation. This finding underscores the importance of considering both execution time and power consumption in the design and selection of AI models for edge computing in the future 6G systems. The trade-offs between computational performance and power efficiency highlighted in this study provide valuable insights for optimizing AI applications on low-power embedded systems.

Our research suggests that deploying deep learning models like U-Net on edge devices is a promising approach for enhancing the efficiency of mmWave MIMO channel estimation. This approach not only supports the high-throughput requirements of 6G communications but also aligns with the growing emphasis on sustainability in modern communication system design.