1 - Design of GPU-Based Aerial Monitoring Signal Processing Software

Received: 13 November 2021
DOI: 10.1049/rsn2.12244
ORIGINAL RESEARCH
- -
Revised: 18 January 2022 Accepted: 18 February 2022
- IET Radar, Sonar & Navigation
Design of high‐speed software defined radar with GPU

accelerator
Wenda Li1 | Chong Tang1 | Shelly Vishwakarma1 | Karl Woodbridge2 |

Kevin Chetty1
1
Department of Security and Crime Science, Abstract
University College London, London, UK
Software defined radar (SDRadar) systems have become an important area for future
2
Department of Electronic and Electrical radar development and are based on similar concepts to Software defined radio (SDR).
Engineering, University College London, London,
UK Most of the processing like filtering, frequency conversion and signal generation are
implemented in software. Currently, radar systems tend to have complex signal processing
Correspondence and operate at wider bandwidth, which means that limits on the available computational
Wenda Li, Department of Security and Crime power must be considered when designing a SDRadar system. This paper presents a
Science, University College London, 35 Tavistock feasible solution to this potential limitation by accelerating the signal processing using a
Square, Bloomsbury, London, WC1H 9EZ, UK.
Email: wenda.li@ucl.ac.uk GPU to enable the development of a high speed SDRadar system. The developed system
overcomes the limitation on the processing speed by CPU‐only, and has been tested on
Funding information three different SDR devices. Results show that, with GPU accelerator, the processing rate
Engineering and Physical Sciences Research Council, can achieve up to 80 MHz compared to 20 MHz with the CPU‐only. The high speed
Grant/Award Number: EP/R018677/1 processing makes it possible to run in real‐time and process full bandwidth across the
WiFi signal acquired by multiple channels. The gains made through porting the pro-
cessing to the GPU moves the technology towards real‐world application in various
scenarios ranging from healthcare to IoT, and other applications that required significant
computational processing.
KEYWORDS
GPU accelerator, signal processing, software defined radar
1 | INTRODUCTION approach of sharing a platform with an SDRadar allows these

requirements to be met [2]. However, increases in signal
The applications of radar cover many broad and various areas, sampling rates (signal bandwidth) and the number of receiver
for example, the long‐range airborne and weather surveillance, channels in Multiple‐Input and Multiple‐Output (MIMO)/
short‐range target detection, target recognition and classifica- distributed radar systems, mean there are more data need to be
tion, etc. These applications have diverse demands, leading to processed. For example, the universal software radio peripheral
the proliferation of highly specialised radar systems on the (USRP) family [3] has sampling rate from 20 MHz up to
same platform, for example, ship, aircraft and others [1]. In 160 MHz for 2 to 4 channels, the DigitizerNetbox (Dig-
addition, these platforms are also equipped with a number of itizerNetbox) [4] can operate at 5 GHz with up to 16 channels,
other types of Radio Frequency (RF) sensors, such as and also Ultra‐Wide Band design [5]. Consequently, this
communication and navigation systems. Many radar systems increasing requirement in computational power becomes a key
were implemented using hardware such as Field programmable parameter to be considered for an SDRadar system.
gate arrays (FPGAs). However, a software implementation, FPGA and GPU are commonly employed to accelerate
which uses general‐purpose processors is more desirable for its computational processing. There are a number of differences
cost‐effective, flexibility and fast development. Thus, the between these two architectures, in terms of flexibility, power
-
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2022 The Authors. IET Radar, Sonar & Navigation published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.
IET Radar Sonar Navig. 2022;16:1083–1094. wileyonlinelibrary.com/journal/rsn2 1083

17518792, 2022, 7, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/rsn2.12244 by Cochrane Mexico, Wiley Online Library on [18/07/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
1084
- LI ET AL.
consumption and latency etc [6, 7]. The major concern in this also been used to accelerate radar image processing [12] with
work is the amount of onboard memory. For example, com- real‐time capability. This system hybrids CPU‐GPU scheme to
mercialised Xilinx FPGA UltraScale product line has a process on 40 MHz sampling rate, and can generate images at
maximum of 133 megabytes block memory [8]. In comparison, 8 fps composed of 6000 pixels. In SDRadar systems, using a
gaming GPUs such as NVIDIA GeForce GTX 2060 can offer GPU as an accelerator has been studied broadly in recent years.
onboard memory of 6 gigabytes. In this work, we focus on the One of the most representatives is [13], which implemented a
high sampling rate radar signal which requires extensive data real‐time Global Positioning System (GPS) receiver with
transfer via the peripheral component interconnect (PCI) adaptive beam‐steering capability using a software‐defined
Express. Therefore, GPU has larger onboard memory and approach. This SDRadar offers sufficient computational
allows data to be transferred and processed simultaneously, capability to support 4‐element antenna array up to 40Msps
making it well‐suited for this SDRadar work. Also, GPU‐based (mega samples per second). After that, a testbed [14] for anti‐
accelerator means the system can be easily modified with other jamming receiver was developed embedded with Space‐Time
Software Defined Radio (SDR) devices without heavy devel- Adaptive Processing and Space‐Frequency Adaptive Process-
opment by FPGA‐based accelerator. ing. It further deploys the batch mode to fully take advantage
With the introduction of NVIDIA Compute Unified of the parallelism resources provided by GPU. However, both
Device Architecture (CUDA), the parallel processing capabil- of the two SDRadars are especially developed for beamform-
ities of GPU becomes accessible not only for the graphics, but ing and are limited in flexibility and configurability. A com-
makes financially and technologically accessible. Another parison between these works and our system is summarised in
advantage of GPU‐based accelerator is its ability to free‐up the Table 1.
CPU from heavy parallel computing and focus on Data In this work, we use the LabVIEW GPU analysis toolkit
Acquisition (DAQ) and synchronisation control. The uti- to interface the CUDA functions and embed into our pre-
lisation of GPU enables a real‐time ability without degradation vious SDRadar system [15] with a new architecture. This is
on sampling rate. achieved by switching to a parallel framework where DAQ by
Considering these advantages, many researchers have been CPU and signal processing (partly by GPU) are processed
working on GPU‐based accelerator in various systems to deal separately. The processing speed is significantly improved
with high computational process such as fast Fourier trans- with the GPU accelerator when compared to CPU‐only
form (FFT) [9] and correlation [10]. An early work [11] pre- system. To demonstrate the effectiveness of GPU acceler-
sents an analysis on correlation with GPU acceleration and ator in real‐time, three SDR devices have been tested to
demonstrates a speed‐up factor of 15 compared to CPU only quantify the advantage of proposed concepts in flexibility
process. Work [10] demonstrates a GPU‐accelerated back‐ deployment and processing speed. The hardware specification
projection in reconstruction of Synthetic Aperture Radar. It of these SDR devices is presented in Table 2 and shown in
also shows an impressive runtime with a speed‐up factor be- Figure 1.
tween 50 and 60. However, this system does not provide a Compared to previous works [12, 17, and 18], the
practical solution for GPU‐based accelerators in real‐time following contributions are made by this paper:
processing. A Raspberry Pi base SDRadar system described
in [9] built for passive radar including reference signal recon- � This paper presents a robust high‐speed SDRadar system
struction and two‐dimensional FFT (size of 2048 � 512) with that is capable of processing sampling rates of up to
on‐board CPU and GPU. In this case, GPU only shows a slight 80 MHz without sacrifice from dropping samples. The
improvement by 10% due to the sequential framework design system runs with a typical passive radar processing including
which leads to overhead for single FFT. The limitation of this Cross‐Ambiguity Function (CAF) and Direct Signal Inter-
system is that it operates at a relatively low sampling rate ference cancellation [19].
240 kHz and is not fully compatible with the parallel frame- � The proposed system can be easily tuned to run with various
work to take full advantage from GPU acceleration. GPU has SDR devices. We have modified it to run with three
TABLE 1 Comparison with other SDRadar systems
Work [13] [14] Our work

Mechanism Beamforming Nulling and beamforming Cross ambiguity function
Configurability Limited Partially Fully adjustable
SDR device USRP2 ADC Multiple
No. channel 4 Up to 8 Up to 16
Real‐time ability Partially Partially Fully adjustable
Processed sample rate (I/Q) 20Msps 20Msps 80Msps
Application GPS sensors Anti‐jamming Near‐field monitoring
Abbreviations: ADC, analog‐to‐digital converter; SDR, Software defined radio.

LI ET AL.
- 1085
TABLE 2 Specification of SDR devices in this work
Device USRP 2920 [3] USRP 2945 [16] DigitizerNetbox DN2.593‐16 [4]
No of channels (devices) 1 (2) 4 (1) 16 (1)
Connector Ethernet PCIe Ethernet
ADC resolution 16‐Bit 14‐Bit 16‐Bit
Max. sampling rate per channel 20 MHz 80 MHz 80 MHz
Theoretical sampling rate (as a SDR device) 40 MHz 100 MHz 640 MHz
Coherent/non‐coherent Non‐coherent Non‐coherent Coherent
Data type I/Q I/Q Amplitude
Abbreviation: SDR, Software defined radio.
FIGURE 1 Three Software Defined Radio (SDR) devices: (a) USRP 2920, (b) USRP 2945, and (c) DigitizerNetbox (DN)
different devices including the USRP 2920, USRP 2945 and (transmitted) signals sr and surveillance (received) signals ss can
DigitizerNetbox. be written as Equations (1) and (2):
� The new system shows a marked improvement in process-
∞
ing speed when comparing to our previous CPU only sys- ðsr ⋅ ss ÞðτÞ ¼ ∫ −∞ sr ∗ ðtÞss ðt þ τÞdt ð1Þ
tem [15]. Three SDRadar features (full bandwidth
processing, multi‐channel and phased‐array system) have ∞
X
been implemented to demonstrate the feasibility of GPU ðsr ⋅ ss Þ½n� ¼ sr ∗ ½m� � ss ½m þ n� ð2Þ
accelerator. m¼−∞
The rest of this article is organised as follows. Section II where * represents complex conjugate. The cross‐correlation
outlines the concepts of our SDRadar system and the associ- theorem suggested in [22] in discrete form is presented in
ated signal processing for WiFi based passive sensing; Sec- Equation (3):
tion III presents the design and implementation of GPU‐based ∞
X � �
parallel processing; measured performance and experimental ðsr ⋅ ss Þ½n� ¼ IFFT FFT s∗r ½m� FFT ðss ½m þ n�Þ ð3Þ
results are shown in Section IV; finally, conclusions from this m¼ −∞
study are in Section V. where FFT is fast Fourier transform and IFFT is inverse fast
Fourier transform.
One of the limitations to apply Equation (3) to SDRadar
2 | SIGNAL PROCESSING IN SDR system is the size of sr and ss. Considering the sampling rate of
20 MHz (a typical bandwidth of WiFi signal at 2.4 GHz), which
2.1 | Cross‐correlation means there are 20 M data points that need to be processed for
every second. Particularly, the FFT process on a long sequence
Cross‐correlation evaluates the level of similarity between two is very slow. This makes the cross‐correlation almost impos-
functions or signals [20], and is widely used to detect the time sible to be processed in real‐time. For this reason, batch pro-
delay and Doppler shift for a known transmitted signal and cess has been applied which segments a long sequence s into L
reflected signal from objects. More specifically, according to short and equal length sequences s = [s1, s2,…, sL]. Cross‐
our previous works [15, 21], cross‐correlation consumes the correlation with batch process can be expressed as:
major computational power in signal processing. In such
scenarios, the speed of the cross‐correlation is crucial to L−1
X � ��
the overall performance of the system. In the time and discrete sr ⋅ ss ¼ IFFT FFT sr i∗ FFT ss i ð4Þ
domains, the definition of cross‐correlation between reference i¼0
1086
- LI ET AL.
FIGURE 3 Block diagram of Cross‐Ambiguity Function (CAF)
FIGURE 2 Block diagram of proposed SDRadar system

(a) (b)
2.2 | Signal processing in SDR
The block diagram of the SDRadar signal processing is shown

in Figure 2. The processing begins with DAQ from SDR de-
vices. The signals from the antenna are digitised and trans-
ferred to computer for processing. The CAF calculates the
distance and velocity of the target. However, the target, clutter
and direct signal are mixed which reduces the sensitivity of
SDR. The CLEAN algorithm [19] is used to suppress the
F I G U R E 4 A comparison between (a) serial fast Fourier transform
clutter and direct signal. Afterwards, a Constant False Alarm
(FFT) processing (on CPU) and (b) parallel FFT processing (on GPU)
Rate has been used to further reduce the noise.
The block diagram of the CAF process is shown in
Figure 3. The complete CAF process has two inputs: trans-
mitted signal which is broadcast by the radar system, and CUDA is a general purpose parallel computing platform
received signal which is reflected from target. Both signals are and programming model that leverages the parallel compute
then reshaped into batch mode to avoid the long sequence. engine in GPUs to solve many complex computational prob-
Afterwards, cross‐correlation Equation (4) is applied to find lems in a more efficient way than on a CPU [23]. Moreover, the
the relative time delay within each batches. The final step is to LabVIEW GPU Analysis Toolkit can access the CUDA
carry out another FFT process across the batches. This is to function, and DAQ software in LabVIEW to control SDR
calculate Doppler shift upon the time delay from cross‐ devices.
correlation. The output is a 2D range‐Doppler surface SCAF. A comparison of FFT performed (in batch mode) by CPU
CLEAN algorithm shares similar approach to the CAF, and GPU is shown in Figure 4. The X‐axis represents data
while both of its inputs are the transmitted signal. This is to points from SDR devices, and Y‐axis represents the time
create a self range‐Doppler surface Sself which is the repre- (a batch can be considered as data points from each specific
sentative for the direct signal. Then the cleaned range‐Doppler time period). A conventional SDRadar system uses a CPU to
surface Sclean is calculated as Sclean ¼ jSCAF j − αjSclean j, where perform FFT in a sequential order that sweeps across all data
j:j represents the absolute value and α is the scaling factor for points in each batch as shown in Figure 4a. In this paper, we
Sclean in relate to SCAF. propose to use GPU to perform simultaneous multiple FFTs
upon a plan of data points as presented in Figure 4b. The high
capacity of parallel computation is expected to significantly
3 | GPU‐BASED PARALLEL accelerate the cross‐correlation of SDR.
PROCESSING
When comparing the hardware architectures between GPU 3.1 | Mapping the data points to GPU grid
and CPU, GPUs are specifically designed for computationally
intensive calculations that have high parallelism rates. Conse- Figure 5 presents the structure of mapping the data points to
quently, GPUs are designed with more transistors for data GPU grid. An SDRadar system with inputs from N channels at
processing than data caching and flow control. These differ- Sr sampling rate in Ti integration time, has a data size of N �
ences in hardware architectures determine the different pro- Sr � Ti. Within each channel, data points are segmented
cessing speed of signal processing. For operating systems, there equally into L batches with a batch length of LB. Batches from
is no affect to process the data in GPU memory which also each channel are combined together and downloaded into
gives better stability. As a result, accelerating the signal pro- GPU memory for cross‐correlation. In the CUDA framework,
cessing with GPU is considered a powerful and fast develop- a given sequence of instructions is called kernel. Each kernel
ment option for SDRadar systems [10, 17]. controls a group of blocks which process the data in parallel.
LI ET AL.
- 1087
The block index (red context in Figure 5) indicates the number saved into the hard drive for off‐line processing or down-
of parallel block NB, where NB = N � L. To make it simple, we loaded into the GPU memory through the PCIE X 16 3.0
use the processing rate Pr to measure the actual data that interface for subsequent online processing. In T2, the whole
processed by the system. The number of blocks, overall data process is operated for each cross‐correlation in parallel by the
size downloaded by GPU and processing/sampling rate can be GPU accelerator, and the obtained data are uploaded to host
expressed as: memory from GPU memory again in the end. In T3, the
spectrogram is generated by the CPU from the data loaded in
N � Sr � Ti ¼ NB � LB ¼ P r ð5Þ the host memory for graphic user display or storage into the
hard drive for further processing for example, activity recog-
whereas the left part represents the processed data points in nition, identification, etc.
CPU memory, the middle part represents the data points in Recall the Equation (5), taking integration time Ti of 1
GPU memory. The processing rate can also be calculated as second as example, the total amount of received data points are
the product of NB � LB. Note that, the value of parameters in N � Sr. The detailed description relating to the cross corre-
Equation (5) is not constant. They vary depending on the SDR lation processing on the GPU is given below:
device and the maximum throughput of DAQ.
1. Reshape the received vector data (complex values) into
parallel blocks NB � LB. This process is done by CPU.
3.2 | System integration 2. Download the batch data from CPU to GPU memory.
3. Complex conjugate is implemented as a GPU function and
Figure 6 schematically displays the system integration of the applied on the received signal (NB − L) � LB in time
GPU‐accelerated cross‐correlation processing for our SDRa- domain, by changing the sign of imaginary parts of the
dar system. There are three major threads including raw data complex number. This is because the transmitted signal
DAQ (Thread1, T1), multi‐channel cross‐correlation by GPU does not do complex conjugate.
(Thread2, T2) and spectrogram generation (Thread3, T3), 4. FFT: forward FFT is performed on all the batches NB � LB.
respectively. The solid arrows describe the main data stream, 5. The transmitted signal L � LB and every L � LB received
and the red arrows indicate the data flow inside the GPU. signal, both are complex numbers in the Fourier domain,
In T1, the raw RF signal is acquired by the SDR device and are multiplied. This gives size of (NB − L) � LB.
transferred into the host computer's memory through the 6. IFFT: inverse fast Fourier transform is performed on all the
Ethernet/PCIe port (10 GHz). Afterwards, there is an initial batches (NB − L) � LB.
preparation to sort the data into 2D matrix. Then, it can be 7. Upload the processed data from GPU to CPU memory.
FIGURE 5 Mapping data points on the GPU
FIGURE 6 Flow chart of GPU‐accelerated raw Radio Frequency (RF) data processing
1088
- LI ET AL.
FIGURE 7 Cross‐correlation on the GPU
8. Intercept first 30 samples within each batch (NB − L) � 30.

Because the redundant batch length LB is not necessary in
terms of the sensing distance.
9. The final step is to do another short FFT across all the
batches on CPU. This is to calculate the Doppler shift upon
the time delay.
The implementation of cross‐correlation on GPU is shown

in Figure 7.
Finally, to ensure different SDR devices are working in a
similar fashion, a ‘two‐process’ design has been used, where
the DAQ and signal processing run separately as shown in
FIGURE 8 Two‐process design for SDRadar system
Figure 8. A queue is used to link together DAQ and signal
processing and follows First‐In‐First‐Out order. The queue
length control is to ensure the queue is not too long to be different numbers of data points. Due to the specification of
processed. If the queue length is over a pre‐defined value, the SDR devices, we consider the maximum sampling rate of
queue will be erased. This design minimises the interference of 50 MHz and up to 16 channels.
data overflow in DAQ to the real‐time signal processing. Figure 9 shows the performance comparison results be-
tween the CPU and GPU based cross‐correlation processing
upon various data sizes. It can be seen from Figures 9a and b
3.3 | Performance comparison on CPU and that the processing time of CPU and GPU on either set of
GPU cross‐correlation increases in proportion to the number of data
points, irrespective of the increase in the block number of
Here we compare the processing time of cross‐correlation on batch length. However, the processing of the GPU‐based
CPU and GPU. The host computer is equipped with an AMD method is much faster than the CPU‐based processing for
Ryzen 3600@3.59 GHz CPU, an NVIDIA GeForce RTX2060 any values of block number or batch length, even though the
GPU with 6 GB graphics memory. Random generated data CPU‐based processing has been optimised with matrix multi-
was used in this comparison with data type of CDB (16bytes/ plication. Figures 9c and d show the ratio of processing time of
element). Upon each test, both processing methods were the CPU‐based processing over GPU‐based processing for
performed 100 times, and calculated the averaged processing each test. It is observed that the processing acceleration of the
time. To match with the DAQ flow from SDR devices, we GPU‐based method over the CPU‐based method ranged from
simulate 1 second of data but with varied sampling rates, that is 2 to 5 times. In addition, both Figures 9c and d clearly
LI ET AL.
- 1089
F I G U R E 9 Performance comparisons between CPU and GPU based cross‐correlation processing on various sizes of batch length and block number.
(a) Processing time comparison on 100 blocks with varying batch length. (b) Processing time comparison on varying block numbers with a fixed batch length of
20 k. (c) and (d) GPU–CPU processing acceleration ratio and processing rate corresponding to (a) and (b), respectively
demonstrate that this acceleration ratio increases with the size lower than the 1 second of processing time. These results
of batch length and block number. The comparison between indicate an attractive performance delivered by GPU acceler-
Figures 9c and d demonstrates that the GPU accelerates the ator which is a feas of high sampling rate processing.
cross‐correlation based on two factors: 1. Faster FFT pro-
cessing over longer data in each block; 2. More efficient parallel
processing due to the highly paralleled core hierarchy in GPU. 4 | SDRADAR APPLICATIONS
The simulation is based on the data from 1 second, which
means latency will be induced when the processing time is A typical radar system requires at least two channels: one
longer than 1 second, thus not suitable for real‐time process- transmitter and one receiver. Two USRP 2920 have been used
ing. Besides, CPU also has another heavy load task: DAQ from since each of them only has 1 channel. More details about the
SDR devices. From Figure 9b, it can be seen that CPU has two USRP 2920 setup can be found in our previous paper [24].
more than 1 second processing time in the test of 200 k, 500 k In comparison, USRP 2940 (4 channels) and DigitizerNetbox
batch length and 800, 1600 blocks, while GPU in all tests are (16 channels) can fully function as an SDRadar system.
less than 1 second. In addition, the overall acceleration ratio in Additional channels can be employed for distributed sensing
this work is not as significant as in work [18] which ranged and angle‐of‐arrival detection.
from 10 to 30 times faster. The reason is because we also
simulated the process of downloading and uploading data to
the GPU memory which revealed longer processing times than 4.1 | Advantages in sensitivity (USRP 2920)
the cross correlation instead. This has a particular effect in
processing time especially with a large data size. One of the important measures for SDRadar system is it's
It is worth noting from Figure 9a that the processing time ability to detect targets against the background clutter and
of 10k(LB) � 100(NB) can be accelerated to 2 times faster than noise. This is defined by multiple factors, for example, the
CPU processing. This is the minimum sampling rate (1 MHz) signal strength, antenna beam pattern, etc. In terms of range‐
used in this work; GPU performs 2 times faster than CPU Doppler surface, the sensitivity can be measured as the Peak
despite it having a short processing time. In comparison, at Signal‐to‐Noise Ratio (PSNR), where peak represents the pulse
500k(LB) � 100(NB) (the maximum sampling rate), GPU has relating to the object and noise represents the sidelobes. High
contributed more than 4 times acceleration ratio. From PSNR means the object can be easily identified.
Figure 9b, GPU processing also has much shorter processing Here we provide examples of range‐Doppler surface based
time when dealing with multiple channels. The processing time on WiFi signals. In this measurement, two USRP 2920 s were
of 20k(LB) � 50(NB) has been accelerated up to 3 times faster used as receivers, one channel was connected directly to the
than CPU processing. While, there is only slightly acceleration WiFi router to measure a ‘Reference’ signal while the other
has been seen at 20k(LB) � 1600(NB) at 3.7 times, but still channel was connected to an antenna to record corresponding
1090
- LI ET AL.
signal reflections from the environment. The antenna was Figure 11 presents the PSNR versus batch length from 1 k
configured as the same direction towards the WiFi router to 200 k for three frame rates. The red dash line indicates the
under a stationary environment without any Doppler shifts. maximum CPU processing ability at 100 k, however, the GPU
Both USRP 2920 were operated at 20 MHz for full bandwidth accelerator can easily process more than 200 k. The highest
of WiFi signal at 2.4 GHz channel. To demonstrate the PSNR of 20 Hz frame rate can reach up to 17 dB, whereas
sensitivity performance of SDRadar system in different signal 100 frame rate and 1 k frame rate both can reach more than
scenarios, three different WiFi frame rates were used as 20 Hz, 27 dB. PSNR values are getting closer after batch length of
100 Hz and 1 kHz per second. The integration time Ti was set 160 k between 100 and 1 k frame rate. This indicates that WiFi
at 1 s, the block number was constant at 100 that gives signal at 100 Hz can deliver high performance when sufficient
the maximum batch length of 20 M/100 = 200 k. Thus, the data points have been processed, this can be also observed
processing rate is given as (2 � 100) � 200k = 40 MHz, where from Figure 10. In addition, in 1 k frame rate, there is only little
2 represents two channels. The SDRadar system made a full improvement after batch length of 80 k. Processing on addi-
CAF processing upon various batch lengths. tional data points will not have extra benefit on PSNR. Thus,
Figure 10 presents the range‐Doppler surface on three the trade‐off between the amount of processing and PSNR
frame rates at different of batch lengths. As expected, frame threshold needs to be identified depending on the frame rate.
rate of 1 k has the best performance compared to frame rates
of 20 and 100 Hz. This is because of the effective signal
depending on the WiFi frame rate as discussed in our previous 4.2 | Distributed channels (USRP 2945)
paper [15]. However, the effective signal may not be captured if
the batch length is not of sufficient length, even under a high Distributed channels, which measure the object from different
frame rate. Consequently, improvements can be observed in angle of aspects, can generate multiple range‐Doppler surfaces
range‐Doppler surfaces with increasing batch length and frame simultaneously. This can bring spatial diversity in Doppler in-
rate. In the worst case, 20 Hz frame rate with 10 k batch length formation and deliver higher recognition accuracy [25]. How-
results in significant difficulties with identifying the preferred ever, there are many challenges for such systems, for example,
peak. This situation gets improved when the batch length is the clock/time synchronisation among different channels, also
increased to 200 k. Range‐Doppler surface at 100 Hz frame the much higher sampling rate when compared with single
rate with 200 k batch length, has almost similar performance channel SDR.
compared to the same batch length at 1 k frame rate. In In this measurement, a USRP 2945 was used as the
comparison, all peaks in 1 k frame rate can easily be receiver with total 4 channels. Among them, one channel was
distinguished. used to recreate transmitted signal, and other three channels
F I G U R E 1 0 Range‐Doppler surface for a WiFi signal: rows 1, 2 and 3 illustrates frame rates of 20 Hz, 100 Hz and 1 kHz, respectively. Column 1–7 are
batch lengths of 10 k, 20 k, 30 k, 50 k, 100 k, 150 k and 200 k, respectively. The x‐axis plots range and the y‐axis plots Doppler
LI ET AL.
- 1091
FIGURE 11 PSNR versus batch length
Node 4 spatial diversity is very important for many machine learning

Rx tasks to improve their accuracy like activity recognition,
6mx5m
localisation, people counting etc. Additionally, unlike the CSI‐
based systems [22], the phase noise of our SDRadar exhibits
better stability and can more easily be extracted to provide an
indication of target direction, where the positive pulses
4m represent the person moving towards the antenna and negative
pulses represent away from the antenna. Moreover, we can also
observe the micro‐Doppler caused by limbs movement, which
indicates the system is very sensitive even for small
3m 0.5m movements.
Afterwards, we record the processing time within each step
Node 3 Node 2 Node 1 to demonstrate the actual time spent by the GPU and CPU
Rx Rx Tx/ Rx during real‐time processing. Three processing rate tests had
been conducted including CPU‐only at 20 MHz, and GPU at
FIGURE 12 Experiment setup: distributed channel both 20/80 MHz. Based on observations, CPU at 20 MHz and
GPU at 80 MHz are the maximum processing rates that can be
operated without time latency on current hardware. We run
are for the surveillance. A participant was asked to walk 100 s of processing for each test, and record the average
around randomly within an area of 6 � 5 m with pauses processing time.
during the experiment as shown in Figure 12. A WiFi router Table 3 summarises the three tests with the processing time
was aligned with the walking path and marked as 0°. After- for each step. As it can be seen, the CPU at 20 MHz reaches
wards, three antennas (one for each channel) were set to 1329.3 ms in overall time which is close to that in GPU at
different angles towards the walking path to demonstrate the 80 MHz at 1389.5 ms. Also, GPU at 20 MHz has an overall
Doppler signatures at different angles. The WiFi router was time of 654.4 ms which is only half to the CPU‐only pro-
fixed at 100 frame rate, SDRadar system had been set up cessing time. These results indicate our GPU accelerator can
with constant parameters as 100 block number and 20 MHz improve the processing rate up to four times than the CPU‐
of sampling rate. The full size of processed data point is (4 � only SDRadar system. Reshape process is to convert the
100) � 200k = 80 M. long vector data into 2D matrix which depends on the size of
Figure 13 shows a 30 s Doppler spectrogram for walking, data. The highest processing time for CPU‐only are FFT and
captured at different angles. Because the USRP 2945 has three IFFT due to the serial processing. In comparison, GPU has
surveillance channels, we measure the walking activity repeat- much lower processing time in FFT and IFFT in both 20 and
edly for three times. It can be seen that the real‐time Doppler 80 MHz. There are additional download and upload processes
record for each period of walking can be distinguished from for GPU to transfer the data with CPU memory. However,
the others. Doppler signatures are varied in each node due to GPU are still faster than the CPU‐only processing even when
the variations in monitoring angle. This difference provided by including these extra steps.
1092
- LI ET AL.
FIGURE 13 Doppler spectrogram captured by 3 distributed receivers at different angle (random direction walking)
TABLE 3 CPU & GPU real processing time comparison
Device USRP 2945(CPU) USRP 2945(GPU) USRP 2945(GPU)

Channel 4 4 4
Sampling rate (Sr) per channel 5 MHz 5 MHz 20 MHz
Processing rate (Pr) 20 MHz 20 MHz 80 MHz
Block number (NB) 400 400 400
Batch length (LB) 50k 50k 200k

Processing time (ms) for 1 s of data
Reshape 287.5 258.6 567.5
Download to GPU 0.0 74.2 125.7
FFT 467.2 54.6 152.8
IFFT 380.8 54.9 148.7
Upload to CPU 0.0 40.7 129.8
Others 193.8 171.4 265.0
Overall 1329.3 654.4 1389.5
Abbreviations: FFT, fast Fourier transform; IFFT, inverse fast Fourier transform.
LI ET AL.
- 1093
4.3 | Angle finding (DigitizerNetbox) using data from 4‐channel processed by CPU (as the base-
line) and data from 16‐channel systems processed by GPU.
The 16‐channel DigitizerNetbox was originally designed As expected, the angular resolution has been largely
for waveform spectrum analyser, whereas we use it as a improved by the 16‐channel with a clear peak at 50°, whereas
MIMO‐SDR. We integrated a phased array antenna to the the 4‐channel system gives a much coarse estimation. This
DigitizerNetbox and performed an Angle‐of‐Arrival (AoA) indicates the performance gain using a GPU accelerator can
analysis over the measurements from all 16 channels. Let the sufficiently improve the angular resolution at no additional
RF signal received at ith channel be si(t), the sampling data cost on the computing unit.
from DigitierNetbox can be written as S(t) = (s1[t] s2[t] … si[t]),
i ≤ 16. For T period of sampling time, AoA is calculated as
the sum of �the signal �amplitude from the signal source as 5 | CONCLUSIONS
PT −1
θðtÞ ¼ FFT t¼0 SðtÞ . Since there is only a single FFT
This paper presents a high‐speed design for an SDRadar sys-
process, we slightly modified the architecture shown in tem by using a GPU accelerator to speed up the cross‐
Figure 7 by removing the multiplication and second FFT correlation process. The idea is that CPU can handle raw
process. Here, the AoA analysis aims to search the direction of Data DAQ and post‐processing which are non‐parallel threads,
the incoming signals by the signal source. while GPU can handle the parallel threads involving FFT
Due to limitations of the Ethernet cable creating a data process. The proposed GPU accelerator demonstrates high
flow bottleneck, we reduced the DAQ rate to 3 iterations per flexibility and extensibility, and is able to work with three
second (96 M data points received every second). The sampling different SDR devices. Experimental results show that the
rate was set at 80 MHz for all 16 channels. We ran the system proposed GPU accelerator can speed up the system by up to
with and without GPU accelerator to show the difference in four times than the CPU‐only system. There are significant
angular detection. To ensure the system remains in real‐time improvements in PSNR (Figure 10) and angular resolution
processing, only the CPU was running at a block number of (Figure 14) by using the GPU accelerator to process more
0.4 k and batch length of 20 k (8 M data points processed every samples.
iteration), while GPU accelerator was running at a block Future work will focus on the joint FPGA and GPU
number of 1.6 k and batch length of 20 k (32 M data points implementation for ultra‐speed SDRadar systems. This could
processed every iteration). According to the angle resolution save more computational power from the CPU and reduce the
Δθ ¼ Dλ where D is the size of antenna, it is expected that 16‐ sampling rate from computer side as some processing can be
channel system should provide 4 times higher resolution than performed by FPGA. It is envisioned that such systems could
the 4‐channel system. be a solution for many industry‐based radar applications and
In this measurement, a WiFi access point (AP) was can be compatible with AI systems for mission‐critical appli-
located at 50° towards the phased array antenna at a distance cations that require very low‐latency, such as autonomous ve-
of 3 m Figure 14 presents an AoA plot for a signal source hicles and manufacturing operations.
F I G U R E 1 4 Performance gain: angle‐of‐arrival plot for a signal source measured at 50° to the antenna, (a) data from 4‐channel and processed by CPU and
(b) data from 16‐channel and processed by GPU
1094
- LI ET AL.
ACKN OW LE DG E ME N T 10. Fasih, A., Hartley, T.: Gpu‐accelerated synthetic aperture radar back-
This work is part of the OPERA project funded by the UK projection in cuda. In: 2010 IEEE Radar Conference, pp. 1408–1413.
Engineering and Physical Sciences Research Council (EPSRC), IEEE (2010)
11. Gembris, D., et al.: Correlation analysis on gpu systems using nvidia’s
Grant No: EP/R018677/1. cuda. J. Real. Time. Image. Process. 6(4), 275–280 (2011)
12. Garcia‐Rial, F., Ubeda‐Medina, L., Grajal, J.: Real‐time gpu‐based image
CO NF LI CT O F I N T ER E ST processing for a 3‐d thz radar. IEEE Trans. Parallel Distr. Syst. 28(10),
All authors listed in this paper declare that they have no 2953–2964 (2017)
conflicts of interest. 13. Seo, J., et al.: A real‐time capable software‐defined receiver using gpu for
adaptive anti‐jam gps sensors. Sensors. 11(9), 8966–8991 (2011)
14. Xu, H., Cui, X., Lu, M.: An sdr‐based real‐time testbed for gnss adaptive
DATA AVA IL AB I LI T Y STA T E ME N T array anti‐jamming algorithms accelerated by gpu. Sensors. 16(3), 356
The data that support the findings of this study are available (2016)
from the corresponding author upon reasonable request. 15. Li, W., et al.: Passive wifi radar for human sensing using a stand‐alone
access point. IEEE Trans. Geosci. Rem. Sens. (2020)
16. Ni usrp 2945. [Online]. https://www.ni.com/en‐gb/support/model.
OR CID usrp‐2945.html
Wenda Li https://orcid.org/0000-0001-6617-9136 17. Zhang, C., Yang, Q., Deng, W.: High frequency radar signal processing
Shelly Vishwakarma https://orcid.org/0000-0003-1035- based on the parallel technique, 2015
3259 18. Li, J., Xiao, Y.: Gpu accelerated parallel fft processing for Fourier
transform hyperspectral imaging. Appl. Opt. 54(13), D91–D98 (2015)
19. Chetty, K., Smith, G.E., Woodbridge, K.: Through‐the‐wall sensing of
RE FE RE NCE S personnel using passive bistatic wifi radar at standoff distances. IEEE
1. Debatty, T.: Software defined radar a state of the art. In: 2010 2nd In- Trans. Geosci. Rem. Sens. 50(4), 1218–1226 (2011)
ternational Workshop on Cognitive Information Processing, pp. 253–- 20. Yoo, J.‐C., Han, T.H.: Fast normalized cross‐correlation. Circ. Syst. Signal
257. IEEE (2010) Process. 28(6), 819–843 (2009)
2. Jondral, F.K.: Software‐defined radio—basics and evolution to cognitive 21. Li, W., et al.: Physical activity sensing via stand‐alone wifi device. In: 2019
radio. EURASIP J. Wirel. Commun. Netw. 3, 652784 (2005) IEEE Global Communications Conference (GLOBECOM), pp. 1–6.
3. Ni usrp 2920. [Online]. https://www.ni.com/en‐gb/shop/hardware/ IEEE (2019)
products/usrp‐software‐defined‐radio‐device.html 22. Cross‐correlation theorem. [Online]. https://mathworld.wolfram.com/
4. Digitizernetbox. [Online]. https://spectrum‐instrumentation.com/en/ Cross‐CorrelationTheorem.html
digitizernetbox 23. Cuda. [Online]. NVIDIA CUDA C Programming Guide
5. Anderson, C.R., et al.: Analysis and implementation of a time‐interleaved 24. Li, W., Tan, B., Piechocki, R.J.: Wifi‐based passive sensing system for
adc array for a software‐defined uwb receiver. IEEE Trans. Veh. Tech- human presence and activity event classification. IET Wirel. Sens. Syst.
nol. 58(8), 4046–4063 (2009) 8(6), 276–283 (2018)
6. Nurvitadhi, E., et al.: Accelerating binarized neural networks: compari- 25. Fioranelli, F., et al.: Feature diversity for optimized human micro‐
son of fpga, cpu, gpu, and asic. In: 2016 International Conference on Doppler classification using multistatic radar. IEEE Trans. Aero. Elec-
Field‐Programmable Technology (FPT), pp. 77–84. IEEE (2016) tron. Syst. 53(2), 640–654 (2017)
7. Fowers, J., et al.: A performance and energy comparison of convolution
on gpus, fpgas, and multicore processors. ACM Trans. Archit. Code
Optim. 9(4), 1–21 (2013) How to cite this article: Li, W., et al.: Design of high‐
8. Xilinx: [Online]. UltraScale Architecture and product ata Sheet: speed software defined radar with GPU accelerator. IET
Overview
Radar Sonar Navig. 16(7), 1083–1094 (2022). https://
9. Moser, D., et al.: Design and evaluation of a low‐cost passive radar
receiver based on iot hardware. In: 2019 IEEE Radar Conference doi.org/10.1049/rsn2.12244
(RadarConf), pp. 1–6. IEEE (2019)

1 - Design of GPU-Based Aerial Monitoring Signal Processing Software

Uploaded by

Copyright:

Available Formats

1 - Design of GPU-Based Aerial Monitoring Signal Processing Software

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - Design of GPU-Based Aerial Monitoring Signal Processing Software

Uploaded by

Copyright:

Available Formats

Received: 13 November 2021

- IET Radar, Sonar & Navigation

Design of high‐speed software defined radar with GPU

Wenda Li1 | Chong Tang1 | Shelly Vishwakarma1 | Karl Woodbridge2 |

1 | INTRODUCTION approach of sharing a platform with an SDRadar allows these

IET Radar Sonar Navig. 2022;16:1083–1094. wileyonlinelibrary.com/journal/rsn2 1083

TABLE 1 Comparison with other SDRadar systems

Work [13] [14] Our work

Configurability Limited Partially Fully adjustable

SDR device USRP2 ADC Multiple

Real‐time ability Partially Partially Fully adjustable

Processed sample rate (I/Q) 20Msps 20Msps 80Msps

Application GPS sensors Anti‐jamming Near‐field monitoring

Abbreviations: ADC, analog‐to‐digital converter; SDR, Software defined radio.

TABLE 2 Specification of SDR devices in this work

Connector Ethernet PCIe Ethernet

ADC resolution 16‐Bit 14‐Bit 16‐Bit

Max. sampling rate per channel 20 MHz 80 MHz 80 MHz

Coherent/non‐coherent Non‐coherent Non‐coherent Coherent

Data type I/Q I/Q Amplitude

Abbreviation: SDR, Software defined radio.

FIGURE 3 Block diagram of Cross‐Ambiguity Function (CAF)

FIGURE 2 Block diagram of proposed SDRadar system

2.2 | Signal processing in SDR

The block diagram of the SDRadar signal processing is shown

FIGURE 5 Mapping data points on the GPU

FIGURE 7 Cross‐correlation on the GPU

8. Intercept first 30 samples within each batch (NB − L) � 30.

The implementation of cross‐correlation on GPU is shown

FIGURE 11 PSNR versus batch length

Node 4 spatial diversity is very important for many machine learning

TABLE 3 CPU & GPU real processing time comparison

Device USRP 2945(CPU) USRP 2945(GPU) USRP 2945(GPU)

Sampling rate (Sr) per channel 5 MHz 5 MHz 20 MHz

Processing rate (Pr) 20 MHz 20 MHz 80 MHz

Block number (NB) 400 400 400

Batch length (LB) 50k 50k 200k

Overall 1329.3 654.4 1389.5

You might also like