Modelo 5G Ok
Modelo 5G Ok
Modelo 5G Ok
Henry Räbinä
Supervisor
Advisor
Abstract
This master’s thesis assesses the problems related to the usage of field-programmable
gate arrays (FPGAs) in a team-based embedded system development, in the
context of 5G layer 1 (5G L1). Due to the high potential for performance and
efficiency improvements, considerable effort has recently been put into integrating
FPGAs into traditional software development. Several tools have been developed
for this purpose, but they all share the common issue of adding overhead to the
program, which cannot be tolerated in 5G L1 due to strict real-time requirements.
Due to the overhead issues, the interface between the software and the
FPGA hardware in 5G L1 is implemented as a direct memory mapping.
However, the FPGA designs include multiple design variants, each containing
parts that are common to every variant, and parts that are specific to only
one variant. Because developing the memory mapped interface for each
separate variant would require excessive software branching, the concept of
a unified programming model (UPM) was developed to reduce the need for
branching. The UPM requires the common interfaces of each variant to be the same.
Tiivistelmä
Tässä diplomityössä tarkastellaan kenttäohjelmoitavien porttimatriisien (FPGA)
käyttöä ja niihin liittyviä ongelmia tiimipohjaisessa sulautettujen järjestelmien
kehityksessä 5G verkkojen 1. kerroksen kontekstissa. FPGA-piirit tarjoavat paljon
mahdollisuuksia sovellusten suorituskyvyn ja energiatehokkuuden parantamiseksi,
mistä johtuen niiden sisällyttämistä perinteiseen ohjelmistokehitykseen on tutkittu
merkittävästi viime aikoina. Tähän tarkoitukseen on kehitetty useita ohjelmistoja,
mutta ne eivät sovellu käytettäväksi 5G-kehityksessä, koska niiden aiheuttama
ylimääräinen viive on liian suuri 5G:n reaaliaikavaatimusten puitteissa.
Preface
This thesis was written for Master’s degree in Micro- and Nanoelectronic circuit
design major in Aalto University. The research work of this thesis was conducted at
Nokia Solutions and Networks (NSN).
Otaniemi, 24.7.2019
Henry Räbinä
vi
Contents
Abstract iii
Preface v
Contents vi
Abbreviations vii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Goals and contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theory 5
2.1 Cellular networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 5G technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Overall architecture and protocol stack . . . . . . . . . . . . . 8
2.2.2 5G layer 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 FPGAs and SoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Operation of FPGAs . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Hardware design using FPGAs . . . . . . . . . . . . . . . . . . 17
2.4 Co-operation between hardware and software . . . . . . . . . . . . . . 19
2.5 Software development for reconfigurable hardware . . . . . . . . . . . 23
2.6 General guidelines for team-based FPGA development . . . . . . . . 31
3 Solution development 35
3.1 Current FPGA design workflow . . . . . . . . . . . . . . . . . . . . . 35
3.2 Proposed improvements and motivation . . . . . . . . . . . . . . . . . 43
3.2.1 Improvements to the current workflow . . . . . . . . . . . . . 43
3.2.2 The new approach . . . . . . . . . . . . . . . . . . . . . . . . 44
References 63
vii
Abbreviations
1G 1st Generation
2G 2nd Generation
3G 3rd Generation
3GPP 3rd generation partnership project
4G 4th Generation
5G 5th Generation
5GC 5G core network
ALM Adaptive logic module
AMF Access and mobility management function
API Application programming interface
ARM Advanced reduced instruction set computer machine
ASIC Application specific integrated circuit
AXI Advanced xtensible interface
CLB Configurable logic block
CP Control plane
CPU Central processing unit
DL Downlink
DMA Direct memory access
DSP Digital signal processing
EDA Electronic design automation
FDD Frequency-division duplexing
FIFO First-in first-out
FPGA Field-programmable gate array
gNB Next generation Node B
GPU Graphics processing unit
HDL Hardware description language
HLS High-level synthesis
HPS Hard processor system
HW Hardware
IC Integrated circuit
IEEE Institute of Electrical and Electronics Engineers
IMT International mobile communications
I/O Input/output
IoT Internet of things
IP Intellectual property
ITU-R International telecommunication union radiocommunication sector
JSON Javascript object notation
JTAG Joint Test Action Group
L1 Layer 1
LTE Long term evolution
LUT Lookup table
MAC Medium access control protocol
MM Memory-mapped
viii
Because the requirements for 5G networks are strict, many parts of the 5G technology
are desired to be implemented in hardware (HW) instead of software (SW). Especially
the strict latency requirements support the need for HW, because the execution time
of computation is lower and significantly more stable on HW than on SW. In addition,
because the specifications for 5G networks are not complete yet, it is essential that the
features of 5G can be fluently modified after the initial implementation. Therefore,
reconfigurable HW is a good choice for 5G implementation, offering a compromise
between efficiency and flexibility. FPGAs are an example of reconfigurable HW that
could be used in this application.
• Includes only a simple user interface, or does not have a user interface at all.
1.2 Motivation
This thesis was conducted for Nokia Solutions and Network (referred to as Nokia in
the rest of this thesis) during the first half of the year 2019. Nokia develops 5G HW
and SW using the configuration described previously in section 1.1, and uses FPGAs
as the reconfigurable HW for implementing 5G. The FPGA HW development at
Nokia is a broad process involving several HW design teams. In addition to develop-
ing the required HW for 5G implementation, the HW teams must provide the SW
teams with a programming model (PM) that can be used to write SW for controlling
the 5G HW implemented on the FPGAs. The PM is essentially a description of the
registers of the FPGA design that are interfacing towards the 5G L1 SW. Thus, the
PM can be considered as an abstraction of the FPGA design that is seen by the
L1 SW. The SW teams can then control the FPGA HW via these interfacing registers.
The UPM is currently in use and it works from the SW viewpoint. However,
from the HW viewpoint, the process of creating the UPM causes problems. The
main issue is that to keep the register banks of the common IPs of each variant
similar, much synchronization between the different teams in the HW development
process is required. In addition, there are certain phases in the development process
that require unnecessary manual work, which is time-consuming for the busy HW
teams. Thus, it would be desired to improve the FPGA development process such
that creating and maintaining the UPM would be simpler.
1. The UPM is always synchronized with the FPGA design variants. This is the
primary goal because the errors in the UPM transfer directly to problems in
the L1 SW development.
2. The creation and maintenance of the UPM is as simple as possible for the HW
developers. This is the secondary goal and even though it is important, the
primary goal should not be compromised because of this goal.
This objective is further divided into two parts. In the first part, minor modifications
to the FPGA development workflow are implemented. These modifications consist
of developing new scripts for automating a few manual parts of the process. The
goal is that these improvements are taken into use quickly after they have been
verified to work properly. In the second part, a significantly modified approach for
the HW development is proposed as a proof-of-concept. It is not assumed that this
process would be taken into use in the near future, but the objective is to examine
the advantages and disadvantages of this process compared to the old process.
In addition to the practical work, the theoretical part of this thesis addresses the
topics of HW/SW co-operation, and SW development for reconfigurable HW. The
general requirements for interconnecting external HW to SW running on a host
CPU is examined, and two SW tools for integrating FPGAs into traditional SW
development are discussed in some detail. Also, the problems of these tools with
respect to the FPGA development at Nokia are studied.
4
1.4 Structure
In the second chapter of this thesis, relevant theory and background related to the
topic of the thesis are discussed. These include mobile networks, 5G technology,
FPGA architecture and operation, and SW development for reconfigurable HW.
The main purpose of this chapter is to give the reader the relevant background
information, and to set a proper context for the thesis topic. The third chapter first
discusses the old FPGA development process in detail, and afterwards explains the
proposed improvements to the process. The minor enhancements to the old process
and the major modification are discussed separately.
The fourth chapter of this thesis analyses the developed improvements, and dis-
cusses their feasibility. First, measurement results related to the minor improvements
of the old process are presented and analyzed. Afterwards, the proof-of-concept
method is further discussed, including for example the pros and cons of the method.
In addition, few alternative methods to improve the development process are shortly
presented. The fifth and final chapter of the thesis concludes the results of the
research work, evaluates their quality, and provides ideas for further research and
development.
5
2 Theory
This chapter presents the basic theoretical concepts that are relevant for this thesis.
Several topics are addressed that provide the reader with the necessary background
information to set this thesis to the correct context. These topics include cellular
networks, 5G technology, FPGAs, system-on-chips (SoCs) and co-operation between
HW and SW. Specifically related to the last topic, the concept of developing SW for
reconfigurable HW is discussed, and a few tools related to this are presented. Finally,
the general recommendations for using FPGAs in embedded system development are
discussed.
Each base station in the cellular network provides only a limited power and covers a
limited area, which is called a cell. The base stations are geographically located such
that there are no uncovered spots in the coverage area. When a user is in a specific
cell, the user is connected to the corresponding base station. To avoid interference,
the cells that are near to each other use different frequencies. However, two cells
that are far enough from each other can use the same frequency because the cells do
not overlap and thus there will be no interference. This is a highly efficient method
to utilize the scarce frequency resources. [7] Figure 1 illustrates a cellular network,
and it also explains the concepts of cells and base stations. The cell labeled with
number 2 uses a different frequency than the two cells labeled with number 1.
6
Base
1 station
1
Cell
Figure 1: An illustration of a cellular network. Modified based on [7].
The technologies used in cellular networks in different eras are usually separated by
defining generations of mobile communication technologies. All of these generations
have provided a significant improvement in the network capability compared to the
previous generations. So far there have been four generations of mobile networks
that are commonly called 1G, 2G, 3G and 4G. 1G was the first generation of mobile
communications used in the 1980s, and it was used for analog radio signals. It
was later replaced by 2G that utilized digital radio signals in the network. The 3G
network was the first mobile network that could transfer broadband data, which was
an essential step in the breakthrough of smartphones. 4G technology improved the
data connections provided by 3G by making them faster, enabling larger bandwidth,
and reducing delays, among other improvements. [7, 9] To create and maintain
the technical specifications and technical reports for these mobile communication
standards, an organization known as the 3rd Generation Partnership Project (3GPP)
was founded in 1998. Currently, the scope of 3GPP covers 3G, 4G, and 5G networks.
[10]
The next generation of mobile networks is known as 5G. It is expected to answer the
challenge of continuously increasing consumption of mobile data by providing very
high speed and capacity, broad bandwidth, and at the same time near-zero latency.
In addition to being just an improvement to the 4G technology, 5G is assumed to
provide enough capacity for IoT. Therefore, new technologies and innovations are
required for implementing 5G networks that could fulfill these requirements. [3, 4]
2.2 5G technology
As mentioned in the previous section, 5G means the fifth generation of mobile
communication technology, and it is specified in the 3GPP standard. The first 5G
specifications were delivered by 3GPP in release 15, even though 5G requirements
were discussed already in release 14. [11] The radiocommunication sector of the
International Telecommunication Union (ITU-R) (an agency of the United Nations
7
To be able to support the above three usage scenarios and the continuous growth
of mobile data usage in the future, 5G technology must satisfy strict technical
requirements. The ITU-R has created the International mobile communications
(IMT)-2020 standard, the purpose of which is to provide international specifications
for 5G. This standard has defined eight key capabilities and their values for 5G,
which are shown in the following table: [3]
Capability Value
Peak data rate 20 Gbps
User experienced data rate 0.1 − 1 Gbps
Latency 1 ms over-the-air
Mobility 500 km/h
Connection density 106 /km2
Energy efficiency 100 times compared with IMT-Advanced
Spectrum efficiency 3 − 5 times compared with IMT-Advanced
Area traffic capacity 10 Mbit/s/m2
The IMT-Advanced standard in the above table refers to the specifications for
4G networks provided by the ITU-R. Based on Table 1, it is evident that these
specifications cannot be satisfied with current generation mobile networks.
8
NG-RAN 5GC
NG-C
UE gNB/ng-eNB AMF
Xn
SMF
NG-U
gNB/ng-eNB gNB/ng-eNB UPF Internet
As can be seen from Figure 2, 5GC consists of the access and mobility management
function (AMF), the session management function (SMF) and the user plane func-
tion (UPF). The AMF provides services related to the mobility of the user. These
include for example mobility management, connection management and reachability.
The SMF provides functions for session establishment and session management.
The combined AMF and SMF correspond to the mobility management entity that
was used in the LTE technology. The UPF concentrates on data connectivity and
providing fast access to the internet. The UPF combines the services provided by
serving gateway and packet gateway used in LTE. [16] The interface between a
RAN-node and the UPF is called the next generation user plane interface (NG-U),
and the interface between a RAN-node and the AMF is called the next generation
control plane interface (NG-C). The interface between two RAN-nodes is called the
Xn-interface, which can be divided into Xn user plane interface and Xn control plane
interface. [14]
The 5G NR architecture contains several protocols for controlling the data that
is sent and received by the connected devices. In 5G, there are separate protocol
stacks for the UP and for the CP. [14] The protocol stack for UP is shown in Figure
3, and the protocol stack for CP is shown in Figure 4. As can be seen from these
figures, there are a total of seven different protocols, which will be briefly introduced
in the following paragraphs.
9
UE gNB
SDAP SDAP
PDCP PDCP
RLC RLC
MAC MAC
PHY PHY
UE gNB AMF
NAS NAS
RRC RRC
PDCP PDCP
RLC RLC
MAC MAC
PHY PHY
The physical layer (PHY) is used to transfer data to the layers above it. This data
includes information on how the data is transferred over the radio interface, and
what characteristics the data has. However, the data provided by PHY does not
contain any information about what is actually transferred. The services provided
by PHY to the higher layers are called the transport channels. There are different
channels for both downlink (DL) and uplink (UL). [14]
The medium access control (MAC) protocol offers functions for data transfer and
radio resource allocation. The MAC protocol provides services known as the logical
channels for the higher layer, and it requires the transport channels from the physical
layer. The logical channels can be divided into two categories: control channels, that
are used only for transferring the CP information, and traffic channels, that are used
only for transferring the UP information. One of the main functions of the MAC
protocol is to provide a mapping between the logical channels and transport channels.
Other services provided by MAC include for example scheduling services, priority
handling services and multiplexing/demultiplexing services. [14, 17]
10
The functions provided by the radio link control (RLC) protocol are performed
by several RLC entities, that are located in the UEs and in the base stations. The
purpose of the RLC protocol is to manage the data transfer between the RLC entities
at the gNB nodes and the RLC entities at the UEs. The RLC entities at the gNB
nodes contain RLC service data units, and the entities at the UEs contain RLC
protocol data units. The RLC protocol also provides segmentation and reassembly
services. [14, 18]
In the packet data convergence protocol (PDCP), the PDCP entities are associated
either to UP or to CP. The PDCP provides services to the layers above it, which
are different in UP and CP. The services include transferring the user plane and
control plane data, integrity protection, header compression and ciphering. PDCP
requires the RLC channels that are provided to it by the RLC protocol. [19] The
service data adaption protocol (SDAP) is the protocol that is above the PDCP in
the UP, and it is responsible for providing a mapping between a quality of service
flow and a data radio bearer, provided to it by the PDCP. The SDAP provides the
quality of service flows to the 5GC. SDAP is the highest-level protocol in the UP. [14]
The radio resource control protocol (RRC) is the protocol that is above the PDCP
in the CP. The main services that the RRC requires from the lower layers are
integrity protection, ciphering and robust delivery of sequential information. As
explained previously, these are provided by the PDCP. The services that the RRC
provides to the upper layers include for example broadcasting control information
and transferring dedicated signalling. [20] The non-access stratum (NAS) protocol is
the highest-level protocol in the CP. The main areas of service for NAS include for
example mobility management, authentication and security control. [14]
2.2.2 5G layer 1
The 5G layer 1 corresponds to the physical layer, described in the previous section.
This is the layer that is closest to the HW, and it provides information about the
data to be transferred to the upper layers. Both the DL and UL in L1 use a waveform
modulation scheme known as orthogonal frequency division multiplexing (OFDM)
with minor modifications. [14] In telecommunications, modulation means the process
of modifying the properties of some signal by mixing another signal with it. Typ-
ically, the signal that is mixed with the original signal contains information to be
transmitted, and the original signal (often called carrier) is used for transferring the
information wirelessly. [21]
OFDM allows the simultaneous usage of several closely spaced frequency bands
(subcarriers) for data transfer such that the bands do not interfere with each other.
When several frequency bands are simultaneously used, there must usually be a
certain amount of space between the bands because when modulation is applied to a
band, it spreads out on the sides which causes different frequency bands to overlap.
11
Spectrum
Frequency
The physical layer downlink can roughly be divided into three channels: the physical
downlink shared channel (PDSCH), the physical downlink control channel (PDCCH)
and the physical broadcast channel (PBCH) [23, 14]. The purpose of PDSCH is
to transfer the downlink user data, UE specific higher layer information, system
information and paging. PDCCH carries downlink control information, and one
important application for it is to schedule DL transmissions on PDSCH and UL
transmissions on the equivalent uplink channel. Other applications for PDCCH
include for example giving notifications to the UEs, or switching the active band-
width part of a UE. PBCH is used to carry very basic system information. [14, 24, 25]
The physical layer uplink can also be roughly divided into three channels: the physical
uplink shared channel (PUSCH), the physical uplink control channel (PUCCH) and
the physical random access channel (PRACH) [23, 14]. The operation of PUSCH
can be compared to PDSCH, except that it transfers uplink data. PUCCH is used
to carry the uplink control information from the UE to the gNB. PRACH is used to
transfer a random-access preamble from the UE towards the gNB. The purpose of
this is to notify the gNB of a random-access attempt and to guide the gNB to adjust
several parameters of the UE. [14, 24, 25]
12
In 5G, all data transmission in both DL and UL is divided into frames, each being
10 ms in duration. A frame in telecommunication is a complete data unit that is
transmitted between two nodes in the network. A frame contains the data to be sent
combined with addressing and protocol information. Every frame is further divided
into 10 subframes, each being 1 ms in duration. A subframe can in turn be divided
into slots, which can differ in lengths depending on the subcarrier spacing. Due to the
nature of OFDM, the slot length decreases as the subcarrier spacing increases. [14, 25]
During one slot, the 5G physical layer must be able to process a varying amount of
events, and because the duration of one slot may be quite short, the available time
to process one event could be extremely short. The highest subcarrier spacing used
in 5G for data channels is 120 kHz (for the synchronization signal that is used for
initially accessing the network, it is 240 kHz), and this corresponds to slot duration
of 125 µs. If the number of events that the 5G L1 should be able to process would
be for example 20, under these conditions the average time for processing one event
would be about 6 µs. In practice, some margin should be left to guarantee that the
latency requirement is filled, and thus the real available time for event processing
would be even less. This sets extremely strict real-time requirements for the 5G L1
implementation. [25] These requirements also motivate the usage of HW instead of
SW for the L1 implementation.
The development of modern FPGAs began in the 1980s, when Xilinx introduced the
first FPGAs in 1984. Since then, the capacity and speed of FPGAs have increased
rapidly, while the energy consumption and costs have decreased. The main reason
behind this development has been the scaling of semiconductor process technology, de-
scribed by the Moore’s Law. In addition, the development of efficient electronic design
automation (EDA) tools for FPGA development supported the success of FPGAs.
[27] As a result, FPGAs have become promising devices for several applications, and
they are widely used in microelectronics and embedded system development. Some
applications that make use of FPGAs include network equipment, automation, data
centers and application specific integrated circuit (ASIC) prototyping. [26]
13
Nokia uses a large amount of HW for implementing 5G L1, but the device of
interest with respect to this thesis is the Intel Stratix 10 SX2800 SoC, which is
included in the HW. The Stratix 10 contains the required components to host
both, the L1 HW and the L1 SW. The Stratix 10 has a hard processor system
(HPS) containing four advanced reduced instruction set computer machine (ARM)
Cortex-A53 cores, and an FPGA containing over 2.5 million logic elements. The
processor containing the ARM cores can be used with up to 1.5 GHz frequency, and
it includes for example cache memories, support for direct memory access (DMA)
and several peripheral interfaces. [28] As described in chapter 1, the L1 SW runs on
the ARM cores and the L1 HW on the FPGA can be controlled by those cores via
the interfacing registers. Figure 6 illustrates the structure and different components
of the Intel Stratix 10 device.
As discussed in the previous section, the strict real-time requirements offer a major
reason for using FPGAs in 5G implementation. An example of the problems in SW
implementation are cache memories, which can cause vast variation to the processing
times, since the time to fetch data from the memory can vary significantly depending
on where the data is physically located [5]. For example, it is faster to fetch data
from the level 1 cache than from the main memory. HW can provide considerably
more stable processing times. In addition, implementing computations on HW is
generally faster than implementing the same computations on SW. However, the
amount of speedup is strongly dependent on the application. [30]
An option for the use of an FPGA could be to produce ASICs, but the prob-
lem is that manufacturing separate ASICs is very time-consuming and expensive,
and they cannot be modified after manufacturing and delivering to the customer,
unlike FPGAs. This non-configurable nature of ASICs is a major drawback as the
specifications for 5G are not complete yet, and thus it would be essential that it is
relatively simple and fast to modify the HW of the 5G L1 implementation after it
has been delivered to the customer. In addition, as described in chapter 1, there are
several variants of the L1 HW design. By using FPGAs, the SW can decide on boot
which variant will be loaded to the FPGA. This would not be possible with ASICs,
and thus a separate ASIC would have to be manufactured for each variant, which
would further increase the costs. Therefore, Nokia has chosen to use FPGAs instead
of ASICs.
The combination of one multiplexer and the RAM cells connected to the data
inputs of the multiplexer is called a lookup table (LUT) [26]. Furthermore, several
LUTs and flip-flops (and possibly other components) can be grouped together to form
configurable logic blocks (CLBs) that can implement more complicated logic than
single LUTs. The CLBs are connected to each other and to other parts of the FPGA
15
via a switch matrix. The switch matrix is formed by connecting the programmable
RAM cells to the select inputs of the multiplexers. The inputs and outputs of the
CLBs are then connected to the data inputs and outputs of the multiplexers of the
switch matrix. [26] The overall architecture of an FPGA is illustrated in Figure 7.
The CLBs in the figure are directly connected to the switch matrices even though it
is not evident from the figure.
Interconnect wires
Switch matrix
CLB CLB CLB CLB
I/O bank
However, the overall structure of modern FPGAs is not quite as simple as implied by
Figure 7. Instead, modern FPGAs include a set of ready-made building blocks that
are provided within the FPGA. These blocks may include for example memories
or memory controllers, digital signal processing (DSP) blocks, input/output (I/O)
controllers, or even CPUs. The idea of including such ready-made blocks into an
FPGA is that using these blocks in an FPGA design, instead of implementing them
on the FPGA logic, may improve area, performance and power ratio. The inclusion of
such blocks may also make FPGAs more appealing and competitive devices compared
to for example DSP processors. [26]
Figure 8 further explains the concepts of the lookup table and the switch matrix.
The c-letters in the figure illustrate the programmable RAM cells and the trapezoids
illustrate the multiplexers. The two multiplexers on the left side of the figure form
the switch matrix and the rightmost multiplexer, combined with the four RAM cells
that are connected to the data inputs of the multiplexer, correspond to the LUT.
As an example of the programmability of an FPGA, let us assume that we would
like to implement a 2-input OR-gate with the configuration shown in Figure 8. To
implement the gate, the three uppermost RAM cells of the LUT (corresponding to
16
the multiplexer values 11, 10 and 01) should be programmed to store 1, and the final
RAM cell corresponding to 00 should be programmed to store 0. If the two select
inputs of the LUT would contain for example bits 1 and 1, the output of the LUT
would be 1 which corresponds to the operation of the OR-gate. [26]
11
10
01
00
c c 11
c c 10 output
c 01
c 00
11
10
01
Lookup table
00
c Switch
matrix
c
Naturally, the logic blocks of real FPGAs are significantly more complex than
for example the one shown in Figure 8. As an example, Figure 9 shows a high-level
block diagram of a one so-called adaptive logic module (ALM) of the Stratix 10
FPGA. In the Stratix 10, one CLB consists of 10 ALMs connected together. [31] In
total, the Stratix 10 contains 933 120 ALMs [28].
17
HDL code into the actual IC. These issues make it difficult to embed FPGAs into
traditional SW development, even though it has been shown that the usage of FPGAs
could improve the efficiency of developed SW significantly in many applications,
for example data analysis [34]. Therefore, much effort is currently put into making
FPGAs more accessible for SW developers. [26] One method to achieve this could
be to shift to a higher abstraction level in the design by developing programming
languages for HW development that are more similar to the languages used in SW
programming. This is known as high-level synthesis (HLS). Catapult C, developed
by Mentor Graphics, is one example of an HLS tool. It uses C/C++ or SystemC as
input language, and produces RTL code as an output. [35]
The FPGA development process can roughly be divided into five sections [36]:
Simulating the created design is not included in the above list, but in reality it is an
extremely important step in the process and it is done in many phases of the flow.
In addition, the flow contains several checks and verifications that are not in the
list, related to for example clock skews, clock domain crossings, checks for logical
equivalence and so on. The creation of the HDL code is the only one of the five
steps that is done manually, and the rest of the steps are done automatically by
EDA tools. However, this has not always been the case. During the first years after
the invention of the FPGA, implementing the designs on FPGAs was manual work.
Afterwards, as the designs became larger and more complicated, it was necessary to
develop EDA tools for the process to keep the design effort on acceptable level. [27]
There exist several tools for this process, provided by the FPGA vendors. The tool
used by Nokia will be addressed in chapter 3 of this thesis.
The first step after writing the HDL code is to analyze it. This means that the
code is checked for syntactic and semantic errors. A syntactic error means a situ-
ation where the written HDL code does not support the syntax of the used HDL
language. A semantic error, on the other hand, refers to a situation where the
syntax of the HDL code is correct, but the written HDL code is meaningless for
the HDL compiler. If the analysis of the HDL code does not give any errors, the
code can be elaborated. In the elaboration phase, the top-level HDL design is
expanded such that all individual entities are represented by unique components,
and the interconnections between them are defined. It should be noted that after the
elaboration, the components are still represented by only generic RTL constructs. [33]
19
The second step in the process is to synthesize the elaborated design. The syn-
thesis tool maps the generic RTL constructs of the elaborated design into gate level
components that are included in the FPGA. These may include for example LUTs,
flip-flops and phase-locked loops. The mapping created by the synthesis tool is called
a netlist, and it will be used in the subsequent phases of the design process. This
mapping is minimized and optimized, and the mapping is logically equivalent to the
elaborated design. [36, 37]
The third phase of the design flow is the implementation (also known as place-
and-route or fitter). In this phase, the netlist generated by the synthesis tool will be
mapped to the HW inside the FPGA. As the name suggests, this process can be
divided into two parts: placing the design, and routing the design. Placement is the
process of deciding where the components generated by the synthesis process are
situated inside the FPGA. Routing is the process of determining the physical connec-
tors that are used to connect the placed components together. The implementation
is a very HW-intensive and time-consuming process. Essentially, implementation
can be seen as an extremely complicated optimization problem: it strives to map
a netlist to the available HW inside the FPGA with certain boundary conditions.
These include for example the timing constraints. [36, 37]
The mapping generated by the place-and-route tool is not yet compatible with
the FPGA, and therefore the last phase of the design flow is to generate a bitfile
based on the mapping that can directly be loaded into the FPGA. The bitfile is
generated automatically by the SW tool that is used in the development, and it
can be loaded to the FPGA by using for example the Joint Test Action Group
(JTAG) standard. JTAG was originally developed to provide relatively effortless
technology for testing printed circuit board assemblies without the need for low-level
physical access or much development for functional tests. However, nowadays JTAG
is widely used for programming, debugging and testing interfaces on for example
microcontrollers, ASICs, and FPGAs [38, 39]. After the bitfile is loaded into the
FPGA, the design can be tested in practice to verify that it operates correctly. It
should be noted that because RAM is volatile memory, meaning that it requires
continuous supply voltage to store data, FPGAs can only store the loaded bitfile as
long as power is supplied to the FPGA. Thus, every time the FPGA is rebooted,
the bitfile is lost and it must be reloaded to the FPGA.
ASIC
FPGA
CPU
Flexibility
Microprocessor Coprocessor
Software Custom-HW
application module
API Ports
Driver Programming
model
Microprocessor Hardware
interface interface
On-chip bus
As can be seen from the figure, five elements of interest can be determined that
constitute the interface between the software application and the custom-HW module.
These elements will be discussed during the rest of this chapter.
1. On-chip bus
2. Microprocessor interface
3. Hardware interface
4. Software driver
5. Programming model
The on-chip bus is a component that is used for transferring data between the
software application and the custom-HW module. In more detail, the bus system
describes a protocol, i.e. the different phases in communication required to transfer
data between components in a predefined manner. Basically two types of on-chip
buses exist: those that are shared among many components (usually called mas-
ters/slaves), and those that implement only dedicated point-to-point connections.
All on-chip buses do not transfer data according to the same protocol. Instead,
there exist several on-chip bus standards, and if the developer desires to connect a
custom-HW block to a CPU/microprocessor, it is often required to know the bus
standard that is used by the processor. Some commonly used standards include for
example the advanced microcontroller bus architecture, CoreConnect and Avalon. [30]
Let us now consider the shared bus configuration and the point-to-point bus con-
figuration in a little more detail. In the shared configuration, an address space is
22
utilized to control the communication between the devices that are connected to the
on-chip bus. This address space is used such a way that when data is transferred via
the bus, there is always a destination address associated with the data. This address
is used to determine the component that should receive the data. [30]
As it requires several pieces of information to transfer data via the on-chip bus,
the bus itself physically consists of several wires. These wires can be divided into
four categories: address wires, data wires, command wires and synchronization wires.
As their names imply, data wires carry the data to be transferred, and address
wires contain the address that is associated with the data. Command wires carry
information about the type of transfer that is to be performed. This information
may for instance imply, whether read or write operation will be performed. Finally,
the synchronization wires ensure that the masters and slaves that are attached to
the bus are synchronized during the data transfer. [30]
The point-to-point configuration has some similarities and some differences in com-
parison to the shared configuration. Similarly as in the shared configuration, the
point-to-point configuration also recognizes slave and master components. However,
because the connection is only from one point to another, there does not exist an
address space or any similar concept. Instead, the data is represented as a continuous
stream of items. Even though there is no address space, the data items may still be
divided into different logical channels by multiplexing several data streams over the
same physical channel. [30]
The microprocessor interface provides a method for the software program to connect
to the on-chip bus. It consists of hardware and low-level firmware to enable this
connection. Microprocessor interfaces can be categorized into three classes: memory-
mapped (MM) interfaces, coprocessor interfaces and custom-instruction interfaces.
The MM interface is the most commonly used. It reserves a portion of the address
space of the processor to be used in communication between HW and SW. This
property makes it easy to use with SW development. As a simple example, a register
connected to the on-chip bus can be an MM interface, making the register a shared
resource between SW and HW. In this case, a bus transfer to a specific memory
address causes an access to the register. More advanced examples of MM interfaces
include for instance mailbox, first-in first-out (FIFO) queue, and shared memory
configuration. [30] However, these will not be discussed in more detail in this thesis.
Even though the MM interface is very widely used, its performance may not be
enough in high-throughput applications. In this situation, it may be replaced with a
coprocessor interface, which is a dedicated interface specifically designed to be used
to attach custom-HW modules to the processor. In addition to higher throughput,
the coprocessor interface provides a fixed latency, which is not the case with the MM
interface. The coprocessor interface does not utilize the on-chip bus, but instead it is
a specific port on the processor which is controlled with a certain set of instructions.
However, the drawback of using the coprocessor interface instead of the MM interface
23
is that the custom-HW module must be designed to be compatible with that interface
on the particular processor. Therefore, the reusability of the HW block is limited to
systems that use the same processor. [30]
The software driver and the programming model are the elements that are clos-
est to the software application and to the custom-HW module, respectively. The
software driver is an interface that connects the SW application to the microprocessor
interface. It is used to wrap the transactions between HW and SW into SW function
calls, and to convert concepts that are natural to software into structures that are
applicable for communication with HW. Similarly, the programming model is an
interface for connecting the custom-HW module to the hardware interface. It is a
high-level presentation of the HW to the SW application. The programming model
is straightforward to handle by the microprocessor because it includes the memory
areas used by and the commands understood by the HW module. [30]
the bitstream that is loaded to the FPGA. Therefore, developing just one constant
interface will cause issues. More specifically, if the MM interface is used, the problem
is that the address space of the interfacing components of the FPGA system changes
whenever the HW interface is modified. The consequence of this is that the address
space used in the SW must be reconfigured. [42]
As discussed in chapter 2.3, the Intel Stratix 10 SX2800 contains both, an FPGA and
an HPS. A typical usage scenario of such a device is to implement fast computation
on an FPGA HW, that is controlled with SW developed for the HPS. Generally,
when a new FPGA design is created, a base address is assigned to each IP block of
the design. This assignment can be performed automatically with the FPGA design
tools. The base addresses are then mapped to the processor so that the developed SW
can access the FPGA design. This enables the user to control the FPGA HW via SW.
Due to the reasons discussed above, it is clear that developing SW for reconfig-
urable HW is not a simple task. Because a single, static interface between HW and
SW is clearly a problematic solution, roughly two practical approaches can be found.
The first one is to move to a higher abstraction level in the development. Several
frameworks have been designed for this purpose, for example the open computing
language (OpenCL) developed by Khronos Group [43], and the reusable integration
framework for FPGA accelerators (RIFFA) developed at the University of California,
San Diego [44]. The second approach is to stick with the traditional interface logic
development, and try to make the process of maintaining and updating the interface
and programming model as simple as possible.
Let us discuss the concept of generating the interfacing logic between HW and
SW by increasing the abstraction level. The objective of moving to a higher level
25
To illustrate the frameworks for increasing the abstraction level, the two frameworks
that were mentioned above (OpenCL and RIFFA) will be examined. Fundamentally,
these frameworks address the same problem but approach it from slightly different
perspectives. RIFFA is used to make connecting custom IP blocks on an FPGA to
the CPU on the host computer as simple as possible. Therefore, using it in FPGA
design still requires the HW design skills to develop IP blocks using HDLs, or at
least the developer must have an access to IP blocks made by someone else. [44]
On the other hand, it can be seen that OpenCL combines the approach provided by
RIFFA with HLS. In other words, using OpenCL for FPGA development requires
(ideally) only SW design experience. [26]
OpenCL is an open and royalty-free SW framework that can be used for devel-
oping SW that runs across several platforms, including for example CPUs, GPUs,
FPGAs and so on. In other words, OpenCL can be seen as programming model,
providing SW developers a relatively easy access for controlling for example different
HW accelerators via SW. [43] This can be highly useful, as traditional SW developers
most likely do not have the necessary knowledge and skills to utilize for example
FPGAs without a framework such as OpenCL.
In terms of FPGAs, the main idea of OpenCL is that it abstracts away the FPGA
design flow, enabling low-level SW developers to use traditional SW development
methods to program FPGAs. An OpenCL program consists of a set of functions
(usually called kernels) running on the accelerator, and a host program running on
the host (usually CPUs), that is connected to the accelerator. The host program is
the SW that controls the HW on the accelerator. The host program controls the
accelerator by using library routines that abstract the communication between the
host processor and the kernels on the accelerator. [26]
The host program of an OpenCL application can be written with several languages
that have the support for it, including for example C/C++ and Python [43]. The
kernel functions that are executed on the FPGA are written with a language that
is similar to C, but it is modified such that it supports the device model used by
OpenCL better. However, the important issue to note is that both “sides” of an
OpenCL program are developed with a software language, which does not require
expertise in HDLs. [26]
26
The method to connect the FPGA board to the host system varies depending
on the development setup. For example, using the Intel FPGA SDK for OpenCL,
there are basically two methods to connect the FPGA to the host. In the first
scenario, the FPGA is placed on an accelerator board that can be directly connected
to the host PC via a peripheral component interconnect express (PCIe) connection.
In this case, the OpenCL host program runs on the CPU of the host PC, and the
kernels are implemented on the FPGA accelerator board. In the second scenario
the host program runs on the CPU cores of the FPGA HPS, and the kernels are
implemented on the FPGA fabric. Thus, the CPU cores are connected to the FPGA
through specialized bridges. [47, 48, 49] Figure 12 illustrates the basic concept of an
OpenCL design flow.
27
OpenCL design
Standard C
OpenCL compiler
compiler
PCIe or other
interconnect
Host device FPGA
Figure 12: High-level illustration of the OpenCL design flow. Modified based on [50].
The basis of RIFFA is a concept of communication channel that is used for commu-
nication between SW threads on the CPU and the developed IP cores on the FPGA.
To use a channel, it must first be opened, then data can be read and written through
it, and finally it must be closed. On the FPGA side, the operation of a channel is
implemented as a FIFO interface for both receiving and transmitting data. On the
SW side, data can be sent to and received from the channel as byte arrays with SW
function calls. The upstream transfers (i.e. data flowing from the IP cores to the
PC) are initiated by the IP cores, and downstream transfers (i.e. data flowing from
the PC to the IP cores) are initiated by the PC. [44]
28
The SW interface of RIFFA is a very simple set of functions that are used to
communicate with the FPGA. The main functions are open, close, send and receive,
which are used for initializing the FPGA, closing the connection to the FPGA, sending
data to the FPGA, and reading data from the FPGA, respectively. In addition, there
is a function for listing all FPGAs that are connected to the host, and a function that
resets a specified FPGA and the transfers across all channels that are connected to it.
Because there are only these six commands for communicating with the FPGA, it is
very simple for a SW developer to communicate with the IP blocks on the FPGA. [44]
The HW interface of RIFFA consists of two separate sets of signals: one for re-
ceiving data, and the other for sending data. Some of these signals are used for the
handshake protocol, and the rest implement the FIFO ports that were mentioned
above. One problem with the HW interface is that it is required to know the length
of every data transfer. This may cause issues with applications where the transfer
length is not known. However, in some situations this problem can be solved by
buffering the data until all of it is generated, and transferring it afterwards. [44]
Let us first examine the HW side of the architecture. The data transfer is imple-
mented as scatter gather DMA based on the PCIe protocol, and the FPGA acts
as the DMA bus master in this configuration. In the above figure, this basically
consists of receive (RX) and transmit (TX) engines and everything above them. [44]
A scatter gather DMA is an architecture that, as opposed to the traditional DMA
which is able to fetch data only from a single buffer of consecutive memory, can
collect data from several buffers distributed over nonconsecutive parts of memory.
In this way, the scatter gather DMA can be used to transfer larger buffers than
is possible with ordinary DMA, reducing the number of DMA transfers and thus
increasing the overall performance of transfers significantly. [51]
As can be seen from Figure 13, the DMA bus master configuration is connected to a
PCIe endpoint core. One of the uses for this core is that it enables the translation
between the packet data supported by PCIe and the payload data. The custom IP
cores access and execute computation on this payload data via the signals of the HW
interface that were shortly introduced previously. [44]
set of the signals of the HW interface. As can be seen from Figure 13, the data
is first written to the TX FIFO, where it is split into blocks that are suitable for
separate PCIe write packets. To send these blocks to the host PC, the targeted
memory locations must be known. These memory locations are provided as scatter
gather elements. Thus, a channel first requests these elements from the host. This is
illustrated as the “scatter gather requester” box in Figure 13. After a channel has
acquired the scatter gather elements, it creates a write packet supported by the PCIe
protocol for each data block created in TX FIFO. [44]
Because there are multiple independent channels transferring the upstream data, and
each of them shares the same PCIe upstream direction, multiplexing is needed. This
is provided and managed by the TX engine block, shown in Figure 13. In addition,
the TX engine also formats the channel requests to complete PCIe packets and sends
them to the PCIe endpoint. [44]
The downstream transfer is initiated by the host PC via SW APIs. Once the
initialization has been completed, the channels request the scatter gather elements
(memory locations) for transferring the data. These requests are addressed by the
TX engine. After the memory locations have been received, separate PCIe read
requests are made for the data at the memory locations defined by the scatter gather
elements. This requested data is forwarded to the RX engine at the PCIe endpoint.
There, the RX engine extracts data from the PCIe packets and demultiplexes the
data to correct channels. The reordering queue shown in Figure 13 is used to ensure
that the data is forwarded to the channel in the correct order. [44]
Let us now examine the SW side of the architecture. As can be seen from Fig-
ure 13, there is a kernel device driver and an API on the host PC. The kernel driver
registers all connected FPGAs that have RIFFA support, and it allocates a memory
buffer for each FPGA that is used for transferring scatter gather data between the
host PC and the FPGA. The language bindings allow the user application to call
into the kernel driver, and the FPGA can access the driver via a device interrupt. [44]
Even though SW tools like OpenCL and RIFFA make it more effortless to integrate
FPGAs into traditional SW development, they both require relatively complicated
logic to be able to adapt easily to different FPGA designs. This causes the overhead
in communication between the host SW and the FPGA to be quite high. Even
though this overhead might not cause issues in many applications, the real-time
requirements of 5G L1 discussed in chapter 2.2.2 are so strict that the overhead will
cause problems. Different latencies of RIFFA are listed in Table 2:
31
The “FPGA to host interrupt time” is the latency between the moment when the
FPGA signals an interrupt to the host and the moment when the host receives the
interrupt. The “host read from FPGA round-trip time” is the total time that it takes
from a request to propagate from the kernel driver to RIFFA (through RX and TX
engines), and back. The “Host thread wake after interrupt time” refers to the time
to resume a SW thread after it has been woken by an interrupt from the FPGA. [44]
As can be seen, all these latencies are in the order of microseconds, which is a large
portion of the time available to perform a certain calculation in 5G L1. In many
worst-case scenarios (from the viewpoint of timing requirements), such latencies are
not acceptable.
For OpenCL, the problem with the overhead is even worse. The issue with OpenCL
is the overhead associated with launching a kernel function on the device, for example
an FPGA or a GPU. This overhead can be in the order of tens of microseconds, and
it is not dependent on the actual computation that is performed on the kernel. [52,
53] Based on the real-time requirements of 5G L1, it is trivial that such overhead
times can not be tolerated. In contrast, accessing the registers of a memory mapped
FPGA is considerably faster. Depending on the situation, the time required for this
is in the order of few hundred nanoseconds. [54]
1. Project management. This includes for example defining the project require-
ments and objectives, resource and cost management, risk assessment and
project execution.
2. FPGA vendor choice and partnership. The careful selection of the FPGA
vendor ensures that the same technology and devices can be efficiently used for
not only the current but also the future FPGA projects.
32
3. FPGA design methodology. This includes the whole FPGA design flow, includ-
ing for example device selection, IP reuse and design environment.
If each of these sectors are designed carefully, successful FPGA development and
desired results can be achieved [55]. Because the problems at Nokia are closely
related to the interaction between different FPGA design teams, the best practices
of working in a team-based design environment are discussed in more detail.
Nowadays, many large FPGA design processes are divided into several teams, each
team developing a certain portion of the design. There are several advantages that
can be achieved by successfully using this team-based design approach. The first
one is acceleration of the design process. Because the separate teams can begin to
implement their parts of the project without waiting for the rest of the team, the
overall process can be completed within a shorter schedule. In other words, the
process becomes more parallel. Another advantage is that the verification of the
design simplifies, because several phases of the verification can be executed by the
teams only for the corresponding portions of the design. The team-based design
approach also isolates the possible problems (for example timing issues) to a certain
portion, simplifying the debugging. The third advantage is that the compilation
time of minor changes to a single portion decreases, because the modification can be
performed only by the corresponding team without directly affecting the rest of the
design. [55]
For a team-based FPGA design process to be successful, there are two major parts
that are required. The first one is a single team lead that is responsible for the
top-level design planning and integration of the design. The second one consists of
the other team members that create the RTL descriptions of the corresponding IP
blocks, and implement the blocks. The team leader creates new top-level designs
either according to a predefined schedule or whenever a major improvement has
been developed for one or more IP blocks. Even though the idea is that the team
members develop the portions of the design independently, it would be efficient if
the portions are developed in the context of the top-level design to avoid issues in
integration. For these two parts to work effectively together, the team lead should
create an initial project setup that assigns the portions of the design to the different
teams. [55] Figure 14 shows a high-level illustration of the team-based design flow.
33
Team members
implement partitions
No
Design finished?
Yes
Done
If the FPGA system contains a processor that is used to execute SW, the overall
development is usually considered to be embedded system design. This requires new
aspects to be taken into account because adding SW to the development process
introduces new design challenges to the HW engineers. One major advantage of
using FPGAs in embedded systems is that the reconfigurable nature of FPGAs allow
the developer to modify the design in both SW and HW. This makes it possible
to match the design more precisely to the target application. In comparison, if an
embedded system is built around a simple microcontroller, only SW modifications
can be done to meet the requirements. On the other hand, the flexibility provided by
an FPGA also creates the problem of determining how a certain functionality should
be implemented or how computations should be split between SW and HW. [55]
Each IP block includes a register interface that is mapped to the addresses of the
SW interface, and these registers represent the data that is transferred between HW
and SW. The register address map contains each of these registers. [55]
The register address map is the main interface between the application SW and the
RTL description of the HW, and the information provided by it is used by many
participants in several data formats. Therefore, the address map is shared within
several teams and groups that are associated with the embedded system develop-
ment. This creates challenges in synchronization of the address map. For these
reasons, it is highly important that the register address map is strictly controlled
and every modification to the map is communicated across the entire design team. [55]
As mentioned previously, the HW designer must connect the register address map
interface to other parts of the system. This requires creating considerable amount of
logic for all the registers to be able to communicate with the system. The problem
that arises is that modifying the register address map requires the HW developer
to modify the interconnection code and the documentations, and the address map
must be again communicated to the rest of the design team. In addition, the SW
developers must modify the SW header files accordingly. Performing this manually
requires much work and is prone to errors. However, FPGA vendors and other
companies provide EDA tools that automate the creation of the interconnection logic,
SW header files and all the necessary other files. It is highly recommended to use
such tools for register address map management. [55]
35
3 Solution development
This chapter first introduces the current FPGA development workflow and the
problems related to it. The rest of the chapter focuses on explaining and motivating
the proposed improvements to the workflow, and describing the development of the
improvements.
Before the development process is described in more detail, let us first discuss
the specifications of the UPM and the requirements for creating it. The UPM itself
is an XML file that contains a description of the interfacing registers towards the L1
SW. The most important requirements for the UPM are listed below:
• The UPM must contain the interfacing registers of every FPGA design variant,
even though each design does not have to actually implement the registers of
every variant.
• Every FPGA design variant must use the same version of the same common
IP blocks.
• Every modification made to the UPM must comply to the open-close principle.
This means that new interfacing registers may be added to previously unused
memory locations and new fields may be added to previously unused parts of a
register. However, the names, locations or contents of existing registers can
not be modified without special arrangements, which include synchronization
between the L1 HW and the L1 SW teams.
• Each FPGA variant team may release independently new bitstreams without
synchronization as long as no modifications to common IPs are performed, and
the new releases are backwards compatible with the UPM.
Figure 15 shows a simplified block diagram that illustrates the basic building blocks
of the 5G L1 FPGA designs. As can be seen from the figure, there are common IP
blocks and variant-specific IP blocks, and each block is directly connected to one
36
register bank. The register banks are further connected to the SW, and through these
connections the SW can be used to control the IP blocks implemented on the FPGA.
The dashed box around the register banks illustrates the UPM, and as can be seen,
it combines the register banks of every common and variant-specific IP block.
SW
The teams involved in the FPGA design process can be divided into three categories:
the teams developing the IP blocks and the corresponding register banks of the
FPGA design (submodule teams), the teams that integrate these blocks into the final
top-level design of each variant (integration teams), and the team that generates the
final UPM. There is a separate team for developing every IP module, and there is
also an independent integration team for every L1 HW variant. Figure 16 illustrates
the different team categories, and the interaction between the teams. As can be seen
from the figure, the submodule teams provide all information of an IP block (i.e. the
description of the register bank as Excel or XML format, and the VHDL codes of the
IP block and the register bank) to the integration teams, and the integration teams
in turn provide the PM XML files of each design variant to the UPM generation
team. Every integration team also generates a bitstream to be loaded to the FPGA,
even though only two bitstreams are shown in Figure 16.
37
UPM
Bitstream 1 Bitstream 4
PM XML UPM generation PM XML
PM XML PM XML
Common IP 1
Common IP 2
Specific IP 1
Specific IP 3
Let us now describe the development process in more detail. The design flow starts
from the specifications for the HW to be implemented. Based on the specifications,
the RTL descriptions of the required IP blocks and the interfacing registers towards
the L1 SW are created by the submodule teams. Each submodule team is responsible
for implementing one specific IP block and the corresponding register interface of
the top-level design. The RTL descriptions of the IP blocks are created using VHDL
language and HLS tools. On the other hand, the RTL descriptions of the interfacing
registers are not created manually by the developer. Instead, a script that takes as
input a simple description of the desired registers, in Excel format, is used.
The Excel tables describing the interfacing registers are created manually by the
submodule teams. For the Excel tables to be compatible with the script, the tables
must conform to a certain format. This required format is further discussed in chapter
3.2.2. The script produces an XML file and a VHDL file based on the input Excel
table. The usage of these files are addressed shortly. Because a register interface can
technically be generated without any information about the corresponding IP block,
the internal implementation and functionality of the IP blocks are not addressed in
38
this thesis. However, in practice the desired usage of the IP block must be known,
because otherwise it cannot be specified what kind of interfacing registers are required
for the IP.
The XML file produced by the script contains essentially the same description
of the interfacing registers as the Excel file, and it complies to the IP-XACT format.
IP-XACT is an XML format that is used to describe individual and reusable elec-
tronic circuit designs. The purpose of IP-XACT is to provide a standard that can be
used for compatible sharing and reuse of circuit designs between component vendors
such that they are also compatible between different EDA tools. [56] IP-XACT has
been standardized by the Institute of Electrical and Electronics Engineers (IEEE) in
the IEEE 1685-2009 standard [57]. The created XML file is used in the subsequent
phase of the design flow, as an integration team generates a programming model
corresponding to a certain design variant.
The VHDL file produced by the script is used to create the HW interface for
the control SW. It contains not only the VHDL description of the used registers,
but also all the interfacing logic that is necessary for accessing the registers both
from the IP block HW and from the ARM cores. The interface protocol according
to which the VHDL code is created is specified in the Excel table. This can be for
example the advanced xtensible interface (AXI) protocol. The creation of the XML
file and the VHDL file of the interfacing registers are performed for every IP block.
Afterwards, the created IP-XACT descriptions and the VHDL files are stored to an
IP library for later reuse.
All parts of the design flow discussed so far are performed by the submodule teams.
Thus, the next step in the process involves the integration teams. As implied in
Figure 16, the submodule teams provide the RTL descriptions of the IP blocks and
the Excel/XML and VHDL descriptions of the register interfaces to the integration
teams. At this point, it should be noted that the submodule teams strive to develop
as “general” and reusable IP modules as possible, i.e. they do not try to optimize
the modules for some specific purpose. This is a good approach in terms of the
reusability of the IP blocks. However, as pointed out in chapter 2.6, this approach
may sometimes create problems in the final integration.
The integration team of each variant uses the IP blocks that belong the specific
variant to create the top-level L1 HW design and the corresponding bitstream. This
design then goes through the basic steps of FPGA design flow described in section
2.3.2 (simulations, synthesis, place-and-route, timing analysis etc.), and finally the
generated bitfile can be loaded to the Stratix 10 FPGA. The SW tool used for this
part of the design flow is the Intel Quartus Prime Pro (from now on referred to
as Quartus). Quartus is a tool developed by Intel that is specifically targeted for
developing HW for PLDs. It can be used for the whole development process, starting
from the VHDL programming all the way to the bitfile generation. However, in this
case the most important tool in Quartus is the platform designer.
39
The platform designer is a system integration tool that is used to automate the
process of integrating separate IP blocks into a larger design. It provides a graphical
user interface where the developer can easily connect the different IP blocks together,
assuming that each IP block is developed to support a certain interface protocol.
Only certain protocols are supported by the platform designer, one of them being
the AXI protocol that is used to connect the IP blocks to each other and to the
ARM cores of the Stratix 10 device. After all the components are connected to
each other in the platform designer, it can be used to generate the top-level RTL
description of the design, based on which the final bitstream can be generated
and loaded to the FPGA. In addition to connecting the IP blocks together, the
base addresses for the SW interfaces are also specified in the platform designer.
Since the SW running on the ARM cores and the FPGA HW communicate by
sharing data via the interfacing registers, each located in a certain memory address,
this interface implements the MM architecture, discussed in chapter 2.4. There-
fore, it is essential that the base addresses can be acquired from the platform designer.
In addition to the bitstreams, each integration team also produces the PM XML file
of the corresponding variant. Essentially, the PM XML file is a combination of all the
XML files corresponding to the IP blocks that are used in the FPGA design variant.
In other words, the PM XML file contains all the interfacing registers of those IP
blocks that are used by the integration team. The PMs of the independent design
variants are created automatically using a C shell script. This script requires as
inputs the XML files of the interfacing registers of the different components (provided
by the submodule teams), and the base addresses of the register banks. The script
produces several outputs, the most important of which are the PM XML file and
the C header files that could be used to develop SW for the FPGA design. However,
these header files are not used because they do not comply to the UPM. Instead, the
used header files are generated separately based on the UPM by the L1 SW team.
The final part of the development process is the generation of the UPM. This
is done by a separate team. As can be seen from Figure 16, each variant integration
team provides the PM XML file of the corresponding variant to the UPM generation
team. As the UPM is the “top-level” PM, it must be formed by combining the
information of all the PMs provided by the integration teams. Currently, this is done
quite manually. One of the L1 HW variants is a “master” variant in the sense that
every IP block and the corresponding register interface that is included in this master
variant is also included in some other variant, but not all the blocks that are in the
master variant are included in every other variant. Therefore, every register interface
that is included in the master variant must be included in the UPM. Currently,
every register bank XML file that is not included in the master variant, is searched
and extracted manually from the XML files of the other variants. These are then
manually copied into a single XML file, that is merged to the XML file corresponding
to the master variant with a C shell script.
40
Convert Excel file Convert Excel file Convert Excel file Convert Excel file
to XML file and to XML file and to XML file and to XML file and
VHDL file VHDL file VHDL file VHDL file
Store files to
IP database
No
Are PMs compatible?
Automated with
Generate UPM
script
Partly automated
with script
One of the most important observations to be made from the figure is that there
is a lack of organized communication between the different teams. In addition, it
can be seen that writing the base addresses to an PM generation script is performed
manually by an integration team, even though this could be done automatically
based on the .qsys file that is produced by the platform designer. Finally, it can be
seen that the generation of the UPM is partly manual, even though it could be fully
automated. These issues will be addressed in the next section of this thesis.
Let us now compare the described development flow with the best practices of
team-based FPGA development discussed in chapter 2.6. If only a single variant-
integration team and the submodule teams are considered, the design process is
almost identical to the preferred embedded system development process. The variant-
integration team can be considered as the team leader, and the submodule teams
correspond to the other team members which independently develop certain portions
of the top-level design. In addition, the register bank interfaces of the IP blocks are
fluently communicated to the integration team and the platform designer is used
to automatically generate the interfacing logic and the SW header files that are
transferred to the L1 SW team. The main difference is that the submodule teams
produce as general IP block as possible, i.e. they are not developed in the context of
the top-level design.
However, as several variant-integration teams and the UPM generation team are intro-
duced to the process, it ceases to follow one of the essential general guidelines. This is
the requirement for a single team leader that would manage all the other participants
in the process. Even though it can still be seen that the variant-integration teams
act as leaders for the submodule teams, there is no clear leader for the integration
teams. They are in equal position with respect to each other, and even though the
UPM generation team communicates with each integration team, it does not manage
or control them.
Another issue that was emphasized in chapter 2.6 was that it is essential to com-
municate the modifications of the register address map (programming model) to
every group involved in the FPGA design process. As mentioned, this principle is
followed in the context of one integration team, the submodule teams and the SW
team. However, this is not applied between the different variant integration teams
which can also be observed from Figure 17. Thus, the integration teams do not
clearly communicate with each other on which versions of the IP blocks they are
using in the corresponding FPGA variants.
These two issues, i.e. the lack of a leader for the integration teams and the lack of
communication between the integration teams, are the main causes for the problems
in maintaining the PMs synchronized over all FPGA design variants. The lack of
the team leader enables the integration teams to develop the FPGA designs indepen-
dently, inevitably leading to the situation where the different variants, that should
comply to the same UPM, use different versions of the same common IP blocks. The
43
lack of communication in turn prevents the teams from fixing this because they do
not systematically share the information on the used IP blocks.
The work done in this thesis can be divided into two parts. The first part con-
sists of improving the current design process. These improvements are to be taken
into use quite quickly. The second part of the work introduces a more significant
modification to the design process, which is examined as a proof-of-concept. It is not
assumed that this new process will be taken into use in the near future, but the goal
is to show that the proposed process can be used in FPGA development and that it
can provide improvements to the process. These two parts are addressed separately
in the following two sections of the thesis.
1. The method for acquiring the base addresses of the interfacing register banks.
2. The generation of the UPM based on the PM XML files of the design variants.
Currently, the base addresses are written manually to the PM generation script.
They are defined in an Excel table that is formed based on the platform designer.
This is unnecessary manual work, because when the top-level design of the L1 HW is
constructed in the platform designer, a .qsys file, that contains all the base addresses
and the associated register banks, is generated. In addition to being slow, writing
the base addresses to the script manually is quite prone to errors. Thus, it would be
faster and less error-prone to fetch the base addresses from the .qsys file with a script.
Therefore, a Python script that reads the base addresses from the .qsys file based on
the names of the components defined in the C shell script was created. However, the
44
component names defined the C shell script and the component names in the .qsys
file differ quite much, and currently the mapping between these names is hard-coded
in the Python script. This mapping could maybe be replaced with an algorithm that
directly finds the base addresses based on the component names. Some kind of string
distance algorithm, for example the Edit distance, could be utilized for this purpose.
The process of generating the UPM is also performed quite manually even though
it could be more automatic. The first step in the UPM generation process is that
the variant integration teams provide the PM XML files of each variant for the
UPM generation. As mentioned earlier, one of these variants can be considered
as a “master” variant, and each of the interfacing register banks of that variant
will be included in the UPM. Thus, all the register banks that are not included
in the master variant, but can be found from one or more other variants must be
searched. Currently, this is done manually by reading through all the PM XML
files, and copying and pasting every found XML register bank description (that is
not in the master variant) to a separate XML file. This new file and the master
variant XML file are then merged together with a C shell script to form the final UPM.
The UPM creation process was improved by creating a Python script that au-
tomatically reads through all the PM XML files of the variants, searches the desired
register bank descriptions, and writes them to an XML file that can subsequently be
used as an input for the C shell script. Currently, the names of the desired register
banks are hard-coded to the Python script, instead of every time searching for the
differences between the XML file of the master variant and all the other variants.
However, this does not make the implementation less flexible because it is known that
only very rarely new register banks are added to/removed from the design variants.
Therefore, the hard-coded register banks must be modified only very rarely.
Because the UPM is a combination of all the register banks of every IP block
of the 5G L1 top-level design, it is desired to develop a simple method for keeping the
PM of one common IP block synchronized over all design variants. This can then be
easily expanded to the whole UPM, as the same method can be applied to every com-
45
mon IP block of the 5G L1 design. The variant-specific IP blocks of the design do not
need to be synchronized because they are not shared between all the different variants.
It should be noted that the implementation of the functionality that will be us-
ing the information provided by the added registers or register fields could cause
problems for some teams. This is due to the internal connections between the IP
blocks in different variants. Thus, it may be difficult for some teams to take the
required new functionality into use because it could cause issues for some internal
connections between the IP blocks. However, this problem is out of the scope of
this thesis because solving it would require the knowledge of the intended purpose
of each register that is included in the UPM, and it is only known by the HW
developers who are developing the corresponding feature. This problem can be seen
as a manifestation of the issue that the submodule teams do not develop the IP blocks
in the context of the top-level designs, even though this would be recommended
according to the best practices discussed in chapter 2.6.
To keep one IP block and the corresponding register bank synchronized between
different HW design teams, there must be a simple and effective method to transfer
data between the teams. In the proposed solution, the data transfer is implemented
with a single top-level Excel file. This file contains the descriptions of every register
bank, and also the base address and the “scope” of each register bank is stored in
it. In this context, the scope of a register bank refers to all the design variants that
use the register bank. In more detail, the process can roughly be described with the
following steps:
1. One integration team desires to take into use a newer version of a common IP
block, provided by the corresponding submodule team. The other teams must
review this proposed modification via the top-level Excel file.
3. All the other teams can now fetch the top-level Excel file from the version
control, extract the modified register bank, and generate new XML and VHDL
files based on the modified register bank.
It should be emphasized that the above process must be followed only for the common
IPs. If an integration team desires to take a newer version of a variant-specific IP
block into use, this can be done without acceptance from the other teams. However,
the modified register bank of a variant-specific IP must still be merged to the top-level
Excel file because the UPM is generated based on it. For this process to be effective,
it is essential that the top-level Excel file is stored to a version control system, and
that there are proper and systematic review practices applied to it. A problem
related to this is that the current FPGA design workflow does not include proper
code review, and thus it would be important that such review system is introduced
46
Let us now consider the implementation of the new process in more detail. The first
issue that was noted in the process was that viewing and modifying the Excel files,
that may in some occasions be considerably large, is in many situations quite clumsy.
This is especially true for the top-level Excel file that may consist of the data included
in tens of individual Excel files, each describing a register bank of a separate IP block.
In addition, it is relatively complicated to develop scripts for parsing or modifying
the data stored in Excel files. Therefore, it was concluded that there should be a
better format for storing the information of the interfacing registers, and it was de-
cided to examine the usage of Javascript object notation (JSON) files for this purpose.
JSON is, as the name implies, based on the Javascript programming language,
and it is a data structure that can be used to store data in an organized and easily
accessible manner. JSON is relatively easy for humans to read and write, but at the
same time it is also simple for a computer to access and parse. JSON consists of two
fundamental structures: a collection of name-value pairs, which can be compared
to for example dictionaries or objects in traditional programming languages, and
ordered lists of values, which can be compared to lists or arrays. [58]
The Excel table in Figure 18 defines three interfacing registers: “Register_1”, “Regis-
ter_2” and “Register_3”. These are defined in the column F of the table. In column
E, the address of each register is specified. However, these are not the absolute
addresses of the registers, but rather offsets to the base address of the register bank.
For example, the absolute address of “Register_2” is <base address> + 0x040. In
column J, the dimension of a register is defined. This can be used to easily define a
varying number of identical registers. In this case, the dimension of every register is
1, so there are only one of each register. If the dimension for some register would be
for instance 3, it would mean that three identical registers would be generated.
In column G of the Excel table, the different fields of a register are defined. A
field is a certain range of bits of the register. For example, the register “Register_1”
consists of three fields: “Register_1_1”, “Register_1_2” and “Register_1_3”. The
range of bits that each field covers can be seen in the columns H and I. By looking at
these columns, it can be seen that each register is 32 bits wide. For the register “Reg-
ister_1”, bits 31-21 are covered by the field “Register_1_1”, bits 20-11 are covered
by the field “Register_1_2”, and bits 10-0 are covered by the field “Register_1_3”.
The field coverage areas for registers “Register_2” and “Registers_3” can be found
similarly.
Finally, the columns K and L define the access type and the default value of the
register fields. The access type defines the type of operations that the SW can do to a
register. Some common types are displayed in Figure 18, namely read-write (RW, the
SW has both read and write access), write-only (WO, the SW has only write access),
and read-only (RO, the SW has only read access). It can be seen that all fields of the
register “Register_1” have RW access, all fields of the register “Register_2” have
WO access, and the field “Register_3_1” has RO access. In column L, the default
value of each field is defined.
Considering the Excel table in Figure 18, it can be seen why JSON is a suitable data
type for storing information about the interfacing register banks. The fields in the
first row of the Excel table can be seen as properties that belong to the register bank.
Every field is associated with a varying number of values, and some of these values
48
may further be associated with other values. As an example, the field “register” is
associated with three registers, and each of them is further associated with at least
an address, dimension and some number of fields. Each of these fields is in turn
associated with other values. The two fundamental data structures of JSON can be
well applied to such information relations.
{field : Register_1_2,
start : 20,
stop : 11,
rw : RW,
default : 0x0,
function : ,
description : ,
comment : },
{field : Register_1_3,
start : 10,
stop : 0,
rw : RW,
default : 0x0,
49
function : ,
description : ,
comment : }
],
address : 0x0,
dim : 1,
function : ,
description : Example description,
comment : },
{register : Register_2,
fields : [
{field : Register_2_1,
start : 31,
stop : 1,
rw : WO,
default : 0x0,
function : ,
description : ,
comment : Example comment},
{field : Register_2_2,
start : 0,
stop : 0,
rw : WO,
default : 0x0,
function : ,
description : ,
comment : }
],
address : 0x040,
dim : 1,
function : ,
description : ,
comment : },
{register : Register_3,
fields : [
{field : Register_3_1,
start : 31,
stop : 0,
rw : RO,
default : 0x0,
function : ,
description : ,
50
comment : }
],
address : 0x080,
dim : 1,
function : ,
description : ,
comment : }
],
hdl_path : ,
function : ,
description : ,
comment : Type_1,
base : 0x00300000
}
It can be seen from Listing 1 that the JSON file contains exactly the same infor-
mation as the corresponding Excel file. However, the base address in the JSON file
is associated with a separate field “base”, even though in the Excel file it is stored
under the comment field. This approach has been selected because the comment field
may contain much more information than just the separation between specific and
common blocks, and thus parsing the base address from the comment field would be
prone to errors.
Let us now consider the actions that are required from the integration teams. In the
proposed new method, this process begins with a team fetching the top-level Excel
file from the version control. Afterwards, every sub-Excel file that is found from the
top-level Excel file is extracted to a separate Excel file and all these files are stored
to a specified directory. The next step in the process is to convert every extracted
Excel file into a JSON format. This is done to enable the development and usage of
simple and robust scripts for accessing and modifying the register banks.
After all the JSON files have been generated, the next step in the process con-
tains synchronization between the top-level Excel file and the PM that is currently
used by an integration team. As all the integration teams have accepted the mod-
ifications in the review of the top-level Excel file, it is naturally desired by an
integration team to replace the old IP blocks and the register banks by the mod-
ified ones. To achieve this, a script for comparing two JSON files was developed.
This script is used such that every JSON file that was generated based on the top-
level Excel file is compared with a corresponding old JSON file that was produced
when the integration team had previously fetched the top-level Excel file from the
version control. If a difference between the two JSON files is recognized, the old
JSON file is replaced by the new JSON file (similarly for Excel files), and a new
XML register bank and a VHDL register bank are generated based on the new Excel
file. These can then be used for the generation of the PM XML file and the bitstream.
51
The previous step illustrates one of the advantages of using JSON format instead of
Excel format for storing the information about the register banks. The comparison
of the register banks is considerably more robust and simpler to implement using
JSON files, compared to using Excel files. The file format of JSON is very clearly
structured and organized, and for example empty spaces or empty lines in the JSON
file have no effect on how a computer interprets the file. On the other hand, the
data format in an Excel file is much more loose. Only few empty rows or columns
in a specific place could cause the comparison algorithm to fail. In addition, JSON
files are more lightweight than Excel files. Due to these advantages of JSON files
over Excel files, it would be desired to use only the JSON files in the development
process. However, the script that produces XML and VHDL files based on the Excel
files uses tools provided by a company called Magillem, and these tools require that
the input of the script is an Excel file in certain format. It would require much effort
to create a script that would convert the JSON file directly to the XML and VHDL
files, and therefore the implementation of such a script is out of the scope of this
thesis. However, it could be done as a future improvement.
Even though the JSON files offer several advantages compared to the Excel files, they
contain also certain disadvantages. The register bank files are designed to be read
and modified by humans, and therefore it is highly important that they are simple
to read, interpret and modify. This is one area where the Excel format outperforms
the JSON format. With JSON files, the user must take care that in addition to text,
also parentheses, quotation marks and commas are formatted correctly so that they
support the JSON format. Thus, to completely replace the Excel files with JSON
files, it would be desired to develop a user interface for modifying and viewing the
contents of the files.
Now that the differences between the top-level Excel file and the previously used
register banks have been synchronized, the integration team can, if desired, propose
new modifications for the IP blocks to other teams. After that, the final step in the
process is to merge the (possibly) modified Excel files to the top-level Excel file. Each
Excel file consist of four sheets, and each sheet must be merged to the corresponding
sheet of the top-level Excel file. To make this step effortless, a Python script was
created for merging the Excel files to the top-level Excel file.
All the previously described steps in the new development process are combined into
two Python scripts. The first one implements all steps from fetching the top-level
Excel file from the version control to generating the necessary XML and VHDL files
based on the comparison of the JSON files. The other script is used to merge all the
desired Excel files into the top-level Excel file. This makes it very straightforward
for an integration team to perform these steps, enabling the team to focus on more
complicated work. In conclusion, these two top-level Python scripts perform the
following steps:
52
1. Extract every sub-Excel file in the fetched top-level Excel file into a separate
Excel file.
3. Compare each produced JSON file with the older version of the same JSON
file that is stored in the local directory of the integration team.
4.1. If differences are found in the comparison and the JSON file represents a register
bank that is used in the specific design variant, generate the XML and VHDL
description of the register bank based on the corresponding Excel file.
4.2. If there are no differences, or the JSON file represents an IP block that is not
used by the integration team, move on to the next JSON file.
5. When all JSON files are compared, replace all the old JSON files with the new
JSON files.
6. Merge all the Excel files together to form the top-level Excel file.
Finally, the team that created the top-level Excel file pushes it to the version control
system where other teams can easily review it and pull it to their local working
directories for modification and synchronization. This process is repeated as the
development proceeds.
The activity diagram in Figure 19 illustrates the proposed new workflow. For
the sake of simplicity, the diagram illustrates an example of a workflow that includes
only two variant integration teams, which are noted as team A and team B. This
diagram describes a use case where team A first takes a newer version of a common
IP block into use, and subsequently team B updates its local design to comply
to the modified top-level Excel file. In addition, there is also a UPM team that
finally generates the UPM that covers both variants. The diagram in Figure 19 is
color-coded similarly as the diagram in Figure 17.
53
Convert Excel file Convert Excel file Convert Excel file Convert Excel file
to XML file and to XML file and to XML file and to XML file and
VHDL file VHDL file VHDL file VHDL file
Let us compare this diagram to the activity diagram in Figure 17. The upper part of
the diagram is identical to that of the diagram in Figure 17, describing the submodule
teams and their workflow. The lower part, on the other hand, shows the automated
steps of synchronization between the top-level Excel file and the local development
environments of the integration teams. The most significant difference is that now
there is organized communication between the integration teams A and B via review
of the top-level Excel file. As can be seen at the bottom left corner of the diagram,
team A does not take the modifications to the common IP block into use before the
changes have been reviewed and accepted by team B.
Another observation that can be made is that the integration teams do not anymore
provide PM XML files as in the diagram in Figure 17. The reason for this is simply
that the PM XML files of different design variants are not needed anymore, because
the top-level Excel file describes the UPM directly. This clearly illustrates that the
PMs of the variants are not needed for anything as such, but their main purpose is
to serve as “tools” for the generation of the UPM. This also holds for the original
FPGA workflow. Because of these differences, the UPM is not generated by merging
XML files together as in the old process. Instead, the UPM generation script in the
proposed method is the same C shell script that was used for generating the PM
XML files of the design variants in the old process. The only difference is that in
the new process, the inputs of that script include the XML files corresponding to
every sub-Excel file included in the top-level Excel file. This also means that the
C header files, that were previously generated by the L1 SW based on the UPM
(see chapter 3.1), can now be generated directly by the HW teams, based on the
top-level Excel file. As can be seen from Figure 19, the base addresses for the UPM
generation script can be acquired from the top-level Excel file with a script, and also
the generation of the UPM has been fully scripted.
55
The main advantage of the proposed improvements is that the time that it takes to
perform the actions was reduced significantly. Below are two tables, each illustrating
the reduction of the execution time of the corresponding phase of the design flow.
Table 3 shows execution times of gathering the base addresses and writing them to
the PM generation script. Four different method of gathering the base addresses
were examined: using the developed script, using a separate Excel file that contains
all the base addresses, using the address map table found in the platform designer,
and using the .qsys file, generated by the platform designer. All the base addresses
56
were removed from the PM generation script before measuring the execution time.
As can be seen from the table, the method that uses the script for fetching the
base addresses is by far the fastest. The problem with the other methods is that
it is quite slow for a human to search the correct component from a certain source.
Especially the address map table of the platform designer is quite complicated and
difficult to read for a human. Another advantage of using the script is that it is less
prone to errors than manually reading the addresses from a source file.
Table 4 shows the measured time that it takes to create the UPM. This time
consists of two parts: generating the XML file that contains all the register banks
that are not included in the master variant (“variant XML file”), and executing the
script that merges the generated XML file with the XML file of the master variant.
Generating the variant XML file was performed by using a script, and by manually
copying the register bank information from separate XML files to the variant XML
file. As can be seen from the table, this process is clearly faster if the script is used.
Let us consider the advantages of the proposed new method. Because the main goal
of the process is to ensure that every released bitstream is always compatible with
the UPM that is at use, the new process will be considered first from this viewpoint.
As described in chapter 3.2.2, two scenarios of programming model modifications
exist: a modification of a variant-specific IP block and a modification of a common
IP block. The first issue to note is that the proposed method assumes that every
modification to some register bank complies to the open-close principle, introduced
in chapter 3.1. Modifications that do not comply to this principle require special
arrangements, and they are out of the scope of this thesis. From now on, every
modification discussed in this thesis is assumed to comply to the open-close principle,
and it will not be emphasized separately.
only one that desires to take an updated version of the IP block into use. Therefore,
the team may independently take a new version of the IP into use, generate and
release a new bitstream using it, and merge the modified register bank into the
top-level Excel file. The other integration teams are not involved in this process.
Even though the UPM will be updated in such a modification (because the top-level
Excel file is updated), this will not cause conflicts with the bitstreams. This is due
to the bitstreams of the other FPGA design variants not using the variant-specific
register bank that was modified.
For updates to the common IP blocks, the situation is more complicated, because ev-
ery bitstream variant uses all the same common IP blocks. Therefore, an integration
team cannot independently take an updated common IP into use. Instead, it must
communicate with the other integration teams that it desires to use a new version
of the IP block. The idea of the top-level Excel file is to make this communication
as straightforward as possible. The team may merge the proposed updated register
bank Excel file into the top-level Excel file, and ask the other integration teams to
review and accept it. By using the top-level Excel file, it is effortless for all the teams
to not only review the actual proposed change, but also to check whether it conflicts
with any of their variant-specific IPs, or planned updates to them.
The new process introduces several advantages, but it still includes communica-
tion between the integration teams, which makes the workflow more complicated for
them. Even though the teams would prefer to work independently, communication is
necessary because common IP blocks cannot be taken into use without synchroniza-
tion between the integration teams. The synchronization causes problems because
the FPGA design variants differ quite much, and the development of these variants
do not proceed at the same phase. Therefore, the communication may slow down
the FPGA development, but it is a compromise that must be made to ensure that
the UPM can be used to develop SW for every FPGA design variant. This is a first
priority in the FPGA development.
The fundamental reason for the need for communication is that to make the devel-
opment fluent, the variant integration teams must have an agreement on the used
common IP blocks before each bitstream variant is generated. If, on the other hand,
the UPM generation team strives to generate the UPM based on the PM XML
files that the integration teams have produced without synchronization, eventually
the problem of unequal common register banks will occur. This will require one or
more integration teams to take a different common IP block into use, regenerate the
bitstream, redo several functional verifications and so on. This will significantly slow
down the process.
The above problem can be avoided by synchronizing the used common IP blocks
before the bitstream is generated. The most reasonable method to do this is by
manual communication between the variant integration teams. Even if there was a
script that would compare the IP blocks of one integration team to the IP blocks of
58
all the other teams before the integration team uses the IPs to create a new bitstream,
the process would still work quite poorly. This is due to the script not knowing in
which phase of development the other teams are. For example, some team might
have plans to release a bitstream with a different common IP in the very near future.
In addition, when one integration team takes a new version of a common IP block
into use, the script would notify the other teams that the versions of the common
IP block do not match, which would again require synchronization between the teams.
Even though it was observed that the FPGA development process without syn-
chronization between teams would be very difficult to say the least, it does not
mean that major improvements using technical methods cannot be achieved. As
an example, the HW teams are currently developing a method to automatically
generate two design variants based on only one design. The idea is that one inte-
gration team would use the platform designer to create a variant that contains all
the common IP blocks and the specific IP blocks of two different (but quite similar)
variants. Afterwards, the .qsys file, which is essentially a text-based description of
the connections between the IP blocks, could be modified with a script such that
the two different FPGA variants would be “extracted”. Because the variants differ
only in the used variant-specific IP blocks, the extraction can, in principle, be done
by only removing certain IP blocks and connections from the .qsys file. However, it
has been noted that this approach cannot be used such that all the design variants
would be generated based on one master variant because there are too significant
differences between the variants. Thus, this method will not remove the need for
communication between the teams.
The most notable advantage of maintaining the top-level Excel file by the submodule
teams is that there would be no need for synchronization between the integration
teams via the top-level Excel file. The submodule teams could simply develop new
versions of the IP blocks independently, and always update the top-level Excel file
when a new version of an IP block is released. It would then be known that as all the
integration teams develop the corresponding bitstreams based on the top-level Excel
file, every bitstream release would, in theory, comply to the UPM. This process
would basically require very little synchronization between different teams associated
with the FPGA development, which is highly desirable by the FPGA developers.
Excel file is maintained by the submodule teams. The first issue that is evident is
the requirement that when a common IP block is modified, a new bitstream should
be released with modifications to only that IP. In practice, this would be difficult
to achieve as the submodule teams work independently. If a submodule team that
develops a common IP block releases a new version of the IP, it does not know
whether some other team has released a newer version of some other IP block after
the previous bitstream was released. To overcome this problem, the submodule
teams should synchronize with each other, which should be avoided.
Another problem is that the integration teams could not work as independently
as before. For example, let us consider a situation where one integration team desires
to release a new bitstream that contains a newer version of an IP block that is specific
only to the corresponding design variant. Even if this newer version of an IP block
has been released by the corresponding submodule team, the integration team may
not be able to release a new bitstream if a modification has also been done to some
common IP block. In this case, the integration team should somehow know whether
all other integration teams are already using the new version of the common IP.
The above problems could be avoided if it was assumed that there is a notable
time interval between each new version release of an IP block. In this scenario, a new
FPGA design bitstream would be created by integration teams after each release of
a new IP block version. However, this is not a feasible assumption. In practice, there
are numerous submodule teams working independently, and thus it is highly unlikely
that there is always enough interval between two subsequent IP block updates. For
these reasons, this process was not examined further in this thesis.
60
Chapter 2 of this thesis first discussed briefly mobile communications and rele-
vant technological aspects of 5G networks. In particular, it was motivated that
the strict real-time requirements of 5G L1 support the usage of HW instead of SW
for implementing the necessary computations. Afterwards, the general architecture
and usage of FPGAs was presented, and it was discussed why FPGAs provide an
attractive option for 5G HW implementation. Finally, the general requirements and
architectural aspects of interfacing between HW and SW were presented.
Chapter 2.6 presented some general guidelines for team-based FPGA development.
It was observed that many companies involved in FPGA development share the
same challenges in the development process. Two essential parts of team-based
design were recognized: a single team leader that takes care of the top-level plan-
ning and integration, and the other team members that are responsible for the
IP block development. Including SW to the development process complicates the
development, because the register address map must be maintained between sev-
eral parts of the development team. To keep the process organized, it is highly
important to strictly manage the register map and to communicate every change
of the map to every group associated with the embedded system development process.
The current FPGA development workflow and the proposed improvements were
discussed in chapter 3. It was noted that even though the workflow associated with
generating a single FPGA design variant follows the best practices of team-based
FPGA design, the overall workflow including all the different variant-integration
teams and the UPM generation team does not follow them. This is due to the lack of a
single team leader for the integration teams, and the lack of necessary communication
between the integration teams. These problems were addressed by creating a top-level
Excel file, which serves as a common interface for all the integration teams to modify
61
the register banks and to communicate the modifications to the other integration
teams. In a way, the top-level Excel file can be seen as a replacement for the single
team lead that was concluded to be missing from the current workflow. In addition to
the major modifications, two minor steps in the current workflow, namely the acqui-
sition of the base addresses of the interfacing register banks and the generation of the
UPM based on the PM XML files of the design variants, were automated with scripts.
For the larger modification to the development process, it was motivated that the
proposed process basically ensures that the generated bitstreams of each variant are
always compatible with the most recent version of the UPM. However, this requires
direct communication between the variant integration teams and it may also slow
down the process because the integration teams must always synchronize with each
other before applying IP block modifications. In addition, it was noted that there
is currently a project being developed that would allow the integration teams to
generate two FPGA variants from only a single bitstream. This would reduce the
number of register maps, making it easier to synchronize the bitstreams.
One important aspect of the proposed process is that the Excel files were strove to
be replaced with the JSON file format. It was shown that the format of the register
bank information stored in the Excel files can be very well adapted to JSON, and a
script for converting an Excel file to JSON file was developed. However, there are a
few challenges in replacing Excel files with JSON files. The first one is that a script
for converting a JSON file into XML file should be developed. Secondly, even though
JSON is a lightweight data format and it supports scripting well, modifying or reading
large register banks is still simpler in Excel format than in JSON format. Therefore,
a simple user interface that allows effortless review and modification of the JSON
files should be developed. These problems could be addressed as future improvements.
As a conclusion, this thesis proposed a few methods to improve the currently used
FPGA development process at Nokia, and it also described a larger update to the
development flow. The aim of these modifications was to keep the process simple for
the HW designers and to ensure that the generated FPGA bitstreams always comply
62
References
[1] LM Ericsson. Future mobile data usage and traffic growth. [Online]. [Referenced
on 28.02.2019]. url: https://www.ericsson.com/en/mobility-report/
future-mobile-data-usage-and-traffic-growth.
[2] Cisco Systems. Cisco visual networking index: global mobile data traffic forecast
update, 2017-2022 white paper. [Online]. [Referenced on 28.02.2019]. url:
https : / / www . cisco . com / c / en / us / solutions / collateral / service -
provider / visual - networking - index - vni / white - paper - c11 - 738429 .
html.
[3] Xiang, W. et al. 5G mobile communications. 1st ed. Switzerland: Springer
international publishing, 2017. 691 pp. isbn: 978-3-319-34206-1.
[4] Rodriguez, J. Fundamentals of 5G mobile networks. 1st ed. West Sussex,
United Kingdom: John Wiley and Sons Ltd, 2015. 336 pp. isbn: 978-1-118-
86748-8.
[5] Laplante, P. and Ovaska, S. Real-time systems design and analysis: tools for
the practitioner. 4th ed. New Jersey: John Wiley and Sons Ltd, 2002. 560 pp.
isbn: 978-0-470-76864-8.
[6] Hallinan, C. Embedded Linux primer: a practical real-world approach. 2nd ed:
Prentice Hall., 2011. 616 pp. isbn: 978-0-13-701783-6.
[7] Khaldoun, A. et al. Mobile and wireless networks: Volume 2. ISTE Ltd., 2016.
356 pp. isbn: 978-1-84821-714-0.
[8] Veeraraghavan, Malathi. Three planes in networks. [Online]. [Referenced
on 01.04.2019]. url: http : / / www . ece . virginia . edu / mv / edu / ee136 /
Lectures/routing-sig/cs-ps-cops.pdf.
[9] Dahlman, E. et al. 4G LTE and LTE-advanced for mobile broadband. 1st ed.
United Kingdom: Elsevier Ltd, 2011. 455 pp. isbn: 978-0-12-385489-6.
[10] 3GPP. About 3GPP Home. [Online]. [Referenced on 16.04.2019]. url: https:
//www.3gpp.org/about-3gpp/about-3gpp.
[11] 3GPP. Releases. [Online]. [Referenced on 14.01.2019]. url: http://www.
3gpp.org/specifications/67-releases.
[12] The International Telecommunication Union. About International Telecom-
munication Union (ITU). [Online]. [Referenced on 14.01.2019]. url: https:
//www.itu.int/en/about/Pages/default.aspx.
[13] Beyene, Y.D. “Algorithms, protocols and cloud-RAN implementation aspects
of 5G networks. [online].” PhD thesis. Aalto University, Department of
communications and networking, Feb. 2018. [Referenced on 30.1.2019]. ISBN
978-952-60-7912-7 (electronic).
[14] 3GPP. 5G NR: overall description. [Online]. [Referenced on 16.01.2019]. url:
https://www.etsi.org/docdeliver/etsi_ts/138300_138399/138300/15.
02.00_60/ts_138300v150200p.docx.
64
[15] 3GPP. 5G: System architecture for the 5G system. [Online]. [Referenced
on 22.01.2019]. url: https://www.etsi.org/deliver/etsi_ts/123500_
123599/123501/15.02.00_60/ts_123501v150200p.pdf.
[16] Carlton, A. The 5G core network: 3GPP standards progress. [Online]. [Ref-
erenced on 28.01.2019]. url: https://www.computerworld.com/article/
3219828 / mobile - wireless / the - 5g - core - network - 3gpp - standards -
progress.html.
[17] 3GPP. 5G NR: Medium access control (MAC) protocol specification. [Online].
[Referenced on 16.04.2019]. url: https://www.etsi.org/deliver/etsi_
ts/138300_138399/138321/15.03.00_60/ts_138321v150300p.pdf.
[18] 3GPP. 5G NR: Radio link control (RLC) protocol specification. [Online].
[Referenced on 17.01.2019]. url: https://www.etsi.org/deliver/etsi_
ts/138300_138399/138322/15.03.00_60/ts_138322v150300p.pdf.
[19] 3GPP. 5G NR: Packet data convergence protocol (PDCP) specification. [Online].
[Referenced on 17.01.2019]. url: https://www.etsi.org/deliver/etsi_
ts/138300_138399/138323/15.02.00_60/ts_138323v150200p.pdf.
[20] 3GPP. NR: Radio resource control (RRC) protocol specification (release 15).
[Online]. [Referenced on 16.04.2019]. url: https://www.3gpp.org/ftp/
Specs/archive/38_series/38.331/.
[21] Goleniewski, L. et al. Telecommunications essentials, second edition: the
complete global source. 2nd ed: Addison-Wesley Professional 2006., 2006. 928 pp.
isbn: 978-0-32-142761-8.
[22] Diniz, P. et al. Block transceivers : OFDM and beyond. 1st ed: Morgan and
Claypool cop., 2012. 184 pp. isbn: 978-1-60845-830-1.
[23] Nokia Bell Labs. 5G new radio (NR): physical layer overview and performance.
[Online]. [Referenced on 22.01.2019]. url: http://ctw2018.ieee-ctw.org/
files/2018/05/5G-NR-CTW-final.pdf.
[24] Nokia Bell Labs. 5G new radio design. [Online]. [Referenced on 22.01.2019].
url: http://www.ieeevtc.org/conf-admin/vtc2017fall/51.pdf.
[25] Lin, X. et al. 5G new radio: unveiling the essentials of the next generation
wireless access technology. [Online]. [Referenced on 16.04.2019]. url: https:
//arxiv.org/abs/1806.06898.
[26] Koch, D. et al. FPGAs for software programmers. 1st ed. Switzerland: Springer
international publishing, 2016. 327 pp. isbn: 978-3-319-26406-6.
[27] Trimberger, S. “Three ages of FPGAs: A retrospective on the first thirty
years of FPGA technology”. In: Proceedings of the IEEE. [Online]. Vol. 103:3
(2015), Pp. 318–331. [Referenced on 08.01.2019]. issn: 0018-9219. Available:
doi: 10.1109/JPROC.2015.2392104.
65
[28] Intel Corporation. Intel stratix 10 GX/SX product table. [Online]. [Refer-
enced on 23.01.2019]. url: https : / / www . intel . com / content / dam /
www/programmable/us/en/pdfs/literature/pt/stratix- 10- product-
table.pdf.
[29] Intel Corporation. Intel Stratix 10 hard processor system technical reference
manual. [Online]. [Referenced on 02.04.2019]. url: https://www.intel.com/
content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-
10/s10_5v4.pdf.
[30] Schaumont, P. A practical introduction to hardware/software codesign. 2nd ed:
Springer publishing., 2013. 480 pp. isbn: 978-1-4614-3737-6.
[31] Intel Corporation. Intel Stratix 10 logic array blocks and adaptive logic mod-
ules user guide. [Online]. [Referenced on 22.02.2019]. url: https://www.
intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/
stratix-10/ug-s10-lab.pdf.
[32] Smith, D.J. “VHDL and Verilog compared contrasted - plus modeled example
written in VHDL, Verilog and C”. In: 33rd design automation conference
proceedings, 1996. [Online] (1996), Pp. 771–776. [Referenced on 16.04.2019].
issn: 0738-100X. Available: doi: 10.1109/DAC.1996.545676.
[33] Ashenden, P. The designer’s guide to VHDL. Elsevier publishing., 2008. 909 pp.
isbn: 978-0-12-088785-9.
[34] Neshatpour, K. et al. “Energy-efficient acceleration of big data analytics
applications using FPGAs”. In: 2015 IEEE International Conference on Big
Data (Big Data). [Online]. (29 October 2015), [Referenced on 02.04.2019]. doi:
10.1109/BigData.2015.7363748.
[35] Mentor Graphics. C/C++/SystemC HLS. [Online]. [Referenced on 21.05.2019].
url: https://www.mentor.com/hls-lp/catapult-high-level-synthesis/
c-systemc-hls.
[36] Andrei, V. FPGA design flow: from HDL to physical implementation. [Online].
[Referenced on 11.01.2019]. url: https://indico.desy.de/indico/event/
7001/session/0/contribution/1/material/slides/0.pdf.
[37] Bezerra, E.A. and Lettnin, D.V. Synthesizable VHDL design for FPGAs.
Springer publishing., 2014. 157 pp. isbn: 978-3-319-02547-6.
[38] Microchip Technology Inc. Introduction to JTAG. [Online]. [Referenced on
25.02.2019]. url: http://microchipdeveloper.com/jlink:jtag.
[39] XJTAG. What is JTAG and how can I make use of it? [Online]. [Referenced
on 25.02.2019]. url: https://www.xjtag.com/about-jtag/what-is-jtag/.
[40] Rodriguez-Andina, J.J. et al. “Deep learning and reconfigurable platforms
in the internet on things: challenges and opportunities in algorithms and
hardware”. In: IEEE industrial electronics magazine. [Online]. Vol. 12:2 (2018),
Pp. 36–49 [Referenced on 07.05.2019]. doi: 10.1109/MIE.2018.2824843.
66
[41] Woulfe, M. et al. “Programming models for FPGA application accelerators”. In:
1st workshop on programming models for emerging achitectures. [Online]. (2009),
[Referenced on 04.03.2019]. issn: Available: doi: 10.13140/2.1.1537.4403.
[42] Szewinski, J. et al. “Software for development and communication with FPGA
based hardware”. In: Proceedings of the SPIE 5948, photonics applications in
industry and research IV, 59480H. [Online]. (11 October 2005), [Referenced on
28.01.2019]. doi: 10.1117/12.622512.
[43] Khronos Group. OpenCL overview. [Online]. [Referenced on 02.04.2019]. url:
https://www.khronos.org/opencl/.
[44] Jacobsen, M. et al. “RIFFA 2.1: a reusable integration framework for FPGA
accelerators”. In: ACM transactions on reconfigurable technology and systems.
[Online]. Vol. 8:4 (2015), Pp. 23. [Referenced on 03.05.2019]. doi: 10.1145/
2815631.
[45] Xilinx, Inc. Software defined. [Online]. [Referenced on 07.05.2019]. url:
https : / / www . xilinx . com / products / design - tools / software - zone /
sdaccel.html.
[46] Intel Corporation. Intel FPGA SDK for OpenCL software technology. [Online].
[Referenced on 07.05.2019]. url: https://www.intel.com/content/www/
us/en/software/programmable/sdk-for-opencl/overview.html.
[47] Intel Corporation. Using Intel FPGA SDK for OpenCL* on DE-series boards.
[Online]. [Referenced on 06.06.2019]. url: ftp://ftp.intel.com/pub/
fpgaup/pub/Intel_Material/17.0/Tutorials/OpenCL_On_DE_Series_
Boards.pdf.
[48] Intel Corporation. Intel FPGA SDK for OpenCL standard edition: get-
ting started guide. [Online]. [Referenced on 06.06.2019]. url: https : / /
www . intel . com / content / www / us / en / programmable / documentation /
xwm1515793070801.html.
[49] Intel Corporation. Intel FPGA SDK for OpenCL standard edition: Cyclone V
SoC getting started guide. [Online]. [Referenced on 06.06.2019]. url: https:
//www.intel.com/content/www/us/en/programmable/documentation/
xwm1515793070801.html.
[50] Intel Corporation. Intel FPGA SDK for OpenCL pro edition: programming
guide. [Online]. [Referenced on 06.06.2019]. url: https://www.intel.com/
content / www / us / en / programmable / documentation / mwh1391807965224 .
html#mwh1391807939093.
[51] Kavianipour, H. et al. “High performance FPGA-based DMA interface for
PCIe”. In: IEEE transactions on nuclear science. [Online]. Vol. 61:2 (2014),
Pp. 745–749. [Referenced on 03.05.2019]. doi: 10.1109/TNS.2014.2304691.
[52] Rosenberg, O. OpenCL do’s and don’ts. [Online]. [Referenced on 05.05.2019].
url: http://www.haifux.org/lectures/267/OpenCL_Dos_and_Donts.pdf.
67
[53] Texas Instruments Incorporated. Optimization techniques for host code. [On-
line]. [Referenced on 05.05.2019]. url: https : / / downloads . ti . com /
mctools/esd/docs/opencl/optimization/host_code.html.
[54] McCalpin, J. Low level microbenchmarks of processor to FPGA memory-mapped
IO. Tech. rep. University of Texas at Austin, 2014.
[55] Simpson, P.A. FPGA design: best practices for team-based reuse. 2nd ed:
Springer publishing., 2015. 257 pp. isbn: 978-3-319-17924-7.
[56] Schattkowsky, T. et al. “A UML frontend for IP-XACT-based IP management”.
In: 2009 design automation test in europe conference exhibition. [Online].
(2009), [Referenced on 07.05.2019]. doi: 10.1109/DATE.2009.5090664.
[57] IEEE 1685-2009. IEEE standard for IP-XACT, standard structure for pack-
aging, integrating, and reusing IP within tool flows. New York. 2010. 374
pp.
[58] Json. Introducing JSON. [Online]. [Referenced on 08.04.2019]. url: https:
//www.json.org/.