Nothing Special   »   [go: up one dir, main page]

A Scalable and High-Performance FFT Processor Opti

Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/265568025

A scalable and high-performance FFT processor, optimized for UWB-OFDM

Article

CITATIONS READS

8 1,927

2 authors, including:

Rene Van leuken


Delft University of Technology
103 PUBLICATIONS 511 CITATIONS

SEE PROFILE

All content following this page was uploaded by Rene Van leuken on 01 June 2016.

The user has requested enhancement of the downloaded file.


A scalable and high-performance FFT processor,
optimized for UWB-OFDM

Ramesh Chidambaram

Advisors:
Dr. Rene van Leuken (Delft University of Technology)
And
Ir. Jos Huisken (Silicon Hive, Philips Technology Incubator)

November 2004 – July 2005

Microelectronics
Department of Electrical Engineering
Faculty of Electrical Engineering, Mathematics and Computer Science
“Thaivathal agatheninum muyarchi than mey varuthak kooli tharum”
The Thirukural, verse 619 (A Tamil epic written by Valluvar, dated 500-550 AD)

Interpretations:
Even a task rendered impossible by divine fate can be conquered by sheer human effort
(And)
Even the effort invested into a task made impossible by fate, has its own rewards

II
“Dance like a butterfly, sting like a bee!”
- Muhhamed Ali (born Cassius Clay)

III
IV
Preface

Retargetable application specific instruction-set processor (ASIP) based designs aim to match
the extent of programmability of a general-purpose solution and yet provide efficient
performance – in terms of data computation rate and area utilization - as delivered by an
application specific integrated circuit (ASIC). Present work examines the design of such a
retargetable core and the accompanying embedded software for a very high performance Fast
Fourier Transform (FFT). The underlying aim is to raise the performance calibre of ANSI-C
programmable ASIP designs to match applications that until now belonged exclusively to the
domain of ASIC solutions.
One of the primary project objectives is that the application should be easily scalable for the
order of the FFT. As a primary target for scalability, the application should cater to any order of
FFT, with minimal changes to the hardware. Secondly, from a performance perspective, the
design should be capable of meeting the requirements of a highly time critical FFT used for
orthogonal frequency division multiplexed ultra wideband (UWB-OFDM), i.e. compute a 128-
point FFT within 312.5ns.
At present, the standardization of UWB-OFDM is still to be finalized by the IEEE
802.15.3a workgroup. Thus, such a design essentially translates into minimizing risk while
producing a core for an application even before it’s comprehensively standardized. Such an
approach brings with it significant business advantages of being first-to-market, and also lends
to gathering useful research regarding implementation issues.

V
Overview

The processor design, with the generation of the corresponding embedded software, is
performed using the coarse-grained retargetable computing technology proprietary to Silicon
Hive. Silicon Hive belongs to the portfolio of the Philips Technology Incubator and is a fully
owned business of Philips Electronics N.V. The technology has been developed and patented by
Philips. Chapter 1 describes the philosophy behind ASIP design and how the design template of
Silicon Hive is utilized for the same.
There is a plethora of literature published that relates to algorithms for implementing the
Discrete Fourier Transform (DFT). However, algorithmic and hardware implementation
perspectives greatly differ. Chapter 2 addresses this issue and compares various algorithmic
techniques – including Radix-2, higher radix, the constant geometry approach, multiple radix,
etc - from the standpoint of a retargetable ASIP.
Aspects of the UWB-OFDM proposal that are pertinent to the FFT block are examined
in Chapter 3. This chapter also deals with formulating the FFT block’s requirements and
shedding light on the design challenge involved.
Chapter 4 describes the approach to algorithm selection and the design choices made to
realize a scalable FFT processor, which is optimised for UWB-OFDM performance
specifications. Finally, Chapter 5 presents a compilation of obtained results and the conclusions
reached. Scope for further improvement is also discussed.

VII
Contents
Preface V

Overview VII

Chapter 1 Overview of ASIP Design 1


1.1 Application specific, yet programmable solutions 1
1.2 ASIP as an accelerator 3
1.3 The Silicon Hive template 4
1.4 Utilizing the Silicon Hive template for designing a FFT processor 7
1.5 List of references 8

Chapter 2 Examination of FFT implementations 9


2.1 Introduction 9
2.2 The Radix-2 algorithm 10
2.3 Radix-2v algorithms 13
2.3.1 The Radix-4 algorithm 14
2.3.2 The Radix-8 and higher radix algorithms 16
2.4 Split/Mixed Radix solutions 18
2.5 The Constant Geometry architecture 18
2.6 Other algorithms 20
2.7 Fetching twiddle factors from memory Vs a CORDIC 20
2.8 Twiddle factor driven approach 21
2.9 Summary 21
2.10 List of References 22

Chapter 3 The UWB-OFDM Standard 25


3.1 Introduction 25
3.2 The MB-OFDM UWB standard 26
3.3 Examining the complexity of the FFT for MB-OFDM UWB 29
3.4 Summary 31
3.5 List of References 31

Chapter 4 Explanation of Design Choices 33


4.1 Summary of basic requirements 33
4.2 Initial choice of algorithm 34
4.3 Data path bandwidth requirement 34
4.4 Memory ping ponging 35
4.5 Twiddle handling 36

IX
4.6 Signal flow graph (SFG) based analysis 37
4.7 Processor architecture requirements 39
4.8 Bit reversed addressing in the last stage 41
4.9 Inter-stage scaling for SQNR 41
4.10 Conclusion 42
4.11 List of references 42

Chapter 5 Results and Conclusions 43


5.1 Numerical accuracy and SQNR 43
5.2 Scheduling results 45
5.3 Processor architecture 48
5.4 Synthesis results 48
5.5 Examining Scalability 56
5.6 Scope for further research 58
5.7 Summary of Results 59
5.8 References 60
Appendix 5.a. 61

List of Abbreviations 63

Acknowledgements 65

X
Chapter 1
Overview of ASIP Design

1.1 Application specific, yet programmable solutions


With the shrinkage in fabrication process feature size, Application Specific Integrated circuit
(ASIC) manufacturing costs have taken sizable proportions. The increase in design complexity
is accompanied by increased investment into developing design automation tools to cope with
the complexity, creating mask sets suitable for the feature size, etc. The resulting non-recurrent
costs demand vendors to deal in very high volumes just to break-even, or to put up exorbitant
prices per unit. Given such a constricting environment, the risks that stem from changes in
standards, minor variations in requirements, etc becomes overwhelming. The cost incurred by an
added redesign cycle is prohibitive for small design changes. An amalgam of various factors
along such lines [1] has lead to a paradigm shift in design strategy. Programmable solutions that
provide a level of performance comparable to that of ASIC designs are becoming more popular.
Such solutions have come to be referred to as Application Specific Instruction set processors, or
ASIP for short. The paramount objective in ASIP design is to allow a family of related
applications to be capable of efficiently compiling on the same target device. An ASIP solution
is an enhancement over a traditional DSP processor solution owing to the added inclusion of
pertinent application specific instructions. In the past however, hardware flexibility has come at
the price of performance limitations. By contrast, in ASIP designs the objective is to adjust the
extent of programmability so as to achieve sufficient flexibility and performance, and yet restrict
costs within acceptable limits. Since an application driven enhancement in the instruction set
translates into specialized hardware blocks, adjusting the extent of programmability is facilitated
by user dominated hardware/software codesign. The performance level is dictated by the
exploitation of instruction level parallelism (ILP) in a Very Large Instruction Word (VLIW)
architecture. The VLIW approach allows for data path and instruction-set to be optimised in
accordance with the application. Significant gains as compared to a general DSP processor are
obtained by placing specialized computational units to speed-up critical sections of an
application. The relative position of the flexibility and performance of ASIP designs taken in
comparison to programmable DSP processors and ASIC solutions is depicted in Figure 1.1.
Retargetable ASIP based designs form an alternative to ASIC and general-purpose
microcontroller based solutions. It aims to match the extent of programmability of a general-
purpose solution and yet provides the efficient performance matching that of an ASIC solution.

Figure 1.1 ASIP: Programmability and performance comparisons

This type of retargetability isn’t literally an improvement in technology, but rather an


overhauling of design objectives to stay competitive in the embedded systems market. Contrary
to traditional ASIC design, an ASIP aims to provide a ‘generic processor that serves as a target
device for a certain family of applications’. The myriad forces that drive the market for such
solutions include frequent upgrading of standards, changes in customer’s demands, reducing
time-to-market overheads, making non-recurring engineering costs stay that way, logistics costs,
silicon reuse, etc. Retargetability comes at different levels, referring to which point the
flexibility stems into the system. The first type of retargetability is Design level or static
retargetability which is basically the enabling of the (re)-design process to be quite flexible. An
example of this would be the objective of ease of scalability of a core’s architecture or the
ability to easily change parameters within the application offline. The second type of
retargetability is during Execution, i.e. dynamic reconfigurability, and is invoked during run-
time in an online fashion. Here, each target configuration is considered as a context, and
depending on requirements the system can seamlessly switch between one context and another.

2
1.2 ASIP as an accelerator
An ASIP design involves matching of the architecture to a dynamically configurable set of
applications. In order to utilize the functionality of an ASIP in realizing an embedded system
solution, a set-up as shown in Figure 1.2 is commonly used. Communication between the CPU
and ASIP is performed via the system bus. This is further aided by a direct memory access
(DMA) controller, used in conjunction with a FIFO unit to load or store bulk data. The
processing data handled by the DMA controller behaves independent of the CPU, as it will
allow the CPU to continue performing other tasks while the DMA is engaging the ASIP.

Figure 1.2. ASIP as an accelerator in an embedded solution

The CPU instructs the DMA to fetch a block of instructions from the Program and data memory
and store it on the internal-program memory of the ASIP. The FIFO is used to control data flow
between DMA and ASIP by passing tokens back and forth. The list of operations in an
application is partitioned into those executed on the CPU itself and those dispatched to the
ASIP. This is the function of partitioning in a user-dominated hardware/software codesign
approach. Owing to adopting a VLIW type ASIP, a spatial compiler is also used to bind
instructions to a set of hardware resources. On occasions, an operation requires usage of two or
more non-adjacent hardware units. The spatial compiler needs to deduce the hardware resources
required and the on-chip data routing to realise such an operation. In digital circuits the
predominant expenditure of energy comes from switching. Given that the VLIW instruction
word length is going to be the cause for activating and deactivating bus lines and functional
clusters, an optimal operation refinement scheme is imperative, especially in the case of low-
power implementations of highly time-critical systems. In a VLIW processor, the instruction is
straightway mapped as a control vector for the hardware. Hence the op-code is generated in a
binary format from the instructions specified in the software application kernel. Thus,

3
optimisation of the instruction word length, extent of encoding, efficient operation grouping, etc
are part of the tool’s functions in generating a controller that works the data-path. The VLIW
instructions have to be tailored in a few ways for effective hardware utilisation and software
efficiency. Firstly, when the ASIP is made up of multiple IP Cores, the tool complexity to
formulate VLIW instructions increases. The encoding scheme for instruction words is optimised
to have a suitable Hamming distance between them and also exploit parallelism. VLIW
instruction word formulation for exploiting ILP can be performed statically (offline) or
dynamically (during run-time, as in the case of super-scalar architectures). Another level of
instruction tailoring targets efficient hardware resource utilization by techniques such as
pipelining and software loop folding. Given the effort required to construct optimal VLIW
instructions from an application kernel specified in a high-level language (like C), the
compiler’s capability is of substantial importance. For ease of design, the focus of most system
design flow is restricted to the data-path construction. This translates into a number of functional
units (ALUs, MACs, SHUs, etc) and storage related units like registers, load/store units, etc.
The control section, comprising of program instruction memory, address generation and
addressing flags, etc are automatically generated to satisfy the requirements of the data-path
section.

1.3 The Silicon Hive template


Silicon Hive is a fully owned enterprise of Philips Electronics N.V. that licences ASIP solutions
along the lines described in the foregone section. All processors are constructed by certain basic
guidelines that go to comprise the Silicon Hive template. Architectures developed using the
Silicon Hive template incorporate a coarse-grained approach to realizing the specifications
required of an Ultra-long instruction word (ULIW) type ASIP accelerator. The predominant
objective behind utilizing such a template is to facilitate automatic generation of processors and
the corresponding enhanced instruction set from a concise machine description. The benefit of
employing a coarse-grained approach is derived from using much denser logic units, which are
optimised with regard to parameters such as performance, power dissipation, cost, etc. The basic
building block of the architecture template is the Processing and Storage Element (PSE) [2]. A
PSE is a VLIW-like data path consisting of several interconnection networks (IN), a plurality of
operation issue-slots (IS) comprising of it’s corresponding Functional units (FU), distributed
register files (RF) and local memory storage (MEM) accompanied by a Load/Store unit (LSU).
A PSE is configurable, and a general schematic of such a PSE is depicted in Figure 1.3. The
usage of such a PSE with a controller (CTRL) and configuration memory (CONFIG) in the

4
construction of a processor cell is depicted in Figure 1.4. As depicted, a few PSE units are made
to interact by means of intra cell communication lines (CL).

C CL C
L L
IN

RF RF RF

IN

IN
MEM
FU FU FU LD/ST

IN
IS IS IS

Figure 1.3. General schematic of a Processing and storage element (PSE)

CELL
CONFIG

CTRL

PSE PSE

PSE PSE

Figure 1.4. Usage of Processing and storage element (PSE) in building a Cell

A more holistic view of the ULIW architecture [3] that goes to make the ASIP is depicted in
Figure 1.5. The architectural template shown in Figure 1.5 is configurable towards a particular
family of applications. Silicon Hive’s proprietary design methodology involves automated
instantiation of such a architecture from a hardware specification, written in a highly abstract
hardware description language called TIM.

5
Figure 1.5. Holistic view of the architectural template

Using a modular, highly abstract language for hardware description enables incorporating rapid
changes to the hardware architecture thus easing design iterations for co-simulation. A graphical
view of the data path for a processor (titled Moustique MPP ME, Copyright Philips Electronics
NV) developed using such a template and methodology is depicted in Figure 1.6.

Figure 1.6. Data path view generated for example processor architecture

The design flow begins with formulating a machine description and a concomitant application.
The former serves to generate the instruction set for cosimulation, and is also the input to the
hardware design flow. The latter is partitioned and compiled over a model emulating the

6
behaviour of the processor. This process of co-simulation comes in different flavours, each
catering to different simulation objectives. A graphical depiction of the design flow and co-
simulation strategy that leads to the generating the processor architecture side-by-side with a
pertinent application is shown in Figure 1.7.

Figure 1.7. User-dominated codesign for processor and application generation

1.4 Utilizing the Silicon Hive template for designing a FFT processor
The usage of the Silicon Hive template is in the generation of an ASIP. Present work makes use
of many advantages provided by the same in the design of a streamlined ASIP, such as the ease
of effecting rapid design iterations for user dominated hardware/software codesign, automatic
processor generation, easy application compilation and cosimulation, etc. These features are
made use of in the designing of a FFT processor and writing the concomitant application kernel.
The characteristics that define the present target processor are:
1. The processor caters efficiently to an application kernel – it is application specific in that
sense – but isn’t constructed with priority towards a plurality of kernels. In the present
case, it refers to the processor being capable of servicing a Fast Fourier Transform (FFT)
algorithm (viz. the application kernel), but not to switch contexts to other related kernels
such as Minimum Mean-Square error (MMSE) algorithm, Channel equalization, etc.
2. The processor description is such that it’s possible to rapidly prototype a target device
given different requirements. For example, if the bit-reversal addressing of the last stage
of an in-place FFT algorithm can be omitted, then the issue slot that handles the same

7
can be excluded in the generated architecture. This would mean advantages as regards to
program memory width, area, power, cost, etc. This exploits the ability of rapid
prototyping and automatic core generation inherent with the Silicon Hive template.
3. The application should be generic in the sense easily adaptable, encompass all special
cases, be efficiently scalable, etc. And yet, the application is meant to target a particular
specialized case of and satisfy all specifications meant for the same. In the present case,
this implies that the application and core should possess quite generic capabilities, and
yet attain all requirements of the 128-point FFT for UWB-OFDM specifications.
4. In order to be fully compatible with the compiler and the ANSI-C based programming
environment, a 16-bit based data value width is chosen. In an ANSI-C based application-
programming environment, the short data type represents a 16-bit value.
5. It would be possible to utilize the arithmetic and logic units to embed a few other
simplistic application kernels (like a FIR filter). However, these applications are not
accompanied by guarantees.

The utility of an ASIP based solution is apparent, especially in the context of UWB-OFDM.
This essentially implies the ability to produce an IP-core for a market, even before the complete
standardization is finalized. Being first-to-market has significant business advantages, and also
lends to gathering research inputs regarding implementation prior to ASIC competitors.

1.5 List of references


[1] From ASIC to ASIP: the next design discontinuity
Keutzer, K.; Malik, S.; Newton, A.R.;
Computer Design: VLSI in Computers and Processors, 2002. Proceedings. 2002 IEEE
International Conference on
16-18 Sept. 2002 Page(s):84 - 90

[2] AVISPA: a massively parallel reconfigurable accelerator


Leijten, J.; Burns, G.; Huisken, J.; Waterlander, E.; van Wel, A.;
System-on-Chip, 2003. Proceedings. International Symposium on
19-21 Nov. 2003 Page(s):165 - 168

[3] Silicon Hive Breaks Out


Halfhill, T. R.;
Microprocessor report, December 1, 2003, www.MPROnline.com
Available at: www.siliconhive.com

[4] Moustique: smaller than an ASIC and fully programmable


Kastrup, B.; van Wel, A.; System-on-Chip, 2003. Proceedings. International Symposium on
19-21 Nov. 2003 Page(s):

8
Chapter 2
Examination of FFT implementations

This section examines the details of existing DFT implementations. A basic overview of each
algorithm is followed by an enumeration of its related characteristics and a discussion about its
suitability for a retargetable hardware implementation.

2.1 Introduction
Digital technology has evolved to the extent that rigorous computations can be carried
out with a high level of performance at low cost. A significant example of this would be the
present day ability to implement the operation of a Discrete Fourier transform (DFT) in
hardware. The implementation takes as many forms as the possibilities for algorithmic
optimisation. The Fast Fourier Transform (FFT) algorithm is one such example and it exploits
the symmetry and periodicity of sinusoids within the DFT summation, the definition of which is
shown in Equation (2.1).
N −1
X (k ) = ∑ x ( n )W
n=0
N
kn
,0 ≤ k ≤ N − 1 - Equation (2.1)

Where W N = e − j 2π / N is the twiddle or phase factor


The notation adopted here is that X(k) represents the frequency domain values obtained
from the Fourier transform on x(n).The computational problem for the DFT is to compute the
sequence {X(k)} of N complex-valued numbers given another sequence of data {x(n)} of length
N. The FFT algorithm uses a divide-and-conquer approach to reduce the number of
computations to arrive at the exact same result of the DFT. The Radix-2 algorithm, introduced in
the seminal publication by Cooley and Tukey [1] exploits the possibility of recursively reducing
the problem into two subproblems of half the size at every intermediate stage. Since then, there
have been many algorithms that aim to compute the DFT at reduced computational cost, given
certain constraints that make the computation efficient. In the case of the Radix-2 FFT, this
constraint is that the number of samples must be a power of ‘2’ to allow for exhaustive
application of the divide-by-2 and conquer approach. Although there are techniques like zero
padding to work around this restriction [2], in some cases it proves to be inefficient when
increased spectral resolution isn’t a priority, for example, the calculation of a 1028-point DFT
using the Radix-2 FFT algorithm. Given that the FFT has been a significant enabler in the
context of telecommunication that employs frequency domain techniques [3], its constraints are
gladly traded in for its widespread utility. It is the same reason that has spawned algorithms that
have a still more restricted operating domain like the Winograd algorithm [4], the prime factor
algorithm [5], etc. The drive towards performing the Radix-2 FFT with lesser number of
intermediate stages gave rise to the Radix-4 and higher radix algorithms. Another significant
development is the usage of split radix and mixed radix algorithms, which have come to be the
method of choice for Application Specific Integrated Circuit (ASIC) implementations of very
specialized high performance FFTs. Another classic and elegant algorithm is the constant
geometry algorithm devised by Singleton [6], which has also found applicability in hardwired
FFT implementations.
Benchmarking of FFT algorithms from a hardware implementation’s perspective has
traditionally entailed a direct one-to-one comparison of the number of mathematical operations
involved in computing the FFT of a particular size. This has been the strategy used for
benchmarking in [7], wherein algorithms are pegged against each other for the case of a 1024-
point DFT. The trend of devising new algorithms that go to further reduce the extent of
arithmetic effort involved still continues. Examples are the Decimation-in-Time-Frequency
algorithm [8] and the Quick Fourier transform algorithm [9]. While expending significant design
effort on reducing arithmetic complexity can be justified in ASIC designs, for ASIP designs the
onus is almost entirely on data handling. Here, the biggest bottleneck is the handling of a
substantial amount of data from memory in a manner that scales suitably with FFT size.
The Radix-2, Radix-4 and Split/Multiple Radix FFT algorithms are treated from an
algorithmic perspective in significant depth by most standard texts ([10], [11]).

2.2 The Radix-2 algorithm


In Equation (2.1), for each value of k, direct computation of X(k) involves N complex
multiplications (4N real multiplications) and N-1 complex additions (4N-2 real additions).
Consequently, to compute all N values of the DFT requires N 2 complex multiplications and N 2-
N complex additions. Direct computation of the DFT does not exploit the symmetry and
periodicity properties of the twiddle factor WN, which are:
W Nk + N / 2 = − W Nk - Equation (2.2)

10
W Nk + N = W Nk - Equation (2.3)
Details about the Radix-2 FFT decimation-in-time (DIT) and decimation-in-frequency (DIF)
algorithms are described in [12]. The description includes the explanation for the total number
of complex multiplications and additions getting reduced to (N/2)log2N and Nlog2N
respectively. A basic unit called a butterfly performs the most fundamental operation of the
FFT. The entire signal flow corresponding to the FFT algorithm can be realized by repeated
instantiation of the butterfly unit with selectively ordered inputs. The most rudimentary butterfly
unit is the DIT Cooley-Tukey butterfly [12], depicted in Figure 2.1, where ⊗ refers to complex
multiplication and ⊕ refers to complex addition.

a Nk ⊕ A Nk
⊗ WNk

bNk
⊗ −WNk ⊕ B Nk

Figure 2.1 The Cooley-Tukey butterfly

In the Radix-2 implementation, a Nk and bNk are N/2 samples apart. The Radix-2 algorithm is an

in-place algorithm, which implies that ANk and B Nk is also N/2 samples apart. Thus, if a Nk is

addressed by an index ‘i’ (equal to k), and bNk is addressed by index ‘i + N/2’, then ANk and B Nk
can also be addressed by the same pair of indices. This translates into simpler data handling in
hardware. An optimum way to implement the Cooley-Tukey butterfly is shown in Figure 2.2,
which reduces the number of complex multiplications.

a Nk ⊕ A Nk

bNk
⊗W k
N × −1 ⊕ B Nk

Figure 2.2 The Optimised Cooley-Tukey butterfly

While these butterflies are for the DIT algorithm, the twiddle factor appears on different edges
in the DIF butterfly since in that case the frequency sum that is being divided into even and odd

11
sums. The corresponding butterfly is the Gentleman-Sande butterfly [12], in Figure 2.3(a). An
optimised version of the same butterfly is depicted in Figure 2.3(b).

a Nk ⊕A
k
N a Nk ⊕ A Nk

⊗ WNk

⊕ ⊕ ⊗W B
k
bNk B Nk bNk
⊗ −W k
N
× −1 k N
N

(a) (b)
Figure 2.3 The Gentleman-Sande butterfly:
(a) Direct algorithmic version
(b) Optimised version

It’s worth noting that even within a single butterfly unit, there is scope for fiddling with primary
operations. For example, in the optimised Gentleman-Sande butterfly the value of twiddle factor
is also addressed by the same index k as used for a Nk . While implementing the butterfly unit
under a resource restriction of a single, clock-driven arithmetic logic unit (ALU), the twiddle
factor can be fetched from memory after the two input data have been fetched. This allows the
same data path to be reused for fetching the twiddle factor. By way of an example, consider the
signal flow graph for DIF implementation of a Radix-2 8-point FFT is shown in Figure 2.4.

Figure 2.4. Radix-2 DIF FFT for N=8

12
For sake of simplicity, arrowheads and other implicit details have been excluded. As can be
observed, Stage 1 performs a bifurcation of data into Stage 2. Thus Stage 2 comprises of two
FFT blocks, each of N=4. The twiddle addressing is performed with reference to N=4 in the
second stage (i.e. 0 corresponds to W40 ). The ‘shuffling’ of data inherent for ‘in place’
algorithms, necessitates a reordering stage. This reordering is based on bit-reversed addressing
for Radix-2. The regular structure of the signal flow graph (i.e. same number of butterflies
working per stage, etc) allows for a high level of parallelism and pipelining to be incorporated in
such ‘in place’ algorithms. The simplicity of the stages however is eclipsed by the overhead of
the reordering of the reordering stage. The reordering done for a bit-reversed addressing is
probably the biggest bottleneck that is experienced in designing vectorial processing cores. The
reordering stage requires data be handled individually, which translates into the consumption of
N clock cycles for reordering. In a retargetable solution, the bit-reversed addressing becomes
cumbersome, as it doesn’t retain any form of regularity with N. This implies is that the
reordering is highly software intensive. This tends to make designers wary of Radix-2 (or in
general Radix-2v). Since even more attractive solutions exist for ASIC designs (like frozen Split
radix/ Mixed Radix algorithms), there too the Radix-2 algorithm isn’t preferred.
However, the Radix-2 algorithm lends for significant versatility owing to the simple
dataflow structure. Highly vectorial processing for a Radix-2 FFT suffers from overheads owing
to twiddle addressing. This is demonstrated in the example that W41 = W82 , which then implies
that interleaving of twiddle factors is required in-between stages for a regular vectorial twiddle
handling.
A feature of the DIF type structure over the DIT structure is that the DIF gives better
Signal-to-noise ratio that stems from quantisation effects after twiddle factor multiplication and
truncation [19]. Additionally, the DIF structure lends itself more easily to the twiddle factor
driven approach, explained in section 2.7.

2.3 Radix-2v algorithms


A natural offshoot of the Radix-2 algorithm’s methodology is to perform the decimation of data
such that at each stage the problem size is divided by a factor 2V instead of just 2. In a way, the
Radix-2V is a generalisation of the Radix-2 algorithm. In this section, Radix-4 and Radix-8
algorithms are examined, and their characteristics from a retargetable hardware perspective are
dealt with.

13
2.3.1 The Radix-4 algorithm
The Radix-4 algorithm incorporates a series sum split of four every stage, as compared to two in
Radix-2. This way, the initial N-point DFT is divided to give four (N/4)-point FFT
computations. Since the Radix-4 algorithm is the mutation of the Radix-2, all predominant
features that apply to Radix-2 hold true here as well namely the highly regular structure, the ‘in-
place’ data-addressing between stages, scope of pipelining, etc. The mathematical formulation
of the algorithm can be found in [11]. Since, every butterfly works on four inputs - x(4n),
x(4n+1), x(4n+2), x(4n+3), where n = 0, 1, ... , (N/4)-1 – feeding the butterfly with non-
consecutive data ensemble values makes vectorial processing a bit tricky for generic type DSP
processor cores. The same applies to the addressing and handling of twiddle factors, since each
twiddle factor needs to be referenced and fetched separately. Thus, while the Radix-4 algorithm
is ‘faster’ from an algorithmic perspective – as it completes the N point FFT in log4(N) stages,
as compared to log2(N) stages in Radix-2. There is an added detail that the algorithm is better
suited for a FFT of size N that is a proper power of 4. While the same techniques of zero
padding and ensemble manipulating (to make the Radix-4 butterfly behave like a Radix-2
butterfly) can be applied, it is a drop in optimality. The Radix-4 butterfly’s nodal symbol is
shown in Figure 2.5. In it’s essence, a two stage Radix-2 implementation of a 4-point FFT, as
shown in Figure 2.6.

Figure. 2.5. Nodal version of Radix-4 Butterfly

Figure. 2.6. Signal flow view of Radix-4 Butterfly

14
As an example, consider a 16-point DIF Radix-4 FFT. The twiddle factors in the internal section
are referenced to N=4, while the twiddles to be used in the global signal flow graph are with
respect to N, which is 16 in the case of this example. This is indicated by the subscript 2 to
depict Radix-2, and subscript 4 to depict Radix-4. The value of k is equivalent to 4n. Thus, if 4n
points to x(1), then the set of global twiddles are {0, 1, 2, 3}. A significant feature here is the
occurrence of trivial operations within the butterfly, which can be exploited.
Since W40 , W41 and W20 are always 1, -j, and 1 respectively, these complex multiplication
operations are made trivial. If data value c = Re{c} + j Im{c} is multiplied with (–j), this
operation reduces to interchanging real (Re{c}) and imaginary (Im{c}) quantities, followed by
changing of sign (performing 2’s compliment for negation) to give Im{c}- j Re{c}. This is data-
flow more than a complex multiplication. Although the butterfly is shown to work in two stages,
this needn’t be from an implementation perspective. Functional units representing the butterfly
would have four inputs and four outputs, and the data flow for each output is made independent
of each other and computed in parallel. This would allow the butterfly to operate in a single
clock cycle and not require intermediate data storage. The SFG for a Radix-4 implementation of
a 16-point FFT is shown in Figure 2.7.

Figure.2.7. Radix-4 implementation of a 16-point FFT


As in the case of Radix-2, the in-place algorithm is accompanied with the overhead of the last
stage reshuffling. While it was bit-reversal in Radix-2, in Radix-4 it’s digit reversal, where-in
each digit is with regard to a number scheme of base 4. Thus, if a binary-format word is used to

15
address a value, then the digit reversal is equivalent to reversing the address sequence, taken two
bits at a time, as shown in Figure 2.8.

Figure 2.8. Radix-4 digit reversal operation

For generic FFT cores, it’s a commonly adopted practise in the case of N being in the form of
2x4V (as in N=128), to have just {log4(N) – 1} number of Radix-4 stages. The last stage is made
to mimic Radix-2 by making x(4n+1) and x(4n+3) zero in the Radix-4 butterfly shown in
Figure.2.6. Although this is wastage of hardware capability in the last stage, it consumes lesser
number of stages than a Radix-2 implementation. However, a point to be noted here is that when
such a manipulation is performed, the solution is a Mixed-radix solution. The reordering of data
after the last stage must take this into consideration and the operation digit-reversal embedded in
the core is made to work conditionally. Another method to circumvent the hindrance to vectorial
processing owing to simultaneous non-sequential input requirements is to adopt a constant
geometry structure with Radix-4 butterflies. Here x(4n), x(4n+1), x(4n+2) and x(4n+3) are
stored on four separate logical memories and are read in a vectorial packed format. The constant
geometry architecture is dealt with in Section 2.4.

2.3.2 The Radix-8 and higher radix algorithms


As far as FFT technique goes, there isn’t much difference between the Radix-8, Radix-16 and
Radix-4 algorithms, except that the series split is N/8 and N/16 instead of N/4. This brings with
it the implicit difference in number of inputs processed in a single butterfly, the addressing of
twiddle factors, number of stages being log8(N) and log16(N) respectively, etc. But as already
made apparent from the case of Radix-4, an algorithm that completes in fewer stages doesn’t
necessarily translate into a faster hardware implementation. The extent to which the algorithm’s
hardware implementation can be made to work in parallel, or how easily vectorial processing
can be implemented is more important. The Radix-8 and Radix-16 implementations come with
the inherent feature that they are basically suited towards FFT sizes that are proper exponentials
of 8 and 16 respectively. This makes such higher Radix solutions not so ideal for cores that
implement a scalable FFT. Further more, the requirement to fetch eight or sixteen individual
data to make a single butterfly work does not ideally suit semi-custom memory cell based
design. So such higher radix solutions are best suited for an ASIC design than for a retargetable

16
core. There are however a few features that accompany the Radix-8 and Radix-16 butterflies
which are noteworthy.

Figure. 2.6. Signal flow view of Radix-8 Butterfly

The twiddle factors in the internal section are referenced to N=8 in Stage 1. The twiddle factors
from Stage 2 onwards allow for trivial operations. Examination of W81 ,W82 and W83 gives the
values of (1 – j)/√2, -j and (-1 + j)/√2 respectively. In the case of a pure integer machine, the
scaling by a factor of (1/√2) or (0.707) isn’t very straightforward. Care must be taken to
appropriately scale it to fit in the number system of the whole integer based core. It’s also
possible to approximate this by bit shifting distributed across many stages. But ‘safer’ ways that
guarantee accuracy can be adopted, like implementing the internal structure of the Radix-8
butterfly by a Radix-4 stage cascaded to a Radix-2 stage. While in certain specialized ASICs
this extent of overhead is excused, it isn’t worth the trade-off in a retargetable core. This line of
deduction holds true for Radix-16 butterflies also.
Therefore, higher Radix solutions are generally considered only for ASIC designs, in
which case the restriction is no more to just Radix-16. A 64-point FFT can be carried out in two
stages of Radix-8, or a Radix-32 followed by a Radix-2 stage since a single Radix-2 last stage is
entirely trivial. The choice depends on specifications like the cost of arithmetic capability,
scheduling objectives, etc.

17
2.4 Split/Mixed Radix solutions
The extension of single-radix solutions is to use a combination of two or more blocks of
butterflies that operate with different radix. The algorithmic basis of the multiple-radix solution
is treated in [10]. It’s apparent from first glance that the parallel usage of two or more types of
butterflies in parallel for a single stage doesn’t allow for efficient hardware utilisation in generic
cases. The ideal scenario to be able to consider multiple radix solutions is when the
specifications of the FFT are known before hand. As an example, an ASIC that implements only
a 128-point FFT doesn’t need to consider satisfying generic cases. Thus ASIC designs can
afford the luxury of implementing the same 128-point FFT using multiple radix solutions like
using Radix-8, 4 and 2 butterflies. This doesn’t prove to be hardware efficient from a viewpoint
of silicon utilisation if the same core must work for other values of FFT size or execute another
application. Thus, while it is the design methodology of ‘frozen’ solutions, it’s not ideal for a
retargetable implementation.

2.5 The Constant Geometry architecture


The constant geometry architecture was first devised to work with data read sequentially from
magnetic tape [6]. In later years, the method held it’s utility because of the advantages
accompanying a single architecture being repeated to implement all stages. In the domain of
retargetable cores, this translates into distributed data storage type architectures with a high
extent of regularity – which is concomitant with comparatively easy application programming.
This obviates the need for stage-specific calculations of data addresses prior to fetching. The
constant geometry architecture can work with any Radix butterfly and is not an in-place
algorithm. In the case of a Radix-2 DIT FFT implemented for the constant geometry
architecture, the N input data are divided into top-(N)/2 and bottom-(N)/2. The constant
geometry butterfly operates on one input from top-(N)/2, and another input from bottom-(N)/2.
The result gets stored into consecutive locations. The generic butterfly of the DIT constant
geometry implementation is as shown in Figure 2.7. The usage of this butterfly in the SFG for a
constant architecture implementation of an 8-point DIT FFT is as shown in Figure.2.8.
x(k) + y(2k)

⊗ WNk __
x(k + N/2) y(2k+1)
Figure.2.7. Butterfly for DIT constant geometry architecture

18
Figure.2.8. Constant geometry architecture for a Radix-2 DIT FFT

It can be deduced from Figure.2.8 that such an architecture allows data of top-(N)/2 {i.e. x(0),
x(1), x(2), x(3)} to reside on one logical memory and data of bottom-(N)/2 {i.e. x(4), x(5), x(6),
x(7)} to reside on another logical memory. This allows the butterflies to be fed data every clock
cycle without a bottleneck owing to memory input/output bandwidth capability. The principle is
extendable to Radix-4 wherein N/4 data are stored on a single memory and the architecture
operates with four logical memories for its input. The design choice of extending the same to
even higher Radices depends on specifications and memory cost.
While the constant geometry architecture is advantageous owing to its being streamlined
and regular, it has a few limitations. There are data path width and memory port width
considerations that come into play while allowing multiple butterflies to operate in parallel. As
an example, consider the case of two butterflies operating in tandem. Assume that the bit-
resolution of packed complex data is 32-bits (packed with 16-bits for real component and 16-
bits for the imaginary component). If top-(N)/2 and bottom-(N)/2 reside in separate memories,
reading data for two butterflies requires a data path width of 64-bit per memory for the read
cycle and 128-bit per memory for the write cycle. The fact that the data path width for the write
cycle is double of the read cycle is all the more pronounced while specifications warrant eight or
more butterflies to operate in tandem. As such large data path widths mean large chip area and
complicated routing, there is a ceiling limit on the extent to which the processing can be

19
vectorial. Also notable is that the need for data reordering exists, that is bit-reversed addressing
in the case of Radix-2. The constant geometry architecture is quite useful in retargetable cores
for medium FFT computation requirements. However, the architecture offers no advantage for
the vectorial handling of twiddles. A discussion about the utility of the constant geometry
architecture from the perspective of an ASIC implementation that targets OFDM can be found
in [13].

2.6 Other algorithms


Algorithms such as the Decimation-in-Time-Frequency (DITF [8]) algorithm, the Quick Fourier
Transform (QFT [9]), Fast Hartley Transform (FHT), Winograd Fourier transform algorithm
(WFTA [14]), prime-factor algorithm [14], etc come under the category of specific case
algorithms. Although it can be useful for a designer to examine these options in the process of
making an exhaustive design space exploration, they fundamentally seek to cater to specific FFT
sizes, or aim to reduce the number of complex multiplications, etc. Algorithms like the prime-
factor algorithm aren’t targeting solutions that are meant to be generic FFT implementations.
Secondly, present day cores allow the designer to assume the arithmetic capability to be for free
while designing FFT architectures, especially when taken in comparison to memory controllers,
on core memory, registers, etc. Hence, they provide no distinct advantage from a retargetable
hardware perspective, but instead in some cases can incur overheads to incorporate the same.

2.7 Fetching twiddle factors from memory Vs a CORDIC


In the handling of twiddle factors, two main aspects are to be considered. Firstly, the value of a
twiddle factor’s real or imaginary component is actually derived from trigonometric functions
that fall in the range of [+1, -1]. This implies that in order to be used in a purely integer type
machine, scaling and quantisation has to be carried out. The second aspect has to do with
acquisition of the correctly indexed twiddle factor for each butterfly. The indexing is done with
respect to variation in order (N, N/2, N/4, etc) between stages. Traditionally this necessitated an
address generation unit for both data and twiddle factors’ exponents depending on input indexes
and stage of the FFT [15]. Individual address calculation and twiddle index generation isn’t
suitable for vectorial processing.
The alternative approach is to generate the twiddle factors on chip on demand basis. The
CORDIC (coordinate rotation digital computer) has been demonstrated to be useful in
generating values for exponential and trigonometric functions [16] and more specifically in the

20
generation of twiddle factors [17]. The CORDIC can be implemented in Radix-2 and is built
from the basic blocks of adders and shifters. The principle of the CORDIC is to calculate the
real and imaginary coordinates of a vector, given a base vector’s polar coordinates and an angle
through which it has to be rotated, along the lines of a mixer. The implementation has a pipeline
depth that determines the accuracy of the produced result. The depth of the pipeline also
determines the latency and can be the driving factor in scheduling a FFT meant to
predominantly run on a single-cycle loop. An advantage however is that the CORDIC can
replace the need of a complex multiplier as the complex data value arriving in the lower arm of
the butterfly can be scaled and rotated by an angle governed by the corresponding twiddle
factor.
While both schemes are valid possibilities with room for tailoring and optimisation, the
design choice is governed by whether or not the design would significantly benefit from a well-
tailored CORDIC over a well planned addressing and indexing scheme, with the most
significant deciding factor being chip area. This distinction becomes prominent depending on
performance requirement. In a high-performance FFT implementation, wherein a few butterflies
are to be computed in parallel to meet cycle budget requirement, each butterfly would require
it’s own dedicated CORDIC. This translates into larger core area, which isn’t worth the trade-off
especially in the case of a small FFT order, wherein twiddle factors can be stored in memory
and be handled in a vectorial manner.

2.8 Twiddle factor driven approach


While fetching twiddle factors from processor memory, a technique that can be used is the
twiddle factor driven approach. It is a methodology to overcome redundant twiddle fetching and
is detailed in [18]. The fundamental concept is to fetch each twiddle factor once and compute
maximum number of possible butterflies wherein the twiddle factor is being used. Such a
twiddle factor driven FFT works at optimising the computation of the FFT by traversing along
stages and simultaneously calculating butterflies within a stage also. It’s also possible to exploit
trivial butterflies, which obviates certain twiddle factor fetching and complex multiplication.

2.9 Summary
In this Chapter, aspects of various DFT algorithms have been examined from the perspective of
ASIP implementation. The observations made are used for making design choices in Chapter 4.

21
In the next Chapter, the requirements of the FFT block for UWB-OFDM are deduced and
analysed.

2.10 List of References


[1] An Algorithm for Machine computation of Complex Fourier Series
J.W. Cooley, J. W. Tukey,
Mathematics of Computation, vol. 19, pp 297-301, April 1965
[2] Aligning harmonics of signals to DFT grid
Luo Lichun; Xiao Xianci;
Antennas and Propagation Society International Symposium, 2003. IEEE
Volume 3, 22-27 June 2003 Page(s):199 - 202 vol.3
[3] FFT: a basic function for a reconfigurable receiver
Palicot, J.; Roland, C.;
Telecommunications, 2003. ICT 2003. 10th International Conference on
Volume 1, 23 Feb.-1 March 2003 Page(s):898 - 902 vol.1
[4] New algorithms for the multidimensional discrete Fourier transform
Auslander, L.; Feig, E.; Winograd, S.;
Acoustics, Speech, and Signal Processing [see also IEEE Transactions on Signal Processing],
IEEE Transactions on
Volume 31, Issue 2, Apr 1983 Page(s):388 - 403
[5] A parallel processing technique for a prime factor algorithm
Deka, R.; Glover, I.A.; Singh, B.;
Digital Processing of Signals in Communications, 1991., Sixth International Conference on
2-6 Sep 1991 Page(s):22 - 24
[6] A method for computing the fast Fourier transform with auxiliary memory and limited
high-speed storage
Singleton, R.;
Audio and Electroacoustics, IEEE Transactions on
Volume 15, Issue 2, Jun 1967 Page(s):91 - 98
[7] Benchmarking of FFT algorithms
Balducci, M.; Ganapathiraju, A.; Hamaker, J.; Picone, J.; Choudary, A.; Skjellum, A.;
Southeastcon '97. 'Engineering new New Century'., Proceedings. IEEE
12-14 April 1997 Page(s):328 - 330
[8] Decimation-in-time-frequency FFT algorithm
Saidi, A.;
Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International
Conference on; Volume iii, 19-22 April 1994 Page(s):III/453 - III/456 vol.3
[9] The quick Fourier transform: an FFT based on symmetries
Haitao Guo; Sitton, G.A.; Burrus, C.S.;
Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing,
IEEE Transactions on],Volume 46, Issue 2, Feb. 1998 Page(s):335 - 341
[10] Inside the FFT Black Box
Chu, E.; George, A.: CRC Press, 2000

22
[11] Theory and Application of Digital Signal Processing
Rabiner, L. R.; Gold, B.: Englewood CLiffs, N. J.:Prentice-Hall Inc., 1975
[12] Understanding Digital Signal Processing
Lyons, R. G: Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1996
[13] Fast Fourier Transform for high speed OFDM wireless multimedia system
Sadat, A.; Mikhael, W.B.;
Circuits and Systems, 2001. MWSCAS 2001. Proceedings of the 44th IEEE 2001 Midwest
Symposium on; Volume 2, 14-17 Aug. 2001 Page(s):938 - 942 vol.2
[14] Multiplierless Winograd and prime factor FFT implementation
Macleod, M.D.;
Signal Processing Letters, IEEE
Volume 11, Issue 9, Sept. 2004 Page(s):740 - 743
[15] Conflict free memory addressing for dedicated FFT hardware
Johnson, L.G.;
Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on [see also
Circuits and Systems II: Express Briefs, IEEE Transactions on]; Volume 39, Issue 5, May
1992 Page(s):312 - 316
[16] A CORDIC processor for FFT computation and its implementation using gallium
arsenide technology
Sarmiento, R.; de Armas, V.; Lopez, J.F.; Montiel-Nelson, J.A.; Nunez, A.;
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on
Volume 6, Issue 1, March 1998 Page(s):18 - 30
[17] Parallel FFT with CORDIC for ultra wide band
Guoping Zhang; Chen, F.;
Personal, Indoor and Mobile Radio Communications, 2004. PIMRC 2004. 15th IEEE
International Symposium on; Volume 2, 5-8 Sept. 2004 Page(s):1173 - 1177 Vol.2
[18] Twiddle-factor-based FFT algorithm with reduced memory access
Yingtao Jiang; Ting Zhou; Yiyan Tang; Yuke Wang;
Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002,
15-19 April 2002 Page(s):70 – 77
[19] Error analysis and resulting structural improvements for fixed point FFTs
Elterich, A.; Stammler, W.;
Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on
11-14 April 1988 Page(s):1419 - 1422 vol.3

23
Chapter 3
The UWB-OFDM Standard

In this section, pertinent details of the UWB MB-OFDM standard are discussed. This is
followed by a derivation of cycle count budget, and other requirements of a FFT meant for the
same.

3.1 Introduction
The IEEE 802 working group was established in 1980, and their purpose to establish standards
for Local Area Networks (LAN) and Metropolitan Area Networks (MAN) [1]. Each standard (or
recommended practice, or guide) is initiated by a group of people with an interest in developing
the standard. Presently a list of active work and study groups within the LAN/MAN standards
committee (LMSC) is provided in Table 3.1:
Table 3.1 List of Active work or Study groups within LMSC
Study or Work Group title
group

802.1 Higher Layer LAN Protocols Working group


802.3 Ethernet Working group
802.11 Wireless LAN Working group
802.15 Wireless Personal Area Network (WPAN) Working group
802.16 Broadband Wireless Access Working group
802.17 Resilient Packet Ring Working group
802.18 Radio Regulatory Technical Advisory Group (RR-TAG)
802.19 Coexistence TAG
802.20 Mobile Broadband Wireless Access (MBWA) Working group
802.21 Media Independent Handoff Working group
802.22 Wireless Regional Area Networks

The 802.15 WPAN working group’s effort focuses on the development of consensus standards
for Personal Area Networks or short distance wireless networks. These address wireless
networking of portable and mobile computing devices such as PCs, Personal Digital Assistants
(PDAs), peripherals, cell phones, pagers, and consumer electronics; allowing these devices to
communicate and interoperate with one another. The goal is to publish standards, recommended
practices, or guides that have broad market applicability and deal effectively with the issues of
coexistence and interoperability with other wired and wireless networking solutions. At the time
of this writing, the list of standards under the purview of the WPAN work group is shown in
Table 3.2. An alternative being discussed for high data-rate WPAN is the usage of Ultra-
wideband (UWB).
Table 3.2 List of standards within 802.15
Standard code Topic of specification

802.15.1 Bluetooth
802.15.2 Coexistence of WPAN (802.15) and WLAN (802.11)
802.15.3 High-rate (20 Mbps or greater) WPAN
802.15.4 Low data-rate, low power, low complexity WPAN

3.2 The MB-OFDM UWB standard


The Federal Communications Commission (FCC) has allocated 7.5 GHz of spectrum for
unlicensed use of UWB devices in the 3.1 to 10.6 GHz frequency band. A method of exploiting
such wide bandwidth is to employ a multi-band approach, in which information is encoded in
multiple sub-bands (in the range of 500-MHz each). The UWB spectral allocation is the first
step toward a new policy of open spectrum initiated by the FCC. This is by far the largest
spectrum allocation for unlicensed use the FCC has ever granted and it spans spectrum allocated
to other pre-existing standards.

Figure 3.1 Spectral Mask for UWB as specified by FCC

26
In order to ensure the coexistence of other standards (especially WLAN 802.11a/g) with UWB,
the FCC has put in place a transmit power spectral density (measured with respect to 1 MHz)
restriction as show in Figure 3.1. The standardisation for adopting UWB for WPAN is still a
topic of debate at the time of this writing. The two contending standards are labelled Direct
Sequence (DS-UWB) and Multi-band Orthogonal frequency division multiplexing (MB-
OFDM). The MB-OFDM standard for UWB is being formulated by the workgroup IEEE
802.15.3a. The alliance of the industries involved in this workgroup has come to be known as
the Multiband-OFDM Alliance (MBOA) [2], co-founded by Philips. The reason FCC’s
initiative to grant unlicensed use of the UWB channel has caused fervent interest is owing to the
implication that such a substantial bandwidth allows for very high data transmission rates (~1
Gbps). However, in adherence to the FCC’s spectral mask, the transmitted power has to be quite
low (comparative to ambient noise as unintentionally radiated by computers). So, while this
allows transmissions at very high speeds, the distance of transmission is quite limited (in the
range of about 10 meters).
In MB-OFDM, the spectral range (3.1 to 10.6 GHz) is divided into 14 bands of 528 MHz
each. The lower three bands are compulsory and to be used for standard operation in the present.
The remaining bands are for optional use or for purposes of future expansion. In the draft
specification of the MBOA special interest group [3] [4], the spectral details for OFDM symbol
construction are as shown below in Figure 3.2.

Figure 3.2 Sub-carrier frequency allocations


The salient features of the sub-carrier allocation are:

1. The OFDM system employs 128 subcarriers, 122 of which carry energy.

2. Out of the 122 subcarriers, 100 are devoted to data (ci) and 12 are assigned to serve as
pilot subcarriers (Pi).

3. The remaining 10 subcarriers go towards guard carriers. The 10 guard subcarriers


occupy the outermost subcarriers (-61 to –57 and 57 to 61) are formed by repeating the
five outermost data bearing subcarriers.

4. The remaining subcarriers and the DC component are left null.

27
The above described transmit spectra, spans a band of 528 MHz. In time domain, the symbol is
transmitted for 312.5 ns. A break-up of a transmitted word is shown in Figure 3.4. A cyclic
prefix of 60.6ns provides for resilience against multi-path errors. A guard interval of 9.5 ns is
present to separate consecutively transmitted symbols and allows time for switching between
bands.

Figure 3.4 MB-OFDM UWB standard’s transmitted symbol durations

The typical block diagram of a transmitter and receiver section for such a MB-OFDM schema is
shown in Figure 3.5 and Figure 3.6 respectively [5].

Figure 3.5 Block diagram of Transmitter

Figure 3.6 Block diagram of Receiver

28
The transmitter and receiver sections are just like as in conventional wireless OFDM. There are
however, a few modifications and restrictions incorporated. The constellation mapping is
restricted to Quadrature Phase Shift Keying (QPSK) so that the precision of internal digital
blocks (IFFT, FFT, ADC, DAC, etc) can be kept low. A 128-point FFT/IFFT to required for
generating or extracting ensemble information from the symbol, and a maximum latency of
312.5ns to perform the computation before the next symbol has to be dealt with. As far as
designing a FFT, these alone form the specifications – the order of the FFT and the time
constraint. It’s worth mentioning that the intricate nuances of the MB-OFDM UWB standard are
still under discussion and subject to changes. However, minor changes to the standard do not
affect the requirements of the FFT and are irrelevant to present work.

3.3 Examining the complexity of the FFT for MB-OFDM UWB


To provide a fair idea about the computational requirement of the 128-point FFT that has a time
criticality (T) of 312.5ns, the implementation using the Radix-2 algorithm is examined. For sake
of convenience, it’s initially assumed that the Radix-2 butterfly experiences no waiting time and
that data and twiddle handling bottlenecks do not exist. The number of stages (ns) required to
compute a FFT of order N = 128 is given by,

n s = log 2 ( N ) -- Equation 3.1

Thus, for the case of UWB OFDM, ns= 7. The number of butterflies for a stage of the signal
flow graph of the FFT is denoted by nstage. Since each butterfly handles two data inputs at an
instance, for N = 128, nstage = 64. Thus, for the entire 128-point FFT, the number of butterflies
(BFtotal) to be computed is given by,

BFtotal = n s × n stage -- Equation 3.2

= 7 x 64
= 448 Radix-2 butterflies.

A high-end retargetable core can at best be worked at an internal clock speed (C.S.) propagated
in the range of 300 MHz. As a measure of computational capability required, the extent of
butterflies operating in parallel is derived. The time period of a clock cycle is given by,

(tc) = 1/(C.S.) -- Equation 3.3

≅ 3.33 x 10-9 s = 3.33 ns.

29
Thus, the number of cycles available for computing complete FFT (Cbudget) is given by,

C budget = ⎡ c ⎤
t
-- Equation 3.4
⎢⎢ T ⎥⎥

= 312.5
3.33

= 94 cycles

Thus, the number of butterflies to be computed in parallel nparallel is given by,

⎡ BF ⎤
n parallel = ⎢ total ⎥ -- Equation 3.5
⎢ C budget ⎥


= 448
94 ⎤
= 5 Radix-2 butterflies in parallel.

Thus, in the ideal case of being able to compute five Radix-2 butterflies every clock tick, the
cycle budget of 94 cycles would be just met. A summary of the values is depicted in Table 3.3.

Table 3.3 Summary of calculations


N 128
ns 7
BFtotal 448
C.S. (MHz) 300
tc (ns) 3.33
Cbudget 94
nparallel 5

However, the practical working case would require some overhead in between stages.
This is owing to having to recalculate parameters for vectorial processing on a domain based on
N/2, as compared to N for the previous stage. There is also the presence of a bit-reversal
addressed rearrangement in the last stage, which will require N/2 or 64 cycles if done one
butterfly output pair at a time. This borders on a near unacceptable number of cycles (~68%) for
just rearrangement of data after complete computation. Thus, an estimate of nparallel as 8 would
seem acceptable given the need to handle overheads, area considerations, and that N is a
multiple of 8 (unlike 5). This can be interpreted as the total number of cycles that can be devoted
towards butterfly computations (Ccomp) is,
BFtotal
C comp = -- Equation 3.6
n parallel

= 56 cycles

30
Thus the remaining 38 cycles (C budget − C comp ) must account for all intermediate parameter

computations between stages and other overheads.


Eight butterflies in parallel is a compromise, and it leaves many issues (especially that of
the bit-reversed addressing based reordering) to be solved. Nevertheless, even eight butterflies
in parallel will require significant effort for handling data and twiddle factor. Additionally, the
sequencing of operations should be scalable with N with an almost linear variation in cycle
count.

3.4 Summary
The FFT for UWB-OFDM has a time budget of 312.5ns for computation. Given a clock speed
in the neighbourhood of 300MHz, a radix-2 implementation would entail computing 8
butterflies in tandem. The details discussed and analysed in this section form the inputs to
making design decisions, the content of Chapter 4.

3.5 List of References


[1] IEEE 802 LAN/MAN Standards Committee homepage
http://grouper.ieee.org/groups/802/

[2] Multiband OFDM Alliance


www.multibandofdm.org

[3] Multiband OFDM Physical layer Proposal for IEEE 802.15 Task group 3a
IEEE P802.15-03/141r1, March 2004

[4] Multiband OFDM Physical layer Proposal Update


IEEE P802.15-04/0220r1, May 2004

[5] Multi-band OFDM: a new approach for UWB


Batra, A.; Balakrishnan, J.; Dabak, A.;
Circuits and Systems, 2004. ISCAS '04. Proceedings of the 2004 International Symposium on
Volume 5, 23-26 May 2004 Page(s):V-365 - V-368 Vol.5

31
Chapter 4
Explanation of Design Choices

The focus of this section is the reasoning behind design choices. Starting with a review of
requirements, each design choice and its implication in attaining requirements is examined.

4.1 Summary of basic requirements


The performance of an ASIP solution for a programmable yet high-performance FFT is strongly
dependent on both the processor architecture and it’s compatibility with the FFT algorithm (i.e.
the application kernel). In light of that, it has been recommended that the development of both
take place simultaneously for a cohesive solution [1]. There are many possibilities and degrees
of design freedom for the choice of an FFT algorithm. Besides selecting one of the primary
methods explained in Chapter 2, it is also a possibility to merge them if required, for example - a
Radix-4, constant geometry implementation if found suitable. Given such a large design search
space, a systematic approach becomes imperative. As a first step, a complete examination of the
requirements is required. A glimpse of the extent of design effort was provided in section 3.3 by
simplistically considering a Radix-2 implementation. There are two distinct and primary design
requirements:

1. The solution should be programmable, meaning the application kernel should be scalable
with the order of the FFT in terms of clock cycles. Emphasis is also placed on
encompassing special cases implying special provisions to still handle rudimentary cases
(e.g. N=2) even while incorporating a highly vectorial processing approach.

2. The programmable solution should specifically target UWB-OFDM performance


requirements. As a frame of reference for the design, a value of 300 MHz is chosen as
the clock frequency, which implies the availability of 94 clock cycles for computation of
the entire 128-point FFT.
4.2 Initial choice of algorithm
As a starting point, the vectorial Radix-2 approach portrayed in Section 3.3 is adopted. The
reason for this as a starting point is simply that a Radix-2 solution is very generic. While
specialized algorithms - like a higher-radix, or split radix, or mixed radix, etc - cater to a
particular higher order FFT implementation, the ease of scalability becomes an issue. Hence, the
Radix-2 Tukey-Cooley algorithm is taken as a starting point and modifications are incorporated
as and when seen appropriate.

4.3 Data path bandwidth requirement


As portrayed in Section 3.3, in order for the number of butterflies being processed
simultaneously to be a perfect divisor of 448 and 64 (the number of radix-2 butterflies per
stage), a choice of computing 8 butterflies simultaneously is made. Such a relatively large extent
of parallel computation makes the choice of data bit resolution critical in the context of wide
data paths and hence larger area on chip. Thus, the bit resolution should only provide for
sufficient Signal-to-quantisation noise ratio, but should also be at a minimal amount so to not
consume larger area on silicon than required. As mentioned in Section 1.4, the chosen data bit
resolution is 16-bit, as opposed to the neighbouring ANSI-C provided bit resolutions of 8-bit
and 32-bit. The topic of signal-to-quantisation noise ratio is further dealt with in Chapter 5. The
representation of data is of the form shown in Figure 4.1.

Figure 4.1. Data packing scheme for representing complex data

In the present case, if 8 butterflies are simultaneously to be fed data, it translates into two 8x32 =
256-bits wide inputs to the Issue Slot (IS) that performs the FFT computation. There are two
issues that come into view at this stage. Firstly, from the perspective of efficient hardware
utilization, the FFT IS unit should be supplied with two 256-bit values every clock cycle. In the

34
case of having to provide such a continuous flow of input data, just LSUs are sufficient. The
alternatives are to cache data considering basic data size to be 32-bits – implying a requirement
of about sixteen cache hits every clock cycle - or having bulk data residing in processor
registers. Aspects such as overhead in packing data, accompanying cost of registers, etc. become
significant. In comparison to such alternatives, using LSUs in providing data only when
required not only meets performance requirements but is also the cheaper, more reliable
alternative.
The second important issue is that of having to fetch data from different locations, (N/2)
apart for each stage. This fetching has to take place in a very regular and streamlined manner.
The need for as high an extent of regularity as possible is to help in pipelining chunks of the
application kernel. There’s a significant advantage to be gained in terms of cycle-budget and
power by pipelining the operations within loops - also known as loop folding - to the extent of
realizing a single-cycle loop. In most standard cases, to realize this high extent of regularity, the
constant-geometry type architecture is preferred. In keeping with the current radix-2 approach,
this can be performed with an architecture that has two logical memories – one for top (N/2)
data values, and the other for the bottom (N/2) data values – as detailed in Section 2.4. However,
in the case of 8 simultaneous butterfly computations with two single-ported memories, this
implies loading a 256-bit vectorial value from each memory, but storing a 512-bit value to a
single memory. This becomes a memory bandwidth issue. A similar argument holds true for a
radix-4 constant geometry implementation with four logical memories, with the added overhead
of data handling complexity if the order of the FFT isn’t an exact multiple of four. This issue is
however circumvented by using dual ported memories, which would allow two distinct data to
be stored to two distinct locations within a memory, simultaneously. This is a point that strongly
favours usage of dual-ported memories in the processor architecture as it maintains the
possibility of adopting a constant-geometry approach to writing the application kernel.

4.4 Memory ping ponging


Memory ping ponging is a technique used to let the three fundamental loop operations – i.e. data
fetching from memory, FFT butterfly computation, and data storing to memory – to be pipelined
and folded to a single cycle loop. The concept is to have a separate source memory from which
data is read and a separate destination memory to which data is written after computation. This
overcomes the memory bandwidth and contention issue as would occur while using a single
memory, hence making loop folding easier. The source and destination memories are

35
interchanged between stages. A pictorial representation of memory ping ponging is depicted in
Figure 4.2.

Figure 4.2. Concept of Memory ping ponging

4.5 Twiddle handling


While data values can be read in a vectorial manner, vectorial handling of the twiddle factors is
not so straightforward. In this case, 8 butterflies requiring their corresponding twiddle value
have to be all procured within a single clock cycle. There are two broad classifications of the
approaches to twiddle handling. The first is that of fetching and storing twiddles in the processor
memory, i.e. on chip RAM, as part of the processor input/output memory blocks. Writing
twiddles to the processor memory from an external memory (i.e. an external ROM) is done to
accomplish a scalable solution. The external memory contains twiddles calculated for a large
FFT order, say a 1024-point FFT, and selected twiddles are transferred on to the processor’s
local memory. The other method is to generate the twiddles by calculating them when required
using a CORDIC. An introduction has been given in Section 2.6. The choice lies in making an
area comparison. The number of pipeline stages required for a CORDIC to generate sine and
cosine values at 16-bit resolution accuracy is 16, and the area consumed is about 3mm2 (in
120nm process technology) for the eight CORDICs needed to be able to service 8 FFT butterfly
computations in parallel [2]. This is much larger than the area of memory block required to store
32-bit twiddle factors for a 1024-point FFT, which would be in the range of 0.25mm2. Hence,
the conclusion is that twiddles are stored on processor memory.
However the order of twiddle factors varies between stages, and in the case of a Radix-2
in-place implementation this variation manifests as interleaving of twiddles between stages.
Hence, an arrangement has to be incorporated to rearrange twiddles in-between stages in a

36
highly regular manner without affecting the FFT IS’s operation, and without costing additional
clock cycles. Such an arrangement can be realised by an adaptation of the memory ping-ponging
technique used for data handling. The adaptation is to place a separate IS in the data path to
perform the twiddle selection and repacking operation, as depicted in Figure 4.3. In the case of
radix-2, the IS selects alternate twiddles and packs them together. It’s a point worth noting that,
if a constant geometry approach is adopted, making the FFT twiddle factor driven (explained in
Section 2.8) between stages is not straightforward. Thus, while the constant geometry approach
is still a possibility, in view of targeting a twiddle factor driven approach to the FFT, the in-
place approach is retained for the application kernel.

Figure 4.3. Twiddle handling and ping ponging

4.6 Signal flow graph (SFG) based analysis


It’s a notable point that – although separate memories are used for butterfly inputs and outputs -
the algorithm followed is still referred to as ‘in-place’. The reason is that the algorithm is in fact
in-place in the context of the signal flow graph, as the input and output data of the butterflies
have the same data index. The modus operandi of the chosen architecture can be better
visualized with the signal flow graph depicted in Figure 4.4. Since the present case deals with 8
radix-2 butterflies in parallel, an example of a 16-point FFT is used to enumerate pertinent
aspects. It must be noted that the bit reversal stage has been excluded at present. Stage 1 is a
typical representative of the generic case of having eight butterflies operating simultaneously in
a vectorial manner – here the butterflies operate upon both the 256-bit wide inputs. The last

37
three stages are special cases in the sense the butterflies operate in an intra-vectorial manner
within the same input, and thereby necessitates inclusion of separate instructions for handle
these stages in the FFT IS. The DIF architecture is preferred owing to better signal to
quantisation noise ratio, as stated in Chapter 2. The memory ping ponging can be visualized
using separate logical memories - Memory1, Memory2 to handle data, and Twiddle1, Twiddle2
to handle twiddle factors. Then, Stage 1 would operate on data fetched from Memory1 and
twiddle factors from Twiddle1. The output data is stored in Memory2, and while the FFT issue
slot is operating, Twiddle2 is populated with alternative twiddle factors from Twiddle1. In the
next stage, Memory2 provides input, and Memory1 serves as the target memory for that stage.
The twiddle factors are fetched from Twiddle2, and the interleaving operation allows for them to
be read in a single fetch as a 256-bit word.

Figure. 4.4. Signal flow for the example of a 16-point FFT


Taking into account the overhead inherent with twiddle factor handling in a highly vectorial
processor, rearrangement of twiddle factors between stages, etc., adopting a twiddle-factor
driven approach is justified. While there is no escaping having to load data every clock cycle,
fetching of twiddles allows for some leverage. Each twiddle set (of 8 twiddles) is loaded just
once per stage, and all the butterflies corresponding to those twiddles are computed before
loading the next twiddle set. The implementation of such a scheme takes the shape of a loop-in-
loop. The inner loop is restricted to only butterfly computing for a particular twiddle factor set,

38
while the other loop deals with handling and rearranging twiddle factors. Twiddle factor sets
are read as 256-bit vectors each, once per stage. For each twiddle set, the inner loop has many
iterations depending on the FFT stage, and there are three primary operations that take place
within the inner loop during the same clock instance:
i. Loading the next input to be operated on in the next cycle
ii. The FFT IS operates on the present input
iii. And the output data made available during the previous cycle is stored to memory.

The twiddles that serve as input to the inner loop are handled by the outer loop, which performs
fetching of twiddles, interleaved rearrangement using a dedicated issue slot, and storage to
memory – again pipelined to happen within a single cycle. Well co-ordinated and yet isolated
index handling for all the LSUs is crucial to achieving an optimal scheduling scheme.
The implication of this is two fold – firstly it allows for more efficient pipelining and loop
folding of the inner loop. Secondly, it allows for sufficient cycles for twiddle rearrangement
(interleaved storage), even when the inner loop is folded to a single-cycle loop.

4.7 Processor architecture requirements


A schematic abstraction of the corresponding architecture that facilitates the signal flow is
depicted in Figure 4.5. To explicitly describe the direction of data flow during alternate cycles,
notations (1) and (2) are used in Figure 4.5. The FFT Issue slot is continuously fed data either
from Memory1 or Memory2, and twiddles from Twiddle1 or Twiddle2 respectively. The
interleaved rearrangement of twiddle factors is shown in dotted rectangle to depict that this
operation takes place outside the inner loop.

Figure 4.5. Cycle specific view of data flow

39
There are certain implications that accompany this processor architecture. Firstly, the minimum
data memory size required is at least twice the order of the FFT. Secondly, the need for an
Address Generation Unit (AGU) to individually address each twiddle factor is obviated as it can
be handled in a vectorial manner. If not for this, dealing with twiddles becomes a bottleneck
while computing eight butterflies in parallel. Another important implication is that two distinct
data values are to be procured from the same memory per clock cycle. Since the distance
between these two data value indexes varies between stages, a dual-ported memory is required.
It is worth recalling that the dual-ported memory type architecture is also required in the case of
adopting a constant-geometry approach. This way, two data values from a memory can be
loaded every clock tick and the FFT issue slot is effectively utilized. Indexing both data values
independently helps loop folding. Each data path is 256-bits wide. While in Figure 4.5 the data
path for twiddle rearrangement has been shown as distinct from the twiddle fetching data path
for sake of clarity, both operations share the same LSU and data path.
Besides the above requirements, in order to compute the entire FFT in a contiguous
manner over time, separate Input and Output memories are used. During the first stage, data is
read from Input memory, processed and stored to Memory2. While the memory ping ponging
takes place between Memory2 and Memory1 for the rest of the FFT stages, the next set of
ensemble values is loaded into the Input memory via a DMA controller. In the last stage of the
FFT, the stage is fed either from Memory1 or Memory2 and stored to Output memory. This
arrangement is graphically depicted in Figure 4.6.

Figure 4.6 Usage of processor input/output memories

40
4.8 Bit reversed addressing in the last stage
The usage of a radix-2 in-place DIF algorithm warrants bit reversed addressing of data for the
last stage. Hence provision is made for an issue slot that performs the bit-reversal operation for
the last stage. However, taking into consideration that the specific case that needs to be
optimised for is a 128-point FFT that has a budget of about 94 clock cycles, the bit reversal type
addressing takes the form of a severe bottleneck. The 128-point FFT requires an order of 64
clock cycles for just storing data in a bit reversed manner into the Output memory [3]. Thus, for
the high-performance case of UWB-OFDM, the bit reversed addressing is almost prohibitive as
it dominates the cycle budget. Considering the problem at a system level, where the FFT
computation is followed by channel estimation and equalization, circumvents this issue. These
subsequent blocks are designed to procure their input data in a bit-reversed order from Output
memory. This obviates the need for bit reversal issue slot and makes the last stage’s
computation vectorial and highly regular. Since the processor is described in TIM, a highly
modular hardware description language, the bit reversal block is excluded easily prior to
committing the design to silicon.

4.9 Inter-stage scaling for SQNR


In order to extract maximum numerical accuracy from an integer type processor, the input data
(real and imaginary component) is scaled up depending on the input signal’s peak-to-average
ratio. The input signal ensemble values is thus normalised to the peak value of signal, and then
proportionally and linearly mapped on to a 16-bit scale. However, given the butterfly operations
of complex multiplication and addition, this would cause data overflow and saturation errors
over subsequent stages. In order to retain accuracy, the data values are shifted to the right by one
bit, every alternate stage. This translates into a scaling by a factor of 1 2 for every stage of the
FFT, thereby maintaining signal power over stages. In the case of the FFT comprising an even
number of stages, the output power is the same as the input power as opposed to being greater
by a factor of 2 in the case of odd number of stages. This aspect of the FFT is detailed and
described in [4]. Shifting right once every two stages minimises the numerical value error owing
to quantisation and yet makes use of the dynamic range, thereby improving the signal to
quantisation noise ratio of the entire FFT. In the case of a scalable FFT, the number of times the
shifting operation has taken place is kept note off to give a correct perspective on the output
ensemble values.

41
4.10 Conclusion
In keeping with the design choices enumerated in this section, the processor description of the
ASIP is carried out in TIM, and the corresponding application kernel is written. A short
summary of the design choices is listed below:
1. A DIF algorithm is chosen in comparison to a DIT algorithm owing to its suitability
towards the twiddle factor driven approach, and for SQNR. Also for SQNR, the data
values are shifted right by one bit every alternate stage.
2. Given cycle budget considerations, the bit-reversed addressing to be performed during
the last stage is omitted and compensated for during channel estimation and equalization.
3. For achieving cycle-budget, eight radix-2 butterflies are computed simultaneously.
Given a 16-bit data resolution requirement, this implies a 256-bit wide data path.
4. Twiddles are fetched and stored on processor memory as opposed to generating twiddles
dynamically using a CORDIC. Also incorporated is a scheme for twiddle ping ponging
and inter-stage rearrangement for vectorial handling of twiddles.
5. Dual ported memories are used as they allow the application kernel to adopt either the
constant geometry approach, or the in-place approach. The in-place approach is adopted
in present work with the aim of using a twiddle factor driven approach.
While writing the application kernel, the principle has been to adhere to intrinsic operations of
the processor rather than adopt regular C-language constructs that are later spatially compiled on
to the processor. The subsequent section deals with results.

4.11 List of references


[1] FFT implementation on DSP-chips-theory and practice
Meyer, R.; Schwarz, K.;
Acoustics, Speech, and Signal Processing, 1990. ICASSP-90., 1990 International Conference on
3-6 April 1990 Page(s):1503 - 1506 vol.3
[2] A 100-MHz, 16-b, direct digital frequency synthesizer with a 100-dBc spurious-free
dynamic range
Madisetti, A.; Kwentus, A.Y.; Willson, A.N., Jr.;
Solid-State Circuits, IEEE Journal of; Volume 34, Issue 8, Aug. 1999 Page(s):1034 - 1043
[3] A new efficient computational algorithm for bit reversal mapping
Drouiche, K.;
Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing,
IEEE Transactions on]; Volume 49, Issue 1, Jan. 2001 Page(s):251 – 254
[4] Effect of rounding and saturation in fixed-point DSP implementation of IFFT and FFT
for OFDM applications
Armstrong, J.; Suraweera, H. A.; Brewer, S.; Slaviero, R;
The Embedded Signal Processing Conference (GSPx 2004), September 2004.
Text available at: http://www.ctie.monash.edu.au/ofdm/paperpdfs/2004/armstrong_gspx_04.pdf

42
Chapter 5
Results and Conclusions

In this section, results obtained the current design are enumerated. This section begins with a
discussion on the Signal-to-quantisation noise ratio, formation of test input and pertinent details
of the test bench. This is followed by a discussion on the schedule, and results obtained from
synthesis. Lastly, a section suggesting possibilities worth exploring has been included.

5.1 Numerical accuracy and SQNR


In any integer type DSP processor, numerical accuracy is a very important issue that needs to be
examined. There is no point of a very performance fast, low power, programmable FFT
processor, if the final answer is numerically unsatisfactory. As outlined in Section 4.9, the data
values are shifted right once every alternate stage, starting from the first stage. It is important
that the test bench take note of the number of times the shifting takes place over the number of
stages. Only then will the power scale of the test bench and the processor generated data match.
It is important to test the processor’s accuracy with realistic data in order to obtain a reliable
assessment of the system’s performance for UWB-OFDM. Hence, a 128-point input data is
generated along the guidelines specified for the UWB MB-OFDM standard (as in Section 3.2),
using Matlab. The constructed frequency values are QPSK modulated and the corresponding
time-domain values are scaled up with regard to the signal’s peak-to-average ratio to better suit
the dynamic range. This data is fed as input during the processor simulation, and the results
obtained from the processor are compared with the reference value generated by the Matlab
function call. The resulting error value is owing to quantisation and possibly owing to saturation
effects. The plot of signal to quantisation noise ratio for each ensemble value is depicted in
Figure 5.1. It’s worth nothing that since the signal value obtained for the 64th ensemble is zero
ideally, the error is large at this point. Since the signal format is already known, this point is
discarded. Besides that, the average SQNR is about 75 dB, which is satisfactory for 16-bit based
computational units.
Figure 5.1. Plot examining SQNR for each ensemble output for the 128-point FFT

The quality and capability of a processor can be proved only with an application kernel that is
written to suit the processor, and the quality of the kernel in turn is assessed by a host test-bench
written in the C-language. For simulations, data is initially loaded on to the processor memories
through an Application Program Interface (API) macro. A similar API macro is used to retrieve
data from the processor Output memory after the FFT is fully computed. In this case, the test
bench must also be scalable with the order of the FFT, and the test bench generates a report.

Figure 5.2. Comparison between results as obtained from test bench and core (processor)

44
The sections of the report can be used for comparison as shown in the snapshot of Figure 5.2.
The FFT computation has a scaling factor of 1 2 per stage, performed by a right shift every
alternate stage. The test bench keeps track of the extent of intermediate scaling that takes place
so that its accounted for while examining the result. The test bench is also useful to provide a
step-by-step data check, which is useful in tracing errors. The simulation process for checking
functional accuracy is depicted in Figure 5.3. The test bench’s FFT subroutine is corroborated
with Matlab, and thereafter used to provide reference data for comparison.

Figure 5.3. Graphical description of the simulation process

5.2 Scheduling results


An optimal schedule is generated by Manifold, the scheduling tool in the Silicon Hive tool suite.
Operations are packed into instruction words and the number of instruction words determines
the required size of program memory. However, the primary objective of scheduling the
application kernel with the setting for a 128-point FFT is to meet the cycle budget within limits
for UWB-OFDM specifications. A statistics file generated during the scheduling gives the
number of instruction words and the number of clock cycles consumed to compute the
application kernel. The results obtained are,

Number of Instruction words = 94 -- Equation 5.1.


Number of clock cycles ( C budget _ result ) = 105 -- Equation 5.2.

The statistics file in its entirety is provided in Appendix 5.a, wherein each instruction number is
listed along side the number of cycles the particular instruction is executed. As a result of having
to use many issue slots to facilitate ping ponging of data, twiddle factor ping ponging and
rearrangement, handling of input and output ensembles separately, etc., the instruction word
width required is 618-bits wide. This is provided by Apex, the Silicon Hive tool that verifies the

45
consistency of the processor description. There are a few aspects that come to the forefront.
Firstly, the extent of a relatively high percentage of overheads, outside of the single cycle folded
loops. Since pipelining happens within loop structures, the advantage takes on significant
proportions when the inner loop executes many loop iterations. However, with 8 butterflies
executing in parallel, the number of times the inner loop executes per stage is just a maximum of
eight for a 128-point FFT. The second point is the small difference between the number of
program instruction words and the number of clock cycles. Typically, there should be a
significant difference between the two, with the number of cycles being much more than the
number of program instructions. This is yet again owing to having to optimise for a relatively
very small FFT size.
Given that the 128-point FFT application kernel has been capable of executing in 105
cycles, a recalculation is required to reassess the clock speed requirements as shown below:
Time period for computation of entire 128-point FFT (T) = 312.5 ns -- Equation 5.3.
From Equations 5.2 and 5.3,

Time period required per clock cycle (tc) = T -- Equation 5.4.


C budget _ result

= 312.5ns
105cycles
= 2.98 ns per cycle
Clock speed (C.S.) required = 1 (t c ) -- Equation 5.5.

= 336 MHz approximately


This clock speed requirement for the obtained schedule is in the close neighbourhood of the
initial chosen value of 300 MHz. Besides the primary level results quoted above, Manifold also
generates an in-depth and very elaborate graphical view of the scheduling schema. A small
section of the schedule, which is just for the issue slots, is depicted in Figure 5.4. The numbers
(0 to 93) at the top of the graph represent the index of the program counter, and thus the
instruction word being performed. At the left end of the graph are the labels of the various issue
slots. When an issue slot participates in an instruction, the box at the intersection between the
corresponding program counter index and the issue slot is shaded as shown. If the instruction is
repeated consecutively many times as in a loop, the shading is darker. The number shown within
each shaded box is the number of operations performed by the issue slot in that given instruction
word. It can be noted that instead of the expected seven darker columns (representing the
pipelined loops for 7 stages of the 128-point FFT), there are only five. This is because the loops
corresponding to stages two and three are unrolled as the latency of the loop is smaller than the
sum of the pre and post ambles. This is also evident in the schedule statistics in Appendix 5.a.

46
Figure 5.4. Schedule view of the issue slot usage
5.3 Processor architecture
The processor description in TIM is quite abstract in nature. A graphical view that is useful to
visualize the processor architecture and extent of connectivity is obtained from Corebrowser, a
custom tool within Silicon Hive. Such a graphical view of the processor wherein levels of
hierarchy can be traversed is useful while writing the application kernel. It not only helps check
for a routing path for data between issue slots, but also helps easily spotting mistakes in the
processor description files, labelling used to reference specific blocks when explicitly used
while writing the application kernel. A snapshot of the entire processor diagram, as generated by
Corebrowser, is shown in Figure 5.6.1, and the enlarged view for readability’s sake is depicted
in Figures 5.6.2 to 5.6.5. The processor diagram is the complete version of the schematic
architecture diagrams discussed in Figures 4.5 and 4.6 in section 4.7, and by Figure 5.5.

Figure 5.5 Schematic of FFT Processor Architecture in terms of the Silicon Hive template

The FFT cluster contains the LSUs for interfacing data and twiddle factor memories. Each data
handling LSU is accompanied by an ALU to locally generate data indices. The initialisation of
indices and the indices for twiddle factors is obtained from general-purpose cluster, which
contains primarily arithmetic and logic capabilities. The FFT cluster also contains a separate
issue slot for twiddle rearrangement. The control cluster primarily handles the branching
instruction for loops.

5.4 Synthesis results


In this section synthesis results corresponding to the processor are detailed. The TIM description
files also serves as the input to Genesys, a Silicon Hive tool for generating VHDL code

48
corresponding to the processor. Beyond that point, the standard hardware synthesis flow is
adopted in generating a netlist, and estimating parameters such as gate count, area, power
dissipation, etc. Synthesis results were generated using Ambit, a synthesis tool from the
Cadence tool suite. Ambit provides details about the area occupied by the different blocks. The
technology setting was for 120nm CMOS, and the generated result for the various blocks are
tabulated in Table 5.2.
For the 1.2V supply 120nm CMOS fabrication process, 180k gates reside within 1 mm2
of chip area. The extent of power dissipation can be calculated by using a formula stated in
Equation 5.6.
PowerDissipation(mW ) = 0.92 × Area(mm 2 ) × C.S .( MHz ) -- Equation 5.6
The value of Clock Speed (C.S.) used is 336 MHz, as derived in Section 5.2. From Table 5.2,
the area corresponding to logic blocks comes to 2.03mm2. There would be a fractional increase
of approximately 20% after routing, giving an area of 2.44mm2. As can also be noted, the
dominant area is by the memories, here chosen to allow the processor capability to scale up to
an order of a 1024-point FFT. An advantage of the twiddle factor driven approach is that, when
not required, the twiddle factor memory can be turned off. This advantage becomes more
prominent towards the later stages of the FFT, as the number of butterflies that make use of the
same twiddle factor set increases. It must also be noted that the power dissipation values cited in
Table 5.1 are estimated for the case of full activity levels for all blocks. However, an example
that is representative of processor activity level is that, out of the eight LSUs of the FFT cluster
that handle data, only four operate at a given moment. Thus, if clock gating was incorporated,
sections of the core turned off when they aren’t being used, etc. the power dissipation of the core
would be almost halved. With the average core activity measured to be in the range of 30%, the
average power dissipation will be 188.4 mW.

Table 5.1. Synthesis results for the core


Power
Logic Block Area (mm2) Gate count (x103 Gates) dissipation
(mW)
Logic cells excluding the FFT functional
0.814 146.52 251.62
unit

Estimate of FFT functional unit 1.104 198.72 341.26

Processor input/output logic 0.114 20.52 35.23

Total 2.032 365.76 628.11

49
The area and power dissipation figures for the memory blocks, catering to a 1024-point FFT, yet
again assuming full activity levels are listed in Table 5.2. As only two of the four dual port data
ensemble memories operate in the average case, the power dissipation for the memories would
be in the range of 1.4 Watts.

Table 5.2. Synthesis results for the memory blocks


Power
2
Memories Area (mm ) dissipation
(mW)
Four 128 bit x 2048 dual port C12FDSRAM for Data 6.32 1953.64

Two 128 bit x 1024 single port C12FSRAM for Twiddle factors 0.54 166.92

Two 32 bit x 1024 single port C12FSRAM for General purpose


0.4 123.64
clusters

Five 128 bit x 128 Single port C12FSRAM for Program memory 0.45 139.10

Total 7.71 2383.2

Since the functionality of such an ASIP is due to the cohesive operation of both the
hardware and software sections, comparison with a non-programmable ASIC solution isn’t
entirely justified. However, to obtain a qualitative understanding of the ASIP, the area of
optimal context (i.e., 128-point FFT for UWB-OFDM) is compared with that of an ASIC
solution, the details of which are provided in Table 5.3.
Table 5.3. Area comparison with an ASIC solution
Design Process Technology feature size (nm Area of FFT module (mm2)
CMOS)
[1] 180 4.22
Present ASIP 120 2.44

The FFT module of the UWB MB-OFDM transceiver described in [1] uses a mixed-radix
approach to the 128-point FFT. In contrast, the present ASIP design is scalable with the order of
the FFT and has a radix-2 based method of attaining UWB performance requirements. An
aspect to be considered is that, with change in the order of the FFT, only the application kernel -
hence, program length, cycle count, etc. – experiences scaling, and the hardware costs remain
the same. Scalability is examined more in detail in section 5.5.

50
Figure 5.6.1. Graphical view of the processor as generated from TIM description files using Corebrowser

51
Figure 5.6.2. Graphical view of the processor as generated from TIM description files using Corebrowser (cont’d)

52
Figure 5.6.3. Graphical view of the processor as generated from TIM description files using Corebrowser (cont’d)

53
Figure 5.6.4. Graphical view of the processor as generated from TIM description files using Corebrowser (cont’d)

54
Figure 5.6.5. Graphical view of the processor as generated
from TIM description files using Corebrowser (cont’d)

55
5.5 Examining Scalability
Besides attaining UWB MB-OFDM standard’s performance requirements, the ASIP is designed
having ease of scalability as a primary objective. The application kernel was compiled and
scheduled for the FFT sizes of 64 and 256. The cycle count for the different FFT sizes are
enumerated in Table 5.4.
Table 5.4. Scalability in terms of number of cycles
Order of FFT (N) Cycle count
64 54
128 105
256 216

These values of the actual schedule as obtained from Manifold is plotted in Figure 5.6, and
compared against the theoretical minimum cycle count that would be obtained if there were no
overhead. As visible from the graph lines in Figure 5.7, the overhead has an additive component
and a multiplicative component (manifesting as a difference in slopes).

220
Obtained from Scheduler
200
Ideal (no overhead)
180

160

140
No. of Cycles

120

100

80

60

40

20

0
0 50 100 150 200 250
Order of FFT (N)
Figure 5.7. Plot of scalability
The amount of percentage overhead decreases with better loop folding. Thus, for the cases of
larger N - wherein the number of loop iterations will be more – the extent of relative overhead is
smaller. This is illustrated in Figure 5.8.

56
56

54

with respect to Sceduled value


Percentage overhead 52

50

48

46

44

42

40
50 100 150 200 250 300
Order of FFT (N)
Figure 5.8. Percentage overhead decrease with increase in FFT size

To further shed perspective on the scalability of the present design, a comparison is performed
with a FFT implemented on a reconfigurable platform, described in [2]. The technology
described in [2] utilizes an array of reconfigurable cells, and hence the hardware is dynamically
retargetable. The results acquired in [2] shows that the solution outperforms other multimedia
processors and that its processing time is judged to be comparable with that of ASIC
implementations. However, the solution in [2] and the present work are juxtaposed in Table 5.5
with regard to various parameters. As can be noted, the present ASIP outperforms the FFT
implementation in [2] in all aspects.

Table 5.5. Comparison with a scalable FFT processor


Characteristic Aspect [2] Present ASIP
(130nm) (120 nm)
1. Performance:
For N=128 FFT
Cycle Count 225 105
Time period 500 ns 312.5 ns
(Clock speed = 450MHz) (Clock speed = 336MHz)
2. Scalability:
Cycle count for:
N=64 111 54
N=128 225 105
N=256 520 216
3. Peak power 4 Watts 1.78 Watts
(core + memories)
4. Area (without memory) 8 x 8 mm2 2.44 mm2

57
It’s further worth noting that the application kernel is fully scalable given an order of FFT N,
and the number of stages in the FFT – at a system level, both can be dispatched to the ASIP
from the host processor. On the hardware front, the FFT and generic clusters remain unchanged.
The size of the memory has to be increased to accommodate data and twiddles for the FFT.
Present area and power figures assume a maximum of a 1024-point FFT.

5.6 Scope for further research


Although the objectives of the thesis have been attained, there are a few fundamental directions
that present themselves for continued research:
1. The present application kernel is twiddle factor driven, but within a single stage. The
logical extension of this is to make the implementation twiddle factor driven between
stages as well. In this case, butterflies about two or three stages would be combined into
an inner loop and there would be no need to perform intermediate store operations to
memory. The implications as far as loop folding, overhead, register file usage to house
intermediate data between stages, etc. need to be examined at depth.
2. The gamut of features within the present design can be extended to be capable of
performing an Inverse FFT (IFFT), block shifting of carriers to realign spectral
components, etc. The possibility of incorporating bit-reversed addressing for the last
stage even during vectorial processing is worth investigating.
3. The constant geometry approach can be programmed on the same processor. A
comparison check of the extent of overhead incurred between the constant geometry and
inter-stage twiddle factor driven approaches can be performed.
4. It would be beneficial to use a single memory for the twiddle factors, wherein the
twiddles are stored in the same order that they would be used in the various stages of the
FFT. This would obviate the need to ping-pong and interleave twiddles between stages.
5. The conventional approach to designing DSP processors has been to assume that data is
made available as Cartesian coordinates in the form of (Real component) + j (Imaginary
component). However, even data made available to the DSP section of a system solution
has first undergone some amplification and filtering in an Analogue RF section,
followed by Analog-to-digital conversion, and the scaling of data provided by an
estimator of received signal strength followed by a Automatic Gain control (AGC)
block. It is also possible to have analogue preconditioning circuits that can estimate the
phase component and magnitude in a signal (circuit blocks of this type are alterations of
Phase lock loops or implementations of bipolar translinear circuits, etc). If the DSP

58
section were to receive data in polar form – as in Magnitude ∠PhaseArgument - the
functional units become a lot simpler. This would be well suited to twiddle factors being
generated on chip in polar coordinates.

5.7 Summary of Results


Present work involves the user dominated Hardware/Software codesign of a programmable,
high performance FFT processor and it’s concomitant scalable application kernel - applicable to
the high performance requirement of the UWB-OFDM standard. The design has been carried
out within the template of Silicon Hive, and results for the design have been extracted. The
salient points of the processor are summarised as follows:
The present ASIP is a first ‘big step’ in attaining a level of performance that is
comparable to an ASIC solution, by implementing the 128-point FFT for UWB-OFDM
within 312.5ns, and yet being fully C-programmable. At the time of this writing, such
high performance applications fall exclusively in the domain of ASIC solutions.
A specialized FFT hardware unit that computes eight radix-2 butterflies simultaneously
is used. This is a highly parallel, Single Instruction Multiple Data (SIMD) architecture.
The architecture is designed such that it can equally serve as a target device to a kernel
based on either the twiddle factor driven in-place or the constant geometry approach.
Versatility of the hardware with regard to the FFT implementation approach is a
significant aspect.
The twiddle factor driven approach has been chosen in the present case and the kernel is
fully scalable for the size of the FFT. The extent of changes to be incorporated in
hardware for handling varying FFT sizes is minimal and only in terms of memory
availability.
Although the area is dominated by the processor memories, the area of the FFT core is
estimated at 2.44mm2, which proves comparable with an ASIC solution.
Scalability has been addressed by providing scheduled cycle counts for 64 and 256-point
FFT cases. The aspect of scalability has been compared against another rival solution
and has been proved better in all aspects of performance, scalability, area and power
dissipation.
With regard to scalability, the overhead is noted to be partly additive and partly
multiplicative. Owing to a greater number of inner loop iterations in the case of a larger FFT
size, the loop folding is better and hence the percentage overhead decreases with increase in
FFT size. Directions for continuing research have been suggested.

59
5.8 References
[1] A 480Mb/s LDPC-COFDM-Based UWB Baseband Transceiver
Liu, H.Y.; Lin, C.C.; Lin, Y.W.; Chung, C.C.; Lin, K.L.; Chang, W.C.; Chen, L.H.; Chang,
H.C.; Lee, C.Y.; Proceedings of the IEEE International Solid State Circuits Conference
(ISSCC), February 2005, San Francisco

[2] Fast parallel FFT on a reconfigurable computation platform


Kamalizad, A.H.; Pan, C.; Bagherzadeh, N.;
Computer Architecture and High Performance Computing, 2003. Proceedings. 15th Symposium
on; 10-12 Nov. 2003 Page(s):254 - 259

60
Appendix 5.a.
Statistics Report of Scheduler for N = 128
PC cycles PC cycles PC cycles
---------------- ---------------- ----------------

0 1 44 1 88 3
1 1 45 1 89 1
2 1 46 1 90 1
3 1 47 1 91 1
4 1 48 1 92 1
5 1 49 1 93 1
6 1 50 1 94 1
7 1 51 1 ----------------
8 1 52 3 Total 105
9 1 53 1
10 1 54 1
11 1 55 1
12 3 56 1 runtime = 0.071185
13 1 57 1 ilp = 6.02857
14 1 58 1 instructions = 105
15 1 59 1 instrs/sec = 1475
16 1 60 1 operations = 633
17 1 61 1 ops/sec = 8892
18 1 62 1 stages = 962,
19 1 63 1 stages/sec = 13514
20 1 64 1
21 1 65 1
22 1 66 1
23 1 67 1
24 1 68 3
25 1 69 1
26 1 70 1
27 1 71 1
28 1 72 1
29 1 73 1
30 1 74 1
31 1 75 1
32 1 76 1
33 1 77 1
34 1 78 3
35 1 79 1
36 1 80 1
37 1 81 1
38 1 82 1
39 1 83 1
40 1 84 1
41 1 85 1
42 1 86 1
43 1 87 1
PC – Program Counter

61
List of Abbreviations
ASIP Application Specific Instruction set Processor
AGU Address Generation Unit
ALU Arithmetic Logic Unit
ANSI American National Standards Institute
API Application Program Interface
ASIC Application Specific Integrated Circuit
CMOS Complimentary Metal Oxide Semiconductor
CORDIC Coordinate Rotation Digital Computer
DFT Discrete Fourier Transform
DIF Decimation In Frequency
DIT Decimation In Time
DSP Digital Signal Processing
DS-UWB Direct Sequence - UWB
FCC Federal Communications Commission
FFT Fast Fourier Transform
FIR Finite Impulse Response
FU Functional Unit
IEEE International Electrical and Electronic Engineers
ILP Instruction Level Parallelism
IN Interconnection Network
IP Intellectual Property
IS Issue Slot
LAN Local Area Network
LSU Load/Store Unit
MBOA Multiband-OFDM Alliance
MB-OFDM Multiband-OFDM
MMSE Minimum Mean Square Error
OFDM Orthogonal Frequency Division Multiplex
PC Program Counter
PSE Processing and Storage Element
QPSK Quadrature Phase Shift Keying
SIMD Single Instruction Multiple Data
SFG Signal Flow Graph
SQNR Signal-to-Quantisation Noise Ratio
TIM The Incredible Machine (abstract hardware description language)
ULIW Ultra Long Instruction Word
UWB Ultra Wide Band
VHDL VHSIC Hardware Description Language
VHSIC Very High Speed Integrated Circuitry
VLIW Very Long Instruction Word
WAN Wide Area Network
WPAN Wireless Personal Area Network

63
Acknowledgements
First and foremost, I would convey my gratitude to Rene van Leuken, my supervisor at
the Delft University of Technology, for creating this opportunity at Silicon Hive. I am grateful
to him for proving very instrumental in my undertaking this graduation project. It has been a
wonderful 7 months, and I owe it to his dynamism.
I gratefully thank Jos Huisken - my supervisor at Silicon Hive - for giving me the
opportunity to work on this exciting project. I'm very grateful for our project discussions, when
I was given the opportunity to benefit from the wealth of his insight and experience.
The culmination of such a thesis was probably more of a burden on others at Silicon
Hive than myself. To that end, I'm very indebted to Marc Quax for mentoring me through the
design-flow and tool usage. He definitely deserves a medal for the patience he demonstrated
with a greenhorn such as myself. I'm very grateful to Ingolf Held for guiding me through the
nuances of the UWB-OFDM standard, and for contributing to the thesis report. I'm very
thankful for the inputs he provided me with on our many technical and non-technical
discussions.
In general, I thank all those at Silicon Hive and Philips who contributed to the thesis. It
is this stimulating environment - inadequately indicated by names, but surely including the
above - that I am deeply indebted to.
On a more personal note, I am very thankful to my parents - Mrs. Lalitha Chidambaram
and Mr. T. S. Chidambaram - for their constant support and encouragement. I'm also thankful to
all those who engraved their footprint on my life during this period: all my wacky housemates
at Galileistraat for keeping me sane, yet driving me wild at the same time; the Tamilian troupe at
Delft - Rajakumari Kumaraswamy and L. K. Madan; Richa Kumar for an awakening; Anup
Barve for the experiments in his kitchen; Deng Wei, Xiaodong Liu and the rest of the basketball
gang at Delft; the list is endless...
As a dedication, I would dedicate this effort on my part to those teachers in my past and
present for having shaped my future without even knowing about it. Lastly, but by no means
least, I thank the one who orchestrated it all knowing fully well - the guy upstairs and the best
teacher of them all!

Delft, The Netherlands Ramesh Chidambaram


Summer of 2005

65
66

View publication stats

You might also like