T~
~-'
It
M~
¢:das
E&.'.'.'.'.'.~'.'.'~ AND
MICROS~NrEMS
ELSEVIER
Microprocessors and Microsystems 21 (1997) 237-248
Performance evaluation of DSP and transputer based systems in sequential
real-time applications
M.O. Tokhi a'*, M.A. Hossian h, C. Chambers c
aDepartment of Automatic Control and Systems Engineering, The University of Sheffield, Sheffield, UK
bDepartment of Computer Science, University of Dhaka, Dhaka, Bangladesh
CDepartment of Computer Science, The University of Sheffield, Sheffield, UK
Received 21 October 1996; received in revised form 22 July 1997; accepted 12 August 1997
Abstract
This paper presents an investigation into the performance evaluation of high-performance processors in real-time applications. The
performance evolution of the processors in relation to task size, compiler efficiency for numerical computation and code optimisation are
investigated. A quantitative measure of performance evaluation of processors is introduced. An adaptive filtering algorithm and a beam
simulation algorithm are considered. These are implemented on several high-performance processors incorporating transputers and digital
signal processing devices. A comparative performance evaluation of the architectures is made, demonstrating the critical issues encountered
with fast processing techniques in real-time signal processing and control. © 1997 Elsevier Science B.V.
Keywords: Digital signal processing; Performance evaluation; Real-time signal processing and control
1. Introduction
A real-time system can be regarded as one that has to
respond to externally-generated stimuli within a finite and
specified period. Despite the vastly increased computing
power which is now available there can still be limitations
in computing capability of digital processors in real-time
control applications for two reasons: (a) sample times
have become shorter as greater performance demands are
imposed on the system, (b) algorithms are becoming more
complex as the development of control theory leads to an
understanding of methods for optimising system performance. To satisfy these high performance demands, microprocessor technology has developed at a rapid pace in recent
years. This is based on (i) processing speed, (ii) processing
ability, (iii) communication ability, and (iv) control ability.
Digital signal processing (DSP) devices are designed in
hardware to perform concurrent add and multiply instructions and execute irregular algorithms efficiently, typically
finite-impulse response (FIR) and infinite-impulse response
(IIR) filter algorithms. Vector processors are designed to
efficiently process regular algorithms involving matrix
manipulations. However, many demanding complex signal
* Corresponding author. Tel.: +44 (0)114 222 5617; fax: +44 (0)114 273
1729; e-mail: O.Tokhi@sheffield.ac.uk
0141-9331/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved
PII S 0 1 4 1 - 9 3 3 ! ( 9 7 ) 0 0 0 4 2 - 2
processing and control algorithms cannot be satisfactorily
realised with conventional computing methods. Alternative
strategies where high-performance computing methods are
employed, could provide suitable solutions in such applications [ 1]. Little work has been reported on such methods in
real-time signal processing and control applications [2-5].
For microprocessors with widely different architectures, performance measurements such as MIPS (million
instructions per second), MOPS (million operations per
second) and MFLOPS (million floating-point operations
per second) are meaningless. Of more importance is to
rate the performance of a processor on the type of program
likely to be encountered in a particular application [4]. The
different microprocessors and their different clock rates,
memory cycle times, etc., all confuse the issue of attempting
to rate the processors. In particular, there is an inherent
difficulty in selecting microprocessors in real-time signal
processing and control applications. The ideal performance
of a microprocessor demands a perfect match between
processor capability and program behaviour. Processor
capability can be enhanced with better hardware technology, innovative architectural features and efficient resource
management. From the hardware point of view, current
performance varies according to whether the processor
possesses a pipeline facility, is microcode/hardware operated, has an internal cache or internal RAM, has a built-in
238
M.O. Tokhi et al./Microprocessot s and Microsystems 21 (1997) 237-248
math co-processor, floating point unit etc. Program behaviour, on the other hand, is difficult to predict due to its
heavy dependence on application and run-time conditions.
Other factors affecting program behaviour include algorithm design, data structure, language efficiency, programmer skill and compiler technology [6,7]. The work reported
in this paper attempts to investigate such issues within the
framework of real-time applications. The aim of this investigation is to explore real-time performance evaluation
issues of DSP devices, vector processors and transputers.
In this manner, typically available processor types of each
category of computing domains is selected and used to provide a comparative study of the features referred to above.
An investigation into the computing capabilities of
several high-performance processors and their suitability
in real-time applications is presented in this paper. Three
computing domains, namely the Texas Instruments
TMS320C40 (C40) DSP device, an lntel 80i860 0860)
vector processor, and an lnmos T805 (T8) transputer are
studied. The algorithms chosen to highlight the main
characteristics of these processors are the least mean square
(LMS) based adaptive filtering algorithm and finite difference (FD) simulation of a flexible beam in transverse
vibration. Previous investigations have established a comparative performance evaluation of these processors in
implementing the above algorithms on the basis of fixed
task sizes. Not much work, however, is known of the performance evolution of such processors in relation to task
size in real-time applications [8]. Moreover, similar studies
involving code optimisation and compiler efficiency have
not been reported. This work aims at addressing these
points.
2.1. Beam simulation
Consider a cantilever beam system with a force U(x,t)
applied at a distance x from its fixed (clamped) end at
time t. This will result in a deflection y(x,t) of the beam
from its stationary position at the point where the force
has been applied. In this manner, the governing dynamic
equation of the beam is given by
2 04y( x, t)
i~
Ox4
02y(x, t)
~- -
-
ot 2
1
-
-U(x,
t)
m
(I)
where # is a beam constant and m is the mass of the beam.
Discretising the beam in time and length using central FD
methods, a discrete approximation to Eq. (1) can be
obtained as [9]
Yk + l
=
- -
Yk- I - X2Syk + (At)2U(x, t)
(2)
m
where h 2 = [(At)2/(Ax)4]/z 2 with At and Ax representing
the step sizes in time and along the beam respectively, S
is a pentadiagonal matrix (the so called stiffness matrix of
the beam), Y i ( i = k + 1 , k , k - 1) is an n × 1 matrix representing the deflection of end of sections (grid-points) 1 to n
of the beam at the time step i (beam divided into n - 1
sections). Eq. (2) is the required relation for the simulation
algorithm that can be implemented on a digital processor
easily.
2.2. The L M S adaptive filter
The LMS algorithm is one of the most successful adaptive algorithms developed by Windrow and his co-workers
[10]. It is based on the steepest descent method where the
weight vector is updated according to
2. Algorithms
Wk + 1 = Wk - 2ekr/Xk
The performance of a processor is largely affected by the
computing requirements of an application. This in turn is
determined by the degree of regularity of the algorithm.
Regularity is used to describe the degree of uniformity in
the execution thread of the computation. An algorithm consisting of varying loops and conditional jumps, for instance,
is considered to be highly irregular. Many algorithms can be
expressed by matrix computations. This leads to the so
called regular iterative (RI) type of algorithms due to their
very regular structure. In implementing an RI type algorithm, a vector processor will, principally, be expected to
perform better. Moreover, if a large amount of data is to be
handled for computation in these type of algorithms, the
performance will further be enhanced if the processor has
more internal data cache, instruction cache and/or a built-in
coprocessor.
The algorithms considered in this investigation consist
of the FD simulation of a flexible beam structure and an
LMS adaptive filter algorithm. These are briefly described
below.
where Wk and Xk are the weight and the input signal vectors
at time step k respectively, ~/ is a constant controlling
the stability and rate of convergence and ek is the error
given by
ek = Yk - W~'Xk
(3)
(4)
where y~ is the current contaminated signal sample. Eqs. (3)
and (4) constitute the LMS adaptive filter algorithm.
Note in the above that the beam simulation algorithm can
be considered as an RI algorithm whereas the LMS adaptive
filter constitutes an irregular algorithm.
3. Hardware
The architectures considered include an i860 vector processor, a C40 DSP device and a T8 transputer. These are
described below.
The i860 is a high-performance 64-bit vector processor
with 40 MHz clock speed, a peak integer performance of
M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248
40 MIPS, 8 kbytes data cache and 4 kbytes instruction
cache and is capable of 80 MFLOPS. It is the Inters first
superscalar RISC processor possessing separate integer,
floating-point, graphics, adder, multiplier and memory
management units. The i860 executes 82 instructions,
including 42 RISC integer, 24 floating-point, 10 graphics,
and six assembler pseudo operations in one clock cycle.
All external address buses are 32-bit wide and the external
data path or internal data bus is 64-bits wide. However,
the internal RISC integer ALU is only 32 bits wide. The
instruction cache transfers 64 bits per clock cycle, equivalent to 320 Mbytes/sec at 40 MHz. In contrast, the data
cache transfers 128 bits per clock cycle. There are two
floating-point units, namely, the multiplier and the adder
units, which can be used separately or simultaneously
under the co-ordination of the floating point control unit.
Special dual-operation floating-point instructions such as
add-and-multiply and subtract-and-multiply use both the
multiplier and adder units in parallel. Furthermore, both
the integer and the floating-point control units can execute
concurrently [6].
The C40 is a high-performance Texas Instruments 32-bit
DSP processor with 40 MHz clock speed, 8 kbytes on-chip
RAM, 512 bytes on-chip instructions cache and is capable
of 275 MOPS and 40MFLOPS. This DSP processor
possesses six parallel high speed communication links for
inter-processor communication with 20 Mbytes/sec asynchronous transfer rate at each port and eleven operations/
cycle throughput. In contrast, it possesses two identical
external data and address buses supporting shared memory
systems and high data rate, single-cycle transfers. It has
separate internal program, data, and DMA coprocessor
buses for support of massive concurrent I/O of program
and data throughput, thereby maximising sustained CPU
performance [11,12].
The T8 is a general purpose medium-grained 32-bit
Inmos processor with 25 MHz clock speed, yielding up to
20MIPS performance, 4kbytes on-chip RAM and is
capable of 4.3 MFLOPS. The T8 is a RISC processor
possessing an on-board 64-bit floating-point unit and four
serial communication links. The links operate at a speed of
20 Mbits/sec achieving data rates of up to 1.7 Mbytes/sec
unidirectionally or 2.3 Mbytes/sec bidirectionally. Most
importantly, the links allow a single transputer to be used
as a nod,~ among any number of similar devices to form a
powerful parallel processing system. The transputer thus
provides an important bridge between single chip, realtime control and general purpose real-time computer control
systems, and, in effect, removes the current distinction
between the two [13,14].
4. Software
Software support is needed for the development of
efficient programs in high level languages. The ideal
239
performance of a computer system demands a perfect
match between machine capability and program behaviour.
Program performance is the turnaround time, which
includes disk and memory access, input and output activities, compilation time, operating system overhead and
CPU time. To shorten the turnaround time, one can reduce
all these time factors [6]. Minimising the run-time memory
management within the program and selecting an efficient
compiler, for a specific computation demand, could enhance
the performance. Compilers have a significant impact on
the performance of the system. This is not to say that
any particular high-level language dominates another.
Most languages have advantages in certain computational
domains.
Compilers have a significant impact on the performance
of the system. This means that some high-level languages
have advantages in certain computational domains and
some have advantages in other domains. The compiler itself
is critical to the performance of the system as the efficiency
of the mechanism for taking a high-level description of the
application and transforming it into a hardware dependent
implementation differs from compiler to compiler. Identifying the foremost compiler for the application in hand is,
therefore, especially challenging due to the unpredictable
run-time behaviour of the program and memory management capabilities using different compilers. In signal processing and control applications it is important to select a
suitable programming language that can support highly
numerical computation for real-time implementation. The
investigation here involves performance evaluation issues
of some commonly used compilers with the computing
platforms considered.
The algorithms are coded in high-level languages as
appropriate for the hardware used, with the code structure
kept as similar as possible for a given application. The
compilers used consist of the Inmos ANSI C (for TS),
Portland Group ANSI C (for i860) 3L Parallel C (for C40
and T8) and Occam (for T8). For the implementations
involving the T8, as will be noted later, the ANSI C compiler is used in investigations involving performance
evaluations of the hardware architectures in implementing
the algorithms whereas the 3L Parallel C and Occam are
used in investigations involving the performance evaluation
of the compilers.
Performance is also related to code optimisation facility
of the compiler, which may be machine dependent. The goal
of program optimisation is, in general, to maximise the
speed of code execution. This involves several factors
such as minimisation of code length and memory accesses,
exploitation of parallelism, elimination of dead code, in-line
function expansion, loop unrolling and maximum utilisation
of registers. The optimisation techniques include vectorisation using pipelined hardware and parallelisation using
multiprocessors simultaneously [6]. The i860 and the C40
processors with their respective compilers are considered in
investigating their optimisation facility.
240
M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248
5. Performance metrics
A commonly used measure of performance of a processor
in an application is speedup. This is defined as the ratio
of execution time of the processor in implementing the
application algorithm relative to a reference time or execution time of a reference processor [4]. The speedup thus
defined provides a relative performance measure of a processor for fixed load (task size) and thus can be referred to
as fixed-load speedup. This can also be used to obtain a
comparative performance measure of a processor in an
application with fixed task sizes under different processing
conditions, for example, with and without code optimisation. However, this performance metric does not provide a
measure of performance of the processor over a wide range
of computational demands of the application.
It has been observed previously and demonstrated later in
this investigation that the performance of a processor, as
execution time, in implementing an application algorithm
generally evolves linearly with the task size [8]. With some
processors, however, anomalies in the form of change of
gradient (slope) of execution time to task size are observed.
These are mainly due to run-time memory management
conditions of the processor where up to a certain task size
the processor may find the available cache sufficient, but
beyond this it may require to access lower level memory.
Despite this the variation in the slope is relatively small and
the execution time to task size relationship can be considered as linear. This means that a quantitative measure
of performance of a processor in an application can
adequately be given by the average ratio of task size to
execution time or the average speed. Alternatively, the performance of the processor can be measured as the average
ratio of execution time per unit task size, or the average
(execution time) gradient. In this manner, a generalised
performance measure of a processor relative to another in
an application can be obtained.
Let the average speeds with two processors Pl and P2 in
an application be denoted by V~ and V2, respectively. The
generalised (execution time) speedup Sv2 of p l relative to p 2
in implementing the application algorithm can thus be
defined as
The concept of speed, assumed to be constant for a
processor in an application, has previously been utilised to
derive an expression for the generalised speedup as the ratio
of parallel speed (of a parallel architecture) over sequential
speed (of a single processor) [15,16]. The generalised
speedup introduced above, however, reflects on the relative
performance of two uni-processor architectures in an application and of the same processor under two different processing conditions.
6. Implementations and results
In the following investigations involving the beam simulation algorithm, an aluminium type cantilever beam of
length 1 = 0.635 m, mass m --- 0.037 kg and # = 1.351
was considered.
6.1. C o m p i l e r efficiency
In this section results of investigations of the performance
evaluation of several compilers are presented and discussed.
The compilers involved are the 3L Parallel C version 2.1,
Inmos ANSI C and Occam. All these compilers can be
utilised with the computing platforms considered in this
investigation. It has previously been reported that, although
Occam is more hardware oriented and a straightforward
programming language for parallel processing, it may not
be as suitable as the Parallel C or ANSI C compilers for
numerical computations [17]. To obtain a comparative performance evaluation of these compilers, the flexible beam
simulation algorithm was coded, for 19 equal-length beam
sections, into the three programming languages and run on a
T8. Fig. 1 shows the execution times in implementing the
simulation algorithm, over 20 000 iterations, using the three
compilers. It is noted that the performances with Parallel C
and ANSI C are nearly at a similar level and at about 1.5
times faster than the Occam. This was further investigated
with a linear algebraic equation. Table 1 shows the performances with integer and floating point operations with and
without array, where
The code without array declares: z = (x + i * y - x * i)/
vl
(x*x+y*y)
Sl/2 ~- ~22
Alternatively, if the corresponding average gradients with
P 1and P2 for the application are given by G Tand G2, respectively, Sl/z can be expressed as
Sl/2 -~-
G2
The concept of generalised speedup described above
can also be utilised to obtain a comparative performance
evaluation of a processor for an application under various
processing conditions.
6 ¸
.~
5¸
4
3
2
1.
0.
Parallel C
ANSIC
Occam
Fig. 1. Performance of the compilers in implementing the simulation
algorithm on the T8.
241
M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248
Table 1
Comparison of compilers for different types of data processing
Floating type data processing
Integer type data processing
Compiler
With array
Without array
With array
Without array
3L Parallel C
ANSI C
Occam
20 000/40 000
0.1327/0.2488
0.1328/0.2444
0.1078/0.2052
20 000/40 000
0.1263/0.2333
0.0127/0.0227
0.1078/0.2044
20 000/40 000
0.1327/0.2488
0.1328/0.2444
0.1960/0.3825
20 000/40 000
0.1263/0.2333
0.0126/0.0226
0.1905/0.3698
The
code
with
array
declares:
z(i) = [x(i) +
i * y(i) - x(i) * i]l[x(i) * x(i) + y(i) * y(i)]
with i =
0,1 ..... 20000, integer x ---- 55, y = 25 and floating
x = 55.02562, y = 25.13455. It is noted that better performance is achieved, throughout, with the ANSI C
compiler than the Occam, except in a situation where
the computation involved is floating type data processing with declaring array. As compared to ANSI
C, better performance is achieved with Parallel C for
both integer and floating type computation with array.
Better performance is achieved with Occam compiler
for floating type computation as compared to Parallel C.
It is also noted that for the amount of double data handling 1.9 times more execution time was required with
Occam. In contrast, 1.87 times more execution time
was required with each ANSI C and Parallel C. This
implies that for large amount of data handling, run-time
memory management problem can be solved with
Parallel C and ANSI C more efficiently than with the
Occam compiler.
6.2. Code optimisation
Code optimisation facility of compilers for hardware is
another important component affecting the real-time performance of a processor. Almost always optimisation facilities
enhance the real-time performance of a processor. The i860
and the C40 have many optimisation features [12,18,19],
The TMS320 floating-point DSP optimising C compiler is
the TMS320 version of the 3L Parallel C compiler [20]. It
i!I
3J
.=
"~ 21
0-"
1
2
Opimisationlevel
4
(a)
(a)
0
3
2
Opimisationlevel
2
Opimisationlevel
3
4
(b)
Fig. 2. Execution times of the i860 in implementing the algorithms with the
Portland Group C compiler optimiser: (a) the LMS algorithm; (b) the beam
simulation algorithm.
0
l
2
Opimisationlevel
3
4
(h)
Fig. 3. Speedup with the Portland Group C compiler optimiser in implementing the algorithms on the i860: (a) the LMS algorithm: (b) the beam
simulation algorithm.
M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248
242
0.45
i860 without optimisation]
0.4
....
i860 with optimisation
|
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
5
20
40
60
80
100
120
140
160
180
200
Number of weights
f~_)
1.6
[
C40 without optimisation
1.4
[ ....
C40 with optimisation
1.2
1
~
0.8
v- 0.6
0.4
0.2
0 ,
5
I - -,~ - ~, - -f
20
40
',
~
60
'
;
80
~
;
;
100
I
120
',
;
140
'
;
160
',
I
180
~
',
200
Number of weights
(b)
Fig. 4. E x e c u t i o n times o f the p r o c e s s o r s in i m p l e m e n t i n g the L M S a l g o r i t h m : (a) with the i860; (b) w i t h the C 4 0 .
has many options, constituting three levels of optimisation,
which aid the successful optimisation of C source code files
on the C40. The Portland Group (PG) C compiler is an
optimising compiler for the i860 [19]. It incorporates four
levels of optimisation.
To measure the performance attainable from the compiler
optimisers, so as to fully utilise the available features,
experiments were conducted to compile and run the LMS
and the beam simulation algorithms on the i860 and the
C40. To study the effect of the PG optimising compiler,
the LMS algorithm was compiled with the number of
weights set at 5 and r/ = 0.04. The algorithm was implemented on the i860 with four levels of optimisation and
the execution time of the processor in implementing the
algorithm over 1000 iterations was recorded. Similarly,
the beam simulation algorithm was compiled and implemented on the i860 with five beam segments and At =
0.055 ms. The execution time of the processor in implementing the algorithm over 20 000 iterations was recorded
with each of the four levels of optimisation. Fig. 2 shows
the execution times achieved in implementing the LMS and
the beam simulation algorithms, where level 0 corresponds
to no optimisation. The corresponding execution time
speedups achieved with each optimisation level in
implementing the LMS and beam simulation algorithms
are shown in Fig. 3. It is noted in Figs. 2 and 3 that the
performance of the processor in implementing the LMS
algorithm has enhanced significantly with higher levels of
optimisation. The enhancement in the case of the beam
simulation algorithm, on the other hand, is not significant
beyond the first level. The disparity in the speedups in
the case of the two algorithms is thought to be due to the
type of operations performed by the optimiser. As the LMS
algorithm has multiple nested loops, the compiler is able
to recognise the structures involved and restructure the
actual code so that it is still functionally equivalent but
suits the CPU architecture better. Branches to sub-routines,
for instance, are in-lined which cuts out procedure call
overheads, hence speeding up the execution time. The
beam simulation algorithm, however, is already in a
matrix format and thus does not need as much room for
improvement [19].
To study the effect of the optimisers further on the
performance of the system, optimisation level 0 (no optimisation) and level 2 were used with the 3L optimiser in
implementing the algorithms on the C40. Similarly, with
243
M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248
i860without optimisafion
i860withoptimisafion
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
5
t
I
I
I
I
t
I
10
15
20
25
30
35
40
Numberofsegments
(a)
9
8
]
....
C40 without optimisationI
C40 with optimisation [
7
!,
6
3
2
1
0
5
I
I
I
I
I
I
I
10
15
20
25
30
35
40
Number of segments
(b)
Fig. 5. Execution times of the processors in implementing the beam simulation algorithm: (a) with the i860; (b) with the C40.
the i860, using the PG compiler, optimisation level 0
and level 4 were utilised. The LMS and the beam simulatio algorithms were coded for various task sizes, by
changing the number of weights in case of the LMS algorithm and number of segments in case of the beam simulation algorithm. The algorithms were thus implemented
on the i860 and the C40. Figs. 4 and 5 show the execution
times achieved by the processors in implementing the LMS
and beam simulation algorithms over 1000 and 20000
iterations, respectively. It is noted that the execution time
in each case varies approximately linearly as a
function of the task size. With the LMS algorithm, as
noted, the relation has a relatively smaller gradient for
less than 10 and 20 weights with the i860 and the C40,
respectively. For the number of weights greater than
these, the gradient is larger. Such a phenomenon is more
likely associated with dynamic memory management
conditions of the processor in each case. In the case of
the beam simulation algorithm, as noted in Fig. 5, such a
situation is not clearly evident. This suggests that with
the amount of data handling involved in implementing
this algorithm both processors appear to have needed to
utilise the lower level memory throughout. The slight
variation in gradient noted in Fig. 5b with the algorithm
implemented on the C40 is more likely due to computational
error.
It is noted in Figs. 4 and 5 that the enhancement in
performance of the processors in implementing the LMS
algorithm with optimisation is substantially greater than in
implementing the beam simulation algorithm. This, as
discussed above, is due to the structure of the algorithms
where for the LMS algorithm the features of the optimisation are well exploited and not much for the beam
simulation algorithm. This is further evidenced in Fig. 6
with the corresponding speedups achieved with optimisation in implementing the algorithms on the i860 and
the C40. It is important to note that the optimisation with the
C40 offers significant enhancement in implementing the
LMS algorithm. However, the enhancement occurs within
a specific bandwidth of the task size (number of filter
weights). A similar trend will be expected with the speedup
achieved in case of the i860 in implementing the LMS
algorithm. However, due to the better data handling capability of the i860 the upper toll-off point is expected to
occur at a larger task size in comparison with that of the
C40 implementation.
244
M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248
7.5
.°
7
-
..
- -
.
6.5
t
6
5.5
5
4.5
4
3.5
"
3 /
2.5 ,
5
/
-
'
'
;
:
:
20
:
40
',
:
:
:
60
:
80
:
:
:
1(30
:
120
:
I
With the i860
[ ....
With the C40
;
140
:
I
I
160
180
i
o
200
Number ef weights
(a)
1.6
1.5
1.4
~
1.3
1.2
1.1
i
I
5
w'th401
....
With i860 I
I
I
I
I
I
I
10
15
20
25
30
35
40
Number of segments
(b)
Fig. 6. O p t i m i s a t i o n s p e e d u p with the p r o c e s s o r s in i m p l e m e n t i n g
the algorithms: (a) the L M S algorithm; (b) the b e a m simulation algorithm.
The execution time speedups achieved with code
optimisation in implementing the beam simulation algorithm on the processors, as noted in Fig. 6b, reach a similar
level with both the i860 and the C40. At the lower end
of task sizes the speedup with the C40 is relatively larger
and continues to decrease with an increase in the task
size. The speedup with the implementation on the i860,
on the other hand, increases with the task size rapidly
at the lower end and slowly beyond 20 beam
segments. This suggests that the optimisation for the i860
is performing better than that for the C40 in this type of
application.
- T8
- - - - C40 without optimisation
2.5
. . . . . . C40 with optimisation
~
_ 5 !860w!,
oo,op , .oo
2
-
.~ 1.5
E-
0.5
5
20
40
60
80
100
120
140
160
180
Number of weights
Fig. 7. E x e c u t i o n times o f the p r o c e s s o r s in i m p l e m e n t i n g
the L M S algorithm.
200
245
M.O. Tokhi et alYMicroprocessors and Microsystems 21 (1997) 237-248
I~l 2498.62
1099.34
474.88
<
i860 with
optimisation
i860 without
optimisation
C40 with
optimsation
C40 without
optimsation
T8
Fig. 8. Processor averagespeed in implementingthe LMS algorithm.
6.3. Comparative performances of the processors
mentations the two levels of optimisation considered
above were utilised. The execution times recorded in implementing the LMS algorithm in these exercises correspond to
1000 iterations and those for the beam simulation algorithm
correspond to 20 000 iterations.
Fig. 7 shows the execution times of the processors in
implementing the LMS algorithm. It is noted that the i860
To evaluate the performance of the processors in
implementing the algorithms relative to one another, the
algorithms were implemented on the i860, the C40 and
the T8 with various task sizes and the execution times
recorded accordingly. With the i860 and the C40 imple40
35
30
25
20
....
Without optimisation]
With optimisation ]
15
10
5 .
"'"".:"i
5
,
20
,
40
,
60
,
,
:
80
:
100
:
:
120
:
:
140
:
:
160
:
~
180
,
200
Number of weights
(a)
3.4
3
2.6
~
2.2
.
o.
.
-
....
.
.-
.
°
.°
.-
,°
1.S
"
Without optimisation[
With optimisation
1.4
1
,
5
:
;
20
:
I
40
I
I
60
I
I
80
I
I
100
|
I
120
I
:
140
:
l
160
:
I
180
:
[
'.
200
Number of weights
(b)
Fig.
9. Execution
time speedup with the i860 in implementingthe LMS algorithm: (a) relative to the T8; (b) relative to the C40.
M.O. Tokhi et al,/Microprocessors and Microsystems 21 (1997) 237-248
246
10 I
9
- - - - C40 without optimisation
8
[ ....
7
]--
/, - -
6
~-
C40 with optimisation
j
- - i860 without optimisation
/
m
- - i860 with optimisauon
4
i
2
fJ.
~
/
I
j
/
/-
._..
-'''"
1
~
o -'~
5
,'-r'~q=:--:q
10
= ....
15
20
U ....
25
l
m
,
~
,
30
35
40
Number of segments
Fig. 10. Executiontimes of the processorsin implementingthe beam simulationalgorithm.
with optimisation has performed the fastest and the T8 as
the slowest of the processors. The performance of the C40,
on the other hand, appears to have enhanced significantly
with optimisation compared with its performance without
optimisation. This is further demonstrated by the average
computational speed of the processors shown in Fig. 8 in
implementing the LMS algorithm. It is evident from Fig. 8
that the speedup achieved with the i860 including optimisation is 5.262, 2.2728, 16.274 and 35.996 relative to
the i860 with no optimisation, the C40 with optimisation,
the C40 with no optimisation and the T8, respectively. The
C40 with optimisation, on the other hand, appears to have
performed 2.315, 7.16 and 15.838 times faster than the i860
without optimisation, the C40 without optimisation and the
T8, respectively. Although the i860 without optimisation
appears to have performed slower than the C40 with optimisation, as is further evidenced from the execution time
speedup achieved with the i860 relative to the T8 and the
C40 as a function of the number of filter weight in Fig. 9,
the i860 with optimisation has outperformed both the
C40 and the T8 significantly in implementing the LMS
algorithm.
Fig. 10 shows the execution times of the processors in
implementing the beam simulation algorithm. It is noted
that the i860 with optimisation has performed the fastest
and the T8 as the slowest of the processors. The performance of both the i860 and the C40 appears not to
have been enhanced significantly with optimisation compared with their performance without optimisation. This is
due to the nature of the beam simulation algorithm for
which the optimisation facility does not offer much in
each case. The performance of the C40 without optimisation
appears not to be significantly different from that of the T8.
The performance of the i860 with and without optimisation
is significantly better than the performance of both the C40
and the T8. This is further demonstrated by the average
computational speed of the processors shown in Fig. 11 in
implementing the algorithm. It is evident from Fig. 11 that
the speedup achieved with the i860 with optimisation is
1.322, 9.693, 14.032 and 14.711 relative to the i860 without
optimisation, the C40 with optimisation, the C40 without
optimisation and the T8, respectively. The C40 with
optimisation, on the other hand, appears to have performed
7.331 times slower than the i860 without optimisation, but
only 1.448 and 1.518 times faster than the C40 without
optimisation and the T8 respectively. Such a significant
performance of the i860 in implementing the beam simulation algorithm is further evidenced in Fig. 12 which shows
the speedup achieved with the i860 relative to the T8 and the
C40 as function of the beam segments.
70-
7. Conclusion
.~
60-
5o4030~
20-
<
I00i860 with
optimisation
i860 without
optimisation
040 with
040 without
optimsation optimsation
T8
Fig. I1. Processor average speed in implementing the beam simulation
algorithm.
An investigation into the comparative performance
evaluation of several high-performance DSP and RISC
processors within the framework of real-time applications
has been presented. Compiler efficiency and code optimisation of the processors as important issues, affecting
the performance of the processors, in real-time applications
have been investigated.
Several areas of importance when considering optimising
algorithms to suit an architecture or vice versa have been
noted. In configuration files, amount of memory may be
allocated for the use of stack, heap or data storage functions.
247
M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248
17
-- - - - Withoutoptimisation
Withoptimisation
I
16
15
14
~. ~3
1 2
.
l l
.
*
°
°
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
°
lO
9
I
I
I
t
!
I
!
10
15
20
25
30
35
40
Number of segments
(a)
11.5
I.
11
- -
Withoutoptimisation
- Withoptimisation
10.5
10
-'o
9.5
9
8.5
8
7.5
7
I
I
10
15
I
I
20
25
Numberof segments
!
I
I
30
35
40
(b)
Fig. 12. Execution time speedup with the i860 in implementing the beam simulation algorithm: (a) relative to the T8; (b) relative to the C40.
Careful choice of these can determine whether an algorithm
can be run in on-chip cache memory or in local memory.
This, in turn, can affect the speed of execution. In cases
where the algorithm is irregular, involving many branches
or subroutines and uneven loops, such as digital filters and
similar signal processing algorithms, the C40 has been
found to be well suited. However, with an efficient compiler
optimiser the more powerful number crunchers, such as the
i860 vector processor, can outperform the C40 at its own
game. One reason for this is that the compiler optimiser
restructures the code so that it is in a more regular form.
A quantitative measure of performance of processors has
been introduced and utilised in a comparative performance
evaluation o f processors in the implementation of real-time
application algorithms. It has been shown that the performance of a processor evolves approximately linearly
with task size. This has led to the introduction of several
performance metrics, namely the average speed, average
gradient and generalised speedup, for a more comprehensive performance evaluation o f a processor under
various processing conditions and relative to other
processors within an application. These have been shown
to provide suitable measures of the performance of a processor over a wide range of loading conditions and thus
reflect on the real-time computing capabilities of the processor in a comprehensive manner.
References
[1] M.O. Tokhi, M.A. Hossain, Real-time active control using sequential
and parallel processing methods, in: M.J. Crocker, N.I. Ivanov (Eds.),
Proceedings of the Fourth International Congress on Sound and
Vibration, St. Petersburg, 24-27 June 1996, International Scientific
Publications, Auburn, Vol. 1, pp. 391-398.
[2] R.R. Leitch, M.O. Tokhi, The implementation of active noise control
systems using digital signal processing techniques, Proceedings of the
Institute of Acoustics 8(1) (1986) 149-157.
[3] M.O. Tokhi, Implementationof adaptive active noise control systems
using digital signal processing devices, Proceedings of the Institute of
Acoustics 14(4) (1992) 543-550.
[4] M.O. Tokhi, M.A. Hossain, SC, RISC and DSP processors in real-time
signal processing and control, Microprocessors and Microsyst~ms
19(5) (1995) 291-300.
248
M.O. Tokhi et aL/Mieroprocessors and Microsystems 21 (1997) 237-248
[5] M.O. Tokhi, M.A. Hossain, M.J. Baxter, P.J. Fleming, Heterogeneous
and homogeneous parallel architectures for real-time active vibration
control, IEE Proceedings, D: Control Theory and Applications 142(6)
(1995) 1-8.
[6] K. Hwang, Advanced Computer Architecture--Parallelism Scalability Programmability, McGraw-Hill, New York, USA. 1993.
[7] A.J. Anderson, A performance evaluation of microprocessors, DSPs
and the transputer for recursive parameter estimation, Microprocessors and Microsystems 15 ( 1991 ) 131 - 136.
[8] M.O. Tokhi, C. Chambers, Hossain, M.A.. Performance ew)lution
with DSP and transputer based systems in real-time signal processing
and control applications, in: Proceedings of UKACC International
Conference on Control-96. Exeter, 2-5 September 1996. lEE,
London, Vol. 1, pp. 371-375.
[9] M.O. Tokhi, M.A. Hossain, Self-tuning active vibration control in
flexible beam structures, in: Proceedings of IMechE-1. Journal of
Systems and Control Engineering 208 (14) (1994) 263-277.
[10] B. Widrow, J.R. Glover, J.M. McCool, J. Kauntiz, C.S. Williams, R.H.
Hearn, J.R. Zeidler, E. Dong, R.C. Goodlin, Adaptive noise
cancelling: principles and applications, Proceedings lEEE 63 (1975)
1692-1696.
[11 ] A. Brown, DSP chip with no parallel?, Electronics World + Wireless
World 0 (1991) 878-879.
[12] Texas Instruments. TMS320C40 User's Guide, Texas Instruments,
New York, USA, 1991.
[13] G.W. Irwin, P.J. Fleming, Transputers in Real-time Control, John
Wiley, Chichester, 1992.
[14] Transtech Parallel Systems Ltd.. Transtech parallel technology,
Transtech Parallel Systems Ltd., Bucks, UK, 1991.
[15] X.-H. Sun, J. Gustafson, Toward a better parallel performance matric.
Parallel Computing 17(101 (19911 1093 -1109.
[16] X.-H. Sun, D.T. Rover, Scalability of parallel algorithm-machine
combinations, lEEE Transactions on Parallel and Distributed
Systems 5(6) (19941 599-613.
[17] G. Bader, E. Ghrke, On the performance of transputer networks for
solving linear systems of equation, Parallel Computing 17 (19911
1397-1407.
[18] Texas Instruments, TMS320C4x User's Guide, Texas Instruments,
New York, USA, 1991.
[19] Portland Group Inc., PG tools user manual, Portland Group inc.,
Oregon, 1991.
[20] Texas Instruments, TMS320 floating-point DSP optimising C
compiler User's Guide, Texas Instruments, New York, USA, 1991.
Dr. Tokhi obtained his BSc degree in electrical
engineering .from the Department of Electrical
Engineering, Kabul University, Afghanistan in
1978 and PhD degree from Department of Electrical and Electronic Engineering, Heriot-Watt
University, UK in 1988. He worked as lecturer in
the Department of Electrical Engineering, Kabul
University, Afghanistan from 1978 to 1982, as
sound engineer in a private firm in Pakistanfrom
1982 to 1983 and as lecturer in the Department
of Electrical and Electronic Engineering,
Glasgow College of Technology, UK from
1988 to 1989. He joined the Department of Automatic Control and
Systems Engineering, The University of Sheffield, UK in 1989 where
he is currently employed as senior lecturer. His research interests
inelude active noise control, active vibration control real-time signal
processing and control, parallel processing, system identification and
adaptive control.
Dr. Hossain obtained is MSe degree fi'om the
Department of Applied Physics and Electronics,
University of Dhaka, Bangladesh in 1987 and
his PhD degree from the Department of Automatic Control and Systems Engineering, The
University of Sheffield, UK in 1995. He worked
as a lecturer in the Department of Applied
Physics and Electronies, University of Dhaka,
Bangladesh from 1988 to 1995. He is current(v
working as an Assistant Professor in the Department of Computer Science, University of Dhaka,
Bangladesh. His research interests include realtime signal processing and control, parallel processing and adaptive
active control.
Colin Chambers graduated in electronic systems
and control engineering at Sheffield Hallam
University, UK. After gaining an MSc(Eng) in
control systems at the University of Sheffield,
UK, he was initially involved in research in
vibration control He joined the department of
computer science at Sheffield in 1996 where he
started his PhD in the correct systems research
group. His interests involve analysis methods for
safety-critical computer control systems.