Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Performance evaluation of DSP and transputer based systems in sequential real-time applications

1997, Microprocessors and Microsystems

T~ ~-' It M~ ¢:das E&.'.'.'.'.'.~'.'.'~ AND MICROS~NrEMS ELSEVIER Microprocessors and Microsystems 21 (1997) 237-248 Performance evaluation of DSP and transputer based systems in sequential real-time applications M.O. Tokhi a'*, M.A. Hossian h, C. Chambers c aDepartment of Automatic Control and Systems Engineering, The University of Sheffield, Sheffield, UK bDepartment of Computer Science, University of Dhaka, Dhaka, Bangladesh CDepartment of Computer Science, The University of Sheffield, Sheffield, UK Received 21 October 1996; received in revised form 22 July 1997; accepted 12 August 1997 Abstract This paper presents an investigation into the performance evaluation of high-performance processors in real-time applications. The performance evolution of the processors in relation to task size, compiler efficiency for numerical computation and code optimisation are investigated. A quantitative measure of performance evaluation of processors is introduced. An adaptive filtering algorithm and a beam simulation algorithm are considered. These are implemented on several high-performance processors incorporating transputers and digital signal processing devices. A comparative performance evaluation of the architectures is made, demonstrating the critical issues encountered with fast processing techniques in real-time signal processing and control. © 1997 Elsevier Science B.V. Keywords: Digital signal processing; Performance evaluation; Real-time signal processing and control 1. Introduction A real-time system can be regarded as one that has to respond to externally-generated stimuli within a finite and specified period. Despite the vastly increased computing power which is now available there can still be limitations in computing capability of digital processors in real-time control applications for two reasons: (a) sample times have become shorter as greater performance demands are imposed on the system, (b) algorithms are becoming more complex as the development of control theory leads to an understanding of methods for optimising system performance. To satisfy these high performance demands, microprocessor technology has developed at a rapid pace in recent years. This is based on (i) processing speed, (ii) processing ability, (iii) communication ability, and (iv) control ability. Digital signal processing (DSP) devices are designed in hardware to perform concurrent add and multiply instructions and execute irregular algorithms efficiently, typically finite-impulse response (FIR) and infinite-impulse response (IIR) filter algorithms. Vector processors are designed to efficiently process regular algorithms involving matrix manipulations. However, many demanding complex signal * Corresponding author. Tel.: +44 (0)114 222 5617; fax: +44 (0)114 273 1729; e-mail: O.Tokhi@sheffield.ac.uk 0141-9331/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved PII S 0 1 4 1 - 9 3 3 ! ( 9 7 ) 0 0 0 4 2 - 2 processing and control algorithms cannot be satisfactorily realised with conventional computing methods. Alternative strategies where high-performance computing methods are employed, could provide suitable solutions in such applications [ 1]. Little work has been reported on such methods in real-time signal processing and control applications [2-5]. For microprocessors with widely different architectures, performance measurements such as MIPS (million instructions per second), MOPS (million operations per second) and MFLOPS (million floating-point operations per second) are meaningless. Of more importance is to rate the performance of a processor on the type of program likely to be encountered in a particular application [4]. The different microprocessors and their different clock rates, memory cycle times, etc., all confuse the issue of attempting to rate the processors. In particular, there is an inherent difficulty in selecting microprocessors in real-time signal processing and control applications. The ideal performance of a microprocessor demands a perfect match between processor capability and program behaviour. Processor capability can be enhanced with better hardware technology, innovative architectural features and efficient resource management. From the hardware point of view, current performance varies according to whether the processor possesses a pipeline facility, is microcode/hardware operated, has an internal cache or internal RAM, has a built-in 238 M.O. Tokhi et al./Microprocessot s and Microsystems 21 (1997) 237-248 math co-processor, floating point unit etc. Program behaviour, on the other hand, is difficult to predict due to its heavy dependence on application and run-time conditions. Other factors affecting program behaviour include algorithm design, data structure, language efficiency, programmer skill and compiler technology [6,7]. The work reported in this paper attempts to investigate such issues within the framework of real-time applications. The aim of this investigation is to explore real-time performance evaluation issues of DSP devices, vector processors and transputers. In this manner, typically available processor types of each category of computing domains is selected and used to provide a comparative study of the features referred to above. An investigation into the computing capabilities of several high-performance processors and their suitability in real-time applications is presented in this paper. Three computing domains, namely the Texas Instruments TMS320C40 (C40) DSP device, an lntel 80i860 0860) vector processor, and an lnmos T805 (T8) transputer are studied. The algorithms chosen to highlight the main characteristics of these processors are the least mean square (LMS) based adaptive filtering algorithm and finite difference (FD) simulation of a flexible beam in transverse vibration. Previous investigations have established a comparative performance evaluation of these processors in implementing the above algorithms on the basis of fixed task sizes. Not much work, however, is known of the performance evolution of such processors in relation to task size in real-time applications [8]. Moreover, similar studies involving code optimisation and compiler efficiency have not been reported. This work aims at addressing these points. 2.1. Beam simulation Consider a cantilever beam system with a force U(x,t) applied at a distance x from its fixed (clamped) end at time t. This will result in a deflection y(x,t) of the beam from its stationary position at the point where the force has been applied. In this manner, the governing dynamic equation of the beam is given by 2 04y( x, t) i~ Ox4 02y(x, t) ~- - - ot 2 1 - -U(x, t) m (I) where # is a beam constant and m is the mass of the beam. Discretising the beam in time and length using central FD methods, a discrete approximation to Eq. (1) can be obtained as [9] Yk + l = - - Yk- I - X2Syk + (At)2U(x, t) (2) m where h 2 = [(At)2/(Ax)4]/z 2 with At and Ax representing the step sizes in time and along the beam respectively, S is a pentadiagonal matrix (the so called stiffness matrix of the beam), Y i ( i = k + 1 , k , k - 1) is an n × 1 matrix representing the deflection of end of sections (grid-points) 1 to n of the beam at the time step i (beam divided into n - 1 sections). Eq. (2) is the required relation for the simulation algorithm that can be implemented on a digital processor easily. 2.2. The L M S adaptive filter The LMS algorithm is one of the most successful adaptive algorithms developed by Windrow and his co-workers [10]. It is based on the steepest descent method where the weight vector is updated according to 2. Algorithms Wk + 1 = Wk - 2ekr/Xk The performance of a processor is largely affected by the computing requirements of an application. This in turn is determined by the degree of regularity of the algorithm. Regularity is used to describe the degree of uniformity in the execution thread of the computation. An algorithm consisting of varying loops and conditional jumps, for instance, is considered to be highly irregular. Many algorithms can be expressed by matrix computations. This leads to the so called regular iterative (RI) type of algorithms due to their very regular structure. In implementing an RI type algorithm, a vector processor will, principally, be expected to perform better. Moreover, if a large amount of data is to be handled for computation in these type of algorithms, the performance will further be enhanced if the processor has more internal data cache, instruction cache and/or a built-in coprocessor. The algorithms considered in this investigation consist of the FD simulation of a flexible beam structure and an LMS adaptive filter algorithm. These are briefly described below. where Wk and Xk are the weight and the input signal vectors at time step k respectively, ~/ is a constant controlling the stability and rate of convergence and ek is the error given by ek = Yk - W~'Xk (3) (4) where y~ is the current contaminated signal sample. Eqs. (3) and (4) constitute the LMS adaptive filter algorithm. Note in the above that the beam simulation algorithm can be considered as an RI algorithm whereas the LMS adaptive filter constitutes an irregular algorithm. 3. Hardware The architectures considered include an i860 vector processor, a C40 DSP device and a T8 transputer. These are described below. The i860 is a high-performance 64-bit vector processor with 40 MHz clock speed, a peak integer performance of M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248 40 MIPS, 8 kbytes data cache and 4 kbytes instruction cache and is capable of 80 MFLOPS. It is the Inters first superscalar RISC processor possessing separate integer, floating-point, graphics, adder, multiplier and memory management units. The i860 executes 82 instructions, including 42 RISC integer, 24 floating-point, 10 graphics, and six assembler pseudo operations in one clock cycle. All external address buses are 32-bit wide and the external data path or internal data bus is 64-bits wide. However, the internal RISC integer ALU is only 32 bits wide. The instruction cache transfers 64 bits per clock cycle, equivalent to 320 Mbytes/sec at 40 MHz. In contrast, the data cache transfers 128 bits per clock cycle. There are two floating-point units, namely, the multiplier and the adder units, which can be used separately or simultaneously under the co-ordination of the floating point control unit. Special dual-operation floating-point instructions such as add-and-multiply and subtract-and-multiply use both the multiplier and adder units in parallel. Furthermore, both the integer and the floating-point control units can execute concurrently [6]. The C40 is a high-performance Texas Instruments 32-bit DSP processor with 40 MHz clock speed, 8 kbytes on-chip RAM, 512 bytes on-chip instructions cache and is capable of 275 MOPS and 40MFLOPS. This DSP processor possesses six parallel high speed communication links for inter-processor communication with 20 Mbytes/sec asynchronous transfer rate at each port and eleven operations/ cycle throughput. In contrast, it possesses two identical external data and address buses supporting shared memory systems and high data rate, single-cycle transfers. It has separate internal program, data, and DMA coprocessor buses for support of massive concurrent I/O of program and data throughput, thereby maximising sustained CPU performance [11,12]. The T8 is a general purpose medium-grained 32-bit Inmos processor with 25 MHz clock speed, yielding up to 20MIPS performance, 4kbytes on-chip RAM and is capable of 4.3 MFLOPS. The T8 is a RISC processor possessing an on-board 64-bit floating-point unit and four serial communication links. The links operate at a speed of 20 Mbits/sec achieving data rates of up to 1.7 Mbytes/sec unidirectionally or 2.3 Mbytes/sec bidirectionally. Most importantly, the links allow a single transputer to be used as a nod,~ among any number of similar devices to form a powerful parallel processing system. The transputer thus provides an important bridge between single chip, realtime control and general purpose real-time computer control systems, and, in effect, removes the current distinction between the two [13,14]. 4. Software Software support is needed for the development of efficient programs in high level languages. The ideal 239 performance of a computer system demands a perfect match between machine capability and program behaviour. Program performance is the turnaround time, which includes disk and memory access, input and output activities, compilation time, operating system overhead and CPU time. To shorten the turnaround time, one can reduce all these time factors [6]. Minimising the run-time memory management within the program and selecting an efficient compiler, for a specific computation demand, could enhance the performance. Compilers have a significant impact on the performance of the system. This is not to say that any particular high-level language dominates another. Most languages have advantages in certain computational domains. Compilers have a significant impact on the performance of the system. This means that some high-level languages have advantages in certain computational domains and some have advantages in other domains. The compiler itself is critical to the performance of the system as the efficiency of the mechanism for taking a high-level description of the application and transforming it into a hardware dependent implementation differs from compiler to compiler. Identifying the foremost compiler for the application in hand is, therefore, especially challenging due to the unpredictable run-time behaviour of the program and memory management capabilities using different compilers. In signal processing and control applications it is important to select a suitable programming language that can support highly numerical computation for real-time implementation. The investigation here involves performance evaluation issues of some commonly used compilers with the computing platforms considered. The algorithms are coded in high-level languages as appropriate for the hardware used, with the code structure kept as similar as possible for a given application. The compilers used consist of the Inmos ANSI C (for TS), Portland Group ANSI C (for i860) 3L Parallel C (for C40 and T8) and Occam (for T8). For the implementations involving the T8, as will be noted later, the ANSI C compiler is used in investigations involving performance evaluations of the hardware architectures in implementing the algorithms whereas the 3L Parallel C and Occam are used in investigations involving the performance evaluation of the compilers. Performance is also related to code optimisation facility of the compiler, which may be machine dependent. The goal of program optimisation is, in general, to maximise the speed of code execution. This involves several factors such as minimisation of code length and memory accesses, exploitation of parallelism, elimination of dead code, in-line function expansion, loop unrolling and maximum utilisation of registers. The optimisation techniques include vectorisation using pipelined hardware and parallelisation using multiprocessors simultaneously [6]. The i860 and the C40 processors with their respective compilers are considered in investigating their optimisation facility. 240 M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248 5. Performance metrics A commonly used measure of performance of a processor in an application is speedup. This is defined as the ratio of execution time of the processor in implementing the application algorithm relative to a reference time or execution time of a reference processor [4]. The speedup thus defined provides a relative performance measure of a processor for fixed load (task size) and thus can be referred to as fixed-load speedup. This can also be used to obtain a comparative performance measure of a processor in an application with fixed task sizes under different processing conditions, for example, with and without code optimisation. However, this performance metric does not provide a measure of performance of the processor over a wide range of computational demands of the application. It has been observed previously and demonstrated later in this investigation that the performance of a processor, as execution time, in implementing an application algorithm generally evolves linearly with the task size [8]. With some processors, however, anomalies in the form of change of gradient (slope) of execution time to task size are observed. These are mainly due to run-time memory management conditions of the processor where up to a certain task size the processor may find the available cache sufficient, but beyond this it may require to access lower level memory. Despite this the variation in the slope is relatively small and the execution time to task size relationship can be considered as linear. This means that a quantitative measure of performance of a processor in an application can adequately be given by the average ratio of task size to execution time or the average speed. Alternatively, the performance of the processor can be measured as the average ratio of execution time per unit task size, or the average (execution time) gradient. In this manner, a generalised performance measure of a processor relative to another in an application can be obtained. Let the average speeds with two processors Pl and P2 in an application be denoted by V~ and V2, respectively. The generalised (execution time) speedup Sv2 of p l relative to p 2 in implementing the application algorithm can thus be defined as The concept of speed, assumed to be constant for a processor in an application, has previously been utilised to derive an expression for the generalised speedup as the ratio of parallel speed (of a parallel architecture) over sequential speed (of a single processor) [15,16]. The generalised speedup introduced above, however, reflects on the relative performance of two uni-processor architectures in an application and of the same processor under two different processing conditions. 6. Implementations and results In the following investigations involving the beam simulation algorithm, an aluminium type cantilever beam of length 1 = 0.635 m, mass m --- 0.037 kg and # = 1.351 was considered. 6.1. C o m p i l e r efficiency In this section results of investigations of the performance evaluation of several compilers are presented and discussed. The compilers involved are the 3L Parallel C version 2.1, Inmos ANSI C and Occam. All these compilers can be utilised with the computing platforms considered in this investigation. It has previously been reported that, although Occam is more hardware oriented and a straightforward programming language for parallel processing, it may not be as suitable as the Parallel C or ANSI C compilers for numerical computations [17]. To obtain a comparative performance evaluation of these compilers, the flexible beam simulation algorithm was coded, for 19 equal-length beam sections, into the three programming languages and run on a T8. Fig. 1 shows the execution times in implementing the simulation algorithm, over 20 000 iterations, using the three compilers. It is noted that the performances with Parallel C and ANSI C are nearly at a similar level and at about 1.5 times faster than the Occam. This was further investigated with a linear algebraic equation. Table 1 shows the performances with integer and floating point operations with and without array, where The code without array declares: z = (x + i * y - x * i)/ vl (x*x+y*y) Sl/2 ~- ~22 Alternatively, if the corresponding average gradients with P 1and P2 for the application are given by G Tand G2, respectively, Sl/z can be expressed as Sl/2 -~- G2 The concept of generalised speedup described above can also be utilised to obtain a comparative performance evaluation of a processor for an application under various processing conditions. 6 ¸ .~ 5¸ 4 3 2 1. 0. Parallel C ANSIC Occam Fig. 1. Performance of the compilers in implementing the simulation algorithm on the T8. 241 M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248 Table 1 Comparison of compilers for different types of data processing Floating type data processing Integer type data processing Compiler With array Without array With array Without array 3L Parallel C ANSI C Occam 20 000/40 000 0.1327/0.2488 0.1328/0.2444 0.1078/0.2052 20 000/40 000 0.1263/0.2333 0.0127/0.0227 0.1078/0.2044 20 000/40 000 0.1327/0.2488 0.1328/0.2444 0.1960/0.3825 20 000/40 000 0.1263/0.2333 0.0126/0.0226 0.1905/0.3698 The code with array declares: z(i) = [x(i) + i * y(i) - x(i) * i]l[x(i) * x(i) + y(i) * y(i)] with i = 0,1 ..... 20000, integer x ---- 55, y = 25 and floating x = 55.02562, y = 25.13455. It is noted that better performance is achieved, throughout, with the ANSI C compiler than the Occam, except in a situation where the computation involved is floating type data processing with declaring array. As compared to ANSI C, better performance is achieved with Parallel C for both integer and floating type computation with array. Better performance is achieved with Occam compiler for floating type computation as compared to Parallel C. It is also noted that for the amount of double data handling 1.9 times more execution time was required with Occam. In contrast, 1.87 times more execution time was required with each ANSI C and Parallel C. This implies that for large amount of data handling, run-time memory management problem can be solved with Parallel C and ANSI C more efficiently than with the Occam compiler. 6.2. Code optimisation Code optimisation facility of compilers for hardware is another important component affecting the real-time performance of a processor. Almost always optimisation facilities enhance the real-time performance of a processor. The i860 and the C40 have many optimisation features [12,18,19], The TMS320 floating-point DSP optimising C compiler is the TMS320 version of the 3L Parallel C compiler [20]. It i!I 3J .= "~ 21 0-" 1 2 Opimisationlevel 4 (a) (a) 0 3 2 Opimisationlevel 2 Opimisationlevel 3 4 (b) Fig. 2. Execution times of the i860 in implementing the algorithms with the Portland Group C compiler optimiser: (a) the LMS algorithm; (b) the beam simulation algorithm. 0 l 2 Opimisationlevel 3 4 (h) Fig. 3. Speedup with the Portland Group C compiler optimiser in implementing the algorithms on the i860: (a) the LMS algorithm: (b) the beam simulation algorithm. M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248 242 0.45 i860 without optimisation] 0.4 .... i860 with optimisation | 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 5 20 40 60 80 100 120 140 160 180 200 Number of weights f~_) 1.6 [ C40 without optimisation 1.4 [ .... C40 with optimisation 1.2 1 ~ 0.8 v- 0.6 0.4 0.2 0 , 5 I - -,~ - ~, - -f 20 40 ', ~ 60 ' ; 80 ~ ; ; 100 I 120 ', ; 140 ' ; 160 ', I 180 ~ ', 200 Number of weights (b) Fig. 4. E x e c u t i o n times o f the p r o c e s s o r s in i m p l e m e n t i n g the L M S a l g o r i t h m : (a) with the i860; (b) w i t h the C 4 0 . has many options, constituting three levels of optimisation, which aid the successful optimisation of C source code files on the C40. The Portland Group (PG) C compiler is an optimising compiler for the i860 [19]. It incorporates four levels of optimisation. To measure the performance attainable from the compiler optimisers, so as to fully utilise the available features, experiments were conducted to compile and run the LMS and the beam simulation algorithms on the i860 and the C40. To study the effect of the PG optimising compiler, the LMS algorithm was compiled with the number of weights set at 5 and r/ = 0.04. The algorithm was implemented on the i860 with four levels of optimisation and the execution time of the processor in implementing the algorithm over 1000 iterations was recorded. Similarly, the beam simulation algorithm was compiled and implemented on the i860 with five beam segments and At = 0.055 ms. The execution time of the processor in implementing the algorithm over 20 000 iterations was recorded with each of the four levels of optimisation. Fig. 2 shows the execution times achieved in implementing the LMS and the beam simulation algorithms, where level 0 corresponds to no optimisation. The corresponding execution time speedups achieved with each optimisation level in implementing the LMS and beam simulation algorithms are shown in Fig. 3. It is noted in Figs. 2 and 3 that the performance of the processor in implementing the LMS algorithm has enhanced significantly with higher levels of optimisation. The enhancement in the case of the beam simulation algorithm, on the other hand, is not significant beyond the first level. The disparity in the speedups in the case of the two algorithms is thought to be due to the type of operations performed by the optimiser. As the LMS algorithm has multiple nested loops, the compiler is able to recognise the structures involved and restructure the actual code so that it is still functionally equivalent but suits the CPU architecture better. Branches to sub-routines, for instance, are in-lined which cuts out procedure call overheads, hence speeding up the execution time. The beam simulation algorithm, however, is already in a matrix format and thus does not need as much room for improvement [19]. To study the effect of the optimisers further on the performance of the system, optimisation level 0 (no optimisation) and level 2 were used with the 3L optimiser in implementing the algorithms on the C40. Similarly, with 243 M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248 i860without optimisafion i860withoptimisafion 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 5 t I I I I t I 10 15 20 25 30 35 40 Numberofsegments (a) 9 8 ] .... C40 without optimisationI C40 with optimisation [ 7 !, 6 3 2 1 0 5 I I I I I I I 10 15 20 25 30 35 40 Number of segments (b) Fig. 5. Execution times of the processors in implementing the beam simulation algorithm: (a) with the i860; (b) with the C40. the i860, using the PG compiler, optimisation level 0 and level 4 were utilised. The LMS and the beam simulatio algorithms were coded for various task sizes, by changing the number of weights in case of the LMS algorithm and number of segments in case of the beam simulation algorithm. The algorithms were thus implemented on the i860 and the C40. Figs. 4 and 5 show the execution times achieved by the processors in implementing the LMS and beam simulation algorithms over 1000 and 20000 iterations, respectively. It is noted that the execution time in each case varies approximately linearly as a function of the task size. With the LMS algorithm, as noted, the relation has a relatively smaller gradient for less than 10 and 20 weights with the i860 and the C40, respectively. For the number of weights greater than these, the gradient is larger. Such a phenomenon is more likely associated with dynamic memory management conditions of the processor in each case. In the case of the beam simulation algorithm, as noted in Fig. 5, such a situation is not clearly evident. This suggests that with the amount of data handling involved in implementing this algorithm both processors appear to have needed to utilise the lower level memory throughout. The slight variation in gradient noted in Fig. 5b with the algorithm implemented on the C40 is more likely due to computational error. It is noted in Figs. 4 and 5 that the enhancement in performance of the processors in implementing the LMS algorithm with optimisation is substantially greater than in implementing the beam simulation algorithm. This, as discussed above, is due to the structure of the algorithms where for the LMS algorithm the features of the optimisation are well exploited and not much for the beam simulation algorithm. This is further evidenced in Fig. 6 with the corresponding speedups achieved with optimisation in implementing the algorithms on the i860 and the C40. It is important to note that the optimisation with the C40 offers significant enhancement in implementing the LMS algorithm. However, the enhancement occurs within a specific bandwidth of the task size (number of filter weights). A similar trend will be expected with the speedup achieved in case of the i860 in implementing the LMS algorithm. However, due to the better data handling capability of the i860 the upper toll-off point is expected to occur at a larger task size in comparison with that of the C40 implementation. 244 M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248 7.5 .° 7 - .. - - . 6.5 t 6 5.5 5 4.5 4 3.5 " 3 / 2.5 , 5 / - ' ' ; : : 20 : 40 ', : : : 60 : 80 : : : 1(30 : 120 : I With the i860 [ .... With the C40 ; 140 : I I 160 180 i o 200 Number ef weights (a) 1.6 1.5 1.4 ~ 1.3 1.2 1.1 i I 5 w'th401 .... With i860 I I I I I I I 10 15 20 25 30 35 40 Number of segments (b) Fig. 6. O p t i m i s a t i o n s p e e d u p with the p r o c e s s o r s in i m p l e m e n t i n g the algorithms: (a) the L M S algorithm; (b) the b e a m simulation algorithm. The execution time speedups achieved with code optimisation in implementing the beam simulation algorithm on the processors, as noted in Fig. 6b, reach a similar level with both the i860 and the C40. At the lower end of task sizes the speedup with the C40 is relatively larger and continues to decrease with an increase in the task size. The speedup with the implementation on the i860, on the other hand, increases with the task size rapidly at the lower end and slowly beyond 20 beam segments. This suggests that the optimisation for the i860 is performing better than that for the C40 in this type of application. - T8 - - - - C40 without optimisation 2.5 . . . . . . C40 with optimisation ~ _ 5 !860w!, oo,op , .oo 2 - .~ 1.5 E- 0.5 5 20 40 60 80 100 120 140 160 180 Number of weights Fig. 7. E x e c u t i o n times o f the p r o c e s s o r s in i m p l e m e n t i n g the L M S algorithm. 200 245 M.O. Tokhi et alYMicroprocessors and Microsystems 21 (1997) 237-248 I~l 2498.62 1099.34 474.88 < i860 with optimisation i860 without optimisation C40 with optimsation C40 without optimsation T8 Fig. 8. Processor averagespeed in implementingthe LMS algorithm. 6.3. Comparative performances of the processors mentations the two levels of optimisation considered above were utilised. The execution times recorded in implementing the LMS algorithm in these exercises correspond to 1000 iterations and those for the beam simulation algorithm correspond to 20 000 iterations. Fig. 7 shows the execution times of the processors in implementing the LMS algorithm. It is noted that the i860 To evaluate the performance of the processors in implementing the algorithms relative to one another, the algorithms were implemented on the i860, the C40 and the T8 with various task sizes and the execution times recorded accordingly. With the i860 and the C40 imple40 35 30 25 20 .... Without optimisation] With optimisation ] 15 10 5 . "'"".:"i 5 , 20 , 40 , 60 , , : 80 : 100 : : 120 : : 140 : : 160 : ~ 180 , 200 Number of weights (a) 3.4 3 2.6 ~ 2.2 . o. . - .... . .- . ° .° .- ,° 1.S " Without optimisation[ With optimisation 1.4 1 , 5 : ; 20 : I 40 I I 60 I I 80 I I 100 | I 120 I : 140 : l 160 : I 180 : [ '. 200 Number of weights (b) Fig. 9. Execution time speedup with the i860 in implementingthe LMS algorithm: (a) relative to the T8; (b) relative to the C40. M.O. Tokhi et al,/Microprocessors and Microsystems 21 (1997) 237-248 246 10 I 9 - - - - C40 without optimisation 8 [ .... 7 ]-- /, - - 6 ~- C40 with optimisation j - - i860 without optimisation / m - - i860 with optimisauon 4 i 2 fJ. ~ / I j / /- ._.. -'''" 1 ~ o -'~ 5 ,'-r'~q=:--:q 10 = .... 15 20 U .... 25 l m , ~ , 30 35 40 Number of segments Fig. 10. Executiontimes of the processorsin implementingthe beam simulationalgorithm. with optimisation has performed the fastest and the T8 as the slowest of the processors. The performance of the C40, on the other hand, appears to have enhanced significantly with optimisation compared with its performance without optimisation. This is further demonstrated by the average computational speed of the processors shown in Fig. 8 in implementing the LMS algorithm. It is evident from Fig. 8 that the speedup achieved with the i860 including optimisation is 5.262, 2.2728, 16.274 and 35.996 relative to the i860 with no optimisation, the C40 with optimisation, the C40 with no optimisation and the T8, respectively. The C40 with optimisation, on the other hand, appears to have performed 2.315, 7.16 and 15.838 times faster than the i860 without optimisation, the C40 without optimisation and the T8, respectively. Although the i860 without optimisation appears to have performed slower than the C40 with optimisation, as is further evidenced from the execution time speedup achieved with the i860 relative to the T8 and the C40 as a function of the number of filter weight in Fig. 9, the i860 with optimisation has outperformed both the C40 and the T8 significantly in implementing the LMS algorithm. Fig. 10 shows the execution times of the processors in implementing the beam simulation algorithm. It is noted that the i860 with optimisation has performed the fastest and the T8 as the slowest of the processors. The performance of both the i860 and the C40 appears not to have been enhanced significantly with optimisation compared with their performance without optimisation. This is due to the nature of the beam simulation algorithm for which the optimisation facility does not offer much in each case. The performance of the C40 without optimisation appears not to be significantly different from that of the T8. The performance of the i860 with and without optimisation is significantly better than the performance of both the C40 and the T8. This is further demonstrated by the average computational speed of the processors shown in Fig. 11 in implementing the algorithm. It is evident from Fig. 11 that the speedup achieved with the i860 with optimisation is 1.322, 9.693, 14.032 and 14.711 relative to the i860 without optimisation, the C40 with optimisation, the C40 without optimisation and the T8, respectively. The C40 with optimisation, on the other hand, appears to have performed 7.331 times slower than the i860 without optimisation, but only 1.448 and 1.518 times faster than the C40 without optimisation and the T8 respectively. Such a significant performance of the i860 in implementing the beam simulation algorithm is further evidenced in Fig. 12 which shows the speedup achieved with the i860 relative to the T8 and the C40 as function of the beam segments. 70- 7. Conclusion .~ 60- 5o4030~ 20- < I00i860 with optimisation i860 without optimisation 040 with 040 without optimsation optimsation T8 Fig. I1. Processor average speed in implementing the beam simulation algorithm. An investigation into the comparative performance evaluation of several high-performance DSP and RISC processors within the framework of real-time applications has been presented. Compiler efficiency and code optimisation of the processors as important issues, affecting the performance of the processors, in real-time applications have been investigated. Several areas of importance when considering optimising algorithms to suit an architecture or vice versa have been noted. In configuration files, amount of memory may be allocated for the use of stack, heap or data storage functions. 247 M.O. Tokhi et al./Microprocessors and Microsystems 21 (1997) 237-248 17 -- - - - Withoutoptimisation Withoptimisation I 16 15 14 ~. ~3 1 2 . l l . * ° ° . . . . . . . . . . . . . . . . . . . ° lO 9 I I I t ! I ! 10 15 20 25 30 35 40 Number of segments (a) 11.5 I. 11 - - Withoutoptimisation - Withoptimisation 10.5 10 -'o 9.5 9 8.5 8 7.5 7 I I 10 15 I I 20 25 Numberof segments ! I I 30 35 40 (b) Fig. 12. Execution time speedup with the i860 in implementing the beam simulation algorithm: (a) relative to the T8; (b) relative to the C40. Careful choice of these can determine whether an algorithm can be run in on-chip cache memory or in local memory. This, in turn, can affect the speed of execution. In cases where the algorithm is irregular, involving many branches or subroutines and uneven loops, such as digital filters and similar signal processing algorithms, the C40 has been found to be well suited. However, with an efficient compiler optimiser the more powerful number crunchers, such as the i860 vector processor, can outperform the C40 at its own game. One reason for this is that the compiler optimiser restructures the code so that it is in a more regular form. A quantitative measure of performance of processors has been introduced and utilised in a comparative performance evaluation o f processors in the implementation of real-time application algorithms. It has been shown that the performance of a processor evolves approximately linearly with task size. This has led to the introduction of several performance metrics, namely the average speed, average gradient and generalised speedup, for a more comprehensive performance evaluation o f a processor under various processing conditions and relative to other processors within an application. These have been shown to provide suitable measures of the performance of a processor over a wide range of loading conditions and thus reflect on the real-time computing capabilities of the processor in a comprehensive manner. References [1] M.O. Tokhi, M.A. Hossain, Real-time active control using sequential and parallel processing methods, in: M.J. Crocker, N.I. Ivanov (Eds.), Proceedings of the Fourth International Congress on Sound and Vibration, St. Petersburg, 24-27 June 1996, International Scientific Publications, Auburn, Vol. 1, pp. 391-398. [2] R.R. Leitch, M.O. Tokhi, The implementation of active noise control systems using digital signal processing techniques, Proceedings of the Institute of Acoustics 8(1) (1986) 149-157. [3] M.O. Tokhi, Implementationof adaptive active noise control systems using digital signal processing devices, Proceedings of the Institute of Acoustics 14(4) (1992) 543-550. [4] M.O. Tokhi, M.A. Hossain, SC, RISC and DSP processors in real-time signal processing and control, Microprocessors and Microsyst~ms 19(5) (1995) 291-300. 248 M.O. Tokhi et aL/Mieroprocessors and Microsystems 21 (1997) 237-248 [5] M.O. Tokhi, M.A. Hossain, M.J. Baxter, P.J. Fleming, Heterogeneous and homogeneous parallel architectures for real-time active vibration control, IEE Proceedings, D: Control Theory and Applications 142(6) (1995) 1-8. [6] K. Hwang, Advanced Computer Architecture--Parallelism Scalability Programmability, McGraw-Hill, New York, USA. 1993. [7] A.J. Anderson, A performance evaluation of microprocessors, DSPs and the transputer for recursive parameter estimation, Microprocessors and Microsystems 15 ( 1991 ) 131 - 136. [8] M.O. Tokhi, C. Chambers, Hossain, M.A.. Performance ew)lution with DSP and transputer based systems in real-time signal processing and control applications, in: Proceedings of UKACC International Conference on Control-96. Exeter, 2-5 September 1996. lEE, London, Vol. 1, pp. 371-375. [9] M.O. Tokhi, M.A. Hossain, Self-tuning active vibration control in flexible beam structures, in: Proceedings of IMechE-1. Journal of Systems and Control Engineering 208 (14) (1994) 263-277. [10] B. Widrow, J.R. Glover, J.M. McCool, J. Kauntiz, C.S. Williams, R.H. Hearn, J.R. Zeidler, E. Dong, R.C. Goodlin, Adaptive noise cancelling: principles and applications, Proceedings lEEE 63 (1975) 1692-1696. [11 ] A. Brown, DSP chip with no parallel?, Electronics World + Wireless World 0 (1991) 878-879. [12] Texas Instruments. TMS320C40 User's Guide, Texas Instruments, New York, USA, 1991. [13] G.W. Irwin, P.J. Fleming, Transputers in Real-time Control, John Wiley, Chichester, 1992. [14] Transtech Parallel Systems Ltd.. Transtech parallel technology, Transtech Parallel Systems Ltd., Bucks, UK, 1991. [15] X.-H. Sun, J. Gustafson, Toward a better parallel performance matric. Parallel Computing 17(101 (19911 1093 -1109. [16] X.-H. Sun, D.T. Rover, Scalability of parallel algorithm-machine combinations, lEEE Transactions on Parallel and Distributed Systems 5(6) (19941 599-613. [17] G. Bader, E. Ghrke, On the performance of transputer networks for solving linear systems of equation, Parallel Computing 17 (19911 1397-1407. [18] Texas Instruments, TMS320C4x User's Guide, Texas Instruments, New York, USA, 1991. [19] Portland Group Inc., PG tools user manual, Portland Group inc., Oregon, 1991. [20] Texas Instruments, TMS320 floating-point DSP optimising C compiler User's Guide, Texas Instruments, New York, USA, 1991. Dr. Tokhi obtained his BSc degree in electrical engineering .from the Department of Electrical Engineering, Kabul University, Afghanistan in 1978 and PhD degree from Department of Electrical and Electronic Engineering, Heriot-Watt University, UK in 1988. He worked as lecturer in the Department of Electrical Engineering, Kabul University, Afghanistan from 1978 to 1982, as sound engineer in a private firm in Pakistanfrom 1982 to 1983 and as lecturer in the Department of Electrical and Electronic Engineering, Glasgow College of Technology, UK from 1988 to 1989. He joined the Department of Automatic Control and Systems Engineering, The University of Sheffield, UK in 1989 where he is currently employed as senior lecturer. His research interests inelude active noise control, active vibration control real-time signal processing and control, parallel processing, system identification and adaptive control. Dr. Hossain obtained is MSe degree fi'om the Department of Applied Physics and Electronics, University of Dhaka, Bangladesh in 1987 and his PhD degree from the Department of Automatic Control and Systems Engineering, The University of Sheffield, UK in 1995. He worked as a lecturer in the Department of Applied Physics and Electronies, University of Dhaka, Bangladesh from 1988 to 1995. He is current(v working as an Assistant Professor in the Department of Computer Science, University of Dhaka, Bangladesh. His research interests include realtime signal processing and control, parallel processing and adaptive active control. Colin Chambers graduated in electronic systems and control engineering at Sheffield Hallam University, UK. After gaining an MSc(Eng) in control systems at the University of Sheffield, UK, he was initially involved in research in vibration control He joined the department of computer science at Sheffield in 1996 where he started his PhD in the correct systems research group. His interests involve analysis methods for safety-critical computer control systems.