Pipelining PDF
Pipelining PDF
Pipelining PDF
Chapter 4
Reference: W. Stallings |
Reference: W. Stallings |
Reference: W. Stallings |
Reference: W. Stallings |
SISD
Represents the organization of a single computer containing a control unit, a
processor unit, and a memory unit.
Instructions are executed sequentially and the system may or may not have
internal parallel processing capabilities.
parallel processing may be achieved by means of multiple functional units or by
pipeline processing.
SIMD
Represents an organization that includes many processing units under the
supervision of a common control unit.
All processors receive the same instruction from the control unit but operate on
different items of data.
The shared memory unit must contain multiple modules so that it can
communicate with all the processors simultaneously.
Reference: W. Stallings |
Y B 2b
o A and B are two fractions that represent the mantissas
o a and b are the exponents
The floating-point addition and subtraction can be performed in four
segments, as shown in Fig. 4-6.
The suboperations that are performed in the four segments are:
o Compare the exponents
The larger exponent is chosen as the exponent of the result.
o Align the mantissas
The exponent difference determines how many times the
mantissa associated with the smaller exponent must be shifted
to the right.
o Add or subtract the mantissas
o Normalize the result
Reference: W. Stallings |
Reference: W. Stallings |
Reference: W. Stallings |
Reference: W. Stallings |
Reference: W. Stallings |
10
Loop buffer: This is a small very high speed register file maintained by the
instruction fetch segment of the pipeline.
Branch prediction: A pipeline with branch prediction uses some additional logic
to guess the outcome of a conditional branch instruction before it is executed.
Delayed branch: in this procedure, the compiler detects the branch instructions
and rearranges the machine language code sequence by inserting useful
instructions that keep the pipeline operating without interruptions.
o A procedure employed in most RISC processors.
o e.g. no-operation instruction
Reference: W. Stallings |
11
Reference: W. Stallings |
12
Fig 4-9(a): Three segment pipeline timing - Pipeline timing with data conflict
Fig. 4-9(b) shows the same program with a no-op instruction inserted after the
load to R2 instruction.
Fig 4-9(b): Three segment pipeline timing - Pipeline timing with delayed load
Thus the no-op instruction is used to advance one clock cycle in order to
compensate for the data conflict in the pipeline.
The advantage of the delayed load approach is that the data dependency is taken
care of by the compiler rather than the hardware.
Delayed Branch
The method used in most RISC processors is to rely on the compiler to redefine
the branches so that they take effect at the proper time in the pipeline. This
method is referred to as delayed branch.
The compiler is designed to analyze the instructions before and after the branch
and rearrange the program sequence by inserting useful instructions in the delay
steps.
It is up to the compiler to find useful instructions to put after the branch
instruction. Failing that, the compiler can insert no-op instructions.
An Example of Delayed Branch
The program for this example consists of five instructions.
o Load from memory to R1
o Increment R2
o Add R3 to R4
o Subtract R5 from R6
o Branch to address X
Compiled By: Er. Hari Aryal [haryal4@gmail.com]
Reference: W. Stallings |
13
In Fig. 4-10(a) the compiler inserts two no-op instructions after the branch.
o The branch address X is transferred to PC in clock cycle 7.
The program in Fig. 4-10(b) is rearranged by placing the add and subtract
instructions after the branch instruction.
o PC is updated to the value of X in clock cycle 5.
Reference: W. Stallings |
14
Vector Operations
Many scientific problems require arithmetic operations on large arrays of
numbers.
A vector is an ordered set of a one-dimensional array of data items.
A vector V of length n is represented as a row vector by V=[v1,v2,,Vn].
To examine the difference between a conventional scalar processor and a vector
processor, consider the following Fortran DO loop:
DO 20 I = 1, 100
20
C(I) = B(I) + A(I)
This is implemented in machine language by the following sequence of
operations.
Initialize I=0
20
Read A(I)
Read B(I)
Store C(I) = A(I)+B(I)
Increment I = I + 1
If I <= 100 go to 20
Continue
A computer capable of vector processing eliminates the overhead associated with
the time it takes to fetch and execute the instructions in the program loop.
C(1:100) = A(1:100) + B(1:100)
A possible instruction format for a vector instruction is shown in Fig. 4-11.
o This assumes that the vector operands reside in memory.
It is also possible to design the processor with a large number of registers and
store all operands in registers prior to the addition operation.
o The base address and length in the vector instruction specify a group of
CPU registers.
Reference: W. Stallings |
15
The inner product calculation on a pipeline vector processor is shown in Fig. 412.
C A1 B1 A5 B5 A9 B9 A13 B13
Reference: W. Stallings |
16
The advantage of a modular memory is that it allows the use of a technique called
interleaving.
In an interleaved memory, different sets of addresses are assigned to different
memory modules.
By staggering the memory access, the effective memory cycle time can be
reduced by a factor close to the number of modules.
Supercomputers
A commercial computer with vector instructions and pipelined floating-point
arithmetic operations is referred to as a supercomputer.
o To speed up the operation, the components are packed tightly together to
minimize the distance that the electronic signals have to travel.
This is augmented by instructions that process vectors and combinations of
scalars and vectors.
A supercomputer is a computer system best known for its high computational
speed, fast and large memory systems, and the extensive use of parallel
processing.
o It is equipped with multiple functional units and each unit has its own
pipeline configuration.
It is specifically optimized for the type of numerical calculations involving
vectors and matrices of floating-point numbers.
They are limited in their use to a number of scientific applications, such as
numerical weather forecasting, seismic wave analysis, and space research.
A measure used to evaluate computers in their ability to perform a given number
of floating-point operations per second is referred to as flops.
A typical supercomputer has a basic cycle time of 4 to 20 ns.
The examples of supercomputer:
Cray-1: it uses vector processing with 12 distinct functional units in parallel; a
large number of registers (over 150); multiprocessor configuration (Cray X-MP
and Cray Y-MP)
o Fujitsu VP-200: 83 vector instructions and 195 scalar instructions; 300
megaflops
4.7 Array Processing
An array processor is a processor that performs computations on large arrays of
data.
The term is used to refer to two different types of processors.
o Attached array processor:
Is an auxiliary processor.
It is intended to improve the performance of the host computer in
specific numerical computation tasks.
o SIMD array processor:
Has a single-instruction multiple-data organization.
It manipulates vector instructions by means of multiple functional
units responding to a common instruction.
Compiled By: Er. Hari Aryal [haryal4@gmail.com]
Reference: W. Stallings |
17
Reference: W. Stallings |
18
Reference: W. Stallings |
19