Project Report PDF
Project Report PDF
Project Report PDF
To describe the processor in program, based on the VESP simulator [1], I firstly define a structure of proces-
sors(processing elements) which includes registers and memory units, then use that structure to define a SIMD
system structure as below. The registers are described in Table I.
typedef struct
{
unsigned short MAR,MDR;
short Regs[4],MEMORY[16],O,C,Z,S,add,cmp;
} processor;
typedef struct
{
unsigned short MAR,MDR,PC,IR,clock;
short Regs[4],MEMORY[4096],reset,O,C,Z,S,add,cmp,I,F,AMode,jmp;
processor pe[4];
} architecture; architecture simd;
The first two bits of op-code represent the type of instructions(C-type = 00), Regs.Ids selects the dest operand
register from 0 to 3, the memory address code could represent the direct memory location or index offset according
to the address mode, and the processor mask specifies the subset of processing elements that take part in the
computations.
Each processing element can execute the C-type instructions either on an integer or a floating-point arithmetic
unit that can operate on two operands. The C-type instructions are specified as Table III.
If the processing mask bit is set in the instruction, the corresponding processor will execute the instruction.
Note that if the calculation is based on the floating-point operands, it will tranucate the tailing of results since the
memory is just 16 bit while the floating-point data type for C is 32-bits. This could result some deviation for the
calculation of floating-point operands.
B. I-type Instructions
When the first two bits of op-code is 01, it represents the I-type instructions, the last four bits of op-code represents
a unique interconnections function. This instruction will read network output register of dest processing elements.
All the network are specified in the program with an offline algorithm. The I-type instructions are specified as
Table V.
TABLE V: I-type Instructions Description
Instructions Op-code ( Permutation )
0 1 2 3
PER 01 0000
1 2 3 0
( )
0 1 2 3
PEL 01 0001
3 2 1 0
( )
0 1 2 3
SHF 01 0010
0 2 1 3
( )
0 1 2 3
NXC 01 0011
1 0 3 2
( )
0 1 2 3
FXC 01 0100
2 3 0 1
C. S-type Instruction
When the first two bits are 10, it represents S-type Instruction for calculation, which means to execute the
calculation on the master processor. When the first two bits are 11, it represents to execute control instruction such
as jump instruction. And for these control instruction, the operand address could be up to 12-bits so that these
instructions will load the next entry in the instruction memory to execute.
The S-type instructions for calculation is the same as C-type instructions while the control instructions are
specified as Table VII.
Note that the memory access for processing element has 5 types of address mode:
1) Immediate: This mode is to decode the operand as constant value.
2) Direct: This mode is to decode the operand as the absolute memory location.
3) Indirect: This mode is to decode the operand as an address base respect to the address stored in the address
register.
4) Ry-: This mode is same as Indirect mode but to decrement the address register after access.
5) Ry+: This mode is same as Indirect mode but to increment the address register after access.
IV. T IMING A SSUMPTION
To measure the execution time of programs, I include a clock into the structure of the SIMD system. This clock
will be incremented according to the instrucion fetched and executed to reflect the amount of time they would
consume in bit logic. In this system, I assume that every instruction is fetched, decoded and executed in sequential
time while each processing element could execute the same instruction on multiple data at the same time.
• Fetch Cycle: In fetch cycle, it loads next instruction’s address into MAR register at the first clock cycle, then
fetch the next instruction into IR register at the second clock cycle. Therefore, this fetch cycle consumes 2
clock cycles for each instruction.
• Decode Cycle: In decode cycle, it translates the instruction in the IR register into different instructions, which
will not increment the clock cycles in the simulator.
• Execute Cycle: In execute cycle, it executes different instructions, which will consume different clock cycles.
According to the manual of 80x86 instruction set [2] and TMS320C6742 Floating-Point DSP [3], I set some
assumptions for integer calculation and floating-point calculation cycles such as the Multiplication for integer
is 12 and for floating-point is 20. More detailed is shown in Table VIII and IX.
In VIII, note that the clock cycles are variant since the memory access mode may different. If the address mode is
not immediate operands, such as indirect access, the memory access will consume another clock cycle besides the
calculation.
2)Transpose Matrix
This is to load entries from memory units of processing elements and exchange them by interconnection networks
so that get a transposed matrix stored in the same memory location. It mainly uses the S-type Instructions SET
RX and C-type instrutions LDA and STO, I-type instructions PEL, NXC, and FXC.
Outputs: I test this program with sample matrix and get the following outputs.
Input: 1,2,3,4,
5,6,7,8,
9,10,11,12,
13,14,15,16,
Output:
The number of instructions executed = 47
The number of clock cycles used = 186
Therefore, we can see that the output matrix is correct and the clock cycles used are 186.
VI. F LOATING - POINT M ATRIX M ULTIPLICATION
Based on the instructions defined above and the architecture designed, I write a program to load and multiply
two 4x4 matrices with floating-point entries on this SIMD system. By choosing the matrix transposition example
option in the program menu, user is allowd to enter two 4x4 integer matries row by row. After entering all the
entries, the test program will generate coresponding ”load multiplication.simd” program in machine code. And then
user could execute SIMD program and get the outputs.
The SIMD program has two parts:
1)Load Matrix
This is similar to load entries in matrix transpostion. The main difference is: a)the float register should be set
instead of integer register; b) two matrices will be loaded and stored in memory units of processing elements.
2)Multiply Matrix
This is to load entries from memory units of processing elements and exchange them by interconnection networks
and then execute multiplication so that get the output matrix stored in new memory location after several loops. It
mainly uses the S-type Instructions SET RY, C-type instructions LDA, STO, MPY and ADD, I-type instructions
PEL, NXC, and FXC.
To access the memory from different processing elements every clock cycle, I calculate the matrices multiplication
as below:
P0,j = A0,0 × B0,j + A0,1 × B1,j + A0,2 × B2,j + A0,3 × B3,j
P1,j = A1,1 × B1,j + A1,0 × B0,j + A1,3 × B3,j + A1,2 × B2,j
(1)
P2,j = A2,2 × B2,j + A2,3 × B3,j + A2,0 × B0,j + A2,1 × B1,j
P3,j = A3,3 × B3,j + A3,2 × B2,j + A3,1 × B1,j + A3,0 × B0,j
So that for every processing element, it access different processing elements at the same time, which is convinient
for SIMD. The psecode is as below:
Outputs: I test this program with sample matrix and get the following outputs.
Input:
0.5 0.1 3.2 2.3
4.1 5.2 6.5 7.1
8.2 9.1 10.1 11
-12.1 -13.2 -14.3 15.2
Output:
The number of instructions executed = 175
The number of clock cycles used = 1049