Seminar
Seminar
Seminar
Module – 1
No Questions Year Marks
1 With a neat sketch, explain the architecture of a vector Dec-2018 4
supercomputer. Sept-2020
2 Explain implicit and explicit parallelism in parallel Dec-2018 4
programming Sept-2020
3 Explain Flynn’s classification of computer architecture Dec-2018 4
4 Explain SIMD machine model. May-2019 3
5 Explain the role of compilers in exploiting parallelism Dec-2018 3
6 With proper equations, explain the terms (i) CPU Time May-2019 4
(ii) Throughput Rate
Parallelism? Dec-2019
3
● Discuss the Bernstein`s conditions for checking the Sept-2020
Parallelism among a set of processes. 3
● State and explain Bernstein’s conditions for parallelism?
9 ● State Amdahl’s law. Write an expression for the Dec-2019 4
overall speed up. Sept-2020 3
● Define Amdahl’s Law.
PROBLEMS
Consider the execution of the following code segment consisting of seven
statements. Use Bernstein’s conditions to detect the maximum parallelism
1 embedded in this code. Justify the portions that can be executed in parallel and 5
the remaining portions that must be executed sequentially. RewriteMay-2019
the code using
parallel constructs such as Cobegin and Coend. No variable substitution is
allowed. All statements can be executed in parallel if they are declared within the
same block of a (Cobegin and Coend) pair.
S1: A=B+C
S2: C=D+E
S3: F=G+E
S4: C=A+F
S5: M=G+C
S6: A=L+E
S7: A=E+A
2 A 400MHz processor was used to execute a program with Dec-2019 4
150000 floating point instructions with clock cycle count of 1.
Determine the execution time and MIPS rate for this program.
Dec-2019
Analyse the data dependences among the following statements
3 and construct a dependency graph. Also detect the parallelism
embedded in them.
S1 : Load R1 , M(100) / R1 ← Memory(100) /
S2 : Move R2 , R1 / R2 ← (R1 ) /
S3 : Inc R1 / R1 ← (R1 ) + 1 /
S4 : Add R2 , R1 / R2 ← (R2 ) + (R1) /
S5 : Store M(100), R1 / Memory(100) ← (R1) /
4 Consider the execution of an object code with 200,000 Sept-2020 6
instructions on a 40-MHz processor. The program consists of
four major types of instructions. The instruction mix and the
number of cycles (CPI) needed for each instruction type are
given below based on the result of a program trace experiment.
ANSWERS
● The same data stream flows through a linear array of processors executing
different instruction streams.
● This architecture is also known as systolic arrays For pipelined execution of
specific algorithms.
4. SIMD Model
● Computers with Multiple Processing elements
● They perform same operation on multiple data points simultaneously
● An operational model of an SIMD computer is specified by a 5-tuple:
5. ROLE OF COMPILERS
Throughput Rate
● System throughput WS
– how many programs a system can execute per unit time
● CPU throughput Wp
– measure of how many programs can be executed per second
– multi programmed system system throughput is often lower than the CPU
Throughput.
7. UMA,NUMA,COMA Models
● Multiprocessors are called tightly coupled systems due to the high degree of resource
sharing.
● The system interconnect takes the form of a common bus, a crossbar switch, or a
multistage network
● UMA model is suitable for general-purpose and times sharing applications by
multiple users.
● lt can be used to speed up the execution of a single large program in time critical
applications.
● To coordinate parallel events, synchronization and communication among processors
are done through using shared variables in the common memory.
● Symmetric Multiprocessor
– When all processors have equal access to all peripheral devices,
– all the processors are equally capable of running the executive programs, such as
the OS kernel and l/O service routines .
b) NUMA(Non Uniform Memory Access)
● It is shared-memory system in which the access time varies with the location of the
memory word
● Two NUMA machine models are
– Shared local memories model
– A hierarchical cluster model
9. Amdahl’s law
“Amdahl’s law states that the performance improvement to be gained from using some faster
mode of execution is limited by the fraction of the time the faster mode can be used.”
● It is a formula which gives the theoretical speedup in latency of the execution of a
task at a fixed workload that can be expected of a system whose resources are
improved.
● In other words, it is a formula used to find the maximum improvement possible by
just improving a particular part of a system.
● It is often used in parallel computing to predict the theoretical speedup when using
multiple processors.
● Amdahl’s law uses two factors to find speedup from some enhancement & speedup
enhanced.
CSA - PREVIOUS UNIVERSITY QUESTIONS
Module – 2
No Questions Year Marks
1 Compare the characteristics of CISC and RISC Dec-2018 4
Architectures
Compare RISC and CISC scalar processor architectures. Sept-2020
4
Differentiate between the following with necessary diagrams: Dec-2019
RISC and CISC 4
2 Explain the terms (i) Hit Ratio (ii) Effective Access Time Dec-2018 3
with proper equations.
3 Explain VLIW architecture. Also explain pipelining in Dec-2018 6
VLIW processors
4 Explain the property of locality of reference in memory. May-2019 4
5 Explain memory hierarchy. May-2019 3
6 Explain Superscalar architecture. Also explain pipelining in May-2019 6
superscalar processors.
7 Define the inclusion property of a memory hierarchy. Illustrate Dec-2019 5
the data transfers between adjacent levels of a memory
hierarchy.
8 4
Distinguish between scalar RISC and super-scalar RISC in terms of instruction
issue, pipeline architecture and performance. Dec-2019
9 Define a) Instruction issue latency b) Instruction issue rate. Sept-2020 4
PROBLEMS
Architecture
● As shown above, Multiple functional units are use concurrently in a VLIW processor.
● All functional units share a common large register file.
● The operations to be simultaneously executed by functional units are synchronized in
a VLIW instruction. ,say 256 or 1024 bits per instruction word.
● Different fields of the long instruction word carry the opcodes to be dispatched to
different functional units.
● Programs written in short instruction words (32 bits) must be compacted together to
form the VLIW instructions – the code compaction must be done by compiler.
Pipelining:
● The execution of instructions by an ideal VLIW processor is shown below: each
instruction specifies multiple operations. The effective CPI becomes 0.33 in this example
VLIW machines behave much like superscalar machines with three differences:
4. Locality of Reference:
1. Temporal locality –
✓ Recently referenced items are likely to be referenced again in near future. ( ex:
loop portion in a code is executed continuously until loop terminates)
✓ leads to usage of least recently used replacement algorithm
✓ helps to determine size of memory at successive levels
2. Spatial locality – t
✓ Tendency for a process to access items whose addresses are near one another.
(ex: array elements , macros, routines etc)
✓ Assists in determining size of of unit data transfer between adjacent memory
levels
3. Sequential locality –
✓ Execution of instructions follows a sequential order unless branched
instruction occurs
✓ Affects determination of grain size for optimal scheduling
5. Memory Hierarchy:
Characterized by 5 parameters::
1. Access Time (ti ) - refers to round trip time for CPU to the ith-level memory.
2. Memory size(si) - is the number of bytes or words in level i
3. Cost per byte(ci) – is the cost of ith level memory esitamed by cisi
4. Transfer bandwidth(bi) – rate at which information is transferred between adjacent
levels.
5. Unit of transfer(xi) - grain size of data transfer between levels i and i+1
a) Registers
• Registers are part of processor – Thus, at times not considered a part of memory
• Register assignment made by compiler
• Register operations directly controlled by compiler – thus register transfers take place at
processor speed
b)Cache
● Controlled by MMU
● Can be implemented at one or multiple levels.
● Information transfer between CPU and cache is in terms of words(4 or 8 bytes-
depends on word length of machine)
● Cache is divided into cache blocks(typically 32 bytes)
● Blocks are unit of data transfer between cache and Main memory or btw L1 and L2
cache
c) Main Memory (primary memory)
● Larger than cache
● Implemented by cost effective RAM chips (ex DDR SDRAMs)
● Managed by MMU in cooperation with OS
● Divided into pages(ex 4 Kbytes), each page contain 128 blocks
● Pages are the unit of information transferred between disk and main memory.
● Scattered pages are organized as segments
d) Disk Drives and Backup Storage
• Highest level of memory
• Holds system programs like OS and compilers, user pgms, data etc
6) Super scalar Architecture:
● The fig shows a three-issue (m=3) superscalar pipeline, m instructions execute in parallel.
6. Inclusion Property:
Information a memory hierarchy (M1, M2,…, Mn) should satisfy inclusion Property.
Inclusion Property
9 Dec-2019 4
Explain chained cache coherence
protocol.
10 Explain the different levels of hierarchy of bus systems. Dec-2019 4
11 ● Define the write-invalidate snoopy bus protocol for Dec-2019 5
maintaining cache coherence. Show the different
possible state transitions for write-through and write-
back cache using the write-invalidate protocol.
● Draw the two state transition graphs for a cache block Sept-2020 3
using write –invalidate write -through snoopy bus
protocol.
12 Explain the three major operational characteristics of a Dec-2019 3
multiprocessor interconnection network.
13 How hardware synchronization can be achieved in a Sept-2020 4
multiprocessor system.
PROBLEMS
1 Design an 8 input omega network using 2X2 switches as Sept-2020 6
building blocks. Show the switch settings for the permutation
π1=(0,6,3,2,5)(1,4). Show the conflicts in switch settings, if
any. Explain blocking and non-blocking networks in this
context.
2 Design an 8 input omega network using 2X2 switches as Dec-2018 6
building blocks. Show the switch settings for the
permutation π1=(0,6,4,7,3)(1,5)(2). Show the conflicts in
switch settings, if any. Explain blocking and non-blocking
networks in this context.
3 Design an 8 input omega network using 2X2 switches as May-2019 6
building blocks. Show the switch settings for the permutations
π1=(0,7,6,4,2)(1,3)(5). Show the conflicts in switch settings, if
any. Explain blocking and non-blocking networks in this
context.
ANSWERS
● Crossbar networks connect every input to every output through a cross point switch.
● A crossbar network is a single stage, non-blocking permutation network (one-to-one).
● In an n-processor, m-memory system, n*m cross point switches will be required.
● Each cross point is a unary switch which can be open or closed, providing a point-to-
point connection path between the processor and a memory module.
● All processors can send memory requests independently and asynchronously.
● This poses the problem of multiple requests destined for the same memory module at
the same time. In such cases, only one of the requests is serviced at a time.
Multiport Memory :
● Since crossbar switches are expensive, and not suitable for systems with many
processors or memory modules, multiport memory modules may be used instead.
● A multiport memory module has multiple connections points for processors (or I/O
devices), and the memory controller in the module handles the arbitration and
switching that might otherwise have been accomplished by a cross point switch.
● Initially before migration, P1 have the value x in its cache and also x in
shared memory.
● Now x has been migrated to p2 and, If p2 reads x and updates it to x’, in
write through policy, the value will be updated simultaneously in shared
memory also. But in p1 its still x. so inconsistencies exists.
● Now in case of write back, If p1 reads the value x , modifies it to x’ ,but
it will not be updated in shared memory which is being read in the p2,
while executing the process. So here the updated value is not used by the
p2 to perform the particular task.(P1 to p2).
Inconsistency due to IO:
● Data movement from an I/o device to a shared memory usually does not cause the
cached copies of data to be updated.
● As a result, an input operation that writes x will write data to the shared primary
memory, causes it to become inconsistent with a cached value of x.
● Likewise, writing data to an I/O device usually use the data in the shared primary
memory, ignoring any potential cached data with different values.
●
● When I/O processors loads a new data X’ into the main memory (middle fig above)
inconsistency occurs between caches and shared memory.
● When outputting a data directly from the shared memory (Bypassing the cache) also
creates inconsistency.
● Possible solution is to attach I/O processors to private caches (C1 and C2) as shown in
fig.
● This way I/O processors share caches with the CPU. The I/O consistency can be
maintained if cache-to-cache consistency is maintained via the bus.
1. Valid:
The cache block, which is consistent with the memory copy, has been read from shared
memory and has not been modified.
2. Invalid:
The block is not found in the cache or is inconsistent with the memory copy.
3. Reserved: Data has been written exactly once since being read from shared memory.
The cache copy is consistent with the memory copy, which is the only other copy.
4. Dirty: The cache block has been modified (written) more than once, and the cache
copy is the only one in the system (thus inconsistent with all other copies).
Read-Miss:
Write-Hit:
Write-miss:
Read-hit:
Block Replacement
● In above fig, in first state, location X is missing in all caches in the system.
● The second state results from three cache (C1, C2 and C3) requesting copies of
location X. Three pointers (processor bits) are set in the entry to indicate the caches
that have copies of the block of data.
● In first two states the dirty bit on left side of the directory entry is set to clean ( C ),
indicating that no processor has permission to write to the block of data.
● The third state results from cache C3 requesting write permission for the block. In
the final state the dirty bit is set to dirty ( D ) and there is a single pointer to the block
and data in cache.
● Memory consumed by Directory is proportional to size of memory O(N) multiplied
by the size of the directory O(N). Thus the total memory overhead is square of
number of processors O(N2)
Limited Directories
● The above fig shows situation when three cache request read copies in a memory
system with Dir2 NB protocol( only 2 copies are allowed). The two pointer directory
can be viewed as a two-way set associative cache.
● When cache C3 requests copy of location X, the memory module must invalidate the
copy in either C1 or C2 cache. This process of pointer replacement is called eviction.
5. Write Invalidate and Write Update
Write Invalidate policy :
● Invalidates all remote copies when a local cache block is updated.
● In fig below processor P1 modifies(write) its cache from X to X’ and all other copies are
invalidated via the bus.
● Invalidated blocks are called dirty, means, they should not be used.
● Broadcast the new data block to all caches containing a copy of the block.
● In fig below the new updated block content X’ is broadcast to all cache copies via the
bus.
● The memory copy is also updated if write-through caches are used, if write-back
caches are used memory copy is updated later when the block is replaced.
● One of the following operations will be performed if processor P1 executes Ans1 <—
fetch & add(x,e1) and P2 executes Ans2 <— fetch & add(x,e2)simultaneously on the
shared variable x.
● If the request from P1 is executed before P2 , the following values are returned:
7. Write- invalidate snoopy protocol using write back policy:
Write-Back Caches:
1. Valid state of a write-back cache can be further split into two cache states.
• RW (read –write/ Exclusive) and
• RO (read only/ Shared) as shown in Fig. below.
2. INV (invalidated or not-in-cache) cache state is equivalent to the invalid state mentioned
before.
● When the memory owns a block, caches can contain only the RO(Shared) copies of
the block. In other words, multiple copies may exist in the RO state and every
processor having a copy (called a keeper of the copy) can read(R(i)), R(j)) the copy
safely.
● The INV state is entered whenever a remote processor writes (w(j)) its local copy or
the local processor replaces (Z(i)) its own block copy.
● The RW state(Exclusive) corresponds to only one cache copy existing in the entire
system owned by the local processor i. Read (R(i)) and write (w(i)) can be safely
performed in the RW state
● From either the RO state or the INV state, the cache block becomes uniquely owned
when a local write (w(i) takes place.
8.Generalized Multiprocessor System:
● Parallel processing demands the use of efficient system interconnects for fast
communication among multiple processors and shared memory, I/0, and peripheral
devices.
● The following are the commonly used system interconnects this purpose.
• Hierarchical buses.
• crossbar switches. And
• multistage networks
● The networks are designed with many choices, the choices are based on the topology,
timing, switching and control strategy.
● In case of dynamic network the multiprocessors interconnections are under program
control.
1.Timing control can be either synchronous and asynchronous
✓ Synchronous – controlled by a global clock which synchronizes all network
activity.
✓ Asynchronous – use handshaking or interlock mechanisms for communication
and especially suitable for coordinating devices with different speed
2.Switching Method
✓ Circuit switching – once a device is granted a path in the network, it occupies the
path for the entire duration of the data transfer.
✓ Packet switching – the information is broken into small packets individually
competing for a path in the network.
3.Network Control
● In the fig above singly linked chain, initially assume there are no shared copies of
location X.
● Processor P1 reads location X, the memory sends a copy to cache C1 along with chain
termination (CT) pointer. The memory also keeps a pointer to cache C1.
● Subsequently, When processor C2 reads location X, the memory sends a copy to
cache C2, along with the pointer to cache C1. The memory then keeps a pointer to
cache C2..
● By repeating the above step, all of the caches can cache a copy of the location X.
● If processor P3 writes to location X, it is necessary to send a data invalidation
message down the chain.
● To ensure sequential consistency, the memory module denies processor P3 write
permission until the processor with the chain termination pointer acknowledges the
invalidation of the chain.
● Perhaps this scheme should be called a gossip protocol (as opposed to a snoopy
protocol) because information is passed from individual to individual rather than
being spread by covert observation.
10. Hierarchy of Bus Systems:
A bus system consists of a hierarchy of buses connecting various systems and components in
a computer. Different buses are used to perform different interconnections
• There are numerous ways in which buses, processors, memories, and I/O devices can be
organized.
• One organization has processors (and their caches) as leaf nodes in a tree, with the buses
(and caches) to which these processors connect forming the interior nodes.
• This generic organization, with appropriate protocols to ensure cache coherency, can model
most hierarchical bus organizations.
● Divided into different clusters each connected through a cluster bus.
● Inter-cluster bus used to provide connection among clusters.
● Second level caches(C20-C22) are used between each cluster and inter-cluster bus.
Second level cache have higher capacity than sum of capacities of all first level caches
connected beneath it.
● Main memory modules are connected to the Inter-cluster bus.
● Most memory request should be satisfied at the first level caches.
11. Write Invalidate Snoopy bus protocol-Write through and Write back
Write Invalidate policy –
● Invalidates all remote copies when a local cache block is updated. In fig below processor
P1 modifies(write) its cache from X to X’ and all other copies are invalidated via the bus.
● Invalidated blocks are called dirty, means, they should not be used.
• Write
• Replacement
A block copy of a write-through cache i attached to processor i can assume one of two
possible cache states:
o valid
o invalid
Write-Back Caches:
1. Valid state of a write-back cache can be further split into two cache states.
• RW (read –write/ Exclusive) and
• RO (read only/ Shared) as shown in Fig. below.
2. INV (invalidated or not-in-cache) cache state is equivalent to the invalid state mentioned
before.This three-state coherence scheme corresponds to an ownership protocol.
2.Switching Method
✓ Circuit switching – once a device is granted a path in the network, it occupies the
path for the entire duration of the data transfer.
✓ Packet switching – the information is broken into small packets individually
competing for a path in the network.
3.Network Control
Module – 4
No Questions Year Marks
1 ● Differentiate between store and forward and Dec-2018 3
wormhole routing.
● Explain various message routing schemes used in message
May-2019 5
passing multi-computers. Sept-2020 4
● Explain different message routing schemes.
ANSWERS:
a) Store and Forward Routing (Packets are the smallest unit of information transmission)
b) Wormhole routing.( Packets are again subdivided into flits)
1. Speedup
● Ideally a linear pipeline of k stages can process n tasks in k+(n-1) clock cycles.
● Where, k cycles are needed to complete execution of very first task.
● Remaining (n-1) tasks require (n-1) cycles.
2. Efficiency:
● Efficiency Ek of a linear K-stage pipeline is
3. Throughput:
● Its defined as the no: of tasks(operations) performed per unit time
● Packet 1 is allocated the channel, packet 2 is temporarily stored in packet buffer and will
be transmitted when channel becomes available.(ie combines store and forward and
wormhole routing schemes)
● Packet buffer may cause significant storage delay.
Adv: Not wasting the resources already allocated.
Disadv: Requires the use of a large buffer to hold the entire packet.
● Wormhole routing uses a blocking policy in case of packet collision .Second packet is
blocked but not abandoned .
6 Specified by a single reservation table Multiple reservation table can be generated for
evaluation of different functions
Module – 5
No Questions Year Marks
1 ● What are the possible hazards that can occur between Dec-2019 6
read and write operations in an instruction pipeline?
● Which are the three logic hazards possible in an Dec-2018 4
instruction pipeline? Define each. Write the necessary
conditions for each to occur.
PROBLEM
1 Consider the execution of a program of 15,00,000 Dec-2019 4
instructions by a linear pipeline processor with a clock rate
of 1000 MHz. Assume that the instruction pipeline has five
stages and that one instruction is issued per cycle. The
penalties due to branch instruction and out-of-order
execution are ignored.
ANSWERS
Ex: I: a=b+c
J : e =a + d
2. WAW HAZARD
R(I) ∩ R(J) ≠ ᶲ for WAW hazard – o/p dependence
Ex: I: a=b+c
J: a=d+e
3.WAR HAZARD
D(I) ∩ R(J) ≠ ᶲ for WAR hazard – anti-dependence
EX: I: a=b+c
J:b=a+d
2. IN-ORDER ISSUE and OUT-OF-ORDER ISSUE
Three scheduling policies are introduced below.
1. In order issue & In order completion : Instructions are issued in program order and
instructions are completed in program order.
2. In order issue & Out of order completion : Instructions are issued in program
order and instructions are completed not in program order.
3. Out of order issue & Out of order completion : Program order is violated in
instruction issue and completion
i) In-Order Issue
● Figure 6.30a shows a schedule for the six instructions being issued in program
order Il, I2, I6.
● Pipeline 1 receives I1, I3, and I5, and pipeline 2 receives I2, I4, and I6 in three
consecutive cycles.
● Due to I1 —> I2, I2 has to wait one cycle to use the data loaded in by I1.
● I3 is delayed one cycle for the same adder used by I2. I6 has to wait for the
result of I5 before it can enter the multiplier stages.
● In order to maintain in-order completion , I5 is forced to wait for two cycles
to come out of pipeline 1.
● In total, nine cycles are needed and five idle cycles (shaded boxes) are
observed.
Out-of-Order Issue
● Branches are predicted based on branch code types statically or based on branch
History during program execution.
● The frequency and probabilities of branch taken and branch types across large no of
program traces is used to predict a branch.
● Such a Static branch strategy may not be very accurate.
● The static prediction direction (taken or nor rat-en) can even be wired into the
processor.
● The wired-in static prediction cannot be changed once committed to the hardware.
● Uses recent branch history to predict whether or not the branch will be taken next
time when it occurs.
● To be accurate we may need to use entire history of branch to predict future choice –
but not practical. Thus we use limited recent history.
● Requires additional hardware to keep track of the past behaviour of the branch
instructions at run time.
●
4. INTERNAL DATA FORWARDING
Memory access operations can be replaced by register transfer operations to save time
● In general, an n-bit CSA is specified as follows: Let X, Y, and Z be three n-bit input
numbers, expressed as X=(Xn-1,Xn-2 …,X1,X0) and so on. The CSA performs
bitwise operations simultaneously on all columns of digits to produce two n bit output
numbers , carry and sum denoted as 𝑆 𝑏 = 0, 𝑆𝑛 − 1, 𝑆𝑛 − 2, … . , 𝑆1, 𝑆0) and C=
(Cn,Cn-1,…..C1, 0)
● Note that the leading bit of bitwise sum 𝑆 𝑏 is always 0 and the tail bit of the carry
vector C is always a 0. The input output relationships are expressed below:
● The arithmetic sum of 3 input numbers i.e S= X+Y+Z is obtained by adding two
output numbers ie S=𝑆 𝑏 +C Using a CPA.
6. EFFECT OF BRANCHING:
● When a branch is taken all instructions following the branch in the pipeline
becomes useless and will be drained from the pipeline. Thus branch taken causes a
pipeline to be flushed losing a number of pipeline stages.
BRACH PREDICTION:
• Branches are predicted based on branch code types statically or based on branch History
during program execution.
• The frequency and probabilities of branch taken and branch types across large no of pgm
traces is used to predict a branch.
• The static prediction direction (taken or nor rat-en) can even be wired into the processor.
• The wired-in static prediction cannot be changed once committed to the hardware.
● Uses recent branch history to predict whether or not the branch will be taken next
time when it occurs.
● To be accurate we may need to use entire history of branch to predict future choice –
but not practical. Thus we use limited recent history.
● Requires additional hardware to keep track of the past behaviour of the branch
instructions at run time.
● Branch Target Buffers are used to implement branch prediction – BTB holds recent
branch information including address of branch target used. The address of branch
instr locates its entry in BTB.
● The state transition diagram is used for backtracking the last 2 branches in a given
program. The BTB entry contains backtracking information which will guide the
prediction.
● BTB can store not only Branch Target address but also target instruction itself and
few successor instr’s, in order to allow zero delay in converting conditional branches
to unconditional branches.
7.CDC SCOREBOARDING:
● Figure above shows CDC6600 like processor that uses dynamic instruction
scheduling hardware,.
● Here Multiple Functional units appeared as multiple execution units pipelines.
● Parallel units allow instr’s to complete out of original program order.
● The processor had instr buffers for each execution unit
● Instrs are issued to available FU’s regardless of whether register i/p data are
available.
● To Control correct routing of data btw execution units and registers CDC 6600 used a
Centralized Control unit known as scoreboard.
● Scoreboard kept track of registers needed by instrs waiting for various
functional units
● When all registers have valid data scoreboard enables instr execution.
● When a FU finishes it signals scoreboard to release the resources.
● Scoreboard is a Centralized control logic which keeps track of status of registers
and multiple functional units.
a) Super Pipelined:
of base cycle.
Execution Time, T(1,n)= k+1/n(N-1)
b) Super Pipelined Super Scalar:
● It execute instruction every cycle with pipeline cycle 1/n of base cycle.
Execution time, T(m,n)= (k+N-m)/mn
● Tomasulo's algorithm was applied to work with processors having a few floating-
point registers. In the ease of Model 91, only four registers were available.
● Fig a below shows a minimum-register machine code for computing X = Y + Z and A
= B * C. The pipeline timing with Tomasulo‘s algorithm appears in Fig.-b.
● Here, the total execution time is 13 cycles, counting from cycle 4 to cycle 15 by
ignoring the pipeline startup and draining times.
10. MECHANISMS FOR INSTRUCTION PIPELINING
1. Pre-fetch Buffers
2. .Multiple Functional Units
3. Internal Data Forwarding
4. Hazard avoidance
1 . Pre-fetch Buffers
● 3 type’s of buffers are used to match instr fetch rate to pipeline consumption
rate.
1. Sequential buffer
2. Target buffer
3. Loop buffer
● In a single memory access time, a block of consecutive instruction s are
fetched into pre-fetch buffer.
a)Sequential buffer
● Sequential instructions are loaded into a pair of sequential buffers for in-
sequence pipelining.
b) Target buffer
● Instructions from Branch target are loaded into pair of target buffers for out of
sequence
● A conditional branch instructions causes both sequential buffers and target
buffers to fill with instr.
● After checking branch condition, appropriate instruction will be taken from
one of the two buffers and instructions in other buffer is discarded.
● From each buffer pair, one may be used to load instructions from memory and
another buffer may be used to feed instructions into pipeline, thus it prevents
collision btw instructions flowing IN and OUT of pipeline.
c)Loop buffer
● holds sequential instructions in a loop. Its maintained by Fetch Stage.
● Pre fetched instructions in a loop body will be executed repeatedly until all
iterations complete execution.
2.Internal Data Forwarding
Memory access operations can be replaced by register transfer operations to save time
b)Load-Load Forwarding
● We can eliminate 2nd load operation(LD R2,M) and replace it with move
operation(MOVE r2, R1)
3. Multiple Functional Units
● Certain pipeline stage becomes bottleneck – this stage corresponds to row with maximum
number of checkmarks in the reservation table
● Can be solved using multiple copies of same stage simultaneously , ie:- use of multiple
execution units in a pipelined processor design.
● A pipelined processor with multiple FU’s and distributed RS’s supported by Tagging.
● In order to resolve data or resource dependences among successive instructions entering
the pipeline the Reservation Stations(RS) are used with each functional unit.
● Operands wait in RS until data dependences are resolved.
● Each RS is uniquely identified by a Tag monitored by tag unit.
● Tag unit keeps checking the tags from all currently used registers or RS’s.
● RS also serve as buffers to interface pipelined functional units with decode and issue unit
● Multiple Functional units operate in parallel once dependencies are resolved.
4. Hazard avoidance
● The read and write of shared variables by different instructions in a pipeline
may lead to different results if these instructions are executed out-of-order.
1, RAW HAZARD
R(I) ∩ D(J) ≠ ᶲ for RAW hazard- flow dependence
Ex: I: a=b+c
J : e =a + d
2. WAW HAZARD
R(I) ∩ R(J) ≠ ᶲ for WAW hazard – o/p dependence
Ex: I: a=b+c
J: a=d+e
3.WAR HAZARD
D(I) ∩ R(J) ≠ ᶲ for WAR hazard – anti-dependence
EX: I: a=b+c
J:b=a+d
CSA- PREVIOUS UNIVERSITY QUESTIONS
Module – 6
No Questions Year Marks
1 Distinguish between static dataflow computers and dynamic Dec-2018 4
dataflow computers.
Sept-2020
2 What are the four context switching polices for Dec-2018 4
multithreaded architecture? Dec-2019 4
3 Write a short note on fine-grain parallelism. Dec-2018 3
4 May-2019 4
Explain the distributed
caching.
5 May-2019 4
● Illustrate the scalable coherence interface (SCI)
interconnect model.
● List any two advantages and disadvantages of
Scalable Coherence Interface(SCI). Dec-2019 4
6 With a neat diagram explain the architecture of a multiple May-2019 6
context processor model.
7 ● What are the problems of asynchrony and their May-2019 6
solutions in massively parallel processors?
● Suggest different methods to overcome asynchrony Sept-2020 4
problem.
8 With a neat diagram explain the MIT/Motorola *T prototype May-2019 6
multithreaded architecture
9 ● Explain various latency hiding techniques. Sept-2020 8
● Explain any two latency hiding techniques used in Dec-2019 6
distributed shared memory multi computers.
● Explain any three latency hiding techniques used in
Dec-2018 9
distributed shared memory multi computers.
● This policy allows switching on every load, independent of whether it will cause a
miss or not. In this case, R represents the average interval between loads.
● A general multithreading model assumes that a context is blocked for L cycles after
every switch.
● But in the case of a switch-on-load processor, this happens only if the load causes a
cache miss.
3) Switch on every instruction-
This policy allows switching on every instruction, independent of whether it is a load or not.
3.FINE-GRAIN PARALLELISM.
● We compare below the grain sizes, communication latencies, and concurrency in four
classes of parallel computers.
● This comparison leads to the rationales for developing fine-grain multicomputers.
Latency Analysis
● The computing granularity and communication latency of leading early examples of
multiproccssors, data-parallel computers, and medium-and fine-grain multicomputers
are summarized in Table. Four attributes are identified to characterize these machines.
Only typical values are shown
● The communication latency: measures the data or message transfer time on a
system interconnect.
● This corresponds to the shared-memory access time on the Cray Y-MP, the time
required to send a 32-bit value across the hypercube network in the CM-2. and the
network latency on the iPSC/1 or J-Machine.
● The synchronization overhead : is the processing time required on a processor,
or on a PE, or on a processing node of a multicomputer for the purpose of synchron
ization.
● The sum Tc + Ts gives the total time required For IPC.
● The shared-memory Cray Y-MP had a short Tc but a long Ts. The SIMD machine
CM-2 had a short Ts but a long Tc.
● The long latency of the iPSC/1 made it unattractive based on fast advancing
standards.
● The MIT J-Machine was designed to make a major improvement in both of these
communication delays.
● The grain size is measured by the execution time of a typical program, including both
computing time and communication time involved.
● Supercomputers handle large grain. Both the CM-2 and the J-Machine were designed
as fine-grain machines.
● The iPSC/1 I was a relatively medium-grain machine compared with the rest.
4.DISTRIBUTED CACHING:
The concept of distributed caching is shown below:
● Every memory location has an owner node
● Eg: N1 owns B and N2 owns A.
● The directories are used to contain import-export list and state whether the data is
shared(for reads, many caches may hold copies) or exclusive (for writes, one cache
holds the current value)
● The directories multiplex among a small number of contexts to cover the cache
loading effects.
● Distributed caching offers a solution for the remote-loads problem, but not for the
synchronizing –loads problem.
● Multithreading offers a solution for remote loads and possibly for synchronizing
loads. The two approaches can be combined to solve both types of remote access
problem
▪ REMOTE LOADS
▪ SYNCHRONIZING LOADS.
a) REMOTE LOADS
● The key issue in remote load :- How to avoid idling in node N1 during load
operations.
● Variables A and B are located on nodes N2 and N3,respectively. They need to be
brought to node N1 to compute the difference A-B in variable C.
● The basic computation demands the execution of two remote loads and then the
subtraction.
● Let pA and pB be the pointers to A and B respectively. The two loads can be issued
from the same thread or from two threads.
● The context(register state) of the computation on N1 is represented by the variable
CTXT.
● It can be a stack pointer, a frame pointer, a current object pointer , a process identifier
etc. In general, variable names like vA, vB and c are interpreted relative to CTXT.
b) SYNCHRONIZING LOAD
● In fig.b, the idling due to synchronizing loads is illustrated.
● In this case , A and B are computed by concurrent processes, and we are not sure
exactly when they will be ready for node N1 to read.
● The ready signals (Ready1 and Ready2) may reach node N1 asynchronusly. This is a
typical situation in the producer-consumer problem. Busy-waiting may result.
● The latency caused by remote loads is an architectural property. The latency caused
by synchronizing loads also depends on scheduling and the time it takes to compute A
and B. which may be much longer than the transit latency.
2.Distributed caching
The concept of distributed caching is shown below:
● Every memory location has an owner node
● Eg: N1 owns B and N2 owns A.
● The directories are used to contain import-export list and state whether the data is
shared(for reads, many caches may hold copies) or exclusive (for writes, one cache
holds the current value)
● The directories multiplex among a small number of contexts to cover the cache
loading effects.
● Distributed caching offers a solution for the remote-loads problem, but not for the
synchronizing –loads problem.
● Multithreading offers a solution for remote loads and possibly for synchronizing
loads. The two approaches can be combined to solve both types of remote access
problem
8.MIT-J MACHINE
● The architecture and building block of the MIT-J machine, its instruction set, and
system design considerations are described below.
● The building block was the message driven process(MDP),a36-bit microprocessor
custom-designed for a line-grain multicomputer.
J –Machine Architecture
● The k-array n-cube networks were applied in the MIT J-Machine.
● The initial prototype J-Machine used a I024-node network (8 *8*16), which was a
reduced 16-ary 3-cube with 8 nodes along the x- and y-dimensions and 16 nodes
along the z dimension.
● A 4096-node J-Machine would use a full l6-ary 3-cube with 16 *16 * 16 nodes. The
J-Machine designers called their network a three diamensional mesh.
Network addressing limited the size of the J-Machine to a maximum configuration maximum
configuration of 65,5315 nodes, corresponding to a three-dimensional mesh with 32*32* 64
nodes. All hidden parts (nodes and links)are not shown for purposes of clarity. Clearly, every
node has a constant node degree of 6,and there are three rings crossing each node along the
three dimensions. The end-around connection scan be folded to balance the wire length on all
channels.
The MDP Design
● The MDP chip include a processor, a 4096-word by36-bit memory and build in
router with network ports.
● An on-chip memory controller with error checking and correction [EEC] capability
permitted local memory to be expanded to 1 million words by adding external DRAM
chips.
● The processor was message-driven in the sense that it executed functions in response
to message via the dispatch mechanism. No receive instruction was needed.
● The MDP created a task to handle each arriving message. Messages carrying these
tasks drove each computation.
● MDP was a general –purpose multicomputer processing node that provided the
communication, synchronization, and global naming mechanisms required to
efficiently support line-grain, concurrent programming models.
● The grain size was as small as 8-word objects or 20-instruction tasks.
● Fine-grain programs typically execute from 10 t0 100 instructions between
communication and synchronization actions.
● MDP chips provided inexpensive processing nodes with plentiful VLSI commodity
parts to construct the Jellybean Machine (J-Machine) multicomputer.
● The MDP appeared as a component with a memory port, six two-way network
ports, and a diagnostic port.
● The memory port provided a direct interface to up to IM words of ECC DRAM,
consisting of 11 multiplexed address lines, a 12-bit data bus, and 3control signals.
9.LATENCY HIDING TECHNIQUES
Four latency hiding mechanisms are used to increase scalability and programmability.
● The coherence problem is easily solved for small bus-based multiprocessors through
the use of snoopy cache coherence protocol.
● The cache coherence problem is much complicated for for large scale
multiprocessors(eg. BBN butterfly, IBM RP3, Stanford Dash etc..) that use general
interconnection networks. o BBN butterfly ------Did not use caches
o IBM RP3 ------Maintain cache coherence by software
o Stanford Dash ----- Maintain cache coherence by hardware support
Benefits of caching:
● Coherent caches helps to hide latency by increasing the cache hit ratio for read
operations . When the cache miss occurs, the number of cycles are wasted.
The benefit of using distributed caches is to reduce the cache miss.
10.ELT/EM-4 in Japan
● The internal design of the processor chip and of the node memory are
shown in Fig.
● The processor chip communicated with the network through a 3 X 3
crossbar switch unit.
● The processor and its memory were interfaced with a memory control
unit. The memory was used to hold programs as well as tokens (operand
segments, heaps, or frames) waiting to be fetched.
● The processor consisted of six component units.
● Input buffer was used as a token store with a capacity of 32 words.
● The fetch-match unit fetched tokens from the memory and performed tag-
matching operations among the tokens fetched in.
● Instructions were directly fetched from the memory through the memory
controller.
● The heart of the processor was the execution unit which fetched
instructions until the end of a thread.
● Instructions with matching tokens were executed.
● Instructions could emit tokens or write to registers.
● Instructions were fetched continually using traditional sequencing until a
“stop” flag was raised to indicate the end of a thread
● Then another pair of tokens was accepted.
● Each instruction in a thread specified the two sources for the next
instruction in the thread.
d. Release consistency(RC)
Release consistency(RC)
• It requires that synchronization accesses in the program be identified and classified as either
acquires (eg: locks) or releases (eg: unlocks).
Disadvantages(RC)
o is increased hardware complexity and a more complex programming model.
Processor consistency(PC)
• Here Writes issued by each individual processor are always in program order.
• The order of writes from two different processors can be out of program order.
✓ Before a read is allowed to perform with respect to any other processor, all previous
read accesses must be performed.
✓ Before a write is allowed to perform with respect to any other processor, all previous
read or write accesses must be performed.