Seminar

CSA-PREVIOUS UNIVERSITY QUESTIONS
Module – 1
No Questions Year Marks
1 With a neat sketch, explain the architecture of a vector Dec-2018 4
supercomputer. Sept-2020
2 Explain implicit and explicit parallelism in parallel Dec-2018 4
programming Sept-2020
3 Explain Flynn’s classification of computer architecture Dec-2018 4
4 Explain SIMD machine model. May-2019 3
5 Explain the role of compilers in exploiting parallelism Dec-2018 3
6 With proper equations, explain the terms (i) CPU Time May-2019 4
(ii) Throughput Rate
7 ● A generalized multiprocessor system architecture May-2019 4

combines features from the UMA, NUMA and COMA
models. Justify the answer.
● Explain NUMA model for Multiprocessor Systems May-2019
4
● Differentiate between UMA and NUMA multiprocessor Dec-2019
4
models .with necessary diagrams.
● Explain COMA model for Multiprocessor Systems Sept-2020 4
8 ● What is the significance of Bernstein’s conditions to May-2019 4

detect
Parallelism? Dec-2019
3
● Discuss the Bernstein`s conditions for checking the Sept-2020
Parallelism among a set of processes. 3
● State and explain Bernstein’s conditions for parallelism?
9 ● State Amdahl’s law. Write an expression for the Dec-2019 4
overall speed up. Sept-2020 3
● Define Amdahl’s Law.
PROBLEMS
Consider the execution of the following code segment consisting of seven
statements. Use Bernstein’s conditions to detect the maximum parallelism
1 embedded in this code. Justify the portions that can be executed in parallel and 5
the remaining portions that must be executed sequentially. RewriteMay-2019
the code using
parallel constructs such as Cobegin and Coend. No variable substitution is
allowed. All statements can be executed in parallel if they are declared within the
same block of a (Cobegin and Coend) pair.
S1: A=B+C
S2: C=D+E
S3: F=G+E
S4: C=A+F
S5: M=G+C
S6: A=L+E
S7: A=E+A
2 A 400MHz processor was used to execute a program with Dec-2019 4
150000 floating point instructions with clock cycle count of 1.
Determine the execution time and MIPS rate for this program.
Dec-2019
Analyse the data dependences among the following statements
3 and construct a dependency graph. Also detect the parallelism
embedded in them.
S1 : Load R1 , M(100) / R1 ← Memory(100) /
S2 : Move R2 , R1 / R2 ← (R1 ) /
S3 : Inc R1 / R1 ← (R1 ) + 1 /
S4 : Add R2 , R1 / R2 ← (R2 ) + (R1) /
S5 : Store M(100), R1 / Memory(100) ← (R1) /
4 Consider the execution of an object code with 200,000 Sept-2020 6
instructions on a 40-MHz processor. The program consists of
four major types of instructions. The instruction mix and the
number of cycles (CPI) needed for each instruction type are
given below based on the result of a program trace experiment.
Calculate the average CPI, MIPS rate and Execution time.
5 Detect parallelism in the following code using Bernstein’s Sept-2020 6

conditions. Assume there are sufficient numbers of resources
available.
P1: C=D*E
P2: M=G+C
P3: A=B+C
P4: C=L+M
P5: F=G/E
6 A 40 MHz processor was used to execute a benchmark Dec-2018 5

program with the following instruction mix and clock cycle
counts:
Determine the effective CPI, MIPS rate and execution time
for this program.
ANSWERS
1. Vector Super Computer

● In a vector computer, a vector processor is attached to the scalar processor as an optional
feature.
● The host computer first loads program and data to the main memory.
● Then the scalar control unit decodes all the instructions.
● If the decoded instructions are scalar operations or program operations, the scalar
processor executes those operations using scalar functional pipelines.
● The decoded instructions are vector operations then the instructions will be sent to vector
control unit.
2. Implicit and Explicit parallelism
3. Flynn’s classification
1. SISD(Single Instruction Single Data Stream)
● Conventional sequential machines are called SISD -[single instruction stream over
single data stream] computers.
● Instructions are executed sequentially but may be overlapped in their execution stages
(pipelining).
2. SIMD(Single Instruction Multiple Data Stream) –

● Represents vector computers/array processors equipped with scalar and vector
hardware.
● There are multiple processing elements supervised by the same control unit.
● All PEs receive the same instruction broadcast from the control unit but operate
on different data sets from distinct data streams.
3. MIMD(multiple instructions over multiple data stream) –

● Most popular model. parallel computers are reserved for MIMD
4, MISD(multiple instruction over single data stream)
● The same data stream flows through a linear array of processors executing
different instruction streams.
● This architecture is also known as systolic arrays For pipelined execution of
specific algorithms.
4. SIMD Model
● Computers with Multiple Processing elements
● They perform same operation on multiple data points simultaneously
● An operational model of an SIMD computer is specified by a 5-tuple:
5. ROLE OF COMPILERS
● Hardware processors can be better designed to exploit parallelism by an

optimizing compiler
● That is compiler techniques are used to exploit hardware features to improve
performance.
● Such processors use large register file and sustained instruction pipelining to
execute nearly one instruction per cycle.
● The large register file supports fast access to temporary values generated by an
optimizing compiler.
● The registers are exploited by code optimizer and global register allocator in
such a compiler.
6. CPU Time and Throughput Rate:
CPU time ( T in seconds/program)
● Time needed to execute the program is estimated by finding the product of three
contributing factors:
Throughput Rate
● System throughput WS
– how many programs a system can execute per unit time
● CPU throughput Wp
– measure of how many programs can be executed per second
– multi programmed system system throughput is often lower than the CPU
Throughput.
7. UMA,NUMA,COMA Models
a)UMA (Uniform Memory Access)
● Physical memory is uniformly shared by all the processors.

● All processors have equal access time to all memory words, which is why it is called
uniform memory access
● Each processor may use a private cache.
● Peripherals are also shared in some fashion
● Multiprocessors are called tightly coupled systems due to the high degree of resource
sharing.
● The system interconnect takes the form of a common bus, a crossbar switch, or a
multistage network
● UMA model is suitable for general-purpose and times sharing applications by
multiple users.
● lt can be used to speed up the execution of a single large program in time critical
applications.
● To coordinate parallel events, synchronization and communication among processors
are done through using shared variables in the common memory.
● Symmetric Multiprocessor
– When all processors have equal access to all peripheral devices,
– all the processors are equally capable of running the executive programs, such as
the OS kernel and l/O service routines .
b) NUMA(Non Uniform Memory Access)
● It is shared-memory system in which the access time varies with the location of the
memory word
● Two NUMA machine models are
– Shared local memories model
– A hierarchical cluster model
i) Shared Local Memory
● Shared memory is physically distributed to all processors called local memories.

● The collection of all local memories forms a global address space accessible by all
processors.
● It is faster to access a local memory with a local processor.
● The access of remote memory attached to other processors takes longer due to the
added delay through the interconnection network
(ii) Hierarchical Cluster Model
● Globally shared memory

● The processors are divided into several clusters.
● Each cluster is itself an UMA or a NUMA multiprocessor.
● The clusters are connected to global shared-memory modules.
● The entire system is considered a NUMA multiprocessor.
● All processors belonging to the same cluster are allowed to uniformly access the
cluster shared-memory modules.
● All clusters have equal access to the global memory.
● However, the access time to the cluster memory is shorterthan that to the global
memory
c) COMA(Chache only Memory)

A multiprocessor using cache-only memory assumes the COMA model.
• Special case of NUMA

• distributed main memories are converted to caches.
• There is no memory hierarchy at each processor node.
• All the caches form a global address space.
• Remote cache access is assisted by the distributed cache directories
• Depending on the interconnection network us sometimes hierarchical directories may be
used to help locate copies of each blocks.
• Initial data placement is not critical because data will eventually migrate to where it will be
used
8. Bernstein’s condition
Let P1 and P2 be two processes.
● Input set Ii is the set of all input variables for a process Pi .
● Ii is also called the read set or domain of Pi.
● We define the input set Ii of a process Pi as the set of all input variables needed to
execute the process.
● Output set Oi is the set of all output variables generated after execution for a process
Pi .Oi is also called write set.
Input variables. are essentially operands which can be fetched from the memory or registers
and output variables are the results to be stored in working registers or memory locations.
If P1 and P2 can execute in parallel (which is written as P1 || P2), then:
9. Amdahl’s law
“Amdahl’s law states that the performance improvement to be gained from using some faster
mode of execution is limited by the fraction of the time the faster mode can be used.”
● It is a formula which gives the theoretical speedup in latency of the execution of a
task at a fixed workload that can be expected of a system whose resources are
improved.
● In other words, it is a formula used to find the maximum improvement possible by
just improving a particular part of a system.
● It is often used in parallel computing to predict the theoretical speedup when using
multiple processors.
● Amdahl’s law uses two factors to find speedup from some enhancement & speedup
enhanced.
CSA - PREVIOUS UNIVERSITY QUESTIONS
Module – 2
1 Compare the characteristics of CISC and RISC Dec-2018 4
Architectures
Compare RISC and CISC scalar processor architectures. Sept-2020
4
Differentiate between the following with necessary diagrams: Dec-2019
RISC and CISC 4
2 Explain the terms (i) Hit Ratio (ii) Effective Access Time Dec-2018 3
with proper equations.
3 Explain VLIW architecture. Also explain pipelining in Dec-2018 6
VLIW processors
4 Explain the property of locality of reference in memory. May-2019 4
5 Explain memory hierarchy. May-2019 3
6 Explain Superscalar architecture. Also explain pipelining in May-2019 6
superscalar processors.
7 Define the inclusion property of a memory hierarchy. Illustrate Dec-2019 5
the data transfers between adjacent levels of a memory
hierarchy.
8 4
Distinguish between scalar RISC and super-scalar RISC in terms of instruction
issue, pipeline architecture and performance. Dec-2019
9 Define a) Instruction issue latency b) Instruction issue rate. Sept-2020 4
PROBLEMS
1 Consider the design of a three level memory hierarchy with Dec-2018 6

the following specifications for memory characteristics:
Hit ratio of cache memory is h1=0.98 and a hit ratio of main

memory is h2=0.9.
(i) Calculate the effective access time.
(ii) Calculate the total memory cost.
2 You are asked to perform capacity planning for a two-level May-2019 6
memory system. The first level, M1, is a cache with three
capacity choices of 64 Kbytes, 128 Kbytes, and 256 Kbytes.
The second level, M2, is a main memory with a 4-Mbyte
capacity. Let c1 and c2 be the cost per byte and t1 and t2 the
access times for M1 and M2 respectively. Assume c1=20c2 and
t2=10t1. The cache hit ratios for the three capacities are
assumed to be 0.7, 0.9 and 0.98 respectively.
(i) What is the average access time ta in terms of t1=20 ns in
the three cache designs? (Note that t1 is the time form CPU to
M1 and t2 is that from CPU to M2)
(ii) Express the average byte cost of the entire memory
hierarchy if c2=$0.2/Kbyte.
3 Consider a two-level memory hierarchy, M1 and M2 of sizes Dec-2019
64Kbytes and 4Mbytes respectively, with access time t1 =
20ns and t2 = 200ns and costs c1 and c2 are $0.01/byte, c2 =
$0.0005/byte respectively. The cache hit ratio h1 = 0.95 at the
first level. Find the effective access time and total cost of this
memory system.
4 Consider the design of a three-level memory hierarchy with Sept-2020 5
the following specifications for memory characteristics:
Hit ratio of cache memory is h1=0.98 and a hit ratio of main

memory is h2=0.9.
(i) Calculate the effective access time.
(ii) Calculate the total memory cost.
Answers
1. RISC and CISC
2. Hit Ratio and Effective Access Time

Hit Ratio :
● when an information item is found in Mi, we call it a hit, otherwise a miss.
● Considering memory level Mi to Mi-1 in a hierarchy , i=1,2,3…n-1.
● The hit ratio ,hi at Mi is the probability that an item required will be found in Mi.
● The miss ratio at Mi is 1-hi.
● The hit ratios at successive levels are a function of memory capacities, management
policies and program behaviour.
Effective access time
● In practice, we wish to achieve as high a hit ratio as possible at M1.
● Every time a miss occurs, a penalty must be paid to access the next higher level of
memory.
● The misses have been called block misses in the cache and page faults in the main
memory because blocks and pages are the units of transfer between these levels.
● Using the access frequencies fi=1,2,….n, we can formally define the effective access
time of a memory hierarchy as follows
3.VLIW Architecture and Pipelining
Architecture
● Horizontal micro-coding + superscalar processing = VLIW architecture.

● VLIW machine has instruction words hundreds of bits in length
● As shown above, Multiple functional units are use concurrently in a VLIW processor.
● All functional units share a common large register file.
● The operations to be simultaneously executed by functional units are synchronized in
a VLIW instruction. ,say 256 or 1024 bits per instruction word.
● Different fields of the long instruction word carry the opcodes to be dispatched to
different functional units.
● Programs written in short instruction words (32 bits) must be compacted together to
form the VLIW instructions – the code compaction must be done by compiler.
Pipelining:
● The execution of instructions by an ideal VLIW processor is shown below: each
instruction specifies multiple operations. The effective CPI becomes 0.33 in this example
VLIW machines behave much like superscalar machines with three differences:
1. The decoding of VLIW instructions is easier than that of superscalar instructions.

2. The code density of the superscalar machine is better when the available instruction-level
parallelism is less than that exploitable by the VLIW machine. This is because the fixed VLIW
format includes bits for non-executable operations, while the superscalar processor issues only
executable instructions.
3. A superscalar machine can be object-code-compatible with a large family of non-parallel
machines. On the contrary, VLlW machine exploiting different amounts of parallelism would
require different instruction sets.
Advantages and Disadvantages of VLIW :
● VLIW reduces the effort required to detect parallelism using hardware or software
techniques.
● The main advantage of VLIW architecture is its simplicity in hardware structure and
instruction set.
● Unfortunately, VLIW does require careful analysis of code in order to “compact” the
most appropriate ”short” instructions into a VLIW word.
4. Locality of Reference:
● The memory hierarchy was developed based on a program behaviour known as

locality of references.
● Memory references are generated by the CPU for either instruction or data access.
● These accesses tend to be clustered in certain regions in time, space, and ordering.
Most programs use a certain portion of memory during a particular time period.
3 dimensions of locality property and its memory design implication
1. Temporal locality –
✓ Recently referenced items are likely to be referenced again in near future. ( ex:
loop portion in a code is executed continuously until loop terminates)
✓ leads to usage of least recently used replacement algorithm
✓ helps to determine size of memory at successive levels
2. Spatial locality – t
✓ Tendency for a process to access items whose addresses are near one another.
(ex: array elements , macros, routines etc)
✓ Assists in determining size of of unit data transfer between adjacent memory
levels
3. Sequential locality –
✓ Execution of instructions follows a sequential order unless branched
instruction occurs
✓ Affects determination of grain size for optimal scheduling
5. Memory Hierarchy:
Characterized by 5 parameters::
1. Access Time (ti ) - refers to round trip time for CPU to the ith-level memory.
2. Memory size(si) - is the number of bytes or words in level i
3. Cost per byte(ci) – is the cost of ith level memory esitamed by cisi
4. Transfer bandwidth(bi) – rate at which information is transferred between adjacent
levels.
5. Unit of transfer(xi) - grain size of data transfer between levels i and i+1
Memory devices at lower levels ( top of hierarchy) are:

Faster to access
Smaller in size
More expensive/byte
Higher bandwidth
Uses smaller unit of transfer
a) Registers
• Registers are part of processor – Thus, at times not considered a part of memory
• Register assignment made by compiler
• Register operations directly controlled by compiler – thus register transfers take place at
processor speed
b)Cache
● Controlled by MMU
● Can be implemented at one or multiple levels.
● Information transfer between CPU and cache is in terms of words(4 or 8 bytes-
depends on word length of machine)
● Cache is divided into cache blocks(typically 32 bytes)
● Blocks are unit of data transfer between cache and Main memory or btw L1 and L2
cache
c) Main Memory (primary memory)
● Larger than cache
● Implemented by cost effective RAM chips (ex DDR SDRAMs)
● Managed by MMU in cooperation with OS
● Divided into pages(ex 4 Kbytes), each page contain 128 blocks
● Pages are the unit of information transferred between disk and main memory.
● Scattered pages are organized as segments
d) Disk Drives and Backup Storage
• Highest level of memory
• Holds system programs like OS and compilers, user pgms, data etc
6) Super scalar Architecture:
● Designed to exploit instruction level parallelism in user programs .

● Independent instr’s can be executed in parallel.
● A superscalar processor of degree m can issue m instructions per cycle.
● Superscalar processors were originally developed as an alternative to vector processors,
with a view to exploit higher degree of instruction level parallelism.
Pipelining:
● The fig shows a three-issue (m=3) superscalar pipeline, m instructions execute in parallel.
Instruction issue degree (m) in a superscalar processors is limited btw 2 to 5.(2<m<5)

● A superscalar processor of degree m can issue m instructions per cycle. In this sense,
the base scalar processor, implemented either in RISC or CISC, has m = 1.
● In order to fully utilize a superscalar processor of degree m, m instructions must be
executable in parallel.
● This situation may not be true in all clock cycles. In that case, some of the pipelines
may be stalling in a wait state.
● In a superscalar processor, the simple operation latency should require only one
cycle, as in the base scalar processor.
● Due to the desire for a higher degree of instruction-level parallelism in programs, the
superscalar processor depends more on an optimizing compiler to exploit parallelism.
6. Inclusion Property:
Information a memory hierarchy (M1, M2,…, Mn) should satisfy inclusion Property.
Inclusion Property
● Stated as M1 C M2 C M3 C……C Mn ( C – symbol for subset)

● Implies all information are originally stored in outermost level Mn
● During processing subsets of Mn are copied into Mn-1.
● Thus if an information word is found in Mi then copies of same word can also be
found in upper levels Mi+1, Mi+2…….Mn. However a word stored in Mi+1 may not
be found in Mi
● A word miss in Mi implies its missing from all lower levels- thus highest level is the
backup storage where everything is found
● Information transfer between the CPU and cache----- is in terms of words (4 or 8 bytes
each depending on the word length of a machine). The cache (M1) is divided into cache
blocks, also called cache lines by some
● Information transfer between cache and main memory, or between L1 and L2 cache-
--- is in terms of Blocks(such as “a” and “b”). Each block may be typically 32 bytes
(8 words). The main memory (M2) is divided into pages, say, 4 Kbytes each.
● Information transfer between disk and main memory.---- is in terms of Pages.
Scattered pages are organized as a segment in the disk memory, for example, segment
F contains page A, page B, and other pages.
● Data transfer between the disk and backup storage is ------ at the file level, such as
segments F and G.
8. Scalar RISC Vs Super Scalar RISC
ARCHITECTURE
9. Instruction issue latency and Instruction issue rate.
1) Instruction issue latency:

● Time (in cycles) required between the issuing of two adjacent instructions
2) •Instruction issue rate:
● The number of instructions issued per cycle (the degree of a superscalar)
PREVIOUS UNIVERSITY QUESTIONS
Module – 3
1 ● Differentiate between crossbar network and multiport Dec-2018 4
memory. 4
● Explain the significance of multiport memory. Sept-2020
2 How does cache inconsistency occur in caches due to Dec-2018 4

process migration and I/O?
3 ● Draw the state transition graph for a cache block Dec-2018 3
using Goodman’s write-once protocol for cache
Dec-2019 4
coherence.
● Explain Goodman’s write once protocol with transition
diagram.
4 Compare full map directories with limited directories. Sept-2020 4

Explain full map directories Dec-2018 4
5 Differentiate write-invalidate and write-update coherence May-2019 4
protocols for write through caches.
6 Explain hot spot problem. May-2019 3
7 Explain write- invalidate snoopy protocol using write back May-2019 4
policy.
8 Discuss the schematic representation of a generalized Dec-2019 4
multiprocessor system
9 Dec-2019 4
Explain chained cache coherence
protocol.
10 Explain the different levels of hierarchy of bus systems. Dec-2019 4
11 ● Define the write-invalidate snoopy bus protocol for Dec-2019 5
maintaining cache coherence. Show the different
possible state transitions for write-through and write-
back cache using the write-invalidate protocol.
● Draw the two state transition graphs for a cache block Sept-2020 3
using write –invalidate write -through snoopy bus
protocol.
12 Explain the three major operational characteristics of a Dec-2019 3
multiprocessor interconnection network.
13 How hardware synchronization can be achieved in a Sept-2020 4
multiprocessor system.
PROBLEMS
1 Design an 8 input omega network using 2X2 switches as Sept-2020 6
building blocks. Show the switch settings for the permutation
π1=(0,6,3,2,5)(1,4). Show the conflicts in switch settings, if
any. Explain blocking and non-blocking networks in this
context.
2 Design an 8 input omega network using 2X2 switches as Dec-2018 6
building blocks. Show the switch settings for the
permutation π1=(0,6,4,7,3)(1,5)(2). Show the conflicts in
switch settings, if any. Explain blocking and non-blocking
networks in this context.
3 Design an 8 input omega network using 2X2 switches as May-2019 6
building blocks. Show the switch settings for the permutations
π1=(0,7,6,4,2)(1,3)(5). Show the conflicts in switch settings, if
any. Explain blocking and non-blocking networks in this
context.
ANSWERS
1.Cross bar network and Multiport memory:
Cross Bar network:
● Crossbar networks connect every input to every output through a cross point switch.
● A crossbar network is a single stage, non-blocking permutation network (one-to-one).
● In an n-processor, m-memory system, n*m cross point switches will be required.
● Each cross point is a unary switch which can be open or closed, providing a point-to-
point connection path between the processor and a memory module.
● All processors can send memory requests independently and asynchronously.
● This poses the problem of multiple requests destined for the same memory module at
the same time. In such cases, only one of the requests is serviced at a time.
Multiport Memory :
● Since crossbar switches are expensive, and not suitable for systems with many
processors or memory modules, multiport memory modules may be used instead.
● A multiport memory module has multiple connections points for processors (or I/O
devices), and the memory controller in the module handles the arbitration and
switching that might otherwise have been accomplished by a cross point switch.
2.Inconsistency due to Process Migration and IO:
Inconsistency due to Process Migration
● Initially before migration, P1 have the value x in its cache and also x in
shared memory.
● Now x has been migrated to p2 and, If p2 reads x and updates it to x’, in
write through policy, the value will be updated simultaneously in shared
memory also. But in p1 its still x. so inconsistencies exists.
● Now in case of write back, If p1 reads the value x , modifies it to x’ ,but
it will not be updated in shared memory which is being read in the p2,
while executing the process. So here the updated value is not used by the
p2 to perform the particular task.(P1 to p2).
Inconsistency due to IO:
● Data movement from an I/o device to a shared memory usually does not cause the
cached copies of data to be updated.
● As a result, an input operation that writes x will write data to the shared primary
memory, causes it to become inconsistent with a cached value of x.
● Likewise, writing data to an I/O device usually use the data in the shared primary
memory, ignoring any potential cached data with different values.
●
● When I/O processors loads a new data X’ into the main memory (middle fig above)
inconsistency occurs between caches and shared memory.
● When outputting a data directly from the shared memory (Bypassing the cache) also
creates inconsistency.
● Possible solution is to attach I/O processors to private caches (C1 and C2) as shown in
fig.
● This way I/O processors share caches with the CPU. The I/O consistency can be
maintained if cache-to-cache consistency is maintained via the bus.
3. Goodman’s write-once protocol:
● This scheme combines the advantages of both write-through and write-hack

invalidations.
● In order to reduce bus traffic, the very first write of a cache block uses a write-
through policy.
● This will result in a consistent memory copy while all other cache copies are
invalidated.
● After the first write, shared memory is updated using a write-back policy.
● This scheme can he described by the four-state transition graph shown in Fig. below
The four cache states are defined below:
1. Valid:
The cache block, which is consistent with the memory copy, has been read from shared
memory and has not been modified.
2. Invalid:
The block is not found in the cache or is inconsistent with the memory copy.
3. Reserved: Data has been written exactly once since being read from shared memory.
The cache copy is consistent with the memory copy, which is the only other copy.
4. Dirty: The cache block has been modified (written) more than once, and the cache
copy is the only one in the system (thus inconsistent with all other copies).
✓ The first write-hit leads to the reserved state.

✓ The second write-hit leads to the dirty state,
✓ and all future write-hits stay in the dirty state.
✓ Whenever a write-miss occurs, the cache block enters the dirty state.
Cache Events and Action: The memory-access and invalidation commands trigger the
following events and actions:
Read-Miss:
Write-Hit:
Write-miss:
Read-hit:
Block Replacement
4. Full Map Directory and Limited Directory:
Full Map Directory

● Directory entry contains one bit per processor and a dirty bit. Each bit represents the
status of the block in corresponding processor’s cache(present or absent)
● If dirty bit is set, then only one processor’s bit is set and that processor can write into the
block.
● A cache maintains 2 bits per block. One bit indicates whether block is valid, and other
indicates whether a valid block may be written
● In above fig, in first state, location X is missing in all caches in the system.
● The second state results from three cache (C1, C2 and C3) requesting copies of
location X. Three pointers (processor bits) are set in the entry to indicate the caches
that have copies of the block of data.
● In first two states the dirty bit on left side of the directory entry is set to clean ( C ),
indicating that no processor has permission to write to the block of data.
● The third state results from cache C3 requesting write permission for the block. In
the final state the dirty bit is set to dirty ( D ) and there is a single pointer to the block
and data in cache.
● Memory consumed by Directory is proportional to size of memory O(N) multiplied
by the size of the directory O(N). Thus the total memory overhead is square of
number of processors O(N2)
Limited Directories
● Designed to Solves the directory size problem by restricting the number of

simultaneously cached copies of any particular block of data limits the growth of the
directory to a constant factor.
● A directory protocol can be classified as Diri X .
● The symbol i stands for the number of pointers, and X is NB for a scheme with
no broadcast A full-map scheme without broadcast is represented as DirN NB.
● A limited directory protocol that uses i< N pointers is denoted Diri NB.
● The above fig shows situation when three cache request read copies in a memory
system with Dir2 NB protocol( only 2 copies are allowed). The two pointer directory
can be viewed as a two-way set associative cache.
● When cache C3 requests copy of location X, the memory module must invalidate the
copy in either C1 or C2 cache. This process of pointer replacement is called eviction.
5. Write Invalidate and Write Update
Write Invalidate policy :
● Invalidates all remote copies when a local cache block is updated.
● In fig below processor P1 modifies(write) its cache from X to X’ and all other copies are
invalidated via the bus.
● Invalidated blocks are called dirty, means, they should not be used.
Write Update policy:
● Broadcast the new data block to all caches containing a copy of the block.
● In fig below the new updated block content X’ is broadcast to all cache copies via the
bus.
● The memory copy is also updated if write-through caches are used, if write-back
caches are used memory copy is updated later when the block is replaced.
6. Hot Spot Problem:
● When network traffic is non-uniform a hot spot may appear corresponding to a

certain memory module being excessively accessed by many processors at the
same time. It degrades network performance.
● Hot spots may degrade the network performance significantly.
Fetch & Add
● In a Fetch&Add(x, e) operation, ‘x’ is an integer variable in shared memory and ‘e’ is
an integer increment.
● When a single processor executes this operation, the semantics is:-
● One of the following operations will be performed if processor P1 executes Ans1 <—
fetch & add(x,e1) and P2 executes Ans2 <— fetch & add(x,e2)simultaneously on the
shared variable x.
● If the request from P1 is executed before P2 , the following values are returned:
7. Write- invalidate snoopy protocol using write back policy:
Write-Back Caches:
1. Valid state of a write-back cache can be further split into two cache states.
• RW (read –write/ Exclusive) and
• RO (read only/ Shared) as shown in Fig. below.
2. INV (invalidated or not-in-cache) cache state is equivalent to the invalid state mentioned
before.
This three-state coherence scheme corresponds to an ownership protocol.
● When the memory owns a block, caches can contain only the RO(Shared) copies of
the block. In other words, multiple copies may exist in the RO state and every
processor having a copy (called a keeper of the copy) can read(R(i)), R(j)) the copy
safely.
● The INV state is entered whenever a remote processor writes (w(j)) its local copy or
the local processor replaces (Z(i)) its own block copy.
● The RW state(Exclusive) corresponds to only one cache copy existing in the entire
system owned by the local processor i. Read (R(i)) and write (w(i)) can be safely
performed in the RW state
● From either the RO state or the INV state, the cache block becomes uniquely owned
when a local write (w(i) takes place.
8.Generalized Multiprocessor System:
● Parallel processing demands the use of efficient system interconnects for fast
communication among multiple processors and shared memory, I/0, and peripheral
devices.
● The following are the commonly used system interconnects this purpose.
• Hierarchical buses.
• crossbar switches. And
• multistage networks
● Generalized multiprocessor system is depicted below. Each Pi is attached to its own

local memory and private cache.
● Multiple processors are connected to shared-memory modules through an inter-
processors-memory network(IPMN).
● These processors share the access of I/O and peripheral devices through processor
I/O network(PION).
● Direct interprocessor communications are supported by an optional interprocessor
communication network (IPCN) instead of through the shared memory.
Interconnection Network Choices/Characteristics :
● The networks are designed with many choices, the choices are based on the topology,
timing, switching and control strategy.
● In case of dynamic network the multiprocessors interconnections are under program
control.
1.Timing control can be either synchronous and asynchronous
✓ Synchronous – controlled by a global clock which synchronizes all network
activity.
✓ Asynchronous – use handshaking or interlock mechanisms for communication
and especially suitable for coordinating devices with different speed
2.Switching Method
✓ Circuit switching – once a device is granted a path in the network, it occupies the
path for the entire duration of the data transfer.
✓ Packet switching – the information is broken into small packets individually
competing for a path in the network.
3.Network Control
✓ Centralized – global controller receives and acts on requests

✓ Distributed – requests handled by local devices independently
9. Chained cache coherence protocol:

● Limited directories have a restriction on the number of shared copies of
data blocks.(due to the limitation of pointers).
● Chained directories will not have a restriction on the number of shared
copies of data blocks.
● They Keeps track of shared copies of data by maintaining chain of
directory pointers.(Singly linked chain).
● In the fig above singly linked chain, initially assume there are no shared copies of
location X.
● Processor P1 reads location X, the memory sends a copy to cache C1 along with chain
termination (CT) pointer. The memory also keeps a pointer to cache C1.
● Subsequently, When processor C2 reads location X, the memory sends a copy to
cache C2, along with the pointer to cache C1. The memory then keeps a pointer to
cache C2..
● By repeating the above step, all of the caches can cache a copy of the location X.
● If processor P3 writes to location X, it is necessary to send a data invalidation
message down the chain.
● To ensure sequential consistency, the memory module denies processor P3 write
permission until the processor with the chain termination pointer acknowledges the
invalidation of the chain.
● Perhaps this scheme should be called a gossip protocol (as opposed to a snoopy
protocol) because information is passed from individual to individual rather than
being spread by covert observation.
10. Hierarchy of Bus Systems:
A bus system consists of a hierarchy of buses connecting various systems and components in
a computer. Different buses are used to perform different interconnections
• Local Bus – implemented within processor chips or on printed-circuit boards.

• Memory Bus – used to connect memory with interface logic
• Data bus – I/O or network interface chip or board uses data bus
Backplane bus – is a printed circuit board on which many connectors are used to plug in
functional boards.
• System bus – built on backplane, provides a common communication path among all plug-
in boards.
• I/O Bus – used to connect I/O devices to a computer system. Ex: SCSI bus. Made of
coaxial cables, with taps connecting disks, printer, and other devices to a processor through
an I/O controller .Special interface logic is used to connect various board types to the
backplane bus.
Hierarchical Buses and Caches
• There are numerous ways in which buses, processors, memories, and I/O devices can be
organized.
• One organization has processors (and their caches) as leaf nodes in a tree, with the buses
(and caches) to which these processors connect forming the interior nodes.
• This generic organization, with appropriate protocols to ensure cache coherency, can model
most hierarchical bus organizations.
● Divided into different clusters each connected through a cluster bus.
● Inter-cluster bus used to provide connection among clusters.
● Second level caches(C20-C22) are used between each cluster and inter-cluster bus.
Second level cache have higher capacity than sum of capacities of all first level caches
connected beneath it.
● Main memory modules are connected to the Inter-cluster bus.
● Most memory request should be satisfied at the first level caches.
11. Write Invalidate Snoopy bus protocol-Write through and Write back
Write Invalidate policy –
● Invalidates all remote copies when a local cache block is updated. In fig below processor
P1 modifies(write) its cache from X to X’ and all other copies are invalidated via the bus.
● Invalidated blocks are called dirty, means, they should not be used.
Write- Through Caches:

The states of a cache block copy change with respect to3 operations in the cache:-
• Read
• Write
• Replacement
A block copy of a write-through cache i attached to processor i can assume one of two
possible cache states:
o valid
o invalid
Write-Back Caches:
1. Valid state of a write-back cache can be further split into two cache states.
• RW (read –write/ Exclusive) and
• RO (read only/ Shared) as shown in Fig. below.
2. INV (invalidated or not-in-cache) cache state is equivalent to the invalid state mentioned
before.This three-state coherence scheme corresponds to an ownership protocol.
12.Major operational characteristics of a multiprocessor interconnection

network.
Interconnection Network Choices/Characteristics :

● The networks are designed with many choices, the choices are based on the topology,
timing, switching and control strategy.
● In case of dynamic network the multiprocessors interconnections are under program
control.
1.Timing control can be either synchronous and asynchronous

✓ Synchronous – controlled by a global clock which synchronizes all network
activity.
✓ Asynchronous – use handshaking or interlock mechanisms for communication
and especially suitable for coordinating devices with different speed
2.Switching Method
✓ Circuit switching – once a device is granted a path in the network, it occupies the
path for the entire duration of the data transfer.
✓ Packet switching – the information is broken into small packets individually
competing for a path in the network.
3.Network Control
✓ Centralized – global controller receives and acts on requests

✓ Distributed – requests handled by local devices independently
13. Hardware synchronization

● Multiprocessor systems use hardware mechanisms to implement low-level or
primitive synchronization operations or use software (operating system) level
synchronization mechanisms such as Semaphores or monitors.
● Hardware mechanisms to implement synchronisation are:
`• Atomic Operations
• Wired Barrier Synchronization
1. Atomic Operations
• Atomic operations such as read, write, or read-modify-write can be used to
implement some synchronization primitives.
• Some interprocessor interrupts can be used for synchronization purposes.
● To synchronize concurrent processes, the software may repeat Test &.Set until the
returned value (temp) becomes 0.
• For example , Test & Set and Reset (Lock) primitives are defined below:
● This synchronization primitive may tie up some bus cycles while a processor
enters busy-waiting on the spin lock.
● To avoid spinning, inter processor interrupts can be used.
● Lock tied to an interrupt is called a suspended lock.
2. Wired Barrier Synchronization
● Concurrent processes residing in different processors can be synchronized rising
barriers. A barrier can be implemented by a shared-memory word which keeps
counting the number of processes reaching the barrier.
● No processor can execute beyond the barrier until the synchronization process is
complete.
● Each processor uses a dedicated control vector X = (X1, X2, ..., Xm) and accesses a
common monitor vector Y = (y1, Y2, .... .ym) in shared memory, where m
corresponds to the barrier lines used.
● Each control bit X, is connected to the base (input) of a probing transistor.
● The monitor bit yi checks the collector voltage (output) of the transistor.
● The wired-NOR connection implies that line i will be high (1) only if control bits Xi,
from all processors are low (0).
● This demonstrates the ability to use the control bit Xi to signal the completion of a
process on processor i.
● The bit Xi is set to 1 when a process is initiated and reset to 0 when the process
finishes its execution.
CSA- PREVIOUS UNIVERSITY QUESTIONS
Module – 4
1 ● Differentiate between store and forward and Dec-2018 3
wormhole routing.
● Explain various message routing schemes used in message
May-2019 5
passing multi-computers. Sept-2020 4
● Explain different message routing schemes.
2 Differentiate between synchronous and asynchronous model Dec-2018 3

of linear pipeline processors.
3 Explain the factors speedup, efficiency and throughput of a k- May-2019 4
stage linear pipeline
4 Analyse and compare the communication latencies of Store-and Dec-2019 3
Forward and Wormhole routing schemes.
5 With suitable diagram explain different flow control strategies Sept-2020 4
for resolving a collision between two packets.
6 Differentiate between linear and nonlinear pipeline processor. May-2019 3
PROBLEMS
1 What do you mean by dimension order routing? Consider a Dec-2018 5

16 node hypercube network. Based on E-cube routing
algorithm, show how to route a message from 0010 to 1001.
Find all intermediate nodes on routing path.
2 Explain E-cube routing. Consider a 64 -node hypercube Sept-2020 5
network. Based on E cube routing algorithm, show how to route
a message from 101101 to 011010. Find all intermediate nodes
on routing path.
3 Consider a 16-node hypercube network. Based on the E-cube Dec-2019 6
routing algorithm, show how to route a message from node
(0111) to node (1101). All intermediate nodes must be identified
on the routing path.
4 Determine the frequency of the pipeline if the stage delays are Dec-2018 3
τ1 = 3ns, τ2 = τ3=5ns and τ4 =8 ns and the latch delay is 1 ns.
5 Consider the five-stage pipelined processor specified by the Dec-2019 9
following reservation table and answer the following: (S indicate
the stages) Dec-2018
(i) List the set of forbidden latencies and the collision vector. (2) (2)
(ii) Draw the state transition diagram showing all possible initial sequences without
causing a collision in the pipeline. (3)
(iii) List all the simple and greedy cycles from the state diagram. (2) (2)
(iv) Determine the minimal average latency (MAL). (2)
6 Consider the following pipeline reservation table. Sept-2020 5
May-2019
i) What are the forbidden latencies?

ii) Draw the transition diagram.
iii) List all the simple cycles and greedy cycles.
iv) Determine the optimal constant latency cycle and minimal
average latency
(MAL).
v) Let the pipeline clock period be τ=10ns. Determine the
throughput of the pipeline.
ANSWERS:
1. Message Routing Schemes:(Store and Forward routing)

Two message routing mechanisms are :
a) Store and Forward Routing (Packets are the smallest unit of information transmission)
b) Wormhole routing.( Packets are again subdivided into flits)
a) Store and Forward Routing

● Packets are the smallest unit of information transmission.
● Each node has a packet buffer to store packets before forwarding to next node.
● A packet is transmitted from a source to destination node through a sequence of
intermediate nodes.
● When a packet reaches an intermediate node, it is first stored in the buffer and then it is
forwarded to the next node if the desired output channel and a packet buffer in the
receiving node are both available.
● Latency in store and forward is directly proportional to distance (no of hops between
source and destination)
● This scheme was implemented in the first generation of multicomputers.
b) Wormhole Routing
● Packets are further divided into flits.]

● Flit buffers are used in hardware routers attached to nodes.
● The transmission from the source node to the destination node is done through a sequence
of routers.
● All flits in same packet are transmitted in pipeline fashion. Header flit knows the
destination and all other flits(data flits) follow it.
● Latency in wormhole routing is independent of distance between source and destination.
2. Synchronous and Asynchronous Model of Linear pipeline
● A linear pipeline processor is constructed with k processing stages.

● External inputs (operands) are fed into the pipeline at the first stage S1 The processed
results are passed from stage Si to stage Si-+1. for all i = 1, 2,...k - l.
● The final result emerges from the pipeline at the last stage Sk
● Depending on the control of data flow along the pipeline. we model linear pipelines in
two categories: Synchronous and asynchronous.
a) Asynchronous Model
● Data flow between adjacent stages in an asynchronous pipeline is controlled by a

handshaking protocol.
● When stage Si is ready to transmit, it sends ready signal to stage Si+1.
● After stage Si+1 receives the incoming data, it returns an acknowledgement signal to
Si.
b) Synchronous model
● Clocked latches are used to interface between stages.

● The latches are made with master-slave flip-flops, which can isolate inputs from
outputs. Upon the arrival of a clock pulse, all latches transfer data to the next stage
simultaneously.
● The pipeline stages are combinational logic circuits.
● It is desired to have approximately equal delays in all stages. Successive tasks or
operations are initiated one per cycle to enter the pipeline.
● Once the pipeline is filled up, one result emerges from the pipeline for each additional
cycle. This throughput is sustained only if the successive tasks arc independent of
each other.
3. Speedup, efficiency and throughput of a k-stage linear pipeline
1. Speedup
● Ideally a linear pipeline of k stages can process n tasks in k+(n-1) clock cycles.
● Where, k cycles are needed to complete execution of very first task.
● Remaining (n-1) tasks require (n-1) cycles.
2. Efficiency:
● Efficiency Ek of a linear K-stage pipeline is
3. Throughput:
● Its defined as the no: of tasks(operations) performed per unit time
4. Latency Analysis: Store and Forward Routing

Latency Analysis – A time comparison between store-and-forward and wormhole routed
networks.
• L- packet length in bits, W channel bandwidth in bits/s, D is the distance (no of nodes traversed
minus 1) and F is the flit length in bits
5. Flow Control Strategies :
a)Buffering in Virtual circuit routing
● Packet 1 is allocated the channel, packet 2 is temporarily stored in packet buffer and will
be transmitted when channel becomes available.(ie combines store and forward and
wormhole routing schemes)
● Packet buffer may cause significant storage delay.
Adv: Not wasting the resources already allocated.
Disadv: Requires the use of a large buffer to hold the entire packet.
b) Blocking flow control
● Wormhole routing uses a blocking policy in case of packet collision .Second packet is
blocked but not abandoned .
d) Discard and retransmit

● Discard policy drops the second packet. The discard policy may results severe wastage of
resources, and it demands packet retransmission and acknowledgement .
● This scheme is rarely used now.
d) Detour after being blocked
● Blocked packet is routed to a detour channel.

● It is economical to implement but may result in the idling of resources allocated to the
blocked packet.
● This scheme offers more flexibility in packet routing.
● But this scheme may waste more channel resources than necessary to reach at the
destination.
● Furthermore a re-routed packet may enter a cycle of livelock, which wastes network
resources
6.Linear and Non-Linear Pipeline Processors
Linear Pipeline Non-Linear Pipeline

1 Linear pipelines are static pipelines because Dynamic pipeline can be re-configured to
they perform fixed function perform variable functions at different times
2 Allows only streamline connection Allow feed-forward and feedback connections

in addition to stream line connection
3
4 Support Only Streamline connection

The 3 stage pipeline has
Streamline connection from S1 to
S2,
Feed forward from S1 to S3
Feedback from S3 to S2 and S3 to S1
5 Reservation Table Reservation Table
6 Specified by a single reservation table Multiple reservation table can be generated for
evaluation of different functions
7 Function portioning is easy-since only Function portioning is difficult because

streamline connection pipeline stage interconnected with loops in
addition to streamline connection
CSA-PREVIOUS UNIVERSITY QUESTIONS
Module – 5
1 ● What are the possible hazards that can occur between Dec-2019 6
read and write operations in an instruction pipeline?
● Which are the three logic hazards possible in an Dec-2018 4
instruction pipeline? Define each. Write the necessary
conditions for each to occur.
2 ● Explain the concept of in-order issue and out-of-order Dec-2018 7

issue with respect to superscalar processor.
● Explain the in-order and out-of-order pipeline Dec-2019 6
scheduling policies for a superscalar machine with an
example.
3 ● Explain static branch prediction strategy and dynamic Dec-2018 6

branch prediction strategy.
● Explain various branch prediction techniques. Sept-2020 6
4 ● Illustrate with example how internal data forwarding May-2019 4

among multiple functional units can improve the
throughput of a pipelined processor.
Dec-2019 4
● Write short notes on internal data forwarding.
5 ● With an example bring out the difference between the May-2019 4
Carry-Save Adders (CSA) and Carry Propagate Adder
(CPA).
● Differentiate between Carry save adder (CSA) and Sept-2020 4
Carry propagation adder (CPA).
6 ● Explain in detail the effect of branching and various May-2019 9

branch handling strategies.
● Explain the effect of branching in instruction
pipelining. Find the execution time and throughput of
the pipeline for n instructions by considering the effect Dec-2019 6
of branching. How branch penalty is reduced using
delayed branch strategy.
7 Explain the score boarding scheme employed by the CDC 6600 May-2019 3
processor.
8 Compare the design and performance of a super pipelined and May-2019 6
super pipelined superscalar processors
9 ● Explain the importance of Tomasulo’s algorithm for Sept-2020 6
dynamic instruction scheduling.
● Explain the Tomasulo’s algorithm for the dynamic Dec-2018 5
instruction scheduling.
10 Describe the various mechanisms for improving the Sept-2020 6
performance of instruction pipeline.
PROBLEM
1 Consider the execution of a program of 15,00,000 Dec-2019 4
instructions by a linear pipeline processor with a clock rate
of 1000 MHz. Assume that the instruction pipeline has five
stages and that one instruction is issued per cycle. The
penalties due to branch instruction and out-of-order
execution are ignored.
a) Calculate the speedup factor in using this pipeline to

execute the program as compared with the use of an
equivalent non-pipelined processor with an equal amount of
flow-through delay.
b) Find out the efficiency and throughput of this pipelined
processor
ANSWERS
1. Three Logic Hazards: READ and WRITE HAZARD

● The read and write of shared variables by different instructions in a pipeline
may lead to different results if these instructions are executed out-of-order.
1, RAW HAZARD
R(I) ∩ D(J) ≠ ᶲ for RAW hazard- flow dependence
Ex: I: a=b+c
J : e =a + d
2. WAW HAZARD
R(I) ∩ R(J) ≠ ᶲ for WAW hazard – o/p dependence
Ex: I: a=b+c
J: a=d+e
3.WAR HAZARD
D(I) ∩ R(J) ≠ ᶲ for WAR hazard – anti-dependence
EX: I: a=b+c
J:b=a+d
2. IN-ORDER ISSUE and OUT-OF-ORDER ISSUE
Three scheduling policies are introduced below.
1. In order issue & In order completion : Instructions are issued in program order and
instructions are completed in program order.
2. In order issue & Out of order completion : Instructions are issued in program
order and instructions are completed not in program order.
3. Out of order issue & Out of order completion : Program order is violated in
instruction issue and completion
i) In-Order Issue
● Figure 6.30a shows a schedule for the six instructions being issued in program
order Il, I2, I6.
● Pipeline 1 receives I1, I3, and I5, and pipeline 2 receives I2, I4, and I6 in three
consecutive cycles.
● Due to I1 —> I2, I2 has to wait one cycle to use the data loaded in by I1.
● I3 is delayed one cycle for the same adder used by I2. I6 has to wait for the
result of I5 before it can enter the multiplier stages.
● In order to maintain in-order completion , I5 is forced to wait for two cycles
to come out of pipeline 1.
● In total, nine cycles are needed and five idle cycles (shaded boxes) are
observed.
● In Fig. 6.30b , out-of-order completion is allowed even if in-order issue is practiced.

● The only difference between this out-of-order schedule and the in-order schedule is
that I5 is allowed to complete ahead of I3 and I4, which are totally independent of I5.
● The total execution time does not improve. However, the pipeline utilization rate
does.
● Only three idle cycles are observed. Note that in Figs. 6.29a and 6.29b, we did not use
the look ahead window.
● In order to shorten the total execution time, the window can be used to reorder the
instruction issues.
Out-of-Order Issue
● By using the lookahead window, instruction I5 can be decoded in advance because it

is independent of all the other instructions.
● The six instructions are issued in three cycles as shown: I5 is fetched and decoded by
the window, while I3 and I4 are decoded concurrently.
● It is followed by issuing I6 and I1 at cycle 2, and I2 at cycle 3.
● Because the issue is out of order, the completion is also out of order as shown in Fig.
6.30c. Now, the total execution time has been reduced to seven cycles with no idle
stages during the execution of these six instructions.
● The in-order issue and completion is the simplest one to implement. It is rarely used
today even in a conventional scalar processor due to some unnecessary delays in
maintaining program order.
● However, in a multiprocessor environment, this policy is still attractive. Allowing
out-of-order completion can be found in both scalar and superscalar processors.
3. Static Branch Prediction and Dynamic Branch Prediction:
Static Branch Prediction Strategy
● Branches are predicted based on branch code types statically or based on branch
History during program execution.
● The frequency and probabilities of branch taken and branch types across large no of
program traces is used to predict a branch.
● Such a Static branch strategy may not be very accurate.
● The static prediction direction (taken or nor rat-en) can even be wired into the
processor.
● The wired-in static prediction cannot be changed once committed to the hardware.
Dynamic Branch Prediction Strategy
● Uses recent branch history to predict whether or not the branch will be taken next
time when it occurs.
● To be accurate we may need to use entire history of branch to predict future choice –
but not practical. Thus we use limited recent history.
● Requires additional hardware to keep track of the past behaviour of the branch
instructions at run time.
● Branch Target Buffers
✓ It is used to implement branch prediction – BTB holds recent branch

information including address of branch target used. The address of
branch instr locates its entry in BTB.
✓ The state transition diagram is used for backtracking the last 2
branches in a given program. The BTB entry contains backtracking
information which will guide the prediction.
✓ BTB can store not only Branch Target address but also target
instruction itself and few successor instr’s, in order to allow zero delay
in converting conditional branches to unconditional branches.
●
4. INTERNAL DATA FORWARDING
Memory access operations can be replaced by register transfer operations to save time
a) Store Load Forwarding
● load operation(LD R2,M) from memory to register R2 can be replaced by move

operation (MOVE R2,R1) from register R1 to R2, since register transfer is faster
than memory access and will also reduce memory traffic , resulting in shorter
execution time.
b)Load-Load Forwarding
● We can eliminate 2nd load operation(LD R2,M) and replace it with move
operation(MOVE r2, R1)
5. CSA and CPA

a) CSA (Carry Save Adder)
● High-speed addition requires either the use of a carry-propagation adder (CPA)
which adds two numbers and produces an arithmetic sum as shown in Fig. a
below.
● In a CPA, the carries generated in successive digits are allowed to propagate from the
low end to the high end, using either ripple carry propagation or some carry look
ahead technique.
b)CPA(Carry Propogation Adder):

● Carry-save adder (CSA) is used to three input numbers and produce one sum
output and a carry output as exemplified in Fig. b. below
● In a CSA, the carries are not allowed to propagate but instead are saved in a carry
vector.
● In general, an n-bit CSA is specified as follows: Let X, Y, and Z be three n-bit input
numbers, expressed as X=(Xn-1,Xn-2 …,X1,X0) and so on. The CSA performs
bitwise operations simultaneously on all columns of digits to produce two n bit output
numbers , carry and sum denoted as 𝑆 𝑏 = 0, 𝑆𝑛 − 1, 𝑆𝑛 − 2, … . , 𝑆1, 𝑆0) and C=
(Cn,Cn-1,…..C1, 0)
● Note that the leading bit of bitwise sum 𝑆 𝑏 is always 0 and the tail bit of the carry
vector C is always a 0. The input output relationships are expressed below:
● The arithmetic sum of 3 input numbers i.e S= X+Y+Z is obtained by adding two
output numbers ie S=𝑆 𝑏 +C Using a CPA.
6. EFFECT OF BRANCHING:
● When a branch is taken all instructions following the branch in the pipeline
becomes useless and will be drained from the pipeline. Thus branch taken causes a
pipeline to be flushed losing a number of pipeline stages.
BRACH PREDICTION:
a) Static Branch Prediction Strategy
• Branches are predicted based on branch code types statically or based on branch History
during program execution.
• The frequency and probabilities of branch taken and branch types across large no of pgm
traces is used to predict a branch.
• Such a Static branch strategy may not be very accurate.
• The static prediction direction (taken or nor rat-en) can even be wired into the processor.
• The wired-in static prediction cannot be changed once committed to the hardware.
b)Dynamic Branch Prediction Strategy
● Uses recent branch history to predict whether or not the branch will be taken next
time when it occurs.
● To be accurate we may need to use entire history of branch to predict future choice –
but not practical. Thus we use limited recent history.
● Requires additional hardware to keep track of the past behaviour of the branch
instructions at run time.
● Branch Target Buffers are used to implement branch prediction – BTB holds recent
branch information including address of branch target used. The address of branch
instr locates its entry in BTB.
● The state transition diagram is used for backtracking the last 2 branches in a given
program. The BTB entry contains backtracking information which will guide the
prediction.
● BTB can store not only Branch Target address but also target instruction itself and
few successor instr’s, in order to allow zero delay in converting conditional branches
to unconditional branches.
7.CDC SCOREBOARDING:
● Figure above shows CDC6600 like processor that uses dynamic instruction
scheduling hardware,.
● Here Multiple Functional units appeared as multiple execution units pipelines.
● Parallel units allow instr’s to complete out of original program order.
● The processor had instr buffers for each execution unit
● Instrs are issued to available FU’s regardless of whether register i/p data are
available.
● To Control correct routing of data btw execution units and registers CDC 6600 used a
Centralized Control unit known as scoreboard.
● Scoreboard kept track of registers needed by instrs waiting for various
functional units
● When all registers have valid data scoreboard enables instr execution.
● When a FU finishes it signals scoreboard to release the resources.
● Scoreboard is a Centralized control logic which keeps track of status of registers
and multiple functional units.
8. SUPER PIPELINED and SUPER PIPELINED SUPERSCALAR processors:
a) Super Pipelined:
● In a super pipelined processor of degree n, the pipeline cycle time is 1/n
of base cycle.
Execution Time, T(1,n)= k+1/n(N-1)
b) Super Pipelined Super Scalar:
● It execute instruction every cycle with pipeline cycle 1/n of base cycle.
Execution time, T(m,n)= (k+N-m)/mn
9. TOMASULO’S ALGORITHM: Dynamic Scheduling
● Named after the chief designer.

● This hardware dependence –resolution scheme was first implemented with
multiple floating point units of the IBM 360/91 processor.
● Functional units are internally pipelined and can complete one operation in every
clock cycle, provided the reservation station(Structure of RS shown below) of the
unit is ready with the required input operand values.
● If source register is busy when an instr reaches issue stage, tag for source register is
forwarded to RS
● When register becomes available tag can signal availability.
● This value is copied into all reservation station which have the matching tag. Thus
operand forwarding is achieved here with the use of tags.
● All destinations which require a data value receive it in the same clock cycle over the
common data bus, by matching stored operand tags with source tag sent over the bus.
Tomasulo’s algorithm for dynamic instruction scheduling
● Tomasulo's algorithm was applied to work with processors having a few floating-
point registers. In the ease of Model 91, only four registers were available.
● Fig a below shows a minimum-register machine code for computing X = Y + Z and A
= B * C. The pipeline timing with Tomasulo‘s algorithm appears in Fig.-b.
● Here, the total execution time is 13 cycles, counting from cycle 4 to cycle 15 by
ignoring the pipeline startup and draining times.
10. MECHANISMS FOR INSTRUCTION PIPELINING
1. Pre-fetch Buffers
2. .Multiple Functional Units
3. Internal Data Forwarding
4. Hazard avoidance
1 . Pre-fetch Buffers
● 3 type’s of buffers are used to match instr fetch rate to pipeline consumption
rate.
1. Sequential buffer
2. Target buffer
3. Loop buffer
● In a single memory access time, a block of consecutive instruction s are
fetched into pre-fetch buffer.
a)Sequential buffer
● Sequential instructions are loaded into a pair of sequential buffers for in-
sequence pipelining.
b) Target buffer
● Instructions from Branch target are loaded into pair of target buffers for out of
sequence
● A conditional branch instructions causes both sequential buffers and target
buffers to fill with instr.
● After checking branch condition, appropriate instruction will be taken from
one of the two buffers and instructions in other buffer is discarded.
● From each buffer pair, one may be used to load instructions from memory and
another buffer may be used to feed instructions into pipeline, thus it prevents
collision btw instructions flowing IN and OUT of pipeline.
c)Loop buffer
● holds sequential instructions in a loop. Its maintained by Fetch Stage.
● Pre fetched instructions in a loop body will be executed repeatedly until all
iterations complete execution.
2.Internal Data Forwarding
Memory access operations can be replaced by register transfer operations to save time
b) Store Load Forwarding
● load operation(LD R2,M) from memory to register R2 can be replaced by move

operation (MOVE R2,R1) from register R1 to R2, since register transfer is faster
than memory access and will also reduce memory traffic , resulting in shorter
execution time.
b)Load-Load Forwarding
● We can eliminate 2nd load operation(LD R2,M) and replace it with move
operation(MOVE r2, R1)
3. Multiple Functional Units
● Certain pipeline stage becomes bottleneck – this stage corresponds to row with maximum
number of checkmarks in the reservation table
● Can be solved using multiple copies of same stage simultaneously , ie:- use of multiple
execution units in a pipelined processor design.
● A pipelined processor with multiple FU’s and distributed RS’s supported by Tagging.
● In order to resolve data or resource dependences among successive instructions entering
the pipeline the Reservation Stations(RS) are used with each functional unit.
● Operands wait in RS until data dependences are resolved.
● Each RS is uniquely identified by a Tag monitored by tag unit.
● Tag unit keeps checking the tags from all currently used registers or RS’s.
● RS also serve as buffers to interface pipelined functional units with decode and issue unit
● Multiple Functional units operate in parallel once dependencies are resolved.
4. Hazard avoidance
● The read and write of shared variables by different instructions in a pipeline
may lead to different results if these instructions are executed out-of-order.
1, RAW HAZARD
R(I) ∩ D(J) ≠ ᶲ for RAW hazard- flow dependence
Ex: I: a=b+c
J : e =a + d
2. WAW HAZARD
R(I) ∩ R(J) ≠ ᶲ for WAW hazard – o/p dependence
Ex: I: a=b+c
J: a=d+e
3.WAR HAZARD
D(I) ∩ R(J) ≠ ᶲ for WAR hazard – anti-dependence
EX: I: a=b+c
J:b=a+d
CSA- PREVIOUS UNIVERSITY QUESTIONS
Module – 6
1 Distinguish between static dataflow computers and dynamic Dec-2018 4
dataflow computers.
Sept-2020
2 What are the four context switching polices for Dec-2018 4
multithreaded architecture? Dec-2019 4
3 Write a short note on fine-grain parallelism. Dec-2018 3
4 May-2019 4
Explain the distributed
caching.
5 May-2019 4
● Illustrate the scalable coherence interface (SCI)
interconnect model.
● List any two advantages and disadvantages of
Scalable Coherence Interface(SCI). Dec-2019 4
6 With a neat diagram explain the architecture of a multiple May-2019 6
context processor model.
7 ● What are the problems of asynchrony and their May-2019 6
solutions in massively parallel processors?
● Suggest different methods to overcome asynchrony Sept-2020 4
problem.
8 With a neat diagram explain the MIT/Motorola *T prototype May-2019 6
multithreaded architecture
9 ● Explain various latency hiding techniques. Sept-2020 8
● Explain any two latency hiding techniques used in Dec-2019 6
distributed shared memory multi computers.
● Explain any three latency hiding techniques used in
Dec-2018 9
distributed shared memory multi computers.
10 ● With suitable diagrams explain ETL/EM-4 architecture. Sept-2020 6

● With a neat diagram explain the architecture of
ETL/EM-4 dataflow architecture.
11 What do you mean by Release Consistency (RC) memory Dec-2019 4

model? Give the conditions to ensure RC.
ANSWERS
1. Static Data flow and Dynamic Data flow computers:

Static versus Dynamic Dataflow
● Static dataflow computers simply disallow more than one token to reside on
any one arc, which is enforced by the firing rule:
● A node is enabled as soon as tokens are present on all input arcs and there is
no token on any of its output arcs.
● The static firing rule is difficult to implement in hardware. Special feedback
acknowledge signals are needed to secure the correct token passing between
producing nodes and consuming nodes.
● Also, the static rule makes it very inefficient to process arrays of data.
● In a dynamic architecture, each data token is tagged with a contest descriptor,
called a ragged rotten.
● The firing rule of tagged-token dataflow is changed to: A node is enabled as
soon as tokens with identical tags are present at each of its input arcs.
● With tagged tokens, tag matching becomes necessary. Special hardware
mechanisms are needed to achieve this
2.CONTEXT SWITCHING POLICES FOR Multithreaded architecture

Four policies:
Different multithreaded architectures are distinguished by the context
switching policies adopted.
1) Switch on cache miss-
● This policy corresponds to the case where a context is preempted when it
causes a cache miss.
● In this case, R (interval between switches)is taken to be the average interval
between misses, and L the time required to satisfy the miss.
2) Switch on every load-
● This policy allows switching on every load, independent of whether it will cause a
miss or not. In this case, R represents the average interval between loads.
● A general multithreading model assumes that a context is blocked for L cycles after
every switch.
● But in the case of a switch-on-load processor, this happens only if the load causes a
cache miss.
3) Switch on every instruction-
This policy allows switching on every instruction, independent of whether it is a load or not.
3.FINE-GRAIN PARALLELISM.
● We compare below the grain sizes, communication latencies, and concurrency in four
classes of parallel computers.
● This comparison leads to the rationales for developing fine-grain multicomputers.
Latency Analysis
● The computing granularity and communication latency of leading early examples of
multiproccssors, data-parallel computers, and medium-and fine-grain multicomputers
are summarized in Table. Four attributes are identified to characterize these machines.
Only typical values are shown
● The communication latency: measures the data or message transfer time on a
system interconnect.
● This corresponds to the shared-memory access time on the Cray Y-MP, the time
required to send a 32-bit value across the hypercube network in the CM-2. and the
network latency on the iPSC/1 or J-Machine.
● The synchronization overhead : is the processing time required on a processor,
or on a PE, or on a processing node of a multicomputer for the purpose of synchron
ization.
● The sum Tc + Ts gives the total time required For IPC.
● The shared-memory Cray Y-MP had a short Tc but a long Ts. The SIMD machine
CM-2 had a short Ts but a long Tc.
● The long latency of the iPSC/1 made it unattractive based on fast advancing
standards.
● The MIT J-Machine was designed to make a major improvement in both of these
communication delays.
● The grain size is measured by the execution time of a typical program, including both
computing time and communication time involved.
● Supercomputers handle large grain. Both the CM-2 and the J-Machine were designed
as fine-grain machines.
● The iPSC/1 I was a relatively medium-grain machine compared with the rest.
4.DISTRIBUTED CACHING:
The concept of distributed caching is shown below:
● Every memory location has an owner node
● Eg: N1 owns B and N2 owns A.
● The directories are used to contain import-export list and state whether the data is
shared(for reads, many caches may hold copies) or exclusive (for writes, one cache
holds the current value)
● The directories multiplex among a small number of contexts to cover the cache
loading effects.
● Distributed caching offers a solution for the remote-loads problem, but not for the
synchronizing –loads problem.
● Multithreading offers a solution for remote loads and possibly for synchronizing
loads. The two approaches can be combined to solve both types of remote access
problem
5. Scalable Coherence Interface

● A scalable coherence interconnect. structure with low latency is needed to extend
from, conventional bused backplanes to a fully duplex, point-to-point interface
specification.
● SCI supports unidirectional point-to-point connections, with two such links between
each pair of nodes, packet based communication is used, with routing.
● The cache coherence protocol used in SCI is directory –based.
● A sharing list is used to chain the distributed directories together for reference
purposes.
SCI Interconnect models
• SCI defines the interface between nodes and the external interconnect
• SCI configuration is shown below
● Each SCI node can be a processor with attached memory and I/O devices.
● The SCI interconnect can assume a ring structure or a crossbar switch as depicted in
Fig above
● Each node has an input link and an output link which are connected from or to the
SCI ring or crossbar.
● The convertor is used to bridge the SCI ring to the VME bus as shown.
● A mesh of rings can also be considered using some bridging modules.
● By eliminating the snoopy cache controllers, the SCI is also less expensive per node.
Its main advantage is low latency and scalability.
6. MULTIPLE CONTEXT PROCESSOR MODEL
● Multithreaded systems are constructed with multiple-context processors.

● Processor efficiency issue will discuss as a function of memory latency (L),the
number of contexts (N),and context switching overhead(C).
Enhanced Processor Model
● A conventional single-thread processor will wait during a remote reference, so we

may say it is idle for a period of time L.
● A multithreaded processor will suspend the current context and switch to another, so
after some fixed number of cycles it will again be busy doing useful work, even
though the remote reference is outstanding. Only if the entire context is suspended
will the processor be idle.
● Here the objective is to maximize the fraction of time that the processor is busy.
● Efficiency of the processor is the performance index ,given by,
Efficiency = busy/(busy+ switching+ Idle)

● Where busy, switching, and Idle represent the amount of time, measured over some
large interval o The basic idea behind a multithreaded machine is to interleave the
execution of several contexts in order to dramatically reduce the value of idle. but
without overly increasing the magnitude of. switching.
● The state of a processor is determined by the disposition of the various contexts on the
processor. o During its lifetime, a context cycles through the following states: ready,
running, leaving, and blocked.
● A processor is busy if there is a context in the running state.
● It is switching while making the transition from one context to another. i.e when a
context is leaving.
● If all contexts are blocked and we say the processor is idle.
● A running context keeps the processor busy until it issues an operation that requires a
context switch.
● The context then spends C cycles in the leaving state, then goes into the blocked state
for L cycles, and finally renters the ready state. Eventually the processor will choose it
and the cycle will start again.
7. PROBLEMS OF ASYNCHRONY
Parallel processors operate asynchronously in a network environment.
• The asynchrony triggers two fundamental latency problems:
▪ REMOTE LOADS
▪ SYNCHRONIZING LOADS.
Latency problems for remote loads or synchronization loads
a) REMOTE LOADS
● The key issue in remote load :- How to avoid idling in node N1 during load
operations.
● Variables A and B are located on nodes N2 and N3,respectively. They need to be
brought to node N1 to compute the difference A-B in variable C.
● The basic computation demands the execution of two remote loads and then the
subtraction.
● Let pA and pB be the pointers to A and B respectively. The two loads can be issued
from the same thread or from two threads.
● The context(register state) of the computation on N1 is represented by the variable
CTXT.
● It can be a stack pointer, a frame pointer, a current object pointer , a process identifier
etc. In general, variable names like vA, vB and c are interpreted relative to CTXT.
b) SYNCHRONIZING LOAD
● In fig.b, the idling due to synchronizing loads is illustrated.
● In this case , A and B are computed by concurrent processes, and we are not sure
exactly when they will be ready for node N1 to read.
● The ready signals (Ready1 and Ready2) may reach node N1 asynchronusly. This is a
typical situation in the producer-consumer problem. Busy-waiting may result.
● The latency caused by remote loads is an architectural property. The latency caused
by synchronizing loads also depends on scheduling and the time it takes to compute A
and B. which may be much longer than the transit latency.
SOLUTION TO ASYNCHONY PROBLEM
Two solutions to overcome asynchrony problem:-

1. Multithreading solutions
o The solution to asynchrony problems is to multiplex among many threads:

when one thread issues a remote load request, the processor begins work on
another thread, and so on.
o The cost of thread switching should be much smaller than that of the latency
of the remote load, or else the processor might as well wait for the remote
load’s response.
o As the inter node latency increases, more threads are needed to hide it
effectively.
o Another concern is to make sure that message carry continuations. Suppose
after issuing a remote load from threads T1, we switch to thread T2, which
also issue a remote load. The response may not return in the same order. This
may be caused by requests travelling different distances, through varying
degrees of congestion, to destination nodes whose loads differ greatly, etc.
2.Distributed caching
The concept of distributed caching is shown below:
● Every memory location has an owner node
● Eg: N1 owns B and N2 owns A.
● The directories are used to contain import-export list and state whether the data is
shared(for reads, many caches may hold copies) or exclusive (for writes, one cache
holds the current value)
● The directories multiplex among a small number of contexts to cover the cache
loading effects.
● Distributed caching offers a solution for the remote-loads problem, but not for the
synchronizing –loads problem.
● Multithreading offers a solution for remote loads and possibly for synchronizing
loads. The two approaches can be combined to solve both types of remote access
problem
8.MIT-J MACHINE
● The architecture and building block of the MIT-J machine, its instruction set, and
system design considerations are described below.
● The building block was the message driven process(MDP),a36-bit microprocessor
custom-designed for a line-grain multicomputer.
J –Machine Architecture
● The k-array n-cube networks were applied in the MIT J-Machine.
● The initial prototype J-Machine used a I024-node network (8 *8*16), which was a
reduced 16-ary 3-cube with 8 nodes along the x- and y-dimensions and 16 nodes
along the z dimension.
● A 4096-node J-Machine would use a full l6-ary 3-cube with 16 *16 * 16 nodes. The
J-Machine designers called their network a three diamensional mesh.
Network addressing limited the size of the J-Machine to a maximum configuration maximum
configuration of 65,5315 nodes, corresponding to a three-dimensional mesh with 32*32* 64
nodes. All hidden parts (nodes and links)are not shown for purposes of clarity. Clearly, every
node has a constant node degree of 6,and there are three rings crossing each node along the
three dimensions. The end-around connection scan be folded to balance the wire length on all
channels.
The MDP Design
● The MDP chip include a processor, a 4096-word by36-bit memory and build in
router with network ports.
● An on-chip memory controller with error checking and correction [EEC] capability
permitted local memory to be expanded to 1 million words by adding external DRAM
chips.
● The processor was message-driven in the sense that it executed functions in response
to message via the dispatch mechanism. No receive instruction was needed.
● The MDP created a task to handle each arriving message. Messages carrying these
tasks drove each computation.
● MDP was a general –purpose multicomputer processing node that provided the
communication, synchronization, and global naming mechanisms required to
efficiently support line-grain, concurrent programming models.
● The grain size was as small as 8-word objects or 20-instruction tasks.
● Fine-grain programs typically execute from 10 t0 100 instructions between
communication and synchronization actions.
● MDP chips provided inexpensive processing nodes with plentiful VLSI commodity
parts to construct the Jellybean Machine (J-Machine) multicomputer.
● The MDP appeared as a component with a memory port, six two-way network
ports, and a diagnostic port.
● The memory port provided a direct interface to up to IM words of ECC DRAM,
consisting of 11 multiplexed address lines, a 12-bit data bus, and 3control signals.
9.LATENCY HIDING TECHNIQUES
Four latency hiding mechanisms are used to increase scalability and programmability.
a). Using pre fetching techniques
b). Using coherent caches
c). Using relaxed memory consistency
d). Using multiple context support

a) Pre fetching techniques
● Pre fetching uses knowledge about the expected misses in a program to move the
corresponding data close to the processor before it is actually needed.
● Prefetching can be classified based on whether it is binding or non-binding, and whether
it is controlled by hardware or software.
● With binding prefetching, the value of a later reference (eg: register load)is bound at the
time when the pre fetch complete.
● Binding prefetching may result in a significant loss in performance since the value will
become stale if another processor modifies the same location during the interval between
pre fetch and reference.
Types::
i) Non- binding prefetching also brings the data close to the processor, but the
data remains visible to the cache coherence protocol and is thus kept consistent
until the processor actually reads the value.
ii) Hardware controlled prefetching includes long cache lines and instruction look
ahead. The effectiveness of long cache lines is limited by the reduced spatial
locality in multiprocessor applications, while instruction look ahead is limited by
branches and the finite look ahead buffer size.
iii) With Software controlled prefetching; explicit pre fetch instructions are issued.
Software control allows the prefetching to be done selectively and extends the
possible bandwidth between pre fetch issue and actual reference.
Disadvantage of software control:-extra instruction overhead required to generate the

pre fetches.
Benefits of pre fetching

The benefits of prefetching come from several sources:
• When a pre fetch is issued early enough in the code so that the line is already in the
cache by the time it is referenced.
• If multiple pre fetches are issued back to back to fetch the data structure, the latency of
all but the first pre fetched reference can be hidden due to the pipelining of the
memory accesses.
• In multiprocessors that use an ownership based cache coherence protocol, If a cache
block line is to be modified, prefetching it directly with ownership can
significantly reduce the write latencies and the ensuing network traffic for
obtaining ownership.
b) Distributed coherent caches
● The coherence problem is easily solved for small bus-based multiprocessors through
the use of snoopy cache coherence protocol.
● The cache coherence problem is much complicated for for large scale
multiprocessors(eg. BBN butterfly, IBM RP3, Stanford Dash etc..) that use general
interconnection networks. o BBN butterfly ------Did not use caches
o IBM RP3 ------Maintain cache coherence by software
o Stanford Dash ----- Maintain cache coherence by hardware support
Benefits of caching:
● Coherent caches helps to hide latency by increasing the cache hit ratio for read
operations . When the cache miss occurs, the number of cycles are wasted.
The benefit of using distributed caches is to reduce the cache miss.
10.ELT/EM-4 in Japan
EM-4 had the overall system organization as shown in Fig. 9.32 a.

● Each EMC-R node was a single-chip processor without floating-point
hardware but including a switch of the network.
The Node Architecture
● The internal design of the processor chip and of the node memory are
shown in Fig.
● The processor chip communicated with the network through a 3 X 3
crossbar switch unit.
● The processor and its memory were interfaced with a memory control
unit. The memory was used to hold programs as well as tokens (operand
segments, heaps, or frames) waiting to be fetched.
● The processor consisted of six component units.
● Input buffer was used as a token store with a capacity of 32 words.
● The fetch-match unit fetched tokens from the memory and performed tag-
matching operations among the tokens fetched in.
● Instructions were directly fetched from the memory through the memory
controller.
● The heart of the processor was the execution unit which fetched
instructions until the end of a thread.
● Instructions with matching tokens were executed.
● Instructions could emit tokens or write to registers.
● Instructions were fetched continually using traditional sequencing until a
“stop” flag was raised to indicate the end of a thread
● Then another pair of tokens was accepted.
● Each instruction in a thread specified the two sources for the next
instruction in the thread.
11.RELAXED MEMORY CONSISTENCY
There are four memory consistency models:-

a. Sequential consistency(SC)
b. Weak Consistency (WC)

c. Processor consistency(PC)
d. Release consistency(RC)
Release consistency(RC)
• It requires that synchronization accesses in the program be identified and classified as either
acquires (eg: locks) or releases (eg: unlocks).
• An acquire is a read operation that gains permission to access a set of data
• Release is a write operation that gives away such permission.
• Three conditions to ensure release consistency:

✓ Before a read or write operation on shared data is performed, all previous acquires
done by the process must have completed successfully.
✓ Before a release is allowed to be performed, all previous reads and writes by the
process must have completed.
✓ Special accesses are processor- consistent with one another.
Advantages(RC)
o Have the potential for increased performance by hiding as much write latency as possible.
Disadvantages(RC)
o is increased hardware complexity and a more complex programming model.
Processor consistency(PC)
• Goodman introduced PC model
• Here Writes issued by each individual processor are always in program order.
• The order of writes from two different processors can be out of program order.
• Two conditions required for ensuring processor consistency:
✓ Before a read is allowed to perform with respect to any other processor, all previous
read accesses must be performed.
✓ Before a write is allowed to perform with respect to any other processor, all previous
read or write accesses must be performed.

Seminar

Uploaded by

Copyright:

Available Formats

Seminar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seminar

Uploaded by

Copyright:

Available Formats

CSA-PREVIOUS UNIVERSITY QUESTIONS

7 ● A generalized multiprocessor system architecture May-2019 4

8 ● What is the significance of Bernstein’s conditions to May-2019 4

Calculate the average CPI, MIPS rate and Execution time.

5 Detect parallelism in the following code using Bernstein’s Sept-2020 6

6 A 40 MHz processor was used to execute a benchmark Dec-2018 5

1. Vector Super Computer

2. SIMD(Single Instruction Multiple Data Stream) –

3. MIMD(multiple instructions over multiple data stream) –

4, MISD(multiple instruction over single data stream)

● Hardware processors can be better designed to exploit parallelism by an

a)UMA (Uniform Memory Access)

● Physical memory is uniformly shared by all the processors.

i) Shared Local Memory

● Shared memory is physically distributed to all processors called local memories.

(ii) Hierarchical Cluster Model

● Globally shared memory

c) COMA(Chache only Memory)

• Special case of NUMA

1 Consider the design of a three level memory hierarchy with Dec-2018 6

Hit ratio of cache memory is h1=0.98 and a hit ratio of main

Hit ratio of cache memory is h1=0.98 and a hit ratio of main

2. Hit Ratio and Effective Access Time

3.VLIW Architecture and Pipelining

● Horizontal micro-coding + superscalar processing = VLIW architecture.

1. The decoding of VLIW instructions is easier than that of superscalar instructions.

● The memory hierarchy was developed based on a program behaviour known as

3 dimensions of locality property and its memory design implication

Memory devices at lower levels ( top of hierarchy) are:

● Designed to exploit instruction level parallelism in user programs .

Instruction issue degree (m) in a superscalar processors is limited btw 2 to 5.(2<m<5)

● Stated as M1 C M2 C M3 C……C Mn ( C – symbol for subset)

9. Instruction issue latency and Instruction issue rate.

1) Instruction issue latency:

2 How does cache inconsistency occur in caches due to Dec-2018 4

4 Compare full map directories with limited directories. Sept-2020 4

1.Cross bar network and Multiport memory:

Cross Bar network:

2.Inconsistency due to Process Migration and IO:

Inconsistency due to Process Migration

3. Goodman’s write-once protocol:

● This scheme combines the advantages of both write-through and write-hack

✓ The first write-hit leads to the reserved state.

4. Full Map Directory and Limited Directory:

Full Map Directory

● Designed to Solves the directory size problem by restricting the number of

Write Update policy:

6. Hot Spot Problem:

● When network traffic is non-uniform a hot spot may appear corresponding to a

This three-state coherence scheme corresponds to an ownership protocol.

● Generalized multiprocessor system is depicted below. Each Pi is attached to its own

Interconnection Network Choices/Characteristics :

✓ Centralized – global controller receives and acts on requests

9. Chained cache coherence protocol:

• Local Bus – implemented within processor chips or on printed-circuit boards.

Write- Through Caches:

12.Major operational characteristics of a multiprocessor interconnection

Interconnection Network Choices/Characteristics :