ACA Notes Diginotes PDF
ACA Notes Diginotes PDF
ACA Notes Diginotes PDF
ARCHITECTURE
Course Code: 15CS72
Book: “Advanced Computer Architecture – Parallelism, Scalability,
Programmability”, Hwang & Jotwani
Dr. Vasanthakumar G U
Department of Computer Science and Engineering
Cambridge Institute of Technology, Bengaluru
Email: vasanth.cse@cambridge.edu.in
Mobile: +91-9902269559
Source diginotes.in Save the earth. Go paperless
COURSE OBJECTIVE
PRE REQUISITE(s)
COURSE APPLICATIONS
• Parallel Computing
• Research in hardware technologies
• Research in Parallel computing, etc.,
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
3
MODULE – 1
THEORY OF PARALLELISM
• Peak Performance
o Is impossible to achieve
• VLSI Complexity
Model
o The 𝑨𝑻𝟐 Model
• Memory Bound
on Chip Area
• I/O Bound on
Volume 𝑨𝑻
• Bisection
Communication
Bound (Cross-
section area)
𝑨𝑻
• Square of this area used as lower
bound
• CONDITIONS OF PARALLELISM
• PROBLEM PARTITIONING AND SCHEDULING
• PROGRAM FLOW MECHANISMS
• SYSTEM INTERCONNECT ARCHITECTURES
• Grain-size problem
o To determine both the size and number of grains in parallel program
• Demand-Driven Mechanisms
o Reduction Machine
o Demand-driven computation
o Eager Evaluation and Lazy Evaluation
o Reduction Machine Models – String Reduction Model and Graph Reduction Model
• Network Performance
o Functionality
o Network Latency
o Bandwidth
o Hardware Complexity
o Scalability
• Network Throughput
o Total no. of messages the network can handle per unit time
o Hot spot and Hot spot throughput
Sumit Mittu
(Assistant Professor, CSE/IT)
Lovely Professional University
For a parallel algorithm, the execution time depends not only on input size but also
on factors such as parallel architecture, no. of processors, etc.
Performance Metrics
Parallel Run Time Speedup Efficiency
Parallel Runtime
The parallel run time T(n) of a program or application is the time required to run
the program on an n-processor parallel computer.
When n = 1, T(1) denotes sequential runtime of the program on single processor.
Speedup
Speedup S(n) is defined as the ratio of time taken to run a program on a single
processor to the time taken to run the program on a parallel computer with
identical processors
It measures how faster the program runs on a parallel computer rather than on a
single processor.
Efficiency
The Efficiency E(n) of a program on n processors is defined as the ratio of
speedup achieved and the number of processor used to achieve it.
<<<IMAGES>>>
<<<IMAGES>>>
<<<IMAGES>>>
Benchmarks
Machines are exposed to these benchmark tests and tested for performance.
When it is not possible to test the applications of different machines, then the
results of benchmark programs that most resemble the applications run on those
machines are used to evaluate the performance of machine.
Benchmarks
Kernel Benchmarks
[Program fragments which are extracted from real programs]
[Heavily used core and responsible for most execution time]
Synthetic Benchmarks
[Small programs created especially for benchmarking purposes]
[These benchmarks do not perform any useful computation]
EXAMPLES
LINPACK LAPACK Livermore Loops SPECmarks
NAS Parallel Benchmarks Perfect Club Parallel Benchmarks
•Inter-processor Communication
•Load Imbalance
•Inter-Task Dependency
•Extra Computation
•Parallel Balance Point
Amdahl’s Law
[based on fixed problem size or fixed work load]
Gustafson’s Law
[for scaled problems, where problem size increases with machine size
i.e. the number of processors]
For a given problem size, the speedup does not increase linearly as the number of
processors increases. In fact, the speedup tends to become saturated.
This is a consequence of Amdahl’s Law.
Amdahl’s Law
T(n) = Ts + Tp/n =
Amdahl’s Law
Sequential operations will tend to dominate the speedup as n becomes very large.
As n ∞, S(n) 1/α
This means, no matter how many processors are employed, the speedup in
this problem is limited to 1/ α.
This is known as sequential bottleneck of the problem.
Note: Sequential bottleneck cannot be removed just by increasing the no. of processors.
Amdahl’s Law
A major shortcoming in applying the Amdahl’s Law: (is its own characteristic)
The total work load or the problem size is fixed
Thus, execution time decreases with increasing no. of processors
Thus, a successful method of overcoming this shortcoming is to increase the problem size!
Amdahl’s Law
<<<GRAPH>>>
It relaxed the restriction of fixed size of the problem and used the notion of fixed
execution time for getting over the sequential bottleneck.
Gustafson’s Law
As the machine size increases, the work load (or problem size) is also increased so
as to keep the fixed execution time for the problem.
Gustafson’s Law
<<<IMAGES>>>
Gustafson’s Law
This law defines a memory bounded speedup model which generalizes both Amdahl’s Law
and Gustafson’s Law to maximize the use of both processor and memory capacities.
The idea is to solve maximum possible size of problem, limited by memory capacity
According to this law, the speedup S*(n) in the performance can be defined by:
Special Cases:
•G(n) = 1
Corresponds to where we have fixed problem size i.e. Amdahl’s Law
•G(n) = n
Corresponds to where the work load increases n times when memory is increased n
times, i.e. for scaled problem or Gustafson’s Law
•G(n) ≥ n
Corresponds to where computational workload (time) increases faster than memory
requirement.
Comparing speedup factors S*(n), S’(n) and S’(n), we shall find S*(n) ≥ S’(n) ≥ S(n)
<<<IMAGES>>>
Scalability
– Increasing the no. of processors decreases the efficiency!
+ Increasing the amount of computation per processor, increases the efficiency!
To keep the efficiency fixed, both the size of problem and the no. of processors must be increased
simultaneously.
Scalability of a parallel system is the measure of its capacity to increase speedup in proportion to
the machine size.
Isoefficiency Function
The isoefficiency function can be used to measure scalability of the parallel computing
systems.
It shows how the size of problem must grow as a function of the number of processors used in order to
maintain some constant efficiency.
The general form of the function is derived using an equivalent definition of efficiency
as follows:
Where, U is the time taken to do the useful computation (essential work), and
O is the parallel overhead. (Note: O is zero for sequential execution).
Source diginotes.in Save the earth. Go paperless
27
SCALABILITY METRIC
Isoefficiency Function
A small isoefficiency function means that small increments in the problem size (U), are sufficient
for efficient utilization of an increasing no. of processors, indicating high scalability.
A large isoeffcicnecy function indicates a poorly scalable system.
Source diginotes.in Save the earth. Go paperless
28
SCALABILITY METRIC
Isoefficiency Function
Performance Analysis
Instrumentation
Ways to do instrumentation:
By inserting it into the application source code directly, or
By placing it into the runtime libraries, or
By modifying the linked executable, etc.
Instrumentation
Chapter 3
Principles of Scalable Performance
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
Sumit Mittu
Assistant Professor, CSE/IT
Lovely Professional University
sumit.12735@lpu.co.in
• Scalable Computer
• Workload Growth and Efficiency Curves
• Application Models
o Fixed Load Model
o Fixed Time Model
o Fixed Memory Model
Chapter 4
Processors and Memory Hierarchy
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
• Broad Categorization
o CISC
o RISC
• Symbolic Processors
o Prolog Processors, Lisp Processors or symbolic manipulators.
o Deals with logic programs, symbolic lists, objects, scripts, productions systems, semantic networks,
frames and artificial neural networks.
• Memory Hierarchy
- Need & Significance
• Each process address space is partitioned into parts and used for code, data
and stack.
• Parts are loaded into primary memory when needed and written back to
secondary storage otherwise.
• The logical address space is referred to as virtual memory.
• Virtual memory is much larger than the physical memory.
• Virtual memory uses: Virtual address and Physical address.
• CPU translates Virtual address to Physical address.
• Virtual memory system uses paging.
N-1:
CPU
P-1:
N-1:
Disk
Address Translation: the hardware converts virtual addresses into physical addresses via
an OS-managed lookup table (page table)
Source
PDF Watermark Remover DEMO : Purchase from diginotes.in
www.PDFWatermarkRemover.com Save
to remove the the earth. Go paperless
watermark
Page Faults (Similar to “Cache Misses”)
Disk
Disk
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Servicing a Page Fault
o OS resumes suspended
process disk
Disk disk
Disk
0
Virtual 0 Address Translation Physical
VP 1 PP 2 Address
Address VP 2
Space for ... Space
Process 1: N-1 (DRAM)
PP 7 (e.g., read/only library
code)
Virtual 0
VP 1
Address VP 2 PP 10
Space for ...
N-1 M-1
Process 2:
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Protection
• Page table entry contains access rights information
o hardware enforces this protection (trap into OS if violation occurs)
Page Tables Memory
Read? Write? Physical Addr 0:
VP 0: Yes No PP 9 1:
Process i: VP 1: Yes Yes PP 4
VP 2: No No XXXXXXX
• • •
• • •
• • •
Read? Write? Physical Addr
VP 0: Yes Yes PP 6
Process j: VP 1: Yes No PP 9 N-1:
VP 2: No No XXXXXXX
• • •
• • •
• • •
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Virtual Memory Address Translation
V = {0, 1, . . . , N–1} virtual address space N>M
P = {0, 1, . . . , M–1} physical address space
page fault
fault
Processor handler
Hardware
Addr Trans
Main Secondary
a Mechanism Memory memory
a'
OS performs
virtual address part of the physical address this transfer
on-chip
memory mgmt unit (MMU) (only if miss)
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Virtual Memory Address Translation
• Parameters
o P = 2p = page size (bytes).
o N = 2n = Virtual address limit
o M = 2m = Physical address limit
n–1 p p–1 0
virtual page number page offset virtual address
address translation
m–1 p p–1 0
physical page number page offset physical address
Notice that the page offset bits don't change as a result of translation
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Page Tables
Memory resident
Virtual Page page table
Number (physical page
or disk address)
Valid Physical Memory
1
1
0
1
1
1
0
1
0 Disk Storage
1 (swap file or
regular file system file)
VA PA miss
Trans- Main
CPU Cache
lation Memory
hit
data
hit
VA PA miss
TLB Main
CPU Cache
Lookup Memory
miss hit
Trans-
lation
data
Chapter 6
Pipelining and Superscalar
Techniques
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
Source diginotes.in
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark Save the earth. Go paperless
In this chapter…
• Static Pipeline
o Used to perform fixed functions
• Dynamic Scheduling
o Tomasulo’s Algorithm (Register-Tagging Scheme)
• Hardware based dependence-resolution
o Scoreboarding Technique
• Scoreboard: the centralized control unit
• A kind of data-driven mechanism
• Delayed Branches
o A delayed branch of d cycles allows at most d-1 useful instructions to be executed following the
branch taken.
o Execution of these instructions should be independent of branch instruction to achieve a zero
branch penalty
4,
4,
Chapter 7
Multiprocessors and Multicomputers
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
• Network Characteristics
o Topology
• Dynamic Networks
o Timing control protocol
• Synchronous (with global clock)
• Asynchronous (with handshake or interlocking mechanism)
o Switching method
• Circuit switching
• Packet switching
o Control Strategy
• Centralized (global controller to receive requests from all devices and grant network access)
• Distributed (requests handled by local devices independently)
• Protocol Approaches
o Snoopy Bus Protocol
o Directory Based Protocol
• Write Policies
o (Write-back, Write-through) x (Write-invalidate, Write-update)
• Directory Protocol
• Store-and-forward routing
• Wormhole routing
• Asynchronous Pipelining
• Latency Analysis
o L: Packet length (in bits)
o W: Channel Bandwidth (in bits per second)
o D: Distance (number of nodes traversed minus 1)
o F: Flit length (in bits)
o Communication Latency in Store-and-forward Routing
• TSF = L (D + 1) / W
o Communication Latency in Wormhole Routing
• TWH = L / W + F D / W
Chapter 8
Multivector and SIMD Computers
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
• Vector-Vector Instructions
o F1: Vi Vj
o F2: Vi x Vj Vk
o Examples: V1 = sin(V2) V3 = V1+ V2
• Vector-Scalar Instructions
o F3: s x Vi Vj
o Examples: V2 = 6 + V1
• Vector-Memory Instructions
o F4: MV (Vector Load)
o F5: VM (Vector Store)
o Examples: X = V1 V2 = Y
• Masking
o F10: Vi x Vm Vj (Vm is a binary vector)
• Examples…
• Vector Loops
o Vector segmentation or strip-mining approach
o Example
• Vector Chaining
o Example: SAXPY code
• Limited Chaining using only one memory-access pipe in Cray-I
• Complete Chaining using three memory-access pipes in Cray X-MP
• SIMD Instructions
o Scalar Operations
• Arithmetic/Logical
o Vector Operations
• Arithmetic/Logical
o Data Routing Operations
• Permutations, broadcasts, multicasts, rotation and shifting
o Masking Operations
• Enable/Disable PEs
• Host and I/O
• Bit-slice and Word-slice Processing
o WSBS, WSBP, WPBS, WPBP
Chapter 9
…Dataflow Architectures
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani
• Data-driven machines
• Evolution of Dataflow Machines
• Dataflow Graphs
o Dataflow Graphs examples.
o Activity Templates and Activity Store
o Example: dataflow graph for cos x
𝒙𝟐 𝒙𝟒 𝒙𝟔
• 𝐜𝐨𝐬 𝐱 ≅ 𝟏 − + −
𝟐! 𝟒! 𝟔!
o More examples