Nothing Special   »   [go: up one dir, main page]

ACA Notes Diginotes PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 283

ADVANCED COMPUTER

ARCHITECTURE
Course Code: 15CS72
Book: “Advanced Computer Architecture – Parallelism, Scalability,
Programmability”, Hwang & Jotwani

Dr. Vasanthakumar G U
Department of Computer Science and Engineering
Cambridge Institute of Technology, Bengaluru
Email: vasanth.cse@cambridge.edu.in
Mobile: +91-9902269559
Source diginotes.in Save the earth. Go paperless
COURSE OBJECTIVE

This course will enable students to;

• Describe Computer Architecture


• Measure the performance of architectures in terms of right parameters
• Summarize parallel architecture and the software used for them

PRE REQUISITE(s)

• Basic Computer Architectures


• Microprocessors and Microcontrollers
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
2
COURSE OUTCOME

This students should be able to;

• Explain the concepts of parallel computing and hardware technologies


• Compare and contrast the parallel architectures
• Illustrate parallel programming concepts

COURSE APPLICATIONS
• Parallel Computing
• Research in hardware technologies
• Research in Parallel computing, etc.,
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
3
MODULE – 1

THEORY OF PARALLELISM

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
4
In this chapter…

• THE STATE OF COMPUTING


• MULTIPROCESSORS AND MULTICOMPUTERS
• MULTIVECTOR AND SIMD COMPUTERS
• PRAM AND VLSI MODELS
• ARCHITECTURAL DEVELOPMENT TRACKS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
5
THE STATE OF COMPUTING
Computer Development Milestones
• How it all started…
o 500 BC: Abacus (China) – The earliest mechanical computer/calculating device.
• Operated to perform decimal arithmetic with carry propagation digit by digit
o 1642: Mechanical Adder/Subtractor (Blaise Pascal)
o 1827: Difference Engine (Charles Babbage)
o 1941: First binary mechanical computer (Konrad Zuse; Germany)
o 1944: Harvard Mark I (IBM)
• The very first electromechanical decimal computer as proposed by Howard Aiken
• Computer Generations
o 1st 2nd 3rd 4th 5th
o Division into generations marked primarily by changes in hardware and software
technologies

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
6
THE STATE OF COMPUTING
Computer Development Milestones
• First Generation (1945 – 54)
o Technology & Architecture:
• Vacuum Tubes
• Relay Memories
• CPU driven by PC and accumulator
• Fixed Point Arithmetic
o Software and Applications:
• Machine/Assembly Languages
• Single user
• No subroutine linkage
• Programmed I/O using CPU
o Representative Systems: ENIAC, Princeton IAS, IBM 701

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
7
THE STATE OF COMPUTING
Computer Development Milestones
• Second Generation (1955 – 64)
o Technology & Architecture:
• Discrete Transistors
• Core Memories
• Floating Point Arithmetic
• I/O Processors
• Multiplexed memory access
o Software and Applications:
• High level languages used with compilers
• Subroutine libraries
• Processing Monitor
o Representative Systems: IBM 7090, CDC 1604, Univac LARC

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
8
THE STATE OF COMPUTING
Computer Development Milestones
• Third Generation (1965 – 74)
o Technology & Architecture:
• IC Chips (SSI/MSI)
• Microprogramming
• Pipelining
• Cache
• Look-ahead processors
o Software and Applications:
• Multiprogramming and Timesharing OS
• Multiuser applications
o Representative Systems: IBM 360/370, CDC 6600, T1-ASC, PDP-8

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
9
THE STATE OF COMPUTING
Computer Development Milestones
• Fourth Generation (1975 – 90)
o Technology & Architecture:
• LSI/VLSI
• Semiconductor memories
• Multiprocessors
• Multi-computers
• Vector supercomputers
o Software and Applications:
• Multiprocessor OS
• Languages, Compilers and environment for parallel processing
o Representative Systems: VAX 9000, Cray X-MP, IBM 3090

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
10
THE STATE OF COMPUTING
Computer Development Milestones
• Fifth Generation (1991 onwards)
o Technology & Architecture:
• Advanced VLSI processors
• Scalable Architectures
• Superscalar processors
o Software and Applications:
• Systems on a chip
• Massively parallel processing
• Grand challenge applications
• Heterogeneous processing
o Representative Systems: S-81, IBM ES/9000, Intel Paragon, nCUBE 6480, MPP, VPP500

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
11
THE STATE OF COMPUTING
Elements of Modern Computers
• Computing Problems
• Algorithms and Data Structures
• Hardware Resources
• Operating System
• System Software Support
• Compiler Support

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
12
THE STATE OF COMPUTING
Evolution of Computer Architecture
• The study of computer architecture involves both the following:
o Hardware organization
o Programming/software requirements
• The evolution of computer architecture is believed to have started
with von Neumann architecture
o Built as a sequential machine
o Executing scalar data
• Major leaps in this context came as…
o Look-ahead, parallelism and pipelining
o Flynn’s classification
o Parallel/Vector Computers
o Development Layers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
13
THE STATE OF COMPUTING
Evolution of Computer Architecture

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
14
THE STATE OF COMPUTING
Evolution of Computer Architecture
Flynn’s Classification of Computer Architectures
In 1966, Michael Flynn proposed a classification for computer architectures
based on the number of instruction steams and data streams
(Flynn’s Taxonomy).

• Flynn uses the stream concept for describing a machine's structure.

• A stream simply means a sequence of items (data or instructions).

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
15
THE STATE OF COMPUTING
Evolution of Computer Architecture
Flynn’s Taxonomy

• SISD: Single instruction single data


– Classical von Neumann architecture

• SIMD: Single instruction multiple data

• MIMD: Multiple instructions multiple data


– Most common and general parallel machine

• MISD: Multiple instructions single data

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
16
THE STATE OF COMPUTING
Evolution of Computer Architecture
SISD
• SISD (Singe-Instruction stream, Singe-Data stream)
• SISD corresponds to the traditional mono-processor ( von Neumann computer).
A single data stream is being processed by one instruction stream
• A single-processor computer (uni-processor) in which a single stream of
instructions is generated from the program.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
17
THE STATE OF COMPUTING
Evolution of Computer Architecture
SIMD

• SIMD (Single-Instruction stream, Multiple-Data streams)


• Each instruction is executed on a different set of data by different processors i.e
multiple processing units of the same type process on multiple-data streams.
• This group is dedicated to array processing machines.
• Sometimes, vector processors can also be seen as a part of this group.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
18
THE STATE OF COMPUTING
Evolution of Computer Architecture
MIMD
• MIMD (Multiple-Instruction streams, Multiple-Data streams)
• Each processor has a separate program.
• An instruction stream is generated from each program.
• Each instruction operates on different data.
• This last machine type builds the group for the traditional multi-processors.
• Several processing units operate on multiple-data streams

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
19
THE STATE OF COMPUTING
Evolution of Computer Architecture
MISD

• MISD (Multiple-Instruction streams, Singe-Data stream)


• Each processor executes a different sequence of instructions.
• In case of MISD computers, multiple processing units operate on one single-data stream .
• In practice, this kind of organization has never been used

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
20
THE STATE OF COMPUTING
Evolution of Computer Architecture
Development Layers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
21
THE STATE OF COMPUTING
Evolution of Computer Architecture

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
22
THE STATE OF COMPUTING
System Attributes to Performance
• Machine Capability and Program Behaviour
o Better Hardware Technology
o Innovative Architectural Features
o Efficient Resource Management
• Algorithm Design
• Data Structures
• Language efficiency
• Programmer Skill
• Compiler Technology

• Peak Performance
o Is impossible to achieve

• Measurement of Program Performance: Turnaround time


o Disk and Memory Access
o Input and Output activities
o Compilation time
o OS Overhead
o CPU time
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
23
THE STATE OF COMPUTING
System Attributes to Performance
• Cycle Time (t), Clock Rate (f) and Cycles Per Instruction (CPI)
• Performance Factors
o Instruction Count, Average CPI, Cycle Time, Memory Cycle Time and No. of memory cycles

• MIPS Rate and Throughput Rate


• Programming Environments – Implicit and Explicit Parallelism

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
24
THE STATE OF COMPUTING
System Attributes to Performance

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
25
THE STATE OF COMPUTING
System Attributes to Performance

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
26
THE STATE OF COMPUTING
System Attributes to Performance
• A benchmark program contains 450000 arithmetic
instructions, 320000 data transfer instructions and 230000
control transfer instructions. Each arithmetic instruction takes
1 clock cycle to execute whereas each data transfer and
control transfer instruction takes 2 clock cycles to execute. On
a 400 MHz processors, determine:
o Effective no. of cycles per instruction (CPI)
o Instruction execution rate (MIPS rate)
o Execution time for this program

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
27
THE STATE OF COMPUTING
System Attributes to Performance
Performance Factors
Instruction Average Cycles per Instruction (CPI) Processor Cycle
Count (Ic) Time (𝜏)
Processor Memory Memory Access
Cycles per References per Latency (k)
System Instruction Instruction (m)
Attributes (CPI and p)
Instruction-set
Architecture
Compiler
Technology
Processor
Implementation
and Control
Cache and
Memory
Hierarchy
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
28
Multiprocessors and Multicomputers
• Shared Memory Multiprocessors
o The Uniform Memory Access (UMA) Model
o The Non-Uniform Memory Access (NUMA) Model
o The Cache-Only Memory Access (COMA) Model
o The Cache-Coherent Non-Uniform Memory Access (CC-NUMA) Model
• Distributed-Memory Multicomputers
o The No-Remote-Memory-Access (NORMA) Machines
o Message Passing Multicomputers
• Taxonomy of MIMD Computers
• Representative Systems
o Multiprocessors: BBN TC-200, MPP, S-81, IBM ES/9000 Model 900/VF,
o Multicomputers: Intel Paragon XP/S, nCUBE/2 6480, SuperNode 1000, CM5, KSR-1

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
29
Multiprocessors and Multicomputers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
30
Multiprocessors and Multicomputers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
31
Multiprocessors and Multicomputers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
32
Multiprocessors and Multicomputers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
33
Multiprocessors and Multicomputers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
34
Multivector and SIMD Computers
• Vector Processors
o Vector Processor Variants
• Vector Supercomputers
• Attached Processors
o Vector Processor Models/Architectures
• Register-to-register architecture
• Memory-to-memory architecture
o Representative Systems:
• Cray-I
• Cray Y-MP (2,4, or 8 processors with 16Gflops peak performance)
• Convex C1, C2, C3 series (C3800 family with 8 processors, 4 GB main memory, 2
Gflops peak performance)
• DEC VAX 9000 (pipeline chaining support)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
35
Multivector and SIMD Computers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
36
Multivector and SIMD Computers
• SIMD Supercomputers
o SIMD Machine Model
• S = < N, C, I, M, R >
• N: No. of PEs in the machine
• C: Set of instructions (scalar/program flow) directly executed by control unit
• I: Set of instructions broadcast by CU to all PEs for parallel execution
• M: Set of masking schemes
• R: Set of data routing functions
o Representative Systems:
• MasPar MP-1 (1024 to 16384 PEs)
• CM-2 (65536 PEs)
• DAP600 Family (up to 4096 PEs)
• Illiac-IV (64 PEs)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
37
Multivector and SIMD Computers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
38
PRAM and VLSI Models

• Parallel Random Access Machines


o Time and Space Complexities
• Time complexity
• Space complexity
• Serial and Parallel complexity
• Deterministic and Non-deterministic algorithm
o PRAM
• Developed by Fortune and Wyllie (1978)
• Objective:
o Modelling idealized parallel computers with zero synchronization or memory
access overhead
• An n-processor PRAM has a globally addressable Memory

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
39
PRAM and VLSI Models

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
40
PRAM and VLSI Models

• Parallel Random Access Machines


o PRAM Variants
• EREW-PRAM Model
• CREW-PRAM Model
• ERCW-PRAM Model
• CRCW-PRAM Model
o Discrepancy with Physical Models
• Most popular variants: EREW and CRCW
• SIMD machine with shared architecture is closest architecture modelled by PRAM
• PRAM Allows different instructions to be executed on different processors
simultaneously. Thus, PRAM really operates in synchronized MIMD mode with
shared memory

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
41
PRAM and VLSI Models

• VLSI Complexity
Model
o The 𝑨𝑻𝟐 Model
• Memory Bound
on Chip Area
• I/O Bound on
Volume 𝑨𝑻
• Bisection
Communication
Bound (Cross-
section area)
𝑨𝑻
• Square of this area used as lower
bound

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
42
Architectural Development Tracks

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
43
Architectural Development Tracks

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
44
Architectural Development Tracks

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
45
Chapter 2
Program and Network Properties
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Source diginotes.in Save the earth. Go paperless


In this chapter…

• CONDITIONS OF PARALLELISM
• PROBLEM PARTITIONING AND SCHEDULING
• PROGRAM FLOW MECHANISMS
• SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
2
Conditions of Parallelism

• Key areas which need significant progress in parallel processing:


o Computational Models for parallel computing
o Inter-processor communication for parallel architectures
o System Integration for incorporating parallel systems into general computing environment

• Various forms of parallelism can be attributed to:


o Levels of Parallelism
o Computational Granularity
o Time and Space Complexities
o Communication Latencies
o Scheduling Policies
o Load Balancing

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
3
CONDITIONS OF PARALLELISM
Data and Resource Dependencies
• Data Dependence
o Flow Dependence
o Anti-dependence
o Output Dependence
o I/O Dependence
o Unknown Dependence
• Control Dependence
o Control-dependence and control-independence
• Resource Dependence
o ALU Dependence, Storage Dependence, etc.
• Bernstein’s Conditions for Parallelism
• Dependence Graph

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
4
CONDITIONS OF PARALLELISM
Data and Resource Dependencies
• Consider the following examples and determine the types of
dependencies and draw corresponding dependence graphs
• Example 2.1(a):
o S1: R1 M[A]
o S2: R2  R1 + R2
o S3: R1  R3
o S4: M[B]  R1
• Example 2.1(b):
o S1: Read array A from file F
o S2: Process data
o S3: Write array B into file F
o S4: Close file F

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
5
CONDITIONS OF PARALLELISM
Data and Resource Dependencies

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
6
CONDITIONS OF PARALLELISM
Data and Resource Dependencies
• Example 2.2: Determine the parallelism embedded in the following
statements and draw the dependency graphs.
o P1: C=DxE
o P2: M=G+C
o P3: A=B+C
o P4: C=L+M
o P5: F=G+E

• Also analyse the statements against Bernstein’s Conditions.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
7
CONDITIONS OF PARALLELISM
Data and Resource Dependencies

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
8
CONDITIONS OF PARALLELISM
Data and Resource Dependencies

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
9
CONDITIONS OF PARALLELISM
Hardware and Software Parallelism
• Hardware Parallelism
o Cost and Performance Trade-offs
o Resource Utilization Patterns of simultaneously executable operations
o k-issue processors and one-issue processor
o Representative systems: Intel i960CA (3-issue), IBM RISC System 6000 (4-issue)
• Software Parallelism
o Algorithm x Programming Style x Program Design
o Program Flow Graphs
o Control Parallelism
o Data Parallelism
• Mismatch between Software Parallelism and Hardware Parallelism
o Role of Compilers in resolving the mismatch

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
10
CONDITIONS OF PARALLELISM
Hardware and Software Parallelism

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
11
CONDITIONS OF PARALLELISM
Hardware and Software Parallelism

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
12
PROGRAM PARTITIONING AND SCHEDULING
Grain Packing and Scheduling
• Fundamental Objectives
o To partition program into parallel branches, program modules, micro-tasks or grains to yield
shortest possible execution time
o Determining the optimal size of grains in a computation

• Grain-size problem
o To determine both the size and number of grains in parallel program

• Parallelism and scheduling/synchronization overhead trade-off


• Grain Packing approach
o Introduced by Kruatrachue and Lewis (1988)
o Grain size is measured by number of basic machine cycles (both processor and memory) needed
to execute all operations within the node
o A program graph shows the structure of a program
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
13
PROGRAM PARTITIONING AND SCHEDULING
Grain Packing and Scheduling
• Example 2.4: Consider the following program
o VAR a, b, c, d, e, f, g, h, I, j, k, l, m, n, o, p, q
o BEGIN
1) a = 1 10) j=exf
2) b = 2 11) k=dxf
3) c = 3 12) l=jxk
4) d = 4 13) m =4 x 1
5) e = 5 14) n=3xm
6) f = 6 15) o=nxi
7) g = a x b 16) p=oxh
8) h = c x d 17) q=pxq
9) i = d x e END

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
14
PROGRAM PARTITIONING AND SCHEDULING
Grain Packing and Scheduling

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
15
PROGRAM PARTITIONING AND SCHEDULING
Grain Packing and Scheduling

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
16
PROGRAM PARTITIONING AND SCHEDULING
Static Multiprocessor Scheduling
• Node Duplication scheduling
o To eliminate idle time and further reduce communication delays among processor, nodes can be
duplicated in more than one processors

• Grain Packing and Mode duplication are often used jointly to


determine the best grain size and corresponding schedule.
• Steps involved in grain determination and process of scheduling
optimization:
o Step 1: Construct a fine-grain program graph
o Step 2: Schedule the fine-grain computation
o Step 3: Perform grain packing to produce the coarse grains
o Step 4: Generate a parallel schedule based on the packed graph
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
17
PROGRAM PARTITIONING AND SCHEDULING
Static Multiprocessor Scheduling

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
18
PROGRAM FLOW MECHANISM

• Control Flow v/s Data Flow


o Control Flow Computer
o Data Flow Computer
o Example: The MIT tagged-token data flow computer (Arvind and his associates), 1986

• Demand-Driven Mechanisms
o Reduction Machine
o Demand-driven computation
o Eager Evaluation and Lazy Evaluation
o Reduction Machine Models – String Reduction Model and Graph Reduction Model

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
19
PROGRAM FLOW MECHANISM

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
20
PROGRAM FLOW MECHANISM

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
21
SYSTEM INTERCONNECT ARCHITECTURES

• Direct Networks for static connections


• Indirect Networks for dynamic connections
• Network Properties and Routing
o Static Networks and Dynamic Networks
o Network Size
o Node Degree (in-degree and out-degree)
o Network Diameter
o Bisection Width
• Channel width (w), Channel bisection width (b) , wire bisection width (B = b.w)
• Wire length
• Symmetric networks

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
22
SYSTEM INTERCONNECT ARCHITECTURES

• Network Properties and Routing


o Data Routing Functions
• Shifting
• Rotation
• Permutation
• Broadcast
• Multicast
• Shuffle
• Exchange
o Hypercube Routing Functions

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
23
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
24
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
25
SYSTEM INTERCONNECT ARCHITECTURES

• Network Performance
o Functionality
o Network Latency
o Bandwidth
o Hardware Complexity
o Scalability

• Network Throughput
o Total no. of messages the network can handle per unit time
o Hot spot and Hot spot throughput

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
26
SYSTEM INTERCONNECT ARCHITECTURES
• Static Connection Networks
o Liner Array
o Ring and Chordal Ring
o Barrel Shifter
o Tree and Star
o Fat Tree
o Mesh and Torus
o Systolic Arrays
o Hypercubes
o Cube-connected Cycles
o K-ary n-Cube Networks
• Summary of Static Network Characteristics
o Refer to Table 2.2 (on page 76) of textbook

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
27
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
28
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
29
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
30
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
31
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
32
SYSTEM INTERCONNECT ARCHITECTURES

• Dynamic Connection Networks


o Bus System – Digital or Contention or Time-shared bus
o Switch Modules
• Binary switch, straight or crossover connections
o Multistage Interconnection Network (MIN)
• Inter-Stage Connection (ISC) – perfect shuffle, butterfly, multi-way shuffle, crossbar, etc.
o Omega Network
o Baseline Network
o Crossbar Network

• Summary of Dynamic Network Characteristics


o Refer to Table 2.4 (on page 82) of the textbook

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
33
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
34
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
35
SYSTEM INTERCONNECT ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
36
PERFORMANCE
EVALUATION OF
PARALLEL COMPUTERS
CSE539
(Advanced Computer Architecture)

Sumit Mittu
(Assistant Professor, CSE/IT)
Lovely Professional University

Source diginotes.in Save the earth. Go paperless


BASICS OF
PERFORMANCE EVALUATION

A sequential algorithm is evaluated in terms of its execution time which is


expressed as a function of its input size.

For a parallel algorithm, the execution time depends not only on input size but also
on factors such as parallel architecture, no. of processors, etc.

Performance Metrics
Parallel Run Time Speedup Efficiency

Standard Performance Measures


Peak Performance Sustained Performance
Instruction Execution Rate (in MIPS) Floating Point Capability (in MFLOPS)
Source diginotes.in Save the earth. Go paperless2
PERFORMANCE METRICS

Parallel Runtime
The parallel run time T(n) of a program or application is the time required to run
the program on an n-processor parallel computer.
When n = 1, T(1) denotes sequential runtime of the program on single processor.

Speedup
Speedup S(n) is defined as the ratio of time taken to run a program on a single
processor to the time taken to run the program on a parallel computer with
identical processors

It measures how faster the program runs on a parallel computer rather than on a
single processor.

Source diginotes.in Save the earth. Go paperless3


PERFORMANCE METRICS

Efficiency
The Efficiency E(n) of a program on n processors is defined as the ratio of
speedup achieved and the number of processor used to achieve it.

Relationship between Execution Time, Speedup and Efficiency and the


number of processors used is depicted using the graphs in next slides.

In the ideal case:


Speedup is expected to be linear i.e. it grows linearly with the number of
processors, but in most cases it falls due to parallel overhead.

Source diginotes.in Save the earth. Go paperless4


PERFORMANCE METRICS

Graphs showing relationship b/w T(n) and no. of processors

<<<IMAGES>>>

Source diginotes.in Save the earth. Go paperless5


PERFORMANCE METRICS

Graphs showing relationship b/w S(n) and no. of processors

<<<IMAGES>>>

Source diginotes.in Save the earth. Go paperless6


PERFORMANCE METRICS

Graphs showing relationship b/w E(n) and no. of processors

<<<IMAGES>>>

Source diginotes.in Save the earth. Go paperless7


PERFORMANCE MEASURES

Standard Performance Measures

Most of the standard measures adopted by the industry to compare the


performance of various parallel computers are based on the concepts of:
Peak Performance
[Theoretical maximum based on best possible utilization of all resources]
Sustained Performance
[based on running application-oriented benchmarks]

Generally measured in units of:


MIPS [to reflect instruction execution rate]
MFLOPS [to reflect the floating-point capability]
Source diginotes.in Save the earth. Go paperless8
PERFORMANCE MEASURES

Benchmarks

Benchmarks are a set of programs of program fragments used to compare the


performance of various machines.

Machines are exposed to these benchmark tests and tested for performance.

When it is not possible to test the applications of different machines, then the
results of benchmark programs that most resemble the applications run on those
machines are used to evaluate the performance of machine.

Source diginotes.in Save the earth. Go paperless9


PERFORMANCE MEASURES

Benchmarks

Kernel Benchmarks
[Program fragments which are extracted from real programs]
[Heavily used core and responsible for most execution time]
Synthetic Benchmarks
[Small programs created especially for benchmarking purposes]
[These benchmarks do not perform any useful computation]

EXAMPLES
LINPACK LAPACK Livermore Loops SPECmarks
NAS Parallel Benchmarks Perfect Club Parallel Benchmarks

Source diginotes.in Save the earth. Go paperless


10
PARALLEL OVERHEAD

Sources of Parallel Overhead

Parallel computers in practice do not achieve linear speedup or an efficiency of 1


because of parallel overhead. The major sources of which could be:

•Inter-processor Communication
•Load Imbalance
•Inter-Task Dependency
•Extra Computation
•Parallel Balance Point

Source diginotes.in Save the earth. Go paperless


11
SPEEDUP
PERFORMANCE LAWS

Speedup Performance Laws

Amdahl’s Law
[based on fixed problem size or fixed work load]

Gustafson’s Law
[for scaled problems, where problem size increases with machine size
i.e. the number of processors]

Sun & Ni’s Law


[applied to scaled problems bounded by memory capacity]
Source diginotes.in Save the earth. Go paperless
12
SPEEDUP
PERFORMANCE LAWS

Amdahl’s Law (1967)

For a given problem size, the speedup does not increase linearly as the number of
processors increases. In fact, the speedup tends to become saturated.
This is a consequence of Amdahl’s Law.

According to Amdahl’s Law, a program contains two types of operations:


Completely sequential
Completely parallel

Let, the time Ts taken to perform sequential operations be a fraction α (0<α≤1) of


the total execution time T(1) of the program, then the time Tp to perform parallel
operations shall be (1-α) of T(1)

Source diginotes.in Save the earth. Go paperless


13
SPEEDUP
PERFORMANCE LAWS

Amdahl’s Law

Thus, Ts = α.T(1) and Tp = (1-α).T(1)

Assuming that the parallel operations achieve linear speedup


(i.e. these operations use 1/n of the time taken to perform on each processor), then

T(n) = Ts + Tp/n =

Thus, the speedup with n processors will be:

Source diginotes.in Save the earth. Go paperless


14
SPEEDUP
PERFORMANCE LAWS

Amdahl’s Law

Sequential operations will tend to dominate the speedup as n becomes very large.

As n  ∞, S(n)  1/α

This means, no matter how many processors are employed, the speedup in
this problem is limited to 1/ α.
This is known as sequential bottleneck of the problem.

Note: Sequential bottleneck cannot be removed just by increasing the no. of processors.

Source diginotes.in Save the earth. Go paperless


15
SPEEDUP
PERFORMANCE LAWS

Amdahl’s Law

A major shortcoming in applying the Amdahl’s Law: (is its own characteristic)
The total work load or the problem size is fixed
Thus, execution time decreases with increasing no. of processors

Thus, a successful method of overcoming this shortcoming is to increase the problem size!

Source diginotes.in Save the earth. Go paperless


16
SPEEDUP
PERFORMANCE LAWS

Amdahl’s Law

<<<GRAPH>>>

Source diginotes.in Save the earth. Go paperless


17
SPEEDUP
PERFORMANCE LAWS

Gustafson’s Law (1988)

It relaxed the restriction of fixed size of the problem and used the notion of fixed
execution time for getting over the sequential bottleneck.

According to Gustafson’s Law,


If the number of parallel operations in the problem is increased (or scaled up) sufficiently,
Then sequential operations will no longer be a bottleneck.

In accuracy-critical applications, it is desirable to solve the largest problem size on a


larger machine rather than solving a smaller problem on a smaller machine, with
almost the same execution time.

Source diginotes.in Save the earth. Go paperless


18
SPEEDUP
PERFORMANCE LAWS

Gustafson’s Law

As the machine size increases, the work load (or problem size) is also increased so
as to keep the fixed execution time for the problem.

Let, Ts be the constant time tank to perform sequential operations; and


Tp(n,W) be the time taken to perform parallel operation of problem size or
workload W using n processors;
Then the speedup with n processors is:

Source diginotes.in Save the earth. Go paperless


19
SPEEDUP
PERFORMANCE LAWS

Gustafson’s Law

<<<IMAGES>>>

Source diginotes.in Save the earth. Go paperless


20
SPEEDUP
PERFORMANCE LAWS

Gustafson’s Law

Assuming that parallel operations achieve a linear speedup


(i.e. these operations take 1/n of the time to perform on one processor)
Then, Tp(1,W) = n. Tp(n,W)

Let α be the fraction of sequential work load in problem, i.e.

Then the speedup can be expressed as : with n processors is:

Source diginotes.in Save the earth. Go paperless


21
SPEEDUP
PERFORMANCE LAWS

Sun & Ni’s Law (1993)

This law defines a memory bounded speedup model which generalizes both Amdahl’s Law
and Gustafson’s Law to maximize the use of both processor and memory capacities.

The idea is to solve maximum possible size of problem, limited by memory capacity

This inherently demands an increased or scaled work load,


providing higher speedup,
Higher efficiency, and
Better resource (processor & memory) utilization
But may result in slight increase in execution time to achieve this scalable
speedup performance!
Source diginotes.in Save the earth. Go paperless
22
SPEEDUP
PERFORMANCE LAWS

Sun & Ni’s Law

According to this law, the speedup S*(n) in the performance can be defined by:

Assumptions made while deriving the above expression:


•A global address space is formed from all individual memory spaces i.e. there is a
distributed shared memory space
•All available memory capacity of used up for solving the scaled problem.

Source diginotes.in Save the earth. Go paperless


23
SPEEDUP
PERFORMANCE LAWS

Sun & Ni’s Law

Special Cases:
•G(n) = 1
Corresponds to where we have fixed problem size i.e. Amdahl’s Law
•G(n) = n
Corresponds to where the work load increases n times when memory is increased n
times, i.e. for scaled problem or Gustafson’s Law
•G(n) ≥ n
Corresponds to where computational workload (time) increases faster than memory
requirement.

Comparing speedup factors S*(n), S’(n) and S’(n), we shall find S*(n) ≥ S’(n) ≥ S(n)

Source diginotes.in Save the earth. Go paperless


24
SPEEDUP
PERFORMANCE LAWS

Sun & Ni’s Law

<<<IMAGES>>>

Source diginotes.in Save the earth. Go paperless


25
SCALABILITY METRIC

Scalability
– Increasing the no. of processors decreases the efficiency!
+ Increasing the amount of computation per processor, increases the efficiency!
To keep the efficiency fixed, both the size of problem and the no. of processors must be increased
simultaneously.

A parallel computing system is said to be scalable if its efficiency can be


fixed by simultaneously increasing the machine size and the problem size.

Scalability of a parallel system is the measure of its capacity to increase speedup in proportion to
the machine size.

Source diginotes.in Save the earth. Go paperless


26
SCALABILITY METRIC

Isoefficiency Function
The isoefficiency function can be used to measure scalability of the parallel computing
systems.

It shows how the size of problem must grow as a function of the number of processors used in order to
maintain some constant efficiency.

The general form of the function is derived using an equivalent definition of efficiency
as follows:

Where, U is the time taken to do the useful computation (essential work), and
O is the parallel overhead. (Note: O is zero for sequential execution).
Source diginotes.in Save the earth. Go paperless
27
SCALABILITY METRIC

Isoefficiency Function

If the efficiency is fixed at some constant value K then

Where, K’ is a constant for fixed efficiency K.


This function is known as the isoefficiency function of parallel computing system.

A small isoefficiency function means that small increments in the problem size (U), are sufficient
for efficient utilization of an increasing no. of processors, indicating high scalability.
A large isoeffcicnecy function indicates a poorly scalable system.
Source diginotes.in Save the earth. Go paperless
28
SCALABILITY METRIC

Isoefficiency Function

Source diginotes.in Save the earth. Go paperless


29
PERFORMANCE
MEASUREMENT TOOLS

Performance Analysis

Search Based Tools


Visualization
Utilization Displays
[Processor (utilization) count, Utilization Summary, Gantt charts, Concurrency Profile, Kiviat
Diagrams]
Communication Displays
[Message Queues, Communication Matrix, Communication Traffic, Hypercube]
Task Displays
[Task Gantt, Task Summary]

Source diginotes.in Save the earth. Go paperless


30
PERFORMANCE
MEASUREMENT TOOLS

Source diginotes.in Save the earth. Go paperless


31
PERFORMANCE
MEASUREMENT TOOLS

Source diginotes.in Save the earth. Go paperless


32
PERFORMANCE
MEASUREMENT TOOLS

Source diginotes.in Save the earth. Go paperless


33
PERFORMANCE
MEASUREMENT TOOLS

Source diginotes.in Save the earth. Go paperless


34
PERFORMANCE
MEASUREMENT TOOLS

Source diginotes.in Save the earth. Go paperless


35
PERFORMANCE
MEASUREMENT TOOLS

Instrumentation

A way to collect data about an application is to instrument the application executable


so that when it executes, it generates the required information as a side-effect.

Ways to do instrumentation:
By inserting it into the application source code directly, or
By placing it into the runtime libraries, or
By modifying the linked executable, etc.

Doing this, some perturbation of the application program will occur


(i.e. intrusion problem)
Source diginotes.in Save the earth. Go paperless
36
PERFORMANCE
MEASUREMENT TOOLS

Instrumentation

Intrusion includes both:


Direct contention for resources (e.g. CPU, memory, communication links, etc.)
Secondary interference with resources (e.g. interaction with cache replacements or virtual
memory, etc.)

To address such effects, you may adopt the following approaches:


Realizing that intrusion affects measurement, treat the resulting data as an
approximation
Leave the added instrumentation in the final implementation.
Try to minimize the intrusion.
Quantify the intrusion and compensate for it!

Source diginotes.in Save the earth. Go paperless


37
CSE539: Advanced Computer Architecture

Chapter 3
Principles of Scalable Performance
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Sumit Mittu
Assistant Professor, CSE/IT
Lovely Professional University
sumit.12735@lpu.co.in

Source diginotes.in Save the earth. Go paperless


In this chapter…

• Parallel Processing Applications


• Sources of Parallel Overhead*
• Application Models for Parallel Processing
• Speedup Performance Laws

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
2
PARALLEL PROCESSING APPLICATIONS
Massive Parallelism for Grand Challenges
• Massively Parallel Processing
• High-Performance Computing and Communication (HPCC) Program
• Grand Challenges
o Magnetic Recording Industry
o Rational Drug Design
o Design of High-speed Transport Aircraft
o Catalyst for Chemical Reactions
o Ocean Modelling and Environment Modelling
o Digital Anatomy in:
• Medical Diagnosis, Pollution reduction, Image Processing, Education Research

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
3
PARALLEL PROCESSING APPLICATIONS
Massive Parallelism for Grand Challenges

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
4
PARALLEL PROCESSING APPLICATIONS
Massive Parallelism for Grand Challenges
• Exploiting Massive Parallelism
o Instruction Parallelism v/s Data Parallelism

• The Past and The Future


o Early Representative Massively Parallel Processing Systems
• Kendall Square KSR-1: All-cache ring multiprocessor; 43 Gflops peak performance
• IBM MPP Model: 1024 IBM RISC/6000 processors; 50 Gflops peak performance
• Cray MPP Model: 1024 DEC Alpha Processor chips; 150 Gflops peak performance
• Intel Paragon: 2D mesh-connected multicomputer; 300 Gflops peak performance
• Fujistu VPP-500: 222-PE MIMD vector system; 355 Gflops peak performance
• TMC CM-5: 16K nodes of SPARC PEs, 2Tflops peak performance

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
5
PARALLEL PROCESSING APPLICATIONS
Application Models for Parallel Processing
• Keeping the workload 𝛼 as constant
o Efficiency E decreases rapidly as Machine Size n increases
• Because: Overhead h increases faster than machine size

• Scalable Computer
• Workload Growth and Efficiency Curves
• Application Models
o Fixed Load Model
o Fixed Time Model
o Fixed Memory Model

• Trade-offs in Scalability Analysis


Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
6
PARALLEL PROCESSING APPLICATIONS
Application Models for Parallel Processing

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
7
Advanced Computer Architecture

Chapter 4
Processors and Memory Hierarchy
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Source diginotes.in Save the earth. Go paperless


PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
In this chapter…

• Design Space of Processors


• Superscalar and Vector Processors
• Memory Hierarchy Technology

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
2
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• Processor families mapped onto a coordinated space of clock rate
v/s CPI
o Clock Rates moved from lower
to higher speeds
o CPI rate is lowered

• Broad Categorization
o CISC
o RISC

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
3
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• The Design Space
o CISC Computers
o RISC Computers
o Superscalar Processors
o VLIW Processors
o Vector Supercomputers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
4
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• Instruction Pipelines
o Instruction Cycle Phases
o Pipeline and Pipeline Cycle
o Instruction Pipeline Cycle
o Instruction issue Latency
o Instruction Issue Rate (degree of superscalar processor)
o Simple Operation Latency
o Resource Conflicts
o Base scalar processor

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
5
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
6
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
7
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
8
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
9
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
10
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
11
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
12
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
13
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
14
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• Vector Processors
o Memory to memory VP
• Memory based instructions
• Longer instructions
• Instructions include memory address
o Register to register VP
• Shorter Instructions
• Vector register files

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
15
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• Vector Processors
o Vector Instructions (Register to register)
• Binary Vector V1 o V2  V3
• Scaling s1 o V1  V2
• Binary Reduction V1 o V2  s1
• Vector Load M(1:n) V1
• Vector Store V1 M(1:n)
• Unary Vector o V1  V2
• Unary Reduction o V1  s1

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
16
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
17
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• Vector Processors
o Vector Instructions (Memory to memory)
• M1(1:n) o M2(1:n)  M3(1:n)
• s1 o M1(1:n)  M2(1:n)
• o M2(1:n)  M(1:n)
• M1(1:n) o M2(1:n)  M(k)
o Vector Pipelines

• Symbolic Processors
o Prolog Processors, Lisp Processors or symbolic manipulators.
o Deals with logic programs, symbolic lists, objects, scripts, productions systems, semantic networks,
frames and artificial neural networks.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
18
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
19
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• Symbolic Processors
o Attributes & Characteristics of Symbolic Processing
• Knowledge Representation
o Lists, relational databases, semantic nets, frames etc.
• Common Operations
o Search, sort, pattern matching, filtering etc.
• Memory Requirements
o Large memory with intensive access pattern, content-based
• Communication Patterns
o Varying traffic size, granularity and format of messages
Continued…

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
20
ADVANCED PROCESSOR TECHNOLOGY
Design Space of Processors
• Symbolic Processors
o Attributes & Characteristics of Symbolic Processing
• Properties of Algorithms
o Non-deterministic, parallel and distributed computations
• I/O Requirements
o User-guided programs, intelligent person-machine interfaces
• Architecture Features
o Parallel update of knowledge bases, dynamic load balancing, dynamic memory
allocation, hardware based garbage collection etc.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
21
ADVANCED PROCESSOR TECHNOLOGY
Memory Hierarchy Technology

• Memory Hierarchy
- Need & Significance

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
22
ADVANCED PROCESSOR TECHNOLOGY
Memory Hierarchy Technology
• Memory Hierarchy
o Parameters
• Access time
• Memory size
• Cost per byte
• Transfer bandwidth
• Unit of transfer
o Properties
• Inclusion
• Coherence
• Locality

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
23
ADVANCED PROCESSOR TECHNOLOGY
Memory Hierarchy Technology
• Memory Hierarchy
o Parameters
• T(i): Access time (round-trip time from CPU to i-th level memory)
o T(i-1) < T(i) <T(i+1)
• S(i): Memory size (number of bytes or words in i-th level memory)
o S(i-1) < S(i) < S(i+1)
• C(i): Cost per byte (per byte cost of i-th level memory; total cost estimated by C(i)*S(i))
o C(i-1) > C(i) > C(i+1)
• B(i): Transfer bandwidth (rate at which information is transferred between adjacent levels)
o B(i-1) > B(i) > B(i+1)
• X(i): Unit of transfer (grain size for data transfer between levels I and i+1)
o X(i-1) < X(i) < X(i+1)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
24
ADVANCED PROCESSOR TECHNOLOGY
Memory Hierarchy Technology
• Memory Hierarchy
o Properties
• Inclusion Property
o M1 ⊂ M2 ⊂ M3 ⊂ ....⊂ Mn
o M(i-1) is a subset of M(i)
• Coherence Property
o Copies of same information item at successive memory levels be consistent
o Strategies to maintain Coherence:
• 1) Write-Through (WT)
• 2) Write-Back (WB)
• Locality of Reference
o Temporal: recently referenced items are likely to be referenced again in near future
o Spatial: tendency of a process to access items whose addresses are near one another
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
25
The Working Sets model
ADVANCED PROCESSOR TECHNOLOGY
Memory Hierarchy Technology

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
26
ADVANCED PROCESSOR TECHNOLOGY
Memory Hierarchy Technology
• Memory Capacity Planning
o Hit ratios
• 𝐡𝐢
o Access Frequency to ith level memory
• 𝐟𝐢 = 𝟏 − 𝐡𝟏 𝟏 − 𝐡𝟐 … 𝟏 − 𝐡𝐢−𝟏 𝐡𝐢
o Effective Access Time
• 𝐓𝐞𝐟𝐟 = σ𝒏𝒊=𝟏 𝒇𝒊 ∗ 𝒕𝒊
o Total Cost of Memory Hierarchy
• 𝐂𝐭𝐨𝐭𝐚𝐥 = σ𝒏𝒊=𝟏 𝒄𝒊 ∗ 𝒔𝒊
o Hierarchy Optimization
• 𝐂𝐭𝐨𝐭𝐚𝐥 = σ𝒏𝒊=𝟏 𝒄𝒊 ∗ 𝒔𝒊
• Subject to: si > 0 ti > 0 (for i=1 to n) 𝐂𝐭𝐨𝐭𝐚𝐥 = σ𝒏𝒊=𝟏 𝒄𝒊 ∗ 𝒔𝒊 < 𝑪𝟎

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
27
ADVANCED PROCESSOR TECHNOLOGY
Memory Hierarchy Technology
Virtual Memory Technology
• Virtual memory is stored in the secondary storage device and helps to
extend additional memory capacity.
• Work with primary memory to load applications.
• Reduces the cost of expanding the capacity of physical memory.
• Implimentations differ from one OS to other.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
28
Virtual Memory Technology

• Each process address space is partitioned into parts and used for code, data
and stack.
• Parts are loaded into primary memory when needed and written back to
secondary storage otherwise.
• The logical address space is referred to as virtual memory.
• Virtual memory is much larger than the physical memory.
• Virtual memory uses: Virtual address and Physical address.
• CPU translates Virtual address to Physical address.
• Virtual memory system uses paging.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
29
Locating an Object in a Cache

1. Search for matching tag “Cache”


o SRAM cache Tag Data
Object Name 0: D 243
X = X? 1: X 17
• •
• •
• •
N-1: J 105
2. Use indirection to look up actual object location
• DRAM cache Lookup Table “Cache”
Location Data
Object Name D: 0 0: 243
X J: N-1 1: 17
• •
• •
• •
X: 1 N-1: 105
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
A System with Physical Memory Only
• Examples:
o most Cray machines, early PCs, nearly all embedded systems, etc.
Memory
0:
Physical 1:
Addresses
CPU

N-1:

Addresses generated by the CPU point directly to bytes in physical memory


Source
PDF Watermark Remover DEMO : Purchase from diginotes.in
www.PDFWatermarkRemover.com Save
to remove the the earth. Go paperless
watermark
A System with Virtual Memory
• Examples:
o workstations, servers, modern PCs, etc. Memory
0:
Page Table 1:
Virtual Physical
Addresses 0: Addresses
1:

CPU

P-1:
N-1:

Disk
Address Translation: the hardware converts virtual addresses into physical addresses via
an OS-managed lookup table (page table)
Source
PDF Watermark Remover DEMO : Purchase from diginotes.in
www.PDFWatermarkRemover.com Save
to remove the the earth. Go paperless
watermark
Page Faults (Similar to “Cache Misses”)

• What if an object is on disk rather than in memory?


o Page table entry indicates that the virtual address is not in memory
o An OS exception handler is invoked, moving data from disk into memory
• current process suspends, others can resume
• OS has full control over the placement.

Before fault After fault


Memory
Memory
Page Table
Virtual Physical Page Table
Addresses Addresses Virtual Physical
Addresses Addresses
CPU
CPU

Disk
Disk
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Servicing a Page Fault

• Processor Signals Controller (1) Initiate Block Read


o Read block of length P starting Processor
Reg
at disk address X and store (3) Read
starting at memory address Y Done
Cache
• Read Occurs
o Direct Memory Access (DMA)
Memory-I/O bus
o Under control of I/O controller
• I/O Controller Signals Completion (2) DMA Transfer I/O
o Interrupt processor Memory controller

o OS resumes suspended
process disk
Disk disk
Disk

Source diginotes.in Save the earth. Go paperless


PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Solution: Separate Virtual Addr. Spaces

o Virtual and physical address spaces divided into equal-sized blocks


• blocks are called “pages” (both virtual and physical)
o Each process has its own virtual address space
• operating system controls how virtual pages are assigned to physical memory

0
Virtual 0 Address Translation Physical
VP 1 PP 2 Address
Address VP 2
Space for ... Space
Process 1: N-1 (DRAM)
PP 7 (e.g., read/only library
code)
Virtual 0
VP 1
Address VP 2 PP 10
Space for ...
N-1 M-1
Process 2:
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Protection
• Page table entry contains access rights information
o hardware enforces this protection (trap into OS if violation occurs)
Page Tables Memory
Read? Write? Physical Addr 0:
VP 0: Yes No PP 9 1:
Process i: VP 1: Yes Yes PP 4
VP 2: No No XXXXXXX
• • •
• • •
• • •
Read? Write? Physical Addr
VP 0: Yes Yes PP 6
Process j: VP 1: Yes No PP 9 N-1:
VP 2: No No XXXXXXX
• • •
• • •
• • •
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Virtual Memory Address Translation
V = {0, 1, . . . , N–1} virtual address space N>M
P = {0, 1, . . . , M–1} physical address space

MAP: V  P U {} address mapping function


MAP(a) = a' if data at virtual address a is present at physical
address a' in P
=  if data at virtual address a is not present in P

page fault
fault
Processor handler
Hardware 
Addr Trans
Main Secondary
a Mechanism Memory memory
a'
OS performs
virtual address part of the physical address this transfer
on-chip
memory mgmt unit (MMU) (only if miss)
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Virtual Memory Address Translation

• Parameters
o P = 2p = page size (bytes).
o N = 2n = Virtual address limit
o M = 2m = Physical address limit

n–1 p p–1 0
virtual page number page offset virtual address

address translation

m–1 p p–1 0
physical page number page offset physical address

Notice that the page offset bits don't change as a result of translation
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Page Tables
Memory resident
Virtual Page page table
Number (physical page
or disk address)
Valid Physical Memory
1
1
0
1
1
1
0
1
0 Disk Storage
1 (swap file or
regular file system file)

Source diginotes.in Save the earth. Go paperless


PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Integrating VM and Cache

VA PA miss
Trans- Main
CPU Cache
lation Memory
hit
data

• Most Caches “Physically Addressed”


o Accessed by physical addresses
o Allows multiple processes to have blocks in cache at same time
o Allows multiple processes to share pages
o Cache doesn’t need to be concerned with protection issues
• Access rights checked as part of address translation

Source diginotes.in Save the earth. Go paperless


PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Speeding up Translation with a TLB

• “Translation Lookaside Buffer” (TLB)


o Small hardware cache in MMU
o Maps virtual page numbers to physical page numbers
o Contains complete page table entries for small number of pages

hit
VA PA miss
TLB Main
CPU Cache
Lookup Memory
miss hit
Trans-
lation
data

Source diginotes.in Save the earth. Go paperless


PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Page Replacement

• Page replacement refers to the process in which a resident page in


main memory is replaced by a new page transferred from the disk.
• The goal of a page replacement policy is to minimize the number of
page faults.
• Reduce the effective memory access time.
• R(t): Resident set of all pages residing in the main memory at time t.
• Forward distance ft(x): The number of time slots from time t to the
first repeated reference of page x in the future.
• Backward distance bt(x): The number of time slots from time t to the
most recent reference of page x in the past.

Source diginotes.in Save the earth. Go paperless


PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Page Replacement Policies
o Least recently used (LRU): Replaces the page in R(t) which has the
longest backward distance.
o Optimal (OPT) algorithm: Replaces the page in R(t) with longest forward
distance.
o First –in-first-out (FIFO): Replaces the page in R(t) which has been in
memory for the longest time.
o Least frequently used (LFU): Replaces the page in R(t) which has been
least referenced in the past.
o Circular FIFO: Joins all the page frame entries into a circular FIFO
queue using a pointer to indicate the front of the queue.
o Random replacement: Trivial algorithm which chooses any page for
replacement randomly.
Source diginotes.in Save the earth. Go paperless
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
Advanced Computer Architecture

Chapter 6
Pipelining and Superscalar
Techniques
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Source diginotes.in
PDF Watermark Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark Save the earth. Go paperless
In this chapter…

• Linear Pipeline Processors


• Non-linear Pipeline Processors
• Instruction Pipeline Design
• Arithmetic Pipeline Design
• Superscalar Pipeline Design

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
2
LINEAR PIPELINE PROCESSORS
• Linear Pipeline Processor
o Cascade of processing stageswhich are linearly connected to perform a fixed function over a stream of
data flowing from one end to the other.
o Instruction execution, arithmetic computations, and memory-access operations.
• Models of Linear Pipeline
o Synchronous Model
o Asynchronous Model
o (Corresponding reservation tables)
• Clocking and Timing Control
o Clock Cycle
o Pipeline Frequency
o Clock skewing
o Flow-through delay
o Speedup factor
o Optimal number of Stages and Performance-Cost Ratio (PCR)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
3
LINEAR PIPELINE PROCESSORS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
4
LINEAR PIPELINE PROCESSORS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
5
NON-LINEAR PIPELINE PROCESSORS
• Dynamic Pipeline
o Configured to perform variable functions at different times.
o Allows streamline, feed-forward connection and feedback connection

• Static Pipeline
o Used to perform fixed functions

• Reservation and Latency Analysis


o Reservation tables
o Evaluation time
• Latency Analysis
o Latency
o Collision
o Forbidden latencies
o Latency Sequence, Latency Cycle and Average Latency

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
6
NON-LINEAR PIPELINE PROCESSORS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
7
NON-LINEAR PIPELINE PROCESSORS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
8
NON-LINEAR PIPELINE PROCESSORS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
9
INSTRUCTION PIPELINE DESIGN

• Instruction Execution Phases


o E.g. Fetch, Decode, Issue, Execute, Write-back
o In-order Instruction issuing and Reordered Instruction issuing
• E.g. X = Y + Z , A = B x C
• Mechanisms/Design Issues for Instruction Pipelining
o Pre-fetch Buffers
o Multiple Functional Units
o Internal Data Forwarding
o Hazard Avoidance
• Dynamic Scheduling
• Branch Handling Techniques

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
10
INSTRUCTION PIPELINE DESIGN

• Fetch: fetches instructions from memory; ideally one per cycle


• Decode: reveals instruction operations to be performed and identifies the resources needed
• Issue: reserves the resources and reads the operands from registers
• Execute: actual processing of operations as indicated by instruction
• Write Back: writing results into the registers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
11
INSTRUCTION PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
12
INSTRUCTION PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
13
INSTRUCTION PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
14
INSTRUCTION PIPELINE DESIGN
Mechanisms/Design Issues of Instruction Pipeline
• Pre-fetch Buffers
o Sequential Buffers
o Target Buffers
o Loop Buffers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
15
INSTRUCTION PIPELINE DESIGN
Mechanisms/Design Issues of Instruction Pipeline
• Multiple Functional Units
o Reservation Station and Tags
o Slow-station as Bottleneck stage
• Subdivision of Pipeline Bottleneck stage
• Replication of Pipeline Bottleneck stage
• (Example to be discussed)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
16
INSTRUCTION PIPELINE DESIGN
Mechanisms/Design Issues of Instruction Pipeline

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
17
INSTRUCTION PIPELINE DESIGN
Mechanisms/Design Issues of Instruction Pipeline
• Internal Forwarding and Register Tagging
o Internal Forwarding:
• A “short-circuit” technique to replace unnecessary memory accesses by register-register
transfers in a sequence of fetch-arithmetic-store operations
o Register Tagging:
• Use of tagged registers , buffers and reservation stations, for exploiting concurrent activities
among multiple arithmetic units
o Store-Fetch Forwarding
• (M  R1, R2  M) replaced by (M  R1, R2  R1)
o Fetch-Fetch Forwarding
• (R1  M, R2  M) replaced by (R1  M, R2  R1)
o Store-Store Overwriting
• (M  R1, M  R2) replaced by (M  R2)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
18
INSTRUCTION PIPELINE DESIGN
Mechanisms/Design Issues of Instruction Pipeline
• Internal Forwarding and Register Tagging

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
19
INSTRUCTION PIPELINE DESIGN
Mechanisms/Design Issues of Instruction Pipeline
• Internal Forwarding and Register Tagging

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
20
INSTRUCTION PIPELINE DESIGN
Mechanisms/Design Issues of Instruction Pipeline
• Hazard Detection and Avoidance
o Domain or Input Set of an instruction
o Range or Output Set of an instruction
o Data Hazards: RAW, WAR and WAW
o Resolution using Register Renaming approach

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
21
INSTRUCTION PIPELINE DESIGN
Dynamic Instruction Scheduling
• Idea of Static Scheduling
o Compiler based scheduling strategy to resolve Interlocking among instructions

• Dynamic Scheduling
o Tomasulo’s Algorithm (Register-Tagging Scheme)
• Hardware based dependence-resolution
o Scoreboarding Technique
• Scoreboard: the centralized control unit
• A kind of data-driven mechanism

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
22
INSTRUCTION PIPELINE DESIGN
Dynamic Instruction Scheduling

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
23
INSTRUCTION PIPELINE DESIGN
Dynamic Instruction Scheduling

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
24
INSTRUCTION PIPELINE DESIGN
Branch Handling Techniques
• Branch Taken, Branch Target, Delay Slot
• Effect of Branching
o Parameters:
• k: No. of stages in the pipeline
• n: Total no. of instructions or tasks
• p: Percentage of Brach instructions over n
• q: Percentage of successful branch instructions (branch taken) over p.
• b: Delay Slot
• τ: Pipeline Cycle Time
o Branch Penalty = q of (p of n) * bτ = pqnbτ
o Effective Execution Time:
• Teff = [k + (n-1)] τ + pqnbτ = [k + (n-1) + pqnb]τ

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
25
INSTRUCTION PIPELINE DESIGN
Branch Handling Techniques
• Effect of Branching
o Effective Throughput:
• Heff = n/Teff
• Heff = n / {[k + (n-1) + pqnb]τ} = nf / [k + (n-1) + pqnb]
• As nInfinity and b = k-1
o H*eff = f / [pq(k-1)+1]
• If p=0 and q=0 (no branching occurs)
o H**eff = f = 1/τ
o Performance Degradation Factor
• D = 1 – H*eff / f = pq(k-1) / [pq(k-1)+1]

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
26
INSTRUCTION PIPELINE DESIGN
Branch Handling Techniques

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
27
INSTRUCTION PIPELINE DESIGN
Branch Handling Techniques
• Branch Prediction
o Static Branch Prediction: based on branch code types
o Dynamic Branch prediction: based on recent branch history
• Strategy 1: Predict the branch direction based on information found at decode stage.
• Strategy 2: Use a cache to store target addresses at effective address calculation stage.
• Strategy 3: Use a cache to store target instructions at fetch stage
o Brach Target Buffer Organization

• Delayed Branches
o A delayed branch of d cycles allows at most d-1 useful instructions to be executed following the
branch taken.
o Execution of these instructions should be independent of branch instruction to achieve a zero
branch penalty

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
28
INSTRUCTION PIPELINE DESIGN
Branch Handling Techniques

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
29
INSTRUCTION PIPELINE DESIGN
Branch Handling Techniques

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
30
ARITHMETIC PIPELINE DESIGN
Computer Arithmetic Operations
• Finite-precision arithmetic
• Overflow and Underflow
• Fixed-Point operations
o Notations:
• Signed-magnitude, one’s complement and two-complement notation
o Operations:
• Addition: (n bit, n bit)  (n bit) Sum, 1 bit output carry
• Subtraction: (n bit, n bit)  (n bit) difference
• Multiplication: (n bit, n bit)  (2n bit) product
• Division: (2n bit, n bit)  (n bit) quotient, (n bit) remainder

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
31
ARITHMETIC PIPELINE DESIGN
Computer Arithmetic Operations
• Floating-Point Numbers
o X = (m, e) representation
• m: mantissa or fraction
• e: exponent with an implied base or radix r.
• Actual Value X = m * r e
o Operations on numbers X = (mx, ex) and Y = (my, ey)
• Addition: (mx * rex-ey + my, ey)
• Subtraction: (mx * rex-ey – my, ey)
• Multiplication: (mx * my, ex+ey)
• Division: (mx / my, ex – ey)
• Elementary Functions
o Transcendental functions like: Trigonometric, Exponential, Logarithmic, etc.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
32
ARITHMETIC PIPELINE DESIGN
Static Arithmetic Pipelines
• Separate units for fixed point operations and floating point operations
• Scalar and Vector Arithmetic Pipelines
• Uni-functional or Static Pipelines
• Arithmetic Pipeline Stages
o Majorly involve hardware to perform: Add and Shift micro-operations
o Addition using: Carry Propagation Adder (CPA) and Carry Save Adder (CSA)
o Shift using: Shift Registers

• Multiplication Pipeline Design


o E.g. To multiply two 8-bit numbers that yield a 16-bit product using CSA and CPA Wallace Tree.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
33
ARITHMETIC PIPELINE DESIGN
Static Arithmetic Pipelines

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
34
ARITHMETIC PIPELINE DESIGN
Static Arithmetic Pipelines

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
35
ARITHMETIC PIPELINE DESIGN
Multifunctional Arithmetic Pipelines
• Multifunctional Pipeline:
o Static multifunctional pipeline
o Dynamic multifunctional pipeline

• Case Study: T1/ASC static multifunctional pipeline architecture

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
36
ARITHMETIC PIPELINE DESIGN
Multifunctional Arithmetic Pipelines

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
37
ARITHMETIC PIPELINE DESIGN
Multifunctional Arithmetic Pipelines

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
38
SUPERSCALAR PIPELINE DESIGN
• Pipeline Design Parameters
o Pipeline cycle, Base cycle, Instruction issue rate, Instruction issue Latency, Simple Operation Latency
o ILP to fully utilize the pipeline
• Superscalar Pipeline Structure
• Data and Resource Dependencies
• Pipeline Stalling
• Superscalar Pipeline Scheduling
o In-order Issue and in-order completion
o In-order Issue and out-of-order completion
o Out-of-order Issue and out-of-order completion
• Superscalar Performance

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
39
SUPERSCALAR PIPELINE DESIGN

Parameter Base Scalar Processor Super Scalar Processor


(degree = K)
Pipeline Cycle 1 (base cycle) K
Instruction Issue Rate 1 K
Instruction Issue Latency 1 1
Simple Operation Latency 1 1
ILP to fully utilize pipeline 1 K

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
40
SUPERSCALAR PIPELINE DESIGN

4,

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
41
SUPERSCALAR PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
42
SUPERSCALAR PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
43
SUPERSCALAR PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
44
SUPERSCALAR PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
45
SUPERSCALAR PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
46
SUPERSCALAR PIPELINE DESIGN

4,

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
47
SUPERSCALAR PIPELINE DESIGN

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
48
SUPERSCALAR PIPELINE DESIGN

• Time required by base scalar machine:


o T(1,1) = K + N – 1
• The ideal execution time required by m-issue superscalar machine:
o T(m,1) = K + (N – m)/m
o Where,
• K is the time required to execute first m instructions through m pipelines of k-stages
simultaneously
• Second term corresponds to time required to execute remaining N-m instructions , m per
cycle through m pipelines
• The ideal speedup of superscalar machine
o S(m,1) = T(1,1)/T(m,1) = m(N + k – 1)/[N+ m(k – 1)]
• As n  infinity, S(m,1) m
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology
PDF Watermark Source diginotes.in Save the earth. Go paperless
Remover DEMO : Purchase from www.PDFWatermarkRemover.com to remove the watermark
49
Advanced Computer Architecture

Chapter 7
Multiprocessors and Multicomputers
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Source diginotes.in Save the earth. Go paperless


In this chapter…

• Multiprocessor System Interconnects


• Cache Coherence and Synchronization Mechanisms
• Three Generations of Multi-computers
• Message Routing Schemes

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
2
MULTIPROCESSOR SYSTEM INTERCONNECTS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
3
MULTIPROCESSOR SYSTEM INTERCONNECTS

• Network Characteristics
o Topology
• Dynamic Networks
o Timing control protocol
• Synchronous (with global clock)
• Asynchronous (with handshake or interlocking mechanism)
o Switching method
• Circuit switching
• Packet switching
o Control Strategy
• Centralized (global controller to receive requests from all devices and grant network access)
• Distributed (requests handled by local devices independently)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
4
MULTIPROCESSOR SYSTEM INTERCONNECTS

• Hierarchical Bus System


o Local Bus (board level)
• Memory bus, data bus
o Backplane Bus (backplane level)
• VME bus (IEEE 1014-1987), Multibus II (IEEE 1296-1987), Futurebus+ (IEEE 896.1-1991)
o I/O Bus (I/O level)
o E.g. Encore Multimax multprocessor’s nanobus
• 20 slots
• 32-bit address path
• 64-bit data path
• Clock rate: 12.5 MHz
• Total Memory bandwidth: 100 Megabytes per second

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
5
MULTIPROCESSOR SYSTEM INTERCONNECTS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
6
MULTIPROCESSOR SYSTEM INTERCONNECTS

• Hierarchical Buses and Caches


o Cache Levels
• First level caches
• Second level caches
o Buses
• (Intra) Cluster Bus
• Inter-cluster bus
o Cache coherence
• Snoopy cache protocol for coherence among first level caches of same cluster
• Intra-cluster cache coherence controlled among second level caches and results passed to
first level caches
o Use of Bridges between multiprocessor clusters

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
7
MULTIPROCESSOR SYSTEM INTERCONNECTS

• Hierarchical Buses and Caches

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
8
MULTIPROCESSOR SYSTEM INTERCONNECTS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
9
MULTIPROCESSOR SYSTEM INTERCONNECTS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
10
MULTIPROCESSOR SYSTEM INTERCONNECTS

• Crossbar Switch Design


o Based on number of network stages
• Single stage (or recirculating) networks
• Multistage networks
o Blocking networks
o Non-blocking (re-arranging) networks
• Crossbar networks
o n x m and n2 Cross-point switch design
o Crossbar benefits and limitations

• Multiport Memory Design


o Multiport Memory

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
11
MULTIPROCESSOR SYSTEM INTERCONNECTS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
12
MULTIPROCESSOR SYSTEM INTERCONNECTS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
13
CACHE COHERENCE MECHANISMS

• Cache Coherence Problem


o Inconsistent copies of same memory block in different caches
o Sources of inconsistency:
• Sharing of writable data
• Process migration
• I/O activity

• Protocol Approaches
o Snoopy Bus Protocol
o Directory Based Protocol

• Write Policies
o (Write-back, Write-through) x (Write-invalidate, Write-update)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
14
CACHE COHERENCE MECHANISMS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
15
CACHE COHERENCE MECHANISMS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
16
CACHE COHERENCE MECHANISMS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
17
CACHE COHERENCE MECHANISMS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
18
CACHE COHERENCE MECHANISMS

• Snoopy Bus Protocols


o Write-through caches
• Write invalidate coherence protocol for write-through caches
• Write-update coherence protocol for write-through caches
• Data item states:
o VALID
o INVALID
• Possible operations:
o Read by same processor R(i) Read by different processor R( j )
o Write by same processor W(i) Write by different processor W( j )
o Replace by same processor Z(i) Replace by different processor Z( j )

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
19
CACHE COHERENCE MECHANISMS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
20
CACHE COHERENCE MECHANISMS

• Snoopy Bus Protocols


o Write-through caches – write invalidate scheme

Current New Current New


Operation Operation
State State State State
R(i) Valid R(i) Valid
W(i) Valid W(i) Valid
Z(i) Invalid Z(i) Invalid
Valid Invalid
R(j) Valid R(j) Invalid
W(j) Invalid W(j) Invalid
Z(j) Valid Z(j) Invalid
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
21
CACHE COHERENCE MECHANISMS

• Snoopy Bus Protocols


o Write-back caches
• Ownership protocol: Write invalidate coherence protocol for write-through caches
• Data item states:
o RO : Read Only (Valid state)
o RW : Read Write (Valid state)
o INV : Invalid state
• Possible operations:
o Read by same processor R(i) Read by different processor R( j )
o Write by same processor W(i) Write by different processor W( j )
o Replace by same processor Z(i) Replace by different processor Z( j )

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
22
CACHE COHERENCE MECHANISMS

• Snoopy Bus Protocols


o Write-back caches – write invalidate (ownership protocol) scheme

Current New Current New Current New


Operation Operation Operation
State State State State State State
R(i) RO R(i) RW R(i) RO
W(i) RW W(i) RW W(i) RW
RO Z(i) INV RW Z(i) INV INV Z(i) INV
(Valid) R(j) RO (Valid) R(j) RO (Invalid) R(j) INV
W(j) INV W(j) INV W(j) INV
Z(j) RO Z(j) RW Z(j) INV
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
23
CACHE COHERENCE MECHANISMS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
24
CACHE COHERENCE MECHANISMS

• Snoopy Bus Protocols


o Write-once Protocol
• First write using write-through policy
• Subsequent writes using write-back policy
• In both cases, data item copy in remote caches is invalidated
• Data item states:
o Valid :cache block consistent with main memory copy
o Reserved : data has been written exactly once and is consistent with main memory
copy
o Dirty : data is written more than once but is not consistent with main memory copy
o Invalid :block not found in cache or is inconsistent with main memory copy

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
25
CACHE COHERENCE MECHANISMS

• Snoopy Bus Protocols


o Write-once Protocol
• Cache events and actions:
o Read-miss
o Read-hit
o Write-miss
o Write-hit
o Block replacement

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
26
CACHE COHERENCE MECHANISMS

• Directory Protocol

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
27
CACHE COHERENCE MECHANISMS
• Protocol Performance issues
o Snoopy Cache Protocol Performance determinants:
• Workload Patterns
• Implementation Efficiency
o Goals/Motivation behind using snooping mechanism
• Reduce bus traffic
• Reduce effective memory access time
o Data Pollution Point
• Miss ratio decreases as block size increases, up to a data pollution point (that is, as blocks
become larger, the probability of finding a desired data item in the cache increases).
• The miss ratio starts to increasing as the block size increases to data pollution point.
o Ping-Pong effect on data shared between multiple caches
• If two processes update a data item alternately, data will continually migrate between two caches
with high miss-rate

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
28
THREE GENERATIONS OF MULTICOMPUTERS

• Multicomputer v/s Multiprocessor


• Design Choices for Multi-computers
o Processors
• Low cost commodity (off-the-shelf) processors
o Memory Structure
• Distributed memory organization
• Local memory with each processor
o Interconnection Schemes
• Message passing, point-to-point , direct networks with send/receive semantics with/without
uniform message communication speed
o Control Strategy
• Asynchronous MIMD, MPMD and SPMD operations

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
29
THREE GENERATIONS OF MULTICOMPUTERS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
30
THREE GENERATIONS OF MULTICOMPUTERS

• The Past, Present and Future Development


o First Generation
• Example Systems: Caltech’s Cosmic Cube, Intel iPSC/1, Ametek S/14, nCube/10
o Second Generation
• Example Systems: iPSC/2, i860, Delta, nCube/2, Supernode 1000, Ametek Series 2010
o Third Generation
• Example Systems: Caltech’s Mosaic C, J-Machine, Intel Paragon
o First and second generation multi-computers are regarded as medium-grain systems
o Third generation multi-computers were regarded as fine-grain systems.
o Fine-grain and shared memory approach can, in theory, combine the relative merits of
multiprocessors and multi-computers in a heterogeneous processing environment.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
31
1st Generation 2nd Generation 3rd Generation
THREE
MIPS
GENERATIONS 1OF MULTICOMPUTERS
10 100
Typical MFLOPS (scalar) 0.1 2 40
Node
Attributes MFLOPS (vector) 10 40 200
Memory Size (in MB) 0.5 4 32
Number of Nodes (N) 64 256 1024

Typical MIPS 64 2560 100 K


System MFLOPS (scalar) 6.4 512 40 K
Attributes MFLOPS (vector) 640 10 K 200 K
Memory Size (in MB) 32 1K 32 K
Local Neighbour
Communi- (in microseconds) 2000 5 0.5
cation
Latency Non-local node 6000 5 0.5
(in microseconds)
Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of
Technology Source diginotes.in Save the earth. Go paperless
32
THREE GENERATIONS OF MULTICOMPUTERS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
33
MESSAGE PASSING SCHEMES

• Message Routing Schemes


• Message Formats
o Messages
o Packets
o Flits (Control Flow Digits)
• Data Only Flits
• Sequence Number
• Routing Information

• Store-and-forward routing
• Wormhole routing

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
34
MESSAGE PASSING SCHEMES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
35
MESSAGE PASSING SCHEMES

• Asynchronous Pipelining

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
36
MESSAGE PASSING SCHEMES

• Latency Analysis
o L: Packet length (in bits)
o W: Channel Bandwidth (in bits per second)
o D: Distance (number of nodes traversed minus 1)
o F: Flit length (in bits)
o Communication Latency in Store-and-forward Routing
• TSF = L (D + 1) / W
o Communication Latency in Wormhole Routing
• TWH = L / W + F D / W

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
37
Advanced Computer Architecture

Chapter 8
Multivector and SIMD Computers
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Source diginotes.in Save the earth. Go paperless


In this chapter…

• Vector Processing Principles


• Compound Vector Operations
• Vector Loops and Chaining
• SIMD Computer Implementation Models

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
2
VECTOR PROCESSING PRINCIPLES

• Vector Processing Definitions


o Vector
o Stride
o Vector Processor
o Vector Processing
o Vectorization
o Vectorizing Compiler or Vectorizer

• Vector Instruction Types


o Vector-vector instructions
o Vector-scalar instructions
o Vector-memory instructions

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
3
VECTOR PROCESSING PRINCIPLES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
4
VECTOR PROCESSING PRINCIPLES

• Vector-Vector Instructions
o F1: Vi  Vj
o F2: Vi x Vj Vk
o Examples: V1 = sin(V2) V3 = V1+ V2

• Vector-Scalar Instructions
o F3: s x Vi  Vj
o Examples: V2 = 6 + V1

• Vector-Memory Instructions
o F4: MV (Vector Load)
o F5: VM (Vector Store)
o Examples: X = V1 V2 = Y

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
5
VECTOR PROCESSING PRINCIPLES

• Vector Reduction Instructions


o F6: Vi  s
o F7: Vi x Vj  s

• Gather and Scatter Instructions


o F8: M  Vi x Vj (Gather)
o F9: Vi x Vj  M (Scatter)

• Masking
o F10: Vi x Vm  Vj (Vm is a binary vector)

• Examples…

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
6
VECTOR PROCESSING PRINCIPLES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
7
VECTOR PROCESSING PRINCIPLES

• Vector-Access Memory Schemes


o Vector-operand Specifications
• Base address, stride and length
o C-Access Memory Organization
• Low-order m-way interleaved memory
o S-access Memory Organizations
• High-order m-way interleaved memory
o C/S Access Memory Organization

• Early Supercomputers (Vectors Processors)


o Cray Series ETA 10E NEC Sx-X 44
o CDC Cyber Fujitsu VP2600 Hitachi 820/80

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
8
VECTOR PROCESSING PRINCIPLES

• Relative Vector/Scalar Performance


o Vector/scalar speed ratio r
o Vectorization ratio in program f
o Relative Performance P is given by:
𝟏 𝒓
• 𝑷= =
𝟏−𝒇 + 𝒇/𝒓 𝟏−𝒇 𝒓 + 𝒇
o When f is low, the speedup cannot be high even with very high r
o Limiting Case:
• P  1 if f  0
o Maximum Case:
• P  r if f  1
o Powerful single chip processors and multicore system-on-a-chip provide High-Performance
Computing (HPC) using MIMD and/or SPMD configurations with large no. of processors.

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
9
COMPUOUND VECTOR PROCESSING

• Compound Vector Operations


o Compound Vector Functions (CVFs)
• Composite function of vector operations converted from a looping structure of linked scalar
operations
o CVF Example: The SAXPY (Single-precision A multiply X Plus Y) Code
• For I = 1 to N
o Load R1, X(I)
o Load R2, Y(I)
o Multiply R1, A
o Add R2, R1
o Store Y(I), R2
• (End of Loop)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
10
COMPUOUND VECTOR PROCESSING

• One-dimensional CVF Examples


o V(I) = V2(I) + V(3) x V(4)
o V1(I) = B(I) + C(I)
o A(I) = V(I) x S + B(I)
o A(I) = V(I) + B(I) + C(I)
o A(I) = Q x v1(I) (R x B(I) + C(I)), etc.
Legend:
o Vi(I) are vector registers
o A(I), B(I), C(I) are vectors in memory
o Q, S are scalars available from scalar registers in memory

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
11
COMPUOUND VECTOR PROCESSING

• Vector Loops
o Vector segmentation or strip-mining approach
o Example

• Vector Chaining
o Example: SAXPY code
• Limited Chaining using only one memory-access pipe in Cray-I
• Complete Chaining using three memory-access pipes in Cray X-MP

• Functional Unit Independence


• Vector Recurrence

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
12
COMPUOUND VECTOR PROCESSING

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
13
COMPUOUND VECTOR PROCESSING

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
14
SIMD COMPUTER ORGANIZATIONS

• SIMD Computer Variants


o Array Processor
o Associative Processor
• SIMD Processor v/s SISD v/s Vector Processor Operation
o Illustration: for(i=0;i<5;i++) a[i] = a[i]+2;
o Lockstep mode of operation in SIMD processor
o Relative Performance comparison
• SIMD Implementation Models
o Distributed Memory Model
• E.g. Illiac IV
o Shared memory Model
• E.g. BSP (Burroughs Scientific Processor)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
15
SIMD COMPUTER ORGANIZATIONS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
16
SIMD COMPUTER ORGANIZATIONS

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
17
SIMD COMPUTER ORGANIZATIONS

• SIMD Instructions
o Scalar Operations
• Arithmetic/Logical
o Vector Operations
• Arithmetic/Logical
o Data Routing Operations
• Permutations, broadcasts, multicasts, rotation and shifting
o Masking Operations
• Enable/Disable PEs
• Host and I/O
• Bit-slice and Word-slice Processing
o WSBS, WSBP, WPBS, WPBP

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
18
Advanced Computer Architecture

Chapter 9
…Dataflow Architectures
Book: “Advanced Computer Architecture – Parallelism, Scalability, Programmability”, Hwang & Jotwani

Source diginotes.in Save the earth. Go paperless


In this chapter…

• Evolution of Dataflow Computers


• Dataflow Graphs
• Static v/s Dynamic Data Flow Computers
• Pure Dataflow Machines
• Explicit Token Store Machines
• Hybrid and Unified Architectures
• Dataflow v/s Control flow Computers

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
2
DATAFLOW AND HYBRID ARCHITECTURES

• Data-driven machines
• Evolution of Dataflow Machines
• Dataflow Graphs
o Dataflow Graphs examples.
o Activity Templates and Activity Store
o Example: dataflow graph for cos x
𝒙𝟐 𝒙𝟒 𝒙𝟔
• 𝐜𝐨𝐬 𝐱 ≅ 𝟏 − + −
𝟐! 𝟒! 𝟔!
o More examples

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
3
DATAFLOW AND HYBRID ARCHITECTURES

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
4
DATAFLOW AND HYBRID ARCHITECTURES

• Static Dataflow Computers


o Special Feedback Acknowledge Signals between nodes
o Firing Rule:
• A node is enabled as soon as tokens are present on all input arcs and there is no token on
any of its output arc
o Example: Dennis Machine (1974)

• Dynamic Dataflow Computers


o Tagged Tokens
o Firing Rule:
• A node is enabled as soon as tokens with identical tags are present at each of its input arcs
o Example: MIT Tagged Token Dataflow Architecture (TTDA) machine (just simulation, never built)

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
5
DATAFLOW AND HYBRID ARCHITECTURES

• Diagrams of static dataflow and dynamic dataflow


• from Hwang and Briggs….

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
6
DATAFLOW AND HYBRID ARCHITECTURES

• Pure Dataflow Machines


o TTDA (1983)
• TTDA was simulated but never built
o Manchester Dataflow Computer (1982)
• Operated asynchronously using a separate clock for each PE
o ETL Sigma-1 (1987)
• 128 PEs fully synchronous with a 10-Mhz clock
• Implemented I-structured memory proposed in TTDA
• Lacked High Level Languages for users

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
7
DATAFLOW AND HYBRID ARCHITECTURES

• Explicit Token Store Machines


o Eliminate associative token matching.
o Waiting token memory is directly accessed using full/empty bits.
o Examples
• MIT/Motorola Monsoon (proposed 1988; operational 1991)
o Multithreading support
o 8 processors
o 8 I-structure memory modules
o 8x8 crossbar network
• ETL EM-4 (1989)
o Extension of Sigma-1
o Proposed 1024 nodes; Operational Implementation 80 nodes

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
8
DATAFLOW AND HYBRID ARCHITECTURES

• Hybrid and Unified Architectures


o Combining Features of von-Neumann and Dataflow architectures
o Examples:
• MIT P-RISC (1988)
• IBM Empire (1991)
• MIT/Motorola *T (1991)
o “RISC-ified” dataflow architecture
• Implemented in P-RISC
• Splitting complex dataflow instructions into separate simple component instructions
• Tighter encoding and longer threads for better performance

• Dataflow Processing v/s Control Flow Processing

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
9
DATAFLOW AND HYBRID ARCHITECTURES

• Computing ab + a/c with:


(a) control flow; (b) dataflow. Pure dataflow basic execution pipeline: (c) single-token-per-arc dataflow;
(d) tagged-token dataflow; (e) explicit token store dataflow

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
10
DATAFLOW AND HYBRID ARCHITECTURES

• Computing ab + a/c with:


(a) control flow; (b) dataflow. Pure dataflow basic execution pipeline: (c) single-token-per-arc dataflow;
(d) tagged-token dataflow; (e) explicit token store dataflow

Dr. Vasanthakumar G U, Department of CSE, Cambridge Institute of


Technology Source diginotes.in Save the earth. Go paperless
11

You might also like