Nothing Special   »   [go: up one dir, main page]

Parallel Computations & Applications: National Tsing-Hua University 2017, Summer Semester

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Parallel Computations

& Applications

National Tsing-Hua University


2017, Summer Semester
Outline
 Embarrassingly Computations
 Divide-And-Conquer Computations
 Pipelined Computations
 Synchronous Computations

Parallel Programming – NTHU LSA Lab 2


Outline
 Embarrassingly Computations
 Image Transformations
 Mandelbrot Set
 Monte Carlo Methods

 Divide-And-Conquer Computations
 Pipelined Computations
 Synchronous Computations

Parallel Programming – NTHU LSA Lab 3


What is Embarrassingly Parallel
 A computation that can be divided into a number of
completely independent tasks

Input Data
Divide input to tasks

Processes

Collect results

Results

Parallel Programming – NTHU LSA Lab 4


Example 1: Image Transformations
 Low-level image operations:
 Shifting: object shifted by ∆𝑥 in the 𝑥-dimension
and ∆𝑦 in the 𝑦-dimension:
𝑥 ′ = 𝑥 + ∆𝑥, 𝑦 ′ = 𝑦 + ∆𝑦
 Scaling: object scaled by a factor of 𝑆𝑥 in the 𝑥-
direction and 𝑆𝑦 in the 𝑦-direction;
𝑥 ′ = 𝑥𝑆𝑥 , 𝑦 ′ = 𝑦𝑆𝑦
 Rotation: object rotated through the angle 𝜃
about the origin of the coordinate system:
𝑥 ′ = 𝑥 cos 𝜃 + 𝑦 sin 𝜃
𝑦 ′ = −𝑥 sin 𝜃 + 𝑦 cos 𝜃
Parallel Programming – NTHU LSA Lab 5
Image Region Partitioning
 Square region  Row region  Column region
partition partition partition

map map map

process process process

Parallel Programming – NTHU LSA Lab 6


640
Pseudo-code for Image Shift 10 x 640

……
 Partition region by ROW with width 10 480
//master process
for(i=0, row=0; i<48; i++, row+=10) // for each of 48 processes
send(row, Pi); // send row no. P47
for(i=0; i<480; i++) for(j=0; j<640; j++) temp_map[i][j] = 0; // initialize temp
for(i=0; i<(480*640); i++) { // for each pixel
recv(oldrow, oldcol, newrow, newcol, PANY); // accept new coordinates
if !((newrow<0)||((newrow>=480)||(newcol<0)||((newcol>=640))
temp_map[newrow][newcol] = map[oldrow][oldcol];
}
for(i=0; i<480; i++) for(j=0; j<640; j++) map[i][j] = temp_map[i][j]; // update map
// slave process
recv (row, Pmaster);
for (oldrow = row; oldrow < (oldrow+10); oldrow++) // for each row in the partition
for (oldcol = 0; oldcol < 640; oldcol++) { // for each column in the row
newrow = oldrow + delta_x; // shift along x-dimension
newcol = oldcol + delta_y; // shift along y-dimension
send(oldrow, oldcol, newrow, newcol, Pmaster); // send out new coordinates
}
Example 2: Mandelbrot Set
The Mandelbrot Set is a set of complex numbers that are

quasi-stable when computed by iterating the function:
𝑍0 = 𝐶, 𝑍𝑘+1 = 𝑍𝑘 2 + 𝐶
 C is some complex number: 𝑪 = 𝑎 + 𝑏𝒊
 𝒁𝒌+𝟏 is the (k+1)th iteration of the complex number
 If |Zk| ≤ 2 for ANY k  C belongs to Mandelbrot Set
𝑍 = 𝑎 2 + 𝑏2
Once |Z | > 2, it will increase forever!
k
𝑘
|Zk| 2i 2+2i
C =-1+0.25i, NOT part of the set
C = -1+0.75i, part of the set 𝑍𝑘 = 22 + 22 = 8

4 -2 2
2 2-i
𝑍𝑘 = 22 + (−1)2 = 5
-2i
0 Iteration ∞
Parallel Programming – NTHU LSA Lab 8
Fractal
 What exact is Mandelbrot Set?
 It is a fractal: An object that display self-similarity at
various scale; Magnifying a fractal reveals small-scale 2
details similar to the large-scale characteristics
 After plotting the Mandelbrot Set
determined by thousands of iteration:
 Add color to the points outside the set &
zoom in at the center of the image: -2i 2i
-2
Parallel Programming – NTHU LSA Lab 10
Mandelbrot Set Program
2
 Compute 𝑍𝑘+1 = 𝑍𝑘 + 𝐶
 Let 𝐶 = 𝐶𝑟𝑒𝑎𝑙 + 𝐶𝑖𝑚𝑎𝑔 𝑖 , 𝑍𝑘 = 𝑍𝑟𝑒𝑎𝑙 + 𝑍𝑖𝑚𝑎𝑔 𝑖
 𝑍𝑘+1 = 𝑍𝑟𝑒𝑎𝑙 2 − 𝑍𝑖𝑚𝑎𝑔 2 + 2𝑍𝑟𝑒𝑎𝑙 𝑍𝑖𝑚𝑎𝑔 𝑖 + 𝐶𝑟𝑒𝑎𝑙 + 𝐶𝑖𝑚𝑎𝑔 𝑖
 𝑍𝑟𝑒𝑎𝑙_𝑛𝑒𝑥𝑡 = 𝑍𝑟𝑒𝑎𝑙 2 − 𝑍𝑖𝑚𝑎𝑔 2 + 𝐶𝑟𝑒𝑎𝑙
 𝑍𝑖𝑚𝑎𝑔_𝑛𝑒𝑥𝑡 = 2𝑍𝑟𝑒𝑎𝑙 𝑍𝑖𝑚𝑎𝑔 + 𝐶𝑖𝑚𝑎𝑔

 Represent image number in program


 C = 2 + 4i C.real = 2, C.imag = 4
Struct complex {
float real;
float imag;
};
Parallel Programming – NTHU LSA Lab 11
Sequential Mandelbrot Set Program
 Testing program:
 Giving a complex number
 Return the iteration number when |Zk| > 2
 Let the maximum iteration is 256
int cal_pixel (complex c) {
int count = 0; // number of iterations
int max= 256; // maximum iteration is 256
float temp, lengthsq;
complex z; // initialize complex number z
z.real = 0; z.imag = 0;
do {
temp = (z.real * z.real) – (z.imag * z.imag) + c.real; // compute next z.real
z.imag = (2 * z.real * z.imag) + c.imag; // compute next z.imag
z.real = temp;
lengthsq = (z.real * z.real) + (z.imag * z.imag);
count++; // update iteration counter
} while ((lengthsq < 4.0) && (count < max));
return count;
} Parallel Programming – NTHU LSA Lab 12
Sequential Mandelbrot Set Program
 Scaling Coordinate Display Program:
 Plot the Mandelbrot Set from the
coordinate system
 Color indicate the iteration number
black=256, white=0
 Points are apart with a fixed distance
read_disk, imag_dist

for (x=real_min; x < real_max; x += real_dist) {


for (y=imag_min; y < imag_max; x += imag_dist) {
c.real = x; c.img = y;
color = cal_pixel (c);
display(x, y, color);
}
}
Parallel Programming – NTHU LSA Lab 13
Parallelizing Mandelbrot Set Program
 Partition screen 640*480 by row using 48 processes
//master process
for(i=0, row=0; i<48; i++, row+=10) // for each process
send(row, Pi); // send row no.
for(i=0; i<(480*640); i++) { // for each pixel point
recv(&x, &y, &color, PANY); // receive coordinate/colors
display(x, y, color); // display pixel
}
//slave process 640
recv (&row, Pmaster);
for (x=0; x < 640; x++) { 10 x 640

……
for (y=row; y < (row+10); y++) {
c.real = min_real + (x * scale_real); 480
c.imag = min_imag + (y * scale_image);
color = cal_pixel (c);
send(x, y, &color, Pmaster);
} Each process may P47
} have different load!
Dynamic Task Assignment
 Work pool / Processor Farm
 Useful when tasks require different execution time
 Dynamic load balancing

Work Pool

(xa, ya) (xc, yc)


(xb, yb)
(xd, yd)
1. Send task
2. Return result &
request new task


3. Send termination
……………………...
Parallel Programming – NTHU LSA Lab 15
Coding for Work Pool Approach
//master process
count = 0; // # of active processes
row = 0; // row being sent
for (k=0; k<num_proc; k++) { // send initial row to each processes
send(row, Pi , data_tag);
count++;
row++;
}
do {
recv(&slave, &r, color, PANY , result_tag);
count--;
if (row < num_row) { // keep sending until no new task
send(row, Pslave , data_tag); // send next row
count++; Tag is needed to distinguish
row++;
between data and termination msg
} else {
send(row, Pslave , terminate_tag); // terminate
}
display(r, color); // display row
} while(count > 0);
Coding for Work Pool Approach
//slave process P ( i )
recv(&row, Pmaster , source_tag);
while (source_tag == data_tag) { // keep receiving new task
c.imag = min_imag + (row * scale_image);
for (x=0; x<640; x++) {
c.real = min_real + (x * scale_real);
color[x] = cal_pixel (c); // compute color of a single row
}
send(i, row, color, Pmaster , result_tag); // send process id and results
recv(&row, Pmaster , source_tag);
}

Parallel Programming – NTHU LSA Lab 17


Example 3: Monte Carlo Methods
 Monte Carlo methods: a class of computational
algorithms that rely on repeated random sampling
to compute their results
 Invented in 1940s by John von Neumann,
Stanislaw Ulam and Nicholas Metropolis,
while they were working on nuclear weapon
(Manhattan Project)
 Especially useful for simulating systems
with many coupled degrees of freedom,
such as fluids, disordered material

Parallel Programming – NTHU LSA Lab 18


Monte Carlo Methods --- 𝜋 calculation
 How to compute 𝜋 ???
 Definition of 𝜋: the area of a circle with unit radius
Total area = 4
𝐀𝐫𝐞𝐚 𝐨𝐟 𝐜𝐢𝐫𝐜𝐥𝐞 𝝅
 We know: =
𝐀𝐫𝐞𝐚 𝐨𝐟 𝐬𝐪𝐮𝐚𝐫𝐞 𝟒
 Randomly choose points from
the square 2 Area = 𝝅

 Giving sufficient number of samples,


the fraction of points within the circle
will be 𝜋/4!!!
 E.g.: With 10,000 randomly sample points 2
𝑥2 + 𝑦2 > 1
we expect 7854 points within the circle
7854/10000 =𝜋/4 𝜋 = 7854/10000*4 = 3.1416
Parallel Programming – NTHU LSA Lab 19
Monte Carlo Methods --- Integral
 Monte Carlo Method can compute ANY definite integral!
 max and min values of the integral must be known
 Very inefficient….
𝑥𝑚𝑎𝑥
 Method: 𝐴𝑟𝑒𝑎 = 𝑓 𝑥 𝑑𝑥
𝑥𝑚𝑖𝑛
 Randomly choose point 𝑥, 𝑦 :
𝑥𝑚𝑎𝑥 ≤ 𝑥 ≤ 𝑥𝑚𝑖𝑛 𝒇 𝒙
𝑦𝑚𝑎𝑥 ≤ 𝑦 ≤ 𝑦𝑚𝑖𝑛 𝑦𝑚𝑎𝑥
 Compute the area (integral)
according to the ratio of points
inside and outside the area 𝑦𝑚𝑖𝑛
𝑥𝑚𝑖𝑛 𝑥𝑚𝑎𝑥
 just like the computation of 𝜋
 Given any point (x, y), outside means : 𝑦 > 𝑓(𝑥)
Parallel Programming – NTHU LSA Lab 20
Outline
 Embarrassingly Computations
 Divide-And-Conquer Computations
 Adding Numbers
 Bucket Sort
 N-Body Simulation

 Pipelined Computations
 Synchronous Computations

Parallel Programming – NTHU LSA Lab 21


What is Divide & Conquer
 Recursively divide a problem into sub-problems that
are of the same form as the larger problem
Initial
Problem
divide

conquer

Final
Result 22
Example 1: Adding Numbers
 Add a sequence of numbers
 Sequential Recursive Code:  Parallel Code:
Scatter the numbers
int add (int* numbers) {
if (len(numbers) <=2) { then reduce results
return numbers[1]+numbers[2];
1,2,3,4
} else {
divide (numbers, sub_num1, sub_num2); 1,2 3,4
part_sum1 = add(sub_num1);
part_sum2 = add(sub_num2); 1 2 3 4
}
return (part_sum1+part_sum2); 3 7
}
10

Parallel Programming – NTHU LSA Lab 23


Example 2: Bucket Sort
 Algorithm
1. Range of numbers is divided into m equal regions
2. One bucket is assigned for each region
3. Place numbers to buckets based on the region
4. Use sequential sort for each bucket
Unsorted numbers
2,4,6,1,8,10,5,9,0,3,7,11

Merge list 0,1,2 3,4,5 6,7,8 9,10,11


0~2 3~5 6~8 9~11
 Only effective if number of items per bucket is similar!!
 Numbers should have a known interval ([max, min])
 Numbers better to be uniformly distributed
Parallel Programming – NTHU LSA Lab 24
Complexity Analysis
 Sequential:
1. Distribute numbers to bucket: O(n)
2. Sequential sort each bucket: (n/m)log(n/m) x m
 Overall: O(n log(n/m))

 Parallelize sorting: one process per bucket


1. Distribute numbers to bucket: O(n)
2. Sequential sort each bucket: (n/m)log(n/m)
 Overall: O(n + n/m log(n/m))

 A single process must scan through all numbers in step1


Parallel Programming – NTHU LSA Lab 25
Further Parallelized Bucket Sort
 Parallelize partitioning and sorting:
Partition numbers to m parts/processes

 Each process divides its numbers to small buckets
 Merge small buckets to large bucket
 Sequential sort each bucket
Unsorted numbers 2,4,6,1,8,10,5,9,0,3,7,11
0~2 3~5 6~8 9~11

P1 P2 P3 P4
2,4,6 1,8,10 5,9,0 3,7,11 Phase1:
2 4 6 1 8 10 0 5 9 3 7 11 Partition

Phase2:
Merge list 0,1,2 3,4,5 6,7,8 9,10,11 Sorting 26
Example 3: N-Body Problem
 Newtonian laws of physics
 The gravitational force between two
bodies of masses 𝑚𝑎 & 𝑚𝑏 :
𝐺𝑚𝑎 𝑚𝑏
𝐹=
𝑟2
 Subject to the force, acceleration occurs
𝐹 =𝑚×𝑎
 Let the time interval be ∆𝑡 &
current velocity 𝑣 𝑡 , position 𝑥 𝑡
 New velocity 𝑣 𝑡+1 :
𝑣 𝑡+1 − 𝑣 𝑡 𝑡+1 𝑡
𝐹∆𝑡
𝐹=𝑚 ⇒𝑣 =𝑣 +
∆𝑡 𝑚
 New position 𝑥 𝑡+1 :
𝑥 𝑡+1 = 𝑥 𝑡 + 𝑣 𝑡+1 ∆𝑡 27
Three-Dimensional Space
 Considering 2 bodies at (𝑥𝑎 , 𝑦𝑎 , 𝑧𝑎 )& 𝑥𝑏 , 𝑦𝑏 , 𝑧𝑏
𝑟 = 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2 + 𝑧𝑎 − 𝑧𝑏 2
 The forces, velocities and positions can be resolved
in the three direction independently
𝐺𝑚𝑎 𝑚𝑏 𝑥𝑏 − 𝑥𝑎
𝐹𝑥 =
𝑟2 𝑟
𝐺𝑚𝑎 𝑚𝑏 𝑦𝑏 − 𝑦𝑎
𝐹𝑦 = 2
( )
𝑟 𝑟
𝐺𝑚𝑎 𝑚𝑏 𝑧𝑏 − 𝑧𝑎
𝐹𝑧 = 2
( )
𝑟 𝑟
Parallel Programming – NTHU LSA Lab 28
N-Body Sequential Code
 Assume all bodies have the same mass m
for (t=0; t<T; t++) {
for (i=0; i<N; i++) {
F = Compute_Force(i); // compute force in O(N^2)
v_new[i] = v[i] + F *dt / m; // compute new velocity
x_new[i] = x[i] + v_new[i] * dt; // compute new position
}
for(i=0; i<N; i++){
x[i] = x_new[i]; // update position
v[i] = v_new[i]; // update velocity
}
}
 Non-feasible as N increases due to O(𝑁 2 )complexity
Parallel Programming – NTHU LSA Lab 29
Approximate Algorithms
 Reduce time complexity by approximating a
cluster of bodies as a single distant body
How to find those clusters of bodies?
Center
m = m1 + m2 of mass
x = (x1*m1 + x2*m2) / m
y = (y1*m1 + y2*m2) / m

r Distant cluster
of bodies

Parallel Programming – NTHU LSA Lab 30


Barnes-Hut Algorithm
 Step1: Recursively divide space by two in each dimensions
 Record the center mass and position of each internal node
Center mass

Parallel Programming – NTHU LSA Lab 31


Barnes-Hut Algorithm
 Step1: Recursively divide space by two in each dimensions
 Record the center mass and position of each internal node

Parallel Programming – NTHU LSA Lab 32


Barnes-Hut Algorithm
 Step1: Recursively divide space by two in each dimensions
 Record the center mass and position of each internal node

Parallel Programming – NTHU LSA Lab 33


Barnes-Hut Algorithm
 Step1: Recursively divide space by two in each dimensions
 Record the center mass and position of each internal node

Parallel Programming – NTHU LSA Lab 34


Barnes-Hut Algorithm
 Step2: Compute approximate forces on each object
1. traverse the nodes of the tree, starting from the root.
2. If the center-of-mass of an internal node is sufficiently far
from the body, approximate the internal node as a single body
 Far is determined by a parameter: θ=d/r
r: the distance between the body
and the node’s center-of-mass
d: the width of the region

Example: θ=0.5
d/r=2.5 > θ
r=4
a

35
d = 10
Barnes-Hut Algorithm
 Step2: Compute approximate forces on each object
1. Traverse the nodes of the tree, starting from the root.
2. If the center-of-mass of an internal node is sufficiently far
from the body, approximate the internal node as a single body
 Far means d/r < θ (e.t. 0 < θ <1)
r: the distance between the body
and the node’s center-of-mass D C B A
d: the width of the region
x
Example: θ=1 D C
d/rA=16/10 > θ
d/rB=16/2 > θ
A B
d/rC=16/15 > θ
d/rD=16/20 < θ
x 36
d = 16
Barnes-Hut Algorithm
 Step2: Compute approximate forces on each object
3. If it is a leaf node, calculate the force and add to the object.
4. Otherwise, recursively compute the force from
children of the internal node.
Example: θ=1
d/rA’=8/7 > θ  A’ is a leaf node
d/rB'=8/15 < θ  B’ treated like a single node D C B A
d/rC'=8/20 < θ  C’ is a leaf node
C’
C’ B’ A’ x

B’

A’
x 37
d=8
Barnes-Hut Algorithm
 θ controls the accuracy and approximation error of the
algorithm
 θ = 0  d/r ALWAYS larger than θ  same as brute force
 θ = 1  most likely only need to consider the object within the
same cluster/region
 If the tree is balanced, the complexity is O(nlogn)
 But in general , the tree could be very unbalanced ……..
 The tree must be re-built for each time interval

Parallel Programming – NTHU LSA Lab 38


Orthogonal Recursive Bisection Method
 Recursively evenly divide space with the same
number of bodies in each of the dimensions
Divide along x dimension

Parallel Programming – NTHU LSA Lab 39


Orthogonal Recursive Bisection Method
 Recursively evenly divide space with the same
number of bodies in each of the dimensions
Divide along y dimension

Parallel Programming – NTHU LSA Lab 40


Orthogonal Recursive Bisection Method
 Recursively evenly divide space with the same
number of bodies in each of the dimensions
Divide along x dimension

Parallel Programming – NTHU LSA Lab 41


Orthogonal Recursive Bisection Method
 Recursively evenly divide space with the same
number of bodies in each of the dimensions
Divide along y dimension
Balanced tree

Parallel Programming – NTHU LSA Lab 42


Orthogonal Recursive Bisection Method
 It is more balanced, but less accurate
 Objects close to each other may not in the same cluster
Divide along y dimension
Balanced tree

Parallel Programming – NTHU LSA Lab 43


Outline
 Embarrassingly Computations
 Divide-And-Conquer Computations
 Pipelined Computations
 Adding Numbers
 Sorting Numbers
 Linear Equation Solver

 Synchronous Computations

Parallel Programming – NTHU LSA Lab 44


What is Pipelined Computations
 A problem is divided into a series of tasks
 Tasks have to be completed one after the other
 Each task will be executed by a separate process
or processor

P0 P1 P2 P3

Parallel Programming – NTHU LSA Lab 45


Types of Pipelined Computations
 Pipelined approach can provide increased
speed under three types of computations:
1. If more than one instance of the complete problem
is to be executed
2. If a single instance has a series of data items must
be processed, each requiring multiple operations
3. If information to start the next process can be
passed forward before the process has completed
all its internal operations

Parallel Programming – NTHU LSA Lab 46


Type 1 Pipelined Computations
1. If more than one instance of the complete problem
is to be executed

(Alternative space-time diagram)


 After the first (p-1) cycles, one problem instance is
completed in each pipeline cycle
 The number of instance should be >> the number of
processes Parallel Programming – NTHU LSA Lab 47
Type 2 Pipelined Computations
2. If a series of data items must be processed,
each requiring multiple operations

Parallel Programming – NTHU LSA Lab 48


Types 3 Pipelined Computations
3. Only one problem instance, but each process
can pass on information to the next process,
before it has completed

Parallel Programming – NTHU LSA Lab 49


Example1: Adding Numbers
 Compute sum of an array:
 for(i=0; i<n; i++) sum += A[i]

 Pipeline for an unfolded loop:


 sum += A[0], sum += A[1], sum += A[2], ……

A[0] A[1] A[2] A[n]

A A A A
sum Sin Sout Sin Sout Sin Sout …… Sin Sout

Parallel Programming – NTHU LSA Lab 50


Example1: Adding Numbers
 The basic code for Pi:  SPMD Program:
recv(&sum, Pi-1);
// code for process Pi
sum += number;
if (Pi != P0) {
send(&sum, Pi+1);
recv(&sum, Pi-1);
 For the first process, P0: sum += number;
}
send(&sum, Pi+1);
if (Pi != Pn) {
 For the last process, Pn-1: send(&sum, Pi+1);
}
recv(&sum, Pi-1);
sum += number;

Parallel Programming – NTHU LSA Lab 51


Example2: Sorting Numbers
 Insertion Sort:
5 2 1 3 4

5
5 2
5 2 1
5 3 2 1
5 4 3 2 1
Parallel Programming – NTHU LSA Lab 52
Example2: Sorting Numbers
 Insertion Sort:
 Each process holds one number
 Compare & move the smaller
number to the right
recv(&number, Pi-1);
if (number > x) {
send(&x, Pi+1);
x= number;
} else {
send(&number, Pi+1);
}
Example 3: Linear Equation Solver
 Special linear equations of “upper-triangular” form
 a’s and b’s are constants, x’s are unknown to be found
𝒂𝒏−𝟏,𝟎 𝒂𝒏−𝟏,𝟏 𝒂𝒏−𝟏,𝟐 …… 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝟎 𝒃𝒏−𝟏
𝒂𝒏−𝟐,𝟎 𝒂𝒏−𝟐,𝟏 …… 𝒂𝒏−𝟐,𝒏−𝟐 0 𝒙𝟏 𝒃𝒏−𝟐
…...

…...
=

…...
…..

…..
0 0
𝒂𝟏,𝟎 𝒂𝟏,𝟏 0 0 0 𝒙𝒏−𝟐 𝒃𝟏
𝒂𝟎,𝟎 0 0 0 0 𝒙𝒏−𝟏 𝒃𝟎

𝒂𝒏−𝟏,𝟎 𝒙𝟎 + 𝒂𝒏−𝟏,𝟏 𝒙𝟏 + 𝒂𝒏−𝟏,𝟐 𝒙𝟐 + ⋯ + 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝒏−𝟏 = 𝐛𝐧−𝟏


..
𝒂𝟐,𝟎 𝒙𝟎 + 𝒂𝟐,𝟏 𝒙𝟏 + 𝒂𝟐,𝟐 𝒙𝟐 = 𝐛𝟐
𝒂𝟏,𝟎 𝒙𝟎 + 𝒂𝟏,𝟏 𝒙𝟏 = 𝐛𝟏
𝒂𝟎,𝟎 𝒙𝟎 = 𝐛𝟎
Parallel Programming – NTHU LSA Lab 54
Example 3: Linear Equation Solver
 Back Substitution
 𝑥0 is found from the last equation
𝑏0
𝑥0 =
𝑎0,0
 Value for 𝑥0 is substituted into the next equation
𝑏1 − 𝑎1,0 𝑥0
𝑥1 =
𝑎1,1
 Values for 𝑥0 , 𝑥1 are substituted into the next equation
𝑏2 − 𝑎2,0 𝑥0 − 𝑎2,1 𝑥1
𝑥2 =
𝑎2,2
 So on until all unknowns are found …
𝑏𝑖 − 𝑖−1
𝑗=0 𝑎𝑖,𝑗 𝑥𝑗
𝑥𝑖 =
𝑎𝑖,𝑖
Parallel Programming – NTHU LSA Lab 55
Example 3: Linear Equation Solver
 First pipeline stage computes 𝑥0 and passes 𝑥0 onto
the second stage, which computes 𝑥1 from 𝑥0 and
passes both 𝑥0 and 𝑥1 onto the next stage, which
computes 𝑥2 from 𝑥0 and 𝑥1 , and so on

Parallel Programming – NTHU LSA Lab 56


Example 3: Linear Equation Solver
 Parallel Code
// code for Pi
sum = 0;
for (j=0; j<i; j++) { // compute partial result
recv(&x[j], Pi-1); // once data is available
send(&x[j], Pi+1);
sum += a[i][j]*x[j];
}
x[i] = (b[i] - sum) / a[i][j]; // send out final result to
send(&x[j], Pi+1); // next process

 Time complexity: 𝑂 𝑛2
 Although later processes have more work for
both communications and computations
Parallel Programming – NTHU LSA Lab 57
Outline
 Embarrassingly Computations
 Divide-And-Conquer Computations
 Pipelined Computations
 Synchronous Computations
 Prefix Sum
 System of Linear Equations

Parallel Programming – NTHU LSA Lab 58


Synchronous Computations
 Definition: all the processes synchronized at
regular points
 Barrier: Basic mechanism for synchronizing
processes
 Inserted at the point in each process where it must wait
 Message (token) is passed among processes for
synchronization
 Deadlock: Common problem occurs from
synchronization
 Two or multiple processes waiting for each other

Parallel Programming – NTHU LSA Lab 59


Barrier
 All processes can only continue from this
POINT when all the processes have reached it

Parallel Programming – NTHU LSA Lab 60


Counter Barrier Implementation
 A.k.a: Linear Barrier
 Centralized counter: count # of processes reaching the barrier
 Increase & check the counter for each barrier call
 Processes is locked by the barrier call until
counter == # processes

61
Counter Barrier Implementation
 Counter-based barrier often have two phases
 Arrival phase: a process enters arrival phase and does not
leave this phase until all processes have arrived in this phase
 Departure phase: Processes are released after moving to the
departure phase
Master Slave processes

Barrier():
Arrival for(i=0;i<p;i++) send(Pmaster)
phase recv(Pany) recv(Pmaster) Barrier():
send(Pmaster)
Departure for(i=0;i<p;i++) recv(Pmaster)
phase send(Pany)

 Slave processes is blocked by recv()


62
 Master could be a bottleneck
Butterfly Barrier Implementation
 At stage i, each process passes a token to the
process with 2i distance away

Parallel Programming – NTHU LSA Lab 63


Butterfly Barrier Implementation
 At stage i, each process passes a token to the
process with 2i distance away

Parallel Programming – NTHU LSA Lab 64


Butterfly Barrier Implementation
 At stage i, each process passes a token to the
process with 2i distance away

Parallel Programming – NTHU LSA Lab 65


Deadlock Problem
 A set of blocked processes each holding some
resources and waiting to acquire a resource held by
another process in the set
P0 P1 P2
 Example:
send(P0)
recv(P1) send(P2) recv(P1)
send(P1) recv(P0) send(P1)
recv(P2)

P0 P1 P2
recv(P0)
recv(P1) recv(P2) recv(P1)
send(P1) send(P0) send(P1)
send(P2)

Parallel Programming – NTHU LSA Lab 66


Example 1: Prefix Sum
 Given a list of numbers 𝑥0 , 𝑥1 , … . 𝑥𝑛−1 ,
compute all partial summations
 𝑥0 ; 𝑥0
+ 𝑥1 ; 𝑥0 + 𝑥1 + 𝑥2 ; … … . .
 Could also replace operator + with AND, OR, *, etc.
 Example:
 𝑥 = 1,2,3,4,5 //sequential code
 Sum = 1,3,6,10,15 for(i = 0; i < n; i++) {
sum[i] = 0;
 Sequential code: O(n2) for (j = 0; j <= i; j++)
sum[i] = sum[i] + x[j];
}
Parallel Programming – NTHU LSA Lab 67
Data Parallelism Solution

Parallel Programming – NTHU LSA Lab 68


Data Parallelism Code
 Sequential Code: O(n2), optimal: O(n)
for (j = 0; j < log(n); j++) /* at each step */
for (i = 2j; i < n; i++) /* add to accumulating sum */
x[i] = x[i] + x[i - 2j]

 Parallel Code: O(log n)


for (j = 0; j < log(n); j++) /* at each step */
forall (i = 0; i < n; i++) /* add to accumulating sum */
if (i >= 2j) x[i] = x[i] + x[i - 2j];

Parallel Programming – NTHU LSA Lab 69


Synchronous Parallelism
 Each iteration composed of several processes
that start together at beginning of iteration
and next iteration cannot begin until all
processes have finished previous iteration
 openMP  MPI
for (j=0; j<n; j++) { // each iteration for (j=0; j<n; j++) { // each iteration
forall (i=0; i<N; i++) { // each process i = myrank;
body(i); body(i);
} barrier(mygroup);
} }

Parallel Programming – NTHU LSA Lab 70


Example 2: System of Linear Equations
 System of linear equations
𝒂𝟎,𝟎 𝒂𝟎,𝟏 𝒂𝟎,𝟐 …… 𝒂𝟎,𝒏−𝟏 𝒙𝟎 𝒃𝟎
𝒂𝟏,𝟎 𝒂𝟏,𝟏 𝒂𝟏,𝟐 …… 𝒂𝟏,𝒏−𝟏 𝒙𝟏 𝒃𝟏
…...

…...
=

…...
…..

…..

…..

…..
𝒂𝒏−𝟐,𝟎 𝒂𝒏−𝟐,𝟏 𝒂𝒏−𝟐,𝟐 …… 𝒂𝒏−𝟐,𝒏−𝟏 𝒙𝒏−𝟐 𝒃𝒏−𝟐
𝒂𝒏−𝟏,𝟎 𝒂𝒏−𝟏,𝟏 𝒂𝒏−𝟏,𝟐 …… 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝒏−𝟏 𝒃𝒏−𝟏

 Jacobi iteration algorithm:


1
 Convert ith iteration to 𝑥𝑖 = [𝑏𝑖 − 𝑖≠𝑗 𝑎𝑖,𝑗 𝑥𝑗 ]
𝑎𝑖,𝑖
 Initial guess with 𝑥𝑖
= 𝑏𝑖 , and calculate new 𝑥𝑖 values
 Repeat until 𝑥𝑖 𝑡 − 𝑥𝑖 𝑡−1 < error tolerance
Parallel Programming – NTHU LSA Lab 71
Jacobi iteration algorithm

Parallel Programming – NTHU LSA Lab 72


Jacobi iteration algorithm example
2𝑥1 − 𝑥2
−𝑥0 + 2𝑥1 − 𝑥2 = 2 𝑥0 = 2 −
2𝑥0 + 𝑥1 − 2𝑥2 = 2 −1
2𝑥0 − 𝑥1 + 2𝑥2 = 2 2𝑥0 − 2𝑥2
𝑥1 = 2 −
1 1
𝑥𝑖 = [𝑏𝑖 − 𝑎𝑖,𝑗 𝑥𝑗 ] 2𝑥0 − 𝑥1
𝑎𝑖,𝑖 𝑥2 = 2 −
𝑖≠𝑗
2
 Iter1: x10 = 2, 𝑥11 = 2, 𝑥21 = 2
2x11 −𝑥21
 x02 =2− = 4, 𝑥12 = 2, 𝑥22 = 1
−1
𝑒0 = 2 − 4 = 2, 𝑒1 = 0, 𝑒2 = 1
2x21 −𝑥22
 iIter2: x02 =2− = 5, 𝑥13 = −2, 𝑥23 = −1
−1
𝑒0 = 4 − 5 = 1, 𝑒1 = 4, 𝑒2 = 2
Parallel Programming – NTHU LSA Lab 73
Jacobi iteration algorithm
 Sequential Code
 a[][] and b[] holding constants in the equations
 x[] holding unknowns 1
𝑥𝑖 = [𝑏𝑖 − 𝑎𝑖,𝑗 𝑥𝑗 ]
 fixed number of iterations 𝑎𝑖,𝑖
𝑖≠𝑗
for (i = 0; i < n; i++) x[i] = b[i]; /*initialize unknowns*/
for (iteration = 0; iteration < limit; iteration++) {
for (i = 0; i < n; i++) { /* for each unknown */
sum = -a[i][i] * x[i];
for (j = 0; j < n; j++) /* compute summation */
sum = sum + a[i][j] * x[j];
new_x[i] = (b[i] - sum) / a[i][i]; /*compute unknown*/
}
for (i = 0; i < n; i++) x[i] = new_x[i]; /*update to new values*/
} Parallel Programming – NTHU LSA Lab 74
Jacobi iteration algorithm
 Parallel Code
 Process i handles unknown x[i]

x[i] = b[i]; /*initialize unknown*/


for (iteration = 0; iteration < limit; iteration++) {
sum = -a[i][i] * x[i];
for (j = 0; j < n; j++) /* compute summation */
sum = sum + a[i][j] * x[j];
new_x[i] = (b[i] - sum) / a[i][i]; /* compute unknown */
allGather(&new_x[i]); /* gather & broadcast new value */
barrier(); /* wait for all processes */
}

Parallel Programming – NTHU LSA Lab 75

You might also like