Parallel Computations & Applications: National Tsing-Hua University 2017, Summer Semester

Parallel Computations
& Applications
National Tsing-Hua University

2017, Summer Semester
Outline
 Embarrassingly Computations
 Divide-And-Conquer Computations
 Pipelined Computations
 Synchronous Computations
Parallel Programming – NTHU LSA Lab 2

Outline
 Image Transformations
 Mandelbrot Set
 Monte Carlo Methods

What is Embarrassingly Parallel
 A computation that can be divided into a number of
completely independent tasks
Input Data
Divide input to tasks
Processes
Collect results
Results

Example 1: Image Transformations
 Low-level image operations:
 Shifting: object shifted by ∆𝑥 in the 𝑥-dimension
and ∆𝑦 in the 𝑦-dimension:
𝑥 ′ = 𝑥 + ∆𝑥, 𝑦 ′ = 𝑦 + ∆𝑦
 Scaling: object scaled by a factor of 𝑆𝑥 in the 𝑥-
direction and 𝑆𝑦 in the 𝑦-direction;
𝑥 ′ = 𝑥𝑆𝑥 , 𝑦 ′ = 𝑦𝑆𝑦
 Rotation: object rotated through the angle 𝜃
about the origin of the coordinate system:
𝑥 ′ = 𝑥 cos 𝜃 + 𝑦 sin 𝜃
𝑦 ′ = −𝑥 sin 𝜃 + 𝑦 cos 𝜃
Image Region Partitioning
 Square region  Row region  Column region
partition partition partition
map map map
process process process

640
Pseudo-code for Image Shift 10 x 640
……
 Partition region by ROW with width 10 480
//master process
for(i=0, row=0; i<48; i++, row+=10) // for each of 48 processes
send(row, Pi); // send row no. P47
for(i=0; i<480; i++) for(j=0; j<640; j++) temp_map[i][j] = 0; // initialize temp
for(i=0; i<(480*640); i++) { // for each pixel
recv(oldrow, oldcol, newrow, newcol, PANY); // accept new coordinates
if !((newrow<0)||((newrow>=480)||(newcol<0)||((newcol>=640))
temp_map[newrow][newcol] = map[oldrow][oldcol];
}
for(i=0; i<480; i++) for(j=0; j<640; j++) map[i][j] = temp_map[i][j]; // update map
// slave process
recv (row, Pmaster);
for (oldrow = row; oldrow < (oldrow+10); oldrow++) // for each row in the partition
for (oldcol = 0; oldcol < 640; oldcol++) { // for each column in the row
newrow = oldrow + delta_x; // shift along x-dimension
newcol = oldcol + delta_y; // shift along y-dimension
send(oldrow, oldcol, newrow, newcol, Pmaster); // send out new coordinates
}
Example 2: Mandelbrot Set
The Mandelbrot Set is a set of complex numbers that are

quasi-stable when computed by iterating the function:
𝑍0 = 𝐶, 𝑍𝑘+1 = 𝑍𝑘 2 + 𝐶
 C is some complex number: 𝑪 = 𝑎 + 𝑏𝒊
 𝒁𝒌+𝟏 is the (k+1)th iteration of the complex number
 If |Zk| ≤ 2 for ANY k  C belongs to Mandelbrot Set
𝑍 = 𝑎 2 + 𝑏2
Once |Z | > 2, it will increase forever!
k
𝑘
|Zk| 2i 2+2i
C =-1+0.25i, NOT part of the set
C = -1+0.75i, part of the set 𝑍𝑘 = 22 + 22 = 8
4 -2 2
2 2-i
𝑍𝑘 = 22 + (−1)2 = 5
-2i
0 Iteration ∞
Fractal
 What exact is Mandelbrot Set?
 It is a fractal: An object that display self-similarity at
various scale; Magnifying a fractal reveals small-scale 2
details similar to the large-scale characteristics
 After plotting the Mandelbrot Set
determined by thousands of iteration:
 Add color to the points outside the set &
zoom in at the center of the image: -2i 2i
-2
Mandelbrot Set Program
2
 Compute 𝑍𝑘+1 = 𝑍𝑘 + 𝐶
 Let 𝐶 = 𝐶𝑟𝑒𝑎𝑙 + 𝐶𝑖𝑚𝑎𝑔 𝑖 , 𝑍𝑘 = 𝑍𝑟𝑒𝑎𝑙 + 𝑍𝑖𝑚𝑎𝑔 𝑖
 𝑍𝑘+1 = 𝑍𝑟𝑒𝑎𝑙 2 − 𝑍𝑖𝑚𝑎𝑔 2 + 2𝑍𝑟𝑒𝑎𝑙 𝑍𝑖𝑚𝑎𝑔 𝑖 + 𝐶𝑟𝑒𝑎𝑙 + 𝐶𝑖𝑚𝑎𝑔 𝑖
 𝑍𝑟𝑒𝑎𝑙_𝑛𝑒𝑥𝑡 = 𝑍𝑟𝑒𝑎𝑙 2 − 𝑍𝑖𝑚𝑎𝑔 2 + 𝐶𝑟𝑒𝑎𝑙
 𝑍𝑖𝑚𝑎𝑔_𝑛𝑒𝑥𝑡 = 2𝑍𝑟𝑒𝑎𝑙 𝑍𝑖𝑚𝑎𝑔 + 𝐶𝑖𝑚𝑎𝑔
 Represent image number in program

 C = 2 + 4i C.real = 2, C.imag = 4
Struct complex {
float real;
float imag;
};
Sequential Mandelbrot Set Program
 Testing program:
 Giving a complex number
 Return the iteration number when |Zk| > 2
 Let the maximum iteration is 256
int cal_pixel (complex c) {
int count = 0; // number of iterations
int max= 256; // maximum iteration is 256
float temp, lengthsq;
complex z; // initialize complex number z
z.real = 0; z.imag = 0;
do {
temp = (z.real * z.real) – (z.imag * z.imag) + c.real; // compute next z.real
z.imag = (2 * z.real * z.imag) + c.imag; // compute next z.imag
z.real = temp;
lengthsq = (z.real * z.real) + (z.imag * z.imag);
count++; // update iteration counter
} while ((lengthsq < 4.0) && (count < max));
return count;
} Parallel Programming – NTHU LSA Lab 12
Sequential Mandelbrot Set Program
 Scaling Coordinate Display Program:
 Plot the Mandelbrot Set from the
coordinate system
 Color indicate the iteration number
black=256, white=0
 Points are apart with a fixed distance
read_disk, imag_dist
for (x=real_min; x < real_max; x += real_dist) {

for (y=imag_min; y < imag_max; x += imag_dist) {
c.real = x; c.img = y;
color = cal_pixel (c);
display(x, y, color);
}
}
Parallelizing Mandelbrot Set Program
 Partition screen 640*480 by row using 48 processes
//master process
for(i=0, row=0; i<48; i++, row+=10) // for each process
send(row, Pi); // send row no.
for(i=0; i<(480*640); i++) { // for each pixel point
recv(&x, &y, &color, PANY); // receive coordinate/colors
display(x, y, color); // display pixel
}
//slave process 640
recv (&row, Pmaster);
for (x=0; x < 640; x++) { 10 x 640
……
for (y=row; y < (row+10); y++) {
c.real = min_real + (x * scale_real); 480
c.imag = min_imag + (y * scale_image);
color = cal_pixel (c);
send(x, y, &color, Pmaster);
} Each process may P47
} have different load!
Dynamic Task Assignment
 Work pool / Processor Farm
 Useful when tasks require different execution time
 Dynamic load balancing
Work Pool
(xa, ya) (xc, yc)

(xb, yb)
(xd, yd)
1. Send task
2. Return result &
request new task
…
3. Send termination
……………………...
Coding for Work Pool Approach
//master process
count = 0; // # of active processes
row = 0; // row being sent
for (k=0; k<num_proc; k++) { // send initial row to each processes
send(row, Pi , data_tag);
count++;
row++;
}
do {
recv(&slave, &r, color, PANY , result_tag);
count--;
if (row < num_row) { // keep sending until no new task
send(row, Pslave , data_tag); // send next row
count++; Tag is needed to distinguish
row++;
between data and termination msg
} else {
send(row, Pslave , terminate_tag); // terminate
}
display(r, color); // display row
} while(count > 0);
Coding for Work Pool Approach
//slave process P ( i )
recv(&row, Pmaster , source_tag);
while (source_tag == data_tag) { // keep receiving new task
c.imag = min_imag + (row * scale_image);
for (x=0; x<640; x++) {
c.real = min_real + (x * scale_real);
color[x] = cal_pixel (c); // compute color of a single row
}
send(i, row, color, Pmaster , result_tag); // send process id and results
recv(&row, Pmaster , source_tag);
}

Example 3: Monte Carlo Methods
 Monte Carlo methods: a class of computational
algorithms that rely on repeated random sampling
to compute their results
 Invented in 1940s by John von Neumann,
Stanislaw Ulam and Nicholas Metropolis,
while they were working on nuclear weapon
(Manhattan Project)
 Especially useful for simulating systems
with many coupled degrees of freedom,
such as fluids, disordered material

Monte Carlo Methods --- 𝜋 calculation
 How to compute 𝜋 ???
 Definition of 𝜋: the area of a circle with unit radius
Total area = 4
𝐀𝐫𝐞𝐚 𝐨𝐟 𝐜𝐢𝐫𝐜𝐥𝐞 𝝅
 We know: =
𝐀𝐫𝐞𝐚 𝐨𝐟 𝐬𝐪𝐮𝐚𝐫𝐞 𝟒
 Randomly choose points from
the square 2 Area = 𝝅
 Giving sufficient number of samples,

the fraction of points within the circle
will be 𝜋/4!!!
 E.g.: With 10,000 randomly sample points 2
𝑥2 + 𝑦2 > 1
we expect 7854 points within the circle
7854/10000 =𝜋/4 𝜋 = 7854/10000*4 = 3.1416
Monte Carlo Methods --- Integral
 Monte Carlo Method can compute ANY definite integral!
 max and min values of the integral must be known
 Very inefficient….
𝑥𝑚𝑎𝑥
 Method: 𝐴𝑟𝑒𝑎 = 𝑓 𝑥 𝑑𝑥
𝑥𝑚𝑖𝑛
 Randomly choose point 𝑥, 𝑦 :
𝑥𝑚𝑎𝑥 ≤ 𝑥 ≤ 𝑥𝑚𝑖𝑛 𝒇 𝒙
𝑦𝑚𝑎𝑥 ≤ 𝑦 ≤ 𝑦𝑚𝑖𝑛 𝑦𝑚𝑎𝑥
 Compute the area (integral)
according to the ratio of points
inside and outside the area 𝑦𝑚𝑖𝑛
𝑥𝑚𝑖𝑛 𝑥𝑚𝑎𝑥
 just like the computation of 𝜋
 Given any point (x, y), outside means : 𝑦 > 𝑓(𝑥)
Outline
 Adding Numbers
 Bucket Sort
 N-Body Simulation

What is Divide & Conquer
 Recursively divide a problem into sub-problems that
are of the same form as the larger problem
Initial
Problem
divide
conquer
Final
Result 22
Example 1: Adding Numbers
 Add a sequence of numbers
 Sequential Recursive Code:  Parallel Code:
Scatter the numbers
int add (int* numbers) {
if (len(numbers) <=2) { then reduce results
return numbers[1]+numbers[2];
1,2,3,4
} else {
divide (numbers, sub_num1, sub_num2); 1,2 3,4
part_sum1 = add(sub_num1);
part_sum2 = add(sub_num2); 1 2 3 4
}
return (part_sum1+part_sum2); 3 7
}
10

Example 2: Bucket Sort
 Algorithm
1. Range of numbers is divided into m equal regions
2. One bucket is assigned for each region
3. Place numbers to buckets based on the region
4. Use sequential sort for each bucket
Unsorted numbers
2,4,6,1,8,10,5,9,0,3,7,11
Merge list 0,1,2 3,4,5 6,7,8 9,10,11

0~2 3~5 6~8 9~11
 Only effective if number of items per bucket is similar!!
 Numbers should have a known interval ([max, min])
 Numbers better to be uniformly distributed
Complexity Analysis
 Sequential:
1. Distribute numbers to bucket: O(n)
2. Sequential sort each bucket: (n/m)log(n/m) x m
 Overall: O(n log(n/m))
 Parallelize sorting: one process per bucket

1. Distribute numbers to bucket: O(n)
2. Sequential sort each bucket: (n/m)log(n/m)
 Overall: O(n + n/m log(n/m))
 A single process must scan through all numbers in step1

Further Parallelized Bucket Sort
 Parallelize partitioning and sorting:
Partition numbers to m parts/processes

 Each process divides its numbers to small buckets
 Merge small buckets to large bucket
 Sequential sort each bucket
Unsorted numbers 2,4,6,1,8,10,5,9,0,3,7,11
0~2 3~5 6~8 9~11
P1 P2 P3 P4
2,4,6 1,8,10 5,9,0 3,7,11 Phase1:
2 4 6 1 8 10 0 5 9 3 7 11 Partition
Phase2:
Merge list 0,1,2 3,4,5 6,7,8 9,10,11 Sorting 26
Example 3: N-Body Problem
 Newtonian laws of physics
 The gravitational force between two
bodies of masses 𝑚𝑎 & 𝑚𝑏 :
𝐺𝑚𝑎 𝑚𝑏
𝐹=
𝑟2
 Subject to the force, acceleration occurs
𝐹 =𝑚×𝑎
 Let the time interval be ∆𝑡 &
current velocity 𝑣 𝑡 , position 𝑥 𝑡
 New velocity 𝑣 𝑡+1 :
𝑣 𝑡+1 − 𝑣 𝑡 𝑡+1 𝑡
𝐹∆𝑡
𝐹=𝑚 ⇒𝑣 =𝑣 +
∆𝑡 𝑚
 New position 𝑥 𝑡+1 :
𝑥 𝑡+1 = 𝑥 𝑡 + 𝑣 𝑡+1 ∆𝑡 27
Three-Dimensional Space
 Considering 2 bodies at (𝑥𝑎 , 𝑦𝑎 , 𝑧𝑎 )& 𝑥𝑏 , 𝑦𝑏 , 𝑧𝑏
𝑟 = 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2 + 𝑧𝑎 − 𝑧𝑏 2
 The forces, velocities and positions can be resolved
in the three direction independently
𝐺𝑚𝑎 𝑚𝑏 𝑥𝑏 − 𝑥𝑎
𝐹𝑥 =
𝑟2 𝑟
𝐺𝑚𝑎 𝑚𝑏 𝑦𝑏 − 𝑦𝑎
𝐹𝑦 = 2
( )
𝑟 𝑟
𝐺𝑚𝑎 𝑚𝑏 𝑧𝑏 − 𝑧𝑎
𝐹𝑧 = 2
( )
𝑟 𝑟
N-Body Sequential Code
 Assume all bodies have the same mass m
for (t=0; t<T; t++) {
for (i=0; i<N; i++) {
F = Compute_Force(i); // compute force in O(N^2)
v_new[i] = v[i] + F *dt / m; // compute new velocity
x_new[i] = x[i] + v_new[i] * dt; // compute new position
}
for(i=0; i<N; i++){
x[i] = x_new[i]; // update position
v[i] = v_new[i]; // update velocity
}
}
 Non-feasible as N increases due to O(𝑁 2 )complexity
Approximate Algorithms
 Reduce time complexity by approximating a
cluster of bodies as a single distant body
How to find those clusters of bodies?
Center
m = m1 + m2 of mass
x = (x1*m1 + x2*m2) / m
y = (y1*m1 + y2*m2) / m
r Distant cluster
of bodies

Barnes-Hut Algorithm
 Step1: Recursively divide space by two in each dimensions
 Record the center mass and position of each internal node
Center mass




 Step2: Compute approximate forces on each object
1. traverse the nodes of the tree, starting from the root.
2. If the center-of-mass of an internal node is sufficiently far
from the body, approximate the internal node as a single body
 Far is determined by a parameter: θ=d/r
r: the distance between the body
and the node’s center-of-mass
d: the width of the region
Example: θ=0.5
d/r=2.5 > θ
r=4
a
35
d = 10
1. Traverse the nodes of the tree, starting from the root.
2. If the center-of-mass of an internal node is sufficiently far
from the body, approximate the internal node as a single body
 Far means d/r < θ (e.t. 0 < θ <1)
r: the distance between the body
and the node’s center-of-mass D C B A
d: the width of the region
x
Example: θ=1 D C
d/rA=16/10 > θ
d/rB=16/2 > θ
A B
d/rC=16/15 > θ
d/rD=16/20 < θ
x 36
d = 16
3. If it is a leaf node, calculate the force and add to the object.
4. Otherwise, recursively compute the force from
children of the internal node.
Example: θ=1
d/rA’=8/7 > θ  A’ is a leaf node
d/rB'=8/15 < θ  B’ treated like a single node D C B A
d/rC'=8/20 < θ  C’ is a leaf node
C’
C’ B’ A’ x
B’
A’
x 37
d=8
 θ controls the accuracy and approximation error of the
algorithm
 θ = 0  d/r ALWAYS larger than θ  same as brute force
 θ = 1  most likely only need to consider the object within the
same cluster/region
 If the tree is balanced, the complexity is O(nlogn)
 But in general , the tree could be very unbalanced ……..
 The tree must be re-built for each time interval

Orthogonal Recursive Bisection Method
 Recursively evenly divide space with the same
number of bodies in each of the dimensions
Divide along x dimension

Divide along y dimension

Divide along x dimension

Balanced tree

 It is more balanced, but less accurate
 Objects close to each other may not in the same cluster
Balanced tree

Outline
 Adding Numbers
 Sorting Numbers
 Linear Equation Solver

What is Pipelined Computations
 A problem is divided into a series of tasks
 Tasks have to be completed one after the other
 Each task will be executed by a separate process
or processor
P0 P1 P2 P3

Types of Pipelined Computations
 Pipelined approach can provide increased
speed under three types of computations:
1. If more than one instance of the complete problem
is to be executed
2. If a single instance has a series of data items must
be processed, each requiring multiple operations
3. If information to start the next process can be
passed forward before the process has completed
all its internal operations

Type 1 Pipelined Computations
1. If more than one instance of the complete problem
is to be executed
(Alternative space-time diagram)

 After the first (p-1) cycles, one problem instance is
completed in each pipeline cycle
 The number of instance should be >> the number of
processes Parallel Programming – NTHU LSA Lab 47
Type 2 Pipelined Computations
2. If a series of data items must be processed,
each requiring multiple operations

Types 3 Pipelined Computations
3. Only one problem instance, but each process
can pass on information to the next process,
before it has completed

Example1: Adding Numbers
 Compute sum of an array:
 for(i=0; i<n; i++) sum += A[i]
 Pipeline for an unfolded loop:

 sum += A[0], sum += A[1], sum += A[2], ……
A[0] A[1] A[2] A[n]
A A A A
sum Sin Sout Sin Sout Sin Sout …… Sin Sout

Example1: Adding Numbers
 The basic code for Pi:  SPMD Program:
recv(&sum, Pi-1);
// code for process Pi
sum += number;
if (Pi != P0) {
send(&sum, Pi+1);
recv(&sum, Pi-1);
 For the first process, P0: sum += number;
}
send(&sum, Pi+1);
if (Pi != Pn) {
 For the last process, Pn-1: send(&sum, Pi+1);
}
recv(&sum, Pi-1);
sum += number;

Example2: Sorting Numbers
 Insertion Sort:
5 2 1 3 4
5
5 2
5 2 1
5 3 2 1
5 4 3 2 1
Example2: Sorting Numbers
 Insertion Sort:
 Each process holds one number
 Compare & move the smaller
number to the right
recv(&number, Pi-1);
if (number > x) {
send(&x, Pi+1);
x= number;
} else {
send(&number, Pi+1);
}
Example 3: Linear Equation Solver
 Special linear equations of “upper-triangular” form
 a’s and b’s are constants, x’s are unknown to be found
𝒂𝒏−𝟏,𝟎 𝒂𝒏−𝟏,𝟏 𝒂𝒏−𝟏,𝟐 …… 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝟎 𝒃𝒏−𝟏
𝒂𝒏−𝟐,𝟎 𝒂𝒏−𝟐,𝟏 …… 𝒂𝒏−𝟐,𝒏−𝟐 0 𝒙𝟏 𝒃𝒏−𝟐
…...
…...
=
…...
…..
…..
0 0
𝒂𝟏,𝟎 𝒂𝟏,𝟏 0 0 0 𝒙𝒏−𝟐 𝒃𝟏
𝒂𝟎,𝟎 0 0 0 0 𝒙𝒏−𝟏 𝒃𝟎
𝒂𝒏−𝟏,𝟎 𝒙𝟎 + 𝒂𝒏−𝟏,𝟏 𝒙𝟏 + 𝒂𝒏−𝟏,𝟐 𝒙𝟐 + ⋯ + 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝒏−𝟏 = 𝐛𝐧−𝟏

..
𝒂𝟐,𝟎 𝒙𝟎 + 𝒂𝟐,𝟏 𝒙𝟏 + 𝒂𝟐,𝟐 𝒙𝟐 = 𝐛𝟐
𝒂𝟏,𝟎 𝒙𝟎 + 𝒂𝟏,𝟏 𝒙𝟏 = 𝐛𝟏
𝒂𝟎,𝟎 𝒙𝟎 = 𝐛𝟎
 Back Substitution
 𝑥0 is found from the last equation
𝑏0
𝑥0 =
𝑎0,0
 Value for 𝑥0 is substituted into the next equation
𝑏1 − 𝑎1,0 𝑥0
𝑥1 =
𝑎1,1
 Values for 𝑥0 , 𝑥1 are substituted into the next equation
𝑏2 − 𝑎2,0 𝑥0 − 𝑎2,1 𝑥1
𝑥2 =
𝑎2,2
 So on until all unknowns are found …
𝑏𝑖 − 𝑖−1
𝑗=0 𝑎𝑖,𝑗 𝑥𝑗
𝑥𝑖 =
𝑎𝑖,𝑖
 First pipeline stage computes 𝑥0 and passes 𝑥0 onto
the second stage, which computes 𝑥1 from 𝑥0 and
passes both 𝑥0 and 𝑥1 onto the next stage, which
computes 𝑥2 from 𝑥0 and 𝑥1 , and so on

 Parallel Code
// code for Pi
sum = 0;
for (j=0; j<i; j++) { // compute partial result
recv(&x[j], Pi-1); // once data is available
send(&x[j], Pi+1);
sum += a[i][j]*x[j];
}
x[i] = (b[i] - sum) / a[i][j]; // send out final result to
send(&x[j], Pi+1); // next process
 Time complexity: 𝑂 𝑛2
 Although later processes have more work for
both communications and computations
Outline
 Prefix Sum
 System of Linear Equations

Synchronous Computations
 Definition: all the processes synchronized at
regular points
 Barrier: Basic mechanism for synchronizing
processes
 Inserted at the point in each process where it must wait
 Message (token) is passed among processes for
synchronization
 Deadlock: Common problem occurs from
synchronization
 Two or multiple processes waiting for each other

Barrier
 All processes can only continue from this
POINT when all the processes have reached it

Counter Barrier Implementation
 A.k.a: Linear Barrier
 Centralized counter: count # of processes reaching the barrier
 Increase & check the counter for each barrier call
 Processes is locked by the barrier call until
counter == # processes
61
Counter Barrier Implementation
 Counter-based barrier often have two phases
 Arrival phase: a process enters arrival phase and does not
leave this phase until all processes have arrived in this phase
 Departure phase: Processes are released after moving to the
departure phase
Master Slave processes
Barrier():
Arrival for(i=0;i<p;i++) send(Pmaster)
phase recv(Pany) recv(Pmaster) Barrier():
send(Pmaster)
Departure for(i=0;i<p;i++) recv(Pmaster)
phase send(Pany)
 Slave processes is blocked by recv()

62
 Master could be a bottleneck
Butterfly Barrier Implementation
 At stage i, each process passes a token to the
process with 2i distance away



Deadlock Problem
 A set of blocked processes each holding some
resources and waiting to acquire a resource held by
another process in the set
P0 P1 P2
 Example:
send(P0)
recv(P1) send(P2) recv(P1)
send(P1) recv(P0) send(P1)
recv(P2)
P0 P1 P2
recv(P0)
recv(P1) recv(P2) recv(P1)
send(P1) send(P0) send(P1)
send(P2)

Example 1: Prefix Sum
 Given a list of numbers 𝑥0 , 𝑥1 , … . 𝑥𝑛−1 ,
compute all partial summations
 𝑥0 ; 𝑥0
+ 𝑥1 ; 𝑥0 + 𝑥1 + 𝑥2 ; … … . .
 Could also replace operator + with AND, OR, *, etc.
 Example:
 𝑥 = 1,2,3,4,5 //sequential code
 Sum = 1,3,6,10,15 for(i = 0; i < n; i++) {
sum[i] = 0;
 Sequential code: O(n2) for (j = 0; j <= i; j++)
sum[i] = sum[i] + x[j];
}
Data Parallelism Solution

Data Parallelism Code
 Sequential Code: O(n2), optimal: O(n)
for (j = 0; j < log(n); j++) /* at each step */
for (i = 2j; i < n; i++) /* add to accumulating sum */
x[i] = x[i] + x[i - 2j]
 Parallel Code: O(log n)

for (j = 0; j < log(n); j++) /* at each step */
forall (i = 0; i < n; i++) /* add to accumulating sum */
if (i >= 2j) x[i] = x[i] + x[i - 2j];

Synchronous Parallelism
 Each iteration composed of several processes
that start together at beginning of iteration
and next iteration cannot begin until all
processes have finished previous iteration
 openMP  MPI
for (j=0; j<n; j++) { // each iteration for (j=0; j<n; j++) { // each iteration
forall (i=0; i<N; i++) { // each process i = myrank;
body(i); body(i);
} barrier(mygroup);
} }

Example 2: System of Linear Equations
 System of linear equations
𝒂𝟎,𝟎 𝒂𝟎,𝟏 𝒂𝟎,𝟐 …… 𝒂𝟎,𝒏−𝟏 𝒙𝟎 𝒃𝟎
𝒂𝟏,𝟎 𝒂𝟏,𝟏 𝒂𝟏,𝟐 …… 𝒂𝟏,𝒏−𝟏 𝒙𝟏 𝒃𝟏
…...
…...
=
…...
…..
…..
…..
…..
𝒂𝒏−𝟐,𝟎 𝒂𝒏−𝟐,𝟏 𝒂𝒏−𝟐,𝟐 …… 𝒂𝒏−𝟐,𝒏−𝟏 𝒙𝒏−𝟐 𝒃𝒏−𝟐
𝒂𝒏−𝟏,𝟎 𝒂𝒏−𝟏,𝟏 𝒂𝒏−𝟏,𝟐 …… 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝒏−𝟏 𝒃𝒏−𝟏
 Jacobi iteration algorithm:

1
 Convert ith iteration to 𝑥𝑖 = [𝑏𝑖 − 𝑖≠𝑗 𝑎𝑖,𝑗 𝑥𝑗 ]
𝑎𝑖,𝑖
 Initial guess with 𝑥𝑖
= 𝑏𝑖 , and calculate new 𝑥𝑖 values
 Repeat until 𝑥𝑖 𝑡 − 𝑥𝑖 𝑡−1 < error tolerance
Jacobi iteration algorithm

Jacobi iteration algorithm example
2𝑥1 − 𝑥2
−𝑥0 + 2𝑥1 − 𝑥2 = 2 𝑥0 = 2 −
2𝑥0 + 𝑥1 − 2𝑥2 = 2 −1
2𝑥0 − 𝑥1 + 2𝑥2 = 2 2𝑥0 − 2𝑥2
𝑥1 = 2 −
1 1
𝑥𝑖 = [𝑏𝑖 − 𝑎𝑖,𝑗 𝑥𝑗 ] 2𝑥0 − 𝑥1
𝑎𝑖,𝑖 𝑥2 = 2 −
𝑖≠𝑗
2
 Iter1: x10 = 2, 𝑥11 = 2, 𝑥21 = 2
2x11 −𝑥21
 x02 =2− = 4, 𝑥12 = 2, 𝑥22 = 1
−1
𝑒0 = 2 − 4 = 2, 𝑒1 = 0, 𝑒2 = 1
2x21 −𝑥22
 iIter2: x02 =2− = 5, 𝑥13 = −2, 𝑥23 = −1
−1
𝑒0 = 4 − 5 = 1, 𝑒1 = 4, 𝑒2 = 2
 Sequential Code
 a[][] and b[] holding constants in the equations
 x[] holding unknowns 1
𝑥𝑖 = [𝑏𝑖 − 𝑎𝑖,𝑗 𝑥𝑗 ]
 fixed number of iterations 𝑎𝑖,𝑖
𝑖≠𝑗
for (i = 0; i < n; i++) x[i] = b[i]; /*initialize unknowns*/
for (iteration = 0; iteration < limit; iteration++) {
for (i = 0; i < n; i++) { /* for each unknown */
sum = -a[i][i] * x[i];
for (j = 0; j < n; j++) /* compute summation */
sum = sum + a[i][j] * x[j];
new_x[i] = (b[i] - sum) / a[i][i]; /*compute unknown*/
}
for (i = 0; i < n; i++) x[i] = new_x[i]; /*update to new values*/
} Parallel Programming – NTHU LSA Lab 74
 Parallel Code
 Process i handles unknown x[i]
x[i] = b[i]; /*initialize unknown*/

for (iteration = 0; iteration < limit; iteration++) {
sum = -a[i][i] * x[i];
for (j = 0; j < n; j++) /* compute summation */
sum = sum + a[i][j] * x[j];
new_x[i] = (b[i] - sum) / a[i][i]; /* compute unknown */
allGather(&new_x[i]); /* gather & broadcast new value */
barrier(); /* wait for all processes */
}

Parallel Computations & Applications: National Tsing-Hua University 2017, Summer Semester

Uploaded by

Copyright:

Available Formats

Parallel Computations & Applications: National Tsing-Hua University 2017, Summer Semester

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Computations & Applications: National Tsing-Hua University 2017, Summer Semester

Uploaded by

Copyright:

Available Formats

Parallel Computations

National Tsing-Hua University

Parallel Programming – NTHU LSA Lab 2

Parallel Programming – NTHU LSA Lab 3

Parallel Programming – NTHU LSA Lab 4

map map map

process process process

Parallel Programming – NTHU LSA Lab 6

 Represent image number in program

for (x=real_min; x < real_max; x += real_dist) {

(xa, ya) (xc, yc)

Parallel Programming – NTHU LSA Lab 17

Parallel Programming – NTHU LSA Lab 18

 Giving sufficient number of samples,

Parallel Programming – NTHU LSA Lab 21

Parallel Programming – NTHU LSA Lab 23

Merge list 0,1,2 3,4,5 6,7,8 9,10,11

 Parallelize sorting: one process per bucket

 A single process must scan through all numbers in step1

Parallel Programming – NTHU LSA Lab 30

Parallel Programming – NTHU LSA Lab 31

Parallel Programming – NTHU LSA Lab 32

Parallel Programming – NTHU LSA Lab 33

Parallel Programming – NTHU LSA Lab 34

Parallel Programming – NTHU LSA Lab 38

Parallel Programming – NTHU LSA Lab 39

Parallel Programming – NTHU LSA Lab 40

Parallel Programming – NTHU LSA Lab 41

Parallel Programming – NTHU LSA Lab 42

Parallel Programming – NTHU LSA Lab 43

Parallel Programming – NTHU LSA Lab 44

Parallel Programming – NTHU LSA Lab 45

Parallel Programming – NTHU LSA Lab 46

(Alternative space-time diagram)

Parallel Programming – NTHU LSA Lab 48

Parallel Programming – NTHU LSA Lab 49

 Pipeline for an unfolded loop:

A[0] A[1] A[2] A[n]

Parallel Programming – NTHU LSA Lab 50

Parallel Programming – NTHU LSA Lab 51

𝒂𝒏−𝟏,𝟎 𝒙𝟎 + 𝒂𝒏−𝟏,𝟏 𝒙𝟏 + 𝒂𝒏−𝟏,𝟐 𝒙𝟐 + ⋯ + 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝒏−𝟏 = 𝐛𝐧−𝟏

Parallel Programming – NTHU LSA Lab 56

Parallel Programming – NTHU LSA Lab 58

Parallel Programming – NTHU LSA Lab 59

Parallel Programming – NTHU LSA Lab 60

 Slave processes is blocked by recv()

Parallel Programming – NTHU LSA Lab 63

Parallel Programming – NTHU LSA Lab 64

Parallel Programming – NTHU LSA Lab 65

Parallel Programming – NTHU LSA Lab 66

Parallel Programming – NTHU LSA Lab 68

 Parallel Code: O(log n)

Parallel Programming – NTHU LSA Lab 69

Parallel Programming – NTHU LSA Lab 70

 Jacobi iteration algorithm:

Parallel Programming – NTHU LSA Lab 72

x[i] = b[i]; /*initialize unknown*/

Parallel Programming – NTHU LSA Lab 75

x[i] = b[i]; /initialize unknown/