Parallel Computations & Applications: National Tsing-Hua University 2017, Summer Semester
Parallel Computations & Applications: National Tsing-Hua University 2017, Summer Semester
Parallel Computations & Applications: National Tsing-Hua University 2017, Summer Semester
& Applications
Divide-And-Conquer Computations
Pipelined Computations
Synchronous Computations
Input Data
Divide input to tasks
Processes
Collect results
Results
……
Partition region by ROW with width 10 480
//master process
for(i=0, row=0; i<48; i++, row+=10) // for each of 48 processes
send(row, Pi); // send row no. P47
for(i=0; i<480; i++) for(j=0; j<640; j++) temp_map[i][j] = 0; // initialize temp
for(i=0; i<(480*640); i++) { // for each pixel
recv(oldrow, oldcol, newrow, newcol, PANY); // accept new coordinates
if !((newrow<0)||((newrow>=480)||(newcol<0)||((newcol>=640))
temp_map[newrow][newcol] = map[oldrow][oldcol];
}
for(i=0; i<480; i++) for(j=0; j<640; j++) map[i][j] = temp_map[i][j]; // update map
// slave process
recv (row, Pmaster);
for (oldrow = row; oldrow < (oldrow+10); oldrow++) // for each row in the partition
for (oldcol = 0; oldcol < 640; oldcol++) { // for each column in the row
newrow = oldrow + delta_x; // shift along x-dimension
newcol = oldcol + delta_y; // shift along y-dimension
send(oldrow, oldcol, newrow, newcol, Pmaster); // send out new coordinates
}
Example 2: Mandelbrot Set
The Mandelbrot Set is a set of complex numbers that are
quasi-stable when computed by iterating the function:
𝑍0 = 𝐶, 𝑍𝑘+1 = 𝑍𝑘 2 + 𝐶
C is some complex number: 𝑪 = 𝑎 + 𝑏𝒊
𝒁𝒌+𝟏 is the (k+1)th iteration of the complex number
If |Zk| ≤ 2 for ANY k C belongs to Mandelbrot Set
𝑍 = 𝑎 2 + 𝑏2
Once |Z | > 2, it will increase forever!
k
𝑘
|Zk| 2i 2+2i
C =-1+0.25i, NOT part of the set
C = -1+0.75i, part of the set 𝑍𝑘 = 22 + 22 = 8
4 -2 2
2 2-i
𝑍𝑘 = 22 + (−1)2 = 5
-2i
0 Iteration ∞
Parallel Programming – NTHU LSA Lab 8
Fractal
What exact is Mandelbrot Set?
It is a fractal: An object that display self-similarity at
various scale; Magnifying a fractal reveals small-scale 2
details similar to the large-scale characteristics
After plotting the Mandelbrot Set
determined by thousands of iteration:
Add color to the points outside the set &
zoom in at the center of the image: -2i 2i
-2
Parallel Programming – NTHU LSA Lab 10
Mandelbrot Set Program
2
Compute 𝑍𝑘+1 = 𝑍𝑘 + 𝐶
Let 𝐶 = 𝐶𝑟𝑒𝑎𝑙 + 𝐶𝑖𝑚𝑎𝑔 𝑖 , 𝑍𝑘 = 𝑍𝑟𝑒𝑎𝑙 + 𝑍𝑖𝑚𝑎𝑔 𝑖
𝑍𝑘+1 = 𝑍𝑟𝑒𝑎𝑙 2 − 𝑍𝑖𝑚𝑎𝑔 2 + 2𝑍𝑟𝑒𝑎𝑙 𝑍𝑖𝑚𝑎𝑔 𝑖 + 𝐶𝑟𝑒𝑎𝑙 + 𝐶𝑖𝑚𝑎𝑔 𝑖
𝑍𝑟𝑒𝑎𝑙_𝑛𝑒𝑥𝑡 = 𝑍𝑟𝑒𝑎𝑙 2 − 𝑍𝑖𝑚𝑎𝑔 2 + 𝐶𝑟𝑒𝑎𝑙
𝑍𝑖𝑚𝑎𝑔_𝑛𝑒𝑥𝑡 = 2𝑍𝑟𝑒𝑎𝑙 𝑍𝑖𝑚𝑎𝑔 + 𝐶𝑖𝑚𝑎𝑔
……
for (y=row; y < (row+10); y++) {
c.real = min_real + (x * scale_real); 480
c.imag = min_imag + (y * scale_image);
color = cal_pixel (c);
send(x, y, &color, Pmaster);
} Each process may P47
} have different load!
Dynamic Task Assignment
Work pool / Processor Farm
Useful when tasks require different execution time
Dynamic load balancing
Work Pool
…
3. Send termination
……………………...
Parallel Programming – NTHU LSA Lab 15
Coding for Work Pool Approach
//master process
count = 0; // # of active processes
row = 0; // row being sent
for (k=0; k<num_proc; k++) { // send initial row to each processes
send(row, Pi , data_tag);
count++;
row++;
}
do {
recv(&slave, &r, color, PANY , result_tag);
count--;
if (row < num_row) { // keep sending until no new task
send(row, Pslave , data_tag); // send next row
count++; Tag is needed to distinguish
row++;
between data and termination msg
} else {
send(row, Pslave , terminate_tag); // terminate
}
display(r, color); // display row
} while(count > 0);
Coding for Work Pool Approach
//slave process P ( i )
recv(&row, Pmaster , source_tag);
while (source_tag == data_tag) { // keep receiving new task
c.imag = min_imag + (row * scale_image);
for (x=0; x<640; x++) {
c.real = min_real + (x * scale_real);
color[x] = cal_pixel (c); // compute color of a single row
}
send(i, row, color, Pmaster , result_tag); // send process id and results
recv(&row, Pmaster , source_tag);
}
Pipelined Computations
Synchronous Computations
conquer
Final
Result 22
Example 1: Adding Numbers
Add a sequence of numbers
Sequential Recursive Code: Parallel Code:
Scatter the numbers
int add (int* numbers) {
if (len(numbers) <=2) { then reduce results
return numbers[1]+numbers[2];
1,2,3,4
} else {
divide (numbers, sub_num1, sub_num2); 1,2 3,4
part_sum1 = add(sub_num1);
part_sum2 = add(sub_num2); 1 2 3 4
}
return (part_sum1+part_sum2); 3 7
}
10
P1 P2 P3 P4
2,4,6 1,8,10 5,9,0 3,7,11 Phase1:
2 4 6 1 8 10 0 5 9 3 7 11 Partition
Phase2:
Merge list 0,1,2 3,4,5 6,7,8 9,10,11 Sorting 26
Example 3: N-Body Problem
Newtonian laws of physics
The gravitational force between two
bodies of masses 𝑚𝑎 & 𝑚𝑏 :
𝐺𝑚𝑎 𝑚𝑏
𝐹=
𝑟2
Subject to the force, acceleration occurs
𝐹 =𝑚×𝑎
Let the time interval be ∆𝑡 &
current velocity 𝑣 𝑡 , position 𝑥 𝑡
New velocity 𝑣 𝑡+1 :
𝑣 𝑡+1 − 𝑣 𝑡 𝑡+1 𝑡
𝐹∆𝑡
𝐹=𝑚 ⇒𝑣 =𝑣 +
∆𝑡 𝑚
New position 𝑥 𝑡+1 :
𝑥 𝑡+1 = 𝑥 𝑡 + 𝑣 𝑡+1 ∆𝑡 27
Three-Dimensional Space
Considering 2 bodies at (𝑥𝑎 , 𝑦𝑎 , 𝑧𝑎 )& 𝑥𝑏 , 𝑦𝑏 , 𝑧𝑏
𝑟 = 𝑥𝑎 − 𝑥𝑏 2 + 𝑦𝑎 − 𝑦𝑏 2 + 𝑧𝑎 − 𝑧𝑏 2
The forces, velocities and positions can be resolved
in the three direction independently
𝐺𝑚𝑎 𝑚𝑏 𝑥𝑏 − 𝑥𝑎
𝐹𝑥 =
𝑟2 𝑟
𝐺𝑚𝑎 𝑚𝑏 𝑦𝑏 − 𝑦𝑎
𝐹𝑦 = 2
( )
𝑟 𝑟
𝐺𝑚𝑎 𝑚𝑏 𝑧𝑏 − 𝑧𝑎
𝐹𝑧 = 2
( )
𝑟 𝑟
Parallel Programming – NTHU LSA Lab 28
N-Body Sequential Code
Assume all bodies have the same mass m
for (t=0; t<T; t++) {
for (i=0; i<N; i++) {
F = Compute_Force(i); // compute force in O(N^2)
v_new[i] = v[i] + F *dt / m; // compute new velocity
x_new[i] = x[i] + v_new[i] * dt; // compute new position
}
for(i=0; i<N; i++){
x[i] = x_new[i]; // update position
v[i] = v_new[i]; // update velocity
}
}
Non-feasible as N increases due to O(𝑁 2 )complexity
Parallel Programming – NTHU LSA Lab 29
Approximate Algorithms
Reduce time complexity by approximating a
cluster of bodies as a single distant body
How to find those clusters of bodies?
Center
m = m1 + m2 of mass
x = (x1*m1 + x2*m2) / m
y = (y1*m1 + y2*m2) / m
r Distant cluster
of bodies
Example: θ=0.5
d/r=2.5 > θ
r=4
a
35
d = 10
Barnes-Hut Algorithm
Step2: Compute approximate forces on each object
1. Traverse the nodes of the tree, starting from the root.
2. If the center-of-mass of an internal node is sufficiently far
from the body, approximate the internal node as a single body
Far means d/r < θ (e.t. 0 < θ <1)
r: the distance between the body
and the node’s center-of-mass D C B A
d: the width of the region
x
Example: θ=1 D C
d/rA=16/10 > θ
d/rB=16/2 > θ
A B
d/rC=16/15 > θ
d/rD=16/20 < θ
x 36
d = 16
Barnes-Hut Algorithm
Step2: Compute approximate forces on each object
3. If it is a leaf node, calculate the force and add to the object.
4. Otherwise, recursively compute the force from
children of the internal node.
Example: θ=1
d/rA’=8/7 > θ A’ is a leaf node
d/rB'=8/15 < θ B’ treated like a single node D C B A
d/rC'=8/20 < θ C’ is a leaf node
C’
C’ B’ A’ x
B’
A’
x 37
d=8
Barnes-Hut Algorithm
θ controls the accuracy and approximation error of the
algorithm
θ = 0 d/r ALWAYS larger than θ same as brute force
θ = 1 most likely only need to consider the object within the
same cluster/region
If the tree is balanced, the complexity is O(nlogn)
But in general , the tree could be very unbalanced ……..
The tree must be re-built for each time interval
Synchronous Computations
P0 P1 P2 P3
A A A A
sum Sin Sout Sin Sout Sin Sout …… Sin Sout
5
5 2
5 2 1
5 3 2 1
5 4 3 2 1
Parallel Programming – NTHU LSA Lab 52
Example2: Sorting Numbers
Insertion Sort:
Each process holds one number
Compare & move the smaller
number to the right
recv(&number, Pi-1);
if (number > x) {
send(&x, Pi+1);
x= number;
} else {
send(&number, Pi+1);
}
Example 3: Linear Equation Solver
Special linear equations of “upper-triangular” form
a’s and b’s are constants, x’s are unknown to be found
𝒂𝒏−𝟏,𝟎 𝒂𝒏−𝟏,𝟏 𝒂𝒏−𝟏,𝟐 …… 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝟎 𝒃𝒏−𝟏
𝒂𝒏−𝟐,𝟎 𝒂𝒏−𝟐,𝟏 …… 𝒂𝒏−𝟐,𝒏−𝟐 0 𝒙𝟏 𝒃𝒏−𝟐
…...
…...
=
…...
…..
…..
0 0
𝒂𝟏,𝟎 𝒂𝟏,𝟏 0 0 0 𝒙𝒏−𝟐 𝒃𝟏
𝒂𝟎,𝟎 0 0 0 0 𝒙𝒏−𝟏 𝒃𝟎
Time complexity: 𝑂 𝑛2
Although later processes have more work for
both communications and computations
Parallel Programming – NTHU LSA Lab 57
Outline
Embarrassingly Computations
Divide-And-Conquer Computations
Pipelined Computations
Synchronous Computations
Prefix Sum
System of Linear Equations
61
Counter Barrier Implementation
Counter-based barrier often have two phases
Arrival phase: a process enters arrival phase and does not
leave this phase until all processes have arrived in this phase
Departure phase: Processes are released after moving to the
departure phase
Master Slave processes
Barrier():
Arrival for(i=0;i<p;i++) send(Pmaster)
phase recv(Pany) recv(Pmaster) Barrier():
send(Pmaster)
Departure for(i=0;i<p;i++) recv(Pmaster)
phase send(Pany)
P0 P1 P2
recv(P0)
recv(P1) recv(P2) recv(P1)
send(P1) send(P0) send(P1)
send(P2)
…...
=
…...
…..
…..
…..
…..
𝒂𝒏−𝟐,𝟎 𝒂𝒏−𝟐,𝟏 𝒂𝒏−𝟐,𝟐 …… 𝒂𝒏−𝟐,𝒏−𝟏 𝒙𝒏−𝟐 𝒃𝒏−𝟐
𝒂𝒏−𝟏,𝟎 𝒂𝒏−𝟏,𝟏 𝒂𝒏−𝟏,𝟐 …… 𝒂𝒏−𝟏,𝒏−𝟏 𝒙𝒏−𝟏 𝒃𝒏−𝟏