4 Performance.4x
4 Performance.4x
4 Performance.4x
by Using OpenMP!
• !Loop optimizations!
• !Best practices!
• !Task parallelism !!
!" #"
It may be easy to write a correctly functioning A major goal is to organize data accesses so that
OpenMP program, but not so easy to create a program values are used as often as possible while they are still
that provides the desired level of performance.! in the cache.!
$" %"
Two-dimensional array access! Two-dimensional array access!
!n = 50000!
&" '"
Loop fusion!
Loop unrolling!
(" )"
Loop fission!
Loop tiling!
n!
! !$ time program!
! !real !5.4!
! !user !3.2!
! !sys! !2.0 !
!!" !#"
Parallel overhead! A simple performance model!
The amount of time required to coordinate parallel threads, as TCPU (P) = (1 + OP ! P) ! Tserial
opposed to doing useful work. !
!$" !%"
Performance factors!
Speedup(P) =
TSerial (P)
=
1
=
1 • Synchronization overheads: Time wasted for waiting to enter
TElapsed (P) f ! f + 1 + O " P 0.95 + 0.05 + 0.02 " P
P P
P !critical regions.!
Speedup(P)
Efficiency(P) =
P
!&" !'"
Overheads of OpenMP directives on alvin (gcc)!
Overheads of OpenMP directives!
,-./01234"
56,677-7"83,"
56,677-7"
924:7-"
;6,,2-,"
83,"
!(" !)"
!*" #+"
Optimize barrier use!
Best practices!
#!" ##"
#$" #%"
Avoid large critical regions! Maximize parallel regions!
Lost time waiting for locks!
time!
...!
}!
...!
}!
!"#
#&"
#'"
Large parallel regions offer more opportunities for using data in cache and provide a
bigger context for compiler optimizations.!
#(" #)"
Load imbalance! Load balancing!
• ""Load balancing is an important aspect of performance!
Unequal work loads lead to idle threads and wasted time.!
• !For regular expressions (e.g. vector addition), load
!balancing is not an issue!
#pragma omp parallel!
{! Busy! • !For less regular workloads, care needs to be taken in
#pragma omp for! Idle! !distributing the work over the threads!
time!
for ( ; ; ) {!
• !Examples of irregular workloads:! ! ! ! ! !
}!
}!
! !- multiplication of triangular matrices ! ! ! !
! !- parallel searches in a linked list!
#*"
#*" $+"
$!" $#"
False sharing! False sharing!
$$" $%"
Solutions:!
$&" $'"
Array padding! Case study: Matrix times vector product!
int a[Nthreads];!
#pragma omp parallel for shared(Nthreads,a) schedule(static,1)!
for (int i=0; i<Nthreads; i++)!
a[i] += i;!
int a[Nthreads][cache_line_size];!
#pragma omp parallel for shared(Nthreads,a) schedule(static,1)!
for (int i=0; i<Nthreads; i++)!
a[i][0] += i;!
$(" $)"
$*" %+"
%!" %#"
%$" %%"
What are tasks?! Tasks in OpenMP!
OpenMP has always had tasks, but they were not called that.!
Tasks are independent units of work!
• !A thread encountering a parallel construct packages up a set of
Threads are assigned to perform the work of each task!
!implicit tasks, one per thread.!
• Tasks may be deferred !
• !A team of threads is created.!
• Tasks may be executed immediately!
The runtime system decides which of the above! • !Each thread is assigned to one of the tasks (and tied to it).!
• !Barrier holds master thread until all implicit tasks are finished.!
• Tasks are composed of:!
• code to execute!
OpenMP 3.0 adds a way to create a task explicitly for the team to
• data environment (it own its data)!
Serial! Parallel! execute.!
• internal control variables!
%'"
$"#
Certain constructs have suspend/resume points at defined The collapse clause (in OpenMP 3.0) handles perfectly
positions within them.! nested multi-dimensional loops.!
&!" &#"
Removal of dependencies! Removal of dependencies!
Serial version containing anti dependency! Serial version containing flow dependency!
Parallel version with dependencies removed! Parallel version with dependencies removed by loop skewing!
&$" &%"
Automatic parallelization!
&&"