Nothing Special   »   [go: up one dir, main page]

Flops Memory Parallel Processing

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Flops, Memory and Parallel Processing and Gaussian elimination (Lecture

Notes for 10-1-2013)


We will discuss three important features in the efficiency of Gaussian elimination:
floating point operations (flops), memory access and parallel processing. The topics
span thirty+ in the development of efficient implementation of Gaussian elimination
and associated pivoting schemes.
FLOPS, LINPACK and the 1980s
LINPACK is the first publicly available, high quality linear algebra package. It was
developed with an NSF grant in the 1980s by, among other researchers, Cleve
Moler the creator of Matlab. At that time and state of computer architecture, the
number of floating point operations in an algorithm was the most important
measure in the efficiency of an algorithm. As we have seen earlier, for larger n, the
flop counts for Gaussian elimination with no pivoting (genp), partial pivoting (gepp),
rook pivoting (gerp) and complete pivoting (gecp) applied to an n by n matrix is:
Method

Flops from
elimination

Flops for pivot


search

Total flops
(leading order
term)

genp

(2 / 3)n 3

(2 / 3)n 3

gepp

(2 / 3)n 3

(1 / 2)n 2

(2 / 3)n 3

gerp

(2 / 3)n 3

(3 / 2)n 2 approximat

(2 / 3)n 3

gecp

ely

(2 / 3)n

(1 / 3)n 3

n3

The conclusion from the last column is, based on flop count, genp, gepp and gerp
should take about the same amount of time and gecp should take about 50%
longer. In practice genp can lead to large error growth, so we wont pursue genp
further here.
We can check the theory in the above table for the other routines by doing
experiments. When comparing run times it is better to use code produced by a
compiled computer language, rather than an interpreted language such as Matlab.
When timing an interpreted language code, one is timing the work to understand
the code as well as any computations. Since a compiled language (such a C, C++,
Fortran, etc.) is translated to machine language before execution, the timing
focuses on the computations, not the translation.
Fortunately, in the Try Your Own Pivot Scheme assignment, C code implementing
gecp is available through the Matlab Mex utility (see item 26 at the course web
page). By making a small change in the gecp source code (in the pivot search
section of the code we can change the for (j = k; j <= nminus1; ++j) loop to for
(j = k; j <= k; ++j)) we can modify the gecp code to implement gepp. I also have

Fortran code, again accessible in Matlab via the Mex utility, that implements rook
pivoting to solve Ax = b. All this code is in LINPACK style in that the flop count
was a key focus in the code development.
We can try out the methods in Matlab (using a midrange, dual processor Acer
laptop):

>> n=2000; A=randn(n,n); b=randn(n,1);


>> tic, x = gecp(A,b); toc
Elapsed time is 20.461706 seconds.
>> tic,x=gerp_unblocked(A,b);toc
code
Elapsed time is 12.953847 seconds.
>> tic, x = geppinC(A,b); toc
Elapsed time is 12.884968 seconds.

% complete pivoting
% rook pivoting with similar style
% partial pivoting

These runs support the conclusions of the above theory. The complete pivoting
code takes about 50% (58% in the above runs) longer than the partial or rook
pivoting code and the run times for the partial pivoting code and rook pivoting code
are about the same!
MANAGING MEMORY ACCESS, LAPACK and the 1990s
If we solve the same problem using the built in backslash in Matlab we get the
following results:
>> tic,x=A\b;toc
Elapsed time is 0.900737 seconds.
The Matlab solution implements Gaussian elimination with partial pivoting, but is
more than 13 times faster than any of the earlier runs. What is going on?
In the late 1980s and early 1990s the use of cache memory became increasingly
popular. Cache memory is a relatively expensive but is fast compared to main RAM
memory in a computer. It became apparent that it is important to keep calculations
within the fast cache memory as much as possible. Otherwise, the time to access
the slow main RAM memory outside the cache would dominate the calculation
times.
To address this issue a new publicly available linear algebra package, called LAPACK
was developed with NSF funding. This package required recoding most of the
LINPACK algorithms, typically using block factorizations which allowed dividing the
algorithm into pieces such that the calculations could be kept in cache memory as
much as possible. The current code for Matlab backslash does Gaussian elimination

with partial pivoting using such a block algorithm. As we see above this leads to a
huge decrease in run times, more than a factor of 13.
The idea of block factorization (we will ignore pivoting initially for clarity) is based
on the following block factorization of an n by n matrix A

L
(0) : 11
L21

L22

U 11 U 12 A11 A12

0
U
A
A
22
21
22

L11 , U11
Where, for a block size b, the matrices
and A11 are b by b, L21 and A21 are (nb) by b,

A11

A
21

A1

U12 and A12 are b by (n-b) and L22 , U 22 and A22 are (n-b) by (n-b). Let
and
L

L1 11 .
L21 We will call n by b matrices A1 and L1 tall skinny matrices since we will
assume that b is much smaller than n. For example b might be 100 and n 2000 or
larger.
As can be seen by multiplying the first block column of U times L, equation (0)
implies that L1U11 A1 or that:

(1) : L1

and U 11 are the LU factors of A1 .

We can therefore calculate L1 and U 11 by doing an LU factorization of the tall skinny


matrix A1 . This calculation can be done in cache memory, assuming the b is chosen
small enough.
Next we multiply the first block row of L times U, which implies that L11U 12 A12 . We
therefore can calculate U 12 using:

(2) : U 12 ( L11 ) 1 A12

This calculation can also be done in cache since A12 and U 12 are short (they only
have b rows), fat (n columns) matrices.
Finally, if in (0) we multiply the second block row of L by the second block row of U,

it follows that L21U 12 L22U 22 A22 or that L22U 22 A22 L21U12 A22 . We break that up
into two parts
(3) : A 22 A22 L21U 12
and

(4) : L22

and U 22 are the LU factors of A22 .

The matrices in (4) are typically too large to fit in cache memory. For example if n =

2000 and b = 100, then A22 and A22 are 1900 by 1900 (more generally (n-b) by (nb)). However, it is easy to separate the calculations in (3) into smaller pieces that
do fit into cache memory. The structure of (3) is indicated by

(5) :

A 22

A22

L
21

U 12

The calculations for a submatrix of A22 , for example the elements in rows 1001 to
1100 and columns 1001 to 1100, only require the same entries A22 and the elements
of rows 1001 to 1100 of L21 and columns 1001 to 1100 of U 12 , for this example.
The submatrix sizes can be selected so that these calculations fit into cache
memory.
The calculations in (4) can be done by repeating the above procedure in a loop. The
first pass through the loop we calculate the first block row and column of L and U
applying the above procedure to the 2000 by 2000 matrix A (in our example). Next

we apply the procedure to the 1900 by 1900 matrix A22 , which produces the next
block row and block column of L and U. Continuing we eventually factor the whole
matrix.
The GEPP_BLOCK code in item 25 at the course web site implements these ideas in
Matlab code. GEPP_BLOCK does a block factorization with partial pivoting. This
requires careful book keeping to keep track of the permutations. This is included in
the GEPP_BLOCK but we will not discuss the details here.
A block implementation of rook pivoting is also possible but due to the more
complicated search in rook pivoting there will be more cache misses, where one
needs to access memory outside of cache, than in partial pivoting. Finally, it
appears that is not possible to do a block implementation of complete pivoting since
complete pivoting requires a search of the entire unfinished portion of A, at every
step of the factorization.
Here are some timings for the same 2000 by 2000 matrix used earlier, using
compiled code (C and Fortran):
>> tic, x = gecp(A,b); toc
pivoting
Elapsed time is 20.461706 seconds.
>> tic, x=gerp(A,b); toc
Elapsed time is 1.367741 seconds.
>> tic, x=A\b; toc
Elapsed time is 0.900737 seconds.

% unblocked complete
% blocked rook pivoting
% blocked partial pivoting

In this experiment blocked rook pivoting requires about 50% more time than
blocked partial pivoting and complete pivoting is more than 20 times slower than
partial pivoting.
Note that the most time consuming part of blocked Gaussian elimination algorithm
is the matrix multiplication in (3) or (5). Most computer manufacturers supply very
efficient, machine language implementations of simpler matrix operations such as
matrix multiplication. Such routines are called the Basic Linear Algebra Subroutines
or the BLAS. The blocked Gaussian elimination algorithm facilitates structuring the
code so that the code can call BLAS routines.
The above discussion represents the state of the art until very recently. For
example, Matlabs backslash for square linear systems calls an implementation of
blocked Gaussian elimination with partial pivoting.
However, computers are evolving again:
PARALLEL PROCESSING, 2011 AND BEYOND
Computers with multicore processors and parallel processing add-on computer
cards are becoming increasingly common. For example, the NVIDIA Tessla cards
contain hundreds or even thousands of processors. Currently, the more advanced
parallel processing cards cost a few thousand dollars, but the prices will continue to
fall.
Soon massively parallel processing will not only be the domain of
supercomputer; parallel processing will become ubiquitous. Our algorithms will need
to adapt.
Parts of the block Gaussian elimination algorithm are easy to parallelize. For
example, it is easy to divide up the calculations in (5) into independent pieces as

described earlier by focusing on calculation of submatrices of A22 . Each piece can


then be sent to a separate processor. These processors can independently and
simultaneous do their portion of the calculations. Since this procedure is relatively
straightforward the calculations in (3) or (5) are sometimes called embarrassingly
parallel. Similarly the calculations in (2) are easy to do in parallel.
If the calculations in (2) and (3) are done in parallel by hundreds or more
processors, these calculations may be done quickly and the bottleneck in the
computations may become the tall skinny LU factorization in step (1).
If this tall skinny LU factorization in (1) is done with partial pivoting, the problem is
not embarrassingly parallel; it is difficult to do the pivoting in parallel efficiently. A
key factor in the computation time is the communication between processors, that
is moving data. This often dominates the arithmetic computations. Let us look at

one part of the communications the number of messages that need to be passed
between processors - for partial pivoting of a tall skinny matrix. A separate,
important issue is the volume of data passed between processors, but for simplicity
we will not discuss that here. In some environments, the time to establish the new
data connection, to initiate a new message, between processors can be a
bottleneck.
To the left we have picture a potential division of a tall skinny matrix, which we call

A1 , to separate processors. In this case we have pictured m blocks in the

A11

matrix and assume that each block is a b by b matrix so that the complete

tall skinny matrix is n by b with n mb . Assume that each block to be


assigned to a separate processor. Locating a pivot element in partial
pivoting is a column by column operation. We can begin by finding the
A31
largest magnitude element in the first column. To do that we need to
collect information from every processor. For example, each parallel
A41
processor could calculate the largest magnitude element of column one of
its submatrix and then the m local maximum elements can be compared to
determine the largest magnitude element. This would require sending m

messages, one for each processor. The selection of pivot elements for a
factorization of the entire n by b matrix would require the same number of
Am ,1
messages for each column or a total of n= mb messages. There will be
other information passed between processors for the elimination, but we
will focus only on the pivot search. We conclude that

A21

With this approach the pivot search in partial pivoting of the n by b


matrix A1 requires n mb messages.

Recall that n can be large, thousands or more. There are other algorithms for
implementing partial pivoting in parallel, but none of them minimizes the number of
messages as well as an alternate pivoting scheme.
Tournament pivoting is a new (first published in 2011) pivoting scheme. Rather
than working column by column as does partial pivoting, tournament pivoting
selects pivot elements by having competitions between pairs of blocks of A1 or

A1 in a tournament which chooses an overall winner


consisting of the best b rows of A1 . These best rows are pivoted to the top of A1
and Gaussian elimination with no pivoting is applied to the pivoted A1 , so the that
pairs of sets of rows

the winning rows of the tournament are the pivot rows.


The principle motivating the competitions between blocks of A1 is that

Gaussian elimination with partial pivoting tends to move rows that


are linearly independent to the top or the matrix.

We wont try to prove this, but we can look at a specific example. Suppose the
second row of a matrix is linearly dependent on the first row. When we do the
elimination and add a multiple of the first row to the second row, the new row two
will be entirely zero. When we select the pivot element for column two we search
column two on or below the diagonal for the element furthest from zero. Therefore
the current row two, which has zero on the diagonal, wont remain at row two. So
the linear dependent row two wont remain at the top of the factorization.
With this in mind we can recast
the goal of pivoting is to select rows of the b by n matrix A1 so that
the first b rows of the pivoted matrix are linear independent.

A11
A21
A31

A41

We can now do this via a tournament involving blocks of A1 (pictured


here again):
1. The rows of A11 and A21 will participate in a competition by forming
an LU factorization with partial pivoting to the 2b by b matrix

A11

A21

. The rows moved to the top are the winners that is


they are rows of B that more linearly independent, as determined
by Gaussian elimination. Let W be the b by b matrix with the
winning rows.

Am ,1

2. For k = 3 to m have the rows of k 1 (the challengers) compete


with the rows of W (the current champions). The competition is
done in a similar way to step 1: apply Gaussian elimination with partial

Ak1

pivoting to the 2b by b matrix


. The b rows that are moved to the
top are the new winners or the current champions.
The result will be the b most linearly independent rows (as determined by the
tournament) of A1 , as desired.
We should add that there are many other ways to carry out the tournamemt to
choose b winners. The above tournament is a sequential style tournament where
the current team of champions successively meet new challengers one after the
other. A real world example of this is on the Survivor TV shows where a player who
is sent to an exile island meets a new challenger every week. The winner goes on

to the meet the next challenger. In the literature this is called a flat tree
tournament.
Another style tournament is tennis style. A tennis style tournament is divided into
rounds. On the first round all m teams play in m/2 games. The m/2 winning teams
(or sets of rows in our case) are sent to the next round. At the second round there
are m/ 4 games (or, in our case, LU factorizations on matrices of size 2b by b) and
m/4 sets of rows advance. This is continued to quarterfinals, semifinals and to finals
where the b best rows are selected. This is called a binary tree tournament in
the literature. A binary tree tournament takes less time than a flat tree tournament
(in a real life tournament or on a computer that can have simultaneous games, i.e. a
parallel processing computer).
What is the communication cost of a tournament to determine the best, or most
linear independent rows, of the n by b matrix A1 . Recall that for simplicity we are
only focusing on the number of messages. In a tournament we need to pass
information between processors at the beginning of each game played. There are
only m-1 games played in a tournament of m teams (this is simple to see in the flat
tournament case). Therefore

The pivot search in tournament pivoting applied to the n by b matrix

A1 requires initiating m 1 n / b messages.


This is an order of magnitude less than required by the algorithm we described
earlier for partial pivoting.
The paper CALU: A Communication Optimal LU Factorization Algorithm, by Laura
Grigori, James Demmel, and Hua Xiang, SIAM J. Matrix Analysis and Applications,
Vol. 32, pp. 1317-1350, 2011 proves that tournament pivoting is communication
optimal in that the number messages required by the algorithm and the number of
words in these messages achieve, within a modest factor, theoretical lower bounds
on the amount of communication. The authors conclude:

The reason to consider CALU is that it does an optimal amount of


communication, and asymptotically less that Gaussian elimination
with partial pivoting (GEPP), and so will be much faster on platforms
where communication is expensive.

You might also like