Flops Memory Parallel Processing
Flops Memory Parallel Processing
Flops Memory Parallel Processing
Flops from
elimination
Total flops
(leading order
term)
genp
(2 / 3)n 3
(2 / 3)n 3
gepp
(2 / 3)n 3
(1 / 2)n 2
(2 / 3)n 3
gerp
(2 / 3)n 3
(3 / 2)n 2 approximat
(2 / 3)n 3
gecp
ely
(2 / 3)n
(1 / 3)n 3
n3
The conclusion from the last column is, based on flop count, genp, gepp and gerp
should take about the same amount of time and gecp should take about 50%
longer. In practice genp can lead to large error growth, so we wont pursue genp
further here.
We can check the theory in the above table for the other routines by doing
experiments. When comparing run times it is better to use code produced by a
compiled computer language, rather than an interpreted language such as Matlab.
When timing an interpreted language code, one is timing the work to understand
the code as well as any computations. Since a compiled language (such a C, C++,
Fortran, etc.) is translated to machine language before execution, the timing
focuses on the computations, not the translation.
Fortunately, in the Try Your Own Pivot Scheme assignment, C code implementing
gecp is available through the Matlab Mex utility (see item 26 at the course web
page). By making a small change in the gecp source code (in the pivot search
section of the code we can change the for (j = k; j <= nminus1; ++j) loop to for
(j = k; j <= k; ++j)) we can modify the gecp code to implement gepp. I also have
Fortran code, again accessible in Matlab via the Mex utility, that implements rook
pivoting to solve Ax = b. All this code is in LINPACK style in that the flop count
was a key focus in the code development.
We can try out the methods in Matlab (using a midrange, dual processor Acer
laptop):
% complete pivoting
% rook pivoting with similar style
% partial pivoting
These runs support the conclusions of the above theory. The complete pivoting
code takes about 50% (58% in the above runs) longer than the partial or rook
pivoting code and the run times for the partial pivoting code and rook pivoting code
are about the same!
MANAGING MEMORY ACCESS, LAPACK and the 1990s
If we solve the same problem using the built in backslash in Matlab we get the
following results:
>> tic,x=A\b;toc
Elapsed time is 0.900737 seconds.
The Matlab solution implements Gaussian elimination with partial pivoting, but is
more than 13 times faster than any of the earlier runs. What is going on?
In the late 1980s and early 1990s the use of cache memory became increasingly
popular. Cache memory is a relatively expensive but is fast compared to main RAM
memory in a computer. It became apparent that it is important to keep calculations
within the fast cache memory as much as possible. Otherwise, the time to access
the slow main RAM memory outside the cache would dominate the calculation
times.
To address this issue a new publicly available linear algebra package, called LAPACK
was developed with NSF funding. This package required recoding most of the
LINPACK algorithms, typically using block factorizations which allowed dividing the
algorithm into pieces such that the calculations could be kept in cache memory as
much as possible. The current code for Matlab backslash does Gaussian elimination
with partial pivoting using such a block algorithm. As we see above this leads to a
huge decrease in run times, more than a factor of 13.
The idea of block factorization (we will ignore pivoting initially for clarity) is based
on the following block factorization of an n by n matrix A
L
(0) : 11
L21
L22
U 11 U 12 A11 A12
0
U
A
A
22
21
22
L11 , U11
Where, for a block size b, the matrices
and A11 are b by b, L21 and A21 are (nb) by b,
A11
A
21
A1
U12 and A12 are b by (n-b) and L22 , U 22 and A22 are (n-b) by (n-b). Let
and
L
L1 11 .
L21 We will call n by b matrices A1 and L1 tall skinny matrices since we will
assume that b is much smaller than n. For example b might be 100 and n 2000 or
larger.
As can be seen by multiplying the first block column of U times L, equation (0)
implies that L1U11 A1 or that:
(1) : L1
This calculation can also be done in cache since A12 and U 12 are short (they only
have b rows), fat (n columns) matrices.
Finally, if in (0) we multiply the second block row of L by the second block row of U,
it follows that L21U 12 L22U 22 A22 or that L22U 22 A22 L21U12 A22 . We break that up
into two parts
(3) : A 22 A22 L21U 12
and
(4) : L22
The matrices in (4) are typically too large to fit in cache memory. For example if n =
2000 and b = 100, then A22 and A22 are 1900 by 1900 (more generally (n-b) by (nb)). However, it is easy to separate the calculations in (3) into smaller pieces that
do fit into cache memory. The structure of (3) is indicated by
(5) :
A 22
A22
L
21
U 12
The calculations for a submatrix of A22 , for example the elements in rows 1001 to
1100 and columns 1001 to 1100, only require the same entries A22 and the elements
of rows 1001 to 1100 of L21 and columns 1001 to 1100 of U 12 , for this example.
The submatrix sizes can be selected so that these calculations fit into cache
memory.
The calculations in (4) can be done by repeating the above procedure in a loop. The
first pass through the loop we calculate the first block row and column of L and U
applying the above procedure to the 2000 by 2000 matrix A (in our example). Next
we apply the procedure to the 1900 by 1900 matrix A22 , which produces the next
block row and block column of L and U. Continuing we eventually factor the whole
matrix.
The GEPP_BLOCK code in item 25 at the course web site implements these ideas in
Matlab code. GEPP_BLOCK does a block factorization with partial pivoting. This
requires careful book keeping to keep track of the permutations. This is included in
the GEPP_BLOCK but we will not discuss the details here.
A block implementation of rook pivoting is also possible but due to the more
complicated search in rook pivoting there will be more cache misses, where one
needs to access memory outside of cache, than in partial pivoting. Finally, it
appears that is not possible to do a block implementation of complete pivoting since
complete pivoting requires a search of the entire unfinished portion of A, at every
step of the factorization.
Here are some timings for the same 2000 by 2000 matrix used earlier, using
compiled code (C and Fortran):
>> tic, x = gecp(A,b); toc
pivoting
Elapsed time is 20.461706 seconds.
>> tic, x=gerp(A,b); toc
Elapsed time is 1.367741 seconds.
>> tic, x=A\b; toc
Elapsed time is 0.900737 seconds.
% unblocked complete
% blocked rook pivoting
% blocked partial pivoting
In this experiment blocked rook pivoting requires about 50% more time than
blocked partial pivoting and complete pivoting is more than 20 times slower than
partial pivoting.
Note that the most time consuming part of blocked Gaussian elimination algorithm
is the matrix multiplication in (3) or (5). Most computer manufacturers supply very
efficient, machine language implementations of simpler matrix operations such as
matrix multiplication. Such routines are called the Basic Linear Algebra Subroutines
or the BLAS. The blocked Gaussian elimination algorithm facilitates structuring the
code so that the code can call BLAS routines.
The above discussion represents the state of the art until very recently. For
example, Matlabs backslash for square linear systems calls an implementation of
blocked Gaussian elimination with partial pivoting.
However, computers are evolving again:
PARALLEL PROCESSING, 2011 AND BEYOND
Computers with multicore processors and parallel processing add-on computer
cards are becoming increasingly common. For example, the NVIDIA Tessla cards
contain hundreds or even thousands of processors. Currently, the more advanced
parallel processing cards cost a few thousand dollars, but the prices will continue to
fall.
Soon massively parallel processing will not only be the domain of
supercomputer; parallel processing will become ubiquitous. Our algorithms will need
to adapt.
Parts of the block Gaussian elimination algorithm are easy to parallelize. For
example, it is easy to divide up the calculations in (5) into independent pieces as
one part of the communications the number of messages that need to be passed
between processors - for partial pivoting of a tall skinny matrix. A separate,
important issue is the volume of data passed between processors, but for simplicity
we will not discuss that here. In some environments, the time to establish the new
data connection, to initiate a new message, between processors can be a
bottleneck.
To the left we have picture a potential division of a tall skinny matrix, which we call
A11
matrix and assume that each block is a b by b matrix so that the complete
messages, one for each processor. The selection of pivot elements for a
factorization of the entire n by b matrix would require the same number of
Am ,1
messages for each column or a total of n= mb messages. There will be
other information passed between processors for the elimination, but we
will focus only on the pivot search. We conclude that
A21
Recall that n can be large, thousands or more. There are other algorithms for
implementing partial pivoting in parallel, but none of them minimizes the number of
messages as well as an alternate pivoting scheme.
Tournament pivoting is a new (first published in 2011) pivoting scheme. Rather
than working column by column as does partial pivoting, tournament pivoting
selects pivot elements by having competitions between pairs of blocks of A1 or
We wont try to prove this, but we can look at a specific example. Suppose the
second row of a matrix is linearly dependent on the first row. When we do the
elimination and add a multiple of the first row to the second row, the new row two
will be entirely zero. When we select the pivot element for column two we search
column two on or below the diagonal for the element furthest from zero. Therefore
the current row two, which has zero on the diagonal, wont remain at row two. So
the linear dependent row two wont remain at the top of the factorization.
With this in mind we can recast
the goal of pivoting is to select rows of the b by n matrix A1 so that
the first b rows of the pivoted matrix are linear independent.
A11
A21
A31
A41
A11
A21
Am ,1
Ak1
to the meet the next challenger. In the literature this is called a flat tree
tournament.
Another style tournament is tennis style. A tennis style tournament is divided into
rounds. On the first round all m teams play in m/2 games. The m/2 winning teams
(or sets of rows in our case) are sent to the next round. At the second round there
are m/ 4 games (or, in our case, LU factorizations on matrices of size 2b by b) and
m/4 sets of rows advance. This is continued to quarterfinals, semifinals and to finals
where the b best rows are selected. This is called a binary tree tournament in
the literature. A binary tree tournament takes less time than a flat tree tournament
(in a real life tournament or on a computer that can have simultaneous games, i.e. a
parallel processing computer).
What is the communication cost of a tournament to determine the best, or most
linear independent rows, of the n by b matrix A1 . Recall that for simplicity we are
only focusing on the number of messages. In a tournament we need to pass
information between processors at the beginning of each game played. There are
only m-1 games played in a tournament of m teams (this is simple to see in the flat
tournament case). Therefore