T. Joffrain
Department of Computer Sciences
The University of Texas at Austin
USA
email: joffrain@cs.utexas.edu
and
J.B. Layton
Lokheed-Martin Aeronautics Company
USA
email: jafrey.b.layton@lmco.com
and
E.S. Quintana-Orti
Departamento de Ingeniena y Ciencia de Computadores
Universidad Jaume I
Spain
email: quintana@icc.uji.es
and
R.A. van de Geijn
<rvdg@cs.utexas.edu>
Department of Computer Sciences
The University of Texas at Austin
USA
email: rvdg@cs.utexas.edu
Electromagnetics is one of the few application areas where the solution of extremely large dense linear systems is still a relatively common operation. These dense systems result from Boundary Elements Formulations which sacrifice sparsity in the system in order to achieve a reduction of the degrees of freedom required in the discretization. While Fast Multipole Methods (FMM) are becoming a popular alternative solution method, the fact that the solution must be obtained for a very large number of right-hand-sides often favors the more traditional solution via LU factorization with partial pivoting. Also, frequently multiple systems that differ only in parts of the dense matrix must be solved as part of an optimization process. This again favors the use of LU factorization with partial pivoting. The problems are frequently large enough that they do not fit in the combined memories of even large distributed memory architectures.
Traditional Out-of-Core (OOC) algorithms for the LU factorization use a panel approach in which the OOC matrix is processed by bringing into memory one panel of columns at a time. The problem with this technique is that it is inherently not scalable: As the numer of rows of the matrix becomes larger, the width of the panel that can be brought into memory becomes narrower, which adversely affects the ratio of I/O operations to useful computation. For matrices with millions of rows, the number of columns that can be stored into the memory can be so narrow that I/O completely dominates. Furthermore, it is hard to attain parallelism for panels of such narrow width.
The alternative that we discuss here follows closely that applied with great success to other matrix factorizations like the Cholesky decomposition or the QR factorization. We consider the matrix as a series of tiles and factorize the matrix using a two-tile approach where, at any given time, at most two tiles reside in the memory. A tiled approach provides true scalability because the size of the tile is based of the available memory of the machine being used and is independent of the size of the matrix. Besides, this approach lends itself nicely to blocked algorithms because the tiles are in essence blocks themselves.
Our two-tile approach modifies the traditional partial pivoting method of the LU factorization. It is replaced by a blocked variant of the so-called pairwise pivoting. The pivoting scheme has important consequences on the element growth in the factorization and therefore, the numerical stability of the method. Pairwise pivoting has been previously reported as being not much worse than partial pivoting in average. In some examples with a special structure, pairwise pivoting has been reported to fail due to an exponential element growth but then, so does partial pivoting for these same examples!
The OOC LU algorithm is implemented by using the parallel infrastructure in PLAPACK and the parallel OOC package POOCLAPACK. Performance results on large clusters will be reported.