PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)
Updated: 20 February 2004
Performance Modeling and Validation of the Numerical and Symbolic
Factorization Algorithms in SuperLU_DIST
Laura Grigori
INRIA, France
email: Laura.Grigori@irisa.fr
Xiaoye S. Li
Lawrence Berkeley National Laboratory
USA
email: xsli@lbl.gov
ABSTRACT:
SuperLU_DIST is a parallel direct solver for sparse nonsymmetric linear
systems. Recently, we conducted systematic performance analysis of
all different phases of the solver. This paper presents the results
on the factorization algorithms, which usually dominate the runtime.
In numerical factorization, the L and U matrices are represented as
2D block matrices, and are distributed in a 2D block-cyclic fashion on
the 2D process grid. The algorithm is right-looking.
By examining the nonzero structure of L and U, and the communication
pattern of the algorithm, we derived a theoretical bound on the
parallel efficiency of the algorithm. We show that this bound
can be used to predict the performance of any sparse matrix, when the
sparsity is measured with respect to the underlying machine parameters,
i.e., floating-point speed, the network latency and bandwidth.
Furthermore, using this model we can identify, for certain class of
sparse matrices, what hardware parameters to improve are most critical
for improving the overall performance.
In symbolic factorization, the nonzero structures of the lower and upper
triangular factors L and U of are computed. This step precedes
numerical factorization and it is the only step that is sequential
at the moment. In particular, it needs the input matrix structure to
reside on one processor and thus it is currently the memory bottleneck
in the solver.
The available algorithms for symbolic factorization are inherently sequential,
and a new algorithmic approach must be considered to solve this problem in
parallel. Our current research focuses on the development of a
parallel symbolic factorization algorithm for distributed memory machines,
which is suitable for general sparse unsymmetric matrices. Its main goal
is to decrease the memory requirements of the symbolic factorization step,
preventing this step from being the memory and the computational
bottleneck.
First, we performed a simulation of the new algorithm and evaluated its
performance. The analysis showed that the algorithm presents good memory
scalability, i.e., if both the problem size and the number of
processors is increased at the same rate, roughly the same amount of memory is
used per processor. Issues of load balance as well as communication
were addressed. At the time of this writing, we are concentrating on the
implementation of this algorithm using MPI. Before the submission
of the final paper, we will complete a detailed performance study of the
proposed algorithm on the real world matrices.
Home page
2004-02-20