PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)

Updated: 20 February 2004

Performance Modeling and Validation of the Numerical and Symbolic Factorization Algorithms in SuperLU_DIST

Laura Grigori
INRIA, France
email: Laura.Grigori@irisa.fr Xiaoye S. Li
Lawrence Berkeley National Laboratory
USA
email: xsli@lbl.gov

ABSTRACT:

SuperLU_DIST is a parallel direct solver for sparse nonsymmetric linear systems. Recently, we conducted systematic performance analysis of all different phases of the solver. This paper presents the results on the factorization algorithms, which usually dominate the runtime.

In numerical factorization, the L and U matrices are represented as 2D block matrices, and are distributed in a 2D block-cyclic fashion on the 2D process grid. The algorithm is right-looking. By examining the nonzero structure of L and U, and the communication pattern of the algorithm, we derived a theoretical bound on the parallel efficiency of the algorithm. We show that this bound can be used to predict the performance of any sparse matrix, when the sparsity is measured with respect to the underlying machine parameters, i.e., floating-point speed, the network latency and bandwidth. Furthermore, using this model we can identify, for certain class of sparse matrices, what hardware parameters to improve are most critical for improving the overall performance.

In symbolic factorization, the nonzero structures of the lower and upper triangular factors L and U of are computed. This step precedes numerical factorization and it is the only step that is sequential at the moment. In particular, it needs the input matrix structure to reside on one processor and thus it is currently the memory bottleneck in the solver.

The available algorithms for symbolic factorization are inherently sequential, and a new algorithmic approach must be considered to solve this problem in parallel. Our current research focuses on the development of a parallel symbolic factorization algorithm for distributed memory machines, which is suitable for general sparse unsymmetric matrices. Its main goal is to decrease the memory requirements of the symbolic factorization step, preventing this step from being the memory and the computational bottleneck.

First, we performed a simulation of the new algorithm and evaluated its performance. The analysis showed that the algorithm presents good memory scalability, i.e., if both the problem size and the number of processors is increased at the same rate, roughly the same amount of memory is used per processor. Issues of load balance as well as communication were addressed. At the time of this writing, we are concentrating on the implementation of this algorithm using MPI. Before the submission of the final paper, we will complete a detailed performance study of the proposed algorithm on the real world matrices.

Home page

2004-02-20