\documentclass{llncs}

\pagestyle{myheadings}

\begin{document}

\title{LAWRA Workshop \\
       Linear Algebra with Recursive Algorithms \\
            http://lawra.uni-c.dk/lawra/}
\author{Fred Gustavson\inst{1} and Jerzy Wa{\'{s}}niewski\inst{2}}
\institute{
        IBM T.J. Watson Research Center, P.P. Box
        218, Yorktown Heights, NY 10598. USA,
        email: gustav@watson.ibm.com
\and
        The Danish Computing Centre for Research and Education
        (UNI$\bullet$C), Technical University of Denmark,
        Building~304, DK-2800 Lyngby, Denmark, \\
        email: jerzy.wasniewski@uni-c.dk}

\maketitle
\index{Gustavson, Fred}
\index{Wa{\'s}niewski, Jerzy}

\markboth{F.G. Gustavson and J. Wa{\'{s}}niewski}
{LAWRA Workshop, Linear Algebra with Recursive Algorithms}

A good quality numerical software is very important in many industrial
applications. Here "quality" means:
  \begin{itemize}
    \item the practical problem is solved as fast as possible, and
    \item with an acceptable error (as is well-known computers make
roundoff errors which can spoil the result significantly).
  \end{itemize}
  Therefore, some companies develop libraries of computer programs which
are ready to use, and where the "best" present algorithms are implemented.
One such library is LAPACK (Linear Algebra PACKage)~\cite{laug} which is
the basis for highly tuned libraries on different architectures, and BLAS
(Basic Linear Algebra Subroutines)~\cite{blas} in which some basic matrix
and vector operations are implemented, and which is heavily used in
LAPACK.

Modern computer architectures have a hierarchical memory (with 1 or more
levels of cache memory). This allows faster algorithms but only when the
algorithm is designed appropriately. There is some development in this
field (for example in LAPACK and ATLAS)~\cite{laug,atlas98} but it seems
that the memory is not used to the full possible extent, and there is a
reason to do further research.

In the last 3 years a new idea emerged which leads to faster algorithms on
modern processors. This is the recursive formulation of all basic
algorithms in numerical software packages. It turns out that recursion
leads automatically to better utilization of the memory, and so faster
algorithms result. Sometimes better algorithms emerge. There is a number of
algorithms where recursion has been applied successfully. In one case the
recursive algorithm was 10 times faster than the existing one but here the
speed up was additionally due to the SMP environment. The recursive
algorithms automatically benefitted from the parallelism and the traditional
methods did not.

The second important issue is that recursive programs ``capture'' very
concisely the mathematical formulation, and are easy to read and
understand (but only if one knows the subject area).

New research on recursive algorithms started at the IBM Watson Research
Center, USA in 1997. Then, in the beginning of 1998, two more centers
joined the team working on these algorithms: UNI$\bullet$C (Danish
Computing Center for Research and Education), and the University of
Ume{\aa}, Sweden. Also, UNI$\bullet$C works in cooperation with the
University of Rousse, Bulgaria on these problems. The ATLAS~\cite{atlas98}
library is extended by the recursive algorithms. At present there have
been developed and tested the following recursive algorithms:
  \begin{itemize}
    \item LU decomposition for general
matrices~\cite{gus1,tol1,JerzyPoland99},
    \item Cholesky decomposition for symmetric and positive definite
matrices~\cite{gus1,AndersenGustavsonWasniewski99b,JerzyWasniewski98,JerzyPoland99},
    \item QR factorization for general matrices~\cite{ElmrothGustavson98},
    \item Several algorithms for symmetric indefinite
system~\cite{JerzyPoland99} and
    \item Recursive BLAS~\cite{FredGustavson98,JerzyPoland99,BoKagstrom97}.
\footnote{See also the LAWRA papers on http://lawra.uni-c.dk/lawra/papers/
and on http://lawra.uni-c.dk/lawra/references/}
  \end{itemize}

  We have improved significantly some of the existing algorithms in LAPACK.
This happens because the recursive formulation uses the cache memory more
efficiently and sometimes also allows the use of larger block operations
(i.~e. more efficient use of BLAS).

There are many widely used algorithms in practice for which the recursive
formulation has not been studied (e.~g. algorithms for eigenvalues,
singular values, generalized eigenvalues, generalized singular values,
etc.). Based on our preliminary experience, we think that recursion
will also benefit these algorithms.
%We think that we have obtained good experience and predict that the
%impact of recursive algorithms on the quality of new numerical software
%will be significant.

  Another important consequence of this research is that as computer
vendors improve the hardware of parallel high performance
computers, the memory hierarchies are tending to become deeper. Thus
recursive algorithms, since they automatically block for them, will be able
to take advantage of the new memories. In this way our research has impact
on both software and hardware.

\section{Outline of the Workshop}

\section*{New Generalized Data Structures for Matrices Lead to a Variety of
High Performance Dense Linear Algebra Algorithms}

First we give a very brief overview of the Algorithms and Architecture
Approach of as means to produce high performance Dense Linear Algorithms.
The key idea developed is blocking for today's deep memory hierarchies.

Next we develop how recursion relates to dense linear algebra. Specifically
we show it leads to automatic variable blocking which is more general than
conventional fixed blocking. New concise algorithms emerge. The machine
independent design of dense linear algebra codes is challenged. By doing so
we are lead to the conclusion that new data structures are required.

Next we describe new data structures for full and packed storage of dense
symmetric/triangular arrays that generalize both. Using the new data
structures one is led to several new algorithms that save "half" the
storage and outperform the current blocked based level 3 algorithms in
LAPACK. We concentrate on the simplest forms of the new algorithms and show
for Cholesky factorization they are a direct generalization of
LINPACK~\cite{linpack}. This means that level 3 BLAS's~\cite{blas3} are not
required to obtain level 3 performance. The replacement for Level 3 BLAS
are so-called kernel routines, see~\cite{AgawalGustavson94}, and on IBM
platforms they are producible from simple vanilla codes by the XLF
FORTRAN~\cite{ibmlanref} compiler. The results for Cholesky, on Power3 for
$n \geq 200$ is over 720 MFlops and reaches 735 MFlops out of a peak of 800
MFlops. Using conventional full format LAPACK
xPOTRF\footnote{LAPACK~\cite{laug} subroutine name; x can be S for
single precision, D for double precision, C for complex or
Z for double complex.} with
ESSL~\cite{ibmessl97} BLAS's, one first gets 600 MFlops at $n \geq 600$ and
only reaches a peak of 620 MFlops. Before leaving Cholesky, we describe a
recursive formulation of Cholesky Factorization of a matrix in packed
storage due to the authors and Bjarne S.
Andersen~\cite{AndersenGustavsonWasniewski99b} of UNI$\bullet$C. The key to
this algorithm is a novel data structure that converts standard packed
format into $n - 1$ "square" full matrices and n scalar diagonal elements.
This formulation only use conventional xGEMM\footnote{BLAS Level
3~\cite{blas3} subroutine name.}, a surprising result in itself.

We have also produced simple square blocked full matrix data formats where
the blocks themselves are stored in column major ( FORTRAN ) order or row
major ( C ) format. The simple algorithms of LU~\cite{gus1,JerzyPoland99}
factorization with partial pivoting for this new data format is a direct
generalization of LINPACK algorithm xGEFA\footnote{LINPACK~\cite{linpack}
subroutine name.}. Again, no conventional level 3 BLAS's are required; the
replacements are again so-called kernel routines. Programming for squared
blocked full matrix format is accomplished in
FORTRAN~\cite{ibmlanref,fortran95} through the use of three and four
dimensional arrays. Thus, no new compiler support is necessary. As in the
Cholesky case above we mention that other more complicated algorithms are
possible; e.g., recursive ones. The recursive algorithms are also easily
programmed via the use of tables that address where the blocks are stored
in the two dimensional recursive block array.

%\bibliography{references}
\begin{thebibliography}{10}

\bibitem{AgawalGustavson94}
R.C. Agawal, F.G. Gustavson, and M.~Zubair.
\newblock Exploiting functional parallelism on power2 to design
  high-performance numerical algorithms.
\newblock {\em IBM Journal of Research and Development}, 38(5):563--576,
  September 1994.

\bibitem{JerzyPoland99}
B.S. Andersen, F.~Gustavson, A.~Karaivanov, J.~Wa{\'s}niewski, and P.Y.
  Yalamov.
\newblock {LAWRA} -- {L}inear {A}lgebra with {R}ecursive {A}lgorithms.
\newblock In R.~Wyrzykowski, B.~Mochnacki, H.~Piech, and J.~Szopa, editors,
  {\em Proceedings of the $3^{th}$ International Conference on Parallel
  Processing and Applied Mathematics, PPAM'99}, pages 63--76, Kazimierz Dolny,
  Poland, 1999. Technical University of Cz{\c{e}}stochowa.

\bibitem{AndersenGustavsonWasniewski99b}
B.S. Andersen, F.~Gustavson, and J.~Wa{\'s}niewski.
\newblock Formulation of the {C}holesky {F}actorization {O}perating on a
  {M}atrix in {P}acked {S}torage {F}orm.
\newblock http://lawra.uni-c.dk/lawra/pspapers/bjarnepaper99.ps.
\newblock The paper is sent to be published in ACM TOMS.

\bibitem{laug}
E.~Anderson, Z.~Bai, C.~Bischof, S.~Blackford, J.~Demmel, J.~Dongarra,
  J.~Du~Croz, A.~Greenbaum, S.~Hammarling, A.~McKenney, and D.~Sorensen.
\newblock {\em {LAPACK} Users' Guide}.
\newblock Society for Industrial and Applied Mathematics, Philadelphia, PA,
  third edition, 1999.

\bibitem{blas}
J.~Dongarra et~al.
\newblock {BLAS} ({B}asic {L}inear {A}lgebra {S}ubprograms).
\newblock http://www.netlib.org/blas/.
\newblock Ongoing Projects at the Innovative Computing Laboratory, Computer
  Science Department, University of Tennessee at Knoxville, USA.

\bibitem{linpack}
J.~J. Dongarra, J.~R. Bunch, C.~B. Moler, and G.~W. Stewart.
\newblock {\em {LINPACK} {U}sers' {G}uide}.
\newblock Society for Industrial and Applied Mathematics, Philadelphia, PA,
  USA, 1979.

\bibitem{blas3}
J.~J. Dongarra, J.~Du~Croz, I.~S. Duff, and S.~Hammarling.
\newblock A set of {L}evel 3 {B}asic {L}inear {A}lgebra {S}ubprograms.
\newblock {\em {ACM} Trans. Math. Soft.}, 16(1):1--28, March 1990.

\bibitem{ElmrothGustavson98}
E.~Elmroth and F.~Gustavson.
\newblock {N}ew {S}erial and {P}arallel {R}ecursive ${QR}$ {F}actorization
  {A}lgorithms for {SMP} {S}ystems.
\newblock In B.~K{\aa}gstr{\"o}m, J.~Dongarra, E.~Elmroth, and
  J.~Wa{\'s}niewski, editors, {\em Proceedings of the $4^{th}$ International
  Workshop, Applied Parallel Computing, Large Scale Scientific and Industrial
  Problems, PARA'98}, number 1541 in Lecture Notes in Computer Science Number,
  pages 120--128, Ume{\aa}, Sweden, June 1998. Springer.

\bibitem{FredGustavson98}
F.~Gustavson, A.~Henriksson, I.~Jonsson, B.~K{\aa}gstr{\"o}m, and P.~Ling.
\newblock Recursive {B}locked {D}ata {F}ormats and {BLAS}' for {D}ense {L}inear
  {A}lgebra {A}lgorithms.
\newblock In B.~K{\aa}gstr{\"o}m, J.~Dongarra, E.~Elmroth, and
  J.~Wa{\'s}niewski, editors, {\em Proceedings of the $4^{th}$ International
  Workshop, Applied Parallel Computing, Large Scale Scientific and Industrial
  Problems, PARA'98}, number 1541 in Lecture Notes in Computer Science Number,
  pages 195--206, Ume{\aa}, Sweden, June 1998. Springer.

\bibitem{gus1}
F.G. Gustavson.
\newblock Recursion leads to automatic variable blocking for dense
  linear-algebra algorithms.
\newblock {\em IBM Journal of Research and Development}, 41(6), November 1997.

\bibitem{ibmessl97}
{IBM}.
\newblock {\em {E}ngineering and {S}cientific {S}ubroutine {L}ibrary for
  {AIX}}, {V}ersion 3, {V}olume 1 edition, December 1997.
\newblock Pub. number {SA}22--7272--0.

\bibitem{ibmlanref}
{IBM}.
\newblock {\em {XL} Fortran {AIX}, {L}anguage {R}eference}, first edition, Dec
  1997.
\newblock {V}ersion 5, {R}elease 1.

\bibitem{BoKagstrom97}
B.~K{\aa}gstr{\"o}m, P.~Ling, and C.~Van Loan.
\newblock {GEMM}-based level 3 {BLAS}: {H}igh-{P}erformance {M}odel
  {I}mplementations and {P}erformance {E}valuation {B}enchmark.
\newblock {\em ACM Trans. Math. Software}, 24(3):268--302, 1998.

\bibitem{fortran95}
M.~Metcalf and J.~Reid.
\newblock {\em {FORTRAN~90} Explained}.
\newblock Oxford University Press, Oxford, UK, first edition, 1990.

\bibitem{tol1}
S.~Toledo.
\newblock {L}ocality of {R}eference in {LU} {D}ecomposition with {P}artial
  {P}ivoting.
\newblock {\em SIAM Journal of Matrix Analysis and Applications}, 18(4), 1997.

\bibitem{JerzyWasniewski98}
J.~Wa{\'s}niewski, B.S. Andersen, and F.~Gustavson.
\newblock {R}ecursive {F}ormulation of {C}holesky {A}lgorithm in {F}ortran~90.
\newblock In B.~K{\aa}gstr{\"o}m, J.~Dongarra, E.~Elmroth, and
  J.~Wa{\'s}niewski, editors, {\em Proceedings of the $4^{th}$ International
  Workshop, Applied Parallel Computing, Large Scale Scientific and Industrial
  Problems, PARA'98}, number 1541 in Lecture Notes in Computer Science Number,
  pages 574--578, Ume{\aa}, Sweden, June 1998. Springer.

\bibitem{atlas98}
R.C. Whaley and J.~Dongarra.
\newblock {A}utomatically {T}uned {L}inear {A}lgebra {S}oftware ({ATLAS}).
\newblock http://www.netlib.org/atlas/, 1999.
\newblock University of Tennessee at Knoxville, Tennessee, USA.

\end{thebibliography}

\end{document}

