J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. Van De Geijn
IBM and others, USA
Abstract:
During the last half-decade, a number of projects have developed software for generating automatically tuned matrix-matrix multiplication algorithms, including the PHiPAC project and the ATLAS project.
We develop a simple model of hierarchical memories and we use it
to determine an optimal strategy for blocking matrices. This model
predicts the form of current, state-of-the-art L1 kernels and five
other, related L1 kernels. Additionally, it shows that current L1
kernels can continue to produce their high performance for operand
matrices that considerably overflow the L1 cache. For a hierarchical
memory with L memory levels (main memory and L-1 caches), our model
reduces the number of potential matrix multiply algorithms from
to 3. Even so, many current _GEMM implementations use only
one or two algorithms. We use the shape of the matrix input operands,
M, N, and K, to select one of our 3 algorithms.
The theoretical results show that depending on the shape of the matrices involved, different strategies are locally optimal. Rather than determining a blocking strategy at library generation time, the theoretical results also point us to a heuristic that allows the blocking strategy to be determined dynamically at run-time as a function of the shapes of the matrices being multiplied. Preliminary experimental results show that the approach yields performance that is superior to that of ATLAS on Intel Pentium(TM) III and IBM POWER3 processors.