## CS494 Lecture Notes - Multiplying Matrices and the Influence of Memory

• March, 2015
• Revision notes: Fall, 2017.
• James S. Plank
• Directory: /home/plank/cs494/Notes/MM-Memory
This is a set of lecture notes that takes a very simple programming task -- multiplying two matrices -- and shows you how paying attention to memory can yield you enormous performance gains, as opposed to doing things naively. An auxiliary hope is that after going through this lecture, you have a sneaking suspicion that programming in Python and Java is, in some important respects, quite inferior to programming in C.

### Step zero -- Strassen's Algorithm

You should know that Strassen's Algorithm exists, and roughly how it reduces the running time to something lower than O(n3). Here's the explanation on Wikipedia. We're not going to explore Strassen's algorithm.

### Step one -- writing a matrix multiplier

I have a very simple matrix multiplier in mm-plain.cpp. You call it as follows:

 ```usage: mm r1 c1/r2 c2 seed(-1=time(0)) print(y|n)\n"); ```

It's straightforward. We have a class called MM, which has two input matrices, M1 and M2, and a product matrix P. It has Multiply() and PrintAll() methods. We create the input matrices in main() with random numbers between 0 and 2 (using the horrible random number generator drand48(). It's fine for this class, since our real intent is to focus on timing, not the quality of the randomness). Here's the class specification:

 ```class MM { public: vector < vector > M1; vector < vector > M2; vector < vector > P; int Print; void Multiply(); void PrintAll(); }; ```

And here's the all important Multiply() method, which is about as straightforward as you can get:

 ```void MM::Multiply() { int i, j, k; for (i = 0; i < P.size(); i++) { for (j = 0; j < P[0].size(); j++) { for (k = 0; k < M2.size(); k++) P[i][j] += (M1[i][k] * M2[k][j]); } } } ```

When we run it, we first double-check to make sure that it's correct:

```UNIX> mm-plain 4 2 3 1 y
M1: 4 x 2

0.0833 0.9090
1.6696 0.6720
1.1310 0.0035
0.3752 1.9809

M2: 2 x 3

1.5010 0.7325 0.7024
1.1467 0.2651 0.1283

P: 4 x 3

1.1673 0.3020 0.1751
3.2767 1.4012 1.2590
1.7016 0.8294 0.7949
2.8346 0.8000 0.5177
Time: 0.0000
UNIX> echo '1.6696 * 1.5010 + 0.6720 * 1.1467' | bc
3.2765
UNIX>
```
And when we time it, we see the O(n3) behavior that we'd expect (this is on my 2012 MacBook Pro. On my 2015 MacBook, the timings are better by a factor of 2-3, but relatively to each other, they are the same.):
```UNIX> mm-plain 50 50 50 1 n
Time: 0.0006
UNIX> !!:gs/50/100
mm-plain 100 100 100 1 n
Time: 0.0076
UNIX> !!:gs/100/200/
mm-plain 200 200 200 1 n
Time: 0.0480
UNIX> !!:gs/200/400/
mm-plain 400 400 400 1 n
Time: 0.3318
UNIX> !!:gs/400/800/
mm-plain 800 800 800 1 n
Time: 5.7603
UNIX>
```
That last timing looks a bit off to me -- I'm going to suspect a memory issue.

### mm-transpose

If you think about memory usage, the way we access M2 is really poor. With M1 and P, we always access elements that are contiguous in memory. However, with M2, we keep accessing elements in different vectors. That's going to result in very poor cache behavior, and perhaps even bad TLB behavior.

The obvious fix is to transpose M2, which we do in mm-transpose.cpp. I create the matrix transposed -- note how M2 has c2 rows and c1 columns. I create the values so that M2[i][j] in mm-transpose.cpp has the same values that M2[j][i] had in mm-plain.cpp. I do that so that you can double-check the answers with mm-plain.cpp:

 ``` M->M1.resize(r1); for (i = 0; i < M->M1.size(); i++) M->M1[i].resize(c1); M->M2.resize(c2); for (i = 0; i < M->M2.size(); i++) M->M2[i].resize(c1); M->P.resize(r1); for (i = 0; i < M->P.size(); i++) M->P[i].resize(c2, 0); for (i = 0; i < M->M1.size(); i++) { for (j = 0; j < M->M1[i].size(); j++) { M->M1[i][j] = drand48()*2.0; } } for (j = 0; j < M->M2[0].size(); j++) { // What was M2[i][j] is now M2[j][i]. for (i = 0; i < M->M2.size(); i++) { M->M2[i][j] = drand48()*2.0; } } ```

Here's the multiplier, where we run through the rows of M2 rather than the columns:

 ```void MM::Multiply() { int i, j, k; for (i = 0; i < P.size(); i++) { for (j = 0; j < P[0].size(); j++) { for (k = 0; k < M1[0].size(); k++) P[i][j] += (M1[i][k] * M2[j][k]); // This is the change: ^^^^ } } } ```

As always, our first step is to double-check that our results match our first matrix multiplier:

```UNIX> mm-transpose 4 2 3 1 y | head -n 18 > tmp1.txt
UNIX> mm-plain 4 2 3 1 y | head -n 18 > tmp2.txt
UNIX> diff tmp1.txt tmp2.txt
UNIX>
```
Let's compare timings:
```UNIX> echo `mm-plain 50 50 50 1 n` `mm-transpose 50 50 50 1 n`
Time: 0.0008 Time: 0.0007
UNIX> !!:gs/50 50 50/100 100 100/
echo `mm-plain 100 100 100 1 n` `mm-transpose 100 100 100 1 n`
Time: 0.0064 Time: 0.0044
UNIX> !!:gs/100 100 100/200 200 200/
echo `mm-plain 200 200 200 1 n` `mm-transpose 200 200 200 1 n`
Time: 0.0469 Time: 0.0345
UNIX> !!:gs/200 200 200/400 400 400/
echo `mm-plain 400 400 400 1 n` `mm-transpose 400 400 400 1 n`
Time: 0.3288 Time: 0.2770
UNIX> !!:gs/400 400 400/800 800 800/
echo `mm-plain 800 800 800 1 n` `mm-transpose 800 800 800 1 n`
Time: 5.7788 Time: 2.3346
UNIX> echo '0.277 * 8' | bc
2.216
UNIX>
```
As we suspected, that last timing hit a memory cliff for mm-plain. On the flip side, mm-transpose exhibits pretty nice O(n3) behavior.

### Step Three -- Making it more memory efficient by using a 1D array

As suggested by Ben A in class, we can do better if we get rid of C++'s two-dimensional array structure, and simply allocate each matrix in its own r * c array. This is because each row of the vector can be in its own memory location. If you do one big matrix, then you are guarateed that the whole matrix is allocated contiguously. The fix is in mm-oned.cpp. Now, the matrices are simple vectors:

 ```class MM { public: vector M1; vector M2; vector P; int r1, c1, c2; int Print; void Multiply(); void PrintAll(); }; ```

Where we did M[i][j] before, we now do M[i*cols+j]. Here's the new matrix mutiply code:

 ```void MM::Multiply() { int i, j, k; for (i = 0; i < r1; i++) { for (j = 0; j < c2; j++) { for (k = 0; k < c1; k++) P[i*c2+j] += M1[i*c1+k] * M2[j*c1+k]; } } } ```

As suspected, this improves performance over mm-transpose. Let's do our checks and timings (if md5 doesn't exist on your system, try "openssl md5"):

```UNIX> mm-oned 4 2 3 1 y | head -n 18 > tmp3.txt
UNIX> md5 tmp*.txt
MD5 (tmp1.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp2.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp3.txt) = e9c4ee5f9c6520146ba1468f39ed3873
UNIX>
UNIX>
UNIX> echo `mm-plain 50 50 50 1 n` `mm-transpose 50 50 50 1 n` `mm-oned 50 50 50 1 n`
Time: 0.0006 Time: 0.0005 Time: 0.0004
UNIX> !!:gs/50 50 50/100 100 100/
echo `mm-plain 100 100 100 1 n` `mm-transpose 100 100 100 1 n` `mm-oned 100 100 100 1 n`
Time: 0.0047 Time: 0.0043 Time: 0.0040
UNIX> !!:gs/100 100 100/200 200 200/
echo `mm-plain 200 200 200 1 n` `mm-transpose 200 200 200 1 n` `mm-oned 200 200 200 1 n`
Time: 0.0472 Time: 0.0346 Time: 0.0283
UNIX> !!:gs/200 200 200/400 400 400/
echo `mm-plain 400 400 400 1 n` `mm-transpose 400 400 400 1 n` `mm-oned 400 400 400 1 n`
Time: 0.3314 Time: 0.2764 Time: 0.2262
UNIX> !!:gs/400 400 400/800 800 800/
echo `mm-plain 800 800 800 1 n` `mm-transpose 800 800 800 1 n` `mm-oned 800 800 800 1 n`
Time: 5.7702 Time: 2.3443 Time: 1.9389
UNIX> echo "" | awk '{ print 1.9389 / 2.3443 }'
0.82707
UNIX>
```
Seventeen percent is nothing to sneeze at.

(As an aside, on my 2015 Macbook, there was no improvement from mm-transpose to mm-oned. I'm going to surmise that it's the compiler that has improved.)

### Step Four -- Using pointers instead of all of that multiplication and addition

When you look at that inner loop statement:

 ```for (k = 0; k < c1; k++) P[i*c2+j] += M1[i*c1+k] * M2[j*c1+k]; ```

It does a lot of arithmetic. In particular, you have three integer multiplications, three integer additions, plus one floating point addition and multiplication. And those are just the arithmetic operations that you see. Each of those matrix indirections will require a multiplication, and addition, and a load (plus the final store).

Of course, the compiler can and does optimize, so you can bet that a lot of those operations will be minimized; however, do you think that the compiler will get them all? I suspect it will do pretty well, but it won't get them all.

So, our next iteration replaces all of the matrix operations with pointer operations. This eliminates all of the multiplications, with the exception of the floating point. The change is in mm-oned-p.cpp:

(As an aside, when we wrote this in class, we added each pointer separately, testing as we went, so that we could get it correct more quickly. As you know, I always advocate that when you are programming.)

 ```void MM::Multiply() { int i, j, k; double *m1, *m1t, *m2, *p; m1 = &(M1[0]); /* m1 is going to point to the first element of the row of M1 */ p = &(P[0]); /* p points to the product element that we are calculating */ for (i = 0; i < r1; i++) { m2 = &(M2[0]); /* We use m2 to run through the "column" of M2*/ for (j = 0; j < c2; j++) { m1t = m1; /* And we use m1t to actually run through the row of M1 */ for (k = 0; k < c1; k++) { (*p) += (*m1t) * (*m2); m1t++; m2++; } p++; /* Since everything is contiguous, this moves us to the next product element, regardless of whether we are at the end of a row. Similarly, at this point, m2 is pointing to the beginning of the next row of M2, so we don't need to update it. */ } m1 = m1t; /* Also similarly, at the end of the row, m1t is pointing to the */ } /* beginning of the next row. That's convenient. */ } ```

That other piece of code reads much more nicely, doesn't it? When we test it, we see that we have destroyed the performance of the optimizer. Go team!!

```UNIX> mm-oned-p 4 2 3 1 y | head -n 18 > tmp4.txt
UNIX> md5 tmp*.txt
MD5 (tmp1.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp2.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp3.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp4.txt) = e9c4ee5f9c6520146ba1468f39ed3873
UNIX> echo `mm-plain 50 50 50 1 n` `mm-transpose 50 50 50 1 n` `mm-oned 50 50 50 1 n` `mm-oned-p 50 50 50 1 n`
Time: 0.0008 Time: 0.0007 Time: 0.0006 Time: 0.0002
UNIX> !!:gs/50 50 50/100 100 100/
echo `mm-plain 100 100 100 1 n` `mm-transpose 100 100 100 1 n` `mm-oned 100 100 100 1 n` `mm-oned-p 100 100 100 1 n`
Time: 0.0057 Time: 0.0044 Time: 0.0038 Time: 0.0013
UNIX> !!:gs/100 100 100/200 200 200/
echo `mm-plain 200 200 200 1 n` `mm-transpose 200 200 200 1 n` `mm-oned 200 200 200 1 n` `mm-oned-p 200 200 200 1 n`
Time: 0.0468 Time: 0.0356 Time: 0.0283 Time: 0.0094
UNIX> !!:gs/200 200 200/400 400 400/
echo `mm-plain 400 400 400 1 n` `mm-transpose 400 400 400 1 n` `mm-oned 400 400 400 1 n` `mm-oned-p 400 400 400 1 n`
Time: 0.3323 Time: 0.2784 Time: 0.2312 Time: 0.0757
UNIX> !!:gs/400 400 400/800 800 800/
echo `mm-plain 800 800 800 1 n` `mm-transpose 800 800 800 1 n` `mm-oned 800 800 800 1 n` `mm-oned-p 800 800 800 1 n`
Time: 5.8131 Time: 2.3560 Time: 1.9589 Time: 0.7848
UNIX>
```
Wow! I did not expect to beat the optimizer by that much! So take a look at where we've come by paying attention to memory. On the 400x400 matrices, our most recent program is 4.4 times faster than our first program. That's pretty significant!

(And the 2017 aside. Now, I'm only getting 2% improvement from mm-oned-p. The overall improvement over mm-plain is roughly 50%. Again, I'm going to attribute all of this to the compiler, which is able to optimize instructions really well. It can't optimize memory by performing the transposition though, so it is simply the act of transposing memory that has improved our lives. I would call that comforting, because it means that you don't have to write yucky pointer-laden code to get good performance.)

### Step Five Block Multiplication --

This is an idea worth exploring. If you partition a matrix into blocks, and treat each block like its own matrix element, then you can multiply the original matrix by performing matrix multiplication on the block matrix, where the addition and multiplication components of the block matrices are matrix addition and multiplication fo the blocks.

That was an awful sentence. Sorry. Let's look at pictures. Here's a multiplication of a 20x16 matrix A by a 16x12 matrix B, to yield a 20x12 matrix C:

I've already shown the partition of these matrices into 4x4 blocks. You can instead view the product as one of a 5x4 matrix by a 4x3 matrix to yield a 5x3 matrix:

It just so happens that each of the elements in the picture above is a 4x4 matrix instead of a number. So, when you want to calculate element C31 in the above matrix, you do your standard dot product of row 3 and column 1:

However, each of those elements is a 4x4 matrix, so what you're really doing is the following:

Now, why would you care about doing block matrix multiplication? The answer lies in how you access memory. Think about it -- when you calculate the product one row at a time, you will be reading in one row of A and the entire B matrix, before you read in the next row of A. If you can't fit all of that into the cache, then you are going to be loading all of B into the cache once per row of A.

Now, suppose you are doing block multiplications. And you calculate the product by calculating a row of blocks at a time. Now you are loading the entire B matrix from memory once per block, which cuts down on the load time by a factor of the block size. As the matrices get big, this could make a big difference.

To hack this up, we're all tempted to start with the best implementation, mm-oned-p.cpp, and turn it into a block matrix implementation. I'm going to resist that temptation, and instead start with mm-oned.cpp. I'll then convert it to pointers, one step at a time, because I think that will be the least bug-laden approach.

My first block matrix implementation is in: mm-block.cpp. I have added a blocksize argument, and I have added a MultiplyBlock() method, which multiplies the block starting at [row1,col1] of matrix M1 with the block starting at [col1,col2] of M2 (which of course, is transposed). This product is added into the block starting at [ro1,col2] of matrix P.

Here is the relevant code. I'm not proud of the tertiary expressions, and the compiler may not be able to optimize anything away, but I'm going to convert this to pointers anyway, so I'm not sweating it:

 ```class MM { public: vector M1; vector M2; vector P; int r1, c1, c2; int Print; int Blocksize; void MultiplyBlock(int row1, int col1, int col2); void Multiply(); void PrintAll(); }; void MM::MultiplyBlock(int row1, int col1, int col2) { int pr, pc, tmp; for (pr = row1; pr < ((row1 + Blocksize > r1) ? r1 : row1 + Blocksize); pr++) { for (pc = col2; pc < ((col2 + Blocksize > c2) ? c2 : col2 + Blocksize); pc++) { for (tmp = col1; tmp < ((col1 + Blocksize > c1) ? c1 : col1 + Blocksize); tmp++) { P[pr*c2+pc] += (M1[pr*c1+tmp] * M2[pc*c1+tmp]); } } } } void MM::Multiply() { int row1, col1, col2; for (row1 = 0; row1 < r1; row1 += Blocksize) { for (col2 = 0; col2 < c2; col2 += Blocksize) { for (col1 = 0; col1 < c1; col1 += Blocksize) { MultiplyBlock(row1, col1, col2); } } } } ```

If you remember from class, this took a little teeth-gnashing and debugging, but we did get it right eventually:

```UNIX> mm-block 4 2 3 1 1 y | head -n 18 > tmp5.txt
UNIX> mm-block 4 2 3 2 1 y | head -n 18 > tmp6.txt
UNIX> mm-block 4 2 3 3 1 y | head -n 18 > tmp7.txt
UNIX> md5 tmp*.txt
MD5 (tmp1.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp2.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp3.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp4.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp5.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp6.txt) = e9c4ee5f9c6520146ba1468f39ed3873
MD5 (tmp7.txt) = e9c4ee5f9c6520146ba1468f39ed3873
UNIX> mm-block 21 17 13 4 1 y | tail -n 22 | head -n 21 | md5
d5a69ea68d5b9ba22a64c25334c947ae
UNIX> mm-oned-p 21 17 13 1 y | tail -n 22 | head -n 21 | md5
d5a69ea68d5b9ba22a64c25334c947ae
UNIX>
```

The next step is to convert the core routine to use pointers rather than array indirection. The goal is to rid our program of all multiplication, with the exception of multiplying the elements of the matrices. We did this in class, and it took multiple days, because the process was bug-prone, and I'd have to quit and go back to my office to debug. We got it right finally, and here's MultiplyBlock():

 ```void MM::MultiplyBlock(int row1, int col1, int col2) { double *m1p, *m1top, *m1; double *m2p, *m2top, *m2base, *m2; double *p, *ptmp; int tmp, tmptop; m1 = &(M1[row1 * c1 + col1]); m1top = m1 + (c1 * Blocksize); if (m1top > &(M1[r1 * c1])) m1top = &(M1[r1 * c1]); m2base = &(M2[col2 * c1 + col1]); m2top = m2base + (c1 * Blocksize); if (m2top > &(M2[c2 * c1])) m2top = &(M2[c2 * c1]); p = &(P[row1 * c2 + col2]); tmptop = col1 + Blocksize; if (tmptop > c1) tmptop = c1; for ( ; m1 < m1top; m1 += c1) { ptmp = p; for (m2 = m2base; m2 < m2top; m2 += c1) { m1p = m1; m2p = m2; for (tmp = col1; tmp < tmptop; tmp++) { *ptmp += ( (*m1p) * (*m2p) ); m1p++; m2p++; } ptmp++; } p += c2; } } ```

The main loop variables are:

• m1 starts at the first element of M1, and drops down one row at each outer iteration.
• m2 starts at the first element of M2, and drops down one column at each second-level iteration. However, you need to remember that M2 is transposed, so in reality, this drops down one row at each level-two iteration (hence why it is incremented by c1).
• p starts at the first element of P, and it follows m1.
• ptmp is the variable that runs through the columns of P. It follows m2.
• m1p and m2p and ptmp are the inner-loop variables.
• m1p and m2p are the inner-loop variables that calculate the dot product for each element of P.
I don't know if this picture helps, but perhaps it does:

First, we assure ourselves that it works:

```UNIX> mm-block-p 21 17 13 5 1 y | tail -n 22 | head -n 21 | md5
d5a69ea68d5b9ba22a64c25334c947ae
UNIX>
```
And then we time it to see the impact of the block. In the test below, I'm varying the block size from 10 to 1800 in multiplication of 1800x1800 matrices. (This is on my MacBook Pro with an Intel Core i5 processor chugging along at 2.4 GHz):

I didn't run these multiple times, so a lot of the jaggedness is not important. However, the general shape is very important, and as you start exploring the interaction of caches and performance, you'll get used to shapes like this graph. The main feature is how the running time degrades steeply from a block size of 380 up to 600. You can correlate that to the L3 cache, which on my Mac is 3MB. How does that correlate? Well, what's in the cache? Typically a row of M1, a row of P and the whole block of M2. So, blocksize(blocksize+2)*8. I'm multiplying by 8 because my matrices are doubles. This value starts to approach 3MB at a block size of 625, which roughly correlates to the degradation of performance.

Those little dips below a blocksize of 200 will correlate to the L2 cache (which is 256K), the L1 cache (whose size I don't know), plus interactions with set associativity. If you really want to understand the behavior, you need generate memory traces and simulate the cache. It's a subtle art, and I'm not going to explore it further. If you are interested in this kind of work, check out the ATLAS project, which started here at UT (http://math-atlas.sourceforge.net/). It's lead author was Clint Whaley, who is now a professor at LSU (http://csc.lsu.edu/~whaley/) -- his work uses empirical testing to optimize the performance of programs with respect to individual machine architectures.

Enough of that aside -- I'm going to choose a block size of 380 as a result of these tests.

Here's where we are with these programs on 1800x1800 matrices. That's a pretty dramatic increase in performance from the naive approach down to the one that uses pointers and blocks!

And here's a second graph that shows the performance as we scale the size of the matrices:

A final remark -- that vertical line in mm-plain is not an outlier. The running time goes from 9.9 seconds for 960x960 to 29.1 seconds for 970x970. Clearly, we have hit a cache boundary.

Stay tuned for the next lecture, where we will attempt to make matrix multiplication even faster on my Macintosh!

(Aside for 2017: Everything is different now, of course. Here's the graph of block size vs performance for the 1800x1800 matrix:

The best performing block size is 15, which achieves a 25% improvement over the non-blocked implementations.

Let's take a look at the performance comparison over all the algorithms:

That's a big difference from before, isn't it? We'll explore this a little more in the next lecture, where we will talk about that block size of 15.)