usage: mm r1 c1/r2 c2 seed(-1=time(0)) print(y|n)\n"); |
It's straightforward. We have a class called MM, which has two input matrices, M1 and M2, and a product matrix P. It has Multiply() and PrintAll() methods. We create the input matrices in main() with random numbers between 0 and 2 (using the horrible random number generator drand48(). It's fine for this class, since our real intent is to focus on timing, not the quality of the randomness). Here's the class specification:
class MM { public: vector < vector <double> > M1; vector < vector <double> > M2; vector < vector <double> > P; int Print; void Multiply(); void PrintAll(); }; |
And here's the all important Multiply() method, which is about as straightforward as you can get:
void MM::Multiply() { int i, j, k; for (i = 0; i < P.size(); i++) { for (j = 0; j < P[0].size(); j++) { for (k = 0; k < M2.size(); k++) P[i][j] += (M1[i][k] * M2[k][j]); } } } |
When we run it, we first double-check to make sure that it's correct:
UNIX> mm-plain 4 2 3 1 y M1: 4 x 2 0.0833 0.9090 1.6696 0.6720 1.1310 0.0035 0.3752 1.9809 M2: 2 x 3 1.5010 0.7325 0.7024 1.1467 0.2651 0.1283 P: 4 x 3 1.1673 0.3020 0.1751 3.2767 1.4012 1.2590 1.7016 0.8294 0.7949 2.8346 0.8000 0.5177 Time: 0.0000 UNIX> echo '1.6696 * 1.5010 + 0.6720 * 1.1467' | bc 3.2765 UNIX>And when we time it, we see the O(n^{3}) behavior that we'd expect (this is on my 2012 MacBook Pro. On my 2015 MacBook, the timings are better by a factor of 2-3, but relatively to each other, they are the same.):
UNIX> mm-plain 50 50 50 1 n Time: 0.0006 UNIX> !!:gs/50/100 mm-plain 100 100 100 1 n Time: 0.0076 UNIX> !!:gs/100/200/ mm-plain 200 200 200 1 n Time: 0.0480 UNIX> !!:gs/200/400/ mm-plain 400 400 400 1 n Time: 0.3318 UNIX> !!:gs/400/800/ mm-plain 800 800 800 1 n Time: 5.7603 UNIX>That last timing looks a bit off to me -- I'm going to suspect a memory issue.
The obvious fix is to transpose M2, which we do in mm-transpose.cpp. I create the matrix transposed -- note how M2 has c2 rows and c1 columns. I create the values so that M2[i][j] in mm-transpose.cpp has the same values that M2[j][i] had in mm-plain.cpp. I do that so that you can double-check the answers with mm-plain.cpp:
M->M1.resize(r1); for (i = 0; i < M->M1.size(); i++) M->M1[i].resize(c1); M->M2.resize(c2); for (i = 0; i < M->M2.size(); i++) M->M2[i].resize(c1); M->P.resize(r1); for (i = 0; i < M->P.size(); i++) M->P[i].resize(c2, 0); for (i = 0; i < M->M1.size(); i++) { for (j = 0; j < M->M1[i].size(); j++) { M->M1[i][j] = drand48()*2.0; } } for (j = 0; j < M->M2[0].size(); j++) { // What was M2[i][j] is now M2[j][i]. for (i = 0; i < M->M2.size(); i++) { M->M2[i][j] = drand48()*2.0; } } |
Here's the multiplier, where we run through the rows of M2 rather than the columns:
void MM::Multiply() { int i, j, k; for (i = 0; i < P.size(); i++) { for (j = 0; j < P[0].size(); j++) { for (k = 0; k < M1[0].size(); k++) P[i][j] += (M1[i][k] * M2[j][k]); // This is the change: ^^^^ } } } |
As always, our first step is to double-check that our results match our first matrix multiplier:
UNIX> mm-transpose 4 2 3 1 y | head -n 18 > tmp1.txt UNIX> mm-plain 4 2 3 1 y | head -n 18 > tmp2.txt UNIX> diff tmp1.txt tmp2.txt UNIX>Let's compare timings:
UNIX> echo `mm-plain 50 50 50 1 n` `mm-transpose 50 50 50 1 n` Time: 0.0008 Time: 0.0007 UNIX> !!:gs/50 50 50/100 100 100/ echo `mm-plain 100 100 100 1 n` `mm-transpose 100 100 100 1 n` Time: 0.0064 Time: 0.0044 UNIX> !!:gs/100 100 100/200 200 200/ echo `mm-plain 200 200 200 1 n` `mm-transpose 200 200 200 1 n` Time: 0.0469 Time: 0.0345 UNIX> !!:gs/200 200 200/400 400 400/ echo `mm-plain 400 400 400 1 n` `mm-transpose 400 400 400 1 n` Time: 0.3288 Time: 0.2770 UNIX> !!:gs/400 400 400/800 800 800/ echo `mm-plain 800 800 800 1 n` `mm-transpose 800 800 800 1 n` Time: 5.7788 Time: 2.3346 UNIX> echo '0.277 * 8' | bc 2.216 UNIX>As we suspected, that last timing hit a memory cliff for mm-plain. On the flip side, mm-transpose exhibits pretty nice O(n^{3}) behavior.
class MM { public: vector <double> M1; vector <double> M2; vector <double> P; int r1, c1, c2; int Print; void Multiply(); void PrintAll(); }; |
Where we did M[i][j] before, we now do M[i*cols+j]. Here's the new matrix mutiply code:
void MM::Multiply() { int i, j, k; for (i = 0; i < r1; i++) { for (j = 0; j < c2; j++) { for (k = 0; k < c1; k++) P[i*c2+j] += M1[i*c1+k] * M2[j*c1+k]; } } } |
As suspected, this improves performance over mm-transpose. Let's do our checks and timings (if md5 doesn't exist on your system, try "openssl md5"):
UNIX> mm-oned 4 2 3 1 y | head -n 18 > tmp3.txt UNIX> md5 tmp*.txt MD5 (tmp1.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp2.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp3.txt) = e9c4ee5f9c6520146ba1468f39ed3873 UNIX> UNIX> UNIX> echo `mm-plain 50 50 50 1 n` `mm-transpose 50 50 50 1 n` `mm-oned 50 50 50 1 n` Time: 0.0006 Time: 0.0005 Time: 0.0004 UNIX> !!:gs/50 50 50/100 100 100/ echo `mm-plain 100 100 100 1 n` `mm-transpose 100 100 100 1 n` `mm-oned 100 100 100 1 n` Time: 0.0047 Time: 0.0043 Time: 0.0040 UNIX> !!:gs/100 100 100/200 200 200/ echo `mm-plain 200 200 200 1 n` `mm-transpose 200 200 200 1 n` `mm-oned 200 200 200 1 n` Time: 0.0472 Time: 0.0346 Time: 0.0283 UNIX> !!:gs/200 200 200/400 400 400/ echo `mm-plain 400 400 400 1 n` `mm-transpose 400 400 400 1 n` `mm-oned 400 400 400 1 n` Time: 0.3314 Time: 0.2764 Time: 0.2262 UNIX> !!:gs/400 400 400/800 800 800/ echo `mm-plain 800 800 800 1 n` `mm-transpose 800 800 800 1 n` `mm-oned 800 800 800 1 n` Time: 5.7702 Time: 2.3443 Time: 1.9389 UNIX> echo "" | awk '{ print 1.9389 / 2.3443 }' 0.82707 UNIX>Seventeen percent is nothing to sneeze at.
(As an aside, on my 2015 Macbook, there was no improvement from mm-transpose to mm-oned. I'm going to surmise that it's the compiler that has improved.)
for (k = 0; k < c1; k++) P[i*c2+j] += M1[i*c1+k] * M2[j*c1+k]; |
It does a lot of arithmetic. In particular, you have three integer multiplications, three integer additions, plus one floating point addition and multiplication. And those are just the arithmetic operations that you see. Each of those matrix indirections will require a multiplication, and addition, and a load (plus the final store).
Of course, the compiler can and does optimize, so you can bet that a lot of those operations will be minimized; however, do you think that the compiler will get them all? I suspect it will do pretty well, but it won't get them all.
So, our next iteration replaces all of the matrix operations with pointer operations. This eliminates all of the multiplications, with the exception of the floating point. The change is in mm-oned-p.cpp:
(As an aside, when we wrote this in class, we added each pointer separately, testing as we went, so that we could get it correct more quickly. As you know, I always advocate that when you are programming.)
void MM::Multiply() { int i, j, k; double *m1, *m1t, *m2, *p; m1 = &(M1[0]); /* m1 is going to point to the first element of the row of M1 */ p = &(P[0]); /* p points to the product element that we are calculating */ for (i = 0; i < r1; i++) { m2 = &(M2[0]); /* We use m2 to run through the "column" of M2*/ for (j = 0; j < c2; j++) { m1t = m1; /* And we use m1t to actually run through the row of M1 */ for (k = 0; k < c1; k++) { (*p) += (*m1t) * (*m2); m1t++; m2++; } p++; /* Since everything is contiguous, this moves us to the next product element, regardless of whether we are at the end of a row. Similarly, at this point, m2 is pointing to the beginning of the next row of M2, so we don't need to update it. */ } m1 = m1t; /* Also similarly, at the end of the row, m1t is pointing to the */ } /* beginning of the next row. That's convenient. */ } |
That other piece of code reads much more nicely, doesn't it? When we test it, we see that we have destroyed the performance of the optimizer. Go team!!
UNIX> mm-oned-p 4 2 3 1 y | head -n 18 > tmp4.txt UNIX> md5 tmp*.txt MD5 (tmp1.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp2.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp3.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp4.txt) = e9c4ee5f9c6520146ba1468f39ed3873 UNIX> echo `mm-plain 50 50 50 1 n` `mm-transpose 50 50 50 1 n` `mm-oned 50 50 50 1 n` `mm-oned-p 50 50 50 1 n` Time: 0.0008 Time: 0.0007 Time: 0.0006 Time: 0.0002 UNIX> !!:gs/50 50 50/100 100 100/ echo `mm-plain 100 100 100 1 n` `mm-transpose 100 100 100 1 n` `mm-oned 100 100 100 1 n` `mm-oned-p 100 100 100 1 n` Time: 0.0057 Time: 0.0044 Time: 0.0038 Time: 0.0013 UNIX> !!:gs/100 100 100/200 200 200/ echo `mm-plain 200 200 200 1 n` `mm-transpose 200 200 200 1 n` `mm-oned 200 200 200 1 n` `mm-oned-p 200 200 200 1 n` Time: 0.0468 Time: 0.0356 Time: 0.0283 Time: 0.0094 UNIX> !!:gs/200 200 200/400 400 400/ echo `mm-plain 400 400 400 1 n` `mm-transpose 400 400 400 1 n` `mm-oned 400 400 400 1 n` `mm-oned-p 400 400 400 1 n` Time: 0.3323 Time: 0.2784 Time: 0.2312 Time: 0.0757 UNIX> !!:gs/400 400 400/800 800 800/ echo `mm-plain 800 800 800 1 n` `mm-transpose 800 800 800 1 n` `mm-oned 800 800 800 1 n` `mm-oned-p 800 800 800 1 n` Time: 5.8131 Time: 2.3560 Time: 1.9589 Time: 0.7848 UNIX>Wow! I did not expect to beat the optimizer by that much! So take a look at where we've come by paying attention to memory. On the 400x400 matrices, our most recent program is 4.4 times faster than our first program. That's pretty significant!
(And the 2017 aside. Now, I'm only getting 2% improvement from mm-oned-p. The overall improvement over mm-plain is roughly 50%. Again, I'm going to attribute all of this to the compiler, which is able to optimize instructions really well. It can't optimize memory by performing the transposition though, so it is simply the act of transposing memory that has improved our lives. I would call that comforting, because it means that you don't have to write yucky pointer-laden code to get good performance.)
That was an awful sentence. Sorry. Let's look at pictures. Here's a multiplication of a 20x16 matrix A by a 16x12 matrix B, to yield a 20x12 matrix C:
I've already shown the partition of these matrices into 4x4 blocks. You can instead view the product as one of a 5x4 matrix by a 4x3 matrix to yield a 5x3 matrix:
It just so happens that each of the elements in the picture above is a 4x4 matrix instead of a number. So, when you want to calculate element C31 in the above matrix, you do your standard dot product of row 3 and column 1:
However, each of those elements is a 4x4 matrix, so what you're really doing is the following:
Now, why would you care about doing block matrix multiplication? The answer lies in how you access memory. Think about it -- when you calculate the product one row at a time, you will be reading in one row of A and the entire B matrix, before you read in the next row of A. If you can't fit all of that into the cache, then you are going to be loading all of B into the cache once per row of A.
Now, suppose you are doing block multiplications. And you calculate the product by calculating a row of blocks at a time. Now you are loading the entire B matrix from memory once per block, which cuts down on the load time by a factor of the block size. As the matrices get big, this could make a big difference.
To hack this up, we're all tempted to start with the best implementation, mm-oned-p.cpp, and turn it into a block matrix implementation. I'm going to resist that temptation, and instead start with mm-oned.cpp. I'll then convert it to pointers, one step at a time, because I think that will be the least bug-laden approach.
My first block matrix implementation is in: mm-block.cpp. I have added a blocksize argument, and I have added a MultiplyBlock() method, which multiplies the block starting at [row1,col1] of matrix M1 with the block starting at [col1,col2] of M2 (which of course, is transposed). This product is added into the block starting at [ro1,col2] of matrix P.
Here is the relevant code. I'm not proud of the tertiary expressions, and the compiler may not be able to optimize anything away, but I'm going to convert this to pointers anyway, so I'm not sweating it:
class MM { public: vector <double> M1; vector <double> M2; vector <double> P; int r1, c1, c2; int Print; int Blocksize; void MultiplyBlock(int row1, int col1, int col2); void Multiply(); void PrintAll(); }; void MM::MultiplyBlock(int row1, int col1, int col2) { int pr, pc, tmp; for (pr = row1; pr < ((row1 + Blocksize > r1) ? r1 : row1 + Blocksize); pr++) { for (pc = col2; pc < ((col2 + Blocksize > c2) ? c2 : col2 + Blocksize); pc++) { for (tmp = col1; tmp < ((col1 + Blocksize > c1) ? c1 : col1 + Blocksize); tmp++) { P[pr*c2+pc] += (M1[pr*c1+tmp] * M2[pc*c1+tmp]); } } } } void MM::Multiply() { int row1, col1, col2; for (row1 = 0; row1 < r1; row1 += Blocksize) { for (col2 = 0; col2 < c2; col2 += Blocksize) { for (col1 = 0; col1 < c1; col1 += Blocksize) { MultiplyBlock(row1, col1, col2); } } } } |
If you remember from class, this took a little teeth-gnashing and debugging, but we did get it right eventually:
UNIX> mm-block 4 2 3 1 1 y | head -n 18 > tmp5.txt UNIX> mm-block 4 2 3 2 1 y | head -n 18 > tmp6.txt UNIX> mm-block 4 2 3 3 1 y | head -n 18 > tmp7.txt UNIX> md5 tmp*.txt MD5 (tmp1.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp2.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp3.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp4.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp5.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp6.txt) = e9c4ee5f9c6520146ba1468f39ed3873 MD5 (tmp7.txt) = e9c4ee5f9c6520146ba1468f39ed3873 UNIX> mm-block 21 17 13 4 1 y | tail -n 22 | head -n 21 | md5 d5a69ea68d5b9ba22a64c25334c947ae UNIX> mm-oned-p 21 17 13 1 y | tail -n 22 | head -n 21 | md5 d5a69ea68d5b9ba22a64c25334c947ae UNIX>
The next step is to convert the core routine to use pointers rather than array indirection. The goal is to rid our program of all multiplication, with the exception of multiplying the elements of the matrices. We did this in class, and it took multiple days, because the process was bug-prone, and I'd have to quit and go back to my office to debug. We got it right finally, and here's MultiplyBlock():
void MM::MultiplyBlock(int row1, int col1, int col2) { double *m1p, *m1top, *m1; double *m2p, *m2top, *m2base, *m2; double *p, *ptmp; int tmp, tmptop; m1 = &(M1[row1 * c1 + col1]); m1top = m1 + (c1 * Blocksize); if (m1top > &(M1[r1 * c1])) m1top = &(M1[r1 * c1]); m2base = &(M2[col2 * c1 + col1]); m2top = m2base + (c1 * Blocksize); if (m2top > &(M2[c2 * c1])) m2top = &(M2[c2 * c1]); p = &(P[row1 * c2 + col2]); tmptop = col1 + Blocksize; if (tmptop > c1) tmptop = c1; for ( ; m1 < m1top; m1 += c1) { ptmp = p; for (m2 = m2base; m2 < m2top; m2 += c1) { m1p = m1; m2p = m2; for (tmp = col1; tmp < tmptop; tmp++) { *ptmp += ( (*m1p) * (*m2p) ); m1p++; m2p++; } ptmp++; } p += c2; } } |
The main loop variables are:
First, we assure ourselves that it works:
UNIX> mm-block-p 21 17 13 5 1 y | tail -n 22 | head -n 21 | md5 d5a69ea68d5b9ba22a64c25334c947ae UNIX>And then we time it to see the impact of the block. In the test below, I'm varying the block size from 10 to 1800 in multiplication of 1800x1800 matrices. (This is on my MacBook Pro with an Intel Core i5 processor chugging along at 2.4 GHz):
I didn't run these multiple times, so a lot of the jaggedness is not important. However, the general shape is very important, and as you start exploring the interaction of caches and performance, you'll get used to shapes like this graph. The main feature is how the running time degrades steeply from a block size of 380 up to 600. You can correlate that to the L3 cache, which on my Mac is 3MB. How does that correlate? Well, what's in the cache? Typically a row of M1, a row of P and the whole block of M2. So, blocksize(blocksize+2)*8. I'm multiplying by 8 because my matrices are doubles. This value starts to approach 3MB at a block size of 625, which roughly correlates to the degradation of performance.
Those little dips below a blocksize of 200 will correlate to the L2 cache (which is 256K), the L1 cache (whose size I don't know), plus interactions with set associativity. If you really want to understand the behavior, you need generate memory traces and simulate the cache. It's a subtle art, and I'm not going to explore it further. If you are interested in this kind of work, check out the ATLAS project, which started here at UT (http://math-atlas.sourceforge.net/). It's lead author was Clint Whaley, who is now a professor at LSU (http://csc.lsu.edu/~whaley/) -- his work uses empirical testing to optimize the performance of programs with respect to individual machine architectures.
Enough of that aside -- I'm going to choose a block size of 380 as a result of these tests.
Here's where we are with these programs on 1800x1800 matrices. That's a pretty dramatic increase in performance from the naive approach down to the one that uses pointers and blocks!
And here's a second graph that shows the performance as we scale the size of the matrices:
A final remark -- that vertical line in mm-plain is not an outlier. The running time goes from 9.9 seconds for 960x960 to 29.1 seconds for 970x970. Clearly, we have hit a cache boundary.
Stay tuned for the next lecture, where we will attempt to make matrix multiplication even faster on my Macintosh!
The best performing block size is 15, which achieves a 25% improvement over the non-blocked implementations.
Let's take a look at the performance comparison over all the algorithms:
That's a big difference from before, isn't it? We'll explore this a little more in the next lecture, where we will talk about that block size of 15.)