## CS494 Lecture Notes - Matrix Multiplication with Instruction Level Parallelism

• April, 2015
• James S. Plank
• Directory: /home/plank/cs494/Notes/MM-SIMD
• Latest Revision: December, 2017

### Reference Material

The reference material that I used for this lecture is an excellent write-up called "Introduction to SSE Programming" at href=http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming. That writeup has far more detail than I present in this lecture -- I simply spin it here to apply to our matrix multiplication example from the previous lecture. Although the explanation in the writeup is for single precision floating point, my extrapolation to double precision is straightforward. In the comments following the writeup, some of the commenters explain the extrapolation to double precision in more detail, so please read that if you'd like more information.

The SSE2 instruction set was developed by Intel in 2000. So, as you can imagine, there is a lot of material on the Internet that discusses how to use these instructions to perform matrix multiplication (including some from Intel). I list a bunch of these below. In other words, none of the material in this lecture is new; however, the discourse is focused on the specific matrix multiplication example that we have been going over in class.

Here are resources that I found on the Internet that discuss double-precision floating poing matrix multiplication using the SSE2 instructions. If these links are dead, try the Wayback Machine.

• http://argus.phys.uregina.ca/gluex/DocDB/0016/001659/001/simd-talk.pdf
• http://stackoverflow.com/questions/24032178/speed-up-matrix-multiplication-by-sse2
• https://software.intel.com/sites/products/documentation/doclib/iss/2013/mkl/mklman/index.htm
• https://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-3
• https://github.com/david12341235/ee282/blob/master/pa1/matmul.c
• http://www.cs.cornell.edu/~bindel/class/cs5220-f11/proj1.html
• https://inst.eecs.berkeley.edu/~cs61c/sp12/labs/08/

### Reiterating Where We Ended Last Class

In the previous set of lecture notes, available at http://web.eecs.utk.edu/~plank/plank/classes/cs494/494/notes/MM-Memory/index.html, I went over strategies and programs for performing fast matrix multiplication of doubles.

For reference, our brain-dead implementation in C++ took 68.9 seconds to multiply two 1800x1800 matrices on my macintosh. Paying attention to caches, memory and pointers, we were able to reduce that to seven seconds. The program was mm-block-p.cpp, which performs the multiplication in square blocks, and uses pointers to access all of the elements. Let's remind ourselves of the code that performs the block matrix multiplication. This code multiplies the block starting in (row1,col1) of M1 with the block starting in (col1,col2) of M2, and adds the result to the block starting in (row1,col2) of P. The size of the block is Blocksize, and we have transposed matrix M2 so that we are running through it in a cache-friendly manner:

 ```void MM::MultiplyBlock(int row1, int col1, int col2) { double *m1p, *m1top, *m1; double *m2p, *m2top, *m2base, *m2; double *p, *ptmp; int tmp, tmptop; m1 = &(M1[row1 * c1 + col1]); m1top = m1 + (c1 * Blocksize); if (m1top > &(M1[r1 * c1])) m1top = &(M1[r1 * c1]); m2base = &(M2[col2 * c1 + col1]); m2top = m2base + (c1 * Blocksize); if (m2top > &(M2[c2 * c1])) m2top = &(M2[c2 * c1]); p = &(P[row1 * c2 + col2]); tmptop = col1 + Blocksize; if (tmptop > c1) tmptop = c1; for ( ; m1 < m1top; m1 += c1) { ptmp = p; for (m2 = m2base; m2 < m2top; m2 += c1) { m1p = m1; m2p = m2; for (tmp = col1; tmp < tmptop; tmp++) { *ptmp += ( (*m1p) * (*m2p) ); m1p++; m2p++; } ptmp++; } p += c2; } } ```

I explain this code in more detail in the previous lecture.

### Vector Instructions Let You Multiply Two Doubles In One Instruction

The first thing that we will explore is instruction-level parallelism, which is implemented on nearly all processors these days. The reason, as I described in class, is that Moore's law is dead. We can't double the speed of our processors every 18 months any more, so instead, we make them do more with each instruction.

Modern Intel processors have 16 "SIMD" registers, which are 128 bits in length. There are also instructions which act on these registers, often doing multiple things to them in a single instruction. Microsoft has written "compiler intrinsics" which allow you to utilize these registers and instructions from within your C/C++ code without resorting to assembly code. That is quite convenient.

First, we include xmmintrin.h and emmintrin.h, which contain prototypes and specs for the intrinsics. Next, we include the following variable declarations in MultiplyBlock():

 ``` __m128d *p1, *p2, v1, v2; double prod[2]; ```

The two pointers allow us to load 128-bit memory regions into SIMD registers that hold two doubles each, and act on them. The two variables allow you to access SIMD registers directly, without linking them to memory. The array prod[] is going to be used to hold the products that we are going to calculate with SIMD registers.

Now, take a look and the main loop in MultiplyBlock():

 ``` for ( ; m1 < m1top; m1 += c1) { ptmp = p; for (m2 = m2base; m2 < m2top; m2 += c1) { p1 = (__m128d *) m1; p2 = (__m128d *) m2; v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; ptmp++; } p += c2; } } ```

The first new pieces of code are the assignments to p1, and p2 What this code is saying to the compiler is -- "when I access *p1 and *p2, load the memory that they point to into SIMD registers." If that's confusing, just read on.

The next statement is "v1 = _mm_sub_pd(v1, v1)." That specifies that v1 is a 128-register that holds two doubles, and you are going to subtract each of those doubles from each other. In other words, it sets both of those doubles to zero, regardless of their initial values.

Now, take a look at the inner for loop, which I reproduce below:

 ``` for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } ```

The _mm_mul_pd() command loads the two doubles pointed by p1 and the two doubles pointed by p2 into SIMD registers, and multiplies the two low ones and the two high ones, putting the two results into v2. To make this clearer, suppose p1 is pointing to two doubles { 10, 20 }, and p2 is pointing to two doubles { 30, 40 }. The _mm_mul_pd() statement will load the four doubles into two SIMD registers and multiply their parts, which means that the resulting register, which is v2, will hold { 10*30=300, 20*40=800 }. That's exactly what we want. The _mm_add_pd() statement adds the two doubles to the register v1, which of course we initialized to be zero before the for loop.

Our loop increments two elements at a time, because we are muliplying two elements at a time. At the end of the loop, our product is sitting in v1, but not completely. Half of the dot product is in the lower half of v1, and the other half is in the upper half. So, when our loop is done, we execute the following code:

 ``` _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; ptmp++; ```

This stores the register v1 into the two doubles pointed to by prod. We then add those two products to *ptmp.

### An example

Let's take a simple 2*2 example:
```UNIX> mm-block-simd 2 2 2 2 0 y
M1: 2 x 2

0.3417 1.4998
0.1927 1.7409

M2: 2 x 2

1.1546 1.5716
1.3844 0.7375

P: 2 x 2

2.4708 1.6431
2.6327 1.5869
Time: 0.0000
UNIX>
```
We call MultiplyBlock(0, 0, 0) and Blocksize is two. In other words, MultiplyBlock() does the complete matrix multiplication. Remember, too, that M2 is transposed in our implementation, so that when MultiplyBlock(0, 0, 0) is called, we have the following:
```M1: { 0.3417, 1.4998, 0.1927, 1.7409 }
M2: { 1.1546, 1.3844, 1.5716, 0.7375 }
P:  { 0.0000, 0.0000, 0.0000, 0.000  }
```
Let's concentrate on the following part of MultiplyBlock():

 ``` p1 = (__m128d *) m1; p2 = (__m128d *) m2; v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; ```

Now, m1 is pointing to { 0.3417, 1.4998 } and m2 is pointing to { 1.1546, 1.3844 }. When the for loop starts, v1 is { 0, 0 }. The _mm_mul_pd() statement performs the two multiplications, setting v2 as follows:

v2 = { 0.3417*1.1546, 1.4998*1.3844 } = { 0.3945, 2.0763 }

We add it to v1, which is zeros, so v1 is { 0.3945, 2.0763 }. The for loop ends here (Blocksize is two), so the _mm_store_pd() statement stores v1 to the two doubles pointed to by prod. Thus prod is now { 0.3945, 2.076323 }. We add both of these to *ptmp (which starts at zero), so *ptmp is 0.3945+2.0763=2.4708. That's our first product element. YAY!!!!!

See if you can trace through the others.

Let's proof-by-Cosmo correctness:

```UNIX> mm-block-p 20 16 12 4 0 y | tail -n 21 | head -n 20 | openssl md5
(stdin)= f203ea56187933fb7392a016cd259a66
UNIX> mm-block-simd 20 16 12 4 0 y | tail -n 21 | head -n 20 | openssl md5
(stdin)= f203ea56187933fb7392a016cd259a66
UNIX>
```
In terms of speed, we are doing half of the multiplications and half of the additions that we were doing previously. We should expect some good speedups:
```UNIX> mm-block-p 1800 1800 1800 380 0 n
Time: 7.1369
UNIX> mm-block-simd 1800 1800 1800 380 0 n
Time: 3.7712
UNIX>
```
Wow. I'll give some more explanations of performance in the next section. (I'll make an aside for 2017 there as well.)

### Let's multiply that register from M2 with four numbers instead of two

Here's a second thought. Let's access two rows of M1 at a time, and calculate two product rows at a time. In other words, instead of doing:
• Load two doubles from M1 into a SIMD register.
• Load two doubles from M2 into a SIMD register.
• Do the two multiplications.
• Do the two additions that add the two products to a SIMD register.
Let's do:
• Load two doubles from M1 into a SIMD register.
• Load two more doubles from the next row of M1 into a SIMD register.
• Load two doubles from M2 into a SIMD register.
• Do the two multiplications of the first two doubles of M1 by the two doubles of M2.
• Do the two multiplications of the second two doubles of M1 by the two doubles of M2.
• Do the two additions from the first multiplications and add to a product SIMD register.
• Do the two additions from the second multiplications and add to a different product SIMD register.
Why might this be faster? Well, in the first case, you'll have to load the values from M2 once per row of M1. In the second case, you are loading them once per two rows of M1.

Let's look at the code. Yes, it's hideous to read, but if you look at it in reference to mm-block-simd.cpp, you should be able to figure it out. The code is in mm-block-simd2.cpp:

 ```void MM::MultiplyBlock(int row1, int col1, int col2) { double *m1top, *m1; double *m2top, *m2base, *m2; double *p, *ptmp, *ptmp2; double prod[2]; int tmp, tmptop; __m128d *p1, *p2, *p3, v1, v2, v3, v4; m1 = &(M1[row1 * c1 + col1]); m1top = m1 + (c1 * Blocksize); if (m1top > &(M1[r1 * c1])) m1top = &(M1[r1 * c1]); m2base = &(M2[col2 * c1 + col1]); m2top = m2base + (c1 * Blocksize); if (m2top > &(M2[c2 * c1])) m2top = &(M2[c2 * c1]); p = &(P[row1 * c2 + col2]); tmptop = col1 + Blocksize; if (tmptop > c1) tmptop = c1; for ( ; m1 < m1top; m1 += (c1*2)) { /* Iterate two rows at a time */ ptmp = p; ptmp2 = p+c2; for (m2 = m2base; m2 < m2top; m2 += c1) { p1 = (__m128d *) m1; p2 = (__m128d *) m2; p3 = (__m128d *) (m1+c1); /* This is one row lower than p1 */ v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ v3 = _mm_sub_pd(v3, v3); /* Zero v3 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v4 = _mm_mul_pd(*p3, *p2); v1 = _mm_add_pd(v1, v2); v3 = _mm_add_pd(v3, v4); p1++; p2++; p3++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; _mm_store_pd(prod, v3); *ptmp2 += prod[0]; *ptmp2 += prod[1]; ptmp++; ptmp2++; } p += (c2*2); /* Iterate the product two rows at a time, too */ } } ```

You'll note that the following code appears to be loading twice from p2:

 ``` v2 = _mm_mul_pd(*p1, *p2); v4 = _mm_mul_pd(*p3, *p2); ```

However, the compiler is smart -- it knows that it should only load the SIMD register once.

And take a look at the speedup:

```UNIX> mm-block-simd 1800 1800 1800 380 0 n
Time: 3.7529
UNIX> mm-block-simd2 1800 1800 1800 380 0 n
Time: 2.5398
UNIX>
```
That's 33 percent faster! Now, I'm not going to go down the rabbit hole of SIMD performance any further. The code above only uses five of the 16 registers -- if you worked to use all 16, you can probably get it down to a second or so. In fact, if any of you succeed in speeding this is up significantly, please do so and I'll post your code here!

Ok, I lied. I wrote up mm-block-simd4.cpp, which extrapolates the above program to four rows rather than two. I won't put the code in here. If you feel the need to hurt yourself, go for it.

### Some performance graphs

This is some flawed data, but such is life. I have three graphs below:
• 2015 on my Macbook: Intel Core i5, 2.4 GHz, 8G RAM. I'm pretty sure that this code used the Gnu C++ compiler. I didn't test multiple block sizes with the SIMD programs, and I hadn't written the simd4 program yet.

• 2017 on my Macbook: Intel Core i7, 2.2 GHz, 16G RAM. The C++ compiler is now Clang, which obviously does a better job of optimizing. I did test multiple block sizes with all of the block implementations.

• 2017 on Arc: This is some random Linux box in the department. I don't know the compiler. I guess I could look it up, but I'm beyond caring. I chose it because it's one of the older machines in the department.
In the 2017 tests, I used find_block_size_1.sh, find_block_size_2.sh, and find_block_size_4.sh to test the SIMD programs.

There are some interesting things about these graphs:

 2015 on my Macbook: 2017 on my Macbook: 2017 on an older Linux box

Some commentary:

• Obviously, the SIMD programs do the best, with the simd4 program winning with its reuse of registers.
• The compilers in 2017 eliminate the need for you to do explicit pointer operations. That's nice.
• The block variants improve matters in all cases. In 2017, the best block sizes are really small. I suspect that this is because the compilers can leverage all of the registers -- in other words, instead of optimizing L1/L2/L3 cache reuse, we are leveraging for register reuse.
• On the flip side, the SIMD variants make good use of the registers, so the block size is now optimized for one of the caches (probably the L1).
• I'm guessing that if you do your cache modeling right, you'll discover that you should have three or four levels of blocking -- perhaps one for each cache (treating the registers like a cache). I'm not writing that code, BTW.

### Registers and Memory

Now, you may also wonder, "How is it that I got more than a factor of two in performance from mm-block-p to mm-block-simd2? I only cut the operation count in two.

The answer lies, as always in the memory hierarchy in addition to instruction counts. Let's look at the assembler of the inner loop of mm-block-p.cpp (use -S as a compiler option):

 ```L9: cmpl %ebx, %r10d jle L6 movsd (%r8), %xmm1 movq %r13, %rdx movl %ebx, %eax .align 4,0x90 L5: movsd (%rdx), %xmm0 mulsd (%rdi), %xmm0 addsd %xmm0, %xmm1 movsd %xmm1, (%r8) addq \$8, %rdx addq \$8, %rdi incl %eax cmpl %r10d, %eax jne L5 ```

Perhaps we should bribe Dr. Gregor to explain this code to us in detail. However, you'll note that the floating point multiplication of each double is being performed by the SIMD instruction mulsd. So, for each multiplication, we have to load the value into a SIMD register (movsd), do a multiplication, add it to the product (addsd), and then move the product back to memory. By having everything done inside the SIMD registers and doing things two-at-a-time, we get really nice speed improvements in mm-block-simd2.cpp.

### Using Assembly Code Rather than The Intrinsics

In the "Introduction to SSE Programming," the author also shows how you can embed SIMD assembler instructions like mulsd above into your code. You do this when you think that you can do a better job of register assignment and instruction ordering than the compiler. I bumbled about with that in class, and I think that for the purposes of this class, it yields diminishing returns, so I won't say anything more about it, except you can learn it on your own if you are interested.