- April, 2015
- James S. Plank
- Directory:
**/home/plank/cs494/Notes/MM-SIMD**

The SSE2 instruction set was developed by Intel in 2000. So, as you can imagine, there is a lot of material on the Internet that discusses how to use these instructions to perform matrix multiplication (including some from Intel). I list a bunch of these below. In other words, none of the material in this lecture is new; however, the discourse is focused on the specific matrix multiplication example that we have been going over in class.

Here are resources that I found on the Internet that discuss double-precision floating poing matrix multiplication using the SSE2 instructions. If these links are dead, try the Wayback Machine.

- http://argus.phys.uregina.ca/gluex/DocDB/0016/001659/001/simd-talk.pdf
- http://stackoverflow.com/questions/24032178/speed-up-matrix-multiplication-by-sse2
- https://software.intel.com/sites/products/documentation/doclib/iss/2013/mkl/mklman/index.htm
- https://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-3
- https://github.com/david12341235/ee282/blob/master/pa1/matmul.c
- http://www.cs.cornell.edu/~bindel/class/cs5220-f11/proj1.html
- https://inst.eecs.berkeley.edu/~cs61c/sp12/labs/08/

For reference, our brain-dead implementation in C++ took 68.9 seconds to multiply two 1800x1800
matrices on my macintosh. Paying attention to caches, memory and pointers, we were able to
reduce that to seven seconds. The program
was
**mm-block-p.cpp**, which performs the multiplication in
square blocks, and uses pointers to access all of the elements.
Let's remind ourselves of the code that performs the block matrix multiplication. This
code multiplies the block starting in (row1,col1) of M1 with the block starting in
(col1,col2) of M2, and adds the result to the block starting in (row1,col2) of P.
The size of the block is **Blocksize**, and we have transposed matrix M2 so that we
are running through it in a cache-friendly manner:

void MM::MultiplyBlock(int row1, int col1, int col2) { double *m1p, *m1top, *m1; double *m2p, *m2top, *m2base, *m2; double *p, *ptmp; int tmp, tmptop; m1 = &(M1[row1 * c1 + col1]); m1top = m1 + (c1 * Blocksize); if (m1top > &(M1[r1 * c1])) m1top = &(M1[r1 * c1]); m2base = &(M2[col2 * c1 + col1]); m2top = m2base + (c1 * Blocksize); if (m2top > &(M2[c2 * c1])) m2top = &(M2[c2 * c1]); p = &(P[row1 * c2 + col2]); tmptop = col1 + Blocksize; if (tmptop > c1) tmptop = c1; for ( ; m1 < m1top; m1 += c1) { ptmp = p; for (m2 = m2base; m2 < m2top; m2 += c1) { m1p = m1; m2p = m2; for (tmp = col1; tmp < tmptop; tmp++) { *ptmp += ( (*m1p) * (*m2p) ); m1p++; m2p++; } ptmp++; } p += c2; } } |

I explain this code in more detail in the previous lecture.

Modern Intel processors have 16 "SIMD" registers, which are 128 bits in length. There are also instructions which act on these registers, often doing multiple things to them in a single instruction. Microsoft has written "compiler intrinsics" which allow you to utilize these registers and instructions from within your C/C++ code without resorting to assembly code. That is quite convenient.

First, we include **xmmintrin.h** and
**emmintrin.h**, which contain prototypes and specs for the intrinsics. Next,
we include the following variable declarations in **MultiplyBlock()**:

__m128d *p1, *p2, v1, v2; double prod[2]; |

The two pointers allow us to load 128-bit memory regions into SIMD registers that hold
two doubles each, and act on them. The two variables allow you to access SIMD registers
directly, without linking them to memory. The array **prod[]** is going to be used
to hold the products that we are going to calculate with SIMD registers.

Now, take a look and the main loop in **MultiplyBlock()**:

for ( ; m1 < m1top; m1 += c1) { ptmp = p; for (m2 = m2base; m2 < m2top; m2 += c1) { p1 = (__m128d *) m1; p2 = (__m128d *) m2; v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; ptmp++; } p += c2; } } |

The first new pieces of code are the assignments to
**p1**,
and
**p2**
What this code is saying to the compiler is -- "when I access
***p1** and ***p2**,
load the memory that they point to into SIMD registers."
If that's confusing, just read on.

The next statement is "v1 = _mm_sub_pd(v1, v1)." That specifies that **v1** is
a 128-register that holds two doubles, and you are going to subtract each of those doubles
from each other. In other words, it sets both of those doubles to zero, regardless of
their initial values.

Now, take a look at the inner **for** loop, which I reproduce below:

for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } |

The **_mm_mul_pd()** command loads the two doubles pointed by **p1** and the
two doubles pointed by **p2** into SIMD registers, and multiplies the two low
ones and the two high ones, putting the two results into **v2**. To make this clearer, suppose **p1** is pointing
to two doubles { 10, 20 }, and
**p2** is pointing
to two doubles { 30, 40 }. The **_mm_mul_pd()** statement will load the four doubles
into two SIMD registers and multiply their parts, which means that the resulting
register, which is **v2**, will hold { 10*30=300, 20*40=800 }. That's exactly what we
want. The **_mm_add_pd()** statement adds the two doubles to the register **v1**,
which of course we initialized to be zero before the **for** loop.

Our loop increments two elements at a time, because we are muliplying two elements at a time.
At the end of the loop, our product is sitting in **v1**, but not completely. Half of the
dot product is in the lower half of **v1**, and the other half is in the upper half.
So, when our loop is done, we execute the following code:

_mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; ptmp++; |

This stores the register **v1** into the two doubles pointed to by **prod**. We then
add those two products to ***ptmp**.

UNIX>We callmm-block-simd 2 2 2 2 0 yM1: 2 x 2 0.3417 1.4998 0.1927 1.7409 M2: 2 x 2 1.1546 1.5716 1.3844 0.7375 P: 2 x 2 2.4708 1.6431 2.6327 1.5869 Time: 0.0000 UNIX>

M1: { 0.3417, 1.4998, 0.1927, 1.7409 } M2: { 1.1546, 1.3844, 1.5716, 0.7375 } P: { 0.0000, 0.0000, 0.0000, 0.000 }Let's concentrate on the following part of

p1 = (__m128d *) m1; p2 = (__m128d *) m2; v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; |

Now, **m1** is pointing to { 0.3417, 1.4998 } and **m2** is pointing to
{ 1.1546, 1.3844 }. When the **for** loop starts, **v1** is { 0, 0 }.
The **_mm_mul_pd()** statement performs the two multiplications, setting
**v2** as follows:

We add it to **v1**, which is zeros, so **v1** is { 0.3945, 2.0763 }.
The **for** loop ends here (**Blocksize** is two), so the **_mm_store_pd()**
statement stores **v1** to the two doubles pointed to by **prod**. Thus
**prod** is now { 0.3945, 2.076323 }. We add both of these to ***ptmp** (which
starts at zero), so ***ptmp** is
0.3945+2.0763=2.4708. That's our first product element. YAY!!!!!

See if you can trace through the others.

Let's proof-by-Cosmo correctness:

UNIX>In terms of speed, we are doing half of the multiplications and half of the additions that we were doing previously. We should expect some good speedups:mm-block-p 20 16 12 4 0 y | tail -n 21 | head -n 20 | openssl md5(stdin)= f203ea56187933fb7392a016cd259a66 UNIX>mm-block-simd 20 16 12 4 0 y | tail -n 21 | head -n 20 | openssl md5(stdin)= f203ea56187933fb7392a016cd259a66 UNIX>

UNIX>Wow. I'll give some more explanations of performance in the next section.mm-block-p 1800 1800 1800 380 0 nTime: 7.1369 UNIX>mm-block-simd 1800 1800 1800 380 0 nTime: 3.7712 UNIX>

- Load two doubles from M1 into a SIMD register.
- Load two doubles from M2 into a SIMD register.
- Do the two multiplications.
- Do the two additions that add the two products to a SIMD register.

- Load two doubles from M1 into a SIMD register.
- Load two more doubles from the next row of M1 into a SIMD register.
- Load two doubles from M2 into a SIMD register.
- Do the two multiplications of the first two doubles of M1 by the two doubles of M2.
- Do the two multiplications of the second two doubles of M1 by the two doubles of M2.
- Do the two additions from the first multiplications and add to a product SIMD register.
- Do the two additions from the second multiplications and add to a different product SIMD register.

Let's look at the code. Yes, it's hideous to read, but if you look at it in reference
to **mm-block-simd.cpp**, you should be able to figure it out. The code is in
**mm-block-simd2.cpp**:

void MM::MultiplyBlock(int row1, int col1, int col2) { double *m1top, *m1; double *m2top, *m2base, *m2; double *p, *ptmp, *ptmp2; double prod[2]; int tmp, tmptop; __m128d *p1, *p2, *p3, v1, v2, v3, v4; m1 = &(M1[row1 * c1 + col1]); m1top = m1 + (c1 * Blocksize); if (m1top > &(M1[r1 * c1])) m1top = &(M1[r1 * c1]); m2base = &(M2[col2 * c1 + col1]); m2top = m2base + (c1 * Blocksize); if (m2top > &(M2[c2 * c1])) m2top = &(M2[c2 * c1]); p = &(P[row1 * c2 + col2]); tmptop = col1 + Blocksize; if (tmptop > c1) tmptop = c1; for ( ; m1 < m1top; m1 += (c1*2)) { /* Iterate two rows at a time */ ptmp = p; ptmp2 = p+c2; for (m2 = m2base; m2 < m2top; m2 += c1) { p1 = (__m128d *) m1; p2 = (__m128d *) m2; p3 = (__m128d *) (m1+c1); /* This is one row lower than p1 */ v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ v3 = _mm_sub_pd(v3, v3); /* Zero v3 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v4 = _mm_mul_pd(*p3, *p2); v1 = _mm_add_pd(v1, v2); v3 = _mm_add_pd(v3, v4); p1++; p2++; p3++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; _mm_store_pd(prod, v3); *ptmp2 += prod[0]; *ptmp2 += prod[1]; ptmp++; ptmp2++; } p += (c2*2); /* Iterate the product two rows at a time, too */ } } |

You'll note that the following code appears to be loading twice from **p2**:

v2 = _mm_mul_pd(*p1, *p2); v4 = _mm_mul_pd(*p3, *p2); |

However, the compiler is smart -- it knows that it should only load the SIMD register once.

And take a look at the speedup:

UNIX>That's 33 percent faster! Now, I'm not going to go down the rabbit hole of SIMD performance any further. The code above only uses five of the 16 registers -- if you worked to use all 16, you can probably get it down to a second or so. In fact, if any of you succeed in speeding this is up significantly, please do so and I'll post your code here!mm-block-simd 1800 1800 1800 380 0 nTime: 3.7529 UNIX>mm-block-simd2 1800 1800 1800 380 0 nTime: 2.5398 UNIX>

Now, you may also wonder, "How is it that I got more than a factor of two in performance
from **mm-block-p** to **mm-block-simd2**? I only cut the operation count in two.

The answer lies, as always in the memory hierarchy in addition to instruction counts.
Let's look at the assembler of the inner loop of **mm-block-p.cpp** (use **-S** as
a compiler option):

L9: cmpl %ebx, %r10d jle L6 movsd (%r8), %xmm1 movq %r13, %rdx movl %ebx, %eax .align 4,0x90 L5: movsd (%rdx), %xmm0 mulsd (%rdi), %xmm0 addsd %xmm0, %xmm1 movsd %xmm1, (%r8) addq $8, %rdx addq $8, %rdi incl %eax cmpl %r10d, %eax jne L5 |

Perhaps we should bribe Dr. Gregor to explain this code to us in detail. However,
you'll note that the floating point multiplication of each double is being performed
by the SIMD instruction **mulsd**. So, for each multiplication, we have to
load the value into a SIMD register (**movsd**), do a multiplication, add it to the
product (**addsd**), and then move the product back to memory. By having everything done
inside the SIMD registers and doing things two-at-a-time, we get really nice speed
improvements in **mm-block-simd2.cpp**.