The SSE2 instruction set was developed by Intel in 2000. So, as you can imagine, there is a lot of material on the Internet that discusses how to use these instructions to perform matrix multiplication (including some from Intel). I list a bunch of these below. In other words, none of the material in this lecture is new; however, the discourse is focused on the specific matrix multiplication example that we have been going over in class.
Here are resources that I found on the Internet that discuss double-precision floating poing matrix multiplication using the SSE2 instructions. If these links are dead, try the Wayback Machine.
For reference, our brain-dead implementation in C++ took 68.9 seconds to multiply two 1800x1800 matrices on my macintosh. Paying attention to caches, memory and pointers, we were able to reduce that to seven seconds. The program was mm-block-p.cpp, which performs the multiplication in square blocks, and uses pointers to access all of the elements. Let's remind ourselves of the code that performs the block matrix multiplication. This code multiplies the block starting in (row1,col1) of M1 with the block starting in (col1,col2) of M2, and adds the result to the block starting in (row1,col2) of P. The size of the block is Blocksize, and we have transposed matrix M2 so that we are running through it in a cache-friendly manner:
void MM::MultiplyBlock(int row1, int col1, int col2) { double *m1p, *m1top, *m1; double *m2p, *m2top, *m2base, *m2; double *p, *ptmp; int tmp, tmptop; m1 = &(M1[row1 * c1 + col1]); m1top = m1 + (c1 * Blocksize); if (m1top > &(M1[r1 * c1])) m1top = &(M1[r1 * c1]); m2base = &(M2[col2 * c1 + col1]); m2top = m2base + (c1 * Blocksize); if (m2top > &(M2[c2 * c1])) m2top = &(M2[c2 * c1]); p = &(P[row1 * c2 + col2]); tmptop = col1 + Blocksize; if (tmptop > c1) tmptop = c1; for ( ; m1 < m1top; m1 += c1) { ptmp = p; for (m2 = m2base; m2 < m2top; m2 += c1) { m1p = m1; m2p = m2; for (tmp = col1; tmp < tmptop; tmp++) { *ptmp += ( (*m1p) * (*m2p) ); m1p++; m2p++; } ptmp++; } p += c2; } } |
I explain this code in more detail in the previous lecture.
Modern Intel processors have 16 "SIMD" registers, which are 128 bits in length. There are also instructions which act on these registers, often doing multiple things to them in a single instruction. Microsoft has written "compiler intrinsics" which allow you to utilize these registers and instructions from within your C/C++ code without resorting to assembly code. That is quite convenient.
First, we include xmmintrin.h and emmintrin.h, which contain prototypes and specs for the intrinsics. Next, we include the following variable declarations in MultiplyBlock():
__m128d *p1, *p2, v1, v2; double prod[2]; |
The two pointers allow us to load 128-bit memory regions into SIMD registers that hold two doubles each, and act on them. The two variables allow you to access SIMD registers directly, without linking them to memory. The array prod[] is going to be used to hold the products that we are going to calculate with SIMD registers.
Now, take a look and the main loop in MultiplyBlock():
for ( ; m1 < m1top; m1 += c1) { ptmp = p; for (m2 = m2base; m2 < m2top; m2 += c1) { p1 = (__m128d *) m1; p2 = (__m128d *) m2; v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; ptmp++; } p += c2; } } |
The first new pieces of code are the assignments to p1, and p2 What this code is saying to the compiler is -- "when I access *p1 and *p2, load the memory that they point to into SIMD registers." If that's confusing, just read on.
The next statement is "v1 = _mm_sub_pd(v1, v1)." That specifies that v1 is a 128-register that holds two doubles, and you are going to subtract each of those doubles from each other. In other words, it sets both of those doubles to zero, regardless of their initial values.
Now, take a look at the inner for loop, which I reproduce below:
for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } |
The _mm_mul_pd() command loads the two doubles pointed by p1 and the two doubles pointed by p2 into SIMD registers, and multiplies the two low ones and the two high ones, putting the two results into v2. To make this clearer, suppose p1 is pointing to two doubles { 10, 20 }, and p2 is pointing to two doubles { 30, 40 }. The _mm_mul_pd() statement will load the four doubles into two SIMD registers and multiply their parts, which means that the resulting register, which is v2, will hold { 10*30=300, 20*40=800 }. That's exactly what we want. The _mm_add_pd() statement adds the two doubles to the register v1, which of course we initialized to be zero before the for loop.
Our loop increments two elements at a time, because we are muliplying two elements at a time. At the end of the loop, our product is sitting in v1, but not completely. Half of the dot product is in the lower half of v1, and the other half is in the upper half. So, when our loop is done, we execute the following code:
_mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; ptmp++; |
This stores the register v1 into the two doubles pointed to by prod. We then add those two products to *ptmp.
UNIX> mm-block-simd 2 2 2 2 0 y M1: 2 x 2 0.3417 1.4998 0.1927 1.7409 M2: 2 x 2 1.1546 1.5716 1.3844 0.7375 P: 2 x 2 2.4708 1.6431 2.6327 1.5869 Time: 0.0000 UNIX>We call MultiplyBlock(0, 0, 0) and Blocksize is two. In other words, MultiplyBlock() does the complete matrix multiplication. Remember, too, that M2 is transposed in our implementation, so that when MultiplyBlock(0, 0, 0) is called, we have the following:
M1: { 0.3417, 1.4998, 0.1927, 1.7409 } M2: { 1.1546, 1.3844, 1.5716, 0.7375 } P: { 0.0000, 0.0000, 0.0000, 0.000 }Let's concentrate on the following part of MultiplyBlock():
p1 = (__m128d *) m1; p2 = (__m128d *) m2; v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v1 = _mm_add_pd(v1, v2); p1++; p2++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; |
Now, m1 is pointing to { 0.3417, 1.4998 } and m2 is pointing to { 1.1546, 1.3844 }. When the for loop starts, v1 is { 0, 0 }. The _mm_mul_pd() statement performs the two multiplications, setting v2 as follows:
We add it to v1, which is zeros, so v1 is { 0.3945, 2.0763 }. The for loop ends here (Blocksize is two), so the _mm_store_pd() statement stores v1 to the two doubles pointed to by prod. Thus prod is now { 0.3945, 2.076323 }. We add both of these to *ptmp (which starts at zero), so *ptmp is 0.3945+2.0763=2.4708. That's our first product element. YAY!!!!!
See if you can trace through the others.
Let's proof-by-Cosmo correctness:
UNIX> mm-block-p 20 16 12 4 0 y | tail -n 21 | head -n 20 | openssl md5 (stdin)= f203ea56187933fb7392a016cd259a66 UNIX> mm-block-simd 20 16 12 4 0 y | tail -n 21 | head -n 20 | openssl md5 (stdin)= f203ea56187933fb7392a016cd259a66 UNIX>In terms of speed, we are doing half of the multiplications and half of the additions that we were doing previously. We should expect some good speedups:
UNIX> mm-block-p 1800 1800 1800 380 0 n Time: 7.1369 UNIX> mm-block-simd 1800 1800 1800 380 0 n Time: 3.7712 UNIX>Wow. I'll give some more explanations of performance in the next section. (I'll make an aside for 2017 there as well.)
Let's look at the code. Yes, it's hideous to read, but if you look at it in reference to mm-block-simd.cpp, you should be able to figure it out. The code is in mm-block-simd2.cpp:
void MM::MultiplyBlock(int row1, int col1, int col2) { double *m1top, *m1; double *m2top, *m2base, *m2; double *p, *ptmp, *ptmp2; double prod[2]; int tmp, tmptop; __m128d *p1, *p2, *p3, v1, v2, v3, v4; m1 = &(M1[row1 * c1 + col1]); m1top = m1 + (c1 * Blocksize); if (m1top > &(M1[r1 * c1])) m1top = &(M1[r1 * c1]); m2base = &(M2[col2 * c1 + col1]); m2top = m2base + (c1 * Blocksize); if (m2top > &(M2[c2 * c1])) m2top = &(M2[c2 * c1]); p = &(P[row1 * c2 + col2]); tmptop = col1 + Blocksize; if (tmptop > c1) tmptop = c1; for ( ; m1 < m1top; m1 += (c1*2)) { /* Iterate two rows at a time */ ptmp = p; ptmp2 = p+c2; for (m2 = m2base; m2 < m2top; m2 += c1) { p1 = (__m128d *) m1; p2 = (__m128d *) m2; p3 = (__m128d *) (m1+c1); /* This is one row lower than p1 */ v1 = _mm_sub_pd(v1, v1); /* Zero v1 */ v3 = _mm_sub_pd(v3, v3); /* Zero v3 */ for (tmp = col1; tmp < tmptop; tmp += 2) { v2 = _mm_mul_pd(*p1, *p2); v4 = _mm_mul_pd(*p3, *p2); v1 = _mm_add_pd(v1, v2); v3 = _mm_add_pd(v3, v4); p1++; p2++; p3++; } _mm_store_pd(prod, v1); *ptmp += prod[0]; *ptmp += prod[1]; _mm_store_pd(prod, v3); *ptmp2 += prod[0]; *ptmp2 += prod[1]; ptmp++; ptmp2++; } p += (c2*2); /* Iterate the product two rows at a time, too */ } } |
You'll note that the following code appears to be loading twice from p2:
v2 = _mm_mul_pd(*p1, *p2); v4 = _mm_mul_pd(*p3, *p2); |
However, the compiler is smart -- it knows that it should only load the SIMD register once.
And take a look at the speedup:
UNIX> mm-block-simd 1800 1800 1800 380 0 n Time: 3.7529 UNIX> mm-block-simd2 1800 1800 1800 380 0 n Time: 2.5398 UNIX>That's 33 percent faster! Now, I'm not going to go down the rabbit hole of SIMD performance any further. The code above only uses five of the 16 registers -- if you worked to use all 16, you can probably get it down to a second or so. In fact, if any of you succeed in speeding this is up significantly, please do so and I'll post your code here!
Ok, I lied. I wrote up mm-block-simd4.cpp, which extrapolates the above program to four rows rather than two. I won't put the code in here. If you feel the need to hurt yourself, go for it.
There are some interesting things about these graphs:
2015 on my Macbook: |
2017 on my Macbook: |
2017 on an older Linux box |
Some commentary:
The answer lies, as always in the memory hierarchy in addition to instruction counts. Let's look at the assembler of the inner loop of mm-block-p.cpp (use -S as a compiler option):
L9: cmpl %ebx, %r10d jle L6 movsd (%r8), %xmm1 movq %r13, %rdx movl %ebx, %eax .align 4,0x90 L5: movsd (%rdx), %xmm0 mulsd (%rdi), %xmm0 addsd %xmm0, %xmm1 movsd %xmm1, (%r8) addq $8, %rdx addq $8, %rdi incl %eax cmpl %r10d, %eax jne L5 |
Perhaps we should bribe Dr. Gregor to explain this code to us in detail. However, you'll note that the floating point multiplication of each double is being performed by the SIMD instruction mulsd. So, for each multiplication, we have to load the value into a SIMD register (movsd), do a multiplication, add it to the product (addsd), and then move the product back to memory. By having everything done inside the SIMD registers and doing things two-at-a-time, we get really nice speed improvements in mm-block-simd2.cpp.