CS494 Lecture Notes - Some simple SIMD examples

A standard caveat

Since Moore's law is dead, processor designers often focus on making their machines do more with each instruction, rather than making the instruction cycle go faster. Which means that a lecture like this one, on leveraging the results of this work, becomes a moving target. I focus on 128-bit SIMD registers, which have been around for a long time, and have been eclipsed by their larger counterparts. I won't enumerate them, because my enumeration will become obsolete every year. Just suffice it to say that learning SIMD will help you learn the others, and help you think about leveraging processor extensions.

Reference Material

There's a really nice "cheat sheet" at https://db.in.tum.de/~finis/x86-intrin-cheatsheet-v2.1.pdf.

128-bit vectors

Most processors these days have special registers that are large, for example 128 bits, plus instructions that operate on them. For years, Intel processors have had "Streaming SIMD Extensions", or SSE. These are 128-bit vectors plus many sets of instructions (see the link in the last sentence for an explanation of all of the various sets, like SSE2, SSE3, SSSE3, etc).

If you can organize your computation to leverage these instructions, they can speed up your code impressively. I have an example in my lecture notes on the Floyd-Warshall algorithm, which doesn't have the actual code, but it does show the speed-up. (The lecture notes don't have the code, because you will get to write it for lab.)

In this set of lecture notes, I am going to demonstrate how to use a few of these instructions by employing the Microsoft Compiler Intrinsics. These intrinsics give you subroutines that compile down to the SSE instructions, thereby making your life easier as a programmer. You can also use "ASM" statements in your code, but I won't demonstrate that here.


A very simple program that uses 128-bit AND and OR

Here's the program -- in src/and_or.cpp:

#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <string.h>
using namespace std;

#include "MOA.h"
#include <emmintrin.h>

int main(int argc, char **argv)
{
  int print;
  int andor;
  int bytes, bufsize;
  int i, j;

  uint8_t *buf1, *buf2, *buf3;  
  __m128i *b1, *b2, *b3;

  /* usage: and_or bytes AND|OR|and|or|none seed print(Y|N) */

  if (argc != 5 || (strcmp(argv[2], "AND") != 0 && strcmp(argv[2], "OR") != 0  && 
                    strcmp(argv[2], "and") != 0 && strcmp(argv[2], "or") != 0  &&
                    strcmp(argv[2], "none") != 0)) {
    fprintf(stderr, "usage: and_or bytes AND|OR|and|or|none seed print(Y|N)\n");
    exit(1);
  }

  /* Set the variables from the command line */

  bytes = atoi(argv[1]);
  andor = argv[2][0];
  print = (argv[4][0] == 'Y');
  MOA_Seed(atoi(argv[3]));
  
  /* Pad the number of bytes so that it's a multiple of 16, and allocate three buffers: */

  bufsize = bytes;
  if (bufsize % 16 != 0) bufsize += (16 - bufsize%16);
  buf1 = (uint8_t *) malloc(bufsize);
  buf2 = (uint8_t *) malloc(bufsize);
  buf3 = (uint8_t *) malloc(bufsize);

  /* Use MOA.cpp to fill buf1 and buf2 with random bytes */

  MOA_Fill_Random_Region (buf1, bytes);
  MOA_Fill_Random_Region (buf2, bytes);

  /* This lets you access the three buffers as 128-bit vectors. */

  b1 = (__m128i *) buf1;
  b2 = (__m128i *) buf2;
  b3 = (__m128i *) buf3;

  /* Use SIMD to do AND. */

  if (andor == 'A') {
    for (i = 0; i < bytes; i += 16) {
      *b3 = _mm_and_si128(*b1, *b2);
      b1++;
      b2++;
      b3++;
    }

  /* Use SIMD to do OR. */

  } else if (andor == 'O') {
    for (i = 0; i < bytes; i += 16) {
      *b3 = _mm_or_si128(*b1, *b2);
      b1++;
      b2++;
      b3++;
    }

  /* Do the bit operations inefficiently on the bytes. */

  } else if (andor == 'a') {
    for (i = 0; i < bytes; i++) buf3[i] = (buf1[i] & buf2[i]);
  } else if (andor == 'o') {
    for (i = 0; i < bytes; i++) buf3[i] = (buf1[i] | buf2[i]);
  }

  /* Print everything out. */

  if (print) {
    printf("B1: ");
    for (i = 0; i < bytes; i++) {
      for (j = 0; j < 8; j++) printf("%d", (buf1[i] >> j) & 1);
      printf(" ");
    }
    printf("\n");
    printf("B2: ");
    for (i = 0; i < bytes; i++) {
      for (j = 0; j < 8; j++) printf("%d", (buf2[i] >> j) & 1);
      printf(" ");
    }
    printf("\n");
    printf("B3: ");
    for (i = 0; i < bytes; i++) {
      for (j = 0; j < 8; j++) printf("%d", (buf3[i] >> j) & 1);
      printf(" ");
    }
    printf("\n");
  }
  return 0;
}

Most of this is straightforward. You need to include <emmintrin.h> so that you can use the Microsoft Intrinsics for the SSE2 instruction set, which supports 128-bit AND and OR. You'll also need to compile with -msse2. Obviously, your machine architecture has to support this (our lab machines do, as does my Macintosh).

UNIX> make bin/and_or
g++ -Wall -Wextra -Iinclude -o bin/and_or src/and_or.cpp src/MOA.cpp -msse2
UNIX> 

This program allocates three regions of bytes named buf1, buf2 and buf3, and it sets the first two regions to random bytes. It then either takes their bitwise AND or their bitwise OR, and puts the result into buf3. You'll call the program with

UNIX> bin/and_or
usage: and_or bytes AND|OR|and|or|none seed print(Y|N)
UNIX> 
The operation is specified with AND, OR, and, or or none. The capital letters specify to use the SIMD instructions, and lower-case letters specify to simply do the bit operations a byte at a time. We use none as a control.

Let's see it running. First, let's convince ourselves that it works:

UNIX> bin/and_or 5 AND 1 Y
B1: 01000001 11000110 10000111 01011011 10000000 
B2: 01010111 01110111 10011000 01101001 00010010 
B3: 01000001 01000110 10000000 01001001 00000000 
UNIX> bin/and_or 5 and 1 Y
B1: 01000001 11000110 10000111 01011011 10000000 
B2: 01010111 01110111 10011000 01101001 00010010 
B3: 01000001 01000110 10000000 01001001 00000000 
UNIX> bin/and_or 5 OR 1 Y
B1: 01000001 11000110 10000111 01011011 10000000 
B2: 01010111 01110111 10011000 01101001 00010010 
B3: 01010111 11110111 10011111 01111011 10010010 
UNIX> bin/and_or 5 or 1 Y
B1: 01000001 11000110 10000111 01011011 10000000 
B2: 01010111 01110111 10011000 01101001 00010010 
B3: 01010111 11110111 10011111 01111011 10010010 
UNIX> 
I'm convinced. Now, let's see a timing to demonstrate how much faster the SIMD instructions are:
UNIX> time bin/and_or 10000000 none 1 N
0.049u 0.011s 0:00.06 83.3%	0+0k 0+0io 0pf+0w
UNIX> time bin/and_or 10000000 and 1 N
0.096u 0.018s 0:00.11 90.9%	0+0k 0+0io 0pf+0w
UNIX> time bin/and_or 10000000 AND 1 N
0.054u 0.017s 0:00.07 85.7%	0+0k 0+0io 0pf+0w
UNIX> 
Filling the areas with random numbers takes quite a bit of time -- about 0.050 seconds. That means that: That's a nine-fold speedup! Why not 16-fold? Because the instructions are so fast that the speed is limited by the time that it takes to get the bytes to the processor, rather than the instructions.

Take a look at the code that does the SIMD:

  b1 = (__m128i *) buf1;
  b2 = (__m128i *) buf2;
  b3 = (__m128i *) buf3;

  if (andor == 'A') {
    for (i = 0; i < bytes; i += 16) {
      *b3 = _mm_and_si128(*b1, *b2);
      b1++;
      b2++;
      b3++;
    }

This is straightforward -- the intrinsics have made life really easy, as we simply access our memory using those (__m128i *) pointers, and the compiler sets it up so that the memory is loaded into 128-bit registers, the registers are used for 128-bit AND operations, and the results are stored back to memory. You can use __m128i data types as well if you want to store intermediate values and not have them go back to memory.

I didn't compile with optimization previously. When you do, the optimizer is pretty smart about the "bad" loop -- it will figure out the SIMD code to use, and replace it so that they all run at the same time:

UNIX> make bin/and_or_opt
g++ -Wall -Wextra -Iinclude -O3 -o bin/and_or_opt src/and_or.cpp src/MOA.cpp -msse2
UNIX> time bin/and_or_opt 10000000 none 1 N
0.036u 0.005s 0:00.04 75.0%	0+0k 0+0io 0pf+0w
UNIX> time bin/and_or_opt 10000000 and 1 N
0.038u 0.008s 0:00.04 75.0%	0+0k 0+0io 0pf+0w
UNIX> time bin/and_or_opt 10000000 AND 1 N
0.038u 0.007s 0:00.04 75.0%	0+0k 0+0io 0pf+0w
UNIX> 
The compiler will also optimize your SIMD to do things like optimize register usage and pipelining. Optimizing SIMD is a true rabbit hole, so I'll stop with that, but if it's a rabbit hole that you enjoy, by all means dive in!

Playing with SIMD

The best way to play with SIMD is to write simple programs like src/and_or.cpp above, or src/odd_shorts.cpp below. Use Microsoft's web pages to help you. For example, suppose you wanted to use _mm_blendv_ps(). Google it and go to Microsoft's web page, such as https://msdn.microsoft.com/en-US/library/bb514075%28v=vs.100%29.aspx, or Intel's web page, such as https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_blendv_ps. Those pages will give you usage, examples, and how to compile. For example, the pages tell you that you need support for SSE4, and you have to include <smmintrin.h>. That also means you'll have to compile with -msse4.

The page also tells you that you need to use __m128 pointers rather than __m128i, because the SIMD instructions are going to treat memory as floating point. That can be a pain, but since C/C++ let you cast pointers wantonly, you can typically make things work, even if you intermix floating point and integer instructions.


For Example: odd_shorts.cpp

The program src/odd_shorts.cpp shows how you can use the various SIMD instructions to do some neat things. In this example, we have an array array of 8 random shorts (16-bit numbers). The code in src/odd_shorts.cpp turns the even ones into zeros. It does that by doing an AND to set the smallest bit to one for each odd number. It then subtracts those from 0. If the smallest bit was odd, this subtraction turns the short into 0xffff. Otherwise, the number remains 0. You AND this with the original vector, and that turns all of the even numbers into zeros.

The code at the end shows a 64-bit right-shift. You should see why I print the shorts from 7 down to 0, rather than from 0 to 7 in this example. It is because of the endian-ness of the integer types. Were you to print the shorts from 0 to 7, you'd be more confused at how the right-shift works. This also applies to when you print out the bytes rather than shorts -- print them from 15 down to 0, rather than from 0 to 15.

UNIX> make bin/odd_shorts
g++ -Wall -Wextra -Iinclude -o bin/odd_shorts src/odd_shorts.cpp src/MOA.cpp -msse2
UNIX> bin/odd_shorts 1
                                  7    6    5    4    3    2    1    0
b=Starting values:             33f4 ab41 7469 f4ca c748 eeea 3001 6382
x=_mm_set1_epi16(1)            0001 0001 0001 0001 0001 0001 0001 0001
a=_mm_and_si128(x, b)          0000 0001 0001 0000 0000 0000 0001 0000
x=_mm_setzero_si128()          0000 0000 0000 0000 0000 0000 0000 0000
a=_mm_sub_epi16(x, a)          0000 ffff ffff 0000 0000 0000 ffff 0000
b=_mm_and_si128(a, b)          0000 ab41 7469 0000 0000 0000 3001 0000
Finishing values:              0000 ab41 7469 0000 0000 0000 3001 0000

b=Starting values:             93c1 da15 3ff6 64c9 95ec 872d 8fd3 c706
b=_mm_srli_epi64(b, 4)         093c 1da1 53ff 664c 095e c872 d8fd 3c70
As uint64_t: 093c1da153ff664c 095ec872d8fd3c70
UNIX> bin/odd_shorts 2
                                  7    6    5    4    3    2    1    0
b=Starting values:             6545 27e5 7016 6a05 2622 53c6 e361 aed1
x=_mm_set1_epi16(1)            0001 0001 0001 0001 0001 0001 0001 0001
a=_mm_and_si128(x, b)          0001 0001 0000 0001 0000 0000 0001 0001
x=_mm_setzero_si128()          0000 0000 0000 0000 0000 0000 0000 0000
a=_mm_sub_epi16(x, a)          ffff ffff 0000 ffff 0000 0000 ffff ffff
b=_mm_and_si128(a, b)          6545 27e5 0000 6a05 0000 0000 e361 aed1
Finishing values:              6545 27e5 0000 6a05 0000 0000 e361 aed1

b=Starting values:             95a2 0ac5 ac0d 5aa7 2ce8 3d99 5ff6 ca72
b=_mm_srli_epi64(b, 4)         095a 20ac 5ac0 d5aa 02ce 83d9 95ff 6ca7
As uint64_t: 095a20ac5ac0d5aa 02ce83d995ff6ca7
UNIX> 
The program is below. For each of the procedures, look them up on Microsoft's web pages.

#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <string.h>
using namespace std;

#include "MOA.h"
#include <emmintrin.h>

/* This is convenient -- a procedure to print out SIMD variables for debugging. */

void print_simd(string s, __m128i v)
{
  uint16_t array[8];
  __m128i *c;
  int i;

  c = (__m128i *) array;
  *c = v;
  printf("%-30s", s.c_str());
  for (i = 7; i >= 0; i--) printf(" %04x", array[i]);
  printf("\n");
}

int main(int argc, char **argv)
{
  uint16_t array[8];
  uint64_t *a64;
  int i;
  __m128i a, b, x, *c;

  if (argc != 2) {
    fprintf(stderr, "usage: odd_shorts seed\n");
    exit(1);
  }

  /* Set <b>array</b> to 8 random shorts */

  MOA_Seed(atoi(argv[1]));
  for (i = 0; i < 8; i++) array[i] = MOA_Random_W(16, 1);

  /* Print the numbers from 7 down to 1. */

  printf("%-30s", "");
  for (i = 7; i >= 0; i--) printf("%5d", i);
  printf("\n");

  /* Set c to the array, then turn the even numbers to zeros. */

  c = (__m128i *) array;     print_simd("b=Starting values:", *c);
  b = *c;
  x = _mm_set1_epi16(1);     print_simd("x=_mm_set1_epi16(1)", x);
  a = _mm_and_si128(x, b);   print_simd("a=_mm_and_si128(x, b)", a);
  x = _mm_setzero_si128();   print_simd("x=_mm_setzero_si128()", x);
  a = _mm_sub_epi16(x, a);   print_simd("a=_mm_sub_epi16(x, a)", a);
  b = _mm_and_si128(a, b);   print_simd("b=_mm_and_si128(a, b)", b);
  *c = b;

  print_simd("Finishing values:", *c);

  /* Show a 64-bit bit-shift. */

  for (i = 0; i < 8; i++) array[i] = MOA_Random_W(16, 1);
  
  printf("\n");
  c = (__m128i *) array;     print_simd("b=Starting values:", *c);
  b = *c;
  b = _mm_srli_epi64(b, 4);   print_simd("b=_mm_srli_epi64(b, 4)", b);
  *c = b;

  a64 = (uint64_t *) array;

  printf("As uint64_t: %016llx %016llx\n", (long long unsigned int) a64[1], 
                                           (long long unsigned int) a64[0]);
  exit(0);
}

The program src/odd_shorts_time.cpp does a timing much like and_or. If you specify 'Y' for argv[2], then it performs the SIMD calculation above. If you specify 'N', then it sets the even numbers to zero in a normal manner:

  /* Use regular code to isolate the odd shorts. */

  } else if (simd == 'N') {
    for (i = 0; i < bytes/2; i++) {
      if (buf1[i] % 2 == 0) buf1[i] = 0;
    }

  }

If you specify 'none', then it does nothing. As you can see, even with compiler optimization, the SIMD version is much faster:

UNIX> time bin/odd_shorts_time 16000000 none 5555 N
0.031u 0.005s 0:00.04 75.0%	0+0k 0+0io 0pf+0w
UNIX> time bin/odd_shorts_time 16000000 Y 5555 N
0.032u 0.006s 0:00.04 75.0%	0+0k 0+0io 0pf+0w
UNIX> time bin/odd_shorts_time 16000000 N 5555 N
0.065u 0.005s 0:00.07 85.7%	0+0k 0+0io 0pf+0w
UNIX>