If you can organize your computation to leverage these instructions, they can speed up your code impressively. I have an example in my lecture notes on the Floyd-Warshall algorithm, which doesn't have the actual code, but it does show the speed-up. (The lecture notes don't have the code, because you will get to write it for lab.)
In this set of lecture notes, I am going to demonstrate how to use a few of these instructions by employing the Microsoft Compiler Intrinsics. These intrinsics give you subroutines that compile down to the SSE instructions, thereby making your life easier as a programmer. You can also use "ASM" statements in your code, but I won't demonstrate that here.
#include <iostream> #include <cstdio> #include <cstdlib> #include <string.h> using namespace std; #include "MOA.h" #include <emmintrin.h> int main(int argc, char **argv) { int print; int andor; int bytes, bufsize; int i, j; uint8_t *buf1, *buf2, *buf3; __m128i *b1, *b2, *b3; /* usage: and_or bytes AND|OR|and|or|none seed print(Y|N) */ if (argc != 5 || (strcmp(argv[2], "AND") != 0 && strcmp(argv[2], "OR") != 0 && strcmp(argv[2], "and") != 0 && strcmp(argv[2], "or") != 0 && strcmp(argv[2], "none") != 0)) { fprintf(stderr, "usage: and_or bytes AND|OR|and|or|none seed print(Y|N)\n"); exit(1); } /* Set the variables from the command line */ bytes = atoi(argv[1]); andor = argv[2][0]; print = (argv[4][0] == 'Y'); MOA_Seed(atoi(argv[3])); /* Pad the number of bytes so that it's a multiple of 16, and allocate three buffers: */ bufsize = bytes; if (bufsize % 16 != 0) bufsize += (16 - bufsize%16); buf1 = (uint8_t *) malloc(bufsize); buf2 = (uint8_t *) malloc(bufsize); buf3 = (uint8_t *) malloc(bufsize); /* Use MOA.cpp to fill buf1 and buf2 with random bytes */ MOA_Fill_Random_Region (buf1, bytes); MOA_Fill_Random_Region (buf2, bytes); /* This lets you access the three buffers as 128-bit vectors. */ b1 = (__m128i *) buf1; b2 = (__m128i *) buf2; b3 = (__m128i *) buf3; /* Use SIMD to do AND. */ if (andor == 'A') { for (i = 0; i < bytes; i += 16) { *b3 = _mm_and_si128(*b1, *b2); b1++; b2++; b3++; } /* Use SIMD to do OR. */ } else if (andor == 'O') { for (i = 0; i < bytes; i += 16) { *b3 = _mm_or_si128(*b1, *b2); b1++; b2++; b3++; } /* Do the bit operations inefficiently on the bytes. */ } else if (andor == 'a') { for (i = 0; i < bytes; i++) buf3[i] = (buf1[i] & buf2[i]); } else if (andor == 'o') { for (i = 0; i < bytes; i++) buf3[i] = (buf1[i] | buf2[i]); } /* Print everything out. */ if (print) { printf("B1: "); for (i = 0; i < bytes; i++) { for (j = 0; j < 8; j++) printf("%d", (buf1[i] >> j) & 1); printf(" "); } printf("\n"); printf("B2: "); for (i = 0; i < bytes; i++) { for (j = 0; j < 8; j++) printf("%d", (buf2[i] >> j) & 1); printf(" "); } printf("\n"); printf("B3: "); for (i = 0; i < bytes; i++) { for (j = 0; j < 8; j++) printf("%d", (buf3[i] >> j) & 1); printf(" "); } printf("\n"); } return 0; } |
Most of this is straightforward. You need to include <emmintrin.h> so that you can use the Microsoft Intrinsics for the SSE2 instruction set, which supports 128-bit AND and OR. You'll also need to compile with -msse2. Obviously, your machine architecture has to support this (our lab machines do, as did my Macintoshes until they migrated to ARM).
UNIX> make bin/and_or g++ -Wall -Wextra -Iinclude -o bin/and_or src/and_or.cpp src/MOA.cpp -msse2 UNIX>
This program allocates three regions of bytes named buf1, buf2 and buf3, and it sets the first two regions to random bytes. It then either takes their bitwise AND or their bitwise OR, and puts the result into buf3. You'll call the program with
UNIX> bin/and_or usage: and_or bytes AND|OR|and|or|none seed print(Y|N) UNIX>The operation is specified with AND, OR, and, or or none. The capital letters specify to use the SIMD instructions, and lower-case letters specify to simply do the bit operations a byte at a time. We use none as a control.
Let's see it running. First, let's convince ourselves that it works:
UNIX> bin/and_or 5 AND 1 Y B1: 01000001 11000110 10000111 01011011 10000000 B2: 01010111 01110111 10011000 01101001 00010010 B3: 01000001 01000110 10000000 01001001 00000000 UNIX> bin/and_or 5 and 1 Y B1: 01000001 11000110 10000111 01011011 10000000 B2: 01010111 01110111 10011000 01101001 00010010 B3: 01000001 01000110 10000000 01001001 00000000 UNIX> bin/and_or 5 OR 1 Y B1: 01000001 11000110 10000111 01011011 10000000 B2: 01010111 01110111 10011000 01101001 00010010 B3: 01010111 11110111 10011111 01111011 10010010 UNIX> bin/and_or 5 or 1 Y B1: 01000001 11000110 10000111 01011011 10000000 B2: 01010111 01110111 10011000 01101001 00010010 B3: 01010111 11110111 10011111 01111011 10010010 UNIX>I'm convinced. Now, let's see a timing to demonstrate how much faster the SIMD instructions are:
UNIX> time bin/and_or 10000000 none 1 N 0.049u 0.011s 0:00.06 83.3% 0+0k 0+0io 0pf+0w UNIX> time bin/and_or 10000000 and 1 N 0.096u 0.018s 0:00.11 90.9% 0+0k 0+0io 0pf+0w UNIX> time bin/and_or 10000000 AND 1 N 0.054u 0.017s 0:00.07 85.7% 0+0k 0+0io 0pf+0w UNIX>Filling the areas with random numbers takes quite a bit of time -- about 0.050 seconds. That means that:
Take a look at the code that does the SIMD:
b1 = (__m128i *) buf1; b2 = (__m128i *) buf2; b3 = (__m128i *) buf3; if (andor == 'A') { for (i = 0; i < bytes; i += 16) { *b3 = _mm_and_si128(*b1, *b2); b1++; b2++; b3++; } |
This is straightforward -- the intrinsics have made life really easy, as we simply access our memory using those (__m128i *) pointers, and the compiler sets it up so that the memory is loaded into 128-bit registers, the registers are used for 128-bit AND operations, and the results are stored back to memory. You can use __m128i data types as well if you want to store intermediate values and not have them go back to memory.
I didn't compile with optimization previously. When you do, the optimizer is pretty smart about the "bad" loop -- it will figure out the SIMD code to use, and replace it so that they all run at the same time:
UNIX> make bin/and_or_opt g++ -Wall -Wextra -Iinclude -O3 -o bin/and_or_opt src/and_or.cpp src/MOA.cpp -msse2 UNIX> time bin/and_or_opt 10000000 none 1 N 0.036u 0.005s 0:00.04 75.0% 0+0k 0+0io 0pf+0w UNIX> time bin/and_or_opt 10000000 and 1 N 0.038u 0.008s 0:00.04 75.0% 0+0k 0+0io 0pf+0w UNIX> time bin/and_or_opt 10000000 AND 1 N 0.038u 0.007s 0:00.04 75.0% 0+0k 0+0io 0pf+0w UNIX>The compiler will also optimize your SIMD to do things like optimize register usage and pipelining. Optimizing SIMD is a true rabbit hole, so I'll stop with that, but if it's a rabbit hole that you enjoy, by all means dive in!
The page also tells you that you need to use __m128 pointers rather than __m128i, because the SIMD instructions are going to treat memory as floating point. That can be a pain, but since C/C++ let you cast pointers wantonly, you can typically make things work, even if you intermix floating point and integer instructions.
The code at the end shows a 64-bit right-shift. You should see why I print the shorts from 7 down to 0, rather than from 0 to 7 in this example. It is because of the endian-ness of the integer types. Were you to print the shorts from 0 to 7, you'd be more confused at how the right-shift works. This also applies to when you print out the bytes rather than shorts -- print them from 15 down to 0, rather than from 0 to 15.
UNIX> make bin/odd_shorts g++ -Wall -Wextra -Iinclude -o bin/odd_shorts src/odd_shorts.cpp src/MOA.cpp -msse2 UNIX> bin/odd_shorts 1 7 6 5 4 3 2 1 0 b=Starting values: 33f4 ab41 7469 f4ca c748 eeea 3001 6382 x=_mm_set1_epi16(1) 0001 0001 0001 0001 0001 0001 0001 0001 a=_mm_and_si128(x, b) 0000 0001 0001 0000 0000 0000 0001 0000 x=_mm_setzero_si128() 0000 0000 0000 0000 0000 0000 0000 0000 a=_mm_sub_epi16(x, a) 0000 ffff ffff 0000 0000 0000 ffff 0000 b=_mm_and_si128(a, b) 0000 ab41 7469 0000 0000 0000 3001 0000 Finishing values: 0000 ab41 7469 0000 0000 0000 3001 0000 b=Starting values: 93c1 da15 3ff6 64c9 95ec 872d 8fd3 c706 b=_mm_srli_epi64(b, 4) 093c 1da1 53ff 664c 095e c872 d8fd 3c70 As uint64_t: 093c1da153ff664c 095ec872d8fd3c70 UNIX> bin/odd_shorts 2 7 6 5 4 3 2 1 0 b=Starting values: 6545 27e5 7016 6a05 2622 53c6 e361 aed1 x=_mm_set1_epi16(1) 0001 0001 0001 0001 0001 0001 0001 0001 a=_mm_and_si128(x, b) 0001 0001 0000 0001 0000 0000 0001 0001 x=_mm_setzero_si128() 0000 0000 0000 0000 0000 0000 0000 0000 a=_mm_sub_epi16(x, a) ffff ffff 0000 ffff 0000 0000 ffff ffff b=_mm_and_si128(a, b) 6545 27e5 0000 6a05 0000 0000 e361 aed1 Finishing values: 6545 27e5 0000 6a05 0000 0000 e361 aed1 b=Starting values: 95a2 0ac5 ac0d 5aa7 2ce8 3d99 5ff6 ca72 b=_mm_srli_epi64(b, 4) 095a 20ac 5ac0 d5aa 02ce 83d9 95ff 6ca7 As uint64_t: 095a20ac5ac0d5aa 02ce83d995ff6ca7 UNIX>The program is below. For each of the procedures, look them up on Microsoft's web pages.
#include <iostream> #include <cstdio> #include <cstdlib> #include <string.h> using namespace std; #include "MOA.h" #include <emmintrin.h> /* This is convenient -- a procedure to print out SIMD variables for debugging. */ void print_simd(string s, __m128i v) { uint16_t array[8]; __m128i *c; int i; c = (__m128i *) array; *c = v; printf("%-30s", s.c_str()); for (i = 7; i >= 0; i--) printf(" %04x", array[i]); printf("\n"); } int main(int argc, char **argv) { uint16_t array[8]; uint64_t *a64; int i; __m128i a, b, x, *c; if (argc != 2) { fprintf(stderr, "usage: odd_shorts seed\n"); exit(1); } /* Set <b>array</b> to 8 random shorts */ MOA_Seed(atoi(argv[1])); for (i = 0; i < 8; i++) array[i] = MOA_Random_W(16, 1); /* Print the numbers from 7 down to 1. */ printf("%-30s", ""); for (i = 7; i >= 0; i--) printf("%5d", i); printf("\n"); /* Set c to the array, then turn the even numbers to zeros. */ c = (__m128i *) array; print_simd("b=Starting values:", *c); b = *c; x = _mm_set1_epi16(1); print_simd("x=_mm_set1_epi16(1)", x); a = _mm_and_si128(x, b); print_simd("a=_mm_and_si128(x, b)", a); x = _mm_setzero_si128(); print_simd("x=_mm_setzero_si128()", x); a = _mm_sub_epi16(x, a); print_simd("a=_mm_sub_epi16(x, a)", a); b = _mm_and_si128(a, b); print_simd("b=_mm_and_si128(a, b)", b); *c = b; print_simd("Finishing values:", *c); /* Show a 64-bit bit-shift. */ for (i = 0; i < 8; i++) array[i] = MOA_Random_W(16, 1); printf("\n"); c = (__m128i *) array; print_simd("b=Starting values:", *c); b = *c; b = _mm_srli_epi64(b, 4); print_simd("b=_mm_srli_epi64(b, 4)", b); *c = b; a64 = (uint64_t *) array; printf("As uint64_t: %016llx %016llx\n", (long long unsigned int) a64[1], (long long unsigned int) a64[0]); exit(0); } |
The program src/odd_shorts_time.cpp does a timing much like and_or. If you specify 'Y' for argv[2], then it performs the SIMD calculation above. If you specify 'N', then it sets the even numbers to zero in a normal manner:
/* Use regular code to isolate the odd shorts. */ } else if (simd == 'N') { for (i = 0; i < bytes/2; i++) { if (buf1[i] % 2 == 0) buf1[i] = 0; } } |
If you specify 'none', then it does nothing. As you can see, even with compiler optimization, the SIMD version is much faster:
UNIX> time bin/odd_shorts_time 16000000 none 5555 N 0.031u 0.005s 0:00.04 75.0% 0+0k 0+0io 0pf+0w UNIX> time bin/odd_shorts_time 16000000 Y 5555 N 0.032u 0.006s 0:00.04 75.0% 0+0k 0+0io 0pf+0w UNIX> time bin/odd_shorts_time 16000000 N 5555 N 0.065u 0.005s 0:00.07 85.7% 0+0k 0+0io 0pf+0w UNIX>