Fast Galois Field Arithmetic Library in C/C++

# Fast Galois Field Arithmetic Library in C/C++

### http://web.eecs.utk.edu/~jplank

Technical Report UT-CS-07-593
Department of Computer Science
University of Tennessee

The online home for this document is:
http://web.eecs.utk.edu/~jplank/plank/papers/CS-07-593

# If You Use This Code

Please send me email (plank@cs.utk.edu) and let me know. One of the ways in which I am evaluated is the impact of my work, and hard data on how many people use this code is important.

## Acknowledgements

This web site is based upon work supported by the National Science Foundation under Grants CNS-0615221 and CNS-0437508. I would also like to thank Cheng Huang, Lihao Xu, Jin Li and Mochan Shrestha for helpful discussions, and Michel Machado for helpful suggestions.

# Introduction

Galois Field arithmetic is fundamental to many applications, especially Reed-Solomon coding. With a Galois Field GF(2w), addition, subtraction, multiplication and division operations are defined over the numbers 0, 1, ..., 2w-1. in such a way that:

• They are closed -- if a and b are elements of the field, then so are (a+b), (a-b), (a*b) and (a/b).

• They adhere to all the well-known properties of addition/subtraction/multiplication/division. For example, addition and multiplication are commutative; addition, multiplication and division are all associative; addition/multiplication are distributive; etc.

• Every element has a multiplicative inverse. This is the most important property. It means that if a is an element of the field and a ≠ 0, then there exists an element b that is also an element of the field such that ab = 1.

When w = 8, the Galois Field GF(28) comprises the elements 0, 1, ..., 255. This is an important field because it allows you to perform arithmetic that adheres to the above properties on single bytes. Again, this is essential in Reed-Solomon coding and other kinds of coding. Similarly, w=16 and w=32 are other important fields.

There are useful applications for fields with other values of w. For example, Cauchy Reed-Solomon coding employs Galois Fields with other values of w, and converts them to strictly binary (XOR) codes.

Galois Fields are covered in standard texts on Error Correcting Codes such as Peterson & Weldon [PW72], MacWilliams and Sloane [MS77], and Van Lint [VL82]. These treatments are thorough, and take a bit of time to understand. For a briefer and more pragmatic treatment, see Plank's 1997 tutorial on Reed-Solomon coding [P97], [PD05].

# The Library Files

This library comes as a tar file: galois.tar. The files that compose galois.tar are:

# Using the Command Line Tools

Type make to compile the library and the command line tools. Testing the command line tools is a nice way to see how things work with Galois Fields. First, addition and subtraction in a Galois Field are equivalent -- they are equal to the bitwise exclusive-or operation. The program gf_xor allows you to take the bitwise exclusive-or of two numbers:

 ```Unix-Prompt> gf_xor 15 7 8 Unix-Prompt> gf_xor 8 7 15 Unix-Prompt> gf_xor 230498 2738947 2772833 Unix-Prompt> gf_xor 2772833 230498 2738947 Unix-Prompt> ```

The program gf_mult takes three arguments: a, b and w, and prints out the product of a and b in GF(2w). The program gf_div performs division, and the program gf_inverse returns the multiplicative inverse of a number in GF(2w):

 ```Unix-Prompt> gf_mult 3 7 4 9 Unix-Prompt> gf_div 9 3 4 7 Unix-Prompt> gf_div 9 7 4 3 Unix-Prompt> gf_mult 1234567 2345678 32 1404360778 Unix-Prompt> gf_div 1404360778 1234567 32 2345678 Unix-Prompt> gf_div 1404360778 2345678 32 1234567 Unix-Prompt> gf_inverse 1404360778 32 106460795 Unix-Prompt> gf_mult 1404360778 106460795 32 1 Unix-Prompt> ```

The other command line tools are discussed at the end of this document.

# Using the Library

The files galois.h and galois.c implement a library of procedures for Galois Field Arithmetic in GF(2w) for w between 1 and 32. The library is written in C, but will work in C++ as well. It is especially tailored for w equal to 8, 16 and 32, but it is also applicable for any other value of w. For the smaller values of w (where multiplication or logarithm tables fit into memory), these procedures should be very fast.

In the following sections, we describe the procedures implemented by the library.

### 1. General purpose multiplication and division: galois_single_multiply() and galois_single_divide()

The syntax of these two calls is:

 ``` int galois_single_multiply(int a, int b, int w); int galois_single_divide(int a, int b, int w); ```

Galois_single_multiply() returns the product of a and b in GF(2w), and galois_single_divide() returns the qoutient of a and b in GF(2w). w may have any value from 1 to 32.

The decision to make this procedure use regular, signed integers instead of unsigned integers was largely for convenience. It only makes a difference when w equals 32, in which case the sign bit of a, b, or the return values may be set. If it matters, simply convert the integers to unsigned integers. The procedures in this library will work regardless -- when w equals 32, they are treated as streams of bits and not integers.

It is anticipated that most applications that need to perform single multiplications and divisions only need reasonable performance, which is what these two procedures give you. If you need faster multiplication and division, then see the procedures below, which allow you to get much faster performance.

### 2. Multiplying a region of bytes by a single number in GF(28), GF(216) and GF(232)

A common use of Galois Field arithmetic is multplying a region of bytes by a single number. This is the basic operation of Reed-Solomon encoding and decoding. This library provides the following three procedures for performing region multiplication:

 ``` void galois_w08_region_multiply(char *region, int multby, int nbytes, char *r2, int add); void galois_w16_region_multiply(char *region, int multby, int nbytes, char *r2, int add); void galois_w32_region_multiply(char *region, int multby, int nbytes, char *r2, int add); ```

These multiply the region of bytes specified by region and nbytes by the number multby in the field specified by the procedure's name. Region should be long-word aligned, otherwise these routines will generate a bus error. There are three separate functionalities of these procedures denoted by the values of r2 and add.

1. If r2 is NULL, the bytes in region are replaced by their products with multby.
2. If r2 is not NULL and add is zero, then the products are placed in the nbytes of memory starting with r2. The two regions should not overlap unless r2 is less than region.
3. If r2 is not NULL and add is one, then the products are calculated and then XOR'd into existing bytes of r2.

The performance of these procedures has been tuned to be very fast. A multiplication table is employed when w=8. Log and inverse log tables are employed with w=16, and seven multiplication tables are employed when w=32.

### 3. XOR-ing a region of bytes

The following procedure allows you to perform the bitwise exclusive-or of a region of bytes.

 ``` void galois_region_xor(char *r1, char *r2, char *r3, int nbytes); ```

R3 may equal either r1 or r2 if you wish to overwrite either of them. Again, all pointers should be long-word aligned.

## Advanced uses -- fast single multplications and divisions

While galois_single_multiply() and galois_single_divide() are nice general-purpose tools, their generality makes them slower than they need to be. For high-performance, you may want to employ their underlying implementations, described below:

### 4. Using a multiplication table: galois_multtable_multiply() and galois_multtable_divide()

When w is small, the fastest way to perform multiplication and division is to employ multiplication and division tables. These tables consume 2(2w+2) bytes each, so they are only applicable when w is reasonably small. For example, when w=8, this is 256 KB per table.

To use multiplication and division tables directly, use one or more of the following routines:

 ``` int galois_create_mult_tables(int w); int galois_multtable_multiply(int a, int b, int w); int galois_multtable_divide(int a, int b, int w); int *galois_get_mult_table(int w); int *galois_get_div_table(int w); ```

Galois_create_mult_tables(w) creates multiplication and division tables for a given value of w and stores them internally. If you call it twice with the same value of w, it will not create new tables the second time. You may call it with different values of w and the tables will be stored separately.

If successful, galois_create_mult_tables() will return 0. Otherwise, it will return -1, and any allocated memory will be freed.

Galois_multtable_multiply() and galois_multtable_divide() work just like galois_single_multiply() and galois_single_divide(), except they assume that you have called galois_create_mult_tables() for the appropriate value of w and that it was successful. They do not error-check, so if you have not called galois_create_mult_tables(), they will seg-fault. This decision was made for speed -- although for small values of w (between 1 and 9), galois_single_multiply() uses galois_multtable_multiply(), it is significantly slower because of the error checking that it does.

Finally, to free yourself of procedure call overhead, the routines galois_get_mult_table() and galois_get_div_table() return the tables themselves. The product/quotient of a and b is in element a*2w+b, which of course may be computed quickly via bit arithmetic as ( (a << w) | b). You do not need to call galois_create_mult_tables() before calling galois_get_mult_table() or galois_get_div_table().

### 5. Using log/anti-log tables: galois_logtable_multiply() and galois_logtable_divide()

When multiplication tables cannot be employed, the next fastest way to perform multiplication and division is to use log and inverse log tables, as described in [P97]. The log table consumes 2(w+2) bytes and the inverse log table consumes 3*2(w+2) bytes, which means that middling values of w may be handled. For example, when w=16, this is 1 MB of tables.

To use the log tables, use one or more of the following routines:

 ``` int galois_create_log_tables(int w); int galois_logtable_multiply(int a, int b, int w); int galois_logtable_divide(int a, int b, int w); int galois_log(int value, int w); int galois_ilog(int value, int w); int *galois_get_log_table(int w); int *galois_get_ilog_table(int w); ```

Galois_create_log_tables(w) creates log and inverse log tables for the given value of w, and stores them internally. If you call it twice with the same value of w, it will not create new tables the second time. You may call it with different values of w and the tables will be stored separately.

If successful, galois_create_log_tables() will return 0. Otherwise, it will return -1, and any allocated memory will be freed.

Galois_logtable_multiply() and galois_logtable_divide() work just like galois_single_multiply() and galois_single_divide(), except they assume that you have called galois_create_log_tables() for the appropriate value of w and that it was successful. They do not error-check, so if you have not called galois_create_log_tables(), they will seg-fault. This decision was made for speed -- although for medium values of w (between 10 and 22), galois_single_multiply() uses galois_logtable_multiply(), it is significantly slower because of the error checking that it does.

Galois_log() and galois_ilog() return the log and inverse log of an element of GF(2w). You can use them to multiply using the following identity:

a * b = ilog[ (log[a] + log[b]) % ((1 << w)-1) ]
a / b = ilog[ (log[a] - log[b] + (1 << w)) % ((1 << w)-1) ]

The division identity takes into account C's weird definition of modular arithmetic with negative numbers.

To perform the fastest multiplication and division with these tables, you should get access to the tables themselves using galois_get_log_table(int w) and galois_get_ilog_table(int w). Then you may calculate the product/quotient of a and b as:

a * b = ilog [ log[a] + log[b] ]
a / b = ilog [ log[a] - log[b] ]

You do not have to worry about modular arithmetic because the ilog table contains three copies of the inverse logs, and is defined for indices between -2w+1 and 22w-2. This saves a few instructions.

### 6. Shift-multiplication and slow division: galois_shift_multiply() and galois_shift_divide()

When tables are unusable, general-purpose multiplication and division is implemented with the following two procedures:

 ``` int galois_shift_multiply(int a, int b, int w); int galois_shift_divide(int a, int b, int w); ```

Galois_shift_multiply() converts b into a w * w bit matrix and multiplies it by the bit vector a to create the product vector. You may see a quasi-tutorial description of this technique in the paper [P05]. It is significantly slower than the methods that use tables. However, it is general-purpose and requires no preallocation of memory.

For division, Galois_shift_divide() also converts b into a bit matrix, inverts it, and then multiplies the inverse by a. As such, it is *really* slow. If I get a clue how to implement this one faster, I will.

Galois_single_multiply() uses galois_shift_multiply() for w between 23 and 31. Galois_single_divide() uses galois_shift_divide() for w between 23 and 32.

### 7. The special case of w=32: galois_split_w8_multiply()

Finally, for w = 32 the following procedures are defined:

 ``` int galois_create_split_w8_tables(); int galois_split_w8_multiply(int a, int b); ```

Galois_create_split_w8_tables() creates seven tables that are 256 KB each. Galois_split_w8_multiply() employs these tables to multiply the 32-bit numbers by breaking them into four eight-bit numbers each, and then performing sixteen multiplications and exclusive-ors to calculate the product. It's a cool technique suggested to me by Cheng Huang of Microsoft, and is a good 16 times faster than using galois_shift_multiply().

"But couldn't you use this technique for other values of w?" Yes, you could, but I'm not implementing it, because I don't think it's that important. If that view changes, I'll fix it.

Galois_single_multiply() uses galois_split_w8_multiply() for w = 32.

The only possible race conditions in these codes are when the various tables are created. For that reason, galois_create_mult_tables() and galois_create_log_tables() should be protected by a mutex if thread safety is a concern.

Since galois_single_multiply() and galois_single_divide() call the table creation routines whenever the tables do not exist, if you are worried about thread safety, then for each value of w that you will use, you should make sure that the first call to galois_single_multiply() or galois_single_divide() is protected. After that, no protection is required.

# Testing Applications

The programs gf_mult, gf_div, gf_log, gf_ilog gf_inverse and gf_xor are straightforward and allow you to test the various routines for various values of w.

Gf_basic_tester and gf_xor_tester test both correctness and speed. Call gf_xor_tester with no arguments to test the speed of gf_region_xor() on your system. Here it is on a Macbook whose CPU is a little busy doing other things:

```Unix-Prompt> gf_xor_tester
XOR Tester
Seeding random number generator with 1172533188
Passed correctness test -- doing 10-second timing
1827.79986 Megabytes of XORs per second
Unix-Prompt>
```

Gf_basic_tester takes the following command line arguments:

• W: 1 through 32
• Method: This is one of the following words: default, multtable, logtable, shift, splitw8. It specifies how multiplication/division will be performed. Default uses gf_single_multiply() and gf_single_divide().
• Ntrials: This specifies how many random multiplies/divides to test for correctness.
After testing for correctness, gf_basic_tester tests for speed. There are three special cases. When method=default and w is 8, 16 and 32, gf_basic_tester also tests the speed of gf_wxx_region_multiply().

For example (again, on my MacBook):

```Unix-Prompt> gf_basic_tester 16 default 100000
W: 16
Method: default
Seeding random number generator with 1172533569
Doing 100000 trials for single-operation correctness.
Passed Single-Operations Correctness Tests.

Doing galois_w16_region_multiply correctness test.
Passed galois_w16_region_multiply correctness test.

Speed Test #1: 10 Seconds of Multiply operations
Speed Test #1: 42.23862 Mega Multiplies per second
Speed Test #2: 10 Seconds of Divide operations
Speed Test #2: 43.42448 Mega Divides per second

Doing 10 seconds worth of region_multiplies - Three tests:
Test 0: Overwrite initial region
Test 1: Products to new region
Test 2: XOR products into second region

Test 0: 253.45548 Megabytes of Multiplies per second
Test 1: 238.59569 Megabytes of Multiplies per second
Test 2: 167.99968 Megabytes of Multiplies per second
```

# References

• [MS77] F. J. MacWilliams and N. J. A. Sloane, The Theory of Error-Correcting Codes, Part I, North-Holland Publishing Company, Amsterdam, New York, Oxford, 1977.

• [PW72] W. W. Peterson and E. J. Weldon, Jr., Error-Correcting Codes, Second Edition, The MIT Press, Cambridge, Massachusetts, 1972, ISBN: 0-262-16-039-0.

• [P97] J. S. Plank, "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems," Software -- Practice & Experience, 27(9), September, 1997, pp. 995-1012. http://web.eecs.utk.edu/~jplank/plank/papers/SPE-9-97.html.

• [P05] J. S. Plank, "Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Storage Applications", Technical Report CS-TR-05-569, University of Tennessee, November, 2005. http://web.eecs.utk.edu/~jplank/plank/papers/CS-05-569.html.

• [PD05] J. S. Plank and Y. Ding, "Note: Correction to the 1997 Tutorial on Reed-Solomon Coding", Software, Practice & Experience, Volume 35, Issue 2, February, 2005, pp. 189-194. http://web.eecs.utk.edu/~jplank/plank/papers/SPE-04.html.

• [VL82] J. H. van Lint, Introduction to Coding Theory, Springer-Verlag, New York, 1982.