CS140 -- Lab 9

CS140 -- Data Structures
James S. Plank (with modifications by Brad Vander Zanden)

This is a lab designed to give you practice with hashing. You will implement a hashing library in two different ways, one using separate chaining and one using quadratic probing. You will then implement a program that reads score files and computes the average score for each individual. A user will be able to query names and your program will print the individual's average score and number of scores. This program will use your hash libraries. You will then do a second implementation using the Unix provided hash table library.

Lab Materials

Executables for the test files are in the directory /home/bvz/cs140/labs/lab9. As usual, if you have questions about how these programs should work, try these. I will not post the executables until Monday at 4pm so that you have to answer the design questions without the benefit of the executables.
I have not provided test files for your hash table library because you need to learn how to write simple test files on your own. You will be made to write a test file as part of your design document.
There are two simple example score files for scoreproc in scfile1 and scfile2.
Scorefiles for the 1998 PGA golf season are in /home/bvz/cs140/labs/lab9/golf/*.
You may find that you want to use the sqrt function from the C math library. To use it you will need to include <math.h> in your file. The sqrt function takes a double as an argument and returns a double. You may want to use this function when computing a prime number.
In order to link the C math library to your executable you will need to place the -lm flag at the end of your compilation command. When linking in C libraries it is important to place them last in your list of files because the linker only loads those functions that it needs. It uses only the files it has already seen to determine these functions. Thus if you place the -lm flag before your scoreproc.o file, the linker will not have seen the sqrt function when it processes the -lm flag and it will not link in the sqrt function. You will then get a linker error that claims that sqrt cannot be found and you will tear out your hair because you "know" that you included the math library. By the way, the -l indicates a library is to follow and the "m" indicates the math library.
I have added a new file to the objs directory called string_conversion.o and a new file to the include directory called string_conversion.h. If you want to see the source code, you can find it in /home/bvz/cs140/src/string_conversion.c. This new file provides four functions that convert strings to integers, longs, floats, and doubles respectively. They are more reliable than sscanf because 1) they will not treat any variation of "nan" as meaning infinity, and 2) they will not treat a word that starts with numbers but has trailing letters, like 123xy, as a number. As an example of the difference between sscanf and these functions, sscanf will treat the string "Nancy" as if it is a number representing infinity, whereas these functions will treat it as a string that is not a number. The functions take two arguments, a character string and an address where they will place the converted results. They return a boolean denoting either success, true, or failure, false. I would prefer that you use these functions instead of sscanf, atoi, or atof.

Hash Table Library

In this part of the lab you are going to create a hash table library that supports the following interface. You must adhere precisely to this interface in creating the hash table library:

void *hash_table_create(int data_size, unsigned int (*hash)(void *key, int tablesize), bool (*compare_keys)(void *key1, void *key2)): Creates a record for a hash table, initializes the hash table, and returns a pointer to the record as a void *. data_size is an estimate of the number of entries the hash table must ultimately hold and hash is a pointer to a hash function. Your program should use data_size to calculate a size for your hash table. If you are using separate chaining, then your hash table size should be the first prime number greater than data_size. If you are using quadratic probing, your table size should be one less than the first power of 2 greater than data_size. For example, if your data_size is in the range 16-31, your table size would be 31, which is 2⁵-1. Your program will use the hash function to calculate the index at which a (key,value) pair should be inserted in the array representing the hash table.
bool hash_table_insert(void *key, void *value, void *hash_table): inserts the (key, value) pair into the hash table. Returns true if the insertion succeeded and false if the insertion fails. The insertion should fail only if the key is a duplicate and the key is already in the hash table.
void *hash_table_find(void *key, void *hash_table): locates the indicated key and returns either a pointer to the value associated with that key, or NULL if the key is not found in the hash table.
void hash_table_print(void *hash_table, void (*print_entry)(int entry, void *key, void *value)): prints the number of entries in the hash table and then calls print_entry with each (key,value) pair in the table. print_entry should print the key and value associated with the given entry. entry is an index for the table's entry. If separate chaining is used, then your print_entry function may be called multiple times with the same entry, but different values for the key and value.

You should create a file named hash_table.h to declare these functions. You should implement the hash table library twice, once using separate chaining and once using quadratic probing with rehashing. You should name the two files separate_chaining.c and quadratic_probing.c. Your quadratic probing implementation should rehash whenever the table becomes more than half full. The new table size should be twice the current table size plus 1. For example, if your previous table size was 31, your new table size should be 63. Notice that this formula ensures that the table size is always 1 less than a power of 2.

Scoreproc

You will now use your hash table library to implement a program, named scoreproc.c, that processes score files. A score file is a file where each line is either blank (in which case it should be ignored) or has a name and a score on it. The name can be multiple words with any amount of white space between them. You should convert all names to strings with just one space between the words. The last word on each line is a non-negative score, which is a floating point number (as always, use a double and error check if the last word is not a number).

Input

scoreproc takes a number indicating the approximate number of unique names and then a list of score files on the command line. It then reads each score in every file, and for each name, it computes the average score for that name. In other words, a name can have multiple entries in a score file, and different score files can have different scores with the same name.

For example, the files scfile1 and scfile2 are two simple score files:

UNIX> cat scfile1
Phil Fulmer 9
Pat Summitt  10
Cutcliffe 8
UNIX> cat scfile2
Rod Delmonico 7
Pat Summitt 11
Cutcliffe 6

If we call scoreproc with both files as command line arguments, the program will keep track of four names:

Phil Fulmer, with an average score of 9.
Pat Summitt, with an average score of 10.5.
Cutcliffe, with an average score of 7.
Rod Delmonico, with an average score of 7.

(Note those files are really meaningless -- I just assigned random numbers to non-random names....)

Program Actions

scoreproc must first read all the name/score pairs and place them into a hash table. You should use an inputstruct to read each file and then you should use jettison_inputstruct to close each file before reading the next one.

When reading a name your program should first check whether the name is already in the hash table. If it is not then you should create a record for the name and add it to the hash table. Regardless of whether or not you create a record you will then need to update the score information so that you can calculate an average score once all the name/score pairs have been read. You only need to compute an average score when it is requested by the user so your program can keep a running sum of the scores and a running count of the number of scores.

Once all the name/pair scores have been read your program should ask the user to enter a name. Your program should print the number of scores plus the average score for that name. If the name wasn't specified in the score files, then your program should say that the name isn't found:

UNIX> scoreproc 4 scfile1 scfile2
Enter a name: Pat Summitt
  Pat Summitt: Avg score: 10.50   #scores: 2
Enter a name: Phil Fulmer
  Phil Fulmer: Avg score: 9.00   #scores: 1
Enter a name: Jim Plank
  Jim Plank is not in the score files
Enter a name:  < CNTL-D >

Once the user terminates their queries by entering CNTL-D your program should call hash_table_print in order to print the hash table. Continuing with the previous example, you would get the following output once the user hits CNTL-D:


table size = 5
0: Empty
1: 	Cutcliffe: Avg score: 7.00   #scores: 2
	Pat Summitt: Avg score: 10.50   #scores: 2
2: 	Phil Fulmer: Avg score: 9.00   #scores: 1
3: Empty
4: 	Rod Delmonico: Avg score: 7.00   #scores: 1
UNIX>

Note that for separate chaining there may be multiple records per table entry whereas for quadratic probing there should be only one record per entry.

Printing Notes

All scores and averages should be printed to two decimal places.
You can get the proper indenting for the entries in each hash table bucket by using the \t character in your printf formatting string. The \t character indicates that printf should skip to the next tab stop and then start printing the next character. For example,
```
	printf("\t%s\n", "brad");
	
```
will indent "brad" by the number of positions defined for the tab stop on your computer.
Places three blank spaces between the average score and the character string "#scores".

Implementation Details

You are going to implement your program in two different ways. First you are going to use the hash tree library you develop and then you are going to use the UNIX provided hash table utility. You will ultimately test your program in three different ways:

You will compile it with the separate chaining version of the hash table library. To make your printing work properly, you can use either a global variable or a static local variable that keeps track of the last index that was printed. When your print function is called, you can use this variable to determine whether or not the index has already been printed.
You will compile it with the quadratic probing version of the hash table library. You will not need to modify your program. This shows the advantage of hiding the implementation of the library from your program. As long as the library's interface is unchanged, the library's implementation can be changed without forcing any code to be re-written in the client programs.
You will modify your program to use the UNIX hcreate and hsearch functions to handle your hash table. The modified program should be placed in scoreproc_unix.c. You will not be able to print out the hash table size or its entries at the end of the program so simply quit after the user presses Ctrl-D. Use the UNIX man command to look at the documentation for hcreate and hsearch. Simply type "man hcreate" or "man hsearch". You will find an example program that uses hcreate and hsearch at the end of the man page. You can also find an example program that counts the frequency of words in a file here.

For the hashing function that you pass to your library, you should use the function shown below rather than the one described in figure 5.5 on page 152 in the book. The hashing function shown below is a modification of the algorithm in figure 5.5 which mods each intermediate sum by a very large number, in this case 10000001. When he did a number of experiments with different hash functions, Dr. Plank found that the hash function shown below produces far fewer collisions than the hash function used in the book. The reason he provides is that the book's algorithm "shifts early characters off the left end of the word, and thus, we lose their information." In contrast, the mod operation that is performed on the intermediate sums preserves some of the information provided by the early characters in the word and hence provides a more uniform distribution.

unsigned int hash(void *key, int tablesize) { unsigned int total; int i; char *s = (char *)key; total = 0; for (i = 0; s[i] != '\0'; i++) { total = ((total << 5) + s[i]) % 10000001; } return total % tablesize; }

There you have it. Try the example above. Also, a great set of input files to try are the files in /home/bvz/cs140/labs/lab9/golf/* (these are all the PGA tournament results from 1998, minus the Masters, with scores normalized over a 100-point scale):

UNIX> scoreproc 400 /home/cs140/www-home/spring-2005/labs/lab8/golf/*
Enter a name: Tiger Woods
  Tiger Woods: Avg score: 30.32   #scores: 14
Enter a name: Davis Love III
  Davis Love III: Avg score: 38.20   #scores: 13
Enter a name: Glen Day
  Glen Day: Avg score: 49.37   #scores: 20
Enter a name: Jim Plank
  Jim Plank is not in the score files
Enter a name: < CNTL-D >

table size = 401
0: Empty
1: 	Lee Rinker: Avg score: 62.89   #scores: 25
	Scott Gump: Avg score: 62.21   #scores: 23
2: Empty
3: 	Pete Jordan: Avg score: 86.46   #scores: 4
4: 	Jay Horton: Avg score: 100.00   #scores: 1
...
UNIX>

Error Checking

If any of the score files cannot be opened, print an appropriate error message and quit. The design document asks you to list other errors that you think you should check for.

Design Document

Your design document should answer the following questions. You must hand it in when the TA asks for your answers. The answers to the design document will be handed out in the Friday lab and will be posted at 4pm Monday on the lab web page.

Compute the value returned by the above hash function for each of the names in scfile1 and scfile2. Assume a tablesize of 11.
Given the names in scfile1 and scfile2, what will be the size of your hash table using:
1. separate chaining
2. quadratic probing
Note: 11 is not the answer to either of the above questions.
Using the names in scfile1 and scfile2, show which names will get assigned to which entries in the hash table using:
1. separate chaining
2. quadratic probing
You only need to show the names of the people, and not the running score sum or score count.
An example format for your answer might be:
```
0: empty
1: bvz, smiley
2: tang, camel, mouse
```
Show the struct that you will declare to hold the information associated with a person. This struct will be passed as the value parameter to your hash table.
What error checks do you think your program should perform?
Write a test program called hash_test.c that performs the following actions:
1. inserts character strings read from the command line into your hash table.
2. performs a find on each string in argv and prints out a string if it is not found. The only string not found will be argv[0], which is the name of your executable.
3. prints the hash table size and contents using hash_print_table
Print this file out when and hand it in with the rest of your design document. You should be able to modify my bst_test_option2.c from lab7 instead of starting from scratch.

What To Submit

You should submit your design document when the TA asks for it during the lab. You will submit the following files to the TAs via the submit script:

scoreproc.c
scoreproc_unix.c
hashtable.h
separate_chaining.c
quadratic_probing.c