CS140 -- Lab 10

CS140 -- Data Structures
James S. Plank (with modifications by Brad Vander Zanden)

Your separate_chaining.c and hashtable.h files are due after one week.
Your linear_probing.c and scoreproc.c files are due on the final due date.
Should you prove unable to complete the separate_chaining.c file by the one week deadline, switch over to completing the linear_probing.c and scoreproc.c files, as you will not be able to get credit any longer for the separate_chaining.c file.

This is a lab designed to give you practice with hashing. You will implement a hashing library in two different ways, one using separate chaining and one using linear probing. You will then implement a program that reads score files and computes the average score for each individual. A user will be able to query names and your program will print the individual's average score and number of scores. This program will use your hash libraries.

Lab Materials

Executables for the test files are in the directory /home/bvz/cs140/labs/lab10. As usual, if you have questions about how these programs should work, try these.
There is a test file called hash_test.c that you can use to test your library. The format of the command is:
```
hash_test word1 word2 ... wordk
```
hash_test inserts the words into the hash table, then tries to find them, and then prints out the hash table. It also tries to locate argv[0], which is hash_test. It will fail to find hash_test but should find all your words.
There are two simple example score files for scoreproc in scfile1 and scfile2.
I have created a makefile for you that you may use to create your executables. You may need to modify the LIBS line so that it includes either the sllist or dllist library. For example, if you use the dllist library in your program, you will need to modify the LIBS line so that it reads:
```
LIBS = $(LIBDIR)/fields.o $(LIBDIR)/dllist.o -lm
```
Scorefiles for the 1998 PGA golf season are in /home/bvz/cs140/labs/lab10/golf/*.
You may find that you want to use the sqrt function from the C math library. To use it you will need to include <math.h> in your file. The sqrt function takes a double as an argument and returns a double. You may want to use this function when computing a prime number.
In order to link the C math library to your executable you will need to place the -lm flag at the end of your compilation command. I have done this for you in the makefile. When linking in C libraries it is important to place them last in your list of files because the linker only loads those functions that it needs. It uses only the files it has already seen to determine these functions. Thus if you place the -lm flag before your scoreproc.o file, the linker will not have seen the sqrt function when it processes the -lm flag and it will not link in the sqrt function. You will then get a linker error that claims that sqrt cannot be found and you will tear out your hair because you "know" that you included the math library. By the way, the -l indicates a library is to follow and the "m" indicates the math library.

Hash Table Library

In this part of the lab you are going to create a hash table library that supports the following interface. You must adhere precisely to this interface in creating the hash table library:

void *hash_table_create(int data_size, unsigned int (*hash)(void *key, int tablesize), int (*compare_keys)(void *key1, void *key2)): Creates a record for a hash table, initializes the hash table, and returns a pointer to the record as a void *. data_size is an estimate of the number of entries the hash table must ultimately hold and hash is a pointer to a hash function. Your program should use data_size to calculate a size for your hash table. For both separate chaining and linear probing, your hash table size should be the first prime number greater than data_size. Use the prime number algorithm presented in class. Your program will use the hash function to calculate the index at which a (key,value) pair should be inserted in the array representing the hash table. For linear probing it is important that your estimated data size be equal to or greater than the number of actual keys you are inserting or else the table may become full and insertions may fail because there are no empty entries in the hash table.
int hash_table_insert(void *hash_table, void *key, void *value): inserts the (key, value) pair into the hash table. Returns 1 if the insertion succeeded and 0 if the insertion fails. The insertion should fail only if the key is a duplicate and the key is already in the hash table.
void *hash_table_find(void *hash_table, void *key): locates the indicated key and returns either a pointer to the value associated with that key, or NULL if the key is not found in the hash table.
void hash_table_print(void *hash_table, void (*print_entry)(int entry, void *key, void *value)): prints the number of entries in the hash table and then calls print_entry with each (key,value) pair in the table. print_entry should print the key and value associated with the given entry. entry is an index for the table's entry. If separate chaining is used, then your print_entry function may be called multiple times with the same entry, but different values for the key and value.

You should create a file named hash_table.h to declare these functions. You should implement the hash table library twice, once using separate chaining and once using linear probing. You should name the two files separate_chaining.c and linear_probing.c. Do not worry about rehashing for linear probing if the table starts becoming too full.

Scoreproc

You will now use your hash table library to implement a program, named scoreproc.c, that processes score files. A score file is a file where each line is either blank (in which case it should be ignored) or has a name and a score on it. The name can be multiple words with any amount of white space between them. You should convert all names to strings with just one space between the words. The last word on each line is a positive score, which is a floating point number (as always, use a double and error check if the last word is not a number).

Input

scoreproc takes a number indicating the approximate number of unique names and then one or more score files on the command line. It then reads each score in the score files, and for each name, it computes the average score for that name. In other words, a name can have multiple entries in a score file. For example, the file scfile1 contains the entries:

Phil Fulmer 9
Pat Summitt  10
Cutcliffe 8

and the file scfile2 contains the entries:

Rod Delmonico 7
Pat Summitt 11
Cutcliffe 6

If we call scoreproc with scfile1 and scfile2 the program will keep track of four names:

Phil Fulmer, with an average score of 9.
Pat Summitt, with an average score of 10.5.
Cutcliffe, with an average score of 7.
Rod Delmonico, with an average score of 7.

(Note the scores are really meaningless -- I just assigned random numbers to non-random names....)

Program Actions

scoreproc must first read all the name/score pairs and place them into a hash table. You should use an inputstruct to read each file and then you should use jettison_inputstruct to close each file before reading the next one.

When reading a name your program should first check whether the name is already in the hash table. If it is not then you should create a record for the name and add it to the hash table. Regardless of whether or not you create a record you will then need to update the score information so that you can calculate an average score once all the name/score pairs have been read. You only need to compute an average score when it is requested by the user so your program can keep a running sum of the scores and a running count of the number of scores.

Once all the name/pair scores have been read your program should ask the user to enter a name. Your program should print the person's name, the average score for that person, and the number of scores for that person. If the name wasn't specified in the score files, then your program should say that the name isn't found. The following example is using separate chaining and a hash table of size 5 (5 is the smallest prime number larger than 4):

UNIX> scoreproc 4 scfile1 scfile2  
Enter a name: Pat Summitt
  Pat Summitt: Avg score: 10.50   #scores: 2
Enter a name: Phil Fulmer
  Phil Fulmer: Avg score: 9.00   #scores: 1
Enter a name: Jim Plank
  Jim Plank is not in the score files
Enter a name:  < CNTL-D >

Once the user terminates their queries by entering CNTL-D your program should call hash_table_print in order to print the hash table. Continuing with the previous example, you would get the following output once the user hits CNTL-D:


table size = 5
0: Empty
1: 	Cutcliffe: Avg score: 7.00   #scores: 2
	Pat Summitt: Avg score: 10.50   #scores: 2
2: 	Phil Fulmer: Avg score: 9.00   #scores: 1
3: Empty
4: 	Rod Delmonico: Avg score: 7.00   #scores: 1
UNIX>

Note that for separate chaining there may be multiple records per table entry whereas for quadratic probing there should be only one record per entry.

Printing Notes

You can get the proper indenting for the entries in each hash table bucket by using the \t character in your printf formatting string. The \t character indicates that printf should skip to the next tab stop and then start printing the next character. For example,
```
	printf("\t%s\n", "brad");
	
```
will indent "brad" by the number of positions defined for the tab stop on your computer.
All scores and averages should be printed to two decimal places.
Place three blank spaces between the average score and the label "#scores:".

Implementation Details

You are going to implement your program in two different ways:

You will compile it with the separate chaining version of the hash table library. To make your printing work properly, you can use either a global variable or a static local variable that keeps track of the last index that was printed. When your print function is called, you can use this variable to determine whether or not the index has already been printed.
You will compile it with the linear probing version of the hash table library. You will not need to modify your program. This shows the advantage of hiding the implementation of the library from your program. As long as the library's interface is unchanged, the library's implementation can be changed without forcing any code to be re-written in the client programs.

For the hashing function that you pass to your library, you should use the following function. Dr. Plank did a number of experiments with different hash functions from a former textbook and a couple of his own he found that the following hash function produced the fewest number of collisions.

unsigned int hash(void *key, int tablesize)
{
  unsigned int total;
  int i;
  char *s = (char *)key;
  
  total = 0;
  for (i = 0; s[i] != '\0'; i++) {
    total = ((total << 5) + s[i]) % 10000001;
  }
  return total % tablesize;
}

There you have it. Try the example above. Also, a great set of input files to try are the files in /home/bvz/cs140/labs/lab10/golf/* (these are all the PGA tournament results from 1998, minus the Masters, with scores normalized over a 100-point scale):

UNIX> scoreproc 400 /home/bvz/cs140/labs/lab10/golf/*
Enter a name: Tiger Woods
  Tiger Woods: Avg score: 30.32   #scores: 14
Enter a name: Davis Love III
  Davis Love III: Avg score: 38.20   #scores: 13
Enter a name: Glen Day
  Glen Day: Avg score: 49.37   #scores: 20
Enter a name: Jim Plank
  Jim Plank is not in the score files
Enter a name: < CNTL-D >

table size = 401
0: Empty
1: 	Lee Rinker: Avg score: 62.89   #scores: 25
	Scott Gump: Avg score: 62.21   #scores: 23
2: Empty
3: 	Pete Jordan: Avg score: 86.46   #scores: 4
4: 	Jay Horton: Avg score: 100.00   #scores: 1
...
UNIX>

Error Checking

If the score file cannot be opened, print an appropriate error message and quit. The design document asks you to list other errors that you think you should check for.

Design Document

Your design document should answer the following questions. You must hand it in when the TA asks for your answers. The answers to the design document will be handed out in the Friday lab and will be posted at 4pm Monday on the lab web page.

Compute the value returned by the above hash function for each of the names in scfile1 and scfile2. Assume a tablesize of 11.
Given the number of unique names in scfile1 and scfile2, what should be the size for your hash table array (Hint: Use the prime number procedure discussed in class). Note: 11 is not the answer.
Using the names in scfile1 and scfile2, show which names will get assigned to which entries in the hash table using:
1. separate chaining
2. linear probing
You only need to show the names of the people, and not the running score sum or score count.
An example format for your answer might be:
```
0: empty
1: bvz, smiley
2: tang, camel, mouse
```
Show the struct that you will declare to hold the information associated with a person. This struct will be passed as the value parameter to your hash table.
What error checks do you think your program should perform?

What To Submit

You should submit your design document when the TA asks for it during the lab. You will submit the following files to the TAs via the submit script:

scoreproc.c
hashtable.h
separate_chaining.c
linear_probing.c

Submit separate_chaining.c and hashtable.h as lab10a. For the final submission, submit all the files as lab10. If one of your two hash table files is not working at that point, please let the TAs know which one is working so they can use that one to test your scoreproc.c file.