CS140 -- Lab 9
This is a lab designed to give you practice with hashing. You will implement a
hashing library in two different ways, one using separate chaining and one
using quadratic probing. You will then implement a program that reads score
files and computes the average score for each individual. A user will be able
to query names and your program will print the individual's average score and
number of scores. This program will use your hash libraries. You will then
do a second implementation using the Unix provided hash table library.
Lab Materials
- Executables for the test files are in the directory
/home/bvz/cs140/labs/lab9. As usual, if you have
questions about how these programs should work, try these. I will not
post the executables until Monday at 4pm so that you have to answer the design
questions without the benefit of the executables.
- I have not provided test files for your hash table library because you
need to learn how to write simple test files on your own. You will be
made to write a test file as part of your design document.
- There are two simple example score files for scoreproc in
scfile1 and scfile2.
- Scorefiles for the 1998 PGA golf season are in
/home/bvz/cs140/labs/lab9/golf/*.
- You may find that you want to use the sqrt function from the
C math library. To use it you will need to include <math.h>
in your file. The sqrt function takes a double as an argument and
returns a double. You may want to use this function when computing
a prime number.
- In order to link the C math library to your executable you will need to
place the -lm flag at the end of your compilation command.
When linking in C libraries it is important to place them
last in your list of files because the linker only loads those functions
that it needs. It uses only the files it has already seen to determine
these functions. Thus if you place the -lm flag before your
scoreproc.o file, the linker will not have seen the sqrt function
when it processes the -lm flag and it will not link in the sqrt
function. You will then get a linker error that claims that sqrt
cannot be found and you will tear out your hair because you "know"
that you included the math library. By the way, the -l indicates a
library is to follow and the "m" indicates the math library.
- I have added a new file to the objs directory called
string_conversion.o and a new file to the include directory
called string_conversion.h. If you want to see the source code, you can find it in
/home/bvz/cs140/src/string_conversion.c.
This new file provides
four functions that convert strings to integers, longs, floats, and
doubles respectively. They are more reliable than sscanf because
1) they will not treat any variation of "nan" as meaning infinity, and
2) they will not treat a word that starts with numbers but has trailing
letters, like 123xy, as a number. As an example of the difference
between sscanf and these functions, sscanf will treat the string
"Nancy" as if it is a number representing infinity, whereas these
functions will treat it as a string that is not a number. The functions
take two arguments, a character string and an address where they will
place the converted results. They return a boolean denoting either
success, true, or failure, false.
I would prefer that you use these
functions instead of sscanf, atoi, or atof.
Hash Table Library
In this part of the lab you are going to create a hash table library that
supports the following interface. You must adhere precisely to this
interface in creating the hash table library:
- void *hash_table_create(int data_size, unsigned int (*hash)(void *key, int tablesize), bool (*compare_keys)(void *key1, void *key2)): Creates a
record for a hash table, initializes the hash table, and returns a
pointer to the record as a void *.
data_size is an estimate of the number of entries the hash table
must ultimately hold and hash is a pointer to a hash function.
Your program should use data_size to calculate a size for your
hash table. If you are using separate chaining, then your hash table
size should be the first prime number greater than data_size. If
you are using quadratic probing, your table size should be one less than
the first power of 2 greater than data_size. For example, if your
data_size is in the range 16-31, your table size would be 31,
which is 25-1. Your program will use the hash
function to calculate the index at which a (key,value) pair should be
inserted in the array representing the hash table.
- bool hash_table_insert(void *key, void *value, void *hash_table):
inserts the (key, value) pair into the hash table. Returns true
if the insertion succeeded and false if the insertion fails.
The insertion should fail only if the key is a duplicate and the key
is already in the hash table.
- void *hash_table_find(void *key, void *hash_table): locates the indicated
key and returns either a pointer to the value associated with that key,
or NULL if the key is not found in the hash table.
- void hash_table_print(void *hash_table, void (*print_entry)(int entry, void *key, void *value)): prints the number of entries in the hash table
and then calls print_entry with each (key,value) pair in the
table. print_entry should print
the key and value associated with the given entry.
entry is an index for the table's entry. If separate chaining
is used, then your print_entry function may be called multiple
times with the same entry, but different values for the
key and value.
You should create a file named hash_table.h to declare these
functions.
You should implement the hash table library twice, once using separate chaining
and once using quadratic probing with rehashing. You should name the two
files separate_chaining.c and quadratic_probing.c. Your
quadratic probing implementation should rehash whenever the table becomes more
than half full. The new table size should be twice the current table size plus
1. For example, if your previous table size was 31, your new table size should
be 63. Notice that this formula ensures that the table size is always 1 less
than a power of 2.
Scoreproc
You will now use your hash table library to implement a
program, named scoreproc.c, that processes score files. A score file
is a file where each line is either blank (in which case it should
be ignored) or has a name and a score on it. The name can
be multiple words with any amount of white space between them. You should
convert all names to strings with just one space between the words.
The last word on each line is a non-negative score, which is a floating
point number (as always, use a double and error check if the
last word is not a number).
Input
scoreproc takes a number indicating the approximate number
of unique names and then a list of score files on the command line.
It then reads each score in every file, and for each name, it
computes the average score for that name. In other words, a name
can have multiple entries in a score file, and different score
files can have different scores with the same name.
For example, the files
scfile1
and
scfile2
are two simple score files:
UNIX> cat scfile1
Phil Fulmer 9
Pat Summitt 10
Cutcliffe 8
UNIX> cat scfile2
Rod Delmonico 7
Pat Summitt 11
Cutcliffe 6
If we call scoreproc with both files as command line arguments,
the program will keep track of four names:
- Phil Fulmer, with an average score of 9.
- Pat Summitt, with an average score of 10.5.
- Cutcliffe, with an average score of 7.
- Rod Delmonico, with an average score of 7.
(Note those files are really meaningless -- I just assigned random
numbers to non-random names....)
Program Actions
scoreproc must first read all the name/score pairs and place
them into a hash table. You should use an inputstruct to read each file
and then you should use jettison_inputstruct to close each file
before reading the next one.
When reading a name your program should first
check whether the name is already in the hash table. If it is not then
you should create a record for the name and add it to the hash table.
Regardless of whether or not you create a record you will then need to
update the score information so that you can calculate an average
score once all the name/score pairs have been read. You only need to compute
an average score when it is requested by the user so your program can
keep a running sum of the scores and a running count of the number of scores.
Once all the name/pair scores have been read your program should
ask the user to enter a name. Your program should print the number of
scores plus the average score for that name. If the name wasn't
specified in the score files, then your program should say that the name isn't
found:
UNIX> scoreproc 4 scfile1 scfile2
Enter a name: Pat Summitt
Pat Summitt: Avg score: 10.50 #scores: 2
Enter a name: Phil Fulmer
Phil Fulmer: Avg score: 9.00 #scores: 1
Enter a name: Jim Plank
Jim Plank is not in the score files
Enter a name: < CNTL-D >
Once the user terminates their queries by entering CNTL-D your program
should call hash_table_print in order to print the hash
table. Continuing with the previous example, you would get
the following output once the user hits CNTL-D:
table size = 5
0: Empty
1: Cutcliffe: Avg score: 7.00 #scores: 2
Pat Summitt: Avg score: 10.50 #scores: 2
2: Phil Fulmer: Avg score: 9.00 #scores: 1
3: Empty
4: Rod Delmonico: Avg score: 7.00 #scores: 1
UNIX>
Note that for separate chaining there may be multiple records per table
entry whereas for quadratic probing there should be only one record per
entry.
Printing Notes
- All scores and averages should be printed to two decimal
places.
- You can get the proper indenting for the entries in each hash table
bucket by using the \t character in your printf formatting
string. The \t character indicates that printf should skip to the
next tab stop and then start printing the next character. For example,
printf("\t%s\n", "brad");
will indent "brad" by the number of positions defined for the
tab stop on your computer.
- Places three blank spaces between the average score and the character
string "#scores".
Implementation Details
You are going to implement your program in two different ways. First you
are going to use the hash tree library you develop and then you are going
to use the UNIX provided hash table utility. You will ultimately test your
program in three different ways:
- You will compile it with the separate chaining version of the hash
table library. To make your printing work properly, you can use either
a global variable or a static local variable that keeps track of the
last index that was printed. When your print function is called, you can
use this variable to determine whether or not the index has already
been printed.
- You will compile it with the quadratic probing version of the hash
table library. You will not need to modify your program. This shows
the advantage of hiding the implementation of the library from your
program. As long as the library's interface is unchanged, the library's
implementation can be changed without forcing any code to be re-written
in the client programs.
- You will modify your program to use the UNIX hcreate
and hsearch functions to handle your hash table. The modified
program should be placed in scoreproc_unix.c. You will not be
able to print out the hash table size or its entries at the end of the
program so simply quit after the user presses Ctrl-D.
Use the UNIX man command to look at the documentation for hcreate
and hsearch. Simply type "man hcreate" or
"man hsearch". You will find an example program that uses
hcreate and hsearch at the end of the man page.
You can also find an example program that counts the frequency of words
in a file here.
For the hashing function that you pass to your library,
you should use the function shown below rather
than the one described in figure 5.5 on page 152 in the book. The hashing
function shown below is a modification of the algorithm in figure 5.5 which
mods each intermediate sum by a very large number, in this case 10000001.
When he did a number of experiments with different hash functions, Dr. Plank
found that the hash function shown below produces far fewer collisions than
the hash function used in the book. The reason he provides is that
the book's algorithm "shifts early characters off the left end of the word,
and thus, we lose their information." In contrast, the mod operation
that is performed on the intermediate sums preserves some of the information
provided by the early characters in the word and hence provides a more
uniform distribution.
unsigned int hash(void *key, int tablesize)
{
unsigned int total;
int i;
char *s = (char *)key;
total = 0;
for (i = 0; s[i] != '\0'; i++) {
total = ((total << 5) + s[i]) % 10000001;
}
return total % tablesize;
}
There you have it. Try the example above. Also, a great set of input
files to try are the files in
/home/bvz/cs140/labs/lab9/golf/* (these are all the
PGA tournament results from 1998, minus the Masters, with scores
normalized over a 100-point scale):
UNIX> scoreproc 400 /home/cs140/www-home/spring-2005/labs/lab8/golf/*
Enter a name: Tiger Woods
Tiger Woods: Avg score: 30.32 #scores: 14
Enter a name: Davis Love III
Davis Love III: Avg score: 38.20 #scores: 13
Enter a name: Glen Day
Glen Day: Avg score: 49.37 #scores: 20
Enter a name: Jim Plank
Jim Plank is not in the score files
Enter a name: < CNTL-D >
table size = 401
0: Empty
1: Lee Rinker: Avg score: 62.89 #scores: 25
Scott Gump: Avg score: 62.21 #scores: 23
2: Empty
3: Pete Jordan: Avg score: 86.46 #scores: 4
4: Jay Horton: Avg score: 100.00 #scores: 1
...
UNIX>
Error Checking
If any of the score files cannot be opened, print an appropriate error message
and quit. The design document asks you to list other errors that you think you
should check for.
Design Document
Your design document should answer the following questions. You must hand it
in when the TA asks for your answers. The answers to the design document will
be handed out in the Friday lab and will be posted at 4pm Monday on the lab
web page.
- Compute the value returned by the above hash function for each of
the names in scfile1 and scfile2. Assume a tablesize
of 11.
- Given the names in scfile1 and scfile2, what will be
the size of your hash table using:
- separate chaining
- quadratic probing
Note: 11 is not the answer to either of the above questions.
- Using the names in scfile1 and scfile2, show which
names will get assigned to which entries in the hash table using:
- separate chaining
- quadratic probing
You only need to show the names of the people, and not the running
score sum or score count.
An example format for your answer might be:
0: empty
1: bvz, smiley
2: tang, camel, mouse
- Show the struct that you will declare to hold the information associated
with a person. This struct will be passed as the value parameter to your
hash table.
- What error checks do you think your program should perform?
- Write a test program called hash_test.c that performs the
following actions:
- inserts character strings read from the command line into your
hash table.
- performs a find on each string in argv and prints out a string if
it is not found. The only string not found will be argv[0], which
is the name of your executable.
- prints the hash table size and contents using hash_print_table
Print this file out when and hand it in with the rest of your design
document.
You should be able to modify my bst_test_option2.c from lab7
instead of starting from scratch.
What To Submit
You should submit your design document when the TA asks for it during the lab.
You will submit the following files to the TAs via the submit script:
- scoreproc.c
- scoreproc_unix.c
- hashtable.h
- separate_chaining.c
- quadratic_probing.c