This is a Dr. Plank lab that is very similar to the lab Dr. Gregor uses to test hash tables, except you will rely on more crypto-inspired hash functions (which will be good!) versus more generic (less good..) functions in the challenge you will complete this Thursday.
The main advantages of doing this as the "formal" lab are the TAs are all very familiar with it, which will help all of us, all of them enjoyed doing it (which likely bodes well for you), and there are pre-existing grade scripts and a "tester" program that I think is more intuitive than alternatives.
When I did this for the old CS140, I ended up making and posting a video walkthrough on my attempt here. The main punchline is as usual -- start early! -- but you may find this useful before you start in earnest. Also note that you will be doing this lab earlier than prior 140 cohorts, since you covered more basic C/C++ programming in CS101/102, but not as early as last fall's 202 (see below):
Original version is here: /home/jplank/cs202/Labs/Lab4
The openssl command generates hashes from files using either
technique. For example:
UNIX> cat files/input-1.txt Rosalita, jump a little lighter. Senorita come sit by my fire. UNIX> cat files/input-2.txt Sosalita, jump a little lighter. Senorita come sit by my fire. UNIX> openssl md5 files/input-1.txt files/input-2.txt MD5(files/input-1.txt)= b9937df3fefbe66d8fcdda363730bf14 MD5(files/input-2.txt)= 3a752ef1b9bfd1db6ba6a701b6772065 UNIX> openssl sha1 files/input-1.txt files/input-2.txt SHA1(files/input-1.txt)= 9a2c3d93445fa844094eb213a17fc5996895c925 SHA1(files/input-2.txt)= 8886b6ef4671093b15c2dba387c3eb169e9db5d2 UNIX>The representation of the hashes is a long stream of hexadecimal. You can read each hex digit as four bits. Thus the first eight bits of b9937df3fefbe66d8fcdda363730bf14 are 10111001 (b9). The hexadecimal string is 32 characters long -- hence the hash is 128 bits.
The ASCII value of 'R' is 82 (binary 1010010) and the value of 'S' is 83 (binary 1010011).
That means that files/input-1.txt and files/input-2.txt differ by exactly one bit.
However, look how different the hashes of both values are. That's a great thing about
both functions. (Look up avalanche
effect if you want to learn more about that property).
Hashes like MD5 and SHA-1 are often used to represent large files. For example,
suppose we have an audio file in our library that is 4 MB long, and has an MD5 hash of
6282f7860277242c80b69d3a20e0b40e. Suppose Dr. Plank wants to make sure the digital download
is not corrupted. We can calculate the hash
of the digital song, and if it doesn't equal 6282f7860277242c80b69d3a20e0b40e, then
we know it's corrupted. If it does equal 6282f7860277242c80b69d3a20e0b40e, we can assume
with probability 1/2128 that it is not corrupted.
To faciliate testing, you were previously asked to clone the course Github repository as follows:
git clone https://github.com/semrich/CS202-22.git cs202
For this assignment, update this clone by using the following:
git pull
We'll discuss this in class but note that we'll be separating each type of file (inc. the executables) into their own subdirectory for this asssignment, similar to how the prior iteration of CS140 completed this lab assignment.
Your job is to implement the class Hash_202. This implements a hash table, where the
keys are strings of hexadecimal digits, and the vals are arbitrary, non-empty strings.
Most of the lab description
is in the header file,
include/hash_202.hpp:
#pragma once #include <vector> #include <string> class Hash_202 { public: /* There is no constructor or destructor here. This allows you to simple declare these as variables, and not worry about memory allocation. */ /* You must call Set_Up() to set up the table before using it. This method takes the hash table size, a name of a hash function (either "Last7" or "XOR"), and the name of a collision resolution strategy (either "Linear" or "Double"). You should error check your input, and look for the following errors, in the following order. If you see one of these errors, return the string specified: - Table already set up: "Hash table already set up" - Bad table size: "Bad table size" - Bad hash function: "Bad hash function" - Bad collision resolution strategy: "Bad collision resolution strategy" */ std::string Set_Up(size_t table_size, const std::string &fxn, const std::string &collision); /* Add() adds the given key/val pair to the hash table. If successful, it should return an empty string. Add should test for the following errors, in this order, and return the given strings: - Table not set up: "Hash table not set up" - Empty string for the key: "Empty key" - Key not composed of hex digits: "Bad key (not all hex digits)" - Empty string for the val: "Empty val" - The hash table is full: "Hash table full" - Cannot insert key: "Cannot insert key" - The key is already in the table "Key already in the table" */ std::string Add(const std::string &key, const std::string &val); /* Find() returns the val associated with the given key. If the hash table has not been set up yet, or if the key is not in the hash table, or if the key is not composed of all hex digits, it should return an empty string. Find() is not const, because it sets the variable Nprobes to equal the number of probes that it took to find the answer. */ std::string Find(const std::string &key); /* Print() prints all non-empty slots in the hash table, one per line. The index should be printed first, right justified and padded to five characters. Then a space, the key, a space, and the val. This should do nothing if the hash table has not been set up yet. */ void Print() const; /* Total_Probes() should traverse the hash table, and for every key, caculcate how many probes it takes find that key. It should return the total number of probes. It should return 0 if the hash table has not been set up yet. It is not const, because it uses Find() to find the number of probes for each key. */ size_t Total_Probes(); /* Obviously, you can use these however you want, but here is how Dr. Plank used them. You may not modify or add to this header file, so you need to be able to implement the lab with just these member variables. */ protected: std::vector <std::string> Keys; // The hash table of keys. std::vector <std::string> Vals; // The Vals[i] is the val associated with Keys[i] size_t Nkeys; // This is the number of keys that have been inserted. int Fxn; // The hash function. Dr. Plank has 'X' = "Xor" and 'L' = "Last7". int Coll; // The collision resolution strategy. Dr. Plank has 'L' and 'D' size_t Nprobes; // When Dr. Plank calls Find(), he sets this value to the number of probles. }; |
This defines a class called Hash_202 class. It will be used to store keys and values.
Both are strings. The keys are strings of hexadecimal digits, just like the
MD5 and SHA-1 hashes. They may be of arbitrary size, and must be composed solely of
the characters 0-9, a-f or A-F. The values are strings.
There is no constructor, so when you create an instance if the class, it will contain an empty hash table. To create the hash table, you must call Set_Up(), which defines the size of the hash table, plus its hash function and collision resolution strategy. We will support two hash functions, "Last7" and "XOR", which will be defined. The hash table will use open addressing with one of two collision resolution strategies: "Linear" for linear probing and "Double" for double hashing. More on that later too.
In the protected data of the class, you you represent the hash table with two arrays -- one of keys and one of values. They'll both be the same size, and the element in Vals[i] will correspond to the key in Keys[i]. You represent an empty slot in Keys and Vals with an empty string.
Add() adds the given key/val pair to the hash table with the proper hash function and collision resolution strategy. Please see the header file for information on error handling.
Find() looks for the given key in the hash table and returns its associated value. Please see the header file for information on error handling.
Print() prints the hash table -- see the header file for the format.
Total_Probes() traverses the hash table, and for every key, calculates how many probes it takes to find the key. It returns the total number of probes. I implemented this by calling Find() on every key, and having Find() set the Nprobes variable. This is the reason why both Find() and Total_Probes() are not const methods.
Dr. Plank provides a testing program for this lab in src/hash_tester.cpp. You call it with three arguments (or four if you want a prompt):
UNIX> bin/hash_tester usage: hash_tester table-size fxn(Last7|XOR) collision(Linear|Double) [prompt] UNIX>Let's call it with an 8 element hash table, "Last7" and "Linear", and then see the commands:
UNIX> echo '?' | bin/hash_tester 8 Last7 Linear A key val Add the given key/val pair to the hash table. F key Find the given key in the hash table. P Print the hash table. TP Print the total probes for finding all keys. Q Quit. ? Print comands. UNIX>
To be explicit: Your job is to write src/hash_202.cpp, which implements the methods of the Hash_202 class. The only file that you submit is src/hash_202.cpp. You are not allowed to modify include/hash_202.hpp or src/hash_tester.cpp and shouldn't need to for this assignment.
cin >> hex >> i; |
Extending on this, src/read_as_hex.cpp reads
strings from standard input and assumes that they are hexadecimal representations
of integers. It prints out the decimal and hexidecimal values of the integers:
/* This program shows how you read a value from a string in hexadecimal using a stringstream. The printf() statement prints the value in decimal and then in hexadecimal. */ #include <string> #include <cstdio> #include <iostream> #include <sstream> using namespace std; int main() { string s; istringstream ss; int i; while (cin >> s) { ss.clear(); ss.str(s); if (ss >> hex >> i) printf("%d 0x%x\n", i, i); } return 0; } |
Giving it "a", "10" and "20" as input prints out the proper values:
UNIX> echo a 10 20 | bin/read_as_hex 10 0xa 16 0x10 32 0x20 UNIX>Why does Dr. Plank include this? Well.......
The first way is called Last7. It treats the last seven digits of the
hash string as a number in hexadecimal. Thus, the hash of
"b9937df3fefbe66d8fcdda363730bf14" will be 120635156, which is equal to 0x730bf14.
If the string has fewer than 7 characters, then simply treat the whole string
as a hexadecimal number. Thus, the hash of "11" will be 17, which is equal to 0x11.
The second hash function is called XOR. With this hash function, you break up the string into seven-character words, and then treat each word as a number in hexadecimal. The last word can be fewer than seven characters. You then calculate the bitwise exclusive-or of each of these numbers. That is your hash function. So, let's take an easy example: "a000000a0000101". First, we break this up into seven-digit words (except for the last one): "a000000", "a000010" and "1". Their bitwise exclusive-or is equal to 0x11, so the XOR hash of "a000000a0000101" is equal to 17.
UNIX> bin/hash_tester 10 Last7 Linear 'Hash> ' Hash> A a Fred Hash> P 0 a Fred Hash> Q UNIX>Since 0xa = 10, the hash value is 10, which means that the key/val go to index 0.
Let's try some of those example keys above:
UNIX> bin/hash_tester 10 Last7 Linear 'Hash> ' Hash> A b9937df3fefbe66d8fcdda363730bf14 Fred # This hashes to 0x730bf14 = 120635156, Hash> P # So it goes into index 6. 6 b9937df3fefbe66d8fcdda363730bf14 Fred Hash> A 11 Luther # This hashes to 0x11 = 17, Hash> P # So it goes into index 7. 6 b9937df3fefbe66d8fcdda363730bf14 Fred 7 11 Luther Hash> F b9937df3fefbe66d8fcdda363730bf14 Found: Fred Hash> F Fred # Fred is a value, not a key, so we can't find it. Not found. Hash> Q UNIX> bin/hash_tester 10 XOR Linear 'Hash> ' Hash> A a000000a0000101 Fred # Recall from above that this hashes to 0x11 = 17. Hash> P 7 a000000a0000101 Fred Hash> Q UNIX>This example shows linear probing in action because "001", "1" and "a000000b" all hash to values that equal 1 mod 10:
UNIX> bin/hash_tester 10 XOR Linear 'Hash> ' Hash> A 001 Fred Hash> A 1 Binky Hash> A a000000b Baby Daisy Hash> P 1 001 Fred 2 1 Binky 3 a000000b Baby Daisy Hash> TP 3 Hash> Q UNIX>The total number of probes is three, because finding "001", requires zero probes, "1" requires one and "a000000b" requires two. The total is therefore three.
Let's try a harder example with double hashing:
UNIX> bin/hash_tester 10 Last7 Double 'Hash> ' Hash> A 2 Fred Hash> A 00000100000002 Binky Hash> A c Luther Hash> P 0 00000100000002 Binky 2 2 Fred 4 c Luther Hash> TP 2 Hash> Q UNIX>Each of these keys hashes to two. First, "2" goes into index two. Then "00000100000002" collides. Its hash value using XOR is 0x12 = 18. We take 18 mod 10 and that gives us an increment of 8. Thus, the first probe will try 2+8 = 10, which is 0 mod 10. Since that slot is empty, "00000100000002" goes into index 0. Finally, when "c" collides, its hash value using XOR is also 2. Therefore, it goes into index 2+2=4.
When we call TP, it returns 2, because "2" has zero probes, and "00000100000002" and "c" have one each.
Two final examples -- the first shows an example where the second hash produces an increment of zero. In that case, you set it to one:
UNIX> bin/hash_tester 10 Last7 Double 'Hash> ' Hash> A 0 Fred Hash> A a Binky Hash> P 0 0 Fred 1 a Binky Hash> TP 1 Hash> Q UNIX>When we insert "a", both hash functions return 10, which equals 0 mod 10. Thus, we set the second hash function to 1, and the value goes into index 1.
The last example shows three keys that hash to 5 with both functions:
UNIX> bin/hash_tester 10 Last7 Double 'Hash> ' Hash> A 5 Fred Hash> A f Binky Hash> P 0 f Binky 5 5 Fred Hash> A 19 Luther A 19 Luther Cannot insert key Hash> Q UNIX>When we try to put the third key into the table, it can't go there, because it keeps trying indices five and zero. Your code needs to work in this case -- it needs to identify that the key cannot go into table.
+ 3 hash_202.cpp is well formatted, commented (inc. name, assignment, and overview), with reasonable variable names +35 Passes Dr. Plank's grade scripts (0.35 points per script, 1..100)