CS360 -- Lab 3


Variable length encoding

Your job this week is to write a decoding program called huff_dec.c. The program takes two command line arguments -- a code defintion file, and an input file. It then uses the code defined in the code definition file to decode the input file. Its output is on standard output.

The code associates strings with sequences of bits. For example, one code associates the following strings with the following sequences of bits:

String Sequence of bits
" " 10
"\n" 1100
"o" 00
"p" 010
"f" 1101
"t" 111
"the" 011

You'll note that no sequence of bits is a prefix of another sequence of bits. This means that you can decode a stream of bits into unique sequences. For example:

111001001110111000101100

maps to this stream of unique sequence of bits:

111 00 10 011 10 111 00 010 1100 = "to the top\n"

The code definition file defines the associations. Its format is a stream of null-terminated strings that come in pairs: the string and the sequence of bits defined as a stream of zero and one characters. Thus, for example, the above code is stored in t-code.txt, which is composed of the following bytes:

111 'o' 0 48 '0' 48 '0' 0 112 'p' 0 48 '0'
49 '1' 48 '0' 0 116 't' 104 'h' 101 'e' 0 48 '0'
49 '1' 49 '1' 0 32 space 0 49 '1' 48 '0' 0
10 eol 0 49 '1' 49 '1' 48 '0' 48 '0' 0 102 'f'
0 49 '1' 49 '1' 48 '0' 49 '1' 0 116 't' 0
49 '1' 49 '1' 49 '1' 0
The first string is "o", and it is associated with "00". Next is "p", which is associated with "010", and so on.

The input file is in a specific format. The last four bytes represent an unsigned integer, which says how many bits should be read from the input file. The preceding bytes contain the stream of bits. If the total number of bits is t then the size of the file should be:

ceil(t/8) + 4 bytes.

Within each byte, the stream of bits starts with the least significant bit. Thus, the stream "10000000" would be represented by the byte 0x1, and the stream "00000001" would be represented by the byte 0x80.

Take for example, the following file t-encoded.txt:

0x27 0x77 0x34 0x18 0x00 0x00 0x00

The last four bytes contain the integer 0x18, which equals 24. Thus, the bit stream contains 24 bits, which are stored in three bytes: 0x27, 0x77 and 0x34. Converting them to a bit stream:

Yields the stream "111001001110111000101100" = "111","00","10","011","10","111","00","010","1100" = "to the top\n".

Let's try another example, in t-encoded-2.txt:

0x47 0x0c 0x0e 0x00 0x00 0x00

The last four bytes contain 0xe = 14. Thus, the first two bytes contain a string of 14 bits: 0x47 = "11100010" and 0x0c = "00110000". The string is "11100010001100" = "111","00","010","00","1100" = "topo\n".

Armed with this knowledge, you are now equipped to write huf_dec, which should decode the input file using the code defined in the code definition file. Use the C standard I/O library to read both the code definition file and the input file.

You may assume that the code definition file defines a code in which no sequence of bits is a prefix of another. You may also assume that no string or sequence of bits is longer than 10000 characters.

You must test for the following errors:

In each case, make sure your program outputs what mine does.

You may assume that the output of huff_dec is composed solely of printable characters.

And BTW, one way to solve this is to use a Huffman Coding tree -- check it out on Wikipedia. That's not how I solved it, but it is a nice solution, and a good technique to learn.


Fun helper programs

The program huff_create_code reads standard input and creates a decent Huffman code from it on standard output. If you're bored someday, ask me how it works.

The program huff_enc takes two command line arguments -- a code defintion file and the name of an output file. It then encodes standard input to the output file using the code.

For example:

UNIX> cat celebrate.txt
I just want to celebrate another day of living
I just want to celebrate another day of life
Put my faith in the people but the people let me down
So I turned the other way and I carried on anyhow
I just want to celebrate another day of living
I just want to celebrate another day of life
UNIX> huff_create_code < celebrate.txt > celebrate-code.txt
UNIX> wc celebrate-code.txt
       1       3     275 celebrate-code.txt
UNIX> huff_enc < celebrate.txt celebrate-code.txt celebrate-encoded.txt
UNIX> huff_dec celebrate-code.txt celebrate-encoded.txt
I just want to celebrate another day of living
I just want to celebrate another day of life
Put my faith in the people but the people let me down
So I turned the other way and I carried on anyhow
I just want to celebrate another day of living
I just want to celebrate another day of life
UNIX> ls -l cel*
-rw-r--r--  1 plank  staff  275 Jan 31 14:08 celebrate-code.txt
-rw-r--r--  1 plank  staff  109 Jan 31 14:09 celebrate-encoded.txt
-rw-r--r--  1 plank  staff  288 Jan 31 14:08 celebrate.txt
UNIX> 
Here's the code in celebrate-code.txt:

'r' 0 '0' '0' '0' '0' '0' 0 'w' 'a' 'n' 't'
0 '0' '0' '0' '0' '1' 0 'g' 0 '0' '0' '0'
'1' '0' '0' 0 'm' 0 '0' '0' '0' '1' '0' '1'
0 'v' 0 '0' '0' '0' '1' '1' '0' 0 'P' 0
'0' '0' '0' '1' '1' '1' '0' 0 'S' 0 '0' '0'
'0' '1' '1' '1' '1' 0 'i' 0 '0' '0' '1' '0'
0 'a' 0 '0' '0' '1' '1' '0' 0 'b' 0 '0'
'0' '1' '1' '1' '0' '0' 0 'c' 0 '0' '0' '1'
'1' '1' '0' '1' 0 'u' 0 '0' '0' '1' '1' '1'
'1' 0 eol 0 '0' '1' '0' '0' '0' 0 'I' 0
'0' '1' '0' '0' '1' 0 'h' 0 '0' '1' '0' '1'
'0' 0 'w' 0 '0' '1' '0' '1' '1' '0' 0 'y'
0 '0' '1' '0' '1' '1' '1' 0 't' 0 '0' '1'
'1' '0' 0 'e' 0 '0' '1' '1' '1' 0 space 0
'1' '0' 0 'f' 0 '1' '1' '0' '0' '0' 0 'l'
0 '1' '1' '0' '0' '1' 0 'o' 0 '1' '1' '0'
'1' 0 'n' 0 '1' '1' '1' '0' '0' 0 'a' 'n'
'o' 't' 'h' 'e' 'r' 0 '1' '1' '1' '0' '1' '0'
0 'c' 'e' 'l' 'e' 'b' 'r' 'a' 't' 'e' 0 '1'
'1' '1' '0' '1' '1' 0 'd' 0 '1' '1' '1' '1'
'0' '0' 0 'd' 'a' 'y' 0 '1' '1' '1' '1' '0'
'1' 0 'j' 'u' 's' 't' 0 '1' '1' '1' '1' '1'
'0' 0 'p' 0 '1' '1' '1' '1' '1' '1' 0