CS302 -- Lab 7


Lab Objective

This lab is designed to give you experience with various sorting algorithms and with using objects to implement these algorithms. In particular you will need to implement a sort-merge algorithm for external sorting and a quicksort for internal sorting.


Setting Up

You should copy the following files from the /ruby/homes/ftp/pub/bvz/classes/cs302/labs/lab8 directory to your directory:

In order to make the Makefile work for you you will need to call the file you create bsort.cc.

In order to make the graphics package work, you will need to type the following two commands in every window in which you run your program (or you can place them in your .cshrc file):

setenv AMULET_DIR /sunshine/homes/bvz/amulet/amulet3/
setenv AMULET_VARS_FILE Makefile.vars.gcc.Solaris
If you have certain types of protections, you may also get a message saying that your display could not be opened when you run your program. If this happens, type the following command:
xhost +machine_name
where machine_name is the name of your machine (e.g., cetus4a).


Problem Statement

In this lab you will implement the balanced, multi-way external sorting algorithm presented in class. You will also experiment with varying the size of the initial runs and the size of P to get an idea of how these parameters affect the performance of the algorithm. Here is a man page-like description of the program you are to write:

Name

bsort - sort and collate lines of a file.

Synopsis

bsort [ -p num_ways ] [ -r run_size] input_file output_file

Description

bsort sorts the lines in input_file from smallest to largest and stores the sorted lines in output_file.

You may assume that each line in input_file has three integer fields. The first field is the sort key. An example data file may be found in data.

Options

Examples

  1. Sort the contents of data and store the sorted file in output-data:

        bsort data output-data
    

  2. Sort the contents of data using a 10-way sort and store the sorted file in output-data:

        bsort -p 10 data output-data
    

  3. Sort the contents of data using a 4-way sort and an initial run size of 6. Store the sorted file in output-data:

        bsort -r 6 -p 4 data output-data
    


Output

The output of bsort is the sorted output file. It should not produce any messages unless the user messes up the command line arguments. A working version of bsort is in the directory /ruby/homes/ftp/pub/bvz/classes/cs302/bin/bsort. If you have a question about what your program should do, first see what this program does.


Strategy

Write this program using the pseudo-code provided in class and the various classes that we have provided for you.

I created the bsort program incrementally using the following steps:

  1. Write and test an internal memory sorting routine or a priority queue that will be used for creating the initial runs.

  2. Write and test an input routine for reading data from the input file and distributing it onto P scratch output files. This input routine reads data into an array, sorts it, and then outputs the run to one of the output files. If you use a heap, the input routine will read data into the heap and output the elements in the heap to an output file. You should use the Fields package to read in a file and fopen to write out a file. Do not use the disk package for this assignment.

  3. Write and test a routine for merging one set of runs from the scratch input files into a new run and writing the new run to a scratch output file.

  4. Write and test the sort merge procedure. It will have to call the routine written in the previous step to create runs and it will have to manage the scratch files so that on each iteration, the previous input files become output files and the previous output files become input files.

  5. Write and test the main procedure so that it handles the -r and -p switches. In the initial phases of writing the program, I used the default values for the initial run sizes and the value of P.

In addition to writing the program, you will need to make some decisions about the use of data structures. There are three principal data structures you will need:

  1. A data structure for creating the initial runs. This data structure should either be an array if you use internal sorting or a heap if you use a priority queue. In both cases you will end up using the Array class that you have been provided.

  2. A data structure for managing the two disk banks. Since the number of files is variable and depends on P, you should use the FileArray class that you have been provided to create your disk banks.

  3. A data structure for managing the lines you read in. You should use the Record class that you have been provided.


Classes For This Lab

In order to assist you with this lab we have prepared a visual debugging environment that shows your disk banks and arrays and that allows you to pause your program and inspect these data structures. The visual debugger is written as a driver program that initializes the environment and then calls your external sorting function. Hence for this lab you will not write a main function. Instead you will write a function named external_sort and this function will be called by our driver program (named driver.cc). external_sort takes two arguments as parameters--argc and argv:

void external_sort (int argc, char** argv);

You will also need to use the classes that we have provided in order to make the visual debugging environment work. a list of these classes and the methods they support: You can find documentation for these classes and an explanation of how to make the visual debugger pause here.


Experiments

Once you have completed bsort, you will perform a series of experiments that vary the initial run size and P parameters. You should use initial run sizes of 100, 1000, and 10000 and values of P equal to 2, 4, 8, and 16. For each experiment you should sort the file data found in lab8/data. You should record 1) the aggregate number of file reads and writes (i.e., the number of calls you make to the Read and Write methods for the File objects), and 2) the number of passes your program requires once the initial set of runs has been created. The results should be recorded in a table that looks as follows:

       
        \  P       2          4            8         16
Run Size \
       
100             i/o, pp    i/o, pp      i/o, pp    i/o, pp

1000            i/o, pp    i/o, pp      i/o, pp    i/o, pp

10000           i/o, pp    i/o, pp      i/o, pp    i/o, pp

You should also produce a graph of the results. The x-axis should be P, the y-axis should be the number of reads/writes, and the curves should be of run size. Hence you will have three curves, one each for 100, 1000, and 10000. On the x-axis, you should evenly space 2, 4, 8, and 16.

i/o stands for the aggregate number of reads and writes and pp stands for the number of passes. Even if you cannot get bsort to work, you can use the bsort in the bin directory to conduct these experiments.

What to Submit

You should submit the following items to Hui. The first two items should be sent in one email message, and the third item should be packaged using 302submit and sent to Hui in a separate email message.

  1. The table and graph described in the previous section.

  2. Whether you used an internal sorting algorithm or a priority queue to generate your initial runs.

  3. Your make and .cc files.